subject:"\[RFC\] Vectorization of indexed elements"

Re: [RFC] Vectorization of indexed elements

2013-12-06 Thread Richard Biener

On Wed, 4 Dec 2013, Vidya Praveen wrote:

> Hi Richi,
> 
> Apologies for the late response. I was on vacation.
> 
> On Mon, Oct 14, 2013 at 09:04:58AM +0100, Richard Biener wrote:
> > > void
> > > foo (int *__restrict__ a,
> > >  int *__restrict__ b,
> > >  int c)
> > > {
> > >   int i;
> > > 
> > >   for (i = 0; i < 8; i++)
> > > a[i] = b[i] * c;
> > > }
> > 
> > Both cases can be handled by patterns that match
> > 
> >   (mul:VXSI (reg:VXSI
> >  (vec_duplicate:VXSI reg:SI)))
> 
> How do I arrive at this pattern in the first place? Assuming vec_init with
> uniform values are expanded as vec_duplicate, it will still be two 
> expressions.
> 
> That is,
> 
> (set reg:VXSI (vec_duplicate:VXSI (reg:SI)))
> (set reg:VXSI (mul:VXSI (reg:VXSI) (reg:VXSI)))

Yes, but then combine comes along and creates

 (set reg:VXSI (mul:VXSI (reg:VXSI (vec_duplicate:VXSI (reg:SI)

which matches one of your define_insn[_and_split]s.

> > You'd then "consume" the vec_duplicate and implement it as
> > load scalar into element zero of the vector and use index mult
> > with index zero.
> 
> If I understand this correctly, you are suggesting to leave the scalar
> load from memory as it is but treat the 
> 
> (mul:VXSI (reg:VXSI (vec_duplicate:VXSI reg:SI)))
> 
> as 
> 
> load reg:VXSI[0], reg:SI
> mul reg:VXSI, reg:VXSI, re:VXSI[0] // by reusing the destination register 
> perhaps
> 
> either by generating instructions directly or by using define_split. Am I 
> right?

Possibly.  Or allow memory as operand 2 for your pattern (so, not
reg:SI but mem:SI).  Combine should be happy with that, too.
 
> If I'm right, then my concern is that it may be possible to simplify this 
> further
> by loading directly to a indexed vector register from memory but it's too 
> late at
> this point for such simplification to be possible.
> 
> Please let me know what am I not understanding.

Not sure.  Did you try it?

Richard.

Re: [RFC] Vectorization of indexed elements

2013-12-04 Thread Vidya Praveen

Hi Jakub,

Apologies for the late response.

On Fri, Oct 11, 2013 at 04:05:24PM +0100, Jakub Jelinek wrote:
> On Fri, Oct 11, 2013 at 03:54:08PM +0100, Vidya Praveen wrote:
> > Here's a compilable example:
> > 
> > void 
> > foo (int *__restrict__ a,
> >  int *__restrict__ b,
> >  int *__restrict__ c)
> > {
> >   int i;
> > 
> >   for (i = 0; i < 8; i++)
> > a[i] = b[i] * c[2];
> > }
> > 
> > This is vectorized by duplicating c[2] now. But I'm trying to take advantage
> > of target instructions that can take a vector register as second argument 
> > but
> > use only one element (by using the same value for all the lanes) of the 
> > vector register.
> > 
> > Eg. mul , , [index]
> > mla , , [index] // multiply and add
> > 
> > But for a loop like the one in the C example given, I will have to load the
> > c[2] in one element of the vector register (leaving the remaining unused)
> > rather. This is why I was proposing to load just one element in a vector 
> > register (what I meant as "lane specific load"). The benefit of doing this 
> > is
> > that we avoid explicit duplication, however such a simplification can only
> > be done where such support is available - the reason why I was thinking in
> > terms of optional standard pattern name. Another benefit is we will also be
> > able to support scalars in the expression like in the following example:
> > 
> > void
> > foo (int *__restrict__ a,
> >  int *__restrict__ b,
> >  int c)
> > {
> >   int i;
> > 
> >   for (i = 0; i < 8; i++)
> > a[i] = b[i] * c;
> > }
> 
> So just during combine let the broadcast operation be combined with the
> arithmetics?  

Yes. I can do that. But I always want it to be possible to recognize and load
directly to the indexed vector register from memory.


> Intel AVX512 ISA has similar feature, not sure what exactly
> they are doing for this. 

Thanks. I'll try to go through the code to understand.

> That said, the broadcast is likely going to be
> hoisted before the loop, and in that case is it really cheaper to have
> it unbroadcasted in a vector register rather than to broadcast it before the
> loop and just use there?

Could you explain what do you mean by unbroadcast? The constructor needs to be
expanded in one way or another, isn't it? I thought expanding to vec_duplicate
when the values are uniform is the most efficient when vec_duplicate could be
supported by the target. If you had meant that each element of vector is loaded
separately, I am thinking how can I combine such an operation with the 
arithmetic
operation.

Thanks
VP.

Re: [RFC] Vectorization of indexed elements

2013-12-04 Thread Vidya Praveen

Hi Richi,

Apologies for the late response. I was on vacation.

On Mon, Oct 14, 2013 at 09:04:58AM +0100, Richard Biener wrote:
> On Fri, 11 Oct 2013, Vidya Praveen wrote:
> 
> > On Tue, Oct 01, 2013 at 09:26:25AM +0100, Richard Biener wrote:
> > > On Mon, 30 Sep 2013, Vidya Praveen wrote:
> > > 
> > > > On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote:
> > > > > On Mon, 30 Sep 2013, Vidya Praveen wrote:
> > > > > 
> > > > > > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote:
> > > > > > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote:
> > > > > > > [...]
> > > > > > > > > > I can't really insist on the single lane load.. something 
> > > > > > > > > > like:
> > > > > > > > > > 
> > > > > > > > > > vc:V4SI[0] = c
> > > > > > > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > > > > > > > > > va:V4SI = vb:V4SI  vt:V4SI
> > > > > > > > > > 
> > > > > > > > > > Or is there any other way to do this?
> > > > > > > > > 
> > > > > > > > > Can you elaborate on "I can't really insist on the single 
> > > > > > > > > lane load"?
> > > > > > > > > What's the single lane load in your example? 
> > > > > > > > 
> > > > > > > > Loading just one lane of the vector like this:
> > > > > > > > 
> > > > > > > > vc:V4SI[0] = c // from the above scalar example
> > > > > > > > 
> > > > > > > > or 
> > > > > > > > 
> > > > > > > > vc:V4SI[0] = c[2] 
> > > > > > > > 
> > > > > > > > is what I meant by single lane load. In this example:
> > > > > > > > 
> > > > > > > > t = c[2] 
> > > > > > > > ...
> > > > > > > > vb:v4si = b[0:3] 
> > > > > > > > vc:v4si = { t, t, t, t }
> > > > > > > > va:v4si = vb:v4si  vc:v4si 
> > > > > > > > 
> > > > > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at 
> > > > > > > > vec_init, I cannot
> > > > > > > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] 
> > > > > > > > (which could be 
> > > > > > > > seen as vec_select:SI (vect_t 0) ). 
> > > > > > > > 
> > > > > > > > > I'd expect the instruction
> > > > > > > > > pattern as quoted to just work (and I hope we expand an 
> > > > > > > > > uniform
> > > > > > > > > constructor { a, a, a, a } properly using vec_duplicate).
> > > > > > > > 
> > > > > > > > As much as I went through the code, this is only done using 
> > > > > > > > vect_init. It is
> > > > > > > > not expanded as vec_duplicate from, for example, 
> > > > > > > > store_constructor() of expr.c
> > > > > > > 
> > > > > > > Do you see any issues if we expand such constructor as 
> > > > > > > vec_duplicate directly 
> > > > > > > instead of going through vect_init way? 
> > > > > > 
> > > > > > Sorry, that was a bad question.
> > > > > > 
> > > > > > But here's what I would like to propose as a first step. Please 
> > > > > > tell me if this
> > > > > > is acceptable or if it makes sense:
> > > > > > 
> > > > > > - Introduce standard pattern names 
> > > > > > 
> > > > > > "vmulim4" - vector muliply with second operand as indexed operand
> > > > > > 
> > > > > > Example:
> > > > > > 
> > > > > > (define_insn "vmuliv4si4"
> > > > > >[set (match_operand:V4SI 0 "register_operand")
> > > > > > (mul:V4SI (match_operand:V4SI 1 "register_operand")
> > > > > >   (vec_duplicate:V4SI
> > > > > > (vec_select:SI
> > > > > >   (match_operand:V4SI 2 "register_operand")
> > > > > >   (match_operand:V4SI 3 "immediate_operand)]
> > > > > >  ...
> > > > > > )
> > > > > 
> > > > > We could factor this with providing a standard pattern name for
> > > > > 
> > > > > (define_insn "vdupi"
> > > > >   [set (match_operand: 0 "register_operand")
> > > > >(vec_duplicate:
> > > > >   (vec_select:
> > > > >  (match_operand: 1 "register_operand")
> > > > >  (match_operand:SI 2 "immediate_operand]
> > > > 
> > > > This is good. I did think about this but then I thought of avoiding the 
> > > > need
> > > > for combiner patterns :-) 
> > > > 
> > > > But do you find the lane specific mov pattern I proposed, acceptable? 
> > > 
> > > The specific mul pattern?  As said, consider factoring to vdupi to
> > > avoid an explosion in required special optabs.
> > > 
> > > > > (you use V4SI for the immediate?  
> > > > 
> > > > Sorry typo again!! It should've been SI.
> > > > 
> > > > > Ideally vdupi has another custom
> > > > > mode for the vector index).
> > > > > 
> > > > > Note that this factored pattern is already available as 
> > > > > vec_perm_const!
> > > > > It is simply (vec_perm_const:V4SI   
> > > > > ).
> > > > > 
> > > > > Which means that on the GIMPLE level we should try to combine
> > > > > 
> > > > > el_4 = BIT_FIELD_REF ;
> > > > > v_5 = { el_4, el_4, ... };
> > > > 
> > > > I don't think we reach this state at all for the scenarios in 
> > > > discussion.
> > > > what we generally have is:
> > > > 
> > > >  el_4 = MEM_REF < array + index*size >
> > > >  v_5 = { el_4, ... }
> > > > 
> > > >

Re: [RFC] Vectorization of indexed elements

2013-10-14 Thread Richard Biener

On Fri, 11 Oct 2013, Vidya Praveen wrote:

> On Tue, Oct 01, 2013 at 09:26:25AM +0100, Richard Biener wrote:
> > On Mon, 30 Sep 2013, Vidya Praveen wrote:
> > 
> > > On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote:
> > > > On Mon, 30 Sep 2013, Vidya Praveen wrote:
> > > > 
> > > > > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote:
> > > > > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote:
> > > > > > [...]
> > > > > > > > > I can't really insist on the single lane load.. something 
> > > > > > > > > like:
> > > > > > > > > 
> > > > > > > > > vc:V4SI[0] = c
> > > > > > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > > > > > > > > va:V4SI = vb:V4SI  vt:V4SI
> > > > > > > > > 
> > > > > > > > > Or is there any other way to do this?
> > > > > > > > 
> > > > > > > > Can you elaborate on "I can't really insist on the single lane 
> > > > > > > > load"?
> > > > > > > > What's the single lane load in your example? 
> > > > > > > 
> > > > > > > Loading just one lane of the vector like this:
> > > > > > > 
> > > > > > > vc:V4SI[0] = c // from the above scalar example
> > > > > > > 
> > > > > > > or 
> > > > > > > 
> > > > > > > vc:V4SI[0] = c[2] 
> > > > > > > 
> > > > > > > is what I meant by single lane load. In this example:
> > > > > > > 
> > > > > > > t = c[2] 
> > > > > > > ...
> > > > > > > vb:v4si = b[0:3] 
> > > > > > > vc:v4si = { t, t, t, t }
> > > > > > > va:v4si = vb:v4si  vc:v4si 
> > > > > > > 
> > > > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, 
> > > > > > > I cannot
> > > > > > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] 
> > > > > > > (which could be 
> > > > > > > seen as vec_select:SI (vect_t 0) ). 
> > > > > > > 
> > > > > > > > I'd expect the instruction
> > > > > > > > pattern as quoted to just work (and I hope we expand an uniform
> > > > > > > > constructor { a, a, a, a } properly using vec_duplicate).
> > > > > > > 
> > > > > > > As much as I went through the code, this is only done using 
> > > > > > > vect_init. It is
> > > > > > > not expanded as vec_duplicate from, for example, 
> > > > > > > store_constructor() of expr.c
> > > > > > 
> > > > > > Do you see any issues if we expand such constructor as 
> > > > > > vec_duplicate directly 
> > > > > > instead of going through vect_init way? 
> > > > > 
> > > > > Sorry, that was a bad question.
> > > > > 
> > > > > But here's what I would like to propose as a first step. Please tell 
> > > > > me if this
> > > > > is acceptable or if it makes sense:
> > > > > 
> > > > > - Introduce standard pattern names 
> > > > > 
> > > > > "vmulim4" - vector muliply with second operand as indexed operand
> > > > > 
> > > > > Example:
> > > > > 
> > > > > (define_insn "vmuliv4si4"
> > > > >[set (match_operand:V4SI 0 "register_operand")
> > > > > (mul:V4SI (match_operand:V4SI 1 "register_operand")
> > > > >   (vec_duplicate:V4SI
> > > > > (vec_select:SI
> > > > >   (match_operand:V4SI 2 "register_operand")
> > > > >   (match_operand:V4SI 3 "immediate_operand)]
> > > > >  ...
> > > > > )
> > > > 
> > > > We could factor this with providing a standard pattern name for
> > > > 
> > > > (define_insn "vdupi"
> > > >   [set (match_operand: 0 "register_operand")
> > > >(vec_duplicate:
> > > >   (vec_select:
> > > >  (match_operand: 1 "register_operand")
> > > >  (match_operand:SI 2 "immediate_operand]
> > > 
> > > This is good. I did think about this but then I thought of avoiding the 
> > > need
> > > for combiner patterns :-) 
> > > 
> > > But do you find the lane specific mov pattern I proposed, acceptable? 
> > 
> > The specific mul pattern?  As said, consider factoring to vdupi to
> > avoid an explosion in required special optabs.
> > 
> > > > (you use V4SI for the immediate?  
> > > 
> > > Sorry typo again!! It should've been SI.
> > > 
> > > > Ideally vdupi has another custom
> > > > mode for the vector index).
> > > > 
> > > > Note that this factored pattern is already available as vec_perm_const!
> > > > It is simply (vec_perm_const:V4SI   
> > > > ).
> > > > 
> > > > Which means that on the GIMPLE level we should try to combine
> > > > 
> > > > el_4 = BIT_FIELD_REF ;
> > > > v_5 = { el_4, el_4, ... };
> > > 
> > > I don't think we reach this state at all for the scenarios in discussion.
> > > what we generally have is:
> > > 
> > >  el_4 = MEM_REF < array + index*size >
> > >  v_5 = { el_4, ... }
> > > 
> > > Or am I missing something?
> > 
> > Well, but in that case I doubt it is profitable (or even valid!) to
> > turn this into a vector lane load from the array.  If it is profitable
> > to perform a vector read (because we're going to use the other elements
> > of the vector as well) then the vectorizer should produce a vector
> > load and materialize the uniform vector from one of its elements.
> > 
> > Maybe

Re: [RFC] Vectorization of indexed elements

2013-10-11 Thread Jakub Jelinek

On Fri, Oct 11, 2013 at 03:54:08PM +0100, Vidya Praveen wrote:
> Here's a compilable example:
> 
> void 
> foo (int *__restrict__ a,
>  int *__restrict__ b,
>  int *__restrict__ c)
> {
>   int i;
> 
>   for (i = 0; i < 8; i++)
> a[i] = b[i] * c[2];
> }
> 
> This is vectorized by duplicating c[2] now. But I'm trying to take advantage
> of target instructions that can take a vector register as second argument but
> use only one element (by using the same value for all the lanes) of the 
> vector register.
> 
> Eg. mul , , [index]
> mla , , [index] // multiply and add
> 
> But for a loop like the one in the C example given, I will have to load the
> c[2] in one element of the vector register (leaving the remaining unused)
> rather. This is why I was proposing to load just one element in a vector 
> register (what I meant as "lane specific load"). The benefit of doing this is
> that we avoid explicit duplication, however such a simplification can only
> be done where such support is available - the reason why I was thinking in
> terms of optional standard pattern name. Another benefit is we will also be
> able to support scalars in the expression like in the following example:
> 
> void
> foo (int *__restrict__ a,
>  int *__restrict__ b,
>  int c)
> {
>   int i;
> 
>   for (i = 0; i < 8; i++)
> a[i] = b[i] * c;
> }

So just during combine let the broadcast operation be combined with the
arithmetics?  Intel AVX512 ISA has similar feature, not sure what exactly
they are doing for this.  That said, the broadcast is likely going to be
hoisted before the loop, and in that case is it really cheaper to have
it unbroadcasted in a vector register rather than to broadcast it before the
loop and just use there?

Jakub

Re: [RFC] Vectorization of indexed elements

2013-10-11 Thread Vidya Praveen

On Tue, Oct 01, 2013 at 09:26:25AM +0100, Richard Biener wrote:
> On Mon, 30 Sep 2013, Vidya Praveen wrote:
> 
> > On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote:
> > > On Mon, 30 Sep 2013, Vidya Praveen wrote:
> > > 
> > > > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote:
> > > > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote:
> > > > > [...]
> > > > > > > > I can't really insist on the single lane load.. something like:
> > > > > > > > 
> > > > > > > > vc:V4SI[0] = c
> > > > > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > > > > > > > va:V4SI = vb:V4SI  vt:V4SI
> > > > > > > > 
> > > > > > > > Or is there any other way to do this?
> > > > > > > 
> > > > > > > Can you elaborate on "I can't really insist on the single lane 
> > > > > > > load"?
> > > > > > > What's the single lane load in your example? 
> > > > > > 
> > > > > > Loading just one lane of the vector like this:
> > > > > > 
> > > > > > vc:V4SI[0] = c // from the above scalar example
> > > > > > 
> > > > > > or 
> > > > > > 
> > > > > > vc:V4SI[0] = c[2] 
> > > > > > 
> > > > > > is what I meant by single lane load. In this example:
> > > > > > 
> > > > > > t = c[2] 
> > > > > > ...
> > > > > > vb:v4si = b[0:3] 
> > > > > > vc:v4si = { t, t, t, t }
> > > > > > va:v4si = vb:v4si  vc:v4si 
> > > > > > 
> > > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I 
> > > > > > cannot
> > > > > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which 
> > > > > > could be 
> > > > > > seen as vec_select:SI (vect_t 0) ). 
> > > > > > 
> > > > > > > I'd expect the instruction
> > > > > > > pattern as quoted to just work (and I hope we expand an uniform
> > > > > > > constructor { a, a, a, a } properly using vec_duplicate).
> > > > > > 
> > > > > > As much as I went through the code, this is only done using 
> > > > > > vect_init. It is
> > > > > > not expanded as vec_duplicate from, for example, 
> > > > > > store_constructor() of expr.c
> > > > > 
> > > > > Do you see any issues if we expand such constructor as vec_duplicate 
> > > > > directly 
> > > > > instead of going through vect_init way? 
> > > > 
> > > > Sorry, that was a bad question.
> > > > 
> > > > But here's what I would like to propose as a first step. Please tell me 
> > > > if this
> > > > is acceptable or if it makes sense:
> > > > 
> > > > - Introduce standard pattern names 
> > > > 
> > > > "vmulim4" - vector muliply with second operand as indexed operand
> > > > 
> > > > Example:
> > > > 
> > > > (define_insn "vmuliv4si4"
> > > >[set (match_operand:V4SI 0 "register_operand")
> > > > (mul:V4SI (match_operand:V4SI 1 "register_operand")
> > > >   (vec_duplicate:V4SI
> > > > (vec_select:SI
> > > >   (match_operand:V4SI 2 "register_operand")
> > > >   (match_operand:V4SI 3 "immediate_operand)]
> > > >  ...
> > > > )
> > > 
> > > We could factor this with providing a standard pattern name for
> > > 
> > > (define_insn "vdupi"
> > >   [set (match_operand: 0 "register_operand")
> > >(vec_duplicate:
> > >   (vec_select:
> > >  (match_operand: 1 "register_operand")
> > >  (match_operand:SI 2 "immediate_operand]
> > 
> > This is good. I did think about this but then I thought of avoiding the need
> > for combiner patterns :-) 
> > 
> > But do you find the lane specific mov pattern I proposed, acceptable? 
> 
> The specific mul pattern?  As said, consider factoring to vdupi to
> avoid an explosion in required special optabs.
> 
> > > (you use V4SI for the immediate?  
> > 
> > Sorry typo again!! It should've been SI.
> > 
> > > Ideally vdupi has another custom
> > > mode for the vector index).
> > > 
> > > Note that this factored pattern is already available as vec_perm_const!
> > > It is simply (vec_perm_const:V4SI   ).
> > > 
> > > Which means that on the GIMPLE level we should try to combine
> > > 
> > > el_4 = BIT_FIELD_REF ;
> > > v_5 = { el_4, el_4, ... };
> > 
> > I don't think we reach this state at all for the scenarios in discussion.
> > what we generally have is:
> > 
> >  el_4 = MEM_REF < array + index*size >
> >  v_5 = { el_4, ... }
> > 
> > Or am I missing something?
> 
> Well, but in that case I doubt it is profitable (or even valid!) to
> turn this into a vector lane load from the array.  If it is profitable
> to perform a vector read (because we're going to use the other elements
> of the vector as well) then the vectorizer should produce a vector
> load and materialize the uniform vector from one of its elements.
> 
> Maybe at this point you should show us a compilable C testcase
> with a loop that should be vectorized using your instructions in
> the end?

Here's a compilable example:

void 
foo (int *__restrict__ a,
 int *__restrict__ b,
 int *__restrict__ c)
{
  int i;

  for (i = 0; i < 8; i++)
a[i] = b[i] * c[2];
}

This is vecto

Re: [RFC] Vectorization of indexed elements

2013-10-01 Thread Richard Biener

On Mon, 30 Sep 2013, Vidya Praveen wrote:

> On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote:
> > On Mon, 30 Sep 2013, Vidya Praveen wrote:
> > 
> > > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote:
> > > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote:
> > > > [...]
> > > > > > > I can't really insist on the single lane load.. something like:
> > > > > > > 
> > > > > > > vc:V4SI[0] = c
> > > > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > > > > > > va:V4SI = vb:V4SI  vt:V4SI
> > > > > > > 
> > > > > > > Or is there any other way to do this?
> > > > > > 
> > > > > > Can you elaborate on "I can't really insist on the single lane 
> > > > > > load"?
> > > > > > What's the single lane load in your example? 
> > > > > 
> > > > > Loading just one lane of the vector like this:
> > > > > 
> > > > > vc:V4SI[0] = c // from the above scalar example
> > > > > 
> > > > > or 
> > > > > 
> > > > > vc:V4SI[0] = c[2] 
> > > > > 
> > > > > is what I meant by single lane load. In this example:
> > > > > 
> > > > > t = c[2] 
> > > > > ...
> > > > > vb:v4si = b[0:3] 
> > > > > vc:v4si = { t, t, t, t }
> > > > > va:v4si = vb:v4si  vc:v4si 
> > > > > 
> > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I 
> > > > > cannot
> > > > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which 
> > > > > could be 
> > > > > seen as vec_select:SI (vect_t 0) ). 
> > > > > 
> > > > > > I'd expect the instruction
> > > > > > pattern as quoted to just work (and I hope we expand an uniform
> > > > > > constructor { a, a, a, a } properly using vec_duplicate).
> > > > > 
> > > > > As much as I went through the code, this is only done using 
> > > > > vect_init. It is
> > > > > not expanded as vec_duplicate from, for example, store_constructor() 
> > > > > of expr.c
> > > > 
> > > > Do you see any issues if we expand such constructor as vec_duplicate 
> > > > directly 
> > > > instead of going through vect_init way? 
> > > 
> > > Sorry, that was a bad question.
> > > 
> > > But here's what I would like to propose as a first step. Please tell me 
> > > if this
> > > is acceptable or if it makes sense:
> > > 
> > > - Introduce standard pattern names 
> > > 
> > > "vmulim4" - vector muliply with second operand as indexed operand
> > > 
> > > Example:
> > > 
> > > (define_insn "vmuliv4si4"
> > >[set (match_operand:V4SI 0 "register_operand")
> > > (mul:V4SI (match_operand:V4SI 1 "register_operand")
> > >   (vec_duplicate:V4SI
> > > (vec_select:SI
> > >   (match_operand:V4SI 2 "register_operand")
> > >   (match_operand:V4SI 3 "immediate_operand)]
> > >  ...
> > > )
> > 
> > We could factor this with providing a standard pattern name for
> > 
> > (define_insn "vdupi"
> >   [set (match_operand: 0 "register_operand")
> >(vec_duplicate:
> >   (vec_select:
> >  (match_operand: 1 "register_operand")
> >  (match_operand:SI 2 "immediate_operand]
> 
> This is good. I did think about this but then I thought of avoiding the need
> for combiner patterns :-) 
> 
> But do you find the lane specific mov pattern I proposed, acceptable? 

The specific mul pattern?  As said, consider factoring to vdupi to
avoid an explosion in required special optabs.

> > (you use V4SI for the immediate?  
> 
> Sorry typo again!! It should've been SI.
> 
> > Ideally vdupi has another custom
> > mode for the vector index).
> > 
> > Note that this factored pattern is already available as vec_perm_const!
> > It is simply (vec_perm_const:V4SI   ).
> > 
> > Which means that on the GIMPLE level we should try to combine
> > 
> > el_4 = BIT_FIELD_REF ;
> > v_5 = { el_4, el_4, ... };
> 
> I don't think we reach this state at all for the scenarios in discussion.
> what we generally have is:
> 
>  el_4 = MEM_REF < array + index*size >
>  v_5 = { el_4, ... }
> 
> Or am I missing something?

Well, but in that case I doubt it is profitable (or even valid!) to
turn this into a vector lane load from the array.  If it is profitable
to perform a vector read (because we're going to use the other elements
of the vector as well) then the vectorizer should produce a vector
load and materialize the uniform vector from one of its elements.

Maybe at this point you should show us a compilable C testcase
with a loop that should be vectorized using your instructions in
the end?

Thanks,
Richard.

Re: [RFC] Vectorization of indexed elements

2013-09-30 Thread Vidya Praveen

On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote:
> On Mon, 30 Sep 2013, Vidya Praveen wrote:
> 
> > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote:
> > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote:
> > > [...]
> > > > > > I can't really insist on the single lane load.. something like:
> > > > > > 
> > > > > > vc:V4SI[0] = c
> > > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > > > > > va:V4SI = vb:V4SI  vt:V4SI
> > > > > > 
> > > > > > Or is there any other way to do this?
> > > > > 
> > > > > Can you elaborate on "I can't really insist on the single lane load"?
> > > > > What's the single lane load in your example? 
> > > > 
> > > > Loading just one lane of the vector like this:
> > > > 
> > > > vc:V4SI[0] = c // from the above scalar example
> > > > 
> > > > or 
> > > > 
> > > > vc:V4SI[0] = c[2] 
> > > > 
> > > > is what I meant by single lane load. In this example:
> > > > 
> > > > t = c[2] 
> > > > ...
> > > > vb:v4si = b[0:3] 
> > > > vc:v4si = { t, t, t, t }
> > > > va:v4si = vb:v4si  vc:v4si 
> > > > 
> > > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I 
> > > > cannot
> > > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which 
> > > > could be 
> > > > seen as vec_select:SI (vect_t 0) ). 
> > > > 
> > > > > I'd expect the instruction
> > > > > pattern as quoted to just work (and I hope we expand an uniform
> > > > > constructor { a, a, a, a } properly using vec_duplicate).
> > > > 
> > > > As much as I went through the code, this is only done using vect_init. 
> > > > It is
> > > > not expanded as vec_duplicate from, for example, store_constructor() of 
> > > > expr.c
> > > 
> > > Do you see any issues if we expand such constructor as vec_duplicate 
> > > directly 
> > > instead of going through vect_init way? 
> > 
> > Sorry, that was a bad question.
> > 
> > But here's what I would like to propose as a first step. Please tell me if 
> > this
> > is acceptable or if it makes sense:
> > 
> > - Introduce standard pattern names 
> > 
> > "vmulim4" - vector muliply with second operand as indexed operand
> > 
> > Example:
> > 
> > (define_insn "vmuliv4si4"
> >[set (match_operand:V4SI 0 "register_operand")
> > (mul:V4SI (match_operand:V4SI 1 "register_operand")
> >   (vec_duplicate:V4SI
> > (vec_select:SI
> >   (match_operand:V4SI 2 "register_operand")
> >   (match_operand:V4SI 3 "immediate_operand)]
> >  ...
> > )
> 
> We could factor this with providing a standard pattern name for
> 
> (define_insn "vdupi"
>   [set (match_operand: 0 "register_operand")
>(vec_duplicate:
>   (vec_select:
>  (match_operand: 1 "register_operand")
>  (match_operand:SI 2 "immediate_operand]

This is good. I did think about this but then I thought of avoiding the need
for combiner patterns :-) 

But do you find the lane specific mov pattern I proposed, acceptable? 

> (you use V4SI for the immediate?  

Sorry typo again!! It should've been SI.

> Ideally vdupi has another custom
> mode for the vector index).
> 
> Note that this factored pattern is already available as vec_perm_const!
> It is simply (vec_perm_const:V4SI   ).
> 
> Which means that on the GIMPLE level we should try to combine
> 
> el_4 = BIT_FIELD_REF ;
> v_5 = { el_4, el_4, ... };

I don't think we reach this state at all for the scenarios in discussion.
what we generally have is:

 el_4 = MEM_REF < array + index*size >
 v_5 = { el_4, ... }

Or am I missing something?

> 
> into
> 
> v_5 = VEC_PERM_EXPR ;
> 
> which it should already do with simplify_permutation.
> 
> But I'm not sure what you are after at then end ;)
> 
> Richard.
>
 
Regards
VP

Re: [RFC] Vectorization of indexed elements

2013-09-30 Thread Richard Biener

On Mon, 30 Sep 2013, Vidya Praveen wrote:

> On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote:
> > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote:
> > [...]
> > > > > I can't really insist on the single lane load.. something like:
> > > > > 
> > > > > vc:V4SI[0] = c
> > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > > > > va:V4SI = vb:V4SI  vt:V4SI
> > > > > 
> > > > > Or is there any other way to do this?
> > > > 
> > > > Can you elaborate on "I can't really insist on the single lane load"?
> > > > What's the single lane load in your example? 
> > > 
> > > Loading just one lane of the vector like this:
> > > 
> > > vc:V4SI[0] = c // from the above scalar example
> > > 
> > > or 
> > > 
> > > vc:V4SI[0] = c[2] 
> > > 
> > > is what I meant by single lane load. In this example:
> > > 
> > > t = c[2] 
> > > ...
> > > vb:v4si = b[0:3] 
> > > vc:v4si = { t, t, t, t }
> > > va:v4si = vb:v4si  vc:v4si 
> > > 
> > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot
> > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which could 
> > > be 
> > > seen as vec_select:SI (vect_t 0) ). 
> > > 
> > > > I'd expect the instruction
> > > > pattern as quoted to just work (and I hope we expand an uniform
> > > > constructor { a, a, a, a } properly using vec_duplicate).
> > > 
> > > As much as I went through the code, this is only done using vect_init. It 
> > > is
> > > not expanded as vec_duplicate from, for example, store_constructor() of 
> > > expr.c
> > 
> > Do you see any issues if we expand such constructor as vec_duplicate 
> > directly 
> > instead of going through vect_init way? 
> 
> Sorry, that was a bad question.
> 
> But here's what I would like to propose as a first step. Please tell me if 
> this
> is acceptable or if it makes sense:
> 
> - Introduce standard pattern names 
> 
> "vmulim4" - vector muliply with second operand as indexed operand
> 
> Example:
> 
> (define_insn "vmuliv4si4"
>[set (match_operand:V4SI 0 "register_operand")
> (mul:V4SI (match_operand:V4SI 1 "register_operand")
>   (vec_duplicate:V4SI
> (vec_select:SI
>   (match_operand:V4SI 2 "register_operand")
>   (match_operand:V4SI 3 "immediate_operand)]
>  ...
> )

We could factor this with providing a standard pattern name for

(define_insn "vdupi"
  [set (match_operand: 0 "register_operand")
   (vec_duplicate:
  (vec_select:
 (match_operand: 1 "register_operand")
 (match_operand:SI 2 "immediate_operand]

(you use V4SI for the immediate?  Ideally vdupi has another custom
mode for the vector index).

Note that this factored pattern is already available as vec_perm_const!
It is simply (vec_perm_const:V4SI   ).

Which means that on the GIMPLE level we should try to combine

el_4 = BIT_FIELD_REF ;
v_5 = { el_4, el_4, ... };

into

v_5 = VEC_PERM_EXPR ;

which it should already do with simplify_permutation.

But I'm not sure what you are after at then end ;)

Richard.

Re: [RFC] Vectorization of indexed elements

2013-09-30 Thread Vidya Praveen

On Wed, Sep 25, 2013 at 10:22:05AM +0100, Richard Biener wrote:
> On Tue, 24 Sep 2013, Vidya Praveen wrote:
> 
> > On Tue, Sep 10, 2013 at 09:25:32AM +0100, Richard Biener wrote:
> > > On Mon, 9 Sep 2013, Marc Glisse wrote:
> > > 
> > > > On Mon, 9 Sep 2013, Vidya Praveen wrote:
> > > > 
> > > > > Hello,
> > > > > 
> > > > > This post details some thoughts on an enhancement to the vectorizer 
> > > > > that
> > > > > could take advantage of the SIMD instructions that allows indexed 
> > > > > element
> > > > > as an operand thus reducing the need for duplication and possibly 
> > > > > improve
> > > > > reuse of previously loaded data.
> > > > > 
> > > > > Appreciate your opinion on this.
> > > > > 
> > > > > ---
> > > > > 
> > > > > A phrase like this:
> > > > > 
> > > > > for(i=0;i<4;i++)
> > > > >   a[i] = b[i]  c[2];
> > > > > 
> > > > > is usually vectorized as:
> > > > > 
> > > > >  va:V4SI = a[0:3]
> > > > >  vb:V4SI = b[0:3]
> > > > >  t = c[2]
> > > > >  vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at 
> > > > > vec_init
> > > > >  ...
> > > > >  va:V4SI = vb:V4SI  vc:V4SI
> > > > > 
> > > > > But this could be simplified further if a target has instructions that
> > > > > support
> > > > > indexed element as a parameter. For example an instruction like this:
> > > > > 
> > > > >  mul v0.4s, v1.4s, v2.4s[2]
> > > > > 
> > > > > can perform multiplication of each element of v2.4s with the third 
> > > > > element
> > > > > of
> > > > > v2.4s (specified as v2.4s[2]) and store the results in the 
> > > > > corresponding
> > > > > elements of v0.4s.
> > > > > 
> > > > > For this to happen, vectorizer needs to understand this idiom and 
> > > > > treat the
> > > > > operand c[2] specially (and by taking in to consideration if the 
> > > > > machine
> > > > > supports indexed element as an operand for  through a target hook 
> > > > > or
> > > > > macro)
> > > > > and consider this as vectorizable statement without having to 
> > > > > duplicate the
> > > > > elements explicitly.
> > > > > 
> > > > > There are fews ways this could be represented at gimple:
> > > > > 
> > > > >  ...
> > > > >  va:V4SI = vb:V4SI  VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 
> > > > > 2))
> > > > >  ...
> > > > > 
> > > > > or by allowing a vectorizer treat an indexed element as a valid 
> > > > > operand in a
> > > > > vectorizable statement:
> > > > 
> > > > Might as well allow any scalar then...
> > > 
> > > I agree.  The VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) form
> > > would necessarily be two extra separate statements and thus subject
> > > to CSE obfuscating it enough for RTL expansion to no longer notice it.
> > 
> > I also thought about having a specialized expression like
> > 
> > VEC_INDEXED__EXPR < arg0, arg1, arg2, index> 
> > 
> > to mean:
> > 
> > arg0 = arg1  arg2[index]
> > 
> > and handle it directly in the expander, like (for eg.) how VEC_LSHIFT_EXPR
> > is handled in expr.c. But I dropped this idea since we may need to introduce
> > many such nodes.
> > 
> > > 
> > > That said, allowing mixed scalar/vector ops isn't very nice and
> > > your scheme can be simplified by just using
> > > 
> > >   vc:V4SI = VEC_DUPLICATE_EXPR <...>
> > >   va:V4SI = vb:V4SI  vc:V4SI
> > > 
> > > where the expander only has to see that vc:V4SI is defined by
> > > a duplicate.
> > 
> > I did try out something like this quickly before I posted this RFC, though
> > I called it VEC_DUP to mean a equivalent of vec_duplicate(vec_select())
> > 
> > for: 
> > 
> >   for(i=0;i<8;i++)
> > a[i] = b[2] * c[i];
> > 
> > I could generate:
> > 
> >   ...
> >   :
> >   _88 = prolog_loop_adjusted_niters.6_60 * 4;
> >   vectp_c.13_87 = c_10(D) + _88;
> >   vect_ldidx_.16_92 = MEM[(int *)b_8(D) + 8B]; 
> >   vect_idxed_.17_93 = (vect_ldidx_.16_92) <<< ??? >>> (0); 
> >   _96 = prolog_loop_adjusted_niters.6_60 * 4;
> >   vectp_a.19_95 = a_6(D) + _96;
> >   vect__12.14_115 = MEM[(int *)vectp_c.13_87];
> >   vect_patt_40.15_116 = vect__12.14_115 * vect_idxed_.17_93;   
> >   MEM[(int *)vectp_a.19_95] = vect_patt_40.15_116; 
> >   vectp_c.12_118 = vectp_c.13_87 + 16;
> >   vectp_a.18_119 = vectp_a.19_95 + 16;
> >   ivtmp_120 = 1;
> >   if (ivtmp_120 < bnd.8_62)
> > goto ;
> >   else
> > goto ;
> > 
> >   :
> >   # vectp_c.12_89 = PHI 
> >   # vectp_a.18_97 = PHI 
> >   # ivtmp_14 = PHI 
> >   vect__12.14_91 = MEM[(int *)vectp_c.12_89];  
> >   vect_patt_40.15_94 = vect__12.14_91 * vect_idxed_.17_93; 
> >   MEM[(int *)vectp_a.18_97] = vect_patt_40.15_94;
> >   ...
> > 
> > It's a crude implementation so VEC_DUP is printed as:
> > 
> >   (vect_ldidx_.16_92) <<< ??? >>> (0);
> > 
> > 
> > > > >  ...
> > > > >  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 2)
> > > > >  ...
> > > > >
> > > > > For the sake of explanation, the above two representations assumes 
> > > > > that
> > > > > c[0:3

Re: [RFC] Vectorization of indexed elements

2013-09-30 Thread Vidya Praveen

On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote:
> On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote:
> [...]
> > > > I can't really insist on the single lane load.. something like:
> > > > 
> > > > vc:V4SI[0] = c
> > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > > > va:V4SI = vb:V4SI  vt:V4SI
> > > > 
> > > > Or is there any other way to do this?
> > > 
> > > Can you elaborate on "I can't really insist on the single lane load"?
> > > What's the single lane load in your example? 
> > 
> > Loading just one lane of the vector like this:
> > 
> > vc:V4SI[0] = c // from the above scalar example
> > 
> > or 
> > 
> > vc:V4SI[0] = c[2] 
> > 
> > is what I meant by single lane load. In this example:
> > 
> > t = c[2] 
> > ...
> > vb:v4si = b[0:3] 
> > vc:v4si = { t, t, t, t }
> > va:v4si = vb:v4si  vc:v4si 
> > 
> > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot
> > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which could be 
> > seen as vec_select:SI (vect_t 0) ). 
> > 
> > > I'd expect the instruction
> > > pattern as quoted to just work (and I hope we expand an uniform
> > > constructor { a, a, a, a } properly using vec_duplicate).
> > 
> > As much as I went through the code, this is only done using vect_init. It is
> > not expanded as vec_duplicate from, for example, store_constructor() of 
> > expr.c
> 
> Do you see any issues if we expand such constructor as vec_duplicate directly 
> instead of going through vect_init way? 

Sorry, that was a bad question.

But here's what I would like to propose as a first step. Please tell me if this
is acceptable or if it makes sense:

- Introduce standard pattern names 

"vmulim4" - vector muliply with second operand as indexed operand

Example:

(define_insn "vmuliv4si4"
   [set (match_operand:V4SI 0 "register_operand")
(mul:V4SI (match_operand:V4SI 1 "register_operand")
  (vec_duplicate:V4SI
(vec_select:SI
  (match_operand:V4SI 2 "register_operand")
  (match_operand:V4SI 3 "immediate_operand)]
 ...
)

"vlmovmn3" - move where one of the operands is specific lane of a vector and 
 other is a scalar. 

Example:

(define_insn "vlmovv4sisi3"
  [set (vec_select:SI (match_operand:V4SI 0 "register_operand")
  (match_operand:SI 1 "immediate_operand"))
   (match_operand:SI 2 "memory_operand")]
  ...
)

- Identify the following idiom and expand through the above standard patterns:

  t = c[m] 
  vc[0:n] = { t, t, t, t}
  a[0:n] = b[0:n] * vc[0:n] 

as 

 (insn (set (vec_select:SI (reg:V4SI 0) 0) (mem:SI ... )))
 (insn (set (reg:V4SI 1)
(mult:V4SI (reg:V4SI 2)
   (vec_duplicate:V4SI (vec_select:SI (reg:V4SI 0) 0)

If this path is acceptable, then I can extend this to support 

"vmaddim4" - multiply and add (with indexed element as multiplier)
"vmsubim4" - multiply and subtract (with indexed element as multiplier)

Please let me know your thoughts.

Cheers
VP

Re: [RFC] Vectorization of indexed elements

2013-09-27 Thread Vidya Praveen

On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote:
[...]
> > > I can't really insist on the single lane load.. something like:
> > > 
> > > vc:V4SI[0] = c
> > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > > va:V4SI = vb:V4SI  vt:V4SI
> > > 
> > > Or is there any other way to do this?
> > 
> > Can you elaborate on "I can't really insist on the single lane load"?
> > What's the single lane load in your example? 
> 
> Loading just one lane of the vector like this:
> 
> vc:V4SI[0] = c // from the above scalar example
> 
> or 
> 
> vc:V4SI[0] = c[2] 
> 
> is what I meant by single lane load. In this example:
> 
> t = c[2] 
> ...
> vb:v4si = b[0:3] 
> vc:v4si = { t, t, t, t }
> va:v4si = vb:v4si  vc:v4si 
> 
> If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot
> insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which could be 
> seen as vec_select:SI (vect_t 0) ). 
> 
> > I'd expect the instruction
> > pattern as quoted to just work (and I hope we expand an uniform
> > constructor { a, a, a, a } properly using vec_duplicate).
> 
> As much as I went through the code, this is only done using vect_init. It is
> not expanded as vec_duplicate from, for example, store_constructor() of expr.c

Do you see any issues if we expand such constructor as vec_duplicate directly 
instead of going through vect_init way? 

VP

Re: [RFC] Vectorization of indexed elements

2013-09-27 Thread Vidya Praveen

On Wed, Sep 25, 2013 at 10:24:56AM +0100, Richard Biener wrote:
> On Tue, 24 Sep 2013, Vidya Praveen wrote:
> 
> > On Mon, Sep 09, 2013 at 07:02:52PM +0100, Marc Glisse wrote:
> > > On Mon, 9 Sep 2013, Vidya Praveen wrote:
> > > 
> > > > Hello,
> > > >
> > > > This post details some thoughts on an enhancement to the vectorizer that
> > > > could take advantage of the SIMD instructions that allows indexed 
> > > > element
> > > > as an operand thus reducing the need for duplication and possibly 
> > > > improve
> > > > reuse of previously loaded data.
> > > >
> > > > Appreciate your opinion on this.
> > > >
> > > > ---
> > > >
> > > > A phrase like this:
> > > >
> > > > for(i=0;i<4;i++)
> > > >   a[i] = b[i]  c[2];
> > > >
> > > > is usually vectorized as:
> > > >
> > > >  va:V4SI = a[0:3]
> > > >  vb:V4SI = b[0:3]
> > > >  t = c[2]
> > > >  vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at 
> > > > vec_init
> > > >  ...
> > > >  va:V4SI = vb:V4SI  vc:V4SI
> > > >
> > > > But this could be simplified further if a target has instructions that 
> > > > support
> > > > indexed element as a parameter. For example an instruction like this:
> > > >
> > > >  mul v0.4s, v1.4s, v2.4s[2]
> > > >
> > > > can perform multiplication of each element of v2.4s with the third 
> > > > element of
> > > > v2.4s (specified as v2.4s[2]) and store the results in the corresponding
> > > > elements of v0.4s.
> > > >
> > > > For this to happen, vectorizer needs to understand this idiom and treat 
> > > > the
> > > > operand c[2] specially (and by taking in to consideration if the machine
> > > > supports indexed element as an operand for  through a target hook 
> > > > or macro)
> > > > and consider this as vectorizable statement without having to duplicate 
> > > > the
> > > > elements explicitly.
> > > >
> > > > There are fews ways this could be represented at gimple:
> > > >
> > > >  ...
> > > >  va:V4SI = vb:V4SI  VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > > >  ...
> > > >
> > > > or by allowing a vectorizer treat an indexed element as a valid operand 
> > > > in a
> > > > vectorizable statement:
> > > 
> > > Might as well allow any scalar then...
> > 
> > Yes, I had given an example below.
> > 
> > > 
> > > >  ...
> > > >  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 2)
> > > >  ...
> > > >
> > > > For the sake of explanation, the above two representations assumes that
> > > > c[0:3] is loaded in vc for some other use and reused here. But when 
> > > > c[2] is the
> > > > only use of 'c' then it may be safer to just load one element and use 
> > > > it like
> > > > this:
> > > >
> > > >  vc:V4SI[0] = c[2]
> > > >  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 0)
> > > >
> > > > This could also mean that expressions involving scalar could be treated
> > > > similarly. For example,
> > > >
> > > >  for(i=0;i<4;i++)
> > > >a[i] = b[i]  c
> > > >
> > > > could be vectorized as:
> > > >
> > > >  vc:V4SI[0] = c
> > > >  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 0)
> > > >
> > > > Such a change would also require new standard pattern names to be 
> > > > defined for
> > > > each .
> > > >
> > > > Alternatively, having something like this:
> > > >
> > > >  ...
> > > >  vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > > >  va:V4SI = vb:V4SI  vt:V4SI
> > > >  ...
> > > >
> > > > would remove the need to introduce several new standard pattern names 
> > > > but have
> > > > just one to represent vec_duplicate(vec_select()) but ofcourse this 
> > > > will expect
> > > > the target to have combiner patterns.
> > > 
> > > The cost estimation wouldn't be very good, but aren't combine patterns 
> > > enough for the whole thing? Don't you model your mul instruction as:
> > > 
> > > (mult:V4SI
> > >(match_operand:V4SI)
> > >(vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI
> > > 
> > > anyway? Seems that combine should be able to handle it. What currently 
> > > happens that we fail to generate the right instruction?
> > 
> > At vec_init, I can recognize an idiom in order to generate vec_duplicate but
> > I can't really insist on the single lane load.. something like:
> > 
> > vc:V4SI[0] = c
> > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > va:V4SI = vb:V4SI  vt:V4SI
> > 
> > Or is there any other way to do this?
> 
> Can you elaborate on "I can't really insist on the single lane load"?
> What's the single lane load in your example? 

Loading just one lane of the vector like this:

vc:V4SI[0] = c // from the above scalar example

or 

vc:V4SI[0] = c[2] 

is what I meant by single lane load. In this example:

t = c[2] 
...
vb:v4si = b[0:3] 
vc:v4si = { t, t, t, t }
va:v4si = vb:v4si  vc:v4si 

If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot
insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which could be 
seen as vec_select:SI (vect_t 0) ). 

> I'd expect the instruction
> pattern as quoted to just work (and I hope we expand an uniform

Re: [RFC] Vectorization of indexed elements

2013-09-25 Thread Richard Biener

On Tue, 24 Sep 2013, Vidya Praveen wrote:

> On Mon, Sep 09, 2013 at 07:02:52PM +0100, Marc Glisse wrote:
> > On Mon, 9 Sep 2013, Vidya Praveen wrote:
> > 
> > > Hello,
> > >
> > > This post details some thoughts on an enhancement to the vectorizer that
> > > could take advantage of the SIMD instructions that allows indexed element
> > > as an operand thus reducing the need for duplication and possibly improve
> > > reuse of previously loaded data.
> > >
> > > Appreciate your opinion on this.
> > >
> > > ---
> > >
> > > A phrase like this:
> > >
> > > for(i=0;i<4;i++)
> > >   a[i] = b[i]  c[2];
> > >
> > > is usually vectorized as:
> > >
> > >  va:V4SI = a[0:3]
> > >  vb:V4SI = b[0:3]
> > >  t = c[2]
> > >  vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at 
> > > vec_init
> > >  ...
> > >  va:V4SI = vb:V4SI  vc:V4SI
> > >
> > > But this could be simplified further if a target has instructions that 
> > > support
> > > indexed element as a parameter. For example an instruction like this:
> > >
> > >  mul v0.4s, v1.4s, v2.4s[2]
> > >
> > > can perform multiplication of each element of v2.4s with the third 
> > > element of
> > > v2.4s (specified as v2.4s[2]) and store the results in the corresponding
> > > elements of v0.4s.
> > >
> > > For this to happen, vectorizer needs to understand this idiom and treat 
> > > the
> > > operand c[2] specially (and by taking in to consideration if the machine
> > > supports indexed element as an operand for  through a target hook or 
> > > macro)
> > > and consider this as vectorizable statement without having to duplicate 
> > > the
> > > elements explicitly.
> > >
> > > There are fews ways this could be represented at gimple:
> > >
> > >  ...
> > >  va:V4SI = vb:V4SI  VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > >  ...
> > >
> > > or by allowing a vectorizer treat an indexed element as a valid operand 
> > > in a
> > > vectorizable statement:
> > 
> > Might as well allow any scalar then...
> 
> Yes, I had given an example below.
> 
> > 
> > >  ...
> > >  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 2)
> > >  ...
> > >
> > > For the sake of explanation, the above two representations assumes that
> > > c[0:3] is loaded in vc for some other use and reused here. But when c[2] 
> > > is the
> > > only use of 'c' then it may be safer to just load one element and use it 
> > > like
> > > this:
> > >
> > >  vc:V4SI[0] = c[2]
> > >  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 0)
> > >
> > > This could also mean that expressions involving scalar could be treated
> > > similarly. For example,
> > >
> > >  for(i=0;i<4;i++)
> > >a[i] = b[i]  c
> > >
> > > could be vectorized as:
> > >
> > >  vc:V4SI[0] = c
> > >  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 0)
> > >
> > > Such a change would also require new standard pattern names to be defined 
> > > for
> > > each .
> > >
> > > Alternatively, having something like this:
> > >
> > >  ...
> > >  vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > >  va:V4SI = vb:V4SI  vt:V4SI
> > >  ...
> > >
> > > would remove the need to introduce several new standard pattern names but 
> > > have
> > > just one to represent vec_duplicate(vec_select()) but ofcourse this will 
> > > expect
> > > the target to have combiner patterns.
> > 
> > The cost estimation wouldn't be very good, but aren't combine patterns 
> > enough for the whole thing? Don't you model your mul instruction as:
> > 
> > (mult:V4SI
> >(match_operand:V4SI)
> >(vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI
> > 
> > anyway? Seems that combine should be able to handle it. What currently 
> > happens that we fail to generate the right instruction?
> 
> At vec_init, I can recognize an idiom in order to generate vec_duplicate but
> I can't really insist on the single lane load.. something like:
> 
> vc:V4SI[0] = c
> vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> va:V4SI = vb:V4SI  vt:V4SI
> 
> Or is there any other way to do this?

Can you elaborate on "I can't really insist on the single lane load"?
What's the single lane load in your example?  I'd expect the instruction
pattern as quoted to just work (and I hope we expand an uniform
constructor { a, a, a, a } properly using vec_duplicate).

Richard.

> Cheers
> VP
> 
> > 
> > In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR 
> > for vec_duplicate, adding new nodes is always painful.
> > 
> > > This enhancement could possibly help further optimizing larger scenarios 
> > > such
> > > as linear systems.
> > >
> > > Regards
> > > VP
> > 
> > -- 
> > Marc Glisse
> >
> 
> 
> 

-- 
Richard Biener 
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend

Re: [RFC] Vectorization of indexed elements

2013-09-25 Thread Richard Biener

On Tue, 24 Sep 2013, Vidya Praveen wrote:

> On Tue, Sep 10, 2013 at 09:25:32AM +0100, Richard Biener wrote:
> > On Mon, 9 Sep 2013, Marc Glisse wrote:
> > 
> > > On Mon, 9 Sep 2013, Vidya Praveen wrote:
> > > 
> > > > Hello,
> > > > 
> > > > This post details some thoughts on an enhancement to the vectorizer that
> > > > could take advantage of the SIMD instructions that allows indexed 
> > > > element
> > > > as an operand thus reducing the need for duplication and possibly 
> > > > improve
> > > > reuse of previously loaded data.
> > > > 
> > > > Appreciate your opinion on this.
> > > > 
> > > > ---
> > > > 
> > > > A phrase like this:
> > > > 
> > > > for(i=0;i<4;i++)
> > > >   a[i] = b[i]  c[2];
> > > > 
> > > > is usually vectorized as:
> > > > 
> > > >  va:V4SI = a[0:3]
> > > >  vb:V4SI = b[0:3]
> > > >  t = c[2]
> > > >  vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at 
> > > > vec_init
> > > >  ...
> > > >  va:V4SI = vb:V4SI  vc:V4SI
> > > > 
> > > > But this could be simplified further if a target has instructions that
> > > > support
> > > > indexed element as a parameter. For example an instruction like this:
> > > > 
> > > >  mul v0.4s, v1.4s, v2.4s[2]
> > > > 
> > > > can perform multiplication of each element of v2.4s with the third 
> > > > element
> > > > of
> > > > v2.4s (specified as v2.4s[2]) and store the results in the corresponding
> > > > elements of v0.4s.
> > > > 
> > > > For this to happen, vectorizer needs to understand this idiom and treat 
> > > > the
> > > > operand c[2] specially (and by taking in to consideration if the machine
> > > > supports indexed element as an operand for  through a target hook or
> > > > macro)
> > > > and consider this as vectorizable statement without having to duplicate 
> > > > the
> > > > elements explicitly.
> > > > 
> > > > There are fews ways this could be represented at gimple:
> > > > 
> > > >  ...
> > > >  va:V4SI = vb:V4SI  VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > > >  ...
> > > > 
> > > > or by allowing a vectorizer treat an indexed element as a valid operand 
> > > > in a
> > > > vectorizable statement:
> > > 
> > > Might as well allow any scalar then...
> > 
> > I agree.  The VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) form
> > would necessarily be two extra separate statements and thus subject
> > to CSE obfuscating it enough for RTL expansion to no longer notice it.
> 
> I also thought about having a specialized expression like
> 
> VEC_INDEXED__EXPR < arg0, arg1, arg2, index> 
> 
> to mean:
> 
> arg0 = arg1  arg2[index]
> 
> and handle it directly in the expander, like (for eg.) how VEC_LSHIFT_EXPR
> is handled in expr.c. But I dropped this idea since we may need to introduce
> many such nodes.
> 
> > 
> > That said, allowing mixed scalar/vector ops isn't very nice and
> > your scheme can be simplified by just using
> > 
> >   vc:V4SI = VEC_DUPLICATE_EXPR <...>
> >   va:V4SI = vb:V4SI  vc:V4SI
> > 
> > where the expander only has to see that vc:V4SI is defined by
> > a duplicate.
> 
> I did try out something like this quickly before I posted this RFC, though
> I called it VEC_DUP to mean a equivalent of vec_duplicate(vec_select())
> 
> for: 
> 
>   for(i=0;i<8;i++)
> a[i] = b[2] * c[i];
> 
> I could generate:
> 
>   ...
>   :
>   _88 = prolog_loop_adjusted_niters.6_60 * 4;
>   vectp_c.13_87 = c_10(D) + _88;
>   vect_ldidx_.16_92 = MEM[(int *)b_8(D) + 8B]; 
>   vect_idxed_.17_93 = (vect_ldidx_.16_92) <<< ??? >>> (0); 
>   _96 = prolog_loop_adjusted_niters.6_60 * 4;
>   vectp_a.19_95 = a_6(D) + _96;
>   vect__12.14_115 = MEM[(int *)vectp_c.13_87];
>   vect_patt_40.15_116 = vect__12.14_115 * vect_idxed_.17_93;   
>   MEM[(int *)vectp_a.19_95] = vect_patt_40.15_116; 
>   vectp_c.12_118 = vectp_c.13_87 + 16;
>   vectp_a.18_119 = vectp_a.19_95 + 16;
>   ivtmp_120 = 1;
>   if (ivtmp_120 < bnd.8_62)
> goto ;
>   else
> goto ;
> 
>   :
>   # vectp_c.12_89 = PHI 
>   # vectp_a.18_97 = PHI 
>   # ivtmp_14 = PHI 
>   vect__12.14_91 = MEM[(int *)vectp_c.12_89];  
>   vect_patt_40.15_94 = vect__12.14_91 * vect_idxed_.17_93; 
>   MEM[(int *)vectp_a.18_97] = vect_patt_40.15_94;
>   ...
> 
> It's a crude implementation so VEC_DUP is printed as:
> 
>   (vect_ldidx_.16_92) <<< ??? >>> (0);
> 
> 
> > > >  ...
> > > >  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 2)
> > > >  ...
> > > >
> > > > For the sake of explanation, the above two representations assumes that
> > > > c[0:3] is loaded in vc for some other use and reused here. But when 
> > > > c[2] is
> > > > the
> > > > only use of 'c' then it may be safer to just load one element and use it
> > > > like
> > > > this:
> > > >
> > > >  vc:V4SI[0] = c[2]
> > > >  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 0)
> > > > 
> > > > This could also mean that expressions involving scalar could be treated
> > > > similarly. For example,
> >

Re: [RFC] Vectorization of indexed elements

2013-09-24 Thread Vidya Praveen

On Mon, Sep 09, 2013 at 07:02:52PM +0100, Marc Glisse wrote:
> On Mon, 9 Sep 2013, Vidya Praveen wrote:
> 
> > Hello,
> >
> > This post details some thoughts on an enhancement to the vectorizer that
> > could take advantage of the SIMD instructions that allows indexed element
> > as an operand thus reducing the need for duplication and possibly improve
> > reuse of previously loaded data.
> >
> > Appreciate your opinion on this.
> >
> > ---
> >
> > A phrase like this:
> >
> > for(i=0;i<4;i++)
> >   a[i] = b[i]  c[2];
> >
> > is usually vectorized as:
> >
> >  va:V4SI = a[0:3]
> >  vb:V4SI = b[0:3]
> >  t = c[2]
> >  vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init
> >  ...
> >  va:V4SI = vb:V4SI  vc:V4SI
> >
> > But this could be simplified further if a target has instructions that 
> > support
> > indexed element as a parameter. For example an instruction like this:
> >
> >  mul v0.4s, v1.4s, v2.4s[2]
> >
> > can perform multiplication of each element of v2.4s with the third element 
> > of
> > v2.4s (specified as v2.4s[2]) and store the results in the corresponding
> > elements of v0.4s.
> >
> > For this to happen, vectorizer needs to understand this idiom and treat the
> > operand c[2] specially (and by taking in to consideration if the machine
> > supports indexed element as an operand for  through a target hook or 
> > macro)
> > and consider this as vectorizable statement without having to duplicate the
> > elements explicitly.
> >
> > There are fews ways this could be represented at gimple:
> >
> >  ...
> >  va:V4SI = vb:V4SI  VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> >  ...
> >
> > or by allowing a vectorizer treat an indexed element as a valid operand in a
> > vectorizable statement:
> 
> Might as well allow any scalar then...

Yes, I had given an example below.

> 
> >  ...
> >  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 2)
> >  ...
> >
> > For the sake of explanation, the above two representations assumes that
> > c[0:3] is loaded in vc for some other use and reused here. But when c[2] is 
> > the
> > only use of 'c' then it may be safer to just load one element and use it 
> > like
> > this:
> >
> >  vc:V4SI[0] = c[2]
> >  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 0)
> >
> > This could also mean that expressions involving scalar could be treated
> > similarly. For example,
> >
> >  for(i=0;i<4;i++)
> >a[i] = b[i]  c
> >
> > could be vectorized as:
> >
> >  vc:V4SI[0] = c
> >  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 0)
> >
> > Such a change would also require new standard pattern names to be defined 
> > for
> > each .
> >
> > Alternatively, having something like this:
> >
> >  ...
> >  vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> >  va:V4SI = vb:V4SI  vt:V4SI
> >  ...
> >
> > would remove the need to introduce several new standard pattern names but 
> > have
> > just one to represent vec_duplicate(vec_select()) but ofcourse this will 
> > expect
> > the target to have combiner patterns.
> 
> The cost estimation wouldn't be very good, but aren't combine patterns 
> enough for the whole thing? Don't you model your mul instruction as:
> 
> (mult:V4SI
>(match_operand:V4SI)
>(vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI
> 
> anyway? Seems that combine should be able to handle it. What currently 
> happens that we fail to generate the right instruction?

At vec_init, I can recognize an idiom in order to generate vec_duplicate but
I can't really insist on the single lane load.. something like:

vc:V4SI[0] = c
vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
va:V4SI = vb:V4SI  vt:V4SI

Or is there any other way to do this?

Cheers
VP

> 
> In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR 
> for vec_duplicate, adding new nodes is always painful.
> 
> > This enhancement could possibly help further optimizing larger scenarios 
> > such
> > as linear systems.
> >
> > Regards
> > VP
> 
> -- 
> Marc Glisse
>

Re: [RFC] Vectorization of indexed elements

2013-09-24 Thread Vidya Praveen

On Tue, Sep 10, 2013 at 09:25:32AM +0100, Richard Biener wrote:
> On Mon, 9 Sep 2013, Marc Glisse wrote:
> 
> > On Mon, 9 Sep 2013, Vidya Praveen wrote:
> > 
> > > Hello,
> > > 
> > > This post details some thoughts on an enhancement to the vectorizer that
> > > could take advantage of the SIMD instructions that allows indexed element
> > > as an operand thus reducing the need for duplication and possibly improve
> > > reuse of previously loaded data.
> > > 
> > > Appreciate your opinion on this.
> > > 
> > > ---
> > > 
> > > A phrase like this:
> > > 
> > > for(i=0;i<4;i++)
> > >   a[i] = b[i]  c[2];
> > > 
> > > is usually vectorized as:
> > > 
> > >  va:V4SI = a[0:3]
> > >  vb:V4SI = b[0:3]
> > >  t = c[2]
> > >  vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at 
> > > vec_init
> > >  ...
> > >  va:V4SI = vb:V4SI  vc:V4SI
> > > 
> > > But this could be simplified further if a target has instructions that
> > > support
> > > indexed element as a parameter. For example an instruction like this:
> > > 
> > >  mul v0.4s, v1.4s, v2.4s[2]
> > > 
> > > can perform multiplication of each element of v2.4s with the third element
> > > of
> > > v2.4s (specified as v2.4s[2]) and store the results in the corresponding
> > > elements of v0.4s.
> > > 
> > > For this to happen, vectorizer needs to understand this idiom and treat 
> > > the
> > > operand c[2] specially (and by taking in to consideration if the machine
> > > supports indexed element as an operand for  through a target hook or
> > > macro)
> > > and consider this as vectorizable statement without having to duplicate 
> > > the
> > > elements explicitly.
> > > 
> > > There are fews ways this could be represented at gimple:
> > > 
> > >  ...
> > >  va:V4SI = vb:V4SI  VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > >  ...
> > > 
> > > or by allowing a vectorizer treat an indexed element as a valid operand 
> > > in a
> > > vectorizable statement:
> > 
> > Might as well allow any scalar then...
> 
> I agree.  The VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) form
> would necessarily be two extra separate statements and thus subject
> to CSE obfuscating it enough for RTL expansion to no longer notice it.

I also thought about having a specialized expression like

VEC_INDEXED__EXPR < arg0, arg1, arg2, index> 

to mean:

arg0 = arg1  arg2[index]

and handle it directly in the expander, like (for eg.) how VEC_LSHIFT_EXPR
is handled in expr.c. But I dropped this idea since we may need to introduce
many such nodes.

> 
> That said, allowing mixed scalar/vector ops isn't very nice and
> your scheme can be simplified by just using
> 
>   vc:V4SI = VEC_DUPLICATE_EXPR <...>
>   va:V4SI = vb:V4SI  vc:V4SI
> 
> where the expander only has to see that vc:V4SI is defined by
> a duplicate.

I did try out something like this quickly before I posted this RFC, though
I called it VEC_DUP to mean a equivalent of vec_duplicate(vec_select())

for: 

  for(i=0;i<8;i++)
a[i] = b[2] * c[i];

I could generate:

  ...
  :
  _88 = prolog_loop_adjusted_niters.6_60 * 4;
  vectp_c.13_87 = c_10(D) + _88;
  vect_ldidx_.16_92 = MEM[(int *)b_8(D) + 8B]; 
  vect_idxed_.17_93 = (vect_ldidx_.16_92) <<< ??? >>> (0); 
  _96 = prolog_loop_adjusted_niters.6_60 * 4;
  vectp_a.19_95 = a_6(D) + _96;
  vect__12.14_115 = MEM[(int *)vectp_c.13_87];
  vect_patt_40.15_116 = vect__12.14_115 * vect_idxed_.17_93;   
  MEM[(int *)vectp_a.19_95] = vect_patt_40.15_116; 
  vectp_c.12_118 = vectp_c.13_87 + 16;
  vectp_a.18_119 = vectp_a.19_95 + 16;
  ivtmp_120 = 1;
  if (ivtmp_120 < bnd.8_62)
goto ;
  else
goto ;

  :
  # vectp_c.12_89 = PHI 
  # vectp_a.18_97 = PHI 
  # ivtmp_14 = PHI 
  vect__12.14_91 = MEM[(int *)vectp_c.12_89];  
  vect_patt_40.15_94 = vect__12.14_91 * vect_idxed_.17_93; 
  MEM[(int *)vectp_a.18_97] = vect_patt_40.15_94;
  ...

It's a crude implementation so VEC_DUP is printed as:

  (vect_ldidx_.16_92) <<< ??? >>> (0);


> > >  ...
> > >  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 2)
> > >  ...
> > >
> > > For the sake of explanation, the above two representations assumes that
> > > c[0:3] is loaded in vc for some other use and reused here. But when c[2] 
> > > is
> > > the
> > > only use of 'c' then it may be safer to just load one element and use it
> > > like
> > > this:
> > >
> > >  vc:V4SI[0] = c[2]
> > >  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 0)
> > > 
> > > This could also mean that expressions involving scalar could be treated
> > > similarly. For example,
> > > 
> > >  for(i=0;i<4;i++)
> > >a[i] = b[i]  c
> > > 
> > > could be vectorized as:
> > > 
> > >  vc:V4SI[0] = c
> > >  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 0)
> > > 
> > > Such a change would also require new standard pattern names to be defined
> > > for
> > > each .
> > > 
> > > Alternatively, having something like this:
> > > 
> > >  ...
> > >  vt:V4S

Re: [RFC] Vectorization of indexed elements

2013-09-10 Thread Richard Biener

On Mon, 9 Sep 2013, Marc Glisse wrote:

> On Mon, 9 Sep 2013, Vidya Praveen wrote:
> 
> > Hello,
> > 
> > This post details some thoughts on an enhancement to the vectorizer that
> > could take advantage of the SIMD instructions that allows indexed element
> > as an operand thus reducing the need for duplication and possibly improve
> > reuse of previously loaded data.
> > 
> > Appreciate your opinion on this.
> > 
> > ---
> > 
> > A phrase like this:
> > 
> > for(i=0;i<4;i++)
> >   a[i] = b[i]  c[2];
> > 
> > is usually vectorized as:
> > 
> >  va:V4SI = a[0:3]
> >  vb:V4SI = b[0:3]
> >  t = c[2]
> >  vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init
> >  ...
> >  va:V4SI = vb:V4SI  vc:V4SI
> > 
> > But this could be simplified further if a target has instructions that
> > support
> > indexed element as a parameter. For example an instruction like this:
> > 
> >  mul v0.4s, v1.4s, v2.4s[2]
> > 
> > can perform multiplication of each element of v2.4s with the third element
> > of
> > v2.4s (specified as v2.4s[2]) and store the results in the corresponding
> > elements of v0.4s.
> > 
> > For this to happen, vectorizer needs to understand this idiom and treat the
> > operand c[2] specially (and by taking in to consideration if the machine
> > supports indexed element as an operand for  through a target hook or
> > macro)
> > and consider this as vectorizable statement without having to duplicate the
> > elements explicitly.
> > 
> > There are fews ways this could be represented at gimple:
> > 
> >  ...
> >  va:V4SI = vb:V4SI  VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> >  ...
> > 
> > or by allowing a vectorizer treat an indexed element as a valid operand in a
> > vectorizable statement:
> 
> Might as well allow any scalar then...

I agree.  The VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) form
would necessarily be two extra separate statements and thus subject
to CSE obfuscating it enough for RTL expansion to no longer notice it.

That said, allowing mixed scalar/vector ops isn't very nice and
your scheme can be simplified by just using

  vc:V4SI = VEC_DUPLICATE_EXPR <...>
  va:V4SI = vb:V4SI  vc:V4SI

where the expander only has to see that vc:V4SI is defined by
a duplicate.

> >  ...
> >  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 2)
> >  ...
> > 
> > For the sake of explanation, the above two representations assumes that
> > c[0:3] is loaded in vc for some other use and reused here. But when c[2] is
> > the
> > only use of 'c' then it may be safer to just load one element and use it
> > like
> > this:
> > 
> >  vc:V4SI[0] = c[2]
> >  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 0)
> > 
> > This could also mean that expressions involving scalar could be treated
> > similarly. For example,
> > 
> >  for(i=0;i<4;i++)
> >a[i] = b[i]  c
> > 
> > could be vectorized as:
> > 
> >  vc:V4SI[0] = c
> >  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 0)
> > 
> > Such a change would also require new standard pattern names to be defined
> > for
> > each .
> > 
> > Alternatively, having something like this:
> > 
> >  ...
> >  vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> >  va:V4SI = vb:V4SI  vt:V4SI
> >  ...
> > 
> > would remove the need to introduce several new standard pattern names but
> > have
> > just one to represent vec_duplicate(vec_select()) but ofcourse this will
> > expect
> > the target to have combiner patterns.
> 
> The cost estimation wouldn't be very good, but aren't combine patterns enough
> for the whole thing? Don't you model your mul instruction as:
> 
> (mult:V4SI
>   (match_operand:V4SI)
>   (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI
> 
> anyway? Seems that combine should be able to handle it. What currently happens
> that we fail to generate the right instruction?
> 
> In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR for
> vec_duplicate, adding new nodes is always painful.

True, though CONSTRUCTOR isn't a good vec_duplicate primitive.  But yes,
we have it that way at the moment and indeed adding new nodes is always
painful.

> > This enhancement could possibly help further optimizing larger scenarios
> > such
> > as linear systems.

Given that the vectorizer already handles all cases you quote but
just the expansion doesn't use the targets special abilities - can't
you just teach the expander to lookup the definition of the
vectors and see if it is an uniform CONSTRUCTOR?

Richard.

Re: [RFC] Vectorization of indexed elements

2013-09-09 Thread Marc Glisse


On Mon, 9 Sep 2013, Vidya Praveen wrote:


Hello,

This post details some thoughts on an enhancement to the vectorizer that
could take advantage of the SIMD instructions that allows indexed element
as an operand thus reducing the need for duplication and possibly improve
reuse of previously loaded data.

Appreciate your opinion on this.

---

A phrase like this:

for(i=0;i<4;i++)
  a[i] = b[i]  c[2];

is usually vectorized as:

 va:V4SI = a[0:3]
 vb:V4SI = b[0:3]
 t = c[2]
 vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init
 ...
 va:V4SI = vb:V4SI  vc:V4SI

But this could be simplified further if a target has instructions that support
indexed element as a parameter. For example an instruction like this:

 mul v0.4s, v1.4s, v2.4s[2]

can perform multiplication of each element of v2.4s with the third element of
v2.4s (specified as v2.4s[2]) and store the results in the corresponding
elements of v0.4s.

For this to happen, vectorizer needs to understand this idiom and treat the
operand c[2] specially (and by taking in to consideration if the machine
supports indexed element as an operand for  through a target hook or macro)
and consider this as vectorizable statement without having to duplicate the
elements explicitly.

There are fews ways this could be represented at gimple:

 ...
 va:V4SI = vb:V4SI  VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
 ...

or by allowing a vectorizer treat an indexed element as a valid operand in a
vectorizable statement:


Might as well allow any scalar then...


 ...
 va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 2)
 ...

For the sake of explanation, the above two representations assumes that
c[0:3] is loaded in vc for some other use and reused here. But when c[2] is the
only use of 'c' then it may be safer to just load one element and use it like
this:

 vc:V4SI[0] = c[2]
 va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 0)

This could also mean that expressions involving scalar could be treated
similarly. For example,

 for(i=0;i<4;i++)
   a[i] = b[i]  c

could be vectorized as:

 vc:V4SI[0] = c
 va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 0)

Such a change would also require new standard pattern names to be defined for
each .

Alternatively, having something like this:

 ...
 vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
 va:V4SI = vb:V4SI  vt:V4SI
 ...

would remove the need to introduce several new standard pattern names but have
just one to represent vec_duplicate(vec_select()) but ofcourse this will expect
the target to have combiner patterns.


The cost estimation wouldn't be very good, but aren't combine patterns 
enough for the whole thing? Don't you model your mul instruction as:


(mult:V4SI
  (match_operand:V4SI)
  (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI

anyway? Seems that combine should be able to handle it. What currently 
happens that we fail to generate the right instruction?


In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR 
for vec_duplicate, adding new nodes is always painful.



This enhancement could possibly help further optimizing larger scenarios such
as linear systems.

Regards
VP


--
Marc Glisse

[RFC] Vectorization of indexed elements

2013-09-09 Thread Vidya Praveen

Hello,

This post details some thoughts on an enhancement to the vectorizer that 
could take advantage of the SIMD instructions that allows indexed element
as an operand thus reducing the need for duplication and possibly improve
reuse of previously loaded data.

Appreciate your opinion on this. 

--- 

A phrase like this:

 for(i=0;i<4;i++)
   a[i] = b[i]  c[2];

is usually vectorized as:

  va:V4SI = a[0:3]
  vb:V4SI = b[0:3]
  t = c[2]
  vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init
  ...
  va:V4SI = vb:V4SI  vc:V4SI

But this could be simplified further if a target has instructions that support
indexed element as a parameter. For example an instruction like this:

  mul v0.4s, v1.4s, v2.4s[2]

can perform multiplication of each element of v2.4s with the third element of
v2.4s (specified as v2.4s[2]) and store the results in the corresponding 
elements of v0.4s. 

For this to happen, vectorizer needs to understand this idiom and treat the 
operand c[2] specially (and by taking in to consideration if the machine 
supports indexed element as an operand for  through a target hook or macro)
and consider this as vectorizable statement without having to duplicate the 
elements explicitly. 

There are fews ways this could be represented at gimple:

  ...
  va:V4SI = vb:V4SI  VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
  ...

or by allowing a vectorizer treat an indexed element as a valid operand in a 
vectorizable statement:

  ...
  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 2)
  ...

For the sake of explanation, the above two representations assumes that 
c[0:3] is loaded in vc for some other use and reused here. But when c[2] is the
only use of 'c' then it may be safer to just load one element and use it like
this:

  vc:V4SI[0] = c[2]
  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 0)

This could also mean that expressions involving scalar could be treated 
similarly. For example,

  for(i=0;i<4;i++)
a[i] = b[i]  c

could be vectorized as:

  vc:V4SI[0] = c
  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 0)
  
Such a change would also require new standard pattern names to be defined for
each .

Alternatively, having something like this:

  ...
  vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
  va:V4SI = vb:V4SI  vt:V4SI
  ...

would remove the need to introduce several new standard pattern names but have
just one to represent vec_duplicate(vec_select()) but ofcourse this will expect
the target to have combiner patterns.

This enhancement could possibly help further optimizing larger scenarios such 
as linear systems.

Regards
VP

Re: [RFC] Vectorization of indexed elements

Re: [RFC] Vectorization of indexed elements

Re: [RFC] Vectorization of indexed elements

Re: [RFC] Vectorization of indexed elements

Re: [RFC] Vectorization of indexed elements

Re: [RFC] Vectorization of indexed elements

Re: [RFC] Vectorization of indexed elements

Re: [RFC] Vectorization of indexed elements

Re: [RFC] Vectorization of indexed elements

Re: [RFC] Vectorization of indexed elements

Re: [RFC] Vectorization of indexed elements

Re: [RFC] Vectorization of indexed elements

Re: [RFC] Vectorization of indexed elements

Re: [RFC] Vectorization of indexed elements

Re: [RFC] Vectorization of indexed elements

Re: [RFC] Vectorization of indexed elements

Re: [RFC] Vectorization of indexed elements

Re: [RFC] Vectorization of indexed elements

Re: [RFC] Vectorization of indexed elements

[RFC] Vectorization of indexed elements

20 matches

Site Navigation

Mail list logo

Footer information