On Wed, Sep 25, 2013 at 10:24:56AM +0100, Richard Biener wrote:
> On Tue, 24 Sep 2013, Vidya Praveen wrote:
>
> > On Mon, Sep 09, 2013 at 07:02:52PM +0100, Marc Glisse wrote:
> > > On Mon, 9 Sep 2013, Vidya Praveen wrote:
> > >
> > > > Hello,
> > > >
> > > > This post details some thoughts on an enhancement to the vectorizer that
> > > > could take advantage of the SIMD instructions that allows indexed
> > > > element
> > > > as an operand thus reducing the need for duplication and possibly
> > > > improve
> > > > reuse of previously loaded data.
> > > >
> > > > Appreciate your opinion on this.
> > > >
> > > > ---
> > > >
> > > > A phrase like this:
> > > >
> > > > for(i=0;i<4;i++)
> > > > a[i] = b[i] <op> c[2];
> > > >
> > > > is usually vectorized as:
> > > >
> > > > va:V4SI = a[0:3]
> > > > vb:V4SI = b[0:3]
> > > > t = c[2]
> > > > vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at
> > > > vec_init
> > > > ...
> > > > va:V4SI = vb:V4SI <op> vc:V4SI
> > > >
> > > > But this could be simplified further if a target has instructions that
> > > > support
> > > > indexed element as a parameter. For example an instruction like this:
> > > >
> > > > mul v0.4s, v1.4s, v2.4s[2]
> > > >
> > > > can perform multiplication of each element of v2.4s with the third
> > > > element of
> > > > v2.4s (specified as v2.4s[2]) and store the results in the corresponding
> > > > elements of v0.4s.
> > > >
> > > > For this to happen, vectorizer needs to understand this idiom and treat
> > > > the
> > > > operand c[2] specially (and by taking in to consideration if the machine
> > > > supports indexed element as an operand for <op> through a target hook
> > > > or macro)
> > > > and consider this as vectorizable statement without having to duplicate
> > > > the
> > > > elements explicitly.
> > > >
> > > > There are fews ways this could be represented at gimple:
> > > >
> > > > ...
> > > > va:V4SI = vb:V4SI <op> VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > > > ...
> > > >
> > > > or by allowing a vectorizer treat an indexed element as a valid operand
> > > > in a
> > > > vectorizable statement:
> > >
> > > Might as well allow any scalar then...
> >
> > Yes, I had given an example below.
> >
> > >
> > > > ...
> > > > va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 2)
> > > > ...
> > > >
> > > > For the sake of explanation, the above two representations assumes that
> > > > c[0:3] is loaded in vc for some other use and reused here. But when
> > > > c[2] is the
> > > > only use of 'c' then it may be safer to just load one element and use
> > > > it like
> > > > this:
> > > >
> > > > vc:V4SI[0] = c[2]
> > > > va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > > >
> > > > This could also mean that expressions involving scalar could be treated
> > > > similarly. For example,
> > > >
> > > > for(i=0;i<4;i++)
> > > > a[i] = b[i] <op> c
> > > >
> > > > could be vectorized as:
> > > >
> > > > vc:V4SI[0] = c
> > > > va:V4SI = vb:V4SI <op> VEC_SELECT_EXPR (vc:V4SI 0)
> > > >
> > > > Such a change would also require new standard pattern names to be
> > > > defined for
> > > > each <op>.
> > > >
> > > > Alternatively, having something like this:
> > > >
> > > > ...
> > > > vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
> > > > va:V4SI = vb:V4SI <op> vt:V4SI
> > > > ...
> > > >
> > > > would remove the need to introduce several new standard pattern names
> > > > but have
> > > > just one to represent vec_duplicate(vec_select()) but ofcourse this
> > > > will expect
> > > > the target to have combiner patterns.
> > >
> > > The cost estimation wouldn't be very good, but aren't combine patterns
> > > enough for the whole thing? Don't you model your mul instruction as:
> > >
> > > (mult:V4SI
> > > (match_operand:V4SI)
> > > (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI))))
> > >
> > > anyway? Seems that combine should be able to handle it. What currently
> > > happens that we fail to generate the right instruction?
> >
> > At vec_init, I can recognize an idiom in order to generate vec_duplicate but
> > I can't really insist on the single lane load.. something like:
> >
> > vc:V4SI[0] = c
> > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0)
> > va:V4SI = vb:V4SI <op> vt:V4SI
> >
> > Or is there any other way to do this?
>
> Can you elaborate on "I can't really insist on the single lane load"?
> What's the single lane load in your example?
Loading just one lane of the vector like this:
vc:V4SI[0] = c // from the above scalar example
or
vc:V4SI[0] = c[2]
is what I meant by single lane load. In this example:
t = c[2]
...
vb:v4si = b[0:3]
vc:v4si = { t, t, t, t }
va:v4si = vb:v4si <op> vc:v4si
If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot
insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which could be
seen as vec_select:SI (vect_t 0) ).
> I'd expect the instruction
> pattern as quoted to just work (and I hope we expand an uniform
> constructor { a, a, a, a } properly using vec_duplicate).
As much as I went through the code, this is only done using vect_init. It is
not expanded as vec_duplicate from, for example, store_constructor() of expr.c
VP
>
> Richard.
>
> > Cheers
> > VP
> >
> > >
> > > In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR
> > > for vec_duplicate, adding new nodes is always painful.
> > >
> > > > This enhancement could possibly help further optimizing larger
> > > > scenarios such
> > > > as linear systems.
> > > >
> > > > Regards
> > > > VP
> > >
> > > --
> > > Marc Glisse
> > >
> >
> >
> >
>
> --
> Richard Biener <[email protected]>
> SUSE / SUSE Labs
> SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
> GF: Jeff Hawn, Jennifer Guild, Felix Imend
>