Re: [RFC] Vectorization of indexed elements
On Wed, 4 Dec 2013, Vidya Praveen wrote: > Hi Richi, > > Apologies for the late response. I was on vacation. > > On Mon, Oct 14, 2013 at 09:04:58AM +0100, Richard Biener wrote: > > > void > > > foo (int *__restrict__ a, > > > int *__restrict__ b, > > > int c) > > > { > > > int i; > > > > > > for (i = 0; i < 8; i++) > > > a[i] = b[i] * c; > > > } > > > > Both cases can be handled by patterns that match > > > > (mul:VXSI (reg:VXSI > > (vec_duplicate:VXSI reg:SI))) > > How do I arrive at this pattern in the first place? Assuming vec_init with > uniform values are expanded as vec_duplicate, it will still be two > expressions. > > That is, > > (set reg:VXSI (vec_duplicate:VXSI (reg:SI))) > (set reg:VXSI (mul:VXSI (reg:VXSI) (reg:VXSI))) Yes, but then combine comes along and creates (set reg:VXSI (mul:VXSI (reg:VXSI (vec_duplicate:VXSI (reg:SI) which matches one of your define_insn[_and_split]s. > > You'd then "consume" the vec_duplicate and implement it as > > load scalar into element zero of the vector and use index mult > > with index zero. > > If I understand this correctly, you are suggesting to leave the scalar > load from memory as it is but treat the > > (mul:VXSI (reg:VXSI (vec_duplicate:VXSI reg:SI))) > > as > > load reg:VXSI[0], reg:SI > mul reg:VXSI, reg:VXSI, re:VXSI[0] // by reusing the destination register > perhaps > > either by generating instructions directly or by using define_split. Am I > right? Possibly. Or allow memory as operand 2 for your pattern (so, not reg:SI but mem:SI). Combine should be happy with that, too. > If I'm right, then my concern is that it may be possible to simplify this > further > by loading directly to a indexed vector register from memory but it's too > late at > this point for such simplification to be possible. > > Please let me know what am I not understanding. Not sure. Did you try it? Richard.
Re: [RFC] Vectorization of indexed elements
Hi Jakub, Apologies for the late response. On Fri, Oct 11, 2013 at 04:05:24PM +0100, Jakub Jelinek wrote: > On Fri, Oct 11, 2013 at 03:54:08PM +0100, Vidya Praveen wrote: > > Here's a compilable example: > > > > void > > foo (int *__restrict__ a, > > int *__restrict__ b, > > int *__restrict__ c) > > { > > int i; > > > > for (i = 0; i < 8; i++) > > a[i] = b[i] * c[2]; > > } > > > > This is vectorized by duplicating c[2] now. But I'm trying to take advantage > > of target instructions that can take a vector register as second argument > > but > > use only one element (by using the same value for all the lanes) of the > > vector register. > > > > Eg. mul , , [index] > > mla , , [index] // multiply and add > > > > But for a loop like the one in the C example given, I will have to load the > > c[2] in one element of the vector register (leaving the remaining unused) > > rather. This is why I was proposing to load just one element in a vector > > register (what I meant as "lane specific load"). The benefit of doing this > > is > > that we avoid explicit duplication, however such a simplification can only > > be done where such support is available - the reason why I was thinking in > > terms of optional standard pattern name. Another benefit is we will also be > > able to support scalars in the expression like in the following example: > > > > void > > foo (int *__restrict__ a, > > int *__restrict__ b, > > int c) > > { > > int i; > > > > for (i = 0; i < 8; i++) > > a[i] = b[i] * c; > > } > > So just during combine let the broadcast operation be combined with the > arithmetics? Yes. I can do that. But I always want it to be possible to recognize and load directly to the indexed vector register from memory. > Intel AVX512 ISA has similar feature, not sure what exactly > they are doing for this. Thanks. I'll try to go through the code to understand. > That said, the broadcast is likely going to be > hoisted before the loop, and in that case is it really cheaper to have > it unbroadcasted in a vector register rather than to broadcast it before the > loop and just use there? Could you explain what do you mean by unbroadcast? The constructor needs to be expanded in one way or another, isn't it? I thought expanding to vec_duplicate when the values are uniform is the most efficient when vec_duplicate could be supported by the target. If you had meant that each element of vector is loaded separately, I am thinking how can I combine such an operation with the arithmetic operation. Thanks VP.
Re: [RFC] Vectorization of indexed elements
Hi Richi, Apologies for the late response. I was on vacation. On Mon, Oct 14, 2013 at 09:04:58AM +0100, Richard Biener wrote: > On Fri, 11 Oct 2013, Vidya Praveen wrote: > > > On Tue, Oct 01, 2013 at 09:26:25AM +0100, Richard Biener wrote: > > > On Mon, 30 Sep 2013, Vidya Praveen wrote: > > > > > > > On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote: > > > > > On Mon, 30 Sep 2013, Vidya Praveen wrote: > > > > > > > > > > > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote: > > > > > > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote: > > > > > > > [...] > > > > > > > > > > I can't really insist on the single lane load.. something > > > > > > > > > > like: > > > > > > > > > > > > > > > > > > > > vc:V4SI[0] = c > > > > > > > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0) > > > > > > > > > > va:V4SI = vb:V4SI vt:V4SI > > > > > > > > > > > > > > > > > > > > Or is there any other way to do this? > > > > > > > > > > > > > > > > > > Can you elaborate on "I can't really insist on the single > > > > > > > > > lane load"? > > > > > > > > > What's the single lane load in your example? > > > > > > > > > > > > > > > > Loading just one lane of the vector like this: > > > > > > > > > > > > > > > > vc:V4SI[0] = c // from the above scalar example > > > > > > > > > > > > > > > > or > > > > > > > > > > > > > > > > vc:V4SI[0] = c[2] > > > > > > > > > > > > > > > > is what I meant by single lane load. In this example: > > > > > > > > > > > > > > > > t = c[2] > > > > > > > > ... > > > > > > > > vb:v4si = b[0:3] > > > > > > > > vc:v4si = { t, t, t, t } > > > > > > > > va:v4si = vb:v4si vc:v4si > > > > > > > > > > > > > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at > > > > > > > > vec_init, I cannot > > > > > > > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] > > > > > > > > (which could be > > > > > > > > seen as vec_select:SI (vect_t 0) ). > > > > > > > > > > > > > > > > > I'd expect the instruction > > > > > > > > > pattern as quoted to just work (and I hope we expand an > > > > > > > > > uniform > > > > > > > > > constructor { a, a, a, a } properly using vec_duplicate). > > > > > > > > > > > > > > > > As much as I went through the code, this is only done using > > > > > > > > vect_init. It is > > > > > > > > not expanded as vec_duplicate from, for example, > > > > > > > > store_constructor() of expr.c > > > > > > > > > > > > > > Do you see any issues if we expand such constructor as > > > > > > > vec_duplicate directly > > > > > > > instead of going through vect_init way? > > > > > > > > > > > > Sorry, that was a bad question. > > > > > > > > > > > > But here's what I would like to propose as a first step. Please > > > > > > tell me if this > > > > > > is acceptable or if it makes sense: > > > > > > > > > > > > - Introduce standard pattern names > > > > > > > > > > > > "vmulim4" - vector muliply with second operand as indexed operand > > > > > > > > > > > > Example: > > > > > > > > > > > > (define_insn "vmuliv4si4" > > > > > >[set (match_operand:V4SI 0 "register_operand") > > > > > > (mul:V4SI (match_operand:V4SI 1 "register_operand") > > > > > > (vec_duplicate:V4SI > > > > > > (vec_select:SI > > > > > > (match_operand:V4SI 2 "register_operand") > > > > > > (match_operand:V4SI 3 "immediate_operand)] > > > > > > ... > > > > > > ) > > > > > > > > > > We could factor this with providing a standard pattern name for > > > > > > > > > > (define_insn "vdupi" > > > > > [set (match_operand: 0 "register_operand") > > > > >(vec_duplicate: > > > > > (vec_select: > > > > > (match_operand: 1 "register_operand") > > > > > (match_operand:SI 2 "immediate_operand] > > > > > > > > This is good. I did think about this but then I thought of avoiding the > > > > need > > > > for combiner patterns :-) > > > > > > > > But do you find the lane specific mov pattern I proposed, acceptable? > > > > > > The specific mul pattern? As said, consider factoring to vdupi to > > > avoid an explosion in required special optabs. > > > > > > > > (you use V4SI for the immediate? > > > > > > > > Sorry typo again!! It should've been SI. > > > > > > > > > Ideally vdupi has another custom > > > > > mode for the vector index). > > > > > > > > > > Note that this factored pattern is already available as > > > > > vec_perm_const! > > > > > It is simply (vec_perm_const:V4SI > > > > > ). > > > > > > > > > > Which means that on the GIMPLE level we should try to combine > > > > > > > > > > el_4 = BIT_FIELD_REF ; > > > > > v_5 = { el_4, el_4, ... }; > > > > > > > > I don't think we reach this state at all for the scenarios in > > > > discussion. > > > > what we generally have is: > > > > > > > > el_4 = MEM_REF < array + index*size > > > > > v_5 = { el_4, ... } > > > > > > > >
Re: [RFC] Vectorization of indexed elements
On Fri, 11 Oct 2013, Vidya Praveen wrote: > On Tue, Oct 01, 2013 at 09:26:25AM +0100, Richard Biener wrote: > > On Mon, 30 Sep 2013, Vidya Praveen wrote: > > > > > On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote: > > > > On Mon, 30 Sep 2013, Vidya Praveen wrote: > > > > > > > > > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote: > > > > > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote: > > > > > > [...] > > > > > > > > > I can't really insist on the single lane load.. something > > > > > > > > > like: > > > > > > > > > > > > > > > > > > vc:V4SI[0] = c > > > > > > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0) > > > > > > > > > va:V4SI = vb:V4SI vt:V4SI > > > > > > > > > > > > > > > > > > Or is there any other way to do this? > > > > > > > > > > > > > > > > Can you elaborate on "I can't really insist on the single lane > > > > > > > > load"? > > > > > > > > What's the single lane load in your example? > > > > > > > > > > > > > > Loading just one lane of the vector like this: > > > > > > > > > > > > > > vc:V4SI[0] = c // from the above scalar example > > > > > > > > > > > > > > or > > > > > > > > > > > > > > vc:V4SI[0] = c[2] > > > > > > > > > > > > > > is what I meant by single lane load. In this example: > > > > > > > > > > > > > > t = c[2] > > > > > > > ... > > > > > > > vb:v4si = b[0:3] > > > > > > > vc:v4si = { t, t, t, t } > > > > > > > va:v4si = vb:v4si vc:v4si > > > > > > > > > > > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, > > > > > > > I cannot > > > > > > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] > > > > > > > (which could be > > > > > > > seen as vec_select:SI (vect_t 0) ). > > > > > > > > > > > > > > > I'd expect the instruction > > > > > > > > pattern as quoted to just work (and I hope we expand an uniform > > > > > > > > constructor { a, a, a, a } properly using vec_duplicate). > > > > > > > > > > > > > > As much as I went through the code, this is only done using > > > > > > > vect_init. It is > > > > > > > not expanded as vec_duplicate from, for example, > > > > > > > store_constructor() of expr.c > > > > > > > > > > > > Do you see any issues if we expand such constructor as > > > > > > vec_duplicate directly > > > > > > instead of going through vect_init way? > > > > > > > > > > Sorry, that was a bad question. > > > > > > > > > > But here's what I would like to propose as a first step. Please tell > > > > > me if this > > > > > is acceptable or if it makes sense: > > > > > > > > > > - Introduce standard pattern names > > > > > > > > > > "vmulim4" - vector muliply with second operand as indexed operand > > > > > > > > > > Example: > > > > > > > > > > (define_insn "vmuliv4si4" > > > > >[set (match_operand:V4SI 0 "register_operand") > > > > > (mul:V4SI (match_operand:V4SI 1 "register_operand") > > > > > (vec_duplicate:V4SI > > > > > (vec_select:SI > > > > > (match_operand:V4SI 2 "register_operand") > > > > > (match_operand:V4SI 3 "immediate_operand)] > > > > > ... > > > > > ) > > > > > > > > We could factor this with providing a standard pattern name for > > > > > > > > (define_insn "vdupi" > > > > [set (match_operand: 0 "register_operand") > > > >(vec_duplicate: > > > > (vec_select: > > > > (match_operand: 1 "register_operand") > > > > (match_operand:SI 2 "immediate_operand] > > > > > > This is good. I did think about this but then I thought of avoiding the > > > need > > > for combiner patterns :-) > > > > > > But do you find the lane specific mov pattern I proposed, acceptable? > > > > The specific mul pattern? As said, consider factoring to vdupi to > > avoid an explosion in required special optabs. > > > > > > (you use V4SI for the immediate? > > > > > > Sorry typo again!! It should've been SI. > > > > > > > Ideally vdupi has another custom > > > > mode for the vector index). > > > > > > > > Note that this factored pattern is already available as vec_perm_const! > > > > It is simply (vec_perm_const:V4SI > > > > ). > > > > > > > > Which means that on the GIMPLE level we should try to combine > > > > > > > > el_4 = BIT_FIELD_REF ; > > > > v_5 = { el_4, el_4, ... }; > > > > > > I don't think we reach this state at all for the scenarios in discussion. > > > what we generally have is: > > > > > > el_4 = MEM_REF < array + index*size > > > > v_5 = { el_4, ... } > > > > > > Or am I missing something? > > > > Well, but in that case I doubt it is profitable (or even valid!) to > > turn this into a vector lane load from the array. If it is profitable > > to perform a vector read (because we're going to use the other elements > > of the vector as well) then the vectorizer should produce a vector > > load and materialize the uniform vector from one of its elements. > > > > Maybe
Re: [RFC] Vectorization of indexed elements
On Fri, Oct 11, 2013 at 03:54:08PM +0100, Vidya Praveen wrote: > Here's a compilable example: > > void > foo (int *__restrict__ a, > int *__restrict__ b, > int *__restrict__ c) > { > int i; > > for (i = 0; i < 8; i++) > a[i] = b[i] * c[2]; > } > > This is vectorized by duplicating c[2] now. But I'm trying to take advantage > of target instructions that can take a vector register as second argument but > use only one element (by using the same value for all the lanes) of the > vector register. > > Eg. mul , , [index] > mla , , [index] // multiply and add > > But for a loop like the one in the C example given, I will have to load the > c[2] in one element of the vector register (leaving the remaining unused) > rather. This is why I was proposing to load just one element in a vector > register (what I meant as "lane specific load"). The benefit of doing this is > that we avoid explicit duplication, however such a simplification can only > be done where such support is available - the reason why I was thinking in > terms of optional standard pattern name. Another benefit is we will also be > able to support scalars in the expression like in the following example: > > void > foo (int *__restrict__ a, > int *__restrict__ b, > int c) > { > int i; > > for (i = 0; i < 8; i++) > a[i] = b[i] * c; > } So just during combine let the broadcast operation be combined with the arithmetics? Intel AVX512 ISA has similar feature, not sure what exactly they are doing for this. That said, the broadcast is likely going to be hoisted before the loop, and in that case is it really cheaper to have it unbroadcasted in a vector register rather than to broadcast it before the loop and just use there? Jakub
Re: [RFC] Vectorization of indexed elements
On Tue, Oct 01, 2013 at 09:26:25AM +0100, Richard Biener wrote: > On Mon, 30 Sep 2013, Vidya Praveen wrote: > > > On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote: > > > On Mon, 30 Sep 2013, Vidya Praveen wrote: > > > > > > > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote: > > > > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote: > > > > > [...] > > > > > > > > I can't really insist on the single lane load.. something like: > > > > > > > > > > > > > > > > vc:V4SI[0] = c > > > > > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0) > > > > > > > > va:V4SI = vb:V4SI vt:V4SI > > > > > > > > > > > > > > > > Or is there any other way to do this? > > > > > > > > > > > > > > Can you elaborate on "I can't really insist on the single lane > > > > > > > load"? > > > > > > > What's the single lane load in your example? > > > > > > > > > > > > Loading just one lane of the vector like this: > > > > > > > > > > > > vc:V4SI[0] = c // from the above scalar example > > > > > > > > > > > > or > > > > > > > > > > > > vc:V4SI[0] = c[2] > > > > > > > > > > > > is what I meant by single lane load. In this example: > > > > > > > > > > > > t = c[2] > > > > > > ... > > > > > > vb:v4si = b[0:3] > > > > > > vc:v4si = { t, t, t, t } > > > > > > va:v4si = vb:v4si vc:v4si > > > > > > > > > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I > > > > > > cannot > > > > > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which > > > > > > could be > > > > > > seen as vec_select:SI (vect_t 0) ). > > > > > > > > > > > > > I'd expect the instruction > > > > > > > pattern as quoted to just work (and I hope we expand an uniform > > > > > > > constructor { a, a, a, a } properly using vec_duplicate). > > > > > > > > > > > > As much as I went through the code, this is only done using > > > > > > vect_init. It is > > > > > > not expanded as vec_duplicate from, for example, > > > > > > store_constructor() of expr.c > > > > > > > > > > Do you see any issues if we expand such constructor as vec_duplicate > > > > > directly > > > > > instead of going through vect_init way? > > > > > > > > Sorry, that was a bad question. > > > > > > > > But here's what I would like to propose as a first step. Please tell me > > > > if this > > > > is acceptable or if it makes sense: > > > > > > > > - Introduce standard pattern names > > > > > > > > "vmulim4" - vector muliply with second operand as indexed operand > > > > > > > > Example: > > > > > > > > (define_insn "vmuliv4si4" > > > >[set (match_operand:V4SI 0 "register_operand") > > > > (mul:V4SI (match_operand:V4SI 1 "register_operand") > > > > (vec_duplicate:V4SI > > > > (vec_select:SI > > > > (match_operand:V4SI 2 "register_operand") > > > > (match_operand:V4SI 3 "immediate_operand)] > > > > ... > > > > ) > > > > > > We could factor this with providing a standard pattern name for > > > > > > (define_insn "vdupi" > > > [set (match_operand: 0 "register_operand") > > >(vec_duplicate: > > > (vec_select: > > > (match_operand: 1 "register_operand") > > > (match_operand:SI 2 "immediate_operand] > > > > This is good. I did think about this but then I thought of avoiding the need > > for combiner patterns :-) > > > > But do you find the lane specific mov pattern I proposed, acceptable? > > The specific mul pattern? As said, consider factoring to vdupi to > avoid an explosion in required special optabs. > > > > (you use V4SI for the immediate? > > > > Sorry typo again!! It should've been SI. > > > > > Ideally vdupi has another custom > > > mode for the vector index). > > > > > > Note that this factored pattern is already available as vec_perm_const! > > > It is simply (vec_perm_const:V4SI ). > > > > > > Which means that on the GIMPLE level we should try to combine > > > > > > el_4 = BIT_FIELD_REF ; > > > v_5 = { el_4, el_4, ... }; > > > > I don't think we reach this state at all for the scenarios in discussion. > > what we generally have is: > > > > el_4 = MEM_REF < array + index*size > > > v_5 = { el_4, ... } > > > > Or am I missing something? > > Well, but in that case I doubt it is profitable (or even valid!) to > turn this into a vector lane load from the array. If it is profitable > to perform a vector read (because we're going to use the other elements > of the vector as well) then the vectorizer should produce a vector > load and materialize the uniform vector from one of its elements. > > Maybe at this point you should show us a compilable C testcase > with a loop that should be vectorized using your instructions in > the end? Here's a compilable example: void foo (int *__restrict__ a, int *__restrict__ b, int *__restrict__ c) { int i; for (i = 0; i < 8; i++) a[i] = b[i] * c[2]; } This is vecto
Re: [RFC] Vectorization of indexed elements
On Mon, 30 Sep 2013, Vidya Praveen wrote: > On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote: > > On Mon, 30 Sep 2013, Vidya Praveen wrote: > > > > > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote: > > > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote: > > > > [...] > > > > > > > I can't really insist on the single lane load.. something like: > > > > > > > > > > > > > > vc:V4SI[0] = c > > > > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0) > > > > > > > va:V4SI = vb:V4SI vt:V4SI > > > > > > > > > > > > > > Or is there any other way to do this? > > > > > > > > > > > > Can you elaborate on "I can't really insist on the single lane > > > > > > load"? > > > > > > What's the single lane load in your example? > > > > > > > > > > Loading just one lane of the vector like this: > > > > > > > > > > vc:V4SI[0] = c // from the above scalar example > > > > > > > > > > or > > > > > > > > > > vc:V4SI[0] = c[2] > > > > > > > > > > is what I meant by single lane load. In this example: > > > > > > > > > > t = c[2] > > > > > ... > > > > > vb:v4si = b[0:3] > > > > > vc:v4si = { t, t, t, t } > > > > > va:v4si = vb:v4si vc:v4si > > > > > > > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I > > > > > cannot > > > > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which > > > > > could be > > > > > seen as vec_select:SI (vect_t 0) ). > > > > > > > > > > > I'd expect the instruction > > > > > > pattern as quoted to just work (and I hope we expand an uniform > > > > > > constructor { a, a, a, a } properly using vec_duplicate). > > > > > > > > > > As much as I went through the code, this is only done using > > > > > vect_init. It is > > > > > not expanded as vec_duplicate from, for example, store_constructor() > > > > > of expr.c > > > > > > > > Do you see any issues if we expand such constructor as vec_duplicate > > > > directly > > > > instead of going through vect_init way? > > > > > > Sorry, that was a bad question. > > > > > > But here's what I would like to propose as a first step. Please tell me > > > if this > > > is acceptable or if it makes sense: > > > > > > - Introduce standard pattern names > > > > > > "vmulim4" - vector muliply with second operand as indexed operand > > > > > > Example: > > > > > > (define_insn "vmuliv4si4" > > >[set (match_operand:V4SI 0 "register_operand") > > > (mul:V4SI (match_operand:V4SI 1 "register_operand") > > > (vec_duplicate:V4SI > > > (vec_select:SI > > > (match_operand:V4SI 2 "register_operand") > > > (match_operand:V4SI 3 "immediate_operand)] > > > ... > > > ) > > > > We could factor this with providing a standard pattern name for > > > > (define_insn "vdupi" > > [set (match_operand: 0 "register_operand") > >(vec_duplicate: > > (vec_select: > > (match_operand: 1 "register_operand") > > (match_operand:SI 2 "immediate_operand] > > This is good. I did think about this but then I thought of avoiding the need > for combiner patterns :-) > > But do you find the lane specific mov pattern I proposed, acceptable? The specific mul pattern? As said, consider factoring to vdupi to avoid an explosion in required special optabs. > > (you use V4SI for the immediate? > > Sorry typo again!! It should've been SI. > > > Ideally vdupi has another custom > > mode for the vector index). > > > > Note that this factored pattern is already available as vec_perm_const! > > It is simply (vec_perm_const:V4SI ). > > > > Which means that on the GIMPLE level we should try to combine > > > > el_4 = BIT_FIELD_REF ; > > v_5 = { el_4, el_4, ... }; > > I don't think we reach this state at all for the scenarios in discussion. > what we generally have is: > > el_4 = MEM_REF < array + index*size > > v_5 = { el_4, ... } > > Or am I missing something? Well, but in that case I doubt it is profitable (or even valid!) to turn this into a vector lane load from the array. If it is profitable to perform a vector read (because we're going to use the other elements of the vector as well) then the vectorizer should produce a vector load and materialize the uniform vector from one of its elements. Maybe at this point you should show us a compilable C testcase with a loop that should be vectorized using your instructions in the end? Thanks, Richard.
Re: [RFC] Vectorization of indexed elements
On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote: > On Mon, 30 Sep 2013, Vidya Praveen wrote: > > > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote: > > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote: > > > [...] > > > > > > I can't really insist on the single lane load.. something like: > > > > > > > > > > > > vc:V4SI[0] = c > > > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0) > > > > > > va:V4SI = vb:V4SI vt:V4SI > > > > > > > > > > > > Or is there any other way to do this? > > > > > > > > > > Can you elaborate on "I can't really insist on the single lane load"? > > > > > What's the single lane load in your example? > > > > > > > > Loading just one lane of the vector like this: > > > > > > > > vc:V4SI[0] = c // from the above scalar example > > > > > > > > or > > > > > > > > vc:V4SI[0] = c[2] > > > > > > > > is what I meant by single lane load. In this example: > > > > > > > > t = c[2] > > > > ... > > > > vb:v4si = b[0:3] > > > > vc:v4si = { t, t, t, t } > > > > va:v4si = vb:v4si vc:v4si > > > > > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I > > > > cannot > > > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which > > > > could be > > > > seen as vec_select:SI (vect_t 0) ). > > > > > > > > > I'd expect the instruction > > > > > pattern as quoted to just work (and I hope we expand an uniform > > > > > constructor { a, a, a, a } properly using vec_duplicate). > > > > > > > > As much as I went through the code, this is only done using vect_init. > > > > It is > > > > not expanded as vec_duplicate from, for example, store_constructor() of > > > > expr.c > > > > > > Do you see any issues if we expand such constructor as vec_duplicate > > > directly > > > instead of going through vect_init way? > > > > Sorry, that was a bad question. > > > > But here's what I would like to propose as a first step. Please tell me if > > this > > is acceptable or if it makes sense: > > > > - Introduce standard pattern names > > > > "vmulim4" - vector muliply with second operand as indexed operand > > > > Example: > > > > (define_insn "vmuliv4si4" > >[set (match_operand:V4SI 0 "register_operand") > > (mul:V4SI (match_operand:V4SI 1 "register_operand") > > (vec_duplicate:V4SI > > (vec_select:SI > > (match_operand:V4SI 2 "register_operand") > > (match_operand:V4SI 3 "immediate_operand)] > > ... > > ) > > We could factor this with providing a standard pattern name for > > (define_insn "vdupi" > [set (match_operand: 0 "register_operand") >(vec_duplicate: > (vec_select: > (match_operand: 1 "register_operand") > (match_operand:SI 2 "immediate_operand] This is good. I did think about this but then I thought of avoiding the need for combiner patterns :-) But do you find the lane specific mov pattern I proposed, acceptable? > (you use V4SI for the immediate? Sorry typo again!! It should've been SI. > Ideally vdupi has another custom > mode for the vector index). > > Note that this factored pattern is already available as vec_perm_const! > It is simply (vec_perm_const:V4SI ). > > Which means that on the GIMPLE level we should try to combine > > el_4 = BIT_FIELD_REF ; > v_5 = { el_4, el_4, ... }; I don't think we reach this state at all for the scenarios in discussion. what we generally have is: el_4 = MEM_REF < array + index*size > v_5 = { el_4, ... } Or am I missing something? > > into > > v_5 = VEC_PERM_EXPR ; > > which it should already do with simplify_permutation. > > But I'm not sure what you are after at then end ;) > > Richard. > Regards VP
Re: [RFC] Vectorization of indexed elements
On Mon, 30 Sep 2013, Vidya Praveen wrote: > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote: > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote: > > [...] > > > > > I can't really insist on the single lane load.. something like: > > > > > > > > > > vc:V4SI[0] = c > > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0) > > > > > va:V4SI = vb:V4SI vt:V4SI > > > > > > > > > > Or is there any other way to do this? > > > > > > > > Can you elaborate on "I can't really insist on the single lane load"? > > > > What's the single lane load in your example? > > > > > > Loading just one lane of the vector like this: > > > > > > vc:V4SI[0] = c // from the above scalar example > > > > > > or > > > > > > vc:V4SI[0] = c[2] > > > > > > is what I meant by single lane load. In this example: > > > > > > t = c[2] > > > ... > > > vb:v4si = b[0:3] > > > vc:v4si = { t, t, t, t } > > > va:v4si = vb:v4si vc:v4si > > > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot > > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which could > > > be > > > seen as vec_select:SI (vect_t 0) ). > > > > > > > I'd expect the instruction > > > > pattern as quoted to just work (and I hope we expand an uniform > > > > constructor { a, a, a, a } properly using vec_duplicate). > > > > > > As much as I went through the code, this is only done using vect_init. It > > > is > > > not expanded as vec_duplicate from, for example, store_constructor() of > > > expr.c > > > > Do you see any issues if we expand such constructor as vec_duplicate > > directly > > instead of going through vect_init way? > > Sorry, that was a bad question. > > But here's what I would like to propose as a first step. Please tell me if > this > is acceptable or if it makes sense: > > - Introduce standard pattern names > > "vmulim4" - vector muliply with second operand as indexed operand > > Example: > > (define_insn "vmuliv4si4" >[set (match_operand:V4SI 0 "register_operand") > (mul:V4SI (match_operand:V4SI 1 "register_operand") > (vec_duplicate:V4SI > (vec_select:SI > (match_operand:V4SI 2 "register_operand") > (match_operand:V4SI 3 "immediate_operand)] > ... > ) We could factor this with providing a standard pattern name for (define_insn "vdupi" [set (match_operand: 0 "register_operand") (vec_duplicate: (vec_select: (match_operand: 1 "register_operand") (match_operand:SI 2 "immediate_operand] (you use V4SI for the immediate? Ideally vdupi has another custom mode for the vector index). Note that this factored pattern is already available as vec_perm_const! It is simply (vec_perm_const:V4SI ). Which means that on the GIMPLE level we should try to combine el_4 = BIT_FIELD_REF ; v_5 = { el_4, el_4, ... }; into v_5 = VEC_PERM_EXPR ; which it should already do with simplify_permutation. But I'm not sure what you are after at then end ;) Richard.
Re: [RFC] Vectorization of indexed elements
On Wed, Sep 25, 2013 at 10:22:05AM +0100, Richard Biener wrote: > On Tue, 24 Sep 2013, Vidya Praveen wrote: > > > On Tue, Sep 10, 2013 at 09:25:32AM +0100, Richard Biener wrote: > > > On Mon, 9 Sep 2013, Marc Glisse wrote: > > > > > > > On Mon, 9 Sep 2013, Vidya Praveen wrote: > > > > > > > > > Hello, > > > > > > > > > > This post details some thoughts on an enhancement to the vectorizer > > > > > that > > > > > could take advantage of the SIMD instructions that allows indexed > > > > > element > > > > > as an operand thus reducing the need for duplication and possibly > > > > > improve > > > > > reuse of previously loaded data. > > > > > > > > > > Appreciate your opinion on this. > > > > > > > > > > --- > > > > > > > > > > A phrase like this: > > > > > > > > > > for(i=0;i<4;i++) > > > > > a[i] = b[i] c[2]; > > > > > > > > > > is usually vectorized as: > > > > > > > > > > va:V4SI = a[0:3] > > > > > vb:V4SI = b[0:3] > > > > > t = c[2] > > > > > vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at > > > > > vec_init > > > > > ... > > > > > va:V4SI = vb:V4SI vc:V4SI > > > > > > > > > > But this could be simplified further if a target has instructions that > > > > > support > > > > > indexed element as a parameter. For example an instruction like this: > > > > > > > > > > mul v0.4s, v1.4s, v2.4s[2] > > > > > > > > > > can perform multiplication of each element of v2.4s with the third > > > > > element > > > > > of > > > > > v2.4s (specified as v2.4s[2]) and store the results in the > > > > > corresponding > > > > > elements of v0.4s. > > > > > > > > > > For this to happen, vectorizer needs to understand this idiom and > > > > > treat the > > > > > operand c[2] specially (and by taking in to consideration if the > > > > > machine > > > > > supports indexed element as an operand for through a target hook > > > > > or > > > > > macro) > > > > > and consider this as vectorizable statement without having to > > > > > duplicate the > > > > > elements explicitly. > > > > > > > > > > There are fews ways this could be represented at gimple: > > > > > > > > > > ... > > > > > va:V4SI = vb:V4SI VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI > > > > > 2)) > > > > > ... > > > > > > > > > > or by allowing a vectorizer treat an indexed element as a valid > > > > > operand in a > > > > > vectorizable statement: > > > > > > > > Might as well allow any scalar then... > > > > > > I agree. The VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) form > > > would necessarily be two extra separate statements and thus subject > > > to CSE obfuscating it enough for RTL expansion to no longer notice it. > > > > I also thought about having a specialized expression like > > > > VEC_INDEXED__EXPR < arg0, arg1, arg2, index> > > > > to mean: > > > > arg0 = arg1 arg2[index] > > > > and handle it directly in the expander, like (for eg.) how VEC_LSHIFT_EXPR > > is handled in expr.c. But I dropped this idea since we may need to introduce > > many such nodes. > > > > > > > > That said, allowing mixed scalar/vector ops isn't very nice and > > > your scheme can be simplified by just using > > > > > > vc:V4SI = VEC_DUPLICATE_EXPR <...> > > > va:V4SI = vb:V4SI vc:V4SI > > > > > > where the expander only has to see that vc:V4SI is defined by > > > a duplicate. > > > > I did try out something like this quickly before I posted this RFC, though > > I called it VEC_DUP to mean a equivalent of vec_duplicate(vec_select()) > > > > for: > > > > for(i=0;i<8;i++) > > a[i] = b[2] * c[i]; > > > > I could generate: > > > > ... > > : > > _88 = prolog_loop_adjusted_niters.6_60 * 4; > > vectp_c.13_87 = c_10(D) + _88; > > vect_ldidx_.16_92 = MEM[(int *)b_8(D) + 8B]; > > vect_idxed_.17_93 = (vect_ldidx_.16_92) <<< ??? >>> (0); > > _96 = prolog_loop_adjusted_niters.6_60 * 4; > > vectp_a.19_95 = a_6(D) + _96; > > vect__12.14_115 = MEM[(int *)vectp_c.13_87]; > > vect_patt_40.15_116 = vect__12.14_115 * vect_idxed_.17_93; > > MEM[(int *)vectp_a.19_95] = vect_patt_40.15_116; > > vectp_c.12_118 = vectp_c.13_87 + 16; > > vectp_a.18_119 = vectp_a.19_95 + 16; > > ivtmp_120 = 1; > > if (ivtmp_120 < bnd.8_62) > > goto ; > > else > > goto ; > > > > : > > # vectp_c.12_89 = PHI > > # vectp_a.18_97 = PHI > > # ivtmp_14 = PHI > > vect__12.14_91 = MEM[(int *)vectp_c.12_89]; > > vect_patt_40.15_94 = vect__12.14_91 * vect_idxed_.17_93; > > MEM[(int *)vectp_a.18_97] = vect_patt_40.15_94; > > ... > > > > It's a crude implementation so VEC_DUP is printed as: > > > > (vect_ldidx_.16_92) <<< ??? >>> (0); > > > > > > > > > ... > > > > > va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 2) > > > > > ... > > > > > > > > > > For the sake of explanation, the above two representations assumes > > > > > that > > > > > c[0:3
Re: [RFC] Vectorization of indexed elements
On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote: > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote: > [...] > > > > I can't really insist on the single lane load.. something like: > > > > > > > > vc:V4SI[0] = c > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0) > > > > va:V4SI = vb:V4SI vt:V4SI > > > > > > > > Or is there any other way to do this? > > > > > > Can you elaborate on "I can't really insist on the single lane load"? > > > What's the single lane load in your example? > > > > Loading just one lane of the vector like this: > > > > vc:V4SI[0] = c // from the above scalar example > > > > or > > > > vc:V4SI[0] = c[2] > > > > is what I meant by single lane load. In this example: > > > > t = c[2] > > ... > > vb:v4si = b[0:3] > > vc:v4si = { t, t, t, t } > > va:v4si = vb:v4si vc:v4si > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which could be > > seen as vec_select:SI (vect_t 0) ). > > > > > I'd expect the instruction > > > pattern as quoted to just work (and I hope we expand an uniform > > > constructor { a, a, a, a } properly using vec_duplicate). > > > > As much as I went through the code, this is only done using vect_init. It is > > not expanded as vec_duplicate from, for example, store_constructor() of > > expr.c > > Do you see any issues if we expand such constructor as vec_duplicate directly > instead of going through vect_init way? Sorry, that was a bad question. But here's what I would like to propose as a first step. Please tell me if this is acceptable or if it makes sense: - Introduce standard pattern names "vmulim4" - vector muliply with second operand as indexed operand Example: (define_insn "vmuliv4si4" [set (match_operand:V4SI 0 "register_operand") (mul:V4SI (match_operand:V4SI 1 "register_operand") (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI 2 "register_operand") (match_operand:V4SI 3 "immediate_operand)] ... ) "vlmovmn3" - move where one of the operands is specific lane of a vector and other is a scalar. Example: (define_insn "vlmovv4sisi3" [set (vec_select:SI (match_operand:V4SI 0 "register_operand") (match_operand:SI 1 "immediate_operand")) (match_operand:SI 2 "memory_operand")] ... ) - Identify the following idiom and expand through the above standard patterns: t = c[m] vc[0:n] = { t, t, t, t} a[0:n] = b[0:n] * vc[0:n] as (insn (set (vec_select:SI (reg:V4SI 0) 0) (mem:SI ... ))) (insn (set (reg:V4SI 1) (mult:V4SI (reg:V4SI 2) (vec_duplicate:V4SI (vec_select:SI (reg:V4SI 0) 0) If this path is acceptable, then I can extend this to support "vmaddim4" - multiply and add (with indexed element as multiplier) "vmsubim4" - multiply and subtract (with indexed element as multiplier) Please let me know your thoughts. Cheers VP
Re: [RFC] Vectorization of indexed elements
On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote: [...] > > > I can't really insist on the single lane load.. something like: > > > > > > vc:V4SI[0] = c > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0) > > > va:V4SI = vb:V4SI vt:V4SI > > > > > > Or is there any other way to do this? > > > > Can you elaborate on "I can't really insist on the single lane load"? > > What's the single lane load in your example? > > Loading just one lane of the vector like this: > > vc:V4SI[0] = c // from the above scalar example > > or > > vc:V4SI[0] = c[2] > > is what I meant by single lane load. In this example: > > t = c[2] > ... > vb:v4si = b[0:3] > vc:v4si = { t, t, t, t } > va:v4si = vb:v4si vc:v4si > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which could be > seen as vec_select:SI (vect_t 0) ). > > > I'd expect the instruction > > pattern as quoted to just work (and I hope we expand an uniform > > constructor { a, a, a, a } properly using vec_duplicate). > > As much as I went through the code, this is only done using vect_init. It is > not expanded as vec_duplicate from, for example, store_constructor() of expr.c Do you see any issues if we expand such constructor as vec_duplicate directly instead of going through vect_init way? VP
Re: [RFC] Vectorization of indexed elements
On Wed, Sep 25, 2013 at 10:24:56AM +0100, Richard Biener wrote: > On Tue, 24 Sep 2013, Vidya Praveen wrote: > > > On Mon, Sep 09, 2013 at 07:02:52PM +0100, Marc Glisse wrote: > > > On Mon, 9 Sep 2013, Vidya Praveen wrote: > > > > > > > Hello, > > > > > > > > This post details some thoughts on an enhancement to the vectorizer that > > > > could take advantage of the SIMD instructions that allows indexed > > > > element > > > > as an operand thus reducing the need for duplication and possibly > > > > improve > > > > reuse of previously loaded data. > > > > > > > > Appreciate your opinion on this. > > > > > > > > --- > > > > > > > > A phrase like this: > > > > > > > > for(i=0;i<4;i++) > > > > a[i] = b[i] c[2]; > > > > > > > > is usually vectorized as: > > > > > > > > va:V4SI = a[0:3] > > > > vb:V4SI = b[0:3] > > > > t = c[2] > > > > vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at > > > > vec_init > > > > ... > > > > va:V4SI = vb:V4SI vc:V4SI > > > > > > > > But this could be simplified further if a target has instructions that > > > > support > > > > indexed element as a parameter. For example an instruction like this: > > > > > > > > mul v0.4s, v1.4s, v2.4s[2] > > > > > > > > can perform multiplication of each element of v2.4s with the third > > > > element of > > > > v2.4s (specified as v2.4s[2]) and store the results in the corresponding > > > > elements of v0.4s. > > > > > > > > For this to happen, vectorizer needs to understand this idiom and treat > > > > the > > > > operand c[2] specially (and by taking in to consideration if the machine > > > > supports indexed element as an operand for through a target hook > > > > or macro) > > > > and consider this as vectorizable statement without having to duplicate > > > > the > > > > elements explicitly. > > > > > > > > There are fews ways this could be represented at gimple: > > > > > > > > ... > > > > va:V4SI = vb:V4SI VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) > > > > ... > > > > > > > > or by allowing a vectorizer treat an indexed element as a valid operand > > > > in a > > > > vectorizable statement: > > > > > > Might as well allow any scalar then... > > > > Yes, I had given an example below. > > > > > > > > > ... > > > > va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 2) > > > > ... > > > > > > > > For the sake of explanation, the above two representations assumes that > > > > c[0:3] is loaded in vc for some other use and reused here. But when > > > > c[2] is the > > > > only use of 'c' then it may be safer to just load one element and use > > > > it like > > > > this: > > > > > > > > vc:V4SI[0] = c[2] > > > > va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) > > > > > > > > This could also mean that expressions involving scalar could be treated > > > > similarly. For example, > > > > > > > > for(i=0;i<4;i++) > > > >a[i] = b[i] c > > > > > > > > could be vectorized as: > > > > > > > > vc:V4SI[0] = c > > > > va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) > > > > > > > > Such a change would also require new standard pattern names to be > > > > defined for > > > > each . > > > > > > > > Alternatively, having something like this: > > > > > > > > ... > > > > vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) > > > > va:V4SI = vb:V4SI vt:V4SI > > > > ... > > > > > > > > would remove the need to introduce several new standard pattern names > > > > but have > > > > just one to represent vec_duplicate(vec_select()) but ofcourse this > > > > will expect > > > > the target to have combiner patterns. > > > > > > The cost estimation wouldn't be very good, but aren't combine patterns > > > enough for the whole thing? Don't you model your mul instruction as: > > > > > > (mult:V4SI > > >(match_operand:V4SI) > > >(vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI > > > > > > anyway? Seems that combine should be able to handle it. What currently > > > happens that we fail to generate the right instruction? > > > > At vec_init, I can recognize an idiom in order to generate vec_duplicate but > > I can't really insist on the single lane load.. something like: > > > > vc:V4SI[0] = c > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0) > > va:V4SI = vb:V4SI vt:V4SI > > > > Or is there any other way to do this? > > Can you elaborate on "I can't really insist on the single lane load"? > What's the single lane load in your example? Loading just one lane of the vector like this: vc:V4SI[0] = c // from the above scalar example or vc:V4SI[0] = c[2] is what I meant by single lane load. In this example: t = c[2] ... vb:v4si = b[0:3] vc:v4si = { t, t, t, t } va:v4si = vb:v4si vc:v4si If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I cannot insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which could be seen as vec_select:SI (vect_t 0) ). > I'd expect the instruction > pattern as quoted to just work (and I hope we expand an uniform
Re: [RFC] Vectorization of indexed elements
On Tue, 24 Sep 2013, Vidya Praveen wrote: > On Mon, Sep 09, 2013 at 07:02:52PM +0100, Marc Glisse wrote: > > On Mon, 9 Sep 2013, Vidya Praveen wrote: > > > > > Hello, > > > > > > This post details some thoughts on an enhancement to the vectorizer that > > > could take advantage of the SIMD instructions that allows indexed element > > > as an operand thus reducing the need for duplication and possibly improve > > > reuse of previously loaded data. > > > > > > Appreciate your opinion on this. > > > > > > --- > > > > > > A phrase like this: > > > > > > for(i=0;i<4;i++) > > > a[i] = b[i] c[2]; > > > > > > is usually vectorized as: > > > > > > va:V4SI = a[0:3] > > > vb:V4SI = b[0:3] > > > t = c[2] > > > vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at > > > vec_init > > > ... > > > va:V4SI = vb:V4SI vc:V4SI > > > > > > But this could be simplified further if a target has instructions that > > > support > > > indexed element as a parameter. For example an instruction like this: > > > > > > mul v0.4s, v1.4s, v2.4s[2] > > > > > > can perform multiplication of each element of v2.4s with the third > > > element of > > > v2.4s (specified as v2.4s[2]) and store the results in the corresponding > > > elements of v0.4s. > > > > > > For this to happen, vectorizer needs to understand this idiom and treat > > > the > > > operand c[2] specially (and by taking in to consideration if the machine > > > supports indexed element as an operand for through a target hook or > > > macro) > > > and consider this as vectorizable statement without having to duplicate > > > the > > > elements explicitly. > > > > > > There are fews ways this could be represented at gimple: > > > > > > ... > > > va:V4SI = vb:V4SI VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) > > > ... > > > > > > or by allowing a vectorizer treat an indexed element as a valid operand > > > in a > > > vectorizable statement: > > > > Might as well allow any scalar then... > > Yes, I had given an example below. > > > > > > ... > > > va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 2) > > > ... > > > > > > For the sake of explanation, the above two representations assumes that > > > c[0:3] is loaded in vc for some other use and reused here. But when c[2] > > > is the > > > only use of 'c' then it may be safer to just load one element and use it > > > like > > > this: > > > > > > vc:V4SI[0] = c[2] > > > va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) > > > > > > This could also mean that expressions involving scalar could be treated > > > similarly. For example, > > > > > > for(i=0;i<4;i++) > > >a[i] = b[i] c > > > > > > could be vectorized as: > > > > > > vc:V4SI[0] = c > > > va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) > > > > > > Such a change would also require new standard pattern names to be defined > > > for > > > each . > > > > > > Alternatively, having something like this: > > > > > > ... > > > vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) > > > va:V4SI = vb:V4SI vt:V4SI > > > ... > > > > > > would remove the need to introduce several new standard pattern names but > > > have > > > just one to represent vec_duplicate(vec_select()) but ofcourse this will > > > expect > > > the target to have combiner patterns. > > > > The cost estimation wouldn't be very good, but aren't combine patterns > > enough for the whole thing? Don't you model your mul instruction as: > > > > (mult:V4SI > >(match_operand:V4SI) > >(vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI > > > > anyway? Seems that combine should be able to handle it. What currently > > happens that we fail to generate the right instruction? > > At vec_init, I can recognize an idiom in order to generate vec_duplicate but > I can't really insist on the single lane load.. something like: > > vc:V4SI[0] = c > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0) > va:V4SI = vb:V4SI vt:V4SI > > Or is there any other way to do this? Can you elaborate on "I can't really insist on the single lane load"? What's the single lane load in your example? I'd expect the instruction pattern as quoted to just work (and I hope we expand an uniform constructor { a, a, a, a } properly using vec_duplicate). Richard. > Cheers > VP > > > > > In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR > > for vec_duplicate, adding new nodes is always painful. > > > > > This enhancement could possibly help further optimizing larger scenarios > > > such > > > as linear systems. > > > > > > Regards > > > VP > > > > -- > > Marc Glisse > > > > > -- Richard Biener SUSE / SUSE Labs SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746 GF: Jeff Hawn, Jennifer Guild, Felix Imend
Re: [RFC] Vectorization of indexed elements
On Tue, 24 Sep 2013, Vidya Praveen wrote: > On Tue, Sep 10, 2013 at 09:25:32AM +0100, Richard Biener wrote: > > On Mon, 9 Sep 2013, Marc Glisse wrote: > > > > > On Mon, 9 Sep 2013, Vidya Praveen wrote: > > > > > > > Hello, > > > > > > > > This post details some thoughts on an enhancement to the vectorizer that > > > > could take advantage of the SIMD instructions that allows indexed > > > > element > > > > as an operand thus reducing the need for duplication and possibly > > > > improve > > > > reuse of previously loaded data. > > > > > > > > Appreciate your opinion on this. > > > > > > > > --- > > > > > > > > A phrase like this: > > > > > > > > for(i=0;i<4;i++) > > > > a[i] = b[i] c[2]; > > > > > > > > is usually vectorized as: > > > > > > > > va:V4SI = a[0:3] > > > > vb:V4SI = b[0:3] > > > > t = c[2] > > > > vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at > > > > vec_init > > > > ... > > > > va:V4SI = vb:V4SI vc:V4SI > > > > > > > > But this could be simplified further if a target has instructions that > > > > support > > > > indexed element as a parameter. For example an instruction like this: > > > > > > > > mul v0.4s, v1.4s, v2.4s[2] > > > > > > > > can perform multiplication of each element of v2.4s with the third > > > > element > > > > of > > > > v2.4s (specified as v2.4s[2]) and store the results in the corresponding > > > > elements of v0.4s. > > > > > > > > For this to happen, vectorizer needs to understand this idiom and treat > > > > the > > > > operand c[2] specially (and by taking in to consideration if the machine > > > > supports indexed element as an operand for through a target hook or > > > > macro) > > > > and consider this as vectorizable statement without having to duplicate > > > > the > > > > elements explicitly. > > > > > > > > There are fews ways this could be represented at gimple: > > > > > > > > ... > > > > va:V4SI = vb:V4SI VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) > > > > ... > > > > > > > > or by allowing a vectorizer treat an indexed element as a valid operand > > > > in a > > > > vectorizable statement: > > > > > > Might as well allow any scalar then... > > > > I agree. The VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) form > > would necessarily be two extra separate statements and thus subject > > to CSE obfuscating it enough for RTL expansion to no longer notice it. > > I also thought about having a specialized expression like > > VEC_INDEXED__EXPR < arg0, arg1, arg2, index> > > to mean: > > arg0 = arg1 arg2[index] > > and handle it directly in the expander, like (for eg.) how VEC_LSHIFT_EXPR > is handled in expr.c. But I dropped this idea since we may need to introduce > many such nodes. > > > > > That said, allowing mixed scalar/vector ops isn't very nice and > > your scheme can be simplified by just using > > > > vc:V4SI = VEC_DUPLICATE_EXPR <...> > > va:V4SI = vb:V4SI vc:V4SI > > > > where the expander only has to see that vc:V4SI is defined by > > a duplicate. > > I did try out something like this quickly before I posted this RFC, though > I called it VEC_DUP to mean a equivalent of vec_duplicate(vec_select()) > > for: > > for(i=0;i<8;i++) > a[i] = b[2] * c[i]; > > I could generate: > > ... > : > _88 = prolog_loop_adjusted_niters.6_60 * 4; > vectp_c.13_87 = c_10(D) + _88; > vect_ldidx_.16_92 = MEM[(int *)b_8(D) + 8B]; > vect_idxed_.17_93 = (vect_ldidx_.16_92) <<< ??? >>> (0); > _96 = prolog_loop_adjusted_niters.6_60 * 4; > vectp_a.19_95 = a_6(D) + _96; > vect__12.14_115 = MEM[(int *)vectp_c.13_87]; > vect_patt_40.15_116 = vect__12.14_115 * vect_idxed_.17_93; > MEM[(int *)vectp_a.19_95] = vect_patt_40.15_116; > vectp_c.12_118 = vectp_c.13_87 + 16; > vectp_a.18_119 = vectp_a.19_95 + 16; > ivtmp_120 = 1; > if (ivtmp_120 < bnd.8_62) > goto ; > else > goto ; > > : > # vectp_c.12_89 = PHI > # vectp_a.18_97 = PHI > # ivtmp_14 = PHI > vect__12.14_91 = MEM[(int *)vectp_c.12_89]; > vect_patt_40.15_94 = vect__12.14_91 * vect_idxed_.17_93; > MEM[(int *)vectp_a.18_97] = vect_patt_40.15_94; > ... > > It's a crude implementation so VEC_DUP is printed as: > > (vect_ldidx_.16_92) <<< ??? >>> (0); > > > > > > ... > > > > va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 2) > > > > ... > > > > > > > > For the sake of explanation, the above two representations assumes that > > > > c[0:3] is loaded in vc for some other use and reused here. But when > > > > c[2] is > > > > the > > > > only use of 'c' then it may be safer to just load one element and use it > > > > like > > > > this: > > > > > > > > vc:V4SI[0] = c[2] > > > > va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) > > > > > > > > This could also mean that expressions involving scalar could be treated > > > > similarly. For example, > >
Re: [RFC] Vectorization of indexed elements
On Mon, Sep 09, 2013 at 07:02:52PM +0100, Marc Glisse wrote: > On Mon, 9 Sep 2013, Vidya Praveen wrote: > > > Hello, > > > > This post details some thoughts on an enhancement to the vectorizer that > > could take advantage of the SIMD instructions that allows indexed element > > as an operand thus reducing the need for duplication and possibly improve > > reuse of previously loaded data. > > > > Appreciate your opinion on this. > > > > --- > > > > A phrase like this: > > > > for(i=0;i<4;i++) > > a[i] = b[i] c[2]; > > > > is usually vectorized as: > > > > va:V4SI = a[0:3] > > vb:V4SI = b[0:3] > > t = c[2] > > vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init > > ... > > va:V4SI = vb:V4SI vc:V4SI > > > > But this could be simplified further if a target has instructions that > > support > > indexed element as a parameter. For example an instruction like this: > > > > mul v0.4s, v1.4s, v2.4s[2] > > > > can perform multiplication of each element of v2.4s with the third element > > of > > v2.4s (specified as v2.4s[2]) and store the results in the corresponding > > elements of v0.4s. > > > > For this to happen, vectorizer needs to understand this idiom and treat the > > operand c[2] specially (and by taking in to consideration if the machine > > supports indexed element as an operand for through a target hook or > > macro) > > and consider this as vectorizable statement without having to duplicate the > > elements explicitly. > > > > There are fews ways this could be represented at gimple: > > > > ... > > va:V4SI = vb:V4SI VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) > > ... > > > > or by allowing a vectorizer treat an indexed element as a valid operand in a > > vectorizable statement: > > Might as well allow any scalar then... Yes, I had given an example below. > > > ... > > va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 2) > > ... > > > > For the sake of explanation, the above two representations assumes that > > c[0:3] is loaded in vc for some other use and reused here. But when c[2] is > > the > > only use of 'c' then it may be safer to just load one element and use it > > like > > this: > > > > vc:V4SI[0] = c[2] > > va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) > > > > This could also mean that expressions involving scalar could be treated > > similarly. For example, > > > > for(i=0;i<4;i++) > >a[i] = b[i] c > > > > could be vectorized as: > > > > vc:V4SI[0] = c > > va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) > > > > Such a change would also require new standard pattern names to be defined > > for > > each . > > > > Alternatively, having something like this: > > > > ... > > vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) > > va:V4SI = vb:V4SI vt:V4SI > > ... > > > > would remove the need to introduce several new standard pattern names but > > have > > just one to represent vec_duplicate(vec_select()) but ofcourse this will > > expect > > the target to have combiner patterns. > > The cost estimation wouldn't be very good, but aren't combine patterns > enough for the whole thing? Don't you model your mul instruction as: > > (mult:V4SI >(match_operand:V4SI) >(vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI > > anyway? Seems that combine should be able to handle it. What currently > happens that we fail to generate the right instruction? At vec_init, I can recognize an idiom in order to generate vec_duplicate but I can't really insist on the single lane load.. something like: vc:V4SI[0] = c vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0) va:V4SI = vb:V4SI vt:V4SI Or is there any other way to do this? Cheers VP > > In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR > for vec_duplicate, adding new nodes is always painful. > > > This enhancement could possibly help further optimizing larger scenarios > > such > > as linear systems. > > > > Regards > > VP > > -- > Marc Glisse >
Re: [RFC] Vectorization of indexed elements
On Tue, Sep 10, 2013 at 09:25:32AM +0100, Richard Biener wrote: > On Mon, 9 Sep 2013, Marc Glisse wrote: > > > On Mon, 9 Sep 2013, Vidya Praveen wrote: > > > > > Hello, > > > > > > This post details some thoughts on an enhancement to the vectorizer that > > > could take advantage of the SIMD instructions that allows indexed element > > > as an operand thus reducing the need for duplication and possibly improve > > > reuse of previously loaded data. > > > > > > Appreciate your opinion on this. > > > > > > --- > > > > > > A phrase like this: > > > > > > for(i=0;i<4;i++) > > > a[i] = b[i] c[2]; > > > > > > is usually vectorized as: > > > > > > va:V4SI = a[0:3] > > > vb:V4SI = b[0:3] > > > t = c[2] > > > vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at > > > vec_init > > > ... > > > va:V4SI = vb:V4SI vc:V4SI > > > > > > But this could be simplified further if a target has instructions that > > > support > > > indexed element as a parameter. For example an instruction like this: > > > > > > mul v0.4s, v1.4s, v2.4s[2] > > > > > > can perform multiplication of each element of v2.4s with the third element > > > of > > > v2.4s (specified as v2.4s[2]) and store the results in the corresponding > > > elements of v0.4s. > > > > > > For this to happen, vectorizer needs to understand this idiom and treat > > > the > > > operand c[2] specially (and by taking in to consideration if the machine > > > supports indexed element as an operand for through a target hook or > > > macro) > > > and consider this as vectorizable statement without having to duplicate > > > the > > > elements explicitly. > > > > > > There are fews ways this could be represented at gimple: > > > > > > ... > > > va:V4SI = vb:V4SI VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) > > > ... > > > > > > or by allowing a vectorizer treat an indexed element as a valid operand > > > in a > > > vectorizable statement: > > > > Might as well allow any scalar then... > > I agree. The VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) form > would necessarily be two extra separate statements and thus subject > to CSE obfuscating it enough for RTL expansion to no longer notice it. I also thought about having a specialized expression like VEC_INDEXED__EXPR < arg0, arg1, arg2, index> to mean: arg0 = arg1 arg2[index] and handle it directly in the expander, like (for eg.) how VEC_LSHIFT_EXPR is handled in expr.c. But I dropped this idea since we may need to introduce many such nodes. > > That said, allowing mixed scalar/vector ops isn't very nice and > your scheme can be simplified by just using > > vc:V4SI = VEC_DUPLICATE_EXPR <...> > va:V4SI = vb:V4SI vc:V4SI > > where the expander only has to see that vc:V4SI is defined by > a duplicate. I did try out something like this quickly before I posted this RFC, though I called it VEC_DUP to mean a equivalent of vec_duplicate(vec_select()) for: for(i=0;i<8;i++) a[i] = b[2] * c[i]; I could generate: ... : _88 = prolog_loop_adjusted_niters.6_60 * 4; vectp_c.13_87 = c_10(D) + _88; vect_ldidx_.16_92 = MEM[(int *)b_8(D) + 8B]; vect_idxed_.17_93 = (vect_ldidx_.16_92) <<< ??? >>> (0); _96 = prolog_loop_adjusted_niters.6_60 * 4; vectp_a.19_95 = a_6(D) + _96; vect__12.14_115 = MEM[(int *)vectp_c.13_87]; vect_patt_40.15_116 = vect__12.14_115 * vect_idxed_.17_93; MEM[(int *)vectp_a.19_95] = vect_patt_40.15_116; vectp_c.12_118 = vectp_c.13_87 + 16; vectp_a.18_119 = vectp_a.19_95 + 16; ivtmp_120 = 1; if (ivtmp_120 < bnd.8_62) goto ; else goto ; : # vectp_c.12_89 = PHI # vectp_a.18_97 = PHI # ivtmp_14 = PHI vect__12.14_91 = MEM[(int *)vectp_c.12_89]; vect_patt_40.15_94 = vect__12.14_91 * vect_idxed_.17_93; MEM[(int *)vectp_a.18_97] = vect_patt_40.15_94; ... It's a crude implementation so VEC_DUP is printed as: (vect_ldidx_.16_92) <<< ??? >>> (0); > > > ... > > > va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 2) > > > ... > > > > > > For the sake of explanation, the above two representations assumes that > > > c[0:3] is loaded in vc for some other use and reused here. But when c[2] > > > is > > > the > > > only use of 'c' then it may be safer to just load one element and use it > > > like > > > this: > > > > > > vc:V4SI[0] = c[2] > > > va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) > > > > > > This could also mean that expressions involving scalar could be treated > > > similarly. For example, > > > > > > for(i=0;i<4;i++) > > >a[i] = b[i] c > > > > > > could be vectorized as: > > > > > > vc:V4SI[0] = c > > > va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) > > > > > > Such a change would also require new standard pattern names to be defined > > > for > > > each . > > > > > > Alternatively, having something like this: > > > > > > ... > > > vt:V4S
Re: [RFC] Vectorization of indexed elements
On Mon, 9 Sep 2013, Marc Glisse wrote: > On Mon, 9 Sep 2013, Vidya Praveen wrote: > > > Hello, > > > > This post details some thoughts on an enhancement to the vectorizer that > > could take advantage of the SIMD instructions that allows indexed element > > as an operand thus reducing the need for duplication and possibly improve > > reuse of previously loaded data. > > > > Appreciate your opinion on this. > > > > --- > > > > A phrase like this: > > > > for(i=0;i<4;i++) > > a[i] = b[i] c[2]; > > > > is usually vectorized as: > > > > va:V4SI = a[0:3] > > vb:V4SI = b[0:3] > > t = c[2] > > vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init > > ... > > va:V4SI = vb:V4SI vc:V4SI > > > > But this could be simplified further if a target has instructions that > > support > > indexed element as a parameter. For example an instruction like this: > > > > mul v0.4s, v1.4s, v2.4s[2] > > > > can perform multiplication of each element of v2.4s with the third element > > of > > v2.4s (specified as v2.4s[2]) and store the results in the corresponding > > elements of v0.4s. > > > > For this to happen, vectorizer needs to understand this idiom and treat the > > operand c[2] specially (and by taking in to consideration if the machine > > supports indexed element as an operand for through a target hook or > > macro) > > and consider this as vectorizable statement without having to duplicate the > > elements explicitly. > > > > There are fews ways this could be represented at gimple: > > > > ... > > va:V4SI = vb:V4SI VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) > > ... > > > > or by allowing a vectorizer treat an indexed element as a valid operand in a > > vectorizable statement: > > Might as well allow any scalar then... I agree. The VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) form would necessarily be two extra separate statements and thus subject to CSE obfuscating it enough for RTL expansion to no longer notice it. That said, allowing mixed scalar/vector ops isn't very nice and your scheme can be simplified by just using vc:V4SI = VEC_DUPLICATE_EXPR <...> va:V4SI = vb:V4SI vc:V4SI where the expander only has to see that vc:V4SI is defined by a duplicate. > > ... > > va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 2) > > ... > > > > For the sake of explanation, the above two representations assumes that > > c[0:3] is loaded in vc for some other use and reused here. But when c[2] is > > the > > only use of 'c' then it may be safer to just load one element and use it > > like > > this: > > > > vc:V4SI[0] = c[2] > > va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) > > > > This could also mean that expressions involving scalar could be treated > > similarly. For example, > > > > for(i=0;i<4;i++) > >a[i] = b[i] c > > > > could be vectorized as: > > > > vc:V4SI[0] = c > > va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) > > > > Such a change would also require new standard pattern names to be defined > > for > > each . > > > > Alternatively, having something like this: > > > > ... > > vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) > > va:V4SI = vb:V4SI vt:V4SI > > ... > > > > would remove the need to introduce several new standard pattern names but > > have > > just one to represent vec_duplicate(vec_select()) but ofcourse this will > > expect > > the target to have combiner patterns. > > The cost estimation wouldn't be very good, but aren't combine patterns enough > for the whole thing? Don't you model your mul instruction as: > > (mult:V4SI > (match_operand:V4SI) > (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI > > anyway? Seems that combine should be able to handle it. What currently happens > that we fail to generate the right instruction? > > In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR for > vec_duplicate, adding new nodes is always painful. True, though CONSTRUCTOR isn't a good vec_duplicate primitive. But yes, we have it that way at the moment and indeed adding new nodes is always painful. > > This enhancement could possibly help further optimizing larger scenarios > > such > > as linear systems. Given that the vectorizer already handles all cases you quote but just the expansion doesn't use the targets special abilities - can't you just teach the expander to lookup the definition of the vectors and see if it is an uniform CONSTRUCTOR? Richard.
Re: [RFC] Vectorization of indexed elements
On Mon, 9 Sep 2013, Vidya Praveen wrote: Hello, This post details some thoughts on an enhancement to the vectorizer that could take advantage of the SIMD instructions that allows indexed element as an operand thus reducing the need for duplication and possibly improve reuse of previously loaded data. Appreciate your opinion on this. --- A phrase like this: for(i=0;i<4;i++) a[i] = b[i] c[2]; is usually vectorized as: va:V4SI = a[0:3] vb:V4SI = b[0:3] t = c[2] vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init ... va:V4SI = vb:V4SI vc:V4SI But this could be simplified further if a target has instructions that support indexed element as a parameter. For example an instruction like this: mul v0.4s, v1.4s, v2.4s[2] can perform multiplication of each element of v2.4s with the third element of v2.4s (specified as v2.4s[2]) and store the results in the corresponding elements of v0.4s. For this to happen, vectorizer needs to understand this idiom and treat the operand c[2] specially (and by taking in to consideration if the machine supports indexed element as an operand for through a target hook or macro) and consider this as vectorizable statement without having to duplicate the elements explicitly. There are fews ways this could be represented at gimple: ... va:V4SI = vb:V4SI VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) ... or by allowing a vectorizer treat an indexed element as a valid operand in a vectorizable statement: Might as well allow any scalar then... ... va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 2) ... For the sake of explanation, the above two representations assumes that c[0:3] is loaded in vc for some other use and reused here. But when c[2] is the only use of 'c' then it may be safer to just load one element and use it like this: vc:V4SI[0] = c[2] va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) This could also mean that expressions involving scalar could be treated similarly. For example, for(i=0;i<4;i++) a[i] = b[i] c could be vectorized as: vc:V4SI[0] = c va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) Such a change would also require new standard pattern names to be defined for each . Alternatively, having something like this: ... vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) va:V4SI = vb:V4SI vt:V4SI ... would remove the need to introduce several new standard pattern names but have just one to represent vec_duplicate(vec_select()) but ofcourse this will expect the target to have combiner patterns. The cost estimation wouldn't be very good, but aren't combine patterns enough for the whole thing? Don't you model your mul instruction as: (mult:V4SI (match_operand:V4SI) (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI anyway? Seems that combine should be able to handle it. What currently happens that we fail to generate the right instruction? In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR for vec_duplicate, adding new nodes is always painful. This enhancement could possibly help further optimizing larger scenarios such as linear systems. Regards VP -- Marc Glisse
[RFC] Vectorization of indexed elements
Hello, This post details some thoughts on an enhancement to the vectorizer that could take advantage of the SIMD instructions that allows indexed element as an operand thus reducing the need for duplication and possibly improve reuse of previously loaded data. Appreciate your opinion on this. --- A phrase like this: for(i=0;i<4;i++) a[i] = b[i] c[2]; is usually vectorized as: va:V4SI = a[0:3] vb:V4SI = b[0:3] t = c[2] vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init ... va:V4SI = vb:V4SI vc:V4SI But this could be simplified further if a target has instructions that support indexed element as a parameter. For example an instruction like this: mul v0.4s, v1.4s, v2.4s[2] can perform multiplication of each element of v2.4s with the third element of v2.4s (specified as v2.4s[2]) and store the results in the corresponding elements of v0.4s. For this to happen, vectorizer needs to understand this idiom and treat the operand c[2] specially (and by taking in to consideration if the machine supports indexed element as an operand for through a target hook or macro) and consider this as vectorizable statement without having to duplicate the elements explicitly. There are fews ways this could be represented at gimple: ... va:V4SI = vb:V4SI VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) ... or by allowing a vectorizer treat an indexed element as a valid operand in a vectorizable statement: ... va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 2) ... For the sake of explanation, the above two representations assumes that c[0:3] is loaded in vc for some other use and reused here. But when c[2] is the only use of 'c' then it may be safer to just load one element and use it like this: vc:V4SI[0] = c[2] va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) This could also mean that expressions involving scalar could be treated similarly. For example, for(i=0;i<4;i++) a[i] = b[i] c could be vectorized as: vc:V4SI[0] = c va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) Such a change would also require new standard pattern names to be defined for each . Alternatively, having something like this: ... vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) va:V4SI = vb:V4SI vt:V4SI ... would remove the need to introduce several new standard pattern names but have just one to represent vec_duplicate(vec_select()) but ofcourse this will expect the target to have combiner patterns. This enhancement could possibly help further optimizing larger scenarios such as linear systems. Regards VP