On Tue, Oct 01, 2013 at 09:26:25AM +0100, Richard Biener wrote: > On Mon, 30 Sep 2013, Vidya Praveen wrote: > > > On Mon, Sep 30, 2013 at 02:19:32PM +0100, Richard Biener wrote: > > > On Mon, 30 Sep 2013, Vidya Praveen wrote: > > > > > > > On Fri, Sep 27, 2013 at 04:19:45PM +0100, Vidya Praveen wrote: > > > > > On Fri, Sep 27, 2013 at 03:50:08PM +0100, Vidya Praveen wrote: > > > > > [...] > > > > > > > > I can't really insist on the single lane load.. something like: > > > > > > > > > > > > > > > > vc:V4SI[0] = c > > > > > > > > vt:V4SI = vec_duplicate:V4SI (vec_select:SI vc:V4SI 0) > > > > > > > > va:V4SI = vb:V4SI <op> vt:V4SI > > > > > > > > > > > > > > > > Or is there any other way to do this? > > > > > > > > > > > > > > Can you elaborate on "I can't really insist on the single lane > > > > > > > load"? > > > > > > > What's the single lane load in your example? > > > > > > > > > > > > Loading just one lane of the vector like this: > > > > > > > > > > > > vc:V4SI[0] = c // from the above scalar example > > > > > > > > > > > > or > > > > > > > > > > > > vc:V4SI[0] = c[2] > > > > > > > > > > > > is what I meant by single lane load. In this example: > > > > > > > > > > > > t = c[2] > > > > > > ... > > > > > > vb:v4si = b[0:3] > > > > > > vc:v4si = { t, t, t, t } > > > > > > va:v4si = vb:v4si <op> vc:v4si > > > > > > > > > > > > If we are expanding the CONSTRUCTOR as vec_duplicate at vec_init, I > > > > > > cannot > > > > > > insist 't' to be vector and t = c[2] to be vect_t[0] = c[2] (which > > > > > > could be > > > > > > seen as vec_select:SI (vect_t 0) ). > > > > > > > > > > > > > I'd expect the instruction > > > > > > > pattern as quoted to just work (and I hope we expand an uniform > > > > > > > constructor { a, a, a, a } properly using vec_duplicate). > > > > > > > > > > > > As much as I went through the code, this is only done using > > > > > > vect_init. It is > > > > > > not expanded as vec_duplicate from, for example, > > > > > > store_constructor() of expr.c > > > > > > > > > > Do you see any issues if we expand such constructor as vec_duplicate > > > > > directly > > > > > instead of going through vect_init way? > > > > > > > > Sorry, that was a bad question. > > > > > > > > But here's what I would like to propose as a first step. Please tell me > > > > if this > > > > is acceptable or if it makes sense: > > > > > > > > - Introduce standard pattern names > > > > > > > > "vmulim4" - vector muliply with second operand as indexed operand > > > > > > > > Example: > > > > > > > > (define_insn "vmuliv4si4" > > > > [set (match_operand:V4SI 0 "register_operand") > > > > (mul:V4SI (match_operand:V4SI 1 "register_operand") > > > > (vec_duplicate:V4SI > > > > (vec_select:SI > > > > (match_operand:V4SI 2 "register_operand") > > > > (match_operand:V4SI 3 "immediate_operand)))))] > > > > ... > > > > ) > > > > > > We could factor this with providing a standard pattern name for > > > > > > (define_insn "vdupi<mode>" > > > [set (match_operand:<mode> 0 "register_operand") > > > (vec_duplicate:<mode> > > > (vec_select:<scalarmode> > > > (match_operand:<mode> 1 "register_operand") > > > (match_operand:SI 2 "immediate_operand))))] > > > > This is good. I did think about this but then I thought of avoiding the need > > for combiner patterns :-) > > > > But do you find the lane specific mov pattern I proposed, acceptable? > > The specific mul pattern? As said, consider factoring to vdupi to > avoid an explosion in required special optabs. > > > > (you use V4SI for the immediate? > > > > Sorry typo again!! It should've been SI. > > > > > Ideally vdupi has another custom > > > mode for the vector index). > > > > > > Note that this factored pattern is already available as vec_perm_const! > > > It is simply (vec_perm_const:V4SI <source> <source> <immediate-selector>). > > > > > > Which means that on the GIMPLE level we should try to combine > > > > > > el_4 = BIT_FIELD_REF <v_3, ...>; > > > v_5 = { el_4, el_4, ... }; > > > > I don't think we reach this state at all for the scenarios in discussion. > > what we generally have is: > > > > el_4 = MEM_REF < array + index*size > > > v_5 = { el_4, ... } > > > > Or am I missing something? > > Well, but in that case I doubt it is profitable (or even valid!) to > turn this into a vector lane load from the array. If it is profitable > to perform a vector read (because we're going to use the other elements > of the vector as well) then the vectorizer should produce a vector > load and materialize the uniform vector from one of its elements. > > Maybe at this point you should show us a compilable C testcase > with a loop that should be vectorized using your instructions in > the end?
Here's a compilable example: void foo (int *__restrict__ a, int *__restrict__ b, int *__restrict__ c) { int i; for (i = 0; i < 8; i++) a[i] = b[i] * c[2]; } This is vectorized by duplicating c[2] now. But I'm trying to take advantage of target instructions that can take a vector register as second argument but use only one element (by using the same value for all the lanes) of the vector register. Eg. mul <vec-reg>, <vec-reg>, <vec-reg>[index] mla <vec-reg>, <vec-reg>, <vec-reg>[index] // multiply and add But for a loop like the one in the C example given, I will have to load the c[2] in one element of the vector register (leaving the remaining unused) rather. This is why I was proposing to load just one element in a vector register (what I meant as "lane specific load"). The benefit of doing this is that we avoid explicit duplication, however such a simplification can only be done where such support is available - the reason why I was thinking in terms of optional standard pattern name. Another benefit is we will also be able to support scalars in the expression like in the following example: void foo (int *__restrict__ a, int *__restrict__ b, int c) { int i; for (i = 0; i < 8; i++) a[i] = b[i] * c; } Another example which can take advantage of the target feature: void foo (int *__restrict__ a, int *__restrict__ b, int *__restrict__ c) { int i,j; for (i = 0; i < 8; i++) for (j = 0; j < 8; j++) a[j] += b[j] * c[i]; } This scenario we discussed this earlier (you suggested handling this at TER). Cheers VP