On Fri, Oct 11, 2013 at 03:54:08PM +0100, Vidya Praveen wrote: > Here's a compilable example: > > void > foo (int *__restrict__ a, > int *__restrict__ b, > int *__restrict__ c) > { > int i; > > for (i = 0; i < 8; i++) > a[i] = b[i] * c[2]; > } > > This is vectorized by duplicating c[2] now. But I'm trying to take advantage > of target instructions that can take a vector register as second argument but > use only one element (by using the same value for all the lanes) of the > vector register. > > Eg. mul <vec-reg>, <vec-reg>, <vec-reg>[index] > mla <vec-reg>, <vec-reg>, <vec-reg>[index] // multiply and add > > But for a loop like the one in the C example given, I will have to load the > c[2] in one element of the vector register (leaving the remaining unused) > rather. This is why I was proposing to load just one element in a vector > register (what I meant as "lane specific load"). The benefit of doing this is > that we avoid explicit duplication, however such a simplification can only > be done where such support is available - the reason why I was thinking in > terms of optional standard pattern name. Another benefit is we will also be > able to support scalars in the expression like in the following example: > > void > foo (int *__restrict__ a, > int *__restrict__ b, > int c) > { > int i; > > for (i = 0; i < 8; i++) > a[i] = b[i] * c; > }
So just during combine let the broadcast operation be combined with the arithmetics? Intel AVX512 ISA has similar feature, not sure what exactly they are doing for this. That said, the broadcast is likely going to be hoisted before the loop, and in that case is it really cheaper to have it unbroadcasted in a vector register rather than to broadcast it before the loop and just use there? Jakub