Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-16 Thread Segher Boessenkool
On Wed, Sep 16, 2020 at 10:31:54AM +0200, Richard Biener wrote:
> On Tue, Sep 15, 2020 at 6:18 PM Segher Boessenkool
>  wrote:
> >
> > On Tue, Sep 15, 2020 at 08:51:09AM +0200, Richard Biener wrote:
> > > On Tue, Sep 15, 2020 at 5:56 AM luoxhu  wrote:
> > > > > u[n % 4] = i;
> > > > >
> > > > > I guess.  Is the % 4 mandated by the vec_insert semantics btw?
> >
> > (As an aside -- please use "& 3" instead: that works fine if n is signed
> > as well, but modulo doesn't.  Maybe that is in the patch already, I
> > didn't check, sorry.)
> >
> > > note this is why I asked about the actual CPU instruction - as I read
> > > Seghers mail
> > > the instruction modifies a vector register, not memory.
> >
> > But note that the builtin is not the same as the machine instruction --
> > here there shouldn't be a difference if compiling for a new enough ISA,
> > but the builtin is available on anything with at least AltiVec.
> 
> Well, given we're trying to improve instruction selection that very much
> should depend on the ISA.  Thus if the target says it cannot vec_set
> to a register in a variable position then we don't want to pretend it can.

Of course :-)  We shouldn't look at what the machine insn can do, but
also not at what the source level constructs can; we should just ask the
target code itself.


Segher


Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-16 Thread Richard Biener via Gcc-patches
On Wed, Sep 16, 2020 at 8:15 AM luoxhu  wrote:
>
>
>
> On 2020/9/15 14:51, Richard Biener wrote:
>
>
> >> I only see VAR_DECL and PARM_DECL, is there any function to check the tree
> >> variable is global? I added DECL_REGISTER, but the RTL still expands to 
> >> stack:
> >
> > is_global_var () or alternatively !auto_var_in_fn_p (), I think doing
> > IFN_SET only
> > makes sense if there's the chance we can promote the variable to a
> > register.  But it
> > would be an incorrect transform (it stores the whole vector) if the
> > vector storage
> > could "escape" to another thread - which means you probably have to check
> > !TREE_ADDRESSABLE as well.
> >
>
> The tree of param "u" will be marked ADDRESSABLE when generating
> "VIEW_CONVERT_EXPR(D.3190)[_1] = i;", if check !TREE_ADDRESSABLE, no 
> IFN_SET
> will be produced in gimple-isel.

The TREE_ADDRESSABLE bit should be cleared later on GIMPLE during
execute_update_address_taken.  If it is not we should fix that.

> #1  0x1066c700 in convert_vector_to_array_for_subscript (loc=5307072, 
> vecp=0x7fffc5d0,
> index=) at 
> ../../gcc/gcc/c-family/c-common.c:8169
> #2  0x10553b54 in build_array_ref (loc=5307072, array= 0x75ad0100 u>, index=<
> trunc_mod_expr 0x759c73a0>) at ../../gcc/gcc/c/c-typeck.c:2668
> #3  0x105c8824 in c_parser_postfix_expression_after_primary 
> (parser=0x77f703f0, expr_lo
> c=5307040, expr=...) at ../../gcc/gcc/c/c-parser.c:10494
> #4  0x105c7570 in c_parser_postfix_expression (parser=0x77f703f0) 
> at ../../gcc/gcc/c/c-
> parser.c:10216
>
> >>
> >> My current implementation does:
> >>
> >> 1)  v = vec_insert (i, u, n);
> >>
> >> =>gimple:
> >> {
> >>register __vector signed int D.3190;
> >>D.3190 = u;// *new decl and copy u first.*
> >>_1 = n & 3;
> >>VIEW_CONVERT_EXPR(D.3190)[_1] = i;   // *update op0 of 
> >> VIEW_CONVERT_EXPR*
> >>_2 = D.3190;
> >>...
> >> }
> >>
> >> =>isel:
> >> {
> >>register __vector signed int D.3190;
> >>D.3190 = u_4(D);
> >>_1 = n_6(D) & 3;
> >>.VEC_SET (&D.3190, i_7(D), _1);
> >
> > why are you passing the address of D.3190 to .VEC_SET?  That will not
> > make D.3190
> > be expanded to a pseudo.   You really need to have GIMPLE registers
> > here (SSA name)
> > and thus a return value, leaving the argument unmodified
> >
> >D.3190_3 = .VEC_SET (D.3190_4, i_7(D), _1);
> >
> > note this is why I asked about the actual CPU instruction - as I read
> > Seghers mail
> > the instruction modifies a vector register, not memory.
> >
>
> Updated the code and got expected gimple-isel output and ASM for both 2 cases:
>
> pr79251.c.236t.isel:
>
> __attribute__((noinline))
> test (__vector signed int u, int i, size_t n)
> {
>   long unsigned int _1;
>   __vector signed int _6;
>   vector(4) int _7;
>   vector(4) int vec_set_dst_8;
>
>[local count: 1073741824]:
>   _1 = n_2(D) & 3;
>   _7 = u;
>   vec_set_dst_8 = .VEC_SET (_7, i_4(D), _1);
>   u = vec_set_dst_8;
>   _6 = u;
>   return _6;
>
> }
>
> But tree variable "u" need to be set to "TREE_ADDRESSABLE (view_op0) = 0;"
> (Maybe check IFN VEC_SET and set to 0 in discover_nonconstant_array_refs
> is better later.) after generating the IFN VEC_SET, otherwise, "u" will still
> be expanded to stack since expander:
>  "Replacing Expressions:  _7 replace with --> _7 = u;".
>
> Setting "u" to non-addressable also seems really unreasonable as for below
> case, as u[n+1] need be ADDRESSABLE:

Why should u be addressable for u[n+1]?

> __attribute__ ((noinline)) vector TYPE
> test (vector TYPE u, TYPE i, size_t n)
> {
>  u[n % 4] = i;
>  u[n+1] = i+1;
>  return u;
> }
>
> => gimple-isel with both VEC_SET and VIEW_CONVERT_EXPR:
>
> test (__vector signed int u, int i, size_t n)
> {
>   long unsigned int _1;
>   long unsigned int _2;
>   int _3;
>   __vector signed int _9;
>   vector(4) int _10;
>   vector(4) int vec_set_dst_11;
>
>[local count: 1073741824]:
>   _1 = n_4(D) & 3;
>   _10 = u;
>   vec_set_dst_11 = .VEC_SET (_10, i_6(D), _1);
>   u = vec_set_dst_11;
>   _2 = n_4(D) + 1;
>   _3 = i_6(D) + 1;
>   VIEW_CONVERT_EXPR(u)[_2] = _3;

why is .VEC_SET not used here?

>   _9 = u;
>   return _9;
>
> }
>
>
> Below code are generating the IFN call, create_tmp_reg or 
> create_tmp_reg_ssa_name
> seems not create a variable that will be allocated on register?

Yes, they create decls, use make_ssa_name (TREE_TYPE (view_op0)) to create
a SSA name of the desired type.

>
> diff --git a/gcc/gimple-isel.cc b/gcc/gimple-isel.cc
> index b330cf4c20e..a699022cd09 100644
> --- a/gcc/gimple-isel.cc
> +++ b/gcc/gimple-isel.cc
> ...
> +  if (!is_global_var (view_op0)
> + && TREE_CODE (TREE_TYPE (view_op0)) == VECTOR_TYPE
> + && tree_fits_uhwi_p (TYPE_SIZE (TREE_TYPE (view_op0)))
> + && tree_to_uhwi (TYPE_SIZE (TREE_TYPE (view_op0))) == 128
> + && determine_value_range (pos, &minv, &maxv) == VR_RANGE
> + && wi::geu_p (minv, 0)
> + &&

Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-16 Thread Richard Biener via Gcc-patches
On Tue, Sep 15, 2020 at 6:18 PM Segher Boessenkool
 wrote:
>
> On Tue, Sep 15, 2020 at 08:51:09AM +0200, Richard Biener wrote:
> > On Tue, Sep 15, 2020 at 5:56 AM luoxhu  wrote:
> > > > u[n % 4] = i;
> > > >
> > > > I guess.  Is the % 4 mandated by the vec_insert semantics btw?
>
> (As an aside -- please use "& 3" instead: that works fine if n is signed
> as well, but modulo doesn't.  Maybe that is in the patch already, I
> didn't check, sorry.)
>
> > note this is why I asked about the actual CPU instruction - as I read
> > Seghers mail
> > the instruction modifies a vector register, not memory.
>
> But note that the builtin is not the same as the machine instruction --
> here there shouldn't be a difference if compiling for a new enough ISA,
> but the builtin is available on anything with at least AltiVec.

Well, given we're trying to improve instruction selection that very much
should depend on the ISA.  Thus if the target says it cannot vec_set
to a register in a variable position then we don't want to pretend it can.

Richard.

>
> Segher


Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-15 Thread luoxhu via Gcc-patches



On 2020/9/15 14:51, Richard Biener wrote:


>> I only see VAR_DECL and PARM_DECL, is there any function to check the tree
>> variable is global? I added DECL_REGISTER, but the RTL still expands to 
>> stack:
> 
> is_global_var () or alternatively !auto_var_in_fn_p (), I think doing
> IFN_SET only
> makes sense if there's the chance we can promote the variable to a
> register.  But it
> would be an incorrect transform (it stores the whole vector) if the
> vector storage
> could "escape" to another thread - which means you probably have to check
> !TREE_ADDRESSABLE as well.
> 

The tree of param "u" will be marked ADDRESSABLE when generating 
"VIEW_CONVERT_EXPR(D.3190)[_1] = i;", if check !TREE_ADDRESSABLE, no 
IFN_SET
will be produced in gimple-isel.


#1  0x1066c700 in convert_vector_to_array_for_subscript (loc=5307072, 
vecp=0x7fffc5d0,
index=) at ../../gcc/gcc/c-family/c-common.c:8169
#2  0x10553b54 in build_array_ref (loc=5307072, array=, index=<
trunc_mod_expr 0x759c73a0>) at ../../gcc/gcc/c/c-typeck.c:2668
#3  0x105c8824 in c_parser_postfix_expression_after_primary 
(parser=0x77f703f0, expr_lo
c=5307040, expr=...) at ../../gcc/gcc/c/c-parser.c:10494
#4  0x105c7570 in c_parser_postfix_expression (parser=0x77f703f0) 
at ../../gcc/gcc/c/c-
parser.c:10216

>>
>> My current implementation does:
>>
>> 1)  v = vec_insert (i, u, n);
>>
>> =>gimple:
>> {
>>register __vector signed int D.3190;
>>D.3190 = u;// *new decl and copy u first.*
>>_1 = n & 3;
>>VIEW_CONVERT_EXPR(D.3190)[_1] = i;   // *update op0 of 
>> VIEW_CONVERT_EXPR*
>>_2 = D.3190;
>>...
>> }
>>
>> =>isel:
>> {
>>register __vector signed int D.3190;
>>D.3190 = u_4(D);
>>_1 = n_6(D) & 3;
>>.VEC_SET (&D.3190, i_7(D), _1);
> 
> why are you passing the address of D.3190 to .VEC_SET?  That will not
> make D.3190
> be expanded to a pseudo.   You really need to have GIMPLE registers
> here (SSA name)
> and thus a return value, leaving the argument unmodified
> 
>D.3190_3 = .VEC_SET (D.3190_4, i_7(D), _1);
> 
> note this is why I asked about the actual CPU instruction - as I read
> Seghers mail
> the instruction modifies a vector register, not memory.
> 

Updated the code and got expected gimple-isel output and ASM for both 2 cases:

pr79251.c.236t.isel:

__attribute__((noinline))
test (__vector signed int u, int i, size_t n)
{
  long unsigned int _1;
  __vector signed int _6;
  vector(4) int _7;
  vector(4) int vec_set_dst_8;

   [local count: 1073741824]:
  _1 = n_2(D) & 3;
  _7 = u;
  vec_set_dst_8 = .VEC_SET (_7, i_4(D), _1);
  u = vec_set_dst_8;
  _6 = u;
  return _6;

}

But tree variable "u" need to be set to "TREE_ADDRESSABLE (view_op0) = 0;"
(Maybe check IFN VEC_SET and set to 0 in discover_nonconstant_array_refs
is better later.) after generating the IFN VEC_SET, otherwise, "u" will still
be expanded to stack since expander:
 "Replacing Expressions:  _7 replace with --> _7 = u;".

Setting "u" to non-addressable also seems really unreasonable as for below
case, as u[n+1] need be ADDRESSABLE:

__attribute__ ((noinline)) vector TYPE
test (vector TYPE u, TYPE i, size_t n)
{
 u[n % 4] = i;
 u[n+1] = i+1;
 return u;
}

=> gimple-isel with both VEC_SET and VIEW_CONVERT_EXPR:

test (__vector signed int u, int i, size_t n)
{
  long unsigned int _1;
  long unsigned int _2;
  int _3;
  __vector signed int _9;
  vector(4) int _10;
  vector(4) int vec_set_dst_11;

   [local count: 1073741824]:
  _1 = n_4(D) & 3;
  _10 = u;
  vec_set_dst_11 = .VEC_SET (_10, i_6(D), _1);
  u = vec_set_dst_11;
  _2 = n_4(D) + 1;
  _3 = i_6(D) + 1;
  VIEW_CONVERT_EXPR(u)[_2] = _3;
  _9 = u;
  return _9;

}


Below code are generating the IFN call, create_tmp_reg or 
create_tmp_reg_ssa_name
seems not create a variable that will be allocated on register?


diff --git a/gcc/gimple-isel.cc b/gcc/gimple-isel.cc
index b330cf4c20e..a699022cd09 100644
--- a/gcc/gimple-isel.cc
+++ b/gcc/gimple-isel.cc
...
+  if (!is_global_var (view_op0)
+ && TREE_CODE (TREE_TYPE (view_op0)) == VECTOR_TYPE
+ && tree_fits_uhwi_p (TYPE_SIZE (TREE_TYPE (view_op0)))
+ && tree_to_uhwi (TYPE_SIZE (TREE_TYPE (view_op0))) == 128
+ && determine_value_range (pos, &minv, &maxv) == VR_RANGE
+ && wi::geu_p (minv, 0)
+ && wi::leu_p (maxv, (128 / GET_MODE_BITSIZE (innermode
+   {
+ location_t loc = gimple_location (stmt);
+ tree var_src = create_tmp_reg (TREE_TYPE (view_op0));
+ tree var_dst
+   = make_temp_ssa_name (TREE_TYPE (view_op0), NULL, "vec_set_dst");
+ TREE_ADDRESSABLE (view_op0) = 0;
+
+ ass_stmt = gimple_build_assign (var_src, view_op0);
+ gimple_set_location (ass_stmt, loc);
+ gsi_insert_before (gsi, ass_stmt, GSI_SAME_STMT);
+
+ new_stmt
+   = gimple_build_call_internal (IFN_VEC_SET, 3, var_src, val, pos);
+
+ gimple_call_set_lhs (new_stmt, var_dst);
+
+  

Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-15 Thread Segher Boessenkool
On Tue, Sep 15, 2020 at 08:51:09AM +0200, Richard Biener wrote:
> On Tue, Sep 15, 2020 at 5:56 AM luoxhu  wrote:
> > > u[n % 4] = i;
> > >
> > > I guess.  Is the % 4 mandated by the vec_insert semantics btw?

(As an aside -- please use "& 3" instead: that works fine if n is signed
as well, but modulo doesn't.  Maybe that is in the patch already, I
didn't check, sorry.)

> note this is why I asked about the actual CPU instruction - as I read
> Seghers mail
> the instruction modifies a vector register, not memory.

But note that the builtin is not the same as the machine instruction --
here there shouldn't be a difference if compiling for a new enough ISA,
but the builtin is available on anything with at least AltiVec.


Segher


Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-14 Thread Richard Biener via Gcc-patches
On Tue, Sep 15, 2020 at 5:56 AM luoxhu  wrote:
>
>
>
> On 2020/9/14 17:47, Richard Biener wrote:
> > On Mon, Sep 14, 2020 at 10:05 AM luoxhu  wrote:
>
> >> Not sure whether this reflects the issues you discussed above.
> >> I constructed below test cases and tested with and without this patch,
> >> only if "a+c"(which means store only), the performance is getting bad with
> >> this patch due to extra load/store(about 50% slower).
> >> For "v = vec_insert (i, v, n);" usage, v is always loaded after store, so 
> >> this
> >> patch will always get big benefit.
> >> But for "v[n % 4] = i;" usage, it depends on whether "v" is used 
> >> immediately
> >> inside the function or out of the function soon.  Does this mean unify the 
> >> two
> >> usage to same gimple code not a good idea sometimes?  Or is it possible to
> >> detect the generated IFN ".VEC_INSERT (&v, i_4(D), _1);" destination "v" is
> >> used not far away inside or out side of the function?
> >>
> >>
> >> #define TYPE int
> >>
> >> vector TYPE v = {1, 2, 3, 4};   // a. global vector.
> >> vector TYPE s = {4, 3, 2, 1};
> >>
> >> __attribute__ ((noinline))
> >> vector TYPE
> >> test (vector TYPE u, TYPE i, size_t n)
> >> {
> >>v[n % 4] = i;
> >
> > ^^^
> >
> > this should be
> >
> > u[n % 4] = i;
> >
> > I guess.  Is the % 4 mandated by the vec_insert semantics btw?
>
> Yes.  Segher pasted the builtin description in his reply.  "v = vec_insert 
> (i, u, n);"
> is a bit different with "u[n % 4] = i;" since it returns a copy of u instead 
> of modify
> u. The adjust in lower __builtin_vec_insert_xxx gimple will make a copy of u 
> first to
> meet the requirements.
>
> >
> > If you tested with the above error you probably need to re-measure.
>
> No I did test for u as local instead of global before.  If u is not used very
> soon, the performance is almost the same for generating single store or 
> IFN_SET
> of inserting with variable index.
>
> source:
> __attribute__ ((noinline)) vector TYPE
> test (vector TYPE u, TYPE i, size_t n)
> {
>   u[n % 4] = i;
>   vector TYPE r = {0};
>   for (long k = 0; k < 100; k++)
> {
>   r += v;
> }
>   return u+r;
> }
>
> => store hit load is relieved due to long distance.
>
> ASM:
> 0:  addis 2,12,.TOC.-.LCF0@ha
> addi 2,2,.TOC.-.LCF0@l
> .localentry test,.-test
> addis 10,2,.LANCHOR0@toc@ha
> li 8,50
> xxspltib 32,0
> addi 9,1,-16
> rldic 6,6,2,60
> stxv 34,-16(1)
> addi 10,10,.LANCHOR0@toc@l
> mtctr 8
> xxlor 33,32,32
> stwx 5,9,6  // short store
> lxv 45,0(10)
> .p2align 4,,15
> .L2:
> vadduwm 0,0,13
> vadduwm 1,1,13
> bdnz .L2
> vadduwm 0,0,1
> lxv 33,-16(1)   // wide load
> vadduwm 2,0,1
> blr
>
>
> Then I intend to use "v" there for global memory insert test to separate
> the store and load into different function ("v" is stored in function,
> but loaded out of function will get different performance for single store
> or lxv+xxperm+xxsel+stxv, the inner function doesn't know the distance
> between load and store across functions).
>
> Do you mean that we don't need generate IFN_SET if "v" is global memory?

Yes.

> I only see VAR_DECL and PARM_DECL, is there any function to check the tree
> variable is global? I added DECL_REGISTER, but the RTL still expands to stack:

is_global_var () or alternatively !auto_var_in_fn_p (), I think doing
IFN_SET only
makes sense if there's the chance we can promote the variable to a
register.  But it
would be an incorrect transform (it stores the whole vector) if the
vector storage
could "escape" to another thread - which means you probably have to check
!TREE_ADDRESSABLE as well.

>
> gcc/internal-fn.c: rtx to_rtx = expand_expr (view_op0, NULL_RTX, VOIDmode, 
> EXPAND_WRITE);
>
> (gdb) p view_op0
> $584 = 
> (gdb) p DECL_REGISTER(view_op0)
> $585 = 1
> (gdb) p to_rtx
> $586 = (rtx_def *) (mem/c:V4SI (reg/f:DI 112 virtual-stack-vars) [1 D.3190+0 
> S16 A128])
>
>
> >
> > On gimple the above function (after fixing it) looks like
> >
> >VIEW_CONVERT_EXPR(u)[_1] = i_4(D);
> >
> > and the IFN idea I had would - for non-global memory 'u' only - transform
> > this to
> >
> >vector_register_2 = u;
> >vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D));
> >u = vector_register_3;
> >
> > if vec_set can handle variable indexes.  This then becomes a
> > vec_set on a register and if that was the only variable indexing of 'u'
> > will also cause 'u' to be expanded as register rather than stack memory.
> >
> > Note we can't use the direct-optab method here since the vec_set optab
> > modifies operand 0 which isn't possible in SSA form.  That might hint
> > at that we eventually want to extend BIT_INSERT_EXPR to handle
> > a non-constant bit position but for experiments using an alternate
> > internal function is certainly easier.
> >
>
> My current implementation does

Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-14 Thread luoxhu via Gcc-patches



On 2020/9/14 17:47, Richard Biener wrote:

On Mon, Sep 14, 2020 at 10:05 AM luoxhu  wrote:



Not sure whether this reflects the issues you discussed above.
I constructed below test cases and tested with and without this patch,
only if "a+c"(which means store only), the performance is getting bad with
this patch due to extra load/store(about 50% slower).
For "v = vec_insert (i, v, n);" usage, v is always loaded after store, so this
patch will always get big benefit.
But for "v[n % 4] = i;" usage, it depends on whether "v" is used immediately
inside the function or out of the function soon.  Does this mean unify the two
usage to same gimple code not a good idea sometimes?  Or is it possible to
detect the generated IFN ".VEC_INSERT (&v, i_4(D), _1);" destination "v" is
used not far away inside or out side of the function?


#define TYPE int

vector TYPE v = {1, 2, 3, 4};   // a. global vector.
vector TYPE s = {4, 3, 2, 1};

__attribute__ ((noinline))
vector TYPE
test (vector TYPE u, TYPE i, size_t n)
{
   v[n % 4] = i;


^^^

this should be

u[n % 4] = i;

I guess.  Is the % 4 mandated by the vec_insert semantics btw?


Yes.  Segher pasted the builtin description in his reply.  "v = vec_insert (i, u, 
n);"
is a bit different with "u[n % 4] = i;" since it returns a copy of u instead of 
modify
u. The adjust in lower __builtin_vec_insert_xxx gimple will make a copy of u 
first to
meet the requirements.



If you tested with the above error you probably need to re-measure.


No I did test for u as local instead of global before.  If u is not used very
soon, the performance is almost the same for generating single store or IFN_SET
of inserting with variable index.

source:
__attribute__ ((noinline)) vector TYPE
test (vector TYPE u, TYPE i, size_t n)
{
 u[n % 4] = i;
 vector TYPE r = {0};
 for (long k = 0; k < 100; k++)
   {
 r += v;
   }
 return u+r;
}

=> store hit load is relieved due to long distance.

ASM:
0:  addis 2,12,.TOC.-.LCF0@ha
   addi 2,2,.TOC.-.LCF0@l
   .localentry test,.-test
   addis 10,2,.LANCHOR0@toc@ha
   li 8,50
   xxspltib 32,0
   addi 9,1,-16
   rldic 6,6,2,60
   stxv 34,-16(1)
   addi 10,10,.LANCHOR0@toc@l
   mtctr 8
   xxlor 33,32,32
   stwx 5,9,6  // short store
   lxv 45,0(10)
   .p2align 4,,15
.L2:
   vadduwm 0,0,13
   vadduwm 1,1,13
   bdnz .L2
   vadduwm 0,0,1
   lxv 33,-16(1)   // wide load
   vadduwm 2,0,1
   blr


Then I intend to use "v" there for global memory insert test to separate
the store and load into different function ("v" is stored in function,
but loaded out of function will get different performance for single store
or lxv+xxperm+xxsel+stxv, the inner function doesn't know the distance
between load and store across functions).

Do you mean that we don't need generate IFN_SET if "v" is global memory?
I only see VAR_DECL and PARM_DECL, is there any function to check the tree
variable is global? I added DECL_REGISTER, but the RTL still expands to stack:

gcc/internal-fn.c: rtx to_rtx = expand_expr (view_op0, NULL_RTX, VOIDmode, 
EXPAND_WRITE);

(gdb) p view_op0
$584 = 
(gdb) p DECL_REGISTER(view_op0)
$585 = 1
(gdb) p to_rtx
$586 = (rtx_def *) (mem/c:V4SI (reg/f:DI 112 virtual-stack-vars) [1 D.3190+0 
S16 A128])




On gimple the above function (after fixing it) looks like

   VIEW_CONVERT_EXPR(u)[_1] = i_4(D);

and the IFN idea I had would - for non-global memory 'u' only - transform
this to

   vector_register_2 = u;
   vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D));
   u = vector_register_3;

if vec_set can handle variable indexes.  This then becomes a
vec_set on a register and if that was the only variable indexing of 'u'
will also cause 'u' to be expanded as register rather than stack memory.

Note we can't use the direct-optab method here since the vec_set optab
modifies operand 0 which isn't possible in SSA form.  That might hint
at that we eventually want to extend BIT_INSERT_EXPR to handle
a non-constant bit position but for experiments using an alternate
internal function is certainly easier.



My current implementation does:

1)  v = vec_insert (i, u, n);

=>gimple:
{
 register __vector signed int D.3190;
 D.3190 = u;// *new decl and copy u first.*
 _1 = n & 3;
 VIEW_CONVERT_EXPR(D.3190)[_1] = i;   // *update op0 of 
VIEW_CONVERT_EXPR*
 _2 = D.3190;
 ...
}

=>isel:
{
 register __vector signed int D.3190;
 D.3190 = u_4(D);
 _1 = n_6(D) & 3;
 .VEC_SET (&D.3190, i_7(D), _1);
 _2 = D.3190;
 ...
}


2) u[n % 4] = i;

=>gimple:
{
 __vector signed int D.3191;
 _1 = n & 3;
 VIEW_CONVERT_EXPR(u)[_1] = i;   // *update op0 of VIEW_CONVERT_EXPR*
 D.3191 = u;   
...

}

=>isel:
{
 D.3190 = u_4(D);
 _1 = n_6(D) & 3;
 .VEC_SET (&D.3191, i_7(D), _1);
 _2 = D.3190;
 v = _2;
}

The IFN ".VEC_SET" behavior quite like the other IFN STORE functions and doesn't
require a dest operand to be set?  Both 1) and 2) are modifying operand 0 of
VIEW_CONV

Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-14 Thread Segher Boessenkool
On Mon, Sep 14, 2020 at 11:47:52AM +0100, Richard Sandiford wrote:
> Would it be worth changing the optab so that the input and output are
> separate?  Having a single operand would be justified if the operation
> was only supposed to touch the selected bytes, but most targets wouldn't
> guarantee that for memory operands, even as things stand.

You have my vote.

> Or maybe the idea was to force the RA's hand by making the input and
> output tied even before RA, with separate moves where necessary.
> But I'm not sure why vec_set is so special that it requires this
> treatment and other optabs don't.

Yeah.  The register allocator is normally very good in using the same
reg in both places, if that is useful.  And it also handles the case
where your machine insns require the two to be the same pretty well.
Not restricting this stuff before RA should be a win.


Segher


Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-14 Thread Segher Boessenkool
On Mon, Sep 14, 2020 at 11:47:56AM +0200, Richard Biener wrote:
> this should be
> 
>u[n % 4] = i;
> 
> I guess.  Is the % 4 mandated by the vec_insert semantics btw?

Yes:

  VEC_INSERT (ARG1, ARG2, ARG3)
  Purpose: Returns a copy of vector ARG2 with element ARG3 replaced by
  the value of ARG1.  Result value: A copy of vector ARG2 with element
  ARG3 replaced by the value of ARG1. This function uses modular
  arithmetic on ARG3 to determine the element number. For example, if
  ARG3 is out of range, the compiler uses ARG3 modulo the number of
  elements in the vector to determine the element position.

The builtin requires it.  The machine insns work like that, too, e.g.:

  vinswlx VRT,RA,RB

  if MSR.VEC=0 then Vector_Unavailable()
  index ← GPR[RA].bit[60:63]
  VSR[VRT+32].byte[index:index+3] ← GPR[RB].bit[32:63]

  Let index be the contents of bits 60:63 of GPR[RA].

  The contents of bits 32:63 of GPR[RB] are placed into
  byte elements index:index+3 of VSR[VRT+32].

  All other byte elements of VSR[VRT+32] are not
  modified.

  If index is greater than 12, the result is undefined.


Segher


Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-14 Thread Segher Boessenkool
On Thu, Sep 10, 2020 at 12:08:44PM +0200, Richard Biener wrote:
> On Wed, Sep 9, 2020 at 6:03 PM Segher Boessenkool
>  wrote:
> > There often are problems over function calls (where the compiler cannot
> > usually *see* how something is used).
> 
> Yep.  The best way would be to use small loads and larger stores
> which is what CPUs usually tend to handle fine (with alignment
> constraints, etc.).  Of course that's not what either of the "solutions"
> can do.

Yes, and yes.

> That said, since you seem to be "first" in having an instruction
> to insert into a vector at a variable position the idea that we'd
> have to spill anyway for this to be expanded and thus we expand
> the vector to a stack location in the first place falls down.  And
> that's where I'd first try to improve things.
> 
> So what can the CPU actually do?

Both immediate and variable inserts, of 1, 2, 4, or 8 bytes.  The
inserted part is not allowed to cross the 16B boundary (all aligned
stuff never has that problem).  Variable inserts look at only the low
bits of the GPR that says where to insert (4 bits for bytes, 3 bits
for halfs, etc.)


Segher


Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-14 Thread Richard Sandiford
Richard Biener  writes:
> On Mon, Sep 14, 2020 at 12:47 PM Richard Sandiford
>  wrote:
>>
>> Richard Biener via Gcc-patches  writes:
>> > On gimple the above function (after fixing it) looks like
>> >
>> >   VIEW_CONVERT_EXPR(u)[_1] = i_4(D);
>> >
>> > and the IFN idea I had would - for non-global memory 'u' only - transform
>> > this to
>> >
>> >   vector_register_2 = u;
>> >   vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D));
>> >   u = vector_register_3;
>> >
>> > if vec_set can handle variable indexes.  This then becomes a
>> > vec_set on a register and if that was the only variable indexing of 'u'
>> > will also cause 'u' to be expanded as register rather than stack memory.
>> >
>> > Note we can't use the direct-optab method here since the vec_set optab
>> > modifies operand 0 which isn't possible in SSA form.
>>
>> Would it be worth changing the optab so that the input and output are
>> separate?  Having a single operand would be justified if the operation
>> was only supposed to touch the selected bytes, but most targets wouldn't
>> guarantee that for memory operands, even as things stand.
>
> I thought about this as well, just didn't want to bother Xiong Hu Luo with
> it for the experiments.
>
>> Or maybe the idea was to force the RA's hand by making the input and
>> output tied even before RA, with separate moves where necessary.
>> But I'm not sure why vec_set is so special that it requires this
>> treatment and other optabs don't.
>
> Certainly the define_expand do not have to be define_insn so the target
> can force the RAs hand just fine if it likes to.
>
> The more interesting question of course is how to query vec_set whether
> it accepts variable indices w/o building too much garbage RTL.

Probably easiest to do something like can_vcond_compare_p, i.e.:

  machine_mode mode = insn_data[icode].operand[N];
  rtx reg = alloca_raw_REG (mode, LAST_VIRTUAL_REGISTER + 1);
  … insn_operand_matches (icode, N, reg) …

Thanks,
Richard


Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-14 Thread Richard Biener via Gcc-patches
On Mon, Sep 14, 2020 at 12:47 PM Richard Sandiford
 wrote:
>
> Richard Biener via Gcc-patches  writes:
> > On gimple the above function (after fixing it) looks like
> >
> >   VIEW_CONVERT_EXPR(u)[_1] = i_4(D);
> >
> > and the IFN idea I had would - for non-global memory 'u' only - transform
> > this to
> >
> >   vector_register_2 = u;
> >   vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D));
> >   u = vector_register_3;
> >
> > if vec_set can handle variable indexes.  This then becomes a
> > vec_set on a register and if that was the only variable indexing of 'u'
> > will also cause 'u' to be expanded as register rather than stack memory.
> >
> > Note we can't use the direct-optab method here since the vec_set optab
> > modifies operand 0 which isn't possible in SSA form.
>
> Would it be worth changing the optab so that the input and output are
> separate?  Having a single operand would be justified if the operation
> was only supposed to touch the selected bytes, but most targets wouldn't
> guarantee that for memory operands, even as things stand.

I thought about this as well, just didn't want to bother Xiong Hu Luo with
it for the experiments.

> Or maybe the idea was to force the RA's hand by making the input and
> output tied even before RA, with separate moves where necessary.
> But I'm not sure why vec_set is so special that it requires this
> treatment and other optabs don't.

Certainly the define_expand do not have to be define_insn so the target
can force the RAs hand just fine if it likes to.

The more interesting question of course is how to query vec_set whether
it accepts variable indices w/o building too much garbage RTL.

Richard.

>
> Thanks,
> Richard
>
>
> > That might hint at that we eventually want to extend BIT_INSERT_EXPR
> > to handle a non-constant bit position but for experiments using an
> > alternate internal function is certainly easier.
> >
> > Richard.


Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-14 Thread Richard Sandiford
Richard Biener via Gcc-patches  writes:
> On gimple the above function (after fixing it) looks like
>
>   VIEW_CONVERT_EXPR(u)[_1] = i_4(D);
>
> and the IFN idea I had would - for non-global memory 'u' only - transform
> this to
>
>   vector_register_2 = u;
>   vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D));
>   u = vector_register_3;
>
> if vec_set can handle variable indexes.  This then becomes a
> vec_set on a register and if that was the only variable indexing of 'u'
> will also cause 'u' to be expanded as register rather than stack memory.
>
> Note we can't use the direct-optab method here since the vec_set optab
> modifies operand 0 which isn't possible in SSA form.

Would it be worth changing the optab so that the input and output are
separate?  Having a single operand would be justified if the operation
was only supposed to touch the selected bytes, but most targets wouldn't
guarantee that for memory operands, even as things stand.

Or maybe the idea was to force the RA's hand by making the input and
output tied even before RA, with separate moves where necessary.
But I'm not sure why vec_set is so special that it requires this
treatment and other optabs don't.

Thanks,
Richard


> That might hint at that we eventually want to extend BIT_INSERT_EXPR
> to handle a non-constant bit position but for experiments using an
> alternate internal function is certainly easier.
>
> Richard.


Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-14 Thread Richard Biener via Gcc-patches
On Mon, Sep 14, 2020 at 10:05 AM luoxhu  wrote:
>
>
>
> On 2020/9/10 18:08, Richard Biener wrote:
> > On Wed, Sep 9, 2020 at 6:03 PM Segher Boessenkool
> >  wrote:
> >>
> >> On Wed, Sep 09, 2020 at 04:28:19PM +0200, Richard Biener wrote:
> >>> On Wed, Sep 9, 2020 at 3:49 PM Segher Boessenkool
> >>>  wrote:
> 
>  Hi!
> 
>  On Tue, Sep 08, 2020 at 10:26:51AM +0200, Richard Biener wrote:
> > Hmm, yeah - I guess that's what should be addressed first then.
> > I'm quite sure that in case 'v' is not on the stack but in memory like
> > in my case a SImode store is better than what we get from
> > vec_insert - in fact vec_insert will likely introduce a RMW cycle
> > which is prone to inserting store-data-races?
> 
>  The other way around -- if it is in memory, and was stored as vector
>  recently, then reading back something shorter from it is prone to
>  SHL/LHS problems.  There is nothing special about the stack here, except
>  of course it is more likely to have been stored recently if on the
>  stack.  So it depends how often it has been stored recently which option
>  is best.  On newer CPUs, although they can avoid SHL/LHS flushes more
>  often, the penalty is relatively bigger, so memory does not often win.
> 
>  I.e.: it needs to be measured.  Intuition is often wrong here.
> >>>
> >>> But the current method would simply do a direct store to memory
> >>> without a preceeding read of the whole vector.
> >>
> >> The problem is even worse the other way: you do a short store here, but
> >> so a full vector read later.  If the store and read are far apart, that
> >> is fine, but if they are close (that is on the order of fifty or more
> >> insns), there can be problems.
> >
> > Sure, but you can't simply load/store a whole vector when the code didn't
> > unless you know it will not introduce data races and it will not trap (thus
> > the whole vector needs to be at least naturally aligned).
> >
> > Also if there's a preceeding short store you will now load the whole vector
> > to avoid the short store ... catch-22
> >
> >> There often are problems over function calls (where the compiler cannot
> >> usually *see* how something is used).
> >
> > Yep.  The best way would be to use small loads and larger stores
> > which is what CPUs usually tend to handle fine (with alignment
> > constraints, etc.).  Of course that's not what either of the "solutions"
> > can do.
> >
> > That said, since you seem to be "first" in having an instruction
> > to insert into a vector at a variable position the idea that we'd
> > have to spill anyway for this to be expanded and thus we expand
> > the vector to a stack location in the first place falls down.  And
> > that's where I'd first try to improve things.
> >
> > So what can the CPU actually do?
> >
>
> Not sure whether this reflects the issues you discussed above.
> I constructed below test cases and tested with and without this patch,
> only if "a+c"(which means store only), the performance is getting bad with
> this patch due to extra load/store(about 50% slower).
> For "v = vec_insert (i, v, n);" usage, v is always loaded after store, so this
> patch will always get big benefit.
> But for "v[n % 4] = i;" usage, it depends on whether "v" is used immediately
> inside the function or out of the function soon.  Does this mean unify the two
> usage to same gimple code not a good idea sometimes?  Or is it possible to
> detect the generated IFN ".VEC_INSERT (&v, i_4(D), _1);" destination "v" is
> used not far away inside or out side of the function?
>
>
> #define TYPE int
>
> vector TYPE v = {1, 2, 3, 4};   // a. global vector.
> vector TYPE s = {4, 3, 2, 1};
>
> __attribute__ ((noinline))
> vector TYPE
> test (vector TYPE u, TYPE i, size_t n)
> {
>   v[n % 4] = i;

^^^

this should be

   u[n % 4] = i;

I guess.  Is the % 4 mandated by the vec_insert semantics btw?

If you tested with the above error you probably need to re-measure.

On gimple the above function (after fixing it) looks like

  VIEW_CONVERT_EXPR(u)[_1] = i_4(D);

and the IFN idea I had would - for non-global memory 'u' only - transform
this to

  vector_register_2 = u;
  vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D));
  u = vector_register_3;

if vec_set can handle variable indexes.  This then becomes a
vec_set on a register and if that was the only variable indexing of 'u'
will also cause 'u' to be expanded as register rather than stack memory.

Note we can't use the direct-optab method here since the vec_set optab
modifies operand 0 which isn't possible in SSA form.  That might hint
at that we eventually want to extend BIT_INSERT_EXPR to handle
a non-constant bit position but for experiments using an alternate
internal function is certainly easier.

Richard.

>   return u;
> }
>
> int main ()
> {
>   //vector TYPE v = {1, 2, 3, 4};// b. local vector.
>   //vector TYPE s = {4, 3, 2, 1};
>   vector TYPE r = {0};
>   for (

Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-14 Thread luoxhu via Gcc-patches



On 2020/9/10 18:08, Richard Biener wrote:
> On Wed, Sep 9, 2020 at 6:03 PM Segher Boessenkool
>  wrote:
>>
>> On Wed, Sep 09, 2020 at 04:28:19PM +0200, Richard Biener wrote:
>>> On Wed, Sep 9, 2020 at 3:49 PM Segher Boessenkool
>>>  wrote:

 Hi!

 On Tue, Sep 08, 2020 at 10:26:51AM +0200, Richard Biener wrote:
> Hmm, yeah - I guess that's what should be addressed first then.
> I'm quite sure that in case 'v' is not on the stack but in memory like
> in my case a SImode store is better than what we get from
> vec_insert - in fact vec_insert will likely introduce a RMW cycle
> which is prone to inserting store-data-races?

 The other way around -- if it is in memory, and was stored as vector
 recently, then reading back something shorter from it is prone to
 SHL/LHS problems.  There is nothing special about the stack here, except
 of course it is more likely to have been stored recently if on the
 stack.  So it depends how often it has been stored recently which option
 is best.  On newer CPUs, although they can avoid SHL/LHS flushes more
 often, the penalty is relatively bigger, so memory does not often win.

 I.e.: it needs to be measured.  Intuition is often wrong here.
>>>
>>> But the current method would simply do a direct store to memory
>>> without a preceeding read of the whole vector.
>>
>> The problem is even worse the other way: you do a short store here, but
>> so a full vector read later.  If the store and read are far apart, that
>> is fine, but if they are close (that is on the order of fifty or more
>> insns), there can be problems.
> 
> Sure, but you can't simply load/store a whole vector when the code didn't
> unless you know it will not introduce data races and it will not trap (thus
> the whole vector needs to be at least naturally aligned).
> 
> Also if there's a preceeding short store you will now load the whole vector
> to avoid the short store ... catch-22
> 
>> There often are problems over function calls (where the compiler cannot
>> usually *see* how something is used).
> 
> Yep.  The best way would be to use small loads and larger stores
> which is what CPUs usually tend to handle fine (with alignment
> constraints, etc.).  Of course that's not what either of the "solutions"
> can do.
> 
> That said, since you seem to be "first" in having an instruction
> to insert into a vector at a variable position the idea that we'd
> have to spill anyway for this to be expanded and thus we expand
> the vector to a stack location in the first place falls down.  And
> that's where I'd first try to improve things.
> 
> So what can the CPU actually do?
> 

Not sure whether this reflects the issues you discussed above.
I constructed below test cases and tested with and without this patch,
only if "a+c"(which means store only), the performance is getting bad with
this patch due to extra load/store(about 50% slower).
For "v = vec_insert (i, v, n);" usage, v is always loaded after store, so this
patch will always get big benefit.
But for "v[n % 4] = i;" usage, it depends on whether "v" is used immediately
inside the function or out of the function soon.  Does this mean unify the two
usage to same gimple code not a good idea sometimes?  Or is it possible to
detect the generated IFN ".VEC_INSERT (&v, i_4(D), _1);" destination "v" is
used not far away inside or out side of the function?


#define TYPE int

vector TYPE v = {1, 2, 3, 4};   // a. global vector.
vector TYPE s = {4, 3, 2, 1};

__attribute__ ((noinline))
vector TYPE
test (vector TYPE u, TYPE i, size_t n)
{
  v[n % 4] = i;
  return u;
}

int main ()
{
  //vector TYPE v = {1, 2, 3, 4};// b. local vector.
  //vector TYPE s = {4, 3, 2, 1};
  vector TYPE r = {0};
  for (long k = 0; k < 3989478945; k++)
{
r += test (s, 254.0f, k);  // c. access different vector. store only.
   //r += test (v, 254.0f, k);  // d. access same vector. load after store 
in callee function.
  //test (s, 254.0f, k); r += v;   //e. load after store out of function.
}
  return r[0];
}


Thanks,
Xionghu



Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-10 Thread Richard Biener via Gcc-patches
On Wed, Sep 9, 2020 at 6:03 PM Segher Boessenkool
 wrote:
>
> On Wed, Sep 09, 2020 at 04:28:19PM +0200, Richard Biener wrote:
> > On Wed, Sep 9, 2020 at 3:49 PM Segher Boessenkool
> >  wrote:
> > >
> > > Hi!
> > >
> > > On Tue, Sep 08, 2020 at 10:26:51AM +0200, Richard Biener wrote:
> > > > Hmm, yeah - I guess that's what should be addressed first then.
> > > > I'm quite sure that in case 'v' is not on the stack but in memory like
> > > > in my case a SImode store is better than what we get from
> > > > vec_insert - in fact vec_insert will likely introduce a RMW cycle
> > > > which is prone to inserting store-data-races?
> > >
> > > The other way around -- if it is in memory, and was stored as vector
> > > recently, then reading back something shorter from it is prone to
> > > SHL/LHS problems.  There is nothing special about the stack here, except
> > > of course it is more likely to have been stored recently if on the
> > > stack.  So it depends how often it has been stored recently which option
> > > is best.  On newer CPUs, although they can avoid SHL/LHS flushes more
> > > often, the penalty is relatively bigger, so memory does not often win.
> > >
> > > I.e.: it needs to be measured.  Intuition is often wrong here.
> >
> > But the current method would simply do a direct store to memory
> > without a preceeding read of the whole vector.
>
> The problem is even worse the other way: you do a short store here, but
> so a full vector read later.  If the store and read are far apart, that
> is fine, but if they are close (that is on the order of fifty or more
> insns), there can be problems.

Sure, but you can't simply load/store a whole vector when the code didn't
unless you know it will not introduce data races and it will not trap (thus
the whole vector needs to be at least naturally aligned).

Also if there's a preceeding short store you will now load the whole vector
to avoid the short store ... catch-22

> There often are problems over function calls (where the compiler cannot
> usually *see* how something is used).

Yep.  The best way would be to use small loads and larger stores
which is what CPUs usually tend to handle fine (with alignment
constraints, etc.).  Of course that's not what either of the "solutions"
can do.

That said, since you seem to be "first" in having an instruction
to insert into a vector at a variable position the idea that we'd
have to spill anyway for this to be expanded and thus we expand
the vector to a stack location in the first place falls down.  And
that's where I'd first try to improve things.

So what can the CPU actually do?

Richard.

>
>
> Segher


Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-09 Thread Segher Boessenkool
On Wed, Sep 09, 2020 at 04:28:19PM +0200, Richard Biener wrote:
> On Wed, Sep 9, 2020 at 3:49 PM Segher Boessenkool
>  wrote:
> >
> > Hi!
> >
> > On Tue, Sep 08, 2020 at 10:26:51AM +0200, Richard Biener wrote:
> > > Hmm, yeah - I guess that's what should be addressed first then.
> > > I'm quite sure that in case 'v' is not on the stack but in memory like
> > > in my case a SImode store is better than what we get from
> > > vec_insert - in fact vec_insert will likely introduce a RMW cycle
> > > which is prone to inserting store-data-races?
> >
> > The other way around -- if it is in memory, and was stored as vector
> > recently, then reading back something shorter from it is prone to
> > SHL/LHS problems.  There is nothing special about the stack here, except
> > of course it is more likely to have been stored recently if on the
> > stack.  So it depends how often it has been stored recently which option
> > is best.  On newer CPUs, although they can avoid SHL/LHS flushes more
> > often, the penalty is relatively bigger, so memory does not often win.
> >
> > I.e.: it needs to be measured.  Intuition is often wrong here.
> 
> But the current method would simply do a direct store to memory
> without a preceeding read of the whole vector.

The problem is even worse the other way: you do a short store here, but
so a full vector read later.  If the store and read are far apart, that
is fine, but if they are close (that is on the order of fifty or more
insns), there can be problems.

There often are problems over function calls (where the compiler cannot
usually *see* how something is used).


Segher


Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-09 Thread Richard Biener via Gcc-patches
On Wed, Sep 9, 2020 at 3:49 PM Segher Boessenkool
 wrote:
>
> Hi!
>
> On Tue, Sep 08, 2020 at 10:26:51AM +0200, Richard Biener wrote:
> > Hmm, yeah - I guess that's what should be addressed first then.
> > I'm quite sure that in case 'v' is not on the stack but in memory like
> > in my case a SImode store is better than what we get from
> > vec_insert - in fact vec_insert will likely introduce a RMW cycle
> > which is prone to inserting store-data-races?
>
> The other way around -- if it is in memory, and was stored as vector
> recently, then reading back something shorter from it is prone to
> SHL/LHS problems.  There is nothing special about the stack here, except
> of course it is more likely to have been stored recently if on the
> stack.  So it depends how often it has been stored recently which option
> is best.  On newer CPUs, although they can avoid SHL/LHS flushes more
> often, the penalty is relatively bigger, so memory does not often win.
>
> I.e.: it needs to be measured.  Intuition is often wrong here.

But the current method would simply do a direct store to memory
without a preceeding read of the whole vector.

Richard.

>
>
> Segher


Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-09 Thread Segher Boessenkool
Hi!

On Tue, Sep 08, 2020 at 10:26:51AM +0200, Richard Biener wrote:
> Hmm, yeah - I guess that's what should be addressed first then.
> I'm quite sure that in case 'v' is not on the stack but in memory like
> in my case a SImode store is better than what we get from
> vec_insert - in fact vec_insert will likely introduce a RMW cycle
> which is prone to inserting store-data-races?

The other way around -- if it is in memory, and was stored as vector
recently, then reading back something shorter from it is prone to
SHL/LHS problems.  There is nothing special about the stack here, except
of course it is more likely to have been stored recently if on the
stack.  So it depends how often it has been stored recently which option
is best.  On newer CPUs, although they can avoid SHL/LHS flushes more
often, the penalty is relatively bigger, so memory does not often win.

I.e.: it needs to be measured.  Intuition is often wrong here.


Segher


Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-09 Thread Richard Biener via Gcc-patches
On Wed, Sep 9, 2020 at 3:47 AM luoxhu  wrote:
>
>
>
> On 2020/9/8 16:26, Richard Biener wrote:
> >> Seems not only pseudo, for example "v = vec_insert (i, v, n);"
> >> the vector variable will be store to stack first, then [r112:DI] is a
> >> memory here to be processed.  So the patch loads it from stack(insn #10) to
> >> temp vector register first, and store to stack again(insn #24) after
> >> rs6000_vector_set_var.
> > Hmm, yeah - I guess that's what should be addressed first then.
> > I'm quite sure that in case 'v' is not on the stack but in memory like
> > in my case a SImode store is better than what we get from
> > vec_insert - in fact vec_insert will likely introduce a RMW cycle
> > which is prone to inserting store-data-races?
>
> Yes, for your case, there is no stack operation and to_rtx is expanded
> with BLKmode instead of V4SImode.  Add the to_rtx mode check could workaround
> it.  ASM doesn't show store hit load issue.
>
> optimized:
>
> _1 = i_2(D) % 4;
> VIEW_CONVERT_EXPR(x.u)[_1] = a_4(D);
>
> expand:
> 2: r118:DI=%3:DI
> 3: r119:DI=%4:DI
> 4: NOTE_INSN_FUNCTION_BEG
> 7: r120:DI=unspec[`*.LANCHOR0',%2:DI] 47
>   REG_EQUAL `*.LANCHOR0'
> 8: r122:SI=r118:DI#0
> 9: {r124:SI=r122:SI/0x4;clobber ca:SI;}
>10: r125:SI=r124:SI<<0x2
>11: r123:SI=r122:SI-r125:SI
>   REG_EQUAL r122:SI%0x4
>12: r126:DI=sign_extend(r123:SI)
>13: r127:DI=r126:DI+0x4
>14: r128:DI=r127:DI<<0x2
>15: r129:DI=r120:DI+r128:DI
>16: [r129:DI]=r119:DI#0
>
>  p to_rtx
> $319 = (rtx_def *) (mem/c:BLK (reg/f:DI 120) [2 x+0 S32 A128])
>
> asm:
> addis 2,12,.TOC.-.LCF0@ha
> addi 2,2,.TOC.-.LCF0@l
> .localentry test,.-test
> srawi 9,3,2
> addze 9,9
> addis 10,2,.LANCHOR0@toc@ha
> addi 10,10,.LANCHOR0@toc@l
> slwi 9,9,2
> subf 9,9,3
> extsw 9,9
> addi 9,9,4
> sldi 9,9,2
> stwx 4,10,9
> blr
>
>
> >
> > So - what we need to "fix" is cfgexpand.c marking variably-indexed
> > decls as not to be expanded as registers (see
> > discover_nonconstant_array_refs).
> >
> > I guess one way forward would be to perform instruction
> > selection on GIMPLE here and transform
> >
> > VIEW_CONVERT_EXPR(D.3185)[_1] = i_6(D)
> >
> > to a (direct) internal function based on the vec_set optab.
>
> I don't quite understand what you mean here.  Do you mean:
> ALTIVEC_BUILTIN_VEC_INSERT -> VIEW_CONVERT_EXPR -> internal function -> 
> vec_set

You're writing VIEW_CONVERT_EXPR here but the outermost component
is an ARRAY_REF.  But yes, this is what I meant.

> or ALTIVEC_BUILTIN_VEC_INSERT -> internal function -> vec_set?
> And which pass to put the selection and transform is acceptable?

Close to RTL expansion.  There's gimple-isel.cc which does instruction selection
for VEC_COND_EXPRs.

> Why call it *based on* vec_set optab?  The VIEW_CONVERT_EXPR or internal 
> function
> is expanded to vec_set optab.

Based on because we have the convenient capability to represent optabs to be
used for RTL expansion as internal function calls on GIMPLE, called
"direct internal function".

> I guess you suggest adding internal function for VIEW_CONVERT_EXPR in gimple,
> and do the transform from internal function to vec_set optab in expander?

No, I suggest to "add" an internal function for the vec_set optab, see
DEF_INTERNAL_OPTAB_FN in internal-fn.def

> I doubt my understanding as this looks really over complicated since we
> transform from VIEW_CONVERT_EXPR to vec_set optab directly so far...
> IIUC, Internal function seems doesn't help much here as Segher said before.

The advantage would be to circumvent GIMPLEs forcing of memory here.
But as I said here:

> > But then in GIMPLE D.3185 is also still memory (we don't have a variable
> > index partial register set operation - BIT_INSERT_EXPR is
> > currently specified to receive a constant bit position only).

it might not work out so easy.  Going down the rathole to avoid forcing
memory during RTL expansion for select cases (vector type bases
with a supported vector mode) might be something to try.

That at least would make the approach of dealing with this
in expand_assignment or siblings sensible.

> > At which point after your patch is the stack storage elided?
> >
>
> Stack storage is elided by register reload pass in RTL.
>
>
> Thanks,
> Xionghu


Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-08 Thread luoxhu via Gcc-patches



On 2020/9/8 16:26, Richard Biener wrote:
>> Seems not only pseudo, for example "v = vec_insert (i, v, n);"
>> the vector variable will be store to stack first, then [r112:DI] is a
>> memory here to be processed.  So the patch loads it from stack(insn #10) to
>> temp vector register first, and store to stack again(insn #24) after
>> rs6000_vector_set_var.
> Hmm, yeah - I guess that's what should be addressed first then.
> I'm quite sure that in case 'v' is not on the stack but in memory like
> in my case a SImode store is better than what we get from
> vec_insert - in fact vec_insert will likely introduce a RMW cycle
> which is prone to inserting store-data-races?

Yes, for your case, there is no stack operation and to_rtx is expanded
with BLKmode instead of V4SImode.  Add the to_rtx mode check could workaround
it.  ASM doesn't show store hit load issue.  

optimized:

_1 = i_2(D) % 4;
VIEW_CONVERT_EXPR(x.u)[_1] = a_4(D);

expand:
2: r118:DI=%3:DI
3: r119:DI=%4:DI
4: NOTE_INSN_FUNCTION_BEG
7: r120:DI=unspec[`*.LANCHOR0',%2:DI] 47
  REG_EQUAL `*.LANCHOR0'
8: r122:SI=r118:DI#0
9: {r124:SI=r122:SI/0x4;clobber ca:SI;}
   10: r125:SI=r124:SI<<0x2
   11: r123:SI=r122:SI-r125:SI
  REG_EQUAL r122:SI%0x4
   12: r126:DI=sign_extend(r123:SI)
   13: r127:DI=r126:DI+0x4
   14: r128:DI=r127:DI<<0x2
   15: r129:DI=r120:DI+r128:DI
   16: [r129:DI]=r119:DI#0

 p to_rtx
$319 = (rtx_def *) (mem/c:BLK (reg/f:DI 120) [2 x+0 S32 A128])

asm:
addis 2,12,.TOC.-.LCF0@ha
addi 2,2,.TOC.-.LCF0@l
.localentry test,.-test
srawi 9,3,2
addze 9,9
addis 10,2,.LANCHOR0@toc@ha
addi 10,10,.LANCHOR0@toc@l
slwi 9,9,2
subf 9,9,3
extsw 9,9
addi 9,9,4
sldi 9,9,2
stwx 4,10,9
blr


> 
> So - what we need to "fix" is cfgexpand.c marking variably-indexed
> decls as not to be expanded as registers (see
> discover_nonconstant_array_refs).
> 
> I guess one way forward would be to perform instruction
> selection on GIMPLE here and transform
> 
> VIEW_CONVERT_EXPR(D.3185)[_1] = i_6(D)
> 
> to a (direct) internal function based on the vec_set optab.  

I don't quite understand what you mean here.  Do you mean:
ALTIVEC_BUILTIN_VEC_INSERT -> VIEW_CONVERT_EXPR -> internal function -> vec_set
or ALTIVEC_BUILTIN_VEC_INSERT -> internal function -> vec_set?
And which pass to put the selection and transform is acceptable?
Why call it *based on* vec_set optab?  The VIEW_CONVERT_EXPR or internal 
function
is expanded to vec_set optab.

I guess you suggest adding internal function for VIEW_CONVERT_EXPR in gimple,
and do the transform from internal function to vec_set optab in expander?
I doubt my understanding as this looks really over complicated since we
transform from VIEW_CONVERT_EXPR to vec_set optab directly so far...
IIUC, Internal function seems doesn't help much here as Segher said before.


> But then in GIMPLE D.3185 is also still memory (we don't have a variable
> index partial register set operation - BIT_INSERT_EXPR is
> currently specified to receive a constant bit position only).
> 
> At which point after your patch is the stack storage elided?
> 

Stack storage is elided by register reload pass in RTL.


Thanks,
Xionghu


Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-08 Thread Richard Biener via Gcc-patches
On Tue, Sep 8, 2020 at 10:11 AM luoxhu  wrote:
>
> Hi Richi,
>
> On 2020/9/7 19:57, Richard Biener wrote:
> > +  if (TREE_CODE (to) == ARRAY_REF)
> > +   {
> > + tree op0 = TREE_OPERAND (to, 0);
> > + if (TREE_CODE (op0) == VIEW_CONVERT_EXPR
> > + && expand_view_convert_to_vec_set (to, from, to_rtx))
> > +   {
> > + pop_temp_slots ();
> > + return;
> > +   }
> > +   }
> >
> > you're placing this at an awkward spot IMHO, after to_rtx expansion
> > but disregading parts of  it and compensating just with 'to' matching.
> > Is the pieces (offset, bitpos) really too awkward to work with for
> > matching?
> >
> > Because as written you'll miscompile
> >
> > struct X { _vector signed int v; _vector singed int u; } x;
> >
> > test(int i, int a)
> > {
> >x.u[i] = a;
> > }
> >
> > as I think you'll end up assigning to x.v.
>
> Thanks for pointing out, this case will be a problem for the patch.
> I checked with optimize_bitfield_assignment_op, it will return very early
> as the mode1 is not VOIDmode, and this is actually not "FIELD op= VAL"
> operation?
>
> To be honest, I am not quite familiar with this part of code, I put the new
> function expand_view_convert_to_vec_set just after to_rtx expansion because
> adjust_address will change the V4SImode memory to SImode memory, but I need
> keep target to_rtx V4SImode to save the vector after calling
> rs6000_vector_set_var, so seems paradoxical here?
>
>  p to_rtx
> $264 = (rtx_def *) (mem/c:V4SI (reg/f:DI 112 virtual-stack-vars) [1 D.3186+0 
> S16 A128])
>
> => to_rtx = adjust_address (to_rtx, mode1, 0);
>
> p to_rtx
> $265 = (rtx_def *) (mem/c:SI (reg/f:DI 112 virtual-stack-vars) [1 D.3186+0 S4 
> A128])
>
>
> >
> > Are we just interested in the case were we store to a
> > pseudo or also when the destination is memory?  I guess
> > only when it's a pseudo - correct?  In that case
> > handling this all in optimize_bitfield_assignment_op
> > is probably the best thing to try.
> >
> > Note we possibly refrain from assigning a pseudo to
> > such vector because we see a variable array-ref to it.
>
> Seems not only pseudo, for example "v = vec_insert (i, v, n);"
> the vector variable will be store to stack first, then [r112:DI] is a
> memory here to be processed.  So the patch loads it from stack(insn #10) to
> temp vector register first, and store to stack again(insn #24) after
> rs6000_vector_set_var.

Hmm, yeah - I guess that's what should be addressed first then.
I'm quite sure that in case 'v' is not on the stack but in memory like
in my case a SImode store is better than what we get from
vec_insert - in fact vec_insert will likely introduce a RMW cycle
which is prone to inserting store-data-races?

So - what we need to "fix" is cfgexpand.c marking variably-indexed
decls as not to be expanded as registers (see
discover_nonconstant_array_refs).

I guess one way forward would be to perform instruction
selection on GIMPLE here and transform

VIEW_CONVERT_EXPR(D.3185)[_1] = i_6(D)

to a (direct) internal function based on the vec_set optab.  But then
in GIMPLE D.3185 is also still memory (we don't have a variable
index partial register set operation - BIT_INSERT_EXPR is
currently specified to receive a constant bit position only).

At which point after your patch is the stack storage elided?

>
> optimized:
>
> D.3185 = v_3(D);
> _1 = n_5(D) & 3;
> VIEW_CONVERT_EXPR(D.3185)[_1] = i_6(D);
> v_8 = D.3185;
> return v_8;
>
> => expand without the patch:
>
> 2: r119:V4SI=%2:V4SI
> 3: r120:DI=%5:DI
> 4: r121:DI=%6:DI
> 5: NOTE_INSN_FUNCTION_BEG
> 8: [r112:DI]=r119:V4SI
>
> 9: r122:DI=r121:DI&0x3
>10: r123:DI=r122:DI<<0x2
>11: r124:DI=r112:DI+r123:DI
>12: [r124:DI]=r120:DI#0
>
>13: r126:V4SI=[r112:DI]
>14: r118:V4SI=r126:V4SI
>18: %2:V4SI=r118:V4SI
>19: use %2:V4SI
>
> => expand with the patch (replace #9~#12 to #10~#24):
>
> 2: r119:V4SI=%2:V4SI
> 3: r120:DI=%5:DI
> 4: r121:DI=%6:DI
> 5: NOTE_INSN_FUNCTION_BEG
> 8: [r112:DI]=r119:V4SI
> 9: r122:DI=r121:DI&0x3
>
>10: r123:V4SI=[r112:DI] // load from stack
>11: {r125:SI=0x3-r122:DI#0;clobber ca:SI;}
>12: r125:SI=r125:SI<<0x2
>13: {r125:SI=0x14-r125:SI;clobber ca:SI;}
>14: r128:DI=unspec[`*.LC0',%2:DI] 47
>   REG_EQUAL `*.LC0'
>15: r127:V2DI=[r128:DI]
>   REG_EQUAL const_vector
>16: r126:V16QI=r127:V2DI#0
>17: r129:V16QI=unspec[r120:DI#0] 61
>18: r130:V16QI=unspec[r125:SI] 151
>19: r131:V16QI=unspec[r129:V16QI,r129:V16QI,r130:V16QI] 232
>20: r132:V16QI=unspec[r126:V16QI,r126:V16QI,r130:V16QI] 232
>21: r124:V16QI=r123:V4SI#0
>22: r124:V16QI={(r132:V16QI!=const_vector)?r131:V16QI:r124:V16QI}
>23: r123:V4SI=r124:V16QI#0
>24: [r112:DI]=r123:V4SI   // store to stack.
>
>25: r134:V4SI=[r112:DI]
>26: r118:V4SI=r134:V4SI
>30: %2:V4SI=r118:V4SI
>31: use %2:V4SI
>
>
> Thanks,
> Xionghu


Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-08 Thread luoxhu via Gcc-patches
Hi Richi,

On 2020/9/7 19:57, Richard Biener wrote:
> +  if (TREE_CODE (to) == ARRAY_REF)
> +   {
> + tree op0 = TREE_OPERAND (to, 0);
> + if (TREE_CODE (op0) == VIEW_CONVERT_EXPR
> + && expand_view_convert_to_vec_set (to, from, to_rtx))
> +   {
> + pop_temp_slots ();
> + return;
> +   }
> +   }
> 
> you're placing this at an awkward spot IMHO, after to_rtx expansion
> but disregading parts of  it and compensating just with 'to' matching.
> Is the pieces (offset, bitpos) really too awkward to work with for
> matching?
> 
> Because as written you'll miscompile
> 
> struct X { _vector signed int v; _vector singed int u; } x;
> 
> test(int i, int a)
> {
>x.u[i] = a;
> }
> 
> as I think you'll end up assigning to x.v.

Thanks for pointing out, this case will be a problem for the patch.
I checked with optimize_bitfield_assignment_op, it will return very early
as the mode1 is not VOIDmode, and this is actually not "FIELD op= VAL"
operation?

To be honest, I am not quite familiar with this part of code, I put the new
function expand_view_convert_to_vec_set just after to_rtx expansion because
adjust_address will change the V4SImode memory to SImode memory, but I need
keep target to_rtx V4SImode to save the vector after calling
rs6000_vector_set_var, so seems paradoxical here?

 p to_rtx
$264 = (rtx_def *) (mem/c:V4SI (reg/f:DI 112 virtual-stack-vars) [1 D.3186+0 
S16 A128])

=> to_rtx = adjust_address (to_rtx, mode1, 0);

p to_rtx
$265 = (rtx_def *) (mem/c:SI (reg/f:DI 112 virtual-stack-vars) [1 D.3186+0 S4 
A128])


> 
> Are we just interested in the case were we store to a
> pseudo or also when the destination is memory?  I guess
> only when it's a pseudo - correct?  In that case
> handling this all in optimize_bitfield_assignment_op
> is probably the best thing to try.
> 
> Note we possibly refrain from assigning a pseudo to
> such vector because we see a variable array-ref to it.

Seems not only pseudo, for example "v = vec_insert (i, v, n);"
the vector variable will be store to stack first, then [r112:DI] is a
memory here to be processed.  So the patch loads it from stack(insn #10) to
temp vector register first, and store to stack again(insn #24) after
rs6000_vector_set_var.

optimized:

D.3185 = v_3(D);
_1 = n_5(D) & 3;
VIEW_CONVERT_EXPR(D.3185)[_1] = i_6(D);
v_8 = D.3185;
return v_8;

=> expand without the patch:

2: r119:V4SI=%2:V4SI
3: r120:DI=%5:DI
4: r121:DI=%6:DI
5: NOTE_INSN_FUNCTION_BEG
8: [r112:DI]=r119:V4SI

9: r122:DI=r121:DI&0x3
   10: r123:DI=r122:DI<<0x2
   11: r124:DI=r112:DI+r123:DI
   12: [r124:DI]=r120:DI#0

   13: r126:V4SI=[r112:DI]
   14: r118:V4SI=r126:V4SI
   18: %2:V4SI=r118:V4SI
   19: use %2:V4SI

=> expand with the patch (replace #9~#12 to #10~#24):

2: r119:V4SI=%2:V4SI
3: r120:DI=%5:DI
4: r121:DI=%6:DI
5: NOTE_INSN_FUNCTION_BEG
8: [r112:DI]=r119:V4SI
9: r122:DI=r121:DI&0x3

   10: r123:V4SI=[r112:DI] // load from stack
   11: {r125:SI=0x3-r122:DI#0;clobber ca:SI;}
   12: r125:SI=r125:SI<<0x2
   13: {r125:SI=0x14-r125:SI;clobber ca:SI;}
   14: r128:DI=unspec[`*.LC0',%2:DI] 47
  REG_EQUAL `*.LC0'
   15: r127:V2DI=[r128:DI]
  REG_EQUAL const_vector
   16: r126:V16QI=r127:V2DI#0
   17: r129:V16QI=unspec[r120:DI#0] 61
   18: r130:V16QI=unspec[r125:SI] 151
   19: r131:V16QI=unspec[r129:V16QI,r129:V16QI,r130:V16QI] 232
   20: r132:V16QI=unspec[r126:V16QI,r126:V16QI,r130:V16QI] 232
   21: r124:V16QI=r123:V4SI#0
   22: r124:V16QI={(r132:V16QI!=const_vector)?r131:V16QI:r124:V16QI}
   23: r123:V4SI=r124:V16QI#0
   24: [r112:DI]=r123:V4SI   // store to stack.

   25: r134:V4SI=[r112:DI]
   26: r118:V4SI=r134:V4SI
   30: %2:V4SI=r118:V4SI
   31: use %2:V4SI


Thanks,
Xionghu


Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-07 Thread Richard Biener via Gcc-patches
On Mon, Sep 7, 2020 at 7:44 AM luoxhu  wrote:
>
> Hi,
>
> On 2020/9/4 18:23, Segher Boessenkool wrote:
> >> diff --git a/gcc/config/rs6000/rs6000-c.c b/gcc/config/rs6000/rs6000-c.c
> >> index 03b00738a5e..00c65311f76 100644
> >> --- a/gcc/config/rs6000/rs6000-c.c
> >> +++ b/gcc/config/rs6000/rs6000-c.c
> >> /* Build *(((arg1_inner_type*)&(vector type){arg1})+arg2) = arg0. 
> >> */
> >> @@ -1654,15 +1656,8 @@ altivec_resolve_overloaded_builtin (location_t loc, 
> >> tree fndecl,
> >>SET_EXPR_LOCATION (stmt, loc);
> >>stmt = build1 (COMPOUND_LITERAL_EXPR, arg1_type, stmt);
> >>  }
> >> -
> >> -  innerptrtype = build_pointer_type (arg1_inner_type);
> >> -
> >> -  stmt = build_unary_op (loc, ADDR_EXPR, stmt, 0);
> >> -  stmt = convert (innerptrtype, stmt);
> >> -  stmt = build_binary_op (loc, PLUS_EXPR, stmt, arg2, 1);
> >> -  stmt = build_indirect_ref (loc, stmt, RO_NULL);
> >> -  stmt = build2 (MODIFY_EXPR, TREE_TYPE (stmt), stmt,
> >> -convert (TREE_TYPE (stmt), arg0));
> >> +  stmt = build_array_ref (loc, stmt, arg2);
> >> +  stmt = fold_build2 (MODIFY_EXPR, TREE_TYPE (arg0), stmt, arg0);
> >> stmt = build2 (COMPOUND_EXPR, arg1_type, stmt, decl);
> >> return stmt;
> >>   }
> > You should make a copy of the vector, not modify the original one in
> > place?  (If I read that correctly, heh.)  Looks good otherwise.
> >
>
> Segher, there  is already existed code to make a copy of vector as we
> discussed offline.  Thanks for the reminder.
>
> cat pr79251.c.006t.gimple
> __attribute__((noinline))
> test (__vector signed int v, int i, size_t n)
> {
>   __vector signed int D.3192;
>   __vector signed int D.3194;
>   __vector signed int v1;
>   v1 = v;
>   D.3192 = v1;
>   _1 = n & 3;
>   VIEW_CONVERT_EXPR(D.3192)[_1] = i;
>   v1 = D.3192;
>   D.3194 = v1;
>   return D.3194;
> }
>
> Attached the v2 patch which does:
> 1) Build VIEW_CONVERT_EXPR for vec_insert (i, v, n) like v[n%4] = i to
> unify the gimple code, then expander could use vec_set_optab to expand.
> 2) Recognize the pattern in expander and use optabs to expand
> VIEW_CONVERT_EXPR to vec_insert with variable index to fast instructions:
> lvsl+xxperm+xxsel.

Looking at the RTL expander side I see several issues.


@@ -5237,6 +5288,16 @@ expand_assignment (tree to, tree from, bool nontemporal)

   to_rtx = expand_expr (tem, NULL_RTX, VOIDmode, EXPAND_WRITE);

+  if (TREE_CODE (to) == ARRAY_REF)
+   {
+ tree op0 = TREE_OPERAND (to, 0);
+ if (TREE_CODE (op0) == VIEW_CONVERT_EXPR
+ && expand_view_convert_to_vec_set (to, from, to_rtx))
+   {
+ pop_temp_slots ();
+ return;
+   }
+   }

you're placing this at an awkward spot IMHO, after to_rtx expansion
but disregading parts of  it and compensating just with 'to' matching.
Is the pieces (offset, bitpos) really too awkward to work with for
matching?

Because as written you'll miscompile

struct X { _vector signed int v; _vector singed int u; } x;

test(int i, int a)
{
  x.u[i] = a;
}

as I think you'll end up assigning to x.v.

Are we just interested in the case were we store to a
pseudo or also when the destination is memory?  I guess
only when it's a pseudo - correct?  In that case
handling this all in optimize_bitfield_assignment_op
is probably the best thing to try.

Note we possibly refrain from assigning a pseudo to
such vector because we see a variable array-ref to it.

Richard.

> Thanks,
> Xionghu


[PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]

2020-09-06 Thread luoxhu via Gcc-patches

Hi,

On 2020/9/4 18:23, Segher Boessenkool wrote:

diff --git a/gcc/config/rs6000/rs6000-c.c b/gcc/config/rs6000/rs6000-c.c
index 03b00738a5e..00c65311f76 100644
--- a/gcc/config/rs6000/rs6000-c.c
+++ b/gcc/config/rs6000/rs6000-c.c
/* Build *(((arg1_inner_type*)&(vector type){arg1})+arg2) = arg0. */
@@ -1654,15 +1656,8 @@ altivec_resolve_overloaded_builtin (location_t loc, tree 
fndecl,
   SET_EXPR_LOCATION (stmt, loc);
   stmt = build1 (COMPOUND_LITERAL_EXPR, arg1_type, stmt);
 }
-
-  innerptrtype = build_pointer_type (arg1_inner_type);
-
-  stmt = build_unary_op (loc, ADDR_EXPR, stmt, 0);
-  stmt = convert (innerptrtype, stmt);
-  stmt = build_binary_op (loc, PLUS_EXPR, stmt, arg2, 1);
-  stmt = build_indirect_ref (loc, stmt, RO_NULL);
-  stmt = build2 (MODIFY_EXPR, TREE_TYPE (stmt), stmt,
-convert (TREE_TYPE (stmt), arg0));
+  stmt = build_array_ref (loc, stmt, arg2);
+  stmt = fold_build2 (MODIFY_EXPR, TREE_TYPE (arg0), stmt, arg0);
stmt = build2 (COMPOUND_EXPR, arg1_type, stmt, decl);
return stmt;
  }

You should make a copy of the vector, not modify the original one in
place?  (If I read that correctly, heh.)  Looks good otherwise.



Segher, there  is already existed code to make a copy of vector as we
discussed offline.  Thanks for the reminder.

cat pr79251.c.006t.gimple
__attribute__((noinline))
test (__vector signed int v, int i, size_t n)
{
 __vector signed int D.3192;
 __vector signed int D.3194;
 __vector signed int v1;
 v1 = v;
 D.3192 = v1;
 _1 = n & 3;
 VIEW_CONVERT_EXPR(D.3192)[_1] = i;
 v1 = D.3192;
 D.3194 = v1;
 return D.3194;
}

Attached the v2 patch which does:
1) Build VIEW_CONVERT_EXPR for vec_insert (i, v, n) like v[n%4] = i to
unify the gimple code, then expander could use vec_set_optab to expand.
2) Recognize the pattern in expander and use optabs to expand
VIEW_CONVERT_EXPR to vec_insert with variable index to fast instructions:
lvsl+xxperm+xxsel. 


Thanks,
Xionghu
From 9a7a47b086579e26ae13f378732cea067b64a4e6 Mon Sep 17 00:00:00 2001
From: Xiong Hu Luo 
Date: Wed, 19 Aug 2020 03:54:17 -0500
Subject: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple
 [PR79251]

vec_insert accepts 3 arguments, arg0 is input vector, arg1 is the value
to be insert, arg2 is the place to insert arg1 to arg0.  Current expander
generates stxv+stwx+lxv if arg2 is variable instead of constant, which
causes serious store hit load performance issue on Power.  This patch tries
 1) Build VIEW_CONVERT_EXPR for vec_insert (i, v, n) like v[n%4] = i to
unify the gimple code, then expander could use vec_set_optab to expand.
 2) Recognize the pattern in expander and use optabs to expand
VIEW_CONVERT_EXPR to vec_insert with variable index to fast instructions:
lvsl+xxperm+xxsel.
In this way, "vec_insert (i, v, n)" and "v[n%4] = i" won't be expanded too
early in gimple stage if arg2 is variable, avoid generating store hit load
instructions.

For Power9 V4SI:
addi 9,1,-16
rldic 6,6,2,60
stxv 34,-16(1)
stwx 5,9,6
lxv 34,-16(1)
=>
addis 9,2,.LC0@toc@ha
addi 9,9,.LC0@toc@l
mtvsrwz 33,5
lxv 32,0(9)
sradi 9,6,2
addze 9,9
sldi 9,9,2
subf 9,9,6
subfic 9,9,3
sldi 9,9,2
subfic 9,9,20
lvsl 13,0,9
xxperm 33,33,45
xxperm 32,32,45
xxsel 34,34,33,32

Though instructions increase from 5 to 15, the performance is improved
60% in typical cases.
Tested with V2DI, V2DF V4SI, V4SF, V8HI, V16QI on Power9-LE and
Power8-BE, bootstrap tested pass.

gcc/ChangeLog:

2020-09-07  Xionghu Luo  

* config/rs6000/altivec.md (altivec_lvsl_reg_2): Rename to
 (altivec_lvsl_reg_2) and extend to SDI mode.
* config/rs6000/rs6000-c.c (altivec_resolve_overloaded_builtin):
Ajdust variable index vec_insert to VIEW_CONVERT_EXPR.
* config/rs6000/rs6000-protos.h (rs6000_expand_vector_set_var):
New declare.
* config/rs6000/rs6000.c (rs6000_expand_vector_set_var):
New function.
* config/rs6000/rs6000.md (FQHS): New mode iterator.
(FD): New mode iterator.
p8_mtvsrwz_v16qi2: New define_insn.
p8_mtvsrd_v16qi2: New define_insn.
* config/rs6000/vector.md: Add register operand2 match for
vec_set index.
* config/rs6000/vsx.md: Call gen_altivec_lvsl_reg_di2.
* expr.c (expand_view_convert_to_vec_set): New function.
(expand_assignment): Call expand_view_convert_to_vec_set.

gcc/testsuite/ChangeLog:

2020-09-07  Xionghu Luo  

* gcc.target/powerpc/pr79251.c: New test.
* gcc.target/powerpc/pr79251-run.c: New test.
* gcc.target/powerpc/pr79251.h: New header.
---
 gcc/config/rs6000/altivec.md  |   4 +-
 gcc/config/rs6000/rs6000-c.c  |  22 ++-
 gcc/config/rs6000/rs6000-prot