Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
On Wed, Sep 16, 2020 at 10:31:54AM +0200, Richard Biener wrote: > On Tue, Sep 15, 2020 at 6:18 PM Segher Boessenkool > wrote: > > > > On Tue, Sep 15, 2020 at 08:51:09AM +0200, Richard Biener wrote: > > > On Tue, Sep 15, 2020 at 5:56 AM luoxhu wrote: > > > > > u[n % 4] = i; > > > > > > > > > > I guess. Is the % 4 mandated by the vec_insert semantics btw? > > > > (As an aside -- please use "& 3" instead: that works fine if n is signed > > as well, but modulo doesn't. Maybe that is in the patch already, I > > didn't check, sorry.) > > > > > note this is why I asked about the actual CPU instruction - as I read > > > Seghers mail > > > the instruction modifies a vector register, not memory. > > > > But note that the builtin is not the same as the machine instruction -- > > here there shouldn't be a difference if compiling for a new enough ISA, > > but the builtin is available on anything with at least AltiVec. > > Well, given we're trying to improve instruction selection that very much > should depend on the ISA. Thus if the target says it cannot vec_set > to a register in a variable position then we don't want to pretend it can. Of course :-) We shouldn't look at what the machine insn can do, but also not at what the source level constructs can; we should just ask the target code itself. Segher
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
On Wed, Sep 16, 2020 at 8:15 AM luoxhu wrote: > > > > On 2020/9/15 14:51, Richard Biener wrote: > > > >> I only see VAR_DECL and PARM_DECL, is there any function to check the tree > >> variable is global? I added DECL_REGISTER, but the RTL still expands to > >> stack: > > > > is_global_var () or alternatively !auto_var_in_fn_p (), I think doing > > IFN_SET only > > makes sense if there's the chance we can promote the variable to a > > register. But it > > would be an incorrect transform (it stores the whole vector) if the > > vector storage > > could "escape" to another thread - which means you probably have to check > > !TREE_ADDRESSABLE as well. > > > > The tree of param "u" will be marked ADDRESSABLE when generating > "VIEW_CONVERT_EXPR(D.3190)[_1] = i;", if check !TREE_ADDRESSABLE, no > IFN_SET > will be produced in gimple-isel. The TREE_ADDRESSABLE bit should be cleared later on GIMPLE during execute_update_address_taken. If it is not we should fix that. > #1 0x1066c700 in convert_vector_to_array_for_subscript (loc=5307072, > vecp=0x7fffc5d0, > index=) at > ../../gcc/gcc/c-family/c-common.c:8169 > #2 0x10553b54 in build_array_ref (loc=5307072, array= 0x75ad0100 u>, index=< > trunc_mod_expr 0x759c73a0>) at ../../gcc/gcc/c/c-typeck.c:2668 > #3 0x105c8824 in c_parser_postfix_expression_after_primary > (parser=0x77f703f0, expr_lo > c=5307040, expr=...) at ../../gcc/gcc/c/c-parser.c:10494 > #4 0x105c7570 in c_parser_postfix_expression (parser=0x77f703f0) > at ../../gcc/gcc/c/c- > parser.c:10216 > > >> > >> My current implementation does: > >> > >> 1) v = vec_insert (i, u, n); > >> > >> =>gimple: > >> { > >>register __vector signed int D.3190; > >>D.3190 = u;// *new decl and copy u first.* > >>_1 = n & 3; > >>VIEW_CONVERT_EXPR(D.3190)[_1] = i; // *update op0 of > >> VIEW_CONVERT_EXPR* > >>_2 = D.3190; > >>... > >> } > >> > >> =>isel: > >> { > >>register __vector signed int D.3190; > >>D.3190 = u_4(D); > >>_1 = n_6(D) & 3; > >>.VEC_SET (&D.3190, i_7(D), _1); > > > > why are you passing the address of D.3190 to .VEC_SET? That will not > > make D.3190 > > be expanded to a pseudo. You really need to have GIMPLE registers > > here (SSA name) > > and thus a return value, leaving the argument unmodified > > > >D.3190_3 = .VEC_SET (D.3190_4, i_7(D), _1); > > > > note this is why I asked about the actual CPU instruction - as I read > > Seghers mail > > the instruction modifies a vector register, not memory. > > > > Updated the code and got expected gimple-isel output and ASM for both 2 cases: > > pr79251.c.236t.isel: > > __attribute__((noinline)) > test (__vector signed int u, int i, size_t n) > { > long unsigned int _1; > __vector signed int _6; > vector(4) int _7; > vector(4) int vec_set_dst_8; > >[local count: 1073741824]: > _1 = n_2(D) & 3; > _7 = u; > vec_set_dst_8 = .VEC_SET (_7, i_4(D), _1); > u = vec_set_dst_8; > _6 = u; > return _6; > > } > > But tree variable "u" need to be set to "TREE_ADDRESSABLE (view_op0) = 0;" > (Maybe check IFN VEC_SET and set to 0 in discover_nonconstant_array_refs > is better later.) after generating the IFN VEC_SET, otherwise, "u" will still > be expanded to stack since expander: > "Replacing Expressions: _7 replace with --> _7 = u;". > > Setting "u" to non-addressable also seems really unreasonable as for below > case, as u[n+1] need be ADDRESSABLE: Why should u be addressable for u[n+1]? > __attribute__ ((noinline)) vector TYPE > test (vector TYPE u, TYPE i, size_t n) > { > u[n % 4] = i; > u[n+1] = i+1; > return u; > } > > => gimple-isel with both VEC_SET and VIEW_CONVERT_EXPR: > > test (__vector signed int u, int i, size_t n) > { > long unsigned int _1; > long unsigned int _2; > int _3; > __vector signed int _9; > vector(4) int _10; > vector(4) int vec_set_dst_11; > >[local count: 1073741824]: > _1 = n_4(D) & 3; > _10 = u; > vec_set_dst_11 = .VEC_SET (_10, i_6(D), _1); > u = vec_set_dst_11; > _2 = n_4(D) + 1; > _3 = i_6(D) + 1; > VIEW_CONVERT_EXPR(u)[_2] = _3; why is .VEC_SET not used here? > _9 = u; > return _9; > > } > > > Below code are generating the IFN call, create_tmp_reg or > create_tmp_reg_ssa_name > seems not create a variable that will be allocated on register? Yes, they create decls, use make_ssa_name (TREE_TYPE (view_op0)) to create a SSA name of the desired type. > > diff --git a/gcc/gimple-isel.cc b/gcc/gimple-isel.cc > index b330cf4c20e..a699022cd09 100644 > --- a/gcc/gimple-isel.cc > +++ b/gcc/gimple-isel.cc > ... > + if (!is_global_var (view_op0) > + && TREE_CODE (TREE_TYPE (view_op0)) == VECTOR_TYPE > + && tree_fits_uhwi_p (TYPE_SIZE (TREE_TYPE (view_op0))) > + && tree_to_uhwi (TYPE_SIZE (TREE_TYPE (view_op0))) == 128 > + && determine_value_range (pos, &minv, &maxv) == VR_RANGE > + && wi::geu_p (minv, 0) > + &&
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
On Tue, Sep 15, 2020 at 6:18 PM Segher Boessenkool wrote: > > On Tue, Sep 15, 2020 at 08:51:09AM +0200, Richard Biener wrote: > > On Tue, Sep 15, 2020 at 5:56 AM luoxhu wrote: > > > > u[n % 4] = i; > > > > > > > > I guess. Is the % 4 mandated by the vec_insert semantics btw? > > (As an aside -- please use "& 3" instead: that works fine if n is signed > as well, but modulo doesn't. Maybe that is in the patch already, I > didn't check, sorry.) > > > note this is why I asked about the actual CPU instruction - as I read > > Seghers mail > > the instruction modifies a vector register, not memory. > > But note that the builtin is not the same as the machine instruction -- > here there shouldn't be a difference if compiling for a new enough ISA, > but the builtin is available on anything with at least AltiVec. Well, given we're trying to improve instruction selection that very much should depend on the ISA. Thus if the target says it cannot vec_set to a register in a variable position then we don't want to pretend it can. Richard. > > Segher
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
On 2020/9/15 14:51, Richard Biener wrote: >> I only see VAR_DECL and PARM_DECL, is there any function to check the tree >> variable is global? I added DECL_REGISTER, but the RTL still expands to >> stack: > > is_global_var () or alternatively !auto_var_in_fn_p (), I think doing > IFN_SET only > makes sense if there's the chance we can promote the variable to a > register. But it > would be an incorrect transform (it stores the whole vector) if the > vector storage > could "escape" to another thread - which means you probably have to check > !TREE_ADDRESSABLE as well. > The tree of param "u" will be marked ADDRESSABLE when generating "VIEW_CONVERT_EXPR(D.3190)[_1] = i;", if check !TREE_ADDRESSABLE, no IFN_SET will be produced in gimple-isel. #1 0x1066c700 in convert_vector_to_array_for_subscript (loc=5307072, vecp=0x7fffc5d0, index=) at ../../gcc/gcc/c-family/c-common.c:8169 #2 0x10553b54 in build_array_ref (loc=5307072, array=, index=< trunc_mod_expr 0x759c73a0>) at ../../gcc/gcc/c/c-typeck.c:2668 #3 0x105c8824 in c_parser_postfix_expression_after_primary (parser=0x77f703f0, expr_lo c=5307040, expr=...) at ../../gcc/gcc/c/c-parser.c:10494 #4 0x105c7570 in c_parser_postfix_expression (parser=0x77f703f0) at ../../gcc/gcc/c/c- parser.c:10216 >> >> My current implementation does: >> >> 1) v = vec_insert (i, u, n); >> >> =>gimple: >> { >>register __vector signed int D.3190; >>D.3190 = u;// *new decl and copy u first.* >>_1 = n & 3; >>VIEW_CONVERT_EXPR(D.3190)[_1] = i; // *update op0 of >> VIEW_CONVERT_EXPR* >>_2 = D.3190; >>... >> } >> >> =>isel: >> { >>register __vector signed int D.3190; >>D.3190 = u_4(D); >>_1 = n_6(D) & 3; >>.VEC_SET (&D.3190, i_7(D), _1); > > why are you passing the address of D.3190 to .VEC_SET? That will not > make D.3190 > be expanded to a pseudo. You really need to have GIMPLE registers > here (SSA name) > and thus a return value, leaving the argument unmodified > >D.3190_3 = .VEC_SET (D.3190_4, i_7(D), _1); > > note this is why I asked about the actual CPU instruction - as I read > Seghers mail > the instruction modifies a vector register, not memory. > Updated the code and got expected gimple-isel output and ASM for both 2 cases: pr79251.c.236t.isel: __attribute__((noinline)) test (__vector signed int u, int i, size_t n) { long unsigned int _1; __vector signed int _6; vector(4) int _7; vector(4) int vec_set_dst_8; [local count: 1073741824]: _1 = n_2(D) & 3; _7 = u; vec_set_dst_8 = .VEC_SET (_7, i_4(D), _1); u = vec_set_dst_8; _6 = u; return _6; } But tree variable "u" need to be set to "TREE_ADDRESSABLE (view_op0) = 0;" (Maybe check IFN VEC_SET and set to 0 in discover_nonconstant_array_refs is better later.) after generating the IFN VEC_SET, otherwise, "u" will still be expanded to stack since expander: "Replacing Expressions: _7 replace with --> _7 = u;". Setting "u" to non-addressable also seems really unreasonable as for below case, as u[n+1] need be ADDRESSABLE: __attribute__ ((noinline)) vector TYPE test (vector TYPE u, TYPE i, size_t n) { u[n % 4] = i; u[n+1] = i+1; return u; } => gimple-isel with both VEC_SET and VIEW_CONVERT_EXPR: test (__vector signed int u, int i, size_t n) { long unsigned int _1; long unsigned int _2; int _3; __vector signed int _9; vector(4) int _10; vector(4) int vec_set_dst_11; [local count: 1073741824]: _1 = n_4(D) & 3; _10 = u; vec_set_dst_11 = .VEC_SET (_10, i_6(D), _1); u = vec_set_dst_11; _2 = n_4(D) + 1; _3 = i_6(D) + 1; VIEW_CONVERT_EXPR(u)[_2] = _3; _9 = u; return _9; } Below code are generating the IFN call, create_tmp_reg or create_tmp_reg_ssa_name seems not create a variable that will be allocated on register? diff --git a/gcc/gimple-isel.cc b/gcc/gimple-isel.cc index b330cf4c20e..a699022cd09 100644 --- a/gcc/gimple-isel.cc +++ b/gcc/gimple-isel.cc ... + if (!is_global_var (view_op0) + && TREE_CODE (TREE_TYPE (view_op0)) == VECTOR_TYPE + && tree_fits_uhwi_p (TYPE_SIZE (TREE_TYPE (view_op0))) + && tree_to_uhwi (TYPE_SIZE (TREE_TYPE (view_op0))) == 128 + && determine_value_range (pos, &minv, &maxv) == VR_RANGE + && wi::geu_p (minv, 0) + && wi::leu_p (maxv, (128 / GET_MODE_BITSIZE (innermode + { + location_t loc = gimple_location (stmt); + tree var_src = create_tmp_reg (TREE_TYPE (view_op0)); + tree var_dst + = make_temp_ssa_name (TREE_TYPE (view_op0), NULL, "vec_set_dst"); + TREE_ADDRESSABLE (view_op0) = 0; + + ass_stmt = gimple_build_assign (var_src, view_op0); + gimple_set_location (ass_stmt, loc); + gsi_insert_before (gsi, ass_stmt, GSI_SAME_STMT); + + new_stmt + = gimple_build_call_internal (IFN_VEC_SET, 3, var_src, val, pos); + + gimple_call_set_lhs (new_stmt, var_dst); + +
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
On Tue, Sep 15, 2020 at 08:51:09AM +0200, Richard Biener wrote: > On Tue, Sep 15, 2020 at 5:56 AM luoxhu wrote: > > > u[n % 4] = i; > > > > > > I guess. Is the % 4 mandated by the vec_insert semantics btw? (As an aside -- please use "& 3" instead: that works fine if n is signed as well, but modulo doesn't. Maybe that is in the patch already, I didn't check, sorry.) > note this is why I asked about the actual CPU instruction - as I read > Seghers mail > the instruction modifies a vector register, not memory. But note that the builtin is not the same as the machine instruction -- here there shouldn't be a difference if compiling for a new enough ISA, but the builtin is available on anything with at least AltiVec. Segher
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
On Tue, Sep 15, 2020 at 5:56 AM luoxhu wrote: > > > > On 2020/9/14 17:47, Richard Biener wrote: > > On Mon, Sep 14, 2020 at 10:05 AM luoxhu wrote: > > >> Not sure whether this reflects the issues you discussed above. > >> I constructed below test cases and tested with and without this patch, > >> only if "a+c"(which means store only), the performance is getting bad with > >> this patch due to extra load/store(about 50% slower). > >> For "v = vec_insert (i, v, n);" usage, v is always loaded after store, so > >> this > >> patch will always get big benefit. > >> But for "v[n % 4] = i;" usage, it depends on whether "v" is used > >> immediately > >> inside the function or out of the function soon. Does this mean unify the > >> two > >> usage to same gimple code not a good idea sometimes? Or is it possible to > >> detect the generated IFN ".VEC_INSERT (&v, i_4(D), _1);" destination "v" is > >> used not far away inside or out side of the function? > >> > >> > >> #define TYPE int > >> > >> vector TYPE v = {1, 2, 3, 4}; // a. global vector. > >> vector TYPE s = {4, 3, 2, 1}; > >> > >> __attribute__ ((noinline)) > >> vector TYPE > >> test (vector TYPE u, TYPE i, size_t n) > >> { > >>v[n % 4] = i; > > > > ^^^ > > > > this should be > > > > u[n % 4] = i; > > > > I guess. Is the % 4 mandated by the vec_insert semantics btw? > > Yes. Segher pasted the builtin description in his reply. "v = vec_insert > (i, u, n);" > is a bit different with "u[n % 4] = i;" since it returns a copy of u instead > of modify > u. The adjust in lower __builtin_vec_insert_xxx gimple will make a copy of u > first to > meet the requirements. > > > > > If you tested with the above error you probably need to re-measure. > > No I did test for u as local instead of global before. If u is not used very > soon, the performance is almost the same for generating single store or > IFN_SET > of inserting with variable index. > > source: > __attribute__ ((noinline)) vector TYPE > test (vector TYPE u, TYPE i, size_t n) > { > u[n % 4] = i; > vector TYPE r = {0}; > for (long k = 0; k < 100; k++) > { > r += v; > } > return u+r; > } > > => store hit load is relieved due to long distance. > > ASM: > 0: addis 2,12,.TOC.-.LCF0@ha > addi 2,2,.TOC.-.LCF0@l > .localentry test,.-test > addis 10,2,.LANCHOR0@toc@ha > li 8,50 > xxspltib 32,0 > addi 9,1,-16 > rldic 6,6,2,60 > stxv 34,-16(1) > addi 10,10,.LANCHOR0@toc@l > mtctr 8 > xxlor 33,32,32 > stwx 5,9,6 // short store > lxv 45,0(10) > .p2align 4,,15 > .L2: > vadduwm 0,0,13 > vadduwm 1,1,13 > bdnz .L2 > vadduwm 0,0,1 > lxv 33,-16(1) // wide load > vadduwm 2,0,1 > blr > > > Then I intend to use "v" there for global memory insert test to separate > the store and load into different function ("v" is stored in function, > but loaded out of function will get different performance for single store > or lxv+xxperm+xxsel+stxv, the inner function doesn't know the distance > between load and store across functions). > > Do you mean that we don't need generate IFN_SET if "v" is global memory? Yes. > I only see VAR_DECL and PARM_DECL, is there any function to check the tree > variable is global? I added DECL_REGISTER, but the RTL still expands to stack: is_global_var () or alternatively !auto_var_in_fn_p (), I think doing IFN_SET only makes sense if there's the chance we can promote the variable to a register. But it would be an incorrect transform (it stores the whole vector) if the vector storage could "escape" to another thread - which means you probably have to check !TREE_ADDRESSABLE as well. > > gcc/internal-fn.c: rtx to_rtx = expand_expr (view_op0, NULL_RTX, VOIDmode, > EXPAND_WRITE); > > (gdb) p view_op0 > $584 = > (gdb) p DECL_REGISTER(view_op0) > $585 = 1 > (gdb) p to_rtx > $586 = (rtx_def *) (mem/c:V4SI (reg/f:DI 112 virtual-stack-vars) [1 D.3190+0 > S16 A128]) > > > > > > On gimple the above function (after fixing it) looks like > > > >VIEW_CONVERT_EXPR(u)[_1] = i_4(D); > > > > and the IFN idea I had would - for non-global memory 'u' only - transform > > this to > > > >vector_register_2 = u; > >vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D)); > >u = vector_register_3; > > > > if vec_set can handle variable indexes. This then becomes a > > vec_set on a register and if that was the only variable indexing of 'u' > > will also cause 'u' to be expanded as register rather than stack memory. > > > > Note we can't use the direct-optab method here since the vec_set optab > > modifies operand 0 which isn't possible in SSA form. That might hint > > at that we eventually want to extend BIT_INSERT_EXPR to handle > > a non-constant bit position but for experiments using an alternate > > internal function is certainly easier. > > > > My current implementation does
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
On 2020/9/14 17:47, Richard Biener wrote: On Mon, Sep 14, 2020 at 10:05 AM luoxhu wrote: Not sure whether this reflects the issues you discussed above. I constructed below test cases and tested with and without this patch, only if "a+c"(which means store only), the performance is getting bad with this patch due to extra load/store(about 50% slower). For "v = vec_insert (i, v, n);" usage, v is always loaded after store, so this patch will always get big benefit. But for "v[n % 4] = i;" usage, it depends on whether "v" is used immediately inside the function or out of the function soon. Does this mean unify the two usage to same gimple code not a good idea sometimes? Or is it possible to detect the generated IFN ".VEC_INSERT (&v, i_4(D), _1);" destination "v" is used not far away inside or out side of the function? #define TYPE int vector TYPE v = {1, 2, 3, 4}; // a. global vector. vector TYPE s = {4, 3, 2, 1}; __attribute__ ((noinline)) vector TYPE test (vector TYPE u, TYPE i, size_t n) { v[n % 4] = i; ^^^ this should be u[n % 4] = i; I guess. Is the % 4 mandated by the vec_insert semantics btw? Yes. Segher pasted the builtin description in his reply. "v = vec_insert (i, u, n);" is a bit different with "u[n % 4] = i;" since it returns a copy of u instead of modify u. The adjust in lower __builtin_vec_insert_xxx gimple will make a copy of u first to meet the requirements. If you tested with the above error you probably need to re-measure. No I did test for u as local instead of global before. If u is not used very soon, the performance is almost the same for generating single store or IFN_SET of inserting with variable index. source: __attribute__ ((noinline)) vector TYPE test (vector TYPE u, TYPE i, size_t n) { u[n % 4] = i; vector TYPE r = {0}; for (long k = 0; k < 100; k++) { r += v; } return u+r; } => store hit load is relieved due to long distance. ASM: 0: addis 2,12,.TOC.-.LCF0@ha addi 2,2,.TOC.-.LCF0@l .localentry test,.-test addis 10,2,.LANCHOR0@toc@ha li 8,50 xxspltib 32,0 addi 9,1,-16 rldic 6,6,2,60 stxv 34,-16(1) addi 10,10,.LANCHOR0@toc@l mtctr 8 xxlor 33,32,32 stwx 5,9,6 // short store lxv 45,0(10) .p2align 4,,15 .L2: vadduwm 0,0,13 vadduwm 1,1,13 bdnz .L2 vadduwm 0,0,1 lxv 33,-16(1) // wide load vadduwm 2,0,1 blr Then I intend to use "v" there for global memory insert test to separate the store and load into different function ("v" is stored in function, but loaded out of function will get different performance for single store or lxv+xxperm+xxsel+stxv, the inner function doesn't know the distance between load and store across functions). Do you mean that we don't need generate IFN_SET if "v" is global memory? I only see VAR_DECL and PARM_DECL, is there any function to check the tree variable is global? I added DECL_REGISTER, but the RTL still expands to stack: gcc/internal-fn.c: rtx to_rtx = expand_expr (view_op0, NULL_RTX, VOIDmode, EXPAND_WRITE); (gdb) p view_op0 $584 = (gdb) p DECL_REGISTER(view_op0) $585 = 1 (gdb) p to_rtx $586 = (rtx_def *) (mem/c:V4SI (reg/f:DI 112 virtual-stack-vars) [1 D.3190+0 S16 A128]) On gimple the above function (after fixing it) looks like VIEW_CONVERT_EXPR(u)[_1] = i_4(D); and the IFN idea I had would - for non-global memory 'u' only - transform this to vector_register_2 = u; vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D)); u = vector_register_3; if vec_set can handle variable indexes. This then becomes a vec_set on a register and if that was the only variable indexing of 'u' will also cause 'u' to be expanded as register rather than stack memory. Note we can't use the direct-optab method here since the vec_set optab modifies operand 0 which isn't possible in SSA form. That might hint at that we eventually want to extend BIT_INSERT_EXPR to handle a non-constant bit position but for experiments using an alternate internal function is certainly easier. My current implementation does: 1) v = vec_insert (i, u, n); =>gimple: { register __vector signed int D.3190; D.3190 = u;// *new decl and copy u first.* _1 = n & 3; VIEW_CONVERT_EXPR(D.3190)[_1] = i; // *update op0 of VIEW_CONVERT_EXPR* _2 = D.3190; ... } =>isel: { register __vector signed int D.3190; D.3190 = u_4(D); _1 = n_6(D) & 3; .VEC_SET (&D.3190, i_7(D), _1); _2 = D.3190; ... } 2) u[n % 4] = i; =>gimple: { __vector signed int D.3191; _1 = n & 3; VIEW_CONVERT_EXPR(u)[_1] = i; // *update op0 of VIEW_CONVERT_EXPR* D.3191 = u; ... } =>isel: { D.3190 = u_4(D); _1 = n_6(D) & 3; .VEC_SET (&D.3191, i_7(D), _1); _2 = D.3190; v = _2; } The IFN ".VEC_SET" behavior quite like the other IFN STORE functions and doesn't require a dest operand to be set? Both 1) and 2) are modifying operand 0 of VIEW_CONV
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
On Mon, Sep 14, 2020 at 11:47:52AM +0100, Richard Sandiford wrote: > Would it be worth changing the optab so that the input and output are > separate? Having a single operand would be justified if the operation > was only supposed to touch the selected bytes, but most targets wouldn't > guarantee that for memory operands, even as things stand. You have my vote. > Or maybe the idea was to force the RA's hand by making the input and > output tied even before RA, with separate moves where necessary. > But I'm not sure why vec_set is so special that it requires this > treatment and other optabs don't. Yeah. The register allocator is normally very good in using the same reg in both places, if that is useful. And it also handles the case where your machine insns require the two to be the same pretty well. Not restricting this stuff before RA should be a win. Segher
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
On Mon, Sep 14, 2020 at 11:47:56AM +0200, Richard Biener wrote: > this should be > >u[n % 4] = i; > > I guess. Is the % 4 mandated by the vec_insert semantics btw? Yes: VEC_INSERT (ARG1, ARG2, ARG3) Purpose: Returns a copy of vector ARG2 with element ARG3 replaced by the value of ARG1. Result value: A copy of vector ARG2 with element ARG3 replaced by the value of ARG1. This function uses modular arithmetic on ARG3 to determine the element number. For example, if ARG3 is out of range, the compiler uses ARG3 modulo the number of elements in the vector to determine the element position. The builtin requires it. The machine insns work like that, too, e.g.: vinswlx VRT,RA,RB if MSR.VEC=0 then Vector_Unavailable() index ← GPR[RA].bit[60:63] VSR[VRT+32].byte[index:index+3] ← GPR[RB].bit[32:63] Let index be the contents of bits 60:63 of GPR[RA]. The contents of bits 32:63 of GPR[RB] are placed into byte elements index:index+3 of VSR[VRT+32]. All other byte elements of VSR[VRT+32] are not modified. If index is greater than 12, the result is undefined. Segher
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
On Thu, Sep 10, 2020 at 12:08:44PM +0200, Richard Biener wrote: > On Wed, Sep 9, 2020 at 6:03 PM Segher Boessenkool > wrote: > > There often are problems over function calls (where the compiler cannot > > usually *see* how something is used). > > Yep. The best way would be to use small loads and larger stores > which is what CPUs usually tend to handle fine (with alignment > constraints, etc.). Of course that's not what either of the "solutions" > can do. Yes, and yes. > That said, since you seem to be "first" in having an instruction > to insert into a vector at a variable position the idea that we'd > have to spill anyway for this to be expanded and thus we expand > the vector to a stack location in the first place falls down. And > that's where I'd first try to improve things. > > So what can the CPU actually do? Both immediate and variable inserts, of 1, 2, 4, or 8 bytes. The inserted part is not allowed to cross the 16B boundary (all aligned stuff never has that problem). Variable inserts look at only the low bits of the GPR that says where to insert (4 bits for bytes, 3 bits for halfs, etc.) Segher
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
Richard Biener writes: > On Mon, Sep 14, 2020 at 12:47 PM Richard Sandiford > wrote: >> >> Richard Biener via Gcc-patches writes: >> > On gimple the above function (after fixing it) looks like >> > >> > VIEW_CONVERT_EXPR(u)[_1] = i_4(D); >> > >> > and the IFN idea I had would - for non-global memory 'u' only - transform >> > this to >> > >> > vector_register_2 = u; >> > vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D)); >> > u = vector_register_3; >> > >> > if vec_set can handle variable indexes. This then becomes a >> > vec_set on a register and if that was the only variable indexing of 'u' >> > will also cause 'u' to be expanded as register rather than stack memory. >> > >> > Note we can't use the direct-optab method here since the vec_set optab >> > modifies operand 0 which isn't possible in SSA form. >> >> Would it be worth changing the optab so that the input and output are >> separate? Having a single operand would be justified if the operation >> was only supposed to touch the selected bytes, but most targets wouldn't >> guarantee that for memory operands, even as things stand. > > I thought about this as well, just didn't want to bother Xiong Hu Luo with > it for the experiments. > >> Or maybe the idea was to force the RA's hand by making the input and >> output tied even before RA, with separate moves where necessary. >> But I'm not sure why vec_set is so special that it requires this >> treatment and other optabs don't. > > Certainly the define_expand do not have to be define_insn so the target > can force the RAs hand just fine if it likes to. > > The more interesting question of course is how to query vec_set whether > it accepts variable indices w/o building too much garbage RTL. Probably easiest to do something like can_vcond_compare_p, i.e.: machine_mode mode = insn_data[icode].operand[N]; rtx reg = alloca_raw_REG (mode, LAST_VIRTUAL_REGISTER + 1); … insn_operand_matches (icode, N, reg) … Thanks, Richard
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
On Mon, Sep 14, 2020 at 12:47 PM Richard Sandiford wrote: > > Richard Biener via Gcc-patches writes: > > On gimple the above function (after fixing it) looks like > > > > VIEW_CONVERT_EXPR(u)[_1] = i_4(D); > > > > and the IFN idea I had would - for non-global memory 'u' only - transform > > this to > > > > vector_register_2 = u; > > vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D)); > > u = vector_register_3; > > > > if vec_set can handle variable indexes. This then becomes a > > vec_set on a register and if that was the only variable indexing of 'u' > > will also cause 'u' to be expanded as register rather than stack memory. > > > > Note we can't use the direct-optab method here since the vec_set optab > > modifies operand 0 which isn't possible in SSA form. > > Would it be worth changing the optab so that the input and output are > separate? Having a single operand would be justified if the operation > was only supposed to touch the selected bytes, but most targets wouldn't > guarantee that for memory operands, even as things stand. I thought about this as well, just didn't want to bother Xiong Hu Luo with it for the experiments. > Or maybe the idea was to force the RA's hand by making the input and > output tied even before RA, with separate moves where necessary. > But I'm not sure why vec_set is so special that it requires this > treatment and other optabs don't. Certainly the define_expand do not have to be define_insn so the target can force the RAs hand just fine if it likes to. The more interesting question of course is how to query vec_set whether it accepts variable indices w/o building too much garbage RTL. Richard. > > Thanks, > Richard > > > > That might hint at that we eventually want to extend BIT_INSERT_EXPR > > to handle a non-constant bit position but for experiments using an > > alternate internal function is certainly easier. > > > > Richard.
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
Richard Biener via Gcc-patches writes: > On gimple the above function (after fixing it) looks like > > VIEW_CONVERT_EXPR(u)[_1] = i_4(D); > > and the IFN idea I had would - for non-global memory 'u' only - transform > this to > > vector_register_2 = u; > vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D)); > u = vector_register_3; > > if vec_set can handle variable indexes. This then becomes a > vec_set on a register and if that was the only variable indexing of 'u' > will also cause 'u' to be expanded as register rather than stack memory. > > Note we can't use the direct-optab method here since the vec_set optab > modifies operand 0 which isn't possible in SSA form. Would it be worth changing the optab so that the input and output are separate? Having a single operand would be justified if the operation was only supposed to touch the selected bytes, but most targets wouldn't guarantee that for memory operands, even as things stand. Or maybe the idea was to force the RA's hand by making the input and output tied even before RA, with separate moves where necessary. But I'm not sure why vec_set is so special that it requires this treatment and other optabs don't. Thanks, Richard > That might hint at that we eventually want to extend BIT_INSERT_EXPR > to handle a non-constant bit position but for experiments using an > alternate internal function is certainly easier. > > Richard.
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
On Mon, Sep 14, 2020 at 10:05 AM luoxhu wrote: > > > > On 2020/9/10 18:08, Richard Biener wrote: > > On Wed, Sep 9, 2020 at 6:03 PM Segher Boessenkool > > wrote: > >> > >> On Wed, Sep 09, 2020 at 04:28:19PM +0200, Richard Biener wrote: > >>> On Wed, Sep 9, 2020 at 3:49 PM Segher Boessenkool > >>> wrote: > > Hi! > > On Tue, Sep 08, 2020 at 10:26:51AM +0200, Richard Biener wrote: > > Hmm, yeah - I guess that's what should be addressed first then. > > I'm quite sure that in case 'v' is not on the stack but in memory like > > in my case a SImode store is better than what we get from > > vec_insert - in fact vec_insert will likely introduce a RMW cycle > > which is prone to inserting store-data-races? > > The other way around -- if it is in memory, and was stored as vector > recently, then reading back something shorter from it is prone to > SHL/LHS problems. There is nothing special about the stack here, except > of course it is more likely to have been stored recently if on the > stack. So it depends how often it has been stored recently which option > is best. On newer CPUs, although they can avoid SHL/LHS flushes more > often, the penalty is relatively bigger, so memory does not often win. > > I.e.: it needs to be measured. Intuition is often wrong here. > >>> > >>> But the current method would simply do a direct store to memory > >>> without a preceeding read of the whole vector. > >> > >> The problem is even worse the other way: you do a short store here, but > >> so a full vector read later. If the store and read are far apart, that > >> is fine, but if they are close (that is on the order of fifty or more > >> insns), there can be problems. > > > > Sure, but you can't simply load/store a whole vector when the code didn't > > unless you know it will not introduce data races and it will not trap (thus > > the whole vector needs to be at least naturally aligned). > > > > Also if there's a preceeding short store you will now load the whole vector > > to avoid the short store ... catch-22 > > > >> There often are problems over function calls (where the compiler cannot > >> usually *see* how something is used). > > > > Yep. The best way would be to use small loads and larger stores > > which is what CPUs usually tend to handle fine (with alignment > > constraints, etc.). Of course that's not what either of the "solutions" > > can do. > > > > That said, since you seem to be "first" in having an instruction > > to insert into a vector at a variable position the idea that we'd > > have to spill anyway for this to be expanded and thus we expand > > the vector to a stack location in the first place falls down. And > > that's where I'd first try to improve things. > > > > So what can the CPU actually do? > > > > Not sure whether this reflects the issues you discussed above. > I constructed below test cases and tested with and without this patch, > only if "a+c"(which means store only), the performance is getting bad with > this patch due to extra load/store(about 50% slower). > For "v = vec_insert (i, v, n);" usage, v is always loaded after store, so this > patch will always get big benefit. > But for "v[n % 4] = i;" usage, it depends on whether "v" is used immediately > inside the function or out of the function soon. Does this mean unify the two > usage to same gimple code not a good idea sometimes? Or is it possible to > detect the generated IFN ".VEC_INSERT (&v, i_4(D), _1);" destination "v" is > used not far away inside or out side of the function? > > > #define TYPE int > > vector TYPE v = {1, 2, 3, 4}; // a. global vector. > vector TYPE s = {4, 3, 2, 1}; > > __attribute__ ((noinline)) > vector TYPE > test (vector TYPE u, TYPE i, size_t n) > { > v[n % 4] = i; ^^^ this should be u[n % 4] = i; I guess. Is the % 4 mandated by the vec_insert semantics btw? If you tested with the above error you probably need to re-measure. On gimple the above function (after fixing it) looks like VIEW_CONVERT_EXPR(u)[_1] = i_4(D); and the IFN idea I had would - for non-global memory 'u' only - transform this to vector_register_2 = u; vector_register_3 = .IFN_VEC_SET (vector_register_2, _1, i_4(D)); u = vector_register_3; if vec_set can handle variable indexes. This then becomes a vec_set on a register and if that was the only variable indexing of 'u' will also cause 'u' to be expanded as register rather than stack memory. Note we can't use the direct-optab method here since the vec_set optab modifies operand 0 which isn't possible in SSA form. That might hint at that we eventually want to extend BIT_INSERT_EXPR to handle a non-constant bit position but for experiments using an alternate internal function is certainly easier. Richard. > return u; > } > > int main () > { > //vector TYPE v = {1, 2, 3, 4};// b. local vector. > //vector TYPE s = {4, 3, 2, 1}; > vector TYPE r = {0}; > for (
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
On 2020/9/10 18:08, Richard Biener wrote: > On Wed, Sep 9, 2020 at 6:03 PM Segher Boessenkool > wrote: >> >> On Wed, Sep 09, 2020 at 04:28:19PM +0200, Richard Biener wrote: >>> On Wed, Sep 9, 2020 at 3:49 PM Segher Boessenkool >>> wrote: Hi! On Tue, Sep 08, 2020 at 10:26:51AM +0200, Richard Biener wrote: > Hmm, yeah - I guess that's what should be addressed first then. > I'm quite sure that in case 'v' is not on the stack but in memory like > in my case a SImode store is better than what we get from > vec_insert - in fact vec_insert will likely introduce a RMW cycle > which is prone to inserting store-data-races? The other way around -- if it is in memory, and was stored as vector recently, then reading back something shorter from it is prone to SHL/LHS problems. There is nothing special about the stack here, except of course it is more likely to have been stored recently if on the stack. So it depends how often it has been stored recently which option is best. On newer CPUs, although they can avoid SHL/LHS flushes more often, the penalty is relatively bigger, so memory does not often win. I.e.: it needs to be measured. Intuition is often wrong here. >>> >>> But the current method would simply do a direct store to memory >>> without a preceeding read of the whole vector. >> >> The problem is even worse the other way: you do a short store here, but >> so a full vector read later. If the store and read are far apart, that >> is fine, but if they are close (that is on the order of fifty or more >> insns), there can be problems. > > Sure, but you can't simply load/store a whole vector when the code didn't > unless you know it will not introduce data races and it will not trap (thus > the whole vector needs to be at least naturally aligned). > > Also if there's a preceeding short store you will now load the whole vector > to avoid the short store ... catch-22 > >> There often are problems over function calls (where the compiler cannot >> usually *see* how something is used). > > Yep. The best way would be to use small loads and larger stores > which is what CPUs usually tend to handle fine (with alignment > constraints, etc.). Of course that's not what either of the "solutions" > can do. > > That said, since you seem to be "first" in having an instruction > to insert into a vector at a variable position the idea that we'd > have to spill anyway for this to be expanded and thus we expand > the vector to a stack location in the first place falls down. And > that's where I'd first try to improve things. > > So what can the CPU actually do? > Not sure whether this reflects the issues you discussed above. I constructed below test cases and tested with and without this patch, only if "a+c"(which means store only), the performance is getting bad with this patch due to extra load/store(about 50% slower). For "v = vec_insert (i, v, n);" usage, v is always loaded after store, so this patch will always get big benefit. But for "v[n % 4] = i;" usage, it depends on whether "v" is used immediately inside the function or out of the function soon. Does this mean unify the two usage to same gimple code not a good idea sometimes? Or is it possible to detect the generated IFN ".VEC_INSERT (&v, i_4(D), _1);" destination "v" is used not far away inside or out side of the function? #define TYPE int vector TYPE v = {1, 2, 3, 4}; // a. global vector. vector TYPE s = {4, 3, 2, 1}; __attribute__ ((noinline)) vector TYPE test (vector TYPE u, TYPE i, size_t n) { v[n % 4] = i; return u; } int main () { //vector TYPE v = {1, 2, 3, 4};// b. local vector. //vector TYPE s = {4, 3, 2, 1}; vector TYPE r = {0}; for (long k = 0; k < 3989478945; k++) { r += test (s, 254.0f, k); // c. access different vector. store only. //r += test (v, 254.0f, k); // d. access same vector. load after store in callee function. //test (s, 254.0f, k); r += v; //e. load after store out of function. } return r[0]; } Thanks, Xionghu
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
On Wed, Sep 9, 2020 at 6:03 PM Segher Boessenkool wrote: > > On Wed, Sep 09, 2020 at 04:28:19PM +0200, Richard Biener wrote: > > On Wed, Sep 9, 2020 at 3:49 PM Segher Boessenkool > > wrote: > > > > > > Hi! > > > > > > On Tue, Sep 08, 2020 at 10:26:51AM +0200, Richard Biener wrote: > > > > Hmm, yeah - I guess that's what should be addressed first then. > > > > I'm quite sure that in case 'v' is not on the stack but in memory like > > > > in my case a SImode store is better than what we get from > > > > vec_insert - in fact vec_insert will likely introduce a RMW cycle > > > > which is prone to inserting store-data-races? > > > > > > The other way around -- if it is in memory, and was stored as vector > > > recently, then reading back something shorter from it is prone to > > > SHL/LHS problems. There is nothing special about the stack here, except > > > of course it is more likely to have been stored recently if on the > > > stack. So it depends how often it has been stored recently which option > > > is best. On newer CPUs, although they can avoid SHL/LHS flushes more > > > often, the penalty is relatively bigger, so memory does not often win. > > > > > > I.e.: it needs to be measured. Intuition is often wrong here. > > > > But the current method would simply do a direct store to memory > > without a preceeding read of the whole vector. > > The problem is even worse the other way: you do a short store here, but > so a full vector read later. If the store and read are far apart, that > is fine, but if they are close (that is on the order of fifty or more > insns), there can be problems. Sure, but you can't simply load/store a whole vector when the code didn't unless you know it will not introduce data races and it will not trap (thus the whole vector needs to be at least naturally aligned). Also if there's a preceeding short store you will now load the whole vector to avoid the short store ... catch-22 > There often are problems over function calls (where the compiler cannot > usually *see* how something is used). Yep. The best way would be to use small loads and larger stores which is what CPUs usually tend to handle fine (with alignment constraints, etc.). Of course that's not what either of the "solutions" can do. That said, since you seem to be "first" in having an instruction to insert into a vector at a variable position the idea that we'd have to spill anyway for this to be expanded and thus we expand the vector to a stack location in the first place falls down. And that's where I'd first try to improve things. So what can the CPU actually do? Richard. > > > Segher
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
On Wed, Sep 09, 2020 at 04:28:19PM +0200, Richard Biener wrote: > On Wed, Sep 9, 2020 at 3:49 PM Segher Boessenkool > wrote: > > > > Hi! > > > > On Tue, Sep 08, 2020 at 10:26:51AM +0200, Richard Biener wrote: > > > Hmm, yeah - I guess that's what should be addressed first then. > > > I'm quite sure that in case 'v' is not on the stack but in memory like > > > in my case a SImode store is better than what we get from > > > vec_insert - in fact vec_insert will likely introduce a RMW cycle > > > which is prone to inserting store-data-races? > > > > The other way around -- if it is in memory, and was stored as vector > > recently, then reading back something shorter from it is prone to > > SHL/LHS problems. There is nothing special about the stack here, except > > of course it is more likely to have been stored recently if on the > > stack. So it depends how often it has been stored recently which option > > is best. On newer CPUs, although they can avoid SHL/LHS flushes more > > often, the penalty is relatively bigger, so memory does not often win. > > > > I.e.: it needs to be measured. Intuition is often wrong here. > > But the current method would simply do a direct store to memory > without a preceeding read of the whole vector. The problem is even worse the other way: you do a short store here, but so a full vector read later. If the store and read are far apart, that is fine, but if they are close (that is on the order of fifty or more insns), there can be problems. There often are problems over function calls (where the compiler cannot usually *see* how something is used). Segher
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
On Wed, Sep 9, 2020 at 3:49 PM Segher Boessenkool wrote: > > Hi! > > On Tue, Sep 08, 2020 at 10:26:51AM +0200, Richard Biener wrote: > > Hmm, yeah - I guess that's what should be addressed first then. > > I'm quite sure that in case 'v' is not on the stack but in memory like > > in my case a SImode store is better than what we get from > > vec_insert - in fact vec_insert will likely introduce a RMW cycle > > which is prone to inserting store-data-races? > > The other way around -- if it is in memory, and was stored as vector > recently, then reading back something shorter from it is prone to > SHL/LHS problems. There is nothing special about the stack here, except > of course it is more likely to have been stored recently if on the > stack. So it depends how often it has been stored recently which option > is best. On newer CPUs, although they can avoid SHL/LHS flushes more > often, the penalty is relatively bigger, so memory does not often win. > > I.e.: it needs to be measured. Intuition is often wrong here. But the current method would simply do a direct store to memory without a preceeding read of the whole vector. Richard. > > > Segher
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
Hi! On Tue, Sep 08, 2020 at 10:26:51AM +0200, Richard Biener wrote: > Hmm, yeah - I guess that's what should be addressed first then. > I'm quite sure that in case 'v' is not on the stack but in memory like > in my case a SImode store is better than what we get from > vec_insert - in fact vec_insert will likely introduce a RMW cycle > which is prone to inserting store-data-races? The other way around -- if it is in memory, and was stored as vector recently, then reading back something shorter from it is prone to SHL/LHS problems. There is nothing special about the stack here, except of course it is more likely to have been stored recently if on the stack. So it depends how often it has been stored recently which option is best. On newer CPUs, although they can avoid SHL/LHS flushes more often, the penalty is relatively bigger, so memory does not often win. I.e.: it needs to be measured. Intuition is often wrong here. Segher
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
On Wed, Sep 9, 2020 at 3:47 AM luoxhu wrote: > > > > On 2020/9/8 16:26, Richard Biener wrote: > >> Seems not only pseudo, for example "v = vec_insert (i, v, n);" > >> the vector variable will be store to stack first, then [r112:DI] is a > >> memory here to be processed. So the patch loads it from stack(insn #10) to > >> temp vector register first, and store to stack again(insn #24) after > >> rs6000_vector_set_var. > > Hmm, yeah - I guess that's what should be addressed first then. > > I'm quite sure that in case 'v' is not on the stack but in memory like > > in my case a SImode store is better than what we get from > > vec_insert - in fact vec_insert will likely introduce a RMW cycle > > which is prone to inserting store-data-races? > > Yes, for your case, there is no stack operation and to_rtx is expanded > with BLKmode instead of V4SImode. Add the to_rtx mode check could workaround > it. ASM doesn't show store hit load issue. > > optimized: > > _1 = i_2(D) % 4; > VIEW_CONVERT_EXPR(x.u)[_1] = a_4(D); > > expand: > 2: r118:DI=%3:DI > 3: r119:DI=%4:DI > 4: NOTE_INSN_FUNCTION_BEG > 7: r120:DI=unspec[`*.LANCHOR0',%2:DI] 47 > REG_EQUAL `*.LANCHOR0' > 8: r122:SI=r118:DI#0 > 9: {r124:SI=r122:SI/0x4;clobber ca:SI;} >10: r125:SI=r124:SI<<0x2 >11: r123:SI=r122:SI-r125:SI > REG_EQUAL r122:SI%0x4 >12: r126:DI=sign_extend(r123:SI) >13: r127:DI=r126:DI+0x4 >14: r128:DI=r127:DI<<0x2 >15: r129:DI=r120:DI+r128:DI >16: [r129:DI]=r119:DI#0 > > p to_rtx > $319 = (rtx_def *) (mem/c:BLK (reg/f:DI 120) [2 x+0 S32 A128]) > > asm: > addis 2,12,.TOC.-.LCF0@ha > addi 2,2,.TOC.-.LCF0@l > .localentry test,.-test > srawi 9,3,2 > addze 9,9 > addis 10,2,.LANCHOR0@toc@ha > addi 10,10,.LANCHOR0@toc@l > slwi 9,9,2 > subf 9,9,3 > extsw 9,9 > addi 9,9,4 > sldi 9,9,2 > stwx 4,10,9 > blr > > > > > > So - what we need to "fix" is cfgexpand.c marking variably-indexed > > decls as not to be expanded as registers (see > > discover_nonconstant_array_refs). > > > > I guess one way forward would be to perform instruction > > selection on GIMPLE here and transform > > > > VIEW_CONVERT_EXPR(D.3185)[_1] = i_6(D) > > > > to a (direct) internal function based on the vec_set optab. > > I don't quite understand what you mean here. Do you mean: > ALTIVEC_BUILTIN_VEC_INSERT -> VIEW_CONVERT_EXPR -> internal function -> > vec_set You're writing VIEW_CONVERT_EXPR here but the outermost component is an ARRAY_REF. But yes, this is what I meant. > or ALTIVEC_BUILTIN_VEC_INSERT -> internal function -> vec_set? > And which pass to put the selection and transform is acceptable? Close to RTL expansion. There's gimple-isel.cc which does instruction selection for VEC_COND_EXPRs. > Why call it *based on* vec_set optab? The VIEW_CONVERT_EXPR or internal > function > is expanded to vec_set optab. Based on because we have the convenient capability to represent optabs to be used for RTL expansion as internal function calls on GIMPLE, called "direct internal function". > I guess you suggest adding internal function for VIEW_CONVERT_EXPR in gimple, > and do the transform from internal function to vec_set optab in expander? No, I suggest to "add" an internal function for the vec_set optab, see DEF_INTERNAL_OPTAB_FN in internal-fn.def > I doubt my understanding as this looks really over complicated since we > transform from VIEW_CONVERT_EXPR to vec_set optab directly so far... > IIUC, Internal function seems doesn't help much here as Segher said before. The advantage would be to circumvent GIMPLEs forcing of memory here. But as I said here: > > But then in GIMPLE D.3185 is also still memory (we don't have a variable > > index partial register set operation - BIT_INSERT_EXPR is > > currently specified to receive a constant bit position only). it might not work out so easy. Going down the rathole to avoid forcing memory during RTL expansion for select cases (vector type bases with a supported vector mode) might be something to try. That at least would make the approach of dealing with this in expand_assignment or siblings sensible. > > At which point after your patch is the stack storage elided? > > > > Stack storage is elided by register reload pass in RTL. > > > Thanks, > Xionghu
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
On 2020/9/8 16:26, Richard Biener wrote: >> Seems not only pseudo, for example "v = vec_insert (i, v, n);" >> the vector variable will be store to stack first, then [r112:DI] is a >> memory here to be processed. So the patch loads it from stack(insn #10) to >> temp vector register first, and store to stack again(insn #24) after >> rs6000_vector_set_var. > Hmm, yeah - I guess that's what should be addressed first then. > I'm quite sure that in case 'v' is not on the stack but in memory like > in my case a SImode store is better than what we get from > vec_insert - in fact vec_insert will likely introduce a RMW cycle > which is prone to inserting store-data-races? Yes, for your case, there is no stack operation and to_rtx is expanded with BLKmode instead of V4SImode. Add the to_rtx mode check could workaround it. ASM doesn't show store hit load issue. optimized: _1 = i_2(D) % 4; VIEW_CONVERT_EXPR(x.u)[_1] = a_4(D); expand: 2: r118:DI=%3:DI 3: r119:DI=%4:DI 4: NOTE_INSN_FUNCTION_BEG 7: r120:DI=unspec[`*.LANCHOR0',%2:DI] 47 REG_EQUAL `*.LANCHOR0' 8: r122:SI=r118:DI#0 9: {r124:SI=r122:SI/0x4;clobber ca:SI;} 10: r125:SI=r124:SI<<0x2 11: r123:SI=r122:SI-r125:SI REG_EQUAL r122:SI%0x4 12: r126:DI=sign_extend(r123:SI) 13: r127:DI=r126:DI+0x4 14: r128:DI=r127:DI<<0x2 15: r129:DI=r120:DI+r128:DI 16: [r129:DI]=r119:DI#0 p to_rtx $319 = (rtx_def *) (mem/c:BLK (reg/f:DI 120) [2 x+0 S32 A128]) asm: addis 2,12,.TOC.-.LCF0@ha addi 2,2,.TOC.-.LCF0@l .localentry test,.-test srawi 9,3,2 addze 9,9 addis 10,2,.LANCHOR0@toc@ha addi 10,10,.LANCHOR0@toc@l slwi 9,9,2 subf 9,9,3 extsw 9,9 addi 9,9,4 sldi 9,9,2 stwx 4,10,9 blr > > So - what we need to "fix" is cfgexpand.c marking variably-indexed > decls as not to be expanded as registers (see > discover_nonconstant_array_refs). > > I guess one way forward would be to perform instruction > selection on GIMPLE here and transform > > VIEW_CONVERT_EXPR(D.3185)[_1] = i_6(D) > > to a (direct) internal function based on the vec_set optab. I don't quite understand what you mean here. Do you mean: ALTIVEC_BUILTIN_VEC_INSERT -> VIEW_CONVERT_EXPR -> internal function -> vec_set or ALTIVEC_BUILTIN_VEC_INSERT -> internal function -> vec_set? And which pass to put the selection and transform is acceptable? Why call it *based on* vec_set optab? The VIEW_CONVERT_EXPR or internal function is expanded to vec_set optab. I guess you suggest adding internal function for VIEW_CONVERT_EXPR in gimple, and do the transform from internal function to vec_set optab in expander? I doubt my understanding as this looks really over complicated since we transform from VIEW_CONVERT_EXPR to vec_set optab directly so far... IIUC, Internal function seems doesn't help much here as Segher said before. > But then in GIMPLE D.3185 is also still memory (we don't have a variable > index partial register set operation - BIT_INSERT_EXPR is > currently specified to receive a constant bit position only). > > At which point after your patch is the stack storage elided? > Stack storage is elided by register reload pass in RTL. Thanks, Xionghu
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
On Tue, Sep 8, 2020 at 10:11 AM luoxhu wrote: > > Hi Richi, > > On 2020/9/7 19:57, Richard Biener wrote: > > + if (TREE_CODE (to) == ARRAY_REF) > > + { > > + tree op0 = TREE_OPERAND (to, 0); > > + if (TREE_CODE (op0) == VIEW_CONVERT_EXPR > > + && expand_view_convert_to_vec_set (to, from, to_rtx)) > > + { > > + pop_temp_slots (); > > + return; > > + } > > + } > > > > you're placing this at an awkward spot IMHO, after to_rtx expansion > > but disregading parts of it and compensating just with 'to' matching. > > Is the pieces (offset, bitpos) really too awkward to work with for > > matching? > > > > Because as written you'll miscompile > > > > struct X { _vector signed int v; _vector singed int u; } x; > > > > test(int i, int a) > > { > >x.u[i] = a; > > } > > > > as I think you'll end up assigning to x.v. > > Thanks for pointing out, this case will be a problem for the patch. > I checked with optimize_bitfield_assignment_op, it will return very early > as the mode1 is not VOIDmode, and this is actually not "FIELD op= VAL" > operation? > > To be honest, I am not quite familiar with this part of code, I put the new > function expand_view_convert_to_vec_set just after to_rtx expansion because > adjust_address will change the V4SImode memory to SImode memory, but I need > keep target to_rtx V4SImode to save the vector after calling > rs6000_vector_set_var, so seems paradoxical here? > > p to_rtx > $264 = (rtx_def *) (mem/c:V4SI (reg/f:DI 112 virtual-stack-vars) [1 D.3186+0 > S16 A128]) > > => to_rtx = adjust_address (to_rtx, mode1, 0); > > p to_rtx > $265 = (rtx_def *) (mem/c:SI (reg/f:DI 112 virtual-stack-vars) [1 D.3186+0 S4 > A128]) > > > > > > Are we just interested in the case were we store to a > > pseudo or also when the destination is memory? I guess > > only when it's a pseudo - correct? In that case > > handling this all in optimize_bitfield_assignment_op > > is probably the best thing to try. > > > > Note we possibly refrain from assigning a pseudo to > > such vector because we see a variable array-ref to it. > > Seems not only pseudo, for example "v = vec_insert (i, v, n);" > the vector variable will be store to stack first, then [r112:DI] is a > memory here to be processed. So the patch loads it from stack(insn #10) to > temp vector register first, and store to stack again(insn #24) after > rs6000_vector_set_var. Hmm, yeah - I guess that's what should be addressed first then. I'm quite sure that in case 'v' is not on the stack but in memory like in my case a SImode store is better than what we get from vec_insert - in fact vec_insert will likely introduce a RMW cycle which is prone to inserting store-data-races? So - what we need to "fix" is cfgexpand.c marking variably-indexed decls as not to be expanded as registers (see discover_nonconstant_array_refs). I guess one way forward would be to perform instruction selection on GIMPLE here and transform VIEW_CONVERT_EXPR(D.3185)[_1] = i_6(D) to a (direct) internal function based on the vec_set optab. But then in GIMPLE D.3185 is also still memory (we don't have a variable index partial register set operation - BIT_INSERT_EXPR is currently specified to receive a constant bit position only). At which point after your patch is the stack storage elided? > > optimized: > > D.3185 = v_3(D); > _1 = n_5(D) & 3; > VIEW_CONVERT_EXPR(D.3185)[_1] = i_6(D); > v_8 = D.3185; > return v_8; > > => expand without the patch: > > 2: r119:V4SI=%2:V4SI > 3: r120:DI=%5:DI > 4: r121:DI=%6:DI > 5: NOTE_INSN_FUNCTION_BEG > 8: [r112:DI]=r119:V4SI > > 9: r122:DI=r121:DI&0x3 >10: r123:DI=r122:DI<<0x2 >11: r124:DI=r112:DI+r123:DI >12: [r124:DI]=r120:DI#0 > >13: r126:V4SI=[r112:DI] >14: r118:V4SI=r126:V4SI >18: %2:V4SI=r118:V4SI >19: use %2:V4SI > > => expand with the patch (replace #9~#12 to #10~#24): > > 2: r119:V4SI=%2:V4SI > 3: r120:DI=%5:DI > 4: r121:DI=%6:DI > 5: NOTE_INSN_FUNCTION_BEG > 8: [r112:DI]=r119:V4SI > 9: r122:DI=r121:DI&0x3 > >10: r123:V4SI=[r112:DI] // load from stack >11: {r125:SI=0x3-r122:DI#0;clobber ca:SI;} >12: r125:SI=r125:SI<<0x2 >13: {r125:SI=0x14-r125:SI;clobber ca:SI;} >14: r128:DI=unspec[`*.LC0',%2:DI] 47 > REG_EQUAL `*.LC0' >15: r127:V2DI=[r128:DI] > REG_EQUAL const_vector >16: r126:V16QI=r127:V2DI#0 >17: r129:V16QI=unspec[r120:DI#0] 61 >18: r130:V16QI=unspec[r125:SI] 151 >19: r131:V16QI=unspec[r129:V16QI,r129:V16QI,r130:V16QI] 232 >20: r132:V16QI=unspec[r126:V16QI,r126:V16QI,r130:V16QI] 232 >21: r124:V16QI=r123:V4SI#0 >22: r124:V16QI={(r132:V16QI!=const_vector)?r131:V16QI:r124:V16QI} >23: r123:V4SI=r124:V16QI#0 >24: [r112:DI]=r123:V4SI // store to stack. > >25: r134:V4SI=[r112:DI] >26: r118:V4SI=r134:V4SI >30: %2:V4SI=r118:V4SI >31: use %2:V4SI > > > Thanks, > Xionghu
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
Hi Richi, On 2020/9/7 19:57, Richard Biener wrote: > + if (TREE_CODE (to) == ARRAY_REF) > + { > + tree op0 = TREE_OPERAND (to, 0); > + if (TREE_CODE (op0) == VIEW_CONVERT_EXPR > + && expand_view_convert_to_vec_set (to, from, to_rtx)) > + { > + pop_temp_slots (); > + return; > + } > + } > > you're placing this at an awkward spot IMHO, after to_rtx expansion > but disregading parts of it and compensating just with 'to' matching. > Is the pieces (offset, bitpos) really too awkward to work with for > matching? > > Because as written you'll miscompile > > struct X { _vector signed int v; _vector singed int u; } x; > > test(int i, int a) > { >x.u[i] = a; > } > > as I think you'll end up assigning to x.v. Thanks for pointing out, this case will be a problem for the patch. I checked with optimize_bitfield_assignment_op, it will return very early as the mode1 is not VOIDmode, and this is actually not "FIELD op= VAL" operation? To be honest, I am not quite familiar with this part of code, I put the new function expand_view_convert_to_vec_set just after to_rtx expansion because adjust_address will change the V4SImode memory to SImode memory, but I need keep target to_rtx V4SImode to save the vector after calling rs6000_vector_set_var, so seems paradoxical here? p to_rtx $264 = (rtx_def *) (mem/c:V4SI (reg/f:DI 112 virtual-stack-vars) [1 D.3186+0 S16 A128]) => to_rtx = adjust_address (to_rtx, mode1, 0); p to_rtx $265 = (rtx_def *) (mem/c:SI (reg/f:DI 112 virtual-stack-vars) [1 D.3186+0 S4 A128]) > > Are we just interested in the case were we store to a > pseudo or also when the destination is memory? I guess > only when it's a pseudo - correct? In that case > handling this all in optimize_bitfield_assignment_op > is probably the best thing to try. > > Note we possibly refrain from assigning a pseudo to > such vector because we see a variable array-ref to it. Seems not only pseudo, for example "v = vec_insert (i, v, n);" the vector variable will be store to stack first, then [r112:DI] is a memory here to be processed. So the patch loads it from stack(insn #10) to temp vector register first, and store to stack again(insn #24) after rs6000_vector_set_var. optimized: D.3185 = v_3(D); _1 = n_5(D) & 3; VIEW_CONVERT_EXPR(D.3185)[_1] = i_6(D); v_8 = D.3185; return v_8; => expand without the patch: 2: r119:V4SI=%2:V4SI 3: r120:DI=%5:DI 4: r121:DI=%6:DI 5: NOTE_INSN_FUNCTION_BEG 8: [r112:DI]=r119:V4SI 9: r122:DI=r121:DI&0x3 10: r123:DI=r122:DI<<0x2 11: r124:DI=r112:DI+r123:DI 12: [r124:DI]=r120:DI#0 13: r126:V4SI=[r112:DI] 14: r118:V4SI=r126:V4SI 18: %2:V4SI=r118:V4SI 19: use %2:V4SI => expand with the patch (replace #9~#12 to #10~#24): 2: r119:V4SI=%2:V4SI 3: r120:DI=%5:DI 4: r121:DI=%6:DI 5: NOTE_INSN_FUNCTION_BEG 8: [r112:DI]=r119:V4SI 9: r122:DI=r121:DI&0x3 10: r123:V4SI=[r112:DI] // load from stack 11: {r125:SI=0x3-r122:DI#0;clobber ca:SI;} 12: r125:SI=r125:SI<<0x2 13: {r125:SI=0x14-r125:SI;clobber ca:SI;} 14: r128:DI=unspec[`*.LC0',%2:DI] 47 REG_EQUAL `*.LC0' 15: r127:V2DI=[r128:DI] REG_EQUAL const_vector 16: r126:V16QI=r127:V2DI#0 17: r129:V16QI=unspec[r120:DI#0] 61 18: r130:V16QI=unspec[r125:SI] 151 19: r131:V16QI=unspec[r129:V16QI,r129:V16QI,r130:V16QI] 232 20: r132:V16QI=unspec[r126:V16QI,r126:V16QI,r130:V16QI] 232 21: r124:V16QI=r123:V4SI#0 22: r124:V16QI={(r132:V16QI!=const_vector)?r131:V16QI:r124:V16QI} 23: r123:V4SI=r124:V16QI#0 24: [r112:DI]=r123:V4SI // store to stack. 25: r134:V4SI=[r112:DI] 26: r118:V4SI=r134:V4SI 30: %2:V4SI=r118:V4SI 31: use %2:V4SI Thanks, Xionghu
Re: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
On Mon, Sep 7, 2020 at 7:44 AM luoxhu wrote: > > Hi, > > On 2020/9/4 18:23, Segher Boessenkool wrote: > >> diff --git a/gcc/config/rs6000/rs6000-c.c b/gcc/config/rs6000/rs6000-c.c > >> index 03b00738a5e..00c65311f76 100644 > >> --- a/gcc/config/rs6000/rs6000-c.c > >> +++ b/gcc/config/rs6000/rs6000-c.c > >> /* Build *(((arg1_inner_type*)&(vector type){arg1})+arg2) = arg0. > >> */ > >> @@ -1654,15 +1656,8 @@ altivec_resolve_overloaded_builtin (location_t loc, > >> tree fndecl, > >>SET_EXPR_LOCATION (stmt, loc); > >>stmt = build1 (COMPOUND_LITERAL_EXPR, arg1_type, stmt); > >> } > >> - > >> - innerptrtype = build_pointer_type (arg1_inner_type); > >> - > >> - stmt = build_unary_op (loc, ADDR_EXPR, stmt, 0); > >> - stmt = convert (innerptrtype, stmt); > >> - stmt = build_binary_op (loc, PLUS_EXPR, stmt, arg2, 1); > >> - stmt = build_indirect_ref (loc, stmt, RO_NULL); > >> - stmt = build2 (MODIFY_EXPR, TREE_TYPE (stmt), stmt, > >> -convert (TREE_TYPE (stmt), arg0)); > >> + stmt = build_array_ref (loc, stmt, arg2); > >> + stmt = fold_build2 (MODIFY_EXPR, TREE_TYPE (arg0), stmt, arg0); > >> stmt = build2 (COMPOUND_EXPR, arg1_type, stmt, decl); > >> return stmt; > >> } > > You should make a copy of the vector, not modify the original one in > > place? (If I read that correctly, heh.) Looks good otherwise. > > > > Segher, there is already existed code to make a copy of vector as we > discussed offline. Thanks for the reminder. > > cat pr79251.c.006t.gimple > __attribute__((noinline)) > test (__vector signed int v, int i, size_t n) > { > __vector signed int D.3192; > __vector signed int D.3194; > __vector signed int v1; > v1 = v; > D.3192 = v1; > _1 = n & 3; > VIEW_CONVERT_EXPR(D.3192)[_1] = i; > v1 = D.3192; > D.3194 = v1; > return D.3194; > } > > Attached the v2 patch which does: > 1) Build VIEW_CONVERT_EXPR for vec_insert (i, v, n) like v[n%4] = i to > unify the gimple code, then expander could use vec_set_optab to expand. > 2) Recognize the pattern in expander and use optabs to expand > VIEW_CONVERT_EXPR to vec_insert with variable index to fast instructions: > lvsl+xxperm+xxsel. Looking at the RTL expander side I see several issues. @@ -5237,6 +5288,16 @@ expand_assignment (tree to, tree from, bool nontemporal) to_rtx = expand_expr (tem, NULL_RTX, VOIDmode, EXPAND_WRITE); + if (TREE_CODE (to) == ARRAY_REF) + { + tree op0 = TREE_OPERAND (to, 0); + if (TREE_CODE (op0) == VIEW_CONVERT_EXPR + && expand_view_convert_to_vec_set (to, from, to_rtx)) + { + pop_temp_slots (); + return; + } + } you're placing this at an awkward spot IMHO, after to_rtx expansion but disregading parts of it and compensating just with 'to' matching. Is the pieces (offset, bitpos) really too awkward to work with for matching? Because as written you'll miscompile struct X { _vector signed int v; _vector singed int u; } x; test(int i, int a) { x.u[i] = a; } as I think you'll end up assigning to x.v. Are we just interested in the case were we store to a pseudo or also when the destination is memory? I guess only when it's a pseudo - correct? In that case handling this all in optimize_bitfield_assignment_op is probably the best thing to try. Note we possibly refrain from assigning a pseudo to such vector because we see a variable array-ref to it. Richard. > Thanks, > Xionghu
[PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251]
Hi, On 2020/9/4 18:23, Segher Boessenkool wrote: diff --git a/gcc/config/rs6000/rs6000-c.c b/gcc/config/rs6000/rs6000-c.c index 03b00738a5e..00c65311f76 100644 --- a/gcc/config/rs6000/rs6000-c.c +++ b/gcc/config/rs6000/rs6000-c.c /* Build *(((arg1_inner_type*)&(vector type){arg1})+arg2) = arg0. */ @@ -1654,15 +1656,8 @@ altivec_resolve_overloaded_builtin (location_t loc, tree fndecl, SET_EXPR_LOCATION (stmt, loc); stmt = build1 (COMPOUND_LITERAL_EXPR, arg1_type, stmt); } - - innerptrtype = build_pointer_type (arg1_inner_type); - - stmt = build_unary_op (loc, ADDR_EXPR, stmt, 0); - stmt = convert (innerptrtype, stmt); - stmt = build_binary_op (loc, PLUS_EXPR, stmt, arg2, 1); - stmt = build_indirect_ref (loc, stmt, RO_NULL); - stmt = build2 (MODIFY_EXPR, TREE_TYPE (stmt), stmt, -convert (TREE_TYPE (stmt), arg0)); + stmt = build_array_ref (loc, stmt, arg2); + stmt = fold_build2 (MODIFY_EXPR, TREE_TYPE (arg0), stmt, arg0); stmt = build2 (COMPOUND_EXPR, arg1_type, stmt, decl); return stmt; } You should make a copy of the vector, not modify the original one in place? (If I read that correctly, heh.) Looks good otherwise. Segher, there is already existed code to make a copy of vector as we discussed offline. Thanks for the reminder. cat pr79251.c.006t.gimple __attribute__((noinline)) test (__vector signed int v, int i, size_t n) { __vector signed int D.3192; __vector signed int D.3194; __vector signed int v1; v1 = v; D.3192 = v1; _1 = n & 3; VIEW_CONVERT_EXPR(D.3192)[_1] = i; v1 = D.3192; D.3194 = v1; return D.3194; } Attached the v2 patch which does: 1) Build VIEW_CONVERT_EXPR for vec_insert (i, v, n) like v[n%4] = i to unify the gimple code, then expander could use vec_set_optab to expand. 2) Recognize the pattern in expander and use optabs to expand VIEW_CONVERT_EXPR to vec_insert with variable index to fast instructions: lvsl+xxperm+xxsel. Thanks, Xionghu From 9a7a47b086579e26ae13f378732cea067b64a4e6 Mon Sep 17 00:00:00 2001 From: Xiong Hu Luo Date: Wed, 19 Aug 2020 03:54:17 -0500 Subject: [PATCH v2] rs6000: Expand vec_insert in expander instead of gimple [PR79251] vec_insert accepts 3 arguments, arg0 is input vector, arg1 is the value to be insert, arg2 is the place to insert arg1 to arg0. Current expander generates stxv+stwx+lxv if arg2 is variable instead of constant, which causes serious store hit load performance issue on Power. This patch tries 1) Build VIEW_CONVERT_EXPR for vec_insert (i, v, n) like v[n%4] = i to unify the gimple code, then expander could use vec_set_optab to expand. 2) Recognize the pattern in expander and use optabs to expand VIEW_CONVERT_EXPR to vec_insert with variable index to fast instructions: lvsl+xxperm+xxsel. In this way, "vec_insert (i, v, n)" and "v[n%4] = i" won't be expanded too early in gimple stage if arg2 is variable, avoid generating store hit load instructions. For Power9 V4SI: addi 9,1,-16 rldic 6,6,2,60 stxv 34,-16(1) stwx 5,9,6 lxv 34,-16(1) => addis 9,2,.LC0@toc@ha addi 9,9,.LC0@toc@l mtvsrwz 33,5 lxv 32,0(9) sradi 9,6,2 addze 9,9 sldi 9,9,2 subf 9,9,6 subfic 9,9,3 sldi 9,9,2 subfic 9,9,20 lvsl 13,0,9 xxperm 33,33,45 xxperm 32,32,45 xxsel 34,34,33,32 Though instructions increase from 5 to 15, the performance is improved 60% in typical cases. Tested with V2DI, V2DF V4SI, V4SF, V8HI, V16QI on Power9-LE and Power8-BE, bootstrap tested pass. gcc/ChangeLog: 2020-09-07 Xionghu Luo * config/rs6000/altivec.md (altivec_lvsl_reg_2): Rename to (altivec_lvsl_reg_2) and extend to SDI mode. * config/rs6000/rs6000-c.c (altivec_resolve_overloaded_builtin): Ajdust variable index vec_insert to VIEW_CONVERT_EXPR. * config/rs6000/rs6000-protos.h (rs6000_expand_vector_set_var): New declare. * config/rs6000/rs6000.c (rs6000_expand_vector_set_var): New function. * config/rs6000/rs6000.md (FQHS): New mode iterator. (FD): New mode iterator. p8_mtvsrwz_v16qi2: New define_insn. p8_mtvsrd_v16qi2: New define_insn. * config/rs6000/vector.md: Add register operand2 match for vec_set index. * config/rs6000/vsx.md: Call gen_altivec_lvsl_reg_di2. * expr.c (expand_view_convert_to_vec_set): New function. (expand_assignment): Call expand_view_convert_to_vec_set. gcc/testsuite/ChangeLog: 2020-09-07 Xionghu Luo * gcc.target/powerpc/pr79251.c: New test. * gcc.target/powerpc/pr79251-run.c: New test. * gcc.target/powerpc/pr79251.h: New header. --- gcc/config/rs6000/altivec.md | 4 +- gcc/config/rs6000/rs6000-c.c | 22 ++- gcc/config/rs6000/rs6000-prot