Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
Hi, Richard. I tried hard in RISC-V backend. I found to fix the case with -march=rv64gcv_zvl4096b can not be without vec_to_scalar count. Is there an approach that we can count vec_to_scalar cost without this piece code in middle-end ? /* ??? Enable for loop costing as well. */ if (!loop_vinfo) record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE, 0, vect_epilogue); Since it's stage 4, I guess we can't change this now. juzhe.zh...@rivai.ai From: Richard Biener Date: 2024-01-11 17:57 To: Robin Dapp CC: juzhe.zh...@rivai.ai; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3 On Thu, Jan 11, 2024 at 10:52 AM Robin Dapp wrote: > > On 1/11/24 10:46, juzhe.zh...@rivai.ai wrote: > > Oh. I see I think I have done wrong here. > > > > I should adjust cost for VEC_EXTRACT not VEC_SET. > > > > But it's odd, I didn't see loop vectorizer is scanning scalar_to_vec > > cost in vect.dump. > > The slidedown/vmv.x.s part is of course vec_extract but we indeed > don't seem to cost it as vec_to_scalar here. It looks like a vectorized live operation as it's not in the loop body (and thus really irrelevant for costing in practice). This has /* ??? Enable for loop costing as well. */ if (!loop_vinfo) record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE, 0, vect_epilogue); so live ops are not costed at all. I would suggest to try unconditionally enabling this? > vmv.vx correspond to scalar_to_vec and I'd say 3 seems a > bit high when a regular vector instruction is "1". > It should rather be dependent on the latency between register > files. We can't really say in general but I'd say "2" is not so bad. > > I would suggest adding special handling in builtin_vectorization_cost > like: > > /* Add register-register latency. */ > case scalar_to_vec: > return common_costs->scalar_to_vec_cost + riscv_register_move_cost (...) > > and adjust register_move_cost accordingly. Instead of using > register_move_cost we could also use a cost structure directly. > (E.g. like aarch64's regmove tuning structures. Those don't > contain VRs but for us it could make sense to add them). > > > +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize > > -fdump-tree-vect-details" } */ > With a cost of "3" we still vectorize for zvl512b and larger. > Is that intended? I don't really see why 512 should vectorized > but 256 not. Disregarding that everything should be optimized > away, 2 iterations for the whole loop with 256 bits doesn't > seem that bad. > > Regards > Robin >
Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872 } >> vect_patt_108.14_273; _12 = 32872 >> _22; vect_patt_111.16_276 = VIEW_CONVERT_EXPR(vect_patt_109.15_275); b_7 = (short int) _12; _4 = a.4_25 + 1; ivtmp_20 = ivtmp_30 - 1; ivtmp_280 = ivtmp_279 + 1; if (ivtmp_280 < 1) goto ; [0.00%] else goto ; [100.00%] [local count: 118111600]: # b_5 = PHI a = 19; _14 = b_5 != 0; _15 = (int) _14; return _15; [local count: 4]: goto ; [100.00%] [local count: 118111601]: # a.4_266 = PHI <_4(8)> # ivtmp_267 = PHI # vect_patt_111.16_277 = PHI _278 = BIT_FIELD_REF ; b_264 = _278; [local count: 477576208]: # a.4_256 = PHI <17(18), _262(20)> # ivtmp_258 = PHI <2(18), ivtmp_263(20)> _259 = (int) a.4_256; _260 = 32872 >> _259; b_261 = (short int) _260; _262 = a.4_256 + 1; ivtmp_263 = ivtmp_258 - 1; if (ivtmp_263 != 0) goto ; [75.27%] else goto ; [24.73%] [local count: 359464610]: goto ; [100.00%] } Final ASM: main: lui a5,%hi(a) li a4,19 sb a4,%lo(a)(a5) li a0,0 ret juzhe.zh...@rivai.ai From: Robin Dapp Date: 2024-01-11 20:56 To: juzhe.zh...@rivai.ai; Richard Biener CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3 > 32872 spends 2 scalar instructions + 1 scalar_to_vec cost: > > lia4,-32768 > addiwa4,a4,104 > vmv.v.xv16,a4 > > It seems reasonable but only can fix test with -march=rv64gcv_zvl256b but > failed on -march=rv64gcv_zvl4096b. The scalar version also needs both instructions: li a0,32768 addiw a0,a0,104 Therefore I don't think we should just add them to the vectorization costs. That would only be necessary if we needed to synthesize a different constant (e.g. if a scalar constant cannot be used directly in a vector setting). Currently, scalar_outside_cost = 0 so we don't model it on the scalar side either. With scalar_to_vec = 2 we first try RVVMF2QI, vf = 8 at zvl256b: a.4_25 = PHI <1(2), _4(11)> 1 times vector_stmt costs 1 in body a.4_25 = PHI <1(2), _4(11)> 2 times scalar_to_vec costs 4 in prologue (unsigned short) a.4_25 1 times vector_stmt costs 1 in body MIN_EXPR 1 times scalar_to_vec costs 2 in prologue MIN_EXPR 1 times vector_stmt costs 1 in body 32872 >> patt_26 1 times scalar_to_vec costs 2 in prologue 32872 >> patt_26 1 times vector_stmt costs 1 in body 1 times scalar_stmt costs 1 in prologue 1 times scalar_stmt costs 1 in body Vector inside of loop cost: 5 Scalar iteration cost: 1 (shouldn't that be 2? but anyway) So one vector iteration costs 5 right now regardless of scalar_to_vec because there are 5 vector operations (phi, promote, min, shift, vsetvl/len adjustment). The scalar_to_vec costs are added to the prologue because it is assumed that broadcasts are hoisted out of the loop. Then vectorization is supposed to be profitable if #iterations = 18 > (body_cost * min_iters) + vector_outside_cost - scalar_outside_cost + 1 = 15. If we really don't want to vectorize, then we can either further increase the prologue cost or the body itself. The body statements are all vector_stmts, though. For the prologue we need a good argument why to increase scalar_to_vec to beyond, say 2. > Is it reasonable ? IMHO, scalar move (vmv.v.x or vfmv.v.f) should be > more costly than normal vadd.vv since it is transferring data between > different pipeline/register class. We want it to be more expensive, yes. In one of the last messages I explained how I would model it using either register_move_cost or using (tune-specific) costs directly. I would start with scalar_to_vec = 1 and add 1 or 2 depending on the uarch/tune-specific reg-move costs. Regards Robin
Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
> 32872 spends 2 scalar instructions + 1 scalar_to_vec cost: > > lia4,-32768 > addiwa4,a4,104 > vmv.v.xv16,a4 > > It seems reasonable but only can fix test with -march=rv64gcv_zvl256b but > failed on -march=rv64gcv_zvl4096b. The scalar version also needs both instructions: li a0,32768 addiw a0,a0,104 Therefore I don't think we should just add them to the vectorization costs. That would only be necessary if we needed to synthesize a different constant (e.g. if a scalar constant cannot be used directly in a vector setting). Currently, scalar_outside_cost = 0 so we don't model it on the scalar side either. With scalar_to_vec = 2 we first try RVVMF2QI, vf = 8 at zvl256b: a.4_25 = PHI <1(2), _4(11)> 1 times vector_stmt costs 1 in body a.4_25 = PHI <1(2), _4(11)> 2 times scalar_to_vec costs 4 in prologue (unsigned short) a.4_25 1 times vector_stmt costs 1 in body MIN_EXPR 1 times scalar_to_vec costs 2 in prologue MIN_EXPR 1 times vector_stmt costs 1 in body 32872 >> patt_26 1 times scalar_to_vec costs 2 in prologue 32872 >> patt_26 1 times vector_stmt costs 1 in body 1 times scalar_stmt costs 1 in prologue 1 times scalar_stmt costs 1 in body Vector inside of loop cost: 5 Scalar iteration cost: 1 (shouldn't that be 2? but anyway) So one vector iteration costs 5 right now regardless of scalar_to_vec because there are 5 vector operations (phi, promote, min, shift, vsetvl/len adjustment). The scalar_to_vec costs are added to the prologue because it is assumed that broadcasts are hoisted out of the loop. Then vectorization is supposed to be profitable if #iterations = 18 > (body_cost * min_iters) + vector_outside_cost - scalar_outside_cost + 1 = 15. If we really don't want to vectorize, then we can either further increase the prologue cost or the body itself. The body statements are all vector_stmts, though. For the prologue we need a good argument why to increase scalar_to_vec to beyond, say 2. > Is it reasonable ? IMHO, scalar move (vmv.v.x or vfmv.v.f) should be > more costly than normal vadd.vv since it is transferring data between > different pipeline/register class. We want it to be more expensive, yes. In one of the last messages I explained how I would model it using either register_move_cost or using (tune-specific) costs directly. I would start with scalar_to_vec = 1 and add 1 or 2 depending on the uarch/tune-specific reg-move costs. Regards Robin
Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
I see after this accurate cost adjustment, it is still vectorized but different vect dump: [local count: 118111602]: # a.4_25 = PHI <1(2), _4(11)> # ivtmp_30 = PHI <18(2), ivtmp_20(11)> # vect_vec_iv_.12_149 = PHI <{ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 }(2), _150(11)> # ivtmp_159 = PHI <0(2), ivtmp_160(11)> _150 = vect_vec_iv_.12_149 + { 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 }; vect_patt_46.13_151 = (vector(16) unsigned short) vect_vec_iv_.12_149; _22 = (int) a.4_25; vect_patt_48.14_153 = MIN_EXPR ; vect_patt_49.15_155 = { 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872 } >> vect_patt_48.14_153; _12 = 32872 >> _22; vect_patt_51.16_156 = VIEW_CONVERT_EXPR(vect_patt_49.15_155); b_7 = (short int) _12; _4 = a.4_25 + 1; ivtmp_20 = ivtmp_30 - 1; ivtmp_160 = ivtmp_159 + 1; if (ivtmp_160 < 1) goto ; [0.00%] else goto ; [100.00%] [local count: 118111600]: # b_5 = PHI a = 19; _14 = b_5 != 0; _15 = (int) _14; return _15; [local count: 4]: goto ; [100.00%] [local count: 118111601]: # a.4_146 = PHI <_4(8)> # ivtmp_147 = PHI # vect_patt_51.16_157 = PHI _158 = BIT_FIELD_REF ; b_144 = _158; Purely VLS vectorized codes, then the later CSE pass is able to optimize it into the simply scalar codes. Wheras it involves some VLA vectorized codes which make the later pass failed to CSE it... juzhe.zh...@rivai.ai From: Robin Dapp Date: 2024-01-11 19:15 To: juzhe.zh...@rivai.ai; Richard Biener CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3 > I think we shouldn't vectorize it with any vlen, since the non-vectorized > codegen is much better. > And also, I have tested -msve-vector-bits=2048, ARM SVE doesn't vectorize it. > -zvl65536b, RVV Clang also doesn't vectorize it. Of course I agree that optimizing everything to return 0 is what should happen (tree-ssa-dom or vrp do that). Unfortunately they don't anymore after vectorizing the loop. My point is cost comparison only has the scalar loop to compare against which is: li a5,1 li a3,19 .L2: mv a4,a5 addiw a5,a5,1 bne a5,a3,.L2 That's effectively 2 * 18 instructions and more than what we get when vectorizing - therefore it's kind totally outrageous to vectorize here and we need to make sure not to go overboard with costing just for this example. How does aarch64's cost comparison look like? What's, comparatively, more expensive with their tuning? I've seen scalar_to_vec = 4 and vec_to_scalar = 4 but a regular operation is 2 already. This would equal scalar_to_vec = 2 for us (and is not sufficient) so something else must come into play still. Regards Robin
Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
Ok. but with this scalar_to_vec set to 2: diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc index df9799d9c5e..a14fb36817a 100644 --- a/gcc/config/riscv/riscv.cc +++ b/gcc/config/riscv/riscv.cc @@ -366,7 +366,7 @@ static const common_vector_cost rvv_vls_vector_cost = { 1, /* gather_load_cost */ 1, /* scatter_store_cost */ 1, /* vec_to_scalar_cost */ - 1, /* scalar_to_vec_cost */ + 2, /* scalar_to_vec_cost */ 1, /* permute_cost */ 1, /* align_load_cost */ 1, /* align_store_cost */ @@ -382,7 +382,7 @@ static const scalable_vector_cost rvv_vla_vector_cost = { 1, /* gather_load_cost */ 1, /* scatter_store_cost */ 1, /* vec_to_scalar_cost */ -1, /* scalar_to_vec_cost */ +2, /* scalar_to_vec_cost */ 1, /* permute_cost */ 1, /* align_load_cost */ 1, /* align_store_cost */ Then all zvl*b can be fixed for this test. Is it reasonable ? IMHO, scalar move (vmv.v.x or vfmv.v.f) should be more costly than normal vadd.vv since it is transferring data between different pipeline/register class. juzhe.zh...@rivai.ai From: Robin Dapp Date: 2024-01-11 19:15 To: juzhe.zh...@rivai.ai; Richard Biener CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3 > I think we shouldn't vectorize it with any vlen, since the non-vectorized > codegen is much better. > And also, I have tested -msve-vector-bits=2048, ARM SVE doesn't vectorize it. > -zvl65536b, RVV Clang also doesn't vectorize it. Of course I agree that optimizing everything to return 0 is what should happen (tree-ssa-dom or vrp do that). Unfortunately they don't anymore after vectorizing the loop. My point is cost comparison only has the scalar loop to compare against which is: li a5,1 li a3,19 .L2: mv a4,a5 addiw a5,a5,1 bne a5,a3,.L2 That's effectively 2 * 18 instructions and more than what we get when vectorizing - therefore it's kind totally outrageous to vectorize here and we need to make sure not to go overboard with costing just for this example. How does aarch64's cost comparison look like? What's, comparatively, more expensive with their tuning? I've seen scalar_to_vec = 4 and vec_to_scalar = 4 but a regular operation is 2 already. This would equal scalar_to_vec = 2 for us (and is not sufficient) so something else must come into play still. Regards Robin
Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
Hi, Robin. I model scalar value initialization accurately with following patch: +/* Adjust vectorization cost after calling + targetm.vectorize.builtin_vectorization_cost. For some statement, we would + like to further fine-grain tweak the cost on top of + targetm.vectorize.builtin_vectorization_cost handling which doesn't have any + information on statement operation codes etc. */ + +static unsigned +adjust_stmt_cost (class vec_info *vinfo, enum vect_cost_for_stmt kind, + struct _stmt_vec_info *stmt_info, tree vectype, int count, + int stmt_cost) +{ + gimple *stmt = stmt_info->stmt; + machine_mode smode = TYPE_MODE (TREE_TYPE (vectype)); + switch (kind) +{ + case scalar_to_vec: { + stmt_cost *= count; + /* Adjust cost by counting the scalar value initialization. */ + for (int i = 0; i < count; i++) + { + tree arg = gimple_arg (stmt, i); + if (poly_int_tree_p (arg)) + { + poly_int64 value = tree_to_poly_int64 (arg); + int scalar_cost + = riscv_const_insns (gen_int_mode (value, smode)); + stmt_cost += scalar_cost; + } + else + stmt_cost += 1; + } + return stmt_cost; + } +default: + break; +} + return count * stmt_cost; +} + unsigned costs::add_stmt_cost (int count, vect_cost_for_stmt kind, stmt_vec_info stmt_info, slp_tree, tree vectype, @@ -1082,9 +1122,13 @@ costs::add_stmt_cost (int count, vect_cost_for_stmt kind, as one iteration of the VLA loop. */ if (where == vect_body && m_unrolled_vls_niters) m_unrolled_vls_stmts += count * m_unrolled_vls_niters; + + if (vectype) + stmt_cost = adjust_stmt_cost (m_vinfo, kind, stmt_info, vectype, count, + stmt_cost); } - return record_stmt_cost (stmt_info, where, count * stmt_cost); + return record_stmt_cost (stmt_info, where, stmt_cost); } 32872 >> patt_126 1 times scalar_to_vec costs 3 in prologue 32872 spends 2 scalar instructions + 1 scalar_to_vec cost: li a4,-32768 addiw a4,a4,104 vmv.v.x v16,a4 It seems reasonable but only can fix test with -march=rv64gcv_zvl256b but failed on -march=rv64gcv_zvl4096b. juzhe.zh...@rivai.ai From: Robin Dapp Date: 2024-01-11 19:15 To: juzhe.zh...@rivai.ai; Richard Biener CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3 > I think we shouldn't vectorize it with any vlen, since the non-vectorized > codegen is much better. > And also, I have tested -msve-vector-bits=2048, ARM SVE doesn't vectorize it. > -zvl65536b, RVV Clang also doesn't vectorize it. Of course I agree that optimizing everything to return 0 is what should happen (tree-ssa-dom or vrp do that). Unfortunately they don't anymore after vectorizing the loop. My point is cost comparison only has the scalar loop to compare against which is: li a5,1 li a3,19 .L2: mv a4,a5 addiw a5,a5,1 bne a5,a3,.L2 That's effectively 2 * 18 instructions and more than what we get when vectorizing - therefore it's kind totally outrageous to vectorize here and we need to make sure not to go overboard with costing just for this example. How does aarch64's cost comparison look like? What's, comparatively, more expensive with their tuning? I've seen scalar_to_vec = 4 and vec_to_scalar = 4 but a regular operation is 2 already. This would equal scalar_to_vec = 2 for us (and is not sufficient) so something else must come into play still. Regards Robin
Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
Oh, Sorry. I made a mistake. It's -mrvv-vector-bits=2048 RVV clang doesn't vectorize it. But ARM SVE GCC doesn't fix the issue if we specify -msve-vector-bits=2048, it will vectorize it. https://godbolt.org/z/x7s8Kz87a I guess LLVM has some magic in their cost model which can work well. juzhe.zh...@rivai.ai From: Robin Dapp Date: 2024-01-11 19:15 To: juzhe.zh...@rivai.ai; Richard Biener CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3 > I think we shouldn't vectorize it with any vlen, since the non-vectorized > codegen is much better. > And also, I have tested -msve-vector-bits=2048, ARM SVE doesn't vectorize it. > -zvl65536b, RVV Clang also doesn't vectorize it. Of course I agree that optimizing everything to return 0 is what should happen (tree-ssa-dom or vrp do that). Unfortunately they don't anymore after vectorizing the loop. My point is cost comparison only has the scalar loop to compare against which is: li a5,1 li a3,19 .L2: mv a4,a5 addiw a5,a5,1 bne a5,a3,.L2 That's effectively 2 * 18 instructions and more than what we get when vectorizing - therefore it's kind totally outrageous to vectorize here and we need to make sure not to go overboard with costing just for this example. How does aarch64's cost comparison look like? What's, comparatively, more expensive with their tuning? I've seen scalar_to_vec = 4 and vec_to_scalar = 4 but a regular operation is 2 already. This would equal scalar_to_vec = 2 for us (and is not sufficient) so something else must come into play still. Regards Robin
Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
> I think we shouldn't vectorize it with any vlen, since the non-vectorized > codegen is much better. > And also, I have tested -msve-vector-bits=2048, ARM SVE doesn't vectorize it. > -zvl65536b, RVV Clang also doesn't vectorize it. Of course I agree that optimizing everything to return 0 is what should happen (tree-ssa-dom or vrp do that). Unfortunately they don't anymore after vectorizing the loop. My point is cost comparison only has the scalar loop to compare against which is: li a5,1 li a3,19 .L2: mv a4,a5 addiw a5,a5,1 bne a5,a3,.L2 That's effectively 2 * 18 instructions and more than what we get when vectorizing - therefore it's kind totally outrageous to vectorize here and we need to make sure not to go overboard with costing just for this example. How does aarch64's cost comparison look like? What's, comparatively, more expensive with their tuning? I've seen scalar_to_vec = 4 and vec_to_scalar = 4 but a regular operation is 2 already. This would equal scalar_to_vec = 2 for us (and is not sufficient) so something else must come into play still. Regards Robin
Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
>> (My question whether why we shouldn't vectorize this at 256b >> and above still stands, though) I think we shouldn't vectorize it with any vlen, since the non-vectorized codegen is much better. And also, I have tested -msve-vector-bits=2048, ARM SVE doesn't vectorize it. -zvl65536b, RVV Clang also doesn't vectorize it. juzhe.zh...@rivai.ai From: Robin Dapp Date: 2024-01-11 18:40 To: juzhe.zh...@rivai.ai; Richard Biener CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3 On 1/11/24 11:20, juzhe.zh...@rivai.ai wrote: > Ok I see your idea and we need to adjust scalar_to_vec accurately. Inside the > loop we have these 2 scalar_to_vec: > > 1. MIN_EXPR 1 times scalar_to_vec costs 1 in prologue > >This scalar_to_vec cost should be 0 or 1 since it only generate single > instructions: vmv.v.iv16,15 > > 2. 32872 >> patt_26 1 times scalar_to_vec costs 1 in prologue > >This cost should be higher since it cost 3 instructions: > lia4,-32768 > addiwa4,a4,104 > vmv.v.xv16,a4 > > Am I correct ? > > I guess if we cost 1 case as 1 cost and 2 case as 3 cost. Then we will be > good. That would be the general idea, yes. As Richard mentioned, it doesn't always work well but for this case here it could help a bit. (My question whether why we shouldn't vectorize this at 256b and above still stands, though) As mentioned before, the other thing that needs to be considered is register-move costs (or the respective cost structure). On some uarchs the vmv.v.f might be more expensive than vmv.v.x and so on - in addition to the instructions needed to synthesize the constant. Regards Robin
Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
On 1/11/24 11:20, juzhe.zh...@rivai.ai wrote: > Ok I see your idea and we need to adjust scalar_to_vec accurately. Inside the > loop we have these 2 scalar_to_vec: > > 1. MIN_EXPR 1 times scalar_to_vec costs 1 in prologue > > This scalar_to_vec cost should be 0 or 1 since it only generate single > instructions: vmv.v.iv16,15 > > 2. 32872 >> patt_26 1 times scalar_to_vec costs 1 in prologue > > This cost should be higher since it cost 3 instructions: > lia4,-32768 > addiwa4,a4,104 > vmv.v.xv16,a4 > > Am I correct ? > > I guess if we cost 1 case as 1 cost and 2 case as 3 cost. Then we will be > good. That would be the general idea, yes. As Richard mentioned, it doesn't always work well but for this case here it could help a bit. (My question whether why we shouldn't vectorize this at 256b and above still stands, though) As mentioned before, the other thing that needs to be considered is register-move costs (or the respective cost structure). On some uarchs the vmv.v.f might be more expensive than vmv.v.x and so on - in addition to the instructions needed to synthesize the constant. Regards Robin
Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
On Thu, Jan 11, 2024 at 11:18 AM Richard Biener wrote: > > On Thu, Jan 11, 2024 at 11:20 AM juzhe.zh...@rivai.ai > wrote: > > > > Ok I see your idea and we need to adjust scalar_to_vec accurately. Inside > > the loop we have these 2 scalar_to_vec: > > > > 1. MIN_EXPR 1 times scalar_to_vec costs 1 in prologue > > > >This scalar_to_vec cost should be 0 or 1 since it only generate single > > instructions: vmv.v.i v16,15 > > > > 2. 32872 >> patt_26 1 times scalar_to_vec costs 1 in prologue > > > >This cost should be higher since it cost 3 instructions: > > li a4,-32768 > > addiw a4,a4,104 > > vmv.v.x v16,a4 > > > > Am I correct ? > > > > I guess if we cost 1 case as 1 cost and 2 case as 3 cost. Then we will be > > good. > > The cost model interface isn't set up very well to cost different constants > differently. For scalar_to_vec we should get the backend the actual scalar > to duplicate but we don't. It's similar for permutes btw. My plan was always to defer changes to when we only have SLP since there all invariants are represented by SLP nodes which we can hand down. > > > > juzhe.zh...@rivai.ai > > > > > > From: Robin Dapp > > Date: 2024-01-11 18:14 > > To: juzhe.zh...@rivai.ai; Richard Biener > > CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw > > Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3 > > > Yeah... I just noticed. I should set it as 4 to fix it with biggest VLEN > > > size, > > > that is, -march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m8... > > > > > > I am confused now how to fix this case. > > > > 4 is definitely too high compared to a regular instruction. > > vmv.vx could even be zero-cost for constants. > > > > To catch constants we could add handling in add_stmt_cost, inspecting > > the stmt directly. > > > > Regards > > Robin > >
Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
On Thu, Jan 11, 2024 at 11:20 AM juzhe.zh...@rivai.ai wrote: > > Ok I see your idea and we need to adjust scalar_to_vec accurately. Inside the > loop we have these 2 scalar_to_vec: > > 1. MIN_EXPR 1 times scalar_to_vec costs 1 in prologue > >This scalar_to_vec cost should be 0 or 1 since it only generate single > instructions: vmv.v.i v16,15 > > 2. 32872 >> patt_26 1 times scalar_to_vec costs 1 in prologue > >This cost should be higher since it cost 3 instructions: > li a4,-32768 > addiw a4,a4,104 > vmv.v.x v16,a4 > > Am I correct ? > > I guess if we cost 1 case as 1 cost and 2 case as 3 cost. Then we will be > good. The cost model interface isn't set up very well to cost different constants differently. For scalar_to_vec we should get the backend the actual scalar to duplicate but we don't. > > juzhe.zh...@rivai.ai > > > From: Robin Dapp > Date: 2024-01-11 18:14 > To: juzhe.zh...@rivai.ai; Richard Biener > CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw > Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3 > > Yeah... I just noticed. I should set it as 4 to fix it with biggest VLEN > > size, > > that is, -march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m8... > > > > I am confused now how to fix this case. > > 4 is definitely too high compared to a regular instruction. > vmv.vx could even be zero-cost for constants. > > To catch constants we could add handling in add_stmt_cost, inspecting > the stmt directly. > > Regards > Robin >
Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
Ok I see your idea and we need to adjust scalar_to_vec accurately. Inside the loop we have these 2 scalar_to_vec: 1. MIN_EXPR 1 times scalar_to_vec costs 1 in prologue This scalar_to_vec cost should be 0 or 1 since it only generate single instructions: vmv.v.i v16,15 2. 32872 >> patt_26 1 times scalar_to_vec costs 1 in prologue This cost should be higher since it cost 3 instructions: li a4,-32768 addiw a4,a4,104 vmv.v.x v16,a4 Am I correct ? I guess if we cost 1 case as 1 cost and 2 case as 3 cost. Then we will be good. juzhe.zh...@rivai.ai From: Robin Dapp Date: 2024-01-11 18:14 To: juzhe.zh...@rivai.ai; Richard Biener CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3 > Yeah... I just noticed. I should set it as 4 to fix it with biggest VLEN > size, > that is, -march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m8... > > I am confused now how to fix this case. 4 is definitely too high compared to a regular instruction. vmv.vx could even be zero-cost for constants. To catch constants we could add handling in add_stmt_cost, inspecting the stmt directly. Regards Robin
Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
> Yeah... I just noticed. I should set it as 4 to fix it with biggest VLEN > size, > that is, -march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m8... > > I am confused now how to fix this case. 4 is definitely too high compared to a regular instruction. vmv.vx could even be zero-cost for constants. To catch constants we could add handling in add_stmt_cost, inspecting the stmt directly. Regards Robin
Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
And also I have investigate LLVM cost model. They don't cost vsevli in vectorization cost model. But their cost model does a good job... juzhe.zh...@rivai.ai From: Robin Dapp Date: 2024-01-11 18:09 To: Richard Biener CC: rdapp.gcc; juzhe.zh...@rivai.ai; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3 >> The slidedown/vmv.x.s part is of course vec_extract but we indeed >> don't seem to cost it as vec_to_scalar here. > > It looks like a vectorized live operation as it's not in the loop body > (and thus really irrelevant for costing in practice). This has > > /* ??? Enable for loop costing as well. */ > if (!loop_vinfo) > record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE, > 0, vect_epilogue); > > so live ops are not costed at all. I would suggest to try unconditionally > enabling this? > IMHO this example is not really ideal to start with anyway so we should maybe try another one first. I'd still argue that one or two iterations vs. potentially 16+ scalar ones is not necessarily bad. That said, we also don't really cost all our vsetvls yet (difficult...). Regards Robin
Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
>> That said, we also don't really cost all our vsetvls yet (difficult...). If cost vsetvl, we will need to cost 1 more for each STMT. However, it is not accurate. Since our VSETVL PASS will eliminate redundancy... juzhe.zh...@rivai.ai From: Robin Dapp Date: 2024-01-11 18:09 To: Richard Biener CC: rdapp.gcc; juzhe.zh...@rivai.ai; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3 >> The slidedown/vmv.x.s part is of course vec_extract but we indeed >> don't seem to cost it as vec_to_scalar here. > > It looks like a vectorized live operation as it's not in the loop body > (and thus really irrelevant for costing in practice). This has > > /* ??? Enable for loop costing as well. */ > if (!loop_vinfo) > record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE, > 0, vect_epilogue); > > so live ops are not costed at all. I would suggest to try unconditionally > enabling this? > IMHO this example is not really ideal to start with anyway so we should maybe try another one first. I'd still argue that one or two iterations vs. potentially 16+ scalar ones is not necessarily bad. That said, we also don't really cost all our vsetvls yet (difficult...). Regards Robin
Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
>> The slidedown/vmv.x.s part is of course vec_extract but we indeed >> don't seem to cost it as vec_to_scalar here. > > It looks like a vectorized live operation as it's not in the loop body > (and thus really irrelevant for costing in practice). This has > > /* ??? Enable for loop costing as well. */ > if (!loop_vinfo) > record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE, > 0, vect_epilogue); > > so live ops are not costed at all. I would suggest to try unconditionally > enabling this? > IMHO this example is not really ideal to start with anyway so we should maybe try another one first. I'd still argue that one or two iterations vs. potentially 16+ scalar ones is not necessarily bad. That said, we also don't really cost all our vsetvls yet (difficult...). Regards Robin
Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
On Thu, Jan 11, 2024 at 10:52 AM Robin Dapp wrote: > > On 1/11/24 10:46, juzhe.zh...@rivai.ai wrote: > > Oh. I see I think I have done wrong here. > > > > I should adjust cost for VEC_EXTRACT not VEC_SET. > > > > But it's odd, I didn't see loop vectorizer is scanning scalar_to_vec > > cost in vect.dump. > > The slidedown/vmv.x.s part is of course vec_extract but we indeed > don't seem to cost it as vec_to_scalar here. It looks like a vectorized live operation as it's not in the loop body (and thus really irrelevant for costing in practice). This has /* ??? Enable for loop costing as well. */ if (!loop_vinfo) record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE, 0, vect_epilogue); so live ops are not costed at all. I would suggest to try unconditionally enabling this? > vmv.vx correspond to scalar_to_vec and I'd say 3 seems a > bit high when a regular vector instruction is "1". > It should rather be dependent on the latency between register > files. We can't really say in general but I'd say "2" is not so bad. > > I would suggest adding special handling in builtin_vectorization_cost > like: > > /* Add register-register latency. */ > case scalar_to_vec: > return common_costs->scalar_to_vec_cost + riscv_register_move_cost (...) > > and adjust register_move_cost accordingly. Instead of using > register_move_cost we could also use a cost structure directly. > (E.g. like aarch64's regmove tuning structures. Those don't > contain VRs but for us it could make sense to add them). > > > +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize > > -fdump-tree-vect-details" } */ > With a cost of "3" we still vectorize for zvl512b and larger. > Is that intended? I don't really see why 512 should vectorized > but 256 not. Disregarding that everything should be optimized > away, 2 iterations for the whole loop with 256 bits doesn't > seem that bad. > > Regards > Robin >
Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
>> With a cost of "3" we still vectorize for zvl512b and larger. >>Is that intended? I don't really see why 512 should vectorized >>but 256 not. Disregarding that everything should be optimized >>away, 2 iterations for the whole loop with 256 bits doesn't >>seem that bad. Yeah... I just noticed. I should set it as 4 to fix it with biggest VLEN size, that is, -march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m8... I am confused now how to fix this case. juzhe.zh...@rivai.ai From: Robin Dapp Date: 2024-01-11 17:52 To: juzhe.zh...@rivai.ai; Richard Biener CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3 On 1/11/24 10:46, juzhe.zh...@rivai.ai wrote: > Oh. I see I think I have done wrong here. > > I should adjust cost for VEC_EXTRACT not VEC_SET. > > But it's odd, I didn't see loop vectorizer is scanning scalar_to_vec > cost in vect.dump. The slidedown/vmv.x.s part is of course vec_extract but we indeed don't seem to cost it as vec_to_scalar here. vmv.vx correspond to scalar_to_vec and I'd say 3 seems a bit high when a regular vector instruction is "1". It should rather be dependent on the latency between register files. We can't really say in general but I'd say "2" is not so bad. I would suggest adding special handling in builtin_vectorization_cost like: /* Add register-register latency. */ case scalar_to_vec: return common_costs->scalar_to_vec_cost + riscv_register_move_cost (...) and adjust register_move_cost accordingly. Instead of using register_move_cost we could also use a cost structure directly. (E.g. like aarch64's regmove tuning structures. Those don't contain VRs but for us it could make sense to add them). > +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize > -fdump-tree-vect-details" } */ With a cost of "3" we still vectorize for zvl512b and larger. Is that intended? I don't really see why 512 should vectorized but 256 not. Disregarding that everything should be optimized away, 2 iterations for the whole loop with 256 bits doesn't seem that bad. Regards Robin
Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
On 1/11/24 10:46, juzhe.zh...@rivai.ai wrote: > Oh. I see I think I have done wrong here. > > I should adjust cost for VEC_EXTRACT not VEC_SET. > > But it's odd, I didn't see loop vectorizer is scanning scalar_to_vec > cost in vect.dump. The slidedown/vmv.x.s part is of course vec_extract but we indeed don't seem to cost it as vec_to_scalar here. vmv.vx correspond to scalar_to_vec and I'd say 3 seems a bit high when a regular vector instruction is "1". It should rather be dependent on the latency between register files. We can't really say in general but I'd say "2" is not so bad. I would suggest adding special handling in builtin_vectorization_cost like: /* Add register-register latency. */ case scalar_to_vec: return common_costs->scalar_to_vec_cost + riscv_register_move_cost (...) and adjust register_move_cost accordingly. Instead of using register_move_cost we could also use a cost structure directly. (E.g. like aarch64's regmove tuning structures. Those don't contain VRs but for us it could make sense to add them). > +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize > -fdump-tree-vect-details" } */ With a cost of "3" we still vectorize for zvl512b and larger. Is that intended? I don't really see why 512 should vectorized but 256 not. Disregarding that everything should be optimized away, 2 iterations for the whole loop with 256 bits doesn't seem that bad. Regards Robin
Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
Oh. I see I think I have done wrong here. I should adjust cost for VEC_EXTRACT not VEC_SET. But it's odd, I didn't see loop vectorizer is scanning scalar_to_vec cost in vect.dump. The vect tree: # a.4_25 = PHI <1(2), _4(11)> # ivtmp_30 = PHI <18(2), ivtmp_20(11)> # vect_vec_iv_.10_137 = PHI <{ 1, 2, 3, ... }(2), vect_vec_iv_.10_137(11)> # ivtmp_149 = PHI <0(2), ivtmp_150(11)> # loop_len_146 = PHI <18(2), _155(11)> vect_patt_28.11_139 = (vector([2048,2048]) unsigned short) vect_vec_iv_.10_137; _22 = (int) a.4_25; vect_patt_26.12_141 = MIN_EXPR ; vect_patt_10.13_143 = { 32872, ... } >> vect_patt_26.12_141; _12 = 32872 >> _22; vect_patt_27.14_144 = VIEW_CONVERT_EXPR(vect_patt_10.13_143); b_7 = (short int) _12; _4 = a.4_25 + 1; ivtmp_20 = ivtmp_30 - 1; ivtmp_150 = ivtmp_149 + POLY_INT_CST [2048, 2048]; _153 = MIN_EXPR ; _154 = 18 - _153; _155 = MIN_EXPR <_154, POLY_INT_CST [2048, 2048]>; if (_155 != 0) goto ; [0.00%] else goto ; [100.00%] [local count: 118111600]: # vect_patt_27.14_145 = PHI # loop_len_156 = PHI _147 = loop_len_156 + 18446744073709551615; _148 = .VEC_EXTRACT (vect_patt_27.14_145, _147); b_5 = _148; a = 19; _14 = b_5 != 0; _15 = (int) _14; return _15; The vect dump tree only compute cost include vector_stmt and scalar_to_vec. It seems it didn't consider VEC_EXTRACT cost ? juzhe.zh...@rivai.ai From: Richard Biener Date: 2024-01-11 17:18 To: Juzhe-Zhong CC: gcc-patches; kito.cheng; kito.cheng; jeffreyalaw; rdapp.gcc Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3 On Thu, Jan 11, 2024 at 9:24 AM Juzhe-Zhong wrote: > > This patch fixes the following inefficient vectorized codes: > > vsetvli a5,zero,e8,mf2,ta,ma > li a2,17 > vid.v v1 > li a4,-32768 > vsetvli zero,zero,e16,m1,ta,ma > addiw a4,a4,104 > vmv.v.i v3,15 > lui a1,%hi(a) > li a0,19 > vsetvli zero,zero,e8,mf2,ta,ma > vadd.vx v1,v1,a2 > sb a0,%lo(a)(a1) > vsetvli zero,zero,e16,m1,ta,ma > vzext.vf2 v2,v1 > vmv.v.x v1,a4 > vminu.vvv2,v2,v3 > vsrl.vv v1,v1,v2 > vslidedown.vi v1,v1,1 > vmv.x.s a0,v1 > sneza0,a0 > ret > > The reason is scalar_to_vec_cost is too low. > > Consider in VEC_SET, we always have a slide + scalar move instruction, > scalar_to_vec_cost = 1 (current cost) is not reasonable. scalar_to_vec is supposed to model a splat of GPR/FPR to a vector register. We probably want to overhaul the cost classes, esp. vec_to_scalar, but of course not now. > I tried to set it as 2 but failed fix this case, that is, I need to > set it as 3 to fix this case. > > No matter scalar move or slide instruction, I believe they are more costly > than normal vector instructions (e.g. vadd.vv). So set it as 3 looks > reasonable > to me. > > After this patch: > > lui a5,%hi(a) > li a4,19 > sb a4,%lo(a)(a5) > li a0,0 > ret > > Tested on both RV32/RV64 no regression, Ok for trunk ? > > PR target/113281 > > gcc/ChangeLog: > > * config/riscv/riscv.cc: Set scalar_to_vec_cost as 3. > > gcc/testsuite/ChangeLog: > > * gcc.target/riscv/rvv/autovec/pr113209.c: Adapt test. > * gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c: New test. > > --- > gcc/config/riscv/riscv.cc | 4 ++-- > .../vect/costmodel/riscv/rvv/pr113281-1.c | 18 ++ > .../gcc.target/riscv/rvv/autovec/pr113209.c| 2 +- > 3 files changed, 21 insertions(+), 3 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c > > diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc > index df9799d9c5e..bcfb3c15a39 100644 > --- a/gcc/config/riscv/riscv.cc > +++ b/gcc/config/riscv/riscv.cc > @@ -366,7 +366,7 @@ static const common_vector_cost rvv_vls_vector_cost = { >1, /* gather_load_cost */ >1, /* scatter_store_cost */ >1, /* vec_to_scalar_cost */ > - 1, /* scalar_to_vec_cost */ > + 3, /* scalar_to_vec_cost */ >1, /* permute_cost */ >1, /* align_load_cost */ >1, /* align_store_cost */ > @@ -382,7 +382,7 @@ static const scalable_vector_cost rvv_vla_vector_cost = { > 1, /* gather_load_cost */ > 1, /* scatter_store_cost */ > 1, /* vec_to_scalar_cost */ > -1, /* scalar_to_vec_cost */ > +3, /* scalar_to_vec_cost */ > 1, /* permute_cost */ > 1, /* align_load_cost */ > 1, /* align_store_cost */ > diff --git a
Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
Thanks Richard. So you think increase scalar_to_vec cost is not the correct approach to fix this case? Or could you give me a suggestion to fix this case ? Thanks. juzhe.zh...@rivai.ai From: Richard Biener Date: 2024-01-11 17:18 To: Juzhe-Zhong CC: gcc-patches; kito.cheng; kito.cheng; jeffreyalaw; rdapp.gcc Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3 On Thu, Jan 11, 2024 at 9:24 AM Juzhe-Zhong wrote: > > This patch fixes the following inefficient vectorized codes: > > vsetvli a5,zero,e8,mf2,ta,ma > li a2,17 > vid.v v1 > li a4,-32768 > vsetvli zero,zero,e16,m1,ta,ma > addiw a4,a4,104 > vmv.v.i v3,15 > lui a1,%hi(a) > li a0,19 > vsetvli zero,zero,e8,mf2,ta,ma > vadd.vx v1,v1,a2 > sb a0,%lo(a)(a1) > vsetvli zero,zero,e16,m1,ta,ma > vzext.vf2 v2,v1 > vmv.v.x v1,a4 > vminu.vvv2,v2,v3 > vsrl.vv v1,v1,v2 > vslidedown.vi v1,v1,1 > vmv.x.s a0,v1 > sneza0,a0 > ret > > The reason is scalar_to_vec_cost is too low. > > Consider in VEC_SET, we always have a slide + scalar move instruction, > scalar_to_vec_cost = 1 (current cost) is not reasonable. scalar_to_vec is supposed to model a splat of GPR/FPR to a vector register. We probably want to overhaul the cost classes, esp. vec_to_scalar, but of course not now. > I tried to set it as 2 but failed fix this case, that is, I need to > set it as 3 to fix this case. > > No matter scalar move or slide instruction, I believe they are more costly > than normal vector instructions (e.g. vadd.vv). So set it as 3 looks > reasonable > to me. > > After this patch: > > lui a5,%hi(a) > li a4,19 > sb a4,%lo(a)(a5) > li a0,0 > ret > > Tested on both RV32/RV64 no regression, Ok for trunk ? > > PR target/113281 > > gcc/ChangeLog: > > * config/riscv/riscv.cc: Set scalar_to_vec_cost as 3. > > gcc/testsuite/ChangeLog: > > * gcc.target/riscv/rvv/autovec/pr113209.c: Adapt test. > * gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c: New test. > > --- > gcc/config/riscv/riscv.cc | 4 ++-- > .../vect/costmodel/riscv/rvv/pr113281-1.c | 18 ++ > .../gcc.target/riscv/rvv/autovec/pr113209.c| 2 +- > 3 files changed, 21 insertions(+), 3 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c > > diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc > index df9799d9c5e..bcfb3c15a39 100644 > --- a/gcc/config/riscv/riscv.cc > +++ b/gcc/config/riscv/riscv.cc > @@ -366,7 +366,7 @@ static const common_vector_cost rvv_vls_vector_cost = { >1, /* gather_load_cost */ >1, /* scatter_store_cost */ >1, /* vec_to_scalar_cost */ > - 1, /* scalar_to_vec_cost */ > + 3, /* scalar_to_vec_cost */ >1, /* permute_cost */ >1, /* align_load_cost */ >1, /* align_store_cost */ > @@ -382,7 +382,7 @@ static const scalable_vector_cost rvv_vla_vector_cost = { > 1, /* gather_load_cost */ > 1, /* scatter_store_cost */ > 1, /* vec_to_scalar_cost */ > -1, /* scalar_to_vec_cost */ > +3, /* scalar_to_vec_cost */ > 1, /* permute_cost */ > 1, /* align_load_cost */ > 1, /* align_store_cost */ > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c > b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c > new file mode 100644 > index 000..331cf961a1f > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c > @@ -0,0 +1,18 @@ > +/* { dg-do compile } */ > +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize > -fdump-tree-vect-details" } */ > + > +unsigned char a; > + > +int main() { > + short b = a = 0; > + for (; a != 19; a++) > +if (a) > + b = 32872 >> a; > + > + if (b == 0) > +return 0; > + else > +return 1; > +} > + > +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */ > diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c > b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c > index 081ee369394..70aae151000 100644 > --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c > +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c > @@ -1,5 +1,5 @@ > /* { dg-do compile } */ > -/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3" } */ > +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 > -fno-vect-cost-model" } */ > > int b, c, d, f, i, a; > int e[1] = {0}; > -- > 2.36.3 >
Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
On Thu, Jan 11, 2024 at 9:24 AM Juzhe-Zhong wrote: > > This patch fixes the following inefficient vectorized codes: > > vsetvli a5,zero,e8,mf2,ta,ma > li a2,17 > vid.v v1 > li a4,-32768 > vsetvli zero,zero,e16,m1,ta,ma > addiw a4,a4,104 > vmv.v.i v3,15 > lui a1,%hi(a) > li a0,19 > vsetvli zero,zero,e8,mf2,ta,ma > vadd.vx v1,v1,a2 > sb a0,%lo(a)(a1) > vsetvli zero,zero,e16,m1,ta,ma > vzext.vf2 v2,v1 > vmv.v.x v1,a4 > vminu.vvv2,v2,v3 > vsrl.vv v1,v1,v2 > vslidedown.vi v1,v1,1 > vmv.x.s a0,v1 > sneza0,a0 > ret > > The reason is scalar_to_vec_cost is too low. > > Consider in VEC_SET, we always have a slide + scalar move instruction, > scalar_to_vec_cost = 1 (current cost) is not reasonable. scalar_to_vec is supposed to model a splat of GPR/FPR to a vector register. We probably want to overhaul the cost classes, esp. vec_to_scalar, but of course not now. > I tried to set it as 2 but failed fix this case, that is, I need to > set it as 3 to fix this case. > > No matter scalar move or slide instruction, I believe they are more costly > than normal vector instructions (e.g. vadd.vv). So set it as 3 looks > reasonable > to me. > > After this patch: > > lui a5,%hi(a) > li a4,19 > sb a4,%lo(a)(a5) > li a0,0 > ret > > Tested on both RV32/RV64 no regression, Ok for trunk ? > > PR target/113281 > > gcc/ChangeLog: > > * config/riscv/riscv.cc: Set scalar_to_vec_cost as 3. > > gcc/testsuite/ChangeLog: > > * gcc.target/riscv/rvv/autovec/pr113209.c: Adapt test. > * gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c: New test. > > --- > gcc/config/riscv/riscv.cc | 4 ++-- > .../vect/costmodel/riscv/rvv/pr113281-1.c | 18 ++ > .../gcc.target/riscv/rvv/autovec/pr113209.c| 2 +- > 3 files changed, 21 insertions(+), 3 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c > > diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc > index df9799d9c5e..bcfb3c15a39 100644 > --- a/gcc/config/riscv/riscv.cc > +++ b/gcc/config/riscv/riscv.cc > @@ -366,7 +366,7 @@ static const common_vector_cost rvv_vls_vector_cost = { >1, /* gather_load_cost */ >1, /* scatter_store_cost */ >1, /* vec_to_scalar_cost */ > - 1, /* scalar_to_vec_cost */ > + 3, /* scalar_to_vec_cost */ >1, /* permute_cost */ >1, /* align_load_cost */ >1, /* align_store_cost */ > @@ -382,7 +382,7 @@ static const scalable_vector_cost rvv_vla_vector_cost = { > 1, /* gather_load_cost */ > 1, /* scatter_store_cost */ > 1, /* vec_to_scalar_cost */ > -1, /* scalar_to_vec_cost */ > +3, /* scalar_to_vec_cost */ > 1, /* permute_cost */ > 1, /* align_load_cost */ > 1, /* align_store_cost */ > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c > b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c > new file mode 100644 > index 000..331cf961a1f > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c > @@ -0,0 +1,18 @@ > +/* { dg-do compile } */ > +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize > -fdump-tree-vect-details" } */ > + > +unsigned char a; > + > +int main() { > + short b = a = 0; > + for (; a != 19; a++) > +if (a) > + b = 32872 >> a; > + > + if (b == 0) > +return 0; > + else > +return 1; > +} > + > +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */ > diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c > b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c > index 081ee369394..70aae151000 100644 > --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c > +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c > @@ -1,5 +1,5 @@ > /* { dg-do compile } */ > -/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3" } */ > +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 > -fno-vect-cost-model" } */ > > int b, c, d, f, i, a; > int e[1] = {0}; > -- > 2.36.3 >
[PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
This patch fixes the following inefficient vectorized codes: vsetvli a5,zero,e8,mf2,ta,ma li a2,17 vid.v v1 li a4,-32768 vsetvli zero,zero,e16,m1,ta,ma addiw a4,a4,104 vmv.v.i v3,15 lui a1,%hi(a) li a0,19 vsetvli zero,zero,e8,mf2,ta,ma vadd.vx v1,v1,a2 sb a0,%lo(a)(a1) vsetvli zero,zero,e16,m1,ta,ma vzext.vf2 v2,v1 vmv.v.x v1,a4 vminu.vvv2,v2,v3 vsrl.vv v1,v1,v2 vslidedown.vi v1,v1,1 vmv.x.s a0,v1 sneza0,a0 ret The reason is scalar_to_vec_cost is too low. Consider in VEC_SET, we always have a slide + scalar move instruction, scalar_to_vec_cost = 1 (current cost) is not reasonable. I tried to set it as 2 but failed fix this case, that is, I need to set it as 3 to fix this case. No matter scalar move or slide instruction, I believe they are more costly than normal vector instructions (e.g. vadd.vv). So set it as 3 looks reasonable to me. After this patch: lui a5,%hi(a) li a4,19 sb a4,%lo(a)(a5) li a0,0 ret Tested on both RV32/RV64 no regression, Ok for trunk ? PR target/113281 gcc/ChangeLog: * config/riscv/riscv.cc: Set scalar_to_vec_cost as 3. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/pr113209.c: Adapt test. * gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c: New test. --- gcc/config/riscv/riscv.cc | 4 ++-- .../vect/costmodel/riscv/rvv/pr113281-1.c | 18 ++ .../gcc.target/riscv/rvv/autovec/pr113209.c| 2 +- 3 files changed, 21 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc index df9799d9c5e..bcfb3c15a39 100644 --- a/gcc/config/riscv/riscv.cc +++ b/gcc/config/riscv/riscv.cc @@ -366,7 +366,7 @@ static const common_vector_cost rvv_vls_vector_cost = { 1, /* gather_load_cost */ 1, /* scatter_store_cost */ 1, /* vec_to_scalar_cost */ - 1, /* scalar_to_vec_cost */ + 3, /* scalar_to_vec_cost */ 1, /* permute_cost */ 1, /* align_load_cost */ 1, /* align_store_cost */ @@ -382,7 +382,7 @@ static const scalable_vector_cost rvv_vla_vector_cost = { 1, /* gather_load_cost */ 1, /* scatter_store_cost */ 1, /* vec_to_scalar_cost */ -1, /* scalar_to_vec_cost */ +3, /* scalar_to_vec_cost */ 1, /* permute_cost */ 1, /* align_load_cost */ 1, /* align_store_cost */ diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c new file mode 100644 index 000..331cf961a1f --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize -fdump-tree-vect-details" } */ + +unsigned char a; + +int main() { + short b = a = 0; + for (; a != 19; a++) +if (a) + b = 32872 >> a; + + if (b == 0) +return 0; + else +return 1; +} + +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */ diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c index 081ee369394..70aae151000 100644 --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3" } */ +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -fno-vect-cost-model" } */ int b, c, d, f, i, a; int e[1] = {0}; -- 2.36.3