Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-12 Thread juzhe.zh...@rivai.ai
Hi, Richard.

I tried hard in RISC-V backend. I found to fix the case with 
-march=rv64gcv_zvl4096b can not be without vec_to_scalar count.

Is there an approach that we can count vec_to_scalar cost without this piece 
code in middle-end ?

  /* ???  Enable for loop costing as well.  */
  if (!loop_vinfo)
record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE,
  0, vect_epilogue);

Since it's stage 4, I guess we can't change this now.



juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2024-01-11 17:57
To: Robin Dapp
CC: juzhe.zh...@rivai.ai; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
On Thu, Jan 11, 2024 at 10:52 AM Robin Dapp  wrote:
>
> On 1/11/24 10:46, juzhe.zh...@rivai.ai wrote:
> > Oh. I see I think I have done wrong here.
> >
> > I should adjust cost for VEC_EXTRACT not VEC_SET.
> >
> > But it's odd, I didn't see loop vectorizer is scanning scalar_to_vec
> > cost in vect.dump.
>
> The slidedown/vmv.x.s part is of course vec_extract but we indeed
> don't seem to cost it as vec_to_scalar here.
 
It looks like a vectorized live operation as it's not in the loop body
(and thus really irrelevant for costing in practice).  This has
 
  /* ???  Enable for loop costing as well.  */
  if (!loop_vinfo)
record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE,
  0, vect_epilogue);
 
so live ops are not costed at all.  I would suggest to try unconditionally
enabling this?
 
> vmv.vx correspond to scalar_to_vec and I'd say 3 seems a
> bit high when a regular vector instruction is "1".
> It should rather be dependent on the latency between register
> files.  We can't really say in general but I'd say "2" is not so bad.
>
> I would suggest adding special handling in builtin_vectorization_cost
> like:
>
> /* Add register-register latency.  */
> case scalar_to_vec:
>   return common_costs->scalar_to_vec_cost + riscv_register_move_cost (...)
>
> and adjust register_move_cost accordingly.  Instead of using
> register_move_cost we could also use a cost structure directly.
> (E.g. like aarch64's regmove tuning structures.  Those don't
> contain VRs but for us it could make sense to add them).
>
> > +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize 
> > -fdump-tree-vect-details" } */
> With a cost of "3" we still vectorize for zvl512b and larger.
> Is that intended?  I don't really see why 512 should vectorized
> but 256 not.  Disregarding that everything should be optimized
> away, 2 iterations for the whole loop with 256 bits doesn't
> seem that bad.
>
> Regards
>  Robin
>
 


Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread juzhe.zh...@rivai.ai
, 32872, 
32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872 } >> 
vect_patt_108.14_273;
  _12 = 32872 >> _22;
  vect_patt_111.16_276 = VIEW_CONVERT_EXPR(vect_patt_109.15_275);
  b_7 = (short int) _12;
  _4 = a.4_25 + 1;
  ivtmp_20 = ivtmp_30 - 1;
  ivtmp_280 = ivtmp_279 + 1;
  if (ivtmp_280 < 1)
goto ; [0.00%]
  else
goto ; [100.00%]

   [local count: 118111600]:
  # b_5 = PHI 
  a = 19;
  _14 = b_5 != 0;
  _15 = (int) _14;
  return _15;

   [local count: 4]:
  goto ; [100.00%]

   [local count: 118111601]:
  # a.4_266 = PHI <_4(8)>
  # ivtmp_267 = PHI 
  # vect_patt_111.16_277 = PHI 
  _278 = BIT_FIELD_REF ;
  b_264 = _278;

   [local count: 477576208]:
  # a.4_256 = PHI <17(18), _262(20)>
  # ivtmp_258 = PHI <2(18), ivtmp_263(20)>
  _259 = (int) a.4_256;
  _260 = 32872 >> _259;
  b_261 = (short int) _260;
  _262 = a.4_256 + 1;
  ivtmp_263 = ivtmp_258 - 1;
  if (ivtmp_263 != 0)
goto ; [75.27%]
  else
goto ; [24.73%]

   [local count: 359464610]:
  goto ; [100.00%]

}

Final ASM:

main:
lui a5,%hi(a)
li a4,19
sb a4,%lo(a)(a5)
li a0,0
ret


juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2024-01-11 20:56
To: juzhe.zh...@rivai.ai; Richard Biener
CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
> 32872 spends 2 scalar instructions + 1 scalar_to_vec cost:
> 
> lia4,-32768
> addiwa4,a4,104
> vmv.v.xv16,a4
> 
> It seems reasonable but only can fix test with -march=rv64gcv_zvl256b but 
> failed on -march=rv64gcv_zvl4096b.
The scalar version also needs both instructions:
 
li a0,32768
addiw a0,a0,104
 
Therefore I don't think we should just add them to the
vectorization costs.  That would only be necessary if we needed
to synthesize a different constant (e.g. if a scalar constant
cannot be used directly in a vector setting).
 
Currently, scalar_outside_cost = 0 so we don't model it on the
scalar side either.
 
With scalar_to_vec = 2 we first try RVVMF2QI, vf = 8 at zvl256b:
 
a.4_25 = PHI <1(2), _4(11)> 1 times vector_stmt costs 1 in body
a.4_25 = PHI <1(2), _4(11)> 2 times scalar_to_vec costs 4 in prologue
(unsigned short) a.4_25 1 times vector_stmt costs 1 in body
MIN_EXPR  1 times scalar_to_vec costs 2 in prologue
MIN_EXPR  1 times vector_stmt costs 1 in body
32872 >> patt_26 1 times scalar_to_vec costs 2 in prologue
32872 >> patt_26 1 times vector_stmt costs 1 in body
 1 times scalar_stmt costs 1 in prologue
 1 times scalar_stmt costs 1 in body
 
  Vector inside of loop cost: 5
  Scalar iteration cost: 1 (shouldn't that be 2? but anyway)
 
So one vector iteration costs 5 right now regardless of
scalar_to_vec because there are 5 vector operations (phi,
promote, min, shift, vsetvl/len adjustment).
The scalar_to_vec costs are added to the prologue because it is
assumed that broadcasts are hoisted out of the loop.
 
Then vectorization is supposed to be profitable if
#iterations = 18 > (body_cost * min_iters)
   + vector_outside_cost - scalar_outside_cost + 1 = 15.
 
If we really don't want to vectorize, then we can either
further increase the prologue cost or the body itself.  The
body statements are all vector_stmts, though.  For the
prologue we need a good argument why to increase scalar_to_vec
to beyond, say 2.
 
> Is it reasonable ? IMHO, scalar move (vmv.v.x or vfmv.v.f) should be
> more costly than normal vadd.vv since it is transferring data between
> different pipeline/register class.
 
We want it to be more expensive, yes.  In one of the last
messages I explained how I would model it using either
register_move_cost or using (tune-specific) costs directly.
I would start with scalar_to_vec = 1 and add 1 or 2 depending
on the uarch/tune-specific reg-move costs.
 
Regards
Robin
 
 


Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread Robin Dapp
> 32872 spends 2 scalar instructions + 1 scalar_to_vec cost:
> 
> lia4,-32768
> addiwa4,a4,104
> vmv.v.xv16,a4
> 
> It seems reasonable but only can fix test with -march=rv64gcv_zvl256b but 
> failed on -march=rv64gcv_zvl4096b.
The scalar version also needs both instructions:

li  a0,32768
addiw   a0,a0,104

Therefore I don't think we should just add them to the
vectorization costs.  That would only be necessary if we needed
to synthesize a different constant (e.g. if a scalar constant
cannot be used directly in a vector setting).

Currently, scalar_outside_cost = 0 so we don't model it on the
scalar side either.

With scalar_to_vec = 2 we first try RVVMF2QI, vf = 8 at zvl256b:

a.4_25 = PHI <1(2), _4(11)> 1 times vector_stmt costs 1 in body
a.4_25 = PHI <1(2), _4(11)> 2 times scalar_to_vec costs 4 in prologue
(unsigned short) a.4_25 1 times vector_stmt costs 1 in body
MIN_EXPR  1 times scalar_to_vec costs 2 in prologue
MIN_EXPR  1 times vector_stmt costs 1 in body
32872 >> patt_26 1 times scalar_to_vec costs 2 in prologue
32872 >> patt_26 1 times vector_stmt costs 1 in body
 1 times scalar_stmt costs 1 in prologue
 1 times scalar_stmt costs 1 in body

  Vector inside of loop cost: 5
  Scalar iteration cost: 1 (shouldn't that be 2? but anyway)

So one vector iteration costs 5 right now regardless of
scalar_to_vec because there are 5 vector operations (phi,
promote, min, shift, vsetvl/len adjustment).
The scalar_to_vec costs are added to the prologue because it is
assumed that broadcasts are hoisted out of the loop.

Then vectorization is supposed to be profitable if
#iterations = 18 > (body_cost * min_iters)
   + vector_outside_cost - scalar_outside_cost + 1 = 15.

If we really don't want to vectorize, then we can either
further increase the prologue cost or the body itself.  The
body statements are all vector_stmts, though.  For the
prologue we need a good argument why to increase scalar_to_vec
to beyond, say 2.

> Is it reasonable ? IMHO, scalar move (vmv.v.x or vfmv.v.f) should be
> more costly than normal vadd.vv since it is transferring data between
> different pipeline/register class.

We want it to be more expensive, yes.  In one of the last
messages I explained how I would model it using either
register_move_cost or using (tune-specific) costs directly.
I would start with scalar_to_vec = 1 and add 1 or 2 depending
on the uarch/tune-specific reg-move costs.

Regards
 Robin



Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread juzhe.zh...@rivai.ai
I see after this accurate cost adjustment, it is still vectorized but different 
vect dump:

 [local count: 118111602]:
  # a.4_25 = PHI <1(2), _4(11)>
  # ivtmp_30 = PHI <18(2), ivtmp_20(11)>
  # vect_vec_iv_.12_149 = PHI <{ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 
15, 16 }(2), _150(11)>
  # ivtmp_159 = PHI <0(2), ivtmp_160(11)>
  _150 = vect_vec_iv_.12_149 + { 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 
16, 16, 16, 16, 16 };
  vect_patt_46.13_151 = (vector(16) unsigned short) vect_vec_iv_.12_149;
  _22 = (int) a.4_25;
  vect_patt_48.14_153 = MIN_EXPR ;
  vect_patt_49.15_155 = { 32872, 32872, 32872, 32872, 32872, 32872, 32872, 
32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872 } >> 
vect_patt_48.14_153;
  _12 = 32872 >> _22;
  vect_patt_51.16_156 = VIEW_CONVERT_EXPR(vect_patt_49.15_155);
  b_7 = (short int) _12;
  _4 = a.4_25 + 1;
  ivtmp_20 = ivtmp_30 - 1;
  ivtmp_160 = ivtmp_159 + 1;
  if (ivtmp_160 < 1)
goto ; [0.00%]
  else
goto ; [100.00%]

   [local count: 118111600]:
  # b_5 = PHI 
  a = 19;
  _14 = b_5 != 0;
  _15 = (int) _14;
  return _15;

   [local count: 4]:
  goto ; [100.00%]

   [local count: 118111601]:
  # a.4_146 = PHI <_4(8)>
  # ivtmp_147 = PHI 
  # vect_patt_51.16_157 = PHI 
  _158 = BIT_FIELD_REF ;
  b_144 = _158;

Purely VLS vectorized codes, then the later CSE pass is able to optimize it 
into the simply scalar codes.

Wheras it involves some VLA vectorized codes which make the later pass failed 
to CSE it...



juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2024-01-11 19:15
To: juzhe.zh...@rivai.ai; Richard Biener
CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
> I think we shouldn't vectorize it with any vlen, since the non-vectorized 
> codegen is much better.
> And also, I have tested -msve-vector-bits=2048, ARM SVE doesn't vectorize it.
> -zvl65536b, RVV Clang also doesn't vectorize it.
 
Of course I agree that optimizing everything to return 0 is
what should happen (tree-ssa-dom or vrp do that).  Unfortunately
they don't anymore after vectorizing the loop.
 
My point is cost comparison only has the scalar loop to compare
against which is:
 
li a5,1
li a3,19
.L2:
mv a4,a5
addiw a5,a5,1
bne a5,a3,.L2
 
That's effectively 2 * 18 instructions and more than what we get
when vectorizing - therefore it's kind totally outrageous to
vectorize here and we need to make sure not to go overboard with
costing just for this example.
 
How does aarch64's cost comparison look like?  What's, comparatively,
more expensive with their tuning?  I've seen scalar_to_vec = 4 and
vec_to_scalar = 4 but a regular operation is 2 already.   This
would equal scalar_to_vec = 2 for us (and is not sufficient) so
something else must come into play still.
 
Regards
Robin
 


Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread juzhe.zh...@rivai.ai
Ok. but with this scalar_to_vec set to 2:

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index df9799d9c5e..a14fb36817a 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -366,7 +366,7 @@ static const common_vector_cost rvv_vls_vector_cost = {
   1, /* gather_load_cost  */
   1, /* scatter_store_cost  */
   1, /* vec_to_scalar_cost  */
-  1, /* scalar_to_vec_cost  */
+  2, /* scalar_to_vec_cost  */
   1, /* permute_cost  */
   1, /* align_load_cost  */
   1, /* align_store_cost  */
@@ -382,7 +382,7 @@ static const scalable_vector_cost rvv_vla_vector_cost = {
 1, /* gather_load_cost  */
 1, /* scatter_store_cost  */
 1, /* vec_to_scalar_cost  */
-1, /* scalar_to_vec_cost  */
+2, /* scalar_to_vec_cost  */
 1, /* permute_cost  */
 1, /* align_load_cost  */
 1, /* align_store_cost  */

Then all zvl*b can be fixed for this test.

Is it reasonable ? IMHO, scalar move (vmv.v.x or vfmv.v.f) should be more 
costly than normal vadd.vv since it is transferring data between different 
pipeline/register class.



juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2024-01-11 19:15
To: juzhe.zh...@rivai.ai; Richard Biener
CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
> I think we shouldn't vectorize it with any vlen, since the non-vectorized 
> codegen is much better.
> And also, I have tested -msve-vector-bits=2048, ARM SVE doesn't vectorize it.
> -zvl65536b, RVV Clang also doesn't vectorize it.
 
Of course I agree that optimizing everything to return 0 is
what should happen (tree-ssa-dom or vrp do that).  Unfortunately
they don't anymore after vectorizing the loop.
 
My point is cost comparison only has the scalar loop to compare
against which is:
 
li a5,1
li a3,19
.L2:
mv a4,a5
addiw a5,a5,1
bne a5,a3,.L2
 
That's effectively 2 * 18 instructions and more than what we get
when vectorizing - therefore it's kind totally outrageous to
vectorize here and we need to make sure not to go overboard with
costing just for this example.
 
How does aarch64's cost comparison look like?  What's, comparatively,
more expensive with their tuning?  I've seen scalar_to_vec = 4 and
vec_to_scalar = 4 but a regular operation is 2 already.   This
would equal scalar_to_vec = 2 for us (and is not sufficient) so
something else must come into play still.
 
Regards
Robin
 


Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread juzhe.zh...@rivai.ai
Hi, Robin.

I model scalar value initialization accurately with following patch:

+/* Adjust vectorization cost after calling
+   targetm.vectorize.builtin_vectorization_cost. For some statement, we would
+   like to further fine-grain tweak the cost on top of
+   targetm.vectorize.builtin_vectorization_cost handling which doesn't have any
+   information on statement operation codes etc.  */
+
+static unsigned
+adjust_stmt_cost (class vec_info *vinfo, enum vect_cost_for_stmt kind,
+ struct _stmt_vec_info *stmt_info, tree vectype, int count,
+ int stmt_cost)
+{
+  gimple *stmt = stmt_info->stmt;
+  machine_mode smode = TYPE_MODE (TREE_TYPE (vectype));
+  switch (kind)
+{
+  case scalar_to_vec: {
+   stmt_cost *= count;
+   /* Adjust cost by counting the scalar value initialization.  */
+   for (int i = 0; i < count; i++)
+ {
+   tree arg = gimple_arg (stmt, i);
+   if (poly_int_tree_p (arg))
+ {
+   poly_int64 value = tree_to_poly_int64 (arg);
+   int scalar_cost
+ = riscv_const_insns (gen_int_mode (value, smode));
+   stmt_cost += scalar_cost;
+ }
+   else
+ stmt_cost += 1;
+ }
+   return stmt_cost;
+  }
+default:
+  break;
+}
+  return count * stmt_cost;
+}
+
 unsigned
 costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
  stmt_vec_info stmt_info, slp_tree, tree vectype,
@@ -1082,9 +1122,13 @@ costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
 as one iteration of the VLA loop.  */
   if (where == vect_body && m_unrolled_vls_niters)
m_unrolled_vls_stmts += count * m_unrolled_vls_niters;
+
+  if (vectype)
+   stmt_cost = adjust_stmt_cost (m_vinfo, kind, stmt_info, vectype, count,
+ stmt_cost);
 }

-  return record_stmt_cost (stmt_info, where, count * stmt_cost);
+  return record_stmt_cost (stmt_info, where, stmt_cost);
 }

32872 >> patt_126 1 times scalar_to_vec costs 3 in prologue

32872 spends 2 scalar instructions + 1 scalar_to_vec cost:

li a4,-32768
addiw a4,a4,104
vmv.v.x v16,a4

It seems reasonable but only can fix test with -march=rv64gcv_zvl256b but 
failed on -march=rv64gcv_zvl4096b.



juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2024-01-11 19:15
To: juzhe.zh...@rivai.ai; Richard Biener
CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
> I think we shouldn't vectorize it with any vlen, since the non-vectorized 
> codegen is much better.
> And also, I have tested -msve-vector-bits=2048, ARM SVE doesn't vectorize it.
> -zvl65536b, RVV Clang also doesn't vectorize it.
 
Of course I agree that optimizing everything to return 0 is
what should happen (tree-ssa-dom or vrp do that).  Unfortunately
they don't anymore after vectorizing the loop.
 
My point is cost comparison only has the scalar loop to compare
against which is:
 
li a5,1
li a3,19
.L2:
mv a4,a5
addiw a5,a5,1
bne a5,a3,.L2
 
That's effectively 2 * 18 instructions and more than what we get
when vectorizing - therefore it's kind totally outrageous to
vectorize here and we need to make sure not to go overboard with
costing just for this example.
 
How does aarch64's cost comparison look like?  What's, comparatively,
more expensive with their tuning?  I've seen scalar_to_vec = 4 and
vec_to_scalar = 4 but a regular operation is 2 already.   This
would equal scalar_to_vec = 2 for us (and is not sufficient) so
something else must come into play still.
 
Regards
Robin
 


Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread juzhe.zh...@rivai.ai
Oh, Sorry. I made a mistake. It's -mrvv-vector-bits=2048 RVV clang doesn't 
vectorize it.

But ARM SVE GCC doesn't fix the issue if we specify -msve-vector-bits=2048, it 
will vectorize it.

https://godbolt.org/z/x7s8Kz87a

I guess LLVM has some magic in their cost model which can work well.



juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2024-01-11 19:15
To: juzhe.zh...@rivai.ai; Richard Biener
CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
> I think we shouldn't vectorize it with any vlen, since the non-vectorized 
> codegen is much better.
> And also, I have tested -msve-vector-bits=2048, ARM SVE doesn't vectorize it.
> -zvl65536b, RVV Clang also doesn't vectorize it.
 
Of course I agree that optimizing everything to return 0 is
what should happen (tree-ssa-dom or vrp do that).  Unfortunately
they don't anymore after vectorizing the loop.
 
My point is cost comparison only has the scalar loop to compare
against which is:
 
li a5,1
li a3,19
.L2:
mv a4,a5
addiw a5,a5,1
bne a5,a3,.L2
 
That's effectively 2 * 18 instructions and more than what we get
when vectorizing - therefore it's kind totally outrageous to
vectorize here and we need to make sure not to go overboard with
costing just for this example.
 
How does aarch64's cost comparison look like?  What's, comparatively,
more expensive with their tuning?  I've seen scalar_to_vec = 4 and
vec_to_scalar = 4 but a regular operation is 2 already.   This
would equal scalar_to_vec = 2 for us (and is not sufficient) so
something else must come into play still.
 
Regards
Robin
 


Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread Robin Dapp
> I think we shouldn't vectorize it with any vlen, since the non-vectorized 
> codegen is much better.
> And also, I have tested -msve-vector-bits=2048, ARM SVE doesn't vectorize it.
> -zvl65536b, RVV Clang also doesn't vectorize it.

Of course I agree that optimizing everything to return 0 is
what should happen (tree-ssa-dom or vrp do that).  Unfortunately
they don't anymore after vectorizing the loop.

My point is cost comparison only has the scalar loop to compare
against which is:

li  a5,1
li  a3,19
.L2:
mv  a4,a5
addiw   a5,a5,1
bne a5,a3,.L2

That's effectively 2 * 18 instructions and more than what we get
when vectorizing - therefore it's kind totally outrageous to
vectorize here and we need to make sure not to go overboard with
costing just for this example.

How does aarch64's cost comparison look like?  What's, comparatively,
more expensive with their tuning?  I've seen scalar_to_vec = 4 and
vec_to_scalar = 4 but a regular operation is 2 already.   This
would equal scalar_to_vec = 2 for us (and is not sufficient) so
something else must come into play still.

Regards
 Robin


Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread juzhe.zh...@rivai.ai
>> (My question whether why we shouldn't vectorize this at 256b
>> and above still stands, though)
I think we shouldn't vectorize it with any vlen, since the non-vectorized 
codegen is much better.
And also, I have tested -msve-vector-bits=2048, ARM SVE doesn't vectorize it.
-zvl65536b, RVV Clang also doesn't vectorize it.



juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2024-01-11 18:40
To: juzhe.zh...@rivai.ai; Richard Biener
CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
On 1/11/24 11:20, juzhe.zh...@rivai.ai wrote:
> Ok I see your idea and we need to adjust scalar_to_vec accurately. Inside the 
> loop we have these 2 scalar_to_vec:
> 
> 1. MIN_EXPR  1 times scalar_to_vec costs 1 in prologue
> 
>This scalar_to_vec cost should be 0 or 1 since it only generate single 
> instructions: vmv.v.iv16,15
> 
> 2. 32872 >> patt_26 1 times scalar_to_vec costs 1 in prologue
> 
>This cost should be higher since it cost 3 instructions:
> lia4,-32768
> addiwa4,a4,104
> vmv.v.xv16,a4
> 
> Am I correct ?
> 
> I guess if we cost 1 case as 1 cost and 2 case as 3 cost. Then we will be 
> good.
 
That would be the general idea, yes.  As Richard mentioned, it doesn't
always work well but for this case here it could help a bit.
(My question whether why we shouldn't vectorize this at 256b
and above still stands, though)
 
As mentioned before, the other thing that needs to be considered
is register-move costs (or the respective cost structure).  On
some uarchs the vmv.v.f might be more expensive than vmv.v.x and
so on - in addition to the instructions needed to synthesize the
constant.
 
Regards
Robin
 
 


Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread Robin Dapp
On 1/11/24 11:20, juzhe.zh...@rivai.ai wrote:
> Ok I see your idea and we need to adjust scalar_to_vec accurately. Inside the 
> loop we have these 2 scalar_to_vec:
> 
> 1. MIN_EXPR  1 times scalar_to_vec costs 1 in prologue
> 
>    This scalar_to_vec cost should be 0 or 1 since it only generate single 
> instructions: vmv.v.iv16,15
> 
> 2. 32872 >> patt_26 1 times scalar_to_vec costs 1 in prologue
> 
>    This cost should be higher since it cost 3 instructions:
>     lia4,-32768
>     addiwa4,a4,104
>     vmv.v.xv16,a4
> 
> Am I correct ?
> 
> I guess if we cost 1 case as 1 cost and 2 case as 3 cost. Then we will be 
> good.

That would be the general idea, yes.  As Richard mentioned, it doesn't
always work well but for this case here it could help a bit.
(My question whether why we shouldn't vectorize this at 256b
and above still stands, though)

As mentioned before, the other thing that needs to be considered
is register-move costs (or the respective cost structure).  On
some uarchs the vmv.v.f might be more expensive than vmv.v.x and
so on - in addition to the instructions needed to synthesize the
constant.

Regards
 Robin



Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread Richard Biener
On Thu, Jan 11, 2024 at 11:18 AM Richard Biener
 wrote:
>
> On Thu, Jan 11, 2024 at 11:20 AM juzhe.zh...@rivai.ai
>  wrote:
> >
> > Ok I see your idea and we need to adjust scalar_to_vec accurately. Inside 
> > the loop we have these 2 scalar_to_vec:
> >
> > 1. MIN_EXPR  1 times scalar_to_vec costs 1 in prologue
> >
> >This scalar_to_vec cost should be 0 or 1 since it only generate single 
> > instructions: vmv.v.i v16,15
> >
> > 2. 32872 >> patt_26 1 times scalar_to_vec costs 1 in prologue
> >
> >This cost should be higher since it cost 3 instructions:
> > li a4,-32768
> > addiw a4,a4,104
> > vmv.v.x v16,a4
> >
> > Am I correct ?
> >
> > I guess if we cost 1 case as 1 cost and 2 case as 3 cost. Then we will be 
> > good.
>
> The cost model interface isn't set up very well to cost different constants
> differently.  For scalar_to_vec we should get the backend the actual scalar
> to duplicate but we don't.

It's similar for permutes btw.  My plan was always to defer changes to when
we only have SLP since there all invariants are represented by SLP nodes
which we can hand down.

> > 
> > juzhe.zh...@rivai.ai
> >
> >
> > From: Robin Dapp
> > Date: 2024-01-11 18:14
> > To: juzhe.zh...@rivai.ai; Richard Biener
> > CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
> > Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
> > >  Yeah... I just noticed. I should set it as 4 to fix it with biggest VLEN 
> > > size,
> > > that is, -march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m8...
> > >
> > > I am confused now how to fix this case.
> >
> > 4 is definitely too high compared to a regular instruction.
> > vmv.vx could even be zero-cost for constants.
> >
> > To catch constants we could add handling in add_stmt_cost, inspecting
> > the stmt directly.
> >
> > Regards
> > Robin
> >


Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread Richard Biener
On Thu, Jan 11, 2024 at 11:20 AM juzhe.zh...@rivai.ai
 wrote:
>
> Ok I see your idea and we need to adjust scalar_to_vec accurately. Inside the 
> loop we have these 2 scalar_to_vec:
>
> 1. MIN_EXPR  1 times scalar_to_vec costs 1 in prologue
>
>This scalar_to_vec cost should be 0 or 1 since it only generate single 
> instructions: vmv.v.i v16,15
>
> 2. 32872 >> patt_26 1 times scalar_to_vec costs 1 in prologue
>
>This cost should be higher since it cost 3 instructions:
> li a4,-32768
> addiw a4,a4,104
> vmv.v.x v16,a4
>
> Am I correct ?
>
> I guess if we cost 1 case as 1 cost and 2 case as 3 cost. Then we will be 
> good.

The cost model interface isn't set up very well to cost different constants
differently.  For scalar_to_vec we should get the backend the actual scalar
to duplicate but we don't.

> 
> juzhe.zh...@rivai.ai
>
>
> From: Robin Dapp
> Date: 2024-01-11 18:14
> To: juzhe.zh...@rivai.ai; Richard Biener
> CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
> Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
> >  Yeah... I just noticed. I should set it as 4 to fix it with biggest VLEN 
> > size,
> > that is, -march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m8...
> >
> > I am confused now how to fix this case.
>
> 4 is definitely too high compared to a regular instruction.
> vmv.vx could even be zero-cost for constants.
>
> To catch constants we could add handling in add_stmt_cost, inspecting
> the stmt directly.
>
> Regards
> Robin
>


Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread juzhe.zh...@rivai.ai
Ok I see your idea and we need to adjust scalar_to_vec accurately. Inside the 
loop we have these 2 scalar_to_vec:

1. MIN_EXPR  1 times scalar_to_vec costs 1 in prologue

   This scalar_to_vec cost should be 0 or 1 since it only generate single 
instructions: vmv.v.i v16,15

2. 32872 >> patt_26 1 times scalar_to_vec costs 1 in prologue

   This cost should be higher since it cost 3 instructions:
li a4,-32768
addiw a4,a4,104
vmv.v.x v16,a4

Am I correct ?

I guess if we cost 1 case as 1 cost and 2 case as 3 cost. Then we will be good.



juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2024-01-11 18:14
To: juzhe.zh...@rivai.ai; Richard Biener
CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
>  Yeah... I just noticed. I should set it as 4 to fix it with biggest VLEN 
> size,
> that is, -march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m8...
> 
> I am confused now how to fix this case.
 
4 is definitely too high compared to a regular instruction.
vmv.vx could even be zero-cost for constants.
 
To catch constants we could add handling in add_stmt_cost, inspecting
the stmt directly.
 
Regards
Robin
 


Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread Robin Dapp
>  Yeah... I just noticed. I should set it as 4 to fix it with biggest VLEN 
> size,
> that is, -march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m8...
> 
> I am confused now how to fix this case.

4 is definitely too high compared to a regular instruction.
vmv.vx could even be zero-cost for constants.

To catch constants we could add handling in add_stmt_cost, inspecting
the stmt directly.

Regards
 Robin


Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread juzhe.zh...@rivai.ai
And also I have investigate LLVM cost model. They don't cost vsevli in 
vectorization cost model.
But their cost model does a good job...



juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2024-01-11 18:09
To: Richard Biener
CC: rdapp.gcc; juzhe.zh...@rivai.ai; gcc-patches; kito.cheng; Kito.cheng; 
jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
>> The slidedown/vmv.x.s part is of course vec_extract but we indeed
>> don't seem to cost it as vec_to_scalar here.
> 
> It looks like a vectorized live operation as it's not in the loop body
> (and thus really irrelevant for costing in practice).  This has
> 
>   /* ???  Enable for loop costing as well.  */
>   if (!loop_vinfo)
> record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE,
>   0, vect_epilogue);
> 
> so live ops are not costed at all.  I would suggest to try unconditionally
> enabling this?
> 
 
IMHO this example is not really ideal to start with anyway so we should maybe
try another one first.  I'd still argue that one or two iterations vs.
potentially 16+ scalar ones is not necessarily bad.
 
That said, we also don't really cost all our vsetvls yet (difficult...).
 
Regards
Robin
 


Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread juzhe.zh...@rivai.ai
>> That said, we also don't really cost all our vsetvls yet (difficult...).
If cost vsetvl, we will need to cost 1 more for each STMT.

However, it is not accurate. Since our VSETVL PASS will eliminate redundancy...


juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2024-01-11 18:09
To: Richard Biener
CC: rdapp.gcc; juzhe.zh...@rivai.ai; gcc-patches; kito.cheng; Kito.cheng; 
jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
>> The slidedown/vmv.x.s part is of course vec_extract but we indeed
>> don't seem to cost it as vec_to_scalar here.
> 
> It looks like a vectorized live operation as it's not in the loop body
> (and thus really irrelevant for costing in practice).  This has
> 
>   /* ???  Enable for loop costing as well.  */
>   if (!loop_vinfo)
> record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE,
>   0, vect_epilogue);
> 
> so live ops are not costed at all.  I would suggest to try unconditionally
> enabling this?
> 
 
IMHO this example is not really ideal to start with anyway so we should maybe
try another one first.  I'd still argue that one or two iterations vs.
potentially 16+ scalar ones is not necessarily bad.
 
That said, we also don't really cost all our vsetvls yet (difficult...).
 
Regards
Robin
 


Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread Robin Dapp
>> The slidedown/vmv.x.s part is of course vec_extract but we indeed
>> don't seem to cost it as vec_to_scalar here.
> 
> It looks like a vectorized live operation as it's not in the loop body
> (and thus really irrelevant for costing in practice).  This has
> 
>   /* ???  Enable for loop costing as well.  */
>   if (!loop_vinfo)
> record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE,
>   0, vect_epilogue);
> 
> so live ops are not costed at all.  I would suggest to try unconditionally
> enabling this?
> 

IMHO this example is not really ideal to start with anyway so we should maybe
try another one first.  I'd still argue that one or two iterations vs.
potentially 16+ scalar ones is not necessarily bad.

That said, we also don't really cost all our vsetvls yet (difficult...).

Regards
 Robin


Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread Richard Biener
On Thu, Jan 11, 2024 at 10:52 AM Robin Dapp  wrote:
>
> On 1/11/24 10:46, juzhe.zh...@rivai.ai wrote:
> > Oh. I see I think I have done wrong here.
> >
> > I should adjust cost for VEC_EXTRACT not VEC_SET.
> >
> > But it's odd, I didn't see loop vectorizer is scanning scalar_to_vec
> > cost in vect.dump.
>
> The slidedown/vmv.x.s part is of course vec_extract but we indeed
> don't seem to cost it as vec_to_scalar here.

It looks like a vectorized live operation as it's not in the loop body
(and thus really irrelevant for costing in practice).  This has

  /* ???  Enable for loop costing as well.  */
  if (!loop_vinfo)
record_stmt_cost (cost_vec, 1, vec_to_scalar, stmt_info, NULL_TREE,
  0, vect_epilogue);

so live ops are not costed at all.  I would suggest to try unconditionally
enabling this?

> vmv.vx correspond to scalar_to_vec and I'd say 3 seems a
> bit high when a regular vector instruction is "1".
> It should rather be dependent on the latency between register
> files.  We can't really say in general but I'd say "2" is not so bad.
>
> I would suggest adding special handling in builtin_vectorization_cost
> like:
>
> /* Add register-register latency.  */
> case scalar_to_vec:
>   return common_costs->scalar_to_vec_cost + riscv_register_move_cost (...)
>
> and adjust register_move_cost accordingly.  Instead of using
> register_move_cost we could also use a cost structure directly.
> (E.g. like aarch64's regmove tuning structures.  Those don't
> contain VRs but for us it could make sense to add them).
>
> > +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize 
> > -fdump-tree-vect-details" } */
> With a cost of "3" we still vectorize for zvl512b and larger.
> Is that intended?  I don't really see why 512 should vectorized
> but 256 not.  Disregarding that everything should be optimized
> away, 2 iterations for the whole loop with 256 bits doesn't
> seem that bad.
>
> Regards
>  Robin
>


Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread juzhe.zh...@rivai.ai
>> With a cost of "3" we still vectorize for zvl512b and larger.
>>Is that intended?  I don't really see why 512 should vectorized
>>but 256 not.  Disregarding that everything should be optimized
>>away, 2 iterations for the whole loop with 256 bits doesn't
>>seem that bad.
 Yeah... I just noticed. I should set it as 4 to fix it with biggest VLEN size,
that is, -march=rv64gcv_zvl4096b --param=riscv-autovec-lmul=m8...

I am confused now how to fix this case.


juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2024-01-11 17:52
To: juzhe.zh...@rivai.ai; Richard Biener
CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
On 1/11/24 10:46, juzhe.zh...@rivai.ai wrote:
> Oh. I see I think I have done wrong here.
> 
> I should adjust cost for VEC_EXTRACT not VEC_SET.
> 
> But it's odd, I didn't see loop vectorizer is scanning scalar_to_vec
> cost in vect.dump.
 
The slidedown/vmv.x.s part is of course vec_extract but we indeed
don't seem to cost it as vec_to_scalar here.
 
vmv.vx correspond to scalar_to_vec and I'd say 3 seems a
bit high when a regular vector instruction is "1".
It should rather be dependent on the latency between register
files.  We can't really say in general but I'd say "2" is not so bad.
 
I would suggest adding special handling in builtin_vectorization_cost
like:
 
/* Add register-register latency.  */
case scalar_to_vec:
  return common_costs->scalar_to_vec_cost + riscv_register_move_cost (...)
 
and adjust register_move_cost accordingly.  Instead of using
register_move_cost we could also use a cost structure directly.
(E.g. like aarch64's regmove tuning structures.  Those don't
contain VRs but for us it could make sense to add them).
 
> +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize 
> -fdump-tree-vect-details" } */
With a cost of "3" we still vectorize for zvl512b and larger.
Is that intended?  I don't really see why 512 should vectorized
but 256 not.  Disregarding that everything should be optimized
away, 2 iterations for the whole loop with 256 bits doesn't
seem that bad.
 
Regards
Robin
 
 


Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread Robin Dapp
On 1/11/24 10:46, juzhe.zh...@rivai.ai wrote:
> Oh. I see I think I have done wrong here.
> 
> I should adjust cost for VEC_EXTRACT not VEC_SET.
> 
> But it's odd, I didn't see loop vectorizer is scanning scalar_to_vec
> cost in vect.dump.

The slidedown/vmv.x.s part is of course vec_extract but we indeed
don't seem to cost it as vec_to_scalar here.

vmv.vx correspond to scalar_to_vec and I'd say 3 seems a
bit high when a regular vector instruction is "1".
It should rather be dependent on the latency between register
files.  We can't really say in general but I'd say "2" is not so bad.

I would suggest adding special handling in builtin_vectorization_cost
like:

/* Add register-register latency.  */
case scalar_to_vec:
  return common_costs->scalar_to_vec_cost + riscv_register_move_cost (...)

and adjust register_move_cost accordingly.  Instead of using
register_move_cost we could also use a cost structure directly.
(E.g. like aarch64's regmove tuning structures.  Those don't
contain VRs but for us it could make sense to add them).

> +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize 
> -fdump-tree-vect-details" } */
With a cost of "3" we still vectorize for zvl512b and larger.
Is that intended?  I don't really see why 512 should vectorized
but 256 not.  Disregarding that everything should be optimized
away, 2 iterations for the whole loop with 256 bits doesn't
seem that bad.

Regards
 Robin



Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread juzhe.zh...@rivai.ai
Oh. I see I think I have done wrong here.

I should adjust cost for VEC_EXTRACT not VEC_SET.

But it's odd, I didn't see loop vectorizer is scanning scalar_to_vec
cost in vect.dump.

The vect tree:

  # a.4_25 = PHI <1(2), _4(11)>
  # ivtmp_30 = PHI <18(2), ivtmp_20(11)>
  # vect_vec_iv_.10_137 = PHI <{ 1, 2, 3, ... }(2), vect_vec_iv_.10_137(11)>
  # ivtmp_149 = PHI <0(2), ivtmp_150(11)>
  # loop_len_146 = PHI <18(2), _155(11)>
  vect_patt_28.11_139 = (vector([2048,2048]) unsigned short) 
vect_vec_iv_.10_137;
  _22 = (int) a.4_25;
  vect_patt_26.12_141 = MIN_EXPR ;
  vect_patt_10.13_143 = { 32872, ... } >> vect_patt_26.12_141;
  _12 = 32872 >> _22;
  vect_patt_27.14_144 = VIEW_CONVERT_EXPR(vect_patt_10.13_143);
  b_7 = (short int) _12;
  _4 = a.4_25 + 1;
  ivtmp_20 = ivtmp_30 - 1;
  ivtmp_150 = ivtmp_149 + POLY_INT_CST [2048, 2048];
  _153 = MIN_EXPR ;
  _154 = 18 - _153;
  _155 = MIN_EXPR <_154, POLY_INT_CST [2048, 2048]>;
  if (_155 != 0)
goto ; [0.00%]
  else
goto ; [100.00%]

   [local count: 118111600]:
  # vect_patt_27.14_145 = PHI 
  # loop_len_156 = PHI 
  _147 = loop_len_156 + 18446744073709551615;
  _148 = .VEC_EXTRACT (vect_patt_27.14_145, _147);
  b_5 = _148;
  a = 19;
  _14 = b_5 != 0;
  _15 = (int) _14;
  return _15;

The vect dump tree only compute cost include vector_stmt and scalar_to_vec.
It seems it didn't consider VEC_EXTRACT cost ?




juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2024-01-11 17:18
To: Juzhe-Zhong
CC: gcc-patches; kito.cheng; kito.cheng; jeffreyalaw; rdapp.gcc
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
On Thu, Jan 11, 2024 at 9:24 AM Juzhe-Zhong  wrote:
>
> This patch fixes the following inefficient vectorized codes:
>
> vsetvli a5,zero,e8,mf2,ta,ma
> li  a2,17
> vid.v   v1
> li  a4,-32768
> vsetvli zero,zero,e16,m1,ta,ma
> addiw   a4,a4,104
> vmv.v.i v3,15
> lui a1,%hi(a)
> li  a0,19
> vsetvli zero,zero,e8,mf2,ta,ma
> vadd.vx v1,v1,a2
> sb  a0,%lo(a)(a1)
> vsetvli zero,zero,e16,m1,ta,ma
> vzext.vf2   v2,v1
> vmv.v.x v1,a4
> vminu.vvv2,v2,v3
> vsrl.vv v1,v1,v2
> vslidedown.vi   v1,v1,1
> vmv.x.s a0,v1
> sneza0,a0
> ret
>
> The reason is scalar_to_vec_cost is too low.
>
> Consider in VEC_SET, we always have a slide + scalar move instruction,
> scalar_to_vec_cost = 1 (current cost) is not reasonable.
 
scalar_to_vec is supposed to model a splat of GPR/FPR to a vector register.
 
We probably want to overhaul the cost classes, esp. vec_to_scalar, but of
course not now.
 
> I tried to set it as 2 but failed fix this case, that is, I need to
> set it as 3 to fix this case.
>
> No matter scalar move or slide instruction, I believe they are more costly
> than normal vector instructions (e.g. vadd.vv). So set it as 3 looks 
> reasonable
> to me.
>
> After this patch:
>
> lui a5,%hi(a)
> li  a4,19
> sb  a4,%lo(a)(a5)
> li  a0,0
> ret
>
> Tested on both RV32/RV64 no regression, Ok for trunk ?
>
> PR target/113281
>
> gcc/ChangeLog:
>
> * config/riscv/riscv.cc: Set scalar_to_vec_cost as 3.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/riscv/rvv/autovec/pr113209.c: Adapt test.
> * gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c: New test.
>
> ---
>  gcc/config/riscv/riscv.cc  |  4 ++--
>  .../vect/costmodel/riscv/rvv/pr113281-1.c  | 18 ++
>  .../gcc.target/riscv/rvv/autovec/pr113209.c|  2 +-
>  3 files changed, 21 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c
>
> diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
> index df9799d9c5e..bcfb3c15a39 100644
> --- a/gcc/config/riscv/riscv.cc
> +++ b/gcc/config/riscv/riscv.cc
> @@ -366,7 +366,7 @@ static const common_vector_cost rvv_vls_vector_cost = {
>1, /* gather_load_cost  */
>1, /* scatter_store_cost  */
>1, /* vec_to_scalar_cost  */
> -  1, /* scalar_to_vec_cost  */
> +  3, /* scalar_to_vec_cost  */
>1, /* permute_cost  */
>1, /* align_load_cost  */
>1, /* align_store_cost  */
> @@ -382,7 +382,7 @@ static const scalable_vector_cost rvv_vla_vector_cost = {
>  1, /* gather_load_cost  */
>  1, /* scatter_store_cost  */
>  1, /* vec_to_scalar_cost  */
> -1, /* scalar_to_vec_cost  */
> +3, /* scalar_to_vec_cost  */
>  1, /* permute_cost  */
>  1, /* align_load_cost  */
>  1, /* align_store_cost  */
> diff --git a

Re: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread juzhe.zh...@rivai.ai
Thanks Richard.

So you think increase scalar_to_vec cost is not the correct approach to fix 
this case?

Or could you give me a suggestion to fix this case ?

Thanks.



juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2024-01-11 17:18
To: Juzhe-Zhong
CC: gcc-patches; kito.cheng; kito.cheng; jeffreyalaw; rdapp.gcc
Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3
On Thu, Jan 11, 2024 at 9:24 AM Juzhe-Zhong  wrote:
>
> This patch fixes the following inefficient vectorized codes:
>
> vsetvli a5,zero,e8,mf2,ta,ma
> li  a2,17
> vid.v   v1
> li  a4,-32768
> vsetvli zero,zero,e16,m1,ta,ma
> addiw   a4,a4,104
> vmv.v.i v3,15
> lui a1,%hi(a)
> li  a0,19
> vsetvli zero,zero,e8,mf2,ta,ma
> vadd.vx v1,v1,a2
> sb  a0,%lo(a)(a1)
> vsetvli zero,zero,e16,m1,ta,ma
> vzext.vf2   v2,v1
> vmv.v.x v1,a4
> vminu.vvv2,v2,v3
> vsrl.vv v1,v1,v2
> vslidedown.vi   v1,v1,1
> vmv.x.s a0,v1
> sneza0,a0
> ret
>
> The reason is scalar_to_vec_cost is too low.
>
> Consider in VEC_SET, we always have a slide + scalar move instruction,
> scalar_to_vec_cost = 1 (current cost) is not reasonable.
 
scalar_to_vec is supposed to model a splat of GPR/FPR to a vector register.
 
We probably want to overhaul the cost classes, esp. vec_to_scalar, but of
course not now.
 
> I tried to set it as 2 but failed fix this case, that is, I need to
> set it as 3 to fix this case.
>
> No matter scalar move or slide instruction, I believe they are more costly
> than normal vector instructions (e.g. vadd.vv). So set it as 3 looks 
> reasonable
> to me.
>
> After this patch:
>
> lui a5,%hi(a)
> li  a4,19
> sb  a4,%lo(a)(a5)
> li  a0,0
> ret
>
> Tested on both RV32/RV64 no regression, Ok for trunk ?
>
> PR target/113281
>
> gcc/ChangeLog:
>
> * config/riscv/riscv.cc: Set scalar_to_vec_cost as 3.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/riscv/rvv/autovec/pr113209.c: Adapt test.
> * gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c: New test.
>
> ---
>  gcc/config/riscv/riscv.cc  |  4 ++--
>  .../vect/costmodel/riscv/rvv/pr113281-1.c  | 18 ++
>  .../gcc.target/riscv/rvv/autovec/pr113209.c|  2 +-
>  3 files changed, 21 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c
>
> diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
> index df9799d9c5e..bcfb3c15a39 100644
> --- a/gcc/config/riscv/riscv.cc
> +++ b/gcc/config/riscv/riscv.cc
> @@ -366,7 +366,7 @@ static const common_vector_cost rvv_vls_vector_cost = {
>1, /* gather_load_cost  */
>1, /* scatter_store_cost  */
>1, /* vec_to_scalar_cost  */
> -  1, /* scalar_to_vec_cost  */
> +  3, /* scalar_to_vec_cost  */
>1, /* permute_cost  */
>1, /* align_load_cost  */
>1, /* align_store_cost  */
> @@ -382,7 +382,7 @@ static const scalable_vector_cost rvv_vla_vector_cost = {
>  1, /* gather_load_cost  */
>  1, /* scatter_store_cost  */
>  1, /* vec_to_scalar_cost  */
> -1, /* scalar_to_vec_cost  */
> +3, /* scalar_to_vec_cost  */
>  1, /* permute_cost  */
>  1, /* align_load_cost  */
>  1, /* align_store_cost  */
> diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c 
> b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c
> new file mode 100644
> index 000..331cf961a1f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize 
> -fdump-tree-vect-details" } */
> +
> +unsigned char a;
> +
> +int main() {
> +  short b = a = 0;
> +  for (; a != 19; a++)
> +if (a)
> +  b = 32872 >> a;
> +
> +  if (b == 0)
> +return 0;
> +  else
> +return 1;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c 
> b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c
> index 081ee369394..70aae151000 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3" } */
> +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 
> -fno-vect-cost-model" } */
>
>  int b, c, d, f, i, a;
>  int e[1] = {0};
> --
> 2.36.3
>
 


Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread Richard Biener
On Thu, Jan 11, 2024 at 9:24 AM Juzhe-Zhong  wrote:
>
> This patch fixes the following inefficient vectorized codes:
>
> vsetvli a5,zero,e8,mf2,ta,ma
> li  a2,17
> vid.v   v1
> li  a4,-32768
> vsetvli zero,zero,e16,m1,ta,ma
> addiw   a4,a4,104
> vmv.v.i v3,15
> lui a1,%hi(a)
> li  a0,19
> vsetvli zero,zero,e8,mf2,ta,ma
> vadd.vx v1,v1,a2
> sb  a0,%lo(a)(a1)
> vsetvli zero,zero,e16,m1,ta,ma
> vzext.vf2   v2,v1
> vmv.v.x v1,a4
> vminu.vvv2,v2,v3
> vsrl.vv v1,v1,v2
> vslidedown.vi   v1,v1,1
> vmv.x.s a0,v1
> sneza0,a0
> ret
>
> The reason is scalar_to_vec_cost is too low.
>
> Consider in VEC_SET, we always have a slide + scalar move instruction,
> scalar_to_vec_cost = 1 (current cost) is not reasonable.

scalar_to_vec is supposed to model a splat of GPR/FPR to a vector register.

We probably want to overhaul the cost classes, esp. vec_to_scalar, but of
course not now.

> I tried to set it as 2 but failed fix this case, that is, I need to
> set it as 3 to fix this case.
>
> No matter scalar move or slide instruction, I believe they are more costly
> than normal vector instructions (e.g. vadd.vv). So set it as 3 looks 
> reasonable
> to me.
>
> After this patch:
>
> lui a5,%hi(a)
> li  a4,19
> sb  a4,%lo(a)(a5)
> li  a0,0
> ret
>
> Tested on both RV32/RV64 no regression, Ok for trunk ?
>
> PR target/113281
>
> gcc/ChangeLog:
>
> * config/riscv/riscv.cc: Set scalar_to_vec_cost as 3.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/riscv/rvv/autovec/pr113209.c: Adapt test.
> * gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c: New test.
>
> ---
>  gcc/config/riscv/riscv.cc  |  4 ++--
>  .../vect/costmodel/riscv/rvv/pr113281-1.c  | 18 ++
>  .../gcc.target/riscv/rvv/autovec/pr113209.c|  2 +-
>  3 files changed, 21 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c
>
> diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
> index df9799d9c5e..bcfb3c15a39 100644
> --- a/gcc/config/riscv/riscv.cc
> +++ b/gcc/config/riscv/riscv.cc
> @@ -366,7 +366,7 @@ static const common_vector_cost rvv_vls_vector_cost = {
>1, /* gather_load_cost  */
>1, /* scatter_store_cost  */
>1, /* vec_to_scalar_cost  */
> -  1, /* scalar_to_vec_cost  */
> +  3, /* scalar_to_vec_cost  */
>1, /* permute_cost  */
>1, /* align_load_cost  */
>1, /* align_store_cost  */
> @@ -382,7 +382,7 @@ static const scalable_vector_cost rvv_vla_vector_cost = {
>  1, /* gather_load_cost  */
>  1, /* scatter_store_cost  */
>  1, /* vec_to_scalar_cost  */
> -1, /* scalar_to_vec_cost  */
> +3, /* scalar_to_vec_cost  */
>  1, /* permute_cost  */
>  1, /* align_load_cost  */
>  1, /* align_store_cost  */
> diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c 
> b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c
> new file mode 100644
> index 000..331cf961a1f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize 
> -fdump-tree-vect-details" } */
> +
> +unsigned char a;
> +
> +int main() {
> +  short b = a = 0;
> +  for (; a != 19; a++)
> +if (a)
> +  b = 32872 >> a;
> +
> +  if (b == 0)
> +return 0;
> +  else
> +return 1;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c 
> b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c
> index 081ee369394..70aae151000 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3" } */
> +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 
> -fno-vect-cost-model" } */
>
>  int b, c, d, f, i, a;
>  int e[1] = {0};
> --
> 2.36.3
>


[PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3

2024-01-11 Thread Juzhe-Zhong
This patch fixes the following inefficient vectorized codes:

vsetvli a5,zero,e8,mf2,ta,ma
li  a2,17
vid.v   v1
li  a4,-32768
vsetvli zero,zero,e16,m1,ta,ma
addiw   a4,a4,104
vmv.v.i v3,15
lui a1,%hi(a)
li  a0,19
vsetvli zero,zero,e8,mf2,ta,ma
vadd.vx v1,v1,a2
sb  a0,%lo(a)(a1)
vsetvli zero,zero,e16,m1,ta,ma
vzext.vf2   v2,v1
vmv.v.x v1,a4
vminu.vvv2,v2,v3
vsrl.vv v1,v1,v2
vslidedown.vi   v1,v1,1
vmv.x.s a0,v1
sneza0,a0
ret

The reason is scalar_to_vec_cost is too low.

Consider in VEC_SET, we always have a slide + scalar move instruction,
scalar_to_vec_cost = 1 (current cost) is not reasonable.

I tried to set it as 2 but failed fix this case, that is, I need to
set it as 3 to fix this case.

No matter scalar move or slide instruction, I believe they are more costly
than normal vector instructions (e.g. vadd.vv). So set it as 3 looks reasonable
to me.

After this patch:

lui a5,%hi(a)
li  a4,19
sb  a4,%lo(a)(a5)
li  a0,0
ret

Tested on both RV32/RV64 no regression, Ok for trunk ?

PR target/113281

gcc/ChangeLog:

* config/riscv/riscv.cc: Set scalar_to_vec_cost as 3.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/pr113209.c: Adapt test.
* gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c: New test.

---
 gcc/config/riscv/riscv.cc  |  4 ++--
 .../vect/costmodel/riscv/rvv/pr113281-1.c  | 18 ++
 .../gcc.target/riscv/rvv/autovec/pr113209.c|  2 +-
 3 files changed, 21 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c

diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc
index df9799d9c5e..bcfb3c15a39 100644
--- a/gcc/config/riscv/riscv.cc
+++ b/gcc/config/riscv/riscv.cc
@@ -366,7 +366,7 @@ static const common_vector_cost rvv_vls_vector_cost = {
   1, /* gather_load_cost  */
   1, /* scatter_store_cost  */
   1, /* vec_to_scalar_cost  */
-  1, /* scalar_to_vec_cost  */
+  3, /* scalar_to_vec_cost  */
   1, /* permute_cost  */
   1, /* align_load_cost  */
   1, /* align_store_cost  */
@@ -382,7 +382,7 @@ static const scalable_vector_cost rvv_vla_vector_cost = {
 1, /* gather_load_cost  */
 1, /* scatter_store_cost  */
 1, /* vec_to_scalar_cost  */
-1, /* scalar_to_vec_cost  */
+3, /* scalar_to_vec_cost  */
 1, /* permute_cost  */
 1, /* align_load_cost  */
 1, /* align_store_cost  */
diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c 
b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c
new file mode 100644
index 000..331cf961a1f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/pr113281-1.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize 
-fdump-tree-vect-details" } */
+
+unsigned char a;
+
+int main() {
+  short b = a = 0;
+  for (; a != 19; a++)
+if (a)
+  b = 32872 >> a;
+
+  if (b == 0)
+return 0;
+  else
+return 1;
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c
index 081ee369394..70aae151000 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr113209.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3" } */
+/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -fno-vect-cost-model" 
} */
 
 int b, c, d, f, i, a;
 int e[1] = {0};
-- 
2.36.3