On 1/11/24 10:46, juzhe.zh...@rivai.ai wrote:
> Oh. I see I think I have done wrong here.
> 
> I should adjust cost for VEC_EXTRACT not VEC_SET.
> 
> But it's odd, I didn't see loop vectorizer is scanning scalar_to_vec
> cost in vect.dump.

The slidedown/vmv.x.s part is of course vec_extract but we indeed
don't seem to cost it as vec_to_scalar here.

vmv.vx correspond to scalar_to_vec and I'd say 3 seems a
bit high when a regular vector instruction is "1".
It should rather be dependent on the latency between register
files.  We can't really say in general but I'd say "2" is not so bad.

I would suggest adding special handling in builtin_vectorization_cost
like:

/* Add register-register latency.  */
case scalar_to_vec:
  return common_costs->scalar_to_vec_cost + riscv_register_move_cost (...)

and adjust register_move_cost accordingly.  Instead of using
register_move_cost we could also use a cost structure directly.
(E.g. like aarch64's regmove tuning structures.  Those don't
contain VRs but for us it could make sense to add them).

> +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize 
> -fdump-tree-vect-details" } */
With a cost of "3" we still vectorize for zvl512b and larger.
Is that intended?  I don't really see why 512 should vectorized
but 256 not.  Disregarding that everything should be optimized
away, 2 iterations for the whole loop with 256 bits doesn't
seem that bad.

Regards
 Robin

Reply via email to