On 1/11/24 10:46, juzhe.zh...@rivai.ai wrote: > Oh. I see I think I have done wrong here. > > I should adjust cost for VEC_EXTRACT not VEC_SET. > > But it's odd, I didn't see loop vectorizer is scanning scalar_to_vec > cost in vect.dump.
The slidedown/vmv.x.s part is of course vec_extract but we indeed don't seem to cost it as vec_to_scalar here. vmv.vx correspond to scalar_to_vec and I'd say 3 seems a bit high when a regular vector instruction is "1". It should rather be dependent on the latency between register files. We can't really say in general but I'd say "2" is not so bad. I would suggest adding special handling in builtin_vectorization_cost like: /* Add register-register latency. */ case scalar_to_vec: return common_costs->scalar_to_vec_cost + riscv_register_move_cost (...) and adjust register_move_cost accordingly. Instead of using register_move_cost we could also use a cost structure directly. (E.g. like aarch64's regmove tuning structures. Those don't contain VRs but for us it could make sense to add them). > +/* { dg-options "-march=rv64gcv_zvl256b -mabi=lp64d -O3 -ftree-vectorize > -fdump-tree-vect-details" } */ With a cost of "3" we still vectorize for zvl512b and larger. Is that intended? I don't really see why 512 should vectorized but 256 not. Disregarding that everything should be optimized away, 2 iterations for the whole loop with 256 bits doesn't seem that bad. Regards Robin