> 32872 spends 2 scalar instructions + 1 scalar_to_vec cost: > > lia4,-32768 > addiwa4,a4,104 > vmv.v.xv16,a4 > > It seems reasonable but only can fix test with -march=rv64gcv_zvl256b but > failed on -march=rv64gcv_zvl4096b. The scalar version also needs both instructions:
li a0,32768 addiw a0,a0,104 Therefore I don't think we should just add them to the vectorization costs. That would only be necessary if we needed to synthesize a different constant (e.g. if a scalar constant cannot be used directly in a vector setting). Currently, scalar_outside_cost = 0 so we don't model it on the scalar side either. With scalar_to_vec = 2 we first try RVVMF2QI, vf = 8 at zvl256b: a.4_25 = PHI <1(2), _4(11)> 1 times vector_stmt costs 1 in body a.4_25 = PHI <1(2), _4(11)> 2 times scalar_to_vec costs 4 in prologue (unsigned short) a.4_25 1 times vector_stmt costs 1 in body MIN_EXPR <patt_28, 15> 1 times scalar_to_vec costs 2 in prologue MIN_EXPR <patt_28, 15> 1 times vector_stmt costs 1 in body 32872 >> patt_26 1 times scalar_to_vec costs 2 in prologue 32872 >> patt_26 1 times vector_stmt costs 1 in body <unknown> 1 times scalar_stmt costs 1 in prologue <unknown> 1 times scalar_stmt costs 1 in body Vector inside of loop cost: 5 Scalar iteration cost: 1 (shouldn't that be 2? but anyway) So one vector iteration costs 5 right now regardless of scalar_to_vec because there are 5 vector operations (phi, promote, min, shift, vsetvl/len adjustment). The scalar_to_vec costs are added to the prologue because it is assumed that broadcasts are hoisted out of the loop. Then vectorization is supposed to be profitable if #iterations = 18 > (body_cost * min_iters) + vector_outside_cost - scalar_outside_cost + 1 = 15. If we really don't want to vectorize, then we can either further increase the prologue cost or the body itself. The body statements are all vector_stmts, though. For the prologue we need a good argument why to increase scalar_to_vec to beyond, say 2. > Is it reasonable ? IMHO, scalar move (vmv.v.x or vfmv.v.f) should be > more costly than normal vadd.vv since it is transferring data between > different pipeline/register class. We want it to be more expensive, yes. In one of the last messages I explained how I would model it using either register_move_cost or using (tune-specific) costs directly. I would start with scalar_to_vec = 1 and add 1 or 2 depending on the uarch/tune-specific reg-move costs. Regards Robin