https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111317
--- Comment #1 from Robin Dapp <rdapp at gcc dot gnu.org> --- I think the default cost model is not too bad for these simple cases. Our emitted instructions match gimple pretty well. The thing we don't model is vsetvl. We could ignore it under the assumption that it is going to be rather cheap on most uarchs. Something that needs to be fixed is the general costing used for length-masking: /* Each may need two MINs and one MINUS to update lengths in body for next iteration. */ if (need_iterate_p) body_stmts += 3 * num_vectors; We don't actually need min with vsetvl (they are our mins) so this would need to be adjusted down, provided vsetvl is cheap. This is the scalar baseline: .L3: lw a5,0(a0) sd a5,0(a1) addi a0,a0,4 addi a1,a1,8 bne a4,a0,.L3 While this is what zvl128b would emit: .L3: vsetvli a5,a2,e8,mf8,ta,ma vle32.v v2,0(a0) vsetvli a4,zero,e64,m1,ta,ma vsext.vf2 v1,v2 vsetvli zero,a2,e64,m1,ta,ma vse64.v v1,0(a1) slli a4,a5,2 add a0,a0,a4 slli a4,a5,3 add a1,a1,a4 sub a2,a2,a5 bne a2,zero,.L3 With a vectorization factor of 2 (might effectively be higher of course but possibly unknown at compile time) I'm not sure vectorization is always a win and the costs actually reflect that. If we disregard vsetvl for now we have 8 instructions in the vectorized loop and 2 * 4 instructions in the scalar loop for the same amount of data. Factoring in the vsetvls I'd say it's worse. Once we statically know the VF is higher, we will vectorize.