I see after this accurate cost adjustment, it is still vectorized but different vect dump:
<bb 8> [local count: 118111602]: # a.4_25 = PHI <1(2), _4(11)> # ivtmp_30 = PHI <18(2), ivtmp_20(11)> # vect_vec_iv_.12_149 = PHI <{ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 }(2), _150(11)> # ivtmp_159 = PHI <0(2), ivtmp_160(11)> _150 = vect_vec_iv_.12_149 + { 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 }; vect_patt_46.13_151 = (vector(16) unsigned short) vect_vec_iv_.12_149; _22 = (int) a.4_25; vect_patt_48.14_153 = MIN_EXPR <vect_patt_46.13_151, { 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15 }>; vect_patt_49.15_155 = { 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872, 32872 } >> vect_patt_48.14_153; _12 = 32872 >> _22; vect_patt_51.16_156 = VIEW_CONVERT_EXPR<vector(16) short int>(vect_patt_49.15_155); b_7 = (short int) _12; _4 = a.4_25 + 1; ivtmp_20 = ivtmp_30 - 1; ivtmp_160 = ivtmp_159 + 1; if (ivtmp_160 < 1) goto <bb 11>; [0.00%] else goto <bb 18>; [100.00%] <bb 16> [local count: 118111600]: # b_5 = PHI <b_141(19)> a = 19; _14 = b_5 != 0; _15 = (int) _14; return _15; <bb 11> [local count: 4]: goto <bb 8>; [100.00%] <bb 18> [local count: 118111601]: # a.4_146 = PHI <_4(8)> # ivtmp_147 = PHI <ivtmp_20(8)> # vect_patt_51.16_157 = PHI <vect_patt_51.16_156(8)> _158 = BIT_FIELD_REF <vect_patt_51.16_157, 16, 240>; b_144 = _158; Purely VLS vectorized codes, then the later CSE pass is able to optimize it into the simply scalar codes. Wheras it involves some VLA vectorized codes which make the later pass failed to CSE it... juzhe.zh...@rivai.ai From: Robin Dapp Date: 2024-01-11 19:15 To: juzhe.zh...@rivai.ai; Richard Biener CC: rdapp.gcc; gcc-patches; kito.cheng; Kito.cheng; jeffreyalaw Subject: Re: [PATCH] RISC-V: Increase scalar_to_vec_cost from 1 to 3 > I think we shouldn't vectorize it with any vlen, since the non-vectorized > codegen is much better. > And also, I have tested -msve-vector-bits=2048, ARM SVE doesn't vectorize it. > -zvl65536b, RVV Clang also doesn't vectorize it. Of course I agree that optimizing everything to return 0 is what should happen (tree-ssa-dom or vrp do that). Unfortunately they don't anymore after vectorizing the loop. My point is cost comparison only has the scalar loop to compare against which is: li a5,1 li a3,19 .L2: mv a4,a5 addiw a5,a5,1 bne a5,a3,.L2 That's effectively 2 * 18 instructions and more than what we get when vectorizing - therefore it's kind totally outrageous to vectorize here and we need to make sure not to go overboard with costing just for this example. How does aarch64's cost comparison look like? What's, comparatively, more expensive with their tuning? I've seen scalar_to_vec = 4 and vec_to_scalar = 4 but a regular operation is 2 already. This would equal scalar_to_vec = 2 for us (and is not sufficient) so something else must come into play still. Regards Robin