Tamar Christina <tamar.christ...@arm.com> writes: > @@ -13936,8 +13937,65 @@ cost_plus: > mode, MULT, 1, speed); > return true; > } > + break; > + case PARALLEL: > + /* Fall through */
Which code paths lead to getting a PARALLEL here? > + case CONST_VECTOR: > + { > + rtx gen_insn = aarch64_simd_make_constant (x, true); > + /* Not a valid const vector. */ > + if (!gen_insn) > + break; > > - /* Fall through. */ > + switch (GET_CODE (gen_insn)) > + { > + case CONST_VECTOR: > + /* Load using MOVI/MVNI. */ > + if (aarch64_simd_valid_immediate (x, NULL)) > + *cost += extra_cost->vect.movi; > + else /* Load using constant pool. */ > + *cost += extra_cost->ldst.load; > + break; > + /* Load using a DUP. */ > + case VEC_DUPLICATE: > + *cost += extra_cost->vect.dup; > + break; Does this trigger in practice? The new check==true path (rightly) stops the duplicated element from being forced into a register, but then I would have expected: rtx gen_vec_duplicate (machine_mode mode, rtx x) { if (valid_for_const_vector_p (mode, x)) return gen_const_vec_duplicate (mode, x); return gen_rtx_VEC_DUPLICATE (mode, x); } to generate the original CONST_VECTOR again. > + default: > + *cost += extra_cost->ldst.load; > + break; > + } > + return true; > + } > + case VEC_CONCAT: > + /* depending on the operation, either DUP or INS. > + For now, keep default costing. */ > + break; > + case VEC_DUPLICATE: > + *cost += extra_cost->vect.dup; > + return true; > + case VEC_SELECT: > + { > + /* cost subreg of 0 as free, otherwise as DUP */ > + rtx op1 = XEXP (x, 1); > + int nelts; > + if ((op1 == const0_rtx && !BYTES_BIG_ENDIAN) > + || (BYTES_BIG_ENDIAN > + && GET_MODE_NUNITS (mode).is_constant(&nelts) > + && INTVAL (op1) == nelts - 1)) > + ; > + else if (vec_series_lowpart_p (mode, GET_MODE (op1), op1)) > + ; > + else if (vec_series_highpart_p (mode, GET_MODE (op1), op1)) > + /* Selecting the high part is not technically free, but we lack > + enough information to decide that here. For instance selecting > + the high-part of a vec_dup *is* free or to feed into any _high > + instruction. Both of which we can't really tell. That said > + have a better chance to optimize an dup vs multiple constants. */ > + ; Not sure about this. We already try to detect the latter case (_high instructions) via aarch64_strip_extend_vec_half. We might be missing some cases, but that still feels like the right way to go IMO. Selecting the high part of a vec_dup should get folded into another vec_dup. The lowpart bits look OK, but which paths call this function without first simplifying the select to a subreg? The subreg is now the canonical form (thanks to r12-2288). Thanks, Richard