Tamar Christina <tamar.christ...@arm.com> writes: >> -----Original Message----- >> From: Richard Sandiford <richard.sandif...@arm.com> >> Sent: Tuesday, August 31, 2021 4:14 PM >> To: Tamar Christina <tamar.christ...@arm.com> >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw >> <richard.earns...@arm.com>; Marcus Shawcroft >> <marcus.shawcr...@arm.com>; Kyrylo Tkachov <kyrylo.tkac...@arm.com> >> Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector constants >> and operations >> >> Tamar Christina <tamar.christ...@arm.com> writes: >> > @@ -13936,8 +13937,65 @@ cost_plus: >> > mode, MULT, 1, speed); >> > return true; >> > } >> > + break; >> > + case PARALLEL: >> > + /* Fall through */ >> >> Which code paths lead to getting a PARALLEL here? > > Hi, > > Thanks for the review! > > I added it for completeness because CSE treats a parallel and CONST_VECTOR as > equivalent when they each entry in the parallel defines a constant.
Could you test whether it ever triggers in practice though? The code would be much simpler without it. >> > + case CONST_VECTOR: >> > + { >> > + rtx gen_insn = aarch64_simd_make_constant (x, true); >> > + /* Not a valid const vector. */ >> > + if (!gen_insn) >> > + break; >> > >> > - /* Fall through. */ >> > + switch (GET_CODE (gen_insn)) >> > + { >> > + case CONST_VECTOR: >> > + /* Load using MOVI/MVNI. */ >> > + if (aarch64_simd_valid_immediate (x, NULL)) >> > + *cost += extra_cost->vect.movi; >> > + else /* Load using constant pool. */ >> > + *cost += extra_cost->ldst.load; >> > + break; >> > + /* Load using a DUP. */ >> > + case VEC_DUPLICATE: >> > + *cost += extra_cost->vect.dup; >> > + break; >> >> Does this trigger in practice? The new check==true path (rightly) stops the >> duplicated element from being forced into a register, but then I would have >> expected: >> >> rtx >> gen_vec_duplicate (machine_mode mode, rtx x) { >> if (valid_for_const_vector_p (mode, x)) >> return gen_const_vec_duplicate (mode, x); >> return gen_rtx_VEC_DUPLICATE (mode, x); } >> >> to generate the original CONST_VECTOR again. > > Yes, but CSE is trying to see whether using a DUP is cheaper than another > instruction. > Normal code won't hit this but CSE is just costing all the different ways one > can semantically > construct a vector, which RTL actually comes out of it depends on how it's > folded as you say. But what I mean is, you call: rtx gen_insn = aarch64_simd_make_constant (x, true); /* Not a valid const vector. */ if (!gen_insn) break; where aarch64_simd_make_constant does: if (CONST_VECTOR_P (vals)) const_vec = vals; else if (GET_CODE (vals) == PARALLEL) { /* A CONST_VECTOR must contain only CONST_INTs and CONST_DOUBLEs, but CONSTANT_P allows more (e.g. SYMBOL_REF). Only store valid constants in a CONST_VECTOR. */ int n_elts = XVECLEN (vals, 0); for (i = 0; i < n_elts; ++i) { rtx x = XVECEXP (vals, 0, i); if (CONST_INT_P (x) || CONST_DOUBLE_P (x)) n_const++; } if (n_const == n_elts) const_vec = gen_rtx_CONST_VECTOR (mode, XVEC (vals, 0)); } else gcc_unreachable (); if (const_vec != NULL_RTX && aarch64_simd_valid_immediate (const_vec, NULL)) /* Load using MOVI/MVNI. */ return const_vec; else if ((const_dup = aarch64_simd_dup_constant (vals, check)) != NULL_RTX) /* Loaded using DUP. */ return const_dup; and aarch64_simd_dup_constant does: machine_mode mode = GET_MODE (vals); machine_mode inner_mode = GET_MODE_INNER (mode); rtx x; if (!const_vec_duplicate_p (vals, &x)) return NULL_RTX; /* We can load this constant by using DUP and a constant in a single ARM register. This will be cheaper than a vector load. */ if (!check) x = copy_to_mode_reg (inner_mode, x); return gen_vec_duplicate (mode, x); For the “check” case, “x” will be a constant, and so gen_vec_duplicate will call gen_const_vec_duplicate, which will return a CONST_VECTOR. It didn't seem to be possible for gen_insn to be a VEC_DUPLICATE. This would be much simpler if we could call aarch64_simd_valid_immediate and aarch64_simd_dup_constant directly from the rtx cost code, hence the question about whether the PARALLEL stuff was really needed in practice. >> > + default: >> > + *cost += extra_cost->ldst.load; >> > + break; >> > + } >> > + return true; >> > + } >> > + case VEC_CONCAT: >> > + /* depending on the operation, either DUP or INS. >> > + For now, keep default costing. */ >> > + break; >> > + case VEC_DUPLICATE: >> > + *cost += extra_cost->vect.dup; >> > + return true; >> > + case VEC_SELECT: >> > + { >> > + /* cost subreg of 0 as free, otherwise as DUP */ >> > + rtx op1 = XEXP (x, 1); >> > + int nelts; >> > + if ((op1 == const0_rtx && !BYTES_BIG_ENDIAN) >> > + || (BYTES_BIG_ENDIAN >> > + && GET_MODE_NUNITS (mode).is_constant(&nelts) >> > + && INTVAL (op1) == nelts - 1)) >> > + ; >> > + else if (vec_series_lowpart_p (mode, GET_MODE (op1), op1)) >> > + ; >> > + else if (vec_series_highpart_p (mode, GET_MODE (op1), op1)) >> > + /* Selecting the high part is not technically free, but we lack >> > + enough information to decide that here. For instance selecting >> > + the high-part of a vec_dup *is* free or to feed into any _high >> > + instruction. Both of which we can't really tell. That said >> > + have a better chance to optimize an dup vs multiple constants. */ >> > + ; >> >> Not sure about this. We already try to detect the latter case (_high >> instructions) via aarch64_strip_extend_vec_half. We might be missing some >> cases, but that still feels like the right way to go IMO. > > That's a different problem from what I understand. What this is trying to > say is that > If you have a vector [x y a b] and you need vector [x y] that you can use the > top part > of the original vector for this. > > This is an approximation, because something that can be created with a movi > is probably > Cheaper to keep distinct if it's not going to be paired with a _high > operation (since you will have a dup then). > > The problem is that the front end has already spit the two Vectors into [x y > a b] and [x y]. > There's nothing else that tries to consolidate them back up if both survive. > > As a consequence of this, the testcase test0 is not handled optimally. It > would instead create > 2 vectors, both of movi 0x3, just one being 64-bits and one being 128-bits. > > So if the cost of selecting it is cheaper than the movi, cse will not > consolidate the vectors, > and because movi's are so cheap, the only cost that worked was 0. But > increasing the costs > of movi's requires the costs of everything to be increased (including loads). > > I preferred to 0 out the cost, because the worst that can happen is an dup > instead of a movi, > And at best a dup instead of a load from a pool (if the constant is > complicated). Hmm, will need to look at this more tomorrow. >> Selecting the high part of a vec_dup should get folded into another vec_dup. >> >> The lowpart bits look OK, but which paths call this function without first >> simplifying the select to a subreg? The subreg is now the canonical form >> (thanks to r12-2288). > > The simplification will happen during folding in cse or in combine. This > costing happens before the folding, > When CSE is trying to decide whether to undo the front end's lowering of > constants. > > To do so it models the constants and the semantic operation required to > extract them. E.g. to get > 2 out of [0 2 4 5] it would need a VEC_SELECT of 1. And I don't treat the > first element/bottom part special > Here. Costing wise they would be the same. But which code path creates the VEC_SELECT? We don't need any context to know that the VEC_SELECT is non-canonical. It's obvious from the operands of the VEC_SELECT in isolation. I'd just rather tackle this at source than try to get the cost code to handle non-canonical rtl. Thanks, Richard