Tamar Christina <tamar.christ...@arm.com> writes: >> -----Original Message----- >> From: Richard Sandiford <richard.sandif...@arm.com> >> Sent: Tuesday, August 31, 2021 5:07 PM >> To: Tamar Christina <tamar.christ...@arm.com> >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw >> <richard.earns...@arm.com>; Marcus Shawcroft >> <marcus.shawcr...@arm.com>; Kyrylo Tkachov <kyrylo.tkac...@arm.com> >> Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector constants >> and operations >> >> Tamar Christina <tamar.christ...@arm.com> writes: >> >> -----Original Message----- >> >> From: Richard Sandiford <richard.sandif...@arm.com> >> >> Sent: Tuesday, August 31, 2021 4:14 PM >> >> To: Tamar Christina <tamar.christ...@arm.com> >> >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw >> >> <richard.earns...@arm.com>; Marcus Shawcroft >> >> <marcus.shawcr...@arm.com>; Kyrylo Tkachov >> <kyrylo.tkac...@arm.com> >> >> Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector >> >> constants and operations >> >> >> >> Tamar Christina <tamar.christ...@arm.com> writes: >> >> > @@ -13936,8 +13937,65 @@ cost_plus: >> >> > mode, MULT, 1, speed); >> >> > return true; >> >> > } >> >> > + break; >> >> > + case PARALLEL: >> >> > + /* Fall through */ >> >> >> >> Which code paths lead to getting a PARALLEL here? >> > >> > Hi, >> > >> > Thanks for the review! >> > >> > I added it for completeness because CSE treats a parallel and >> > CONST_VECTOR as equivalent when they each entry in the parallel defines >> a constant. >> >> Could you test whether it ever triggers in practice though? >> The code would be much simpler without it. > > Will check 😊 > >> >> >> > + case CONST_VECTOR: >> >> > + { >> >> > + rtx gen_insn = aarch64_simd_make_constant (x, true); >> >> > + /* Not a valid const vector. */ >> >> > + if (!gen_insn) >> >> > + break; >> >> > >> >> > - /* Fall through. */ >> >> > + switch (GET_CODE (gen_insn)) >> >> > + { >> >> > + case CONST_VECTOR: >> >> > + /* Load using MOVI/MVNI. */ >> >> > + if (aarch64_simd_valid_immediate (x, NULL)) >> >> > + *cost += extra_cost->vect.movi; >> >> > + else /* Load using constant pool. */ >> >> > + *cost += extra_cost->ldst.load; >> >> > + break; >> >> > + /* Load using a DUP. */ >> >> > + case VEC_DUPLICATE: >> >> > + *cost += extra_cost->vect.dup; >> >> > + break; >> >> >> >> Does this trigger in practice? The new check==true path (rightly) >> >> stops the duplicated element from being forced into a register, but >> >> then I would have >> >> expected: >> >> >> >> rtx >> >> gen_vec_duplicate (machine_mode mode, rtx x) { >> >> if (valid_for_const_vector_p (mode, x)) >> >> return gen_const_vec_duplicate (mode, x); >> >> return gen_rtx_VEC_DUPLICATE (mode, x); } >> >> >> >> to generate the original CONST_VECTOR again. >> > >> > Yes, but CSE is trying to see whether using a DUP is cheaper than another >> instruction. >> > Normal code won't hit this but CSE is just costing all the different >> > ways one can semantically construct a vector, which RTL actually comes out >> of it depends on how it's folded as you say. >> >> But what I mean is, you call: >> >> rtx gen_insn = aarch64_simd_make_constant (x, true); >> /* Not a valid const vector. */ >> if (!gen_insn) >> break; >> >> where aarch64_simd_make_constant does: >> >> if (CONST_VECTOR_P (vals)) >> const_vec = vals; >> else if (GET_CODE (vals) == PARALLEL) >> { >> /* A CONST_VECTOR must contain only CONST_INTs and >> CONST_DOUBLEs, but CONSTANT_P allows more (e.g. SYMBOL_REF). >> Only store valid constants in a CONST_VECTOR. */ >> int n_elts = XVECLEN (vals, 0); >> for (i = 0; i < n_elts; ++i) >> { >> rtx x = XVECEXP (vals, 0, i); >> if (CONST_INT_P (x) || CONST_DOUBLE_P (x)) >> n_const++; >> } >> if (n_const == n_elts) >> const_vec = gen_rtx_CONST_VECTOR (mode, XVEC (vals, 0)); >> } >> else >> gcc_unreachable (); >> >> if (const_vec != NULL_RTX >> && aarch64_simd_valid_immediate (const_vec, NULL)) >> /* Load using MOVI/MVNI. */ >> return const_vec; >> else if ((const_dup = aarch64_simd_dup_constant (vals, check)) != >> NULL_RTX) >> /* Loaded using DUP. */ >> return const_dup; >> >> and aarch64_simd_dup_constant does: >> >> machine_mode mode = GET_MODE (vals); >> machine_mode inner_mode = GET_MODE_INNER (mode); >> rtx x; >> >> if (!const_vec_duplicate_p (vals, &x)) >> return NULL_RTX; >> >> /* We can load this constant by using DUP and a constant in a >> single ARM register. This will be cheaper than a vector >> load. */ >> if (!check) >> x = copy_to_mode_reg (inner_mode, x); >> return gen_vec_duplicate (mode, x); >> >> For the “check” case, “x” will be a constant, and so gen_vec_duplicate will >> call >> gen_const_vec_duplicate, which will return a CONST_VECTOR. >> It didn't seem to be possible for gen_insn to be a VEC_DUPLICATE. >> > > Yes, but CSE can ask the cost of a VEC_DUPLICATE directly on a register > without going through gen_const_vec_duplicate > which is intended as the gen_ functions can have side effects (e.g. creating > new psuedos etc) > > If say it sees a constant x and a vector [x x x x] it wants to know what the > cost keeping > x and materializing [x x x x] vs doing a duplicate of x into [x x x x] is. > > In this case since both the constant and the vectors are needed you won't get > a constant there but a register so you'll actually see a > vec_dup. If CSE pushes in the constant that would defeat the point 😊. Right > now it's CSE that's pushing constants of vec_dup into vec_constants. > > My change is making it explicitly ask for the cost of doing this instead of > assuming it always cheaper because for a large majority of > cases it's not actually cheaper and is highly dependent on the targets > ability to create said constant. > > So this hook will see both versions, the dup of the register and the > vec_constant while CSE is trying to decide which one to keep.
But the code I quoted above is from: + break; + case PARALLEL: + /* Fall through */ + case CONST_VECTOR: + { + rtx gen_insn = aarch64_simd_make_constant (x, true); + /* Not a valid const vector. */ + if (!gen_insn) + break; - /* Fall through. */ + switch (GET_CODE (gen_insn)) + { + case CONST_VECTOR: + /* Load using MOVI/MVNI. */ + if (aarch64_simd_valid_immediate (x, NULL)) + *cost += extra_cost->vect.movi; + else /* Load using constant pool. */ + *cost += extra_cost->ldst.load; + break; + /* Load using a DUP. */ + case VEC_DUPLICATE: + *cost += extra_cost->vect.dup; + break; + default: + *cost += extra_cost->ldst.load; + break; + } + return true; + } Here, CSE is passing in a PARALLEL or a CONST_VECTOR. That rtx then gets passed to aarch64_simd_make_constant. We then switch based on the result of aarch64_simd_make_constant, with a case statement for VEC_DUPLICATE. So the code is handling a case in which aarch64_simd_make_constant converts a PARALLEL or a CONST_VECTOR (passed by CSE) into a VEC_DUPLICATE. For the reasons above, that doesn't seem to be possible. aarch64_simd_make_constant would return duplicated constants as a CONST_VECTOR rather than a VEC_DUPLICATE. It sounds like you're talking about the separate top-level VEC_DUPLICATE case, which is obviously OK/needed. Maybe it would be better to turn it around and say: do you have a case in which the nested VEC_DUPLICATE case above is reached? >> This would be much simpler if we could call aarch64_simd_valid_immediate >> and aarch64_simd_dup_constant directly from the rtx cost code, BTW, I meant const_vec_duplicate_p here. sorry. > Agreed... I tried to separate them before, but the logic was annoying to > split and I thought not worth the effort, so instead I just > changed it to have a checking only mode. > >> hence the >> question about whether the PARALLEL stuff was really needed in practice. >> >> >> > + default: >> >> > + *cost += extra_cost->ldst.load; >> >> > + break; >> >> > + } >> >> > + return true; >> >> > + } >> >> > + case VEC_CONCAT: >> >> > + /* depending on the operation, either DUP or INS. >> >> > + For now, keep default costing. */ >> >> > + break; >> >> > + case VEC_DUPLICATE: >> >> > + *cost += extra_cost->vect.dup; >> >> > + return true; >> >> > + case VEC_SELECT: >> >> > + { >> >> > + /* cost subreg of 0 as free, otherwise as DUP */ >> >> > + rtx op1 = XEXP (x, 1); >> >> > + int nelts; >> >> > + if ((op1 == const0_rtx && !BYTES_BIG_ENDIAN) >> >> > + || (BYTES_BIG_ENDIAN >> >> > + && GET_MODE_NUNITS (mode).is_constant(&nelts) >> >> > + && INTVAL (op1) == nelts - 1)) >> >> > + ; >> >> > + else if (vec_series_lowpart_p (mode, GET_MODE (op1), op1)) >> >> > + ; >> >> > + else if (vec_series_highpart_p (mode, GET_MODE (op1), op1)) >> >> > + /* Selecting the high part is not technically free, but we >> >> > lack >> >> > + enough information to decide that here. For instance >> >> > selecting >> >> > + the high-part of a vec_dup *is* free or to feed into any >> >> > _high >> >> > + instruction. Both of which we can't really tell. That >> >> > said >> >> > + have a better chance to optimize an dup vs multiple >> >> > constants. */ >> >> > + ; >> >> >> >> Not sure about this. We already try to detect the latter case (_high >> >> instructions) via aarch64_strip_extend_vec_half. We might be missing >> >> some cases, but that still feels like the right way to go IMO. >> > >> > That's a different problem from what I understand. What this is >> > trying to say is that If you have a vector [x y a b] and you need >> > vector [x y] that you can use the top part of the original vector for this. >> > >> > This is an approximation, because something that can be created with a >> > movi is probably Cheaper to keep distinct if it's not going to be paired >> > with a >> _high operation (since you will have a dup then). >> > >> > The problem is that the front end has already spit the two Vectors into [x >> > y >> a b] and [x y]. >> > There's nothing else that tries to consolidate them back up if both >> > survive. >> > >> > As a consequence of this, the testcase test0 is not handled optimally. >> > It would instead create >> > 2 vectors, both of movi 0x3, just one being 64-bits and one being 128-bits. >> > >> > So if the cost of selecting it is cheaper than the movi, cse will not >> > consolidate the vectors, and because movi's are so cheap, the only >> > cost that worked was 0. But increasing the costs of movi's requires the >> costs of everything to be increased (including loads). >> > >> > I preferred to 0 out the cost, because the worst that can happen is an >> > dup instead of a movi, And at best a dup instead of a load from a pool (if >> the constant is complicated). >> >> Hmm, will need to look at this more tomorrow. >> >> >> Selecting the high part of a vec_dup should get folded into another >> vec_dup. >> >> >> >> The lowpart bits look OK, but which paths call this function without >> >> first simplifying the select to a subreg? The subreg is now the >> >> canonical form (thanks to r12-2288). >> > >> > The simplification will happen during folding in cse or in combine. >> > This costing happens before the folding, When CSE is trying to decide >> whether to undo the front end's lowering of constants. >> > >> > To do so it models the constants and the semantic operation required >> > to extract them. E.g. to get >> > 2 out of [0 2 4 5] it would need a VEC_SELECT of 1. And I don't treat >> > the first element/bottom part special Here. Costing wise they would be >> the same. >> >> But which code path creates the VEC_SELECT? We don't need any context to >> know that the VEC_SELECT is non-canonical. It's obvious from the operands >> of the VEC_SELECT in isolation. > > The non-cannonical RTL is never generated. I assume we're talking about the 0 > case here > Since subregs can't select arbitrary elements (as I asked before). > > For the 0 case it's only temporarily modelled as such as such to keep the CSE > alternative costing simple. > Currently it's just a for loop for I = 0 to vec_elems. Ah, sorry, I see now that you're talking about the 1/2 patch. I looked at this one first :-) > When it comes time to generate the actual insn fold_rtx is called which will > fold the VEC_SELECT > Into a subreg. > > So it's never emitted into the instruction stream in its non canonical form. > >> >> I'd just rather tackle this at source than try to get the cost code to handle >> non-canonical rtl. > > If that's what is preferred I can change the CSE patch to generate a subreg > for the 0 case, I'm not sure I agree with it > as CSE is just trying to ask "what Is the cost of selecting the element 0 in > this vector". And as I mentioned before > it never emits the instruction unfolded. This representation seems to a more > logical representation for costing to me. I think it's better to cost what we intend to generate. Otherwise each target needs to handle both forms: “CSE asks about this, but actually intends to generate that instead”. > It's however unfortunate that there's only one costing callback, as far as > CSE is concerned the representation/form > doesn't matter, it's just looking at the high level operation. > > Or is the concern here that most targets will have costing for subreg 0 but > not VEC_SELECT? In which case without > Actually handling the costs of the other operations the CSE changes won't do > anything for targets anyway. And it would > be odd for a target to cost VEC_SELECT from 1 to <N> instead of just costing > 0 too. Well, even for the motivating target (aarch64), we had to make changes to treat index 0 as especially cheap. That's likely to be necessary on other targets too, if they want to take advantage of this. The for loop exists because the index matters. I'm still a bit sceptical about treating the high-part cost as lower. ISTM that the subreg cases are the ones that are truly “free” and any others should have a normal cost. So if CSE handled the subreg case itself (to model how the rtx would actually be generated) then aarch64 code would have to do less work. I imagine that will be true for other targets as well. Thanks, Richard