> -----Original Message----- > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: Tuesday, July 8, 2025 10:07 AM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: Kyrylo Tkachov <ktkac...@nvidia.com>; GCC Patches <gcc- > patc...@gcc.gnu.org>; Richard Earnshaw <richard.earns...@arm.com>; Alex > Coplan <alex.cop...@arm.com>; Andrew Pinski <pins...@gmail.com> > Subject: Re: [PATCH 3/7] aarch64: Handle DImode BCAX operations > > Tamar Christina <tamar.christ...@arm.com> writes: > >> -----Original Message----- > >> From: Richard Sandiford <richard.sandif...@arm.com> > >> Sent: Monday, July 7, 2025 12:55 PM > >> To: Kyrylo Tkachov <ktkac...@nvidia.com> > >> Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Richard Earnshaw > >> <richard.earns...@arm.com>; Alex Coplan <alex.cop...@arm.com>; Andrew > >> Pinski <pins...@gmail.com> > >> Subject: Re: [PATCH 3/7] aarch64: Handle DImode BCAX operations > >> > >> Richard Sandiford <richard.sandif...@arm.com> writes: > >> > Kyrylo Tkachov <ktkac...@nvidia.com> writes: > >> >> Hi all, > >> >> > >> >> To handle DImode BCAX operations we want to do them on the SIMD side > only > >> if > >> >> the incoming arguments don't require a cross-bank move. > >> >> This means we need to split back the combination to separate GP BIC+EOR > >> >> instructions if the operands are expected to be in GP regs through > >> >> reload. > >> >> The split happens pre-reload if we already know that the destination > >> >> will be > >> >> a GP reg. Otherwise if reload descides to use the "=r,r" alternative we > >> >> ensure > >> >> operand 0 is early-clobber. > >> >> This scheme is similar to how we handle the BSL operations elsewhere in > >> >> aarch64-simd.md. > >> >> > >> >> Thus, for the functions: > >> >> uint64_t bcax_d_gp (uint64_t a, uint64_t b, uint64_t c) { return BCAX > >> >> (a, b, > c); } > >> >> uint64x1_t bcax_d (uint64x1_t a, uint64x1_t b, uint64x1_t c) { return > >> >> BCAX > (a, > >> b, c); } > >> >> > >> >> we now generate the desired: > >> >> bcax_d_gp: > >> >> bic x1, x1, x2 > >> >> eor x0, x1, x0 > >> >> ret > >> >> > >> >> bcax_d: > >> >> bcax v0.16b, v0.16b, v1.16b, v2.16b > >> >> ret > >> >> > >> >> When the inputs are in SIMD regs we use BCAX and when they are in GP > >> >> regs > we > >> >> don't force them to SIMD with extra moves. > >> >> > >> >> Bootstrapped and tested on aarch64-none-linux-gnu. > >> >> Ok for trunk? > >> >> Thanks, > >> >> Kyrill > >> >> > >> >> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> > >> >> > >> >> gcc/ > >> >> > >> >> * config/aarch64/aarch64-simd.md (*bcaxqdi4): New > >> >> define_insn_and_split. > >> >> > >> >> gcc/testsuite/ > >> >> > >> >> * gcc.target/aarch64/simd/bcax_d.c: Add tests for DImode > >> >> arguments. > >> >> > >> >> From 95268cff1261a7724190dd291f9fcb5a7c817917 Mon Sep 17 > 00:00:00 > >> 2001 > >> >> From: Kyrylo Tkachov <ktkac...@nvidia.com> > >> >> Date: Thu, 3 Jul 2025 09:45:02 -0700 > >> >> Subject: [PATCH 3/7] aarch64: Handle DImode BCAX operations > >> >> > >> >> To handle DImode BCAX operations we want to do them on the SIMD side > only > >> if > >> >> the incoming arguments don't require a cross-bank move. > >> >> This means we need to split back the combination to separate GP BIC+EOR > >> >> instructions if the operands are expected to be in GP regs through > >> >> reload. > >> >> The split happens pre-reload if we already know that the destination > >> >> will be > >> >> a GP reg. Otherwise if reload descides to use the "=r,r" alternative > >> >> we ensure > >> >> operand 0 is early-clobber. > >> >> This scheme is similar to how we handle the BSL operations elsewhere in > >> >> aarch64-simd.md. > >> >> > >> >> Thus, for the functions: > >> >> uint64_t bcax_d_gp (uint64_t a, uint64_t b, uint64_t c) { return BCAX > >> >> (a, b, > c); } > >> >> uint64x1_t bcax_d (uint64x1_t a, uint64x1_t b, uint64x1_t c) { return > >> >> BCAX > (a, > >> b, c); } > >> >> > >> >> we now generate the desired: > >> >> bcax_d_gp: > >> >> bic x1, x1, x2 > >> >> eor x0, x1, x0 > >> >> ret > >> >> > >> >> bcax_d: > >> >> bcax v0.16b, v0.16b, v1.16b, v2.16b > >> >> ret > >> >> > >> >> When the inputs are in SIMD regs we use BCAX and when they are in GP > >> >> regs > we > >> >> don't force them to SIMD with extra moves. > >> >> > >> >> Bootstrapped and tested on aarch64-none-linux-gnu. > >> >> > >> >> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> > >> >> > >> >> gcc/ > >> >> > >> >> * config/aarch64/aarch64-simd.md (*bcaxqdi4): New > >> >> define_insn_and_split. > >> >> > >> >> gcc/testsuite/ > >> >> > >> >> * gcc.target/aarch64/simd/bcax_d.c: Add tests for DImode > >> >> arguments. > >> >> --- > >> >> gcc/config/aarch64/aarch64-simd.md | 29 +++++++++++++++++++ > >> >> .../gcc.target/aarch64/simd/bcax_d.c | 6 +++- > >> >> 2 files changed, 34 insertions(+), 1 deletion(-) > >> >> > >> >> diff --git a/gcc/config/aarch64/aarch64-simd.md > >> b/gcc/config/aarch64/aarch64-simd.md > >> >> index 4493e55603d..be6a16b4be8 100644 > >> >> --- a/gcc/config/aarch64/aarch64-simd.md > >> >> +++ b/gcc/config/aarch64/aarch64-simd.md > >> >> @@ -9252,6 +9252,35 @@ > >> >> [(set_attr "type" "crypto_sha3")] > >> >> ) > >> >> > >> >> +(define_insn_and_split "*bcaxqdi4" > >> >> + [(set (match_operand:DI 0 "register_operand" "=w,&r") > >> >> + (xor:DI > >> >> + (and:DI > >> >> + (not:DI (match_operand:DI 3 "register_operand" "w,r")) > >> >> + (match_operand:DI 2 "register_operand" "w,r")) > >> >> + (match_operand:DI 1 "register_operand" "w,r")))] > >> > > >> > I think the constraint on operand 1 should be "w,r0", so that we allow > >> > operand 1 to be the same as operand 0. Without that, and with split1 > >> > disabled/sidelined, we would end up with an extra move for: > >> > > >> > uint64_t f(uint64_t x0, uint64_t x1, uint64_t x2) { > >> > return x0 ^ (x1 & ~x2); > >> > } > >> > > >> > (The only reason split1 avoids the extra move is that combine combines > >> > the hard register copy into the *bcaxqdi4, which is a bit dubious from > >> > an RA perspective.) > >> > >> Sigh. Wrong way round, of course: it's operands 2 and 3 that can be > >> "w,r0". > >> > > > > Question for my own understanding. From an RA perspective can the tie end up > > with the same cost as the r? I was wondering whether w,0r or w,r0 makes a > difference. > > Hmm, good question. It turns out that the costings for "0r" and "r0" > are different: the costing for "0r" sums both the "0" costs and the "r" > costs, whereas the costing for "r0" in the same as for "r". > (See ira-costs.cc:record_reg_classes, where '0'-'9' are handled by > an "if" statement and the other constraints are handled by a following > "while" statement.) > > "0" is costed based on the register class of operand 0, so effectively > in the same way as "r". So I think the effect of costing both "0" and > "r" in "0r" would be double-counting. > > However, the costs of allocating a GPR to "r" (or to a "0" bound to "=r") > are 0, so if the alternative is considered in isolation, the double > counting would increase the cost of non-GPR classes while not increasing > the cost of GPR classes. > > When an instruction has multiple alternatives, the final cost for each > class is the minimum cost for that class over all alternatives. So if > the "=r"/"0r" alternative is alongside a "=w"/"w" alternative, the FPR > costs would be taken purely from the "=w"/"w" alternative, and any > double counting in the other alternative would have no effect. > And I don't think there are any allocatable classes outside "r" > and "w" that can store integers. > > Still, I think that does mean that "r0" suggested above would be better > than "0r". It should give the same class costs as just "r". >
That's a surprising difference! Thanks for explaining. Yeah I agree that r0 is better. Thanks, Tamar > Thanks, > Richard