> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: Tuesday, July 8, 2025 10:07 AM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: Kyrylo Tkachov <ktkac...@nvidia.com>; GCC Patches <gcc-
> patc...@gcc.gnu.org>; Richard Earnshaw <richard.earns...@arm.com>; Alex
> Coplan <alex.cop...@arm.com>; Andrew Pinski <pins...@gmail.com>
> Subject: Re: [PATCH 3/7] aarch64: Handle DImode BCAX operations
> 
> Tamar Christina <tamar.christ...@arm.com> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <richard.sandif...@arm.com>
> >> Sent: Monday, July 7, 2025 12:55 PM
> >> To: Kyrylo Tkachov <ktkac...@nvidia.com>
> >> Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Richard Earnshaw
> >> <richard.earns...@arm.com>; Alex Coplan <alex.cop...@arm.com>; Andrew
> >> Pinski <pins...@gmail.com>
> >> Subject: Re: [PATCH 3/7] aarch64: Handle DImode BCAX operations
> >>
> >> Richard Sandiford <richard.sandif...@arm.com> writes:
> >> > Kyrylo Tkachov <ktkac...@nvidia.com> writes:
> >> >> Hi all,
> >> >>
> >> >> To handle DImode BCAX operations we want to do them on the SIMD side
> only
> >> if
> >> >> the incoming arguments don't require a cross-bank move.
> >> >> This means we need to split back the combination to separate GP BIC+EOR
> >> >> instructions if the operands are expected to be in GP regs through 
> >> >> reload.
> >> >> The split happens pre-reload if we already know that the destination 
> >> >> will be
> >> >> a GP reg. Otherwise if reload descides to use the "=r,r" alternative we 
> >> >> ensure
> >> >> operand 0 is early-clobber.
> >> >> This scheme is similar to how we handle the BSL operations elsewhere in
> >> >> aarch64-simd.md.
> >> >>
> >> >> Thus, for the functions:
> >> >> uint64_t bcax_d_gp (uint64_t a, uint64_t b, uint64_t c) { return BCAX 
> >> >> (a, b,
> c); }
> >> >> uint64x1_t bcax_d (uint64x1_t a, uint64x1_t b, uint64x1_t c) { return 
> >> >> BCAX
> (a,
> >> b, c); }
> >> >>
> >> >> we now generate the desired:
> >> >> bcax_d_gp:
> >> >> bic x1, x1, x2
> >> >> eor x0, x1, x0
> >> >> ret
> >> >>
> >> >> bcax_d:
> >> >> bcax v0.16b, v0.16b, v1.16b, v2.16b
> >> >> ret
> >> >>
> >> >> When the inputs are in SIMD regs we use BCAX and when they are in GP 
> >> >> regs
> we
> >> >> don't force them to SIMD with extra moves.
> >> >>
> >> >> Bootstrapped and tested on aarch64-none-linux-gnu.
> >> >> Ok for trunk?
> >> >> Thanks,
> >> >> Kyrill
> >> >>
> >> >> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
> >> >>
> >> >> gcc/
> >> >>
> >> >>         * config/aarch64/aarch64-simd.md (*bcaxqdi4): New
> >> >>         define_insn_and_split.
> >> >>
> >> >> gcc/testsuite/
> >> >>
> >> >>         * gcc.target/aarch64/simd/bcax_d.c: Add tests for DImode 
> >> >> arguments.
> >> >>
> >> >> From 95268cff1261a7724190dd291f9fcb5a7c817917 Mon Sep 17
> 00:00:00
> >> 2001
> >> >> From: Kyrylo Tkachov <ktkac...@nvidia.com>
> >> >> Date: Thu, 3 Jul 2025 09:45:02 -0700
> >> >> Subject: [PATCH 3/7] aarch64: Handle DImode BCAX operations
> >> >>
> >> >> To handle DImode BCAX operations we want to do them on the SIMD side
> only
> >> if
> >> >> the incoming arguments don't require a cross-bank move.
> >> >> This means we need to split back the combination to separate GP BIC+EOR
> >> >> instructions if the operands are expected to be in GP regs through 
> >> >> reload.
> >> >> The split happens pre-reload if we already know that the destination 
> >> >> will be
> >> >> a GP reg.  Otherwise if reload descides to use the "=r,r" alternative 
> >> >> we ensure
> >> >> operand 0 is early-clobber.
> >> >> This scheme is similar to how we handle the BSL operations elsewhere in
> >> >> aarch64-simd.md.
> >> >>
> >> >> Thus, for the functions:
> >> >> uint64_t bcax_d_gp (uint64_t a, uint64_t b, uint64_t c) { return BCAX 
> >> >> (a, b,
> c); }
> >> >> uint64x1_t bcax_d (uint64x1_t a, uint64x1_t b, uint64x1_t c) { return 
> >> >> BCAX
> (a,
> >> b, c); }
> >> >>
> >> >> we now generate the desired:
> >> >> bcax_d_gp:
> >> >>         bic     x1, x1, x2
> >> >>         eor     x0, x1, x0
> >> >>         ret
> >> >>
> >> >> bcax_d:
> >> >>         bcax    v0.16b, v0.16b, v1.16b, v2.16b
> >> >>         ret
> >> >>
> >> >> When the inputs are in SIMD regs we use BCAX and when they are in GP 
> >> >> regs
> we
> >> >> don't force them to SIMD with extra moves.
> >> >>
> >> >> Bootstrapped and tested on aarch64-none-linux-gnu.
> >> >>
> >> >> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
> >> >>
> >> >> gcc/
> >> >>
> >> >>         * config/aarch64/aarch64-simd.md (*bcaxqdi4): New
> >> >>         define_insn_and_split.
> >> >>
> >> >> gcc/testsuite/
> >> >>
> >> >>         * gcc.target/aarch64/simd/bcax_d.c: Add tests for DImode 
> >> >> arguments.
> >> >> ---
> >> >>  gcc/config/aarch64/aarch64-simd.md            | 29 +++++++++++++++++++
> >> >>  .../gcc.target/aarch64/simd/bcax_d.c          |  6 +++-
> >> >>  2 files changed, 34 insertions(+), 1 deletion(-)
> >> >>
> >> >> diff --git a/gcc/config/aarch64/aarch64-simd.md
> >> b/gcc/config/aarch64/aarch64-simd.md
> >> >> index 4493e55603d..be6a16b4be8 100644
> >> >> --- a/gcc/config/aarch64/aarch64-simd.md
> >> >> +++ b/gcc/config/aarch64/aarch64-simd.md
> >> >> @@ -9252,6 +9252,35 @@
> >> >>    [(set_attr "type" "crypto_sha3")]
> >> >>  )
> >> >>
> >> >> +(define_insn_and_split "*bcaxqdi4"
> >> >> +  [(set (match_operand:DI 0 "register_operand" "=w,&r")
> >> >> +       (xor:DI
> >> >> +         (and:DI
> >> >> +           (not:DI (match_operand:DI 3 "register_operand" "w,r"))
> >> >> +           (match_operand:DI 2 "register_operand" "w,r"))
> >> >> +         (match_operand:DI 1 "register_operand" "w,r")))]
> >> >
> >> > I think the constraint on operand 1 should be "w,r0", so that we allow
> >> > operand 1 to be the same as operand 0.  Without that, and with split1
> >> > disabled/sidelined, we would end up with an extra move for:
> >> >
> >> >   uint64_t f(uint64_t x0, uint64_t x1, uint64_t x2) {
> >> >     return x0 ^ (x1 & ~x2);
> >> >   }
> >> >
> >> > (The only reason split1 avoids the extra move is that combine combines
> >> > the hard register copy into the *bcaxqdi4, which is a bit dubious from
> >> > an RA perspective.)
> >>
> >> Sigh.  Wrong way round, of course: it's operands 2 and 3 that can be 
> >> "w,r0".
> >>
> >
> > Question for my own understanding. From an RA perspective can the tie end up
> > with the same cost as the r? I was wondering whether w,0r or w,r0 makes a
> difference.
> 
> Hmm, good question.  It turns out that the costings for "0r" and "r0"
> are different: the costing for "0r" sums both the "0" costs and the "r"
> costs, whereas the costing for "r0" in the same as for "r".
> (See ira-costs.cc:record_reg_classes, where '0'-'9' are handled by
> an "if" statement and the other constraints are handled by a following
> "while" statement.)
> 
> "0" is costed based on the register class of operand 0, so effectively
> in the same way as "r".  So I think the effect of costing both "0" and
> "r" in "0r" would be double-counting.
> 
> However, the costs of allocating a GPR to "r" (or to a "0" bound to "=r")
> are 0, so if the alternative is considered in isolation, the double
> counting would increase the cost of non-GPR classes while not increasing
> the cost of GPR classes.
> 
> When an instruction has multiple alternatives, the final cost for each
> class is the minimum cost for that class over all alternatives.  So if
> the "=r"/"0r" alternative is alongside a "=w"/"w" alternative, the FPR
> costs would be taken purely from the "=w"/"w" alternative, and any
> double counting in the other alternative would have no effect.
> And I don't think there are any allocatable classes outside "r"
> and "w" that can store integers.
> 
> Still, I think that does mean that "r0" suggested above would be better
> than "0r".  It should give the same class costs as just "r".
> 

That's a surprising difference! Thanks for explaining.  Yeah I agree that r0
is better.

Thanks,
Tamar

> Thanks,
> Richard

Reply via email to