> On 8 Jul 2025, at 17:43, Richard Sandiford <richard.sandif...@arm.com> wrote: > > Kyrylo Tkachov <ktkac...@nvidia.com> writes: >> Thanks for your comments, do you mean something like the following? > > Yeah, the patch LGTM, thanks.
So it turned out that doing this in the EOR3 pattern in patch 4/7 caused wrong-code in 531.deepsjeng_r: [(set (match_operand:DI 0 "register_operand") (xor:DI (xor:DI (match_operand:DI 2 "register_operand") (match_operand:DI 3 "register_operand")) (match_operand:DI 1 "register_operand")))] "TARGET_SHA3" {@ [ cons: =0, 1, 2 , 3 ; attrs: type ] [ w , w, w , w ; crypto_sha3 ] eor3\t%0.16b, %1.16b, %2.16b, %3.16b [ &r , r, r0, r0 ; multiple ] # } "&& REG_P (operands[0]) && GP_REGNUM_P (REGNO (operands[0]))" [(set (match_dup 4) (xor:DI (match_dup 2) (match_dup 3))) (set (match_dup 0) (xor:DI (match_dup 4) (match_dup 1)))] { if (reload_completed) operands[4] = operands[0]; else if (can_create_pseudo_p ()) operands[4] = gen_reg_rtx (DImode); else FAIL; } Using just “=&r,r,r,r” rather than “=&r,r,r0,r0” worked fine. I think I’ll need to dive deeper into the details, but if something stands out it’d be great. Thanks, Kyrill > > Richard > >> Or do you mean to have separate alternatives with each one individually >> tying one of operands 2 or 3 to r0? >> >> Kyrill >> >> >>> >>> Thanks, >>> Tamar >>> >>>> Thanks, >>>> Richard >> >> >> From 9e67b44d9ff111b0f280d7f3fe2c197aa7dabc94 Mon Sep 17 00:00:00 2001 >> From: Kyrylo Tkachov <ktkac...@nvidia.com> >> Date: Thu, 3 Jul 2025 09:45:02 -0700 >> Subject: [PATCH 3/7] aarch64: Handle DImode BCAX operations >> >> To handle DImode BCAX operations we want to do them on the SIMD side only if >> the incoming arguments don't require a cross-bank move. >> This means we need to split back the combination to separate GP BIC+EOR >> instructions if the operands are expected to be in GP regs through reload. >> The split happens pre-reload if we already know that the destination will be >> a GP reg. Otherwise if reload descides to use the "=r,r" alternative we >> ensure >> operand 0 is early-clobber. >> This scheme is similar to how we handle the BSL operations elsewhere in >> aarch64-simd.md. >> >> Thus, for the functions: >> uint64_t bcax_d_gp (uint64_t a, uint64_t b, uint64_t c) { return BCAX (a, b, >> c); } >> uint64x1_t bcax_d (uint64x1_t a, uint64x1_t b, uint64x1_t c) { return BCAX >> (a, b, c); } >> >> we now generate the desired: >> bcax_d_gp: >> bic x1, x1, x2 >> eor x0, x1, x0 >> ret >> >> bcax_d: >> bcax v0.16b, v0.16b, v1.16b, v2.16b >> ret >> >> When the inputs are in SIMD regs we use BCAX and when they are in GP regs we >> don't force them to SIMD with extra moves. >> >> Bootstrapped and tested on aarch64-none-linux-gnu. >> >> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> >> >> gcc/ >> >> * config/aarch64/aarch64-simd.md (*bcaxqdi4): New >> define_insn_and_split. >> >> gcc/testsuite/ >> >> * gcc.target/aarch64/simd/bcax_d.c: Add tests for DImode arguments. >> --- >> gcc/config/aarch64/aarch64-simd.md | 29 +++++++++++++++++++ >> .../gcc.target/aarch64/simd/bcax_d.c | 6 +++- >> 2 files changed, 34 insertions(+), 1 deletion(-) >> >> diff --git a/gcc/config/aarch64/aarch64-simd.md >> b/gcc/config/aarch64/aarch64-simd.md >> index 4493e55603d..270cb2ff3a1 100644 >> --- a/gcc/config/aarch64/aarch64-simd.md >> +++ b/gcc/config/aarch64/aarch64-simd.md >> @@ -9252,6 +9252,35 @@ >> [(set_attr "type" "crypto_sha3")] >> ) >> >> +(define_insn_and_split "*bcaxqdi4" >> + [(set (match_operand:DI 0 "register_operand") >> + (xor:DI >> + (and:DI >> + (not:DI (match_operand:DI 3 "register_operand")) >> + (match_operand:DI 2 "register_operand")) >> + (match_operand:DI 1 "register_operand")))] >> + "TARGET_SHA3" >> + {@ [ cons: =0, 1, 2 , 3 ; attrs: type ] >> + [ w , w, w , w ; crypto_sha3 ] bcax\t%0.16b, %1.16b, %2.16b, >> %3.16b >> + [ &r , r, r0, r0 ; multiple ] # >> + } >> + "&& REG_P (operands[0]) && GP_REGNUM_P (REGNO (operands[0]))" >> + [(set (match_dup 4) >> + (and:DI (not:DI (match_dup 3)) >> + (match_dup 2))) >> + (set (match_dup 0) >> + (xor:DI (match_dup 4) >> + (match_dup 1)))] >> + { >> + if (reload_completed) >> + operands[4] = operands[0]; >> + else if (can_create_pseudo_p ()) >> + operands[4] = gen_reg_rtx (DImode); >> + else >> + FAIL; >> + } >> +) >> + >> ;; SM3 >> >> (define_insn "aarch64_sm3ss1qv4si" >> diff --git a/gcc/testsuite/gcc.target/aarch64/simd/bcax_d.c >> b/gcc/testsuite/gcc.target/aarch64/simd/bcax_d.c >> index d68f0e102bf..a7640c3f6f1 100644 >> --- a/gcc/testsuite/gcc.target/aarch64/simd/bcax_d.c >> +++ b/gcc/testsuite/gcc.target/aarch64/simd/bcax_d.c >> @@ -7,9 +7,13 @@ >> >> #define BCAX(x,y,z) ((x) ^ ((y) & ~(z))) >> >> +/* When the inputs come from GP regs don't form a BCAX. */ >> +uint64_t bcax_d_gp (uint64_t a, uint64_t b, uint64_t c) { return BCAX (a, >> b, c); } >> + >> +uint64x1_t bcax_d (uint64x1_t a, uint64x1_t b, uint64x1_t c) { return BCAX >> (a, b, c); } >> uint32x2_t bcax_s (uint32x2_t a, uint32x2_t b, uint32x2_t c) { return BCAX >> (a, b, c); } >> uint16x4_t bcax_h (uint16x4_t a, uint16x4_t b, uint16x4_t c) { return BCAX >> (a, b, c); } >> uint8x8_t bcax_b (uint8x8_t a, uint8x8_t b, uint8x8_t c) { return BCAX (a, >> b, c); } >> >> -/* { dg-final { scan-assembler-times {bcax\tv0.16b, v0.16b, v1.16b, v2.16b} >> 3 } } */ >> +/* { dg-final { scan-assembler-times {bcax\tv0.16b, v0.16b, v1.16b, v2.16b} >> 4 } } */