Richard Sandiford <richard.sandif...@arm.com> writes:
> Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>> Hi all,
>>
>> To handle DImode BCAX operations we want to do them on the SIMD side only if
>> the incoming arguments don't require a cross-bank move.
>> This means we need to split back the combination to separate GP BIC+EOR
>> instructions if the operands are expected to be in GP regs through reload.
>> The split happens pre-reload if we already know that the destination will be
>> a GP reg. Otherwise if reload descides to use the "=r,r" alternative we 
>> ensure
>> operand 0 is early-clobber.
>> This scheme is similar to how we handle the BSL operations elsewhere in
>> aarch64-simd.md.
>>
>> Thus, for the functions:
>> uint64_t bcax_d_gp (uint64_t a, uint64_t b, uint64_t c) { return BCAX (a, b, 
>> c); }
>> uint64x1_t bcax_d (uint64x1_t a, uint64x1_t b, uint64x1_t c) { return BCAX 
>> (a, b, c); }
>>
>> we now generate the desired:
>> bcax_d_gp:
>> bic x1, x1, x2
>> eor x0, x1, x0
>> ret
>>
>> bcax_d:
>> bcax v0.16b, v0.16b, v1.16b, v2.16b
>> ret
>>
>> When the inputs are in SIMD regs we use BCAX and when they are in GP regs we
>> don't force them to SIMD with extra moves.
>>
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>> Ok for trunk?
>> Thanks,
>> Kyrill
>>
>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>>
>> gcc/
>>
>>      * config/aarch64/aarch64-simd.md (*bcaxqdi4): New
>>      define_insn_and_split.
>>
>> gcc/testsuite/
>>
>>      * gcc.target/aarch64/simd/bcax_d.c: Add tests for DImode arguments.
>>
>> From 95268cff1261a7724190dd291f9fcb5a7c817917 Mon Sep 17 00:00:00 2001
>> From: Kyrylo Tkachov <ktkac...@nvidia.com>
>> Date: Thu, 3 Jul 2025 09:45:02 -0700
>> Subject: [PATCH 3/7] aarch64: Handle DImode BCAX operations
>>
>> To handle DImode BCAX operations we want to do them on the SIMD side only if
>> the incoming arguments don't require a cross-bank move.
>> This means we need to split back the combination to separate GP BIC+EOR
>> instructions if the operands are expected to be in GP regs through reload.
>> The split happens pre-reload if we already know that the destination will be
>> a GP reg.  Otherwise if reload descides to use the "=r,r" alternative we 
>> ensure
>> operand 0 is early-clobber.
>> This scheme is similar to how we handle the BSL operations elsewhere in
>> aarch64-simd.md.
>>
>> Thus, for the functions:
>> uint64_t bcax_d_gp (uint64_t a, uint64_t b, uint64_t c) { return BCAX (a, b, 
>> c); }
>> uint64x1_t bcax_d (uint64x1_t a, uint64x1_t b, uint64x1_t c) { return BCAX 
>> (a, b, c); }
>>
>> we now generate the desired:
>> bcax_d_gp:
>>         bic     x1, x1, x2
>>         eor     x0, x1, x0
>>         ret
>>
>> bcax_d:
>>         bcax    v0.16b, v0.16b, v1.16b, v2.16b
>>         ret
>>
>> When the inputs are in SIMD regs we use BCAX and when they are in GP regs we
>> don't force them to SIMD with extra moves.
>>
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>
>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>>
>> gcc/
>>
>>      * config/aarch64/aarch64-simd.md (*bcaxqdi4): New
>>      define_insn_and_split.
>>
>> gcc/testsuite/
>>
>>      * gcc.target/aarch64/simd/bcax_d.c: Add tests for DImode arguments.
>> ---
>>  gcc/config/aarch64/aarch64-simd.md            | 29 +++++++++++++++++++
>>  .../gcc.target/aarch64/simd/bcax_d.c          |  6 +++-
>>  2 files changed, 34 insertions(+), 1 deletion(-)
>>
>> diff --git a/gcc/config/aarch64/aarch64-simd.md 
>> b/gcc/config/aarch64/aarch64-simd.md
>> index 4493e55603d..be6a16b4be8 100644
>> --- a/gcc/config/aarch64/aarch64-simd.md
>> +++ b/gcc/config/aarch64/aarch64-simd.md
>> @@ -9252,6 +9252,35 @@
>>    [(set_attr "type" "crypto_sha3")]
>>  )
>>  
>> +(define_insn_and_split "*bcaxqdi4"
>> +  [(set (match_operand:DI 0 "register_operand" "=w,&r")
>> +    (xor:DI
>> +      (and:DI
>> +        (not:DI (match_operand:DI 3 "register_operand" "w,r"))
>> +        (match_operand:DI 2 "register_operand" "w,r"))
>> +      (match_operand:DI 1 "register_operand" "w,r")))]
>
> I think the constraint on operand 1 should be "w,r0", so that we allow
> operand 1 to be the same as operand 0.  Without that, and with split1
> disabled/sidelined, we would end up with an extra move for:
>
>   uint64_t f(uint64_t x0, uint64_t x1, uint64_t x2) {
>     return x0 ^ (x1 & ~x2);
>   }
>
> (The only reason split1 avoids the extra move is that combine combines
> the hard register copy into the *bcaxqdi4, which is a bit dubious from
> an RA perspective.)

Sigh.  Wrong way round, of course: it's operands 2 and 3 that can be "w,r0".

Richard

Reply via email to