Re: [PATCH][AArch64] ACLE intrinsics: get low/high half from BFloat16 vector

2020-11-05 Thread Christophe Lyon via Gcc-patches
On Tue, 3 Nov 2020 at 12:17, Dennis Zhang via Gcc-patches
 wrote:
>
> Hi Richard,
>
> On 10/30/20 2:07 PM, Richard Sandiford wrote:
> > Dennis Zhang  writes:
> >> diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def 
> >> b/gcc/config/aarch64/aarch64-simd-builtins.def
> >> index 332a0b6b1ea..39ebb776d1d 100644
> >> --- a/gcc/config/aarch64/aarch64-simd-builtins.def
> >> +++ b/gcc/config/aarch64/aarch64-simd-builtins.def
> >> @@ -719,6 +719,9 @@
> >> VAR1 (QUADOP_LANE, bfmlalb_lane_q, 0, ALL, v4sf)
> >> VAR1 (QUADOP_LANE, bfmlalt_lane_q, 0, ALL, v4sf)
> >>
> >> +  /* Implemented by aarch64_vget_halfv8bf.  */
> >> +  VAR1 (GETREG, vget_half, 0, ALL, v8bf)
> >
> > This should be AUTO_FP, since it doesn't have any side-effects.
> > (As before, we should probably rename the flag, but that's separate work.)
> >
> >> +
> >> /* Implemented by aarch64_simd_mmlav16qi.  */
> >> VAR1 (TERNOP, simd_smmla, 0, NONE, v16qi)
> >> VAR1 (TERNOPU, simd_ummla, 0, NONE, v16qi)
> >> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> >> b/gcc/config/aarch64/aarch64-simd.md
> >> index 9f0e2bd1e6f..f62c52ca327 100644
> >> --- a/gcc/config/aarch64/aarch64-simd.md
> >> +++ b/gcc/config/aarch64/aarch64-simd.md
> >> @@ -7159,6 +7159,19 @@
> >> [(set_attr "type" "neon_dot")]
> >>   )
> >>
> >> +;; vget_low/high_bf16
> >> +(define_expand "aarch64_vget_halfv8bf"
> >> +  [(match_operand:V4BF 0 "register_operand")
> >> +   (match_operand:V8BF 1 "register_operand")
> >> +   (match_operand:SI 2 "aarch64_zero_or_1")]
> >> +  "TARGET_BF16_SIMD"
> >> +{
> >> +  int hbase = INTVAL (operands[2]);
> >> +  rtx sel = aarch64_gen_stepped_int_parallel (4, hbase * 4, 1);
> >
> > I think this needs to be:
> >
> >aarch64_simd_vect_par_cnst_half
> >
> > instead.  The issue is that on big-endian targets, GCC assumes vector
> > lane 0 is in the high part of the register, whereas for AArch64 it's
> > always in the low part of the register.  So we convert from AArch64
> > numbering to GCC numbering when generating the rtx and then take
> > endianness into account when matching the rtx later.
> >
> > It would be good to have -mbig-endian tests that make sure we generate
> > the right instruction for each function (i.e. we get them the right way
> > round).  I guess it would be good to test that for little-endian too.
> >
>
> I've updated the expander using aarch64_simd_vect_par_cnst_half.
> And the expander is divided into two for getting low and high half
> seperately.
> It's tested for aarch64-none-linux-gnu and aarch64_be-none-linux-gnu
> targets with new tests including -mbig-endian option.
>

Hi,

When testing with a cross x86_64 -> aarch64-none-linux-gnu, the new
big-endian test fails:
FAIL: gcc.target/aarch64/advsimd-intrinsics/bf16_get-be.c   -O0  (test
for excess errors)
Excess errors:
/aci-gcc-fsf/builds/gcc-fsf-gccsrc/sysroot-aarch64-none-linux-gnu/usr/include/gnu/stubs.h:11:11:
fatal error: gnu/stubs-lp64_be.h: No such file or directory
compilation terminated.

What am I missing, since it works for you?

Thanks

Christophe

> >> +  emit_insn (gen_aarch64_get_halfv8bf (operands[0], operands[1], sel));
> >> +  DONE;
> >> +})
> >> +
> >>   ;; bfmmla
> >>   (define_insn "aarch64_bfmmlaqv4sf"
> >> [(set (match_operand:V4SF 0 "register_operand" "=w")
> >> diff --git a/gcc/config/aarch64/predicates.md 
> >> b/gcc/config/aarch64/predicates.md
> >> index 215fcec5955..0c8bc2b0c73 100644
> >> --- a/gcc/config/aarch64/predicates.md
> >> +++ b/gcc/config/aarch64/predicates.md
> >> @@ -84,6 +84,10 @@
> >>   (ior (match_test "op == constm1_rtx")
> >>(match_test "op == const1_rtx"))
> >>
> >> +(define_predicate "aarch64_zero_or_1"
> >> +  (and (match_code "const_int")
> >> +   (match_test "op == const0_rtx || op == const1_rtx")))
> >
> > zero_or_1 looked odd to me, feels like it should be 0_or_1 or zero_or_one.
> > But I see that it's for consistency with aarch64_reg_zero_or_m1_or_1,
> > so let's keep it as-is.
> >
>
> This predicate is removed since there is no need of the imm operand in
> the new expanders.
>
> Thanks for the reviews.
> Is it OK for trunk now?
>
> Cheers
> Dennis
>
>


Re: [PATCH][AArch64] ACLE intrinsics: get low/high half from BFloat16 vector

2020-11-03 Thread Dennis Zhang via Gcc-patches

On 11/3/20 2:05 PM, Richard Sandiford wrote:

Dennis Zhang  writes:

Hi Richard,

On 10/30/20 2:07 PM, Richard Sandiford wrote:

Dennis Zhang  writes:

diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def 
b/gcc/config/aarch64/aarch64-simd-builtins.def
index 332a0b6b1ea..39ebb776d1d 100644
--- a/gcc/config/aarch64/aarch64-simd-builtins.def
+++ b/gcc/config/aarch64/aarch64-simd-builtins.def
@@ -719,6 +719,9 @@
 VAR1 (QUADOP_LANE, bfmlalb_lane_q, 0, ALL, v4sf)
 VAR1 (QUADOP_LANE, bfmlalt_lane_q, 0, ALL, v4sf)
   
+  /* Implemented by aarch64_vget_halfv8bf.  */

+  VAR1 (GETREG, vget_half, 0, ALL, v8bf)


This should be AUTO_FP, since it doesn't have any side-effects.
(As before, we should probably rename the flag, but that's separate work.)


+
 /* Implemented by aarch64_simd_mmlav16qi.  */
 VAR1 (TERNOP, simd_smmla, 0, NONE, v16qi)
 VAR1 (TERNOPU, simd_ummla, 0, NONE, v16qi)
diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 9f0e2bd1e6f..f62c52ca327 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -7159,6 +7159,19 @@
 [(set_attr "type" "neon_dot")]
   )
   
+;; vget_low/high_bf16

+(define_expand "aarch64_vget_halfv8bf"
+  [(match_operand:V4BF 0 "register_operand")
+   (match_operand:V8BF 1 "register_operand")
+   (match_operand:SI 2 "aarch64_zero_or_1")]
+  "TARGET_BF16_SIMD"
+{
+  int hbase = INTVAL (operands[2]);
+  rtx sel = aarch64_gen_stepped_int_parallel (4, hbase * 4, 1);


I think this needs to be:

aarch64_simd_vect_par_cnst_half

instead.  The issue is that on big-endian targets, GCC assumes vector
lane 0 is in the high part of the register, whereas for AArch64 it's
always in the low part of the register.  So we convert from AArch64
numbering to GCC numbering when generating the rtx and then take
endianness into account when matching the rtx later.

It would be good to have -mbig-endian tests that make sure we generate
the right instruction for each function (i.e. we get them the right way
round).  I guess it would be good to test that for little-endian too.



I've updated the expander using aarch64_simd_vect_par_cnst_half.
And the expander is divided into two for getting low and high half
seperately.
It's tested for aarch64-none-linux-gnu and aarch64_be-none-linux-gnu
targets with new tests including -mbig-endian option.


+  emit_insn (gen_aarch64_get_halfv8bf (operands[0], operands[1], sel));
+  DONE;
+})
+
   ;; bfmmla
   (define_insn "aarch64_bfmmlaqv4sf"
 [(set (match_operand:V4SF 0 "register_operand" "=w")
diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md
index 215fcec5955..0c8bc2b0c73 100644
--- a/gcc/config/aarch64/predicates.md
+++ b/gcc/config/aarch64/predicates.md
@@ -84,6 +84,10 @@
 (ior (match_test "op == constm1_rtx")
  (match_test "op == const1_rtx"))
   
+(define_predicate "aarch64_zero_or_1"

+  (and (match_code "const_int")
+   (match_test "op == const0_rtx || op == const1_rtx")))


zero_or_1 looked odd to me, feels like it should be 0_or_1 or zero_or_one.
But I see that it's for consistency with aarch64_reg_zero_or_m1_or_1,
so let's keep it as-is.



This predicate is removed since there is no need of the imm operand in
the new expanders.

Thanks for the reviews.
Is it OK for trunk now?


Looks good.  OK for trunk and branches, thanks.

Richard



Thanks for approval, Richard!
This patch is committed at 3553c658533e430b232997bdfd97faf6606fb102

Bests
Dennis


Re: [PATCH][AArch64] ACLE intrinsics: get low/high half from BFloat16 vector

2020-11-03 Thread Richard Sandiford via Gcc-patches
Dennis Zhang  writes:
> Hi Richard,
>
> On 10/30/20 2:07 PM, Richard Sandiford wrote:
>> Dennis Zhang  writes:
>>> diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def 
>>> b/gcc/config/aarch64/aarch64-simd-builtins.def
>>> index 332a0b6b1ea..39ebb776d1d 100644
>>> --- a/gcc/config/aarch64/aarch64-simd-builtins.def
>>> +++ b/gcc/config/aarch64/aarch64-simd-builtins.def
>>> @@ -719,6 +719,9 @@
>>> VAR1 (QUADOP_LANE, bfmlalb_lane_q, 0, ALL, v4sf)
>>> VAR1 (QUADOP_LANE, bfmlalt_lane_q, 0, ALL, v4sf)
>>>   
>>> +  /* Implemented by aarch64_vget_halfv8bf.  */
>>> +  VAR1 (GETREG, vget_half, 0, ALL, v8bf)
>> 
>> This should be AUTO_FP, since it doesn't have any side-effects.
>> (As before, we should probably rename the flag, but that's separate work.)
>> 
>>> +
>>> /* Implemented by aarch64_simd_mmlav16qi.  */
>>> VAR1 (TERNOP, simd_smmla, 0, NONE, v16qi)
>>> VAR1 (TERNOPU, simd_ummla, 0, NONE, v16qi)
>>> diff --git a/gcc/config/aarch64/aarch64-simd.md 
>>> b/gcc/config/aarch64/aarch64-simd.md
>>> index 9f0e2bd1e6f..f62c52ca327 100644
>>> --- a/gcc/config/aarch64/aarch64-simd.md
>>> +++ b/gcc/config/aarch64/aarch64-simd.md
>>> @@ -7159,6 +7159,19 @@
>>> [(set_attr "type" "neon_dot")]
>>>   )
>>>   
>>> +;; vget_low/high_bf16
>>> +(define_expand "aarch64_vget_halfv8bf"
>>> +  [(match_operand:V4BF 0 "register_operand")
>>> +   (match_operand:V8BF 1 "register_operand")
>>> +   (match_operand:SI 2 "aarch64_zero_or_1")]
>>> +  "TARGET_BF16_SIMD"
>>> +{
>>> +  int hbase = INTVAL (operands[2]);
>>> +  rtx sel = aarch64_gen_stepped_int_parallel (4, hbase * 4, 1);
>> 
>> I think this needs to be:
>> 
>>aarch64_simd_vect_par_cnst_half
>> 
>> instead.  The issue is that on big-endian targets, GCC assumes vector
>> lane 0 is in the high part of the register, whereas for AArch64 it's
>> always in the low part of the register.  So we convert from AArch64
>> numbering to GCC numbering when generating the rtx and then take
>> endianness into account when matching the rtx later.
>> 
>> It would be good to have -mbig-endian tests that make sure we generate
>> the right instruction for each function (i.e. we get them the right way
>> round).  I guess it would be good to test that for little-endian too.
>> 
>
> I've updated the expander using aarch64_simd_vect_par_cnst_half.
> And the expander is divided into two for getting low and high half 
> seperately.
> It's tested for aarch64-none-linux-gnu and aarch64_be-none-linux-gnu 
> targets with new tests including -mbig-endian option.
>
>>> +  emit_insn (gen_aarch64_get_halfv8bf (operands[0], operands[1], sel));
>>> +  DONE;
>>> +})
>>> +
>>>   ;; bfmmla
>>>   (define_insn "aarch64_bfmmlaqv4sf"
>>> [(set (match_operand:V4SF 0 "register_operand" "=w")
>>> diff --git a/gcc/config/aarch64/predicates.md 
>>> b/gcc/config/aarch64/predicates.md
>>> index 215fcec5955..0c8bc2b0c73 100644
>>> --- a/gcc/config/aarch64/predicates.md
>>> +++ b/gcc/config/aarch64/predicates.md
>>> @@ -84,6 +84,10 @@
>>>  (ior (match_test "op == constm1_rtx")
>>>   (match_test "op == const1_rtx"))
>>>   
>>> +(define_predicate "aarch64_zero_or_1"
>>> +  (and (match_code "const_int")
>>> +   (match_test "op == const0_rtx || op == const1_rtx")))
>> 
>> zero_or_1 looked odd to me, feels like it should be 0_or_1 or zero_or_one.
>> But I see that it's for consistency with aarch64_reg_zero_or_m1_or_1,
>> so let's keep it as-is.
>> 
>
> This predicate is removed since there is no need of the imm operand in 
> the new expanders.
>
> Thanks for the reviews.
> Is it OK for trunk now?

Looks good.  OK for trunk and branches, thanks.

Richard


Re: [PATCH][AArch64] ACLE intrinsics: get low/high half from BFloat16 vector

2020-11-03 Thread Dennis Zhang via Gcc-patches

Hi Richard,

On 10/30/20 2:07 PM, Richard Sandiford wrote:

Dennis Zhang  writes:

diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def 
b/gcc/config/aarch64/aarch64-simd-builtins.def
index 332a0b6b1ea..39ebb776d1d 100644
--- a/gcc/config/aarch64/aarch64-simd-builtins.def
+++ b/gcc/config/aarch64/aarch64-simd-builtins.def
@@ -719,6 +719,9 @@
VAR1 (QUADOP_LANE, bfmlalb_lane_q, 0, ALL, v4sf)
VAR1 (QUADOP_LANE, bfmlalt_lane_q, 0, ALL, v4sf)
  
+  /* Implemented by aarch64_vget_halfv8bf.  */

+  VAR1 (GETREG, vget_half, 0, ALL, v8bf)


This should be AUTO_FP, since it doesn't have any side-effects.
(As before, we should probably rename the flag, but that's separate work.)


+
/* Implemented by aarch64_simd_mmlav16qi.  */
VAR1 (TERNOP, simd_smmla, 0, NONE, v16qi)
VAR1 (TERNOPU, simd_ummla, 0, NONE, v16qi)
diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 9f0e2bd1e6f..f62c52ca327 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -7159,6 +7159,19 @@
[(set_attr "type" "neon_dot")]
  )
  
+;; vget_low/high_bf16

+(define_expand "aarch64_vget_halfv8bf"
+  [(match_operand:V4BF 0 "register_operand")
+   (match_operand:V8BF 1 "register_operand")
+   (match_operand:SI 2 "aarch64_zero_or_1")]
+  "TARGET_BF16_SIMD"
+{
+  int hbase = INTVAL (operands[2]);
+  rtx sel = aarch64_gen_stepped_int_parallel (4, hbase * 4, 1);


I think this needs to be:

   aarch64_simd_vect_par_cnst_half

instead.  The issue is that on big-endian targets, GCC assumes vector
lane 0 is in the high part of the register, whereas for AArch64 it's
always in the low part of the register.  So we convert from AArch64
numbering to GCC numbering when generating the rtx and then take
endianness into account when matching the rtx later.

It would be good to have -mbig-endian tests that make sure we generate
the right instruction for each function (i.e. we get them the right way
round).  I guess it would be good to test that for little-endian too.



I've updated the expander using aarch64_simd_vect_par_cnst_half.
And the expander is divided into two for getting low and high half 
seperately.
It's tested for aarch64-none-linux-gnu and aarch64_be-none-linux-gnu 
targets with new tests including -mbig-endian option.



+  emit_insn (gen_aarch64_get_halfv8bf (operands[0], operands[1], sel));
+  DONE;
+})
+
  ;; bfmmla
  (define_insn "aarch64_bfmmlaqv4sf"
[(set (match_operand:V4SF 0 "register_operand" "=w")
diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md
index 215fcec5955..0c8bc2b0c73 100644
--- a/gcc/config/aarch64/predicates.md
+++ b/gcc/config/aarch64/predicates.md
@@ -84,6 +84,10 @@
 (ior (match_test "op == constm1_rtx")
  (match_test "op == const1_rtx"))
  
+(define_predicate "aarch64_zero_or_1"

+  (and (match_code "const_int")
+   (match_test "op == const0_rtx || op == const1_rtx")))


zero_or_1 looked odd to me, feels like it should be 0_or_1 or zero_or_one.
But I see that it's for consistency with aarch64_reg_zero_or_m1_or_1,
so let's keep it as-is.



This predicate is removed since there is no need of the imm operand in 
the new expanders.


Thanks for the reviews.
Is it OK for trunk now?

Cheers
Dennis


diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def b/gcc/config/aarch64/aarch64-simd-builtins.def
index eb8e6f7b3d8..f26a96042bc 100644
--- a/gcc/config/aarch64/aarch64-simd-builtins.def
+++ b/gcc/config/aarch64/aarch64-simd-builtins.def
@@ -722,6 +722,10 @@
   VAR1 (QUADOP_LANE, bfmlalb_lane_q, 0, ALL, v4sf)
   VAR1 (QUADOP_LANE, bfmlalt_lane_q, 0, ALL, v4sf)
 
+  /* Implemented by aarch64_vget_lo/hi_halfv8bf.  */
+  VAR1 (UNOP, vget_lo_half, 0, AUTO_FP, v8bf)
+  VAR1 (UNOP, vget_hi_half, 0, AUTO_FP, v8bf)
+
   /* Implemented by aarch64_simd_mmlav16qi.  */
   VAR1 (TERNOP, simd_smmla, 0, NONE, v16qi)
   VAR1 (TERNOPU, simd_ummla, 0, NONE, v16qi)
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 381a702eba0..af29a2f26f5 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -7159,6 +7159,27 @@
   [(set_attr "type" "neon_dot")]
 )
 
+;; vget_low/high_bf16
+(define_expand "aarch64_vget_lo_halfv8bf"
+  [(match_operand:V4BF 0 "register_operand")
+   (match_operand:V8BF 1 "register_operand")]
+  "TARGET_BF16_SIMD"
+{
+  rtx p = aarch64_simd_vect_par_cnst_half (V8BFmode, 8, false);
+  emit_insn (gen_aarch64_get_halfv8bf (operands[0], operands[1], p));
+  DONE;
+})
+
+(define_expand "aarch64_vget_hi_halfv8bf"
+  [(match_operand:V4BF 0 "register_operand")
+   (match_operand:V8BF 1 "register_operand")]
+  "TARGET_BF16_SIMD"
+{
+  rtx p = aarch64_simd_vect_par_cnst_half (V8BFmode, 8, true);
+  emit_insn (gen_aarch64_get_halfv8bf (operands[0], operands[1], p));
+  DONE;
+})
+
 ;; bfmmla
 (define_insn "aarch64_bfmmlaqv4sf"
   [(set (match_operand:V4SF 0 "register_operand" "=w")
diff 

Re: [PATCH][AArch64] ACLE intrinsics: get low/high half from BFloat16 vector

2020-10-30 Thread Richard Sandiford via Gcc-patches
Dennis Zhang  writes:
> diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def 
> b/gcc/config/aarch64/aarch64-simd-builtins.def
> index 332a0b6b1ea..39ebb776d1d 100644
> --- a/gcc/config/aarch64/aarch64-simd-builtins.def
> +++ b/gcc/config/aarch64/aarch64-simd-builtins.def
> @@ -719,6 +719,9 @@
>VAR1 (QUADOP_LANE, bfmlalb_lane_q, 0, ALL, v4sf)
>VAR1 (QUADOP_LANE, bfmlalt_lane_q, 0, ALL, v4sf)
>  
> +  /* Implemented by aarch64_vget_halfv8bf.  */
> +  VAR1 (GETREG, vget_half, 0, ALL, v8bf)

This should be AUTO_FP, since it doesn't have any side-effects.
(As before, we should probably rename the flag, but that's separate work.)

> +
>/* Implemented by aarch64_simd_mmlav16qi.  */
>VAR1 (TERNOP, simd_smmla, 0, NONE, v16qi)
>VAR1 (TERNOPU, simd_ummla, 0, NONE, v16qi)
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 9f0e2bd1e6f..f62c52ca327 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -7159,6 +7159,19 @@
>[(set_attr "type" "neon_dot")]
>  )
>  
> +;; vget_low/high_bf16
> +(define_expand "aarch64_vget_halfv8bf"
> +  [(match_operand:V4BF 0 "register_operand")
> +   (match_operand:V8BF 1 "register_operand")
> +   (match_operand:SI 2 "aarch64_zero_or_1")]
> +  "TARGET_BF16_SIMD"
> +{
> +  int hbase = INTVAL (operands[2]);
> +  rtx sel = aarch64_gen_stepped_int_parallel (4, hbase * 4, 1);

I think this needs to be:

  aarch64_simd_vect_par_cnst_half

instead.  The issue is that on big-endian targets, GCC assumes vector
lane 0 is in the high part of the register, whereas for AArch64 it's
always in the low part of the register.  So we convert from AArch64
numbering to GCC numbering when generating the rtx and then take
endianness into account when matching the rtx later.

It would be good to have -mbig-endian tests that make sure we generate
the right instruction for each function (i.e. we get them the right way
round).  I guess it would be good to test that for little-endian too.

> +  emit_insn (gen_aarch64_get_halfv8bf (operands[0], operands[1], sel));
> +  DONE;
> +})
> +
>  ;; bfmmla
>  (define_insn "aarch64_bfmmlaqv4sf"
>[(set (match_operand:V4SF 0 "register_operand" "=w")
> diff --git a/gcc/config/aarch64/predicates.md 
> b/gcc/config/aarch64/predicates.md
> index 215fcec5955..0c8bc2b0c73 100644
> --- a/gcc/config/aarch64/predicates.md
> +++ b/gcc/config/aarch64/predicates.md
> @@ -84,6 +84,10 @@
>(ior (match_test "op == constm1_rtx")
> (match_test "op == const1_rtx"))
>  
> +(define_predicate "aarch64_zero_or_1"
> +  (and (match_code "const_int")
> +   (match_test "op == const0_rtx || op == const1_rtx")))

zero_or_1 looked odd to me, feels like it should be 0_or_1 or zero_or_one.
But I see that it's for consistency with aarch64_reg_zero_or_m1_or_1,
so let's keep it as-is.

Thanks,
Richard