Re: [PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq intrinsics

2021-05-04 Thread Richard Sandiford via Gcc-patches
Jonathan Wright via Gcc-patches  writes:
> Hi Richard,
>
> I think you may be referencing an older checkout as we refactored this
> pattern in a previous change to:
>
> (define_insn "mul_lane3"
>  [(set (match_operand:VMUL 0 "register_operand" "=w")
>(mult:VMUL
>(vec_duplicate:VMUL
>  (vec_select:
>(match_operand:VMUL 2 "register_operand" "")
>(parallel [(match_operand:SI 3 "immediate_operand" "i")])))
>(match_operand:VMUL 1 "register_operand" "w")))]
>   "TARGET_SIMD"
>   {
> operands[3] = aarch64_endian_lane_rtx (mode, INTVAL (operands[3]));
> return "mul\\t%0., %1., %2.[%3]";
>   }
>   [(set_attr "type" "neon_mul__scalar")]
> )
>
> which doesn't help us with the 'laneq' intrinsics as the machine mode for
> operands 0 and 1 (of the laneq intrinsics) is narrower than the machine
> mode for operand 2.

Gah, I copied the wrong one, sorry.  The one I meant was:

(define_insn "*aarch64_mul3_elt_"
  [(set (match_operand:VMUL_CHANGE_NLANES 0 "register_operand" "=w")
 (mult:VMUL_CHANGE_NLANES
   (vec_duplicate:VMUL_CHANGE_NLANES
  (vec_select:
(match_operand: 1 "register_operand" "")
(parallel [(match_operand:SI 2 "immediate_operand")])))
  (match_operand:VMUL_CHANGE_NLANES 3 "register_operand" "w")))]
  "TARGET_SIMD"
  {
operands[2] = aarch64_endian_lane_rtx (mode, INTVAL 
(operands[2]));
return "mul\\t%0., %3., %1.[%2]";
  }
  [(set_attr "type" "neon_mul__scalar")]
)

This already provides patterns in which the indexed operand is
wider than the other operands.

Thanks,
Richard


Re: [PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq intrinsics

2021-05-04 Thread Jonathan Wright via Gcc-patches
Hi Richard,

I think you may be referencing an older checkout as we refactored this
pattern in a previous change to:

(define_insn "mul_lane3"
 [(set (match_operand:VMUL 0 "register_operand" "=w")
   (mult:VMUL
   (vec_duplicate:VMUL
 (vec_select:
   (match_operand:VMUL 2 "register_operand" "")
   (parallel [(match_operand:SI 3 "immediate_operand" "i")])))
   (match_operand:VMUL 1 "register_operand" "w")))]
  "TARGET_SIMD"
  {
operands[3] = aarch64_endian_lane_rtx (mode, INTVAL (operands[3]));
return "mul\\t%0., %1., %2.[%3]";
  }
  [(set_attr "type" "neon_mul__scalar")]
)

which doesn't help us with the 'laneq' intrinsics as the machine mode for
operands 0 and 1 (of the laneq intrinsics) is narrower than the machine
mode for operand 2.

Thanks,
Jonathan
​
____________
From: Richard Sandiford 
Sent: 30 April 2021 19:18
To: Jonathan Wright 
Cc: gcc-patches@gcc.gnu.org 
Subject: Re: [PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq 
intrinsics

Richard Sandiford via Gcc-patches  writes:
> Jonathan Wright  writes:
>> diff --git a/gcc/config/aarch64/aarch64-simd.md 
>> b/gcc/config/aarch64/aarch64-simd.md
>> index 
>> bdee49f74f4725409d33af733bb55be290b3f0e7..234762960bd6df057394f753072ef65a6628a43d
>>  100644
>> --- a/gcc/config/aarch64/aarch64-simd.md
>> +++ b/gcc/config/aarch64/aarch64-simd.md
>> @@ -734,6 +734,22 @@
>>[(set_attr "type" "neon_mul__scalar")]
>>  )
>>
>> +(define_insn "mul_laneq3"
>> +  [(set (match_operand:VDQSF 0 "register_operand" "=w")
>> +(mult:VDQSF
>> +  (vec_duplicate:VDQSF
>> +(vec_select:
>> +  (match_operand:V4SF 2 "register_operand" "w")
>> +  (parallel [(match_operand:SI 3 "immediate_operand" "i")])))
>> +  (match_operand:VDQSF 1 "register_operand" "w")))]
>> +  "TARGET_SIMD"
>> +  {
>> +operands[3] = aarch64_endian_lane_rtx (V4SFmode, INTVAL (operands[3]));
>> +return "fmul\\t%0., %1., %2.[%3]";
>> +  }
>> +  [(set_attr "type" "neon_fp_mul_s_scalar")]
>> +)
>> +

Oops, sorry, I just realised that this pattern does already exist as:

(define_insn "*aarch64_mul3_elt"
 [(set (match_operand:VMUL 0 "register_operand" "=w")
(mult:VMUL
  (vec_duplicate:VMUL
  (vec_select:
(match_operand:VMUL 1 "register_operand" "")
(parallel [(match_operand:SI 2 "immediate_operand")])))
  (match_operand:VMUL 3 "register_operand" "w")))]
  "TARGET_SIMD"
  {
operands[2] = aarch64_endian_lane_rtx (mode, INTVAL (operands[2]));
return "mul\\t%0., %3., %1.[%2]";
  }
  [(set_attr "type" "neon_mul__scalar")]
)

Thanks,
Richard


Re: [PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq intrinsics

2021-04-30 Thread Richard Sandiford via Gcc-patches
Richard Sandiford via Gcc-patches  writes:
> Jonathan Wright  writes:
>> diff --git a/gcc/config/aarch64/aarch64-simd.md 
>> b/gcc/config/aarch64/aarch64-simd.md
>> index 
>> bdee49f74f4725409d33af733bb55be290b3f0e7..234762960bd6df057394f753072ef65a6628a43d
>>  100644
>> --- a/gcc/config/aarch64/aarch64-simd.md
>> +++ b/gcc/config/aarch64/aarch64-simd.md
>> @@ -734,6 +734,22 @@
>>[(set_attr "type" "neon_mul__scalar")]
>>  )
>>  
>> +(define_insn "mul_laneq3"
>> +  [(set (match_operand:VDQSF 0 "register_operand" "=w")
>> +(mult:VDQSF
>> +  (vec_duplicate:VDQSF
>> +(vec_select:
>> +  (match_operand:V4SF 2 "register_operand" "w")
>> +  (parallel [(match_operand:SI 3 "immediate_operand" "i")])))
>> +  (match_operand:VDQSF 1 "register_operand" "w")))]
>> +  "TARGET_SIMD"
>> +  {
>> +operands[3] = aarch64_endian_lane_rtx (V4SFmode, INTVAL (operands[3]));
>> +return "fmul\\t%0., %1., %2.[%3]";
>> +  }
>> +  [(set_attr "type" "neon_fp_mul_s_scalar")]
>> +)
>> +

Oops, sorry, I just realised that this pattern does already exist as:

(define_insn "*aarch64_mul3_elt"
 [(set (match_operand:VMUL 0 "register_operand" "=w")
(mult:VMUL
  (vec_duplicate:VMUL
  (vec_select:
(match_operand:VMUL 1 "register_operand" "")
(parallel [(match_operand:SI 2 "immediate_operand")])))
  (match_operand:VMUL 3 "register_operand" "w")))]
  "TARGET_SIMD"
  {
operands[2] = aarch64_endian_lane_rtx (mode, INTVAL (operands[2]));
return "mul\\t%0., %3., %1.[%2]";
  }
  [(set_attr "type" "neon_mul__scalar")]
)

Thanks,
Richard


Re: [PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq intrinsics

2021-04-30 Thread Richard Sandiford via Gcc-patches
Jonathan Wright  writes:
> Updated the patch to be more consistent with the others in the series.
>
> Tested and bootstrapped on aarch64-none-linux-gnu - no issues.
>
> Ok for master?

OK, thanks.

Richard

>
> Thanks,
> Jonathan
> ---
> From: Gcc-patches  on behalf of Jonathan
> Wright via Gcc-patches 
> Sent: 28 April 2021 15:42
> To: gcc-patches@gcc.gnu.org 
> Subject: [PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq
> intrinsics
>  
> Hi,
>
> As subject, this patch rewrites the floating-point vml[as][q]_laneq Neon
> intrinsics to use RTL builtins rather than relying on the GCC vector
> extensions. Using RTL builtins allows control over the emission of
> fmla/fmls instructions (which we don't want here.)
>
> With this commit, the code generated by these intrinsics changes from
> a fused multiply-add/subtract instruction to an fmul followed by an
> fadd/fsub instruction. If the programmer really wants fmla/fmls
> instructions, they can use the vfm[as] intrinsics.
>
> Regression tested and bootstrapped on aarch64-none-linux-gnu - no
> issues.
>
> Ok for master?
>
> Thanks,
> Jonathan
>
> ---
>
> gcc/ChangeLog:
>
> 2021-02-17  Jonathan Wright  
>
> * config/aarch64/aarch64-simd-builtins.def: Add
> float_ml[as][q]_laneq builtin generator macros.
> * config/aarch64/aarch64-simd.md (mul_laneq3): Define.
> (aarch64_float_mla_laneq): Define.
> (aarch64_float_mls_laneq): Define.
> * config/aarch64/arm_neon.h (vmla_laneq_f32): Use RTL builtin
> instead of GCC vector extensions.
> (vmlaq_laneq_f32): Likewise.
> (vmls_laneq_f32): Likewise.
> (vmlsq_laneq_f32): Likewise.
>
> diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def 
> b/gcc/config/aarch64/aarch64-simd-builtins.def
> index 
> 8e4b4edc8a46ffba777a42058f06ce7204152824..1e81bb53287e9797f3539c2c64ed11c6c26d6e4e
>  100644
> --- a/gcc/config/aarch64/aarch64-simd-builtins.def
> +++ b/gcc/config/aarch64/aarch64-simd-builtins.def
> @@ -674,6 +674,8 @@
>BUILTIN_VDQSF (TERNOP, float_mls_n, 0, FP)
>BUILTIN_VDQSF (QUADOP_LANE, float_mla_lane, 0, FP)
>BUILTIN_VDQSF (QUADOP_LANE, float_mls_lane, 0, FP)
> +  BUILTIN_VDQSF (QUADOP_LANE, float_mla_laneq, 0, FP)
> +  BUILTIN_VDQSF (QUADOP_LANE, float_mls_laneq, 0, FP)
>  
>/* Implemented by aarch64_simd_bsl.  */
>BUILTIN_VDQQH (BSL_P, simd_bsl, 0, NONE)
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> bdee49f74f4725409d33af733bb55be290b3f0e7..234762960bd6df057394f753072ef65a6628a43d
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -734,6 +734,22 @@
>[(set_attr "type" "neon_mul__scalar")]
>  )
>  
> +(define_insn "mul_laneq3"
> +  [(set (match_operand:VDQSF 0 "register_operand" "=w")
> + (mult:VDQSF
> +   (vec_duplicate:VDQSF
> + (vec_select:
> +   (match_operand:V4SF 2 "register_operand" "w")
> +   (parallel [(match_operand:SI 3 "immediate_operand" "i")])))
> +   (match_operand:VDQSF 1 "register_operand" "w")))]
> +  "TARGET_SIMD"
> +  {
> +operands[3] = aarch64_endian_lane_rtx (V4SFmode, INTVAL (operands[3]));
> +return "fmul\\t%0., %1., %2.[%3]";
> +  }
> +  [(set_attr "type" "neon_fp_mul_s_scalar")]
> +)
> +
>  (define_insn "*aarch64_mul3_elt_"
>[(set (match_operand:VMUL_CHANGE_NLANES 0 "register_operand" "=w")
>   (mult:VMUL_CHANGE_NLANES
> @@ -2742,6 +2758,46 @@
>}
>  )
>  
> +(define_expand "aarch64_float_mla_laneq"
> +  [(set (match_operand:VDQSF 0 "register_operand")
> + (plus:VDQSF
> +   (mult:VDQSF
> + (vec_duplicate:VDQSF
> +   (vec_select:
> + (match_operand:V4SF 3 "register_operand")
> + (parallel [(match_operand:SI 4 "immediate_operand")])))
> + (match_operand:VDQSF 2 "register_operand"))
> +   (match_operand:VDQSF 1 "register_operand")))]
> +  "TARGET_SIMD"
> +  {
> +rtx scratch = gen_reg_rtx (mode);
> +emit_insn (gen_mul_laneq3 (scratch, operands[2],
> +  operands[3], operands[4]));
> +emit_insn (gen_add3 (operands[0], operands[1], scratch));
> +DONE;
> +  }
> +)
> +
> +(define_expand "aarch64_float_mls_laneq"
> +  [(set (m

Re: [PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq intrinsics

2021-04-30 Thread Jonathan Wright via Gcc-patches
Updated the patch to be more consistent with the others in the series.

Tested and bootstrapped on aarch64-none-linux-gnu - no issues.

Ok for master?

Thanks,
Jonathan

From: Gcc-patches  on behalf of Jonathan 
Wright via Gcc-patches 
Sent: 28 April 2021 15:42
To: gcc-patches@gcc.gnu.org 
Subject: [PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq 
intrinsics

Hi,

As subject, this patch rewrites the floating-point vml[as][q]_laneq Neon
intrinsics to use RTL builtins rather than relying on the GCC vector
extensions. Using RTL builtins allows control over the emission of
fmla/fmls instructions (which we don't want here.)

With this commit, the code generated by these intrinsics changes from
a fused multiply-add/subtract instruction to an fmul followed by an
fadd/fsub instruction. If the programmer really wants fmla/fmls
instructions, they can use the vfm[as] intrinsics.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-02-17  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Add
float_ml[as][q]_laneq builtin generator macros.
* config/aarch64/aarch64-simd.md (mul_laneq3): Define.
(aarch64_float_mla_laneq): Define.
(aarch64_float_mls_laneq): Define.
* config/aarch64/arm_neon.h (vmla_laneq_f32): Use RTL builtin
instead of GCC vector extensions.
(vmlaq_laneq_f32): Likewise.
(vmls_laneq_f32): Likewise.
(vmlsq_laneq_f32): Likewise.


rb14213.patch
Description: rb14213.patch


[PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq intrinsics

2021-04-28 Thread Jonathan Wright via Gcc-patches
Hi,

As subject, this patch rewrites the floating-point vml[as][q]_laneq Neon
intrinsics to use RTL builtins rather than relying on the GCC vector
extensions. Using RTL builtins allows control over the emission of
fmla/fmls instructions (which we don't want here.)

With this commit, the code generated by these intrinsics changes from
a fused multiply-add/subtract instruction to an fmul followed by an
fadd/fsub instruction. If the programmer really wants fmla/fmls
instructions, they can use the vfm[as] intrinsics.

Regression tested and bootstrapped on aarch64-none-linux-gnu - no
issues.

Ok for master?

Thanks,
Jonathan

---

gcc/ChangeLog:

2021-02-17  Jonathan Wright  

* config/aarch64/aarch64-simd-builtins.def: Add
float_ml[as][q]_laneq builtin generator macros.
* config/aarch64/aarch64-simd.md (mul_laneq3): Define.
(aarch64_float_mla_laneq): Define.
(aarch64_float_mls_laneq): Define.
* config/aarch64/arm_neon.h (vmla_laneq_f32): Use RTL builtin
instead of GCC vector extensions.
(vmlaq_laneq_f32): Likewise.
(vmls_laneq_f32): Likewise.
(vmlsq_laneq_f32): Likewise.


rb14213.patch
Description: rb14213.patch