[PATCH] aarch64: Use type-qualified builtins for vget_low/high intrinsics
Hi, This patch declares unsigned and polynomial type-qualified builtins for vget_low_*/vget_high_* Neon intrinsics. Using these builtins removes the need for many casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-11-10 Jonathan Wright * config/aarch64/aarch64-builtins.c (TYPES_UNOPP): Define. * config/aarch64/aarch64-simd-builtins.def: Declare type- qualified builtins for vget_low/high. * config/aarch64/arm_neon.h (vget_low_p8): Use type-qualified builtin and remove casts. (vget_low_p16): Likewise. (vget_low_p64): Likewise. (vget_low_u8): Likewise. (vget_low_u16): Likewise. (vget_low_u32): Likewise. (vget_low_u64): Likewise. (vget_high_p8): Likewise. (vget_high_p16): Likewise. (vget_high_p64): Likewise. (vget_high_u8): Likewise. (vget_high_u16): Likewise. (vget_high_u32): Likewise. (vget_high_u64): Likewise. * config/aarch64/iterators.md (VQ_P): New mode iterator. rb15060.patch Description: rb15060.patch
[PATCH] aarch64: Use type-qualified builtins for vcombine_* Neon intrinsics
Hi, This patch declares unsigned and polynomial type-qualified builtins for vcombine_* Neon intrinsics. Using these builtins removes the need for many casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-11-10 Jonathan Wright * config/aarch64/aarch64-builtins.c (TYPES_COMBINE): Delete. (TYPES_COMBINEP): Delete. * config/aarch64/aarch64-simd-builtins.def: Declare type- qualified builtins for vcombine_* intrinsics. * config/aarch64/arm_neon.h (vcombine_s8): Remove unnecessary cast. (vcombine_s16): Likewise. (vcombine_s32): Likewise. (vcombine_f32): Likewise. (vcombine_u8): Use type-qualified builtin and remove casts. (vcombine_u16): Likewise. (vcombine_u32): Likewise. (vcombine_u64): Likewise. (vcombine_p8): Likewise. (vcombine_p16): Likewise. (vcombine_p64): Likewise. (vcombine_bf16): Remove unnecessary cast. * config/aarch64/iterators.md (VDC_I): New mode iterator. (VDC_P): New mode iterator. rb15059.patch Description: rb15059.patch
[PATCH] aarch64: Use type-qualified builtins for LD1/ST1 Neon intrinsics
Hi, This patch declares unsigned and polynomial type-qualified builtins and uses them to implement the LD1/ST1 Neon intrinsics. This removes the need for many casts in arm_neon.h. The new type-qualified builtins are also lowered to gimple - as the unqualified builtins are already. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-11-10 Jonathan Wright * config/aarch64/aarch64-builtins.c (TYPES_LOAD1_U): Define. (TYPES_LOAD1_P): Define. (TYPES_STORE1_U): Define. (TYPES_STORE1P): Rename to... (TYPES_STORE1_P): This. (get_mem_type_for_load_store): Add unsigned and poly types. (aarch64_general_gimple_fold_builtin): Add unsigned and poly type-qualified builtin declarations. * config/aarch64/aarch64-simd-builtins.def: Declare type- qualified builtins for LD1/ST1. * config/aarch64/arm_neon.h (vld1_p8): Use type-qualified builtin and remove cast. (vld1_p16): Likewise. (vld1_u8): Likewise. (vld1_u16): Likewise. (vld1_u32): Likewise. (vld1q_p8): Likewise. (vld1q_p16): Likewise. (vld1q_p64): Likewise. (vld1q_u8): Likewise. (vld1q_u16): Likewise. (vld1q_u32): Likewise. (vld1q_u64): Likewise. (vst1_p8): Likewise. (vst1_p16): Likewise. (vst1_u8): Likewise. (vst1_u16): Likewise. (vst1_u32): Likewise. (vst1q_p8): Likewise. (vst1q_p16): Likewise. (vst1q_p64): Likewise. (vst1q_u8): Likewise. (vst1q_u16): Likewise. (vst1q_u32): Likewise. (vst1q_u64): Likewise. * config/aarch64/iterators.md (VALLP_NO_DI): New iterator. rb15058.patch Description: rb15058.patch
[PATCH] aarch64: Use type-qualified builtins for ADDV Neon intrinsics
Hi, This patch declares unsigned type-qualified builtins and uses them to implement the vector reduction Neon intrinsics. This removes the need for many casts in arm_neon.h. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-11-09 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Declare unsigned builtins for vector reduction. * config/aarch64/arm_neon.h (vaddv_u8): Use type-qualified builtin and remove casts. (vaddv_u16): Likewise. (vaddv_u32): Likewise. (vaddvq_u8): Likewise. (vaddvq_u16): Likewise. (vaddvq_u32): Likewise. (vaddvq_u64): Likewise. rb15057.patch Description: rb15057.patch
[PATCH] aarch64: Use type-qualified builtins for ADDP Neon intrinsics
Hi, This patch declares unsigned type-qualified builtins and uses them to implement the pairwise addition Neon intrinsics. This removes the need for many casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-11-09 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: * config/aarch64/arm_neon.h (vpaddq_u8): Use type-qualified builtin and remove casts. (vpaddq_u16): Likewise. (vpaddq_u32): Likewise. (vpaddq_u64): Likewise. (vpadd_u8): Likewise. (vpadd_u16): Likewise. (vpadd_u32): Likewise. (vpaddd_u64): Likewise. rb15039.patch Description: rb15039.patch
[PATCH] aarch64: Use type-qualified builtins for [R]SUBHN[2] Neon intrinsics
Hi, This patch declares unsigned type-qualified builtins and uses them to implement (rounding) halving-narrowing-subtract Neon intrinsics. This removes the need for many casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-11-09 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Declare unsigned builtins for [r]subhn[2]. * config/aarch64/arm_neon.h (vsubhn_s16): Remove unnecessary cast. (vsubhn_s32): Likewise. (vsubhn_s64): Likewise. (vsubhn_u16): Use type-qualified builtin and remove casts. (vsubhn_u32): Likewise. (vsubhn_u64): Likewise. (vrsubhn_s16): Remove unnecessary cast. (vrsubhn_s32): Likewise. (vrsubhn_s64): Likewise. (vrsubhn_u16): Use type-qualified builtin and remove casts. (vrsubhn_u32): Likewise. (vrsubhn_u64): Likewise. (vrsubhn_high_s16): Remove unnecessary cast. (vrsubhn_high_s32): Likewise. (vrsubhn_high_s64): Likewise. (vrsubhn_high_u16): Use type-qualified builtin and remove casts. (vrsubhn_high_u32): Likewise. (vrsubhn_high_u64): Likewise. (vsubhn_high_s16): Remove unnecessary cast. (vsubhn_high_s32): Likewise. (vsubhn_high_s64): Likewise. (vsubhn_high_u16): Use type-qualified builtin and remove casts. (vsubhn_high_u32): Likewise. (vsubhn_high_u64): Likewise. rb15038.patch Description: rb15038.patch
[PATCH] aarch64: Use type-qualified builtins for [R]ADDHN[2] Neon intrinsics
Hi, This patch declares unsigned type-qualified builtins and uses them to implement (rounding) halving-narrowing-add Neon intrinsics. This removes the need for many casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-11-09 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Declare unsigned builtins for [r]addhn[2]. * config/aarch64/arm_neon.h (vaddhn_s16): Remove unnecessary cast. (vaddhn_s32): Likewise. (vaddhn_s64): Likewise. (vaddhn_u16): Use type-qualified builtin and remove casts. (vaddhn_u32): Likewise. (vaddhn_u64): Likewise. (vraddhn_s16): Remove unnecessary cast. (vraddhn_s32): Likewise. (vraddhn_s64): Likewise. (vraddhn_u16): Use type-qualified builtin and remove casts. (vraddhn_u32): Likewise. (vraddhn_u64): Likewise. (vaddhn_high_s16): Remove unnecessary cast. (vaddhn_high_s32): Likewise. (vaddhn_high_s64): Likewise. (vaddhn_high_u16): Use type-qualified builtin and remove casts. (vaddhn_high_u32): Likewise. (vaddhn_high_u64): Likewise. (vraddhn_high_s16): Remove unnecessary cast. (vraddhn_high_s32): Likewise. (vraddhn_high_s64): Likewise. (vraddhn_high_u16): Use type-qualified builtin and remove casts. (vraddhn_high_u32): Likewise. (vraddhn_high_u64): Likewise. rb15037.patch Description: rb15037.patch
[PATCH] aarch64: Use type-qualified builtins for UHSUB Neon intrinsics
Hi, This patch declares unsigned type-qualified builtins and uses them to implement halving-subtract Neon intrinsics. This removes the need for many casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-11-09 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Use BINOPU type qualifiers in generator macros for uhsub builtins. * config/aarch64/arm_neon.h (vhsub_s8): Remove unnecessary cast. (vhsub_s16): Likewise. (vhsub_s32): Likewise. (vhsub_u8): Use type-qualified builtin and remove casts. (vhsub_u16): Likewise. (vhsub_u32): Likewise. (vhsubq_s8): Remove unnecessary cast. (vhsubq_s16): Likewise. (vhsubq_s32): Likewise. (vhsubq_u8): Use type-qualified builtin and remove casts. (vhsubq_u16): Likewise. (vhsubq_u32): Likewise. rb15036.patch Description: rb15036.patch
[PATCH] aarch64: Use type-qualified builtins for U[R]HADD Neon intrinsics
Hi, This patch declares unsigned type-qualified builtins and uses them to implement (rounding) halving-add Neon intrinsics. This removes the need for many casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-11-09 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Use BINOPU type qualifiers in generator macros for u[r]hadd builtins. * config/aarch64/arm_neon.h (vhadd_s8): Remove unnecessary cast. (vhadd_s16): Likewise. (vhadd_s32): Likewise. (vhadd_u8): Use type-qualified builtin and remove casts. (vhadd_u16): Likewise. (vhadd_u32): Likewise. (vhaddq_s8): Remove unnecessary cast. (vhaddq_s16): Likewise. (vhaddq_s32): Likewise. (vhaddq_u8): Use type-qualified builtin and remove casts. (vhaddq_u16): Likewise. (vhaddq_u32): Likewise. (vrhadd_s8): Remove unnecessary cast. (vrhadd_s16): Likewise. (vrhadd_s32): Likewise. (vrhadd_u8): Use type-qualified builtin and remove casts. (vrhadd_u16): Likewise. (vrhadd_u32): Likewise. (vrhaddq_s8): Remove unnecessary cast. (vrhaddq_s16): Likewise. (vrhaddq_s32): Likewise. (vrhaddq_u8): Use type-wualified builtin and remove casts. (vrhaddq_u16): Likewise. (vrhaddq_u32): Likewise. rb15035.patch Description: rb15035.patch
[PATCH] aarch64: Use type-qualified builtins for USUB[LW][2] Neon intrinsics
Hi, This patch declares unsigned type-qualified builtins and uses them to implement widening-subtract Neon intrinsics. This removes the need for many casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-11-09 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Use BINOPU type qualifiers in generator macros for usub[lw][2] builtins. * config/aarch64/arm_neon.h (vsubl_s8): Remove unnecessary cast. (vsubl_s16): Likewise. (vsubl_s32): Likewise. (vsubl_u8): Use type-qualified builtin and remove casts. (vsubl_u16): Likewise. (vsubl_u32): Likewise. (vsubl_high_s8): Remove unnecessary cast. (vsubl_high_s16): Likewise. (vsubl_high_s32): Likewise. (vsubl_high_u8): Use type-qualified builtin and remove casts. (vsubl_high_u16): Likewise. (vsubl_high_u32): Likewise. (vsubw_s8): Remove unnecessary casts. (vsubw_s16): Likewise. (vsubw_s32): Likewise. (vsubw_u8): Use type-qualified builtin and remove casts. (vsubw_u16): Likewise. (vsubw_u32): Likewise. (vsubw_high_s8): Remove unnecessary cast. (vsubw_high_s16): Likewise. (vsubw_high_s32): Likewise. (vsubw_high_u8): Use type-qualified builtin and remove casts. (vsubw_high_u16): Likewise. (vsubw_high_u32): Likewise. rb15034.patch Description: rb15034.patch
[PATCH] aarch64: Use type-qualified builtins for UADD[LW][2] Neon intrinsics
Hi, This patch declares unsigned type-qualified builtins and uses them to implement widening-add Neon intrinsics. This removes the need for many casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-11-09 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Use BINOPU type qualifiers in generator macros for uadd[lw][2] builtins. * config/aarch64/arm_neon.h (vaddl_s8): Remove unnecessary cast. (vaddl_s16): Likewise. (vaddl_s32): Likewise. (vaddl_u8): Use type-qualified builtin and remove casts. (vaddl_u16): Likewise. (vaddl_u32): Likewise. (vaddl_high_s8): Remove unnecessary cast. (vaddl_high_s16): Likewise. (vaddl_high_s32): Likewise. (vaddl_high_u8): Use type-qualified builtin and remove casts. (vaddl_high_u16): Likewise. (vaddl_high_u32): Likewise. (vaddw_s8): Remove unnecessary cast. (vaddw_s16): Likewise. (vaddw_s32): Likewise. (vaddw_u8): Use type-qualified builtin and remove casts. (vaddw_u16): Likewise. (vaddw_u32): Likewise. (vaddw_high_s8): Remove unnecessary cast. (vaddw_high_s16): Likewise. (vaddw_high_s32): Likewise. (vaddw_high_u8): Use type-qualified builtin and remove casts. (vaddw_high_u16): Likewise. (vaddw_high_u32): Likewise. rb15033.patch Description: rb15033.patch
[PATCH] aarch64: Use type-qualified builtins for [R]SHRN[2] Neon intrinsics
Hi, Thus patch declares unsigned type-qualified builtins and uses them for [R]SHRN[2] Neon intrinsics. This removes the need for casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-11-08 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Declare type- qualified builtins for [R]SHRN[2]. * config/aarch64/arm_neon.h (vshrn_n_u16): Use type-qualified builtin and remove casts. (vshrn_n_u32): Likewise. (vshrn_n_u64): Likewise. (vrshrn_high_n_u16): Likewise. (vrshrn_high_n_u32): Likewise. (vrshrn_high_n_u64): Likewise. (vrshrn_n_u16): Likewise. (vrshrn_n_u32): Likewise. (vrshrn_n_u64): Likewise. (vshrn_high_n_u16): Likewise. (vshrn_high_n_u32): Likewise. (vshrn_high_n_u64): Likewise. rb15032.patch Description: rb15032.patch
[PATCH] aarch64: Use type-qualified builtins for XTN[2] Neon intrinsics
Hi, This patch declares unsigned type-qualified builtins and uses them for XTN[2] Neon intrinsics. This removes the need for casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-11-08 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Declare unsigned type-qualified builtins for XTN[2]. * config/aarch64/arm_neon.h (vmovn_high_u16): Use type- qualified builtin and remove casts. (vmovn_high_u32): Likewise. (vmovn_high_u64): Likewise. (vmovn_u16): Likewise. (vmovn_u32): Likewise. (vmovn_u64): Likewise. rb15031.patch Description: rb15031.patch
[PATCH] aarch64: Use type-qualified builtins for PMUL[L] Neon intrinsics
Hi, This patch declares poly type-qualified builtins and uses them for PMUL[L] Neon intrinsics. This removes the need for casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-11-08 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Use poly type qualifier in builtin generator macros. * config/aarch64/arm_neon.h (vmul_p8): Use type-qualified builtin and remove casts. (vmulq_p8): Likewise. (vmull_high_p8): Likewise. (vmull_p8): Likewise. rb15030.patch Description: rb15030.patch
[PATCH] aarch64: Use type-qualified builtins for unsigned MLA/MLS intrinsics
Hi, This patch declares type-qualified builtins and uses them for MLA/MLS Neon intrinsics that operate on unsigned types. This eliminates lots of casts in arm_neon.h. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-11-08 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Declare type- qualified builtin generators for unsigned MLA/MLS intrinsics. * config/aarch64/arm_neon.h (vmla_n_u16): Use type-qualified builtin. (vmla_n_u32): Likewise. (vmla_u8): Likewise. (vmla_u16): Likewise. (vmla_u32): Likewise. (vmlaq_n_u16): Likewise. (vmlaq_n_u32): Likewise. (vmlaq_u8): Likewise. (vmlaq_u16): Likewise. (vmlaq_u32): Likewise. (vmls_n_u16): Likewise. (vmls_n_u32): Likewise. (vmls_u8): Likewise. (vmls_u16): Likewise. (vmls_u32): Likewise. (vmlsq_n_u16): Likewise. (vmlsq_n_u32): Likewise. (vmlsq_u8): Likewise. (vmlsq_u16): Likewise. (vmlsq_u32): Likewise. rb15027.patch Description: rb15027.patch
Re: [PATCH 4/6 V2] aarch64: Add machine modes for Neon vector-tuple types
Hi, Each of the comments on the previous version of the patch have been addressed. Ok for master? Thanks, Jonathan From: Richard Sandiford Sent: 22 October 2021 16:13 To: Jonathan Wright Cc: gcc-patches@gcc.gnu.org ; Kyrylo Tkachov Subject: Re: [PATCH 4/6] aarch64: Add machine modes for Neon vector-tuple types Thanks a lot for doing this. Jonathan Wright writes: > @@ -763,9 +839,16 @@ aarch64_lookup_simd_builtin_type (machine_mode mode, > return aarch64_simd_builtin_std_type (mode, q); > > for (i = 0; i < nelts; i++) > - if (aarch64_simd_types[i].mode == mode > - && aarch64_simd_types[i].q == q) > - return aarch64_simd_types[i].itype; > + { > + if (aarch64_simd_types[i].mode == mode > + && aarch64_simd_types[i].q == q) > + return aarch64_simd_types[i].itype; > + else if (aarch64_simd_tuple_types[i][0] != NULL_TREE) Very minor (sorry for not noticing earlier), but: the “else” is redundant here. > + for (int j = 0; j < 3; j++) > + if (TYPE_MODE (aarch64_simd_tuple_types[i][j]) == mode > + && aarch64_simd_types[i].q == q) > + return aarch64_simd_tuple_types[i][j]; > + } > > return NULL_TREE; > } > diff --git a/gcc/config/aarch64/aarch64-simd.md > b/gcc/config/aarch64/aarch64-simd.md > index > 48eddf64e05afe3788abfa05141f6544a9323ea1..0aa185b67ff13d40c87db0449aec312929ff5387 > 100644 > --- a/gcc/config/aarch64/aarch64-simd.md > +++ b/gcc/config/aarch64/aarch64-simd.md > @@ -6636,162 +6636,165 @@ > > ;; Patterns for vector struct loads and stores. > > -(define_insn "aarch64_simd_ld2" > - [(set (match_operand:OI 0 "register_operand" "=w") > - (unspec:OI [(match_operand:OI 1 "aarch64_simd_struct_operand" "Utv") > - (unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] > - UNSPEC_LD2))] > +(define_insn "aarch64_simd_ld2" > + [(set (match_operand:VSTRUCT_2Q 0 "register_operand" "=w") > + (unspec:VSTRUCT_2Q [ > + (match_operand:VSTRUCT_2Q 1 "aarch64_simd_struct_operand" "Utv")] > + UNSPEC_LD2))] > "TARGET_SIMD" > "ld2\\t{%S0. - %T0.}, %1" > [(set_attr "type" "neon_load2_2reg")] > ) > > -(define_insn "aarch64_simd_ld2r" > - [(set (match_operand:OI 0 "register_operand" "=w") > - (unspec:OI [(match_operand:BLK 1 "aarch64_simd_struct_operand" "Utv") > - (unspec:VALLDIF [(const_int 0)] UNSPEC_VSTRUCTDUMMY) ] > - UNSPEC_LD2_DUP))] > +(define_insn "aarch64_simd_ld2r" > + [(set (match_operand:VSTRUCT_2QD 0 "register_operand" "=w") > + (unspec:VSTRUCT_2QD [ > + (match_operand:VSTRUCT_2QD 1 "aarch64_simd_struct_operand" "Utv")] > + UNSPEC_LD2_DUP))] Sorry again for missing this, but the ld2rs, ld3rs and ld4rs should keep their BLKmode arguments, since they only access 2, 3 or 4 scalar memory elements. > @@ -7515,10 +7605,10 @@ > ) > > (define_insn_and_split "aarch64_combinev16qi" > - [(set (match_operand:OI 0 "register_operand" "=w") > - (unspec:OI [(match_operand:V16QI 1 "register_operand" "w") > - (match_operand:V16QI 2 "register_operand" "w")] > - UNSPEC_CONCAT))] > + [(set (match_operand:V2x16QI 0 "register_operand" "=w") > + (unspec:V2x16QI [(match_operand:V16QI 1 "register_operand" "w") > + (match_operand:V16QI 2 "register_operand" "w")] > + UNSPEC_CONCAT))] Just realised that we can now make this a vec_concat, since the modes are finally self-consistent. No need to do that though, either way is fine. Looks good otherwise. Richard<>
[PATCH 4/6] aarch64: Add machine modes for Neon vector-tuple types
Hi, Until now, GCC has used large integer machine modes (OI, CI and XI) to model Neon vector-tuple types. This is suboptimal for many reasons, the most notable are: 1) Large integer modes are opaque and modifying one vector in the tuple requires a lot of inefficient set/get gymnastics. The result is a lot of superfluous move instructions. 2) Large integer modes do not map well to types that are tuples of 64-bit vectors - we need additional zero-padding which again results in superfluous move instructions. This patch adds new machine modes that better model the C-level Neon vector-tuple types. The approach is somewhat similar to that already used for SVE vector-tuple types. All of the AArch64 backend patterns and builtins that manipulate Neon vector tuples are updated to use the new machine modes. This has the effect of significantly reducing the amount of boiler-plate code in the arm_neon.h header. While this patch increases the quality of code generated in many instances, there is still room for significant improvement - which will be attempted in subsequent patches. Bootstrapped and regression tested on aarch64-none-linux-gnu and aarch64_be-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-08-09 Jonathan Wright Richard Sandiford * config/aarch64/aarch64-builtins.c (v2x8qi_UP): Define. (v2x4hi_UP): Likewise. (v2x4hf_UP): Likewise. (v2x4bf_UP): Likewise. (v2x2si_UP): Likewise. (v2x2sf_UP): Likewise. (v2x1di_UP): Likewise. (v2x1df_UP): Likewise. (v2x16qi_UP): Likewise. (v2x8hi_UP): Likewise. (v2x8hf_UP): Likewise. (v2x8bf_UP): Likewise. (v2x4si_UP): Likewise. (v2x4sf_UP): Likewise. (v2x2di_UP): Likewise. (v2x2df_UP): Likewise. (v3x8qi_UP): Likewise. (v3x4hi_UP): Likewise. (v3x4hf_UP): Likewise. (v3x4bf_UP): Likewise. (v3x2si_UP): Likewise. (v3x2sf_UP): Likewise. (v3x1di_UP): Likewise. (v3x1df_UP): Likewise. (v3x16qi_UP): Likewise. (v3x8hi_UP): Likewise. (v3x8hf_UP): Likewise. (v3x8bf_UP): Likewise. (v3x4si_UP): Likewise. (v3x4sf_UP): Likewise. (v3x2di_UP): Likewise. (v3x2df_UP): Likewise. (v4x8qi_UP): Likewise. (v4x4hi_UP): Likewise. (v4x4hf_UP): Likewise. (v4x4bf_UP): Likewise. (v4x2si_UP): Likewise. (v4x2sf_UP): Likewise. (v4x1di_UP): Likewise. (v4x1df_UP): Likewise. (v4x16qi_UP): Likewise. (v4x8hi_UP): Likewise. (v4x8hf_UP): Likewise. (v4x8bf_UP): Likewise. (v4x4si_UP): Likewise. (v4x4sf_UP): Likewise. (v4x2di_UP): Likewise. (v4x2df_UP): Likewise. (TYPES_GETREGP): Delete. (TYPES_SETREGP): Likewise. (TYPES_LOADSTRUCT_U): Define. (TYPES_LOADSTRUCT_P): Likewise. (TYPES_LOADSTRUCT_LANE_U): Likewise. (TYPES_LOADSTRUCT_LANE_P): Likewise. (TYPES_STORE1P): Move for consistency. (TYPES_STORESTRUCT_U): Define. (TYPES_STORESTRUCT_P): Likewise. (TYPES_STORESTRUCT_LANE_U): Likewise. (TYPES_STORESTRUCT_LANE_P): Likewise. (aarch64_simd_tuple_types): Define. (aarch64_lookup_simd_builtin_type): Handle tuple type lookup. (aarch64_init_simd_builtin_functions): Update frontend lookup for builtin functions after handling arm_neon.h pragma. (register_tuple_type): Manually set modes of single-integer tuple types. Record tuple types. * config/aarch64/aarch64-modes.def (ADV_SIMD_D_REG_STRUCT_MODES): Define D-register tuple modes. (ADV_SIMD_Q_REG_STRUCT_MODES): Define Q-register tuple modes. (SVE_MODES): Give single-vector modes priority over vector- tuple modes. (VECTOR_MODES_WITH_PREFIX): Set partial-vector mode order to be after all single-vector modes. * config/aarch64/aarch64-simd-builtins.def: Update builtin generator macros to reflect modifications to the backend patterns. * config/aarch64/aarch64-simd.md (aarch64_simd_ld2): Use vector-tuple mode iterator and rename to... (aarch64_simd_ld2): This. (aarch64_simd_ld2r): Use vector-tuple mode iterator and rename to... (aarch64_simd_ld2r): This. (aarch64_vec_load_lanesoi_lane): Use vector-tuple mode iterator and rename to... (aarch64_vec_load_lanes_lane): This. (vec_load_lanesoi): Use vector-tuple mode iterator and rename to... (vec_load_lanes): This. (aarch64_simd_st2): Use vector-tuple mode iterator and rename to... (aarch64_simd_st2): This. (aarch64_vec_store_lanesoi_lane): Use vector-tuple mode iterator and rename to... (aarch64_vec_store_lanes_lane): This.
[PATCH 6/6] aarch64: Pass and return Neon vector-tuple types without a parallel
Hi, Neon vector-tuple types can be passed in registers on function call and return - there is no need to generate a parallel rtx. This patch adds cases to detect vector-tuple modes and generates an appropriate register rtx. This change greatly improves code generated when passing Neon vector- tuple types between functions; many new test cases are added to defend these improvements. Bootstrapped and regression tested on aarch64-none-linux-gnu and aarch64_be-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-10-07 Jonathan Wright * config/aarch64/aarch64.c (aarch64_function_value): Generate a register rtx for Neon vector-tuple modes. (aarch64_layout_arg): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vector_structure_intrinsics.c: New code generation tests. rb14937.patch Description: rb14937.patch
[PATCH 5/6] gcc/lower_subreg.c: Prevent decomposition if modes are not tieable
Hi, Preventing decomposition if modes are not tieable is necessary to stop AArch64 partial Neon structure modes being treated as packed in registers. This is a necessary prerequisite for a future AArch64 PCS change to maintain good code generation. Bootstrapped and regression tested on: * x86_64-pc-linux-gnu - no issues. * aarch64-none-linux-gnu - two test failures which will be fixed by the next patch in this series. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-10-14 Jonathan Wright * lower-subreg.c (simple_move): Prevent decomposition if modes are not tieable. rb14936.patch Description: rb14936.patch
[PATCH 3/6] gcc/expmed.c: Ensure vector modes are tieable before extraction
Hi, Extracting a bitfield from a vector can be achieved by casting the vector to a new type whose elements are the same size as the desired bitfield, before generating a subreg. However, this is only an optimization if the original vector can be accessed in the new machine mode without first being copied - a condition denoted by the TARGET_MODES_TIEABLE_P hook. This patch adds a check to make sure that the vector modes are tieable before attempting to generate a subreg. This is a necessary prerequisite for a subsequent patch that will introduce new machine modes for Arm Neon vector-tuple types. Bootstrapped and regression tested on aarch64-none-linux-gnu and x86_64-pc-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-10-11 Jonathan Wright * expmed.c (extract_bit_field_1): Ensure modes are tieable. rb14926.patch Description: rb14926.patch
[PATCH 2/6] gcc/expr.c: Remove historic workaround for broken SIMD subreg
Hi, A long time ago, using a parallel to take a subreg of a SIMD register was broken. This temporary fix[1] (from 2003) spilled these registers to memory and reloaded the appropriate part to obtain the subreg. The fix initially existed for the benefit of the PowerPC E500 - a platform for which GCC removed support a number of years ago. Regardless, a proper mechanism for taking a subreg of a SIMD register exists now anyway. This patch removes the workaround thus preventing SIMD registers being dumped to memory unnecessarily - which sometimes can't be fixed by later passes. Bootstrapped and regression tested on aarch64-none-linux-gnu and x86_64-pc-linux-gnu - no issues. Ok for master? Thanks, Jonathan [1] https://gcc.gnu.org/pipermail/gcc-patches/2003-April/102099.html --- gcc/ChangeLog: 2021-10-11 Jonathan Wright * expr.c (emit_group_load_1): Remove historic workaround. rb14923.patch Description: rb14923.patch
[PATCH 1/6] aarch64: Move Neon vector-tuple type declaration into the compiler
Hi, As subject, this patch declares the Neon vector-tuple types inside the compiler instead of in the arm_neon.h header. This is a necessary first step before adding corresponding machine modes to the AArch64 backend. The vector-tuple types are implemented using a #pragma. This means initialization of builtin functions that have vector-tuple types as arguments or return values has to be delayed until the #pragma is handled. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Note that this patch series cannot be merged until the following has been accepted: https://gcc.gnu.org/pipermail/gcc-patches/2021-October/581948.html Ok for master with this proviso? Thanks, Jonathan --- gcc/ChangeLog: 2021-09-10 Jonathan Wright * config/aarch64/aarch64-builtins.c (aarch64_init_simd_builtins): Factor out main loop to... (aarch64_init_simd_builtin_functions): This new function. (register_tuple_type): Define. (aarch64_scalar_builtin_type_p): Define. (handle_arm_neon_h): Define. * config/aarch64/aarch64-c.c (aarch64_pragma_aarch64): Handle pragma for arm_neon.h. * config/aarch64/aarch64-protos.h (aarch64_advsimd_struct_mode_p): Declare. (handle_arm_neon_h): Likewise. * config/aarch64/aarch64.c (aarch64_advsimd_struct_mode_p): Remove static modifier. * config/aarch64/arm_neon.h (target): Remove Neon vector structure type definitions. rb14838.patch Description: rb14838.patch
[PATCH] aarch64: Remove redundant struct type definitions in arm_neon.h
Hi, As subject, this patch deletes some redundant type definitions in arm_neon.h. These vector type definitions are an artifact from the initial commit that added the AArch64 port. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-10-15 Jonathan Wright * config/aarch64/arm_neon.h (__STRUCTN): Delete function macro and all invocations. rb14942.patch Description: rb14942.patch
[PATCH] aarch64: Fix pointer parameter type in LD1 Neon intrinsics
The pointer parameter to load a vector of signed values should itself be a signed type. This patch fixes two instances of this unsigned- signed implicit conversion in arm_neon.h. Tested relevant intrinsics with -Wpointer-sign and warnings no longer present. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-10-14 Jonathan Wright * config/aarch64/arm_neon.h (vld1_s8_x3): Use signed type for pointer parameter. (vld1_s32_x3): Likewise. rb14933.patch Description: rb14933.patch
[PATCH] aarch64: Fix type qualifiers for qtbl1 and qtbx1 Neon builtins
Hi, This patch fixes type qualifiers for the qtbl1 and qtbx1 Neon builtins and removes the casts from the Neon intrinsic function bodies that use these builtins. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 23-09-2021 Jonathan Wright * config/aarch64/aarch64-builtins.c (TYPES_BINOP_PPU): Define new type qualifier enum. (TYPES_TERNOP_SSSU): Likewise. (TYPES_TERNOP_PPPU): Likewise. * config/aarch64/aarch64-simd-builtins.def: Define PPU, SSU, PPPU and SSSU builtin generator macros for qtbl1 and qtbx1 Neon builtins. * config/aarch64/arm_neon.h (vqtbl1_p8): Use type-qualified builtin and remove casts. (vqtbl1_s8): Likewise. (vqtbl1q_p8): Likewise. (vqtbl1q_s8): Likewise. (vqtbx1_s8): Likewise. (vqtbx1_p8): Likewise. (vqtbx1q_s8): Likewise. (vqtbx1q_p8): Likewise. (vtbl1_p8): Likewise. (vtbl2_p8): Likewise. (vtbx2_p8): Likewise. rb14884.patch Description: rb14884.patch
[PATCH] aarch64: Fix float <-> int errors in vld4[q]_lane intrinsics
Hi, A previous commit "aarch64: Remove macros for vld4[q]_lane Neon intrinsics" introduced some float <-> int type conversion errors. This patch fixes those errors. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-08-18 Jonathan Wright * config/aarch64/arm_neon.h (vld4_lane_f32): Use float RTL pattern. (vld4q_lane_f64): Use float type cast. From: Andreas Schwab Sent: 18 August 2021 13:09 To: Jonathan Wright via Gcc-patches Cc: Jonathan Wright ; Richard Sandiford Subject: Re: [PATCH 3/3] aarch64: Remove macros for vld4[q]_lane Neon intrinsics I think this patch breaks bootstrap. In file included from ../../libcpp/lex.c:756: /opt/gcc/gcc-20210818/Build/prev-gcc/include/arm_neon.h: In function 'float32x2x4_t vld4_lane_f32(const float32_t*, float32x2x4_t, int)': /opt/gcc/gcc-20210818/Build/prev-gcc/include/arm_neon.h:21081:11: error: cannot convert 'float*' to 'const int*' 21081 | (__builtin_aarch64_simd_sf *) __ptr, __o, __c); | ^~~ | | | float* : note: initializing argument 1 of '__builtin_aarch64_simd_xi __builtin_aarch64_ld4_lanev2si(const int*, __builtin_aarch64_simd_xi, int)' /opt/gcc/gcc-20210818/Build/prev-gcc/include/arm_neon.h: In function 'float64x2x4_t vld4q_lane_f64(const float64_t*, float64x2x4_t, int)': /opt/gcc/gcc-20210818/Build/prev-gcc/include/arm_neon.h:21384:9: error: cannot convert 'long int*' to 'const double*' 21384 | (__builtin_aarch64_simd_di *) __ptr, __o, __c); | ^~~ | | | long int* : note: initializing argument 1 of '__builtin_aarch64_simd_xi __builtin_aarch64_ld4_lanev2df(const double*, __builtin_aarch64_simd_xi, int)' Andreas. -- Andreas Schwab, sch...@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different." rb14797.patch Description: rb14797.patch
[PATCH 3/3] aarch64: Remove macros for vld4[q]_lane Neon intrinsics
Hi, This patch removes macros for vld4[q]_lane Neon intrinsics. This is a preparatory step before adding new modes for structures of Advanced SIMD vectors. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-08-16 Jonathan Wright * config/aarch64/arm_neon.h (__LD4_LANE_FUNC): Delete. (__LD4Q_LANE_FUNC): Likewise. (vld4_lane_u8): Define without macro. (vld4_lane_u16): Likewise. (vld4_lane_u32): Likewise. (vld4_lane_u64): Likewise. (vld4_lane_s8): Likewise. (vld4_lane_s16): Likewise. (vld4_lane_s32): Likewise. (vld4_lane_s64): Likewise. (vld4_lane_f16): Likewise. (vld4_lane_f32): Likewise. (vld4_lane_f64): Likewise. (vld4_lane_p8): Likewise. (vld4_lane_p16): Likewise. (vld4_lane_p64): Likewise. (vld4q_lane_u8): Likewise. (vld4q_lane_u16): Likewise. (vld4q_lane_u32): Likewise. (vld4q_lane_u64): Likewise. (vld4q_lane_s8): Likewise. (vld4q_lane_s16): Likewise. (vld4q_lane_s32): Likewise. (vld4q_lane_s64): Likewise. (vld4q_lane_f16): Likewise. (vld4q_lane_f32): Likewise. (vld4q_lane_f64): Likewise. (vld4q_lane_p8): Likewise. (vld4q_lane_p16): Likewise. (vld4q_lane_p64): Likewise. (vld4_lane_bf16): Likewise. (vld4q_lane_bf16): Likewise. rb14793.patch Description: rb14793.patch
[PATCH 2/3] aarch64: Remove macros for vld3[q]_lane Neon intrinsics
Hi, This patch removes macros for vld3[q]_lane Neon intrinsics. This is a preparatory step before adding new modes for structures of Advanced SIMD vectors. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-08-16 Jonathan Wright * config/aarch64/arm_neon.h (__LD3_LANE_FUNC): Delete. (__LD3Q_LANE_FUNC): Delete. (vld3_lane_u8): Define without macro. (vld3_lane_u16): Likewise. (vld3_lane_u32): Likewise. (vld3_lane_u64): Likewise. (vld3_lane_s8): Likewise. (vld3_lane_s16): Likewise. (vld3_lane_s32): Likewise. (vld3_lane_s64): Likewise. (vld3_lane_f16): Likewise. (vld3_lane_f32): Likewise. (vld3_lane_f64): Likewise. (vld3_lane_p8): Likewise. (vld3_lane_p16): Likewise. (vld3_lane_p64): Likewise. (vld3q_lane_u8): Likewise. (vld3q_lane_u16): Likewise. (vld3q_lane_u32): Likewise. (vld3q_lane_u64): Likewise. (vld3q_lane_s8): Likewise. (vld3q_lane_s16): Likewise. (vld3q_lane_s32): Likewise. (vld3q_lane_s64): Likewise. (vld3q_lane_f16): Likewise. (vld3q_lane_f32): Likewise. (vld3q_lane_f64): Likewise. (vld3q_lane_p8): Likewise. (vld3q_lane_p16): Likewise. (vld3q_lane_p64): Likewise. (vld3_lane_bf16): Likewise. (vld3q_lane_bf16): Likewise. rb14792.patch Description: rb14792.patch
[PATCH 1/3] aarch64: Remove macros for vld2[q]_lane Neon intrinsics
Hi, This patch removes macros for vld2[q]_lane Neon intrinsics. This is a preparatory step before adding new modes for structures of Advanced SIMD vectors. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-08-12 Jonathan Wright * config/aarch64/arm_neon.h (__LD2_LANE_FUNC): Delete. (__LD2Q_LANE_FUNC): Likewise. (vld2_lane_u8): Define without macro. (vld2_lane_u16): Likewise. (vld2_lane_u32): Likewise. (vld2_lane_u64): Likewise. (vld2_lane_s8): Likewise. (vld2_lane_s16): Likewise. (vld2_lane_s32): Likewise. (vld2_lane_s64): Likewise. (vld2_lane_f16): Likewise. (vld2_lane_f32): Likewise. (vld2_lane_f64): Likewise. (vld2_lane_p8): Likewise. (vld2_lane_p16): Likewise. (vld2_lane_p64): Likewise. (vld2q_lane_u8): Likewise. (vld2q_lane_u16): Likewise. (vld2q_lane_u32): Likewise. (vld2q_lane_u64): Likewise. (vld2q_lane_s8): Likewise. (vld2q_lane_s16): Likewise. (vld2q_lane_s32): Likewise. (vld2q_lane_s64): Likewise. (vld2q_lane_f16): Likewise. (vld2q_lane_f32): Likewise. (vld2q_lane_f64): Likewise. (vld2q_lane_p8): Likewise. (vld2q_lane_p16): Likewise. (vld2q_lane_p64): Likewise. (vld2_lane_bf16): Likewise. (vld2q_lane_bf16): Likewise. rb14791.patch Description: rb14791.patch
[PATCH] testsuite: aarch64: Fix invalid SVE tests
Hi, Some scan-assembler tests for SVE code generation were erroneously split over multiple lines - meaning they became invalid. This patch gets the tests working again by putting each test on a single line. The extract_[1234].c tests are corrected to expect that extracted 32-bit values are moved into 'w' registers rather than 'x' registers. Ok for master? Thanks, Jonathan --- gcc/testsuite/ChangeLog: 2021-08-06 Jonathan Wright * gcc.target/aarch64/sve/dup_lane_1.c: Don't split scan-assembler tests over multiple lines. Expect 32-bit result values in 'w' registers. * gcc.target/aarch64/sve/extract_1.c: Likewise. * gcc.target/aarch64/sve/extract_2.c: Likewise. * gcc.target/aarch64/sve/extract_3.c: Likewise. * gcc.target/aarch64/sve/extract_4.c: Likewise. rb14768.patch Description: rb14768.patch
Re: [PATCH] testsuite: aarch64: Fix failing vector structure tests on big-endian
Hi, I've corrected the quoting and moved everything on to one line. Ok for master? Thanks, Jonathan --- gcc/testsuite/ChangeLog: 2021-08-04 Jonathan Wright * gcc.target/aarch64/vector_structure_intrinsics.c: Restrict tests to little-endian targets. From: Richard Sandiford Sent: 06 August 2021 13:24 To: Jonathan Wright Cc: gcc-patches@gcc.gnu.org ; Christophe Lyon Subject: Re: [PATCH] testsuite: aarch64: Fix failing vector structure tests on big-endian Jonathan Wright writes: > diff --git a/gcc/testsuite/gcc.target/aarch64/vector_structure_intrinsics.c > b/gcc/testsuite/gcc.target/aarch64/vector_structure_intrinsics.c > index > 60c53bc27f8378c78b119576ed19fde0e5743894..a8e31ab85d6fd2a045c8efaf2cbc42b5f40d2411 > 100644 > --- a/gcc/testsuite/gcc.target/aarch64/vector_structure_intrinsics.c > +++ b/gcc/testsuite/gcc.target/aarch64/vector_structure_intrinsics.c > @@ -197,7 +197,8 @@ TEST_ST1x3 (vst1q, uint64x2x3_t, uint64_t*, u64, x3); > TEST_ST1x3 (vst1q, poly64x2x3_t, poly64_t*, p64, x3); > TEST_ST1x3 (vst1q, float64x2x3_t, float64_t*, f64, x3); > > -/* { dg-final { scan-assembler-not "mov\\t" } } */ > +/* { dg-final { scan-assembler-not {"mov\\t"} { > + target { aarch64_little_endian } } ) } */ I think this needs to stay on line. We should also either keep the original quoting on the regexp or use {mov\t}. Having both forms of quote would turn it into a test for the characters: "mov\t" (including quotes and backslash). Thanks, Richard > > /* { dg-final { scan-assembler-times "tbl\\t" 18} } */ > /* { dg-final { scan-assembler-times "tbx\\t" 18} } */ rb14749.patch Description: rb14749.patch
[PATCH 4/4] aarch64: Use memcpy to copy structures in bfloat vst* intrinsics
Hi, As subject, this patch uses __builtin_memcpy to copy vector structures instead of using a union - or constructing a new opaque structure one vector at a time - in each of the vst[234][q] and vst1[q]_x[234] bfloat Neon intrinsics in arm_neon.h. It also adds new code generation tests to verify that superfluous move instructions are not generated for the vst[234]q or vst1q_x[234] bfloat intrinsics. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-30 Jonathan Wright * config/aarch64/arm_neon.h (vst1_bf16_x2): Use __builtin_memcpy instead of constructing an additional __builtin_aarch64_simd_oi one vector at a time. (vst1q_bf16_x2): Likewise. (vst1_bf16_x3): Use __builtin_memcpy instead of constructing an additional __builtin_aarch64_simd_ci one vector at a time. (vst1q_bf16_x3): Likewise. (vst1_bf16_x4): Use __builtin_memcpy instead of a union. (vst1q_bf16_x4): Likewise. (vst2_bf16): Use __builtin_memcpy instead of constructing an additional __builtin_aarch64_simd_oi one vector at a time. (vst2q_bf16): Likewise. (vst3_bf16): Use __builtin_memcpy instead of constructing an additional __builtin_aarch64_simd_ci mode one vector at a time. (vst3q_bf16): Likewise. (vst4_bf16): Use __builtin_memcpy instead of constructing an additional __builtin_aarch64_simd_xi one vector at a time. (vst4q_bf16): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vector_structure_intrinsics.c: Add new tests. rb14731.patch Description: rb14731.patch
[PATCH 3/4] aarch64: Use memcpy to copy structures in vst2[q]_lane intrinsics
Hi, As subject, this patch uses __builtin_memcpy to copy vector structures instead of using a union - or constructing a new opaque structure one vector at a time - in each of the vst2[q]_lane Neon intrinsics in arm_neon.h. It also adds new code generation tests to verify that superfluous move instructions are not generated for the vst2q_lane intrinsics. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-30 Jonathan Wright * config/aarch64/arm_neon.h (__ST2_LANE_FUNC): Delete. (__ST2Q_LANE_FUNC): Delete. (vst2_lane_f16): Use __builtin_memcpy to copy vector structure instead of constructing __builtin_aarch64_simd_oi one vector at a time. (vst2_lane_f32): Likewise. (vst2_lane_f64): Likewise. (vst2_lane_p8): Likewise. (vst2_lane_p16): Likewise. (vst2_lane_p64): Likewise. (vst2_lane_s8): Likewise. (vst2_lane_s16): Likewise. (vst2_lane_s32): Likewise. (vst2_lane_s64): Likewise. (vst2_lane_u8): Likewise. (vst2_lane_u16): Likewise. (vst2_lane_u32): Likewise. (vst2_lane_u64): Likewise. (vst2_lane_bf16): Likewise. (vst2q_lane_f16): Use __builtin_memcpy to copy vector structure instead of using a union. (vst2q_lane_f32): Likewise. (vst2q_lane_f64): Likewise. (vst2q_lane_p8): Likewise. (vst2q_lane_p16): Likewise. (vst2q_lane_p64): Likewise. (vst2q_lane_s8): Likewise. (vst2q_lane_s16): Likewise. (vst2q_lane_s32): Likewise. (vst2q_lane_s64): Likewise. (vst2q_lane_u8): Likewise. (vst2q_lane_u16): Likewise. (vst2q_lane_u32): Likewise. (vst2q_lane_u64): Likewise. (vst2q_lane_bf16): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vector_structure_intrinsics.c: Add new tests. rb14730.patch Description: rb14730.patch
[PATCH 2/4] aarch64: Use memcpy to copy structures in vst3[q]_lane intrinsics
Hi, As subject, this patch uses __builtin_memcpy to copy vector structures instead of using a union - or constructing a new opaque structure one vector at a time - in each of the vst3[q]_lane Neon intrinsics in arm_neon.h. It also adds new code generation tests to verify that superfluous move instructions are not generated for the vst3q_lane intrinsics. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-30 Jonathan Wright * config/aarch64/arm_neon.h (__ST3_LANE_FUNC): Delete. (__ST3Q_LANE_FUNC): Delete. (vst3_lane_f16): Use __builtin_memcpy to copy vector structure instead of constructing __builtin_aarch64_simd_ci one vector at a time. (vst3_lane_f32): Likewise. (vst3_lane_f64): Likewise. (vst3_lane_p8): Likewise. (vst3_lane_p16): Likewise. (vst3_lane_p64): Likewise. (vst3_lane_s8): Likewise. (vst3_lane_s16): Likewise. (vst3_lane_s32): Likewise. (vst3_lane_s64): Likewise. (vst3_lane_u8): Likewise. (vst3_lane_u16): Likewise. (vst3_lane_u32): Likewise. (vst3_lane_u64): Likewise. (vst3_lane_bf16): Likewise. (vst3q_lane_f16): Use __builtin_memcpy to copy vector structure instead of using a union. (vst3q_lane_f32): Likewise. (vst3q_lane_f64): Likewise. (vst3q_lane_p8): Likewise. (vst3q_lane_p16): Likewise. (vst3q_lane_p64): Likewise. (vst3q_lane_s8): Likewise. (vst3q_lane_s16): Likewise. (vst3q_lane_s32): Likewise. (vst3q_lane_s64): Likewise. (vst3q_lane_u8): Likewise. (vst3q_lane_u16): Likewise. (vst3q_lane_u32): Likewise. (vst3q_lane_u64): Likewise. (vst3q_lane_bf16): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vector_structure_intrinsics.c: Add new tests. rb14729.patch Description: rb14729.patch
[PATCH 1/4] aarch64: Use memcpy to copy structures in vst4[q]_lane intrinsics
Hi, As subject, this patch uses __builtin_memcpy to copy vector structures instead of using a union - or constructing a new opaque structure one vector at a time - in each of the vst4[q]_lane Neon intrinsics in arm_neon.h. It also adds new code generation tests to verify that superfluous move instructions are not generated for the vst4q_lane intrinsics. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-29 Jonathan Wright * config/aarch64/arm_neon.h (__ST4_LANE_FUNC): Delete. (__ST4Q_LANE_FUNC): Delete. (vst4_lane_f16): Use __builtin_memcpy to copy vector structure instead of constructing __builtin_aarch64_simd_xi one vector at a time. (vst4_lane_f32): Likewise. (vst4_lane_f64): Likewise. (vst4_lane_p8): Likewise. (vst4_lane_p16): Likewise. (vst4_lane_p64): Likewise. (vst4_lane_s8): Likewise. (vst4_lane_s16): Likewise. (vst4_lane_s32): Likewise. (vst4_lane_s64): Likewise. (vst4_lane_u8): Likewise. (vst4_lane_u16): Likewise. (vst4_lane_u32): Likewise. (vst4_lane_u64): Likewise. (vst4_lane_bf16): Likewise. (vst4q_lane_f16): Use __builtin_memcpy to copy vector structure instead of using a union. (vst4q_lane_f32): Likewise. (vst4q_lane_f64): Likewise. (vst4q_lane_p8): Likewise. (vst4q_lane_p16): Likewise. (vst4q_lane_p64): Likewise. (vst4q_lane_s8): Likewise. (vst4q_lane_s16): Likewise. (vst4q_lane_s32): Likewise. (vst4q_lane_s64): Likewise. (vst4q_lane_u8): Likewise. (vst4q_lane_u16): Likewise. (vst4q_lane_u32): Likewise. (vst4q_lane_u64): Likewise. (vst4q_lane_bf16): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vector_structure_intrinsics.c: Add new tests. rb14728.patch Description: rb14728.patch
[PATCH V2] aarch64: Don't include vec_select high-half in SIMD subtract cost
Hi, V2 of this change implements the same approach as for the multiply and add-widen patches. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-28 Jonathan Wright * config/aarch64/aarch64.c: Traverse RTL tree to prevent cost of vec_select high-half from being added into Neon subtract cost. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vsubX_high_cost.c: New test. From: Jonathan Wright Sent: 29 July 2021 10:23 To: gcc-patches@gcc.gnu.org Cc: Richard Sandiford ; Kyrylo Tkachov Subject: [PATCH] aarch64: Don't include vec_select high-half in SIMD subtract cost Hi, The Neon subtract-long/subract-widen instructions can select the top or bottom half of the operand registers. This selection does not change the cost of the underlying instruction and this should be reflected by the RTL cost function. This patch adds RTL tree traversal in the Neon subtract cost function to match vec_select high-half of its operands. This traversal prevents the cost of the vec_select from being added into the cost of the subtract - meaning that these instructions can now be emitted in the combine pass as they are no longer deemed prohibitively expensive. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-28 Jonathan Wright * config/aarch64/aarch64.c: Traverse RTL tree to prevent cost of vec_select high-half from being added into Neon subtract cost. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vsubX_high_cost.c: New test. rb14711.patch Description: rb14711.patch
[PATCH V2] aarch64: Don't include vec_select high-half in SIMD add cost
Hi, V2 of this patch uses the same approach as that just implemented for the multiply high-half cost patch. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-28 Jonathan Wright * config/aarch64/aarch64.c: Traverse RTL tree to prevent cost of vec_select high-half from being added into Neon add cost. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vaddX_high_cost.c: New test. From: Jonathan Wright Sent: 29 July 2021 10:22 To: gcc-patches@gcc.gnu.org Cc: Richard Sandiford ; Kyrylo Tkachov Subject: [PATCH] aarch64: Don't include vec_select high-half in SIMD add cost Hi, The Neon add-long/add-widen instructions can select the top or bottom half of the operand registers. This selection does not change the cost of the underlying instruction and this should be reflected by the RTL cost function. This patch adds RTL tree traversal in the Neon add cost function to match vec_select high-half of its operands. This traversal prevents the cost of the vec_select from being added into the cost of the subtract - meaning that these instructions can now be emitted in the combine pass as they are no longer deemed prohibitively expensive. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-28 Jonathan Wright * config/aarch64/aarch64.c: Traverse RTL tree to prevent cost of vec_select high-half from being added into Neon add cost. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vaddX_high_cost.c: New test. rb14710.patch Description: rb14710.patch
[PATCH V2] aarch64: Don't include vec_select high-half in SIMD multiply cost
Hi, Changes suggested here and those discussed off-list have been implemented in V2 of the patch. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-19 Jonathan Wright * config/aarch64/aarch64.c (aarch64_strip_extend_vec_half): Define. (aarch64_rtx_mult_cost): Traverse RTL tree to prevent cost of vec_select high-half from being added into Neon multiply cost. * rtlanal.c (vec_series_highpart_p): Define. * rtlanal.h (vec_series_highpart_p): Declare. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vmul_high_cost.c: New test. From: Richard Sandiford Sent: 04 August 2021 10:05 To: Jonathan Wright via Gcc-patches Cc: Jonathan Wright Subject: Re: [PATCH] aarch64: Don't include vec_select high-half in SIMD multiply cost Jonathan Wright via Gcc-patches writes: > Hi, > > The Neon multiply/multiply-accumulate/multiply-subtract instructions > can select the top or bottom half of the operand registers. This > selection does not change the cost of the underlying instruction and > this should be reflected by the RTL cost function. > > This patch adds RTL tree traversal in the Neon multiply cost function > to match vec_select high-half of its operands. This traversal > prevents the cost of the vec_select from being added into the cost of > the multiply - meaning that these instructions can now be emitted in > the combine pass as they are no longer deemed prohibitively > expensive. > > Regression tested and bootstrapped on aarch64-none-linux-gnu - no > issues. Like you say, the instructions can handle both the low and high halves. Shouldn't we also check for the low part (as a SIGN/ZERO_EXTEND of a subreg)? > Ok for master? > > Thanks, > Jonathan > > --- > > gcc/ChangeLog: > > 2021-07-19 Jonathan Wright > > * config/aarch64/aarch64.c (aarch64_vec_select_high_operand_p): > Define. > (aarch64_rtx_mult_cost): Traverse RTL tree to prevent cost of > vec_select high-half from being added into Neon multiply > cost. > * rtlanal.c (vec_series_highpart_p): Define. > * rtlanal.h (vec_series_highpart_p): Declare. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/vmul_high_cost.c: New test. > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index > 5809887997305317c5a81421089db431685e2927..a49672afe785e3517250d324468edacceab5c9d3 > 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -76,6 +76,7 @@ > #include "function-abi.h" > #include "gimple-pretty-print.h" > #include "tree-ssa-loop-niter.h" > +#include "rtlanal.h" > > /* This file should be included last. */ > #include "target-def.h" > @@ -11970,6 +11971,19 @@ aarch64_cheap_mult_shift_p (rtx x) > return false; > } > > +/* Return true iff X is an operand of a select-high-half vector > + instruction. */ > + > +static bool > +aarch64_vec_select_high_operand_p (rtx x) > +{ > + return ((GET_CODE (x) == ZERO_EXTEND || GET_CODE (x) == SIGN_EXTEND) > + && GET_CODE (XEXP (x, 0)) == VEC_SELECT > + && vec_series_highpart_p (GET_MODE (XEXP (x, 0)), > + GET_MODE (XEXP (XEXP (x, 0), 0)), > + XEXP (XEXP (x, 0), 1))); > +} > + > /* Helper function for rtx cost calculation. Calculate the cost of > a MULT or ASHIFT, which may be part of a compound PLUS/MINUS rtx. > Return the calculated cost of the expression, recursing manually in to > @@ -11995,6 +12009,13 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, > int outer, bool speed) > unsigned int vec_flags = aarch64_classify_vector_mode (mode); > if (vec_flags & VEC_ADVSIMD) > { > + /* The select-operand-high-half versions of the instruction have the > + same cost as the three vector version - don't add the costs of the > + select into the costs of the multiply. */ > + if (aarch64_vec_select_high_operand_p (op0)) > + op0 = XEXP (XEXP (op0, 0), 0); > + if (aarch64_vec_select_high_operand_p (op1)) > + op1 = XEXP (XEXP (op1, 0), 0); For consistency with aarch64_strip_duplicate_vec_elt, I think this should be something like aarch64_strip_vec_extension, returning the inner rtx on success and the original one on failure. Thanks, Richard > /* The by-element versions of the instruction have the same costs as > the normal 3-vector version. So don't add the costs of the > duplicate or subsequent select i
[PATCH] testsuite: aarch64: Fix failing vector structure tests on big-endian
Hi, Recent refactoring of the arm_neon.h header enabled better code generation for intrinsics that manipulate vector structures. New tests were also added to verify the benefit of these changes. It now transpires that the code generation improvements are observed only on little-endian systems. This patch restricts the code generation tests to little-endian targets (for now.) Ok for master? Thanks, Jonathan --- gcc/testsuite/ChangeLog: 2021-08-04 Jonathan Wright * gcc.target/aarch64/vector_structure_intrinsics.c: Restrict tests to little-endian targets. From: Christophe Lyon Sent: 03 August 2021 10:42 To: Jonathan Wright Cc: gcc-patches@gcc.gnu.org ; Richard Sandiford Subject: Re: [PATCH 1/8] aarch64: Use memcpy to copy vector tables in vqtbl[234] intrinsics On Fri, Jul 23, 2021 at 10:22 AM Jonathan Wright via Gcc-patches wrote: Hi, This patch uses __builtin_memcpy to copy vector structures instead of building a new opaque structure one vector at a time in each of the vqtbl[234] Neon intrinsics in arm_neon.h. This simplifies the header file and also improves code generation - superfluous move instructions were emitted for every register extraction/set in this additional structure. Add new code generation tests to verify that superfluous move instructions are no longer generated for the vqtbl[234] intrinsics. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-08 Jonathan Wright * config/aarch64/arm_neon.h (vqtbl2_s8): Use __builtin_memcpy instead of constructing __builtin_aarch64_simd_oi one vector at a time. (vqtbl2_u8): Likewise. (vqtbl2_p8): Likewise. (vqtbl2q_s8): Likewise. (vqtbl2q_u8): Likewise. (vqtbl2q_p8): Likewise. (vqtbl3_s8): Use __builtin_memcpy instead of constructing __builtin_aarch64_simd_ci one vector at a time. (vqtbl3_u8): Likewise. (vqtbl3_p8): Likewise. (vqtbl3q_s8): Likewise. (vqtbl3q_u8): Likewise. (vqtbl3q_p8): Likewise. (vqtbl4_s8): Use __builtin_memcpy instead of constructing __builtin_aarch64_simd_xi one vector at a time. (vqtbl4_u8): Likewise. (vqtbl4_p8): Likewise. (vqtbl4q_s8): Likewise. (vqtbl4q_u8): Likewise. (vqtbl4q_p8): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vector_structure_intrinsics.c: New test. Hi, This new test fails on aarch64_be: FAIL: gcc.target/aarch64/vector_structure_intrinsics.c scan-assembler-not mov\\t Can you check? Thanks Christophe rb14749.patch Description: rb14749.patch
[PATCH] aarch64: Don't include vec_select high-half in SIMD subtract cost
Hi, The Neon subtract-long/subract-widen instructions can select the top or bottom half of the operand registers. This selection does not change the cost of the underlying instruction and this should be reflected by the RTL cost function. This patch adds RTL tree traversal in the Neon subtract cost function to match vec_select high-half of its operands. This traversal prevents the cost of the vec_select from being added into the cost of the subtract - meaning that these instructions can now be emitted in the combine pass as they are no longer deemed prohibitively expensive. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-28 Jonathan Wright * config/aarch64/aarch64.c: Traverse RTL tree to prevent cost of vec_select high-half from being added into Neon subtract cost. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vsubX_high_cost.c: New test. rb14711.patch Description: rb14711.patch
[PATCH] aarch64: Don't include vec_select high-half in SIMD add cost
Hi, The Neon add-long/add-widen instructions can select the top or bottom half of the operand registers. This selection does not change the cost of the underlying instruction and this should be reflected by the RTL cost function. This patch adds RTL tree traversal in the Neon add cost function to match vec_select high-half of its operands. This traversal prevents the cost of the vec_select from being added into the cost of the subtract - meaning that these instructions can now be emitted in the combine pass as they are no longer deemed prohibitively expensive. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-28 Jonathan Wright * config/aarch64/aarch64.c: Traverse RTL tree to prevent cost of vec_select high-half from being added into Neon add cost. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vaddX_high_cost.c: New test. rb14710.patch Description: rb14710.patch
[PATCH] aarch64: Don't include vec_select high-half in SIMD multiply cost
Hi, The Neon multiply/multiply-accumulate/multiply-subtract instructions can select the top or bottom half of the operand registers. This selection does not change the cost of the underlying instruction and this should be reflected by the RTL cost function. This patch adds RTL tree traversal in the Neon multiply cost function to match vec_select high-half of its operands. This traversal prevents the cost of the vec_select from being added into the cost of the multiply - meaning that these instructions can now be emitted in the combine pass as they are no longer deemed prohibitively expensive. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-19 Jonathan Wright * config/aarch64/aarch64.c (aarch64_vec_select_high_operand_p): Define. (aarch64_rtx_mult_cost): Traverse RTL tree to prevent cost of vec_select high-half from being added into Neon multiply cost. * rtlanal.c (vec_series_highpart_p): Define. * rtlanal.h (vec_series_highpart_p): Declare. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vmul_high_cost.c: New test. rb14704.patch Description: rb14704.patch
Re: [PATCH V2] aarch64: Don't include vec_select in SIMD multiply cost
Hi, V2 of the patch addresses the initial review comments, factors out common code (as we discussed off-list) and adds a set of unit tests to verify the code generation benefit. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-19 Jonathan Wright * config/aarch64/aarch64.c (aarch64_strip_duplicate_vec_elt): Define. (aarch64_rtx_mult_cost): Traverse RTL tree to prevent vec_select cost from being added into Neon multiply cost. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vmul_element_cost.c: New test. From: Richard Sandiford Sent: 22 July 2021 18:16 To: Jonathan Wright Cc: gcc-patches@gcc.gnu.org ; Kyrylo Tkachov Subject: Re: [PATCH] aarch64: Don't include vec_select in SIMD multiply cost Jonathan Wright writes: > Hi, > > The Neon multiply/multiply-accumulate/multiply-subtract instructions > can take various forms - multiplying full vector registers of values > or multiplying one vector by a single element of another. Regardless > of the form used, these instructions have the same cost, and this > should be reflected by the RTL cost function. > > This patch adds RTL tree traversal in the Neon multiply cost function > to match the vec_select used by the lane-referencing forms of the > instructions already mentioned. This traversal prevents the cost of > the vec_select from being added into the cost of the multiply - > meaning that these instructions can now be emitted in the combine > pass as they are no longer deemed prohibitively expensive. > > Regression tested and bootstrapped on aarch64-none-linux-gnu - no > issues. > > Ok for master? > > Thanks, > Jonathan > > --- > > gcc/ChangeLog: > > 2021-07-19 Jonathan Wright > > * config/aarch64/aarch64.c (aarch64_rtx_mult_cost): Traverse > RTL tree to prevents vec_select from being added into Neon > multiply cost. > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index > f5b25a7f7041645921e6ad85714efda73b993492..b368303b0e699229266e6d008e28179c496bf8cd > 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -11985,6 +11985,21 @@ aarch64_rtx_mult_cost (rtx x, enum rtx_code code, > int outer, bool speed) > op0 = XEXP (op0, 0); > else if (GET_CODE (op1) == VEC_DUPLICATE) > op1 = XEXP (op1, 0); > + /* The same argument applies to the VEC_SELECT when using the lane- > + referencing forms of the MUL/MLA/MLS instructions. Without the > + traversal here, the combine pass deems these patterns too > + expensive and subsequently does not emit the lane-referencing > + forms of the instructions. In addition, canonical form is for the > + VEC_SELECT to be the second argument of the multiply - thus only > + op1 is traversed. */ > + if (GET_CODE (op1) == VEC_SELECT > + && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1) > + op1 = XEXP (op1, 0); > + else if ((GET_CODE (op1) == ZERO_EXTEND > + || GET_CODE (op1) == SIGN_EXTEND) > + && GET_CODE (XEXP (op1, 0)) == VEC_SELECT > + && GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1) > + op1 = XEXP (XEXP (op1, 0), 0); I think this logically belongs in the “GET_CODE (op1) == VEC_DUPLICATE” if block, since the condition is never true otherwise. We can probably skip the GET_MODE_NUNITS tests, but if you'd prefer to keep them, I think it would be better to add them to the existing VEC_DUPLICATE tests rather than restrict them to the VEC_SELECT ones. Also, although this is in Advanced SIMD-specific code, I think it'd be better to use: is_a (GET_MODE (op1)) instead of: GET_MODE_NUNITS (GET_MODE (op1)).to_constant () == 1 Do you have a testcase? Thanks, Richard rb14675.patch Description: rb14675.patch
Re: [PATCH V2] simplify-rtx: Push sign/zero-extension inside vec_duplicate
Hi, This updated patch fixes the two-operators-per-row style issue in the aarch64-simd.md RTL patterns as well as integrating the simplify-rtx.c change as suggested. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-19 Jonathan Wright * config/aarch64/aarch64-simd.md: Push sign/zero-extension inside vec_duplicate for all patterns. * simplify-rtx.c (simplify_context::simplify_unary_operation_1): Push sign/zero-extension inside vec_duplicate. From: Richard Sandiford Sent: 22 July 2021 18:36 To: Jonathan Wright Cc: gcc-patches@gcc.gnu.org ; Kyrylo Tkachov Subject: Re: [PATCH] simplify-rtx: Push sign/zero-extension inside vec_duplicate Jonathan Wright writes: > Hi, > > As a general principle, vec_duplicate should be as close to the root > of an expression as possible. Where unary operations have > vec_duplicate as an argument, these operations should be pushed > inside the vec_duplicate. > > This patch modifies unary operation simplification to push > sign/zero-extension of a scalar inside vec_duplicate. > > This patch also updates all RTL patterns in aarch64-simd.md to use > the new canonical form. > > Regression tested and bootstrapped on aarch64-none-linux-gnu and > x86_64-none-linux-gnu - no issues. > > Ok for master? > > Thanks, > Jonathan > > --- > > gcc/ChangeLog: > > 2021-07-19 Jonathan Wright > > * config/aarch64/aarch64-simd.md: Push sign/zero-extension > inside vec_duplicate for all patterns. > * simplify-rtx.c (simplify_context::simplify_unary_operation_1): > Push sign/zero-extension inside vec_duplicate. > > diff --git a/gcc/config/aarch64/aarch64-simd.md > b/gcc/config/aarch64/aarch64-simd.md > index > 74890989cb3045798bf8d0241467eaaf72238297..99a95a54248041906b9a0ad742d3a0dca9733b35 > 100644 > --- a/gcc/config/aarch64/aarch64-simd.md > +++ b/gcc/config/aarch64/aarch64-simd.md > @@ -2092,14 +2092,14 @@ > > (define_insn "aarch64_mlal_hi_n_insn" > [(set (match_operand: 0 "register_operand" "=w") > - (plus: > - (mult: > - (ANY_EXTEND: (vec_select: > - (match_operand:VQ_HSI 2 "register_operand" "w") > - (match_operand:VQ_HSI 3 "vect_par_cnst_hi_half" ""))) > - (ANY_EXTEND: (vec_duplicate: > - (match_operand: 4 "register_operand" "" > - (match_operand: 1 "register_operand" "0")))] > + (plus: > + (mult: > + (ANY_EXTEND: (vec_select: > + (match_operand:VQ_HSI 2 "register_operand" "w") > + (match_operand:VQ_HSI 3 "vect_par_cnst_hi_half" ""))) > + (vec_duplicate: (ANY_EXTEND: > + (match_operand: 4 "register_operand" "" > + (match_operand: 1 "register_operand" "0")))] Sorry to nitpick, since this is pre-existing, but I think the pattern would be easier to read with one operation per line. I.e.: (plus: (mult: (ANY_EXTEND: (vec_select: (match_operand:VQ_HSI 2 "register_operand" "w") (match_operand:VQ_HSI 3 "vect_par_cnst_hi_half" ""))) (vec_duplicate: (ANY_EXTEND: (match_operand: 4 "register_operand" "" (match_operand: 1 "register_operand" "0")))] Same for the other patterns with similar doubling of operators. (It looks like you've fixed other indentation problems though, thanks.) > diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c > index > 2d169d3f9f70c85d396adaed124b6c52aca98f07..f885816412f7576d2535f827562d2b425a6a553b > 100644 > --- a/gcc/simplify-rtx.c > +++ b/gcc/simplify-rtx.c > @@ -903,6 +903,18 @@ simplify_context::simplify_unary_operation_1 (rtx_code > code, machine_mode mode, > rtx temp, elt, base, step; > scalar_int_mode inner, int_mode, op_mode, op0_mode; > > + /* Extending a VEC_DUPLICATE of a scalar should be canonicalized to a > + VEC_DUPLICATE of an extended scalar. This is outside of the main switch > + as we may wish to push all unary operations inside VEC_DUPLICATE. */ > + if ((code == SIGN_EXTEND || code == ZERO_EXTEND) > + && GET_CODE (op) == VEC_DUPLICATE > + && GET_MODE_NUNITS (GET_MODE (XEXP (op, 0))).to_constant () == 1) > + { > + rtx x = simplify_gen_unary (code, GET_MODE_INNER (mode), > + XEXP (op, 0), GET_MODE (XEXP (op, 0))); > + return gen_vec_duplicate (mode, x); > + } > + > switch (code) > { > case NOT: This is really an extension of the existing code: if (VECTOR_MODE_P (mode) && vec_duplicate_p (op, &elt) && code != VEC_DUPLICATE) { /* Try applying the operator to ELT and see if that simplifies. We can duplicate the result if so. The reason we don't use simplify_gen_unary is that it isn't necessarily a win to convert things like:
[PATCH] aarch64: Use memcpy to copy vector tables in vst1[q]_x2 intrinsics
Hi, This patch uses __builtin_memcpy to copy vector structures instead of building a new opaque structure one vector at a time in each of the vst1[q]_x2 Neon intrinsics in arm_neon.h. This simplifies the header file and also improves code generation - superfluous move instructions were emitted for every register extraction/set in this additional structure. Add new code generation tests to verify that superfluous move instructions are not generated for the vst1q_x2 intrinsics. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-23 Jonathan Wright * config/aarch64/arm_neon.h (vst1_s64_x2): Use __builtin_memcpy instead of constructing __builtin_aarch64_simd_oi one vector at a time. (vst1_u64_x2): Likewise. (vst1_f64_x2): Likewise. (vst1_s8_x2): Likewise. (vst1_p8_x2): Likewise. (vst1_s16_x2): Likewise. (vst1_p16_x2): Likewise. (vst1_s32_x2): Likewise. (vst1_u8_x2): Likewise. (vst1_u16_x2): Likewise. (vst1_u32_x2): Likewise. (vst1_f16_x2): Likewise. (vst1_f32_x2): Likewise. (vst1_p64_x2): Likewise. (vst1q_s8_x2): Likewise. (vst1q_p8_x2): Likewise. (vst1q_s16_x2): Likewise. (vst1q_p16_x2): Likewise. (vst1q_s32_x2): Likewise. (vst1q_s64_x2): Likewise. (vst1q_u8_x2): Likewise. (vst1q_u16_x2): Likewise. (vst1q_u32_x2): Likewise. (vst1q_u64_x2): Likewise. (vst1q_f16_x2): Likewise. (vst1q_f32_x2): Likewise. (vst1q_f64_x2): Likewise. (vst1q_p64_x2): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vector_structure_intrinsics.c: Add new tests. rb14701.patch Description: rb14701.patch
[PATCH] aarch64: Use memcpy to copy vector tables in vst1[q]_x3 intrinsics
Hi, This patch uses __builtin_memcpy to copy vector structures instead of building a new opaque structure one vector at a time in each of the vst1[q]_x3 Neon intrinsics in arm_neon.h. This simplifies the header file and also improves code generation - superfluous move instructions were emitted for every register extraction/set in this additional structure. Add new code generation tests to verify that superfluous move instructions are not generated for the vst1q_x3 intrinsics. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-23 Jonathan Wright * config/aarch64/arm_neon.h (vst1_s64_x3): Use __builtin_memcpy instead of constructing __builtin_aarch64_simd_ci one vector at a time. (vst1_u64_x3): Likewise. (vst1_f64_x3): Likewise. (vst1_s8_x3): Likewise. (vst1_p8_x3): Likewise. (vst1_s16_x3): Likewise. (vst1_p16_x3): Likewise. (vst1_s32_x3): Likewise. (vst1_u8_x3): Likewise. (vst1_u16_x3): Likewise. (vst1_u32_x3): Likewise. (vst1_f16_x3): Likewise. (vst1_f32_x3): Likewise. (vst1_p64_x3): Likewise. (vst1q_s8_x3): Likewise. (vst1q_p8_x3): Likewise. (vst1q_s16_x3): Likewise. (vst1q_p16_x3): Likewise. (vst1q_s32_x3): Likewise. (vst1q_s64_x3): Likewise. (vst1q_u8_x3): Likewise. (vst1q_u16_x3): Likewise. (vst1q_u32_x3): Likewise. (vst1q_u64_x3): Likewise. (vst1q_f16_x3): Likewise. (vst1q_f32_x3): Likewise. (vst1q_f64_x3): Likewise. (vst1q_p64_x3): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vector_structure_intrinsics.c: Add new tests. rb14700.patch Description: rb14700.patch
Re: [PATCH 4/8] aarch64: Use memcpy to copy vector tables in vtbx4 intrinsics
Same explanation as for patch 3/8: I haven't added test cases here because these intrinsics don't map to a single instruction (they're legacy from Armv7) and would trip the "scan-assembler not mov" that we're using for the other tests. Thanks, Jonathan From: Richard Sandiford Sent: 23 July 2021 10:31 To: Kyrylo Tkachov Cc: Jonathan Wright ; gcc-patches@gcc.gnu.org Subject: Re: [PATCH 4/8] aarch64: Use memcpy to copy vector tables in vtbx4 intrinsics Kyrylo Tkachov writes: >> -Original Message- >> From: Jonathan Wright >> Sent: 23 July 2021 10:15 >> To: gcc-patches@gcc.gnu.org >> Cc: Kyrylo Tkachov ; Richard Sandiford >> >> Subject: [PATCH 4/8] aarch64: Use memcpy to copy vector tables in vtbx4 >> intrinsics >> >> Hi, >> >> This patch uses __builtin_memcpy to copy vector structures instead of >> building a new opaque structure one vector at a time in each of the >> vtbx4 Neon intrinsics in arm_neon.h. This simplifies the header file >> and also improves code generation - superfluous move instructions >> were emitted for every register extraction/set in this additional >> structure. >> >> Regression tested and bootstrapped on aarch64-none-linux-gnu - no >> issues. >> >> Ok for master? > > Ok. Here too I think we want some testcases… Thanks, Richard
[PATCH 8/8] aarch64: Use memcpy to copy vector tables in vst1[q]_x4 intrinsics
Hi, This patch uses __builtin_memcpy to copy vector structures instead of using a union in each of the vst1[q]_x4 Neon intrinsics in arm_neon.h. Add new code generation tests to verify that superfluous move instructions are not generated for the vst1q_x4 intrinsics. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-21 Jonathan Wright * config/aarch64/arm_neon.h (vst1_s8_x4): Use __builtin_memcpy instead of using a union. (vst1q_s8_x4): Likewise. (vst1_s16_x4): Likewise. (vst1q_s16_x4): Likewise. (vst1_s32_x4): Likewise. (vst1q_s32_x4): Likewise. (vst1_u8_x4): Likewise. (vst1q_u8_x4): Likewise. (vst1_u16_x4): Likewise. (vst1q_u16_x4): Likewise. (vst1_u32_x4): Likewise. (vst1q_u32_x4): Likewise. (vst1_f16_x4): Likewise. (vst1q_f16_x4): Likewise. (vst1_f32_x4): Likewise. (vst1q_f32_x4): Likewise. (vst1_p8_x4): Likewise. (vst1q_p8_x4): Likewise. (vst1_p16_x4): Likewise. (vst1q_p16_x4): Likewise. (vst1_s64_x4): Likewise. (vst1_u64_x4): Likewise. (vst1_p64_x4): Likewise. (vst1q_s64_x4): Likewise. (vst1q_u64_x4): Likewise. (vst1q_p64_x4): Likewise. (vst1_f64_x4): Likewise. (vst1q_f64_x4): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vector_structure_intrinsics.c: Add new tests. rb14697.patch Description: rb14697.patch
[PATCH 7/8] aarch64: Use memcpy to copy vector tables in vst2[q] intrinsics
Hi, This patch uses __builtin_memcpy to copy vector structures instead of building a new opaque structure one vector at a time in each of the vst2[q] Neon intrinsics in arm_neon.h. This simplifies the header file and also improves code generation - superfluous move instructions were emitted for every register extraction/set in this additional structure. Add new code generation tests to verify that superfluous move instructions are no longer generated for the vst2q intrinsics. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-21 Jonathan Wrightt * config/aarch64/arm_neon.h (vst2_s64): Use __builtin_memcpy instead of constructing __builtin_aarch64_simd_oi one vector at a time. (vst2_u64): Likewise. (vst2_f64): Likewise. (vst2_s8): Likewise. (vst2_p8): Likewise. (vst2_s16): Likewise. (vst2_p16): Likewise. (vst2_s32): Likewise. (vst2_u8): Likewise. (vst2_u16): Likewise. (vst2_u32): Likewise. (vst2_f16): Likewise. (vst2_f32): Likewise. (vst2_p64): Likewise. (vst2q_s8): Likewise. (vst2q_p8): Likewise. (vst2q_s16): Likewise. (vst2q_p16): Likewise. (vst2q_s32): Likewise. (vst2q_s64): Likewise. (vst2q_u8): Likewise. (vst2q_u16): Likewise. (vst2q_u32): Likewise. (vst2q_u64): Likewise. (vst2q_f16): Likewise. (vst2q_f32): Likewise. (vst2q_f64): Likewise. (vst2q_p64): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vector_structure_intrinsics.c: Add new tests. rb14689.patch Description: rb14689.patch
Re: [PATCH 3/8] aarch64: Use memcpy to copy vector tables in vtbl[34] intrinsics
I haven't added test cases here because these intrinsics don't map to a single instruction (they're legacy from Armv7) and would trip the "scan-assembler not mov" that we're using for the other tests. Jonathan From: Richard Sandiford Sent: 23 July 2021 10:29 To: Kyrylo Tkachov Cc: Jonathan Wright ; gcc-patches@gcc.gnu.org Subject: Re: [PATCH 3/8] aarch64: Use memcpy to copy vector tables in vtbl[34] intrinsics Kyrylo Tkachov writes: >> -Original Message- >> From: Jonathan Wright >> Sent: 23 July 2021 09:30 >> To: gcc-patches@gcc.gnu.org >> Cc: Kyrylo Tkachov ; Richard Sandiford >> >> Subject: [PATCH 3/8] aarch64: Use memcpy to copy vector tables in vtbl[34] >> intrinsics >> >> Hi, >> >> This patch uses __builtin_memcpy to copy vector structures instead of >> building a new opaque structure one vector at a time in each of the >> vtbl[34] Neon intrinsics in arm_neon.h. This simplifies the header file >> and also improves code generation - superfluous move instructions >> were emitted for every register extraction/set in this additional >> structure. >> >> Regression tested and bootstrapped on aarch64-none-linux-gnu - no >> issues. >> >> Ok for master? > > Ok. Please add testcases first though. :-) Thanks, Richard
[PATCH 6/8] aarch64: Use memcpy to copy vector tables in vst3[q] intrinsics
Hi, This patch uses __builtin_memcpy to copy vector structures instead of building a new opaque structure one vector at a time in each of the vst3[q] Neon intrinsics in arm_neon.h. This simplifies the header file and also improves code generation - superfluous move instructions were emitted for every register extraction/set in this additional structure. Add new code generation tests to verify that superfluous move instructions are no longer generated for the vst3q intrinsics. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-21 Jonathan Wright * config/aarch64/arm_neon.h (vst3_s64): Use __builtin_memcpy instead of constructing __builtin_aarch64_simd_ci one vector at a time. (vst3_u64): Likewise. (vst3_f64): Likewise. (vst3_s8): Likewise. (vst3_p8): Likewise. (vst3_s16): Likewise. (vst3_p16): Likewise. (vst3_s32): Likewise. (vst3_u8): Likewise. (vst3_u16): Likewise. (vst3_u32): Likewise. (vst3_f16): Likewise. (vst3_f32): Likewise. (vst3_p64): Likewise. (vst3q_s8): Likewise. (vst3q_p8): Likewise. (vst3q_s16): Likewise. (vst3q_p16): Likewise. (vst3q_s32): Likewise. (vst3q_s64): Likewise. (vst3q_u8): Likewise. (vst3q_u16): Likewise. (vst3q_u32): Likewise. (vst3q_u64): Likewise. (vst3q_f16): Likewise. (vst3q_f32): Likewise. (vst3q_f64): Likewise. (vst3q_p64): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vector_structure_intrinsics.c: Add new tests. rb14688.patch Description: rb14688.patch
[PATCH 5/8] aarch64: Use memcpy to copy vector tables in vst4[q] intrinsics
Hi, This patch uses __builtin_memcpy to copy vector structures instead of building a new opaque structure one vector at a time in each of the vst4[q] Neon intrinsics in arm_neon.h. This simplifies the header file and also improves code generation - superfluous move instructions were emitted for every register extraction/set in this additional structure. Add new code generation tests to verify that superfluous move instructions are no longer generated for the vst4q intrinsics. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-20 Jonathan Wright * config/aarch64/arm_neon.h (vst4_s64): Use __builtin_memcpy instead of constructing __builtin_aarch64_simd_xi one vector at a time. (vst4_u64): Likewise. (vst4_f64): Likewise. (vst4_s8): Likewise. (vst4_p8): Likewise. (vst4_s16): Likewise. (vst4_p16): Likewise. (vst4_s32): Likewise. (vst4_u8): Likewise. (vst4_u16): Likewise. (vst4_u32): Likewise. (vst4_f16): Likewise. (vst4_f32): Likewise. (vst4_p64): Likewise. (vst4q_s8): Likewise. (vst4q_p8): Likewise. (vst4q_s16): Likewise. (vst4q_p16): Likewise. (vst4q_s32): Likewise. (vst4q_s64): Likewise. (vst4q_u8): Likewise. (vst4q_u16): Likewise. (vst4q_u32): Likewise. (vst4q_u64): Likewise. (vst4q_f16): Likewise. (vst4q_f32): Likewise. (vst4q_f64): Likewise. (vst4q_p64): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vector_structure_intrinsics.c: Add new tests. rb14687.patch Description: rb14687.patch
[PATCH 4/8] aarch64: Use memcpy to copy vector tables in vtbx4 intrinsics
Hi, This patch uses __builtin_memcpy to copy vector structures instead of building a new opaque structure one vector at a time in each of the vtbx4 Neon intrinsics in arm_neon.h. This simplifies the header file and also improves code generation - superfluous move instructions were emitted for every register extraction/set in this additional structure. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-19 Jonathan Wright * config/aarch64/arm_neon.h (vtbx4_s8): Use __builtin_memcpy instead of constructing __builtin_aarch64_simd_oi one vector at a time. (vtbx4_u8): Likewise. (vtbx4_p8): Likewise. rb14674.patch Description: rb14674.patch
[PATCH 3/8] aarch64: Use memcpy to copy vector tables in vtbl[34] intrinsics
Hi, This patch uses __builtin_memcpy to copy vector structures instead of building a new opaque structure one vector at a time in each of the vtbl[34] Neon intrinsics in arm_neon.h. This simplifies the header file and also improves code generation - superfluous move instructions were emitted for every register extraction/set in this additional structure. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-08 Jonathan Wright * config/aarch64/arm_neon.h (vtbl3_s8): Use __builtin_memcpy instead of constructing __builtin_aarch64_simd_oi one vector at a time. (vtbl3_u8): Likewise. (vtbl3_p8): Likewise. (vtbl4_s8): Likewise. (vtbl4_u8): Likewise. (vtbl4_p8): Likewise. rb14673.patch Description: rb14673.patch
[PATCH 2/8] aarch64: Use memcpy to copy vector tables in vqtbx[234] intrinsics
Hi, This patch uses __builtin_memcpy to copy vector structures instead of building a new opaque structure one vector at a time in each of the vqtbx[234] Neon intrinsics in arm_neon.h. This simplifies the header file and also improves code generation - superfluous move instructions were emitted for every register extraction/set in this additional structure. Add new code generation tests to verify that superfluous move instructions are no longer generated for the vqtbx[234] intrinsics. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-08 Jonathan Wright * config/aarch64/arm_neon.h (vqtbx2_s8): Use __builtin_memcpy instead of constructing __builtin_aarch64_simd_oi one vector at a time. (vqtbx2_u8): Likewise. (vqtbx2_p8): Likewise. (vqtbx2q_s8): Likewise. (vqtbx2q_u8): Likewise. (vqtbx2q_p8): Likewise. (vqtbx3_s8): Use __builtin_memcpy instead of constructing __builtin_aarch64_simd_ci one vector at a time. (vqtbx3_u8): Likewise. (vqtbx3_p8): Likewise. (vqtbx3q_s8): Likewise. (vqtbx3q_u8): Likewise. (vqtbx3q_p8): Likewise. (vqtbx4_s8): Use __builtin_memcpy instead of constructing __builtin_aarch64_simd_xi one vector at a time. (vqtbx4_u8): Likewise. (vqtbx4_p8): Likewise. (vqtbx4q_s8): Likewise. (vqtbx4q_u8): Likewise. (vqtbx4q_p8): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vector_structure_intrinsics.c: New tests. rb14640.patch Description: rb14640.patch
[PATCH 1/8] aarch64: Use memcpy to copy vector tables in vqtbl[234] intrinsics
Hi, This patch uses __builtin_memcpy to copy vector structures instead of building a new opaque structure one vector at a time in each of the vqtbl[234] Neon intrinsics in arm_neon.h. This simplifies the header file and also improves code generation - superfluous move instructions were emitted for every register extraction/set in this additional structure. Add new code generation tests to verify that superfluous move instructions are no longer generated for the vqtbl[234] intrinsics. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-08 Jonathan Wright * config/aarch64/arm_neon.h (vqtbl2_s8): Use __builtin_memcpy instead of constructing __builtin_aarch64_simd_oi one vector at a time. (vqtbl2_u8): Likewise. (vqtbl2_p8): Likewise. (vqtbl2q_s8): Likewise. (vqtbl2q_u8): Likewise. (vqtbl2q_p8): Likewise. (vqtbl3_s8): Use __builtin_memcpy instead of constructing __builtin_aarch64_simd_ci one vector at a time. (vqtbl3_u8): Likewise. (vqtbl3_p8): Likewise. (vqtbl3q_s8): Likewise. (vqtbl3q_u8): Likewise. (vqtbl3q_p8): Likewise. (vqtbl4_s8): Use __builtin_memcpy instead of constructing __builtin_aarch64_simd_xi one vector at a time. (vqtbl4_u8): Likewise. (vqtbl4_p8): Likewise. (vqtbl4q_s8): Likewise. (vqtbl4q_u8): Likewise. (vqtbl4q_p8): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vector_structure_intrinsics.c: New test. rb14639.patch Description: rb14639.patch
[PATCH] simplify-rtx: Push sign/zero-extension inside vec_duplicate
Hi, As a general principle, vec_duplicate should be as close to the root of an expression as possible. Where unary operations have vec_duplicate as an argument, these operations should be pushed inside the vec_duplicate. This patch modifies unary operation simplification to push sign/zero-extension of a scalar inside vec_duplicate. This patch also updates all RTL patterns in aarch64-simd.md to use the new canonical form. Regression tested and bootstrapped on aarch64-none-linux-gnu and x86_64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-19 Jonathan Wright * config/aarch64/aarch64-simd.md: Push sign/zero-extension inside vec_duplicate for all patterns. * simplify-rtx.c (simplify_context::simplify_unary_operation_1): Push sign/zero-extension inside vec_duplicate. rb14677.patch Description: rb14677.patch
[PATCH] aarch64: Don't include vec_select in SIMD multiply cost
Hi, The Neon multiply/multiply-accumulate/multiply-subtract instructions can take various forms - multiplying full vector registers of values or multiplying one vector by a single element of another. Regardless of the form used, these instructions have the same cost, and this should be reflected by the RTL cost function. This patch adds RTL tree traversal in the Neon multiply cost function to match the vec_select used by the lane-referencing forms of the instructions already mentioned. This traversal prevents the cost of the vec_select from being added into the cost of the multiply - meaning that these instructions can now be emitted in the combine pass as they are no longer deemed prohibitively expensive. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-19 Jonathan Wright * config/aarch64/aarch64.c (aarch64_rtx_mult_cost): Traverse RTL tree to prevents vec_select from being added into Neon multiply cost. rb14675.patch Description: rb14675.patch
[PATCH] aarch64: Refactor TBL/TBX RTL patterns
Hi, As subject, this patch renames the two-source-register TBL/TBX RTL patterns so that their names better reflect what they do, rather than confusing them with tbl3 or tbx4 patterns. Also use the correct "neon_tbl2" type attribute for both patterns. Rename single-source-register TBL/TBX patterns for consistency. Bootstrapped and regression tested on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-08 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Use two variant generators for all TBL/TBX intrinsics and rename to consistent forms: qtbl[1234] or qtbx[1234]. * config/aarch64/aarch64-simd.md (aarch64_tbl1): Rename to... (aarch64_qtbl1): This. (aarch64_tbx1): Rename to... (aarch64_qtbx1): This. (aarch64_tbl2v16qi): Delete. (aarch64_tbl3): Rename to... (aarch64_qtbl2): This. (aarch64_tbx4): Rename to... (aarch64_qtbx2): This. * config/aarch64/aarch64.c (aarch64_expand_vec_perm_1): Use renamed qtbl1 and qtbl2 RTL patterns. * config/aarch64/arm_neon.h (vqtbl1_p8): Use renamed qtbl1 RTL pattern. (vqtbl1_s8): Likewise. (vqtbl1_u8): Likewise. (vqtbl1q_p8): Likewise. (vqtbl1q_s8): Likewise. (vqtbl1q_u8): Likewise. (vqtbx1_s8): Use renamed qtbx1 RTL pattern. (vqtbx1_u8): Likewise. (vqtbx1_p8): Likewise. (vqtbx1q_s8): Likewise. (vqtbx1q_u8): Likewise. (vqtbx1q_p8): Likewise. (vtbl1_s8): Use renamed qtbl1 RTL pattern. (vtbl1_u8): Likewise. (vtbl1_p8): Likewise. (vtbl2_s8): Likewise (vtbl2_u8): Likewise. (vtbl2_p8): Likewise. (vtbl3_s8): Use renamed qtbl2 RTL pattern. (vtbl3_u8): Likewise. (vtbl3_p8): Likewise. (vtbl4_s8): Likewise. (vtbl4_u8): Likewise. (vtbl4_p8): Likewise. (vtbx2_s8): Use renamed qtbx2 RTL pattern. (vtbx2_u8): Likewise. (vtbx2_p8): Likewise. (vqtbl2_s8): Use renamed qtbl2 RTL pattern. (vqtbl2_u8): Likewise. (vqtbl2_p8): Likewise. (vqtbl2q_s8): Likewise. (vqtbl2q_u8): Likewise. (vqtbl2q_p8): Likewise. (vqtbx2_s8): Use renamed qtbx2 RTL pattern. (vqtbx2_u8): Likewise. (vqtbx2_p8): Likewise. (vqtbx2q_s8): Likewise. (vqtbx2q_u8): Likewise. (vqtbx2q_p8): Likewise. (vtbx4_s8): Likewise. (vtbx4_u8): Likewise. (vtbx4_p8): Likewise. rb14671.patch Description: rb14671.patch
Re: [PATCH V2] gcc: Add vec_select -> subreg RTL simplification
Ah, yes - those test results should have only been changed for little endian. I've submitted a patch to the list restoring the original expected results for big endian. Thanks, Jonathan From: Christophe Lyon Sent: 15 July 2021 10:09 To: Richard Sandiford ; Jonathan Wright ; gcc-patches@gcc.gnu.org ; Kyrylo Tkachov Subject: Re: [PATCH V2] gcc: Add vec_select -> subreg RTL simplification On Mon, Jul 12, 2021 at 5:31 PM Richard Sandiford via Gcc-patches mailto:gcc-patches@gcc.gnu.org>> wrote: Jonathan Wright mailto:jonathan.wri...@arm.com>> writes: > Hi, > > Version 2 of this patch adds more code generation tests to show the > benefit of this RTL simplification as well as adding a new helper function > 'rtx_vec_series_p' to reduce code duplication. > > Patch tested as version 1 - ok for master? Sorry for the slow reply. > Regression tested and bootstrapped on aarch64-none-linux-gnu, > x86_64-unknown-linux-gnu, arm-none-linux-gnueabihf and > aarch64_be-none-linux-gnu - no issues. I've also tested this on powerpc64le-unknown-linux-gnu, no issues again. > diff --git a/gcc/combine.c b/gcc/combine.c > index > 6476812a21268e28219d1e302ee1c979d528a6ca..0ff6ca87e4432cfeff1cae1dd219ea81ea0b73e4 > 100644 > --- a/gcc/combine.c > +++ b/gcc/combine.c > @@ -6276,6 +6276,26 @@ combine_simplify_rtx (rtx x, machine_mode op0_mode, > int in_dest, > - 1, > 0)); >break; > +case VEC_SELECT: > + { > + rtx trueop0 = XEXP (x, 0); > + mode = GET_MODE (trueop0); > + rtx trueop1 = XEXP (x, 1); > + int nunits; > + /* If we select a low-part subreg, return that. */ > + if (GET_MODE_NUNITS (mode).is_constant (&nunits) > + && targetm.can_change_mode_class (mode, GET_MODE (x), ALL_REGS)) > + { > + int offset = BYTES_BIG_ENDIAN ? nunits - XVECLEN (trueop1, 0) : 0; > + > + if (rtx_vec_series_p (trueop1, offset)) > + { > + rtx new_rtx = lowpart_subreg (GET_MODE (x), trueop0, mode); > + if (new_rtx != NULL_RTX) > + return new_rtx; > + } > + } > + } Since this occurs three times, I think it would be worth having a new predicate: /* Return true if, for all OP of mode OP_MODE: (vec_select:RESULT_MODE OP SEL) is equivalent to the lowpart RESULT_MODE of OP. */ bool vec_series_lowpart_p (machine_mode result_mode, machine_mode op_mode, rtx sel) containing the GET_MODE_NUNITS (…).is_constant, can_change_mode_class and rtx_vec_series_p tests. I think the function belongs in rtlanal.[hc], even though subreg_lowpart_p is in emit-rtl.c. > diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md > index > aef6da9732d45b3586bad5ba57dafa438374ac3c..f12a0bebd3d6dd3381ac8248cd3fa3f519115105 > 100644 > --- a/gcc/config/aarch64/aarch64.md > +++ b/gcc/config/aarch64/aarch64.md > @@ -1884,15 +1884,16 @@ > ) > > (define_insn "*zero_extend2_aarch64" > - [(set (match_operand:GPI 0 "register_operand" "=r,r,w") > -(zero_extend:GPI (match_operand:SHORT 1 "nonimmediate_operand" > "r,m,m")))] > + [(set (match_operand:GPI 0 "register_operand" "=r,r,w,r") > +(zero_extend:GPI (match_operand:SHORT 1 "nonimmediate_operand" > "r,m,m,w")))] >"" >"@ > and\t%0, %1, > ldr\t%w0, %1 > - ldr\t%0, %1" > - [(set_attr "type" "logic_imm,load_4,f_loads") > - (set_attr "arch" "*,*,fp")] > + ldr\t%0, %1 > + umov\t%w0, %1.[0]" > + [(set_attr "type" "logic_imm,load_4,f_loads,neon_to_gp") > + (set_attr "arch" "*,*,fp,fp")] FTR (just to show I thought about it): I don't know whether the umov can really be considered an fp operation rather than a simd operation, but since we don't support fp without simd, this is already a distinction without a difference. So the pattern is IMO OK as-is. > diff --git a/gcc/config/arm/vfp.md b/gcc/config/arm/vfp.md > index > 55b6c1ac585a4cae0789c3afc0fccfc05a6d3653..93e963696dad30f29a76025696670f8b31bf2c35 > 100644 > --- a/gcc/config/arm/vfp.md > +++ b/gcc/config/arm/vfp.md > @@ -224,7 +224,7 @@ > ;; problems because small constants get converted into adds. > (define_insn "*arm_movsi_vfp" >[(set (match_operand:SI 0 "nonimmediate_operand" "=rk,r,r,r,rk,m > ,*t,r,*t,*t, *Uv") > - (match_operand:SI 1 "general_operand" "rk, > I,K,j,mi,rk,r,*t,*t,*Uvi,*t"))] > + (match_operand:SI 1 "general_operand" "rk, > I,K,j,mi,rk,r,t,*t,*Uvi,*t"))] >"TARGET_ARM && TARGET_HARD_FLOAT > && ( s_register_operand (operands[0], SImode) > || s_register_operand (operands[1], SImode))" I'll assume that an Arm maintainer would have spoken up by now if they didn't want this for some reason. > diff --git a/gcc/rtl.c b/gcc/rtl.c > index > aaee882f5ca3e37b59c9829e41d0864070c170eb..3e8b3628b0b76b41889b77bb0019f582ee6f5aaa > 100644 > --- a/gcc/rtl.c > +++ b/gcc/rtl.c > @@ -736,6 +736,19 @@ rtvec_all_equal_p (const_r
testsuite: aarch64: Fix failing SVE tests on big endian
Hi, A recent change "gcc: Add vec_select -> subreg RTL simplification" updated the expected test results for SVE extraction tests. The new result should only have been changed for little endian. This patch restores the old expected result for big endian. Ok for master? Thanks, Jonathan --- gcc/testsuite/ChangeLog: 2021-07-15 Jonathan Wright * gcc.target/aarch64/sve/extract_1.c: Split expected results by big/little endian targets, restoring the old expected result for big endian. * gcc.target/aarch64/sve/extract_2.c: Likewise. * gcc.target/aarch64/sve/extract_3.c: Likewise. * gcc.target/aarch64/sve/extract_4.c: Likewise. rb14655.patch Description: rb14655.patch
[PATCH] aarch64: Use unions for vector tables in vqtbl[234] intrinsics
Hi, As subject, this patch uses a union instead of constructing a new opaque vector structure for each of the vqtbl[234] Neon intrinsics in arm_neon.h. This simplifies the header file and also improves code generation - superfluous move instructions were emitted for every register extraction/set in this additional structure. This change is safe because the C-level vector structure types e.g. uint8x16x4_t already provide a tie for sequential register allocation - which is required by the TBL instructions. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-07-08 Jonathan Wright * config/aarch64/arm_neon.h (vqtbl2_s8): Use union instead of additional __builtin_aarch64_simd_oi structure. (vqtbl2_u8): Likewise. (vqtbl2_p8): Likewise. (vqtbl2q_s8): Likewise. (vqtbl2q_u8): Likewise. (vqtbl2q_p8): Likewise. (vqtbl3_s8): Use union instead of additional __builtin_aarch64_simd_ci structure. (vqtbl3_u8): Likewise. (vqtbl3_p8): Likewise. (vqtbl3q_s8): Likewise. (vqtbl3q_u8): Likewise. (vqtbl3q_p8): Likewise. (vqtbl4_s8): Use union instead of additional __builtin_aarch64_simd_xi structure. (vqtbl4_u8): Likewise. (vqtbl4_p8): Likewise. (vqtbl4q_s8): Likewise. (vqtbl4q_u8): Likewise. (vqtbl4q_p8): Likewise. rb14639.patch Description: rb14639.patch
[PATCH V2] gcc: Add vec_select -> subreg RTL simplification
Hi, Version 2 of this patch adds more code generation tests to show the benefit of this RTL simplification as well as adding a new helper function 'rtx_vec_series_p' to reduce code duplication. Patch tested as version 1 - ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-06-08 Jonathan Wright * combine.c (combine_simplify_rtx): Add vec_select -> subreg simplification. * config/aarch64/aarch64.md (*zero_extend2_aarch64): Add Neon to general purpose register case for zero-extend pattern. * config/arm/vfp.md (*arm_movsi_vfp): Remove "*" from *t -> r case to prevent some cases opting to go through memory. * cse.c (fold_rtx): Add vec_select -> subreg simplification. * rtl.c (rtx_vec_series_p): Define helper function to determine whether RTX vector-selection indices are in series. * rtl.h (rtx_vec_series_p): Define. * simplify-rtx.c (simplify_context::simplify_binary_operation_1): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/extract_zero_extend.c: Remove dump scan for RTL pattern match. * gcc.target/aarch64/narrow_high_combine.c: Add new tests. * gcc.target/aarch64/simd/vmulx_laneq_f64_1.c: Update scan-assembler regex to look for a scalar register instead of lane 0 of a vector. * gcc.target/aarch64/simd/vmulx_laneq_f64_1.c: Likewise. * gcc.target/aarch64/simd/vmulxd_laneq_f64_1.c: Likewise. * gcc.target/aarch64/simd/vmulxs_lane_f32_1.c: Likewise. * gcc.target/aarch64/simd/vmulxs_laneq_f32_1.c: Likewise. * gcc.target/aarch64/simd/vqdmlalh_lane_s16.c: Likewise. * gcc.target/aarch64/simd/vqdmlals_lane_s32.c: Likewise. * gcc.target/aarch64/simd/vqdmlslh_lane_s16.c: Likewise. * gcc.target/aarch64/simd/vqdmlsls_lane_s32.c: Likewise. * gcc.target/aarch64/simd/vqdmullh_lane_s16.c: Likewise. * gcc.target/aarch64/simd/vqdmullh_laneq_s16.c: Likewise. * gcc.target/aarch64/simd/vqdmulls_lane_s32.c: Likewise. * gcc.target/aarch64/simd/vqdmulls_laneq_s32.c: Likewise. * gcc.target/aarch64/sve/dup_lane_1.c: Likewise. * gcc.target/aarch64/sve/live_1.c: Update scan-assembler regex cases to look for 'b' and 'h' registers instead of 'w'. * gcc.target/arm/mve/intrinsics/vgetq_lane_f16.c: Extract lane 1 as the moves for lane 0 now get optimized away. * gcc.target/arm/mve/intrinsics/vgetq_lane_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vgetq_lane_s16.c: Likewise. * gcc.target/arm/mve/intrinsics/vgetq_lane_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vgetq_lane_s8.c: Likewise. * gcc.target/arm/mve/intrinsics/vgetq_lane_u16.c: Likewise. * gcc.target/arm/mve/intrinsics/vgetq_lane_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vgetq_lane_u8.c: Likewise. From: Jonathan Wright Sent: 02 July 2021 10:53 To: gcc-patches@gcc.gnu.org Cc: Richard Sandiford ; Kyrylo Tkachov Subject: [PATCH] gcc: Add vec_select -> subreg RTL simplification Hi, As subject, this patch adds a new RTL simplification for the case of a VEC_SELECT selecting the low part of a vector. The simplification returns a SUBREG. The primary goal of this patch is to enable better combinations of Neon RTL patterns - specifically allowing generation of 'write-to- high-half' narrowing intructions. Adding this RTL simplification means that the expected results for a number of tests need to be updated: * aarch64 Neon: Update the scan-assembler regex for intrinsics tests to expect a scalar register instead of lane 0 of a vector. * aarch64 SVE: Likewise. * arm MVE: Use lane 1 instead of lane 0 for lane-extraction intrinsics tests (as the move instructions get optimized away for lane 0.) Regression tested and bootstrapped on aarch64-none-linux-gnu, x86_64-unknown-linux-gnu, arm-none-linux-gnueabihf and aarch64_be-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-06-08 Jonathan Wright * combine.c (combine_simplify_rtx): Add vec_select -> subreg simplification. * config/aarch64/aarch64.md (*zero_extend2_aarch64): Add Neon to general purpose register case for zero-extend pattern. * config/arm/vfp.md (*arm_movsi_vfp): Remove "*" from *t -> r case to prevent some cases opting to go through memory. * cse.c (fold_rtx): Add vec_select -> subreg simplification. * simplify-rtx.c (simplify_context::simplify_binary_operation_1): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/extract_zero_extend.c: Remove dump scan for RTL pattern match. * gcc.target/aarch64/simd/vmulx_laneq_f64_1.c: Update scan-assembler regex to look for a scalar register instead of lane 0 of a vector. * gcc.target/aarch64/simd/vmulx_laneq_f64_1.c: Likewise.
[PATCH] gcc: Add vec_select -> subreg RTL simplification
Hi, As subject, this patch adds a new RTL simplification for the case of a VEC_SELECT selecting the low part of a vector. The simplification returns a SUBREG. The primary goal of this patch is to enable better combinations of Neon RTL patterns - specifically allowing generation of 'write-to- high-half' narrowing intructions. Adding this RTL simplification means that the expected results for a number of tests need to be updated: * aarch64 Neon: Update the scan-assembler regex for intrinsics tests to expect a scalar register instead of lane 0 of a vector. * aarch64 SVE: Likewise. * arm MVE: Use lane 1 instead of lane 0 for lane-extraction intrinsics tests (as the move instructions get optimized away for lane 0.) Regression tested and bootstrapped on aarch64-none-linux-gnu, x86_64-unknown-linux-gnu, arm-none-linux-gnueabihf and aarch64_be-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-06-08 Jonathan Wright * combine.c (combine_simplify_rtx): Add vec_select -> subreg simplification. * config/aarch64/aarch64.md (*zero_extend2_aarch64): Add Neon to general purpose register case for zero-extend pattern. * config/arm/vfp.md (*arm_movsi_vfp): Remove "*" from *t -> r case to prevent some cases opting to go through memory. * cse.c (fold_rtx): Add vec_select -> subreg simplification. * simplify-rtx.c (simplify_context::simplify_binary_operation_1): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/extract_zero_extend.c: Remove dump scan for RTL pattern match. * gcc.target/aarch64/simd/vmulx_laneq_f64_1.c: Update scan-assembler regex to look for a scalar register instead of lane 0 of a vector. * gcc.target/aarch64/simd/vmulx_laneq_f64_1.c: Likewise. * gcc.target/aarch64/simd/vmulxd_laneq_f64_1.c: Likewise. * gcc.target/aarch64/simd/vmulxs_lane_f32_1.c: Likewise. * gcc.target/aarch64/simd/vmulxs_laneq_f32_1.c: Likewise. * gcc.target/aarch64/simd/vqdmlalh_lane_s16.c: Likewise. * gcc.target/aarch64/simd/vqdmlals_lane_s32.c: Likewise. * gcc.target/aarch64/simd/vqdmlslh_lane_s16.c: Likewise. * gcc.target/aarch64/simd/vqdmlsls_lane_s32.c: Likewise. * gcc.target/aarch64/simd/vqdmullh_lane_s16.c: Likewise. * gcc.target/aarch64/simd/vqdmullh_laneq_s16.c: Likewise. * gcc.target/aarch64/simd/vqdmulls_lane_s32.c: Likewise. * gcc.target/aarch64/simd/vqdmulls_laneq_s32.c: Likewise. * gcc.target/aarch64/sve/dup_lane_1.c: Likewise. * gcc.target/aarch64/sve/live_1.c: Update scan-assembler regex cases to look for 'b' and 'h' registers instead of 'w'. * gcc.target/arm/mve/intrinsics/vgetq_lane_f16.c: Extract lane 1 as the moves for lane 0 now get optimized away. * gcc.target/arm/mve/intrinsics/vgetq_lane_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vgetq_lane_s16.c: Likewise. * gcc.target/arm/mve/intrinsics/vgetq_lane_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vgetq_lane_s8.c: Likewise. * gcc.target/arm/mve/intrinsics/vgetq_lane_u16.c: Likewise. * gcc.target/arm/mve/intrinsics/vgetq_lane_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vgetq_lane_u8.c: Likewise. rb14526.patch Description: rb14526.patch
[PATCH V2] aarch64: Model zero-high-half semantics of ADDHN/SUBHN instructions
Hi, Version 2 of this patch adds tests to verify the benefit of this change. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-06-14 Jonathan Wright * config/aarch64/aarch64-simd.md (aarch64_hn): Change to an expander that emits the correct instruction depending on endianness. (aarch64_hn_insn_le): Define. (aarch64_hn_insn_be): Define. gcc/testsuite/ChangeLog: * gcc.target/aarch64/narrow_zero_high_half.c: Add new tests. From: Gcc-patches on behalf of Jonathan Wright via Gcc-patches Sent: 15 June 2021 11:02 To: gcc-patches@gcc.gnu.org Subject: [PATCH] aarch64: Model zero-high-half semantics of ADDHN/SUBHN instructions Hi, As subject, this patch models the zero-high-half semantics of the narrowing arithmetic Neon instructions in the aarch64_hn RTL pattern. Modeling these semantics allows for better RTL combinations while also removing some register allocation issues as the compiler now knows that the operation is totally destructive. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-06-14 Jonathan Wright * config/aarch64/aarch64-simd.md (aarch64_hn): Change to an expander that emits the correct instruction depending on endianness. (aarch64_hn_insn_le): Define. (aarch64_hn_insn_be): Define. rb14566.patch Description: rb14566.patch
[PATCH V2] aarch64: Model zero-high-half semantics of [SU]QXTN instructions
Hi, Version 2 of the patch adds tests to verify the benefit of this change. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-06-14 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Split generator for aarch64_qmovn builtins into scalar and vector variants. * config/aarch64/aarch64-simd.md (aarch64_qmovn_insn_le): Define. (aarch64_qmovn_insn_be): Define. (aarch64_qmovn): Split into scalar and vector variants. Change vector variant to an expander that emits the correct instruction depending on endianness. gcc/testsuite/ChangeLog: * gcc.target/aarch64/narrow_zero_high_half.c: Add new tests. From: Gcc-patches on behalf of Jonathan Wright via Gcc-patches Sent: 15 June 2021 10:59 To: gcc-patches@gcc.gnu.org Subject: [PATCH] aarch64: Model zero-high-half semantics of [SU]QXTN instructions Hi, As subject, this patch first splits the aarch64_qmovn pattern into separate scalar and vector variants. It then further splits the vector RTL pattern into big/little endian variants that model the zero-high-half semantics of the underlying instruction. Modeling these semantics allows for better RTL combinations while also removing some register allocation issues as the compiler now knows that the operation is totally destructive. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-06-14 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Split generator for aarch64_qmovn builtins into scalar and vector variants. * config/aarch64/aarch64-simd.md (aarch64_qmovn_insn_le): Define. (aarch64_qmovn_insn_be): Define. (aarch64_qmovn): Split into scalar and vector variants. Change vector variant to an expander that emits the correct instruction depending on endianness. rb14565.patch Description: rb14565.patch
[PATCH V2] aarch64: Model zero-high-half semantics of SQXTUN instruction in RTL
Hi, Version 2 of the patch adds tests to verify the benefit of this change. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-06-14 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Split generator for aarch64_sqmovun builtins into scalar and vector variants. * config/aarch64/aarch64-simd.md (aarch64_sqmovun): Split into scalar and vector variants. Change vector variant to an expander that emits the correct instruction depending on endianness. (aarch64_sqmovun_insn_le): Define. (aarch64_sqmovun_insn_be): Define. gcc/testsuite/ChangeLog: * gcc.target/aarch64/narrow_zero_high_half.c: Add new tests. From: Gcc-patches on behalf of Jonathan Wright via Gcc-patches Sent: 15 June 2021 10:52 To: gcc-patches@gcc.gnu.org Subject: [PATCH] aarch64: Model zero-high-half semantics of SQXTUN instruction in RTL Hi, As subject, this patch first splits the aarch64_sqmovun pattern into separate scalar and vector variants. It then further split the vector pattern into big/little endian variants that model the zero-high-half semantics of the underlying instruction. Modeling these semantics allows for better RTL combinations while also removing some register allocation issues as the compiler now knows that the operation is totally destructive. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-06-14 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Split generator for aarch64_sqmovun builtins into scalar and vector variants. * config/aarch64/aarch64-simd.md (aarch64_sqmovun): Split into scalar and vector variants. Change vector variant to an expander that emits the correct instruction depending on endianness. (aarch64_sqmovun_insn_le): Define. (aarch64_sqmovun_insn_be): Define. rb14564.patch Description: rb14564.patch
[PATCH V2] aarch64: Model zero-high-half semantics of XTN instruction in RTL
Hi, Version 2 of this patch adds tests to verify the benefit of this change. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-06-11 Jonathan Wright * config/aarch64/aarch64-simd.md (aarch64_xtn_insn_le): Define - modeling zero-high-half semantics. (aarch64_xtn): Change to an expander that emits the appropriate instruction depending on endianness. (aarch64_xtn_insn_be): Define - modeling zero-high-half semantics. (aarch64_xtn2_le): Rename to... (aarch64_xtn2_insn_le): This. (aarch64_xtn2_be): Rename to... (aarch64_xtn2_insn_be): This. (vec_pack_trunc_): Emit truncation instruction instead of aarch64_xtn. * config/aarch64/iterators.md (Vnarrowd): Add Vnarrowd mode attribute iterator. gcc/testsuite/ChangeLog: * gcc.target/aarch64/narrow_zero_high_half.c: Add new tests. From: Gcc-patches on behalf of Jonathan Wright via Gcc-patches Sent: 15 June 2021 10:45 To: gcc-patches@gcc.gnu.org Subject: [PATCH] aarch64: Model zero-high-half semantics of XTN instruction in RTL Hi, Modeling the zero-high-half semantics of the XTN narrowing instruction in RTL indicates to the compiler that this is a totally destructive operation. This enables more RTL simplifications and also prevents some register allocation issues. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-06-11 Jonathan Wright * config/aarch64/aarch64-simd.md (aarch64_xtn_insn_le): Define - modeling zero-high-half semantics. (aarch64_xtn): Change to an expander that emits the appropriate instruction depending on endianness. (aarch64_xtn_insn_be): Define - modeling zero-high-half semantics. (aarch64_xtn2_le): Rename to... (aarch64_xtn2_insn_le): This. (aarch64_xtn2_be): Rename to... (aarch64_xtn2_insn_be): This. (vec_pack_trunc_): Emit truncation instruction instead of aarch64_xtn. * config/aarch64/iterators.md (Vnarrowd): Add Vnarrowd mode attribute iterator. rb14563.patch Description: rb14563.patch
[PATCH] testsuite: aarch64: Add zero-high-half tests for narrowing shifts
Hi, This patch adds tests to verify that Neon narrowing-shift instructions clear the top half of the result vector. It is sufficient to show that a subsequent combine with a zero-vector is optimized away - leaving just the narrowing-shift instruction. Ok for master? Thanks, Jonathan --- gcc/testsuite/ChangeLog: 2021-06-15 Jonathan Wright * gcc.target/aarch64/narrow_zero_high_half.c: New test. rb14569.patch Description: rb14569.patch
[PATCH] aarch64: Model zero-high-half semantics of ADDHN/SUBHN instructions
Hi, As subject, this patch models the zero-high-half semantics of the narrowing arithmetic Neon instructions in the aarch64_hn RTL pattern. Modeling these semantics allows for better RTL combinations while also removing some register allocation issues as the compiler now knows that the operation is totally destructive. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-06-14 Jonathan Wright * config/aarch64/aarch64-simd.md (aarch64_hn): Change to an expander that emits the correct instruction depending on endianness. (aarch64_hn_insn_le): Define. (aarch64_hn_insn_be): Define. rb14566.patch Description: rb14566.patch
[PATCH] aarch64: Model zero-high-half semantics of [SU]QXTN instructions
Hi, As subject, this patch first splits the aarch64_qmovn pattern into separate scalar and vector variants. It then further splits the vector RTL pattern into big/little endian variants that model the zero-high-half semantics of the underlying instruction. Modeling these semantics allows for better RTL combinations while also removing some register allocation issues as the compiler now knows that the operation is totally destructive. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-06-14 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Split generator for aarch64_qmovn builtins into scalar and vector variants. * config/aarch64/aarch64-simd.md (aarch64_qmovn_insn_le): Define. (aarch64_qmovn_insn_be): Define. (aarch64_qmovn): Split into scalar and vector variants. Change vector variant to an expander that emits the correct instruction depending on endianness. rb14565.patch Description: rb14565.patch
[PATCH] aarch64: Model zero-high-half semantics of SQXTUN instruction in RTL
Hi, As subject, this patch first splits the aarch64_sqmovun pattern into separate scalar and vector variants. It then further split the vector pattern into big/little endian variants that model the zero-high-half semantics of the underlying instruction. Modeling these semantics allows for better RTL combinations while also removing some register allocation issues as the compiler now knows that the operation is totally destructive. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-06-14 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Split generator for aarch64_sqmovun builtins into scalar and vector variants. * config/aarch64/aarch64-simd.md (aarch64_sqmovun): Split into scalar and vector variants. Change vector variant to an expander that emits the correct instruction depending on endianness. (aarch64_sqmovun_insn_le): Define. (aarch64_sqmovun_insn_be): Define. rb14564.patch Description: rb14564.patch
[PATCH] aarch64: Model zero-high-half semantics of XTN instruction in RTL
Hi, Modeling the zero-high-half semantics of the XTN narrowing instruction in RTL indicates to the compiler that this is a totally destructive operation. This enables more RTL simplifications and also prevents some register allocation issues. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-06-11 Jonathan Wright * config/aarch64/aarch64-simd.md (aarch64_xtn_insn_le): Define - modeling zero-high-half semantics. (aarch64_xtn): Change to an expander that emits the appropriate instruction depending on endianness. (aarch64_xtn_insn_be): Define - modeling zero-high-half semantics. (aarch64_xtn2_le): Rename to... (aarch64_xtn2_insn_le): This. (aarch64_xtn2_be): Rename to... (aarch64_xtn2_insn_be): This. (vec_pack_trunc_): Emit truncation instruction instead of aarch64_xtn. * config/aarch64/iterators.md (Vnarrowd): Add Vnarrowd mode attribute iterator. rb14563.patch Description: rb14563.patch
[PATCH] aarch64: Use correct type attributes for RTL generating XTN(2)
Hi, As subject, this patch corrects the type attribute in RTL patterns that generate XTN/XTN2 instructions to be "neon_move_narrow_q". This makes a material difference because these instructions can be executed on both SIMD pipes in the Cortex-A57 core model, whereas the "neon_shift_imm_narrow_q" attribute (in use until now) would suggest to the scheduler that they could only execute on one of the two pipes. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-05-18 Jonathan Wright * config/aarch64/aarch64-simd.md: Use "neon_move_narrow_q" type attribute in patterns generating XTN(2). rb14492.patch Description: rb14492.patch
[PATCH] aarch64: Use an expander for quad-word vec_pack_trunc pattern
Hi, The existing vec_pack_trunc RTL pattern emits an opaque two- instruction assembly code sequence that prevents proper instruction scheduling. This commit changes the pattern to an expander that emits individual xtn and xtn2 instructions. This commit also consolidates the duplicate truncation patterns. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-05-17 Jonathan Wright * config/aarch64/aarch64-simd.md (aarch64_simd_vec_pack_trunc_): Remove as duplicate of... (aarch64_xtn): This. (aarch64_xtn2_le): Move position in file. (aarch64_xtn2_be): Move position in file. (aarch64_xtn2): Move position in file. (vec_pack_trunc_): Define as an expander. rb14480.patch Description: rb14480.patch
[PATCH 5/5] testsuite: aarch64: Add tests for high-half narrowing instructions
Hi, As subject, this patch adds tests to confirm that a *2 (write to high-half) Neon instruction is generated from vcombine* of a narrowing intrinsic sequence. Ok for master? Thanks, Jonathan --- gcc/testsuite/ChangeLog: 2021-05-14 Jonathan Wright * gcc.target/aarch64/narrow_high_combine.c: New test. rb14483.patch Description: rb14483.patch
[PATCH 4/5] aarch64: Refactor aarch64_qshrn_n RTL pattern
Hi, As subject, this patch splits the aarch64_qshrn_n pattern into separate scalar and vector variants. It further splits the vector pattern into big/little endian variants that model the zero-high-half semantics of the underlying instruction - allowing for more combinations with the write-to-high-half variant (aarch64_qshrn2_n.) This improvement will be confirmed by a new test in gcc.target/aarch64/narrow_high_combine.c (patch 5/5 in this series.) Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-05-14 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Split builtin generation for aarch64_qshrn_n pattern into separate scalar and vector generators. * config/aarch64/aarch64-simd.md (aarch64_qshrn_n): Define as an expander and split into... (aarch64_qshrn_n_insn_le): This and... (aarch64_qshrn_n_insn_be): This. * config/aarch64/iterators.md: Define SD_HSDI iterator. rb14490.patch Description: rb14490.patch
[PATCH 3/5] aarch64: Relax aarch64_sqxtun2 RTL pattern
Hi, As subject, this patch uses UNSPEC_SQXTUN instead of UNSPEC_SQXTUN2 in the aarch64_sqxtun2 patterns. This allows for more more aggressive combinations and ultimately better code generation - which will be confirmed by a new set of tests in gcc.target/aarch64/narrow_high_combine.c (patch 5/5 in this series.) The now redundant UNSPEC_SQXTUN2 is removed. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-05-14 Jonathn Wright * config/aarch64/aarch64-simd.md: Use UNSPEC_SQXTUN instead of UNSPEC_SQXTUN2. * config/aarch64/iterators.md: Remove UNSPEC_SQXTUN2. rb14481.patch Description: rb14481.patch
[PATCH 2/5] aarch64: Relax aarch64_qshrn2_n RTL pattern
Hi, As subject, this patch implements saturating right-shift and narrow high Neon intrinsic RTL patterns using a vec_concat of a register_operand and a VQSHRN_N unspec - instead of just a VQSHRN_N unspec. This more relaxed pattern allows for more aggressive combinations and ultimately better code generation - which will be confirmed by a new set of tests in gcc.target/aarch64/narrow_high_combine.c (patch 5/5 in this series.) Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-03-04 Jonathan Wright * config/aarch64/aarch64-simd.md (aarch64_qshrn2_n): Implement as an expand emitting a big/little endian instruction pattern. (aarch64_qshrn2_n_insn_le): Define. (aarch64_qshrn2_n_insn_be): Define. rb14251.patch Description: rb14251.patch
[PATCH 1/5] aarch64: Relax aarch64_hn2 RTL pattern
Hi, As subject, this patch implements v[r]addhn2 and v[r]subhn2 Neon intrinsic RTL patterns using a vec_concat of a register_operand and an ADDSUBHN unspec - instead of just an ADDSUBHN2 unspec. This more relaxed pattern allows for more aggressive combinations and ultimately better code generation - which will be confirmed by a new set of tests in gcc.target/aarch64/narrow_high_combine.c (patch 5/5 in this series). This patch also removes the now redundant [R]ADDHN2 and [R]SUBHN2 unspecs and their iterator. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-03-03 Jonathan Wright * config/aarch64/aarch64-simd.md (aarch64_hn2): Implement as an expand emitting a big/little endian instruction pattern. (aarch64_hn2_insn_le): Define. (aarch64_hn2_insn_be): Define. * config/aarch64/iterators.md: Remove UNSPEC_[R]ADDHN2 and UNSPEC_[R]SUBHN2 unspecs and ADDSUBHN2 iterator. rb14250.patch Description: rb14250.patch
Re: [PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq intrinsics
Hi Richard, I think you may be referencing an older checkout as we refactored this pattern in a previous change to: (define_insn "mul_lane3" [(set (match_operand:VMUL 0 "register_operand" "=w") (mult:VMUL (vec_duplicate:VMUL (vec_select: (match_operand:VMUL 2 "register_operand" "") (parallel [(match_operand:SI 3 "immediate_operand" "i")]))) (match_operand:VMUL 1 "register_operand" "w")))] "TARGET_SIMD" { operands[3] = aarch64_endian_lane_rtx (mode, INTVAL (operands[3])); return "mul\\t%0., %1., %2.[%3]"; } [(set_attr "type" "neon_mul__scalar")] ) which doesn't help us with the 'laneq' intrinsics as the machine mode for operands 0 and 1 (of the laneq intrinsics) is narrower than the machine mode for operand 2. Thanks, Jonathan From: Richard Sandiford Sent: 30 April 2021 19:18 To: Jonathan Wright Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq intrinsics Richard Sandiford via Gcc-patches writes: > Jonathan Wright writes: >> diff --git a/gcc/config/aarch64/aarch64-simd.md >> b/gcc/config/aarch64/aarch64-simd.md >> index >> bdee49f74f4725409d33af733bb55be290b3f0e7..234762960bd6df057394f753072ef65a6628a43d >> 100644 >> --- a/gcc/config/aarch64/aarch64-simd.md >> +++ b/gcc/config/aarch64/aarch64-simd.md >> @@ -734,6 +734,22 @@ >>[(set_attr "type" "neon_mul__scalar")] >> ) >> >> +(define_insn "mul_laneq3" >> + [(set (match_operand:VDQSF 0 "register_operand" "=w") >> +(mult:VDQSF >> + (vec_duplicate:VDQSF >> +(vec_select: >> + (match_operand:V4SF 2 "register_operand" "w") >> + (parallel [(match_operand:SI 3 "immediate_operand" "i")]))) >> + (match_operand:VDQSF 1 "register_operand" "w")))] >> + "TARGET_SIMD" >> + { >> +operands[3] = aarch64_endian_lane_rtx (V4SFmode, INTVAL (operands[3])); >> +return "fmul\\t%0., %1., %2.[%3]"; >> + } >> + [(set_attr "type" "neon_fp_mul_s_scalar")] >> +) >> + Oops, sorry, I just realised that this pattern does already exist as: (define_insn "*aarch64_mul3_elt" [(set (match_operand:VMUL 0 "register_operand" "=w") (mult:VMUL (vec_duplicate:VMUL (vec_select: (match_operand:VMUL 1 "register_operand" "") (parallel [(match_operand:SI 2 "immediate_operand")]))) (match_operand:VMUL 3 "register_operand" "w")))] "TARGET_SIMD" { operands[2] = aarch64_endian_lane_rtx (mode, INTVAL (operands[2])); return "mul\\t%0., %3., %1.[%2]"; } [(set_attr "type" "neon_mul__scalar")] ) Thanks, Richard
Re: [PATCH 14/20] testsuite: aarch64: Add fusion tests for FP vml[as] intrinsics
Updated the patch to implement suggestions - restricting these tests to run on only aarch64 targets. Tested and all new tests pass on aarch64-none-linux-gnu. Ok for master? Thanks, Jonathan From: Richard Sandiford Sent: 28 April 2021 16:46 To: Jonathan Wright via Gcc-patches Cc: Jonathan Wright Subject: Re: [PATCH 14/20] testsuite: aarch64: Add fusion tests for FP vml[as] intrinsics Jonathan Wright via Gcc-patches writes: > Hi, > > As subject, this patch adds compilation tests to make sure that the output > of vmla/vmls floating-point Neon intrinsics (fmul, fadd/fsub) is not fused > into fmla/fmls instructions. > > Ok for master? > > Thanks, > Jonathan > > --- > > gcc/testsuite/ChangeLog: > > 2021-02-16 Jonathan Wright > >* gcc.target/aarch64/advsimd-intrinsics/vmla_float_not_fused.c: >New test. >* gcc.target/aarch64/advsimd-intrinsics/vmla_float_not_fused_A64.c: >New test. >* gcc.target/aarch64/advsimd-intrinsics/vmls_float_not_fused.c: >New test. >* gcc.target/aarch64/advsimd-intrinsics/vmls_float_not_fused_A64.c: >New test. > > diff --git > a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vmla_float_not_fused.c > b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vmla_float_not_fused.c > new file mode 100644 > index > ..402c4ef414558767c7d7ddc21817093a80d2a06d > --- /dev/null > +++ > b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vmla_float_not_fused.c > @@ -0,0 +1,42 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3" } */ Could you test this on an arm*-*-* target too? I'd expect the dg-finals to fail there, since the syntax is vmul.f32 etc. instead. Alternatively, we could just skip this for arm*-*-*, like you do with the by-lane tests. > + > + > +#include > + > +float32x2_t foo_f32 (float32x2_t a, float32x2_t b, float32x2_t c) > +{ > + return vmla_f32 (a, b, c); > +} > + > +float32x4_t fooq_f32 (float32x4_t a, float32x4_t b, float32x4_t c) > +{ > + return vmlaq_f32 (a, b, c); > +} > + > +float32x2_t foo_n_f32 (float32x2_t a, float32x2_t b, float32_t c) > +{ > + return vmla_n_f32 (a, b, c); > +} > + > +float32x4_t fooq_n_f32 (float32x4_t a, float32x4_t b, float32_t c) > +{ > + return vmlaq_n_f32 (a, b, c); > +} > + > +float32x2_t foo_lane_f32 (float32x2_t a, > + float32x2_t b, > + float32x2_t v) > +{ > + return vmla_lane_f32 (a, b, v, 0); > +} > + > +float32x4_t fooq_lane_f32 (float32x4_t a, > +float32x4_t b, > +float32x2_t v) > +{ > + return vmlaq_lane_f32 (a, b, v, 0); > +} > + > +/* { dg-final { scan-assembler-times {fmul} 6} } */ > +/* { dg-final { scan-assembler-times {fadd} 6} } */ It'd be safer to match {\tfmul\t} etc. instead. Matching bare words runs the risk of picking up things like directory names that happen to contain “fmul” as a substring. Thanks, Richard > diff --git > a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vmla_float_not_fused_A64.c > > b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vmla_float_not_fused_A64.c > new file mode 100644 > index > ..08a9590e2572fa78c8360f09c8353a0d23678ec1 > --- /dev/null > +++ > b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vmla_float_not_fused_A64.c > @@ -0,0 +1,33 @@ > +/* { dg-skip-if "" { arm*-*-* } } */ > +/* { dg-do compile } */ > +/* { dg-options "-O3" } */ > + > + > +#include > + > +float64x1_t foo_f64 (float64x1_t a, float64x1_t b, float64x1_t c) > +{ > + return vmla_f64 (a, b, c); > +} > + > +float64x2_t fooq_f64 (float64x2_t a, float64x2_t b, float64x2_t c) > +{ > + return vmlaq_f64 (a, b, c); > +} > + > +float32x2_t foo_laneq_f32 (float32x2_t a, > +float32x2_t b, > +float32x4_t v) > +{ > + return vmla_laneq_f32 (a, b, v, 0); > +} > + > +float32x4_t fooq_laneq_f32 (float32x4_t a, > + float32x4_t b, > + float32x4_t v) > +{ > + return vmlaq_laneq_f32 (a, b, v, 0); > +} > + > +/* { dg-final { scan-assembler-times {fmul} 4} } */ > +/* { dg-final { scan-assembler-times {fadd} 4} } */ > diff --git > a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vmls_float_not_fused.c > b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/vmls_float_not_fused.c > new file mode 100644 > index > ..0846b7cf5d2c332175235c15bbe534b2558960ef
Re: [PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq intrinsics
Updated the patch to be more consistent with the others in the series. Tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan From: Gcc-patches on behalf of Jonathan Wright via Gcc-patches Sent: 28 April 2021 15:42 To: gcc-patches@gcc.gnu.org Subject: [PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq intrinsics Hi, As subject, this patch rewrites the floating-point vml[as][q]_laneq Neon intrinsics to use RTL builtins rather than relying on the GCC vector extensions. Using RTL builtins allows control over the emission of fmla/fmls instructions (which we don't want here.) With this commit, the code generated by these intrinsics changes from a fused multiply-add/subtract instruction to an fmul followed by an fadd/fsub instruction. If the programmer really wants fmla/fmls instructions, they can use the vfm[as] intrinsics. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-02-17 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Add float_ml[as][q]_laneq builtin generator macros. * config/aarch64/aarch64-simd.md (mul_laneq3): Define. (aarch64_float_mla_laneq): Define. (aarch64_float_mls_laneq): Define. * config/aarch64/arm_neon.h (vmla_laneq_f32): Use RTL builtin instead of GCC vector extensions. (vmlaq_laneq_f32): Likewise. (vmls_laneq_f32): Likewise. (vmlsq_laneq_f32): Likewise. rb14213.patch Description: rb14213.patch
Re: [PATCH 12/20] aarch64: Use RTL builtins for FP ml[as][q]_lane intrinsics
Patch updated as per suggestion (similar to patch 10/20.) Tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan From: Richard Sandiford Sent: 28 April 2021 16:37 To: Jonathan Wright via Gcc-patches Cc: Jonathan Wright Subject: Re: [PATCH 12/20] aarch64: Use RTL builtins for FP ml[as][q]_lane intrinsics Jonathan Wright via Gcc-patches writes: > Hi, > > As subject, this patch rewrites the floating-point vml[as][q]_lane Neon > intrinsics to use RTL builtins rather than relying on the GCC vector > extensions. Using RTL builtins allows control over the emission of > fmla/fmls instructions (which we don't want here.) > > With this commit, the code generated by these intrinsics changes from > a fused multiply-add/subtract instruction to an fmul followed by an > fadd/fsub instruction. If the programmer really wants fmla/fmls > instructions, they can use the vfm[as] intrinsics. > > Regression tested and bootstrapped on aarch64-none-linux-gnu - no > issues. > > Ok for master? > > Thanks, > Jonathan > > --- > > gcc/ChangeLog: > > 2021-02-16 Jonathan Wright > >* config/aarch64/aarch64-simd-builtins.def: Add >float_ml[as]_lane builtin generator macros. >* config/aarch64/aarch64-simd.md (mul_lane3): Define. >(aarch64_float_mla_lane): Define. >(aarch64_float_mls_lane): Define. >* config/aarch64/arm_neon.h (vmla_lane_f32): Use RTL builtin >instead of GCC vector extensions. >(vmlaq_lane_f32): Likewise. >(vmls_lane_f32): Likewise. >(vmlsq_lane_f32): Likewise. > > diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def > b/gcc/config/aarch64/aarch64-simd-builtins.def > index > 55a5682baeb13041053ef9e6eaa831182ea8b10c..b702493e1351478272bb7d26991a5673943d61ec > 100644 > --- a/gcc/config/aarch64/aarch64-simd-builtins.def > +++ b/gcc/config/aarch64/aarch64-simd-builtins.def > @@ -668,6 +668,8 @@ >BUILTIN_VDQF_DF (TERNOP, float_mls, 0, FP) >BUILTIN_VDQSF (TERNOP, float_mla_n, 0, FP) >BUILTIN_VDQSF (TERNOP, float_mls_n, 0, FP) > + BUILTIN_VDQSF (QUADOP_LANE, float_mla_lane, 0, FP) > + BUILTIN_VDQSF (QUADOP_LANE, float_mls_lane, 0, FP) > >/* Implemented by aarch64_simd_bsl. */ >BUILTIN_VDQQH (BSL_P, simd_bsl, 0, NONE) > diff --git a/gcc/config/aarch64/aarch64-simd.md > b/gcc/config/aarch64/aarch64-simd.md > index > 95363d7b5ad11f775aa03f24bbcb0b66d20abb7c..abc8b1708b86bcee2e5082cc4659a197c5821985 > 100644 > --- a/gcc/config/aarch64/aarch64-simd.md > +++ b/gcc/config/aarch64/aarch64-simd.md > @@ -2625,6 +2625,22 @@ >[(set_attr "type" "neon_fp_mul_")] > ) > > +(define_insn "mul_lane3" > + [(set (match_operand:VDQSF 0 "register_operand" "=w") > + (mult:VDQSF > + (vec_duplicate:VDQSF > + (vec_select: > + (match_operand:V2SF 2 "register_operand" "w") > + (parallel [(match_operand:SI 3 "immediate_operand" "i")]))) > + (match_operand:VDQSF 1 "register_operand" "w")))] > + "TARGET_SIMD" > + { > +operands[3] = aarch64_endian_lane_rtx (V2SFmode, INTVAL (operands[3])); > +return "fmul\\t%0., %1., %2.[%3]"; > + } > + [(set_attr "type" "neon_fp_mul_s_scalar")] > +) > + Similarly to the 10/20 patch (IIRC), we can instead reuse: (define_insn "*aarch64_mul3_elt" [(set (match_operand:VMUL 0 "register_operand" "=w") (mult:VMUL (vec_duplicate:VMUL (vec_select: (match_operand:VMUL 1 "register_operand" "") (parallel [(match_operand:SI 2 "immediate_operand")]))) (match_operand:VMUL 3 "register_operand" "w")))] "TARGET_SIMD" { operands[2] = aarch64_endian_lane_rtx (mode, INTVAL (operands[2])); return "mul\\t%0., %3., %1.[%2]"; } [(set_attr "type" "neon_mul__scalar")] ) Thanks, Richard > (define_expand "div3" > [(set (match_operand:VHSDF 0 "register_operand") > (div:VHSDF (match_operand:VHSDF 1 "register_operand") > @@ -2728,6 +2744,46 @@ >} > ) > > +(define_expand "aarch64_float_mla_lane" > + [(set (match_operand:VDQSF 0 "register_operand") > + (plus:VDQSF > + (mult:VDQSF > + (vec_duplicate:VDQSF > + (vec_select: > + (match_operand:V2SF 3 "register_operand") > + (parallel [(match_operand:SI 4 "immediate_operand")]))) > +
Re: [PATCH 10/20] aarch64: Use RTL builtins for FP ml[as]_n intrinsics
Patch updated as per your suggestion. Tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan From: Richard Sandiford Sent: 28 April 2021 16:11 To: Jonathan Wright via Gcc-patches Cc: Jonathan Wright Subject: Re: [PATCH 10/20] aarch64: Use RTL builtins for FP ml[as]_n intrinsics Jonathan Wright via Gcc-patches writes: > Hi, > > As subject, this patch rewrites the floating-point vml[as][q]_n Neon > intrinsics to use RTL builtins rather than inline assembly code, allowing > for better scheduling and optimization. > > Regression tested and bootstrapped on aarch64-none-linux-gnu - no > issues. > > Ok for master? > > Thanks, > Jonathan > > --- > > gcc/ChangeLog: > > 2021-01-18 Jonathan Wright > >* config/aarch64/aarch64-simd-builtins.def: Add >float_ml[as]_n builtin generator macros. >* config/aarch64/aarch64-simd.md (mul_n3): Define. >(aarch64_float_mla_n): Define. >(aarch64_float_mls_n): Define. >* config/aarch64/arm_neon.h (vmla_n_f32): Use RTL builtin >instead of inline asm. >(vmlaq_n_f32): Likewise. >(vmls_n_f32): Likewise. >(vmlsq_n_f32): Likewise. > > diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def > b/gcc/config/aarch64/aarch64-simd-builtins.def > index > 0f44ed84ff9d08d808b1b2dfe528db5208b134f5..547509474c23daf6882ed2f8407ddb5caf1d1b91 > 100644 > --- a/gcc/config/aarch64/aarch64-simd-builtins.def > +++ b/gcc/config/aarch64/aarch64-simd-builtins.def > @@ -664,6 +664,9 @@ >BUILTIN_VHSDF (TERNOP, fnma, 4, FP) >VAR1 (TERNOP, fnma, 4, FP, hf) > > + BUILTIN_VDQSF (TERNOP, float_mla_n, 0, FP) > + BUILTIN_VDQSF (TERNOP, float_mls_n, 0, FP) > + >/* Implemented by aarch64_simd_bsl. */ >BUILTIN_VDQQH (BSL_P, simd_bsl, 0, NONE) >VAR2 (BSL_P, simd_bsl,0, NONE, di, v2di) > diff --git a/gcc/config/aarch64/aarch64-simd.md > b/gcc/config/aarch64/aarch64-simd.md > index > 5f701dd2775290156634ef8c6feccecd359e9ec9..d016970a2c278405b270a0ac745221e69f0f625e > 100644 > --- a/gcc/config/aarch64/aarch64-simd.md > +++ b/gcc/config/aarch64/aarch64-simd.md > @@ -2614,6 +2614,17 @@ >[(set_attr "type" "neon_fp_mul_")] > ) > > +(define_insn "mul_n3" > + [(set (match_operand:VHSDF 0 "register_operand" "=w") > + (mult:VHSDF > + (vec_duplicate:VHSDF > + (match_operand: 2 "register_operand" "w")) > + (match_operand:VHSDF 1 "register_operand" "w")))] > + "TARGET_SIMD" > + "fmul\\t%0., %1., %2.[0]" This functionality should already be provided by: (define_insn "*aarch64_mul3_elt_from_dup" [(set (match_operand:VMUL 0 "register_operand" "=w") (mult:VMUL (vec_duplicate:VMUL (match_operand: 1 "register_operand" "")) (match_operand:VMUL 2 "register_operand" "w")))] "TARGET_SIMD" "mul\t%0., %2., %1.[0]"; [(set_attr "type" "neon_mul__scalar")] ) so I think we should instead rename that to mul_n3 and reorder its operands. Thanks, Richard > + [(set_attr "type" "neon_fp_mul_")] > +) > + > (define_expand "div3" > [(set (match_operand:VHSDF 0 "register_operand") > (div:VHSDF (match_operand:VHSDF 1 "register_operand") > @@ -2651,6 +2662,40 @@ >[(set_attr "type" "neon_fp_abs_")] > ) > > +(define_expand "aarch64_float_mla_n" > + [(set (match_operand:VDQSF 0 "register_operand") > + (plus:VDQSF > + (mult:VDQSF > + (vec_duplicate:VDQSF > + (match_operand: 3 "register_operand")) > + (match_operand:VDQSF 2 "register_operand")) > + (match_operand:VDQSF 1 "register_operand")))] > + "TARGET_SIMD" > + { > +rtx scratch = gen_reg_rtx (mode); > +emit_insn (gen_mul_n3 (scratch, operands[2], operands[3])); > +emit_insn (gen_add3 (operands[0], operands[1], scratch)); > +DONE; > + } > +) > + > +(define_expand "aarch64_float_mls_n" > + [(set (match_operand:VDQSF 0 "register_operand") > + (minus:VDQSF > + (match_operand:VDQSF 1 "register_operand") > + (mult:VDQSF > + (vec_duplicate:VDQSF > + (match_operand: 3 "register_operand")) > + (match_operand:VDQSF 2 "register_operand"] > + "TARGET_SIMD" > + { > +rtx scratch = gen_reg_rtx (mode); > +emit_insn (gen_mul
Re: [PATCH 1/20] aarch64: Use RTL builtin for vmull[_high]_p8 intrinsics
Thanks for the review, I've updated the patch as per option 1. Tested and bootstrapped on aarch64-none-linux-gnu with no issues. Ok for master? Thanks, Jonathan From: Richard Sandiford Sent: 28 April 2021 15:11 To: Jonathan Wright via Gcc-patches Cc: Jonathan Wright Subject: Re: [PATCH 1/20] aarch64: Use RTL builtin for vmull[_high]_p8 intrinsics Jonathan Wright via Gcc-patches writes: > Hi, > > As subject, this patch rewrites the vmull[_high]_p8 Neon intrinsics to use RTL > builtins rather than inline assembly code, allowing for better scheduling and > optimization. > > Regression tested and bootstrapped on aarch64-none-linux-gnu and > aarch64_be-none-elf - no issues. Thanks for doing this. Mostly LGTM, but one comment about the patterns: > […] > +(define_insn "aarch64_pmull_hiv16qi_insn" > + [(set (match_operand:V8HI 0 "register_operand" "=w") > + (unspec:V8HI > + [(vec_select:V8QI > + (match_operand:V16QI 1 "register_operand" "w") > + (match_operand:V16QI 3 "vect_par_cnst_hi_half" "")) > +(vec_select:V8QI > + (match_operand:V16QI 2 "register_operand" "w") > + (match_dup 3))] > + UNSPEC_PMULL2))] > + "TARGET_SIMD" > + "pmull2\\t%0.8h, %1.16b, %2.16b" > + [(set_attr "type" "neon_mul_b_long")] > +) As things stands, UNSPEC_PMULL2 has the vec_select “built in”: (define_insn "aarch64_crypto_pmullv2di" [(set (match_operand:TI 0 "register_operand" "=w") (unspec:TI [(match_operand:V2DI 1 "register_operand" "w") (match_operand:V2DI 2 "register_operand" "w")] UNSPEC_PMULL2))] "TARGET_SIMD && TARGET_AES" "pmull2\\t%0.1q, %1.2d, %2.2d" [(set_attr "type" "crypto_pmull")] ) So I think it would be more consistent to do one of the following: (1) Keep the vec_selects in the new pattern, but use UNSPEC_PMULL for the operation instead of UNSPEC_PMULL2. (2) Remove the vec_selects and keep the UNSPEC_PMULL2. (1) in principle allows more combination opportunities than (2), although I don't know how likely it is to help in practice. Thanks, Richard rb14128.patch Description: rb14128.patch
[PATCH 20/20] aarch64: Remove unspecs from [su]qmovn RTL pattern
Hi, Saturating truncation can be expressed using the RTL expressions ss_truncate and us_truncate. This patch changes the implementation of the vqmovn_* Neon intrinsics to use these RTL expressions rather than a pair of unspecs. The redundant unspecs are removed along with their code iterator. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-04-12 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Modify comment to make consistent with updated RTL pattern. * config/aarch64/aarch64-simd.md (aarch64_qmovn): Implement using ss_truncate and us_truncate rather than unspecs. * config/aarch64/iterators.md: Remove redundant unspecs and iterator: UNSPEC_[SU]QXTN and SUQMOVN respectively. rb14376.patch Description: rb14376.patch
[PATCH 19/20] aarch64: Update attributes of arm_acle.h intrinsics
Hi, As subject, this patch updates the attributes of all intrinsics defined in arm_acle.h to be consistent with the attributes of the intrinsics defined in arm_neon.h. Specifically, this means updating the attributes from: __extension__ static __inline __attribute__ ((__always_inline__)) to: __extension__ extern __inline __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-03-18 Jonathan Wright * config/aarch64/arm_acle.h (__attribute__): Make intrinsic attributes consistent with those defined in arm_neon.h. rb14296.patch Description: rb14296.patch
[PATCH 18/20] aarch64: Update attributes of arm_fp16.h intrinsics
Hi, As subject, this patch updates the attributes of all intrinsics defined in arm_fp16.h to be consistent with the attributes of the intrinsics defined in arm_neon.h. Specifically, this means updating the attributes from: __extension__ static __inline __attribute__ ((__always_inline__)) to: __extension__ extern __inline __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-03-18 Jonathan Wright * config/aarch64/arm_fp16.h (__attribute__): Make intrinsic attributes consistent with those defined in arm_neon.h. rb14295.patch Description: rb14295.patch
[PATCH 17/20] aarch64: Relax aarch64_qshrnn2_n RTL pattern
Hi, As subject, this patch implements the saturating right-shift and narrow high Neon intrinsic RTL patterns using a vec_concat of a register_operand and a VQSHRN_N unspec - instead of just a VQSHRN2_N unspec. This more relaxed pattern allows for more aggressive combinations and ultimately better code generation. Regression tested and bootstrapped on aarch64-none-linux-gnu and aarch64_be-none-elf - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-03-04 Jonathan Wright * config/aarch64/aarch64-simd.md (aarch64_qshrn2_n): Implement as an expand emitting a big/little endian instruction pattern. (aarch64_qshrn2_n_insn_le): Define. (aarch64_qshrn2_n_insn_be): Define. * config/aarch64/iterators.md: Add VQSHRN2_N iterator and constituent unspecs. rb14251.patch Description: rb14251.patch
[PATCH 16/20] aarch64: Relax aarch64_hn2 RTL pattern
Hi, As subject, this patch implements the v[r]addhn2 and v[r]subhn2 Neon intrinsic RTL patterns using a vec_concat of a register_operand and an ADDSUBHN unspec - instead of just an ADDSUBHN2 unspec. This more relaxed pattern allows for more aggressive combinations and ultimately better code generation. Regression tested and bootstrapped on aarch64-none-linux-gnu and aarch64_be-none-elf - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-03-03 Jonathan Wright * config/aarch64/aarch64-simd.md (aarch64_hn2): Implement as an expand emitting a big/little endian instruction pattern. (aarch64_hn2_insn_le): Define. (aarch64_hn2_insn_be): Define. rb14250.patch Description: rb14250.patch
[PATCH 15/20] aarch64: Use RTL builtins for vcvtx intrinsics
Hi, As subject, this patch rewrites the vcvtx Neon intrinsics to use RTL builtins rather than inline assembly code, allowing for better scheduling and optimization. Regression tested and bootstrapped on aarch64-none-linux-gnu and aarch64_be-none-elf - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-02-18 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Add float_trunc_rodd builtin generator macros. * config/aarch64/aarch64-simd.md (aarch64_float_trunc_rodd_df): Define. (aarch64_float_trunc_rodd_lo_v2sf): Define. (aarch64_float_trunc_rodd_hi_v4sf_le): Define. (aarch64_float_trunc_rodd_hi_v4sf_be): Define. (aarch64_float_trunc_rodd_hi_v4sf): Define. * config/aarch64/arm_neon.h (vcvtx_f32_f64): Use RTL builtin instead of inline asm. (vcvtx_high_f32_f64): Likewise. (vcvtxd_f32_f64): Likewise. * config/aarch64/iterators.md: Add FCVTXN unspec. rb14222.patch Description: rb14222.patch
[PATCH 14/20] testsuite: aarch64: Add fusion tests for FP vml[as] intrinsics
Hi, As subject, this patch adds compilation tests to make sure that the output of vmla/vmls floating-point Neon intrinsics (fmul, fadd/fsub) is not fused into fmla/fmls instructions. Ok for master? Thanks, Jonathan --- gcc/testsuite/ChangeLog: 2021-02-16 Jonathan Wright * gcc.target/aarch64/advsimd-intrinsics/vmla_float_not_fused.c: New test. * gcc.target/aarch64/advsimd-intrinsics/vmla_float_not_fused_A64.c: New test. * gcc.target/aarch64/advsimd-intrinsics/vmls_float_not_fused.c: New test. * gcc.target/aarch64/advsimd-intrinsics/vmls_float_not_fused_A64.c: New test. rb14202.patch Description: rb14202.patch
[PATCH 13/20] aarch64: Use RTL builtins for FP ml[as][q]_laneq intrinsics
Hi, As subject, this patch rewrites the floating-point vml[as][q]_laneq Neon intrinsics to use RTL builtins rather than relying on the GCC vector extensions. Using RTL builtins allows control over the emission of fmla/fmls instructions (which we don't want here.) With this commit, the code generated by these intrinsics changes from a fused multiply-add/subtract instruction to an fmul followed by an fadd/fsub instruction. If the programmer really wants fmla/fmls instructions, they can use the vfm[as] intrinsics. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-02-17 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Add float_ml[as][q]_laneq builtin generator macros. * config/aarch64/aarch64-simd.md (mul_laneq3): Define. (aarch64_float_mla_laneq): Define. (aarch64_float_mls_laneq): Define. * config/aarch64/arm_neon.h (vmla_laneq_f32): Use RTL builtin instead of GCC vector extensions. (vmlaq_laneq_f32): Likewise. (vmls_laneq_f32): Likewise. (vmlsq_laneq_f32): Likewise. rb14213.patch Description: rb14213.patch
[PATCH 12/20] aarch64: Use RTL builtins for FP ml[as][q]_lane intrinsics
Hi, As subject, this patch rewrites the floating-point vml[as][q]_lane Neon intrinsics to use RTL builtins rather than relying on the GCC vector extensions. Using RTL builtins allows control over the emission of fmla/fmls instructions (which we don't want here.) With this commit, the code generated by these intrinsics changes from a fused multiply-add/subtract instruction to an fmul followed by an fadd/fsub instruction. If the programmer really wants fmla/fmls instructions, they can use the vfm[as] intrinsics. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-02-16 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Add float_ml[as]_lane builtin generator macros. * config/aarch64/aarch64-simd.md (mul_lane3): Define. (aarch64_float_mla_lane): Define. (aarch64_float_mls_lane): Define. * config/aarch64/arm_neon.h (vmla_lane_f32): Use RTL builtin instead of GCC vector extensions. (vmlaq_lane_f32): Likewise. (vmls_lane_f32): Likewise. (vmlsq_lane_f32): Likewise. rb14212.patch Description: rb14212.patch
[PATCH 11/20] aarch64: Use RTL builtins for FP ml[as] intrinsics
Hi, As subject, this patch rewrites the floating-point vml[as][q] Neon intrinsics to use RTL builtins rather than relying on the GCC vector extensions. Using RTL builtins allows control over the emission of fmla/fmls instructions (which we don't want here.) With this commit, the code generated by these intrinsics changes from a fused multiply-add/subtract instruction to an fmul followed by an fadd/fsub instruction. If the programmer really wants fmla/fmls instructions, they can use the vfm[as] intrinsics. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-02-16 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Add float_ml[as] builtin generator macros. * config/aarch64/aarch64-simd.md (aarch64_float_mla): Define. (aarch64_float_mls): Define. * config/aarch64/arm_neon.h (vmla_f32): Use RTL builtin instead of relying on GCC vector extensions. (vmla_f64): Likewise. (vmlaq_f32): Likewise. (vmlaq_f64): Likewise. (vmls_f32): Likewise. (vmls_f64): Likewise. (vmlsq_f32): Likewise. (vmlsq_f64): Likewise. * config/aarch64/iterators.md: Define VDQF_DF mode iterator. rb14211.patch Description: rb14211.patch
[PATCH 10/20] aarch64: Use RTL builtins for FP ml[as]_n intrinsics
Hi, As subject, this patch rewrites the floating-point vml[as][q]_n Neon intrinsics to use RTL builtins rather than inline assembly code, allowing for better scheduling and optimization. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-01-18 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Add float_ml[as]_n builtin generator macros. * config/aarch64/aarch64-simd.md (mul_n3): Define. (aarch64_float_mla_n): Define. (aarch64_float_mls_n): Define. * config/aarch64/arm_neon.h (vmla_n_f32): Use RTL builtin instead of inline asm. (vmlaq_n_f32): Likewise. (vmls_n_f32): Likewise. (vmlsq_n_f32): Likewise. rb14042.patch Description: rb14042.patch
[PATCH 9/20] aarch64: Use RTL builtins for v[q]tbx intrinsics
Hi, As subject, this patch rewrites the v[q]tbx Neon intrinsics to use RTL builtins rather than inline assembly code, allowing for better scheduling and optimization. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-02-12 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Add tbx1 builtin generator macros. * config/aarch64/aarch64-simd.md (aarch64_tbx1): Define. * config/aarch64/arm_neon.h (vqtbx1_s8): USE RTL builtin instead of inline asm. (vqtbx1_u8): Likewise. (vqtbx1_p8): Likewise. (vqtbx1q_s8): Likewise. (vqtbx1q_u8): Likewise. (vqtbx1q_p8): Likewise. (vtbx2_s8): Likewise. (vtbx2_u8): Likewise. (vtbx2_p8): Likewise. rb14188.patch Description: rb14188.patch
[PATCH 8/20] aarch64: Use RTL builtins for v[q]tbl intrinsics
Hi, As subject, this patch rewrites the v[q]tbl Neon intrinsics to use RTL builtins rather than inline assembly code, allowing for better scheduling and optimization. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-02-12 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Add tbl1 builtin generator macros. * config/aarch64/arm_neon.h (vqtbl1_p8): Use RTL builtin instead of inline asm. (vqtbl1_s8): Likewise. (vqtbl1_u8): Likewise. (vqtbl1q_p8): Likewise. (vqtbl1q_s8): Likewise. (vqtbl1q_u8): Likewise. (vtbl1_s8): Likewise. (vtbl1_u8): Likewise. (vtbl1_p8): Likewise. (vtbl2_s8): Likewise. (vtbl2_u8): Likewise. (vtbl2_p8): Likewise. rb14154.patch Description: rb14154.patch
[PATCH 7/20] aarch64: Use RTL builtins for polynomial vsri[q]_n intrinsics
Hi, As subject, this patch rewrites the vsri[q]_n_p* Neon intrinsics to use RTL builtins rather than inline assembly code, allowing for better scheduling and optimization. Regression tested and bootstrapped on aarch64-none-linux-gnu - no issues. Ok for master? Thanks, Jonathan --- gcc/ChangeLog: 2021-02-10 Jonathan Wright * config/aarch64/aarch64-simd-builtins.def: Add polynomial ssri_n buitin generator macro. * config/aarch64/arm_neon.h (vsri_n_p8): Use RTL builtin instead of inline asm. (vsri_n_p16): Likewise. (vsri_n_p64): Likewise. (vsriq_n_p8): Likewise. (vsriq_n_p16): Likewise. (vsriq_n_p64): Likewise. rb14147.patch Description: rb14147.patch