[PING][PATCH] arm: Remove unsigned variant of vcaddq_m
(Pinging since I realised that this is required for my later Low Overhead Loop patch series to work) Ok for trunk with the updated changelog that Christophe mentioned? Thanks, Stamatis/Stam Markianos-Wright From: Stam Markianos-Wright Sent: Tuesday, August 1, 2023 6:21 PM To: gcc-patches@gcc.gnu.org Cc: Richard Earnshaw ; Kyrylo Tkachov Subject: arm: Remove unsigned variant of vcaddq_m Hi all, The unsigned variants of the vcaddq_m operation are not needed within the compiler, as the assembly output of the signed and unsigned versions of the ops is identical: with a `.i` suffix (as opposed to separate `.s` and `.u` suffixes). Tested with baremetal arm-none-eabi on Arm's fastmodels. Ok for trunk? Thanks, Stamatis Markianos-Wright gcc/ChangeLog: * config/arm/arm-mve-builtins-base.cc (vcaddq_rot90, vcaddq_rot270): Use common insn for signed and unsigned front-end definitions. * config/arm/arm_mve_builtins.def (vcaddq_rot90_m_u, vcaddq_rot270_m_u): Make common. (vcaddq_rot90_m_s, vcaddq_rot270_m_s): Remove. * config/arm/iterators.md (mve_insn): Merge signed and unsigned defs. (isu): Likewise. (rot): Likewise. (mve_rot): Likewise. (supf): Likewise. (VxCADDQ_M): Likewise. * config/arm/unspecs.md (unspec): Likewise. --- gcc/config/arm/arm-mve-builtins-base.cc | 4 ++-- gcc/config/arm/arm_mve_builtins.def | 6 ++--- gcc/config/arm/iterators.md | 30 +++-- gcc/config/arm/mve.md | 4 ++-- gcc/config/arm/unspecs.md | 6 ++--- 5 files changed, 21 insertions(+), 29 deletions(-) diff --git a/gcc/config/arm/arm-mve-builtins-base.cc b/gcc/config/arm/arm-mve-builtins-base.cc index e31095ae112..426a87e9852 100644 --- a/gcc/config/arm/arm-mve-builtins-base.cc +++ b/gcc/config/arm/arm-mve-builtins-base.cc @@ -260,8 +260,8 @@ FUNCTION_PRED_P_S_U (vaddvq, VADDVQ) FUNCTION_PRED_P_S_U (vaddvaq, VADDVAQ) FUNCTION_WITH_RTX_M (vandq, AND, VANDQ) FUNCTION_ONLY_N (vbrsrq, VBRSRQ) -FUNCTION (vcaddq_rot90, unspec_mve_function_exact_insn_rot, (UNSPEC_VCADD90, UNSPEC_VCADD90, UNSPEC_VCADD90, VCADDQ_ROT90_M_S, VCADDQ_ROT90_M_U, VCADDQ_ROT90_M_F)) -FUNCTION (vcaddq_rot270, unspec_mve_function_exact_insn_rot, (UNSPEC_VCADD270, UNSPEC_VCADD270, UNSPEC_VCADD270, VCADDQ_ROT270_M_S, VCADDQ_ROT270_M_U, VCADDQ_ROT270_M_F)) +FUNCTION (vcaddq_rot90, unspec_mve_function_exact_insn_rot, (UNSPEC_VCADD90, UNSPEC_VCADD90, UNSPEC_VCADD90, VCADDQ_ROT90_M, VCADDQ_ROT90_M, VCADDQ_ROT90_M_F)) +FUNCTION (vcaddq_rot270, unspec_mve_function_exact_insn_rot, (UNSPEC_VCADD270, UNSPEC_VCADD270, UNSPEC_VCADD270, VCADDQ_ROT270_M, VCADDQ_ROT270_M, VCADDQ_ROT270_M_F)) FUNCTION (vcmlaq, unspec_mve_function_exact_insn_rot, (-1, -1, UNSPEC_VCMLA, -1, -1, VCMLAQ_M_F)) FUNCTION (vcmlaq_rot90, unspec_mve_function_exact_insn_rot, (-1, -1, UNSPEC_VCMLA90, -1, -1, VCMLAQ_ROT90_M_F)) FUNCTION (vcmlaq_rot180, unspec_mve_function_exact_insn_rot, (-1, -1, UNSPEC_VCMLA180, -1, -1, VCMLAQ_ROT180_M_F)) diff --git a/gcc/config/arm/arm_mve_builtins.def b/gcc/config/arm/arm_mve_builtins.def index 43dacc3dda1..6ac1812c697 100644 --- a/gcc/config/arm/arm_mve_builtins.def +++ b/gcc/config/arm/arm_mve_builtins.def @@ -523,8 +523,8 @@ VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vhsubq_m_n_u, v16qi, v8hi, v4si) VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vhaddq_m_u, v16qi, v8hi, v4si) VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vhaddq_m_n_u, v16qi, v8hi, v4si) VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, veorq_m_u, v16qi, v8hi, v4si) -VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vcaddq_rot90_m_u, v16qi, v8hi, v4si) -VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vcaddq_rot270_m_u, v16qi, v8hi, v4si) +VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vcaddq_rot90_m_, v16qi, v8hi, v4si) +VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vcaddq_rot270_m_, v16qi, v8hi, v4si) VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vbicq_m_u, v16qi, v8hi, v4si) VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vandq_m_u, v16qi, v8hi, v4si) VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vaddq_m_u, v16qi, v8hi, v4si) @@ -587,8 +587,6 @@ VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, vhcaddq_rot270_m_s, v16qi, v8hi, v4si) VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, vhaddq_m_s, v16qi, v8hi, v4si) VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, vhaddq_m_n_s, v16qi, v8hi, v4si) VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, veorq_m_s, v16qi, v8hi, v4si) -VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, vcaddq_rot90_m_s, v16qi, v8hi, v4si) -VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, vcaddq_rot270_m_s, v16qi, v8hi, v4si) VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, vbrsrq_m_n_s, v16qi, v8hi, v4si) VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, vbicq_m_s, v16qi, v8hi, v4si) VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, vandq_m_s, v16qi, v8hi, v4si) diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md index b13ff53d36f..2edd0b06370 100644 ---
[commited trunk 7/9] arm testsuite: Remove reduntant tests
Following Andrea's overhaul of the MVE testsuite, these tests are now reduntant, as equivalent checks have been added to the each intrinsic's .c test. gcc/testsuite/ChangeLog: * gcc.target/arm/mve/intrinsics/mve_fp_vaddq_n.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vaddq_m.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vaddq_n.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vddupq_m_n_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vddupq_m_n_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vddupq_m_n_u8.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vddupq_n_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vddupq_n_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vddupq_n_u8.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vddupq_x_n_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vddupq_x_n_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vddupq_x_n_u8.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vdwdupq_x_n_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vdwdupq_x_n_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vdwdupq_x_n_u8.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vidupq_m_n_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vidupq_m_n_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vidupq_m_n_u8.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vidupq_n_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vidupq_n_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vidupq_n_u8.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vidupq_x_n_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vidupq_x_n_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vidupq_x_n_u8.c: Removed. * gcc.target/arm/mve/intrinsics/mve_viwdupq_x_n_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_viwdupq_x_n_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_viwdupq_x_n_u8.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_offset_s64.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_offset_u64.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_offset_z_s64.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_offset_z_u64.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_shifted_offset_s64.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_shifted_offset_u64.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_shifted_offset_z_s64.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_shifted_offset_z_u64.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_f16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_s16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_s32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_z_f16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_z_s16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_z_s32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_z_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_z_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_f16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_s16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_s32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_z_f16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_z_s16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_z_s32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_z_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_z_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_f32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_s32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_z_f32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_z_s32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_z_u32.c: Removed. *
[commited trunk 9/9] arm testsuite: Shifts and get_FPSCR ACLE optimisation fixes
These newly updated tests were rewritten by Andrea. Some of them needed further manual fixing as follows: * The #shift immediate value not in the check-function-bodies as expected * The ACLE was specifying sub-optimal code: lsr+and instead of ubfx. In this case the test rewritten from the ACLE had the lsr+and pattern, but the compiler was able to optimise to ubfx. Hence I've changed the test to now match on ubfx. * Added a separate test to check shift on constants being optimised to movs. gcc/testsuite/ChangeLog: * gcc.target/arm/mve/intrinsics/srshr.c: Update shift value. * gcc.target/arm/mve/intrinsics/srshrl.c: Update shift value. * gcc.target/arm/mve/intrinsics/uqshl.c: Update shift value. * gcc.target/arm/mve/intrinsics/uqshll.c: Update shift value. * gcc.target/arm/mve/intrinsics/urshr.c: Update shift value. * gcc.target/arm/mve/intrinsics/urshrl.c: Update shift value. * gcc.target/arm/mve/intrinsics/vadciq_m_s32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vadciq_m_u32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vadciq_s32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vadciq_u32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vadcq_m_s32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vadcq_m_u32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vadcq_s32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vadcq_u32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vsbciq_m_s32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vsbciq_m_u32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vsbciq_s32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vsbciq_u32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vsbcq_m_s32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vsbcq_m_u32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vsbcq_s32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vsbcq_u32.c: Update to ubfx. * gcc.target/arm/mve/mve_const_shifts.c: New test. --- .../gcc.target/arm/mve/intrinsics/srshr.c | 2 +- .../gcc.target/arm/mve/intrinsics/srshrl.c| 2 +- .../gcc.target/arm/mve/intrinsics/uqshl.c | 14 +-- .../gcc.target/arm/mve/intrinsics/uqshll.c| 14 +-- .../gcc.target/arm/mve/intrinsics/urshr.c | 4 +- .../gcc.target/arm/mve/intrinsics/urshrl.c| 4 +- .../arm/mve/intrinsics/vadciq_m_s32.c | 8 +--- .../arm/mve/intrinsics/vadciq_m_u32.c | 8 +--- .../arm/mve/intrinsics/vadciq_s32.c | 8 +--- .../arm/mve/intrinsics/vadciq_u32.c | 8 +--- .../arm/mve/intrinsics/vadcq_m_s32.c | 8 +--- .../arm/mve/intrinsics/vadcq_m_u32.c | 8 +--- .../gcc.target/arm/mve/intrinsics/vadcq_s32.c | 8 +--- .../gcc.target/arm/mve/intrinsics/vadcq_u32.c | 8 +--- .../arm/mve/intrinsics/vsbciq_m_s32.c | 8 +--- .../arm/mve/intrinsics/vsbciq_m_u32.c | 8 +--- .../arm/mve/intrinsics/vsbciq_s32.c | 8 +--- .../arm/mve/intrinsics/vsbciq_u32.c | 8 +--- .../arm/mve/intrinsics/vsbcq_m_s32.c | 8 +--- .../arm/mve/intrinsics/vsbcq_m_u32.c | 8 +--- .../gcc.target/arm/mve/intrinsics/vsbcq_s32.c | 8 +--- .../gcc.target/arm/mve/intrinsics/vsbcq_u32.c | 8 +--- .../gcc.target/arm/mve/mve_const_shifts.c | 41 +++ 23 files changed, 81 insertions(+), 128 deletions(-) create mode 100644 gcc/testsuite/gcc.target/arm/mve/mve_const_shifts.c diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshr.c b/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshr.c index 94e3f42fd33..734375d58c0 100644 --- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshr.c +++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshr.c @@ -12,7 +12,7 @@ extern "C" { /* **foo: ** ... -** srshr (?:ip|fp|r[0-9]+), #shift(?:@.*|) +** srshr (?:ip|fp|r[0-9]+), #1(?:@.*|) ** ... */ int32_t diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshrl.c b/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshrl.c index 65f28ccbfde..a91943c38a0 100644 --- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshrl.c +++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshrl.c @@ -12,7 +12,7 @@ extern "C" { /* **foo: ** ... -** srshrl (?:ip|fp|r[0-9]+), (?:ip|fp|r[0-9]+), #shift(?: @.*|) +** srshrl (?:ip|fp|r[0-9]+), (?:ip|fp|r[0-9]+), #1(?: @.*|) ** ... */ int64_t diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/uqshl.c b/gcc/testsuite/gcc.target/arm/mve/intrinsics/uqshl.c index b23c9d97ba6..462531cad54 100644 --- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/uqshl.c +++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/uqshl.c @@ -12,7 +12,7 @@ extern "C" { /* **foo: ** ... -** uqshl (?:ip|fp|r[0-9]+), #shift(?:@.*|) +** uqshl (?:ip|fp|r[0-9]+), #1(?:@.*|) ** ...
[commited trunk 2/9] arm: Fix vstrwq* backend + testsuite
From: Andrea Corallo Hi all, this patch fixes the vstrwq* MVE instrinsics failing to emit the correct sequence of instruction due to a missing predicate. Also the immediate range is fixed to be multiples of 2 up between [-252, 252]. Best Regards Andrea gcc/ChangeLog: * config/arm/constraints.md (mve_vldrd_immediate): Move it to predicates.md. (Ri): Move constraint definition from predicates.md. (Rl): Define new constraint. * config/arm/mve.md (mve_vstrwq_scatter_base_wb_p_v4si): Add missing constraint. (mve_vstrwq_scatter_base_wb_p_fv4sf): Add missing Up constraint for op 1, use mve_vstrw_immediate predicate and Rl constraint for op 2. Fix asm output spacing. (mve_vstrdq_scatter_base_wb_p_v2di): Add missing constraint. * config/arm/predicates.md (Ri) Move constraint to constraints.md (mve_vldrd_immediate): Move it from constraints.md. (mve_vstrw_immediate): New predicate. gcc/testsuite/ChangeLog: * gcc.target/arm/mve/intrinsics/vstrwq_f32.c: Use check-function-bodies instead of scan-assembler checks. Use extern "C" for C++ testing. * gcc.target/arm/mve/intrinsics/vstrwq_p_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_p_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_p_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_p_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_p_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_p_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_p_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_p_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_p_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_p_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_p_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_p_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_u32.c: Likewise. --- gcc/config/arm/constraints.md | 20 -- gcc/config/arm/mve.md | 10 ++--- gcc/config/arm/predicates.md | 14 +++ .../arm/mve/intrinsics/vstrwq_f32.c | 32 --- .../arm/mve/intrinsics/vstrwq_p_f32.c | 40 --- .../arm/mve/intrinsics/vstrwq_p_s32.c | 40 --- .../arm/mve/intrinsics/vstrwq_p_u32.c | 40 --- .../arm/mve/intrinsics/vstrwq_s32.c | 32 --- .../mve/intrinsics/vstrwq_scatter_base_f32.c | 28 +++-- .../intrinsics/vstrwq_scatter_base_p_f32.c| 36 +++-- .../intrinsics/vstrwq_scatter_base_p_s32.c| 36 +++-- .../intrinsics/vstrwq_scatter_base_p_u32.c| 36 +++-- .../mve/intrinsics/vstrwq_scatter_base_s32.c | 28 +++-- .../mve/intrinsics/vstrwq_scatter_base_u32.c | 28 +++-- .../intrinsics/vstrwq_scatter_base_wb_f32.c | 32 --- .../intrinsics/vstrwq_scatter_base_wb_p_f32.c | 40 --- .../intrinsics/vstrwq_scatter_base_wb_p_s32.c | 40 --- .../intrinsics/vstrwq_scatter_base_wb_p_u32.c | 40 --- .../intrinsics/vstrwq_scatter_base_wb_s32.c | 32 --- .../intrinsics/vstrwq_scatter_base_wb_u32.c | 32 --- .../intrinsics/vstrwq_scatter_offset_f32.c| 32 --- .../intrinsics/vstrwq_scatter_offset_p_f32.c | 40 ---
[commited trunk 8/9] arm testsuite: XFAIL or relax registers in some tests [PR109697]
Hi all, This is a simple testsuite tidy-up patch, addressing to types of errors: * The vcmp vector-scalar tests failing due to the compiler's preference of vector-vector comparisons, over vector-scalar comparisons. This is due to the lack of cost model for MVE and the compiler not knowing that the RTL vec_duplicate is free in those instructions. For now, we simply XFAIL these checks. * The tests for pr108177 had strict usage of q0 and r0 registers, meaning that they would FAIL with -mfloat-abi=softf. The register checks have now been relaxed. A couple of these run-tests also had incosistent use of integer MVE with floating point vectors, so I've now changed these to use FP MVE. gcc/testsuite/ChangeLog: PR target/109697 * gcc.target/arm/mve/intrinsics/vcmpcsq_n_u16.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpcsq_n_u32.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpcsq_n_u8.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpeqq_n_f16.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpeqq_n_f32.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpeqq_n_u16.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpeqq_n_u32.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpeqq_n_u8.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpgeq_n_f16.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpgeq_n_f32.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpgtq_n_f16.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpgtq_n_f32.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmphiq_n_u16.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmphiq_n_u32.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmphiq_n_u8.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpleq_n_f16.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpleq_n_f32.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpltq_n_f16.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpltq_n_f32.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpneq_n_f16.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpneq_n_f32.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpneq_n_u16.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpneq_n_u32.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpneq_n_u8.c: XFAIL check. * gcc.target/arm/mve/pr108177-1.c: Relax registers. * gcc.target/arm/mve/pr108177-10.c: Relax registers. * gcc.target/arm/mve/pr108177-11.c: Relax registers. * gcc.target/arm/mve/pr108177-12.c: Relax registers. * gcc.target/arm/mve/pr108177-13.c: Relax registers. * gcc.target/arm/mve/pr108177-13-run.c: use mve_fp * gcc.target/arm/mve/pr108177-14.c: Relax registers. * gcc.target/arm/mve/pr108177-14-run.c: use mve_fp * gcc.target/arm/mve/pr108177-2.c: Relax registers. * gcc.target/arm/mve/pr108177-3.c: Relax registers. * gcc.target/arm/mve/pr108177-4.c: Relax registers. * gcc.target/arm/mve/pr108177-5.c: Relax registers. * gcc.target/arm/mve/pr108177-6.c: Relax registers. * gcc.target/arm/mve/pr108177-7.c: Relax registers. * gcc.target/arm/mve/pr108177-8.c: Relax registers. * gcc.target/arm/mve/pr108177-9.c: Relax registers. --- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpcsq_n_u16.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpcsq_n_u32.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpcsq_n_u8.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpeqq_n_f16.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpeqq_n_f32.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpeqq_n_u16.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpeqq_n_u32.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpeqq_n_u8.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpgeq_n_f16.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpgeq_n_f32.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpgtq_n_f16.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpgtq_n_f32.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmphiq_n_u16.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmphiq_n_u32.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmphiq_n_u8.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpleq_n_f16.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpleq_n_f32.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpltq_n_f16.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpltq_n_f32.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpneq_n_f16.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpneq_n_f32.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpneq_n_u16.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpneq_n_u32.c | 2 +-
[commited trunk 4/9] arm: Stop vadcq, vsbcq intrinsics from overwriting the FPSCR NZ flags
Hi all, We noticed that calls to the vadcq and vsbcq intrinsics, both of which use __builtin_arm_set_fpscr_nzcvqc to set the Carry flag in the FPSCR, would produce the following code: ``` < r2 is the *carry input > vmrsr3, FPSCR_nzcvqc bic r3, r3, #536870912 orr r3, r3, r2, lsl #29 vmsrFPSCR_nzcvqc, r3 ``` when the MVE ACLE instead gives a different instruction sequence of: ``` < Rt is the *carry input > VMRS Rs,FPSCR_nzcvqc BFI Rs,Rt,#29,#1 VMSR FPSCR_nzcvqc,Rs ``` the bic + orr pair is slower and it's also wrong, because, if the *carry input is greater than 1, then we risk overwriting the top two bits of the FPSCR register (the N and Z flags). This turned out to be a problem in the header file and the solution was to simply add a `& 1x0u` to the `*carry` input: then the compiler knows that we only care about the lowest bit and can optimise to a BFI. Ok for trunk? Thanks, Stam Markianos-Wright gcc/ChangeLog: * config/arm/arm_mve.h (__arm_vadcq_s32): Fix arithmetic. (__arm_vadcq_u32): Likewise. (__arm_vadcq_m_s32): Likewise. (__arm_vadcq_m_u32): Likewise. (__arm_vsbcq_s32): Likewise. (__arm_vsbcq_u32): Likewise. (__arm_vsbcq_m_s32): Likewise. (__arm_vsbcq_m_u32): Likewise. * config/arm/mve.md (get_fpscr_nzcvqc): Make unspec_volatile. gcc/testsuite/ChangeLog: * gcc.target/arm/mve/mve_vadcq_vsbcq_fpscr_overwrite.c: New. --- gcc/config/arm/arm_mve.h | 16 ++--- gcc/config/arm/mve.md | 2 +- .../arm/mve/mve_vadcq_vsbcq_fpscr_overwrite.c | 67 +++ 3 files changed, 76 insertions(+), 9 deletions(-) create mode 100644 gcc/testsuite/gcc.target/arm/mve/mve_vadcq_vsbcq_fpscr_overwrite.c diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h index 1774e6eca2b..4ad1c99c288 100644 --- a/gcc/config/arm/arm_mve.h +++ b/gcc/config/arm/arm_mve.h @@ -4098,7 +4098,7 @@ __extension__ extern __inline int32x4_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) __arm_vadcq_s32 (int32x4_t __a, int32x4_t __b, unsigned * __carry) { - __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & ~0x2000u) | (*__carry << 29)); + __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & ~0x2000u) | ((*__carry & 0x1u) << 29)); int32x4_t __res = __builtin_mve_vadcq_sv4si (__a, __b); *__carry = (__builtin_arm_get_fpscr_nzcvqc () >> 29) & 0x1u; return __res; @@ -4108,7 +4108,7 @@ __extension__ extern __inline uint32x4_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) __arm_vadcq_u32 (uint32x4_t __a, uint32x4_t __b, unsigned * __carry) { - __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & ~0x2000u) | (*__carry << 29)); + __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & ~0x2000u) | ((*__carry & 0x1u) << 29)); uint32x4_t __res = __builtin_mve_vadcq_uv4si (__a, __b); *__carry = (__builtin_arm_get_fpscr_nzcvqc () >> 29) & 0x1u; return __res; @@ -4118,7 +4118,7 @@ __extension__ extern __inline int32x4_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) __arm_vadcq_m_s32 (int32x4_t __inactive, int32x4_t __a, int32x4_t __b, unsigned * __carry, mve_pred16_t __p) { - __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & ~0x2000u) | (*__carry << 29)); + __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & ~0x2000u) | ((*__carry & 0x1u) << 29)); int32x4_t __res = __builtin_mve_vadcq_m_sv4si (__inactive, __a, __b, __p); *__carry = (__builtin_arm_get_fpscr_nzcvqc () >> 29) & 0x1u; return __res; @@ -4128,7 +4128,7 @@ __extension__ extern __inline uint32x4_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) __arm_vadcq_m_u32 (uint32x4_t __inactive, uint32x4_t __a, uint32x4_t __b, unsigned * __carry, mve_pred16_t __p) { - __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & ~0x2000u) | (*__carry << 29)); + __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & ~0x2000u) | ((*__carry & 0x1u) << 29)); uint32x4_t __res = __builtin_mve_vadcq_m_uv4si (__inactive, __a, __b, __p); *__carry = (__builtin_arm_get_fpscr_nzcvqc () >> 29) & 0x1u; return __res; @@ -4174,7 +4174,7 @@ __extension__ extern __inline int32x4_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) __arm_vsbcq_s32 (int32x4_t __a, int32x4_t __b, unsigned * __carry) { - __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & ~0x2000u) | (*__carry << 29)); + __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & ~0x2000u) | ((*__carry & 0x1u) << 29)); int32x4_t __res = __builtin_mve_vsbcq_sv4si (__a, __b); *__carry = (__builtin_arm_get_fpscr_nzcvqc () >> 29) & 0x1u; return __res; @@ -4184,7 +4184,7 @@ __extension__ extern __inline uint32x4_t __attribute__ ((__always_inline__,
[commited trunk 5/9] arm: Fix overloading of MVE scalar constant parameters on vbicq
We found this as part of the wider testsuite updates. The applicable tests are authored by Andrea earlier in this patch series Ok for trunk? gcc/ChangeLog: * config/arm/arm_mve.h (__arm_vbicq): Change coerce on scalar constant. --- gcc/config/arm/arm_mve.h | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h index 4ad1c99c288..30cec519791 100644 --- a/gcc/config/arm/arm_mve.h +++ b/gcc/config/arm/arm_mve.h @@ -10847,10 +10847,10 @@ extern void *__ARM_undef; #define __arm_vbicq(p0,p1) ({ __typeof(p0) __p0 = (p0); \ __typeof(p1) __p1 = (p1); \ _Generic( (int (*)[__ARM_mve_typeid(__p0)][__ARM_mve_typeid(__p1)])0, \ - int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s16 (__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce1 (__p1, int)), \ - int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s32 (__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce1 (__p1, int)), \ - int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u16 (__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce1 (__p1, int)), \ - int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u32 (__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce1 (__p1, int)), \ + int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s16 (__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce3 (p1, int)), \ + int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s32 (__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce3 (p1, int)), \ + int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u16 (__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce3 (p1, int)), \ + int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u32 (__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce3 (p1, int)), \ int (*)[__ARM_mve_type_int8x16_t][__ARM_mve_type_int8x16_t]: __arm_vbicq_s8 (__ARM_mve_coerce(__p0, int8x16_t), __ARM_mve_coerce(__p1, int8x16_t)), \ int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int16x8_t]: __arm_vbicq_s16 (__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce(__p1, int16x8_t)), \ int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int32x4_t]: __arm_vbicq_s32 (__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce(__p1, int32x4_t)), \ @@ -11699,10 +11699,10 @@ extern void *__ARM_undef; #define __arm_vbicq(p0,p1) ({ __typeof(p0) __p0 = (p0); \ __typeof(p1) __p1 = (p1); \ _Generic( (int (*)[__ARM_mve_typeid(__p0)][__ARM_mve_typeid(__p1)])0, \ - int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s16 (__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce1 (__p1, int)), \ - int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s32 (__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce1 (__p1, int)), \ - int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u16 (__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce1 (__p1, int)), \ - int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u32 (__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce1 (__p1, int)), \ + int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s16 (__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce3 (p1, int)), \ + int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s32 (__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce3 (p1, int)), \ + int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u16 (__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce3 (p1, int)), \ + int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u32 (__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce3 (p1, int)), \ int (*)[__ARM_mve_type_int8x16_t][__ARM_mve_type_int8x16_t]: __arm_vbicq_s8 (__ARM_mve_coerce(__p0, int8x16_t), __ARM_mve_coerce(__p1, int8x16_t)), \ int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int16x8_t]: __arm_vbicq_s16 (__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce(__p1, int16x8_t)), \ int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int32x4_t]: __arm_vbicq_s32 (__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce(__p1, int32x4_t)), \ -- 2.25.1
[committed gcc12 backport] arm: Fix overloading of MVE scalar constant parameters on vbicq, vmvnq_m
We found this as part of the wider testsuite updates. The applicable tests are authored by Andrea earlier in this patch series Ok for trunk? gcc/ChangeLog: * config/arm/arm_mve.h (__arm_vbicq): Change coerce on scalar constant. (__arm_vmvnq_m): Likewise. --- gcc/config/arm/arm_mve.h | 24 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h index 39b3446617d..0b35bd0eedd 100644 --- a/gcc/config/arm/arm_mve.h +++ b/gcc/config/arm/arm_mve.h @@ -35906,10 +35906,10 @@ extern void *__ARM_undef; #define __arm_vbicq(p0,p1) ({ __typeof(p0) __p0 = (p0); \ __typeof(p1) __p1 = (p1); \ _Generic( (int (*)[__ARM_mve_typeid(__p0)][__ARM_mve_typeid(__p1)])0, \ - int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s16 (__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce1 (__p1, int)), \ - int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s32 (__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce1 (__p1, int)), \ - int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u16 (__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce1 (__p1, int)), \ - int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u32 (__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce1 (__p1, int)), \ + int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s16 (__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce3 (p1, int)), \ + int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s32 (__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce3 (p1, int)), \ + int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u16 (__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce3 (p1, int)), \ + int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u32 (__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce3 (p1, int)), \ int (*)[__ARM_mve_type_int8x16_t][__ARM_mve_type_int8x16_t]: __arm_vbicq_s8 (__ARM_mve_coerce(__p0, int8x16_t), __ARM_mve_coerce(__p1, int8x16_t)), \ int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int16x8_t]: __arm_vbicq_s16 (__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce(__p1, int16x8_t)), \ int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int32x4_t]: __arm_vbicq_s32 (__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce(__p1, int32x4_t)), \ @@ -38825,10 +38825,10 @@ extern void *__ARM_undef; #define __arm_vbicq(p0,p1) ({ __typeof(p0) __p0 = (p0); \ __typeof(p1) __p1 = (p1); \ _Generic( (int (*)[__ARM_mve_typeid(__p0)][__ARM_mve_typeid(__p1)])0, \ - int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s16 (__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce1 (__p1, int)), \ - int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s32 (__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce1 (__p1, int)), \ - int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u16 (__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce1 (__p1, int)), \ - int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u32 (__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce1 (__p1, int)), \ + int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s16 (__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce3 (p1, int)), \ + int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s32 (__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce3 (p1, int)), \ + int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u16 (__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce3 (p1, int)), \ + int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u32 (__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce3 (p1, int)), \ int (*)[__ARM_mve_type_int8x16_t][__ARM_mve_type_int8x16_t]: __arm_vbicq_s8 (__ARM_mve_coerce(__p0, int8x16_t), __ARM_mve_coerce(__p1, int8x16_t)), \ int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int16x8_t]: __arm_vbicq_s16 (__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce(__p1, int16x8_t)), \ int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int32x4_t]: __arm_vbicq_s32 (__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce(__p1, int32x4_t)), \ @@ -40962,10 +40962,10 @@ extern void *__ARM_undef; int (*)[__ARM_mve_type_uint8x16_t][__ARM_mve_type_uint8x16_t]: __arm_vmvnq_m_u8 (__ARM_mve_coerce(__p0, uint8x16_t), __ARM_mve_coerce(__p1, uint8x16_t), p2), \ int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_uint16x8_t]: __arm_vmvnq_m_u16 (__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce(__p1, uint16x8_t), p2), \ int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_uint32x4_t]: __arm_vmvnq_m_u32 (__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce(__p1, uint32x4_t), p2), \ - int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vmvnq_m_n_s16 (__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce1(__p1,
[committed gcc12 backport] arm testsuite: Shifts and get_FPSCR ACLE optimisation fixes
These newly updated tests were rewritten by Andrea. Some of them needed further manual fixing as follows: * The #shift immediate value not in the check-function-bodies as expected * The ACLE was specifying sub-optimal code: lsr+and instead of ubfx. In this case the test rewritten from the ACLE had the lsr+and pattern, but the compiler was able to optimise to ubfx. Hence I've changed the test to now match on ubfx. * Added a separate test to check shift on constants being optimised to movs. gcc/testsuite/ChangeLog: * gcc.target/arm/mve/intrinsics/srshr.c: Update shift value. * gcc.target/arm/mve/intrinsics/srshrl.c: Update shift value. * gcc.target/arm/mve/intrinsics/uqshl.c: Update shift value. * gcc.target/arm/mve/intrinsics/uqshll.c: Update shift value. * gcc.target/arm/mve/intrinsics/urshr.c: Update shift value. * gcc.target/arm/mve/intrinsics/urshrl.c: Update shift value. * gcc.target/arm/mve/intrinsics/vadciq_m_s32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vadciq_m_u32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vadciq_s32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vadciq_u32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vadcq_m_s32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vadcq_m_u32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vadcq_s32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vadcq_u32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vsbciq_m_s32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vsbciq_m_u32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vsbciq_s32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vsbciq_u32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vsbcq_m_s32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vsbcq_m_u32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vsbcq_s32.c: Update to ubfx. * gcc.target/arm/mve/intrinsics/vsbcq_u32.c: Update to ubfx. * gcc.target/arm/mve/mve_const_shifts.c: New test. --- .../gcc.target/arm/mve/intrinsics/srshr.c | 2 +- .../gcc.target/arm/mve/intrinsics/srshrl.c| 2 +- .../gcc.target/arm/mve/intrinsics/uqshl.c | 14 +-- .../gcc.target/arm/mve/intrinsics/uqshll.c| 14 +-- .../gcc.target/arm/mve/intrinsics/urshr.c | 4 +- .../gcc.target/arm/mve/intrinsics/urshrl.c| 4 +- .../arm/mve/intrinsics/vadciq_m_s32.c | 8 +--- .../arm/mve/intrinsics/vadciq_m_u32.c | 8 +--- .../arm/mve/intrinsics/vadciq_s32.c | 8 +--- .../arm/mve/intrinsics/vadciq_u32.c | 8 +--- .../arm/mve/intrinsics/vadcq_m_s32.c | 8 +--- .../arm/mve/intrinsics/vadcq_m_u32.c | 8 +--- .../gcc.target/arm/mve/intrinsics/vadcq_s32.c | 8 +--- .../gcc.target/arm/mve/intrinsics/vadcq_u32.c | 8 +--- .../arm/mve/intrinsics/vsbciq_m_s32.c | 8 +--- .../arm/mve/intrinsics/vsbciq_m_u32.c | 8 +--- .../arm/mve/intrinsics/vsbciq_s32.c | 8 +--- .../arm/mve/intrinsics/vsbciq_u32.c | 8 +--- .../arm/mve/intrinsics/vsbcq_m_s32.c | 8 +--- .../arm/mve/intrinsics/vsbcq_m_u32.c | 8 +--- .../gcc.target/arm/mve/intrinsics/vsbcq_s32.c | 8 +--- .../gcc.target/arm/mve/intrinsics/vsbcq_u32.c | 8 +--- .../gcc.target/arm/mve/mve_const_shifts.c | 41 +++ 23 files changed, 81 insertions(+), 128 deletions(-) create mode 100644 gcc/testsuite/gcc.target/arm/mve/mve_const_shifts.c diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshr.c b/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshr.c index 94e3f42fd33..734375d58c0 100644 --- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshr.c +++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshr.c @@ -12,7 +12,7 @@ extern "C" { /* **foo: ** ... -** srshr (?:ip|fp|r[0-9]+), #shift(?:@.*|) +** srshr (?:ip|fp|r[0-9]+), #1(?:@.*|) ** ... */ int32_t diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshrl.c b/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshrl.c index 65f28ccbfde..a91943c38a0 100644 --- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshrl.c +++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshrl.c @@ -12,7 +12,7 @@ extern "C" { /* **foo: ** ... -** srshrl (?:ip|fp|r[0-9]+), (?:ip|fp|r[0-9]+), #shift(?: @.*|) +** srshrl (?:ip|fp|r[0-9]+), (?:ip|fp|r[0-9]+), #1(?: @.*|) ** ... */ int64_t diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/uqshl.c b/gcc/testsuite/gcc.target/arm/mve/intrinsics/uqshl.c index b23c9d97ba6..462531cad54 100644 --- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/uqshl.c +++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/uqshl.c @@ -12,7 +12,7 @@ extern "C" { /* **foo: ** ... -** uqshl (?:ip|fp|r[0-9]+), #shift(?:@.*|) +** uqshl (?:ip|fp|r[0-9]+), #1(?:@.*|) ** ...
[committed gcc12 backport] arm testsuite: Remove reduntant tests
Following Andrea's overhaul of the MVE testsuite, these tests are now reduntant, as equivalent checks have been added to the each intrinsic's .c test. gcc/testsuite/ChangeLog: * gcc.target/arm/mve/intrinsics/mve_fp_vaddq_n.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vaddq_m.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vaddq_n.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vddupq_m_n_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vddupq_m_n_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vddupq_m_n_u8.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vddupq_n_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vddupq_n_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vddupq_n_u8.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vddupq_x_n_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vddupq_x_n_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vddupq_x_n_u8.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vdwdupq_x_n_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vdwdupq_x_n_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vdwdupq_x_n_u8.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vidupq_m_n_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vidupq_m_n_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vidupq_m_n_u8.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vidupq_n_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vidupq_n_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vidupq_n_u8.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vidupq_x_n_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vidupq_x_n_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vidupq_x_n_u8.c: Removed. * gcc.target/arm/mve/intrinsics/mve_viwdupq_x_n_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_viwdupq_x_n_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_viwdupq_x_n_u8.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_offset_s64.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_offset_u64.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_offset_z_s64.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_offset_z_u64.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_shifted_offset_s64.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_shifted_offset_u64.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_shifted_offset_z_s64.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_shifted_offset_z_u64.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_f16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_s16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_s32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_z_f16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_z_s16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_z_s32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_z_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_z_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_f16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_s16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_s32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_z_f16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_z_s16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_z_s32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_z_u16.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_z_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_f32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_s32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_u32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_z_f32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_z_s32.c: Removed. * gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_z_u32.c: Removed. *
[committed gcc12 backport] arm testsuite: XFAIL or relax registers in some tests [PR109697]
Hi all, This is a simple testsuite tidy-up patch, addressing to types of errors: * The vcmp vector-scalar tests failing due to the compiler's preference of vector-vector comparisons, over vector-scalar comparisons. This is due to the lack of cost model for MVE and the compiler not knowing that the RTL vec_duplicate is free in those instructions. For now, we simply XFAIL these checks. * The tests for pr108177 had strict usage of q0 and r0 registers, meaning that they would FAIL with -mfloat-abi=softf. The register checks have now been relaxed. A couple of these run-tests also had incosistent use of integer MVE with floating point vectors, so I've now changed these to use FP MVE. gcc/testsuite/ChangeLog: PR target/109697 * gcc.target/arm/mve/intrinsics/vcmpcsq_n_u16.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpcsq_n_u32.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpcsq_n_u8.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpeqq_n_f16.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpeqq_n_f32.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpeqq_n_u16.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpeqq_n_u32.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpeqq_n_u8.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpgeq_n_f16.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpgeq_n_f32.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpgtq_n_f16.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpgtq_n_f32.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmphiq_n_u16.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmphiq_n_u32.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmphiq_n_u8.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpleq_n_f16.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpleq_n_f32.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpltq_n_f16.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpltq_n_f32.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpneq_n_f16.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpneq_n_f32.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpneq_n_u16.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpneq_n_u32.c: XFAIL check. * gcc.target/arm/mve/intrinsics/vcmpneq_n_u8.c: XFAIL check. * gcc.target/arm/mve/pr108177-1.c: Relax registers. * gcc.target/arm/mve/pr108177-10.c: Relax registers. * gcc.target/arm/mve/pr108177-11.c: Relax registers. * gcc.target/arm/mve/pr108177-12.c: Relax registers. * gcc.target/arm/mve/pr108177-13.c: Relax registers. * gcc.target/arm/mve/pr108177-13-run.c: use mve_fp * gcc.target/arm/mve/pr108177-14.c: Relax registers. * gcc.target/arm/mve/pr108177-14-run.c: use mve_fp * gcc.target/arm/mve/pr108177-2.c: Relax registers. * gcc.target/arm/mve/pr108177-3.c: Relax registers. * gcc.target/arm/mve/pr108177-4.c: Relax registers. * gcc.target/arm/mve/pr108177-5.c: Relax registers. * gcc.target/arm/mve/pr108177-6.c: Relax registers. * gcc.target/arm/mve/pr108177-7.c: Relax registers. * gcc.target/arm/mve/pr108177-8.c: Relax registers. * gcc.target/arm/mve/pr108177-9.c: Relax registers. --- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpcsq_n_u16.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpcsq_n_u32.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpcsq_n_u8.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpeqq_n_f16.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpeqq_n_f32.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpeqq_n_u16.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpeqq_n_u32.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpeqq_n_u8.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpgeq_n_f16.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpgeq_n_f32.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpgtq_n_f16.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpgtq_n_f32.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmphiq_n_u16.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmphiq_n_u32.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmphiq_n_u8.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpleq_n_f16.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpleq_n_f32.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpltq_n_f16.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpltq_n_f32.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpneq_n_f16.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpneq_n_f32.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpneq_n_u16.c | 2 +- gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpneq_n_u32.c | 2 +-
[committed gcc12 backport] arm: Stop vadcq, vsbcq intrinsics from overwriting the FPSCR NZ flags
Hi all, We noticed that calls to the vadcq and vsbcq intrinsics, both of which use __builtin_arm_set_fpscr_nzcvqc to set the Carry flag in the FPSCR, would produce the following code: ``` < r2 is the *carry input > vmrsr3, FPSCR_nzcvqc bic r3, r3, #536870912 orr r3, r3, r2, lsl #29 vmsrFPSCR_nzcvqc, r3 ``` when the MVE ACLE instead gives a different instruction sequence of: ``` < Rt is the *carry input > VMRS Rs,FPSCR_nzcvqc BFI Rs,Rt,#29,#1 VMSR FPSCR_nzcvqc,Rs ``` the bic + orr pair is slower and it's also wrong, because, if the *carry input is greater than 1, then we risk overwriting the top two bits of the FPSCR register (the N and Z flags). This turned out to be a problem in the header file and the solution was to simply add a `& 1x0u` to the `*carry` input: then the compiler knows that we only care about the lowest bit and can optimise to a BFI. Ok for trunk? Thanks, Stam Markianos-Wright gcc/ChangeLog: * config/arm/arm_mve.h (__arm_vadcq_s32): Fix arithmetic. (__arm_vadcq_u32): Likewise. (__arm_vadcq_m_s32): Likewise. (__arm_vadcq_m_u32): Likewise. (__arm_vsbcq_s32): Likewise. (__arm_vsbcq_u32): Likewise. (__arm_vsbcq_m_s32): Likewise. (__arm_vsbcq_m_u32): Likewise. * config/arm/mve.md (get_fpscr_nzcvqc): Make unspec_volatile. gcc/testsuite/ChangeLog: * gcc.target/arm/mve/mve_vadcq_vsbcq_fpscr_overwrite.c: New. (cherry picked from commit f1417d051be094ffbce228e11951f3e12e8fca1c) --- gcc/config/arm/arm_mve.h | 16 ++--- gcc/config/arm/mve.md | 2 +- .../arm/mve/mve_vadcq_vsbcq_fpscr_overwrite.c | 67 +++ 3 files changed, 76 insertions(+), 9 deletions(-) create mode 100644 gcc/testsuite/gcc.target/arm/mve/mve_vadcq_vsbcq_fpscr_overwrite.c diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h index 82ceec2bbfc..6bf1794d2ff 100644 --- a/gcc/config/arm/arm_mve.h +++ b/gcc/config/arm/arm_mve.h @@ -16055,7 +16055,7 @@ __extension__ extern __inline int32x4_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) __arm_vadcq_s32 (int32x4_t __a, int32x4_t __b, unsigned * __carry) { - __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & ~0x2000u) | (*__carry << 29)); + __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & ~0x2000u) | ((*__carry & 0x1u) << 29)); int32x4_t __res = __builtin_mve_vadcq_sv4si (__a, __b); *__carry = (__builtin_arm_get_fpscr_nzcvqc () >> 29) & 0x1u; return __res; @@ -16065,7 +16065,7 @@ __extension__ extern __inline uint32x4_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) __arm_vadcq_u32 (uint32x4_t __a, uint32x4_t __b, unsigned * __carry) { - __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & ~0x2000u) | (*__carry << 29)); + __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & ~0x2000u) | ((*__carry & 0x1u) << 29)); uint32x4_t __res = __builtin_mve_vadcq_uv4si (__a, __b); *__carry = (__builtin_arm_get_fpscr_nzcvqc () >> 29) & 0x1u; return __res; @@ -16075,7 +16075,7 @@ __extension__ extern __inline int32x4_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) __arm_vadcq_m_s32 (int32x4_t __inactive, int32x4_t __a, int32x4_t __b, unsigned * __carry, mve_pred16_t __p) { - __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & ~0x2000u) | (*__carry << 29)); + __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & ~0x2000u) | ((*__carry & 0x1u) << 29)); int32x4_t __res = __builtin_mve_vadcq_m_sv4si (__inactive, __a, __b, __p); *__carry = (__builtin_arm_get_fpscr_nzcvqc () >> 29) & 0x1u; return __res; @@ -16085,7 +16085,7 @@ __extension__ extern __inline uint32x4_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) __arm_vadcq_m_u32 (uint32x4_t __inactive, uint32x4_t __a, uint32x4_t __b, unsigned * __carry, mve_pred16_t __p) { - __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & ~0x2000u) | (*__carry << 29)); + __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & ~0x2000u) | ((*__carry & 0x1u) << 29)); uint32x4_t __res = __builtin_mve_vadcq_m_uv4si (__inactive, __a, __b, __p); *__carry = (__builtin_arm_get_fpscr_nzcvqc () >> 29) & 0x1u; return __res; @@ -16131,7 +16131,7 @@ __extension__ extern __inline int32x4_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) __arm_vsbcq_s32 (int32x4_t __a, int32x4_t __b, unsigned * __carry) { - __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & ~0x2000u) | (*__carry << 29)); + __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & ~0x2000u) | ((*__carry & 0x1u) << 29)); int32x4_t __res = __builtin_mve_vsbcq_sv4si (__a, __b); *__carry = (__builtin_arm_get_fpscr_nzcvqc () >> 29) & 0x1u; return __res; @@ -16141,7
[committed gcc12 backport] [arm] complete vmsr/vmrs blank and case adjustments
From: Alexandre Oliva Back in September last year, some of the vmsr and vmrs patterns had an extraneous blank removed, and the case of register names lowered, but another instance remained, and so did a testcase. for gcc/ChangeLog * config/arm/vfp.md (*thumb2_movsi_vfp): Drop blank after tab after vmsr and vmrs, and lower the case of P0. for gcc/testsuite/ChangeLog * gcc.target/arm/acle/cde-mve-full-assembly.c: Drop blank after tab after vmsr, and lower the case of P0. --- gcc/config/arm/vfp.md | 4 +- .../arm/acle/cde-mve-full-assembly.c | 264 +- 2 files changed, 134 insertions(+), 134 deletions(-) diff --git a/gcc/config/arm/vfp.md b/gcc/config/arm/vfp.md index 932e4b7447e..7a430ef8d36 100644 --- a/gcc/config/arm/vfp.md +++ b/gcc/config/arm/vfp.md @@ -312,9 +312,9 @@ (define_insn "*thumb2_movsi_vfp" case 12: case 13: return output_move_vfp (operands); case 14: - return \"vmsr\\t P0, %1\"; + return \"vmsr\\tp0, %1\"; case 15: - return \"vmrs\\t %0, P0\"; + return \"vmrs\\t%0, p0\"; case 16: return \"mcr\\tp10, 7, %1, cr1, cr0, 0\\t @SET_FPSCR\"; case 17: diff --git a/gcc/testsuite/gcc.target/arm/acle/cde-mve-full-assembly.c b/gcc/testsuite/gcc.target/arm/acle/cde-mve-full-assembly.c index 501cc84da10..e3e7f7ef3e5 100644 --- a/gcc/testsuite/gcc.target/arm/acle/cde-mve-full-assembly.c +++ b/gcc/testsuite/gcc.target/arm/acle/cde-mve-full-assembly.c @@ -567,80 +567,80 @@ contain back references). */ /* ** test_cde_vcx1q_mfloat16x8_tintint: -** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr P0, r2 @ movhi) -** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr P0, r2 @ movhi) +** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr p0, r2 @ movhi) +** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr p0, r2 @ movhi) ** vpst ** vcx1t p0, q0, #32 ** bx lr */ /* ** test_cde_vcx1q_mfloat32x4_tintint: -** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr P0, r2 @ movhi) -** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr P0, r2 @ movhi) +** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr p0, r2 @ movhi) +** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr p0, r2 @ movhi) ** vpst ** vcx1t p0, q0, #32 ** bx lr */ /* ** test_cde_vcx1q_muint8x16_tintint: -** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr P0, r2 @ movhi) -** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr P0, r2 @ movhi) +** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr p0, r2 @ movhi) +** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr p0, r2 @ movhi) ** vpst ** vcx1t p0, q0, #32 ** bx lr */ /* ** test_cde_vcx1q_muint16x8_tintint: -** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr P0, r2 @ movhi) -** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr P0, r2 @ movhi) +** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr p0, r2 @ movhi) +** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr p0, r2 @ movhi) ** vpst ** vcx1t p0, q0, #32 ** bx lr */ /* ** test_cde_vcx1q_muint32x4_tintint: -** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr P0, r2 @ movhi) -** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr P0, r2 @ movhi) +** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr p0, r2 @ movhi) +** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr p0, r2 @ movhi) ** vpst ** vcx1t p0, q0, #32 ** bx lr */ /* ** test_cde_vcx1q_muint64x2_tintint: -** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr P0, r2 @ movhi) -** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr P0, r2 @ movhi) +** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr p0, r2 @ movhi) +** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr p0, r2 @ movhi) ** vpst ** vcx1t p0, q0, #32 ** bx lr */ /* ** test_cde_vcx1q_mint8x16_tintint: -** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr P0, r2 @ movhi) -** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr P0, r2 @ movhi) +** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr p0, r2 @ movhi) +** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64 d1, \.L[0-9]*\+8|vmsr p0, r2 @
[committed gcc12 backport] arm: Add vorrq_n overloading into vorrq _Generic
We found this as part of the wider testsuite updates. The applicable tests are authored by Andrea earlier in this patch series Ok for trunk? gcc/ChangeLog: * config/arm/arm_mve.h (__arm_vorrq): Add _n variant. --- gcc/config/arm/arm_mve.h | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h index 6bf1794d2ff..39b3446617d 100644 --- a/gcc/config/arm/arm_mve.h +++ b/gcc/config/arm/arm_mve.h @@ -35852,6 +35852,10 @@ extern void *__ARM_undef; int (*)[__ARM_mve_type_uint8x16_t][__ARM_mve_type_uint8x16_t]: __arm_vorrq_u8 (__ARM_mve_coerce(__p0, uint8x16_t), __ARM_mve_coerce(__p1, uint8x16_t)), \ int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_uint16x8_t]: __arm_vorrq_u16 (__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce(__p1, uint16x8_t)), \ int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_uint32x4_t]: __arm_vorrq_u32 (__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce(__p1, uint32x4_t)), \ + int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_int_n]: __arm_vorrq_n_u16 (__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce3(p1, int)), \ + int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_int_n]: __arm_vorrq_n_u32 (__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce3(p1, int)), \ + int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vorrq_n_s16 (__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce3(p1, int)), \ + int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int_n]: __arm_vorrq_n_s32 (__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce3(p1, int)), \ int (*)[__ARM_mve_type_float16x8_t][__ARM_mve_type_float16x8_t]: __arm_vorrq_f16 (__ARM_mve_coerce(__p0, float16x8_t), __ARM_mve_coerce(__p1, float16x8_t)), \ int (*)[__ARM_mve_type_float32x4_t][__ARM_mve_type_float32x4_t]: __arm_vorrq_f32 (__ARM_mve_coerce(__p0, float32x4_t), __ARM_mve_coerce(__p1, float32x4_t)));}) @@ -38637,7 +38641,11 @@ extern void *__ARM_undef; int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int32x4_t]: __arm_vorrq_s32 (__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce(__p1, int32x4_t)), \ int (*)[__ARM_mve_type_uint8x16_t][__ARM_mve_type_uint8x16_t]: __arm_vorrq_u8 (__ARM_mve_coerce(__p0, uint8x16_t), __ARM_mve_coerce(__p1, uint8x16_t)), \ int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_uint16x8_t]: __arm_vorrq_u16 (__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce(__p1, uint16x8_t)), \ - int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_uint32x4_t]: __arm_vorrq_u32 (__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce(__p1, uint32x4_t)));}) + int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_uint32x4_t]: __arm_vorrq_u32 (__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce(__p1, uint32x4_t)), \ + int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_int_n]: __arm_vorrq_n_u16 (__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce3(p1, int)), \ + int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_int_n]: __arm_vorrq_n_u32 (__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce3(p1, int)), \ + int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vorrq_n_s16 (__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce3(p1, int)), \ + int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int_n]: __arm_vorrq_n_s32 (__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce3(p1, int)));}) #define __arm_vornq(p0,p1) ({ __typeof(p0) __p0 = (p0); \ __typeof(p1) __p1 = (p1); \ -- 2.25.1
[committed gcc12 backport] arm: Fix vstrwq* backend + testsuite
From: Andrea Corallo Hi all, this patch fixes the vstrwq* MVE instrinsics failing to emit the correct sequence of instruction due to a missing predicate. Also the immediate range is fixed to be multiples of 2 up between [-252, 252]. Best Regards Andrea gcc/ChangeLog: * config/arm/constraints.md (mve_vldrd_immediate): Move it to predicates.md. (Ri): Move constraint definition from predicates.md. (Rl): Define new constraint. * config/arm/mve.md (mve_vstrwq_scatter_base_wb_p_v4si): Add missing constraint. (mve_vstrwq_scatter_base_wb_p_fv4sf): Add missing Up constraint for op 1, use mve_vstrw_immediate predicate and Rl constraint for op 2. Fix asm output spacing. (mve_vstrdq_scatter_base_wb_p_v2di): Add missing constraint. * config/arm/predicates.md (Ri) Move constraint to constraints.md (mve_vldrd_immediate): Move it from constraints.md. (mve_vstrw_immediate): New predicate. gcc/testsuite/ChangeLog: * gcc.target/arm/mve/intrinsics/vstrwq_f32.c: Use check-function-bodies instead of scan-assembler checks. Use extern "C" for C++ testing. * gcc.target/arm/mve/intrinsics/vstrwq_p_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_p_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_p_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_p_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_p_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_p_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_p_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_p_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_p_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_p_f32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_p_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_p_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_s32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_u32.c: Likewise. * gcc.target/arm/mve/intrinsics/vstrwq_u32.c: Likewise. --- gcc/config/arm/constraints.md | 20 -- gcc/config/arm/mve.md | 10 ++--- gcc/config/arm/predicates.md | 14 +++ .../arm/mve/intrinsics/vstrwq_f32.c | 32 --- .../arm/mve/intrinsics/vstrwq_p_f32.c | 40 --- .../arm/mve/intrinsics/vstrwq_p_s32.c | 40 --- .../arm/mve/intrinsics/vstrwq_p_u32.c | 40 --- .../arm/mve/intrinsics/vstrwq_s32.c | 32 --- .../mve/intrinsics/vstrwq_scatter_base_f32.c | 28 +++-- .../intrinsics/vstrwq_scatter_base_p_f32.c| 36 +++-- .../intrinsics/vstrwq_scatter_base_p_s32.c| 36 +++-- .../intrinsics/vstrwq_scatter_base_p_u32.c| 36 +++-- .../mve/intrinsics/vstrwq_scatter_base_s32.c | 28 +++-- .../mve/intrinsics/vstrwq_scatter_base_u32.c | 28 +++-- .../intrinsics/vstrwq_scatter_base_wb_f32.c | 32 --- .../intrinsics/vstrwq_scatter_base_wb_p_f32.c | 40 --- .../intrinsics/vstrwq_scatter_base_wb_p_s32.c | 40 --- .../intrinsics/vstrwq_scatter_base_wb_p_u32.c | 40 --- .../intrinsics/vstrwq_scatter_base_wb_s32.c | 32 --- .../intrinsics/vstrwq_scatter_base_wb_u32.c | 32 --- .../intrinsics/vstrwq_scatter_offset_f32.c| 32 --- .../intrinsics/vstrwq_scatter_offset_p_f32.c | 40 ---
[PATCH 2/2 v2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
- Respin of the below patch - In this 2/2 patch, from v1 to v2 I have: * Removed the modification the interface of the doloop_end target-insn (so I no longer need to touch any other target backends) * Added more modes to `arm_get_required_vpr_reg` to make it flexible between searching: all operands/only input arguments/only outputs. Also added helpers: `arm_get_required_vpr_reg_ret_val` `arm_get_required_vpr_reg_param` * Added support for the use of other VPR predicate values within a dlstp/letp loop, as long as they don't originate from the vctp-generated VPR value. Also changed `arm_mve_get_loop_unique_vctp` to the simpler `arm_mve_get_loop_vctp` since now we can support other VCTP insns within the loop. * Added support for loops of the form: int num_of_iters = (num_of_elem + num_of_lanes - 1) / num_of_lanes for (i = 0; i < num_of_iters; i++) { p = vctp (num_of_elem) n -= num_of_lanes; } to be tranformed into dlstp/letp loops. * Changed the VCTP look-ahead for SIGN_EXTEND and SUBREG insns to use df def/use chains instead of `next_nonnote_nondebug_insn_bb`. * Added support for using unpredicated (but predicable) insns within the dlstp/letp loop. These need to meet some specific conditions, because they _will_ become implicitly tail predicated by the dlstp/letp transformation. * Added a df chain check to any other instructions to make sure that they don't USE the VCTP-generated VPR value. * Added testing of all these various edge cases. Original email with updated Changelog at the end: Hi all, This is the 2/2 patch that contains the functional changes needed for MVE Tail Predicated Low Overhead Loops. See my previous email for a general introduction of MVE LOLs. This support is added through the already existing loop-doloop mechanisms that are used for non-MVE dls/le looping. Changes are: 1) Relax the loop-doloop mechanism in the mid-end to allow for decrement numbers other that -1 and for `count` to be an rtx containing the number of elements to be processed, rather than an expression for calculating the number of iterations. 2) Add a `allow_elementwise_doloop` target hook. This allows the target backend to manipulate the iteration count as it needs: in our case to change it from a pre-calculation of the number of iterations to the number of elements to be processed. 3) The doloop_end target-insn now had an additional parameter: the `count` (note: this is before it gets modified to just be the number of elements), so that the decrement value is extracted from that parameter. And many things in the backend to implement the above optimisation: 4) Appropriate changes to the define_expand of doloop_end and new patterns for dlstp and letp. 5) `arm_attempt_dlstp_transform`: (called from the define_expand of doloop_end) this function checks for the loop's suitability for dlstp/letp transformation and then implements it, if possible. 6) `arm_mve_get_loop_unique_vctp`: A function that loops through the loop contents and returns the vctp VPR-genereting operation within the loop, if it is unique and there is exclusively one vctp within the loop. 7) A couple of utility functions: `arm_mve_get_vctp_lanes` to map from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg` to check an insn to see if it requires the VPR or not. No regressions on arm-none-eabi with various targets and on aarch64-none-elf. Thoughts on getting this into trunk? Thank you, Stam Markianos-Wright gcc/ChangeLog: * config/arm/arm-protos.h (arm_attempt_dlstp_transform): New. * config/arm/arm.cc (TARGET_ALLOW_ELEMENTWISE_DOLOOP): New. (arm_mve_get_vctp_lanes): New. (arm_get_required_vpr_reg): New. (arm_get_required_vpr_reg_ret_val): New. (arm_get_required_vpr_reg_param): New. (arm_mve_get_loop_vctp): New. (arm_attempt_dlstp_transform): New. (arm_allow_elementwise_doloop): New. * config/arm/iterators.md (DLSTP): New. (mode1): Add DLSTP mappings. * config/arm/mve.md (*predicated_doloop_end_internal): New. (dlstp_insn): New. * config/arm/thumb2.md (doloop_end): Update for MVE LOLs. * config/arm/unspecs.md: New unspecs. * tm.texi: Document new hook. * tm.texi.in: Likewise. * loop-doloop.cc (doloop_condition_get): Relax conditions. (doloop_optimize): Add support for elementwise LoLs. * target.def (allow_elementwise_doloop): New hook. * targhooks.cc (default_allow_elementwise_doloop): New. * targhooks.h (default_allow_elementwise_doloop): New. gcc/testsuite/ChangeLog: * gcc.target/arm/lob.h: Update framework. * gcc.target/arm/lob1.c: Likewise. * gcc.target/arm/lob6.c: Likewise. * gcc.target/arm/dlstp-int16x8.c: New test. * gcc.target/arm/dlstp-int32x4.c: New test. *
[PING][PATCH] arm: Split up MVE _Generic associations to prevent type clashes [PR107515]
Hi all, With these previous patches: https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606586.html https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606587.html we enabled the MVE overloaded _Generic associations to handle more scalar types, however at PR 107515 we found a new regression that wasn't detected in our testing: With glibc's `posix/types.h`: ``` typedef signed int __int32_t; ... typedef __int32_t int32_t; ``` We would get a `error: '_Generic' specifies two compatible types` from `__ARM_mve_coerce3` because of `type: param`, when `type` is `int` and `int32_t: param` both being the same under the hood. The same did not happen with Newlib's header `sys/_stdint.h`: ``` typedef long int __int32_t; ... typedef __int32_t int32_t ; ``` which worked fine, because it uses `long int`. The same could feasibly happen in `__ARM_mve_coerce2` between `__fp16` and `float16_t`. The solution here is to break the _Generic down, so that the similar types don't appear at the same level, as is done in `__ARM_mve_typeid`. Ok for trunk? Thanks, Stam Markianos-Wright gcc/ChangeLog: PR target/96795 PR target/107515 * config/arm/arm_mve.h (__ARM_mve_coerce2): Split types. (__ARM_mve_coerce3): Likewise. gcc/testsuite/ChangeLog: PR target/96795 PR target/107515 * gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-fp.c: New test. * gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-int.c: New test. === Inline Ctrl+C, Ctrl+V or patch === diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h index 09167ec118ed3310c5077145e119196f29d83cac..70003653db65736fcfd019e83d9f18153be650dc 100644 --- a/gcc/config/arm/arm_mve.h +++ b/gcc/config/arm/arm_mve.h @@ -35659,9 +35659,9 @@ extern void *__ARM_undef; #define __ARM_mve_coerce1(param, type) \ _Generic(param, type: param, const type: param, default: *(type *)__ARM_undef) #define __ARM_mve_coerce2(param, type) \ -_Generic(param, type: param, float16_t: param, float32_t: param, default: *(type *)__ARM_undef) +_Generic(param, type: param, __fp16: param, default: _Generic (param, _Float16: param, float16_t: param, float32_t: param, default: *(type *)__ARM_undef)) #define __ARM_mve_coerce3(param, type) \ -_Generic(param, type: param, int8_t: param, int16_t: param, int32_t: param, int64_t: param, uint8_t: param, uint16_t: param, uint32_t: param, uint64_t: param, default: *(type *)__ARM_undef) +_Generic(param, type: param, default: _Generic (param, int8_t: param, int16_t: param, int32_t: param, int64_t: param, uint8_t: param, uint16_t: param, uint32_t: param, uint64_t: param, default: *(type *)__ARM_undef)) #if (__ARM_FEATURE_MVE & 2) /* MVE Floating point. */ diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-fp.c b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-fp.c new file mode 100644 index ..427dcacb5ff59b53d5eab1f1582ef6460da3f2f3 --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-fp.c @@ -0,0 +1,65 @@ +/* { dg-require-effective-target arm_v8_1m_mve_fp_ok } */ +/* { dg-add-options arm_v8_1m_mve_fp } */ +/* { dg-additional-options "-O2 -Wno-pedantic -Wno-long-long" } */ +#include "arm_mve.h" + +float f1; +double f2; +float16_t f3; +float32_t f4; +__fp16 f5; +_Float16 f6; + +int i1; +short i2; +long i3; +long long i4; +int8_t i5; +int16_t i6; +int32_t i7; +int64_t i8; + +const int ci1; +const short ci2; +const long ci3; +const long long ci4; +const int8_t ci5; +const int16_t ci6; +const int32_t ci7; +const int64_t ci8; + +float16x8_t floatvec; +int16x8_t intvec; + +void test(void) +{ +/* Test a few different supported ways of passing an int value. The +intrinsic vmulq was chosen arbitrarily, but it is representative of +all intrinsics that take a non-const scalar value. */ +intvec = vmulq(intvec, 2); +intvec = vmulq(intvec, (int32_t) 2); +intvec = vmulq(intvec, (short) 2); +intvec = vmulq(intvec, i1); +intvec = vmulq(intvec, i2); +intvec = vmulq(intvec, i3); +intvec = vmulq(intvec, i4); +intvec = vmulq(intvec, i5); +intvec = vmulq(intvec, i6); +intvec = vmulq(intvec, i7); +intvec = vmulq(intvec, i8); + +/* Test a few different supported ways of passing a float value. */ +floatvec = vmulq(floatvec, 0.5); +floatvec = vmulq(floatvec, 0.5f); +floatvec = vmulq(floatvec, (__fp16) 0.5); +floatvec = vmulq(floatvec, f1); +floatvec = vmulq(floatvec, f2); +floatvec = vmulq(floatvec, f3); +floatvec = vmulq(floatvec, f4); +floatvec = vmulq(floatvec, f5); +floatvec = vmulq(floatvec, f6); +floatvec = vmulq(floatvec, 0.15f16); +floatvec = vmulq(floatvec, (_Float16) 0.15); +} + +/* { dg-final { scan-assembler-not "__ARM_undef" } } */ \ No newline at end of file diff --git
Re: [PATCH] Fix memory constraint on MVE v[ld/st][2/4] instructions [PR107714]
On 12/12/2022 13:42, Kyrylo Tkachov wrote: Hi Stam, -Original Message- From: Stam Markianos-Wright Sent: Friday, December 9, 2022 1:32 PM To: gcc-patches@gcc.gnu.org Cc: Kyrylo Tkachov ; Richard Earnshaw ; Ramana Radhakrishnan ; ni...@redhat.com Subject: [PATCH] Fix memory constraint on MVE v[ld/st][2/4] instructions [PR107714] Hi all, In the M-Class Arm-ARM: https://developer.arm.com/documentation/ddi0553/bu/?lang=en these MVE instructions only have '!' writeback variant and at: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107714 we found that the Um constraint would also allow through a register offset writeback, resulting in an assembler error. Here I have added a new constraint and predicate for these instructions, which (uniquely, AFAICT), only support a `!` writeback increment by the data size (inside the compiler this is a POST_INC). No regressions in arm-none-eabi with MVE and MVE.FP. Ok for trunk, and backport to GCC11 and GCC12 (testing pending)? Thanks, Stam gcc/ChangeLog: PR target/107714 * config/arm/arm-protos.h (mve_struct_mem_operand): New protoype. * config/arm/arm.cc (mve_struct_mem_operand): New function. * config/arm/constraints.md (Ug): New constraint. * config/arm/mve.md (mve_vst4q): Change constraint. (mve_vst2q): Likewise. (mve_vld4q): Likewise. (mve_vld2q): Likewise. * config/arm/predicates.md (mve_struct_operand): New predicate. gcc/testsuite/ChangeLog: PR target/107714 * gcc.target/arm/mve/intrinsics/vldst24q_reg_offset.c: New test. diff --git a/gcc/config/arm/constraints.md b/gcc/config/arm/constraints.md index e5a36d29c7135943b9bb5ea396f70e2e4beb1e4a..8908b7f5b15ce150685868e78e75280bf32053f1 100644 --- a/gcc/config/arm/constraints.md +++ b/gcc/config/arm/constraints.md @@ -474,6 +474,12 @@ (and (match_code "mem") (match_test "TARGET_32BIT && arm_coproc_mem_operand (op, FALSE)"))) +(define_memory_constraint "Ug" + "@internal + In Thumb-2 state a valid MVE struct load/store address." + (and (match_code "mem") + (match_test "TARGET_HAVE_MVE && mve_struct_mem_operand (op)"))) + I think you can define the constraints in terms of the new mve_struct_operand predicate directly (see how we define the "Ua" constraint, for example). Ok if that works (and testing passes of course). Done as discussed and re-tested on all branches. Pushed as: 4269a6567eb991e6838f40bda5be9e3a7972530c to trunk 25edc76f2afba0b4eaf22174d42de042a6969dbe to gcc-12 08842ad274f5e2630994f7c6e70b2d31768107ea to gcc-11 Thank you! Stam Thanks, Kyrill
[PATCH] Fix memory constraint on MVE v[ld/st][2/4] instructions [PR107714]
Hi all, In the M-Class Arm-ARM: https://developer.arm.com/documentation/ddi0553/bu/?lang=en these MVE instructions only have '!' writeback variant and at: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107714 we found that the Um constraint would also allow through a register offset writeback, resulting in an assembler error. Here I have added a new constraint and predicate for these instructions, which (uniquely, AFAICT), only support a `!` writeback increment by the data size (inside the compiler this is a POST_INC). No regressions in arm-none-eabi with MVE and MVE.FP. Ok for trunk, and backport to GCC11 and GCC12 (testing pending)? Thanks, Stam gcc/ChangeLog: PR target/107714 * config/arm/arm-protos.h (mve_struct_mem_operand): New protoype. * config/arm/arm.cc (mve_struct_mem_operand): New function. * config/arm/constraints.md (Ug): New constraint. * config/arm/mve.md (mve_vst4q): Change constraint. (mve_vst2q): Likewise. (mve_vld4q): Likewise. (mve_vld2q): Likewise. * config/arm/predicates.md (mve_struct_operand): New predicate. gcc/testsuite/ChangeLog: PR target/107714 * gcc.target/arm/mve/intrinsics/vldst24q_reg_offset.c: New test.diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h index 550272facd12e60a49bf8a3b20f811cc13765b3a..8ea38118b05769bd6fcb1d22d902a50979cfd953 100644 --- a/gcc/config/arm/arm-protos.h +++ b/gcc/config/arm/arm-protos.h @@ -122,6 +122,7 @@ extern int arm_coproc_mem_operand_wb (rtx, int); extern int neon_vector_mem_operand (rtx, int, bool); extern int mve_vector_mem_operand (machine_mode, rtx, bool); extern int neon_struct_mem_operand (rtx); +extern int mve_struct_mem_operand (rtx); extern rtx *neon_vcmla_lane_prepare_operands (rtx *); diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc index b587561eebea921bdc68016922d37948e2870ce2..31f2a7b9d4688dde69d1435e24cf885e8544be71 100644 --- a/gcc/config/arm/arm.cc +++ b/gcc/config/arm/arm.cc @@ -13737,6 +13737,24 @@ neon_vector_mem_operand (rtx op, int type, bool strict) return FALSE; } +/* Return TRUE if OP is a mem suitable for loading/storing an MVE struct + type. */ +int +mve_struct_mem_operand (rtx op) +{ + rtx ind = XEXP (op, 0); + + /* Match: (mem (reg)). */ + if (REG_P (ind)) +return arm_address_register_rtx_p (ind, 0); + + /* Allow only post-increment by the mode size. */ + if (GET_CODE (ind) == POST_INC) +return arm_address_register_rtx_p (XEXP (ind, 0), 0); + + return FALSE; +} + /* Return TRUE if OP is a mem suitable for loading/storing a Neon struct type. */ int diff --git a/gcc/config/arm/constraints.md b/gcc/config/arm/constraints.md index e5a36d29c7135943b9bb5ea396f70e2e4beb1e4a..8908b7f5b15ce150685868e78e75280bf32053f1 100644 --- a/gcc/config/arm/constraints.md +++ b/gcc/config/arm/constraints.md @@ -474,6 +474,12 @@ (and (match_code "mem") (match_test "TARGET_32BIT && arm_coproc_mem_operand (op, FALSE)"))) +(define_memory_constraint "Ug" + "@internal + In Thumb-2 state a valid MVE struct load/store address." + (and (match_code "mem") + (match_test "TARGET_HAVE_MVE && mve_struct_mem_operand (op)"))) + (define_memory_constraint "Uj" "@internal In ARM/Thumb-2 state a VFP load/store address that supports writeback diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md index b5e6da4b1335818a3e8815de59850e845a2d0400..847bc032afa2c3977c05725562a14940beb282d4 100644 --- a/gcc/config/arm/mve.md +++ b/gcc/config/arm/mve.md @@ -99,7 +99,7 @@ ;; [vst4q]) ;; (define_insn "mve_vst4q" - [(set (match_operand:XI 0 "neon_struct_operand" "=Um") + [(set (match_operand:XI 0 "mve_struct_operand" "=Ug") (unspec:XI [(match_operand:XI 1 "s_register_operand" "w") (unspec:MVE_VLD_ST [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] VST4Q)) @@ -9959,7 +9959,7 @@ ;; [vst2q]) ;; (define_insn "mve_vst2q" - [(set (match_operand:OI 0 "neon_struct_operand" "=Um") + [(set (match_operand:OI 0 "mve_struct_operand" "=Ug") (unspec:OI [(match_operand:OI 1 "s_register_operand" "w") (unspec:MVE_VLD_ST [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] VST2Q)) @@ -9988,7 +9988,7 @@ ;; (define_insn "mve_vld2q" [(set (match_operand:OI 0 "s_register_operand" "=w") - (unspec:OI [(match_operand:OI 1 "neon_struct_operand" "Um") + (unspec:OI [(match_operand:OI 1 "mve_struct_operand" "Ug") (unspec:MVE_VLD_ST [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] VLD2Q)) ] @@ -10016,7 +10016,7 @@ ;; (define_insn "mve_vld4q" [(set (match_operand:XI 0 "s_register_operand" "=w") - (unspec:XI [(match_operand:XI 1 "neon_struct_operand" "Um") + (unspec:XI [(match_operand:XI 1 "mve_struct_operand" "Ug") (unspec:MVE_VLD_ST [(const_int 0)] UNSPEC_VSTRUCTDUMMY)] VLD4Q)) ] diff --git a/gcc/config/arm/predicates.md b/gcc/config/arm/predicates.md index aab5a91ad4ddc6a7a02611d05442d6de63841a7c..67f2fdb4f8f607ceb50871e1bc17dbdb9b987c2c 100644 ---
[PATCH] arm: Split up MVE _Generic associations to prevent type clashes [PR107515]
Hi all, With these previous patches: https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606586.html https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606587.html we enabled the MVE overloaded _Generic associations to handle more scalar types, however at PR 107515 we found a new regression that wasn't detected in our testing: With glibc's `posix/types.h`: ``` typedef signed int __int32_t; ... typedef __int32_t int32_t; ``` We would get a `error: '_Generic' specifies two compatible types` from `__ARM_mve_coerce3` because of `type: param`, when `type` is `int` and `int32_t: param` both being the same under the hood. The same did not happen with Newlib's header `sys/_stdint.h`: ``` typedef long int __int32_t; ... typedef __int32_t int32_t ; ``` which worked fine, because it uses `long int`. The same could feasibly happen in `__ARM_mve_coerce2` between `__fp16` and `float16_t`. The solution here is to break the _Generic down, so that the similar types don't appear at the same level, as is done in `__ARM_mve_typeid`. Ok for trunk? Thanks, Stam Markianos-Wright gcc/ChangeLog: PR target/96795 PR target/107515 * config/arm/arm_mve.h (__ARM_mve_coerce2): Split types. (__ARM_mve_coerce3): Likewise. gcc/testsuite/ChangeLog: PR target/96795 PR target/107515 * gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-fp.c: New test. * gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-int.c: New test. === Inline Ctrl+C, Ctrl+V or patch === diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h index 09167ec118ed3310c5077145e119196f29d83cac..70003653db65736fcfd019e83d9f18153be650dc 100644 --- a/gcc/config/arm/arm_mve.h +++ b/gcc/config/arm/arm_mve.h @@ -35659,9 +35659,9 @@ extern void *__ARM_undef; #define __ARM_mve_coerce1(param, type) \ _Generic(param, type: param, const type: param, default: *(type *)__ARM_undef) #define __ARM_mve_coerce2(param, type) \ - _Generic(param, type: param, float16_t: param, float32_t: param, default: *(type *)__ARM_undef) + _Generic(param, type: param, __fp16: param, default: _Generic (param, _Float16: param, float16_t: param, float32_t: param, default: *(type *)__ARM_undef)) #define __ARM_mve_coerce3(param, type) \ - _Generic(param, type: param, int8_t: param, int16_t: param, int32_t: param, int64_t: param, uint8_t: param, uint16_t: param, uint32_t: param, uint64_t: param, default: *(type *)__ARM_undef) + _Generic(param, type: param, default: _Generic (param, int8_t: param, int16_t: param, int32_t: param, int64_t: param, uint8_t: param, uint16_t: param, uint32_t: param, uint64_t: param, default: *(type *)__ARM_undef)) #if (__ARM_FEATURE_MVE & 2) /* MVE Floating point. */ diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-fp.c b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-fp.c new file mode 100644 index ..427dcacb5ff59b53d5eab1f1582ef6460da3f2f3 --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-fp.c @@ -0,0 +1,65 @@ +/* { dg-require-effective-target arm_v8_1m_mve_fp_ok } */ +/* { dg-add-options arm_v8_1m_mve_fp } */ +/* { dg-additional-options "-O2 -Wno-pedantic -Wno-long-long" } */ +#include "arm_mve.h" + +float f1; +double f2; +float16_t f3; +float32_t f4; +__fp16 f5; +_Float16 f6; + +int i1; +short i2; +long i3; +long long i4; +int8_t i5; +int16_t i6; +int32_t i7; +int64_t i8; + +const int ci1; +const short ci2; +const long ci3; +const long long ci4; +const int8_t ci5; +const int16_t ci6; +const int32_t ci7; +const int64_t ci8; + +float16x8_t floatvec; +int16x8_t intvec; + +void test(void) +{ + /* Test a few different supported ways of passing an int value. The + intrinsic vmulq was chosen arbitrarily, but it is representative of + all intrinsics that take a non-const scalar value. */ + intvec = vmulq(intvec, 2); + intvec = vmulq(intvec, (int32_t) 2); + intvec = vmulq(intvec, (short) 2); + intvec = vmulq(intvec, i1); + intvec = vmulq(intvec, i2); + intvec = vmulq(intvec, i3); + intvec = vmulq(intvec, i4); + intvec = vmulq(intvec, i5); + intvec = vmulq(intvec, i6); + intvec = vmulq(intvec, i7); + intvec = vmulq(intvec, i8); + + /* Test a few different supported ways of passing a float value. */ + floatvec = vmulq(floatvec, 0.5); + floatvec = vmulq(floatvec, 0.5f); + floatvec = vmulq(floatvec, (__fp16) 0.5); + floatvec = vmulq(floatvec, f1); + floatvec = vmulq(floatvec, f2); + floatvec = vmulq(floatvec, f3); + floatvec = vmulq(floatvec, f4); + floatvec = vmulq(floatvec, f5); + floatvec = vmulq(floatvec, f6); + floatvec = vmulq(floatvec, 0.15f16); + floatvec = vmulq(floatvec, (_Float16) 0.15); +} + +/* { dg-final { scan-assembler-not "__ARM_undef" } } */ \ No newline at end of file diff --git
[PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
On 11/15/22 15:51, Andre Vieira (lists) wrote: On 11/11/2022 17:40, Stam Markianos-Wright via Gcc-patches wrote: Hi all, This is the 2/2 patch that contains the functional changes needed for MVE Tail Predicated Low Overhead Loops. See my previous email for a general introduction of MVE LOLs. This support is added through the already existing loop-doloop mechanisms that are used for non-MVE dls/le looping. Changes are: 1) Relax the loop-doloop mechanism in the mid-end to allow for decrement numbers other that -1 and for `count` to be an rtx containing the number of elements to be processed, rather than an expression for calculating the number of iterations. 2) Add a `allow_elementwise_doloop` target hook. This allows the target backend to manipulate the iteration count as it needs: in our case to change it from a pre-calculation of the number of iterations to the number of elements to be processed. 3) The doloop_end target-insn now had an additional parameter: the `count` (note: this is before it gets modified to just be the number of elements), so that the decrement value is extracted from that parameter. And many things in the backend to implement the above optimisation: 4) Appropriate changes to the define_expand of doloop_end and new patterns for dlstp and letp. 5) `arm_attempt_dlstp_transform`: (called from the define_expand of doloop_end) this function checks for the loop's suitability for dlstp/letp transformation and then implements it, if possible. 6) `arm_mve_get_loop_unique_vctp`: A function that loops through the loop contents and returns the vctp VPR-genereting operation within the loop, if it is unique and there is exclusively one vctp within the loop. 7) A couple of utility functions: `arm_mve_get_vctp_lanes` to map from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg` to check an insn to see if it requires the VPR or not. No regressions on arm-none-eabi with various targets and on aarch64-none-elf. Thoughts on getting this into trunk? Thank you, Stam Markianos-Wright gcc/ChangeLog: * config/aarch64/aarch64.md: Add extra doloop_end arg. * config/arm/arm-protos.h (arm_attempt_dlstp_transform): New. * config/arm/arm.cc (TARGET_ALLOW_ELEMENTWISE_DOLOOP): New. (arm_mve_get_vctp_lanes): New. (arm_get_required_vpr_reg): New. (arm_mve_get_loop_unique_vctp): New. (arm_attempt_dlstp_transform): New. (arm_allow_elementwise_doloop): New. * config/arm/iterators.md: * config/arm/mve.md (*predicated_doloop_end_internal): New. (dlstp_insn): New. * config/arm/thumb2.md (doloop_end): Update for MVE LOLs. * config/arm/unspecs.md: New unspecs. * config/ia64/ia64.md: Add extra doloop_end arg. * config/pru/pru.md: Add extra doloop_end arg. * config/rs6000/rs6000.md: Add extra doloop_end arg. * config/s390/s390.md: Add extra doloop_end arg. * config/v850/v850.md: Add extra doloop_end arg. * doc/tm.texi: Document new hook. * doc/tm.texi.in: Likewise. * loop-doloop.cc (doloop_condition_get): Relax conditions. (doloop_optimize): Add support for elementwise LoLs. * target-insns.def (doloop_end): Add extra arg. * target.def (allow_elementwise_doloop): New hook. * targhooks.cc (default_allow_elementwise_doloop): New. * targhooks.h (default_allow_elementwise_doloop): New. gcc/testsuite/ChangeLog: * gcc.target/arm/lob.h: Update framework. * gcc.target/arm/lob1.c: Likewise. * gcc.target/arm/lob6.c: Likewise. * gcc.target/arm/dlstp-int16x8.c: New test. * gcc.target/arm/dlstp-int32x4.c: New test. * gcc.target/arm/dlstp-int64x2.c: New test. * gcc.target/arm/dlstp-int8x16.c: New test. ### Inline copy of patch ### diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index f2e3d905dbbeb2949f2947f5cfd68208c94c9272..7a6d24a80060b4a704a481ccd1a32d96e7b0f369 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -7366,7 +7366,8 @@ ;; knows what to generate. (define_expand "doloop_end" [(use (match_operand 0 "" "")) ; loop pseudo - (use (match_operand 1 "" ""))] ; label + (use (match_operand 1 "" "")) ; label + (use (match_operand 2 "" ""))] ; decrement constant "optimize > 0 && flag_modulo_sched" { rtx s0; diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h index 550272facd12e60a49bf8a3b20f811cc13765b3a..7684620f0f4d161dd9e9ad2d70308021ec3d3d34 100644 --- a/gcc/config/arm/arm-protos.h +++ b/gcc/config/arm/arm-protos.h @@ -63,7 +63,7 @@ extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *); extern bool arm_q_bit_acce
[PATCH 15/35] arm: Explicitly specify other float types for _Generic overloading [PR107515]
On 11/20/22 22:49, Ramana Radhakrishnan wrote: On Fri, Nov 18, 2022 at 4:59 PM Kyrylo Tkachov via Gcc-patches wrote: -Original Message- From: Andrea Corallo Sent: Thursday, November 17, 2022 4:38 PM To: gcc-patches@gcc.gnu.org Cc: Kyrylo Tkachov ; Richard Earnshaw ; Stam Markianos-Wright Subject: [PATCH 15/35] arm: Explicitly specify other float types for _Generic overloading [PR107515] From: Stam Markianos-Wright This patch adds explicit references to other float types to __ARM_mve_typeid in arm_mve.h. Resolves PR 107515: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107515 gcc/ChangeLog: PR 107515 * config/arm/arm_mve.h (__ARM_mve_typeid): Add float types. Argh, I'm looking forward to when we move away from this _Generic business, but for now ok. The ChangeLog should say "PR target/107515" for the git hook to recognize it IIRC. and the PR is against 11.x - is there a plan to back port this and dependent patches to relevant branches ? Hi Ramana! Assuming maintainer approval, we do hope to backport. And yes, it would have to be the whole patch series, so that we carry over all the improved testing, as well (and we'll have to run it ofc). Does that sound Ok? Thank you, Stam Ramana Thanks, Kyrill --- gcc/config/arm/arm_mve.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h index fd1876b57a0..f6b42dc3fab 100644 --- a/gcc/config/arm/arm_mve.h +++ b/gcc/config/arm/arm_mve.h @@ -35582,6 +35582,9 @@ enum { short: __ARM_mve_type_int_n, \ int: __ARM_mve_type_int_n, \ long: __ARM_mve_type_int_n, \ + _Float16: __ARM_mve_type_fp_n, \ + __fp16: __ARM_mve_type_fp_n, \ + float: __ARM_mve_type_fp_n, \ double: __ARM_mve_type_fp_n, \ long long: __ARM_mve_type_int_n, \ unsigned char: __ARM_mve_type_int_n, \ -- 2.25.1
Re: [PATCH 15/35] arm: Explicitly specify other float types for _Generic overloading [PR107515]
On 11/18/22 16:58, Kyrylo Tkachov wrote: -Original Message- From: Andrea Corallo Sent: Thursday, November 17, 2022 4:38 PM To: gcc-patches@gcc.gnu.org Cc: Kyrylo Tkachov ; Richard Earnshaw ; Stam Markianos-Wright Subject: [PATCH 15/35] arm: Explicitly specify other float types for _Generic overloading [PR107515] From: Stam Markianos-Wright This patch adds explicit references to other float types to __ARM_mve_typeid in arm_mve.h. Resolves PR 107515: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107515 gcc/ChangeLog: PR 107515 * config/arm/arm_mve.h (__ARM_mve_typeid): Add float types. Argh, I'm looking forward to when we move away from this _Generic business, but for now ok. Oh we all are ;) The ChangeLog should say "PR target/107515" for the git hook to recognize it IIRC. Agh, thanks for spotting this! Will change and push it with the rest of the patch series when ready/ Thank you, Stam Thanks, Kyrill --- gcc/config/arm/arm_mve.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h index fd1876b57a0..f6b42dc3fab 100644 --- a/gcc/config/arm/arm_mve.h +++ b/gcc/config/arm/arm_mve.h @@ -35582,6 +35582,9 @@ enum { short: __ARM_mve_type_int_n, \ int: __ARM_mve_type_int_n, \ long: __ARM_mve_type_int_n, \ + _Float16: __ARM_mve_type_fp_n, \ + __fp16: __ARM_mve_type_fp_n, \ + float: __ARM_mve_type_fp_n, \ double: __ARM_mve_type_fp_n, \ long long: __ARM_mve_type_int_n, \ unsigned char: __ARM_mve_type_int_n, \ -- 2.25.1
Re: [PATCH 13/35] arm: further fix overloading of MVE vaddq[_m]_n intrinsic
On 11/18/22 16:49, Kyrylo Tkachov wrote: -Original Message- From: Andrea Corallo Sent: Thursday, November 17, 2022 4:38 PM To: gcc-patches@gcc.gnu.org Cc: Kyrylo Tkachov ; Richard Earnshaw ; Stam Markianos-Wright Subject: [PATCH 13/35] arm: further fix overloading of MVE vaddq[_m]_n intrinsic From: Stam Markianos-Wright It was observed that in tests `vaddq_m_n_[s/u][8/16/32].c`, the _Generic resolution would fall back to the `__ARM_undef` failure state. This is a regression since `dc39db873670bea8d8e655444387ceaa53a01a79` and `6bd4ce64eb48a72eca300cb52773e6101d646004`, but it previously wasn't identified, because the tests were not checking for this kind of failure. The above commits changed the definitions of the intrinsics from using `[u]int[8/16/32]_t` types for the scalar argument to using `int`. This allowed `int` to be supported in user code through the overloaded `#defines`, but seems to have broken the `[u]int[8/16/32]_t` types The solution implemented by this patch is to explicitly use a new _Generic mapping from all the `[u]int[8/16/32]_t` types for int. With this change, both `int` and `[u]int[8/16/32]_t` parameters are supported from user code and are handled by the overloading mechanism correctly. gcc/ChangeLog: * config/arm/arm_mve.h (__arm_vaddq_m_n_s8): Change types. (__arm_vaddq_m_n_s32): Likewise. (__arm_vaddq_m_n_s16): Likewise. (__arm_vaddq_m_n_u8): Likewise. (__arm_vaddq_m_n_u32): Likewise. (__arm_vaddq_m_n_u16): Likewise. (__arm_vaddq_m): Fix Overloading. (__ARM_mve_coerce3): New. Ok. Wasn't there a PR in Bugzilla about this that we can cite in the commit message? Thanks, Kyrill Thanks for the review! Ah yes, there was this one: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96795 which was closed last time around. It does make sense to add it, though, so we'll do that. Thanks! --- gcc/config/arm/arm_mve.h | 78 1 file changed, 40 insertions(+), 38 deletions(-) diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h index 684f997520f..951dc25374b 100644 --- a/gcc/config/arm/arm_mve.h +++ b/gcc/config/arm/arm_mve.h @@ -9675,42 +9675,42 @@ __arm_vabdq_m_u16 (uint16x8_t __inactive, uint16x8_t __a, uint16x8_t __b, mve_pr __extension__ extern __inline int8x16_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) -__arm_vaddq_m_n_s8 (int8x16_t __inactive, int8x16_t __a, int __b, mve_pred16_t __p) +__arm_vaddq_m_n_s8 (int8x16_t __inactive, int8x16_t __a, int8_t __b, mve_pred16_t __p) { return __builtin_mve_vaddq_m_n_sv16qi (__inactive, __a, __b, __p); } __extension__ extern __inline int32x4_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) -__arm_vaddq_m_n_s32 (int32x4_t __inactive, int32x4_t __a, int __b, mve_pred16_t __p) +__arm_vaddq_m_n_s32 (int32x4_t __inactive, int32x4_t __a, int32_t __b, mve_pred16_t __p) { return __builtin_mve_vaddq_m_n_sv4si (__inactive, __a, __b, __p); } __extension__ extern __inline int16x8_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) -__arm_vaddq_m_n_s16 (int16x8_t __inactive, int16x8_t __a, int __b, mve_pred16_t __p) +__arm_vaddq_m_n_s16 (int16x8_t __inactive, int16x8_t __a, int16_t __b, mve_pred16_t __p) { return __builtin_mve_vaddq_m_n_sv8hi (__inactive, __a, __b, __p); } __extension__ extern __inline uint8x16_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) -__arm_vaddq_m_n_u8 (uint8x16_t __inactive, uint8x16_t __a, int __b, mve_pred16_t __p) +__arm_vaddq_m_n_u8 (uint8x16_t __inactive, uint8x16_t __a, uint8_t __b, mve_pred16_t __p) { return __builtin_mve_vaddq_m_n_uv16qi (__inactive, __a, __b, __p); } __extension__ extern __inline uint32x4_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) -__arm_vaddq_m_n_u32 (uint32x4_t __inactive, uint32x4_t __a, int __b, mve_pred16_t __p) +__arm_vaddq_m_n_u32 (uint32x4_t __inactive, uint32x4_t __a, uint32_t __b, mve_pred16_t __p) { return __builtin_mve_vaddq_m_n_uv4si (__inactive, __a, __b, __p); } __extension__ extern __inline uint16x8_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) -__arm_vaddq_m_n_u16 (uint16x8_t __inactive, uint16x8_t __a, int __b, mve_pred16_t __p) +__arm_vaddq_m_n_u16 (uint16x8_t __inactive, uint16x8_t __a, uint16_t __b, mve_pred16_t __p) { return __builtin_mve_vaddq_m_n_uv8hi (__inactive, __a, __b, __p); } @@ -26417,42 +26417,42 @@ __arm_vabdq_m (uint16x8_t __inactive, uint16x8_t __a, uint16x8_t __b, mve_pred16 __extension__ extern __inline int8x16_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) -__arm_vaddq_m (int8x16_t __inactive, int8x16_t __a, int __b, mve_pred16_t __p) +__arm_vaddq_m (int8x16_t __inactive, int8x16_t __a, int8_t __b, mve_pred16_t __p) { return __arm_vaddq_m_n_s8 (__inactive, __a, __b, __p);
[PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops
Hi all, This is the 2/2 patch that contains the functional changes needed for MVE Tail Predicated Low Overhead Loops. See my previous email for a general introduction of MVE LOLs. This support is added through the already existing loop-doloop mechanisms that are used for non-MVE dls/le looping. Changes are: 1) Relax the loop-doloop mechanism in the mid-end to allow for decrement numbers other that -1 and for `count` to be an rtx containing the number of elements to be processed, rather than an expression for calculating the number of iterations. 2) Add a `allow_elementwise_doloop` target hook. This allows the target backend to manipulate the iteration count as it needs: in our case to change it from a pre-calculation of the number of iterations to the number of elements to be processed. 3) The doloop_end target-insn now had an additional parameter: the `count` (note: this is before it gets modified to just be the number of elements), so that the decrement value is extracted from that parameter. And many things in the backend to implement the above optimisation: 4) Appropriate changes to the define_expand of doloop_end and new patterns for dlstp and letp. 5) `arm_attempt_dlstp_transform`: (called from the define_expand of doloop_end) this function checks for the loop's suitability for dlstp/letp transformation and then implements it, if possible. 6) `arm_mve_get_loop_unique_vctp`: A function that loops through the loop contents and returns the vctp VPR-genereting operation within the loop, if it is unique and there is exclusively one vctp within the loop. 7) A couple of utility functions: `arm_mve_get_vctp_lanes` to map from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg` to check an insn to see if it requires the VPR or not. No regressions on arm-none-eabi with various targets and on aarch64-none-elf. Thoughts on getting this into trunk? Thank you, Stam Markianos-Wright gcc/ChangeLog: * config/aarch64/aarch64.md: Add extra doloop_end arg. * config/arm/arm-protos.h (arm_attempt_dlstp_transform): New. * config/arm/arm.cc (TARGET_ALLOW_ELEMENTWISE_DOLOOP): New. (arm_mve_get_vctp_lanes): New. (arm_get_required_vpr_reg): New. (arm_mve_get_loop_unique_vctp): New. (arm_attempt_dlstp_transform): New. (arm_allow_elementwise_doloop): New. * config/arm/iterators.md: * config/arm/mve.md (*predicated_doloop_end_internal): New. (dlstp_insn): New. * config/arm/thumb2.md (doloop_end): Update for MVE LOLs. * config/arm/unspecs.md: New unspecs. * config/ia64/ia64.md: Add extra doloop_end arg. * config/pru/pru.md: Add extra doloop_end arg. * config/rs6000/rs6000.md: Add extra doloop_end arg. * config/s390/s390.md: Add extra doloop_end arg. * config/v850/v850.md: Add extra doloop_end arg. * doc/tm.texi: Document new hook. * doc/tm.texi.in: Likewise. * loop-doloop.cc (doloop_condition_get): Relax conditions. (doloop_optimize): Add support for elementwise LoLs. * target-insns.def (doloop_end): Add extra arg. * target.def (allow_elementwise_doloop): New hook. * targhooks.cc (default_allow_elementwise_doloop): New. * targhooks.h (default_allow_elementwise_doloop): New. gcc/testsuite/ChangeLog: * gcc.target/arm/lob.h: Update framework. * gcc.target/arm/lob1.c: Likewise. * gcc.target/arm/lob6.c: Likewise. * gcc.target/arm/dlstp-int16x8.c: New test. * gcc.target/arm/dlstp-int32x4.c: New test. * gcc.target/arm/dlstp-int64x2.c: New test. * gcc.target/arm/dlstp-int8x16.c: New test. ### Inline copy of patch ### diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index f2e3d905dbbeb2949f2947f5cfd68208c94c9272..7a6d24a80060b4a704a481ccd1a32d96e7b0f369 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -7366,7 +7366,8 @@ ;; knows what to generate. (define_expand "doloop_end" [(use (match_operand 0 "" "")) ; loop pseudo - (use (match_operand 1 "" ""))] ; label + (use (match_operand 1 "" "")) ; label + (use (match_operand 2 "" ""))] ; decrement constant "optimize > 0 && flag_modulo_sched" { rtx s0; diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h index 550272facd12e60a49bf8a3b20f811cc13765b3a..7684620f0f4d161dd9e9ad2d70308021ec3d3d34 100644 --- a/gcc/config/arm/arm-protos.h +++ b/gcc/config/arm/arm-protos.h @@ -63,7 +63,7 @@ extern void arm_decompose_di_binop (rtx, rtx, rtx *, rtx *, rtx *, rtx *); extern bool arm_q_bit_access (void); extern bool arm_ge_bits_access (void); extern bool arm_target_insn_ok_for_lob (rtx); - +extern rtx arm_attempt_dlstp_transform (rtx, rtx); #ifdef RTX_CODE enum reg_class arm_mode_base_reg_class (machine_mode); diff --git
[PATCH] slp tree vectorizer: Re-calculate vectorization factor in the case of invalid choices [PR96974]
On 29/03/2021 10:20, Richard Biener wrote: On Fri, 26 Mar 2021, Richard Sandiford wrote: Richard Biener writes: On Wed, 24 Mar 2021, Stam Markianos-Wright wrote: Hi all, This patch resolves bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96974 This is achieved by forcing a re-calculation of *stmt_vectype_out if an incompatible combination of TYPE_VECTOR_SUBPARTS is detected, but with an extra introduced max_nunits ceiling. I am not 100% sure if this is the best way to go about fixing this, because this is my first look at the vectorizer and I lack knowledge of the wider context, so do let me know if you see a better way to do this! I have added the previously ICE-ing reproducer as a new test. This is compiled as "g++ -Ofast -march=armv8.2-a+sve -fdisable-tree-fre4" for GCC11 and "g++ -Ofast -march=armv8.2-a+sve" for GCC10. (the non-fdisable-tree-fre4 version has gone latent on GCC11) Bootstrapped and reg-tested on aarch64-linux-gnu. Also reg-tested on aarch64-none-elf. I don't think this is going to work well given uses will expect a vector type that's consistent here. I think giving up is for the moment the best choice, thus replacing the assert with vectorization failure. In the end we shouldn't require those nunits vectypes to be separately computed - we compute the vector type of the defs anyway and in case they're invariant the vectorizable_* function either can deal with the type mix or not anyway. I agree this area needs simplification, but I think the direction of travel should be to make the assert valid. I agree this is probably the pragmatic fix for GCC 11 and earlier though. The issue is that we compute a vector type for a use that may differ from what we'd compute for it in the context of its definition (or in the context of another use). Any such "local" decision is likely flawed and I'd rather simplify further doing the only decision on the definition side - if there's a disconnect between the number of lanes (and thus altering the VF won't help) then we have to give up anyway. Richard. Thank you both for the further info! Would it be fair to close the initial PR regarding the ICE (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96974) and then open a second one at a lower priority level to address these further improvements? Also Christophe has kindly found out that the test FAILs in ILP32, so it would be great to get that one in asap, too! https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567431.html Cheers, Stam
Re: [PATCH] slp tree vectorizer: Re-calculate vectorization factor in the case of invalid choices [PR96974]
On 24/03/2021 13:46, Richard Biener wrote: On Wed, 24 Mar 2021, Stam Markianos-Wright wrote: Hi all, This patch resolves bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96974 This is achieved by forcing a re-calculation of *stmt_vectype_out if an incompatible combination of TYPE_VECTOR_SUBPARTS is detected, but with an extra introduced max_nunits ceiling. I am not 100% sure if this is the best way to go about fixing this, because this is my first look at the vectorizer and I lack knowledge of the wider context, so do let me know if you see a better way to do this! I have added the previously ICE-ing reproducer as a new test. This is compiled as "g++ -Ofast -march=armv8.2-a+sve -fdisable-tree-fre4" for GCC11 and "g++ -Ofast -march=armv8.2-a+sve" for GCC10. (the non-fdisable-tree-fre4 version has gone latent on GCC11) Bootstrapped and reg-tested on aarch64-linux-gnu. Also reg-tested on aarch64-none-elf. I don't think this is going to work well given uses will expect a vector type that's consistent here. I think giving up is for the moment the best choice, thus replacing the assert with vectorization failure. In the end we shouldn't require those nunits vectypes to be separately computed - we compute the vector type of the defs anyway and in case they're invariant the vectorizable_* function either can deal with the type mix or not anyway. Yea good point! I agree and after all we are very close to releases now ;) I've attached the patch that just do the graceful vectorization failure and add a slightly better test now. Re-tested as previously with no issues ofc. gcc-10.patch is what I'd backport to GCC10 (the only difference between that and gcc-11.patch is that one compiles with `-fdisable-tree-fre4` and the other without it). Ok to push this to the GCC11 branch and backport to the GCC10 branch? Cheers :D Stam That said, the goal should be to simplify things here. Richard. gcc/ChangeLog: * tree-vect-stmts.c (get_vectype_for_scalar_type): Add new parameter to core function and add new function overload. (vect_get_vector_types_for_stmt): Add re-calculation logic. gcc/testsuite/ChangeLog: * g++.target/aarch64/sve/pr96974.C: New test. diff --git a/gcc/testsuite/g++.target/aarch64/sve/pr96974.C b/gcc/testsuite/g++.target/aarch64/sve/pr96974.C new file mode 100644 index 000..363241d18df --- /dev/null +++ b/gcc/testsuite/g++.target/aarch64/sve/pr96974.C @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -march=armv8.2-a+sve -fdisable-tree-fre4 -fdump-tree-slp-details" } */ + +float a; +int +b () +{ return __builtin_lrintf(a); } + +struct c { + float d; +c() { + for (int e = 0; e < 9; e++) + coeffs[e] = d ? b() : 0; +} +int coeffs[10]; +} f; + +/* { dg-final { scan-tree-dump "Not vectorized: Incompatible number of vector subparts between" "slp1" } } */ diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c index d791d3a4720..4c01e82ff39 100644 --- a/gcc/tree-vect-stmts.c +++ b/gcc/tree-vect-stmts.c @@ -12148,8 +12148,12 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info, } } - gcc_assert (multiple_p (TYPE_VECTOR_SUBPARTS (nunits_vectype), - TYPE_VECTOR_SUBPARTS (*stmt_vectype_out))); + if (!multiple_p (TYPE_VECTOR_SUBPARTS (nunits_vectype), + TYPE_VECTOR_SUBPARTS (*stmt_vectype_out))) +return opt_result::failure_at (stmt, + "Not vectorized: Incompatible number " + "of vector subparts between %T and %T\n", + nunits_vectype, *stmt_vectype_out); if (dump_enabled_p ()) { diff --git a/gcc/testsuite/g++.target/aarch64/sve/pr96974.C b/gcc/testsuite/g++.target/aarch64/sve/pr96974.C new file mode 100644 index 000..2023c55e3e6 --- /dev/null +++ b/gcc/testsuite/g++.target/aarch64/sve/pr96974.C @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -march=armv8.2-a+sve -fdump-tree-slp-details" } */ + +float a; +int +b () +{ return __builtin_lrintf(a); } + +struct c { + float d; +c() { + for (int e = 0; e < 9; e++) + coeffs[e] = d ? b() : 0; +} +int coeffs[10]; +} f; + +/* { dg-final { scan-tree-dump "Not vectorized: Incompatible number of vector subparts between" "slp1" } } */ diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c index c2d1f39fe0f..6418edb5204 100644 --- a/gcc/tree-vect-stmts.c +++ b/gcc/tree-vect-stmts.c @@ -12249,8 +12249,12 @@ vect_get_vector_types_for_stmt (stmt_vec_info stmt_info, } } - gcc_assert (multiple_p (TYPE_VECTOR_SUBPARTS (nunits_vectype), - TYPE_VECTOR_SUBPARTS (*stmt_vectype_out))); + if (!multiple_p (TYPE_VECTOR_SUBPARTS (nunits_vectype), + TYPE_VECTOR_SUBPARTS (*stmt_vectype_out))) +return opt_result::failure_at (stmt, + "Not vectorized: Incompatible number " + "of vector subparts between %T and %T\n", + nunits_vectype, *stmt_vectype_out); if (dump_enabled_p ()) {
[PATCH] slp tree vectorizer: Re-calculate vectorization factor in the case of invalid choices [PR96974]
Hi all, This patch resolves bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96974 This is achieved by forcing a re-calculation of *stmt_vectype_out if an incompatible combination of TYPE_VECTOR_SUBPARTS is detected, but with an extra introduced max_nunits ceiling. I am not 100% sure if this is the best way to go about fixing this, because this is my first look at the vectorizer and I lack knowledge of the wider context, so do let me know if you see a better way to do this! I have added the previously ICE-ing reproducer as a new test. This is compiled as "g++ -Ofast -march=armv8.2-a+sve -fdisable-tree-fre4" for GCC11 and "g++ -Ofast -march=armv8.2-a+sve" for GCC10. (the non-fdisable-tree-fre4 version has gone latent on GCC11) Bootstrapped and reg-tested on aarch64-linux-gnu. Also reg-tested on aarch64-none-elf. gcc/ChangeLog: * tree-vect-stmts.c (get_vectype_for_scalar_type): Add new parameter to core function and add new function overload. (vect_get_vector_types_for_stmt): Add re-calculation logic. gcc/testsuite/ChangeLog: * g++.target/aarch64/sve/pr96974.C: New test. diff --git a/gcc/testsuite/g++.target/aarch64/sve/pr96974.C b/gcc/testsuite/g++.target/aarch64/sve/pr96974.C new file mode 100644 index ..2f6ebd6ce3dd8626f5e666edba77d2c925739b7d --- /dev/null +++ b/gcc/testsuite/g++.target/aarch64/sve/pr96974.C @@ -0,0 +1,16 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -march=armv8.2-a+sve -fdisable-tree-fre4" } */ + +float a; +int +b () +{ return __builtin_lrintf(a); } + +struct c { + float d; +c() { + for (int e = 0; e < 9; e++) + coeffs[e] = d ? b() : 0; +} +int coeffs[10]; +} f; diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c index c2d1f39fe0f4bbc90ffa079cb6a8fcf87b76b3af..f8d3eac38718e18bf957b85109cccbc03e21c041 100644 --- a/gcc/tree-vect-stmts.c +++ b/gcc/tree-vect-stmts.c @@ -11342,7 +11342,7 @@ get_related_vectype_for_scalar_type (machine_mode prevailing_mode, tree get_vectype_for_scalar_type (vec_info *vinfo, tree scalar_type, - unsigned int group_size) + unsigned int group_size, unsigned int max_nunits) { /* For BB vectorization, we should always have a group size once we've constructed the SLP tree; the only valid uses of zero GROUP_SIZEs @@ -11375,13 +11375,16 @@ get_vectype_for_scalar_type (vec_info *vinfo, tree scalar_type, fail (in the latter case because GROUP_SIZE is too small for the target), but it's possible that a target could have a hole between supported vector types. + There is also the option to artificially pass a max_nunits, + which is smaller than GROUP_SIZE, if the use of GROUP_SIZE + would result in an incompatible mode for the target. If GROUP_SIZE is not a power of 2, this has the effect of trying the largest power of 2 that fits within the group, even though the group is not a multiple of that vector size. The BB vectorizer will then try to carve up the group into smaller pieces. */ - unsigned int nunits = 1 << floor_log2 (group_size); + unsigned int nunits = 1 << floor_log2 (max_nunits); do { vectype = get_related_vectype_for_scalar_type (vinfo->vector_mode, @@ -11394,6 +11397,14 @@ get_vectype_for_scalar_type (vec_info *vinfo, tree scalar_type, return vectype; } +tree +get_vectype_for_scalar_type (vec_info *vinfo, tree scalar_type, + unsigned int group_size) +{ + return get_vectype_for_scalar_type (vinfo, scalar_type, + group_size, group_size); +} + /* Return the vector type corresponding to SCALAR_TYPE as supported by the target. NODE, if nonnull, is the SLP tree node that will use the returned vector type. */ @@ -12172,6 +12183,8 @@ vect_get_vector_types_for_stmt (stmt_vec_info stmt_info, tree vectype; tree scalar_type = NULL_TREE; + tree scalar_type_orig = NULL_TREE; + if (group_size == 0 && STMT_VINFO_VECTYPE (stmt_info)) { vectype = STMT_VINFO_VECTYPE (stmt_info); @@ -12210,6 +12223,7 @@ vect_get_vector_types_for_stmt (stmt_vec_info stmt_info, "get vectype for scalar type: %T\n", scalar_type); } vectype = get_vectype_for_scalar_type (vinfo, scalar_type, group_size); + scalar_type_orig = scalar_type; if (!vectype) return opt_result::failure_at (stmt, "not vectorized:" @@ -12249,6 +12263,36 @@ vect_get_vector_types_for_stmt (stmt_vec_info stmt_info, } } + /* In rare cases with different types and sizes we may reach an invalid + combination where nunits_vectype has fewer TYPE_VECTOR_SUBPARTS than + *stmt_vectype_out. In that case attempt to re-calculate + *stmt_vectype_out with an imposed max taken from nunits_vectype. */ + unsigned int max_nunits; + if (known_lt (TYPE_VECTOR_SUBPARTS (nunits_vectype), + TYPE_VECTOR_SUBPARTS (*stmt_vectype_out))) +{ + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, +
Re: [committed obvious][arm] Add test that was missing from old commit [PR91816]
On 26/11/2020 09:01, Christophe Lyon wrote: On Wed, 25 Nov 2020 at 14:24, Stam Markianos-Wright via Gcc-patches wrote: Hi all, A while back I submitted GCC10 commit: 44f77a6dea2f312ee1743f3dde465c1b8453ee13 for PR91816. Turns out I was an idiot and forgot to include the test in the actual git commit, even my entire patch had been approved. Tested that the test still passes on a cross arm-none-eabi and also in a Cortex A-15 bootstrap with no regressions. Submitting this as Obvious to gcc-11 and backporting to gcc-10. Hi, This new test fails when forcing -mcpu=cortex-m3/4/5/7/33: FAIL: gcc.target/arm/pr91816.c scan-assembler-times beq\\t.L[0-9] 2 FAIL: gcc.target/arm/pr91816.c scan-assembler-times beq\\t.Lbcond[0-9] 1 FAIL: gcc.target/arm/pr91816.c scan-assembler-times bne\\t.L[0-9] 2 FAIL: gcc.target/arm/pr91816.c scan-assembler-times bne\\t.Lbcond[0-9] 1 I didn't check manually what is generated, can you have a look? Oh wow thank you for spotting this! It looks like the A class target that I had tested had a tendency to emit a movw/movt pair, whereas these M class targets would emit a single ldr. This resulted in an overall shorter jump for these targets that did not trigger the new far-branch code. The test passes after... doubling it's own size: #define HW3HW2 HW2 HW2 HW2 HW2 HW2 HW2 HW2 HW2 HW2 #define HW4HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 #define HW5HW4 HW4 HW4 HW4 HW4 HW4 HW4 HW4 HW4 HW4 +#define HW6HW5 HW5 __attribute__((noinline,noclone)) void f1 (int a) { @@ -25,7 +26,7 @@ __attribute__((noinline,noclone)) void f2 (int a) __attribute__((noinline,noclone)) void f3 (int a) { - if (a) { HW5 } + if (a) { HW6 } } __attribute__((noinline,noclone)) void f4 (int a) @@ -41,7 +42,7 @@ __attribute__((noinline,noclone)) void f5 (int a) __attribute__((noinline,noclone)) void f6 (int a) { - if (a == 1) { HW5 } + if (a == 1) { HW6 } } But this does effectively double the compilation time of an already quite large test. Would that be ok? Overall this is the edge case testing that the compiler behaves correctly with a branch in huge compilation unit, so it would be nice to have test coverage of it on as many targets as possible... but also kinda rare. Hope this helps! Cheers, Stam Thanks, Christophe Thanks, Stam Markianos-Wright gcc/testsuite/ChangeLog: PR target/91816 * gcc.target/arm/pr91816.c: New test.
[backport gcc-8,9][arm] Thumb2 out of range conditional branch fix [PR91816]
Hi all, Now that I have pushed the entirety of this patch to gcc-10 and gcc-11, I would like to backport it to gcc-8 and gcc-9. PR link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91816 This patch had originally been approved here: https://gcc.gnu.org/legacy-ml/gcc-patches/2020-01/msg02010.html See the attached diffs that have been rebased and apply cleanly. Tested on a cross arm-none-eabi and also in a Cortex A-15 bootstrap with no regressions. Ok to backport? Thanks, Stam Markianos-Wright diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h index 9d0acde7a39..87e01e35221 100644 --- a/gcc/config/arm/arm-protos.h +++ b/gcc/config/arm/arm-protos.h @@ -553,4 +553,6 @@ void arm_parse_option_features (sbitmap, const cpu_arch_option *, void arm_initialize_isa (sbitmap, const enum isa_feature *); +const char * arm_gen_far_branch (rtx *, int, const char * , const char *); + #endif /* ! GCC_ARM_PROTOS_H */ diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index f990ca11bcb..eefe3d99548 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -31629,6 +31629,39 @@ arm_constant_alignment (const_tree exp, HOST_WIDE_INT align) return align; } +/* Generate code to enable conditional branches in functions over 1 MiB. + Parameters are: + operands: is the operands list of the asm insn (see arm_cond_branch or + arm_cond_branch_reversed). + pos_label: is an index into the operands array where operands[pos_label] is + the asm label of the final jump destination. + dest: is a string which is used to generate the asm label of the intermediate + destination + branch_format: is a string denoting the intermediate branch format, e.g. + "beq", "bne", etc. */ + +const char * +arm_gen_far_branch (rtx * operands, int pos_label, const char * dest, + const char * branch_format) +{ + rtx_code_label * tmp_label = gen_label_rtx (); + char label_buf[256]; + char buffer[128]; + ASM_GENERATE_INTERNAL_LABEL (label_buf, dest , \ + CODE_LABEL_NUMBER (tmp_label)); + const char *label_ptr = arm_strip_name_encoding (label_buf); + rtx dest_label = operands[pos_label]; + operands[pos_label] = tmp_label; + + snprintf (buffer, sizeof (buffer), "%s%s", branch_format , label_ptr); + output_asm_insn (buffer, operands); + + snprintf (buffer, sizeof (buffer), "b\t%%l0%d\n%s:", pos_label, label_ptr); + operands[pos_label] = dest_label; + output_asm_insn (buffer, operands); + return ""; +} + #if CHECKING_P namespace selftest { diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md index 6d6b37719e0..81c96658d95 100644 --- a/gcc/config/arm/arm.md +++ b/gcc/config/arm/arm.md @@ -7187,9 +7187,15 @@ ;; And for backward branches we have ;; (neg_range - neg_base_offs + pc_offs) = (neg_range - (-2 or -4) + 4). ;; +;; In 16-bit Thumb these ranges are: ;; For a 'b' pos_range = 2046, neg_range = -2048 giving (-2040->2048). ;; For a 'b' pos_range = 254, neg_range = -256 giving (-250 ->256). +;; In 32-bit Thumb these ranges are: +;; For a 'b' +/- 16MB is not checked for. +;; For a 'b' pos_range = 1048574, neg_range = -1048576 giving +;; (-1048568 -> 1048576). + (define_expand "cbranchsi4" [(set (pc) (if_then_else (match_operator 0 "expandable_comparison_operator" @@ -7444,23 +7450,50 @@ (label_ref (match_operand 0 "" "")) (pc)))] "TARGET_32BIT" - "* - if (arm_ccfsm_state == 1 || arm_ccfsm_state == 2) + { +if (arm_ccfsm_state == 1 || arm_ccfsm_state == 2) { arm_ccfsm_state += 2; - return \"\"; + return ""; } - return \"b%d1\\t%l0\"; - " +switch (get_attr_length (insn)) + { + case 2: /* Thumb2 16-bit b{cond}. */ + case 4: /* Thumb2 32-bit b{cond} or A32 b{cond}. */ + return "b%d1\t%l0"; + break; + + /* Thumb2 b{cond} out of range. Use 16-bit b{cond} and + unconditional branch b. */ + default: return arm_gen_far_branch (operands, 0, "Lbcond", "b%D1\t"); + } + } [(set_attr "conds" "use") (set_attr "type" "branch") (set (attr "length") - (if_then_else - (and (match_test "TARGET_THUMB2") - (and (ge (minus (match_dup 0) (pc)) (const_int -250)) -(le (minus (match_dup 0) (pc)) (const_int 256 - (const_int 2) - (const_int 4)))] +(if_then_else (match_test "!TARGET_THUMB2") + + ;;Target is not Thumb2, therefore is A32. Generate b{cond}. + (const_int 4) + + ;; Check if target is within 16-bit Thumb2 b{cond} range. + (if_then_else (and (ge (minus (match_dup 0) (pc)) (const_int -250)) +(le (minus (match_dup 0) (pc)) (const_int 256))) + + ;; Target is Thumb2, within narrow range. + ;; Generate b{cond}. + (const_int 2) + + ;; Check if target is within 32-bit Thumb2 b{cond} range. +
[committed obvious][arm] Add test that was missing from old commit [PR91816]
Hi all, A while back I submitted GCC10 commit: 44f77a6dea2f312ee1743f3dde465c1b8453ee13 for PR91816. Turns out I was an idiot and forgot to include the test in the actual git commit, even my entire patch had been approved. Tested that the test still passes on a cross arm-none-eabi and also in a Cortex A-15 bootstrap with no regressions. Submitting this as Obvious to gcc-11 and backporting to gcc-10. Thanks, Stam Markianos-Wright gcc/testsuite/ChangeLog: PR target/91816 * gcc.target/arm/pr91816.c: New test. diff --git a/gcc/testsuite/gcc.target/arm/pr91816.c b/gcc/testsuite/gcc.target/arm/pr91816.c new file mode 100644 index 000..75b938a6aad --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/pr91816.c @@ -0,0 +1,63 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target arm_thumb2_ok } */ +/* { dg-additional-options "-mthumb" } */ +/* { dg-timeout-factor 4.0 } */ + +int printf(const char *, ...); + +#define HW0 printf("Hello World!\n"); +#define HW1 HW0 HW0 HW0 HW0 HW0 HW0 HW0 HW0 HW0 HW0 +#define HW2 HW1 HW1 HW1 HW1 HW1 HW1 HW1 HW1 HW1 HW1 +#define HW3 HW2 HW2 HW2 HW2 HW2 HW2 HW2 HW2 HW2 HW2 +#define HW4 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 +#define HW5 HW4 HW4 HW4 HW4 HW4 HW4 HW4 HW4 HW4 HW4 + +__attribute__((noinline,noclone)) void f1 (int a) +{ + if (a) { HW0 } +} + +__attribute__((noinline,noclone)) void f2 (int a) +{ + if (a) { HW3 } +} + + +__attribute__((noinline,noclone)) void f3 (int a) +{ + if (a) { HW5 } +} + +__attribute__((noinline,noclone)) void f4 (int a) +{ + if (a == 1) { HW0 } +} + +__attribute__((noinline,noclone)) void f5 (int a) +{ + if (a == 1) { HW3 } +} + + +__attribute__((noinline,noclone)) void f6 (int a) +{ + if (a == 1) { HW5 } +} + + +int main(void) +{ + f1(0); + f2(0); + f3(0); + f4(0); + f5(0); + f6(0); + return 0; +} + + +/* { dg-final { scan-assembler-times "beq\\t.L\[0-9\]" 2 } } */ +/* { dg-final { scan-assembler-times "beq\\t.Lbcond\[0-9\]" 1 } } */ +/* { dg-final { scan-assembler-times "bne\\t.L\[0-9\]" 2 } } */ +/* { dg-final { scan-assembler-times "bne\\t.Lbcond\[0-9\]" 1 } } */