[PING][PATCH] arm: Remove unsigned variant of vcaddq_m

2023-08-19 Thread Stam Markianos-Wright via Gcc-patches


(Pinging since I realised that this is required for my later Low Overhead Loop 
patch series to work)

Ok for trunk with the updated changelog that Christophe mentioned?

Thanks,
Stamatis/Stam Markianos-Wright


From: Stam Markianos-Wright
Sent: Tuesday, August 1, 2023 6:21 PM
To: gcc-patches@gcc.gnu.org 
Cc: Richard Earnshaw ; Kyrylo Tkachov 

Subject: arm: Remove unsigned variant of vcaddq_m

Hi all,

The unsigned variants of the vcaddq_m operation are not needed within the
compiler, as the assembly output of the signed and unsigned versions of the
ops is identical: with a `.i` suffix (as opposed to separate `.s` and `.u`
suffixes).

Tested with baremetal arm-none-eabi on Arm's fastmodels.

Ok for trunk?

Thanks,
Stamatis Markianos-Wright

gcc/ChangeLog:

 * config/arm/arm-mve-builtins-base.cc (vcaddq_rot90, vcaddq_rot270):
   Use common insn for signed and unsigned front-end definitions.
 * config/arm/arm_mve_builtins.def
   (vcaddq_rot90_m_u, vcaddq_rot270_m_u): Make common.
   (vcaddq_rot90_m_s, vcaddq_rot270_m_s): Remove.
 * config/arm/iterators.md (mve_insn): Merge signed and unsigned defs.
   (isu): Likewise.
   (rot): Likewise.
   (mve_rot): Likewise.
   (supf): Likewise.
   (VxCADDQ_M): Likewise.
 * config/arm/unspecs.md (unspec): Likewise.
---
  gcc/config/arm/arm-mve-builtins-base.cc |  4 ++--
  gcc/config/arm/arm_mve_builtins.def |  6 ++---
  gcc/config/arm/iterators.md | 30 +++--
  gcc/config/arm/mve.md   |  4 ++--
  gcc/config/arm/unspecs.md   |  6 ++---
  5 files changed, 21 insertions(+), 29 deletions(-)

diff --git a/gcc/config/arm/arm-mve-builtins-base.cc
b/gcc/config/arm/arm-mve-builtins-base.cc
index e31095ae112..426a87e9852 100644
--- a/gcc/config/arm/arm-mve-builtins-base.cc
+++ b/gcc/config/arm/arm-mve-builtins-base.cc
@@ -260,8 +260,8 @@ FUNCTION_PRED_P_S_U (vaddvq, VADDVQ)
  FUNCTION_PRED_P_S_U (vaddvaq, VADDVAQ)
  FUNCTION_WITH_RTX_M (vandq, AND, VANDQ)
  FUNCTION_ONLY_N (vbrsrq, VBRSRQ)
-FUNCTION (vcaddq_rot90, unspec_mve_function_exact_insn_rot,
(UNSPEC_VCADD90, UNSPEC_VCADD90, UNSPEC_VCADD90, VCADDQ_ROT90_M_S,
VCADDQ_ROT90_M_U, VCADDQ_ROT90_M_F))
-FUNCTION (vcaddq_rot270, unspec_mve_function_exact_insn_rot,
(UNSPEC_VCADD270, UNSPEC_VCADD270, UNSPEC_VCADD270, VCADDQ_ROT270_M_S,
VCADDQ_ROT270_M_U, VCADDQ_ROT270_M_F))
+FUNCTION (vcaddq_rot90, unspec_mve_function_exact_insn_rot,
(UNSPEC_VCADD90, UNSPEC_VCADD90, UNSPEC_VCADD90, VCADDQ_ROT90_M,
VCADDQ_ROT90_M, VCADDQ_ROT90_M_F))
+FUNCTION (vcaddq_rot270, unspec_mve_function_exact_insn_rot,
(UNSPEC_VCADD270, UNSPEC_VCADD270, UNSPEC_VCADD270, VCADDQ_ROT270_M,
VCADDQ_ROT270_M, VCADDQ_ROT270_M_F))
  FUNCTION (vcmlaq, unspec_mve_function_exact_insn_rot, (-1, -1,
UNSPEC_VCMLA, -1, -1, VCMLAQ_M_F))
  FUNCTION (vcmlaq_rot90, unspec_mve_function_exact_insn_rot, (-1, -1,
UNSPEC_VCMLA90, -1, -1, VCMLAQ_ROT90_M_F))
  FUNCTION (vcmlaq_rot180, unspec_mve_function_exact_insn_rot, (-1, -1,
UNSPEC_VCMLA180, -1, -1, VCMLAQ_ROT180_M_F))
diff --git a/gcc/config/arm/arm_mve_builtins.def
b/gcc/config/arm/arm_mve_builtins.def
index 43dacc3dda1..6ac1812c697 100644
--- a/gcc/config/arm/arm_mve_builtins.def
+++ b/gcc/config/arm/arm_mve_builtins.def
@@ -523,8 +523,8 @@ VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED,
vhsubq_m_n_u, v16qi, v8hi, v4si)
  VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vhaddq_m_u, v16qi, v8hi, v4si)
  VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vhaddq_m_n_u, v16qi, v8hi,
v4si)
  VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, veorq_m_u, v16qi, v8hi, v4si)
-VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vcaddq_rot90_m_u, v16qi,
v8hi, v4si)
-VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vcaddq_rot270_m_u, v16qi,
v8hi, v4si)
+VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vcaddq_rot90_m_, v16qi,
v8hi, v4si)
+VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vcaddq_rot270_m_, v16qi,
v8hi, v4si)
  VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vbicq_m_u, v16qi, v8hi, v4si)
  VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vandq_m_u, v16qi, v8hi, v4si)
  VAR3 (QUADOP_UNONE_UNONE_UNONE_UNONE_PRED, vaddq_m_u, v16qi, v8hi, v4si)
@@ -587,8 +587,6 @@ VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED,
vhcaddq_rot270_m_s, v16qi, v8hi, v4si)
  VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, vhaddq_m_s, v16qi, v8hi, v4si)
  VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, vhaddq_m_n_s, v16qi, v8hi, v4si)
  VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, veorq_m_s, v16qi, v8hi, v4si)
-VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, vcaddq_rot90_m_s, v16qi, v8hi, v4si)
-VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, vcaddq_rot270_m_s, v16qi, v8hi,
v4si)
  VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, vbrsrq_m_n_s, v16qi, v8hi, v4si)
  VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, vbicq_m_s, v16qi, v8hi, v4si)
  VAR3 (QUADOP_NONE_NONE_NONE_NONE_PRED, vandq_m_s, v16qi, v8hi, v4si)
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index b13ff53d36f..2edd0b06370 100644
--- 

[commited trunk 7/9] arm testsuite: Remove reduntant tests

2023-05-18 Thread Stam Markianos-Wright via Gcc-patches
Following Andrea's overhaul of the MVE testsuite, these tests are now
reduntant, as equivalent checks have been added to the each intrinsic's
.c test.

gcc/testsuite/ChangeLog:

* gcc.target/arm/mve/intrinsics/mve_fp_vaddq_n.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vaddq_m.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vaddq_n.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vddupq_m_n_u16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vddupq_m_n_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vddupq_m_n_u8.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vddupq_n_u16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vddupq_n_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vddupq_n_u8.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vddupq_x_n_u16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vddupq_x_n_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vddupq_x_n_u8.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vdwdupq_x_n_u16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vdwdupq_x_n_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vdwdupq_x_n_u8.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vidupq_m_n_u16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vidupq_m_n_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vidupq_m_n_u8.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vidupq_n_u16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vidupq_n_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vidupq_n_u8.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vidupq_x_n_u16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vidupq_x_n_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vidupq_x_n_u8.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_viwdupq_x_n_u16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_viwdupq_x_n_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_viwdupq_x_n_u8.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_offset_s64.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_offset_u64.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_offset_z_s64.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_offset_z_u64.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_shifted_offset_s64.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_shifted_offset_u64.c: 
Removed.
* 
gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_shifted_offset_z_s64.c: Removed.
* 
gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_shifted_offset_z_u64.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_f16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_s16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_s32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_u16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_z_f16.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_z_s16.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_z_s32.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_z_u16.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_z_u32.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_f16.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_s16.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_s32.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_u16.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_u32.c: 
Removed.
* 
gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_z_f16.c: Removed.
* 
gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_z_s16.c: Removed.
* 
gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_z_s32.c: Removed.
* 
gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_z_u16.c: Removed.
* 
gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_z_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_f32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_s32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_z_f32.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_z_s32.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_z_u32.c: 
Removed.
* 

[commited trunk 9/9] arm testsuite: Shifts and get_FPSCR ACLE optimisation fixes

2023-05-18 Thread Stam Markianos-Wright via Gcc-patches
These newly updated tests were rewritten by Andrea. Some of them
needed further manual fixing as follows:

* The #shift immediate value not in the check-function-bodies as expected
* The ACLE was specifying sub-optimal code: lsr+and instead of ubfx. In
  this case the test rewritten from the ACLE had the lsr+and pattern,
  but the compiler was able to optimise to ubfx. Hence I've changed the
  test to now match on ubfx.
* Added a separate test to check shift on constants being optimised to
  movs.

gcc/testsuite/ChangeLog:

* gcc.target/arm/mve/intrinsics/srshr.c: Update shift value.
* gcc.target/arm/mve/intrinsics/srshrl.c: Update shift value.
* gcc.target/arm/mve/intrinsics/uqshl.c: Update shift value.
* gcc.target/arm/mve/intrinsics/uqshll.c: Update shift value.
* gcc.target/arm/mve/intrinsics/urshr.c: Update shift value.
* gcc.target/arm/mve/intrinsics/urshrl.c: Update shift value.
* gcc.target/arm/mve/intrinsics/vadciq_m_s32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vadciq_m_u32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vadciq_s32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vadciq_u32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vadcq_m_s32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vadcq_m_u32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vadcq_s32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vadcq_u32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vsbciq_m_s32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vsbciq_m_u32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vsbciq_s32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vsbciq_u32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vsbcq_m_s32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vsbcq_m_u32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vsbcq_s32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vsbcq_u32.c: Update to ubfx.
* gcc.target/arm/mve/mve_const_shifts.c: New test.
---
 .../gcc.target/arm/mve/intrinsics/srshr.c |  2 +-
 .../gcc.target/arm/mve/intrinsics/srshrl.c|  2 +-
 .../gcc.target/arm/mve/intrinsics/uqshl.c | 14 +--
 .../gcc.target/arm/mve/intrinsics/uqshll.c| 14 +--
 .../gcc.target/arm/mve/intrinsics/urshr.c |  4 +-
 .../gcc.target/arm/mve/intrinsics/urshrl.c|  4 +-
 .../arm/mve/intrinsics/vadciq_m_s32.c |  8 +---
 .../arm/mve/intrinsics/vadciq_m_u32.c |  8 +---
 .../arm/mve/intrinsics/vadciq_s32.c   |  8 +---
 .../arm/mve/intrinsics/vadciq_u32.c   |  8 +---
 .../arm/mve/intrinsics/vadcq_m_s32.c  |  8 +---
 .../arm/mve/intrinsics/vadcq_m_u32.c  |  8 +---
 .../gcc.target/arm/mve/intrinsics/vadcq_s32.c |  8 +---
 .../gcc.target/arm/mve/intrinsics/vadcq_u32.c |  8 +---
 .../arm/mve/intrinsics/vsbciq_m_s32.c |  8 +---
 .../arm/mve/intrinsics/vsbciq_m_u32.c |  8 +---
 .../arm/mve/intrinsics/vsbciq_s32.c   |  8 +---
 .../arm/mve/intrinsics/vsbciq_u32.c   |  8 +---
 .../arm/mve/intrinsics/vsbcq_m_s32.c  |  8 +---
 .../arm/mve/intrinsics/vsbcq_m_u32.c  |  8 +---
 .../gcc.target/arm/mve/intrinsics/vsbcq_s32.c |  8 +---
 .../gcc.target/arm/mve/intrinsics/vsbcq_u32.c |  8 +---
 .../gcc.target/arm/mve/mve_const_shifts.c | 41 +++
 23 files changed, 81 insertions(+), 128 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/mve_const_shifts.c

diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshr.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshr.c
index 94e3f42fd33..734375d58c0 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshr.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshr.c
@@ -12,7 +12,7 @@ extern "C" {
 /*
 **foo:
 ** ...
-** srshr   (?:ip|fp|r[0-9]+), #shift(?:@.*|)
+** srshr   (?:ip|fp|r[0-9]+), #1(?:@.*|)
 ** ...
 */
 int32_t
diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshrl.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshrl.c
index 65f28ccbfde..a91943c38a0 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshrl.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshrl.c
@@ -12,7 +12,7 @@ extern "C" {
 /*
 **foo:
 ** ...
-** srshrl  (?:ip|fp|r[0-9]+), (?:ip|fp|r[0-9]+), #shift(?: @.*|)
+** srshrl  (?:ip|fp|r[0-9]+), (?:ip|fp|r[0-9]+), #1(?: @.*|)
 ** ...
 */
 int64_t
diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/uqshl.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/uqshl.c
index b23c9d97ba6..462531cad54 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/uqshl.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/uqshl.c
@@ -12,7 +12,7 @@ extern "C" {
 /*
 **foo:
 ** ...
-** uqshl   (?:ip|fp|r[0-9]+), #shift(?:@.*|)
+** uqshl   (?:ip|fp|r[0-9]+), #1(?:@.*|)
 ** ...
 

[commited trunk 2/9] arm: Fix vstrwq* backend + testsuite

2023-05-18 Thread Stam Markianos-Wright via Gcc-patches
From: Andrea Corallo 

Hi all,

this patch fixes the vstrwq* MVE instrinsics failing to emit the
correct sequence of instruction due to a missing predicate. Also the
immediate range is fixed to be multiples of 2 up between [-252, 252].

Best Regards

  Andrea

gcc/ChangeLog:

* config/arm/constraints.md (mve_vldrd_immediate): Move it to
predicates.md.
(Ri): Move constraint definition from predicates.md.
(Rl): Define new constraint.
* config/arm/mve.md (mve_vstrwq_scatter_base_wb_p_v4si): Add
missing constraint.
(mve_vstrwq_scatter_base_wb_p_fv4sf): Add missing Up constraint
for op 1, use mve_vstrw_immediate predicate and Rl constraint for
op 2. Fix asm output spacing.
(mve_vstrdq_scatter_base_wb_p_v2di): Add missing constraint.
* config/arm/predicates.md (Ri) Move constraint to constraints.md
(mve_vldrd_immediate): Move it from
constraints.md.
(mve_vstrw_immediate): New predicate.

gcc/testsuite/ChangeLog:

* gcc.target/arm/mve/intrinsics/vstrwq_f32.c: Use
check-function-bodies instead of scan-assembler checks.  Use
extern "C" for C++ testing.
* gcc.target/arm/mve/intrinsics/vstrwq_p_f32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_p_s32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_p_u32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_s32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_f32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_p_f32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_p_s32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_p_u32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_s32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_u32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_f32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_f32.c: 
Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_s32.c: 
Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_u32.c: 
Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_s32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_u32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_f32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_p_f32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_p_s32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_p_u32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_s32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_u32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_f32.c: 
Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_p_f32.c: 
Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_p_s32.c: 
Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_p_u32.c: 
Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_s32.c: 
Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_u32.c: 
Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_u32.c: Likewise.
---
 gcc/config/arm/constraints.md | 20 --
 gcc/config/arm/mve.md | 10 ++---
 gcc/config/arm/predicates.md  | 14 +++
 .../arm/mve/intrinsics/vstrwq_f32.c   | 32 ---
 .../arm/mve/intrinsics/vstrwq_p_f32.c | 40 ---
 .../arm/mve/intrinsics/vstrwq_p_s32.c | 40 ---
 .../arm/mve/intrinsics/vstrwq_p_u32.c | 40 ---
 .../arm/mve/intrinsics/vstrwq_s32.c   | 32 ---
 .../mve/intrinsics/vstrwq_scatter_base_f32.c  | 28 +++--
 .../intrinsics/vstrwq_scatter_base_p_f32.c| 36 +++--
 .../intrinsics/vstrwq_scatter_base_p_s32.c| 36 +++--
 .../intrinsics/vstrwq_scatter_base_p_u32.c| 36 +++--
 .../mve/intrinsics/vstrwq_scatter_base_s32.c  | 28 +++--
 .../mve/intrinsics/vstrwq_scatter_base_u32.c  | 28 +++--
 .../intrinsics/vstrwq_scatter_base_wb_f32.c   | 32 ---
 .../intrinsics/vstrwq_scatter_base_wb_p_f32.c | 40 ---
 .../intrinsics/vstrwq_scatter_base_wb_p_s32.c | 40 ---
 .../intrinsics/vstrwq_scatter_base_wb_p_u32.c | 40 ---
 .../intrinsics/vstrwq_scatter_base_wb_s32.c   | 32 ---
 .../intrinsics/vstrwq_scatter_base_wb_u32.c   | 32 ---
 .../intrinsics/vstrwq_scatter_offset_f32.c| 32 ---
 .../intrinsics/vstrwq_scatter_offset_p_f32.c  | 40 ---
 

[commited trunk 8/9] arm testsuite: XFAIL or relax registers in some tests [PR109697]

2023-05-18 Thread Stam Markianos-Wright via Gcc-patches
Hi all,

This is a simple testsuite tidy-up patch, addressing to types of errors:

* The vcmp vector-scalar tests failing due to the compiler's preference
of vector-vector comparisons, over vector-scalar comparisons. This is
due to the lack of cost model for MVE and the compiler not knowing that
the RTL vec_duplicate is free in those instructions. For now, we simply
XFAIL these checks.
* The tests for pr108177 had strict usage of q0 and r0 registers,
meaning that they would FAIL with -mfloat-abi=softf. The register checks
have now been relaxed. A couple of these run-tests also had incosistent
use of integer MVE with floating point vectors, so I've now changed these
to use FP MVE.

gcc/testsuite/ChangeLog:
PR target/109697
* gcc.target/arm/mve/intrinsics/vcmpcsq_n_u16.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpcsq_n_u32.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpcsq_n_u8.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpeqq_n_f16.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpeqq_n_f32.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpeqq_n_u16.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpeqq_n_u32.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpeqq_n_u8.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpgeq_n_f16.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpgeq_n_f32.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpgtq_n_f16.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpgtq_n_f32.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmphiq_n_u16.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmphiq_n_u32.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmphiq_n_u8.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpleq_n_f16.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpleq_n_f32.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpltq_n_f16.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpltq_n_f32.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpneq_n_f16.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpneq_n_f32.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpneq_n_u16.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpneq_n_u32.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpneq_n_u8.c: XFAIL check.
* gcc.target/arm/mve/pr108177-1.c: Relax registers.
* gcc.target/arm/mve/pr108177-10.c: Relax registers.
* gcc.target/arm/mve/pr108177-11.c: Relax registers.
* gcc.target/arm/mve/pr108177-12.c: Relax registers.
* gcc.target/arm/mve/pr108177-13.c: Relax registers.
* gcc.target/arm/mve/pr108177-13-run.c: use mve_fp
* gcc.target/arm/mve/pr108177-14.c: Relax registers.
* gcc.target/arm/mve/pr108177-14-run.c: use mve_fp
* gcc.target/arm/mve/pr108177-2.c: Relax registers.
* gcc.target/arm/mve/pr108177-3.c: Relax registers.
* gcc.target/arm/mve/pr108177-4.c: Relax registers.
* gcc.target/arm/mve/pr108177-5.c: Relax registers.
* gcc.target/arm/mve/pr108177-6.c: Relax registers.
* gcc.target/arm/mve/pr108177-7.c: Relax registers.
* gcc.target/arm/mve/pr108177-8.c: Relax registers.
* gcc.target/arm/mve/pr108177-9.c: Relax registers.
---
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpcsq_n_u16.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpcsq_n_u32.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpcsq_n_u8.c  | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpeqq_n_f16.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpeqq_n_f32.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpeqq_n_u16.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpeqq_n_u32.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpeqq_n_u8.c  | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpgeq_n_f16.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpgeq_n_f32.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpgtq_n_f16.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpgtq_n_f32.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmphiq_n_u16.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmphiq_n_u32.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmphiq_n_u8.c  | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpleq_n_f16.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpleq_n_f32.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpltq_n_f16.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpltq_n_f32.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpneq_n_f16.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpneq_n_f32.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpneq_n_u16.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpneq_n_u32.c | 2 +-
 

[commited trunk 4/9] arm: Stop vadcq, vsbcq intrinsics from overwriting the FPSCR NZ flags

2023-05-18 Thread Stam Markianos-Wright via Gcc-patches
Hi all,

We noticed that calls to the vadcq and vsbcq intrinsics, both of
which use __builtin_arm_set_fpscr_nzcvqc to set the Carry flag in
the FPSCR, would produce the following code:

```
< r2 is the *carry input >
vmrsr3, FPSCR_nzcvqc
bic r3, r3, #536870912
orr r3, r3, r2, lsl #29
vmsrFPSCR_nzcvqc, r3
```

when the MVE ACLE instead gives a different instruction sequence of:
```
< Rt is the *carry input >
VMRS Rs,FPSCR_nzcvqc
BFI Rs,Rt,#29,#1
VMSR FPSCR_nzcvqc,Rs
```

the bic + orr pair is slower and it's also wrong, because, if the
*carry input is greater than 1, then we risk overwriting the top two
bits of the FPSCR register (the N and Z flags).

This turned out to be a problem in the header file and the solution was
to simply add a `& 1x0u` to the `*carry` input: then the compiler knows
that we only care about the lowest bit and can optimise to a BFI.

Ok for trunk?

Thanks,
Stam Markianos-Wright

gcc/ChangeLog:

* config/arm/arm_mve.h (__arm_vadcq_s32): Fix arithmetic.
(__arm_vadcq_u32): Likewise.
(__arm_vadcq_m_s32): Likewise.
(__arm_vadcq_m_u32): Likewise.
(__arm_vsbcq_s32): Likewise.
(__arm_vsbcq_u32): Likewise.
(__arm_vsbcq_m_s32): Likewise.
(__arm_vsbcq_m_u32): Likewise.
* config/arm/mve.md (get_fpscr_nzcvqc): Make unspec_volatile.

gcc/testsuite/ChangeLog:
* gcc.target/arm/mve/mve_vadcq_vsbcq_fpscr_overwrite.c: New.
---
 gcc/config/arm/arm_mve.h  | 16 ++---
 gcc/config/arm/mve.md |  2 +-
 .../arm/mve/mve_vadcq_vsbcq_fpscr_overwrite.c | 67 +++
 3 files changed, 76 insertions(+), 9 deletions(-)
 create mode 100644 
gcc/testsuite/gcc.target/arm/mve/mve_vadcq_vsbcq_fpscr_overwrite.c

diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h
index 1774e6eca2b..4ad1c99c288 100644
--- a/gcc/config/arm/arm_mve.h
+++ b/gcc/config/arm/arm_mve.h
@@ -4098,7 +4098,7 @@ __extension__ extern __inline int32x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vadcq_s32 (int32x4_t __a, int32x4_t __b, unsigned * __carry)
 {
-  __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & 
~0x2000u) | (*__carry << 29));
+  __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & 
~0x2000u) | ((*__carry & 0x1u) << 29));
   int32x4_t __res = __builtin_mve_vadcq_sv4si (__a, __b);
   *__carry = (__builtin_arm_get_fpscr_nzcvqc () >> 29) & 0x1u;
   return __res;
@@ -4108,7 +4108,7 @@ __extension__ extern __inline uint32x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vadcq_u32 (uint32x4_t __a, uint32x4_t __b, unsigned * __carry)
 {
-  __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & 
~0x2000u) | (*__carry << 29));
+  __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & 
~0x2000u) | ((*__carry & 0x1u) << 29));
   uint32x4_t __res = __builtin_mve_vadcq_uv4si (__a, __b);
   *__carry = (__builtin_arm_get_fpscr_nzcvqc () >> 29) & 0x1u;
   return __res;
@@ -4118,7 +4118,7 @@ __extension__ extern __inline int32x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vadcq_m_s32 (int32x4_t __inactive, int32x4_t __a, int32x4_t __b, 
unsigned * __carry, mve_pred16_t __p)
 {
-  __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & 
~0x2000u) | (*__carry << 29));
+  __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & 
~0x2000u) | ((*__carry & 0x1u) << 29));
   int32x4_t __res = __builtin_mve_vadcq_m_sv4si (__inactive, __a, __b, __p);
   *__carry = (__builtin_arm_get_fpscr_nzcvqc () >> 29) & 0x1u;
   return __res;
@@ -4128,7 +4128,7 @@ __extension__ extern __inline uint32x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vadcq_m_u32 (uint32x4_t __inactive, uint32x4_t __a, uint32x4_t __b, 
unsigned * __carry, mve_pred16_t __p)
 {
-  __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & 
~0x2000u) | (*__carry << 29));
+  __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & 
~0x2000u) | ((*__carry & 0x1u) << 29));
   uint32x4_t __res =  __builtin_mve_vadcq_m_uv4si (__inactive, __a, __b, __p);
   *__carry = (__builtin_arm_get_fpscr_nzcvqc () >> 29) & 0x1u;
   return __res;
@@ -4174,7 +4174,7 @@ __extension__ extern __inline int32x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vsbcq_s32 (int32x4_t __a, int32x4_t __b, unsigned * __carry)
 {
-  __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & 
~0x2000u) | (*__carry << 29));
+  __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & 
~0x2000u) | ((*__carry & 0x1u) << 29));
   int32x4_t __res = __builtin_mve_vsbcq_sv4si (__a, __b);
   *__carry = (__builtin_arm_get_fpscr_nzcvqc () >> 29) & 0x1u;
   return __res;
@@ -4184,7 +4184,7 @@ __extension__ extern __inline uint32x4_t
 __attribute__ ((__always_inline__, 

[commited trunk 5/9] arm: Fix overloading of MVE scalar constant parameters on vbicq

2023-05-18 Thread Stam Markianos-Wright via Gcc-patches
We found this as part of the wider testsuite updates.

The applicable tests are authored by Andrea earlier in this patch series

Ok for trunk?

gcc/ChangeLog:

* config/arm/arm_mve.h (__arm_vbicq): Change coerce on
scalar constant.
---
 gcc/config/arm/arm_mve.h | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h
index 4ad1c99c288..30cec519791 100644
--- a/gcc/config/arm/arm_mve.h
+++ b/gcc/config/arm/arm_mve.h
@@ -10847,10 +10847,10 @@ extern void *__ARM_undef;
 #define __arm_vbicq(p0,p1) ({ __typeof(p0) __p0 = (p0); \
   __typeof(p1) __p1 = (p1); \
   _Generic( (int (*)[__ARM_mve_typeid(__p0)][__ARM_mve_typeid(__p1)])0, \
-  int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s16 
(__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce1 (__p1, int)), \
-  int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s32 
(__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce1 (__p1, int)), \
-  int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u16 
(__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce1 (__p1, int)), \
-  int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u32 
(__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce1 (__p1, int)), \
+  int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s16 
(__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce3 (p1, int)), \
+  int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s32 
(__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce3 (p1, int)), \
+  int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u16 
(__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce3 (p1, int)), \
+  int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u32 
(__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce3 (p1, int)), \
   int (*)[__ARM_mve_type_int8x16_t][__ARM_mve_type_int8x16_t]: __arm_vbicq_s8 
(__ARM_mve_coerce(__p0, int8x16_t), __ARM_mve_coerce(__p1, int8x16_t)), \
   int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int16x8_t]: __arm_vbicq_s16 
(__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce(__p1, int16x8_t)), \
   int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int32x4_t]: __arm_vbicq_s32 
(__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce(__p1, int32x4_t)), \
@@ -11699,10 +11699,10 @@ extern void *__ARM_undef;
 #define __arm_vbicq(p0,p1) ({ __typeof(p0) __p0 = (p0); \
   __typeof(p1) __p1 = (p1); \
   _Generic( (int (*)[__ARM_mve_typeid(__p0)][__ARM_mve_typeid(__p1)])0, \
-  int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s16 
(__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce1 (__p1, int)), \
-  int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s32 
(__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce1 (__p1, int)), \
-  int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u16 
(__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce1 (__p1, int)), \
-  int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u32 
(__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce1 (__p1, int)), \
+  int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s16 
(__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce3 (p1, int)), \
+  int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s32 
(__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce3 (p1, int)), \
+  int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u16 
(__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce3 (p1, int)), \
+  int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u32 
(__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce3 (p1, int)), \
   int (*)[__ARM_mve_type_int8x16_t][__ARM_mve_type_int8x16_t]: __arm_vbicq_s8 
(__ARM_mve_coerce(__p0, int8x16_t), __ARM_mve_coerce(__p1, int8x16_t)), \
   int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int16x8_t]: __arm_vbicq_s16 
(__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce(__p1, int16x8_t)), \
   int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int32x4_t]: __arm_vbicq_s32 
(__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce(__p1, int32x4_t)), \
-- 
2.25.1



[committed gcc12 backport] arm: Fix overloading of MVE scalar constant parameters on vbicq, vmvnq_m

2023-05-18 Thread Stam Markianos-Wright via Gcc-patches
We found this as part of the wider testsuite updates.

The applicable tests are authored by Andrea earlier in this patch series

Ok for trunk?

gcc/ChangeLog:

* config/arm/arm_mve.h (__arm_vbicq): Change coerce on
scalar constant.
(__arm_vmvnq_m): Likewise.
---
 gcc/config/arm/arm_mve.h | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h
index 39b3446617d..0b35bd0eedd 100644
--- a/gcc/config/arm/arm_mve.h
+++ b/gcc/config/arm/arm_mve.h
@@ -35906,10 +35906,10 @@ extern void *__ARM_undef;
 #define __arm_vbicq(p0,p1) ({ __typeof(p0) __p0 = (p0); \
   __typeof(p1) __p1 = (p1); \
   _Generic( (int (*)[__ARM_mve_typeid(__p0)][__ARM_mve_typeid(__p1)])0, \
-  int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s16 
(__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce1 (__p1, int)), \
-  int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s32 
(__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce1 (__p1, int)), \
-  int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u16 
(__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce1 (__p1, int)), \
-  int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u32 
(__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce1 (__p1, int)), \
+  int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s16 
(__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce3 (p1, int)), \
+  int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s32 
(__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce3 (p1, int)), \
+  int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u16 
(__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce3 (p1, int)), \
+  int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u32 
(__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce3 (p1, int)), \
   int (*)[__ARM_mve_type_int8x16_t][__ARM_mve_type_int8x16_t]: __arm_vbicq_s8 
(__ARM_mve_coerce(__p0, int8x16_t), __ARM_mve_coerce(__p1, int8x16_t)), \
   int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int16x8_t]: __arm_vbicq_s16 
(__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce(__p1, int16x8_t)), \
   int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int32x4_t]: __arm_vbicq_s32 
(__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce(__p1, int32x4_t)), \
@@ -38825,10 +38825,10 @@ extern void *__ARM_undef;
 #define __arm_vbicq(p0,p1) ({ __typeof(p0) __p0 = (p0); \
   __typeof(p1) __p1 = (p1); \
   _Generic( (int (*)[__ARM_mve_typeid(__p0)][__ARM_mve_typeid(__p1)])0, \
-  int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s16 
(__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce1 (__p1, int)), \
-  int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s32 
(__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce1 (__p1, int)), \
-  int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u16 
(__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce1 (__p1, int)), \
-  int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u32 
(__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce1 (__p1, int)), \
+  int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s16 
(__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce3 (p1, int)), \
+  int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_s32 
(__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce3 (p1, int)), \
+  int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u16 
(__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce3 (p1, int)), \
+  int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_int_n]: __arm_vbicq_n_u32 
(__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce3 (p1, int)), \
   int (*)[__ARM_mve_type_int8x16_t][__ARM_mve_type_int8x16_t]: __arm_vbicq_s8 
(__ARM_mve_coerce(__p0, int8x16_t), __ARM_mve_coerce(__p1, int8x16_t)), \
   int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int16x8_t]: __arm_vbicq_s16 
(__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce(__p1, int16x8_t)), \
   int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int32x4_t]: __arm_vbicq_s32 
(__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce(__p1, int32x4_t)), \
@@ -40962,10 +40962,10 @@ extern void *__ARM_undef;
   int (*)[__ARM_mve_type_uint8x16_t][__ARM_mve_type_uint8x16_t]: 
__arm_vmvnq_m_u8 (__ARM_mve_coerce(__p0, uint8x16_t), __ARM_mve_coerce(__p1, 
uint8x16_t), p2), \
   int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_uint16x8_t]: 
__arm_vmvnq_m_u16 (__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce(__p1, 
uint16x8_t), p2), \
   int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_uint32x4_t]: 
__arm_vmvnq_m_u32 (__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce(__p1, 
uint32x4_t), p2), \
-  int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vmvnq_m_n_s16 
(__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce1(__p1, 

[committed gcc12 backport] arm testsuite: Shifts and get_FPSCR ACLE optimisation fixes

2023-05-18 Thread Stam Markianos-Wright via Gcc-patches
These newly updated tests were rewritten by Andrea. Some of them
needed further manual fixing as follows:

* The #shift immediate value not in the check-function-bodies as expected
* The ACLE was specifying sub-optimal code: lsr+and instead of ubfx. In
  this case the test rewritten from the ACLE had the lsr+and pattern,
  but the compiler was able to optimise to ubfx. Hence I've changed the
  test to now match on ubfx.
* Added a separate test to check shift on constants being optimised to
  movs.

gcc/testsuite/ChangeLog:

* gcc.target/arm/mve/intrinsics/srshr.c: Update shift value.
* gcc.target/arm/mve/intrinsics/srshrl.c: Update shift value.
* gcc.target/arm/mve/intrinsics/uqshl.c: Update shift value.
* gcc.target/arm/mve/intrinsics/uqshll.c: Update shift value.
* gcc.target/arm/mve/intrinsics/urshr.c: Update shift value.
* gcc.target/arm/mve/intrinsics/urshrl.c: Update shift value.
* gcc.target/arm/mve/intrinsics/vadciq_m_s32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vadciq_m_u32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vadciq_s32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vadciq_u32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vadcq_m_s32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vadcq_m_u32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vadcq_s32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vadcq_u32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vsbciq_m_s32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vsbciq_m_u32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vsbciq_s32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vsbciq_u32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vsbcq_m_s32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vsbcq_m_u32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vsbcq_s32.c: Update to ubfx.
* gcc.target/arm/mve/intrinsics/vsbcq_u32.c: Update to ubfx.
* gcc.target/arm/mve/mve_const_shifts.c: New test.
---
 .../gcc.target/arm/mve/intrinsics/srshr.c |  2 +-
 .../gcc.target/arm/mve/intrinsics/srshrl.c|  2 +-
 .../gcc.target/arm/mve/intrinsics/uqshl.c | 14 +--
 .../gcc.target/arm/mve/intrinsics/uqshll.c| 14 +--
 .../gcc.target/arm/mve/intrinsics/urshr.c |  4 +-
 .../gcc.target/arm/mve/intrinsics/urshrl.c|  4 +-
 .../arm/mve/intrinsics/vadciq_m_s32.c |  8 +---
 .../arm/mve/intrinsics/vadciq_m_u32.c |  8 +---
 .../arm/mve/intrinsics/vadciq_s32.c   |  8 +---
 .../arm/mve/intrinsics/vadciq_u32.c   |  8 +---
 .../arm/mve/intrinsics/vadcq_m_s32.c  |  8 +---
 .../arm/mve/intrinsics/vadcq_m_u32.c  |  8 +---
 .../gcc.target/arm/mve/intrinsics/vadcq_s32.c |  8 +---
 .../gcc.target/arm/mve/intrinsics/vadcq_u32.c |  8 +---
 .../arm/mve/intrinsics/vsbciq_m_s32.c |  8 +---
 .../arm/mve/intrinsics/vsbciq_m_u32.c |  8 +---
 .../arm/mve/intrinsics/vsbciq_s32.c   |  8 +---
 .../arm/mve/intrinsics/vsbciq_u32.c   |  8 +---
 .../arm/mve/intrinsics/vsbcq_m_s32.c  |  8 +---
 .../arm/mve/intrinsics/vsbcq_m_u32.c  |  8 +---
 .../gcc.target/arm/mve/intrinsics/vsbcq_s32.c |  8 +---
 .../gcc.target/arm/mve/intrinsics/vsbcq_u32.c |  8 +---
 .../gcc.target/arm/mve/mve_const_shifts.c | 41 +++
 23 files changed, 81 insertions(+), 128 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/arm/mve/mve_const_shifts.c

diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshr.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshr.c
index 94e3f42fd33..734375d58c0 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshr.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshr.c
@@ -12,7 +12,7 @@ extern "C" {
 /*
 **foo:
 ** ...
-** srshr   (?:ip|fp|r[0-9]+), #shift(?:@.*|)
+** srshr   (?:ip|fp|r[0-9]+), #1(?:@.*|)
 ** ...
 */
 int32_t
diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshrl.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshrl.c
index 65f28ccbfde..a91943c38a0 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshrl.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/srshrl.c
@@ -12,7 +12,7 @@ extern "C" {
 /*
 **foo:
 ** ...
-** srshrl  (?:ip|fp|r[0-9]+), (?:ip|fp|r[0-9]+), #shift(?: @.*|)
+** srshrl  (?:ip|fp|r[0-9]+), (?:ip|fp|r[0-9]+), #1(?: @.*|)
 ** ...
 */
 int64_t
diff --git a/gcc/testsuite/gcc.target/arm/mve/intrinsics/uqshl.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/uqshl.c
index b23c9d97ba6..462531cad54 100644
--- a/gcc/testsuite/gcc.target/arm/mve/intrinsics/uqshl.c
+++ b/gcc/testsuite/gcc.target/arm/mve/intrinsics/uqshl.c
@@ -12,7 +12,7 @@ extern "C" {
 /*
 **foo:
 ** ...
-** uqshl   (?:ip|fp|r[0-9]+), #shift(?:@.*|)
+** uqshl   (?:ip|fp|r[0-9]+), #1(?:@.*|)
 ** ...
 

[committed gcc12 backport] arm testsuite: Remove reduntant tests

2023-05-18 Thread Stam Markianos-Wright via Gcc-patches
Following Andrea's overhaul of the MVE testsuite, these tests are now
reduntant, as equivalent checks have been added to the each intrinsic's
.c test.

gcc/testsuite/ChangeLog:

* gcc.target/arm/mve/intrinsics/mve_fp_vaddq_n.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vaddq_m.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vaddq_n.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vddupq_m_n_u16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vddupq_m_n_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vddupq_m_n_u8.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vddupq_n_u16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vddupq_n_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vddupq_n_u8.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vddupq_x_n_u16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vddupq_x_n_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vddupq_x_n_u8.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vdwdupq_x_n_u16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vdwdupq_x_n_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vdwdupq_x_n_u8.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vidupq_m_n_u16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vidupq_m_n_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vidupq_m_n_u8.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vidupq_n_u16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vidupq_n_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vidupq_n_u8.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vidupq_x_n_u16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vidupq_x_n_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vidupq_x_n_u8.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_viwdupq_x_n_u16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_viwdupq_x_n_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_viwdupq_x_n_u8.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_offset_s64.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_offset_u64.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_offset_z_s64.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_offset_z_u64.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_shifted_offset_s64.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_shifted_offset_u64.c: 
Removed.
* 
gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_shifted_offset_z_s64.c: Removed.
* 
gcc.target/arm/mve/intrinsics/mve_vldrdq_gather_shifted_offset_z_u64.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_f16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_s16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_s32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_u16.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_z_f16.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_z_s16.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_z_s32.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_z_u16.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_offset_z_u32.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_f16.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_s16.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_s32.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_u16.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_u32.c: 
Removed.
* 
gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_z_f16.c: Removed.
* 
gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_z_s16.c: Removed.
* 
gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_z_s32.c: Removed.
* 
gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_z_u16.c: Removed.
* 
gcc.target/arm/mve/intrinsics/mve_vldrhq_gather_shifted_offset_z_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_f32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_s32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_u32.c: Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_z_f32.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_z_s32.c: 
Removed.
* gcc.target/arm/mve/intrinsics/mve_vldrwq_gather_offset_z_u32.c: 
Removed.
* 

[committed gcc12 backport] arm testsuite: XFAIL or relax registers in some tests [PR109697]

2023-05-18 Thread Stam Markianos-Wright via Gcc-patches
Hi all,

This is a simple testsuite tidy-up patch, addressing to types of errors:

* The vcmp vector-scalar tests failing due to the compiler's preference
of vector-vector comparisons, over vector-scalar comparisons. This is
due to the lack of cost model for MVE and the compiler not knowing that
the RTL vec_duplicate is free in those instructions. For now, we simply
XFAIL these checks.
* The tests for pr108177 had strict usage of q0 and r0 registers,
meaning that they would FAIL with -mfloat-abi=softf. The register checks
have now been relaxed. A couple of these run-tests also had incosistent
use of integer MVE with floating point vectors, so I've now changed
these to use FP MVE.

gcc/testsuite/ChangeLog:
PR target/109697
* gcc.target/arm/mve/intrinsics/vcmpcsq_n_u16.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpcsq_n_u32.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpcsq_n_u8.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpeqq_n_f16.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpeqq_n_f32.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpeqq_n_u16.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpeqq_n_u32.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpeqq_n_u8.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpgeq_n_f16.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpgeq_n_f32.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpgtq_n_f16.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpgtq_n_f32.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmphiq_n_u16.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmphiq_n_u32.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmphiq_n_u8.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpleq_n_f16.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpleq_n_f32.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpltq_n_f16.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpltq_n_f32.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpneq_n_f16.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpneq_n_f32.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpneq_n_u16.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpneq_n_u32.c: XFAIL check.
* gcc.target/arm/mve/intrinsics/vcmpneq_n_u8.c: XFAIL check.
* gcc.target/arm/mve/pr108177-1.c: Relax registers.
* gcc.target/arm/mve/pr108177-10.c: Relax registers.
* gcc.target/arm/mve/pr108177-11.c: Relax registers.
* gcc.target/arm/mve/pr108177-12.c: Relax registers.
* gcc.target/arm/mve/pr108177-13.c: Relax registers.
* gcc.target/arm/mve/pr108177-13-run.c: use mve_fp
* gcc.target/arm/mve/pr108177-14.c: Relax registers.
* gcc.target/arm/mve/pr108177-14-run.c: use mve_fp
* gcc.target/arm/mve/pr108177-2.c: Relax registers.
* gcc.target/arm/mve/pr108177-3.c: Relax registers.
* gcc.target/arm/mve/pr108177-4.c: Relax registers.
* gcc.target/arm/mve/pr108177-5.c: Relax registers.
* gcc.target/arm/mve/pr108177-6.c: Relax registers.
* gcc.target/arm/mve/pr108177-7.c: Relax registers.
* gcc.target/arm/mve/pr108177-8.c: Relax registers.
* gcc.target/arm/mve/pr108177-9.c: Relax registers.
---
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpcsq_n_u16.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpcsq_n_u32.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpcsq_n_u8.c  | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpeqq_n_f16.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpeqq_n_f32.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpeqq_n_u16.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpeqq_n_u32.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpeqq_n_u8.c  | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpgeq_n_f16.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpgeq_n_f32.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpgtq_n_f16.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpgtq_n_f32.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmphiq_n_u16.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmphiq_n_u32.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmphiq_n_u8.c  | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpleq_n_f16.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpleq_n_f32.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpltq_n_f16.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpltq_n_f32.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpneq_n_f16.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpneq_n_f32.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpneq_n_u16.c | 2 +-
 gcc/testsuite/gcc.target/arm/mve/intrinsics/vcmpneq_n_u32.c | 2 +-
 

[committed gcc12 backport] arm: Stop vadcq, vsbcq intrinsics from overwriting the FPSCR NZ flags

2023-05-18 Thread Stam Markianos-Wright via Gcc-patches
Hi all,

We noticed that calls to the vadcq and vsbcq intrinsics, both of
which use __builtin_arm_set_fpscr_nzcvqc to set the Carry flag in
the FPSCR, would produce the following code:

```
< r2 is the *carry input >
vmrsr3, FPSCR_nzcvqc
bic r3, r3, #536870912
orr r3, r3, r2, lsl #29
vmsrFPSCR_nzcvqc, r3
```

when the MVE ACLE instead gives a different instruction sequence of:
```
< Rt is the *carry input >
VMRS Rs,FPSCR_nzcvqc
BFI Rs,Rt,#29,#1
VMSR FPSCR_nzcvqc,Rs
```

the bic + orr pair is slower and it's also wrong, because, if the
*carry input is greater than 1, then we risk overwriting the top two
bits of the FPSCR register (the N and Z flags).

This turned out to be a problem in the header file and the solution was
to simply add a `& 1x0u` to the `*carry` input: then the compiler knows
that we only care about the lowest bit and can optimise to a BFI.

Ok for trunk?

Thanks,
Stam Markianos-Wright

gcc/ChangeLog:

* config/arm/arm_mve.h (__arm_vadcq_s32): Fix arithmetic.
(__arm_vadcq_u32): Likewise.
(__arm_vadcq_m_s32): Likewise.
(__arm_vadcq_m_u32): Likewise.
(__arm_vsbcq_s32): Likewise.
(__arm_vsbcq_u32): Likewise.
(__arm_vsbcq_m_s32): Likewise.
(__arm_vsbcq_m_u32): Likewise.
* config/arm/mve.md (get_fpscr_nzcvqc): Make unspec_volatile.

gcc/testsuite/ChangeLog:
* gcc.target/arm/mve/mve_vadcq_vsbcq_fpscr_overwrite.c: New.

(cherry picked from commit f1417d051be094ffbce228e11951f3e12e8fca1c)
---
 gcc/config/arm/arm_mve.h  | 16 ++---
 gcc/config/arm/mve.md |  2 +-
 .../arm/mve/mve_vadcq_vsbcq_fpscr_overwrite.c | 67 +++
 3 files changed, 76 insertions(+), 9 deletions(-)
 create mode 100644 
gcc/testsuite/gcc.target/arm/mve/mve_vadcq_vsbcq_fpscr_overwrite.c

diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h
index 82ceec2bbfc..6bf1794d2ff 100644
--- a/gcc/config/arm/arm_mve.h
+++ b/gcc/config/arm/arm_mve.h
@@ -16055,7 +16055,7 @@ __extension__ extern __inline int32x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vadcq_s32 (int32x4_t __a, int32x4_t __b, unsigned * __carry)
 {
-  __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & 
~0x2000u) | (*__carry << 29));
+  __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & 
~0x2000u) | ((*__carry & 0x1u) << 29));
   int32x4_t __res = __builtin_mve_vadcq_sv4si (__a, __b);
   *__carry = (__builtin_arm_get_fpscr_nzcvqc () >> 29) & 0x1u;
   return __res;
@@ -16065,7 +16065,7 @@ __extension__ extern __inline uint32x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vadcq_u32 (uint32x4_t __a, uint32x4_t __b, unsigned * __carry)
 {
-  __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & 
~0x2000u) | (*__carry << 29));
+  __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & 
~0x2000u) | ((*__carry & 0x1u) << 29));
   uint32x4_t __res = __builtin_mve_vadcq_uv4si (__a, __b);
   *__carry = (__builtin_arm_get_fpscr_nzcvqc () >> 29) & 0x1u;
   return __res;
@@ -16075,7 +16075,7 @@ __extension__ extern __inline int32x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vadcq_m_s32 (int32x4_t __inactive, int32x4_t __a, int32x4_t __b, 
unsigned * __carry, mve_pred16_t __p)
 {
-  __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & 
~0x2000u) | (*__carry << 29));
+  __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & 
~0x2000u) | ((*__carry & 0x1u) << 29));
   int32x4_t __res = __builtin_mve_vadcq_m_sv4si (__inactive, __a, __b, __p);
   *__carry = (__builtin_arm_get_fpscr_nzcvqc () >> 29) & 0x1u;
   return __res;
@@ -16085,7 +16085,7 @@ __extension__ extern __inline uint32x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vadcq_m_u32 (uint32x4_t __inactive, uint32x4_t __a, uint32x4_t __b, 
unsigned * __carry, mve_pred16_t __p)
 {
-  __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & 
~0x2000u) | (*__carry << 29));
+  __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & 
~0x2000u) | ((*__carry & 0x1u) << 29));
   uint32x4_t __res =  __builtin_mve_vadcq_m_uv4si (__inactive, __a, __b, __p);
   *__carry = (__builtin_arm_get_fpscr_nzcvqc () >> 29) & 0x1u;
   return __res;
@@ -16131,7 +16131,7 @@ __extension__ extern __inline int32x4_t
 __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
 __arm_vsbcq_s32 (int32x4_t __a, int32x4_t __b, unsigned * __carry)
 {
-  __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & 
~0x2000u) | (*__carry << 29));
+  __builtin_arm_set_fpscr_nzcvqc((__builtin_arm_get_fpscr_nzcvqc () & 
~0x2000u) | ((*__carry & 0x1u) << 29));
   int32x4_t __res = __builtin_mve_vsbcq_sv4si (__a, __b);
   *__carry = (__builtin_arm_get_fpscr_nzcvqc () >> 29) & 0x1u;
   return __res;
@@ -16141,7 

[committed gcc12 backport] [arm] complete vmsr/vmrs blank and case adjustments

2023-05-18 Thread Stam Markianos-Wright via Gcc-patches
From: Alexandre Oliva 

Back in September last year, some of the vmsr and vmrs patterns had an
extraneous blank removed, and the case of register names lowered, but
another instance remained, and so did a testcase.

for  gcc/ChangeLog

* config/arm/vfp.md (*thumb2_movsi_vfp): Drop blank after tab
after vmsr and vmrs, and lower the case of P0.

for  gcc/testsuite/ChangeLog

* gcc.target/arm/acle/cde-mve-full-assembly.c: Drop blank
after tab after vmsr, and lower the case of P0.
---
 gcc/config/arm/vfp.md |   4 +-
 .../arm/acle/cde-mve-full-assembly.c  | 264 +-
 2 files changed, 134 insertions(+), 134 deletions(-)

diff --git a/gcc/config/arm/vfp.md b/gcc/config/arm/vfp.md
index 932e4b7447e..7a430ef8d36 100644
--- a/gcc/config/arm/vfp.md
+++ b/gcc/config/arm/vfp.md
@@ -312,9 +312,9 @@ (define_insn "*thumb2_movsi_vfp"
 case 12: case 13:
   return output_move_vfp (operands);
 case 14:
-  return \"vmsr\\t P0, %1\";
+  return \"vmsr\\tp0, %1\";
 case 15:
-  return \"vmrs\\t %0, P0\";
+  return \"vmrs\\t%0, p0\";
 case 16:
   return \"mcr\\tp10, 7, %1, cr1, cr0, 0\\t @SET_FPSCR\";
 case 17:
diff --git a/gcc/testsuite/gcc.target/arm/acle/cde-mve-full-assembly.c 
b/gcc/testsuite/gcc.target/arm/acle/cde-mve-full-assembly.c
index 501cc84da10..e3e7f7ef3e5 100644
--- a/gcc/testsuite/gcc.target/arm/acle/cde-mve-full-assembly.c
+++ b/gcc/testsuite/gcc.target/arm/acle/cde-mve-full-assembly.c
@@ -567,80 +567,80 @@
contain back references).  */
 /*
 ** test_cde_vcx1q_mfloat16x8_tintint:
-** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
 P0, r2 @ movhi)
-** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
 P0, r2 @ movhi)
+** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
p0, r2  @ movhi)
+** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
p0, r2  @ movhi)
 ** vpst
 ** vcx1t   p0, q0, #32
 ** bx  lr
 */
 /*
 ** test_cde_vcx1q_mfloat32x4_tintint:
-** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
 P0, r2 @ movhi)
-** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
 P0, r2 @ movhi)
+** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
p0, r2  @ movhi)
+** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
p0, r2  @ movhi)
 ** vpst
 ** vcx1t   p0, q0, #32
 ** bx  lr
 */
 /*
 ** test_cde_vcx1q_muint8x16_tintint:
-** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
 P0, r2 @ movhi)
-** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
 P0, r2 @ movhi)
+** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
p0, r2  @ movhi)
+** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
p0, r2  @ movhi)
 ** vpst
 ** vcx1t   p0, q0, #32
 ** bx  lr
 */
 /*
 ** test_cde_vcx1q_muint16x8_tintint:
-** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
 P0, r2 @ movhi)
-** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
 P0, r2 @ movhi)
+** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
p0, r2  @ movhi)
+** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
p0, r2  @ movhi)
 ** vpst
 ** vcx1t   p0, q0, #32
 ** bx  lr
 */
 /*
 ** test_cde_vcx1q_muint32x4_tintint:
-** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
 P0, r2 @ movhi)
-** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
 P0, r2 @ movhi)
+** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
p0, r2  @ movhi)
+** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
p0, r2  @ movhi)
 ** vpst
 ** vcx1t   p0, q0, #32
 ** bx  lr
 */
 /*
 ** test_cde_vcx1q_muint64x2_tintint:
-** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
 P0, r2 @ movhi)
-** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
 P0, r2 @ movhi)
+** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
p0, r2  @ movhi)
+** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
p0, r2  @ movhi)
 ** vpst
 ** vcx1t   p0, q0, #32
 ** bx  lr
 */
 /*
 ** test_cde_vcx1q_mint8x16_tintint:
-** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
 P0, r2 @ movhi)
-** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
 P0, r2 @ movhi)
+** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
p0, r2  @ movhi)
+** (?:vldr\.64 d0, \.L[0-9]*\n\tvldr\.64   d1, \.L[0-9]*\+8|vmsr   
p0, r2  @ 

[committed gcc12 backport] arm: Add vorrq_n overloading into vorrq _Generic

2023-05-18 Thread Stam Markianos-Wright via Gcc-patches
We found this as part of the wider testsuite updates.

The applicable tests are authored by Andrea earlier in this patch series

Ok for trunk?

gcc/ChangeLog:

* config/arm/arm_mve.h (__arm_vorrq): Add _n variant.
---
 gcc/config/arm/arm_mve.h | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h
index 6bf1794d2ff..39b3446617d 100644
--- a/gcc/config/arm/arm_mve.h
+++ b/gcc/config/arm/arm_mve.h
@@ -35852,6 +35852,10 @@ extern void *__ARM_undef;
   int (*)[__ARM_mve_type_uint8x16_t][__ARM_mve_type_uint8x16_t]: 
__arm_vorrq_u8 (__ARM_mve_coerce(__p0, uint8x16_t), __ARM_mve_coerce(__p1, 
uint8x16_t)), \
   int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_uint16x8_t]: 
__arm_vorrq_u16 (__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce(__p1, 
uint16x8_t)), \
   int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_uint32x4_t]: 
__arm_vorrq_u32 (__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce(__p1, 
uint32x4_t)), \
+  int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_int_n]: __arm_vorrq_n_u16 
(__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce3(p1, int)), \
+  int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_int_n]: __arm_vorrq_n_u32 
(__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce3(p1, int)), \
+  int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vorrq_n_s16 
(__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce3(p1, int)), \
+  int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int_n]: __arm_vorrq_n_s32 
(__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce3(p1, int)), \
   int (*)[__ARM_mve_type_float16x8_t][__ARM_mve_type_float16x8_t]: 
__arm_vorrq_f16 (__ARM_mve_coerce(__p0, float16x8_t), __ARM_mve_coerce(__p1, 
float16x8_t)), \
   int (*)[__ARM_mve_type_float32x4_t][__ARM_mve_type_float32x4_t]: 
__arm_vorrq_f32 (__ARM_mve_coerce(__p0, float32x4_t), __ARM_mve_coerce(__p1, 
float32x4_t)));})
 
@@ -38637,7 +38641,11 @@ extern void *__ARM_undef;
   int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int32x4_t]: __arm_vorrq_s32 
(__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce(__p1, int32x4_t)), \
   int (*)[__ARM_mve_type_uint8x16_t][__ARM_mve_type_uint8x16_t]: 
__arm_vorrq_u8 (__ARM_mve_coerce(__p0, uint8x16_t), __ARM_mve_coerce(__p1, 
uint8x16_t)), \
   int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_uint16x8_t]: 
__arm_vorrq_u16 (__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce(__p1, 
uint16x8_t)), \
-  int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_uint32x4_t]: 
__arm_vorrq_u32 (__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce(__p1, 
uint32x4_t)));})
+  int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_uint32x4_t]: 
__arm_vorrq_u32 (__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce(__p1, 
uint32x4_t)), \
+  int (*)[__ARM_mve_type_uint16x8_t][__ARM_mve_type_int_n]: __arm_vorrq_n_u16 
(__ARM_mve_coerce(__p0, uint16x8_t), __ARM_mve_coerce3(p1, int)), \
+  int (*)[__ARM_mve_type_uint32x4_t][__ARM_mve_type_int_n]: __arm_vorrq_n_u32 
(__ARM_mve_coerce(__p0, uint32x4_t), __ARM_mve_coerce3(p1, int)), \
+  int (*)[__ARM_mve_type_int16x8_t][__ARM_mve_type_int_n]: __arm_vorrq_n_s16 
(__ARM_mve_coerce(__p0, int16x8_t), __ARM_mve_coerce3(p1, int)), \
+  int (*)[__ARM_mve_type_int32x4_t][__ARM_mve_type_int_n]: __arm_vorrq_n_s32 
(__ARM_mve_coerce(__p0, int32x4_t), __ARM_mve_coerce3(p1, int)));})
 
 #define __arm_vornq(p0,p1) ({ __typeof(p0) __p0 = (p0); \
   __typeof(p1) __p1 = (p1); \
-- 
2.25.1



[committed gcc12 backport] arm: Fix vstrwq* backend + testsuite

2023-05-18 Thread Stam Markianos-Wright via Gcc-patches
From: Andrea Corallo 

Hi all,

this patch fixes the vstrwq* MVE instrinsics failing to emit the
correct sequence of instruction due to a missing predicate. Also the
immediate range is fixed to be multiples of 2 up between [-252, 252].

Best Regards

  Andrea

gcc/ChangeLog:

* config/arm/constraints.md (mve_vldrd_immediate): Move it to
predicates.md.
(Ri): Move constraint definition from predicates.md.
(Rl): Define new constraint.
* config/arm/mve.md (mve_vstrwq_scatter_base_wb_p_v4si): Add
missing constraint.
(mve_vstrwq_scatter_base_wb_p_fv4sf): Add missing Up constraint
for op 1, use mve_vstrw_immediate predicate and Rl constraint for
op 2. Fix asm output spacing.
(mve_vstrdq_scatter_base_wb_p_v2di): Add missing constraint.
* config/arm/predicates.md (Ri) Move constraint to constraints.md
(mve_vldrd_immediate): Move it from
constraints.md.
(mve_vstrw_immediate): New predicate.

gcc/testsuite/ChangeLog:

* gcc.target/arm/mve/intrinsics/vstrwq_f32.c: Use
check-function-bodies instead of scan-assembler checks.  Use
extern "C" for C++ testing.
* gcc.target/arm/mve/intrinsics/vstrwq_p_f32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_p_s32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_p_u32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_s32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_f32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_p_f32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_p_s32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_p_u32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_s32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_u32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_f32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_f32.c: 
Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_s32.c: 
Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_p_u32.c: 
Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_s32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_base_wb_u32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_f32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_p_f32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_p_s32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_p_u32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_s32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_offset_u32.c: Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_f32.c: 
Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_p_f32.c: 
Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_p_s32.c: 
Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_p_u32.c: 
Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_s32.c: 
Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_scatter_shifted_offset_u32.c: 
Likewise.
* gcc.target/arm/mve/intrinsics/vstrwq_u32.c: Likewise.
---
 gcc/config/arm/constraints.md | 20 --
 gcc/config/arm/mve.md | 10 ++---
 gcc/config/arm/predicates.md  | 14 +++
 .../arm/mve/intrinsics/vstrwq_f32.c   | 32 ---
 .../arm/mve/intrinsics/vstrwq_p_f32.c | 40 ---
 .../arm/mve/intrinsics/vstrwq_p_s32.c | 40 ---
 .../arm/mve/intrinsics/vstrwq_p_u32.c | 40 ---
 .../arm/mve/intrinsics/vstrwq_s32.c   | 32 ---
 .../mve/intrinsics/vstrwq_scatter_base_f32.c  | 28 +++--
 .../intrinsics/vstrwq_scatter_base_p_f32.c| 36 +++--
 .../intrinsics/vstrwq_scatter_base_p_s32.c| 36 +++--
 .../intrinsics/vstrwq_scatter_base_p_u32.c| 36 +++--
 .../mve/intrinsics/vstrwq_scatter_base_s32.c  | 28 +++--
 .../mve/intrinsics/vstrwq_scatter_base_u32.c  | 28 +++--
 .../intrinsics/vstrwq_scatter_base_wb_f32.c   | 32 ---
 .../intrinsics/vstrwq_scatter_base_wb_p_f32.c | 40 ---
 .../intrinsics/vstrwq_scatter_base_wb_p_s32.c | 40 ---
 .../intrinsics/vstrwq_scatter_base_wb_p_u32.c | 40 ---
 .../intrinsics/vstrwq_scatter_base_wb_s32.c   | 32 ---
 .../intrinsics/vstrwq_scatter_base_wb_u32.c   | 32 ---
 .../intrinsics/vstrwq_scatter_offset_f32.c| 32 ---
 .../intrinsics/vstrwq_scatter_offset_p_f32.c  | 40 ---
 

[PATCH 2/2 v2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2023-01-11 Thread Stam Markianos-Wright via Gcc-patches

-  Respin of the below patch -

In this 2/2 patch, from v1 to v2 I have:

* Removed the modification the interface of the doloop_end target-insn
(so I no longer need to touch any other target backends)


* Added more modes to `arm_get_required_vpr_reg` to make it flexible
between searching: all operands/only input arguments/only outputs. Also
added helpers:
`arm_get_required_vpr_reg_ret_val`
`arm_get_required_vpr_reg_param`

* Added support for the use of other VPR predicate values within
a dlstp/letp loop, as long as they don't originate from the vctp-generated
VPR value. Also changed `arm_mve_get_loop_unique_vctp` to the simpler
`arm_mve_get_loop_vctp` since now we can support other VCTP insns
within the loop.

* Added support for loops of the form:
     int num_of_iters = (num_of_elem + num_of_lanes - 1) / num_of_lanes
     for (i = 0; i < num_of_iters; i++)
       {
     p = vctp (num_of_elem)
     n -= num_of_lanes;
       }
   to be tranformed into dlstp/letp loops.

* Changed the VCTP look-ahead for SIGN_EXTEND and SUBREG insns to
use df def/use chains instead of `next_nonnote_nondebug_insn_bb`.

* Added support for using unpredicated (but predicable) insns
within the dlstp/letp loop. These need to meet some specific conditions,
because they _will_ become implicitly tail predicated by the dlstp/letp
transformation.

* Added a df chain check to any other instructions to make sure that they
don't USE the VCTP-generated VPR value.

* Added testing of all these various edge cases.


Original email with updated Changelog at the end:



Hi all,

This is the 2/2 patch that contains the functional changes needed
for MVE Tail Predicated Low Overhead Loops.  See my previous email
for a general introduction of MVE LOLs.

This support is added through the already existing loop-doloop
mechanisms that are used for non-MVE dls/le looping.

Changes are:

1) Relax the loop-doloop mechanism in the mid-end to allow for
   decrement numbers other that -1 and for `count` to be an
   rtx containing the number of elements to be processed, rather
   than an expression for calculating the number of iterations.
2) Add a `allow_elementwise_doloop` target hook. This allows the
   target backend to manipulate the iteration count as it needs:
   in our case to change it from a pre-calculation of the number
   of iterations to the number of elements to be processed.
3) The doloop_end target-insn now had an additional parameter:
   the `count` (note: this is before it gets modified to just be
   the number of elements), so that the decrement value is
   extracted from that parameter.

And many things in the backend to implement the above optimisation:

4)  Appropriate changes to the define_expand of doloop_end and new
    patterns for dlstp and letp.
5) `arm_attempt_dlstp_transform`: (called from the define_expand of
    doloop_end) this function checks for the loop's suitability for
    dlstp/letp transformation and then implements it, if possible.
6) `arm_mve_get_loop_unique_vctp`: A function that loops through
    the loop contents and returns the vctp VPR-genereting operation
    within the loop, if it is unique and there is exclusively one
    vctp within the loop.
7) A couple of utility functions: `arm_mve_get_vctp_lanes` to map
   from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
   to check an insn to see if it requires the VPR or not.

No regressions on arm-none-eabi with various targets and on
aarch64-none-elf. Thoughts on getting this into trunk?

Thank you,
Stam Markianos-Wright

gcc/ChangeLog:

    * config/arm/arm-protos.h (arm_attempt_dlstp_transform): New.
    * config/arm/arm.cc (TARGET_ALLOW_ELEMENTWISE_DOLOOP): New.
    (arm_mve_get_vctp_lanes): New.
    (arm_get_required_vpr_reg): New.
    (arm_get_required_vpr_reg_ret_val): New.
    (arm_get_required_vpr_reg_param): New.
    (arm_mve_get_loop_vctp): New.
    (arm_attempt_dlstp_transform): New.
    (arm_allow_elementwise_doloop): New.
    * config/arm/iterators.md (DLSTP): New.
    (mode1): Add DLSTP mappings.
    * config/arm/mve.md (*predicated_doloop_end_internal): New.
    (dlstp_insn): New.
    * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
    * config/arm/unspecs.md: New unspecs.
    * tm.texi: Document new hook.
    * tm.texi.in: Likewise.
    * loop-doloop.cc (doloop_condition_get): Relax conditions.
    (doloop_optimize): Add support for elementwise LoLs.
    * target.def (allow_elementwise_doloop): New hook.
    * targhooks.cc (default_allow_elementwise_doloop): New.
    * targhooks.h (default_allow_elementwise_doloop): New.

gcc/testsuite/ChangeLog:

    * gcc.target/arm/lob.h: Update framework.
    * gcc.target/arm/lob1.c: Likewise.
    * gcc.target/arm/lob6.c: Likewise.
    * gcc.target/arm/dlstp-int16x8.c: New test.
    * gcc.target/arm/dlstp-int32x4.c: New test.
    * 

[PING][PATCH] arm: Split up MVE _Generic associations to prevent type clashes [PR107515]

2023-01-10 Thread Stam Markianos-Wright via Gcc-patches

Hi all,

With these previous patches:
https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606586.html
https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606587.html
we enabled the MVE overloaded _Generic associations to handle more
scalar types, however at PR 107515 we found a new regression that
wasn't detected in our testing:

With glibc's `posix/types.h`:
```
typedef signed int __int32_t;
...
typedef __int32_t int32_t;
```
We would get a `error: '_Generic' specifies two compatible types`
from `__ARM_mve_coerce3` because of `type: param`, when `type` is
`int` and `int32_t: param` both being the same under the hood.

The same did not happen with Newlib's header `sys/_stdint.h`:
```
typedef long int __int32_t;
...
typedef __int32_t int32_t ;
```
which worked fine, because it uses `long int`.

The same could feasibly happen in `__ARM_mve_coerce2` between
`__fp16` and `float16_t`.

The solution here is to break the _Generic down, so that the similar
types don't appear at the same level, as is done in `__ARM_mve_typeid`.

Ok for trunk?

Thanks,
Stam Markianos-Wright

gcc/ChangeLog:
 PR target/96795
 PR target/107515
 * config/arm/arm_mve.h (__ARM_mve_coerce2): Split types.
 (__ARM_mve_coerce3): Likewise.

gcc/testsuite/ChangeLog:
 PR target/96795
 PR target/107515
 *
gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-fp.c: New test.
 *
gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-int.c: New test.


=== Inline Ctrl+C, Ctrl+V or patch ===

diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h
index
09167ec118ed3310c5077145e119196f29d83cac..70003653db65736fcfd019e83d9f18153be650dc
100644
--- a/gcc/config/arm/arm_mve.h
+++ b/gcc/config/arm/arm_mve.h
@@ -35659,9 +35659,9 @@ extern void *__ARM_undef;
  #define __ARM_mve_coerce1(param, type) \
  _Generic(param, type: param, const type: param, default: *(type
*)__ARM_undef)
  #define __ARM_mve_coerce2(param, type) \
-_Generic(param, type: param, float16_t: param, float32_t: param,
default: *(type *)__ARM_undef)
+_Generic(param, type: param, __fp16: param, default: _Generic
(param, _Float16: param, float16_t: param, float32_t: param, default:
*(type *)__ARM_undef))
  #define __ARM_mve_coerce3(param, type) \
-_Generic(param, type: param, int8_t: param, int16_t: param,
int32_t: param, int64_t: param, uint8_t: param, uint16_t: param,
uint32_t: param, uint64_t: param, default: *(type *)__ARM_undef)
+_Generic(param, type: param, default: _Generic (param, int8_t:
param, int16_t: param, int32_t: param, int64_t: param, uint8_t: param,
uint16_t: param, uint32_t: param, uint64_t: param, default: *(type
*)__ARM_undef))

  #if (__ARM_FEATURE_MVE & 2) /* MVE Floating point.  */

diff --git
a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-fp.c
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-fp.c
new file mode 100644
index
..427dcacb5ff59b53d5eab1f1582ef6460da3f2f3
--- /dev/null
+++
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-fp.c
@@ -0,0 +1,65 @@
+/* { dg-require-effective-target arm_v8_1m_mve_fp_ok } */
+/* { dg-add-options arm_v8_1m_mve_fp } */
+/* { dg-additional-options "-O2 -Wno-pedantic -Wno-long-long" } */
+#include "arm_mve.h"
+
+float f1;
+double f2;
+float16_t f3;
+float32_t f4;
+__fp16 f5;
+_Float16 f6;
+
+int i1;
+short i2;
+long i3;
+long long i4;
+int8_t i5;
+int16_t i6;
+int32_t i7;
+int64_t i8;
+
+const int ci1;
+const short ci2;
+const long ci3;
+const long long ci4;
+const int8_t ci5;
+const int16_t ci6;
+const int32_t ci7;
+const int64_t ci8;
+
+float16x8_t floatvec;
+int16x8_t intvec;
+
+void test(void)
+{
+/* Test a few different supported ways of passing an int value.  The
+intrinsic vmulq was chosen arbitrarily, but it is representative of
+all intrinsics that take a non-const scalar value.  */
+intvec = vmulq(intvec, 2);
+intvec = vmulq(intvec, (int32_t) 2);
+intvec = vmulq(intvec, (short) 2);
+intvec = vmulq(intvec, i1);
+intvec = vmulq(intvec, i2);
+intvec = vmulq(intvec, i3);
+intvec = vmulq(intvec, i4);
+intvec = vmulq(intvec, i5);
+intvec = vmulq(intvec, i6);
+intvec = vmulq(intvec, i7);
+intvec = vmulq(intvec, i8);
+
+/* Test a few different supported ways of passing a float value.  */
+floatvec = vmulq(floatvec, 0.5);
+floatvec = vmulq(floatvec, 0.5f);
+floatvec = vmulq(floatvec, (__fp16) 0.5);
+floatvec = vmulq(floatvec, f1);
+floatvec = vmulq(floatvec, f2);
+floatvec = vmulq(floatvec, f3);
+floatvec = vmulq(floatvec, f4);
+floatvec = vmulq(floatvec, f5);
+floatvec = vmulq(floatvec, f6);
+floatvec = vmulq(floatvec, 0.15f16);
+floatvec = vmulq(floatvec, (_Float16) 0.15);
+}
+
+/* { dg-final { scan-assembler-not "__ARM_undef" } } */
\ No newline at end of file
diff --git

Re: [PATCH] Fix memory constraint on MVE v[ld/st][2/4] instructions [PR107714]

2023-01-10 Thread Stam Markianos-Wright via Gcc-patches



On 12/12/2022 13:42, Kyrylo Tkachov wrote:

Hi Stam,


-Original Message-
From: Stam Markianos-Wright 
Sent: Friday, December 9, 2022 1:32 PM
To: gcc-patches@gcc.gnu.org
Cc: Kyrylo Tkachov ; Richard Earnshaw
; Ramana Radhakrishnan
; ni...@redhat.com
Subject: [PATCH] Fix memory constraint on MVE v[ld/st][2/4] instructions
[PR107714]

Hi all,

In the M-Class Arm-ARM:

https://developer.arm.com/documentation/ddi0553/bu/?lang=en

these MVE instructions only have '!' writeback variant and at:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107714

we found that the Um constraint would also allow through a
register offset writeback, resulting in an assembler error.

Here I have added a new constraint and predicate for these
instructions, which (uniquely, AFAICT), only support a `!` writeback
increment by the data size (inside the compiler this is a POST_INC).

No regressions in arm-none-eabi with MVE and MVE.FP.

Ok for trunk, and backport to GCC11 and GCC12 (testing pending)?

Thanks,
Stam

gcc/ChangeLog:
      PR target/107714
      * config/arm/arm-protos.h (mve_struct_mem_operand): New
protoype.
      * config/arm/arm.cc (mve_struct_mem_operand): New function.
      * config/arm/constraints.md (Ug): New constraint.
      * config/arm/mve.md (mve_vst4q): Change constraint.
      (mve_vst2q): Likewise.
      (mve_vld4q): Likewise.
      (mve_vld2q): Likewise.
      * config/arm/predicates.md (mve_struct_operand): New predicate.

gcc/testsuite/ChangeLog:
      PR target/107714
      * gcc.target/arm/mve/intrinsics/vldst24q_reg_offset.c: New test.


diff --git a/gcc/config/arm/constraints.md b/gcc/config/arm/constraints.md
index 
e5a36d29c7135943b9bb5ea396f70e2e4beb1e4a..8908b7f5b15ce150685868e78e75280bf32053f1
 100644
--- a/gcc/config/arm/constraints.md
+++ b/gcc/config/arm/constraints.md
@@ -474,6 +474,12 @@
   (and (match_code "mem")
(match_test "TARGET_32BIT && arm_coproc_mem_operand (op, FALSE)")))
  
+(define_memory_constraint "Ug"

+ "@internal
+  In Thumb-2 state a valid MVE struct load/store address."
+ (and (match_code "mem")
+  (match_test "TARGET_HAVE_MVE && mve_struct_mem_operand (op)")))
+

I think you can define the constraints in terms of the new mve_struct_operand predicate 
directly (see how we define the "Ua" constraint, for example).
Ok if that works (and testing passes of course).


Done as discussed and re-tested on all branches. Pushed as:

4269a6567eb991e6838f40bda5be9e3a7972530c to trunk

25edc76f2afba0b4eaf22174d42de042a6969dbe to gcc-12

08842ad274f5e2630994f7c6e70b2d31768107ea to gcc-11

Thank you!
Stam



Thanks,
Kyrill



[PATCH] Fix memory constraint on MVE v[ld/st][2/4] instructions [PR107714]

2022-12-09 Thread Stam Markianos-Wright via Gcc-patches

Hi all,

In the M-Class Arm-ARM:

https://developer.arm.com/documentation/ddi0553/bu/?lang=en

these MVE instructions only have '!' writeback variant and at:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107714

we found that the Um constraint would also allow through a
register offset writeback, resulting in an assembler error.

Here I have added a new constraint and predicate for these
instructions, which (uniquely, AFAICT), only support a `!` writeback
increment by the data size (inside the compiler this is a POST_INC).

No regressions in arm-none-eabi with MVE and MVE.FP.

Ok for trunk, and backport to GCC11 and GCC12 (testing pending)?

Thanks,
Stam

gcc/ChangeLog:
    PR target/107714
    * config/arm/arm-protos.h (mve_struct_mem_operand): New protoype.
    * config/arm/arm.cc (mve_struct_mem_operand): New function.
    * config/arm/constraints.md (Ug): New constraint.
    * config/arm/mve.md (mve_vst4q): Change constraint.
    (mve_vst2q): Likewise.
    (mve_vld4q): Likewise.
    (mve_vld2q): Likewise.
    * config/arm/predicates.md (mve_struct_operand): New predicate.

gcc/testsuite/ChangeLog:
    PR target/107714
    * gcc.target/arm/mve/intrinsics/vldst24q_reg_offset.c: New test.diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 550272facd12e60a49bf8a3b20f811cc13765b3a..8ea38118b05769bd6fcb1d22d902a50979cfd953 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -122,6 +122,7 @@ extern int arm_coproc_mem_operand_wb (rtx, int);
 extern int neon_vector_mem_operand (rtx, int, bool);
 extern int mve_vector_mem_operand (machine_mode, rtx, bool);
 extern int neon_struct_mem_operand (rtx);
+extern int mve_struct_mem_operand (rtx);
 
 extern rtx *neon_vcmla_lane_prepare_operands (rtx *);
 
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index b587561eebea921bdc68016922d37948e2870ce2..31f2a7b9d4688dde69d1435e24cf885e8544be71 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -13737,6 +13737,24 @@ neon_vector_mem_operand (rtx op, int type, bool strict)
   return FALSE;
 }
 
+/* Return TRUE if OP is a mem suitable for loading/storing an MVE struct
+   type.  */
+int
+mve_struct_mem_operand (rtx op)
+{
+  rtx ind = XEXP (op, 0);
+
+  /* Match: (mem (reg)).  */
+  if (REG_P (ind))
+return arm_address_register_rtx_p (ind, 0);
+
+  /* Allow only post-increment by the mode size.  */
+  if (GET_CODE (ind) == POST_INC)
+return arm_address_register_rtx_p (XEXP (ind, 0), 0);
+
+  return FALSE;
+}
+
 /* Return TRUE if OP is a mem suitable for loading/storing a Neon struct
type.  */
 int
diff --git a/gcc/config/arm/constraints.md b/gcc/config/arm/constraints.md
index e5a36d29c7135943b9bb5ea396f70e2e4beb1e4a..8908b7f5b15ce150685868e78e75280bf32053f1 100644
--- a/gcc/config/arm/constraints.md
+++ b/gcc/config/arm/constraints.md
@@ -474,6 +474,12 @@
  (and (match_code "mem")
   (match_test "TARGET_32BIT && arm_coproc_mem_operand (op, FALSE)")))
 
+(define_memory_constraint "Ug"
+ "@internal
+  In Thumb-2 state a valid MVE struct load/store address."
+ (and (match_code "mem")
+  (match_test "TARGET_HAVE_MVE && mve_struct_mem_operand (op)")))
+
 (define_memory_constraint "Uj"
  "@internal
   In ARM/Thumb-2 state a VFP load/store address that supports writeback
diff --git a/gcc/config/arm/mve.md b/gcc/config/arm/mve.md
index b5e6da4b1335818a3e8815de59850e845a2d0400..847bc032afa2c3977c05725562a14940beb282d4 100644
--- a/gcc/config/arm/mve.md
+++ b/gcc/config/arm/mve.md
@@ -99,7 +99,7 @@
 ;; [vst4q])
 ;;
 (define_insn "mve_vst4q"
-  [(set (match_operand:XI 0 "neon_struct_operand" "=Um")
+  [(set (match_operand:XI 0 "mve_struct_operand" "=Ug")
 	(unspec:XI [(match_operand:XI 1 "s_register_operand" "w")
 		(unspec:MVE_VLD_ST [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
 	 VST4Q))
@@ -9959,7 +9959,7 @@
 ;; [vst2q])
 ;;
 (define_insn "mve_vst2q"
-  [(set (match_operand:OI 0 "neon_struct_operand" "=Um")
+  [(set (match_operand:OI 0 "mve_struct_operand" "=Ug")
 	(unspec:OI [(match_operand:OI 1 "s_register_operand" "w")
 		(unspec:MVE_VLD_ST [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
 	 VST2Q))
@@ -9988,7 +9988,7 @@
 ;;
 (define_insn "mve_vld2q"
   [(set (match_operand:OI 0 "s_register_operand" "=w")
-	(unspec:OI [(match_operand:OI 1 "neon_struct_operand" "Um")
+	(unspec:OI [(match_operand:OI 1 "mve_struct_operand" "Ug")
 		(unspec:MVE_VLD_ST [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
 	 VLD2Q))
   ]
@@ -10016,7 +10016,7 @@
 ;;
 (define_insn "mve_vld4q"
   [(set (match_operand:XI 0 "s_register_operand" "=w")
-	(unspec:XI [(match_operand:XI 1 "neon_struct_operand" "Um")
+	(unspec:XI [(match_operand:XI 1 "mve_struct_operand" "Ug")
 		(unspec:MVE_VLD_ST [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
 	 VLD4Q))
   ]
diff --git a/gcc/config/arm/predicates.md b/gcc/config/arm/predicates.md
index aab5a91ad4ddc6a7a02611d05442d6de63841a7c..67f2fdb4f8f607ceb50871e1bc17dbdb9b987c2c 100644
--- 

[PATCH] arm: Split up MVE _Generic associations to prevent type clashes [PR107515]

2022-12-01 Thread Stam Markianos-Wright via Gcc-patches

Hi all,

With these previous patches:
https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606586.html
https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606587.html
we enabled the MVE overloaded _Generic associations to handle more
scalar types, however at PR 107515 we found a new regression that
wasn't detected in our testing:

With glibc's `posix/types.h`:
```
typedef signed int __int32_t;
...
typedef __int32_t int32_t;
```
We would get a `error: '_Generic' specifies two compatible types`
from `__ARM_mve_coerce3` because of `type: param`, when `type` is
`int` and `int32_t: param` both being the same under the hood.

The same did not happen with Newlib's header `sys/_stdint.h`:
```
typedef long int __int32_t;
...
typedef __int32_t int32_t ;
```
which worked fine, because it uses `long int`.

The same could feasibly happen in `__ARM_mve_coerce2` between
`__fp16` and `float16_t`.

The solution here is to break the _Generic down, so that the similar
types don't appear at the same level, as is done in `__ARM_mve_typeid`.

Ok for trunk?

Thanks,
Stam Markianos-Wright

gcc/ChangeLog:
    PR target/96795
    PR target/107515
    * config/arm/arm_mve.h (__ARM_mve_coerce2): Split types.
    (__ARM_mve_coerce3): Likewise.

gcc/testsuite/ChangeLog:
    PR target/96795
    PR target/107515
    * 
gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-fp.c: New test.
    * 
gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-int.c: New test.



=== Inline Ctrl+C, Ctrl+V or patch ===

diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h
index 
09167ec118ed3310c5077145e119196f29d83cac..70003653db65736fcfd019e83d9f18153be650dc 
100644

--- a/gcc/config/arm/arm_mve.h
+++ b/gcc/config/arm/arm_mve.h
@@ -35659,9 +35659,9 @@ extern void *__ARM_undef;
 #define __ARM_mve_coerce1(param, type) \
 _Generic(param, type: param, const type: param, default: *(type 
*)__ARM_undef)

 #define __ARM_mve_coerce2(param, type) \
-    _Generic(param, type: param, float16_t: param, float32_t: param, 
default: *(type *)__ARM_undef)
+    _Generic(param, type: param, __fp16: param, default: _Generic 
(param, _Float16: param, float16_t: param, float32_t: param, default: 
*(type *)__ARM_undef))

 #define __ARM_mve_coerce3(param, type) \
-    _Generic(param, type: param, int8_t: param, int16_t: param, 
int32_t: param, int64_t: param, uint8_t: param, uint16_t: param, 
uint32_t: param, uint64_t: param, default: *(type *)__ARM_undef)
+    _Generic(param, type: param, default: _Generic (param, int8_t: 
param, int16_t: param, int32_t: param, int64_t: param, uint8_t: param, 
uint16_t: param, uint32_t: param, uint64_t: param, default: *(type 
*)__ARM_undef))


 #if (__ARM_FEATURE_MVE & 2) /* MVE Floating point.  */

diff --git 
a/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-fp.c 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-fp.c

new file mode 100644
index 
..427dcacb5ff59b53d5eab1f1582ef6460da3f2f3

--- /dev/null
+++ 
b/gcc/testsuite/gcc.target/arm/mve/intrinsics/mve_intrinsic_type_overloads-fp.c

@@ -0,0 +1,65 @@
+/* { dg-require-effective-target arm_v8_1m_mve_fp_ok } */
+/* { dg-add-options arm_v8_1m_mve_fp } */
+/* { dg-additional-options "-O2 -Wno-pedantic -Wno-long-long" } */
+#include "arm_mve.h"
+
+float f1;
+double f2;
+float16_t f3;
+float32_t f4;
+__fp16 f5;
+_Float16 f6;
+
+int i1;
+short i2;
+long i3;
+long long i4;
+int8_t i5;
+int16_t i6;
+int32_t i7;
+int64_t i8;
+
+const int ci1;
+const short ci2;
+const long ci3;
+const long long ci4;
+const int8_t ci5;
+const int16_t ci6;
+const int32_t ci7;
+const int64_t ci8;
+
+float16x8_t floatvec;
+int16x8_t intvec;
+
+void test(void)
+{
+    /* Test a few different supported ways of passing an int value.  The
+    intrinsic vmulq was chosen arbitrarily, but it is representative of
+    all intrinsics that take a non-const scalar value.  */
+    intvec = vmulq(intvec, 2);
+    intvec = vmulq(intvec, (int32_t) 2);
+    intvec = vmulq(intvec, (short) 2);
+    intvec = vmulq(intvec, i1);
+    intvec = vmulq(intvec, i2);
+    intvec = vmulq(intvec, i3);
+    intvec = vmulq(intvec, i4);
+    intvec = vmulq(intvec, i5);
+    intvec = vmulq(intvec, i6);
+    intvec = vmulq(intvec, i7);
+    intvec = vmulq(intvec, i8);
+
+    /* Test a few different supported ways of passing a float value.  */
+    floatvec = vmulq(floatvec, 0.5);
+    floatvec = vmulq(floatvec, 0.5f);
+    floatvec = vmulq(floatvec, (__fp16) 0.5);
+    floatvec = vmulq(floatvec, f1);
+    floatvec = vmulq(floatvec, f2);
+    floatvec = vmulq(floatvec, f3);
+    floatvec = vmulq(floatvec, f4);
+    floatvec = vmulq(floatvec, f5);
+    floatvec = vmulq(floatvec, f6);
+    floatvec = vmulq(floatvec, 0.15f16);
+    floatvec = vmulq(floatvec, (_Float16) 0.15);
+}
+
+/* { dg-final { scan-assembler-not "__ARM_undef" } } */
\ No newline at end of file
diff --git 

[PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2022-11-28 Thread Stam Markianos-Wright via Gcc-patches


On 11/15/22 15:51, Andre Vieira (lists) wrote:


On 11/11/2022 17:40, Stam Markianos-Wright via Gcc-patches wrote:

Hi all,

This is the 2/2 patch that contains the functional changes needed
for MVE Tail Predicated Low Overhead Loops.  See my previous email
for a general introduction of MVE LOLs.

This support is added through the already existing loop-doloop
mechanisms that are used for non-MVE dls/le looping.

Changes are:

1) Relax the loop-doloop mechanism in the mid-end to allow for
   decrement numbers other that -1 and for `count` to be an
   rtx containing the number of elements to be processed, rather
   than an expression for calculating the number of iterations.
2) Add a `allow_elementwise_doloop` target hook. This allows the
   target backend to manipulate the iteration count as it needs:
   in our case to change it from a pre-calculation of the number
   of iterations to the number of elements to be processed.
3) The doloop_end target-insn now had an additional parameter:
   the `count` (note: this is before it gets modified to just be
   the number of elements), so that the decrement value is
   extracted from that parameter.

And many things in the backend to implement the above optimisation:

4)  Appropriate changes to the define_expand of doloop_end and new
    patterns for dlstp and letp.
5) `arm_attempt_dlstp_transform`: (called from the define_expand of
    doloop_end) this function checks for the loop's suitability for
    dlstp/letp transformation and then implements it, if possible.
6) `arm_mve_get_loop_unique_vctp`: A function that loops through
    the loop contents and returns the vctp VPR-genereting operation
    within the loop, if it is unique and there is exclusively one
    vctp within the loop.
7) A couple of utility functions: `arm_mve_get_vctp_lanes` to map
   from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
   to check an insn to see if it requires the VPR or not.

No regressions on arm-none-eabi with various targets and on
aarch64-none-elf. Thoughts on getting this into trunk?

Thank you,
Stam Markianos-Wright

gcc/ChangeLog:

    * config/aarch64/aarch64.md: Add extra doloop_end arg.
    * config/arm/arm-protos.h (arm_attempt_dlstp_transform): New.
    * config/arm/arm.cc (TARGET_ALLOW_ELEMENTWISE_DOLOOP): New.
    (arm_mve_get_vctp_lanes): New.
    (arm_get_required_vpr_reg): New.
    (arm_mve_get_loop_unique_vctp): New.
    (arm_attempt_dlstp_transform): New.
    (arm_allow_elementwise_doloop): New.
    * config/arm/iterators.md:
    * config/arm/mve.md (*predicated_doloop_end_internal): New.
    (dlstp_insn): New.
    * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
    * config/arm/unspecs.md: New unspecs.
    * config/ia64/ia64.md: Add extra doloop_end arg.
    * config/pru/pru.md: Add extra doloop_end arg.
    * config/rs6000/rs6000.md: Add extra doloop_end arg.
    * config/s390/s390.md: Add extra doloop_end arg.
    * config/v850/v850.md: Add extra doloop_end arg.
    * doc/tm.texi: Document new hook.
    * doc/tm.texi.in: Likewise.
    * loop-doloop.cc (doloop_condition_get): Relax conditions.
    (doloop_optimize): Add support for elementwise LoLs.
    * target-insns.def (doloop_end): Add extra arg.
    * target.def (allow_elementwise_doloop): New hook.
    * targhooks.cc (default_allow_elementwise_doloop): New.
    * targhooks.h (default_allow_elementwise_doloop): New.

gcc/testsuite/ChangeLog:

    * gcc.target/arm/lob.h: Update framework.
    * gcc.target/arm/lob1.c: Likewise.
    * gcc.target/arm/lob6.c: Likewise.
    * gcc.target/arm/dlstp-int16x8.c: New test.
    * gcc.target/arm/dlstp-int32x4.c: New test.
    * gcc.target/arm/dlstp-int64x2.c: New test.
    * gcc.target/arm/dlstp-int8x16.c: New test.


### Inline copy of patch ###

diff --git a/gcc/config/aarch64/aarch64.md 
b/gcc/config/aarch64/aarch64.md
index 
f2e3d905dbbeb2949f2947f5cfd68208c94c9272..7a6d24a80060b4a704a481ccd1a32d96e7b0f369 
100644

--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -7366,7 +7366,8 @@
 ;; knows what to generate.
 (define_expand "doloop_end"
   [(use (match_operand 0 "" ""))  ; loop pseudo
-   (use (match_operand 1 "" ""))] ; label
+   (use (match_operand 1 "" ""))  ; label
+   (use (match_operand 2 "" ""))] ; decrement constant
   "optimize > 0 && flag_modulo_sched"
 {
   rtx s0;
diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 
550272facd12e60a49bf8a3b20f811cc13765b3a..7684620f0f4d161dd9e9ad2d70308021ec3d3d34 
100644

--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -63,7 +63,7 @@ extern void arm_decompose_di_binop (rtx, rtx, rtx 
*, rtx *, rtx *, rtx *);

 extern bool arm_q_bit_acce

[PATCH 15/35] arm: Explicitly specify other float types for _Generic overloading [PR107515]

2022-11-21 Thread Stam Markianos-Wright via Gcc-patches



On 11/20/22 22:49, Ramana Radhakrishnan wrote:

On Fri, Nov 18, 2022 at 4:59 PM Kyrylo Tkachov via Gcc-patches
 wrote:




-Original Message-
From: Andrea Corallo 
Sent: Thursday, November 17, 2022 4:38 PM
To: gcc-patches@gcc.gnu.org
Cc: Kyrylo Tkachov ; Richard Earnshaw
; Stam Markianos-Wright 
Subject: [PATCH 15/35] arm: Explicitly specify other float types for _Generic
overloading [PR107515]

From: Stam Markianos-Wright 

This patch adds explicit references to other float types
to __ARM_mve_typeid in arm_mve.h.  Resolves PR 107515:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107515

gcc/ChangeLog:
 PR 107515
 * config/arm/arm_mve.h (__ARM_mve_typeid): Add float types.

Argh, I'm looking forward to when we move away from this _Generic business, but 
for now ok.
The ChangeLog should say "PR target/107515" for the git hook to recognize it 
IIRC.

and the PR is against 11.x - is there a plan to back port this and
dependent patches to relevant branches ?


Hi Ramana!


Assuming maintainer approval, we do hope to backport.

And yes, it would have to be the whole patch series, so that we carry

over all the improved testing, as well (and we'll have to run it ofc).


Does that sound Ok?

Thank you,

Stam




Ramana


Thanks,
Kyrill


---
  gcc/config/arm/arm_mve.h | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h
index fd1876b57a0..f6b42dc3fab 100644
--- a/gcc/config/arm/arm_mve.h
+++ b/gcc/config/arm/arm_mve.h
@@ -35582,6 +35582,9 @@ enum {
   short: __ARM_mve_type_int_n, \
   int: __ARM_mve_type_int_n, \
   long: __ARM_mve_type_int_n, \
+ _Float16: __ARM_mve_type_fp_n, \
+ __fp16: __ARM_mve_type_fp_n, \
+ float: __ARM_mve_type_fp_n, \
   double: __ARM_mve_type_fp_n, \
   long long: __ARM_mve_type_int_n, \
   unsigned char: __ARM_mve_type_int_n, \
--
2.25.1


Re: [PATCH 15/35] arm: Explicitly specify other float types for _Generic overloading [PR107515]

2022-11-21 Thread Stam Markianos-Wright via Gcc-patches



On 11/18/22 16:58, Kyrylo Tkachov wrote:



-Original Message-
From: Andrea Corallo 
Sent: Thursday, November 17, 2022 4:38 PM
To: gcc-patches@gcc.gnu.org
Cc: Kyrylo Tkachov ; Richard Earnshaw
; Stam Markianos-Wright 
Subject: [PATCH 15/35] arm: Explicitly specify other float types for _Generic
overloading [PR107515]

From: Stam Markianos-Wright 

This patch adds explicit references to other float types
to __ARM_mve_typeid in arm_mve.h.  Resolves PR 107515:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107515

gcc/ChangeLog:
 PR 107515
 * config/arm/arm_mve.h (__ARM_mve_typeid): Add float types.

Argh, I'm looking forward to when we move away from this _Generic business, but 
for now ok.

Oh we all are ;)

The ChangeLog should say "PR target/107515" for the git hook to recognize it 
IIRC.


Agh, thanks for spotting this! Will change and push it with the rest of 
the patch series when ready/


Thank you,

Stam



Thanks,
Kyrill


---
  gcc/config/arm/arm_mve.h | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h
index fd1876b57a0..f6b42dc3fab 100644
--- a/gcc/config/arm/arm_mve.h
+++ b/gcc/config/arm/arm_mve.h
@@ -35582,6 +35582,9 @@ enum {
   short: __ARM_mve_type_int_n, \
   int: __ARM_mve_type_int_n, \
   long: __ARM_mve_type_int_n, \
+ _Float16: __ARM_mve_type_fp_n, \
+ __fp16: __ARM_mve_type_fp_n, \
+ float: __ARM_mve_type_fp_n, \
   double: __ARM_mve_type_fp_n, \
   long long: __ARM_mve_type_int_n, \
   unsigned char: __ARM_mve_type_int_n, \
--
2.25.1


Re: [PATCH 13/35] arm: further fix overloading of MVE vaddq[_m]_n intrinsic

2022-11-21 Thread Stam Markianos-Wright via Gcc-patches



On 11/18/22 16:49, Kyrylo Tkachov wrote:



-Original Message-
From: Andrea Corallo 
Sent: Thursday, November 17, 2022 4:38 PM
To: gcc-patches@gcc.gnu.org
Cc: Kyrylo Tkachov ; Richard Earnshaw
; Stam Markianos-Wright 
Subject: [PATCH 13/35] arm: further fix overloading of MVE vaddq[_m]_n
intrinsic

From: Stam Markianos-Wright 

It was observed that in tests `vaddq_m_n_[s/u][8/16/32].c`, the _Generic
resolution would fall back to the `__ARM_undef` failure state.

This is a regression since `dc39db873670bea8d8e655444387ceaa53a01a79`
and
`6bd4ce64eb48a72eca300cb52773e6101d646004`, but it previously wasn't
identified, because the tests were not checking for this kind of failure.

The above commits changed the definitions of the intrinsics from using
`[u]int[8/16/32]_t` types for the scalar argument to using `int`. This
allowed `int` to be supported in user code through the overloaded
`#defines`, but seems to have broken the `[u]int[8/16/32]_t` types

The solution implemented by this patch is to explicitly use a new
_Generic mapping from all the `[u]int[8/16/32]_t` types for int. With this
change, both `int` and `[u]int[8/16/32]_t` parameters are supported from
user code and are handled by the overloading mechanism correctly.

gcc/ChangeLog:

 * config/arm/arm_mve.h (__arm_vaddq_m_n_s8): Change types.
 (__arm_vaddq_m_n_s32): Likewise.
 (__arm_vaddq_m_n_s16): Likewise.
 (__arm_vaddq_m_n_u8): Likewise.
 (__arm_vaddq_m_n_u32): Likewise.
 (__arm_vaddq_m_n_u16): Likewise.
 (__arm_vaddq_m): Fix Overloading.
 (__ARM_mve_coerce3): New.

Ok. Wasn't there a PR in Bugzilla about this that we can cite in the commit 
message?
Thanks,
Kyrill


Thanks for the review! Ah yes, there was this one:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96795

which was closed last time around.
It does make sense to add it, though, so we'll do that.

Thanks!




---
  gcc/config/arm/arm_mve.h | 78 
  1 file changed, 40 insertions(+), 38 deletions(-)

diff --git a/gcc/config/arm/arm_mve.h b/gcc/config/arm/arm_mve.h
index 684f997520f..951dc25374b 100644
--- a/gcc/config/arm/arm_mve.h
+++ b/gcc/config/arm/arm_mve.h
@@ -9675,42 +9675,42 @@ __arm_vabdq_m_u16 (uint16x8_t __inactive,
uint16x8_t __a, uint16x8_t __b, mve_pr

  __extension__ extern __inline int8x16_t
  __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
-__arm_vaddq_m_n_s8 (int8x16_t __inactive, int8x16_t __a, int __b,
mve_pred16_t __p)
+__arm_vaddq_m_n_s8 (int8x16_t __inactive, int8x16_t __a, int8_t __b,
mve_pred16_t __p)
  {
return __builtin_mve_vaddq_m_n_sv16qi (__inactive, __a, __b, __p);
  }

  __extension__ extern __inline int32x4_t
  __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
-__arm_vaddq_m_n_s32 (int32x4_t __inactive, int32x4_t __a, int __b,
mve_pred16_t __p)
+__arm_vaddq_m_n_s32 (int32x4_t __inactive, int32x4_t __a, int32_t __b,
mve_pred16_t __p)
  {
return __builtin_mve_vaddq_m_n_sv4si (__inactive, __a, __b, __p);
  }

  __extension__ extern __inline int16x8_t
  __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
-__arm_vaddq_m_n_s16 (int16x8_t __inactive, int16x8_t __a, int __b,
mve_pred16_t __p)
+__arm_vaddq_m_n_s16 (int16x8_t __inactive, int16x8_t __a, int16_t __b,
mve_pred16_t __p)
  {
return __builtin_mve_vaddq_m_n_sv8hi (__inactive, __a, __b, __p);
  }

  __extension__ extern __inline uint8x16_t
  __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
-__arm_vaddq_m_n_u8 (uint8x16_t __inactive, uint8x16_t __a, int __b,
mve_pred16_t __p)
+__arm_vaddq_m_n_u8 (uint8x16_t __inactive, uint8x16_t __a, uint8_t __b,
mve_pred16_t __p)
  {
return __builtin_mve_vaddq_m_n_uv16qi (__inactive, __a, __b, __p);
  }

  __extension__ extern __inline uint32x4_t
  __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
-__arm_vaddq_m_n_u32 (uint32x4_t __inactive, uint32x4_t __a, int __b,
mve_pred16_t __p)
+__arm_vaddq_m_n_u32 (uint32x4_t __inactive, uint32x4_t __a, uint32_t
__b, mve_pred16_t __p)
  {
return __builtin_mve_vaddq_m_n_uv4si (__inactive, __a, __b, __p);
  }

  __extension__ extern __inline uint16x8_t
  __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
-__arm_vaddq_m_n_u16 (uint16x8_t __inactive, uint16x8_t __a, int __b,
mve_pred16_t __p)
+__arm_vaddq_m_n_u16 (uint16x8_t __inactive, uint16x8_t __a, uint16_t
__b, mve_pred16_t __p)
  {
return __builtin_mve_vaddq_m_n_uv8hi (__inactive, __a, __b, __p);
  }
@@ -26417,42 +26417,42 @@ __arm_vabdq_m (uint16x8_t __inactive,
uint16x8_t __a, uint16x8_t __b, mve_pred16

  __extension__ extern __inline int8x16_t
  __attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
-__arm_vaddq_m (int8x16_t __inactive, int8x16_t __a, int __b,
mve_pred16_t __p)
+__arm_vaddq_m (int8x16_t __inactive, int8x16_t __a, int8_t __b,
mve_pred16_t __p)
  {
   return __arm_vaddq_m_n_s8 (__inactive, __a, __b, __p);
  

[PATCH 2/2] arm: Add support for MVE Tail-Predicated Low Overhead Loops

2022-11-11 Thread Stam Markianos-Wright via Gcc-patches

Hi all,

This is the 2/2 patch that contains the functional changes needed
for MVE Tail Predicated Low Overhead Loops.  See my previous email
for a general introduction of MVE LOLs.

This support is added through the already existing loop-doloop
mechanisms that are used for non-MVE dls/le looping.

Changes are:

1) Relax the loop-doloop mechanism in the mid-end to allow for
   decrement numbers other that -1 and for `count` to be an
   rtx containing the number of elements to be processed, rather
   than an expression for calculating the number of iterations.
2) Add a `allow_elementwise_doloop` target hook. This allows the
   target backend to manipulate the iteration count as it needs:
   in our case to change it from a pre-calculation of the number
   of iterations to the number of elements to be processed.
3) The doloop_end target-insn now had an additional parameter:
   the `count` (note: this is before it gets modified to just be
   the number of elements), so that the decrement value is
   extracted from that parameter.

And many things in the backend to implement the above optimisation:

4)  Appropriate changes to the define_expand of doloop_end and new
    patterns for dlstp and letp.
5) `arm_attempt_dlstp_transform`: (called from the define_expand of
    doloop_end) this function checks for the loop's suitability for
    dlstp/letp transformation and then implements it, if possible.
6) `arm_mve_get_loop_unique_vctp`: A function that loops through
    the loop contents and returns the vctp VPR-genereting operation
    within the loop, if it is unique and there is exclusively one
    vctp within the loop.
7) A couple of utility functions: `arm_mve_get_vctp_lanes` to map
   from vctp unspecs to number of lanes, and `arm_get_required_vpr_reg`
   to check an insn to see if it requires the VPR or not.

No regressions on arm-none-eabi with various targets and on
aarch64-none-elf. Thoughts on getting this into trunk?

Thank you,
Stam Markianos-Wright

gcc/ChangeLog:

    * config/aarch64/aarch64.md: Add extra doloop_end arg.
    * config/arm/arm-protos.h (arm_attempt_dlstp_transform): New.
    * config/arm/arm.cc (TARGET_ALLOW_ELEMENTWISE_DOLOOP): New.
    (arm_mve_get_vctp_lanes): New.
    (arm_get_required_vpr_reg): New.
    (arm_mve_get_loop_unique_vctp): New.
    (arm_attempt_dlstp_transform): New.
    (arm_allow_elementwise_doloop): New.
    * config/arm/iterators.md:
    * config/arm/mve.md (*predicated_doloop_end_internal): New.
    (dlstp_insn): New.
    * config/arm/thumb2.md (doloop_end): Update for MVE LOLs.
    * config/arm/unspecs.md: New unspecs.
    * config/ia64/ia64.md: Add extra doloop_end arg.
    * config/pru/pru.md: Add extra doloop_end arg.
    * config/rs6000/rs6000.md: Add extra doloop_end arg.
    * config/s390/s390.md: Add extra doloop_end arg.
    * config/v850/v850.md: Add extra doloop_end arg.
    * doc/tm.texi: Document new hook.
    * doc/tm.texi.in: Likewise.
    * loop-doloop.cc (doloop_condition_get): Relax conditions.
    (doloop_optimize): Add support for elementwise LoLs.
    * target-insns.def (doloop_end): Add extra arg.
    * target.def (allow_elementwise_doloop): New hook.
    * targhooks.cc (default_allow_elementwise_doloop): New.
    * targhooks.h (default_allow_elementwise_doloop): New.

gcc/testsuite/ChangeLog:

    * gcc.target/arm/lob.h: Update framework.
    * gcc.target/arm/lob1.c: Likewise.
    * gcc.target/arm/lob6.c: Likewise.
    * gcc.target/arm/dlstp-int16x8.c: New test.
    * gcc.target/arm/dlstp-int32x4.c: New test.
    * gcc.target/arm/dlstp-int64x2.c: New test.
    * gcc.target/arm/dlstp-int8x16.c: New test.


### Inline copy of patch ###

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 
f2e3d905dbbeb2949f2947f5cfd68208c94c9272..7a6d24a80060b4a704a481ccd1a32d96e7b0f369 
100644

--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -7366,7 +7366,8 @@
 ;; knows what to generate.
 (define_expand "doloop_end"
   [(use (match_operand 0 "" ""))  ; loop pseudo
-   (use (match_operand 1 "" ""))] ; label
+   (use (match_operand 1 "" ""))  ; label
+   (use (match_operand 2 "" ""))] ; decrement constant
   "optimize > 0 && flag_modulo_sched"
 {
   rtx s0;
diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 
550272facd12e60a49bf8a3b20f811cc13765b3a..7684620f0f4d161dd9e9ad2d70308021ec3d3d34 
100644

--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -63,7 +63,7 @@ extern void arm_decompose_di_binop (rtx, rtx, rtx *, 
rtx *, rtx *, rtx *);

 extern bool arm_q_bit_access (void);
 extern bool arm_ge_bits_access (void);
 extern bool arm_target_insn_ok_for_lob (rtx);
-
+extern rtx arm_attempt_dlstp_transform (rtx, rtx);
 #ifdef RTX_CODE
 enum reg_class
 arm_mode_base_reg_class (machine_mode);
diff --git 

[PATCH] slp tree vectorizer: Re-calculate vectorization factor in the case of invalid choices [PR96974]

2021-03-31 Thread Stam Markianos-Wright via Gcc-patches

On 29/03/2021 10:20, Richard Biener wrote:

On Fri, 26 Mar 2021, Richard Sandiford wrote:


Richard Biener  writes:

On Wed, 24 Mar 2021, Stam Markianos-Wright wrote:


Hi all,

This patch resolves bug:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96974

This is achieved by forcing a re-calculation of *stmt_vectype_out if an
incompatible combination of TYPE_VECTOR_SUBPARTS is detected, but with an
extra introduced max_nunits ceiling.

I am not 100% sure if this is the best way to go about fixing this, because
this is my first look at the vectorizer and I lack knowledge of the wider
context, so do let me know if you see a better way to do this!

I have added the previously ICE-ing reproducer as a new test.

This is compiled as "g++ -Ofast -march=armv8.2-a+sve -fdisable-tree-fre4" for
GCC11 and "g++ -Ofast -march=armv8.2-a+sve" for GCC10.

(the non-fdisable-tree-fre4 version has gone latent on GCC11)

Bootstrapped and reg-tested on aarch64-linux-gnu.
Also reg-tested on aarch64-none-elf.


I don't think this is going to work well given uses will expect
a vector type that's consistent here.

I think giving up is for the moment the best choice, thus replacing
the assert with vectorization failure.

In the end we shouldn't require those nunits vectypes to be
separately computed - we compute the vector type of the defs
anyway and in case they're invariant the vectorizable_* function
either can deal with the type mix or not anyway.


I agree this area needs simplification, but I think the direction of
travel should be to make the assert valid.  I agree this is probably
the pragmatic fix for GCC 11 and earlier though.


The issue is that we compute a vector type for a use that may differ
from what we'd compute for it in the context of its definition (or
in the context of another use).  Any such "local" decision is likely
flawed and I'd rather simplify further doing the only decision on
the definition side - if there's a disconnect between the number
of lanes (and thus altering the VF won't help) then we have to give
up anyway.

Richard.



Thank you both for the further info! Would it be fair to close the 
initial PR regarding the ICE 
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96974) and then open a 
second one at a lower priority level to address these further improvements?


Also Christophe has kindly found out that the test FAILs in ILP32, so it 
would be great to get that one in asap, too! 
https://gcc.gnu.org/pipermail/gcc-patches/2021-March/567431.html


Cheers,
Stam



Re: [PATCH] slp tree vectorizer: Re-calculate vectorization factor in the case of invalid choices [PR96974]

2021-03-25 Thread Stam Markianos-Wright via Gcc-patches

On 24/03/2021 13:46, Richard Biener wrote:

On Wed, 24 Mar 2021, Stam Markianos-Wright wrote:


Hi all,

This patch resolves bug:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96974

This is achieved by forcing a re-calculation of *stmt_vectype_out if an
incompatible combination of TYPE_VECTOR_SUBPARTS is detected, but with an
extra introduced max_nunits ceiling.

I am not 100% sure if this is the best way to go about fixing this, because
this is my first look at the vectorizer and I lack knowledge of the wider
context, so do let me know if you see a better way to do this!

I have added the previously ICE-ing reproducer as a new test.

This is compiled as "g++ -Ofast -march=armv8.2-a+sve -fdisable-tree-fre4" for
GCC11 and "g++ -Ofast -march=armv8.2-a+sve" for GCC10.

(the non-fdisable-tree-fre4 version has gone latent on GCC11)

Bootstrapped and reg-tested on aarch64-linux-gnu.
Also reg-tested on aarch64-none-elf.


I don't think this is going to work well given uses will expect
a vector type that's consistent here.

I think giving up is for the moment the best choice, thus replacing
the assert with vectorization failure.

In the end we shouldn't require those nunits vectypes to be
separately computed - we compute the vector type of the defs
anyway and in case they're invariant the vectorizable_* function
either can deal with the type mix or not anyway.



Yea good point! I agree and after all we are very close to releases now ;)

I've attached the patch that just do the graceful vectorization failure 
and add a slightly better test now. Re-tested as previously with no 
issues ofc.


gcc-10.patch is what I'd backport to GCC10 (the only difference between 
that and gcc-11.patch is that one compiles with `-fdisable-tree-fre4` 
and the other without it).


Ok to push this to the GCC11 branch and backport to the GCC10 branch?

Cheers :D
Stam


That said, the goal should be to simplify things here.

Richard.



gcc/ChangeLog:

 * tree-vect-stmts.c (get_vectype_for_scalar_type): Add new
 parameter to core function and add new function overload.
 (vect_get_vector_types_for_stmt): Add re-calculation logic.

gcc/testsuite/ChangeLog:

 * g++.target/aarch64/sve/pr96974.C: New test.





diff --git a/gcc/testsuite/g++.target/aarch64/sve/pr96974.C b/gcc/testsuite/g++.target/aarch64/sve/pr96974.C
new file mode 100644
index 000..363241d18df
--- /dev/null
+++ b/gcc/testsuite/g++.target/aarch64/sve/pr96974.C
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -march=armv8.2-a+sve -fdisable-tree-fre4 -fdump-tree-slp-details" } */
+
+float a;
+int
+b ()
+{ return __builtin_lrintf(a); }
+
+struct c {
+  float d;
+c() {
+  for (int e = 0; e < 9; e++)
+	coeffs[e] = d ? b() : 0;
+}
+int coeffs[10];
+} f;
+
+/* { dg-final { scan-tree-dump "Not vectorized: Incompatible number of vector subparts between" "slp1" } } */
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index d791d3a4720..4c01e82ff39 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -12148,8 +12148,12 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
 	}
 }
 
-  gcc_assert (multiple_p (TYPE_VECTOR_SUBPARTS (nunits_vectype),
-			  TYPE_VECTOR_SUBPARTS (*stmt_vectype_out)));
+  if (!multiple_p (TYPE_VECTOR_SUBPARTS (nunits_vectype),
+		   TYPE_VECTOR_SUBPARTS (*stmt_vectype_out)))
+return opt_result::failure_at (stmt,
+   "Not vectorized: Incompatible number "
+   "of vector subparts between %T and %T\n",
+   nunits_vectype, *stmt_vectype_out);
 
   if (dump_enabled_p ())
 {
diff --git a/gcc/testsuite/g++.target/aarch64/sve/pr96974.C b/gcc/testsuite/g++.target/aarch64/sve/pr96974.C
new file mode 100644
index 000..2023c55e3e6
--- /dev/null
+++ b/gcc/testsuite/g++.target/aarch64/sve/pr96974.C
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -march=armv8.2-a+sve -fdump-tree-slp-details" } */
+
+float a;
+int
+b ()
+{ return __builtin_lrintf(a); }
+
+struct c {
+  float d;
+c() {
+  for (int e = 0; e < 9; e++)
+	coeffs[e] = d ? b() : 0;
+}
+int coeffs[10];
+} f;
+
+/* { dg-final { scan-tree-dump "Not vectorized: Incompatible number of vector subparts between" "slp1" } } */
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index c2d1f39fe0f..6418edb5204 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -12249,8 +12249,12 @@ vect_get_vector_types_for_stmt (stmt_vec_info stmt_info,
 	}
 }
 
-  gcc_assert (multiple_p (TYPE_VECTOR_SUBPARTS (nunits_vectype),
-			  TYPE_VECTOR_SUBPARTS (*stmt_vectype_out)));
+  if (!multiple_p (TYPE_VECTOR_SUBPARTS (nunits_vectype),
+		   TYPE_VECTOR_SUBPARTS (*stmt_vectype_out)))
+return opt_result::failure_at (stmt,
+   "Not vectorized: Incompatible number "
+   "of vector subparts between %T and %T\n",
+   nunits_vectype, *stmt_vectype_out);
 
   if (dump_enabled_p ())
 {


[PATCH] slp tree vectorizer: Re-calculate vectorization factor in the case of invalid choices [PR96974]

2021-03-24 Thread Stam Markianos-Wright via Gcc-patches

Hi all,

This patch resolves bug:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96974

This is achieved by forcing a re-calculation of *stmt_vectype_out if an
incompatible combination of TYPE_VECTOR_SUBPARTS is detected, but with 
an extra introduced max_nunits ceiling.


I am not 100% sure if this is the best way to go about fixing this, 
because this is my first look at the vectorizer and I lack knowledge of 
the wider context, so do let me know if you see a better way to do this!


I have added the previously ICE-ing reproducer as a new test.

This is compiled as "g++ -Ofast -march=armv8.2-a+sve 
-fdisable-tree-fre4" for GCC11 and "g++ -Ofast -march=armv8.2-a+sve" for 
GCC10.


(the non-fdisable-tree-fre4 version has gone latent on GCC11)

Bootstrapped and reg-tested on aarch64-linux-gnu.
Also reg-tested on aarch64-none-elf.


gcc/ChangeLog:

* tree-vect-stmts.c (get_vectype_for_scalar_type): Add new
parameter to core function and add new function overload.
(vect_get_vector_types_for_stmt): Add re-calculation logic.

gcc/testsuite/ChangeLog:

* g++.target/aarch64/sve/pr96974.C: New test.
diff --git a/gcc/testsuite/g++.target/aarch64/sve/pr96974.C b/gcc/testsuite/g++.target/aarch64/sve/pr96974.C
new file mode 100644
index ..2f6ebd6ce3dd8626f5e666edba77d2c925739b7d
--- /dev/null
+++ b/gcc/testsuite/g++.target/aarch64/sve/pr96974.C
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -march=armv8.2-a+sve -fdisable-tree-fre4" } */
+
+float a;
+int
+b ()
+{ return __builtin_lrintf(a); }
+
+struct c {
+  float d;
+c() {
+  for (int e = 0; e < 9; e++)
+	coeffs[e] = d ? b() : 0;
+}
+int coeffs[10];
+} f;
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index c2d1f39fe0f4bbc90ffa079cb6a8fcf87b76b3af..f8d3eac38718e18bf957b85109cccbc03e21c041 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -11342,7 +11342,7 @@ get_related_vectype_for_scalar_type (machine_mode prevailing_mode,
 
 tree
 get_vectype_for_scalar_type (vec_info *vinfo, tree scalar_type,
-			 unsigned int group_size)
+			 unsigned int group_size, unsigned int max_nunits)
 {
   /* For BB vectorization, we should always have a group size once we've
  constructed the SLP tree; the only valid uses of zero GROUP_SIZEs
@@ -11375,13 +11375,16 @@ get_vectype_for_scalar_type (vec_info *vinfo, tree scalar_type,
 	 fail (in the latter case because GROUP_SIZE is too small
 	 for the target), but it's possible that a target could have
 	 a hole between supported vector types.
+	 There is also the option to artificially pass a max_nunits,
+	 which is smaller than GROUP_SIZE, if the use of GROUP_SIZE
+	 would result in an incompatible mode for the target.
 
 	 If GROUP_SIZE is not a power of 2, this has the effect of
 	 trying the largest power of 2 that fits within the group,
 	 even though the group is not a multiple of that vector size.
 	 The BB vectorizer will then try to carve up the group into
 	 smaller pieces.  */
-  unsigned int nunits = 1 << floor_log2 (group_size);
+  unsigned int nunits = 1 << floor_log2 (max_nunits);
   do
 	{
 	  vectype = get_related_vectype_for_scalar_type (vinfo->vector_mode,
@@ -11394,6 +11397,14 @@ get_vectype_for_scalar_type (vec_info *vinfo, tree scalar_type,
   return vectype;
 }
 
+tree
+get_vectype_for_scalar_type (vec_info *vinfo, tree scalar_type,
+			 unsigned int group_size)
+{
+  return get_vectype_for_scalar_type (vinfo, scalar_type,
+ group_size, group_size);
+}
+
 /* Return the vector type corresponding to SCALAR_TYPE as supported
by the target.  NODE, if nonnull, is the SLP tree node that will
use the returned vector type.  */
@@ -12172,6 +12183,8 @@ vect_get_vector_types_for_stmt (stmt_vec_info stmt_info,
 
   tree vectype;
   tree scalar_type = NULL_TREE;
+  tree scalar_type_orig = NULL_TREE;
+
   if (group_size == 0 && STMT_VINFO_VECTYPE (stmt_info))
 {
   vectype = STMT_VINFO_VECTYPE (stmt_info);
@@ -12210,6 +12223,7 @@ vect_get_vector_types_for_stmt (stmt_vec_info stmt_info,
 			 "get vectype for scalar type: %T\n", scalar_type);
 	}
   vectype = get_vectype_for_scalar_type (vinfo, scalar_type, group_size);
+  scalar_type_orig = scalar_type;
   if (!vectype)
 	return opt_result::failure_at (stmt,
    "not vectorized:"
@@ -12249,6 +12263,36 @@ vect_get_vector_types_for_stmt (stmt_vec_info stmt_info,
 	}
 }
 
+  /* In rare cases with different types and sizes we may reach an invalid
+ combination where nunits_vectype has fewer TYPE_VECTOR_SUBPARTS than
+ *stmt_vectype_out.  In that case attempt to re-calculate
+ *stmt_vectype_out with an imposed max taken from nunits_vectype.  */
+  unsigned int max_nunits;
+  if (known_lt (TYPE_VECTOR_SUBPARTS (nunits_vectype),
+		TYPE_VECTOR_SUBPARTS (*stmt_vectype_out)))
+{
+  if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+	   

Re: [committed obvious][arm] Add test that was missing from old commit [PR91816]

2020-11-26 Thread Stam Markianos-Wright via Gcc-patches

On 26/11/2020 09:01, Christophe Lyon wrote:

On Wed, 25 Nov 2020 at 14:24, Stam Markianos-Wright via Gcc-patches
 wrote:


Hi all,

A while back I submitted GCC10 commit:

   44f77a6dea2f312ee1743f3dde465c1b8453ee13

for PR91816.

Turns out I was an idiot and forgot to include the test in the actual
git commit, even my entire patch had been approved.

Tested that the test still passes on a cross arm-none-eabi and also in a
Cortex A-15 bootstrap with no regressions.

Submitting this as Obvious to gcc-11 and backporting to gcc-10.



Hi,

This new test fails when forcing -mcpu=cortex-m3/4/5/7/33:
FAIL: gcc.target/arm/pr91816.c scan-assembler-times beq\\t.L[0-9] 2
FAIL: gcc.target/arm/pr91816.c scan-assembler-times beq\\t.Lbcond[0-9] 1
FAIL: gcc.target/arm/pr91816.c scan-assembler-times bne\\t.L[0-9] 2
FAIL: gcc.target/arm/pr91816.c scan-assembler-times bne\\t.Lbcond[0-9] 1

I didn't check manually what is generated, can you have a look?



Oh wow thank you for spotting this!

It looks like the A class target that I had tested had a tendency to 
emit a movw/movt pair, whereas these M class targets would emit a single 
ldr. This resulted in an overall shorter jump for these targets that did 
not trigger the new far-branch code.


The test passes after... doubling it's own size:



 #define HW3HW2 HW2 HW2 HW2 HW2 HW2 HW2 HW2 HW2 HW2
 #define HW4HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3
 #define HW5HW4 HW4 HW4 HW4 HW4 HW4 HW4 HW4 HW4 HW4
+#define HW6HW5 HW5

 __attribute__((noinline,noclone)) void f1 (int a)
 {
@@ -25,7 +26,7 @@ __attribute__((noinline,noclone)) void f2 (int a)

 __attribute__((noinline,noclone)) void f3 (int a)
 {
-  if (a) { HW5 }
+  if (a) { HW6 }
 }

 __attribute__((noinline,noclone)) void f4 (int a)
@@ -41,7 +42,7 @@ __attribute__((noinline,noclone)) void f5 (int a)

 __attribute__((noinline,noclone)) void f6 (int a)
 {
-  if (a == 1) { HW5 }
+  if (a == 1) { HW6 }
 }

But this does effectively double the compilation time of an already 
quite large test. Would that be ok?


Overall this is the edge case testing that the compiler behaves 
correctly with a branch in huge compilation unit, so it would be nice to 
have test coverage of it on as many targets as possible... but also 
kinda rare.


Hope this helps!

Cheers,
Stam




Thanks,

Christophe





Thanks,
Stam Markianos-Wright

gcc/testsuite/ChangeLog:
 PR target/91816
 * gcc.target/arm/pr91816.c: New test.




[backport gcc-8,9][arm] Thumb2 out of range conditional branch fix [PR91816]

2020-11-25 Thread Stam Markianos-Wright via Gcc-patches

Hi all,

Now that I have pushed the entirety of this patch to gcc-10 and gcc-11, 
I would like to backport it to gcc-8 and gcc-9.


PR link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91816

This patch had originally been approved here:

https://gcc.gnu.org/legacy-ml/gcc-patches/2020-01/msg02010.html

See the attached diffs that have been rebased and apply cleanly.

Tested on a cross arm-none-eabi and also in a Cortex A-15 bootstrap with 
no regressions.


Ok to backport?

Thanks,
Stam Markianos-Wright
diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 9d0acde7a39..87e01e35221 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -553,4 +553,6 @@ void arm_parse_option_features (sbitmap, const 
cpu_arch_option *,
 
 void arm_initialize_isa (sbitmap, const enum isa_feature *);
 
+const char * arm_gen_far_branch (rtx *, int, const char * , const char *);
+
 #endif /* ! GCC_ARM_PROTOS_H */
diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
index f990ca11bcb..eefe3d99548 100644
--- a/gcc/config/arm/arm.c
+++ b/gcc/config/arm/arm.c
@@ -31629,6 +31629,39 @@ arm_constant_alignment (const_tree exp, HOST_WIDE_INT 
align)
   return align;
 }
 
+/* Generate code to enable conditional branches in functions over 1 MiB.
+   Parameters are:
+ operands: is the operands list of the asm insn (see arm_cond_branch or
+   arm_cond_branch_reversed).
+ pos_label: is an index into the operands array where operands[pos_label] 
is
+   the asm label of the final jump destination.
+ dest: is a string which is used to generate the asm label of the 
intermediate
+   destination
+   branch_format: is a string denoting the intermediate branch format, e.g.
+ "beq", "bne", etc.  */
+
+const char *
+arm_gen_far_branch (rtx * operands, int pos_label, const char * dest,
+   const char * branch_format)
+{
+  rtx_code_label * tmp_label = gen_label_rtx ();
+  char label_buf[256];
+  char buffer[128];
+  ASM_GENERATE_INTERNAL_LABEL (label_buf, dest , \
+   CODE_LABEL_NUMBER (tmp_label));
+  const char *label_ptr = arm_strip_name_encoding (label_buf);
+  rtx dest_label = operands[pos_label];
+  operands[pos_label] = tmp_label;
+
+  snprintf (buffer, sizeof (buffer), "%s%s", branch_format , label_ptr);
+  output_asm_insn (buffer, operands);
+
+  snprintf (buffer, sizeof (buffer), "b\t%%l0%d\n%s:", pos_label, label_ptr);
+  operands[pos_label] = dest_label;
+  output_asm_insn (buffer, operands);
+  return "";
+}
+
 #if CHECKING_P
 namespace selftest {
 
diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 6d6b37719e0..81c96658d95 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -7187,9 +7187,15 @@
 ;; And for backward branches we have 
 ;;   (neg_range - neg_base_offs + pc_offs) = (neg_range - (-2 or -4) + 4).
 ;;
+;; In 16-bit Thumb these ranges are:
 ;; For a 'b'   pos_range = 2046, neg_range = -2048 giving (-2040->2048).
 ;; For a 'b' pos_range = 254,  neg_range = -256  giving (-250 ->256).
 
+;; In 32-bit Thumb these ranges are:
+;; For a 'b'   +/- 16MB is not checked for.
+;; For a 'b' pos_range = 1048574,  neg_range = -1048576  giving
+;; (-1048568 -> 1048576).
+
 (define_expand "cbranchsi4"
   [(set (pc) (if_then_else
  (match_operator 0 "expandable_comparison_operator"
@@ -7444,23 +7450,50 @@
  (label_ref (match_operand 0 "" ""))
  (pc)))]
   "TARGET_32BIT"
-  "*
-  if (arm_ccfsm_state == 1 || arm_ccfsm_state == 2)
+  {
+if (arm_ccfsm_state == 1 || arm_ccfsm_state == 2)
 {
   arm_ccfsm_state += 2;
-  return \"\";
+  return "";
 }
-  return \"b%d1\\t%l0\";
-  "
+switch (get_attr_length (insn))
+  {
+   case 2: /* Thumb2 16-bit b{cond}.  */
+   case 4: /* Thumb2 32-bit b{cond} or A32 b{cond}.  */
+ return "b%d1\t%l0";
+ break;
+
+   /* Thumb2 b{cond} out of range.  Use 16-bit b{cond} and
+  unconditional branch b.  */
+   default: return arm_gen_far_branch (operands, 0, "Lbcond", "b%D1\t");
+  }
+  }
   [(set_attr "conds" "use")
(set_attr "type" "branch")
(set (attr "length")
-   (if_then_else
-  (and (match_test "TARGET_THUMB2")
-   (and (ge (minus (match_dup 0) (pc)) (const_int -250))
-(le (minus (match_dup 0) (pc)) (const_int 256
-  (const_int 2)
-  (const_int 4)))]
+(if_then_else (match_test "!TARGET_THUMB2")
+
+  ;;Target is not Thumb2, therefore is A32.  Generate b{cond}.
+  (const_int 4)
+
+  ;; Check if target is within 16-bit Thumb2 b{cond} range.
+  (if_then_else (and (ge (minus (match_dup 0) (pc)) (const_int -250))
+(le (minus (match_dup 0) (pc)) (const_int 256)))
+
+   ;; Target is Thumb2, within narrow range.
+   ;; Generate b{cond}.
+   (const_int 2)
+
+   ;; Check if target is within 32-bit Thumb2 b{cond} range.
+

[committed obvious][arm] Add test that was missing from old commit [PR91816]

2020-11-25 Thread Stam Markianos-Wright via Gcc-patches

Hi all,

A while back I submitted GCC10 commit:

 44f77a6dea2f312ee1743f3dde465c1b8453ee13

for PR91816.

Turns out I was an idiot and forgot to include the test in the actual 
git commit, even my entire patch had been approved.


Tested that the test still passes on a cross arm-none-eabi and also in a
Cortex A-15 bootstrap with no regressions.

Submitting this as Obvious to gcc-11 and backporting to gcc-10.

Thanks,
Stam Markianos-Wright

gcc/testsuite/ChangeLog:
PR target/91816
* gcc.target/arm/pr91816.c: New test.
diff --git a/gcc/testsuite/gcc.target/arm/pr91816.c b/gcc/testsuite/gcc.target/arm/pr91816.c
new file mode 100644
index 000..75b938a6aad
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/pr91816.c
@@ -0,0 +1,63 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_thumb2_ok } */
+/* { dg-additional-options "-mthumb" }  */
+/* { dg-timeout-factor 4.0 } */
+
+int printf(const char *, ...);
+
+#define HW0	printf("Hello World!\n");
+#define HW1	HW0 HW0 HW0 HW0 HW0 HW0 HW0 HW0 HW0 HW0
+#define HW2	HW1 HW1 HW1 HW1 HW1 HW1 HW1 HW1 HW1 HW1
+#define HW3	HW2 HW2 HW2 HW2 HW2 HW2 HW2 HW2 HW2 HW2
+#define HW4	HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3 HW3
+#define HW5	HW4 HW4 HW4 HW4 HW4 HW4 HW4 HW4 HW4 HW4
+
+__attribute__((noinline,noclone)) void f1 (int a)
+{
+  if (a) { HW0 }
+}
+
+__attribute__((noinline,noclone)) void f2 (int a)
+{
+  if (a) { HW3 }
+}
+
+
+__attribute__((noinline,noclone)) void f3 (int a)
+{
+  if (a) { HW5 }
+}
+
+__attribute__((noinline,noclone)) void f4 (int a)
+{
+  if (a == 1) { HW0 }
+}
+
+__attribute__((noinline,noclone)) void f5 (int a)
+{
+  if (a == 1) { HW3 }
+}
+
+
+__attribute__((noinline,noclone)) void f6 (int a)
+{
+  if (a == 1) { HW5 }
+}
+
+
+int main(void)
+{
+	f1(0);
+	f2(0);
+	f3(0);
+	f4(0);
+	f5(0);
+	f6(0);
+	return 0;
+}
+
+
+/* { dg-final { scan-assembler-times "beq\\t.L\[0-9\]" 2 } } */
+/* { dg-final { scan-assembler-times "beq\\t.Lbcond\[0-9\]" 1 } } */
+/* { dg-final { scan-assembler-times "bne\\t.L\[0-9\]" 2 } } */
+/* { dg-final { scan-assembler-times "bne\\t.Lbcond\[0-9\]" 1 } } */