[PATCH 15/15] aarch64: Conditionally define __ARM_FEATURE_SVE2p1

2024-11-06 Thread Richard Sandiford
Previous patches are supposed to add full support for SVE2.1,
so this patch advertises that through __ARM_FEATURE_SVE2p1.

pragma_cpp_predefs_3.c had one fewer pop than push.  The final
test is triple-nested:

- armv8-a (to start with a clean slate, untainted by command-line flags)
- the maximal SVE set
- general-regs-only

gcc/
* config/aarch64/aarch64-c.cc (aarch64_update_cpp_builtins): Handle
__ARM_FEATURE_SVE2p1.

gcc/testsuite/
* gcc.target/aarch64/pragma_cpp_predefs_3.c: Add SVE2p1 tests.
---
 gcc/config/aarch64/aarch64-c.cc   |  1 +
 .../gcc.target/aarch64/pragma_cpp_predefs_3.c | 84 +++
 2 files changed, 85 insertions(+)

diff --git a/gcc/config/aarch64/aarch64-c.cc b/gcc/config/aarch64/aarch64-c.cc
index f9b9e379375..d1ae80c0bb3 100644
--- a/gcc/config/aarch64/aarch64-c.cc
+++ b/gcc/config/aarch64/aarch64-c.cc
@@ -214,6 +214,7 @@ aarch64_update_cpp_builtins (cpp_reader *pfile)
"__ARM_FEATURE_SVE2_BITPERM", pfile);
   aarch64_def_or_undef (TARGET_SVE2_SHA3, "__ARM_FEATURE_SVE2_SHA3", pfile);
   aarch64_def_or_undef (TARGET_SVE2_SM4, "__ARM_FEATURE_SVE2_SM4", pfile);
+  aarch64_def_or_undef (TARGET_SVE2p1, "__ARM_FEATURE_SVE2p1", pfile);
 
   aarch64_def_or_undef (TARGET_LSE, "__ARM_FEATURE_ATOMICS", pfile);
   aarch64_def_or_undef (TARGET_AES, "__ARM_FEATURE_AES", pfile);
diff --git a/gcc/testsuite/gcc.target/aarch64/pragma_cpp_predefs_3.c 
b/gcc/testsuite/gcc.target/aarch64/pragma_cpp_predefs_3.c
index 39128528600..f1f70ed7b5c 100644
--- a/gcc/testsuite/gcc.target/aarch64/pragma_cpp_predefs_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/pragma_cpp_predefs_3.c
@@ -28,6 +28,10 @@
 #error "__ARM_FEATURE_SVE2_SM4 is defined but should not be!"
 #endif
 
+#ifdef __ARM_FEATURE_SVE2p1
+#error "__ARM_FEATURE_SVE2p1 is defined but should not be!"
+#endif
+
 #pragma GCC push_options
 #pragma GCC target ("arch=armv8.2-a+sve")
 
@@ -55,6 +59,10 @@
 #error "__ARM_FEATURE_SVE2_SM4 is defined but should not be!"
 #endif
 
+#ifdef __ARM_FEATURE_SVE2p1
+#error "__ARM_FEATURE_SVE2p1 is defined but should not be!"
+#endif
+
 #pragma GCC pop_options
 
 #pragma GCC push_options
@@ -84,6 +92,10 @@
 #error "__ARM_FEATURE_SVE2_SM4 is defined but should not be!"
 #endif
 
+#ifdef __ARM_FEATURE_SVE2p1
+#error "__ARM_FEATURE_SVE2p1 is defined but should not be!"
+#endif
+
 #pragma GCC pop_options
 
 #pragma GCC push_options
@@ -242,6 +254,72 @@
 #error "__ARM_FEATURE_SVE2_SM4 is not defined but should be!"
 #endif
 
+#pragma GCC pop_options
+
+#pragma GCC push_options
+#pragma GCC target ("arch=armv9-a+sve2p1")
+
+#ifndef __ARM_FEATURE_SVE
+#error "__ARM_FEATURE_SVE is not defined but should be!"
+#endif
+
+#ifndef __ARM_FEATURE_SVE2
+#error "__ARM_FEATURE_SVE2 is not defined but should be!"
+#endif
+
+#ifdef __ARM_FEATURE_SVE2_AES
+#error "__ARM_FEATURE_SVE2_AES is defined but should not be!"
+#endif
+
+#ifdef __ARM_FEATURE_SVE2_BITPERM
+#error "__ARM_FEATURE_SVE2_BITPERM is defined but should not be!"
+#endif
+
+#ifdef __ARM_FEATURE_SVE2_SHA3
+#error "__ARM_FEATURE_SVE2_SHA3 is defined but should not be!"
+#endif
+
+#ifdef __ARM_FEATURE_SVE2_SM4
+#error "__ARM_FEATURE_SVE2_SM4 is defined but should not be!"
+#endif
+
+#ifndef __ARM_FEATURE_SVE2p1
+#error "__ARM_FEATURE_SVE2p1 is not defined but should be!"
+#endif
+
+#pragma GCC pop_options
+
+#pragma GCC push_options
+#pragma GCC target 
("arch=armv9-a+sve2-aes+sve2-bitperm+sve2-sha3+sve2-sm4+sve2p1")
+
+#ifndef __ARM_FEATURE_SVE
+#error "__ARM_FEATURE_SVE is not defined but should be!"
+#endif
+
+#ifndef __ARM_FEATURE_SVE2
+#error "__ARM_FEATURE_SVE2 is not defined but should be!"
+#endif
+
+#ifndef __ARM_FEATURE_SVE2_AES
+#error "__ARM_FEATURE_SVE2_AES is not defined but should be!"
+#endif
+
+#ifndef __ARM_FEATURE_SVE2_BITPERM
+#error "__ARM_FEATURE_SVE2_BITPERM is not defined but should be!"
+#endif
+
+#ifndef __ARM_FEATURE_SVE2_SHA3
+#error "__ARM_FEATURE_SVE2_SHA3 is not defined but should be!"
+#endif
+
+#ifndef __ARM_FEATURE_SVE2_SM4
+#error "__ARM_FEATURE_SVE2_SM4 is not defined but should be!"
+#endif
+
+#ifndef __ARM_FEATURE_SVE2p1
+#error "__ARM_FEATURE_SVE2p1 is not defined but should be!"
+#endif
+
 #pragma GCC push_options
 #pragma GCC target ("general-regs-only")
 
@@ -269,6 +347,12 @@
 #error "__ARM_FEATURE_SVE2_SM4 is defined but should not be!"
 #endif
 
+#ifdef __ARM_FEATURE_SVE2p1
+#error "__ARM_FEATURE_SVE2p1 is defined but should not be!"
+#endif
+
+#pragma GCC pop_options
+
 #pragma GCC pop_options
 
 #pragma GCC pop_options
-- 
2.25.1



[PATCH 13/15] aarch64: Add common subset of SVE2p1 and SME2

2024-11-06 Thread Richard Sandiford
This patch handles the SVE2p1 instructions that are shared
with SME2.  This includes the consecutive-register forms of
the 2-register and 4-register loads and stores, but not the
strided-register forms.

gcc/
* config/aarch64/aarch64.h (TARGET_SVE2p1_OR_SME2): New macro.
* config/aarch64/aarch64-early-ra.cc
(is_stride_candidate): Require TARGET_STREAMING_SME2
(early_ra::maybe_convert_to_strided_access): Likewise.
* config/aarch64/aarch64-sve-builtins-sve2.def: Mark instructions
that are common to both SVE2p1 and SME2.
* config/aarch64/aarch64-sve.md
(@aarch64_dot_prod_lane):
Test TARGET_SVE2p1_OR_SME2 instead of TARGET_STREAMING_SME2.
(@aarch64_sve_vnx4sf): Move TARGET_SVE_BF16 condition
into SVE_BFLOAT_TERNARY_LONG.
(@aarch64_sve__lanevnx4sf): Likewise
SVE_BFLOAT_TERNARY_LONG_LANE.
* config/aarch64/aarch64-sve2.md
(@aarch64_): Require TARGET_SVE2p1_OR_SME2
instead of TARGET_STREAMING_SME2.
(@aarch64_): Likewise.
(@aarch64_sve_ptrue_c): Likewise.
(@aarch64_sve_pext): Likewise.
(@aarch64_sve_pextx2): Likewise.
(@aarch64_sve_cntp_c): Likewise.
(@aarch64_sve_fclamp): Likewise.
(*aarch64_sve_fclamp_x): Likewise.
(dot_prodvnx4sivnx8hi): Likewise.
(aarch64_sve_fdotvnx4sfvnx8hf): Likewise.
(aarch64_fdot_prod_lanevnx4sfvnx8hf): Likewise.
(@aarch64_sve_while_b_x2): Likewise.
(@aarch64_sve_while_c): Likewise.
(@aarch64_sve_): Move
TARGET_STREAMING_SME2 condition into SVE_QCVTxN.
(@aarch64_sve_): Likewise
SVE2_INT_SHIFT_IMM_NARROWxN, but also require TARGET_STREAMING_SME2
for the 4-register forms.
* config/aarch64/iterators.md (SVE_BFLOAT_TERNARY_LONG): Require
TARGET_SVE2p1_OR_SME2 rather than TARGET_STREAMING_SME2 for
UNSPEC_BFMLSLB and UNSPEC_BFMLSLT.  Require TARGET_SVE_BF16
for the others.
(SVE_BFLOAT_TERNARY_LONG_LANE): Likewise.
(SVE2_INT_SHIFT_IMM_NARROWxN): Require TARGET_SVE2p1_OR_SME2 for
the interleaving forms and TARGET_STREAMING_SME2 for the rest.
(SVE_QCVTxN): Likewise.

gcc/testsuite/
* gcc.target/aarch64/sve/clamp_3.c: New test.
* gcc.target/aarch64/sve/clamp_4.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/bfmlslb_f32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/bfmlslb_lane_f32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/bfmlslt_f32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/bfmlslt_lane_f32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/clamp_f16.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/clamp_f32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/clamp_f64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/cntp_c16.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/cntp_c32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/cntp_c64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/cntp_c8.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/dot_f32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/dot_lane_f32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/dot_lane_s32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/dot_lane_u32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/dot_s32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/dot_u32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_bf16_x2.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_bf16_x4.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_f16_x2.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_f16_x4.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_f32_x2.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_f32_x4.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_f64_x2.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_f64_x4.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_s16_x2.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_s16_x4.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_s32_x2.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_s32_x4.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_s64_x2.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_s64_x4.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_s8_x2.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_s8_x4.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_u16_x2.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_u16_x4.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_u32_x2.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_u32_x4.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_u64_x2.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/ld1_u64_x4.c: Likewise.
* gcc.target/aarch64

[PATCH 12/15] aarch64: Add common subset of SVE2p1 and SME

2024-11-06 Thread Richard Sandiford
Some instructions that were previously restricted to streaming mode
can also be used in non-streaming mode with SVE2.1.  This patch adds
support for those, as well as the usual new-extension boilerplate.
A later patch will add the feature macro.

gcc/
* config/aarch64/aarch64-option-extensions.def (sve2p1): New extension.
* config/aarch64/aarch64-sve-builtins-sve2.def: Mark instructions
that are common to both SVE2p1 and SME.
* config/aarch64/aarch64.h (TARGET_SVE2p1): New macro.
(TARGET_SVE2p1_OR_SME): Likewise.
* config/aarch64/aarch64-sve2.md
(@aarch64_sve_psel): Require TARGET_SVE2p1_OR_SME
instead of TARGET_STREAMING.
(*aarch64_sve_psel_plus): Likewise.
(@aarch64_sve_clamp): Likewise.
(*aarch64_sve_clamp_x): Likewise.
(@aarch64_pred_): Likewise.
(@cond_): Likewise.

gcc/testsuite/
* lib/target-supports.exp
(check_effective_target_aarch64_asm_sve2p1_ok): New procedure.
* gcc.target/aarch64/sve/clamp_1.c: New test.
* gcc.target/aarch64/sve/clamp_2.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/clamp_s16.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/clamp_s32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/clamp_s64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/clamp_s8.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/clamp_u16.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/clamp_u32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/clamp_u64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/clamp_u8.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/psel_lane_b16.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/psel_lane_b32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/psel_lane_b64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/psel_lane_b8.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/psel_lane_c16.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/psel_lane_c32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/psel_lane_c64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/psel_lane_c8.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/revd_bf16.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/revd_f16.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/revd_f32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/revd_f64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/revd_s16.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/revd_s32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/revd_s64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/revd_s8.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/revd_u16.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/revd_u32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/revd_u64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/revd_u8.c: Likewise.
---
 .../aarch64/aarch64-option-extensions.def |  2 +
 .../aarch64/aarch64-sve-builtins-sve2.def |  2 +-
 gcc/config/aarch64/aarch64-sve2.md| 12 +--
 gcc/config/aarch64/aarch64.h  |  9 ++
 .../gcc.target/aarch64/sve/clamp_1.c  | 40 
 .../gcc.target/aarch64/sve/clamp_2.c  | 34 +++
 .../aarch64/sve2/acle/asm/clamp_s16.c | 46 +
 .../aarch64/sve2/acle/asm/clamp_s32.c | 46 +
 .../aarch64/sve2/acle/asm/clamp_s64.c | 46 +
 .../aarch64/sve2/acle/asm/clamp_s8.c  | 46 +
 .../aarch64/sve2/acle/asm/clamp_u16.c | 46 +
 .../aarch64/sve2/acle/asm/clamp_u32.c | 46 +
 .../aarch64/sve2/acle/asm/clamp_u64.c | 46 +
 .../aarch64/sve2/acle/asm/clamp_u8.c  | 46 +
 .../aarch64/sve2/acle/asm/psel_lane_b16.c | 93 +++
 .../aarch64/sve2/acle/asm/psel_lane_b32.c | 93 +++
 .../aarch64/sve2/acle/asm/psel_lane_b64.c | 84 +
 .../aarch64/sve2/acle/asm/psel_lane_b8.c  | 93 +++
 .../aarch64/sve2/acle/asm/psel_lane_c16.c | 93 +++
 .../aarch64/sve2/acle/asm/psel_lane_c32.c | 93 +++
 .../aarch64/sve2/acle/asm/psel_lane_c64.c | 84 +
 .../aarch64/sve2/acle/asm/psel_lane_c8.c  | 93 +++
 .../aarch64/sve2/acle/asm/revd_bf16.c | 80 
 .../aarch64/sve2/acle/asm/revd_f16.c  | 80 
 .../aarch64/sve2/acle/asm/revd_f32.c  | 80 
 .../aarch64/sve2/acle/asm/revd_f64.c  | 80 
 .../aarch64/sve2/acle/asm/revd_s16.c  | 80 
 .../aarch64/sve2/acle/asm/revd_s32.c  | 80 
 .../aarch64/sve2/acle/asm/revd_s64.c  | 80 
 .../aarch64/sve2/acle/asm/revd_s8.c   | 80 
 .../aarch64/sve2/acle/asm/revd_u16.c  | 80 +

[PATCH 11/15] aarch64: Define arm_neon.h types in arm_sve.h too

2024-11-06 Thread Richard Sandiford
This patch moves the scalar and single-vector Advanced SIMD types
from arm_neon.h into a private header, so that they can be defined
by arm_sve.h as well.  This is needed for the upcoming SVE2.1
hybrid-VLA reductions, which return 128-bit Advanced SIMD vectors.

The approach follows Claudio's patch for FP8.

gcc/
* config.gcc (extra_headers): Add arm_private_neon_types.h.
* config/aarch64/arm_private_neon_types.h: New file, split out
from...
* config/aarch64/arm_neon.h: ...here.
* config/aarch64/arm_sve.h: Include arm_private_neon_types.h
---
 gcc/config.gcc  |  2 +-
 gcc/config/aarch64/arm_neon.h   | 49 +
 gcc/config/aarch64/arm_private_neon_types.h | 79 +
 gcc/config/aarch64/arm_sve.h|  5 +-
 4 files changed, 84 insertions(+), 51 deletions(-)
 create mode 100644 gcc/config/aarch64/arm_private_neon_types.h

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 1b0637d7ff8..7e0108e2154 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -347,7 +347,7 @@ m32c*-*-*)
 ;;
 aarch64*-*-*)
cpu_type=aarch64
-   extra_headers="arm_fp16.h arm_neon.h arm_bf16.h arm_acle.h arm_sve.h 
arm_sme.h arm_neon_sve_bridge.h arm_private_fp8.h"
+   extra_headers="arm_fp16.h arm_neon.h arm_bf16.h arm_acle.h arm_sve.h 
arm_sme.h arm_neon_sve_bridge.h arm_private_fp8.h arm_private_neon_types.h"
c_target_objs="aarch64-c.o"
cxx_target_objs="aarch64-c.o"
d_target_objs="aarch64-d.o"
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index d3533f3ee6f..c727302ac75 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -30,58 +30,15 @@
 #pragma GCC push_options
 #pragma GCC target ("+nothing+simd")
 
+#include 
 #include 
-#pragma GCC aarch64 "arm_neon.h"
+#include 
 
-#include 
+#pragma GCC aarch64 "arm_neon.h"
 
 #define __AARCH64_UINT64_C(__C) ((uint64_t) __C)
 #define __AARCH64_INT64_C(__C) ((int64_t) __C)
 
-typedef __Int8x8_t int8x8_t;
-typedef __Int16x4_t int16x4_t;
-typedef __Int32x2_t int32x2_t;
-typedef __Int64x1_t int64x1_t;
-typedef __Float16x4_t float16x4_t;
-typedef __Float32x2_t float32x2_t;
-typedef __Poly8x8_t poly8x8_t;
-typedef __Poly16x4_t poly16x4_t;
-typedef __Uint8x8_t uint8x8_t;
-typedef __Uint16x4_t uint16x4_t;
-typedef __Uint32x2_t uint32x2_t;
-typedef __Float64x1_t float64x1_t;
-typedef __Uint64x1_t uint64x1_t;
-typedef __Int8x16_t int8x16_t;
-typedef __Int16x8_t int16x8_t;
-typedef __Int32x4_t int32x4_t;
-typedef __Int64x2_t int64x2_t;
-typedef __Float16x8_t float16x8_t;
-typedef __Float32x4_t float32x4_t;
-typedef __Float64x2_t float64x2_t;
-typedef __Poly8x16_t poly8x16_t;
-typedef __Poly16x8_t poly16x8_t;
-typedef __Poly64x2_t poly64x2_t;
-typedef __Poly64x1_t poly64x1_t;
-typedef __Uint8x16_t uint8x16_t;
-typedef __Uint16x8_t uint16x8_t;
-typedef __Uint32x4_t uint32x4_t;
-typedef __Uint64x2_t uint64x2_t;
-
-typedef __Poly8_t poly8_t;
-typedef __Poly16_t poly16_t;
-typedef __Poly64_t poly64_t;
-typedef __Poly128_t poly128_t;
-
-typedef __Mfloat8x8_t mfloat8x8_t;
-typedef __Mfloat8x16_t mfloat8x16_t;
-
-typedef __fp16 float16_t;
-typedef float float32_t;
-typedef double float64_t;
-
-typedef __Bfloat16x4_t bfloat16x4_t;
-typedef __Bfloat16x8_t bfloat16x8_t;
-
 /* __aarch64_vdup_lane internal macros.  */
 #define __aarch64_vdup_lane_any(__size, __q, __a, __b) \
   vdup##__q##_n_##__size (__aarch64_vget_lane_any (__a, __b))
diff --git a/gcc/config/aarch64/arm_private_neon_types.h 
b/gcc/config/aarch64/arm_private_neon_types.h
new file mode 100644
index 000..0f588f026b7
--- /dev/null
+++ b/gcc/config/aarch64/arm_private_neon_types.h
@@ -0,0 +1,79 @@
+/* AArch64 type definitions for arm_neon.h
+   Do not include this file directly. Use one of arm_neon.h, arm_sme.h,
+   or arm_sve.h instead.
+
+   Copyright (C) 2024 Free Software Foundation, Inc.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
+   License for more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   .  */
+
+#ifndef _GCC_ARM_PRIVATE_NEON_TYPES_H
+#define _GCC_ARM_PRIVATE_NEON_TYPES_H
+
+#if !defin

[PATCH 10/15] aarch64: Add svboolx4_t

2024-11-06 Thread Richard Sandiford
This patch adds an svboolx4_t type, to go alongside the existing
svboolx2_t type.  It doesn't require any special ISA support beyond
SVE itself and it currently has no associated instructions.

gcc/
* config/aarch64/aarch64-modes.def (VNx64BI): New mode.
* config/aarch64/aarch64-protos.h
(aarch64_split_double_move): Generalize to...
(aarch64_split_move): ...this.
* config/aarch64/aarch64-sve-builtins-base.def (svcreate4, svget4)
(svset4, svundef4): Add bool variants.
* config/aarch64/aarch64-sve-builtins.cc (handle_arm_sve_h): Add
svboolx4_t.
* config/aarch64/iterators.md (SVE_STRUCT_BI): New mode iterator.
* config/aarch64/aarch64-sve.md (movvnx32bi): Generalize to...
(mov): ...this.
* config/aarch64/aarch64.cc
(pure_scalable_type_info::piece::get_rtx): Allow num_prs to be 4.
(aarch64_classify_vector_mode): Handle VNx64BI.
(aarch64_hard_regno_nregs): Likewise.
(aarch64_class_max_nregs): Likewise.
(aarch64_array_mode): Use VNx64BI for arrays of 4 svbool_ts.
(aarch64_split_double_move): Generalize to...
(aarch64_split_move): ...this.
(aarch64_split_128bit_move): Update call accordingly.

gcc/testsuite/
* gcc.target/aarch64/sve/acle/general-c/create_5.c: Expect svcreate4
to succeed for svbool_ts.
* gcc.target/aarch64/sve/acle/asm/test_sve_acle.h
(TEST_UNDEF_B): New macro.
* gcc.target/aarch64/sve/acle/asm/create4_1.c: Test _b form.
* gcc.target/aarch64/sve/acle/asm/undef2_1.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/undef4_1.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/get4_b.c: New test.
* gcc.target/aarch64/sve/acle/asm/set4_b.c: Likewise.
* gcc.target/aarch64/sve/acle/general-c/svboolx4_1.c: Likewise.
---
 gcc/config/aarch64/aarch64-modes.def  |   3 +
 gcc/config/aarch64/aarch64-protos.h   |   2 +-
 .../aarch64/aarch64-sve-builtins-base.def |   4 +
 gcc/config/aarch64/aarch64-sve-builtins.cc|   2 +-
 gcc/config/aarch64/aarch64-sve.md |   8 +-
 gcc/config/aarch64/aarch64.cc |  50 
 gcc/config/aarch64/iterators.md   |   2 +
 .../aarch64/sve/acle/asm/create4_1.c  |  10 ++
 .../gcc.target/aarch64/sve/acle/asm/get4_b.c  |  73 +++
 .../gcc.target/aarch64/sve/acle/asm/set4_b.c  |  87 +
 .../aarch64/sve/acle/asm/test_sve_acle.h  |   8 ++
 .../aarch64/sve/acle/asm/undef2_1.c   |   7 ++
 .../aarch64/sve/acle/asm/undef4_1.c   |   7 ++
 .../aarch64/sve/acle/general-c/create_5.c |   2 +-
 .../aarch64/sve/acle/general-c/svboolx4_1.c   | 117 ++
 15 files changed, 351 insertions(+), 31 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/asm/get4_b.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/asm/set4_b.c
 create mode 100644 
gcc/testsuite/gcc.target/aarch64/sve/acle/general-c/svboolx4_1.c

diff --git a/gcc/config/aarch64/aarch64-modes.def 
b/gcc/config/aarch64/aarch64-modes.def
index 25a22c1195e..813421e1e39 100644
--- a/gcc/config/aarch64/aarch64-modes.def
+++ b/gcc/config/aarch64/aarch64-modes.def
@@ -48,18 +48,21 @@ ADJUST_FLOAT_FORMAT (HF, &ieee_half_format);
 
 /* Vector modes.  */
 
+VECTOR_BOOL_MODE (VNx64BI, 64, BI, 8);
 VECTOR_BOOL_MODE (VNx32BI, 32, BI, 4);
 VECTOR_BOOL_MODE (VNx16BI, 16, BI, 2);
 VECTOR_BOOL_MODE (VNx8BI, 8, BI, 2);
 VECTOR_BOOL_MODE (VNx4BI, 4, BI, 2);
 VECTOR_BOOL_MODE (VNx2BI, 2, BI, 2);
 
+ADJUST_NUNITS (VNx64BI, aarch64_sve_vg * 32);
 ADJUST_NUNITS (VNx32BI, aarch64_sve_vg * 16);
 ADJUST_NUNITS (VNx16BI, aarch64_sve_vg * 8);
 ADJUST_NUNITS (VNx8BI, aarch64_sve_vg * 4);
 ADJUST_NUNITS (VNx4BI, aarch64_sve_vg * 2);
 ADJUST_NUNITS (VNx2BI, aarch64_sve_vg);
 
+ADJUST_ALIGNMENT (VNx64BI, 2);
 ADJUST_ALIGNMENT (VNx32BI, 2);
 ADJUST_ALIGNMENT (VNx16BI, 2);
 ADJUST_ALIGNMENT (VNx8BI, 2);
diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index e8588e1cb17..660e335bf34 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -1045,7 +1045,7 @@ rtx aarch64_simd_expand_builtin (int, tree, rtx);
 void aarch64_simd_lane_bounds (rtx, HOST_WIDE_INT, HOST_WIDE_INT, const_tree);
 rtx aarch64_endian_lane_rtx (machine_mode, unsigned int);
 
-void aarch64_split_double_move (rtx, rtx, machine_mode);
+void aarch64_split_move (rtx, rtx, machine_mode);
 void aarch64_split_128bit_move (rtx, rtx);
 
 bool aarch64_split_128bit_move_p (rtx, rtx);
diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.def 
b/gcc/config/aarch64/aarch64-sve-builtins-base.def
index da2a0e41aa5..0353f56e705 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.def
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.def
@@ -74,6 +74,7 @@ DEF_SVE_FUNCTION (svcreate2, create, all_data, none)
 DEF_SVE_FUNCTION (svcreate2, create, b, none)
 DEF

[PATCH 03/15] aarch64: Tweak definition of all_data & co

2024-11-06 Thread Richard Sandiford
Past extensions to SVE have required new subsets of all_data; the
SVE2.1 patches will add another.  This patch tries to make this more
scalable by defining the multi-size *_data macros to be unions of
single-size *_data macros.

gcc/
* config/aarch64/aarch64-sve-builtins.cc (TYPES_all_data): Redefine
in terms of single-size *_data definitions.
(TYPES_bhs_data, TYPES_hs_data, TYPES_sd_data): Likewise.
(TYPES_b_data, TYPES_h_data, TYPES_s_data): New macros.
---
 gcc/config/aarch64/aarch64-sve-builtins.cc | 51 +-
 1 file changed, 31 insertions(+), 20 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
b/gcc/config/aarch64/aarch64-sve-builtins.cc
index 44b7f6edae5..c0b5115fdeb 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
@@ -231,12 +231,11 @@ CONSTEXPR const group_suffix_info group_suffixes[] = {
 #define TYPES_all_arith(S, D) \
   TYPES_all_float (S, D), TYPES_all_integer (S, D)
 
-/* _bf16
-   _f16 _f32 _f64
-   _s8  _s16 _s32 _s64
-   _u8  _u16 _u32 _u64.  */
 #define TYPES_all_data(S, D) \
-  S (bf16), TYPES_all_arith (S, D)
+  TYPES_b_data (S, D), \
+  TYPES_h_data (S, D), \
+  TYPES_s_data (S, D), \
+  TYPES_d_data (S, D)
 
 /* _b only.  */
 #define TYPES_b(S, D) \
@@ -255,6 +254,11 @@ CONSTEXPR const group_suffix_info group_suffixes[] = {
 #define TYPES_b_integer(S, D) \
   S (s8), TYPES_b_unsigned (S, D)
 
+/* _s8
+   _u8.  */
+#define TYPES_b_data(S, D) \
+  TYPES_b_integer (S, D)
+
 /* _s8 _s16
_u8 _u16.  */
 #define TYPES_bh_integer(S, D) \
@@ -277,12 +281,10 @@ CONSTEXPR const group_suffix_info group_suffixes[] = {
 #define TYPES_bhs_integer(S, D) \
   TYPES_bhs_signed (S, D), TYPES_bhs_unsigned (S, D)
 
-/*  _bf16
-_f16  _f32
-_s8  _s16  _s32
-_u8  _u16  _u32.  */
 #define TYPES_bhs_data(S, D) \
-  S (bf16), S (f16), S (f32), TYPES_bhs_integer (S, D)
+  TYPES_b_data (S, D), \
+  TYPES_h_data (S, D), \
+  TYPES_s_data (S, D)
 
 /* _s16_s8  _s32_s16  _s64_s32
_u16_u8  _u32_u16  _u64_u32.  */
@@ -295,6 +297,13 @@ CONSTEXPR const group_suffix_info group_suffixes[] = {
 #define TYPES_h_integer(S, D) \
   S (s16), S (u16)
 
+/* _bf16
+   _f16
+   _s16
+   _u16.  */
+#define TYPES_h_data(S, D) \
+  S (bf16), S (f16), TYPES_h_integer (S, D)
+
 /* _s16 _s32.  */
 #define TYPES_hs_signed(S, D) \
   S (s16), S (s32)
@@ -308,12 +317,9 @@ CONSTEXPR const group_suffix_info group_suffixes[] = {
 #define TYPES_hs_float(S, D) \
   S (f16), S (f32)
 
-/* _bf16
-_f16  _f32
-_s16  _s32
-_u16  _u32.  */
 #define TYPES_hs_data(S, D) \
-  S (bf16), S (f16), S (f32), TYPES_hs_integer (S, D)
+  TYPES_h_data (S, D), \
+  TYPES_s_data (S, D)
 
 /* _u16 _u64.  */
 #define TYPES_hd_unsigned(S, D) \
@@ -352,10 +358,17 @@ CONSTEXPR const group_suffix_info group_suffixes[] = {
 #define TYPES_s_unsigned(S, D) \
   S (u32)
 
-/* _s32 _u32.  */
+/* _s32
+   _u32.  */
 #define TYPES_s_integer(S, D) \
   TYPES_s_signed (S, D), TYPES_s_unsigned (S, D)
 
+/* _f32
+   _s32
+   _u32.  */
+#define TYPES_s_data(S, D) \
+  TYPES_s_float (S, D), TYPES_s_integer (S, D)
+
 /* _s32 _s64.  */
 #define TYPES_sd_signed(S, D) \
   S (s32), S (s64)
@@ -369,11 +382,9 @@ CONSTEXPR const group_suffix_info group_suffixes[] = {
 #define TYPES_sd_integer(S, D) \
   TYPES_sd_signed (S, D), TYPES_sd_unsigned (S, D)
 
-/* _f32 _f64
-   _s32 _s64
-   _u32 _u64.  */
 #define TYPES_sd_data(S, D) \
-  S (f32), S (f64), TYPES_sd_integer (S, D)
+  TYPES_s_data (S, D), \
+  TYPES_d_data (S, D)
 
 /* _f16 _f32 _f64
_s32 _s64
-- 
2.25.1



[PATCH 09/15] aarch64: Sort some SVE2 lists alphabetically

2024-11-06 Thread Richard Sandiford
gcc/
* config/aarch64/aarch64-sve-builtins-sve2.def: Sort entries
alphabetically.
* config/aarch64/aarch64-sve-builtins-sve2.h: Likewise.
* config/aarch64/aarch64-sve-builtins-sve2.cc: Likewise.
---
 .../aarch64/aarch64-sve-builtins-sve2.cc  | 24 +++---
 .../aarch64/aarch64-sve-builtins-sve2.def | 32 +--
 .../aarch64/aarch64-sve-builtins-sve2.h   | 14 
 3 files changed, 35 insertions(+), 35 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
index f0ab7400ef5..24e95afd6eb 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
@@ -589,20 +589,20 @@ FUNCTION (svabalb, unspec_based_add_function, 
(UNSPEC_SABDLB,
   UNSPEC_UABDLB, -1))
 FUNCTION (svabalt, unspec_based_add_function, (UNSPEC_SABDLT,
   UNSPEC_UABDLT, -1))
+FUNCTION (svabdlb, unspec_based_function, (UNSPEC_SABDLB, UNSPEC_UABDLB, -1))
+FUNCTION (svabdlt, unspec_based_function, (UNSPEC_SABDLT, UNSPEC_UABDLT, -1))
+FUNCTION (svadalp, unspec_based_function, (UNSPEC_SADALP, UNSPEC_UADALP, -1))
 FUNCTION (svadclb, unspec_based_function, (-1, UNSPEC_ADCLB, -1))
 FUNCTION (svadclt, unspec_based_function, (-1, UNSPEC_ADCLT, -1))
 FUNCTION (svaddhnb, unspec_based_function, (UNSPEC_ADDHNB, UNSPEC_ADDHNB, -1))
 FUNCTION (svaddhnt, unspec_based_function, (UNSPEC_ADDHNT, UNSPEC_ADDHNT, -1))
-FUNCTION (svabdlb, unspec_based_function, (UNSPEC_SABDLB, UNSPEC_UABDLB, -1))
-FUNCTION (svabdlt, unspec_based_function, (UNSPEC_SABDLT, UNSPEC_UABDLT, -1))
-FUNCTION (svadalp, unspec_based_function, (UNSPEC_SADALP, UNSPEC_UADALP, -1))
 FUNCTION (svaddlb, unspec_based_function, (UNSPEC_SADDLB, UNSPEC_UADDLB, -1))
 FUNCTION (svaddlbt, unspec_based_function, (UNSPEC_SADDLBT, -1, -1))
 FUNCTION (svaddlt, unspec_based_function, (UNSPEC_SADDLT, UNSPEC_UADDLT, -1))
-FUNCTION (svaddwb, unspec_based_function, (UNSPEC_SADDWB, UNSPEC_UADDWB, -1))
-FUNCTION (svaddwt, unspec_based_function, (UNSPEC_SADDWT, UNSPEC_UADDWT, -1))
 FUNCTION (svaddp, unspec_based_pred_function, (UNSPEC_ADDP, UNSPEC_ADDP,
   UNSPEC_FADDP))
+FUNCTION (svaddwb, unspec_based_function, (UNSPEC_SADDWB, UNSPEC_UADDWB, -1))
+FUNCTION (svaddwt, unspec_based_function, (UNSPEC_SADDWT, UNSPEC_UADDWT, -1))
 FUNCTION (svaesd, fixed_insn_function, (CODE_FOR_aarch64_sve2_aesd))
 FUNCTION (svaese, fixed_insn_function, (CODE_FOR_aarch64_sve2_aese))
 FUNCTION (svaesimc, fixed_insn_function, (CODE_FOR_aarch64_sve2_aesimc))
@@ -649,12 +649,12 @@ FUNCTION (svldnt1uh_gather, svldnt1_gather_extend_impl, 
(TYPE_SUFFIX_u16))
 FUNCTION (svldnt1uw_gather, svldnt1_gather_extend_impl, (TYPE_SUFFIX_u32))
 FUNCTION (svlogb, unspec_based_function, (-1, -1, UNSPEC_COND_FLOGB))
 FUNCTION (svmatch, svmatch_svnmatch_impl, (UNSPEC_MATCH))
+FUNCTION (svmaxnmp, unspec_based_pred_function, (-1, -1, UNSPEC_FMAXNMP))
 FUNCTION (svmaxp, unspec_based_pred_function, (UNSPEC_SMAXP, UNSPEC_UMAXP,
   UNSPEC_FMAXP))
-FUNCTION (svmaxnmp, unspec_based_pred_function, (-1, -1, UNSPEC_FMAXNMP))
+FUNCTION (svminnmp, unspec_based_pred_function, (-1, -1, UNSPEC_FMINNMP))
 FUNCTION (svminp, unspec_based_pred_function, (UNSPEC_SMINP, UNSPEC_UMINP,
   UNSPEC_FMINP))
-FUNCTION (svminnmp, unspec_based_pred_function, (-1, -1, UNSPEC_FMINNMP))
 FUNCTION (svmlalb, unspec_based_mla_function, (UNSPEC_SMULLB,
   UNSPEC_UMULLB, UNSPEC_FMLALB))
 FUNCTION (svmlalb_lane, unspec_based_mla_lane_function, (UNSPEC_SMULLB,
@@ -723,15 +723,15 @@ FUNCTION (svqdmullt_lane, unspec_based_lane_function, 
(UNSPEC_SQDMULLT,
 FUNCTION (svqneg, rtx_code_function, (SS_NEG, UNKNOWN, UNKNOWN))
 FUNCTION (svqrdcmlah, svqrdcmlah_impl,)
 FUNCTION (svqrdcmlah_lane, svqrdcmlah_lane_impl,)
-FUNCTION (svqrdmulh, unspec_based_function, (UNSPEC_SQRDMULH, -1, -1))
-FUNCTION (svqrdmulh_lane, unspec_based_lane_function, (UNSPEC_SQRDMULH,
-  -1, -1))
 FUNCTION (svqrdmlah, unspec_based_function, (UNSPEC_SQRDMLAH, -1, -1))
 FUNCTION (svqrdmlah_lane, unspec_based_lane_function, (UNSPEC_SQRDMLAH,
   -1, -1))
 FUNCTION (svqrdmlsh, unspec_based_function, (UNSPEC_SQRDMLSH, -1, -1))
 FUNCTION (svqrdmlsh_lane, unspec_based_lane_function, (UNSPEC_SQRDMLSH,
   -1, -1))
+FUNCTION (svqrdmulh, unspec_based_function, (UNSPEC_SQRDMULH, -1, -1))
+FUNCTION (svqrdmulh_lane, unspec_based_lane_function, (UNSPEC_SQRDMULH,
+  -1, -1))
 FUNCTION (svqrshl, svqrshl_impl,)
 FUNCTION (svqrshr, unspec_based_uncond_function, (UNSPEC_SQRSHR,
 

[PATCH 08/15] aarch64: Factor out part of the SVE ext_def class

2024-11-06 Thread Richard Sandiford
This patch factors out some of ext_def into a base class,
so that it can be reused for the SVE2.1 svextq intrinsic.

gcc/
* config/aarch64/aarch64-sve-builtins-shapes.cc (ext_base): New base
class, extracted from...
(ext_def): ...here.
---
 .../aarch64/aarch64-sve-builtins-shapes.cc| 32 +++
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-shapes.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-shapes.cc
index cf321540b60..62277afaeff 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-shapes.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-shapes.cc
@@ -735,6 +735,23 @@ struct binary_za_slice_opt_single_base : public 
overloaded_base<1>
   }
 };
 
+/* Base class for ext.  */
+struct ext_base : public overloaded_base<0>
+{
+  void
+  build (function_builder &b, const function_group_info &group) const override
+  {
+b.add_overloaded_functions (group, MODE_none);
+build_all (b, "v0,v0,v0,su64", group, MODE_none);
+  }
+
+  tree
+  resolve (function_resolver &r) const override
+  {
+return r.resolve_uniform (2, 1);
+  }
+};
+
 /* Base class for inc_dec and inc_dec_pat.  */
 struct inc_dec_base : public overloaded_base<0>
 {
@@ -2413,21 +2430,8 @@ SHAPE (dupq)
 
where the final argument is an integer constant expression that when
multiplied by the number of bytes in t0 is in the range [0, 255].  */
-struct ext_def : public overloaded_base<0>
+struct ext_def : public ext_base
 {
-  void
-  build (function_builder &b, const function_group_info &group) const override
-  {
-b.add_overloaded_functions (group, MODE_none);
-build_all (b, "v0,v0,v0,su64", group, MODE_none);
-  }
-
-  tree
-  resolve (function_resolver &r) const override
-  {
-return r.resolve_uniform (2, 1);
-  }
-
   bool
   check (function_checker &c) const override
   {
-- 
2.25.1



[PATCH 05/15] aarch64: Add an abstraction for vector base addresses

2024-11-06 Thread Richard Sandiford
In the upcoming SVE2.1 svld1q and svst1q intrinsics, the relationship
between the base vector and the data vector differs from existing
gather/scatter intrinsics.  This patch adds a new abstraction to
handle the difference.

gcc/
* config/aarch64/aarch64-sve-builtins.h
(function_shape::vector_base_type): New member function.
* config/aarch64/aarch64-sve-builtins.cc
(function_shape::vector_base_type): Likewise.
(function_resolver::resolve_sv_displacement): Use it.
(function_resolver::resolve_gather_address): Likewise.
---
 gcc/config/aarch64/aarch64-sve-builtins.cc | 24 --
 gcc/config/aarch64/aarch64-sve-builtins.h  |  2 ++
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
b/gcc/config/aarch64/aarch64-sve-builtins.cc
index c0b5115fdeb..a259f637a29 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
@@ -1176,6 +1176,21 @@ aarch64_const_binop (enum tree_code code, tree arg1, 
tree arg2)
   return NULL_TREE;
 }
 
+/* Return the type that a vector base should have in a gather load or
+   scatter store involving vectors of type TYPE.  In an extending load,
+   TYPE is the result of the extension; in a truncating store, it is the
+   input to the truncation.
+
+   Index vectors have the same width as base vectors, but can be either
+   signed or unsigned.  */
+type_suffix_index
+function_shape::vector_base_type (type_suffix_index type) const
+{
+  unsigned int required_bits = type_suffixes[type].element_bits;
+  gcc_assert (required_bits == 32 || required_bits == 64);
+  return required_bits == 32 ? TYPE_SUFFIX_u32 : TYPE_SUFFIX_u64;
+}
+
 /* Return a hash code for a function_instance.  */
 hashval_t
 function_instance::hash () const
@@ -2750,7 +2765,8 @@ function_resolver::resolve_sv_displacement (unsigned int 
argno,
   return mode;
 }
 
-  unsigned int required_bits = type_suffixes[type].element_bits;
+  auto base_type = shape->vector_base_type (type);
+  unsigned int required_bits = type_suffixes[base_type].element_bits;
   if (required_bits == 32
   && displacement_units () == UNITS_elements
   && !lookup_form (MODE_s32index, type)
@@ -2900,11 +2916,7 @@ function_resolver::resolve_gather_address (unsigned int 
argno,
return MODE_none;
 
   /* Check whether the type is the right one.  */
-  unsigned int required_bits = type_suffixes[type].element_bits;
-  gcc_assert (required_bits == 32 || required_bits == 64);
-  type_suffix_index required_type = (required_bits == 32
-? TYPE_SUFFIX_u32
-: TYPE_SUFFIX_u64);
+  auto required_type = shape->vector_base_type (type);
   if (required_type != base_type)
{
  error_at (location, "passing %qT to argument %d of %qE,"
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.h 
b/gcc/config/aarch64/aarch64-sve-builtins.h
index d5cc6e0a40d..1fb7abe132f 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.h
+++ b/gcc/config/aarch64/aarch64-sve-builtins.h
@@ -784,6 +784,8 @@ public:
  more common than false, so provide a default definition.  */
   virtual bool explicit_group_suffix_p () const { return true; }
 
+  virtual type_suffix_index vector_base_type (type_suffix_index) const;
+
   /* Define all functions associated with the given group.  */
   virtual void build (function_builder &,
  const function_group_info &) const = 0;
-- 
2.25.1



[PATCH 07/15] aarch64: Parameterise SVE pointer type inference

2024-11-06 Thread Richard Sandiford
All extending gather load intrinsics encode the source type in
their name (e.g. svld1sb for an extending load from signed bytes).
The type of the extension result has to be specified using an
explicit type suffix; it isn't something that can be inferred
from the arguments, since there are multiple valid choices for
the same arguments.

This meant that type inference for gather loads was only needed for
non-extending loads, in which case the pointer target had to be a
32-bit or 64-bit element type.  The gather_scatter_p argument to
function_resolver::infer_pointer_type therefore controlled two things:
how we should react to vector base addresses, and whether we should
require a minimum element size of 32.

The element size restriction doesn't apply to the upcomding SVE2.1
svld1q intrinsic, so this patch adds a separate argument for the minimum
element size requirement.

gcc/
* config/aarch64/aarch64-sve-builtins.h
(function_resolver::target_type_restrictions): New enum.
(function_resolver::infer_pointer_type): Add an extra argument
that specifies what the target type can be.
* config/aarch64/aarch64-sve-builtins.cc
(function_resolver::infer_pointer_type): Likewise.
* config/aarch64/aarch64-sve-builtins-shapes.cc
(load_gather_sv_base::get_target_type_restrictions): New virtual
member function.
(load_gather_sv_base::resolve): Use it.  Update call to
infer_pointer_type.
---
 gcc/config/aarch64/aarch64-sve-builtins-shapes.cc | 10 +-
 gcc/config/aarch64/aarch64-sve-builtins.cc|  8 +---
 gcc/config/aarch64/aarch64-sve-builtins.h |  4 +++-
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-shapes.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-shapes.cc
index e1204c283b6..cf321540b60 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-shapes.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-shapes.cc
@@ -815,14 +815,22 @@ struct load_gather_sv_base : public overloaded_base<0>
 unsigned int i, nargs;
 mode_suffix_index mode;
 type_suffix_index type;
+auto restrictions = get_target_type_restrictions (r);
 if (!r.check_gp_argument (2, i, nargs)
-   || (type = r.infer_pointer_type (i, true)) == NUM_TYPE_SUFFIXES
+   || (type = r.infer_pointer_type (i, true,
+restrictions)) == NUM_TYPE_SUFFIXES
|| (mode = r.resolve_sv_displacement (i + 1, type, true),
mode == MODE_none))
   return error_mark_node;
 
 return r.resolve_to (mode, type);
   }
+
+  virtual function_resolver::target_type_restrictions
+  get_target_type_restrictions (const function_instance &) const
+  {
+return function_resolver::TARGET_32_64;
+  }
 };
 
 /* Base class for load_ext_gather_index and load_ext_gather_offset,
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
b/gcc/config/aarch64/aarch64-sve-builtins.cc
index a259f637a29..9fb0d6fd416 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
@@ -1998,10 +1998,12 @@ function_resolver::infer_64bit_scalar_integer_pair 
(unsigned int argno)
corresponding type suffix.  Return that type suffix on success,
otherwise report an error and return NUM_TYPE_SUFFIXES.
GATHER_SCATTER_P is true if the function is a gather/scatter
-   operation, and so requires a pointer to 32-bit or 64-bit data.  */
+   operation.  RESTRICTIONS describes any additional restrictions
+   on the target type.  */
 type_suffix_index
 function_resolver::infer_pointer_type (unsigned int argno,
-  bool gather_scatter_p)
+  bool gather_scatter_p,
+  target_type_restrictions restrictions)
 {
   tree actual = get_argument_type (argno);
   if (actual == error_mark_node)
@@ -2027,7 +2029,7 @@ function_resolver::infer_pointer_type (unsigned int argno,
   return NUM_TYPE_SUFFIXES;
 }
   unsigned int bits = type_suffixes[type].element_bits;
-  if (gather_scatter_p && bits != 32 && bits != 64)
+  if (restrictions == TARGET_32_64 && bits != 32 && bits != 64)
 {
   error_at (location, "passing %qT to argument %d of %qE, which"
" expects a pointer to 32-bit or 64-bit elements",
diff --git a/gcc/config/aarch64/aarch64-sve-builtins.h 
b/gcc/config/aarch64/aarch64-sve-builtins.h
index 1fb7abe132f..5bd9b88d117 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins.h
+++ b/gcc/config/aarch64/aarch64-sve-builtins.h
@@ -488,6 +488,7 @@ public:
 class function_resolver : public function_call_info
 {
 public:
+  enum target_type_restrictions { TARGET_ANY, TARGET_32_64 };
   enum { SAME_SIZE = 256, HALF_SIZE, QUARTER_SIZE };
   static const type_class_index SAME_TYPE_CLASS = NUM_TYPE_CLASSES;
 
@@ -518,7 +519,8 @@ public:
   vector_type_index infer_predicate_type (unsigned int);
   type_suffix_in

[PATCH 06/15] aarch64: Add an abstraction for scatter store type inference

2024-11-06 Thread Richard Sandiford
Until now, all data arguments to a scatter store needed to have
32-bit or 64-bit elements.  This isn't true for the upcoming SVE2.1
svst1q scatter intrinsic, so this patch adds an abstraction around the
restriction.

gcc/
* config/aarch64/aarch64-sve-builtins-shapes.cc
(store_scatter_base::infer_vector_type): New virtual member function.
(store_scatter_base::resolve): Use it.
---
 gcc/config/aarch64/aarch64-sve-builtins-shapes.cc | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-shapes.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-shapes.cc
index f190770250f..e1204c283b6 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-shapes.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-shapes.cc
@@ -994,12 +994,18 @@ struct store_scatter_base : public overloaded_base<0>
 mode_suffix_index mode;
 type_suffix_index type;
 if (!r.check_gp_argument (has_displacement_p ? 3 : 2, i, nargs)
-   || (type = r.infer_sd_vector_type (nargs - 1)) == NUM_TYPE_SUFFIXES
+   || (type = infer_vector_type (r, nargs - 1)) == NUM_TYPE_SUFFIXES
|| (mode = r.resolve_gather_address (i, type, false)) == MODE_none)
   return error_mark_node;
 
 return r.resolve_to (mode, type);
   }
+
+  virtual type_suffix_index
+  infer_vector_type (function_resolver &r, unsigned int argno) const
+  {
+return r.infer_sd_vector_type (argno);
+  }
 };
 
 /* Base class for ternary operations in which the final argument is an
-- 
2.25.1



[PATCH 04/15] aarch64: Use braces in SVE TBL instructions

2024-11-06 Thread Richard Sandiford
GCC previously used the older assembly syntax for SVE TBL, with no
braces around the second operand.  This patch switches to the newer,
official syntax, with braces around the operand.

The initial SVE binutils submission supported both syntaxes, so there
should be no issues with backwards compatibility.

gcc/
* config/aarch64/aarch64-sve.md (@aarch64_sve_tbl): Wrap
the second operand in braces.

gcc/testsuite/
* gcc.target/aarch64/sve/acle/asm/dup_lane_bf16.c: Wrap the second
TBL operand in braces
* gcc.target/aarch64/sve/acle/asm/dup_lane_f16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_lane_f32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_lane_f64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_lane_s16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_lane_s32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_lane_s64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_lane_s8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_lane_u16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_lane_u32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_lane_u64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/dup_lane_u8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/tbl_bf16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/tbl_f16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/tbl_f32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/tbl_f64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/tbl_s16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/tbl_s32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/tbl_s64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/tbl_s8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/tbl_u16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/tbl_u32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/tbl_u64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/tbl_u8.c: Likewise.
* gcc.target/aarch64/sve/slp_perm_6.c: Likewise.
* gcc.target/aarch64/sve/slp_perm_7.c: Likewise.
* gcc.target/aarch64/sve/vec_perm_1.c: Likewise.
* gcc.target/aarch64/sve/vec_perm_const_1.c: Likewise.
* gcc.target/aarch64/sve/vec_perm_const_1_overrun.c: Likewise.
* gcc.target/aarch64/sve/vec_perm_const_single_1.c: Likewise.
* gcc.target/aarch64/sve/vec_perm_single_1.c: Likewise.
* gcc.target/aarch64/sve/uzp1_1.c: Shorten the scan-assembler-nots
to just "\ttbl\".
* gcc.target/aarch64/sve/uzp2_1.c: Likewise.
---
 gcc/config/aarch64/aarch64-sve.md |  2 +-
 .../aarch64/sve/acle/asm/dup_lane_bf16.c  | 12 +--
 .../aarch64/sve/acle/asm/dup_lane_f16.c   | 12 +--
 .../aarch64/sve/acle/asm/dup_lane_f32.c   | 16 +++
 .../aarch64/sve/acle/asm/dup_lane_f64.c   | 18 -
 .../aarch64/sve/acle/asm/dup_lane_s16.c   | 12 +--
 .../aarch64/sve/acle/asm/dup_lane_s32.c   | 16 +++
 .../aarch64/sve/acle/asm/dup_lane_s64.c   | 20 +--
 .../aarch64/sve/acle/asm/dup_lane_s8.c|  8 
 .../aarch64/sve/acle/asm/dup_lane_u16.c   | 12 +--
 .../aarch64/sve/acle/asm/dup_lane_u32.c   | 16 +++
 .../aarch64/sve/acle/asm/dup_lane_u64.c   | 20 +--
 .../aarch64/sve/acle/asm/dup_lane_u8.c|  8 
 .../aarch64/sve/acle/asm/tbl_bf16.c   |  6 +++---
 .../gcc.target/aarch64/sve/acle/asm/tbl_f16.c |  6 +++---
 .../gcc.target/aarch64/sve/acle/asm/tbl_f32.c |  6 +++---
 .../gcc.target/aarch64/sve/acle/asm/tbl_f64.c |  6 +++---
 .../gcc.target/aarch64/sve/acle/asm/tbl_s16.c |  6 +++---
 .../gcc.target/aarch64/sve/acle/asm/tbl_s32.c |  6 +++---
 .../gcc.target/aarch64/sve/acle/asm/tbl_s64.c |  6 +++---
 .../gcc.target/aarch64/sve/acle/asm/tbl_s8.c  |  6 +++---
 .../gcc.target/aarch64/sve/acle/asm/tbl_u16.c |  6 +++---
 .../gcc.target/aarch64/sve/acle/asm/tbl_u32.c |  6 +++---
 .../gcc.target/aarch64/sve/acle/asm/tbl_u64.c |  6 +++---
 .../gcc.target/aarch64/sve/acle/asm/tbl_u8.c  |  6 +++---
 .../gcc.target/aarch64/sve/slp_perm_6.c   |  2 +-
 .../gcc.target/aarch64/sve/slp_perm_7.c   |  2 +-
 gcc/testsuite/gcc.target/aarch64/sve/uzp1_1.c |  8 
 gcc/testsuite/gcc.target/aarch64/sve/uzp2_1.c |  8 
 .../gcc.target/aarch64/sve/vec_perm_1.c   |  8 
 .../gcc.target/aarch64/sve/vec_perm_const_1.c |  8 
 .../aarch64/sve/vec_perm_const_1_overrun.c|  8 
 .../aarch64/sve/vec_perm_const_single_1.c |  8 
 .../aarch64/sve/vec_perm_single_1.c   |  8 
 34 files changed, 152 insertions(+), 152 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sve.md 
b/gcc/config/aarch64/aarch64-sve.md
index 06bd3e4bb2c..0955a697680 100644
--- a/gcc/config/aarch64/aarch64-sve.md
+++ b/gcc/config/aarch64/aarch64-sve.md
@@ -9040,7 

[PATCH 02/15] aarch64: Test TARGET_STREAMING instead of TARGET_STREAMING_SME

2024-11-06 Thread Richard Sandiford
g:ede97598e2c recorded separate ISA requirements for streaming
and non-streaming mode.  The premise there was that AARCH64_FL_SME
should not be included in the streaming mode requirements, since:

(a) an __arm_streaming_compatible function wouldn't be in streaming
mode if SME wasn't available.

(b) __arm_streaming_compatible functions only allow things that are
possible in non-streaming mode, so the non-streaming architecture
is enough to assemble the code, even if +sme isn't enabled.

(c) we reject __arm_streaming if +sme isn't enabled, so don't need
to test it for individual intrinsics as well.

Later patches lean into this further.

This patch applies the same reasoning to the .md constructs for
base streaming-only SME instructions, guarding them with
TARGET_STREAMING rather than TARGET_STREAMING_SME.

gcc/
* config/aarch64/aarch64.h (TARGET_SME): Expand comment.
(TARGET_STREAMING_SME): Delete.
* config/aarch64/aarch64-sme.md: Use TARGET_STREAMING instead of
TARGET_STREAMING_SME.
* config/aarch64/aarch64-sve2.md: Likewise.
---
 gcc/config/aarch64/aarch64-sme.md  | 28 ++--
 gcc/config/aarch64/aarch64-sve2.md |  8 
 gcc/config/aarch64/aarch64.h   |  6 ++
 3 files changed, 20 insertions(+), 22 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sme.md 
b/gcc/config/aarch64/aarch64-sme.md
index 9215f51b01f..8fca138314c 100644
--- a/gcc/config/aarch64/aarch64-sme.md
+++ b/gcc/config/aarch64/aarch64-sme.md
@@ -481,7 +481,7 @@ (define_insn "@aarch64_sme_"
   (match_operand: 2 "register_operand" "Upl")
   (match_operand:SME_ZA_I 3 "aarch64_sve_ldff1_operand" "Utf")]
  SME_LD1))]
-  "TARGET_STREAMING_SME"
+  "TARGET_STREAMING"
   "ld1\t{ za%0.[%w1, 0] }, %2/z, %3"
 )
 
@@ -496,7 +496,7 @@ (define_insn "@aarch64_sme__plus"
   (match_operand: 3 "register_operand" "Upl")
   (match_operand:SME_ZA_I 4 "aarch64_sve_ldff1_operand" "Utf")]
  SME_LD1))]
-  "TARGET_STREAMING_SME
+  "TARGET_STREAMING
&& UINTVAL (operands[2]) < 128 / "
   "ld1\t{ za%0.[%w1, %2] }, %3/z, %4"
 )
@@ -583,7 +583,7 @@ (define_insn "@aarch64_sme_"
   (match_operand:SI 2 "register_operand" "Ucj")
   (match_operand: 3 "register_operand" "Upl")]
  SME_ST1))]
-  "TARGET_STREAMING_SME"
+  "TARGET_STREAMING"
   "st1\t{ za%1.[%w2, 0] }, %3, %0"
 )
 
@@ -598,7 +598,7 @@ (define_insn "@aarch64_sme__plus"
(match_operand:SI 3 "const_int_operand"))
   (match_operand: 4 "register_operand" "Upl")]
  SME_ST1))]
-  "TARGET_STREAMING_SME
+  "TARGET_STREAMING
&& UINTVAL (operands[3]) < 128 / "
   "st1\t{ za%1.[%w2, %3] }, %4, %0"
 )
@@ -663,7 +663,7 @@ (define_insn "@aarch64_sme_"
   (match_operand:DI 3 "const_int_operand")
   (match_operand:SI 4 "register_operand" "Ucj")]
  SME_READ))]
-  "TARGET_STREAMING_SME"
+  "TARGET_STREAMING"
   "mova\t%0., %2/m, za%3.[%w4, 0]"
 )
 
@@ -678,7 +678,7 @@ (define_insn 
"*aarch64_sme__plus"
   (plus:SI (match_operand:SI 4 "register_operand" "Ucj")
(match_operand:SI 5 "const_int_operand"))]
  SME_READ))]
-  "TARGET_STREAMING_SME
+  "TARGET_STREAMING
&& UINTVAL (operands[5]) < 128 / "
   "mova\t%0., %2/m, za%3.[%w4, %5]"
 )
@@ -693,7 +693,7 @@ (define_insn 
"@aarch64_sme_"
   (match_operand:DI 3 "const_int_operand")
   (match_operand:SI 4 "register_operand" "Ucj")]
  SME_READ))]
-  "TARGET_STREAMING_SME"
+  "TARGET_STREAMING"
   "mova\t%0.q, %2/m, za%3.q[%w4, 0]"
 )
 
@@ -707,7 +707,7 @@ (define_insn "@aarch64_sme_"
   (match_operand: 2 "register_operand" "Upl")
   (match_operand:SVE_FULL 3 "register_operand" "w")]
  SME_WRITE))]
-  "TARGET_STREAMING_SME"
+  "TARGET_STREAMING"
   "mova\tza%0.[%w1, 0], %2/m, %3."
 )
 
@@ -722,7 +722,7 @@ (define_insn 
"*aarch64_sme__plus"
   (match_operand: 3 "register_operand" "Upl")
   (match_operand:SVE_FULL 4 "register_operand" "w")]
  SME_WRITE))]
-  "TARGET_STREAMING_SME
+  "TARGET_STREAMING
&& UINTVAL (operands[2]) < 128 / "
   "mova\tza%0.[%w1, %2], %3/m, %4."
 )
@@ -737,7 +737,7 @@ (define_insn 
"@aarch64_sme_"
   (match_operand:VNx2BI 2 "register_operand" "Upl")
   (match_operand:SVE_FULL 3 "register_operand" "w")]
  SME_WRITE))]
-  "TARGET_STREAMING_SME"
+  "TARGET_STREAMING"
   "mova\tza%0.q[%w1, 0], %2/m, %3.q"
 )
 
@@ -917,7 +917,7 @@ (define_insn "@aarch64_sme_"
   (match_operand: 2 "register_operand" "Upl")
   (match_operand:SME_ZA_SDI 3 "register_operand" "w")]
  SME_BINARY_SDI))]
-  "TARGET_STREAMING_SME"
+  "TARGET_STREAMING"
   "\tza%0., %1/m, %2/m, %3."
 )
 
@@ -1479,7 +1479,7 @@ (define_insn 
"@aarch64_sme_"
   (match_operand:VNx16QI_ONLY 3 "register_operand" "w")
   (match_operand:VNx16QI_ONLY 4 "register_operand" "w")]
  SME_INT_MOP))]
-  "TARGET_STREAMIN

[PATCH 01/15] aarch64: Make more use of TARGET_STREAMING_SME2

2024-11-06 Thread Richard Sandiford
Some code was checking TARGET_STREAMING and TARGET_SME2 separately,
but we now have a macro to test both at once.

gcc/
* config/aarch64/aarch64-sme.md: Use TARGET_STREAMING_SME2
instead of separate TARGET_STREAMING and TARGET_SME2 tests.
* config/aarch64/aarch64-sve2.md: Likewise.
* config/aarch64/iterators.md: Likewise.
---
 gcc/config/aarch64/aarch64-sme.md  | 34 --
 gcc/config/aarch64/aarch64-sve2.md |  6 +++---
 gcc/config/aarch64/iterators.md|  8 +++
 3 files changed, 21 insertions(+), 27 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-sme.md 
b/gcc/config/aarch64/aarch64-sme.md
index 78ad2fc699f..9215f51b01f 100644
--- a/gcc/config/aarch64/aarch64-sme.md
+++ b/gcc/config/aarch64/aarch64-sme.md
@@ -1334,7 +1334,7 @@ (define_insn 
"@aarch64_sme_"
   (match_operand:VNx8HI_ONLY 1 "register_operand" "w")
   (match_operand:VNx8HI_ONLY 2 "register_operand" "x")]
  SME_INT_TERNARY_SLICE))]
-  "TARGET_SME2 && TARGET_SME_I16I64 && TARGET_STREAMING_SME"
+  "TARGET_STREAMING_SME2 && TARGET_SME_I16I64"
   "ll\tza.d[%w0, 0:3], %1.h, %2.h"
 )
 
@@ -1348,7 +1348,7 @@ (define_insn 
"*aarch64_sme__plus"
   (match_operand:VNx8HI_ONLY 2 "register_operand" "w")
   (match_operand:VNx8HI_ONLY 3 "register_operand" "x")]
  SME_INT_TERNARY_SLICE))]
-  "TARGET_SME2 && TARGET_SME_I16I64 && TARGET_STREAMING_SME"
+  "TARGET_STREAMING_SME2 && TARGET_SME_I16I64"
   {
 operands[4] = GEN_INT (INTVAL (operands[1]) + 3);
 return "ll\tza.d[%w0, %1:%4], %2.h, %3.h";
@@ -1364,7 +1364,7 @@ (define_insn 
"@aarch64_sme_"
   (match_operand:SME_ZA_HIx24 1 "aligned_register_operand" 
"Uw")
   (match_operand:SME_ZA_HIx24 2 "aligned_register_operand" 
"Uw")]
  SME_INT_TERNARY_SLICE))]
-  "TARGET_SME2 && TARGET_SME_I16I64 && TARGET_STREAMING_SME"
+  "TARGET_STREAMING_SME2 && TARGET_SME_I16I64"
   "ll\tza.d[%w0, 0:3, vgx], %1, %2"
 )
 
@@ -1378,7 +1378,7 @@ (define_insn 
"*aarch64_sme__plus"
   (match_operand:SME_ZA_HIx24 2 "aligned_register_operand" 
"Uw")
   (match_operand:SME_ZA_HIx24 3 "aligned_register_operand" 
"Uw")]
  SME_INT_TERNARY_SLICE))]
-  "TARGET_SME2 && TARGET_SME_I16I64 && TARGET_STREAMING_SME"
+  "TARGET_STREAMING_SME2 && TARGET_SME_I16I64"
   {
 operands[4] = GEN_INT (INTVAL (operands[1]) + 3);
 return "ll\tza.d[%w0, %1:%4, vgx], %2, %3";
@@ -1395,7 +1395,7 @@ (define_insn 
"@aarch64_sme_single_"
   (vec_duplicate:SME_ZA_HIx24
 (match_operand: 2 "register_operand" "x"))]
  SME_INT_TERNARY_SLICE))]
-  "TARGET_SME2 && TARGET_SME_I16I64 && TARGET_STREAMING_SME"
+  "TARGET_STREAMING_SME2 && TARGET_SME_I16I64"
   "ll\tza.d[%w0, 0:3, vgx], %1, %2.h"
 )
 
@@ -1410,7 +1410,7 @@ (define_insn 
"*aarch64_sme_single__p
   (vec_duplicate:SME_ZA_HIx24
 (match_operand: 3 "register_operand" "x"))]
  SME_INT_TERNARY_SLICE))]
-  "TARGET_SME2 && TARGET_SME_I16I64 && TARGET_STREAMING_SME"
+  "TARGET_STREAMING_SME2 && TARGET_SME_I16I64"
   {
 operands[4] = GEN_INT (INTVAL (operands[1]) + 3);
 return "ll\tza.d[%w0, %1:%4, vgx], %2, %3.h";
@@ -1429,7 +1429,7 @@ (define_insn 
"@aarch64_sme_lane_"
  (match_operand:SI 3 "const_int_operand")]
 UNSPEC_SVE_LANE_SELECT)]
  SME_INT_TERNARY_SLICE))]
-  "TARGET_SME2 && TARGET_SME_I16I64 && TARGET_STREAMING_SME"
+  "TARGET_STREAMING_SME2 && TARGET_SME_I16I64"
   "ll\tza.d[%w0, 0:3], %1, %2.h[%3]"
 )
 
@@ -1446,7 +1446,7 @@ (define_insn 
"*aarch64_sme_lane_"
  (match_operand:SI 4 "const_int_operand")]
 UNSPEC_SVE_LANE_SELECT)]
  SME_INT_TERNARY_SLICE))]
-  "TARGET_SME2 && TARGET_SME_I16I64 && TARGET_STREAMING_SME"
+  "TARGET_STREAMING_SME2 && TARGET_SME_I16I64"
   {
 operands[5] = GEN_INT (INTVAL (operands[1]) + 3);
 return "ll\tza.d[%w0, %1:%5], %2, %3.h[%4]";
@@ -1642,8 +1642,7 @@ (define_insn 
"@aarch64_sme_"
   (match_operand:SME_ZA_SDFx24 1 "aligned_register_operand" 
"Uw")
   (match_operand:SME_ZA_SDFx24 2 "aligned_register_operand" 
"Uw")]
  SME_FP_TERNARY_SLICE))]
-  "TARGET_SME2
-   && TARGET_STREAMING_SME
+  "TARGET_STREAMING_SME2
&&  == "
   "\tza.[%w0, 0, vgx], %1, %2"
 )
@@ -1658,8 +1657,7 @@ (define_insn 
"*aarch64_sme__plus"
   (match_operand:SME_ZA_SDFx24 2 "aligned_register_operand" 
"Uw")
   (match_operand:SME_ZA_SDFx24 3 "aligned_register_operand" 
"Uw")]
  SME_FP_TERNARY_SLICE))]
-  "TARGET_SME2
-   && TARGET_STREAMING_SME
+  "TARGET_STREAMING_SME2
&&  == "
   "\tza.[%w0, %1, vgx], %2, %3"
 )
@@ -1674,8 +1672,7 @@ (define_insn 
"@aarch64_sme_single_
   (vec_duplicate:SME_ZA_SDFx24
 (match_operand: 2 "register_operand" "x"))]
  SME_FP_TERNARY_SLICE))]
-  "TARGET_SME2
-   && TARGET_STREAMING_SME
+  "TARGET_STREAMING_SME2
&&  == "
   "\tza.[%w0, 0, vgx], %1, 
%2."
 )
@@ -1691,8 +1688,7 @@ (defi

[PATCH 00/15] aarch64: Add support for SVE2.1

2024-11-06 Thread Richard Sandiford
This series adds support for FEAT_SVE2p1 (-march=...+sve2p1).
One thing that the extension does is make some SME and SME2 instructions
available outside of streaming mode.  It also adds quite a few new
instructions.  Some of those new instructions are shared with SME2.1,
which will be added by a later patch.

Tested on aarch64-linux-gnu.  GNU binutils doesn't yet have full
support for SVE2.1, meaning that the aarch64_asm_sve2p1_ok target
selector fails and that the new aarch64-sve2-acle-asm.exp tests fall
back to "dg-do compile" instead of "dg-do assemble".  However, I also
tested aarch64-sve2-acle-asm.exp against LLVM's assembler using a
hacked-up script.

I also tried to cross-check GCC's implementation against LLVM's SVE2.1
ACLE tests.  There were some failures due to missing B16B16 support
(part of a separate follow-on series) and the fact that LLVM's stores
take pointers to const (raised separately), but otherwise things
seemed ok.

I'll commit this on Monday if there are no comments before then,
but please let me know if you'd like me to wait longer.  It will
likely need some minor updates due to conflicts with other
in-flight patches.

Richard


[PATCH 3/3] aarch64: Fix gcc.target/aarch64/sme2/acle-asm/bfmlslb_f32.c

2024-11-06 Thread Richard Sandiford
I missed a search-and-replace on this test, meaning that it was
duplicating bfmlalb_f32.c.

gcc/testsuite/
* gcc.target/aarch64/sme2/acle-asm/bfmlslb_f32.c: Replace bfmla*
with bfmls*
---
 .../aarch64/sme2/acle-asm/bfmlslb_f32.c   | 60 +--
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/gcc/testsuite/gcc.target/aarch64/sme2/acle-asm/bfmlslb_f32.c 
b/gcc/testsuite/gcc.target/aarch64/sme2/acle-asm/bfmlslb_f32.c
index f67316cd33c..946af545141 100644
--- a/gcc/testsuite/gcc.target/aarch64/sme2/acle-asm/bfmlslb_f32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sme2/acle-asm/bfmlslb_f32.c
@@ -3,63 +3,63 @@
 #include "test_sme2_acle.h"
 
 /*
-** bfmlalb_f32_tied1:
-** bfmlalb z0\.s, z4\.h, z5\.h
+** bfmlslb_f32_tied1:
+** bfmlslb z0\.s, z4\.h, z5\.h
 ** ret
 */
-TEST_DUAL_Z (bfmlalb_f32_tied1, svfloat32_t, svbfloat16_t,
-z0 = svbfmlalb_f32 (z0, z4, z5),
-z0 = svbfmlalb (z0, z4, z5))
+TEST_DUAL_Z (bfmlslb_f32_tied1, svfloat32_t, svbfloat16_t,
+z0 = svbfmlslb_f32 (z0, z4, z5),
+z0 = svbfmlslb (z0, z4, z5))
 
 /*
-** bfmlalb_f32_tied2:
+** bfmlslb_f32_tied2:
 ** mov (z[0-9]+)\.d, z0\.d
 ** movprfx z0, z4
-** bfmlalb z0\.s, \1\.h, z1\.h
+** bfmlslb z0\.s, \1\.h, z1\.h
 ** ret
 */
-TEST_DUAL_Z_REV (bfmlalb_f32_tied2, svfloat32_t, svbfloat16_t,
-z0_res = svbfmlalb_f32 (z4, z0, z1),
-z0_res = svbfmlalb (z4, z0, z1))
+TEST_DUAL_Z_REV (bfmlslb_f32_tied2, svfloat32_t, svbfloat16_t,
+z0_res = svbfmlslb_f32 (z4, z0, z1),
+z0_res = svbfmlslb (z4, z0, z1))
 
 /*
-** bfmlalb_f32_tied3:
+** bfmlslb_f32_tied3:
 ** mov (z[0-9]+)\.d, z0\.d
 ** movprfx z0, z4
-** bfmlalb z0\.s, z1\.h, \1\.h
+** bfmlslb z0\.s, z1\.h, \1\.h
 ** ret
 */
-TEST_DUAL_Z_REV (bfmlalb_f32_tied3, svfloat32_t, svbfloat16_t,
-z0_res = svbfmlalb_f32 (z4, z1, z0),
-z0_res = svbfmlalb (z4, z1, z0))
+TEST_DUAL_Z_REV (bfmlslb_f32_tied3, svfloat32_t, svbfloat16_t,
+z0_res = svbfmlslb_f32 (z4, z1, z0),
+z0_res = svbfmlslb (z4, z1, z0))
 
 /*
-** bfmlalb_f32_untied:
+** bfmlslb_f32_untied:
 ** movprfx z0, z1
-** bfmlalb z0\.s, z4\.h, z5\.h
+** bfmlslb z0\.s, z4\.h, z5\.h
 ** ret
 */
-TEST_DUAL_Z (bfmlalb_f32_untied, svfloat32_t, svbfloat16_t,
-z0 = svbfmlalb_f32 (z1, z4, z5),
-z0 = svbfmlalb (z1, z4, z5))
+TEST_DUAL_Z (bfmlslb_f32_untied, svfloat32_t, svbfloat16_t,
+z0 = svbfmlslb_f32 (z1, z4, z5),
+z0 = svbfmlslb (z1, z4, z5))
 
 /*
-** bfmlalb_h7_f32_tied1:
+** bfmlslb_h7_f32_tied1:
 ** mov (z[0-9]+\.h), h7
-** bfmlalb z0\.s, z4\.h, \1
+** bfmlslb z0\.s, z4\.h, \1
 ** ret
 */
-TEST_DUAL_ZD (bfmlalb_h7_f32_tied1, svfloat32_t, svbfloat16_t, bfloat16_t,
- z0 = svbfmlalb_n_f32 (z0, z4, d7),
- z0 = svbfmlalb (z0, z4, d7))
+TEST_DUAL_ZD (bfmlslb_h7_f32_tied1, svfloat32_t, svbfloat16_t, bfloat16_t,
+ z0 = svbfmlslb_n_f32 (z0, z4, d7),
+ z0 = svbfmlslb (z0, z4, d7))
 
 /*
-** bfmlalb_h7_f32_untied:
+** bfmlslb_h7_f32_untied:
 ** mov (z[0-9]+\.h), h7
 ** movprfx z0, z1
-** bfmlalb z0\.s, z4\.h, \1
+** bfmlslb z0\.s, z4\.h, \1
 ** ret
 */
-TEST_DUAL_ZD (bfmlalb_h7_f32_untied, svfloat32_t, svbfloat16_t, bfloat16_t,
- z0 = svbfmlalb_n_f32 (z1, z4, d7),
- z0 = svbfmlalb (z1, z4, d7))
+TEST_DUAL_ZD (bfmlslb_h7_f32_untied, svfloat32_t, svbfloat16_t, bfloat16_t,
+ z0 = svbfmlslb_n_f32 (z1, z4, d7),
+ z0 = svbfmlslb (z1, z4, d7))
-- 
2.25.1



[PATCH 0/3] aarch64: Fix various issues with the SME support

2024-11-06 Thread Richard Sandiford
While adding support for SVE2.1 and SME2.1, I found several
embarrassing mistakes in my earlier SME and SME2 patches. :(
This series tries to fix them.

Tested on aarch64-linux-gnu.  I'm planning to commit to trunk on
Thursday evening UTC if there are no comments before then, but please
let me know if you'd like me to hold off.  I'll backport to GCC 14 after
a grace period.

Richard Sandiford (3):
  aarch64: Restrict FCLAMP to SME2
  aarch64: Make PSEL dependent on SME rather than SME2
  aarch64: Fix gcc.target/aarch64/sme2/acle-asm/bfmlslb_f32.c

 gcc/config/aarch64/aarch64-sve2.md|  8 +--
 .../{sme2 => sme}/acle-asm/psel_lane_b16.c|  2 +-
 .../{sme2 => sme}/acle-asm/psel_lane_b32.c|  2 +-
 .../{sme2 => sme}/acle-asm/psel_lane_b64.c|  2 +-
 .../{sme2 => sme}/acle-asm/psel_lane_b8.c |  2 +-
 .../{sme2 => sme}/acle-asm/psel_lane_c16.c|  2 +-
 .../{sme2 => sme}/acle-asm/psel_lane_c32.c|  2 +-
 .../{sme2 => sme}/acle-asm/psel_lane_c64.c|  2 +-
 .../{sme2 => sme}/acle-asm/psel_lane_c8.c |  2 +-
 .../gcc.target/aarch64/sme/clamp_3.c  |  2 +
 .../gcc.target/aarch64/sme/clamp_4.c  |  2 +
 .../gcc.target/aarch64/sme/clamp_5.c  | 24 
 .../aarch64/sme2/acle-asm/bfmlslb_f32.c   | 60 +--
 13 files changed, 70 insertions(+), 42 deletions(-)
 rename gcc/testsuite/gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_b16.c 
(98%)
 rename gcc/testsuite/gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_b32.c 
(98%)
 rename gcc/testsuite/gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_b64.c 
(98%)
 rename gcc/testsuite/gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_b8.c 
(98%)
 rename gcc/testsuite/gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_c16.c 
(98%)
 rename gcc/testsuite/gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_c32.c 
(98%)
 rename gcc/testsuite/gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_c64.c 
(98%)
 rename gcc/testsuite/gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_c8.c 
(98%)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sme/clamp_5.c

-- 
2.25.1



[PATCH 1/3] aarch64: Restrict FCLAMP to SME2

2024-11-06 Thread Richard Sandiford
There are two sets of patterns for FCLAMP: one set for single registers
and one set for multiple registers.  The multiple-register set was
correctly gated on SME2, but the single-register set only required SME.
This doesn't matter for ACLE usage, since the intrinsic definitions
are correctly gated.  But it does matter for automatic generation of
FCLAMP from separate minimum and maximum operations (either ACLE
intrinsics or autovectorised code).

gcc/
* config/aarch64/aarch64-sve2.md (@aarch64_sve_fclamp)
(*aarch64_sve_fclamp_x): Require TARGET_STREAMING_SME2
rather than TARGET_STREAMING_SME.

gcc/testsuite/
* gcc.target/aarch64/sme/clamp_3.c: Force sme2
* gcc.target/aarch64/sme/clamp_4.c: Likewise.
* gcc.target/aarch64/sme/clamp_5.c: New test.
---
 gcc/config/aarch64/aarch64-sve2.md|  4 ++--
 .../gcc.target/aarch64/sme/clamp_3.c  |  2 ++
 .../gcc.target/aarch64/sme/clamp_4.c  |  2 ++
 .../gcc.target/aarch64/sme/clamp_5.c  | 24 +++
 4 files changed, 30 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sme/clamp_5.c

diff --git a/gcc/config/aarch64/aarch64-sve2.md 
b/gcc/config/aarch64/aarch64-sve2.md
index 8047f405a17..08f83fc7ca0 100644
--- a/gcc/config/aarch64/aarch64-sve2.md
+++ b/gcc/config/aarch64/aarch64-sve2.md
@@ -1117,7 +1117,7 @@ (define_insn "@aarch64_sve_fclamp"
 UNSPEC_FMAXNM)
   (match_operand:SVE_FULL_F 3 "register_operand")]
  UNSPEC_FMINNM))]
-  "TARGET_STREAMING_SME"
+  "TARGET_STREAMING_SME2"
   {@ [cons: =0,  1, 2, 3; attrs: movprfx]
  [   w, %0, w, w; * ] fclamp\t%0., %2., 
%3.
  [ ?&w,  w, w, w; yes   ] movprfx\t%0, 
%1\;fclamp\t%0., %2., %3.
@@ -1137,7 +1137,7 @@ (define_insn_and_split "*aarch64_sve_fclamp_x"
 UNSPEC_COND_FMAXNM)
   (match_operand:SVE_FULL_F 3 "register_operand")]
  UNSPEC_COND_FMINNM))]
-  "TARGET_STREAMING_SME"
+  "TARGET_STREAMING_SME2"
   {@ [cons: =0,  1, 2, 3; attrs: movprfx]
  [   w, %0, w, w; * ] #
  [ ?&w,  w, w, w; yes   ] #
diff --git a/gcc/testsuite/gcc.target/aarch64/sme/clamp_3.c 
b/gcc/testsuite/gcc.target/aarch64/sme/clamp_3.c
index 44959f79490..162de6224d5 100644
--- a/gcc/testsuite/gcc.target/aarch64/sme/clamp_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/sme/clamp_3.c
@@ -2,6 +2,8 @@
 
 #include 
 
+#pragma GCC target "+sme2"
+
 #define TEST(TYPE) \
   TYPE \
   tied1_##TYPE(TYPE a, TYPE b, TYPE c) __arm_streaming \
diff --git a/gcc/testsuite/gcc.target/aarch64/sme/clamp_4.c 
b/gcc/testsuite/gcc.target/aarch64/sme/clamp_4.c
index 643b2635b90..453c82cd860 100644
--- a/gcc/testsuite/gcc.target/aarch64/sme/clamp_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sme/clamp_4.c
@@ -2,6 +2,8 @@
 
 #include 
 
+#pragma GCC target "+sme2"
+
 #define TEST(TYPE) \
   TYPE \
   untied_##TYPE(TYPE a, TYPE b, TYPE c, TYPE d) __arm_streaming
\
diff --git a/gcc/testsuite/gcc.target/aarch64/sme/clamp_5.c 
b/gcc/testsuite/gcc.target/aarch64/sme/clamp_5.c
new file mode 100644
index 000..7c5464bdc36
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sme/clamp_5.c
@@ -0,0 +1,24 @@
+// { dg-options "-O" }
+
+#include 
+
+#pragma GCC target "+nosme2"
+
+#define TEST(TYPE) \
+  TYPE \
+  tied1_##TYPE(TYPE a, TYPE b, TYPE c) __arm_streaming \
+  {\
+return svminnm_x(svptrue_b8(), svmaxnm_x(svptrue_b8(), a, b), c);  \
+  }\
+   \
+  TYPE \
+  tied2_##TYPE(TYPE a, TYPE b, TYPE c) __arm_streaming \
+  {\
+return svminnm_x(svptrue_b8(), svmaxnm_x(svptrue_b8(), b, a), c);  \
+  }
+
+TEST(svfloat16_t)
+TEST(svfloat32_t)
+TEST(svfloat64_t)
+
+/* { dg-final { scan-assembler-not {\tfclamp\t} } } */
-- 
2.25.1



[PATCH 2/3] aarch64: Make PSEL dependent on SME rather than SME2

2024-11-06 Thread Richard Sandiford
The svpsel_lane intrinsics were wrongly classified as SME2+ only,
rather than as base SME intrinsics.  They should always be available
in streaming mode.

gcc/
* config/aarch64/aarch64-sve2.md (@aarch64_sve_psel)
(*aarch64_sve_psel_plus): Require TARGET_STREAMING
rather than TARGET_STREAMING_SME2.

gcc/testsuite/
* gcc.target/aarch64/sme2/acle-asm/psel_lane_b16.c: Move to...
* gcc.target/aarch64/sme/acle-asm/psel_lane_b16.c: ...here.
* gcc.target/aarch64/sme2/acle-asm/psel_lane_b32.c: Move to...
* gcc.target/aarch64/sme/acle-asm/psel_lane_b32.c: ...here.
* gcc.target/aarch64/sme2/acle-asm/psel_lane_b64.c: Move to...
* gcc.target/aarch64/sme/acle-asm/psel_lane_b64.c: ...here.
* gcc.target/aarch64/sme2/acle-asm/psel_lane_b8.c: Move to...
* gcc.target/aarch64/sme/acle-asm/psel_lane_b8.c: ...here.
* gcc.target/aarch64/sme2/acle-asm/psel_lane_c16.c: Move to...
* gcc.target/aarch64/sme/acle-asm/psel_lane_c16.c: ...here.
* gcc.target/aarch64/sme2/acle-asm/psel_lane_c32.c: Move to...
* gcc.target/aarch64/sme/acle-asm/psel_lane_c32.c: ...here.
* gcc.target/aarch64/sme2/acle-asm/psel_lane_c64.c: Move to...
* gcc.target/aarch64/sme/acle-asm/psel_lane_c64.c: ...here.
* gcc.target/aarch64/sme2/acle-asm/psel_lane_c8.c: Move to...
* gcc.target/aarch64/sme/acle-asm/psel_lane_c8.c: ...here.
---
 gcc/config/aarch64/aarch64-sve2.md| 4 ++--
 .../gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_b16.c | 2 +-
 .../gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_b32.c | 2 +-
 .../gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_b64.c | 2 +-
 .../gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_b8.c  | 2 +-
 .../gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_c16.c | 2 +-
 .../gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_c32.c | 2 +-
 .../gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_c64.c | 2 +-
 .../gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_c8.c  | 2 +-
 9 files changed, 10 insertions(+), 10 deletions(-)
 rename gcc/testsuite/gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_b16.c 
(98%)
 rename gcc/testsuite/gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_b32.c 
(98%)
 rename gcc/testsuite/gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_b64.c 
(98%)
 rename gcc/testsuite/gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_b8.c 
(98%)
 rename gcc/testsuite/gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_c16.c 
(98%)
 rename gcc/testsuite/gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_c32.c 
(98%)
 rename gcc/testsuite/gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_c64.c 
(98%)
 rename gcc/testsuite/gcc.target/aarch64/{sme2 => sme}/acle-asm/psel_lane_c8.c 
(98%)

diff --git a/gcc/config/aarch64/aarch64-sve2.md 
b/gcc/config/aarch64/aarch64-sve2.md
index 08f83fc7ca0..ac27124fb74 100644
--- a/gcc/config/aarch64/aarch64-sve2.md
+++ b/gcc/config/aarch64/aarch64-sve2.md
@@ -418,7 +418,7 @@ (define_insn "@aarch64_sve_psel"
   (match_operand:SI 3 "register_operand" "Ucj")
   (const_int BHSD_BITS)]
  UNSPEC_PSEL))]
-  "TARGET_STREAMING_SME2"
+  "TARGET_STREAMING"
   "psel\t%0, %1, %2.[%w3, 0]"
 )
 
@@ -432,7 +432,7 @@ (define_insn "*aarch64_sve_psel_plus"
 (match_operand:SI 4 "const_int_operand"))
   (const_int BHSD_BITS)]
  UNSPEC_PSEL))]
-  "TARGET_STREAMING_SME2
+  "TARGET_STREAMING
&& UINTVAL (operands[4]) < 128 / "
   "psel\t%0, %1, %2.[%w3, %4]"
 )
diff --git a/gcc/testsuite/gcc.target/aarch64/sme2/acle-asm/psel_lane_b16.c 
b/gcc/testsuite/gcc.target/aarch64/sme/acle-asm/psel_lane_b16.c
similarity index 98%
rename from gcc/testsuite/gcc.target/aarch64/sme2/acle-asm/psel_lane_b16.c
rename to gcc/testsuite/gcc.target/aarch64/sme/acle-asm/psel_lane_b16.c
index 704e9e375f5..45dda808d2a 100644
--- a/gcc/testsuite/gcc.target/aarch64/sme2/acle-asm/psel_lane_b16.c
+++ b/gcc/testsuite/gcc.target/aarch64/sme/acle-asm/psel_lane_b16.c
@@ -1,6 +1,6 @@
 /* { dg-final { check-function-bodies "**" "" "-DCHECK_ASM" } } */
 
-#include "test_sme2_acle.h"
+#include "test_sme_acle.h"
 
 /*
 ** psel_lane_p0_p2_p7_0:
diff --git a/gcc/testsuite/gcc.target/aarch64/sme2/acle-asm/psel_lane_b32.c 
b/gcc/testsuite/gcc.target/aarch64/sme/acle-asm/psel_lane_b32.c
similarity index 98%
rename from gcc/testsuite/gcc.target/aarch64/sme2/acle-asm/psel_lane_b32.c
rename to gcc/testsuite/gcc.target/aarch64/sme/acle-asm/psel_lane_b32.c
index 7d9c7a129ea..d3d1b7b42ca 100644
--- a/gcc/testsuite/gcc.target/aarch64/sme2/acle-asm/psel_lane_b32.c
+++ b/gcc/testsuite/gcc.target/aarch64/sme/acle-asm/psel_lane_b32.c
@@ -1,6 +1,6 @@
 /* { dg-final { check-function-bodies "**" "" "-DCHECK_ASM" } } */
 
-#include "test_sme2_acle.h"
+#include "test_sme_acle.h"
 
 /*
 ** psel_lane_p0_p2_p7_0:
diff --git a/gcc/testsuite/gcc.target/aarch64/sme2/acle-asm/psel_lane_b64.c 
b/gcc/testsuite/gcc.t

Re: [PATCH 3/4] sched1: model: only promote true dependecies in predecessor promotion

2024-11-05 Thread Richard Sandiford
Sorry, still haven't found time to look at the patch properly
(hopefully after stage 1 closes, if not before), but:

Jeff Law  writes:
> [...]
> On 10/31/24 1:35 PM, Vineet Gupta wrote:
>>> And if it doesn't strictly need to be a valid schedule are we giving an
>>> overly-optimistic view of the best that can be done from a pressure
>>> standpoint with this change?  And if so, is that wise?
>> 
>> As I mentioned above, the design goal of model schedule is to keep pressure 
>> to min.
>> So indeed we might be a bit more optimistic than reality here. But main list 
>> scheduler fixes
>> that if that leads to undesired outcomes. What we are trying to do here is 
>> not pessimize
>>   in certain cases, especially when that's not by design but just an outcome 
>> of the
>>   implementation subtlety.
> And my point is that if I'm allowed to generate a minimal register 
> pressure schedule without needing it to generate the same semantics as 
> the original code, then I could drastically reduce the register pressure 
> :-) But I'd also claim that doing so isn't actually useful.
>
> The mental model I'd work from is we want to know the minimal pressure 
> while still preserving the original code semantics -- unless someone who 
> knows this better (ie Richard S) were to say otherwise.

Yeah, that's right.  The "model" schedule was supposed to be a correct
schedule that completely ignored the target pipeline and instead tried
to minimise register pressure.  This was in contrast to the list
scheduler without any pressure heuristics, which tried to fill pipeline
bubbles while completely ignoring register pressure.

The idea was then to strike a balance between the list scheduler's
natural tendency to do things as soon as the pipeline model allowed
vs the model scheduler's tendency to delay things that would increase
pressure.  But all three schedules (the two extremes, and the compromise)
are intended to be correct in isolation.

Richard


[PATCH] aarch64: Fix incorrect LS64 documentation

2024-11-04 Thread Richard Sandiford
As Yuta Mukai pointed out, the manual wrongly said that LS64 is
enabled by default for Armv8.7-A and above, and for Armv9.2-A
and above.  LS64 is not mandatory at any architecture level
(and the code correctly implemented that).

I think this was a leftover from an early version of the spec.

gcc/
* doc/invoke.texi: Fix documentation of LS64 so that it's
not implied by Armv8.7-A or Armv9.2-A.
---
Tested on aarch64-linux-gnu & pushed to trunk.  I'll backport
to branches soon.

Thanks to Yuta for the spot.

Richard

 gcc/doc/invoke.texi | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 28ef2cde43d..7146163d66d 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -21443,12 +21443,12 @@ and the features that they enable by default:
 @item @samp{armv8.4-a} @tab Armv8.4-A @tab @samp{armv8.3-a}, @samp{+flagm}, 
@samp{+fp16fml}, @samp{+dotprod}
 @item @samp{armv8.5-a} @tab Armv8.5-A @tab @samp{armv8.4-a}, @samp{+sb}, 
@samp{+ssbs}, @samp{+predres}
 @item @samp{armv8.6-a} @tab Armv8.6-A @tab @samp{armv8.5-a}, @samp{+bf16}, 
@samp{+i8mm}
-@item @samp{armv8.7-a} @tab Armv8.7-A @tab @samp{armv8.6-a}, @samp{+ls64}
+@item @samp{armv8.7-a} @tab Armv8.7-A @tab @samp{armv8.6-a}
 @item @samp{armv8.8-a} @tab Armv8.8-a @tab @samp{armv8.7-a}, @samp{+mops}
 @item @samp{armv8.9-a} @tab Armv8.9-a @tab @samp{armv8.8-a}
 @item @samp{armv9-a} @tab Armv9-A @tab @samp{armv8.5-a}, @samp{+sve}, 
@samp{+sve2}
 @item @samp{armv9.1-a} @tab Armv9.1-A @tab @samp{armv9-a}, @samp{+bf16}, 
@samp{+i8mm}
-@item @samp{armv9.2-a} @tab Armv9.2-A @tab @samp{armv9.1-a}, @samp{+ls64}
+@item @samp{armv9.2-a} @tab Armv9.2-A @tab @samp{armv9.1-a}
 @item @samp{armv9.3-a} @tab Armv9.3-A @tab @samp{armv9.2-a}, @samp{+mops}
 @item @samp{armv9.4-a} @tab Armv9.4-A @tab @samp{armv9.3-a}
 @item @samp{armv8-r} @tab Armv8-R @tab @samp{armv8-r}
@@ -21773,7 +21773,6 @@ default for @option{-march=armv8.6-a}.  Use of this 
option with architectures
 prior to Armv8.2-A is not supported.
 @item ls64
 Enable the 64-byte atomic load and store instructions for accelerators.
-This option is enabled by default for @option{-march=armv8.7-a}.
 @item mops
 Enable the instructions to accelerate memory operations like @code{memcpy},
 @code{memmove}, @code{memset}.  This option is enabled by default for
-- 
2.25.1



Re: [PATCH v2] aarch64: Add support for FUJITSU-MONAKA (-mcpu=fujitsu-monaka) CPU

2024-11-04 Thread Richard Sandiford
"Yuta Mukai (Fujitsu)"  writes:
> Thank you for the reviews! I attached a patch that fixes the problems.
>
>>> On 31 Oct 2024, at 11:50, Richard Sandiford  
>>> wrote:
>>> 
>>> "Yuta Mukai (Fujitsu)"  writes:
>>>> Hello,
>>>> 
>>>> This patch adds initial support for FUJITSU-MONAKA CPU, which we are 
>>>> developing.
>>>> This is the slides for the CPU: 
>>>> https://www.fujitsu.com/downloads/SUPER/topics/isc24/next-arm-based-processor-fujitsu-monaka-and-its-software-ecosystem.pdf
>>>> 
>>>> Bootstrapped/regtested on aarch64-unknown-linux-gnu.
>>>> 
>>>> We will post a patch for backporting to GCC 14 later.
>>>> 
>>>> We would be grateful if someone could push this on our behalf, as we do 
>>>> not have write access.
>>> 
>>> Thanks for the patch, it looks good.  I just have a couple of minor 
>>> comments:
>>> 
>>>> @@ -132,6 +132,7 @@ AARCH64_CORE("octeontx2f95mm", octeontx2f95mm, 
>>>> cortexa57, V8_2A,  (CRYPTO, PROFI
>>>> 
>>>> /* Fujitsu ('F') cores. */
>>>> AARCH64_CORE("a64fx", a64fx, a64fx, V8_2A,  (F16, SVE), a64fx, 0x46, 
>>>> 0x001, -1)
>>>> +AARCH64_CORE("fujitsu-monaka", fujitsu_monaka, cortexa57, V9_3A, (AES, 
>>>> CRYPTO, F16, F16FML, FP8, LS64, RCPC, RNG, SHA2, SHA3, SM4, SVE2_AES, 
>>>> SVE2_BITPERM, SVE2_SHA3, SVE2_SM4), fujitsu_monaka, 0x46, 0x003, -1)
>>> 
>>> Usually this file omits listing a feature if it is already implied by the
>>> architecture level.  In this case, I think V9_3A should enable F16FML and
>>> RCPC automatically, and so we could drop those features from the list.
>>> 
>>> Also, we should be able to rely on transitive dependencies for the
>>> SVE2 crypto extensions.  So I think it should be enough to list:
>>> 
>>> AARCH64_CORE("fujitsu-monaka", fujitsu_monaka, cortexa57, V9_3A, (F16, FP8, 
>>> LS64, RNG, SVE2_AES, SVE2_BITPERM, SVE2_SHA3, SVE2_SM4), fujitsu_monaka, 
>>> 0x46, 0x003, -1)
>>> 
>>> which should have the same effect.
>>> 
>>> Could you check whether that works?
>
> Thanks for the list.
> CRYPTO was found not to be implied by SHA2, so I left only it there.
>
> Incidentally, the manual says that LS64 is automatically enabled for V9_2A, 
> but it is not.
> Should the manual be corrected?
>
> https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html#index-march
>> ‘armv9.2-a’  Armv9.2-A   ‘armv9.1-a’, ‘+ls64’

Oops, yes!  Thanks for pointing that out.  I'll push a patch separately.

>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h 
>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>> new file mode 100644
>>>> index 0..8d6f297b8
>>>> --- /dev/null
>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>> @@ -0,0 +1,65 @@
>>>> +/* Tuning model description for AArch64 architecture.
>>> 
>>> It's probably worth changing "AArch64 architecture" to "FUJITSU-MONAKA".
>
> Fixed.
>
>>> 
>>> The patch looks good to me otherwise.
>>
>>Looks ok to me modulo those comments as well.
>>The ChangeLog should be improved a little bit too.
>>
>>* config/aarch64/aarch64-cores.def (AARCH64_CORE): Add fujitsu-monaka
>>* config/aarch64/aarch64-tune.md: Regenerate
>>* config/aarch64/aarch64.cc: Include fujitsu-monaka tuning model
>>* doc/invoke.texi: Document -mcpu=fujitsu-monaka
>>* config/aarch64/tuning_models/fujitsu_monaka.h: New file.
>>
>>The sentences should end in full stop “.”
>
> Fixed.

Thanks for the patch.  I've pushed it to trunk.

Richard


Re: [PATCH 1/2] aarch64: Remove scheduling models for falkor and saphira

2024-10-31 Thread Richard Sandiford
Andrew Pinski  writes:
> These 2 qualcomm cores have been long gone in that Qualcomm has not
> supported since at least 2019. Removing them will make it easier I think
> to change the insn type attributes instead of keeping them up todate.
>
> Note this does not remove the cores, just the schedule models.
>
> Bootstrapped and tested on aarch64-linux-gnu.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-cores.def (falkor): Use cortex-a57 scheduler.
>   (saphira): Likewise.
>   * config/aarch64/aarch64.md: Don't include falkor.md and saphira.md.
>   * config/aarch64/falkor.md: Removed.
>   * config/aarch64/saphira.md: Removed.

Thanks for doing this.  You're in a much better position than me to say
whether it's the right thing to do or not.  So OK from my POV, but please
give others a day or so to comment.

Same goes for patch 2.

Thanks,
Richard

>
> Signed-off-by: Andrew Pinski 
> ---
>  gcc/config/aarch64/aarch64-cores.def |   8 +-
>  gcc/config/aarch64/aarch64.md|   2 -
>  gcc/config/aarch64/falkor.md | 687 ---
>  gcc/config/aarch64/saphira.md| 560 --
>  4 files changed, 4 insertions(+), 1253 deletions(-)
>  delete mode 100644 gcc/config/aarch64/falkor.md
>  delete mode 100644 gcc/config/aarch64/saphira.md
>
> diff --git a/gcc/config/aarch64/aarch64-cores.def 
> b/gcc/config/aarch64/aarch64-cores.def
> index cc226003688..c9bdf3dcf2e 100644
> --- a/gcc/config/aarch64/aarch64-cores.def
> +++ b/gcc/config/aarch64/aarch64-cores.def
> @@ -83,14 +83,14 @@ AARCH64_CORE("emag",emag,  xgene1,V8A,  
> (CRC, CRYPTO), emag, 0x5
>  AARCH64_CORE("xgene1",  xgene1,xgene1,V8A,  (), xgene1, 0x50, 
> 0x000, -1)
>  
>  /* Qualcomm ('Q') cores. */
> -AARCH64_CORE("falkor",  falkor,falkor,V8A,  (CRC, CRYPTO, RDMA), 
> qdf24xx,   0x51, 0xC00, -1)
> -AARCH64_CORE("qdf24xx", qdf24xx,   falkor,V8A,  (CRC, CRYPTO, RDMA), 
> qdf24xx,   0x51, 0xC00, -1)
> +AARCH64_CORE("falkor",  falkor,cortexa57,V8A,  (CRC, CRYPTO, 
> RDMA), qdf24xx,   0x51, 0xC00, -1)
> +AARCH64_CORE("qdf24xx", qdf24xx,   cortexa57,V8A,  (CRC, CRYPTO, 
> RDMA), qdf24xx,   0x51, 0xC00, -1)
>  
>  /* Samsung ('S') cores. */
>  AARCH64_CORE("exynos-m1",   exynosm1,  exynosm1,  V8A,  (CRC, CRYPTO), 
> exynosm1,  0x53, 0x001, -1)
>  
>  /* HXT ('h') cores. */
> -AARCH64_CORE("phecda",  phecda,falkor,V8A,  (CRC, CRYPTO), 
> qdf24xx,   0x68, 0x000, -1)
> +AARCH64_CORE("phecda",  phecda,cortexa57,V8A,  (CRC, CRYPTO), 
> qdf24xx,   0x68, 0x000, -1)
>  
>  /* ARMv8.1-A Architecture Processors.  */
>  
> @@ -149,7 +149,7 @@ AARCH64_CORE("zeus", zeus, cortexa57, V8_4A,  (SVE, I8MM, 
> BF16, PROFILE, SSBS, R
>  AARCH64_CORE("neoverse-512tvb", neoverse512tvb, cortexa57, V8_4A,  (SVE, 
> I8MM, BF16, PROFILE, SSBS, RNG), neoverse512tvb, INVALID_IMP, INVALID_CORE, 
> -1)
>  
>  /* Qualcomm ('Q') cores. */
> -AARCH64_CORE("saphira", saphira,saphira,V8_4A,  (CRYPTO), 
> saphira,   0x51, 0xC01, -1)
> +AARCH64_CORE("saphira", saphira,cortexa57,V8_4A,  (CRYPTO), 
> saphira,   0x51, 0xC01, -1)
>  
>  /* ARMv8.6-A Architecture Processors.  */
>  
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 20956fc49d8..8d10197c9e8 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -589,8 +589,6 @@ (define_attr "ldpstp" "ldp,stp,none" (const_string 
> "none"))
>  (include "../arm/cortex-a53.md")
>  (include "../arm/cortex-a57.md")
>  (include "../arm/exynos-m1.md")
> -(include "falkor.md")
> -(include "saphira.md")
>  (include "thunderx.md")
>  (include "../arm/xgene1.md")
>  (include "thunderx2t99.md")
> diff --git a/gcc/config/aarch64/falkor.md b/gcc/config/aarch64/falkor.md
> deleted file mode 100644
> index 0c5cf930e89..000
> --- a/gcc/config/aarch64/falkor.md
> +++ /dev/null
> @@ -1,687 +0,0 @@
> -;; Falkor pipeline description
> -;; Copyright (C) 2017-2024 Free Software Foundation, Inc.
> -;;
> -;; This file is part of GCC.
> -;;
> -;; GCC is free software; you can redistribute it and/or modify it
> -;; under the terms of the GNU General Public License as published by
> -;; the Free Software Foundation; either version 3, or (at your option)
> -;; any later version.
> -;;
> -;; GCC is distributed in the hope that it will be useful, but
> -;; WITHOUT ANY WARRANTY; without even the implied warranty of
> -;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> -;; General Public License for more details.
> -;;
> -;; You should have received a copy of the GNU General Public License
> -;; along with GCC; see the file COPYING3.  If not see
> -;; .
> -
> -(define_automaton "falkor")
> -
> -;; Complex int instructions (e.g. multiply and divide) execute in the X
> -;; pipeline.  Simple int instructions execute in the X, Y, and Z pipelines.
> -
> -(define_cpu_unit "falkor_x" "fal

Re: [PATCH] aarch64: Recognize vector permute patterns suitable for FMOV [PR100165]

2024-10-31 Thread Richard Sandiford
Pengxuan Zheng  writes:
> This patch optimizes certain vector permute expansion with the FMOV 
> instruction
> when one of the input vectors is a vector of all zeros and the result of the
> vector permute is as if the upper lane of the non-zero input vector is set to
> zero and the lower lane remains unchanged.
>
> Note that the patch also propagates zero_op0_p and zero_op1_p during re-encode
> now.  They will be used by aarch64_evpc_fmov to check if the input vectors are
> valid candidates.
>
>   PR target/100165
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-simd.md (aarch64_simd_vec_set_zero_fmov):
>   New define_insn.
>   * config/aarch64/aarch64.cc (aarch64_evpc_reencode): Copy zero_op0_p and
>   zero_op1_p.
>   (aarch64_evpc_fmov): New function.
>   (aarch64_expand_vec_perm_const_1): Add call to aarch64_evpc_fmov.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/vec-set-zero.c: Update test accordingly.
>   * gcc.target/aarch64/fmov.c: New test.
>   * gcc.target/aarch64/fmov-be.c: New test.

Nice!  Thanks for doing this.  Some comments on the patch below.
>
> Signed-off-by: Pengxuan Zheng 
> ---
>  gcc/config/aarch64/aarch64-simd.md|  14 +++
>  gcc/config/aarch64/aarch64.cc |  74 +++-
>  gcc/testsuite/gcc.target/aarch64/fmov-be.c|  74 
>  gcc/testsuite/gcc.target/aarch64/fmov.c   | 110 ++
>  .../gcc.target/aarch64/vec-set-zero.c |   6 +-
>  5 files changed, 275 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/fmov-be.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/fmov.c
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index e456f693d2f..543126948e7 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1190,6 +1190,20 @@ (define_insn "aarch64_simd_vec_set"
>[(set_attr "type" "neon_ins, neon_from_gp, neon_load1_one_lane")]
>  )
>  
> +(define_insn "aarch64_simd_vec_set_zero_fmov"
> +  [(set (match_operand:VP_2E 0 "register_operand" "=w")
> + (vec_merge:VP_2E
> + (match_operand:VP_2E 1 "aarch64_simd_imm_zero" "Dz")
> + (match_operand:VP_2E 3 "register_operand" "w")
> + (match_operand:SI 2 "immediate_operand" "i")))]
> +  "TARGET_SIMD
> +   && (ENDIAN_LANE_N (, exact_log2 (INTVAL (operands[2]))) == 1)"
> +  {
> +return "fmov\\t%0, %3";
> +  }
> +  [(set_attr "type" "fmov")]
> +)
> +

I think this shows that target-independent code is missing some
canonicalisation of vec_merge.  combine has:

  unsigned n_elts = 0;
  if (GET_CODE (x) == VEC_MERGE
  && CONST_INT_P (XEXP (x, 2))
  && GET_MODE_NUNITS (GET_MODE (x)).is_constant (&n_elts)
  && (swap_commutative_operands_p (XEXP (x, 0), XEXP (x, 1))
  /* Two operands have same precedence, then
 first bit of mask select first operand.  */
  || (!swap_commutative_operands_p (XEXP (x, 1), XEXP (x, 0))
  && !(UINTVAL (XEXP (x, 2)) & 1
{
  rtx temp = XEXP (x, 0);
  unsigned HOST_WIDE_INT sel = UINTVAL (XEXP (x, 2));
  unsigned HOST_WIDE_INT mask = HOST_WIDE_INT_1U;
  if (n_elts == HOST_BITS_PER_WIDE_INT)
mask = -1;
  else
mask = (HOST_WIDE_INT_1U << n_elts) - 1;
  SUBST (XEXP (x, 0), XEXP (x, 1));
  SUBST (XEXP (x, 1), temp);
  SUBST (XEXP (x, 2), GEN_INT (~sel & mask));
}

which AFAICT would prefer to put the immediate second, not first.  I think
we should be doing the same canonicalisation in simplify_ternary_operation,
and possibly elsewhere, so that the .md pattern only needs to match the
canonical form (i.e. register, immedate, mask).

On:

> +   && (ENDIAN_LANE_N (, exact_log2 (INTVAL (operands[2]))) == 1)"

it seems dangerous to pass exact_log2 to ENDIAN_LANE_N when we haven't
checked whether it is a power of 2.  (0b00 or 0b11 ought to get simplified,
but I don't think we can ignore the possibility.)

Rather than restrict the pattern to pairs, could we instead handle
VALL_F16 minus the QI elements, with the 16-bit elements restricted
to TARGET_F16?  E.g. we should be able to handle V4SI using an FMOV
of S registers if only the low element is nonzero.

Part of me thinks that this should just be described as a plain old AND,
but I suppose that doesn't work well for FP modes.  Still, handling ANDs
might be an interesting follow-up :)

Thanks,
Richard

>  (define_insn "aarch64_simd_vec_set_zero"
>[(set (match_operand:VALL_F16 0 "register_operand" "=w")
>   (vec_merge:VALL_F16
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index a6cc00e74ab..64756920eda 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -25950,6 +25950,8 @@ aarch64_evpc_reencode (struct expand_vec_perm_d *d)
>newd.target = d->target ? gen_lowpart (new_mode, d->target) : NULL;
>newd.op0 = d->op0 ? gen_lowpart 

Re: [PATCH v3 5/8] aarch64: Add masked-load else operands.

2024-10-31 Thread Richard Sandiford
rdapp@gmail.com writes:
> From: Robin Dapp 
>
> This adds zero else operands to masked loads and their intrinsics.
> I needed to adjust more than initially thought because we rely on
> combine for several instructions and a change in a "base" pattern
> needs to propagate to all those.
>
> For the lack of a better idea I used a function call property to specify
> whether a builtin needs an else operand or not.  Somebody with better
> knowledge of the aarch64 target can surely improve that.

This part no longer holds :)

> [...]
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> index fe16d93adcd..406ceb13a4c 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> [...]
> @@ -1537,11 +1538,14 @@ public:
>{
>  insn_code icode;
>  if (e.vectors_per_tuple () == 1)
> -  icode = convert_optab_handler (maskload_optab,
> -  e.vector_mode (0), e.gp_mode (0));
> +  {
> + icode = convert_optab_handler (maskload_optab,
> +e.vector_mode (0), e.gp_mode (0));
> + e.args.quick_push (CONST0_RTX (e.vector_mode (0)));

It looks like this should no longer be necessary, since
use_contiguous_load_insn should add it instead.  (Let me know if that's
wrong though.)

> +  }
>  else
>icode = code_for_aarch64 (UNSPEC_LD1_COUNT, e.tuple_mode (0));
> -return e.use_contiguous_load_insn (icode);
> +return e.use_contiguous_load_insn (icode, true);
>}
>  };
>  
> @@ -1551,13 +1555,19 @@ class svld1_extend_impl : public extending_load
>  public:
>using extending_load::extending_load;
>  
> +  unsigned int
> +  call_properties (const function_instance &) const override
> +  {
> +return CP_READ_MEMORY;
> +  }
> +

It looks like this is a left-over from the previous version and
could be removed.

>rtx
>expand (function_expander &e) const override
>{
> -insn_code icode = code_for_aarch64_load (UNSPEC_LD1_SVE, extend_rtx_code 
> (),
> +insn_code icode = code_for_aarch64_load (extend_rtx_code (),
>e.vector_mode (0),
>e.memory_vector_mode ());
> -return e.use_contiguous_load_insn (icode);
> +return e.use_contiguous_load_insn (icode, true);
>}
>  };
>  
> [...]
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins.cc
> index ef14f8cd39d..84c0a0caa50 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
> @@ -4229,7 +4229,7 @@ function_expander::use_vcond_mask_insn (insn_code icode,
> Extending loads have a further predicate (operand 3) that nominally
> controls the extension.  */
>  rtx
> -function_expander::use_contiguous_load_insn (insn_code icode)
> +function_expander::use_contiguous_load_insn (insn_code icode, bool has_else)

The comment should describe the new parameter.  Maybe add a new paragraph:

   HAS_ELSE is true if the pattern has an additional operand that specifies
   the values of inactive lanes.  This exists to match the general maskload
   interface and is always zero for AArch64.  */

>  {
>machine_mode mem_mode = memory_vector_mode ();
>  
> [...]
> diff --git a/gcc/config/aarch64/aarch64-sve.md 
> b/gcc/config/aarch64/aarch64-sve.md
> index 06bd3e4bb2c..a2e9f52d024 100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> [...]
> @@ -1302,11 +1303,14 @@ (define_expand "vec_load_lanes"
>[(set (match_operand:SVE_STRUCT 0 "register_operand")
>   (unspec:SVE_STRUCT
> [(match_dup 2)
> -(match_operand:SVE_STRUCT 1 "memory_operand")]
> +(match_operand:SVE_STRUCT 1 "memory_operand")
> +(match_dup 3)
> +   ]

Formatting nit, sorry, but: local style is to put the closing ]
on the previous line.

OK with those changes.  Thanks a lot for doing this!

Richard

> UNSPEC_LDN))]
>"TARGET_SVE"
>{
>  operands[2] = aarch64_ptrue_reg (mode);
> +operands[3] = CONST0_RTX (mode);
>}
>  )
>  
> [...]


Re: [PATCH] AArch64: Switch off early scheduling

2024-10-31 Thread Richard Sandiford
Wilco Dijkstra  writes:
> The early scheduler takes up ~33% of the total build time, however it doesn't
> provide a meaningful performance gain.  This is partly because modern OoO 
> cores
> need far less scheduling, partly because the scheduler tends to create many
> unnecessary spills by increasing register pressure.  Building applications
> 56% faster is far more useful than ~0.1% improvement on SPEC, so switch off
> early scheduling on AArch64.  Codesize reduces by ~0.2%.
>
> The combine_and_move pass runs if the scheduler is disabled and aggressively
> combines moves.  The movsf/df patterns allow all FP immediates since they
> rely on a split pattern, however splits do not happen this late.  To fix this,
> use a more accurate check that blocks creation of literal loads during
> combine_and_move.  Fix various tests that depend on scheduling by explicitly
> adding -fschedule-insns.
>
> Passes bootstrap & regress, OK for commit?

I'm in favour of this.  Obviously the numbers are what count, but
also from first principles:

- I can't remember the last time a scheduling model was added to the port.

- We've (consciously) never added scheduling types for SVE.

- It doesn't make logical sense to schedule for Neoverse V3 (say)
  as thought it were a Cortex A57.

So at this point, it seems better for scheduling to be opt-in rather
than opt-out.  (That is, we can switch to a tune-based default if
anyone does add a new scheduling model in future.)

Let's see what others think.

Please split the md changes out into a separate pre-patch though.

What do you think about disabling late scheduling as well?

Thanks,
Richard

> gcc/ChangeLog:
> * common/config/aarch64/aarch64-common.cc: Switch off fschedule_insns.
> * config/aarch64/aarch64.md (movhf_aarch64): Use 
> aarch64_valid_fp_move.
> (movsf_aarch64): Likewise.
> (movdf_aarch64): Likewise.
> * config/aarch64/aarch64.cc (aarch64_valid_fp_move): New function.
> * config/aarch64/aarch64-protos.h (aarch64_valid_fp_move): Likewise.
>
> gcc/testsuite/ChangeLog:
> * testsuite/gcc.target/aarch64/ldp_aligned.c: Fix test.
> * testsuite/gcc.target/aarch64/ldp_always.c: Likewise.
> * testsuite/gcc.target/aarch64/ldp_stp_10.c: Add -fschedule-insns.
> * testsuite/gcc.target/aarch64/ldp_stp_12.c: Likewise.
> * testsuite/gcc.target/aarch64/ldp_stp_13.c: Remove test.
> * testsuite/gcc.target/aarch64/ldp_stp_21.c: Add -fschedule-insns.
> * testsuite/gcc.target/aarch64/ldp_stp_8.c: Likewise.
> * testsuite/gcc.target/aarch64/ldp_vec_v2sf.c: Likewise.
> * testsuite/gcc.target/aarch64/ldp_vec_v2si.c: Likewise.
> * testsuite/gcc.target/aarch64/test_frame_16.c: Fix test.
> * testsuite/gcc.target/aarch64/sve/vcond_12.c: Add -fschedule-insns.
> * testsuite/gcc.target/aarch64/sve/acle/general/ldff1_3.c: Likewise.
>
> ---
>
> diff --git a/gcc/common/config/aarch64/aarch64-common.cc 
> b/gcc/common/config/aarch64/aarch64-common.cc
> index 
> 2bfc597e333b6018970a9ee6e370a66b6d0960ef..845747e31e821c2f3970fd39ea70f046eddbe920
>  100644
> --- a/gcc/common/config/aarch64/aarch64-common.cc
> +++ b/gcc/common/config/aarch64/aarch64-common.cc
> @@ -54,6 +54,8 @@ static const struct default_options 
> aarch_option_optimization_table[] =
>  { OPT_LEVELS_ALL, OPT_fomit_frame_pointer, NULL, 0 },
>  /* Enable -fsched-pressure by default when optimizing.  */
>  { OPT_LEVELS_1_PLUS, OPT_fsched_pressure, NULL, 1 },
> +/* Disable early scheduling due to high compile-time overheads.  */
> +{ OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
>  /* Enable redundant extension instructions removal at -O2 and higher.  */
>  { OPT_LEVELS_2_PLUS, OPT_free, NULL, 1 },
>  { OPT_LEVELS_2_PLUS, OPT_mearly_ra_, NULL, AARCH64_EARLY_RA_ALL },
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 
> 250c5b96a21ea1c969a0e77e420525eec90e4de4..b30329d7f85f5b962dca43cf12ca938898425874
>  100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -758,6 +758,7 @@ bool aarch64_advsimd_struct_mode_p (machine_mode mode);
>  opt_machine_mode aarch64_vq_mode (scalar_mode);
>  opt_machine_mode aarch64_full_sve_mode (scalar_mode);
>  bool aarch64_can_const_movi_rtx_p (rtx x, machine_mode mode);
> +bool aarch64_valid_fp_move (rtx, rtx, machine_mode);
>  bool aarch64_const_vec_all_same_int_p (rtx, HOST_WIDE_INT);
>  bool aarch64_const_vec_all_same_in_range_p (rtx, HOST_WIDE_INT,
> HOST_WIDE_INT);
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> 2647293f7cf020378dacc37b7bfbccc856573e44..965ec18412a6486e6ac4ff2e4a7d742bf61e5d75
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -11223,6 +11223,36 @@ aarch64_can_const_movi_rtx_p (rtx x, machine_mode 
> mode)
>   

Re: [PATCH] aarch64: Add support for FUJITSU-MONAKA (-mcpu=fujitsu-monaka) CPU

2024-10-31 Thread Richard Sandiford
"Yuta Mukai (Fujitsu)"  writes:
> Hello,
>
> This patch adds initial support for FUJITSU-MONAKA CPU, which we are 
> developing.
> This is the slides for the CPU: 
> https://www.fujitsu.com/downloads/SUPER/topics/isc24/next-arm-based-processor-fujitsu-monaka-and-its-software-ecosystem.pdf
>
> Bootstrapped/regtested on aarch64-unknown-linux-gnu.
>
> We will post a patch for backporting to GCC 14 later.
>
> We would be grateful if someone could push this on our behalf, as we do not 
> have write access.

Thanks for the patch, it looks good.  I just have a couple of minor comments:

> @@ -132,6 +132,7 @@ AARCH64_CORE("octeontx2f95mm", octeontx2f95mm, cortexa57, 
> V8_2A,  (CRYPTO, PROFI
>  
>  /* Fujitsu ('F') cores. */
>  AARCH64_CORE("a64fx", a64fx, a64fx, V8_2A,  (F16, SVE), a64fx, 0x46, 0x001, 
> -1)
> +AARCH64_CORE("fujitsu-monaka", fujitsu_monaka, cortexa57, V9_3A, (AES, 
> CRYPTO, F16, F16FML, FP8, LS64, RCPC, RNG, SHA2, SHA3, SM4, SVE2_AES, 
> SVE2_BITPERM, SVE2_SHA3, SVE2_SM4), fujitsu_monaka, 0x46, 0x003, -1)

Usually this file omits listing a feature if it is already implied by the
architecture level.  In this case, I think V9_3A should enable F16FML and
RCPC automatically, and so we could drop those features from the list.

Also, we should be able to rely on transitive dependencies for the
SVE2 crypto extensions.  So I think it should be enough to list:

AARCH64_CORE("fujitsu-monaka", fujitsu_monaka, cortexa57, V9_3A, (F16, FP8, 
LS64, RNG, SVE2_AES, SVE2_BITPERM, SVE2_SHA3, SVE2_SM4), fujitsu_monaka, 0x46, 
0x003, -1)

which should have the same effect.

Could you check whether that works?

> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h 
> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> new file mode 100644
> index 0..8d6f297b8
> --- /dev/null
> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> @@ -0,0 +1,65 @@
> +/* Tuning model description for AArch64 architecture.

It's probably worth changing "AArch64 architecture" to "FUJITSU-MONAKA".

The patch looks good to me otherwise.

Thanks,
Richard


[PATCH 3/3] aarch64: Require SVE2 and/or SME2 for SVE FAMINMAX intrinsics

2024-10-30 Thread Richard Sandiford
After the previous patch, we can now accurately model the ISA
requirements for the SVE FAMINMAX intrinsics.  They can be used
in non-streaming mode if TARGET_SVE2 and in streaming mode if
TARGET_SME2 (with both cases also requiring TARGET_FAMINMAX).
They can be used in streaming-compatible mode if TARGET_SVE2
&& TARGET_SME2.

Also, Kyrill pointed out in the original review of the FAMINMAX
support that it would be more consistent to define the rtl patterns
in aarch64-sve2.md rather than aarch64-sve.md, so the pushed patch
did that.  This patch moves the definitions of the intrinsics to
the sve2 files too, for consistency.

gcc/
* config/aarch64/aarch64-sve-builtins-base.cc (svmax, svamin): Move
definitions to...
* config/aarch64/aarch64-sve-builtins-sve2.cc: ...here.
* config/aarch64/aarch64-sve-builtins-base.def (svmax, svamin): Move
definitions to...
* config/aarch64/aarch64-sve-builtins-sve2.def: ...here.  Require
SME2 in streaming mode.

gcc/testsuite/
* gcc.target/aarch64/sve/acle/general/amin_1.c: New test.
* gcc.target/aarch64/sve2/acle/asm/amax_f16.c: Enabled sve2 and
(for streaming mode) sme2.
* gcc.target/aarch64/sve2/acle/asm/amax_f32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/amax_f64.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/amin_f16.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/amin_f32.c: Likewise.
* gcc.target/aarch64/sve2/acle/asm/amin_f64.c: Likewise.
---
 gcc/config/aarch64/aarch64-sve-builtins-base.cc  | 4 
 gcc/config/aarch64/aarch64-sve-builtins-base.def | 5 -
 gcc/config/aarch64/aarch64-sve-builtins-sve2.cc  | 4 
 gcc/config/aarch64/aarch64-sve-builtins-sve2.def | 7 +++
 .../gcc.target/aarch64/sve/acle/general/amin_1.c | 9 +
 .../gcc.target/aarch64/sve2/acle/asm/amax_f16.c  | 5 -
 .../gcc.target/aarch64/sve2/acle/asm/amax_f32.c  | 5 -
 .../gcc.target/aarch64/sve2/acle/asm/amax_f64.c  | 5 -
 .../gcc.target/aarch64/sve2/acle/asm/amin_f16.c  | 5 -
 .../gcc.target/aarch64/sve2/acle/asm/amin_f32.c  | 5 -
 .../gcc.target/aarch64/sve2/acle/asm/amin_f64.c  | 5 -
 11 files changed, 44 insertions(+), 15 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general/amin_1.c

diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
index fe16d93adcd..1c9f515a52c 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
@@ -3184,10 +3184,6 @@ FUNCTION (svadrb, svadr_bhwd_impl, (0))
 FUNCTION (svadrd, svadr_bhwd_impl, (3))
 FUNCTION (svadrh, svadr_bhwd_impl, (1))
 FUNCTION (svadrw, svadr_bhwd_impl, (2))
-FUNCTION (svamax, cond_or_uncond_unspec_function,
- (UNSPEC_COND_FAMAX, UNSPEC_FAMAX))
-FUNCTION (svamin, cond_or_uncond_unspec_function,
- (UNSPEC_COND_FAMIN, UNSPEC_FAMIN))
 FUNCTION (svand, rtx_code_function, (AND, AND))
 FUNCTION (svandv, reduction, (UNSPEC_ANDV))
 FUNCTION (svasr, rtx_code_function, (ASHIFTRT, ASHIFTRT))
diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.def 
b/gcc/config/aarch64/aarch64-sve-builtins-base.def
index edfe2574507..da2a0e41aa5 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-base.def
+++ b/gcc/config/aarch64/aarch64-sve-builtins-base.def
@@ -368,8 +368,3 @@ DEF_SVE_FUNCTION (svuzp2q, binary, all_data, none)
 DEF_SVE_FUNCTION (svzip1q, binary, all_data, none)
 DEF_SVE_FUNCTION (svzip2q, binary, all_data, none)
 #undef REQUIRED_EXTENSIONS
-
-#define REQUIRED_EXTENSIONS ssve (AARCH64_FL_FAMINMAX)
-DEF_SVE_FUNCTION (svamax, binary_opt_single_n, all_float, mxz)
-DEF_SVE_FUNCTION (svamin, binary_opt_single_n, all_float, mxz)
-#undef REQUIRED_EXTENSIONS
diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc 
b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
index d29c2209fdf..64f86035c30 100644
--- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
+++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
@@ -591,6 +591,10 @@ FUNCTION (svaesd, fixed_insn_function, 
(CODE_FOR_aarch64_sve2_aesd))
 FUNCTION (svaese, fixed_insn_function, (CODE_FOR_aarch64_sve2_aese))
 FUNCTION (svaesimc, fixed_insn_function, (CODE_FOR_aarch64_sve2_aesimc))
 FUNCTION (svaesmc, fixed_insn_function, (CODE_FOR_aarch64_sve2_aesmc))
+FUNCTION (svamax, cond_or_uncond_unspec_function,
+ (UNSPEC_COND_FAMAX, UNSPEC_FAMAX))
+FUNCTION (svamin, cond_or_uncond_unspec_function,
+ (UNSPEC_COND_FAMIN, UNSPEC_FAMIN))
 FUNCTION (svbcax, CODE_FOR_MODE0 (aarch64_sve2_bcax),)
 FUNCTION (svbdep, unspec_based_function, (UNSPEC_BDEP, UNSPEC_BDEP, -1))
 FUNCTION (svbext, unspec_based_function, (UNSPEC_BEXT, UNSPEC_BEXT, -1))
diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.def 
b/gcc/config/aarch64/aarch64-sve-builtins-sve2.def
index 345a7621b6f..e4021559f36 100644
--- a

[PATCH 2/3] aarch64: Record separate streaming and non-streaming ISA requirements

2024-10-30 Thread Richard Sandiford
For some upcoming extensions, we need to add intrinsics whose
ISA requirements differ between streaming mode and non-streaming mode.
This patch tries to generalise the infrastructure to support that:

- Rather than have a single set of feature flags, the patch uses a
  separate set for sm_off (non-streaming, PSTATE.SM==0) and sm_on
  (streaming, PSTATE.SM==1).

- The sm_off set is zero if the intrinsic is streaming-only.
  Otherwise it is AARCH64_FL_SM_OFF | .

- Similarly, the sm_on set is zero if the intrinsic is non-streaming-only.
  Otherwise it is AARCH64_FL_SM_ON | .  AARCH64_FL_SME is
  taken as given in streaming mode.

- Streaming-compatible code must satisfy both sets of requirements.

There should be no functional change.

gcc/
* config.gcc (aarch64*-*-*): Add aarch64-protos.h to target_gtfiles.
* config/aarch64/aarch64-protos.h
(aarch64_required_extensions): New structure.
(aarch64_check_required_extensions): Change the type of the
required_extensions parameter from aarch64_feature_flags to
aarch64_required_extensions.
* config/aarch64/aarch64-sve-builtins.h
(function_builder::add_unique_function): Likewise.
(function_builder::add_overloaded_function): Likewise.
(function_builder::get_attributes): Likewise.
(function_builder::add_function): Likewise.
(function_group_info): Change the type of required_extensions
in the same way.
* config/aarch64/aarch64-builtins.cc
(aarch64_pragma_builtins_data::required_extensions): Change the type
from aarch64_feature_flags to aarch64_required_extensions.
(aarch64_check_required_extensions): Likewise change the type
of the required_extensions parameter.  Separate the requirements
for non-streaming mode and streaming mode, ORing them together
for streaming-compatible mode.
(aarch64_general_required_extensions): New function.
(aarch64_general_check_builtin_call): Use it.
* config/aarch64/aarch64-sve-builtins.cc
(registered_function::required_extensions): Change the type
from aarch64_feature_flags to aarch64_required_extensions.
(DEF_NEON_SVE_FUNCTION, DEF_SME_ZA_FUNCTION_GS): Update accordingly.
(function_builder::get_attributes): Change the type of the
required_extensions parameter from aarch64_feature_flags to
aarch64_required_extensions.
(function_builder::add_function): Likewise.
(function_builder::add_unique_function): Likewise.
(function_builder::add_overloaded_function): Likewise.
* config/aarch64/aarch64-simd-pragma-builtins.def: Update
REQUIRED_EXTENSIONS definitions to use aarch64_required_extensions.
* config/aarch64/aarch64-sve-builtins-base.def: Likewise.
* config/aarch64/aarch64-sve-builtins-sme.def: Likewise.
* config/aarch64/aarch64-sve-builtins-sve2.def: Likewise.
---
 gcc/config.gcc|   2 +-
 gcc/config/aarch64/aarch64-builtins.cc| 122 ++
 gcc/config/aarch64/aarch64-protos.h   |  87 -
 .../aarch64/aarch64-simd-pragma-builtins.def  |   2 +-
 .../aarch64/aarch64-sve-builtins-base.def |  26 ++--
 .../aarch64/aarch64-sve-builtins-sme.def  |  30 ++---
 .../aarch64/aarch64-sve-builtins-sve2.def |  41 ++
 gcc/config/aarch64/aarch64-sve-builtins.cc|  51 +---
 gcc/config/aarch64/aarch64-sve-builtins.h |  13 +-
 9 files changed, 226 insertions(+), 148 deletions(-)

diff --git a/gcc/config.gcc b/gcc/config.gcc
index e2ed3b309cc..c3531e56c9d 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -352,7 +352,7 @@ aarch64*-*-*)
cxx_target_objs="aarch64-c.o"
d_target_objs="aarch64-d.o"
extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o 
aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o 
aarch64-sve-builtins-sve2.o aarch64-sve-builtins-sme.o 
cortex-a57-fma-steering.o aarch64-speculation.o 
falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o 
aarch64-early-ra.o aarch64-ldp-fusion.o"
-   target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.h 
\$(srcdir)/config/aarch64/aarch64-builtins.cc 
\$(srcdir)/config/aarch64/aarch64-sve-builtins.h 
\$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
+   target_gtfiles="\$(srcdir)/config/aarch64/aarch64-protos.h 
\$(srcdir)/config/aarch64/aarch64-builtins.h 
\$(srcdir)/config/aarch64/aarch64-builtins.cc 
\$(srcdir)/config/aarch64/aarch64-sve-builtins.h 
\$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
target_has_targetm_common=yes
;;
 alpha*-*-*)
diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
b/gcc/config/aarch64/aarch64-builtins.cc
index 480ac223d86..97bde7c15d3 100644
--- a/gcc/config/aarch64/aarch64-builtins.cc
+++ b/gcc/config/aarch64/aarch64-builtins.cc
@@ -1595,7 +1595,8 @@ enum class aarch64_builtin_signatures
 
 #undef EN

[PATCH 1/3] aarch64: Move ENTRY_VHSDF to aarch64-simd-pragma-builtins.def

2024-10-30 Thread Richard Sandiford
It's more convenient for later patches if we only define ENTRY_VHSDF
once, in the .def file.  Then the only macro that needs to be defined
before including the file is ENTRY itself.

The patch also moves the architecture requirements out of the
individual ENTRY invocations into a block-level definition of
REQUIRED_EXTENSIONS.  This reduces cut-&-paste a little and makes
things more consistent with aarch64-sve-builtins*.def.

gcc/
* config/aarch64/aarch64-builtins.cc (ENTRY): Remove the features
argument and get the features from REQUIRED_EXTENSIONS instead.
(ENTRY_VHSDF): Move definition to...
* config/aarch64/aarch64-simd-pragma-builtins.def: ...here.
Move the architecture requirements to REQUIRED_EXTENSIONS.
---
 gcc/config/aarch64/aarch64-builtins.cc| 22 +++
 .../aarch64/aarch64-simd-pragma-builtins.def  | 14 ++--
 2 files changed, 15 insertions(+), 21 deletions(-)

diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
b/gcc/config/aarch64/aarch64-builtins.cc
index 86d96e47f01..480ac223d86 100644
--- a/gcc/config/aarch64/aarch64-builtins.cc
+++ b/gcc/config/aarch64/aarch64-builtins.cc
@@ -780,17 +780,9 @@ typedef struct
   AARCH64_SIMD_BUILTIN_##T##_##N##A,
 
 #undef ENTRY
-#define ENTRY(N, S, M, U, F) \
+#define ENTRY(N, S, M, U) \
   AARCH64_##N,
 
-#undef ENTRY_VHSDF
-#define ENTRY_VHSDF(NAME, SIGNATURE, UNSPEC, EXTENSIONS) \
-  AARCH64_##NAME##_f16, \
-  AARCH64_##NAME##q_f16, \
-  AARCH64_##NAME##_f32, \
-  AARCH64_##NAME##q_f32, \
-  AARCH64_##NAME##q_f64,
-
 enum aarch64_builtins
 {
   AARCH64_BUILTIN_MIN,
@@ -1602,16 +1594,8 @@ enum class aarch64_builtin_signatures
 };
 
 #undef ENTRY
-#define ENTRY(N, S, M, U, F) \
-  {#N, aarch64_builtin_signatures::S, E_##M##mode, U, F},
-
-#undef ENTRY_VHSDF
-#define ENTRY_VHSDF(NAME, SIGNATURE, UNSPEC, EXTENSIONS) \
-  ENTRY (NAME##_f16, SIGNATURE, V4HF, UNSPEC, EXTENSIONS) \
-  ENTRY (NAME##q_f16, SIGNATURE, V8HF, UNSPEC, EXTENSIONS) \
-  ENTRY (NAME##_f32, SIGNATURE, V2SF, UNSPEC, EXTENSIONS) \
-  ENTRY (NAME##q_f32, SIGNATURE, V4SF, UNSPEC, EXTENSIONS) \
-  ENTRY (NAME##q_f64, SIGNATURE, V2DF, UNSPEC, EXTENSIONS)
+#define ENTRY(N, S, M, U) \
+  {#N, aarch64_builtin_signatures::S, E_##M##mode, U, REQUIRED_EXTENSIONS},
 
 /* Initialize pragma builtins.  */
 
diff --git a/gcc/config/aarch64/aarch64-simd-pragma-builtins.def 
b/gcc/config/aarch64/aarch64-simd-pragma-builtins.def
index f432185be46..9d530fc45d4 100644
--- a/gcc/config/aarch64/aarch64-simd-pragma-builtins.def
+++ b/gcc/config/aarch64/aarch64-simd-pragma-builtins.def
@@ -18,6 +18,16 @@
along with GCC; see the file COPYING3.  If not see
.  */
 
+#undef ENTRY_VHSDF
+#define ENTRY_VHSDF(NAME, SIGNATURE, UNSPEC) \
+  ENTRY (NAME##_f16, SIGNATURE, V4HF, UNSPEC) \
+  ENTRY (NAME##q_f16, SIGNATURE, V8HF, UNSPEC) \
+  ENTRY (NAME##_f32, SIGNATURE, V2SF, UNSPEC) \
+  ENTRY (NAME##q_f32, SIGNATURE, V4SF, UNSPEC) \
+  ENTRY (NAME##q_f64, SIGNATURE, V2DF, UNSPEC)
+
 // faminmax
-ENTRY_VHSDF (vamax, binary, UNSPEC_FAMAX, AARCH64_FL_FAMINMAX)
-ENTRY_VHSDF (vamin, binary, UNSPEC_FAMIN, AARCH64_FL_FAMINMAX)
+#define REQUIRED_EXTENSIONS AARCH64_FL_FAMINMAX
+ENTRY_VHSDF (vamax, binary, UNSPEC_FAMAX)
+ENTRY_VHSDF (vamin, binary, UNSPEC_FAMIN)
+#undef REQUIRED_EXTENSIONS
-- 
2.25.1



[PATCH 0/3] aarch64: Allow separate SVE and SME feature requirements

2024-10-30 Thread Richard Sandiford
Currently we represent architecture requirements using a single bitmask
of features.  However, some of the new extensions have different
requirements in non-streaming mode compared to stremaing mode.
This series adds support for that and applies it to FAMINMAX.

Tested on aarch64-linux-gnu.  Since we have quite a bit of work gated
behind this, I'm planning to commit tomorrow evening (UTC) if there are
no comments before then, but please let me know if you'd like more time
to review.

Richard

Richard Sandiford (3):
  aarch64: Move ENTRY_VHSDF to aarch64-simd-pragma-builtins.def
  aarch64: Record separate streaming and non-streaming ISA requirements
  aarch64: Require SVE2 and/or SME2 for SVE FAMINMAX intrinsics

 gcc/config.gcc|   2 +-
 gcc/config/aarch64/aarch64-builtins.cc| 142 +-
 gcc/config/aarch64/aarch64-protos.h   |  87 ++-
 .../aarch64/aarch64-simd-pragma-builtins.def  |  14 +-
 .../aarch64/aarch64-sve-builtins-base.cc  |   4 -
 .../aarch64/aarch64-sve-builtins-base.def |  29 +---
 .../aarch64/aarch64-sve-builtins-sme.def  |  30 ++--
 .../aarch64/aarch64-sve-builtins-sve2.cc  |   4 +
 .../aarch64/aarch64-sve-builtins-sve2.def |  48 +++---
 gcc/config/aarch64/aarch64-sve-builtins.cc|  51 ---
 gcc/config/aarch64/aarch64-sve-builtins.h |  13 +-
 .../aarch64/sve/acle/general/amin_1.c |   9 ++
 .../aarch64/sve2/acle/asm/amax_f16.c  |   5 +-
 .../aarch64/sve2/acle/asm/amax_f32.c  |   5 +-
 .../aarch64/sve2/acle/asm/amax_f64.c  |   5 +-
 .../aarch64/sve2/acle/asm/amin_f16.c  |   5 +-
 .../aarch64/sve2/acle/asm/amin_f32.c  |   5 +-
 .../aarch64/sve2/acle/asm/amin_f64.c  |   5 +-
 18 files changed, 282 insertions(+), 181 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/acle/general/amin_1.c

-- 
2.25.1



Re: [PATCH 1/4] sched1: hookize pressure scheduling spilling agressiveness

2024-10-30 Thread Richard Sandiford
Vineet Gupta  writes:
> On 10/30/24 10:25, Jeff Law wrote:
>> On 10/30/24 9:31 AM, Richard Sandiford wrote:
>>> That might need some finessing of the name.  But I think the concept
>>> is right.  I'd rather base the hook (or param) on a general concept
>>> like that rather than a specific "wide vs narrow" thing.
>> Agreed.  Naming was my real only concern about the first patch.
>
> We are leaning towards
>   - TARGET_SCHED_PRESSURE_SPILL_AGGRESSIVE
>   - targetm.sched.pressure_spill_aggressive
>
> Targets could wire them up however they like
>
>>>> I still see Vineet's data as compelling, even with GIGO concern.
>>> Do you mean the reduction in dynamic instruction counts?  If so,
>>> that isn't what the algorithm is aiming to reduce.  Like I mentioned
>>> in the previous thread, trying to minimise dynamic instruction counts
>>> was also harmful for the core & benchmarks I was looking at.
>>> We just ended up with lots of pipeline bubbles that could be
>>> alleviated by judicious spilling.
>> Vineet showed significant cycle and icount improvements.  I'm much more 
>> interested in the former :-)
>
> The initial premise indeed was icounts but with recent access to some 
> credible hardware I'm all for perf measurement now.
>
> Please look at patch 2/4 [1] for actual perf data both cycles and 
> instructions.
> I kept 1/4 introducing hook seperate from 2/4 which implements the hook for 
> RISC-V.
>
>     [1] https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665945.html

Ah, sorry, I was indeed going only from the description in 1/4.
I've not had time to look at the rest of the series yet.

> As Jeff mentioned on a In-order RISC-V core, are we are seeing 6% cycle 
> improvements from the hook and another 6% cycles improvement from patch 3/4

Sounds good!

> Also Wilco gave this a spin on high end OoO Neoverse and seems to be seeing 
> 20% improvement which I gather is cycles.

Yeah, it's common ground that we should change this for OoO cores.

>>> I'm not saying that the algorithm gets the decision right for cactu
>>> when tuning for in-order CPU X and running on that same CPU X.
>>> But it seems like that combination hasn't been tried, and that,
>>> even on the combinations that the patch has been tried on, the cactu
>>> justification is based on static properties of the binary rather than
>>> a particular runtime improvement (Y% faster).
>
> I'd requested Wilco to possibly try this on some in-order arm cores.

OK.  FWIW, I think the original testing was on Cortex-A9 or Cortex-A15,
It was also heavy on filters, such as yiq.

But is this about making the argument in favour of an unconditional change?
If so, I don't think it's necessary to front-load this testing.  Like I said
in my reply to Jeff, that can happen naturally if all major targets move
to the new behaviour.  And for a hook/param approach, we already have
enough data to justify the patch.

Thanks,
Richard


Re: [PATCH 1/4] sched1: hookize pressure scheduling spilling agressiveness

2024-10-30 Thread Richard Sandiford
Jeff Law  writes:
> On 10/30/24 9:31 AM, Richard Sandiford wrote:
>
>> 
>> OK (and yeah, I can sympathise).  But I think there's an argument that,
>> if you're scheduling for one in-order core using the pipeline of an
>> unrelated core, that's effectively scheduling for the core as though
>> it were out-of-order.  In other words, the property we care about
>> isn't so much whether the processor itself is in-order (a statement
>> about the uarch), but whether we trying to schedule for a particular
>> in-order pipeline (a statement about what GCC is doing or knows about).
>> I'd argue that in the case you describe, we're not trying to schedule
>> for a particular in-order pipeline.
> I can see that point.
>
>> 
>> That might need some finessing of the name.  But I think the concept
>> is right.  I'd rather base the hook (or param) on a general concept
>> like that rather than a specific "wide vs narrow" thing.
> Agreed.  Naming was my real only concern about the first patch.
>
>> 
>>> I still see Vineet's data as compelling, even with GIGO concern.
>> 
>> Do you mean the reduction in dynamic instruction counts?  If so,
>> that isn't what the algorithm is aiming to reduce.  Like I mentioned
>> in the previous thread, trying to minimise dynamic instruction counts
>> was also harmful for the core & benchmarks I was looking at.
>> We just ended up with lots of pipeline bubbles that could be
>> alleviated by judicious spilling.
> Vineet showed significant cycle and icount improvements.  I'm much more 
> interested in the former :-)
>
> I'm planning to run it on our internal design, but it's not the top of 
> the priority list and it's a scarce resource right now...  I fully 
> expect it'll show a cycle improvement there too, though probably much 
> smaller than the improvement seen on that spacemit k1 design.
>
>> 
>> I'm not saying that the algorithm gets the decision right for cactu
>> when tuning for in-order CPU X and running on that same CPU X.
>> But it seems like that combination hasn't been tried, and that,
>> even on the combinations that the patch has been tried on, the cactu
>> justification is based on static properties of the binary rather than
>> a particular runtime improvement (Y% faster).
>> 
>> To be clear, the two paragraphs above are trying to explain why I think
>> this should be behind a hook or param rather than unconditional.  The
>> changes themselves look fine, and incorporate the suggestions from the
>> previous thread (thanks!).
> Thanks for that clarifying statement.  I actually think we're broadly in 
> agreement here -- keep it as a hook/param rather than making it 
> unconditional.

Yeah, agreed.

> Assuming we keep it as a hook/param, opt-in & come up with better 
> name/docs, any objections from your side?

No, seems fine to me.

I'm kind-of leaning towards a --param.  The hook definition would logically
be determined by -mtune (at least on targets like aarch32 that do have
meaningful in-order scheduling descriptions -- I can imagine that for
aarch64 we'd set it unconditionally).  But that wouldn't capture the
case above, where you're tuning for a different core from the one
that will actually be used.

How about:

--param cycle-accurate-model

but with the description:

  Whether GCC should assume that the scheduling description is mostly
  a cycle-accurate model of the target processor, in the absence of cache
  misses.  Nonzero usually means that the selected scheduling model
  describes an in-order processor, that the scheduling model accurately
  predicts pipeline bubbles in the absence of cache misses, and that GCC
  should assume that the scheduling model matches the target that the code
  is intended to run on.

(with better word-smithing)?

I suppose it should initially default to 1, but we could flip that later
if all major targets set it to 0.  (Or we could take that as proof that
the old approach isn't needed and just remove the --param.)

A param would also be cheaper to test.

Thanks,
Richard


[PATCH] aarch64: Forbid F64MM permutes in streaming mode

2024-10-30 Thread Richard Sandiford
The current code was based on an early version of the SME spec,
which allowed the .Q forms of TRN1, TRN2, UZP1, UZP2, ZIP1, and ZIP2
to be used in streaming mode.  We should now forbid them instead;
see 
https://developer.arm.com/documentation/ddi0602/2024-09/SVE-Instructions/TRN1--TRN2--vectors---Interleave-even-or-odd-elements-from-two-vectors-?lang=en
and the corresponding entries for the others.

Tested on aarch64-linux-gnu.  I'm planning to push to trunk and gcc-14
branch tomorrow evening if there are no comments before then.

Richard


gcc/
* config/aarch64/aarch64-sve-builtins-base.def (svtrn1q, svtrn2q)
(svuzp1q, svuzp2q, svzip1q, svzip2q): Require SM_OFF.

gcc/testsuite/
* g++.target/aarch64/sve/aarch64-ssve.exp: Add tests for trn[12]q,
uzp[12].c, and zip[12]q.
* gcc.target/aarch64/sve/acle/asm/trn1q_bf16.c: Skip for
STREAMING_COMPATIBLE.
* gcc.target/aarch64/sve/acle/asm/trn1q_f16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn1q_f32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn1q_f64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn1q_s16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn1q_s32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn1q_s64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn1q_s8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn1q_u16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn1q_u32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn1q_u64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn1q_u8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn2q_bf16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn2q_f16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn2q_f32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn2q_f64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn2q_s16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn2q_s32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn2q_s64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn2q_s8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn2q_u16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn2q_u32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn2q_u64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/trn2q_u8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp1q_bf16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp1q_f16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp1q_f32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp1q_f64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp1q_s16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp1q_s32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp1q_s64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp1q_s8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp1q_u16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp1q_u32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp1q_u64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp1q_u8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp2q_bf16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp2q_f16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp2q_f32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp2q_f64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp2q_s16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp2q_s32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp2q_s64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp2q_s8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp2q_u16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp2q_u32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp2q_u64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/uzp2q_u8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/zip1q_bf16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/zip1q_f16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/zip1q_f32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/zip1q_f64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/zip1q_s16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/zip1q_s32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/zip1q_s64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/zip1q_s8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/zip1q_u16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/zip1q_u32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/zip1q_u64.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/zip1q_u8.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/zip2q_bf16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/zip2q_f16.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/zip2q_f32.c: Likewise.
* gcc.target/aarch64/sve/acle/asm/zip2q_f64.c: Likewise.
 

Re: [PATCH 1/4] sched1: hookize pressure scheduling spilling agressiveness

2024-10-30 Thread Richard Sandiford
Jeff Law  writes:
> On 10/30/24 8:44 AM, Richard Sandiford wrote:
>
>>> But the data from the BPI (spacemit k1 chip) is an in-order core.
>>> Granted we don't have a good model of its pipeline, but it's definitely
>>> in-order.
>> 
>> Damn :)  (I did try to clarify what was being tested earlier, but the
>> response wasn't clear.)
>> 
>> So how representative is the DFA model being used for the BPI?
>> Is it more "pretty close, but maybe different in a few minor details"?
>> Or is it more "we're just using an existing DFA model for a different
>> core and hoping for the best"?  Is the issue width accurate?
>> 
>> If we're scheduling for an in-order core without an accurate pipeline
>> model then that feels like the first thing to fix.  Otherwise we're
>> in danger of GIGO.
> GIGO is a risk here -- there really isn't good data on the pipeline for 
> that chip, especially on the FP side.  I don't really have a good way to 
> test this on an in-order RISC-V target where there is a reasonable DFA 
> model.

OK (and yeah, I can sympathise).  But I think there's an argument that,
if you're scheduling for one in-order core using the pipeline of an
unrelated core, that's effectively scheduling for the core as though
it were out-of-order.  In other words, the property we care about
isn't so much whether the processor itself is in-order (a statement
about the uarch), but whether we trying to schedule for a particular
in-order pipeline (a statement about what GCC is doing or knows about).
I'd argue that in the case you describe, we're not trying to schedule
for a particular in-order pipeline.

That might need some finessing of the name.  But I think the concept
is right.  I'd rather base the hook (or param) on a general concept
like that rather than a specific "wide vs narrow" thing.

> I still see Vineet's data as compelling, even with GIGO concern.

Do you mean the reduction in dynamic instruction counts?  If so,
that isn't what the algorithm is aiming to reduce.  Like I mentioned
in the previous thread, trying to minimise dynamic instruction counts
was also harmful for the core & benchmarks I was looking at.
We just ended up with lots of pipeline bubbles that could be
alleviated by judicious spilling.

I'm not saying that the algorithm gets the decision right for cactu
when tuning for in-order CPU X and running on that same CPU X.
But it seems like that combination hasn't been tried, and that,
even on the combinations that the patch has been tried on, the cactu
justification is based on static properties of the binary rather than
a particular runtime improvement (Y% faster).

To be clear, the two paragraphs above are trying to explain why I think
this should be behind a hook or param rather than unconditional.  The
changes themselves look fine, and incorporate the suggestions from the
previous thread (thanks!).

Richard


Re: [PATCH v2 9/9] aarch64: Handle alignment when it is bigger than BIGGEST_ALIGNMENT

2024-10-30 Thread Richard Sandiford
Evgeny Karpov  writes:
> Tuesday, October 29, 2024
> Richard Sandiford  wrote:
>
>> Hmm, I see.  I think this is surprising enough that it would be worth
>> a comment.  How about:
>>
>>  /* Since the assembly directive only specifies a size, and not an
>> alignment, we need to follow the default ASM_OUTPUT_LOCAL behavior
>> and round the size up to at least a multiple of BIGGEST_ALIGNMENT bits,
>> so that each uninitialized object starts on such a boundary.
>> However, we also want to allow the alignment (and thus minimum size)
>> to exceed BIGGEST_ALIGNMENT.  */
>
> Thanks for the suggestion. It will be included in the next version of the 
> patch.
>
>> But how does using a larger size force the linker to assign a larger
>> alignment than BIGGEST_ALIGNMENT?  Is there a second limit in play?
>> 
>> Or does this patch not guarantee that the ffmpeg variable gets the
>> alignment it wants?  Is it just about suppresing the error?
>> 
>> If it's just about suppressing the error without guaranteeing the
>> requested alignment, then, yeah, I think patching ffmpeg would
>> be better.  If the patch does guarantee the alignment, then the
>> patch seems ok, but I think the comment should explain how, and
>> explain why BIGGEST_ALIGNMENT isn't larger.
>
> It looks like it generates the expected assembly code for the alignments
> and the correct object file, and it should be the expected code for FFmpeg.
>
> The alignment cannot be larger than 8192, otherwise, it will generate an 
> error.
>
> error: requested alignment ‘16384’ exceeds object file maximum 8192
>16 | float __attribute__((aligned (1 << 14))) large_aligned_array10[3];

OK, thanks.  But...

> Here an example:
>
> float large_aligned_array[3];
> float __attribute__((aligned (8))) large_aligned_array2[3];
> float __attribute__((aligned (16))) large_aligned_array3[3];
> float __attribute__((aligned (32))) large_aligned_array4[3];
> float __attribute__((aligned (64))) large_aligned_array5[3];
> float __attribute__((aligned (128))) large_aligned_array6[3];
> float __attribute__((aligned (256))) large_aligned_array7[3];
> float __attribute__((aligned (512))) large_aligned_array8[3];
> float __attribute__((aligned (1024))) large_aligned_array9[3];
>
>
>   .align  3
>   .deflarge_aligned_array;.scl3;  .type   0;  .endef
> large_aligned_array:
>   .space  12  // skip

Is this the construct used by the patch?  The original patch was:

+#define ASM_OUTPUT_ALIGNED_LOCAL(FILE, NAME, SIZE, ALIGNMENT)  \
+  { \
+unsigned HOST_WIDE_INT rounded = MAX ((SIZE), 1); \
+unsigned HOST_WIDE_INT alignment = MAX ((ALIGNMENT), BIGGEST_ALIGNMENT); \
+rounded += (alignment / BITS_PER_UNIT) - 1; \
+rounded = (rounded / (alignment / BITS_PER_UNIT) \
+  * (alignment / BITS_PER_UNIT)); \
+ASM_OUTPUT_LOCAL (FILE, NAME, SIZE, rounded); \
+  }

with patch 5 defining ASM_OUTPUT_LOCAL as:

+#define ASM_OUTPUT_LOCAL(FILE, NAME, SIZE, ROUNDED)  \
+( fputs (".lcomm ", (FILE)),   \
+  assemble_name ((FILE), (NAME)),  \
+  fprintf ((FILE), ",%u\n", (int)(ROUNDED)))

So for the change in question, I was expecting, say, a 1024-byte-aligned
float[3] to be defined using:

.lcomm  array, 1024

If we have access to .align, couldn't we define ASM_OUTPUT_ALIGNED_LOCAL
to use that, using the style you quoted above?  Or is the .lcomm approach
needed to work with -fcommon?

Thanks,
Richard

>   .global large_aligned_array2
>   .align  3
>   .deflarge_aligned_array2;   .scl3;  .type   0;  .endef
> large_aligned_array2:
>   .space  12  // skip
>
>   .global large_aligned_array3
>   .align  4
>   .deflarge_aligned_array3;   .scl3;  .type   0;  .endef
> large_aligned_array3:
>   .space  12  // skip

>
>   .global large_aligned_array4
>   .align  5
>   .deflarge_aligned_array4;   .scl3;  .type   0;  .endef
> large_aligned_array4:
>   .space  12  // skip
>
>   .global large_aligned_array5
>   .align  6
>   .deflarge_aligned_array5;   .scl3;  .type   0;  .endef
> large_aligned_array5:
>   .space  12  // skip
>
>   .global large_aligned_array6
>   .align  7
>   .deflarge_aligned_array6;   .scl3;  .type   0;  .endef
> large_aligned_array6:
>   .space  12  // skip
>
>   .global large_aligned_array7
>   .align  8
>   .deflarge_aligned_array7;   .scl3;  .type   0;  .endef
> large_aligned_array7:
>   .space  12  // skip
>
>   .global large_aligned_a

Re: [PATCH 1/4] sched1: hookize pressure scheduling spilling agressiveness

2024-10-30 Thread Richard Sandiford
Jeff Law  writes:
> On 10/30/24 4:05 AM, Richard Sandiford wrote:
>> Vineet Gupta  writes:
>>> On 10/29/24 11:51, Wilco Dijkstra wrote:
>>>> Hi Vineet,
>>>>> I agree the NARROW/WIDE stuff is obfuscating things in technicalities.
>>>> Is there evidence this change would make things significantly worse for
>>>> some targets?
>>>
>>> Honestly I don't think this needs to be behind any toggle or made optional 
>>> at all. The old algorithm was overly eager in spilling. But per last
>>> discussion with Richard [1] at least back in 2012 for some in-order arm32 
>>> core this was better. And also that's where the wide vs. narrow discussions
>>> came up and that it really mattered, as far as I understood.
>> 
>> Right, that's the key.  The current algorithm was tuned on an in-order
>> core for which GCC already had a relatively accurate pipeline model.
>> The question is whether this is better on a core like that: that is,
>> on an in-order core for which GCC has a relatively accurate pipeline model.
>> No amount of benchmarking on out-of-order cores will answer that.
>> 
>> Somewhat surprisingly, we don't AFAIK have a target hook for "is the
>> current target out-of-order?".  Why not make the target hook that
>> instead?  I think everyone agrees (including me in the previous
>> thread) that the current behaviour isn't right for OoO cores.
>> 
>> If someone has an OoO core that for some reason prefers the current
>> approach (unlikely), we can decide what to do then.  But in the meantime,
>> keying of OoO-ness seems simpler and easier to document.
> But the data from the BPI (spacemit k1 chip) is an in-order core. 
> Granted we don't have a good model of its pipeline, but it's definitely 
> in-order.

Damn :)  (I did try to clarify what was being tested earlier, but the
response wasn't clear.)

So how representative is the DFA model being used for the BPI?
Is it more "pretty close, but maybe different in a few minor details"?
Or is it more "we're just using an existing DFA model for a different
core and hoping for the best"?  Is the issue width accurate?

If we're scheduling for an in-order core without an accurate pipeline
model then that feels like the first thing to fix.  Otherwise we're
in danger of GIGO.

Thanks,
Richard




Re: [PATCH v4] [aarch64] Fix function multiversioning dispatcher link error with LTO

2024-10-30 Thread Richard Sandiford
Yangyu Chen  writes:
> We forgot to apply DECL_EXTERNAL to __init_cpu_features_resolver decl. When
> building with LTO, the linker cannot find the
> __init_cpu_features_resolver.lto_priv* symbol, causing the link error.
>
> This patch gets this fixed by adding DECL_EXTERNAL to the decl. To avoid used
> but never defined warning for this symbol, we also mark TREE_PUBLIC to the 
> decl.
> We should also mark the decl having hidden visibility. And fix the attribute 
> in
> the same way for __aarch64_cpu_features identifier.
>
> Minimal steps to reproduce the bug:
>
> echo '__attribute__((target_clones("default", "aes"))) void func1() { }' > 1.c
> echo '__attribute__((target_clones("default", "aes"))) void func2() { }' > 2.c
> echo 'void func1();void func2();int main(){func1();func2();return 0;}' > 
> main.c
> gcc -flto -c 1.c 2.c
> gcc -flto main.c 1.o 2.o
>
> Fixes: 0cfde688e213 ("[aarch64] Add function multiversioning support")
> Signed-off-by: Yangyu Chen 
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (dispatch_function_versions): Adding
>   DECL_EXTERNAL, TREE_PUBLIC and hidden DECL_VISIBILITY to
>   __init_cpu_features_resolver and __aarch64_cpu_features.

Thanks, pushed to trunk.  I'll push to GCC 14 branch tomorrow when
testing & pushing another patch.

Richard

> ---
>  gcc/config/aarch64/aarch64.cc | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 5770491b30c..2b2d5b9e390 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -20437,6 +20437,10 @@ dispatch_function_versions (tree dispatch_decl,
>tree init_fn_id = get_identifier ("__init_cpu_features_resolver");
>tree init_fn_decl = build_decl (UNKNOWN_LOCATION, FUNCTION_DECL,
> init_fn_id, init_fn_type);
> +  DECL_EXTERNAL (init_fn_decl) = 1;
> +  TREE_PUBLIC (init_fn_decl) = 1;
> +  DECL_VISIBILITY (init_fn_decl) = VISIBILITY_HIDDEN;
> +  DECL_VISIBILITY_SPECIFIED (init_fn_decl) = 1;
>tree arg1 = DECL_ARGUMENTS (dispatch_decl);
>tree arg2 = TREE_CHAIN (arg1);
>ifunc_cpu_init_stmt = gimple_build_call (init_fn_decl, 2, arg1, arg2);
> @@ -20456,6 +20460,9 @@ dispatch_function_versions (tree dispatch_decl,
>   get_identifier ("__aarch64_cpu_features"),
>   global_type);
>DECL_EXTERNAL (global_var) = 1;
> +  TREE_PUBLIC (global_var) = 1;
> +  DECL_VISIBILITY (global_var) = VISIBILITY_HIDDEN;
> +  DECL_VISIBILITY_SPECIFIED (global_var) = 1;
>tree mask_var = create_tmp_var (long_long_unsigned_type_node);
>  
>tree component_expr = build3 (COMPONENT_REF, long_long_unsigned_type_node,


Re: [PATCH v3] [aarch64] Fix function multiversioning dispatcher link error with LTO

2024-10-30 Thread Richard Sandiford
Yangyu Chen  writes:
> We forgot to apply DECL_EXTERNAL to __init_cpu_features_resolver decl. When
> building with LTO, the linker cannot find the
> __init_cpu_features_resolver.lto_priv* symbol, causing the link error.
>
> This patch gets this fixed by adding DECL_EXTERNAL to the decl. To avoid used
> but never defined warning for this symbol, we also mark TREE_PUBLIC to the 
> decl.
> We should also mark the decl having hidden visibility. And fix the attribute 
> in
> the same way for __aarch64_cpu_features identifier.
>
> Minimal steps to reproduce the bug:
>
> echo '__attribute__((target_clones("default", "aes"))) void func1() { }' > 1.c
> echo '__attribute__((target_clones("default", "aes"))) void func2() { }' > 2.c
> echo 'void func1();void func2();int main(){func1();func2();return 0;}' > 
> main.c
> gcc -flto -c 1.c 2.c
> gcc -flto main.c 1.o 2.o
>
> Fixes: 0cfde688e213 ("[aarch64] Add function multiversioning support")
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (dispatch_function_versions): Adding
>   DECL_EXTERNAL, TREE_PUBLIC and hidden DECL_VISIBILITY to
>   __init_cpu_features_resolver and __aarch64_cpu_features.

Thanks, LGTM.  I've tested this locally and was about to push, but then
realised: since you've already contributed changes (great!), it probably
wouldn't be acceptable to treat it as trivial for copyright purposes.
Could you confirm that you're contributing under the DCO:
https://gcc.gnu.org/dco.html ?  If so, could you repost with a
Signed-off-by?

Sorry for the administrivia.

Richard

> ---
>  gcc/config/aarch64/aarch64.cc | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 5770491b30c..2b2d5b9e390 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -20437,6 +20437,10 @@ dispatch_function_versions (tree dispatch_decl,
>tree init_fn_id = get_identifier ("__init_cpu_features_resolver");
>tree init_fn_decl = build_decl (UNKNOWN_LOCATION, FUNCTION_DECL,
> init_fn_id, init_fn_type);
> +  DECL_EXTERNAL (init_fn_decl) = 1;
> +  TREE_PUBLIC (init_fn_decl) = 1;
> +  DECL_VISIBILITY (init_fn_decl) = VISIBILITY_HIDDEN;
> +  DECL_VISIBILITY_SPECIFIED (init_fn_decl) = 1;
>tree arg1 = DECL_ARGUMENTS (dispatch_decl);
>tree arg2 = TREE_CHAIN (arg1);
>ifunc_cpu_init_stmt = gimple_build_call (init_fn_decl, 2, arg1, arg2);
> @@ -20456,6 +20460,9 @@ dispatch_function_versions (tree dispatch_decl,
>   get_identifier ("__aarch64_cpu_features"),
>   global_type);
>DECL_EXTERNAL (global_var) = 1;
> +  TREE_PUBLIC (global_var) = 1;
> +  DECL_VISIBILITY (global_var) = VISIBILITY_HIDDEN;
> +  DECL_VISIBILITY_SPECIFIED (global_var) = 1;
>tree mask_var = create_tmp_var (long_long_unsigned_type_node);
>  
>tree component_expr = build3 (COMPONENT_REF, long_long_unsigned_type_node,


Re: [PATCH 1/4] sched1: hookize pressure scheduling spilling agressiveness

2024-10-30 Thread Richard Sandiford
Vineet Gupta  writes:
> On 10/29/24 11:51, Wilco Dijkstra wrote:
>> Hi Vineet,
>>> I agree the NARROW/WIDE stuff is obfuscating things in technicalities.
>> Is there evidence this change would make things significantly worse for
>> some targets? 
>
> Honestly I don't think this needs to be behind any toggle or made optional at 
> all. The old algorithm was overly eager in spilling. But per last
> discussion with Richard [1] at least back in 2012 for some in-order arm32 
> core this was better. And also that's where the wide vs. narrow discussions
> came up and that it really mattered, as far as I understood.

Right, that's the key.  The current algorithm was tuned on an in-order
core for which GCC already had a relatively accurate pipeline model.
The question is whether this is better on a core like that: that is,
on an in-order core for which GCC has a relatively accurate pipeline model.
No amount of benchmarking on out-of-order cores will answer that.

Somewhat surprisingly, we don't AFAIK have a target hook for "is the
current target out-of-order?".  Why not make the target hook that
instead?  I think everyone agrees (including me in the previous
thread) that the current behaviour isn't right for OoO cores.

If someone has an OoO core that for some reason prefers the current
approach (unlikely), we can decide what to do then.  But in the meantime,
keying of OoO-ness seems simpler and easier to document.

Thanks,
Richard


Re: [PATCH v2 9/9] aarch64: Handle alignment when it is bigger than BIGGEST_ALIGNMENT

2024-10-29 Thread Richard Sandiford
Evgeny Karpov  writes:
>> Wednesday, October 23, 2024
>> Richard Sandiford  wrote:
>> 
>>> Or, even if that does work, it isn't clear to me why patching
>>> ASM_OUTPUT_ALIGNED_LOCAL is a complete solution to the problem.
>>
>> This patch reproduces the same code as it was done without declaring 
>> ASM_OUTPUT_ALIGNED_LOCAL.
>> ASM_OUTPUT_ALIGNED_LOCAL is needed to get the alignment value and handle it 
>> when it is bigger than BIGGEST_ALIGNMENT.
>> In all other cases, the code is the same.
>> 
>> https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/varasm.cc;h=c2540055421641caed08113d92dbeff7ffc09f49;hb=refs/heads/master#l2137
>> https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/varasm.cc;h=c2540055421641caed08113d92dbeff7ffc09f49;hb=refs/heads/master#l2233
>
> Does this information provide more clarity on ASM_OUTPUT_ALIGNED_LOCAL usage?
> If not, this patch will be dropped as a low priority, and FFmpeg, which 
> requires this change, will be patched 
> to avoid using alignment higher than 16 bytes on AArch64.

Hmm, I see.  I think this is surprising enough that it would be worth
a comment.  How about:

  /* Since the assembly directive only specifies a size, and not an
 alignment, we need to follow the default ASM_OUTPUT_LOCAL behavior
 and round the size up to at least a multiple of BIGGEST_ALIGNMENT bits,
 so that each uninitialized object starts on such a boundary.
 However, we also want to allow the alignment (and thus minimum size)
 to exceed BIGGEST_ALIGNMENT.  */

But how does using a larger size force the linker to assign a larger
alignment than BIGGEST_ALIGNMENT?  Is there a second limit in play?

Or does this patch not guarantee that the ffmpeg variable gets the
alignment it wants?  Is it just about suppresing the error?

If it's just about suppressing the error without guaranteeing the
requested alignment, then, yeah, I think patching ffmpeg would
be better.  If the patch does guarantee the alignment, then the
patch seems ok, but I think the comment should explain how, and
explain why BIGGEST_ALIGNMENT isn't larger.

Thanks,
Richard


Re: [PATCH] aarch64: Use canonicalize_comparison in ccmp expansion [PR117346]

2024-10-29 Thread Richard Sandiford
Andrew Pinski  writes:
> While testing the patch for PR 85605 on aarch64, it was noticed that
> imm_choice_comparison.c test failed. This was because canonicalize_comparison
> was not being called in the ccmp case. This can be noticed without the patch
> for PR 85605 as evidence of the new testcase.
>
> Bootstrapped and tested on aarch64-linux-gnu.
>
>   PR target/117346
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (aarch64_gen_ccmp_first): Call
>   canonicalize_comparison before figuring out the cmp_mode/cc_mode.
>   (aarch64_gen_ccmp_next): Likewise.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/imm_choice_comparison-1.c: New test.

OK, thanks.

Richard

> Signed-off-by: Andrew Pinski 
> ---
>  gcc/config/aarch64/aarch64.cc |  6 +++
>  .../aarch64/imm_choice_comparison-1.c | 42 +++
>  2 files changed, 48 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/imm_choice_comparison-1.c
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index a6cc00e74ab..cbb7ef13315 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -27353,6 +27353,9 @@ aarch64_gen_ccmp_first (rtx_insn **prep_seq, rtx_insn 
> **gen_seq,
>if (op_mode == VOIDmode)
>  op_mode = GET_MODE (op1);
>  
> +  if (CONST_SCALAR_INT_P (op1))
> +canonicalize_comparison (op_mode, &code, &op1);
> +
>switch (op_mode)
>  {
>  case E_QImode:
> @@ -27429,6 +27432,9 @@ aarch64_gen_ccmp_next (rtx_insn **prep_seq, rtx_insn 
> **gen_seq, rtx prev,
>if (op_mode == VOIDmode)
>  op_mode = GET_MODE (op1);
>  
> +  if (CONST_SCALAR_INT_P (op1))
> +canonicalize_comparison (op_mode, &cmp_code, &op1);
> +
>switch (op_mode)
>  {
>  case E_QImode:
> diff --git a/gcc/testsuite/gcc.target/aarch64/imm_choice_comparison-1.c 
> b/gcc/testsuite/gcc.target/aarch64/imm_choice_comparison-1.c
> new file mode 100644
> index 000..2afebe1a349
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/imm_choice_comparison-1.c
> @@ -0,0 +1,42 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +/* { dg-final { check-function-bodies "**" "" } } */
> +
> +/* PR target/117346 */
> +/* Make sure going through ccmp uses similar to non ccmp-case. */
> +/* This is similar to imm_choice_comparison.c's check except to force
> +   the use of ccmp by reording the comparison and putting the cast before. */
> +
> +/*
> +** check:
> +**   ...
> +**   mov w[0-9]+, -16777217
> +**   ...
> +*/
> +
> +int
> +check (int x, int y)
> +{
> +  unsigned xu = x;
> +  if (xu > 0xfefe && x > y)
> +return 100;
> +
> +  return x;
> +}
> +
> +/*
> +** check1:
> +**   ...
> +**   mov w[0-9]+, -16777217
> +**   ...
> +*/
> +
> +int
> +check1 (int x, int y)
> +{
> +  unsigned xu = x;
> +  if (x > y && xu > 0xfefe)
> +return 100;
> +
> +  return x;
> +}


Re: [PATCH][AARCH64][PR115258]Fix excess moves

2024-10-29 Thread Richard Sandiford
Kugan Vivekanandarajah  writes:
> Hi,
>
> Fix for PR115258 cases a performance regression in some of the TSVC kernels 
> by adding additional mov instructions.
> This patch fixes this. 
>
> i.e., When operands are equal, it is likely that all of them get the same 
> register similar to:
> (insn 19 15 20 3 (set (reg:V2x16QI 62 v30 [117])
> (unspec:V2x16QI [
> (reg:V16QI 62 v30 [orig:102 vect__1.7 ] [102])
> (reg:V16QI 62 v30 [orig:102 vect__1.7 ] [102])
> ] UNSPEC_CONCAT)) "tsvc.c":11:12 4871 {aarch64_combinev16qi}
>  (nil))
>
> In this case, aarch64_split_combinev16qi would split it with one insm. Hence, 
> when the operands are equal, split after reload.
>
> Bootstrapped and recession tested on aarch64-linux-gnu, Is this ok for trunk?

Thanks for the patch.  I'm not sure this is the right fix though.
I'm planning to have a look at the PR once stage 1 closes.

Richard

>
> Thanks,
> Kugan
>
> From ace50a5eb5d459901325ff17ada83791cef0a354 Mon Sep 17 00:00:00 2001
> From: Kugan 
> Date: Wed, 23 Oct 2024 05:03:02 +0530
> Subject: [PATCH] [PATCH][AARCH64][PR115258]Fix excess moves
>
> When operands are equal, it is likely that all of them get the same register
> similar to:
> (insn 19 15 20 3 (set (reg:V2x16QI 62 v30 [117])
> (unspec:V2x16QI [
> (reg:V16QI 62 v30 [orig:102 vect__1.7 ] [102])
> (reg:V16QI 62 v30 [orig:102 vect__1.7 ] [102])
> ] UNSPEC_CONCAT)) "tsvc.c":11:12 4871 {aarch64_combinev16qi}
>  (nil))
>
> In this case, aarch64_split_combinev16qi would split it with one insn. Hence,
> when the operands are equal, prefer splitting after reload.
>
>   PR target/115258
>
> gcc/ChangeLog:
>
>   PR target/115258
>   * config/aarch64/aarch64-simd.md (aarch64_combinev16qi): Restrict
>   the split before reload if operands are equal.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/pr115258-2.c: New test.
>
> Signed-off-by: Kugan Vivekanandarajah 
> ---
>  gcc/config/aarch64/aarch64-simd.md|  2 +-
>  gcc/testsuite/gcc.target/aarch64/pr115258-2.c | 18 ++
>  2 files changed, 19 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr115258-2.c
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 2a44aa3fcc3..e56100b3766 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -8525,7 +8525,7 @@
>   UNSPEC_CONCAT))]
>"TARGET_SIMD"
>"#"
> -  "&& 1"
> +  "&& reload_completed || !rtx_equal_p (operands[1], operands[2])"
>[(const_int 0)]
>  {
>aarch64_split_combinev16qi (operands);
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr115258-2.c 
> b/gcc/testsuite/gcc.target/aarch64/pr115258-2.c
> new file mode 100644
> index 000..f28190cef32
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr115258-2.c
> @@ -0,0 +1,18 @@
> +
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -mcpu=neoverse-v2" } */
> +
> +extern __attribute__((aligned(64))) float a[32000], b[32000];
> +int dummy(float[32000], float[32000], float);
> +
> +void s1112() {
> +
> +  for (int nl = 0; nl < 10 * 3; nl++) {
> +for (int i = 32000 - 1; i >= 0; i--) {
> +  a[i] = b[i] + (float)1.;
> +}
> +dummy(a, b, 0.);
> +  }
> +}
> +
> +/* { dg-final { scan-assembler-times "mov\\tv\[0-9\]+\.16b, v\[0-9\]+\.16b" 
> 2 } } */


Re: [PATCH v2 5/8] aarch64: Add masked-load else operands.

2024-10-29 Thread Richard Sandiford
"Robin Dapp"  writes:
>>> For the lack of a better idea I used a function call property to specify
>>> whether a builtin needs an else operand or not.  Somebody with better
>>> knowledge of the aarch64 target can surely improve that.
>>
>> Yeah, those flags are really for source-level/gimple-level attributes.
>> Would it work to pass a new parameter to use_contiguous_load instead?
>
> I tried this first (before adding the call property) and immediate fallout
> from it was the direct expansion of sve intrinsics failing.  I didn't touch
> those.  Should we amend them with a zero else value or is there another
> way?

Could you give an example of what you mean?  In the patch, it seemed
like whether a class's call_properties returned CP_HAS_ELSE or not was
a static property of the class.  So rather than doing:

  unsigned int
  call_properties (const function_instance &) const override
  {
return ... | CP_HAS_ELSE;
  }

...
/* Add the else operand.  */
e.args.quick_push (CONST0_RTX (e.vector_mode (1)));
return e.use_contiguous_load_insn (icode);

I thought we could instead make the interface:

rtx
function_expander::use_contiguous_load_insn (insn_code icode, bool has_else)

with has_else being declared default-false.  Then use_contiguous_load_insn
could use:

  if (has_else)
add_input_operand (icode, const0_rtx);

(add_input_operand should take care of broadcasting the zero to the
right vector mode.)

The caller would then just be:

return e.use_contiguous_load_insn (icode, true);

without any changes to e.args.

Is that what you tried?

Thanks,
Richard


Re: [PATCH v2] [aarch64] Fix function multiversioning dispatcher link error with LTO

2024-10-29 Thread Richard Sandiford
Yangyu Chen  writes:
> We forgot to apply DECL_EXTERNAL to __init_cpu_features_resolver decl. When
> building with LTO, the linker cannot find the
> __init_cpu_features_resolver.lto_priv* symbol, causing the link error.
>
> This patch get this fixed by adding DECL_EXTERNAL to the decl. To avoid used 
> but
> never defined warning for this symbol, we also mark TREE_PUBLIC to the decl.
>
> Minimal steps to reproduce the bug:
>
> echo '__attribute__((target_clones("default", "aes"))) void func1() { }' > 1.c
> echo '__attribute__((target_clones("default", "aes"))) void func2() { }' > 2.c
> echo 'void func1();void func2();int main(){func1();func2();return 0;}' > 
> main.c
> gcc -flto -c 1.c 2.c
> gcc -flto main.c 1.o 2.o
>
> Fixes: 0cfde688e213 ("[aarch64] Add function multiversioning support")
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (dispatch_function_versions): Adding
>   DECL_EXTERNAL and TREE_PUBLIC to __init_cpu_features_resolver decl.

Thanks for doing this.  I suppose at the same time, we should also
mark __aarch64_cpu_features as TREE_PUBLIC.  We could also mark both
of them as having hidden visibility, via:

  DECL_VISIBILITY (...) = VISIBILITY_HIDDEN;
  DECL_VISIBILITY_SPECIFIED (...) = 1;

Richard

> ---
>  gcc/config/aarch64/aarch64.cc | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 5770491b30c..37123befeaf 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -20437,6 +20437,8 @@ dispatch_function_versions (tree dispatch_decl,
>tree init_fn_id = get_identifier ("__init_cpu_features_resolver");
>tree init_fn_decl = build_decl (UNKNOWN_LOCATION, FUNCTION_DECL,
> init_fn_id, init_fn_type);
> +  DECL_EXTERNAL (init_fn_decl) = 1;
> +  TREE_PUBLIC (init_fn_decl) = 1;
>tree arg1 = DECL_ARGUMENTS (dispatch_decl);
>tree arg2 = TREE_CHAIN (arg1);
>ifunc_cpu_init_stmt = gimple_build_call (init_fn_decl, 2, arg1, arg2);


Re: [PATCH 2/6] aarch64: Use canonical RTL representation for SVE2 XAR and extend it to fixed-width modes

2024-10-28 Thread Richard Sandiford
Kyrylo Tkachov  writes:
> Hi all,
>
> The MD pattern for the XAR instruction in SVE2 is currently expressed with
> non-canonical RTL by using a ROTATERT code with a constant rotate amount.
> Fix it by using the left ROTATE code.  This necessitates adjusting the rotate
> amount during expand. 
>
> Additionally, as the SVE2 XAR instruction is unpredicated and can handle all
> element sizes from .b to .d, it is a good fit for implementing the XOR+ROTATE
> operation for Advanced SIMD modes where the TARGET_SHA3 cannot be used
> (that can only handle V2DImode operands).  Therefore let's extend the accepted
> modes of the SVE2 patternt to include the Advanced SIMD integer modes.
>
> This leads to some tests for the svxar* intrinsics to fail because they now
> simplify to a plain EOR when the rotate amount is the width of the element.
> This simplification is desirable (EOR instructions have better or equal
> throughput than XAR, and they are non-destructive of their input) so the
> tests are adjusted.
>
> For V2DImode XAR operations we should prefer the Advanced SIMD version when
> it is available (TARGET_SHA3) because it is non-destructive, so restrict the
> SVE2 pattern accordingly.  Tests are added to confirm this.
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
> Ok for mainline?
> Thanks,
> Kyrill
>
> Signed-off-by: Kyrylo Tkachov 
>
> gcc/
>
>   * config/aarch64/iterators.md (SVE_ASIMD_FULL_I): New mode iterator.
>   * config/aarch64/aarch64-sve2.md (@aarch64_sve2_xar):
>   Use SVE_ASIMD_FULL_I modes.  Use ROTATE code for the rotate step.
>   Adjust output logic.
>   * config/aarch64/aarch64-sve-builtins-sve2.cc (svxar_impl): Define.
>   (svxar): Use the above.
>
> gcc/testsuite/
>
>   * gcc.target/aarch64/xar_neon_modes.c: New test.
>   * gcc.target/aarch64/xar_v2di_nonsve.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/xar_s16.c: Scan for EOR rather than
>   XAR.
>   * gcc.target/aarch64/sve2/acle/asm/xar_s32.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/xar_s64.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/xar_s8.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/xar_u16.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/xar_u32.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/xar_u64.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/xar_u8.c: Likewise.

Looks great to me.  Just one very minor nit:

> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
> index ddd6e466ee3..62c17281ec7 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
> @@ -90,6 +90,23 @@ public:
>}
>  };
>  
> +class svxar_impl : public function_base
> +{
> +public:
> +  rtx
> +  expand (function_expander &e) const override
> +  {
> +/* aarch64_sve2_xar represents this operation with a left-rotate RTX.
> +   Convert the right-rotate amount from the intrinsic to fit this.  */
> +machine_mode mode = e.vector_mode (0);
> +HOST_WIDE_INT rot = GET_MODE_UNIT_BITSIZE (mode)
> + - INTVAL (e.args[2]);
> +e.args[2]
> +  = aarch64_simd_gen_const_vector_dup (mode, rot);

The split line seems unnecessary.

OK with that change as far as I'm concerned.

Thanks,
Richard

> +return e.use_exact_insn (code_for_aarch64_sve2_xar (mode));
> +  }
> +};
> +
>  class svcdot_impl : public function_base
>  {
>  public:
> @@ -773,6 +790,6 @@ FUNCTION (svwhilege, while_comparison, (UNSPEC_WHILEGE, 
> UNSPEC_WHILEHS))
>  FUNCTION (svwhilegt, while_comparison, (UNSPEC_WHILEGT, UNSPEC_WHILEHI))
>  FUNCTION (svwhilerw, svwhilerw_svwhilewr_impl, (UNSPEC_WHILERW))
>  FUNCTION (svwhilewr, svwhilerw_svwhilewr_impl, (UNSPEC_WHILEWR))
> -FUNCTION (svxar, CODE_FOR_MODE0 (aarch64_sve2_xar),)
> +FUNCTION (svxar, svxar_impl,)
>  
>  } /* end namespace aarch64_sve */
> diff --git a/gcc/config/aarch64/aarch64-sve2.md 
> b/gcc/config/aarch64/aarch64-sve2.md
> index 5f2697c3179..8047f405a17 100644
> --- a/gcc/config/aarch64/aarch64-sve2.md
> +++ b/gcc/config/aarch64/aarch64-sve2.md
> @@ -1266,18 +1266,28 @@
>  ;; - XAR
>  ;; -
>  
> +;; Also allow the Advanced SIMD modes as the the SVE2 XAR instruction
> +;; can handle more element sizes than the TARGET_SHA3 one from Advanced SIMD.
> +;; Don't allow the V2DImode use here unless !TARGET_SHA3 as the Advanced SIMD
> +;; version should be preferred when available as it is non-destructive on its
> +;; input.
>  (define_insn "@aarch64_sve2_xar"
> -  [(set (match_operand:SVE_FULL_I 0 "register_operand")
> - (rotatert:SVE_FULL_I
> -   (xor:SVE_FULL_I
> - (match_operand:SVE_FULL_I 1 "register_operand")
> - (match_operand:SVE_FULL_I 2 "register_operand"))
> -   (match_operand:SVE_FULL_I 3 "aarch64_simd_rshift_imm")))]
> -  "TARGET_SVE2"
> -  {@ [ cons: =0 , 1  , 2 ; at

Re: [PATCH] AArch64: Add more accurate constraint [PR117292]

2024-10-25 Thread Richard Sandiford
Wilco Dijkstra  writes:
> As shown in the PR, reload may only check the constraint in some cases and
> and not check the predicate is still valid for the resulting instruction.

Yeah, that's by design.  constraints have to accept a subset of the
predicates.

> To fix the issue, add a new constraint which matches the predicate exactly.
>
> Passes regress & bootstrap, OK for commit?
>
> gcc/ChangeLog:
> PR target/117292
> * config/aarch64/aarch64-simd.md (xor3): Use 'De' 
> constraint.
> * config/aarch64/constraints.md (De): Add new constraint.
>
> gcc/testsuite/ChangeLog:
> PR target/117292
> * testsuite/gcc.target/aarch64/sve/single_5.c: Remove xfails.
> * testsuite/gcc.target/aarch64/pr117292.c: New test.

OK, thanks.

Richard

> ---
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> eabfd5f324fce3bd5b8f676ab9c13827b00baa30..c61195f55cca3bbe5b3b34d9c4364fdc7830e6ea
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1151,7 +1151,7 @@ (define_insn "xor3"
>"TARGET_SIMD"
>{@ [ cons: =0 , 1 , 2  ]
>   [ w, w , w  ] eor\t%0., %1., %2.
> - [ w, 0 , Do ] << aarch64_output_simd_xor_imm (operands[2], 
> );
> + [ w, 0 , De ] << aarch64_output_simd_xor_imm (operands[2], 
> );
>}
>[(set_attr "type" "neon_logic")]
>  )
> diff --git a/gcc/config/aarch64/constraints.md 
> b/gcc/config/aarch64/constraints.md
> index 
> 3f9fd92a1911e0b18a163dc6c4c2c97c871458e0..647941c3c5a37d8411f931cf00440c0240c91c0a
>  100644
> --- a/gcc/config/aarch64/constraints.md
> +++ b/gcc/config/aarch64/constraints.md
> @@ -472,6 +472,12 @@ (define_constraint "Db"
>   (and (match_code "const_vector")
>(match_test "aarch64_simd_valid_and_imm (op)")))
>
> +(define_constraint "De"
> +  "@internal
> +   A constraint that matches vector of immediates for xor."
> + (and (match_code "const_vector")
> +  (match_test "aarch64_simd_valid_xor_imm (op)")))
> +
>  (define_constraint "Dn"
>"@internal
>   A constraint that matches vector of immediates."
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr117292.c 
> b/gcc/testsuite/gcc.target/aarch64/pr117292.c
> new file mode 100644
> index 
> ..86816266c681a099c46d2246469edc600a1352e3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr117292.c
> @@ -0,0 +1,41 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Os" } */
> +
> +#pragma GCC target "+nosve"
> +
> +typedef char v8u8;
> +typedef __attribute__((__vector_size__ (2))) char v16u8;
> +typedef __attribute__((__vector_size__ (4))) char v32u8;
> +typedef __attribute__((__vector_size__ (8))) char v64u8;
> +typedef short v128u8;
> +typedef __attribute__((__vector_size__ (32))) char v256u8;
> +typedef __attribute__((__vector_size__ (64))) char v512u8;
> +v16u8 foo0_v16s16_0;
> +__attribute__((__vector_size__ (16))) int foo0_v512u32_0;
> +v8u8 foo0_ret;
> +
> +static __attribute__((__noinline__)) __attribute__((__noclone__)) void
> +foo0 (signed char s8_0, v512u8 v512u16_0, v512u8 v512s16_0,
> +v512u8 v512s32_0, v512u8 v512u64_0, v512u8 v512s64_0,
> +v512u8 v512u128_0, v512u8 v512s128_0)
> +{
> +  char v8s8_0;
> +  v8s8_0 ^= s8_0;
> +  foo0_v512u32_0 /= foo0_v512u32_0 ^ s8_0;
> +  v512u8 v512u8_r = v512u16_0 + v512s16_0 + v512s32_0 +
> +v512u64_0 + v512s64_0 + v512u128_0 + v512s128_0;
> +  v16u8 v16u8_r = ((union { v64u8 a; v16u8 b;})
> + ((union { v128u8 a; v64u8 b;})
> + ((union { v256u8 a; v128u8 b;})
> + ((union { v512u8 a; v256u8 b;})  v512u8_r).b).b).b).b +
> +foo0_v16s16_0;
> +  v8u8 v8u8_r = ((union { v16u8 a; v8u8 b;}) v16u8_r).b + v8s8_0;
> +  foo0_ret = v8u8_r;
> +}
> +
> +void
> +main ()
> +{
> +  foo0 (3, (v512u8){}, (v512u8){}, (v512u8){}, (v512u8){},
> +(v512u8){}, (v512u8){}, (v512u8){});
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/single_5.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/single_5.c
> index 
> 233118bbb383bbdf2d342d057ead024f92804221..ac81194733f07eff5d2e9d2309c29e815a52deea
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/single_5.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/single_5.c
> @@ -11,8 +11,8 @@
>  /* { dg-final { scan-assembler-times {\tmovi\tv[0-9]+\.8h, 0x4\n} 1 } } */
>  /* { dg-final { scan-assembler-times {\tmovi\tv[0-9]+\.4s, 0x5\n} 1 } } */
>  /* { dg-final { scan-assembler-times {\tmovi\tv[0-9]+\.4s, 0x6\n} 1 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #7\n} 1 { xfail 
> *-*-* } } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #8\n} 1 { xfail 
> *-*-* } } } */
> +/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #7\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #8\n} 1 } } */
>  /* { dg-final { scan-assembler-times {\tfmov\tv[0-9]+\.8h, 1\.0e\+0\n} 1 } } 
> */
>  /* { dg-final

Re: [PATCH] Replace 8 consecutive spaces in leading whitespace by tab

2024-10-25 Thread Richard Sandiford
Arsen Arsenović  writes:
> Hi!
>
> Jakub Jelinek  writes:
>
>> Hi!
>>
>> On top of the previously posted leading whitespace patch, this change
>> just replaces 8 consecutive spaces in leading whitespace by tab.
>> The patch is too large (1MB xz -9e compressed), so I'm not even trying to
>> split it up into 4+ pieces to fit under the mailing list limits.
>> But the change was done purely by a script,
>> for i in `find gcc -name \*.h -o -name \*.cc -o -name \*.c | grep -v 
>> testsuite/
>> | grep -v gofrontend/`; do grep -l '^[ ]* ' $i; done > /tmp/2
>> grep -L 'do not edit' `cat /tmp/2` > /tmp/3
>> for i in `find include lib{iberty,gcc,cpp,stdc++-v3} -name \*.h -o -name 
>> \*.cc
>> -o -name \*.c | grep -v testsuite/ | grep -v gofrontend/`; do grep -l '^[ ]* 
>> '
>> $i; done >> /tmp/3
>> for j in `seq 32`; do for i in `cat /tmp/3`; do sed -i -e 's/^\(\t*\)
>> /\1\t/g' $i; done; done
>> diff -upb yields nothing.
>>
>> Ok for trunk if this passes bootstrap/regtest?
>
> Maybe we should go the other way around?

FWIW, I strongly agree.  Tab indentation is an anachronism that I would
love to drop.

Richard

> Compressing eight spaces into
> a tab leads to strange artifacts in diffs (where lines appear
> misindented because some were aligned by tabs and some by spaces), and
> nowadays editor authors seem to have forgotten tabs are eight spaces and
> instead default to (or, worse, hard-code) four, obviously making the
> codebase quite unreadable.  We also don't get the benefit of being able
> to adjust tabstop locally to our preferences when we use two-column
> indentation, so I don't see an advantage to keeping 'indent-tabs-mode
> (or equivalent in other editors) enabled.
>
> The only two possible advantages I see currently are:
>
> 1. Emacs behaves this way OOTB; this can be addressed via .dir-locals.el
> 2. Tabs take up less disk space; I do not think this is a real issue
>nowadays
>
> WDYT?  Am I missing something?
>
> TIA, have a lovely day.


Re: [PATCH 4/6] aarch64: Optimize vector rotates into REV* instructions where possible

2024-10-25 Thread Richard Sandiford
Kyrylo Tkachov  writes:
>> On 25 Oct 2024, at 13:46, Richard Sandiford  
>> wrote:
>> 
>> Kyrylo Tkachov  writes:
>>> Thank you for the suggestions! I’m trying them out now.
>>> 
>>>>> +  if (rotamnt % BITS_PER_UNIT != 0)
>>>>> +return NULL_RTX;
>>>>> +  machine_mode qimode;
>>>>> +  if (!qimode_for_vec_perm (mode).exists (&qimode))
>>>>> +return NULL_RTX;
>>>>> +
>>>>> +  vec_perm_builder builder;
>>>>> +  unsigned nunits = GET_MODE_SIZE (GET_MODE_INNER (mode));
>>>> 
>>>> simpler as GET_MODE_UNIT_SIZE
>>>> 
>>>>> +  unsigned total_units;
>>>>> +  /* TODO: Handle VLA vector rotates?  */
>>>>> +  if (!GET_MODE_SIZE (mode).is_constant (&total_units))
>>>>> +return NULL_RTX;
>>>> 
>>>> Yeah.  I think we can do that by changing:
>>>> 
>>>>> +  builder.new_vector (total_units, 1, total_units);
>>>> 
>>>> to:
>>>> 
>>>> builder.new_vector (total_units, 3, units);
>>> 
>>> I think units here is the size in units of the fixed-width component of the 
>>> mode? So e.g. 16 for V4SI and VNx4SI but 8 for V4HI and VN4HI?
>> 
>> Ah, no, sorry, I meant "nunits" rather than "units", with "nunits"
>> being the same as for your code.  So for V4SI and VNx4SI we'd push
>> 12 elements total, as 4 (nunits) "patterns" of 3 elements each.
>> The first argument (total_units) is just GET_MODE_SIZE (mode)
>> in all its poly_int glory.
>
> Hmm, I’m afraid I’m lost again. For V4SI we have a vector of 16 bytes, how 
> can 12 indices be enough to describe the permute?
> With this scheme we do end up pushing 12 elements, in the order: 
> 2,3,0,1,6,7,4,5,10,11,8,9 .
> In the final RTX emitted in the instruction stream this seems to end up as:
> (const_vector:V16QI [
> (const_int 2 [0x2])
> (const_int 3 [0x3])
> (const_int 0 [0])
> (const_int 1 [0x1])
> (const_int 6 [0x6])
> (const_int 7 [0x7])
> (const_int 4 [0x4])
> (const_int 5 [0x5])
> (const_int 10 [0xa])
> (const_int 11 [0xb])
> (const_int 8 [0x8])
> (const_int 9 [0x9]) repeated x2
> (const_int 14 [0xe])
> (const_int 7 [0x7])
> (const_int 0 [0])
> ])
>
> So the first 12 elements are indeed correct, but the last 4 elements are not.

Gah, sorry, I got the arguments the wrong way around.  It should be:

   builder.new_vector (GET_MODE_SIZE (mode), nunits, 3);

(4 patterns, 3 elements per pattern)

Thanks,
Richard


Re: [PATCH 4/6] aarch64: Optimize vector rotates into REV* instructions where possible

2024-10-25 Thread Richard Sandiford
Kyrylo Tkachov  writes:
> Thank you for the suggestions! I’m trying them out now.
>
>>> +  if (rotamnt % BITS_PER_UNIT != 0)
>>> +return NULL_RTX;
>>> +  machine_mode qimode;
>>> +  if (!qimode_for_vec_perm (mode).exists (&qimode))
>>> +return NULL_RTX;
>>> +
>>> +  vec_perm_builder builder;
>>> +  unsigned nunits = GET_MODE_SIZE (GET_MODE_INNER (mode));
>> 
>> simpler as GET_MODE_UNIT_SIZE
>> 
>>> +  unsigned total_units;
>>> +  /* TODO: Handle VLA vector rotates?  */
>>> +  if (!GET_MODE_SIZE (mode).is_constant (&total_units))
>>> +return NULL_RTX;
>> 
>> Yeah.  I think we can do that by changing:
>> 
>>> +  builder.new_vector (total_units, 1, total_units);
>> 
>> to:
>> 
>>  builder.new_vector (total_units, 3, units);
>
> I think units here is the size in units of the fixed-width component of the 
> mode? So e.g. 16 for V4SI and VNx4SI but 8 for V4HI and VN4HI?

Ah, no, sorry, I meant "nunits" rather than "units", with "nunits"
being the same as for your code.  So for V4SI and VNx4SI we'd push
12 elements total, as 4 (nunits) "patterns" of 3 elements each.
The first argument (total_units) is just GET_MODE_SIZE (mode)
in all its poly_int glory.

For V4HI and VNx4HI it would be (..., 3, 2), so 6 elements total.

> What is the recommended API for getting that number out of the poly_uint64 
> mode size?. Is it just accessing coeffs[0]?
>
>> 
>> unconditionally and making the outer loop below iterate exactly
>> three times (i.e. to nunits * 3).  It's ok if we generate more
>> indices than needed.
>> 
>>> +  int rot_to_perm = nunits - rotamnt / BITS_PER_UNIT;
>>> +  for (unsigned j = 0; j < total_units; j += nunits)
>>> +for (unsigned i = 0; i < nunits; i++)
>>> +  {
>>> + unsigned idx = (rot_to_perm + i) % nunits + j;
>>> + if (BYTES_BIG_ENDIAN)
>>> +   idx = total_units - idx - 1;
>> 
>> I think the endian adjustment should be local to the inner loop/vector
>> element only.  Since this would mean undoing the "nunits - " adjustment
>> above, how about something like:
>> 
>>  unsigned rot_bytes = rotamnt / BITS_PER_UNIT;
>>  unsigned rot_to_perm = BYTES_BIG_ENDIAN ? rot_bytes : nunits - rot_bytes;
>>  ...
>>  builder.quick_push ((rot_to_perm + i) % nunits + j);
>> 
>> or whatever variation you prefer.
>> 
>> Hope I've got that right...
>
>
> Hmm, I’m getting some test failures and wrong indices when I try this.
> I think I can work out the indices and the loops for them but I’d like to work
> through an example. So say we are rotating a V4SImode vector by 16 (a REV32 
> instruction).
> The indices pushed into the byte permute vector with my original patch are:
> [2,3,0,1, 6,7,4,5, a,b,8,9, e,f,c,d]
> What sequence do we want to push for V4SImode now that we have 3 patterns in 
> vector_builder?
> Is it repeating the above 3 times or is it interleaving each SImode entry 
> somehow?

I think the calculation should be the same as in your original code,
adding "i" to each element.  It's just that we limit the outer loop
to 3 iterations instead of total_units / nunits.

So the end result should be that the pushed elements are the same
as in your original patch, just longer or shorter, depending on
whether we're pushing more elements (e.g. V2DI) or fewer (e.g. V4SI).

In general, the VLA constant scheme works by creating the leading
elements as normal and then describing how to extend that sequence
using the (npatterns, nelts_per_pattern) pair.  The first
npatterns * nelts_per_pattern elements are always explicitly pushed,
but are automatically extended or truncated as necessary when creating
fixed-length tree and rtl constants.

Thanks,
Richard


[pushed] testsuite: Generalise tree-ssa/shifts-3.c regexp

2024-10-25 Thread Richard Sandiford
My recent gcc.dg/tree-ssa/shifts-3.c test failed on arm-linux-gnueabihf
because it used widen_mult_expr to do a multiplication on chars.
This patch generalises the regexp in the same way as for f3.

Tested on arm-linux-gnueabihf and aarch64-linux-gnu, pushed as obvious.

Richard


gcc/testsuite/
* gcc.dg/tree-ssa/shifts-3.c: Accept widen_mult for f2 too.
---
 gcc/testsuite/gcc.dg/tree-ssa/shifts-3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/shifts-3.c 
b/gcc/testsuite/gcc.dg/tree-ssa/shifts-3.c
index dcff518e630..2b1cf703b4a 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/shifts-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/shifts-3.c
@@ -58,7 +58,7 @@ f6 (unsigned int x)
 /* { dg-final { scan-tree-dump-not {<[a-z]*_div_expr,} "optimized" } } */
 /* { dg-final { scan-tree-dump-not {

Re: [PATCH 4/6] aarch64: Optimize vector rotates into REV* instructions where possible

2024-10-25 Thread Richard Sandiford
Kyrylo Tkachov  writes:
> Hi Richard,
>
>> On 23 Oct 2024, at 11:30, Richard Sandiford  
>> wrote:
>>
>> Kyrylo Tkachov  writes:
>>> Hi all,
>>>
>>> Some vector rotate operations can be implemented in a single instruction
>>> rather than using the fallback SHL+USRA sequence.
>>> In particular, when the rotate amount is half the bitwidth of the element
>>> we can use a REV64,REV32,REV16 instruction.
>>> This patch adds this transformation in the recently added splitter for 
>>> vector
>>> rotates.
>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>
>>> Signed-off-by: Kyrylo Tkachov 
>>>
>>> gcc/
>>>
>>> * config/aarch64/aarch64-protos.h (aarch64_emit_opt_vec_rotate):
>>> Declare prototype.
>>> * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Implement.
>>> * config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm):
>>> Call the above.
>>>
>>> gcc/testsuite/
>>>
>>> * gcc.target/aarch64/simd/pr117048_2.c: New test.
>>
>> Sorry to be awkward, but I still think at least part of this should be
>> target-independent.  Any rotate by a byte amount can be expressed as a
>> vector permutation in a target-independent way.  Target-independent code
>> can then use the usual optab routines to query whether the permutation
>> is possible and/or try to generate it.
>
> Thank you for elaborating. I had already prototyped the permute 
> index-computing code in my tree
> but was reluctant to using it during expand as I wanted the rotate RTX to be 
> available for combining
> into XAR so I felt a bit stuck. Having the code in a generic place but called 
> from the backend at a
> time of its choosing makes sense to me.
>
>>
>> I can see that it probably makes sense to leave target code to make
>> the decision about when to use the permutation strategy vs. other
>> approaches.  But the code to implement that strategy shouldn't need
>> to be target-specific.
>>
>> E.g. we could have a routine:
>>
>>  expand_rotate_as_vec_perm
>>
>> which checks whether the rotation amount is suitable and tries to
>> generate the permutation if so.
>
> I’ve implemented something like that in the attached patch.
> It seems to work on AArch64 but as mentioned in the commit message I’d like a 
> check on
> the big-endian logic, and perhaps some pointers on how/whether it should be 
> extended to
> VLA vectors.

Great!  Thanks for doing this.  Some comments on those aspects below,
but otherwise it LGTM.

>
> I’m updating the other patches in the series according to your feedback so 
> I’ll repost them once I’m done,
> just wanted to get this out for further iteration in the meantime.
> Thanks,
> Kyrill
>
>
>
>
>
>
> From 6c1794a574b5525b3b495ed505621a8af029e825 Mon Sep 17 00:00:00 2001
> From: Kyrylo Tkachov 
> Date: Wed, 16 Oct 2024 04:10:08 -0700
> Subject: [PATCH] aarch64: Optimize vector rotates as vector permutes where
>  possible
>
> Some vector rotate operations can be implemented in a single instruction
> rather than using the fallback SHL+USRA sequence.
> In particular, when the rotate amount is half the bitwidth of the element
> we can use a REV64,REV32,REV16 instruction.
> More generally, rotates by a byte amount can be implented using vector
> permutes.
> This patch adds such a generic routine in expmed.cc called
> expand_rotate_as_vec_perm that calculates the required permute indices
> and uses the expand_vec_perm_const interface.
>
> On aarch64 this ends up generating the single-instruction sequences above
> where possible and can use LDR+TBL sequences too, which are a good choice.
>
> For now, I have restricted the expand_rotate_as_vec_perm to fixed-width modes
> as I don't have much experience with using it for VLA modes, but I imagine
> it's extendable there.  In any case, the only use of expand_rotate_as_vec_perm
> is in aarch64-specific code that for now only handles fixed-width modes.
>
> A runtime aarch64 test is added to ensure the permute indices are not messed
> up.
> I'd appreciate a review of the BYTES_BIG_ENDIAN logic.  I've adjusted the
> permute vector indices in RTL for it and in the final AArch64 assembly the
> final vector loaded from .LC is identical for little and big-endian, which
> I *think* is the correct behaviour.  For rotates by the half-width that should
> generate single REV64, REV32 instructions aarch64 does not seem to recognise
> them and falls back to an LDR+TBL for big-endian.  I'm not sure if that's
> simply missing logic i

Re: [PATCH 18/22] aarch64: libitm: Add GCS support

2024-10-25 Thread Richard Sandiford
Yury Khrustalev  writes:
> From: Szabolcs Nagy 
>
> Transaction begin and abort use setjmp/longjmp like operations that
> need to be updated for GCS compatibility. We use similar logic to
> libc setjmp/longjmp that support switching stack and thus switching
> GCS (e.g. due to longjmp out of a makecontext stack), this is kept
> even though it is likely not required for transaction aborts.
>
> The gtm_jmpbuf is internal to libitm so we can change its layout
> without breaking ABI.
>
> libitm/ChangeLog:
>
>   * config/aarch64/sjlj.S: Add GCS support and mark GCS compatible.
>   * config/aarch64/target.h: Add gcs field to gtm_jmpbuf.
> ---
>  libitm/config/aarch64/sjlj.S   | 60 --
>  libitm/config/aarch64/target.h |  1 +
>  2 files changed, 58 insertions(+), 3 deletions(-)
>
> diff --git a/libitm/config/aarch64/sjlj.S b/libitm/config/aarch64/sjlj.S
> index aeffd4d1070..cf1d8af2c96 100644
> --- a/libitm/config/aarch64/sjlj.S
> +++ b/libitm/config/aarch64/sjlj.S
> @@ -29,6 +29,13 @@
>  #define AUTIASP  hint29
>  #define PACIBSP  hint27
>  #define AUTIBSP  hint31
> +#define CHKFEAT_X16  hint40
> +#define MRS_GCSPR(x) mrs x, s3_3_c2_c5_1
> +#define GCSPOPM(x)   syslx, #3, c7, c7, #1
> +#define GCSSS1(x)sys #3, c7, c7, #2, x
> +#define GCSSS2(x)syslx, #3, c7, c7, #3
> +
> +#define L(name) .L##name
>  
>  #if defined(HAVE_AS_CFI_PSEUDO_OP) && defined(__GCC_HAVE_DWARF2_CFI_ASM)
>  # define cfi_negate_ra_state .cfi_negate_ra_state
> @@ -80,7 +87,16 @@ _ITM_beginTransaction:
>   stp d10, d11, [sp, 7*16]
>   stp d12, d13, [sp, 8*16]
>   stp d14, d15, [sp, 9*16]
> - str x1, [sp, 10*16]
> +
> + /* GCS support.  */
> + mov x2, 0
> + mov x16, 1
> + CHKFEAT_X16
> + tbnzx16, 0, L(gcs_done_sj)
> + MRS_GCSPR (x2)
> + add x2, x2, 8 /* GCS after _ITM_beginTransaction returns.  */
> +L(gcs_done_sj):
> + stp x2, x1, [sp, 10*16]
>  
>   /* Invoke GTM_begin_transaction with the struct we just built.  */
>   mov x1, sp
> @@ -117,7 +133,38 @@ GTM_longjmp:
>   ldp d10, d11, [x1, 7*16]
>   ldp d12, d13, [x1, 8*16]
>   ldp d14, d15, [x1, 9*16]
> +
> + /* GCS support.  */
> + mov x16, 1
> + CHKFEAT_X16
> + tbnzx16, 0, L(gcs_done_lj)
> + MRS_GCSPR (x7)
>   ldr x3, [x1, 10*16]
> + mov x4, x3
> + /* x7: GCSPR now.  x3, x4: target GCSPR.  x5, x6: tmp regs.  */
> +L(gcs_scan):
> + cmp x7, x4
> + b.eqL(gcs_pop)
> + sub x4, x4, 8
> + /* Check for a cap token.  */
> + ldr x5, [x4]
> + and x6, x4, 0xf000
> + orr x6, x6, 1
> + cmp x5, x6
> + b.neL(gcs_scan)
> +L(gcs_switch):
> + add x7, x4, 8
> + GCSSS1 (x4)
> + GCSSS2 (xzr)

Don't we still need to pop from the current stack up to the switch point,
in case something further up the call frame wants to switch back to it?

If so, don't we also need to handle multiple switches, and similarly
pop from intermediate stacks?

E.g. if we have stacks S1-S3 and functions f1-f4, and a call stack:

  f1 initially uses S1, switches to S2, calls f2
  f2 initially uses S2, switches to S3, calls f3
  f3 initially uses S3, switches to S1, calls f4
  f4 initially uses S1, triggers a longjmp to f2

then wouldn't the longjmp need unwind S1 to the switch point;
unwind S3 through f3's entry to the switch point; and then unwind
S2 in the way that the routine currently does?  Or is that kind of
situation not supported?

Thanks,
Richard

> +L(gcs_pop):
> + cmp x7, x3
> + b.eqL(gcs_done_lj)
> + GCSPOPM (xzr)
> + add x7, x7, 8
> + b   L(gcs_pop)
> +L(gcs_done_lj):
> +
> + ldr x3, [x1, 10*16 + 8]
>   ldp x29, x30, [x1]
>   cfi_def_cfa(x1, 0)
>   CFI_PAC_TOGGLE
> @@ -132,6 +179,7 @@ GTM_longjmp:
>  #define FEATURE_1_AND 0xc000
>  #define FEATURE_1_BTI 1
>  #define FEATURE_1_PAC 2
> +#define FEATURE_1_GCS 4
>  
>  /* Supported features based on the code generation options.  */
>  #if defined(__ARM_FEATURE_BTI_DEFAULT)
> @@ -146,6 +194,12 @@ GTM_longjmp:
>  # define PAC_FLAG 0
>  #endif
>  
> +#if __ARM_FEATURE_GCS_DEFAULT
> +# define GCS_FLAG FEATURE_1_GCS
> +#else
> +# define GCS_FLAG 0
> +#endif
> +
>  /* Add a NT_GNU_PROPERTY_TYPE_0 note.  */
>  #define GNU_PROPERTY(type, value)\
>.section .note.gnu.property, "a";  \
> @@ -163,7 +217,7 @@ GTM_longjmp:
>  .section .note.GNU-stack, "", %progbits
>  
>  /* Add GNU property note if built with branch protection.  */
> -# if (BTI_FLAG|PAC_FLAG) != 0
> -GNU_PROPERTY (FEATURE_1_AND, BTI_FLAG|PAC_FLAG)
> +# if (BTI_FLAG|PAC_FLAG|GCS_FLAG) != 0
> +GNU_PROPERTY (FEATURE_1_AND, BTI_FLAG|PAC_FLAG|GCS_FLAG)
>  # endif
>  #endif
> diff --git a/libitm/config/aarch64/target.h b/libitm/config/aarch64/target.h
> index 3d99197bfab..a1f39b4bf7a 100644
> --- a/libitm/config/aarc

Re: [PATCH 07/22] aarch64: Add GCS builtins

2024-10-25 Thread Richard Sandiford
Yury Khrustalev  writes:
> From: Szabolcs Nagy 
>
> Add new builtins for GCS:
>
>   void *__builtin_aarch64_gcspr (void)
>   uint64_t __builtin_aarch64_gcspopm (void)
>   void *__builtin_aarch64_gcsss (void *)
>
> The builtins are always enabled, but should be used behind runtime
> checks in case the target does not support GCS. They are thin
> wrappers around the corresponding instructions.
>
> The GCS pointer is modelled with void * type (normal stores do not
> work on GCS memory, but it is writable via the gcsss operation or
> via GCSSTR if enabled so not const) and an entry on the GCS is
> modelled with uint64_t (since it has fixed size and can be a token
> that's not a pointer).
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-builtins.cc (enum aarch64_builtins): Add
>   AARCH64_BUILTIN_GCSPR, AARCH64_BUILTIN_GCSPOPM, AARCH64_BUILTIN_GCSSS.
>   (aarch64_init_gcs_builtins): New.
>   (aarch64_general_init_builtins): Call aarch64_init_gcs_builtins.
>   (aarch64_expand_gcs_builtin): New.
>   (aarch64_general_expand_builtin): Call aarch64_expand_gcs_builtin.
> ---
>  gcc/config/aarch64/aarch64-builtins.cc | 70 ++
>  1 file changed, 70 insertions(+)
>
> diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
> b/gcc/config/aarch64/aarch64-builtins.cc
> index 765f2091504..a42a2b9e67f 100644
> --- a/gcc/config/aarch64/aarch64-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-builtins.cc
> @@ -877,6 +877,9 @@ enum aarch64_builtins
>AARCH64_PLIX,
>/* Armv8.9-A / Armv9.4-A builtins.  */
>AARCH64_BUILTIN_CHKFEAT,
> +  AARCH64_BUILTIN_GCSPR,
> +  AARCH64_BUILTIN_GCSPOPM,
> +  AARCH64_BUILTIN_GCSSS,
>AARCH64_BUILTIN_MAX
>  };
>  
> @@ -2241,6 +2244,29 @@ aarch64_init_fpsr_fpcr_builtins (void)
>  AARCH64_BUILTIN_SET_FPSR64);
>  }
>  
> +/* Add builtins for Guarded Control Stack instructions.  */
> +
> +static void
> +aarch64_init_gcs_builtins (void)
> +{
> +  tree ftype;
> +
> +  ftype = build_function_type_list (ptr_type_node, NULL);
> +  aarch64_builtin_decls[AARCH64_BUILTIN_GCSPR]
> += aarch64_general_add_builtin ("__builtin_aarch64_gcspr", ftype,
> +AARCH64_BUILTIN_GCSPR);
> +
> +  ftype = build_function_type_list (uint64_type_node, NULL);
> +  aarch64_builtin_decls[AARCH64_BUILTIN_GCSPOPM]
> += aarch64_general_add_builtin ("__builtin_aarch64_gcspopm", ftype,
> +AARCH64_BUILTIN_GCSPOPM);
> +
> +  ftype = build_function_type_list (ptr_type_node, ptr_type_node, NULL);
> +  aarch64_builtin_decls[AARCH64_BUILTIN_GCSSS]
> += aarch64_general_add_builtin ("__builtin_aarch64_gcsss", ftype,
> +AARCH64_BUILTIN_GCSSS);
> +}
> +
>  /* Initialize all builtins in the AARCH64_BUILTIN_GENERAL group.  */
>  
>  void
> @@ -2288,6 +2314,8 @@ aarch64_general_init_builtins (void)
>  = aarch64_general_add_builtin ("__builtin_aarch64_chkfeat", 
> ftype_chkfeat,
>  AARCH64_BUILTIN_CHKFEAT);
>  
> +  aarch64_init_gcs_builtins ();
> +
>if (in_lto_p)
>  handle_arm_acle_h ();
>  }
> @@ -3367,6 +3395,43 @@ aarch64_expand_fpsr_fpcr_getter (enum insn_code icode, 
> machine_mode mode,
>return op.value;
>  }
>  
> +/* Expand GCS builtin EXP with code FCODE, putting the result
> +   int TARGET.  If IGNORE is true the return value is ignored.  */

into

This would need updating for the comment on patch 6, but otherwise
it looks good.

Thanks,
Richard

> +
> +rtx
> +aarch64_expand_gcs_builtin (tree exp, rtx target, int fcode, int ignore)
> +{
> +  if (fcode == AARCH64_BUILTIN_GCSPR)
> +{
> +  expand_operand op;
> +  create_output_operand (&op, target, DImode);
> +  expand_insn (CODE_FOR_aarch64_load_gcspr, 1, &op);
> +  return op.value;
> +}
> +  if (fcode == AARCH64_BUILTIN_GCSPOPM && ignore)
> +{
> +  expand_insn (CODE_FOR_aarch64_gcspopm_xzr, 0, 0);
> +  return target;
> +}
> +  if (fcode == AARCH64_BUILTIN_GCSPOPM)
> +{
> +  expand_operand op;
> +  create_output_operand (&op, target, Pmode);
> +  expand_insn (CODE_FOR_aarch64_gcspopm, 1, &op);
> +  return op.value;
> +}
> +  if (fcode == AARCH64_BUILTIN_GCSSS)
> +{
> +  expand_operand ops[2];
> +  rtx op1 = expand_normal (CALL_EXPR_ARG (exp, 0));
> +  create_output_operand (&ops[0], target, Pmode);
> +  create_input_operand (&ops[1], op1, Pmode);
> +  expand_insn (CODE_FOR_aarch64_gcsss, 2, ops);
> +  return ops[0].value;
> +}
> +  gcc_unreachable ();
> +}
> +
>  /* Expand an expression EXP that calls built-in function FCODE,
> with result going to TARGET if that's convenient.  IGNORE is true
> if the result of the builtin is ignored.  */
> @@ -3502,6 +3567,11 @@ aarch64_general_expand_builtin (unsigned int fcode, 
> tree exp, rtx target,
>   emit_move_insn (target, x16_reg);
>   return target;
>}
> +
> +case AARCH64_BUILTIN_GCSP

Re: [PATCH 14/22] aarch64: Add GCS support to the unwinder

2024-10-25 Thread Richard Sandiford
Yury Khrustalev  writes:
> From: Szabolcs Nagy 
>
> Follows the current linux ABI that uses single signal entry token
> and shared shadow stack between thread and alt stack.
> Could be behind __ARM_FEATURE_GCS_DEFAULT ifdef (only do anything
> special with gcs compat codegen) but there is a runtime check anyway.
>
> Change affected tests to be compatible with -mbranch-protection=standard
>
> gcc/testsuite/ChangeLog:
>
>   * g++.target/aarch64/pr94515-1.C (f1_no_pac_ret): Update.
>   (main): Update.
>   Co-authored-by: Matthieu Longo 
>
>   * gcc.target/aarch64/pr104689.c (unwind): Update.
>   Co-authored-by: Matthieu Longo 

Too many Co-authored-bys :)  The one below is indented correctly.

> libgcc/ChangeLog:
>
>   * config/aarch64/aarch64-unwind.h (_Unwind_Frames_Extra): Update.
>   (_Unwind_Frames_Increment): Define.
>
> Co-authored-by: Matthieu Longo 
> ---
>  gcc/testsuite/g++.target/aarch64/pr94515-1.C |  6 +-
>  gcc/testsuite/gcc.target/aarch64/pr104689.c  |  3 +-
>  libgcc/config/aarch64/aarch64-unwind.h   | 59 +++-
>  3 files changed, 63 insertions(+), 5 deletions(-)
>
> diff --git a/gcc/testsuite/g++.target/aarch64/pr94515-1.C 
> b/gcc/testsuite/g++.target/aarch64/pr94515-1.C
> index 359039e1753..8175ea50c32 100644
> --- a/gcc/testsuite/g++.target/aarch64/pr94515-1.C
> +++ b/gcc/testsuite/g++.target/aarch64/pr94515-1.C
> @@ -5,7 +5,7 @@
>  
>  volatile int zero = 0;
>  
> -__attribute__((noinline, target("branch-protection=none")))
> +__attribute__((noinline, target("branch-protection=bti")))
>  void unwind (void)
>  {
>if (zero == 0)
> @@ -22,7 +22,7 @@ int test (int z)
>  // autiasp -> cfi_negate_ra_state: RA_signing_SP -> RA_no_signing
>  return 1;
>} else {
> -// 2nd cfi_negate_ra_state because the CFI directives are processed 
> linearily.
> +// 2nd cfi_negate_ra_state because the CFI directives are processed 
> linearly.
>  // At this point, the unwinder would believe that the address is not 
> signed
>  // due to the previous return. That's why the compiler has to emit second
>  // cfi_negate_ra_state to mean that the return address is still signed.
> @@ -33,7 +33,7 @@ int test (int z)
>}
>  }
>  
> -__attribute__((target("branch-protection=none")))
> +__attribute__((target("branch-protection=bti")))
>  int main ()
>  {
>try {
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr104689.c 
> b/gcc/testsuite/gcc.target/aarch64/pr104689.c
> index 3b7adbdfe7d..9688ecc85f9 100644
> --- a/gcc/testsuite/gcc.target/aarch64/pr104689.c
> +++ b/gcc/testsuite/gcc.target/aarch64/pr104689.c
> @@ -98,6 +98,7 @@ asm(""
>  "unusual_no_pac_ret:\n"
>  ".cfi_startproc\n"
>  "" SET_RA_STATE_0 "\n"
> +"bti c\n"
>  "stp x29, x30, [sp, -16]!\n"
>  ".cfi_def_cfa_offset 16\n"
>  ".cfi_offset 29, -16\n"
> @@ -121,7 +122,7 @@ static void f2_pac_ret (void)
>die ();
>  }
>  
> -__attribute__((target("branch-protection=none")))
> +__attribute__((target("branch-protection=bti")))
>  static void f1_no_pac_ret (void)
>  {
>unusual_pac_ret (f2_pac_ret);

Could you explain these testsuite changes in more detail?  It seems
on the face of it that they're changing the tests to test something
other than the original intention.

Having new tests alongside the same lines would be fine though.

> diff --git a/libgcc/config/aarch64/aarch64-unwind.h 
> b/libgcc/config/aarch64/aarch64-unwind.h
> index 4d36f0b26f7..cf4ec749c05 100644
> --- a/libgcc/config/aarch64/aarch64-unwind.h
> +++ b/libgcc/config/aarch64/aarch64-unwind.h
> @@ -178,6 +178,9 @@ aarch64_demangle_return_addr (struct _Unwind_Context 
> *context,
>return addr;
>  }
>  
> +/* GCS enable flag for chkfeat instruction.  */
> +#define CHKFEAT_GCS 1
> +
>  /* SME runtime function local to libgcc, streaming compatible
> and preserves more registers than the base PCS requires, but
> we don't rely on that here.  */
> @@ -185,12 +188,66 @@ __attribute__ ((visibility ("hidden")))
>  void __libgcc_arm_za_disable (void);
>  
>  /* Disable the SME ZA state in case an unwound frame used the ZA
> -   lazy saving scheme.  */
> +   lazy saving scheme. And unwind the GCS for EH.  */
>  #undef _Unwind_Frames_Extra
>  #define _Unwind_Frames_Extra(x)  \
>do \
>  {\
>__libgcc_arm_za_disable ();\
> +  if (__builtin_aarch64_chkfeat (CHKFEAT_GCS) == 0)  \
> + {   \
> +   for (_Unwind_Word n = (x); n != 0; n--)   \
> + __builtin_aarch64_gcspopm ();   \
> + }   \
> +}\
> +  while (0)
> +
> +/* On signal entry the OS places a token on the GCS that can be used to
> +   verify the integrity of

Re: [PATCH 08/22] aarch64: Add __builtin_aarch64_gcs* tests

2024-10-24 Thread Richard Sandiford
Yury Khrustalev  writes:
> From: Szabolcs Nagy 
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/gcspopm-1.c: New test.
>   * gcc.target/aarch64/gcspr-1.c: New test.
>   * gcc.target/aarch64/gcsss-1.c: New test.
> ---
>  gcc/testsuite/gcc.target/aarch64/gcspopm-1.c | 69 
>  gcc/testsuite/gcc.target/aarch64/gcspr-1.c   | 31 +
>  gcc/testsuite/gcc.target/aarch64/gcsss-1.c   | 49 ++
>  3 files changed, 149 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/gcspopm-1.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/gcspr-1.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/gcsss-1.c
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/gcspopm-1.c 
> b/gcc/testsuite/gcc.target/aarch64/gcspopm-1.c
> new file mode 100644
> index 000..6e6add39cf7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/gcspopm-1.c
> @@ -0,0 +1,69 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mbranch-protection=none" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +/*
> +**foo1:
> +**   syslxzr, #3, c7, c7, #1 // gcspopm
> +**   ret
> +*/
> +void
> +foo1 (void)
> +{
> +  __builtin_aarch64_gcspopm ();
> +}
> +
> +/*
> +**foo2:
> +**   mov x0, 0
> +**   syslx0, #3, c7, c7, #1 // gcspopm
> +**   ret
> +*/
> +unsigned long long
> +foo2 (void)
> +{
> +  return __builtin_aarch64_gcspopm ();
> +}
> +
> +/*
> +**foo3:
> +**   mov x16, 1
> +** (
> +**   mov x0, 0
> +**   hint40 // chkfeat x16
> +** |
> +**   hint40 // chkfeat x16
> +**   mov x0, 0
> +** )

The mov could also happen first, before the mov x16, 1.  It would
probably be easier to use...

> +**   cbz x16, .*
> +**   ret
> +**   mov x0, 0
> +**   syslx0, #3, c7, c7, #1 // gcspopm
> +**   ret
> +*/
> +unsigned long long
> +foo3 (void)
> +{
> +  if (__builtin_aarch64_chkfeat (1) == 0)
> +return __builtin_aarch64_gcspopm ();
> +  return 0;
> +}

unsigned long long
foo3 (unsigned long long x)
{
  if (__builtin_aarch64_chkfeat (1) == 0)
return __builtin_aarch64_gcspopm ();
  return x;
}

so that x0 is returned unmodified if the chkfeat returns nonzero.

FWIW, if we do remove the embedded moves from the .md define_insns,
we should also be able to get rid of the redundant zeroing of x0
on the gcspopm path.

> +
> +/*
> +**foo4:
> +**   syslxzr, #3, c7, c7, #1 // gcspopm
> +**   mov x0, 0
> +**   syslx0, #3, c7, c7, #1 // gcspopm
> +**   syslxzr, #3, c7, c7, #1 // gcspopm
> +**   ret
> +*/
> +unsigned long long
> +foo4 (void)
> +{
> +  unsigned long long a = __builtin_aarch64_gcspopm ();
> +  unsigned long long b = __builtin_aarch64_gcspopm ();
> +  unsigned long long c = __builtin_aarch64_gcspopm ();
> +  (void) a;
> +  (void) c;
> +  return b;
> +}

Nice test :)

> diff --git a/gcc/testsuite/gcc.target/aarch64/gcspr-1.c 
> b/gcc/testsuite/gcc.target/aarch64/gcspr-1.c
> new file mode 100644
> index 000..0e651979551
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/gcspr-1.c
> @@ -0,0 +1,31 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mbranch-protection=none" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +/*
> +**foo1:
> +**   mrs x0, s3_3_c2_c5_1 // gcspr_el0
> +**   ret
> +*/
> +void *
> +foo1 (void)
> +{
> +  return __builtin_aarch64_gcspr ();
> +}
> +
> +/*
> +**foo2:
> +**   mrs x[0-9]*, s3_3_c2_c5_1 // gcspr_el0
> +**   syslxzr, #3, c7, c7, #1 // gcspopm
> +**   mrs x[0-9]*, s3_3_c2_c5_1 // gcspr_el0
> +**   sub x0, x[0-9]*, x[0-9]*
> +**   ret
> +*/
> +long
> +foo2 (void)
> +{
> +  const char *p = __builtin_aarch64_gcspr ();
> +  __builtin_aarch64_gcspopm ();
> +  const char *q = __builtin_aarch64_gcspr ();
> +  return p - q;
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/gcsss-1.c 
> b/gcc/testsuite/gcc.target/aarch64/gcsss-1.c
> new file mode 100644
> index 000..025c7fee647
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/gcsss-1.c
> @@ -0,0 +1,49 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mbranch-protection=none" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +/*
> +**foo1:
> +**   sys #3, c7, c7, #2, x0 // gcsss1
> +**   mov x[0-9]*, 0
> +**   syslx[0-9]*, #3, c7, c7, #3 // gcsss2

Might as well make this:

**  mov (x[0-9]+), 0
**  sysl\1, #3, c7, c7, #3 // gcsss2


> +**   ret
> +*/
> +void
> +foo1 (void *p)
> +{
> +  __builtin_aarch64_gcsss (p);
> +}
> +
> +/*
> +**foo2:
> +**   sys #3, c7, c7, #2, x0 // gcsss1
> +**   mov x0, 0
> +**   syslx0, #3, c7, c7, #3 // gcsss2
> +**   ret
> +*/
> +void *
> +foo2 (void *p)
> +{
> +  return __builtin_aarch64_gcsss (p);
> +}
> +
> +/*
> +**foo3:
> +**   mov x16, 1
> +**   hint40 // chkfeat x16
> +**   cbnzx16, .*
> +**   sys #3, c7, c7, #2, x0 // gcsss1
> +**   mov x0, 0
> +**   syslx0, #3, c7, c7, #3 // gcsss2
> +**   ret
> +**   mov x0, 0
> +**   ret
>

Re: [PATCH 06/22] aarch64: Add GCS instructions

2024-10-24 Thread Richard Sandiford
Yury Khrustalev  writes:
> From: Szabolcs Nagy 
>
> Add instructions for the Guarded Control Stack extension.
>
> GCSSS1 and GCSSS2 are modelled as a single GCSSS unspec, because they
> are always used together in the compiler.
>
> Before GCSPOPM and GCSSS2 an extra "mov xn, 0" is added to clear the
> output register, this is needed to get reasonable result when GCS is
> disabled, when the instructions are NOPs. Since the instructions are
> expected to be used behind runtime feature checks, this is mainly
> relevant if GCS can be disabled asynchronously.
>
> The output of GCSPOPM is usually not needed, so a separate gcspopm_xzr
> was added to model that. Did not do the same for GCSSS as it is a less
> common operation.
>
> The used mnemonics do not depend on updated assembler since these
> instructions can be used without new -march setting behind a runtime
> check.
>
> Reading the GCSPR is modelled as unspec_volatile so it does not get
> reordered wrt the other instructions changing the GCSPR.

Sorry to be awkward, but I think we should still use one define_insn
per instruction, with no embedded moves.  E.g.:

>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.md (aarch64_load_gcspr): New.
>   (aarch64_gcspopm): New.
>   (aarch64_gcspopm_xzr): New.
>   (aarch64_gcsss): New.
> ---
>  gcc/config/aarch64/aarch64.md | 35 +++
>  1 file changed, 35 insertions(+)
>
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 43bed0ce10f..e4e11e35b5b 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -382,6 +382,9 @@ (define_c_enum "unspecv" [
>  UNSPECV_BTI_J; Represent BTI j.
>  UNSPECV_BTI_JC   ; Represent BTI jc.
>  UNSPECV_CHKFEAT  ; Represent CHKFEAT X16.
> +UNSPECV_GCSPR; Represent MRS Xn, GCSPR_EL0
> +UNSPECV_GCSPOPM  ; Represent GCSPOPM.
> +UNSPECV_GCSSS; Represent GCSSS1 and GCSSS2.
>  UNSPECV_TSTART   ; Represent transaction start.
>  UNSPECV_TCOMMIT  ; Represent transaction commit.
>  UNSPECV_TCANCEL  ; Represent transaction cancel.
> @@ -8321,6 +8324,38 @@ (define_insn "aarch64_chkfeat"
>"hint\\t40 // chkfeat x16"
>  )
>  
> +;; Guarded Control Stack (GCS) instructions
> +(define_insn "aarch64_load_gcspr"
> +  [(set (match_operand:DI 0 "register_operand" "=r")
> + (unspec_volatile:DI [(const_int 0)] UNSPECV_GCSPR))]
> +  ""
> +  "mrs\\t%0, s3_3_c2_c5_1 // gcspr_el0"
> +  [(set_attr "type" "mrs")]
> +)
> +
> +(define_insn "aarch64_gcspopm"
> +  [(set (match_operand:DI 0 "register_operand" "=r")
> + (unspec_volatile:DI [(const_int 0)] UNSPECV_GCSPOPM))]
> +  ""
> +  "mov\\t%0, 0\;sysl\\t%0, #3, c7, c7, #1 // gcspopm"
> +  [(set_attr "length" "8")]
> +)

...this would be:

(define_insn "aarch64_gcspopm"
  [(set (match_operand:DI 0 "register_operand" "=r")
(unspec_volatile:DI [(match_operand:DI 1 "register_operand" "0")] 
UNSPECV_GCSPOPM))]
  ""
  "sysl\\t%0, #3, c7, c7, #1 // gcspopm"
)

with the code that emits the instruction first emitting a zeroing
of operand 1.

Thanks,
Richard

> +
> +(define_insn "aarch64_gcspopm_xzr"
> +  [(unspec_volatile [(const_int 0)] UNSPECV_GCSPOPM)]
> +  ""
> +  "sysl\\txzr, #3, c7, c7, #1 // gcspopm"
> +)
> +
> +(define_insn "aarch64_gcsss"
> +  [(set (match_operand:DI 0 "register_operand" "=r")
> + (unspec_volatile:DI [(match_operand:DI 1 "register_operand" "r")]
> +   UNSPECV_GCSSS))]
> +  ""
> +  "sys\\t#3, c7, c7, #2, %1 // gcsss1\;mov\\t%0, 0\;sysl\\t%0, #3, c7, c7, 
> #3 // gcsss2"
> +  [(set_attr "length" "12")]
> +)
> +
>  ;; AdvSIMD Stuff
>  (include "aarch64-simd.md")


Re: [PATCH] SVE intrinsics: Fold svaba with op1 all zeros to svabd.

2024-10-24 Thread Richard Sandiford
Jennifer Schmitz  writes:
> Similar to
> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665780.html,
> this patch implements folding of svaba to svabd if op1 is all zeros,
> resulting in the use of UABD/SABD instructions instead of UABA/SABA.
> Tests were added to check the produced assembly for use of UABD/SABD,
> also for the _n case.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz 
>
> gcc/
>   * config/aarch64/aarch64-sve-builtins-sve2.cc
>   (svaba_impl::fold): Fold svaba to svabd if op1 is all zeros.
>
> gcc/testsuite/
>   * gcc.target/aarch64/sve2/acle/asm/aba_s32.c: New tests.
>   * gcc.target/aarch64/sve2/acle/asm/aba_s64.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/aba_u32.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/aba_u64.c: Likewise.

OK, thanks.

Richard

> ---
>  .../aarch64/aarch64-sve-builtins-sve2.cc  | 18 +++
>  .../aarch64/sve2/acle/asm/aba_s32.c   | 23 +++
>  .../aarch64/sve2/acle/asm/aba_s64.c   | 22 ++
>  .../aarch64/sve2/acle/asm/aba_u32.c   | 22 ++
>  .../aarch64/sve2/acle/asm/aba_u64.c   | 22 ++
>  5 files changed, 107 insertions(+)
>
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
> index 6a20a613f83..107b299d068 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
> @@ -80,6 +80,24 @@ unspec_sqrdcmlah (int rot)
>  
>  class svaba_impl : public function_base
>  {
> +public:
> +  gimple *
> +  fold (gimple_folder &f) const override
> +  {
> +/* Fold to svabd if op1 is all zeros.  */
> +tree op1 = gimple_call_arg (f.call, 0);
> +if (!integer_zerop (op1))
> +  return NULL;
> +function_instance instance ("svabd", functions::svabd,
> + shapes::binary_opt_n, f.mode_suffix_id,
> + f.type_suffix_ids, GROUP_none, PRED_x);
> +gcall *call = f.redirect_call (instance);
> +/* Add a ptrue as predicate, because unlike svaba, svabd is
> +   predicated.  */
> +gimple_call_set_arg (call, 0, build_all_ones_cst (f.gp_type ()));
> +return call;
> +  }
> +
>  public:
>rtx
>expand (function_expander &e) const override
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aba_s32.c 
> b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aba_s32.c
> index 73c00282526..655ad630241 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aba_s32.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aba_s32.c
> @@ -108,3 +108,26 @@ TEST_UNIFORM_Z (aba_11_s32_tied2, svint32_t,
>  TEST_UNIFORM_Z (aba_11_s32_untied, svint32_t,
>   z0 = svaba_n_s32 (z1, z2, 11),
>   z0 = svaba (z1, z2, 11))
> +
> +/*
> +** aba_11_s32_zeroop1n:
> +**   ptrue   (p[0-7])\.b, all
> +**   mov z0\.s, #11
> +**   sabdz0\.s, \1/m, z0\.s, z1\.s
> +**   ret
> +*/
> +TEST_UNIFORM_Z (aba_11_s32_zeroop1n, svint32_t,
> + z0 = svaba_n_s32 (svdup_s32 (0), z1, 11),
> + z0 = svaba (svdup_s32 (0), z1, 11))
> +
> +
> +/*
> +** aba_11_s32_zeroop1:
> +**   ptrue   (p[0-7])\.b, all
> +**   mov z0\.s, #11
> +**   sabdz0\.s, \1/m, z0\.s, z1\.s
> +**   ret
> +*/
> +TEST_UNIFORM_Z (aba_11_s32_zeroop1, svint32_t,
> + z0 = svaba_s32 (svdup_s32 (0), z1, svdup_s32 (11)),
> + z0 = svaba (svdup_s32 (0), z1, svdup_s32 (11)))
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aba_s64.c 
> b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aba_s64.c
> index 0c169dbf613..8b1eb7d2f4e 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aba_s64.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aba_s64.c
> @@ -108,3 +108,25 @@ TEST_UNIFORM_Z (aba_11_s64_tied2, svint64_t,
>  TEST_UNIFORM_Z (aba_11_s64_untied, svint64_t,
>   z0 = svaba_n_s64 (z1, z2, 11),
>   z0 = svaba (z1, z2, 11))
> +
> +/*
> +** aba_11_s64_zeroop1n:
> +**   ptrue   (p[0-7])\.b, all
> +**   mov z0\.d, #11
> +**   sabdz0\.d, \1/m, z0\.d, z1\.d
> +**   ret
> +*/
> +TEST_UNIFORM_Z (aba_11_s64_zeroop1n, svint64_t,
> + z0 = svaba_n_s64 (svdup_s64 (0), z1, 11),
> + z0 = svaba (svdup_s64 (0), z1, 11))
> +
> +/*
> +** aba_11_s64_zeroop1:
> +**   ptrue   (p[0-7])\.b, all
> +**   mov z0\.d, #11
> +**   sabdz0\.d, \1/m, z0\.d, z1\.d
> +**   ret
> +*/
> +TEST_UNIFORM_Z (aba_11_s64_zeroop1, svint64_t,
> + z0 = svaba_s64 (svdup_s64 (0), z1, svdup_s64 (11)),
> + z0 = svaba (svdup_s64 (0), z1, svdup_s64 (11)))
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aba_u32.c 
> b/gcc/testsuite/gcc.target/aarch64/sve2/acle/asm/aba_u32.c
> index 2ba8f419567..fc2fed28e02 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve2/acl

Re: [PATCH 22/22] aarch64: Fix nonlocal goto tests incompatible with GCS

2024-10-24 Thread Richard Sandiford
Yury Khrustalev  writes:
> gcc/testsuite/ChangeLog:
>   * gcc.target/aarch64/gcs-nonlocal-3.c: New test.
>   * gcc.target/aarch64/sme/nonlocal_goto_4.c: Update.
>   * gcc.target/aarch64/sme/nonlocal_goto_5.c: Update.
>   * gcc.target/aarch64/sme/nonlocal_goto_6.c: Update.
> ---
>  .../gcc.target/aarch64/gcs-nonlocal-3.c   | 33 +++
>  .../gcc.target/aarch64/sme/nonlocal_goto_4.c  |  2 +-
>  .../gcc.target/aarch64/sme/nonlocal_goto_5.c  |  2 +-
>  .../gcc.target/aarch64/sme/nonlocal_goto_6.c  |  2 +-
>  4 files changed, 36 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/gcs-nonlocal-3.c
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/gcs-nonlocal-3.c 
> b/gcc/testsuite/gcc.target/aarch64/gcs-nonlocal-3.c
> new file mode 100644
> index 000..8511f66f66e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/gcs-nonlocal-3.c
> @@ -0,0 +1,33 @@
> +/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2 
> -mbranch-protection=gcs" } */
> +/* { dg-final { check-function-bodies "**" "" "" { target "*-*-*" } 
> {\.L[0-9]+\:} } } */
> +
> +void run(void (*)());
> +
> +/*
> +** bar.0:
> +**   ...
> +**   hint40 // chkfeat x16
> +**   tbnzw16, 0, (\.L[0-9]+)
> +**   ...
> +**   mrs x1, s3_3_c2_c5_1 // gcspr_el0
> +**   subsx1, x3, x1

It doesn't look like this choice of registers is guaranteed.
Probably safer as:

**  mrs (x[0-9]+), s3_3_c2_c5_1 // gcspr_el0
**  subsx[0-9]+, x[0-9]+, \1

(which will throw off the later captures, sorry!)

OK with that change, thanks.

Richard

> +**   bne (\.L[0-9]+)\n\1\:
> +**   ...
> +**   br  x[0-9]+\n\2\:
> +**   ...
> +**   syslxzr, #3, c7, c7, #1 // gcspopm
> +**   ...
> +**   b   \1
> +*/
> +int
> +foo (int *ptr)
> +{
> +  __label__ failure;
> +
> +  void bar () { *ptr += 1; goto failure; }
> +  run (bar);
> +  return 1;
> +
> +failure:
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/sme/nonlocal_goto_4.c 
> b/gcc/testsuite/gcc.target/aarch64/sme/nonlocal_goto_4.c
> index 0446076286b..aed04bb495c 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sme/nonlocal_goto_4.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sme/nonlocal_goto_4.c
> @@ -1,4 +1,4 @@
> -/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */
> +/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2 
> -mbranch-protection=none" } */
>  /* { dg-final { check-function-bodies "**" "" } } */
>  
>  void run(void (*)());
> diff --git a/gcc/testsuite/gcc.target/aarch64/sme/nonlocal_goto_5.c 
> b/gcc/testsuite/gcc.target/aarch64/sme/nonlocal_goto_5.c
> index 4246aec8b2f..e4a31c5c600 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sme/nonlocal_goto_5.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sme/nonlocal_goto_5.c
> @@ -1,4 +1,4 @@
> -/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */
> +/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2 
> -mbranch-protection=none" } */
>  /* { dg-final { check-function-bodies "**" "" } } */
>  
>  void run(void (*)() __arm_streaming);
> diff --git a/gcc/testsuite/gcc.target/aarch64/sme/nonlocal_goto_6.c 
> b/gcc/testsuite/gcc.target/aarch64/sme/nonlocal_goto_6.c
> index 151e2f22dc7..38f6c139f6d 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sme/nonlocal_goto_6.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sme/nonlocal_goto_6.c
> @@ -1,4 +1,4 @@
> -/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */
> +/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2 
> -mbranch-protection=none" } */
>  /* { dg-final { check-function-bodies "**" "" } } */
>  
>  void run(void (*)() __arm_streaming_compatible);


Re: [PATCH 00/22] aarch64: Add support for Guarded Control Stack extension

2024-10-24 Thread Richard Sandiford
Yury Khrustalev  writes:
> This patch series adds support for the Guarded Control Stack extension [1].
>
> GCS marking for binaries is specified in [2].
>
> Regression tested on AArch64 and no regressions have been found.
>
> Is this OK for trunk?
>
> Sources and branches:
>  - binutils-gdb: sourceware.org/git/binutils-gdb.git users/ARM/gcs
>  - gcc: this patch series, or
>gcc.gnu.org/git/gcc.git vendors/ARM/gcs-v3
>see https://gcc.gnu.org/gitwrite.html#vendor for setup details
>  - glibc: sourceware.org/git/glibc.git arm/gcs-v2
>  - kernel: git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git 
> for-next/gcs
>
> Cross-building the toolchain for target aarch64-none-linux-gnu:
>  - build and install binutils-gdb
>  - build and install GCC stage 1
>  - install kernel headers
>  - install glibc headers
>  - build and install GCC stage 2 configuring with 
> --enable-standard-branch-protection
>  - build and install glibc
>  - build and install GCC stage 3 along with target libraries configuring with 
> --enable-standard-branch-protection
>
> FVP model provided by the Shrinkwrap tool [3] can be used for testing.
>
> Run tests with environment var
>
>   GLIBC_TUNABLES=glibc.cpu.aarch64_gcs=1:glibc.cpu.aarch64_gcs_policy=2
>
> See details about Glibc tunables in corresponding Glibc patch [4].
>
> Corresponding binutils patch [5].
>
> [1] https://developer.arm.com/documentation/ddi0487/ka/ (chapter D11)
> [2] https://github.com/ARM-software/abi-aa/blob/main/sysvabi64/sysvabi64.rst
> [3] https://git.gitlab.arm.com/tooling/shrinkwrap.git
> [4] 
> https://inbox.sourceware.org/libc-alpha/20241023083920.466015-1-yury.khrusta...@arm.com/
> [5] 
> https://inbox.sourceware.org/binutils/20241014101743.346-1-yury.khrusta...@arm.com/

Thanks for this.  I've replied to some individual patches, but the
ones I didn't reply to look ok as-is.

Richard

>
> ---
>
> Matthieu Longo (1):
>   aarch64: Fix tests incompatible with GCS
>
> Richard Ball (1):
>   aarch64: Add tests and docs for indirect_return attribute
>
> Szabolcs Nagy (19):
>   aarch64: Add -mbranch-protection=gcs option
>   aarch64: Add branch-protection target pragma tests
>   aarch64: Add support for chkfeat insn
>   aarch64: Add __builtin_aarch64_chkfeat
>   aarch64: Add __builtin_aarch64_chkfeat tests
>   aarch64: Add GCS instructions
>   aarch64: Add GCS builtins
>   aarch64: Add __builtin_aarch64_gcs* tests
>   aarch64: Add GCS support for nonlocal stack save
>   aarch64: Add non-local goto and jump tests for GCS
>   aarch64: Add ACLE feature macros for GCS
>   aarch64: Add test for GCS ACLE defs
>   aarch64: Add target pragma tests for gcs
>   aarch64: Add GCS support to the unwinder
>   aarch64: Emit GNU property NOTE for GCS
>   aarch64: libgcc: add GCS marking to asm
>   aarch64: libatomic: add GCS marking to asm
>   aarch64: libitm: Add GCS support
>   aarch64: Introduce indirect_return attribute
>
> Yury Khrustalev (1):
>   aarch64: Fix nonlocal goto tests incompatible with GCS
>
>  gcc/config/aarch64/aarch64-builtins.cc|  88 
>  gcc/config/aarch64/aarch64-c.cc   |   3 +
>  gcc/config/aarch64/aarch64-protos.h   |   2 +
>  gcc/config/aarch64/aarch64.cc |  40 ++
>  gcc/config/aarch64/aarch64.h  |   7 +
>  gcc/config/aarch64/aarch64.md | 126 ++
>  gcc/config/aarch64/aarch64.opt|   3 +
>  gcc/config/arm/aarch-bti-insert.cc|  36 -
>  gcc/configure |   2 +-
>  gcc/configure.ac  |   6 +-
>  gcc/doc/extend.texi   |   5 +
>  gcc/doc/invoke.texi   |   5 +-
>  gcc/testsuite/g++.target/aarch64/pr94515-1.C  |   6 +-
>  .../return_address_sign_ab_exception.C|  19 ++-
>  gcc/testsuite/gcc.target/aarch64/chkfeat-1.c  |  75 +++
>  gcc/testsuite/gcc.target/aarch64/chkfeat-2.c  |  15 +++
>  gcc/testsuite/gcc.target/aarch64/eh_return.c  |  13 ++
>  .../gcc.target/aarch64/gcs-nonlocal-1.c   |  25 
>  .../gcc.target/aarch64/gcs-nonlocal-2.c   |  21 +++
>  .../gcc.target/aarch64/gcs-nonlocal-3.c   |  33 +
>  gcc/testsuite/gcc.target/aarch64/gcspopm-1.c  |  69 ++
>  gcc/testsuite/gcc.target/aarch64/gcspr-1.c|  31 +
>  gcc/testsuite/gcc.target/aarch64/gcsss-1.c|  49 +++
>  .../gcc.target/aarch64/indirect_return.c  |  25 
>  gcc/testsuite/gcc.target/aarch64/pr104689.c   |   3 +-
>  .../gcc.target/aarch64/pragma_cpp_predefs_1.c |  30 +
>  .../gcc.target/aarch64/pragma_cpp_predefs_4.c |  85 
>  .../gcc.target/aarch64/sme/nonlocal_goto_4.c  |   2 +-
>  .../gcc.target/aarch64/sme/nonlocal_goto_5.c  |   2 +-
>  .../gcc.target/aarch64/sme/nonlocal_goto_6.c  |   2 +-
>  libatomic/config/linux/aarch64/atomic_16.S|  11 +-
>  libgcc/config/aarch64/aarch64-asm.h   |  16 ++-
>  libgcc/config/aarch64/aarch64-unwind.h|  59 +++-
>  l

Re: [PATCH 21/22] aarch64: Fix tests incompatible with GCS

2024-10-24 Thread Richard Sandiford
Yury Khrustalev  writes:
> From: Matthieu Longo 
>
> gcc/testsuite/ChangeLog:
>
>   * g++.target/aarch64/return_address_sign_ab_exception.C: Update.
>   * gcc.target/aarch64/eh_return.c: Update.

OK, thanks.

Richard

> ---
>  .../return_address_sign_ab_exception.C| 19 +--
>  gcc/testsuite/gcc.target/aarch64/eh_return.c  | 13 +
>  2 files changed, 26 insertions(+), 6 deletions(-)
>
> diff --git 
> a/gcc/testsuite/g++.target/aarch64/return_address_sign_ab_exception.C 
> b/gcc/testsuite/g++.target/aarch64/return_address_sign_ab_exception.C
> index ead11de7b15..6c79ebf03eb 100644
> --- a/gcc/testsuite/g++.target/aarch64/return_address_sign_ab_exception.C
> +++ b/gcc/testsuite/g++.target/aarch64/return_address_sign_ab_exception.C
> @@ -1,16 +1,28 @@
>  /* { dg-do run } */
>  /* { dg-options "--save-temps" } */
>  /* { dg-require-effective-target arm_v8_3a_bkey_directive } */
> +/* { dg-final { check-function-bodies "**" "" } } */
>  
> +/*
> +** _Z5foo_av:
> +**   hint25 // paciasp
> +** ...
> +*/
>  __attribute__((target("branch-protection=pac-ret+leaf")))
>  int foo_a () {
>throw 22;
>  }
>  
> +/*
> +** _Z5foo_bv:
> +**   hint27 // pacibsp
> +** ...
> +*/
>  __attribute__((target("branch-protection=pac-ret+leaf+b-key")))
>  int foo_b () {
>throw 22;
>  }
> +/* { dg-final { scan-assembler-times ".cfi_b_key_frame" 1 } } */
>  
>  int main (int argc, char** argv) {
>try {
> @@ -23,9 +35,4 @@ int main (int argc, char** argv) {
>  }
>}
>return 1;
> -}
> -
> -/* { dg-final { scan-assembler-times "paciasp" 1 } } */
> -/* { dg-final { scan-assembler-times "pacibsp" 1 } } */
> -/* { dg-final { scan-assembler-times ".cfi_b_key_frame" 1 } } */
> -
> +}
> \ No newline at end of file
> diff --git a/gcc/testsuite/gcc.target/aarch64/eh_return.c 
> b/gcc/testsuite/gcc.target/aarch64/eh_return.c
> index 32179488085..51b20f784b3 100644
> --- a/gcc/testsuite/gcc.target/aarch64/eh_return.c
> +++ b/gcc/testsuite/gcc.target/aarch64/eh_return.c
> @@ -1,6 +1,19 @@
>  /* { dg-do run } */
>  /* { dg-options "-O2 -fno-inline" } */
>  
> +/* With BTI enabled, this test would crash with SIGILL, Illegal instruction.
> +   The 2nd argument of __builtin_eh_return is expected to be an EH handler
> +   within a function, rather than a separate function.
> +   The current implementation of __builtin_eh_return in AArch64 backend 
> emits a
> +   jump instead of branching with LR.
> +   The prologue of the handler (i.e. continuation) starts with "bti c" (vs.
> +   "bti jc") which is a landing pad type prohibiting jumps, hence the 
> exception
> +   at runtime.
> +   The current behavior of __builtin_eh_return is considered correct.
> +   Consequently, the default option -mbranch-protection=standard needs to be
> +   overridden to remove BTI.  */
> +/* { dg-additional-options "-mbranch-protection=pac-ret+leaf+gcs" { target { 
> default_branch_protection } } } */
> +
>  #include 
>  #include 


Re: [PATCH 20/22] aarch64: Add tests and docs for indirect_return attribute

2024-10-24 Thread Richard Sandiford
Yury Khrustalev  writes:
> From: Richard Ball 
>
> This patch adds a new testcase and docs
> for the indirect_return attribute.
>
> gcc/ChangeLog:
>
>   * doc/extend.texi: Add AArch64 docs for indirect_return
>   attribute.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/indirect_return.c: New test.
>   Co-authored-by: Yury Khrustalev 
> ---
>  gcc/doc/extend.texi   |  5 
>  .../gcc.target/aarch64/indirect_return.c  | 25 +++
>  2 files changed, 30 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/indirect_return.c
>
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index 42bd567119d..45e2b3ec569 100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -4760,6 +4760,11 @@ Enable or disable calls to out-of-line helpers to 
> implement atomic operations.
>  This corresponds to the behavior of the command-line options
>  @option{-moutline-atomics} and @option{-mno-outline-atomics}.
>  
> +@cindex @code{indirect_return} function attribute, AArch64
> +@item indirect_return
> +Used to inform the compiler that a function may return via
> +an indirect return. Adds a BTI J instruction under 
> @option{mbranch-protection=} bti.

"return via an indirect return" doesn't really add much information,
especially since the "indirect" might sound related to "indirect
branch", and all returns are indirect in that sense.

How about going with a variation of the x86 documentation:

The @code{indirect_return} attribute can be applied to a function type
to indicate that the function may return via an indirect branch instead
of via a normal return instruction.  For example, this can be true of
functions that implement manual context switching between user space
threads, such as POSIX's @code{swapcontext} function.

>  @end table
>  
>  The above target attributes can be specified as follows:
> diff --git a/gcc/testsuite/gcc.target/aarch64/indirect_return.c 
> b/gcc/testsuite/gcc.target/aarch64/indirect_return.c
> new file mode 100644
> index 000..f1ef56d5557
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/indirect_return.c
> @@ -0,0 +1,25 @@
> +/* { dg-do compile } */
> +/* { dg-options "-mbranch-protection=bti" } */
> +
> +int __attribute((indirect_return))
> +foo (int a)
> +{
> +  return a;
> +}
> +
> +/*
> +**func1:
> +**   hint34 // bti c
> +**   ...
> +**   bl  foo
> +**   hint36 // bti j
> +**   ...
> +**   ret
> +*/
> +int
> +func1 (int a, int b)
> +{
> +  return foo (a + b);
> +}
> +
> +/* { dg-final { check-function-bodies "**" "" "" } } */

I think we should also check the case of a sibling call from an
indirect_return function to an indirect_return function,
since patch 19/22 specifically optimises that case.

Thanks,
Richard


Re: [PATCH 7/9] Handle POLY_INT_CSTs in get_nonzero_bits

2024-10-24 Thread Richard Sandiford
Richard Biener  writes:
> On Fri, 18 Oct 2024, Richard Sandiford wrote:
>
>> This patch extends get_nonzero_bits to handle POLY_INT_CSTs,
>> The easiest (but also most useful) case is that the number
>> of trailing zeros in the runtime value is at least the number
>> of trailing zeros in each individual component.
>> 
>> In principle, we could do this for coeffs 1 and above only,
>> and then OR in ceoff 0.  This would give ~0x11 for [14, 32], say.
>> But that's future work.
>> 
>> gcc/
>>  * tree-ssanames.cc (get_nonzero_bits): Handle POLY_INT_CSTs.
>>  * match.pd (with_possible_nonzero_bits): Likewise.
>> 
>> gcc/testsuite/
>>  * gcc.target/aarch64/sve/cnt_fold_4.c: New test.
>> ---
>>  gcc/match.pd  |  2 +
>>  .../gcc.target/aarch64/sve/cnt_fold_4.c   | 61 +++
>>  gcc/tree-ssanames.cc  |  3 +
>>  3 files changed, 66 insertions(+)
>>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_4.c
>> 
>> diff --git a/gcc/match.pd b/gcc/match.pd
>> index 540582dc984..41903554478 100644
>> --- a/gcc/match.pd
>> +++ b/gcc/match.pd
>> @@ -2893,6 +2893,8 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>> possibly set.  */
>>  (match with_possible_nonzero_bits
>>   INTEGER_CST@0)
>> +(match with_possible_nonzero_bits
>> + POLY_INT_CST@0)
>>  (match with_possible_nonzero_bits
>>   SSA_NAME@0
>>   (if (INTEGRAL_TYPE_P (TREE_TYPE (@0)) || POINTER_TYPE_P (TREE_TYPE (@0)
>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_4.c 
>> b/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_4.c
>> new file mode 100644
>> index 000..b7a53701993
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_4.c
>> @@ -0,0 +1,61 @@
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2" } */
>> +/* { dg-final { check-function-bodies "**" "" } } */
>> +
>> +#include 
>> +
>> +/*
>> +** f1:
>> +**  cnthx0
>> +**  ret
>> +*/
>> +uint64_t
>> +f1 ()
>> +{
>> +  uint64_t x = svcntw ();
>> +  x >>= 2;
>> +  return x << 3;
>> +}
>> +
>> +/*
>> +** f2:
>> +**  [^\n]+
>> +**  [^\n]+
>> +**  ...
>> +**  ret
>> +*/
>> +uint64_t
>> +f2 ()
>> +{
>> +  uint64_t x = svcntd ();
>> +  x >>= 2;
>> +  return x << 3;
>> +}
>> +
>> +/*
>> +** f3:
>> +**  cntbx0, all, mul #4
>> +**  ret
>> +*/
>> +uint64_t
>> +f3 ()
>> +{
>> +  uint64_t x = svcntd ();
>> +  x >>= 1;
>> +  return x << 6;
>> +}
>> +
>> +/*
>> +** f4:
>> +**  [^\n]+
>> +**  [^\n]+
>> +**  ...
>> +**  ret
>> +*/
>> +uint64_t
>> +f4 ()
>> +{
>> +  uint64_t x = svcntd ();
>> +  x >>= 2;
>> +  return x << 2;
>> +}
>> diff --git a/gcc/tree-ssanames.cc b/gcc/tree-ssanames.cc
>> index 4f83fcbb517..d2d1ec18797 100644
>> --- a/gcc/tree-ssanames.cc
>> +++ b/gcc/tree-ssanames.cc
>> @@ -505,6 +505,9 @@ get_nonzero_bits (const_tree name)
>>/* Use element_precision instead of TYPE_PRECISION so complex and
>>   vector types get a non-zero precision.  */
>>unsigned int precision = element_precision (TREE_TYPE (name));
>> +  if (POLY_INT_CST_P (name))
>> +return -known_alignment (wi::to_poly_wide (name));
>> +
>
> Since you don't need precision can you move this right after the
> INTEGER_CST handling?

Oops, yes.  An earlier cut did use the precision, but I forgot to
move it to a more sensible place when changing it.

Thanks for the reviews.  I've pushed parts 1, 2, and 4-9 with the
changes suggested.  Part 3 needs more work, so I'll do that separately.

Richard


Re: [PATCH 19/22] aarch64: Introduce indirect_return attribute

2024-10-24 Thread Richard Sandiford
Yury Khrustalev  writes:
> From: Szabolcs Nagy 
>
> Tail calls of indirect_return functions from non-indirect_return
> functions are disallowed even if BTI is disabled, since the call
> site may have BTI enabled.
>
> Following x86, mismatching attribute on function pointers is not
> a type error even though this can lead to bugs.

Is that still true?  I would have expected the aarch64_comp_type_attributes
part of the patch to reject mismatches.

> Needed for swapcontext within the same function when GCS is enabled.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.cc (aarch64_gnu_attributes): Add
>   indirect_return.
>   (aarch64_function_ok_for_sibcall): Disallow tail calls if caller
>   is non-indirect_return but callee is indirect_return.
>   (aarch64_comp_type_attributes): Check indirect_return attribute.
>   * config/arm/aarch-bti-insert.cc (call_needs_bti_j): New.
>   (rest_of_insert_bti): Use call_needs_bti_j.
>
> ---
>  gcc/config/aarch64/aarch64.cc  | 11 +
>  gcc/config/arm/aarch-bti-insert.cc | 36 ++
>  2 files changed, 43 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index a89a30113b9..9bfc9a1dbba 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -853,6 +853,7 @@ static const attribute_spec aarch64_gnu_attributes[] =
> affects_type_identity, handler, exclude } */
>{ "aarch64_vector_pcs", 0, 0, false, true,  true,  true,
> handle_aarch64_vector_pcs_attribute, NULL },
> +  { "indirect_return",0, 0, false, true, true, false, NULL, NULL },
>{ "arm_sve_vector_bits", 1, 1, false, true,  false, true,
> aarch64_sve::handle_arm_sve_vector_bits_attribute,
> NULL },
> @@ -6429,6 +6430,14 @@ aarch64_function_ok_for_sibcall (tree, tree exp)
>  if (bool (aarch64_cfun_shared_flags (state))
>   != bool (aarch64_fntype_shared_flags (fntype, state)))
>return false;
> +
> +  /* BTI J is needed where indirect_return functions may return
> + if bti is enabled there.  */
> +  if (lookup_attribute ("indirect_return", TYPE_ATTRIBUTES (fntype))
> +  && !lookup_attribute ("indirect_return",
> + TYPE_ATTRIBUTES (TREE_TYPE (cfun->decl
> +return false;
> +
>return true;
>  }
>  
> @@ -29118,6 +29127,8 @@ aarch64_comp_type_attributes (const_tree type1, 
> const_tree type2)
>  
>if (!check_attr ("gnu", "aarch64_vector_pcs"))
>  return 0;
> +  if (!check_attr ("gnu", "indirect_return"))
> +return 0;
>if (!check_attr ("gnu", "Advanced SIMD type"))
>  return 0;
>if (!check_attr ("gnu", "SVE type"))
> diff --git a/gcc/config/arm/aarch-bti-insert.cc 
> b/gcc/config/arm/aarch-bti-insert.cc
> index 14d36971cd4..403afff9120 100644
> --- a/gcc/config/arm/aarch-bti-insert.cc
> +++ b/gcc/config/arm/aarch-bti-insert.cc
> @@ -92,6 +92,35 @@ const pass_data pass_data_insert_bti =
>0, /* todo_flags_finish.  */
>  };
>  
> +/* Decide if BTI J is needed after a call instruction.  */
> +static bool
> +call_needs_bti_j (rtx_insn *insn)
> +{
> +  /* Call returns twice, one of which may be indirect.  */
> +  if (find_reg_note (insn, REG_SETJMP, NULL))
> +return true;
> +
> +  /* Tail call does not return.  */
> +  if (SIBLING_CALL_P (insn))
> +return false;
> +
> +  /* Check if the function is marked to return indirectly.  */
> +  rtx call = get_call_rtx_from (insn);
> +  rtx fnaddr = XEXP (call, 0);
> +  tree fndecl = NULL_TREE;
> +  if (GET_CODE (XEXP (fnaddr, 0)) == SYMBOL_REF)
> +fndecl = SYMBOL_REF_DECL (XEXP (fnaddr, 0));
> +  if (fndecl == NULL_TREE)
> +fndecl = MEM_EXPR (fnaddr);
> +  if (!fndecl)
> +return false;
> +  if (TREE_CODE (TREE_TYPE (fndecl)) != FUNCTION_TYPE
> +  && TREE_CODE (TREE_TYPE (fndecl)) != METHOD_TYPE)
> +return false;
> +  tree fntype = TREE_TYPE (fndecl);
> +  return lookup_attribute ("indirect_return", TYPE_ATTRIBUTES (fntype));

I think it would be safer/more robust to encode the indirect_return status
in the call insn "cookie", like we do for some other ABI properties.
The information would be recorded in CUMULATIVE_ARGS by
aarch64_init_cumulative_args, then aarch64_function_arg would
add it to the cookie.

Thanks,
Richard

> +}
> +
>  /* Insert the BTI instruction.  */
>  /* This is implemented as a late RTL pass that runs before branch
> shortening and does the following.  */
> @@ -147,10 +176,9 @@ rest_of_insert_bti (void)
>   }
>   }
>  
> -   /* Also look for calls to setjmp () which would be marked with
> -  REG_SETJMP note and put a BTI J after.  This is where longjump ()
> -  will return.  */
> -   if (CALL_P (insn) && (find_reg_note (insn, REG_SETJMP, NULL)))
> +   /* Also look for calls that may return indirectly, such as setjmp,
> +  and put a BTI J after them.  */
> +  

Re: [PATCH 5/9] Generalise ((X /[ex] A) +- B) * A -> X +- A * B rule

2024-10-24 Thread Richard Sandiford
Richard Biener  writes:
> On Fri, 18 Oct 2024, Richard Sandiford wrote:
>
>> match.pd had a rule to simplify ((X /[ex] A) +- B) * A -> X +- A * B
>> when A and B are INTEGER_CSTs.  This patch extends it to handle the
>> case where the outer multiplication is by a factor of A, not just
>> A itself.  It also handles addition and multiplication of poly_ints.
>> (Exact division by a poly_int seems unlikely.)
>> 
>> I'm not sure why minus is handled here.  Wouldn't minus of a constant be
>> canonicalised to a plus?
>
> All but A - INT_MIN, yes.  For A - INT_MIN we'd know A == INT_MIN.
> For unsigned we canonicalize all constants IIRC.  So I agree the
> minus case can go away.

Ah, right, thanks.  Hadn't thought about that special case.  Given that..

> OK unchanged or with the minus removed.

...I ended up leaving it unchanged. :)

Richard

>
> Thanks,
> Richard.
>
>> gcc/
>>  * match.pd: Generalise ((X /[ex] A) +- B) * A -> X +- A * B rule
>>  to ((X /[ex] C1) +- C2) * (C1 * C3) -> (X * C3) +- (C1 * C2 * C3).
>> 
>> gcc/testsuite/
>>  * gcc.dg/tree-ssa/mulexactdiv-5.c: New test.
>>  * gcc.dg/tree-ssa/mulexactdiv-6.c: Likewise.
>>  * gcc.dg/tree-ssa/mulexactdiv-7.c: Likewise.
>>  * gcc.dg/tree-ssa/mulexactdiv-8.c: Likewise.
>>  * gcc.target/aarch64/sve/cnt_fold_3.c: Likewise.
>> ---
>>  gcc/match.pd  | 38 +++-
>>  gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-5.c | 29 +
>>  gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-6.c | 59 +++
>>  gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-7.c | 22 +++
>>  gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-8.c | 20 +++
>>  .../gcc.target/aarch64/sve/cnt_fold_3.c   | 40 +
>>  6 files changed, 194 insertions(+), 14 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-5.c
>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-6.c
>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-7.c
>>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/mulexactdiv-8.c
>>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cnt_fold_3.c
>> 
>> diff --git a/gcc/match.pd b/gcc/match.pd
>> index 6677bc06d80..268316456c3 100644
>> --- a/gcc/match.pd
>> +++ b/gcc/match.pd
>> @@ -5493,24 +5493,34 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>>  optab_vector)))
>> (eq (trunc_mod @0 @1) { build_zero_cst (TREE_TYPE (@0)); })))
>>  
>> -/* ((X /[ex] A) +- B) * A  -->  X +- A * B.  */
>> +/* ((X /[ex] C1) +- C2) * (C1 * C3)  -->  (X * C3) +- (C1 * C2 * C3).  */
>>  (for op (plus minus)
>>   (simplify
>> -  (mult (convert1? (op (convert2? (exact_div @0 INTEGER_CST@@1)) 
>> INTEGER_CST@2)) @1)
>> -  (if (tree_nop_conversion_p (type, TREE_TYPE (@2))
>> -   && tree_nop_conversion_p (TREE_TYPE (@0), TREE_TYPE (@2)))
>> -   (with
>> - {
>> -   wi::overflow_type overflow;
>> -   wide_int mul = wi::mul (wi::to_wide (@1), wi::to_wide (@2),
>> -   TYPE_SIGN (type), &overflow);
>> - }
>> +  (mult (convert1? (op (convert2? (exact_div @0 INTEGER_CST@1))
>> +   poly_int_tree_p@2))
>> +poly_int_tree_p@3)
>> +  (with { poly_widest_int factor; }
>> +   (if (tree_nop_conversion_p (type, TREE_TYPE (@2))
>> +&& tree_nop_conversion_p (TREE_TYPE (@0), TREE_TYPE (@2))
>> +&& multiple_p (wi::to_poly_widest (@3), wi::to_widest (@1), &factor))
>> +(with
>> +  {
>> +wi::overflow_type overflow;
>> +wide_int mul;
>> +  }
>>   (if (types_match (type, TREE_TYPE (@2))
>> - && types_match (TREE_TYPE (@0), TREE_TYPE (@2)) && !overflow)
>> -  (op @0 { wide_int_to_tree (type, mul); })
>> +  && types_match (TREE_TYPE (@0), TREE_TYPE (@2))
>> +  && TREE_CODE (@2) == INTEGER_CST
>> +  && TREE_CODE (@3) == INTEGER_CST
>> +  && (mul = wi::mul (wi::to_wide (@2), wi::to_wide (@3),
>> + TYPE_SIGN (type), &overflow),
>> +  !overflow))
>> +  (op (mult @0 { wide_int_to_tree (type, factor); })
>> +  { wide_int_to_tree (type, mul); })
>>(with { tree utype = unsigned_type_for (type); }
>> -   (convert (op (convert:utype @0)
>> -(mult (convert:utype @1) (convert:utype @2))
>> +   (convert (op (mult (convert:utype @0)
>> +

Re: [PATCH 16/22] aarch64: libgcc: add GCS marking to asm

2024-10-24 Thread Richard Sandiford
Yury Khrustalev  writes:
> From: Szabolcs Nagy 
>
> libgcc/ChangeLog:
>
>   * config/aarch64/aarch64-asm.h (FEATURE_1_GCS): Define.
>   (GCS_FLAG): Define if GCS is enabled.
>   (GNU_PROPERTY): Add GCS_FLAG.

This might be a daft question, but don't we also want to use the
new build attributes, where supported?  Or is that handled separately?

Same question for the other libraries.

Thanks,
Richard

> ---
>  libgcc/config/aarch64/aarch64-asm.h | 16 ++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/libgcc/config/aarch64/aarch64-asm.h 
> b/libgcc/config/aarch64/aarch64-asm.h
> index d8ab91d52f1..f7bd225f7a4 100644
> --- a/libgcc/config/aarch64/aarch64-asm.h
> +++ b/libgcc/config/aarch64/aarch64-asm.h
> @@ -22,6 +22,9 @@
> see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
> .  */
>  
> +#ifndef AARCH64_ASM_H
> +#define AARCH64_ASM_H
> +
>  #include "auto-target.h"
>  
>  #define L(label) .L ## label
> @@ -38,6 +41,7 @@
>  #define FEATURE_1_AND 0xc000
>  #define FEATURE_1_BTI 1
>  #define FEATURE_1_PAC 2
> +#define FEATURE_1_GCS 4
>  
>  /* Supported features based on the code generation options.  */
>  #if defined(__ARM_FEATURE_BTI_DEFAULT)
> @@ -58,6 +62,12 @@
>  # define AUTIASP
>  #endif
>  
> +#if __ARM_FEATURE_GCS_DEFAULT
> +# define GCS_FLAG FEATURE_1_GCS
> +#else
> +# define GCS_FLAG 0
> +#endif
> +
>  #ifdef __ELF__
>  #define HIDDEN(name) .hidden name
>  #define SYMBOL_SIZE(name) .size name, .-name
> @@ -88,8 +98,8 @@
>  .previous
>  
>  /* Add GNU property note if built with branch protection.  */
> -# if (BTI_FLAG|PAC_FLAG) != 0
> -GNU_PROPERTY (FEATURE_1_AND, BTI_FLAG|PAC_FLAG)
> +# if (BTI_FLAG|PAC_FLAG|GCS_FLAG) != 0
> +GNU_PROPERTY (FEATURE_1_AND, BTI_FLAG|PAC_FLAG|GCS_FLAG)
>  # endif
>  #endif
>  
> @@ -106,3 +116,5 @@ GNU_PROPERTY (FEATURE_1_AND, BTI_FLAG|PAC_FLAG)
>  #define END(name) \
>.cfi_endproc;  \
>SYMBOL_SIZE(name)
> +
> +#endif


Re: [PATCH 09/22] aarch64: Add GCS support for nonlocal stack save

2024-10-24 Thread Richard Sandiford
Yury Khrustalev  writes:
> From: Szabolcs Nagy 
>
> Nonlocal stack save and restore has to also save and restore the GCS
> pointer. This is used in __builtin_setjmp/longjmp and nonlocal goto.
>
> The GCS specific code is only emitted if GCS branch-protection is
> enabled and the code always checks at runtime if GCS is enabled.
>
> The new -mbranch-protection=gcs and old -mbranch-protection=none code
> are ABI compatible: jmpbuf for __builtin_setjmp has space for 5
> pointers, the layout is
>
>   old layout: fp, pc, sp, unused, unused
>   new layout: fp, pc, sp, gcsp, unused
>
> Note: the ILP32 code generation is wrong as it saves the pointers with
> Pmode (i.e. 8 bytes per pointer), but the user supplied buffer size is
> for 5 pointers (4 bytes per pointer), this is not fixed.
>
> The nonlocal goto has no ABI compatibility issues as the goto and its
> destination are in the same translation unit.
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64.h (STACK_SAVEAREA_MODE): Make space for gcs.
>   * config/aarch64/aarch64.md (save_stack_nonlocal): New.
>   (restore_stack_nonlocal): New.
> ---
>  gcc/config/aarch64/aarch64.h  |  7 +++
>  gcc/config/aarch64/aarch64.md | 82 +++
>  2 files changed, 89 insertions(+)
>
> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> index 593319fd472..43a92e85780 100644
> --- a/gcc/config/aarch64/aarch64.h
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -1297,6 +1297,13 @@ typedef struct
>  #define CTZ_DEFINED_VALUE_AT_ZERO(MODE, VALUE) \
>((VALUE) = GET_MODE_UNIT_BITSIZE (MODE), 2)
>  
> +/* Have space for both SP and GCSPR in the NONLOCAL case in
> +   emit_stack_save as well as in __builtin_setjmp, __builtin_longjmp
> +   and __builtin_nonlocal_goto.
> +   Note: On ILP32 the documented buf size is not enough PR84150.  */
> +#define STACK_SAVEAREA_MODE(LEVEL)   \
> +  ((LEVEL) == SAVE_NONLOCAL ? TImode : Pmode)

It might be better to use CDImode, so that we don't claim 16-byte alignment
for -mstrict-align.

> +
>  #define INCOMING_RETURN_ADDR_RTX gen_rtx_REG (Pmode, LR_REGNUM)
>  
>  #define RETURN_ADDR_RTX aarch64_return_addr
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index e4e11e35b5b..6e1646387d8 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -1200,6 +1200,88 @@ (define_insn "*cb1"
> (const_int 1)))]
>  )
>  
> +(define_expand "save_stack_nonlocal"
> +  [(set (match_operand 0 "memory_operand")
> +(match_operand 1 "register_operand"))]
> +  ""
> +{
> +  rtx stack_slot = adjust_address (operands[0], Pmode, 0);
> +  emit_move_insn (stack_slot, operands[1]);
> +
> +  if (aarch64_gcs_enabled ())
> +{
> +  /* Save GCS with code like
> + mov x16, 1
> + chkfeat x16
> + tbnzx16, 0, .L_done
> + mrs tmp, gcspr_el0
> + str tmp, [%0, 8]
> + .L_done:  */
> +
> +  rtx done_label = gen_label_rtx ();
> +  rtx r16 = gen_rtx_REG (DImode, R16_REGNUM);
> +  emit_move_insn (r16, const1_rtx);
> +  emit_insn (gen_aarch64_chkfeat ());
> +  emit_insn (gen_tbranch_neqi3 (r16, const0_rtx, done_label));
> +  rtx gcs_slot = adjust_address (operands[0], Pmode, GET_MODE_SIZE 
> (Pmode));
> +  rtx gcs = force_reg (Pmode, const0_rtx);

The code seems to use force_reg (Pmode, const0_rtx) to get a fresh
register, but that should be done using gen_reg_rtx (Pmode) instead.

Looks good otherwise.  In particular, it avoids one mistake I made
in the past, in that it uses the generic optabs to generate branches,
and so should work with -mtrack-speculation.  (It would be good to have
a test of nonlocal goto and -mtrack-speculation though, if the later
patches don't have one already.)

Thanks,
Richard

> +  emit_insn (gen_aarch64_load_gcspr (gcs));
> +  emit_move_insn (gcs_slot, gcs);
> +  emit_label (done_label);
> +}
> +  DONE;
> +})
> +
> +(define_expand "restore_stack_nonlocal"
> +  [(set (match_operand 0 "register_operand" "")
> + (match_operand 1 "memory_operand" ""))]
> +  ""
> +{
> +  rtx stack_slot = adjust_address (operands[1], Pmode, 0);
> +  emit_move_insn (operands[0], stack_slot);
> +
> +  if (aarch64_gcs_enabled ())
> +{
> +  /* Restore GCS with code like
> + mov x16, 1
> + chkfeat x16
> + tbnzx16, 0, .L_done
> + ldr tmp1, [%1, 8]
> + mrs tmp2, gcspr_el0
> + substmp2, tmp1, tmp2
> + b.eq.L_done
> + .L_loop:
> + gcspopm
> + substmp2, tmp2, 8
> + b.ne.L_loop
> + .L_done:  */
> +
> +  rtx loop_label = gen_label_rtx ();
> +  rtx done_label = gen_label_rtx ();
> +  rtx r16 = gen_rtx_REG (DImode, R16_REGNUM);
> +  emit_move_insn (r16, const1_rtx);
> +  emit_insn (gen_aarch64_chkfeat ());
> +  emit_insn

Re: [PATCH] SVE intrinsics: Fold svsra with op1 all zeros to svlsr/svasr.

2024-10-24 Thread Richard Sandiford
Jennifer Schmitz  writes:
>> On 22 Oct 2024, at 18:21, Richard Sandiford  
>> wrote:
>> 
>> External email: Use caution opening links or attachments
>> 
>> 
>> Jennifer Schmitz  writes:
>>> A common idiom in intrinsics loops is to have accumulator intrinsics
>>> in an unrolled loop with an accumulator initialized to zero at the 
>>> beginning.
>>> Propagating the initial zero accumulator into the first iteration
>>> of the loop and simplifying the first accumulate instruction is a
>>> desirable transformation that we should teach GCC.
>>> Therefore, this patch folds svsra to svlsr/svasr if op1 is all zeros,
>>> producing the lower latency instructions LSR/ASR instead of USRA/SSRA.
>>> We implemented this optimization in svsra_impl::fold.
>>> Because svlsr/svasr are predicated intrinsics, we added a ptrue
>>> predicate. Additionally, the width of the shift amount (imm3) was
>>> adjusted to fit the function type.
>>> In order to create the ptrue predicate, a new helper function
>>> build_ptrue was added. We also refactored gimple_folder::fold_to_ptrue
>>> to use the new helper function.
>>> 
>>> Tests were added to check the produced assembly for use of LSR/ASR.
>>> 
>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
>>> regression.
>>> OK for mainline?
>>> 
>>> Signed-off-by: Jennifer Schmitz 
>>> 
>>> gcc/
>>>  * config/aarch64/aarch64-sve-builtins-sve2.cc
>>>  (svsra_impl::fold): Fold svsra to svlsr/svasr if op1 is all zeros.
>>>  * config/aarch64/aarch64-sve-builtins.cc (build_ptrue): New
>>>  function that returns a ptrue tree.
>>>  (gimple_folder::fold_to_ptrue): Refactor to use build_ptrue.
>>>  * config/aarch64/aarch64-sve-builtins.h: Declare build_ptrue.
>>> 
>>> gcc/testsuite/
>>>  * gcc.target/aarch64/sve2/acle/asm/sra_s32.c: New test.
>>>  * gcc.target/aarch64/sve2/acle/asm/sra_s64.c: Likewise.
>>>  * gcc.target/aarch64/sve2/acle/asm/sra_u32.c: Likewise.
>>>  * gcc.target/aarch64/sve2/acle/asm/sra_u64.c: Likewise.
>>> ---
>>> .../aarch64/aarch64-sve-builtins-sve2.cc  | 29 +++
>>> gcc/config/aarch64/aarch64-sve-builtins.cc| 28 +++---
>>> gcc/config/aarch64/aarch64-sve-builtins.h |  1 +
>>> .../aarch64/sve2/acle/asm/sra_s32.c   |  9 ++
>>> .../aarch64/sve2/acle/asm/sra_s64.c   |  9 ++
>>> .../aarch64/sve2/acle/asm/sra_u32.c   |  9 ++
>>> .../aarch64/sve2/acle/asm/sra_u64.c   |  9 ++
>>> 7 files changed, 83 insertions(+), 11 deletions(-)
>>> 
>>> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc 
>>> b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
>>> index 6a20a613f83..0990918cc45 100644
>>> --- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
>>> +++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
>>> @@ -417,6 +417,35 @@ public:
>>> 
>>> class svsra_impl : public function_base
>>> {
>>> +public:
>>> +  gimple *
>>> +  fold (gimple_folder &f) const override
>>> +  {
>>> +/* Fold to svlsr/svasr if op1 is all zeros.  */
>>> +tree op1 = gimple_call_arg (f.call, 0);
>>> +if (!integer_zerop (op1))
>>> +  return NULL;
>>> +function_instance instance ("svlsr", functions::svlsr,
>>> + shapes::binary_uint_opt_n, MODE_n,
>>> + f.type_suffix_ids, GROUP_none, PRED_x);
>>> +if (!f.type_suffix (0).unsigned_p)
>>> +  {
>>> + instance.base_name = "svasr";
>>> + instance.base = functions::svasr;
>>> +  }
>>> +gcall *call = f.redirect_call (instance);
>>> +unsigned int element_bytes = f.type_suffix (0).element_bytes;
>>> +/* Add a ptrue as predicate, because unlike svsra, svlsr/svasr are
>>> +   predicated intrinsics.  */
>>> +gimple_call_set_arg (call, 0, build_ptrue (element_bytes));
>> 
>> Maybe it would be simpler to use build_all_ones_cst (f.gp_type ()).
>> Unlike for fold_to_ptrue (which produces output predicates),
>> we don't need the upper bits of each predicate element to be zero.
>> 
>>> +/* For svsra, the shift amount (imm3) is uint64_t for all function 
>>> types,
>>> +   but for svlsr/svasr, imm3 has the sam

Re: [PATCH 04/22] aarch64: Add __builtin_aarch64_chkfeat

2024-10-24 Thread Richard Sandiford
Yury Khrustalev  writes:
> From: Szabolcs Nagy 
>
> Builtin for chkfeat: the input argument is used to initialize x16 then
> execute chkfeat and return the updated x16.
>
> Note: ACLE __chkfeat(x) plans to flip the bits to be more intuitive
> (xor the input to output), but for the builtin that seems unnecessary
> complication.

Sounds good.

> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-builtins.cc (enum aarch64_builtins):
>   Define AARCH64_BUILTIN_CHKFEAT.
>   (aarch64_general_init_builtins): Handle chkfeat.
>   (aarch64_general_expand_builtin): Handle chkfeat.
> ---
>  gcc/config/aarch64/aarch64-builtins.cc | 18 ++
>  1 file changed, 18 insertions(+)
>
> diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
> b/gcc/config/aarch64/aarch64-builtins.cc
> index 7d737877e0b..765f2091504 100644
> --- a/gcc/config/aarch64/aarch64-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-builtins.cc
> @@ -875,6 +875,8 @@ enum aarch64_builtins
>AARCH64_PLDX,
>AARCH64_PLI,
>AARCH64_PLIX,
> +  /* Armv8.9-A / Armv9.4-A builtins.  */
> +  AARCH64_BUILTIN_CHKFEAT,
>AARCH64_BUILTIN_MAX
>  };
>  
> @@ -2280,6 +2282,12 @@ aarch64_general_init_builtins (void)
>if (!TARGET_ILP32)
>  aarch64_init_pauth_hint_builtins ();
>  
> +  tree ftype_chkfeat
> += build_function_type_list (uint64_type_node, uint64_type_node, NULL);
> +  aarch64_builtin_decls[AARCH64_BUILTIN_CHKFEAT]
> += aarch64_general_add_builtin ("__builtin_aarch64_chkfeat", 
> ftype_chkfeat,
> +AARCH64_BUILTIN_CHKFEAT);
> +
>if (in_lto_p)
>  handle_arm_acle_h ();
>  }
> @@ -3484,6 +3492,16 @@ aarch64_general_expand_builtin (unsigned int fcode, 
> tree exp, rtx target,
>  case AARCH64_PLIX:
>aarch64_expand_prefetch_builtin (exp, fcode);
>return target;
> +
> +case AARCH64_BUILTIN_CHKFEAT:
> +  {
> + rtx x16_reg = gen_rtx_REG (DImode, R16_REGNUM);
> + op0 = expand_normal (CALL_EXPR_ARG (exp, 0));
> + emit_move_insn (x16_reg, op0);
> + expand_insn (CODE_FOR_aarch64_chkfeat, 0, 0);
> + emit_move_insn (target, x16_reg);
> + return target;

target isn't reuired to be nonnull, so this would be safer as:

  return copy_to_reg (x16_reg);

(I don't think it's worth complicating things by trying to reuse target,
since this code isn't going to be performance/memory critical.)

Looks good otherwise.

Thanks,
Richard

> +  }
>  }
>  
>if (fcode >= AARCH64_SIMD_BUILTIN_BASE && fcode <= 
> AARCH64_SIMD_BUILTIN_MAX)


Re: [PATCH 05/22] aarch64: Add __builtin_aarch64_chkfeat tests

2024-10-24 Thread Richard Sandiford
Yury Khrustalev  writes:
> From: Szabolcs Nagy 
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/chkfeat-1.c: New test.
>   * gcc.target/aarch64/chkfeat-2.c: New test.
> ---
>  gcc/testsuite/gcc.target/aarch64/chkfeat-1.c | 75 
>  gcc/testsuite/gcc.target/aarch64/chkfeat-2.c | 15 
>  2 files changed, 90 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/chkfeat-1.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/chkfeat-2.c
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/chkfeat-1.c 
> b/gcc/testsuite/gcc.target/aarch64/chkfeat-1.c
> new file mode 100644
> index 000..2fae81e740f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/chkfeat-1.c
> @@ -0,0 +1,75 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mbranch-protection=none" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +/*
> +**foo1:
> +**   mov x16, 1
> +**   hint40 // chkfeat x16
> +**   mov x0, x16
> +**   ret
> +*/
> +unsigned long long
> +foo1 (void)
> +{
> +  return __builtin_aarch64_chkfeat (1);
> +}
> +
> +/*
> +**foo2:
> +**   mov x16, 1
> +**   movkx16, 0x5678, lsl 32
> +**   movkx16, 0x1234, lsl 48
> +**   hint40 // chkfeat x16
> +**   mov x0, x16
> +**   ret
> +*/
> +unsigned long long
> +foo2 (void)
> +{
> +  return __builtin_aarch64_chkfeat (0x123456780001);
> +}
> +
> +/*
> +**foo3:
> +**   mov x16, x0
> +**   hint40 // chkfeat x16
> +**   mov x0, x16
> +**   ret
> +*/
> +unsigned long long
> +foo3 (unsigned long long x)
> +{
> +  return __builtin_aarch64_chkfeat (x);
> +}
> +
> +/*
> +**foo4:
> +**   ldr x16, \[x0\]
> +**   hint40 // chkfeat x16
> +**   str x16, \[x0\]
> +**   ret
> +*/
> +void
> +foo4 (unsigned long long *p)
> +{
> +  *p = __builtin_aarch64_chkfeat (*p);
> +}
> +
> +/*
> +**foo5:
> +**   mov x16, 1
> +**   hint40 // chkfeat x16
> +**   cmp x16, 0
> +**(
> +**   cselw0, w1, w0, eq
> +**|
> +**   cselw0, w0, w1, ne
> +**)
> +**   ret
> +*/
> +int
> +foo5 (int x, int y)
> +{
> +  return __builtin_aarch64_chkfeat (1) ? x : y;
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/chkfeat-2.c 
> b/gcc/testsuite/gcc.target/aarch64/chkfeat-2.c
> new file mode 100644
> index 000..682524e244f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/chkfeat-2.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +/* { dg-final { scan-assembler-times {hint\t40 // chkfeat x16} 2 } } */
> +
> +void bar (void);
> +
> +/* Extern call may change enabled HW features.  */
> +unsigned long long
> +foo (void)
> +{
> +  unsigned long long a = __builtin_aarch64_chkfeat (1);
> +  bar ();
> +  unsigned long long b = __builtin_aarch64_chkfeat (1);
> +  return a + b;
> +}

This doesn't in itself check that the chkfeats are correctly ordered
wrt the call.  It might be better to use a check-function-bodies test:

/*
** foo:
**  ...
**  hint40 // chkfeat x16
**  ...
**  bl  bar
**  ...
**  hint40 // chkfeat x16
**  ...
*/

Looks good otherwise.

Thanks,
Richard


Re: SVE intrinsics: Fold constant operands for svlsl.

2024-10-24 Thread Richard Sandiford
Kyrylo Tkachov  writes:
>> On 24 Oct 2024, at 10:39, Soumya AR  wrote:
>> 
>> Hi Richard,
>> 
>> > On 23 Oct 2024, at 5:58 PM, Richard Sandiford  
>> > wrote:
>> > 
>> > External email: Use caution opening links or attachments
>> > 
>> > 
>> > Soumya AR  writes:
>> >> diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
>> >> b/gcc/config/aarch64/aarch64-sve-builtins.cc
>> >> index 41673745cfe..aa556859d2e 100644
>> >> --- a/gcc/config/aarch64/aarch64-sve-builtins.cc
>> >> +++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
>> >> @@ -1143,11 +1143,14 @@ aarch64_const_binop (enum tree_code code, tree 
>> >> arg1, tree arg2)
>> >>   tree type = TREE_TYPE (arg1);
>> >>   signop sign = TYPE_SIGN (type);
>> >>   wi::overflow_type overflow = wi::OVF_NONE;
>> >> -
>> >> +  unsigned int element_bytes = tree_to_uhwi (TYPE_SIZE_UNIT (type));
>> >>   /* Return 0 for division by 0, like SDIV and UDIV do.  */
>> >>   if (code == TRUNC_DIV_EXPR && integer_zerop (arg2))
>> >>  return arg2;
>> >> -
>> >> +  /* Return 0 if shift amount is out of range. */
>> >> +  if (code == LSHIFT_EXPR
>> >> +   && tree_to_uhwi (arg2) >= (element_bytes * BITS_PER_UNIT))
>> > 
>> > tree_to_uhwi is dangerous because a general shift might be negative
>> > (even if these particular shift amounts are unsigned).  We should
>> > probably also key off TYPE_PRECISION rather than TYPE_SIZE_UNIT.  So:
>> > 
>> >if (code == LSHIFT_EXPR
>> >&& wi::geu_p (wi::to_wide (arg2), TYPE_PRECISION (type)))
>> > 
>> > without the element_bytes variable.  Also: the indentation looks a bit off;
>> > it should be tabs only followed by spaces only.
>> 
>> Thanks for the feedback, posting an updated patch with the suggested changes.
>
> Thanks Soumya, I’ve pushed this patch to trunk as commit 3e7549ece7c after 
> adjusting
> the ChangeLog slightly to start the lines with tabs instead of spaces.

Sorry Soumya, I forgot that you didn't have commit access yet.
It's time you did though.  Could you follow the instructions
on https://gcc.gnu.org/gitwrite.html ?  I'm happy to sponsor
(and I'm sure Kyrill would be too).

Thanks,
Richard


Re: [PATCH v2 3/4] aarch64: improve assembly debug comments for AEABI build attributes

2024-10-23 Thread Richard Sandiford
Matthieu Longo  writes:
> The previous implementation to emit AEABI build attributes did not
> support string values (asciz) in aeabi_subsection, and was not
> emitting values associated to tags in the assembly comments.
>
> This new approach provides a more user-friendly interface relying on
> typing, and improves the emitted assembly comments:
>   * aeabi_attribute:
> ** Adds the interpreted value next to the tag in the assembly
> comment.
> ** Supports asciz values.
>   * aeabi_subsection:
> ** Adds debug information for its parameters.
> ** Auto-detects the attribute types when declaring the subsection.
>
> Additionally, it is also interesting to note that the code was moved
> to a separate file to improve modularity and "releave" the 1000-lines

I think you dropped a 0.  I wish it was only 1000 :-)

> long aarch64.cc file from a few lines. Finally, it introduces a new
> namespace "aarch64::" for AArch64 backend which reduce the length of
> function names by not prepending 'aarch64_' to each of them.
> [...]
> diff --git a/gcc/config/aarch64/aarch64-dwarf-metadata.h 
> b/gcc/config/aarch64/aarch64-dwarf-metadata.h
> new file mode 100644
> index 000..01f08ad073e
> --- /dev/null
> +++ b/gcc/config/aarch64/aarch64-dwarf-metadata.h
> @@ -0,0 +1,226 @@
> +/* DWARF metadata for AArch64 architecture.
> +   Copyright (C) 2024 Free Software Foundation, Inc.
> +   Contributed by ARM Ltd.
> +
> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3, or (at your option)
> +   any later version.
> +
> +   GCC is distributed in the hope that it will be useful, but
> +   WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   .  */
> +
> +#ifndef GCC_AARCH64_DWARF_METADATA_H
> +#define GCC_AARCH64_DWARF_METADATA_H
> +
> +#include "system.h"

We should drop this line.  It's the .cc file's responsibility to
include system.h.

> +#include "vec.h"
> +
> +namespace aarch64 {
> +
> +enum attr_val_type : uint8_t
> +{
> +  uleb128 = 0x0,
> +  asciz = 0x1,
> +};
> +
> +enum BA_TagFeature_t : uint8_t
> +{
> +  Tag_Feature_BTI = 1,
> +  Tag_Feature_PAC = 2,
> +  Tag_Feature_GCS = 3,
> +};
> +
> +template 
> +struct aeabi_attribute
> +{
> +  T_tag tag;
> +  T_val value;
> +};
> +
> +template 
> +aeabi_attribute
> +make_aeabi_attribute (T_tag tag, T_val val)
> +{
> +  return aeabi_attribute{tag, val};
> +}
> +
> +namespace details {
> +
> +constexpr const char *
> +to_c_str (bool b)
> +{
> +  return b ? "true" : "false";
> +}
> +
> +constexpr const char *
> +to_c_str (const char *s)
> +{
> +  return s;
> +}
> +
> +constexpr const char *
> +to_c_str (attr_val_type t)
> +{
> +  return (t == uleb128 ? "ULEB128"
> +   : t == asciz ? "asciz"
> +   : nullptr);
> +}
> +
> +constexpr const char *
> +to_c_str (BA_TagFeature_t feature)
> +{
> +  return (feature == Tag_Feature_BTI ? "Tag_Feature_BTI"
> +   : feature == Tag_Feature_PAC ? "Tag_Feature_PAC"
> +   : feature == Tag_Feature_GCS ? "Tag_Feature_GCS"
> +   : nullptr);
> +}
> +
> +template <
> +  typename T,
> +  typename = typename std::enable_if::value, T>::type
> +>
> +constexpr const char *
> +aeabi_attr_str_fmt (T phantom __attribute__((unused)))

FWIW, it would be ok to drop the parameter name and the attribute.
But it's ok as-is too, if you think it makes the intention clearer.

> +{
> +  return "\t.aeabi_attribute %u, %u";
> +}
> +
> +constexpr const char *
> +aeabi_attr_str_fmt (const char *phantom __attribute__((unused)))
> +{
> +  return "\t.aeabi_attribute %u, \"%s\"";
> +}
> [...]
> @@ -24834,17 +24808,21 @@ aarch64_start_file (void)
> asm_fprintf (asm_out_file, "\t.arch %s\n",
>   aarch64_last_printed_arch_string.c_str ());
>  
> -  /* Check whether the current assembly supports gcs build attributes, if not
> - fallback to .note.gnu.property section.  */
> +  /* Check whether the current assembler supports AEABI build attributes, if
> + not fallback to .note.gnu.property section.  */
>  #if (HAVE_AS_AEABI_BUILD_ATTRIBUTES)

Just to note that, as with patch 2, I hope this could be:

  if (HAVE_AS_AEABI_BUILD_ATTRIBUTES)
{
  ...
}

instead.

OK with those changes, thanks.

Richard


Re: [PATCH v2 2/2] aarch64: Add mfloat vreinterpret intrinsics

2024-10-23 Thread Richard Sandiford
Andrew Carlotti  writes:
> This patch splits out some of the qualifier handling from the v1 patch, and
> adjusts the VREINTERPRET* macros to include support for mf8 intrinsics.
>
> Bootstrapped and regression tested on aarch64; ok for master?
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-builtins.cc (MODE_d_mf8): New.
>   (MODE_q_mf8): New.
>   (QUAL_mf8): New.
>   (VREINTERPRET_BUILTINS1): Add mf8 entry.
>   (VREINTERPRET_BUILTINS): Ditto.
>   (VREINTERPRETQ_BUILTINS1): Ditto.
>   (VREINTERPRETQ_BUILTINS): Ditto.
>   (aarch64_lookup_simd_type_in_table): Match modal_float bit
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/advsimd-intrinsics/mf8-reinterpret.c: New test.

OK, thanks.

Richard

> diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
> b/gcc/config/aarch64/aarch64-builtins.cc
> index 
> 432131c3b2d7cf4f788b79ce3d84c9e7554dc750..31231c9e66ee8307cb86e181fc51ea2622c5f82c
>  100644
> --- a/gcc/config/aarch64/aarch64-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-builtins.cc
> @@ -133,6 +133,7 @@
>  #define MODE_d_f16 E_V4HFmode
>  #define MODE_d_f32 E_V2SFmode
>  #define MODE_d_f64 E_V1DFmode
> +#define MODE_d_mf8 E_V8QImode
>  #define MODE_d_s8 E_V8QImode
>  #define MODE_d_s16 E_V4HImode
>  #define MODE_d_s32 E_V2SImode
> @@ -148,6 +149,7 @@
>  #define MODE_q_f16 E_V8HFmode
>  #define MODE_q_f32 E_V4SFmode
>  #define MODE_q_f64 E_V2DFmode
> +#define MODE_q_mf8 E_V16QImode
>  #define MODE_q_s8 E_V16QImode
>  #define MODE_q_s16 E_V8HImode
>  #define MODE_q_s32 E_V4SImode
> @@ -177,6 +179,7 @@
>  #define QUAL_p16 qualifier_poly
>  #define QUAL_p64 qualifier_poly
>  #define QUAL_p128 qualifier_poly
> +#define QUAL_mf8 qualifier_modal_float
>  
>  #define LENGTH_d ""
>  #define LENGTH_q "q"
> @@ -598,6 +601,7 @@ static aarch64_simd_builtin_datum 
> aarch64_simd_builtin_data[] = {
>  /* vreinterpret intrinsics are defined for any pair of element types.
> { _bf16   }   { _bf16   }
> {  _f16 _f32 _f64 }   {  _f16 _f32 _f64 }
> +   { _mf8}   { _mf8}
> { _s8  _s16 _s32 _s64 } x { _s8  _s16 _s32 _s64 }
> { _u8  _u16 _u32 _u64 }   { _u8  _u16 _u32 _u64 }
> { _p8  _p16  _p64 }   { _p8  _p16  _p64 }.  */
> @@ -609,6 +613,7 @@ static aarch64_simd_builtin_datum 
> aarch64_simd_builtin_data[] = {
>VREINTERPRET_BUILTIN2 (A, f16) \
>VREINTERPRET_BUILTIN2 (A, f32) \
>VREINTERPRET_BUILTIN2 (A, f64) \
> +  VREINTERPRET_BUILTIN2 (A, mf8) \
>VREINTERPRET_BUILTIN2 (A, s8) \
>VREINTERPRET_BUILTIN2 (A, s16) \
>VREINTERPRET_BUILTIN2 (A, s32) \
> @@ -626,6 +631,7 @@ static aarch64_simd_builtin_datum 
> aarch64_simd_builtin_data[] = {
>VREINTERPRET_BUILTINS1 (f16) \
>VREINTERPRET_BUILTINS1 (f32) \
>VREINTERPRET_BUILTINS1 (f64) \
> +  VREINTERPRET_BUILTINS1 (mf8) \
>VREINTERPRET_BUILTINS1 (s8) \
>VREINTERPRET_BUILTINS1 (s16) \
>VREINTERPRET_BUILTINS1 (s32) \
> @@ -641,6 +647,7 @@ static aarch64_simd_builtin_datum 
> aarch64_simd_builtin_data[] = {
>  /* vreinterpretq intrinsics are additionally defined for p128.
> { _bf16 }   { _bf16 }
> {  _f16 _f32 _f64   }   {  _f16 _f32 _f64   }
> +   { _mf8  }   { _mf8  }
> { _s8  _s16 _s32 _s64   } x { _s8  _s16 _s32 _s64   }
> { _u8  _u16 _u32 _u64   }   { _u8  _u16 _u32 _u64   }
> { _p8  _p16  _p64 _p128 }   { _p8  _p16  _p64 _p128 }.  */
> @@ -652,6 +659,7 @@ static aarch64_simd_builtin_datum 
> aarch64_simd_builtin_data[] = {
>VREINTERPRETQ_BUILTIN2 (A, f16) \
>VREINTERPRETQ_BUILTIN2 (A, f32) \
>VREINTERPRETQ_BUILTIN2 (A, f64) \
> +  VREINTERPRETQ_BUILTIN2 (A, mf8) \
>VREINTERPRETQ_BUILTIN2 (A, s8) \
>VREINTERPRETQ_BUILTIN2 (A, s16) \
>VREINTERPRETQ_BUILTIN2 (A, s32) \
> @@ -670,6 +678,7 @@ static aarch64_simd_builtin_datum 
> aarch64_simd_builtin_data[] = {
>VREINTERPRETQ_BUILTINS1 (f16) \
>VREINTERPRETQ_BUILTINS1 (f32) \
>VREINTERPRETQ_BUILTINS1 (f64) \
> +  VREINTERPRETQ_BUILTINS1 (mf8) \
>VREINTERPRETQ_BUILTINS1 (s8) \
>VREINTERPRETQ_BUILTINS1 (s16) \
>VREINTERPRETQ_BUILTINS1 (s32) \
> @@ -1117,7 +1126,8 @@ aarch64_lookup_simd_type_in_table (machine_mode mode,
>  {
>int i;
>int nelts = ARRAY_SIZE (aarch64_simd_types);
> -  int q = qualifiers & (qualifier_poly | qualifier_unsigned);
> +  int q = qualifiers
> +& (qualifier_poly | qualifier_unsigned | qualifier_modal_float);
>  
>for (i = 0; i < nelts; i++)
>  {
> diff --git 
> a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/mf8-reinterpret.c 
> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/mf8-reinterpret.c
> new file mode 100644
> index 
> ..5e5921746036bbfbf20d2a77697760efd1f71cc2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/mf8-reinterpret.c
> @@ -0,0 +1,46

Re: [PATCH v2 1/2] aarch64: Add support for mfloat8x{8|16}_t types

2024-10-23 Thread Richard Sandiford
Andrew Carlotti  writes:
> Compared to v1, I've split changes that aren't used for the type definitions
> into a separate patch.  I've also added some tests, mostly along the lines
> suggested by Richard S.
>
> Bootstrapped and regression tested on aarch64; ok for master?
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-builtins.cc
>   (aarch64_init_simd_builtin_types): Initialise FP8 simd types.
>   * config/aarch64/aarch64-builtins.h
>   (enum aarch64_type_qualifiers): Add qualifier_modal_float bit.
>   * config/aarch64/aarch64-simd-builtin-types.def:
>   Add Mfloat8x{8|16}_t types.
>   * config/aarch64/arm_neon.h: Add mfloat8x{8|16}_t typedefs.
>
> gcc/testsuite/ChangeLog:
>
>   * gcc.target/aarch64/movv16qi_2.c: Test mfloat as well.
>   * gcc.target/aarch64/movv16qi_3.c: Ditto.
>   * gcc.target/aarch64/movv2x16qi_1.c: Ditto.
>   * gcc.target/aarch64/movv3x16qi_1.c: Ditto.
>   * gcc.target/aarch64/movv4x16qi_1.c: Ditto.
>   * gcc.target/aarch64/movv8qi_2.c: Ditto.
>   * gcc.target/aarch64/movv8qi_3.c: Ditto.
>   * gcc.target/aarch64/mfloat-init-1.c: New test.

OK, thanks.

Richard

> diff --git a/gcc/config/aarch64/aarch64-builtins.h 
> b/gcc/config/aarch64/aarch64-builtins.h
> index 
> e326fe666769cedd6c06d0752ed30b9359745ac9..00db7a74885db4d97ed365e8e3e2d7cf7d8410a4
>  100644
> --- a/gcc/config/aarch64/aarch64-builtins.h
> +++ b/gcc/config/aarch64/aarch64-builtins.h
> @@ -54,6 +54,8 @@ enum aarch64_type_qualifiers
>/* Lane indices selected in quadtuplets. - must be in range, and flipped 
> for
>   bigendian.  */
>qualifier_lane_quadtup_index = 0x1000,
> +  /* Modal FP types.  */
> +  qualifier_modal_float = 0x2000,
>  };
>  
>  #define ENTRY(E, M, Q, G) E,
> diff --git a/gcc/config/aarch64/aarch64-builtins.cc 
> b/gcc/config/aarch64/aarch64-builtins.cc
> index 
> 7d737877e0bf6c1f9eb53351a6085b0db16a04d6..432131c3b2d7cf4f788b79ce3d84c9e7554dc750
>  100644
> --- a/gcc/config/aarch64/aarch64-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-builtins.cc
> @@ -1220,6 +1220,10 @@ aarch64_init_simd_builtin_types (void)
>aarch64_simd_types[Bfloat16x4_t].eltype = bfloat16_type_node;
>aarch64_simd_types[Bfloat16x8_t].eltype = bfloat16_type_node;
>  
> +  /* Init FP8 element types.  */
> +  aarch64_simd_types[Mfloat8x8_t].eltype = aarch64_mfp8_type_node;
> +  aarch64_simd_types[Mfloat8x16_t].eltype = aarch64_mfp8_type_node;
> +
>for (i = 0; i < nelts; i++)
>  {
>tree eltype = aarch64_simd_types[i].eltype;
> diff --git a/gcc/config/aarch64/aarch64-simd-builtin-types.def 
> b/gcc/config/aarch64/aarch64-simd-builtin-types.def
> index 
> 6111cd0d4fe1136feabb36a4077cf86d13b835e2..83b2da2e7dc0962c1e5957e25c8f6232c2148fe5
>  100644
> --- a/gcc/config/aarch64/aarch64-simd-builtin-types.def
> +++ b/gcc/config/aarch64/aarch64-simd-builtin-types.def
> @@ -52,3 +52,5 @@
>ENTRY (Float64x2_t, V2DF, none, 13)
>ENTRY (Bfloat16x4_t, V4BF, none, 14)
>ENTRY (Bfloat16x8_t, V8BF, none, 14)
> +  ENTRY (Mfloat8x8_t, V8QI, modal_float, 13)
> +  ENTRY (Mfloat8x16_t, V16QI, modal_float, 14)
> diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
> index 
> e376685489da055029def6b661132b5154886b57..730d9d3fa8158ef2d1d13c0f629e306e774145a0
>  100644
> --- a/gcc/config/aarch64/arm_neon.h
> +++ b/gcc/config/aarch64/arm_neon.h
> @@ -72,6 +72,9 @@ typedef __Poly16_t poly16_t;
>  typedef __Poly64_t poly64_t;
>  typedef __Poly128_t poly128_t;
>  
> +typedef __Mfloat8x8_t mfloat8x8_t;
> +typedef __Mfloat8x16_t mfloat8x16_t;
> +
>  typedef __fp16 float16_t;
>  typedef float float32_t;
>  typedef double float64_t;
> diff --git a/gcc/testsuite/gcc.target/aarch64/mfloat-init-1.c 
> b/gcc/testsuite/gcc.target/aarch64/mfloat-init-1.c
> new file mode 100644
> index 
> ..15a6b331fd3986476950e799d11bdef710193f1d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/mfloat-init-1.c
> @@ -0,0 +1,5 @@
> +/* { dg-do assemble } */
> +/* { dg-options "-O --save-temps" } */
> +
> +/* { dg-error "invalid conversion to type 'mfloat8_t" "" {target *-*-*} 0 } 
> */
> +__Mfloat8x8_t const_mf8x8 () { return (__Mfloat8x8_t) { 1, 1, 1, 1, 1, 1, 1, 
> 1 }; }
> diff --git a/gcc/testsuite/gcc.target/aarch64/movv16qi_2.c 
> b/gcc/testsuite/gcc.target/aarch64/movv16qi_2.c
> index 
> 08a0a19b515134742fcb121e8cf6a19600f86075..39a06db0707538996fb5a3990ef53589d0210b17
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/movv16qi_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/movv16qi_2.c
> @@ -17,6 +17,7 @@ TEST_GENERAL (__Bfloat16x8_t)
>  TEST_GENERAL (__Float16x8_t)
>  TEST_GENERAL (__Float32x4_t)
>  TEST_GENERAL (__Float64x2_t)
> +TEST_GENERAL (__Mfloat8x16_t)
>  
>  __Int8x16_t const_s8x8 () { return (__Int8x16_t) { 1, 1, 1, 1, 1, 1, 1, 1, 
> 1, 1, 1, 1, 1, 1, 1, 1 }; }
>  __Int16x8_t const_s16x4 () { return (__Int16x8_t) { 1, 0, 1, 0, 1, 0, 1, 0 
> }; }
> diff --git a/gcc/testsuite/gcc.target/aarch64/movv16qi_3

Re: [PATCH v2 0/4] aarch64: add minimal support of AEABI build attributes for GCS

2024-10-23 Thread Richard Sandiford
Matthieu Longo  writes:
> The primary focus of this patch series is to add support for build attributes 
> in the context of GCS (Guarded Control Stack, an Armv9.4-a extension) to the 
> AArch64 backend.
> It addresses comments from revision 1 [2] and 2 [3], and proposes a different 
> approach compared to the previous implementation of the build attributes.
>
> The series is composed of the following 4 patches:
> 1. Patch adding assembly debug comments (-dA) to the existing GNU properties, 
> to improve testing and check the correctness of values.
> 2. The minimal patch adding support for build attributes in the context of 
> GCS.
> 3. A refactoring of (2) to make things less error-prone and more modular, add 
> support for asciz attributes and more debug information.
> 4. A refactoring of (1) relying partly on (3).
> The targeted final state of this series would consist in squashing (2) + (3), 
> and (1) + (4).
>
> **Special note regarding (2):** If Gas has support for build attributes, both 
> build attributes and GNU properties will be emitted. This behavior is still 
> open for discussion. Please, let me know your thoughts regarding this 
> behavior.

I don't have a strong opinion.  But emitting both seems like the safe
and conservatively correct behaviour, so I think the onus would be
on anyone who wants to drop the old information to make the case
for doing that.

> This patch series needs to be applied on top of the patch series for GCS [1].
>
> Bootstrapped on aarch64-none-linux-gnu, and no regression found.
>
> [1]: 
> https://gcc.gnu.org/git/?p=gcc.git;a=shortlog;h=refs/vendors/ARM/heads/gcs
> [2]: https://gcc.gnu.org/pipermail/gcc-patches/2024-September/662825.html
> [3]: https://gcc.gnu.org/pipermail/gcc-patches/2024-September/664004.html
>
> Regards,
> Matthieu
>
> Diff with revision 1 [2]:
> - update the description of (2)
> - address the comments related to the tests in (2)
> - add new commits (1), (3) and (4)
>
> Diff with revision 2 [3]:
> - address comments of Richard Sandiford in revision 2.
> - fix several formatting mistakes.
> - remove RFC tag.
>
>
> Matthieu Longo (3):
>   aarch64: add debug comments to feature properties in .note.gnu.property
>   aarch64: improve assembly debug comments for AEABI build attributes
>   aarch64: encapsulate note.gnu.property emission into a class
>
> Srinath Parvathaneni (1):
>   aarch64: add minimal support of AEABI build attributes for GCS.

Looks good, thanks.  OK for trunk with the suggested changes for
patches 2 and 3.

Richard

>  gcc/config.gcc|   2 +-
>  gcc/config.in |   6 +
>  gcc/config/aarch64/aarch64-dwarf-metadata.cc  | 145 +++
>  gcc/config/aarch64/aarch64-dwarf-metadata.h   | 245 ++
>  gcc/config/aarch64/aarch64.cc |  69 ++---
>  gcc/config/aarch64/t-aarch64  |  10 +
>  gcc/configure |  38 +++
>  gcc/configure.ac  |  10 +
>  gcc/testsuite/gcc.target/aarch64/bti-1.c  |  13 +-
>  .../aarch64-build-attributes.exp  |  35 +++
>  .../build-attributes/build-attribute-gcs.c|  12 +
>  .../build-attribute-standard.c|  12 +
>  .../build-attributes/no-build-attribute-bti.c |  12 +
>  .../build-attributes/no-build-attribute-gcs.c |  12 +
>  .../build-attributes/no-build-attribute-pac.c |  12 +
>  .../no-build-attribute-standard.c |  12 +
>  gcc/testsuite/lib/target-supports.exp |  16 ++
>  17 files changed, 611 insertions(+), 50 deletions(-)
>  create mode 100644 gcc/config/aarch64/aarch64-dwarf-metadata.cc
>  create mode 100644 gcc/config/aarch64/aarch64-dwarf-metadata.h
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/build-attributes/aarch64-build-attributes.exp
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/build-attributes/build-attribute-gcs.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/build-attributes/build-attribute-standard.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/build-attributes/no-build-attribute-bti.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/build-attributes/no-build-attribute-gcs.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/build-attributes/no-build-attribute-pac.c
>  create mode 100644 
> gcc/testsuite/gcc.target/aarch64/build-attributes/no-build-attribute-standard.c


Re: [PATCH v2 2/4] aarch64: add minimal support of AEABI build attributes for GCS.

2024-10-23 Thread Richard Sandiford
Matthieu Longo  writes:
> @@ -24803,6 +24834,16 @@ aarch64_start_file (void)
> asm_fprintf (asm_out_file, "\t.arch %s\n",
>   aarch64_last_printed_arch_string.c_str ());
>  
> +  /* Check whether the current assembly supports gcs build attributes, if not
> + fallback to .note.gnu.property section.  */
> +#if (HAVE_AS_AEABI_BUILD_ATTRIBUTES)
> +  if (aarch64_gcs_enabled ())

I was hoping we could instead use:

  if (HAVE_AS_AEABI_BUILD_ATTRIBUTES && aarch64_gcs_enabled ())

so that the code is parsed but compiled out when the new syntax is not
supported.  This avoids cases where a patch that works with a new
assembler breaks bootstrap when using an older assembler, or vice
versa.

> +{
> +  aarch64_emit_aeabi_subsection (".aeabi-feature-and-bits", 1, 0);
> +  aarch64_emit_aeabi_attribute ("Tag_Feature_GCS", 3, 1);
> +}
> +#endif
> +
> default_file_start ();
>  }
>  
> [...]
> diff --git a/gcc/configure.ac b/gcc/configure.ac
> index 8a5fed516b3..f4b1343ca70 100644
> --- a/gcc/configure.ac
> +++ b/gcc/configure.ac
> @@ -4387,6 +4387,16 @@ case "$target" in
>   ldr x0, [[x2, #:gotpage_lo15:globalsym]]
>  ],,[AC_DEFINE(HAVE_AS_SMALL_PIC_RELOCS, 1,
>   [Define if your assembler supports relocs needed by -fpic.])])
> +# Check if we have binutils support for AEABI build attributes.
> +gcc_GAS_CHECK_FEATURE([support of AEABI build attributes], 
> gcc_cv_as_aarch64_aeabi_build_attributes,,
> +[
> + .set ATTR_TYPE_uleb128,   0
> + .set ATTR_TYPE_asciz, 1

Very minor, but: we can drop this line, since it isn't used in the test.
Same for the corresponding Tcl test.

OK with those changes, thanks.

Richard

> + .set Tag_Feature_foo, 1
> + .aeabi_subsection .aeabi-feature-and-bits, 1, ATTR_TYPE_uleb128
> + .aeabi_attribute Tag_Feature_foo, 1
> +],,[AC_DEFINE(HAVE_AS_AEABI_BUILD_ATTRIBUTES, 1,
> + [Define if your assembler supports AEABI build attributes.])])
>  # Enable Branch Target Identification Mechanism and Return Address
>  # Signing by default.
>  AC_ARG_ENABLE(standard-branch-protection,


Re: [PATCH] SVE intrinsics: Fold division and multiplication by -1 to neg.

2024-10-23 Thread Richard Sandiford
Jennifer Schmitz  writes:
> Because a neg instruction has lower latency and higher throughput than
> sdiv and mul, svdiv and svmul by -1 can be folded to svneg. For svdiv,
> this is already implemented on the RTL level; for svmul, the
> optimization was still missing.
> This patch implements folding to svneg for both operations using the
> gimple_folder. For svdiv, the transform is applied if the divisor is -1.
> Svmul is folded if either of the operands is -1. A case distinction of
> the predication is made to account for the fact that svneg_m has 3 arguments
> (argument 0 holds the values for the inactive lanes), while svneg_x and
> svneg_z have only 2 arguments.
> Tests were added or adjusted to check the produced assembly and runtime
> tests were added to check correctness.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz 

Sorry for the slow review.

> [...]
> @@ -2033,12 +2054,37 @@ public:
>  if (integer_zerop (op1) || integer_zerop (op2))
>return f.fold_active_lanes_to (build_zero_cst (TREE_TYPE (f.lhs)));
>  
> +/* If one of the operands is all integer -1, fold to svneg.  */
> +tree pg = gimple_call_arg (f.call, 0);
> +tree negated_op = NULL;
> +if (integer_minus_onep (op2))
> +  negated_op = op1;
> +else if (integer_minus_onep (op1))
> +  negated_op = op2;
> +if (!f.type_suffix (0).unsigned_p && negated_op)

This is definitely ok, but it would be nice to handle the unsigned_p
case too at some point.  This would mean bit-casting to the equivalent
signed type, doing the negation, and casting back.  It would be good
to have a helper for doing that (maybe with a lambda callback that
emits the actual call) since I can imagine it will be needed elsewhere
too.

> [...]
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mul_s32.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mul_s32.c
> index 13009d88619..1d605dbdd8d 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mul_s32.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mul_s32.c
> @@ -183,14 +183,25 @@ TEST_UNIFORM_Z (mul_3_s32_m_untied, svint32_t,
>  
>  /*
>  ** mul_m1_s32_m:
> -**   mov (z[0-9]+)\.b, #-1
> -**   mul z0\.s, p0/m, z0\.s, \1\.s
> +**   neg z0\.s, p0/m, z0\.s
>  **   ret
>  */
>  TEST_UNIFORM_Z (mul_m1_s32_m, svint32_t,
>   z0 = svmul_n_s32_m (p0, z0, -1),
>   z0 = svmul_m (p0, z0, -1))
>  
> +/*
> +** mul_m1r_s32_m:
> +**   mov (z[0-9]+)\.b, #-1
> +**   mov (z[0-9]+)\.d, z0\.d
> +**   movprfx z0, \1
> +**   neg z0\.s, p0/m, \2\.s
> +**   ret
> +*/
> +TEST_UNIFORM_Z (mul_m1r_s32_m, svint32_t,
> + z0 = svmul_s32_m (p0, svdup_s32 (-1), z0),
> + z0 = svmul_m (p0, svdup_s32 (-1), z0))

Maybe it would be better to test the untied case instead, by passing
z1 rather than z0 as the final argument.  Hopefully that would leave
us with just the first and last instructions.  (I think the existing
tests already cover the awkward tied2 case well enough.)

Same for the later similar tests.

OK with that change, thanks.

Richard

> +
>  /*
>  ** mul_s32_z_tied1:
>  **   movprfx z0\.s, p0/z, z0\.s
> @@ -597,13 +608,44 @@ TEST_UNIFORM_Z (mul_255_s32_x, svint32_t,
>  
>  /*
>  ** mul_m1_s32_x:
> -**   mul z0\.s, z0\.s, #-1
> +**   neg z0\.s, p0/m, z0\.s
>  **   ret
>  */
>  TEST_UNIFORM_Z (mul_m1_s32_x, svint32_t,
>   z0 = svmul_n_s32_x (p0, z0, -1),
>   z0 = svmul_x (p0, z0, -1))
>  
> +/*
> +** mul_m1r_s32_x:
> +**   neg z0\.s, p0/m, z0\.s
> +**   ret
> +*/
> +TEST_UNIFORM_Z (mul_m1r_s32_x, svint32_t,
> + z0 = svmul_s32_x (p0, svdup_s32 (-1), z0),
> + z0 = svmul_x (p0, svdup_s32 (-1), z0))
> +
> +/*
> +** mul_m1_s32_z:
> +**   mov (z[0-9]+)\.d, z0\.d
> +**   movprfx z0\.s, p0/z, \1\.s
> +**   neg z0\.s, p0/m, \1\.s
> +**   ret
> +*/
> +TEST_UNIFORM_Z (mul_m1_s32_z, svint32_t,
> + z0 = svmul_n_s32_z (p0, z0, -1),
> + z0 = svmul_z (p0, z0, -1))
> +
> +/*
> +** mul_m1r_s32_z:
> +**   mov (z[0-9]+)\.d, z0\.d
> +**   movprfx z0\.s, p0/z, \1\.s
> +**   neg z0\.s, p0/m, \1\.s
> +**   ret
> +*/
> +TEST_UNIFORM_Z (mul_m1r_s32_z, svint32_t,
> + z0 = svmul_s32_z (p0, svdup_s32 (-1),  z0),
> + z0 = svmul_z (p0, svdup_s32 (-1), z0))
> +
>  /*
>  ** mul_m127_s32_x:
>  **   mul z0\.s, z0\.s, #-127
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mul_s64.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mul_s64.c
> index 530d9fc84a5..c05d184f2fe 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mul_s64.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/acle/asm/mul_s64.c
> @@ -192,8 +192,7 @@ TEST_UNIFORM_Z (mul_3_s64_m_untied, svint64_t,
>  
>  /*
>  ** mul_m1_s64_m:
> -**   mov (z[0-9]+)\.b, #-1
> -**   mul z0\.d, p0/m, z0\.d, \1\.d
> +**   neg z0\.d, p0/m, z0\.d
>  **   ret

Re: [PATCH 1/2] aarch64: Use standard names for saturating arithmetic

2024-10-23 Thread Richard Sandiford
Akram Ahmad  writes:
> On 23/10/2024 12:20, Richard Sandiford wrote:
>> Thanks for doing this.  The approach looks good.  My main question is:
>> are we sure that we want to use the Advanced SIMD instructions for
>> signed saturating SI and DI arithmetic on GPRs?  E.g. for addition,
>> we only saturate at the negative limit if both operands are negative,
>> and only saturate at the positive limit if both operands are positive.
>> So for 32-bit values we can use:
>>
>>  asr tmp, x or y, #31
>>  eor tmp, tmp, #0x8000
>>
>> to calculate the saturation value and:
>>
>>  addsres, x, y
>>  cselres, tmp, res, vs
>>
>> to calculate the full result.  That's the same number of instructions
>> as two fmovs for the inputs, the sqadd, and the fmov for the result,
>> but it should be more efficient.
>>
>> The reason for asking now, rather than treating it as a potential
>> future improvement, is that it would also avoid splitting the patterns
>> for signed and unsigned ops.  (The length of the split alternative can be
>> conservatively set to 16 even for the unsigned version, since nothing
>> should care in practice.  The split will have happened before
>> shorten_branches.)
>
> Hi Richard, thanks for looking over this.
>
> I might be misunderstanding your suggestion, but is there a way to
> efficiently check the signedness of the second operand (let's say 'y')
> if it is stored in a register? This is a problem we considered and
> couldn't solve post-reload, as we only have three registers (including
> two operands) to work with. (I might be wrong in terms of how many
> registers we have available). AFAIK that's why we only use adds, csinv
> / subs, csel in the unsigned case.

Ah, ok.  For post-reload splits, we would need to add:

  (clobber (match_operand:GPI 3 "scratch_operand"))

then use "X" as the constraint for the Advanced SIMD alternative and
"&r" as the constraint in the GPR alternative.  But I suppose that
also sinks my dream of a unified pattern, since the unsigned case
wouldn't need the extra operand.

In both cases (signed and unsigned), the pattern should clobber CC_REGNUM,
since the split changes the flags.

> [...]
>>> diff --git 
>>> a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
>>>  
>>> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
>>> new file mode 100644
>>> index 000..63eb21e438b
>>> --- /dev/null
>>> +++ 
>>> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
>>> @@ -0,0 +1,79 @@
>>> +/* { dg-do assemble { target { aarch64*-*-* } } } */
>>> +/* { dg-options "-O2 --save-temps -ftree-vectorize" } */
>>> +/* { dg-final { check-function-bodies "**" "" "-DCHECK_ASM" } } */
>>> +
>>> +/*
>>> +** uadd_lane: { xfail *-*-* }
>> Just curious: why does this fail?  Is it a vector costing issue?
> This is due to a missing pattern from match.pd- I've sent another patch
> upstream to rectify this. In essence, this function exposes a commutative
> form of an existing addition pattern, but that form isn't currently 
> commutative
> when it should be. It's a similar reason for why the uqsubs are also 
> marked as
> xfail, so that same patch series contains a fix for the uqsub case too.

Ah, ok, thanks.

Richard


Re: [PATCH] SVE intrinsics: Add constant folding for svindex.

2024-10-23 Thread Richard Sandiford
Jennifer Schmitz  writes:
> This patch folds svindex with constant arguments into a vector series.
> We implemented this in svindex_impl::fold using the function build_vec_series.
> For example,
> svuint64_t f1 ()
> {
>   return svindex_u642 (10, 3);
> }
> compiled with -O2 -march=armv8.2-a+sve, is folded to {10, 13, 16, ...}
> in the gimple pass lower.
> This optimization benefits cases where svindex is used in combination with
> other gimple-level optimizations.
> For example,
> svuint64_t f2 ()
> {
> return svmul_x (svptrue_b64 (), svindex_u64 (10, 3), 5);
> }
> has previously been compiled to
> f2:
> index   z0.d, #10, #3
> mul z0.d, z0.d, #5
> ret
> Now, it is compiled to
> f2:
> mov x0, 50
> index   z0.d, x0, #15
> ret

Nice!  Thanks for doing this.

> For non-constant arguments, build_vec_series produces a VEC_SERIES_EXPR,
> which is translated back at RTL level to an index instruction without codegen
> changes.
>
> We added test cases checking
> - the application of the transform during gimple for constant arguments,
> - the interaction with another gimple-level optimization.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz 
>
> gcc/
>   * config/aarch64/aarch64-sve-builtins-base.cc
>   (svindex_impl::fold): Add constant folding.
>
> gcc/testsuite/
>   * gcc.target/aarch64/sve/index_const_fold.c: New test.
> ---
>  .../aarch64/aarch64-sve-builtins-base.cc  | 12 +++
>  .../gcc.target/aarch64/sve/index_const_fold.c | 35 +++
>  2 files changed, 47 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/index_const_fold.c
>
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-base.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> index 1c17149e1f0..f6b1657ecbb 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins-base.cc
> @@ -1304,6 +1304,18 @@ public:
>  
>  class svindex_impl : public function_base
>  {
> +public:
> +  gimple *
> +  fold (gimple_folder &f) const override
> +  {
> +tree vec_type = TREE_TYPE (f.lhs);
> +tree base = gimple_call_arg (f.call, 0);
> +tree step = gimple_call_arg (f.call, 1);

Could we restrict this to:

  if (TREE_CODE (base) != INTEGER_CST || TREE_CODE (step) != INTEGER_CST)
return nullptr;

for now?  This goes back to the previous discussion about how "creative"
the compiler is allowed to be in replacing the user's original instruction
selection.  IIRC, it'd also be somewhat novel to use VEC_SERIES_EXPR for
constant-length vectors.

We can always relax this later if we find a compelling use case.
But it looks like the tests would still pass with the guard above.

OK with that change, thanks.

Richard

> +
> +return gimple_build_assign (f.lhs,
> + build_vec_series (vec_type, base, step));
> +  }
> +
>  public:
>rtx
>expand (function_expander &e) const override
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/index_const_fold.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/index_const_fold.c
> new file mode 100644
> index 000..7abb803f58b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/index_const_fold.c
> @@ -0,0 +1,35 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +
> +#include 
> +#include 
> +
> +#define INDEX_CONST(TYPE, TY)\
> +  sv##TYPE f_##TY##_index_const ()   \
> +  {  \
> +return svindex_##TY (10, 3); \
> +  }
> +
> +#define MULT_INDEX(TYPE, TY) \
> +  sv##TYPE f_##TY##_mult_index ()\
> +  {  \
> +return svmul_x (svptrue_b8 (),   \
> + svindex_##TY (10, 3),   \
> + 5); \
> +  }
> +
> +#define ALL_TESTS(TYPE, TY)  \
> +  INDEX_CONST (TYPE, TY) \
> +  MULT_INDEX (TYPE, TY)
> +
> +ALL_TESTS (uint8_t, u8)
> +ALL_TESTS (uint16_t, u16)
> +ALL_TESTS (uint32_t, u32)
> +ALL_TESTS (uint64_t, u64)
> +ALL_TESTS (int8_t, s8)
> +ALL_TESTS (int16_t, s16)
> +ALL_TESTS (int32_t, s32)
> +ALL_TESTS (int64_t, s64)
> +
> +/* { dg-final { scan-tree-dump-times "return \\{ 10, 13, 16, ... \\}" 8 
> "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "return \\{ 50, 65, 80, ... \\}" 8 
> "optimized" } } */


Re: [PATCH 2/2] tree-optimization/116575 - SLP masked load-lanes discovery

2024-10-23 Thread Richard Sandiford
Richard Biener  writes:
> The following implements masked load-lane discovery for SLP.  The
> challenge here is that a masked load has a full-width mask with
> group-size number of elements when this becomes a masked load-lanes
> instruction one mask element gates all group members.  We already
> have some discovery hints in place, namely STMT_VINFO_SLP_VECT_ONLY
> to guard non-uniform masks, but we need to choose a way for SLP
> discovery to handle possible masked load-lanes SLP trees.
>
> I have this time chosen to handle load-lanes discovery where we
> have performed permute optimization already and conveniently got
> the graph with predecessor edges built.  This is because unlike
> non-masked loads masked loads with a load_permutation are never
> produced by SLP discovery (because load permutation handling doesn't
> handle un-permuting the mask) and thus the load-permutation lowering
> which handles non-masked load-lanes discovery doesn't trigger.
>
> With this SLP discovery for a possible masked load-lanes, thus
> a masked load with uniform mask, produces a splat of a single-lane
> sub-graph as the mask SLP operand.  This is a representation that
> shouldn't pessimize the mask load case and allows the masked load-lanes
> transform to simply elide this splat.

It's been too long since I did significant work on the vectoriser for
me to make a sensible comment on this, but FWIW, I agree the representation
of a splatted mask sounds good.

> This fixes the aarch64-sve.exp mask_struct_load*.c testcases with
> --param vect-force-slp=1
>
> Bootstrap and regtest running on x86_64-unknown-linux-gnu.
>
> I realize we are still quite inconsistent in how we do SLP
> discovery - mainly because of my idea to only apply minimal
> changes at this point.  I would expect that permuted masked loads
> miss the interleaving lowering performed by load permutation
> lowering.  And if we fix that we again have to decide whether
> to interleave or load-lane at the same time.  I'm also not sure
> how much good the optimize_slp passes to do VEC_PERMs in the
> SLP graph and what stops working when there are no longer any
> load_permutations in there.

Yeah, I'm also not sure about that.  The code only considers candidate
layouts that would undo a load permutation or a bijective single-input
VEC_PERM_EXPR.  It won't do anything for 2-to-1 permutes or single-input
packs.  The current layout selection is probably quite outdated at
this point.

Thanks,
Richard

> Richard.
>
>   PR tree-optimization/116575
>   * tree-vect-slp.cc (vect_get_and_check_slp_defs): Handle
>   gaps, aka NULL scalar stmt.
>   (vect_build_slp_tree_2): Allow gaps in the middle of a
>   grouped mask load.  When the mask of a grouped mask load
>   is uniform do single-lane discovery for the mask and
>   insert a splat VEC_PERM_EXPR node.
>   (vect_optimize_slp_pass::decide_masked_load_lanes): New
>   function.
>   (vect_optimize_slp_pass::run): Call it.
> ---
>  gcc/tree-vect-slp.cc | 138 ++-
>  1 file changed, 135 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index fca9ae86d2e..037098a96cb 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -641,6 +641,16 @@ vect_get_and_check_slp_defs (vec_info *vinfo, unsigned 
> char swap,
>unsigned int commutative_op = -1U;
>bool first = stmt_num == 0;
>  
> +  if (!stmt_info)
> +{
> +  for (auto oi : *oprnds_info)
> + {
> +   oi->def_stmts.quick_push (NULL);
> +   oi->ops.quick_push (NULL_TREE);
> + }
> +  return 0;
> +}
> +
>if (!is_a (stmt_info->stmt)
>&& !is_a (stmt_info->stmt)
>&& !is_a (stmt_info->stmt))
> @@ -2029,9 +2039,11 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
>   has_gaps = true;
> /* We cannot handle permuted masked loads directly, see
>PR114375.  We cannot handle strided masked loads or masked
> -  loads with gaps.  */
> +  loads with gaps unless the mask is uniform.  */
> if ((STMT_VINFO_GROUPED_ACCESS (stmt_info)
> -&& (DR_GROUP_GAP (first_stmt_info) != 0 || has_gaps))
> +&& (DR_GROUP_GAP (first_stmt_info) != 0
> +|| (has_gaps
> +&& STMT_VINFO_SLP_VECT_ONLY (first_stmt_info
> || STMT_VINFO_STRIDED_P (stmt_info))
>   {
> load_permutation.release ();
> @@ -2054,7 +2066,12 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
> unsigned i = 0;
> for (stmt_vec_info si = first_stmt_info;
>  si; si = DR_GROUP_NEXT_ELEMENT (si))
> - stmts2[i++] = si;
> + {
> +   if (si != first_stmt_info)
> + for (unsigned k = 1; k < DR_GROUP_GAP (si); ++k)
> +   stmts2[i+

Re: [PATCH v2 9/9] aarch64: Handle alignment when it is bigger than BIGGEST_ALIGNMENT

2024-10-23 Thread Richard Sandiford
Evgeny Karpov  writes:
> Tuesday, October 22, 2024
> Richard Sandiford  wrote:
>
>>> If ASM_OUTPUT_ALIGNED_LOCAL uses an alignment less than BIGGEST_ALIGNMENT,
>>> it might trigger a relocation issue.
>>>
>>> relocation truncated to fit: IMAGE_REL_ARM64_PAGEOFFSET_12L
>>
>> Sorry to press the issue, but: why does that happen?
>
> #define IMAGE_REL_ARM64_PAGEOFFSET_12L  0x0007  /* The 12-bit page offset of 
> the target, for instruction LDR (indexed, unsigned immediate). */
>
> Based on the documentation for LDR
> https://developer.arm.com/documentation/ddi0596/2020-12/Base-Instructions/LDR--immediate---Load-Register--immediate--
> "For the 64-bit variant: is the optional positive immediate byte offset, a 
> multiple of 8 in the range 0 to 32760, defaulting to 0 and encoded in the 
> "imm12" field as /8"

This in itself is relatively standard thouugh.  We can't assume
without checking that any given offset will be "nicely" aligned.
So...

> This means BIGGEST_ALIGNMENT (128) could be replaced with 64.
>
> auto rounded = ROUND_UP (MAX ((SIZE), 1),       \
>     MAX ((ALIGNMENT), 64) / BITS_PER_UNIT);
>
> It works for most cases, however, not for all of them.

...although this will work for, say, loading all of:

unsigned char x[8];

using a single LDR, it doesn't look like it would cope with:

  struct __attribute__((packed)) {
char x;
void *y;
  } foo;

  void *f() { return foo.y; }

Or, even if that does work, it isn't clear to me why patching
ASM_OUTPUT_ALIGNED_LOCAL is a complete solution to the problem.

ISTM that we should be checking the known alignment during code generation,
and only using relocations if their alignment requirements are known to
be met.

Once that's done, it would make sense to increase the default alignment
if that improves code quality.  But it would be good to fix the correctness
issue first, while the problem is still easily visible.

If we do want to increase the default alignment to improve code quality,
the normal way would be via macros like DATA_ALIGNMENT or LOCAL_ALIGNMENT.
The advantage of those macros is that the increased alignment is visible
during code generation, rather than something that is only applied at
output time.

Thanks,
Richard


Re: [PATCH v3] aarch64: Improve scalar mode popcount expansion by using SVE [PR113860]

2024-10-23 Thread Richard Sandiford
Pengxuan Zheng  writes:
> This is similar to the recent improvements to the Advanced SIMD popcount
> expansion by using SVE. We can utilize SVE to generate more efficient code for
> scalar mode popcount too.
>
> Changes since v1:
> * v2: Add a new VNx1BI mode and a new test case for V1DI.
> * v3: Abandon VNx1BI changes and add a new variant of aarch64_ptrue_reg.

Sorry for the slow review.

The patch looks good though.  OK with the changes below:

> diff --git a/gcc/testsuite/gcc.target/aarch64/popcnt12.c 
> b/gcc/testsuite/gcc.target/aarch64/popcnt12.c
> new file mode 100644
> index 000..f086cae55a2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/popcnt12.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fgimple" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +

It's probably safer to add:

#pragma GCC target "+nosve"

here, so that we don't try to use the SVE instructions.

> +/*
> +** foo:
> +**   cnt v0.8b, v0.8b
> +**   addvb0, v0.8b

Nothing requires the temporary register to be v0, so this should be
something like:

cnt (v[0-9]+\.8b), v0\.8b
addvb0, \1

Thanks,
Richard

> +**   ret
> +*/
> +__Uint64x1_t __GIMPLE
> +foo (__Uint64x1_t x)
> +{
> +  __Uint64x1_t z;
> +
> +  z = .POPCOUNT (x);
> +  return z;
> +}


Re: [PATCH 1/2] aarch64: Use standard names for saturating arithmetic

2024-10-23 Thread Richard Sandiford
Richard Sandiford  writes:
> Akram Ahmad  writes:
>> This renames the existing {s,u}q{add,sub} instructions to use the
>> standard names {s,u}s{add,sub}3 which are used by IFN_SAT_ADD and
>> IFN_SAT_SUB.
>>
>> The NEON intrinsics for saturating arithmetic and their corresponding
>> builtins are changed to use these standard names too.
>>
>> Using the standard names for the instructions causes 32 and 64-bit
>> unsigned scalar saturating arithmetic to use the NEON instructions,
>> resulting in an additional (and inefficient) FMOV to be generated when
>> the original operands are in GP registers. This patch therefore also
>> restores the original behaviour of using the adds/subs instructions
>> in this circumstance.
>>
>> Additional tests are written for the scalar and Adv. SIMD cases to
>> ensure that the correct instructions are used. The NEON intrinsics are
>> already tested elsewhere.
>
> Thanks for doing this.  The approach looks good.  My main question is:
> are we sure that we want to use the Advanced SIMD instructions for
> signed saturating SI and DI arithmetic on GPRs?  E.g. for addition,
> we only saturate at the negative limit if both operands are negative,
> and only saturate at the positive limit if both operands are positive.
> So for 32-bit values we can use:
>
>   asr tmp, x or y, #31
>   eor tmp, tmp, #0x8000
>
> to calculate the saturation value and:
>
>   addsres, x, y
>   cselres, tmp, res, vs

Bah, knew I should have sat on this before sending.  tmp is the
inverse of the saturation value, so we want:

csinv   res, res, tmp, vc

instead of the csel above.

> to calculate the full result.  That's the same number of instructions
> as two fmovs for the inputs, the sqadd, and the fmov for the result,
> but it should be more efficient.
>
> The reason for asking now, rather than treating it as a potential
> future improvement, is that it would also avoid splitting the patterns
> for signed and unsigned ops.  (The length of the split alternative can be
> conservatively set to 16 even for the unsigned version, since nothing
> should care in practice.  The split will have happened before
> shorten_branches.)
>
>> gcc/ChangeLog:
>>
>>  * config/aarch64/aarch64-builtins.cc: Expand iterators.
>>  * config/aarch64/aarch64-simd-builtins.def: Use standard names
>>  * config/aarch64/aarch64-simd.md: Use standard names, split insn
>>  definitions on signedness of operator and type of operands.
>>  * config/aarch64/arm_neon.h: Use standard builtin names.
>>  * config/aarch64/iterators.md: Add VSDQ_I_QI_HI iterator to
>>  simplify splitting of insn for unsigned scalar arithmetic.
>>
>> gcc/testsuite/ChangeLog:
>>
>>  * 
>> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect.inc:
>>  Template file for unsigned vector saturating arithmetic tests.
>>  * 
>> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c:
>>  8-bit vector type tests.
>>  * 
>> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_2.c:
>>  16-bit vector type tests.
>>  * 
>> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_3.c:
>>  32-bit vector type tests.
>>  * 
>> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_4.c:
>>  64-bit vector type tests.
>>  * gcc.target/aarch64/saturating_arithmetic.inc: Template file
>>  for scalar saturating arithmetic tests.
>>  * gcc.target/aarch64/saturating_arithmetic_1.c: 8-bit tests.
>>  * gcc.target/aarch64/saturating_arithmetic_2.c: 16-bit tests.
>>  * gcc.target/aarch64/saturating_arithmetic_3.c: 32-bit tests.
>>  * gcc.target/aarch64/saturating_arithmetic_4.c: 64-bit tests.
>> diff --git 
>> a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
>>  
>> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
>> new file mode 100644
>> index 000..63eb21e438b
>> --- /dev/null
>> +++ 
>> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
>> @@ -0,0 +1,79 @@
>> +/* { dg-do assemble { target { aarch64*-*-* } } } */
>> +/* { dg-options "-O2 --save-temps -ftree-vectorize" } */
>> +/* { dg-final { check-function-bodies "**" "" "-DCHECK_ASM" } } */
>> +
>> +/*
>> +** uadd_lane: { xfail *-*-* }
>
> Just curious: why does this fail?  Is it a vector costing

Re: SVE intrinsics: Fold constant operands for svlsl.

2024-10-23 Thread Richard Sandiford
Soumya AR  writes:
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins.cc
> index 41673745cfe..aa556859d2e 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
> @@ -1143,11 +1143,14 @@ aarch64_const_binop (enum tree_code code, tree arg1, 
> tree arg2)
>tree type = TREE_TYPE (arg1);
>signop sign = TYPE_SIGN (type);
>wi::overflow_type overflow = wi::OVF_NONE;
> -
> +  unsigned int element_bytes = tree_to_uhwi (TYPE_SIZE_UNIT (type));
>/* Return 0 for division by 0, like SDIV and UDIV do.  */
>if (code == TRUNC_DIV_EXPR && integer_zerop (arg2))
>   return arg2;
> -
> +  /* Return 0 if shift amount is out of range. */
> +  if (code == LSHIFT_EXPR
> +   && tree_to_uhwi (arg2) >= (element_bytes * BITS_PER_UNIT))

tree_to_uhwi is dangerous because a general shift might be negative
(even if these particular shift amounts are unsigned).  We should
probably also key off TYPE_PRECISION rather than TYPE_SIZE_UNIT.  So:

if (code == LSHIFT_EXPR
&& wi::geu_p (wi::to_wide (arg2), TYPE_PRECISION (type)))

without the element_bytes variable.  Also: the indentation looks a bit off;
it should be tabs only followed by spaces only.

OK with those change, thanks.

Richard


> + return build_int_cst (type, 0);
>if (!poly_int_binop (poly_res, code, arg1, arg2, sign, &overflow))
>   return NULL_TREE;
>return force_fit_type (type, poly_res, false,
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c
> new file mode 100644
> index 000..6109558001a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/const_fold_lsl_1.c
> @@ -0,0 +1,142 @@
> +/* { dg-final { check-function-bodies "**" "" } } */
> +/* { dg-options "-O2" } */
> +
> +#include "arm_sve.h"
> +
> +/*
> +** s64_x:
> +**   mov z[0-9]+\.d, #20
> +**   ret
> +*/
> +svint64_t s64_x (svbool_t pg) {
> +return svlsl_n_s64_x (pg, svdup_s64 (5), 2);  
> +}
> +
> +/*
> +** s64_x_vect:
> +**   mov z[0-9]+\.d, #20
> +**   ret
> +*/
> +svint64_t s64_x_vect (svbool_t pg) {
> +return svlsl_s64_x (pg, svdup_s64 (5), svdup_u64 (2));  
> +}
> +
> +/*
> +** s64_z:
> +**   mov z[0-9]+\.d, p[0-7]/z, #20
> +**   ret
> +*/
> +svint64_t s64_z (svbool_t pg) {
> +return svlsl_n_s64_z (pg, svdup_s64 (5), 2);  
> +}
> +
> +/*
> +** s64_z_vect:
> +**   mov z[0-9]+\.d, p[0-7]/z, #20
> +**   ret
> +*/
> +svint64_t s64_z_vect (svbool_t pg) {
> +return svlsl_s64_z (pg, svdup_s64 (5), svdup_u64 (2));  
> +}
> +
> +/*
> +** s64_m_ptrue:
> +**   mov z[0-9]+\.d, #20
> +**   ret
> +*/
> +svint64_t s64_m_ptrue () {
> +return svlsl_n_s64_m (svptrue_b64 (), svdup_s64 (5), 2);  
> +}
> +
> +/*
> +** s64_m_ptrue_vect:
> +**   mov z[0-9]+\.d, #20
> +**   ret
> +*/
> +svint64_t s64_m_ptrue_vect () {
> +return svlsl_s64_m (svptrue_b64 (), svdup_s64 (5), svdup_u64 (2));  
> +}
> +
> +/*
> +** s64_m_pg:
> +**   mov z[0-9]+\.d, #5
> +**   lsl z[0-9]+\.d, p[0-7]/m, z[0-9]+\.d, #2
> +**   ret
> +*/
> +svint64_t s64_m_pg (svbool_t pg) {
> +return svlsl_n_s64_m (pg, svdup_s64 (5), 2);
> +} 
> +
> +/*
> +** s64_m_pg_vect:
> +**   mov z[0-9]+\.d, #5
> +**   lsl z[0-9]+\.d, p[0-7]/m, z[0-9]+\.d, #2
> +**   ret
> +*/
> +svint64_t s64_m_pg_vect (svbool_t pg) {
> +return svlsl_s64_m (pg, svdup_s64 (5), svdup_u64 (2));
> +}
> +
> +/*
> +** s64_x_0:
> +**   mov z[0-9]+\.d, #5
> +**   ret
> +*/
> +svint64_t s64_x_0 (svbool_t pg) {
> +return svlsl_n_s64_x (pg, svdup_s64 (5), 0);  
> +}
> +
> +/*
> +** s64_x_bit_width:
> +**   movi?   [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +**   ret
> +*/
> +svint64_t s64_x_bit_width (svbool_t pg) {
> +return svlsl_n_s64_x (pg, svdup_s64 (5), 64); 
> +}
> +
> +/*
> +** s64_x_out_of_range:
> +**   movi?   [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +**   ret
> +*/
> +svint64_t s64_x_out_of_range (svbool_t pg) {
> +return svlsl_n_s64_x (pg, svdup_s64 (5), 68); 
> +}
> +
> +/*
> +** u8_x_unsigned_overflow:
> +**   mov z[0-9]+\.b, #-2
> +**   ret
> +*/
> +svuint8_t u8_x_unsigned_overflow (svbool_t pg) {
> +return svlsl_n_u8_x (pg, svdup_u8 (255), 1);  
> +}
> +
> +/*
> +** s8_x_signed_overflow:
> +**   mov z[0-9]+\.b, #-2
> +**   ret
> +*/
> +svint8_t s8_x_signed_overflow (svbool_t pg) {
> +return svlsl_n_s8_x (pg, svdup_s8 (255), 1);  
> +}
> +
> +/*
> +** s8_x_neg_shift:
> +**   mov z[0-9]+\.b, #-2
> +**   ret
> +*/
> +svint8_t s8_x_neg_shift (svbool_t pg) {
> +return svlsl_n_s8_x (pg, svdup_s8 (-1), 1);  
> +}
> +
> +/*
> +** s8_x_max_shift:
> +**   mov z[0-9]+\.b, #-128
> +**   ret
> +*/
> +svint8_t s8_x_max_shift (svbool_t pg) {
> +return svlsl_n_s8_x (pg, svdup_s8 (1), 7);  
> +}
> +


Re: [PATCH v3] AArch64: Fix copysign patterns

2024-10-23 Thread Richard Sandiford
Wilco Dijkstra  writes:
> The current copysign pattern has a mismatch in the predicates and constraints 
> -
> operand[2] is a register_operand but also has an alternative X which allows 
> any
> operand.  Since it is a floating point operation, having an integer 
> alternative
> makes no sense.  Change the expander to always use vector immediates which 
> results
> in better code and sharing of immediates between copysign and xorsign.
>
> Passes bootstrap and regress, OK for commit?
>
> gcc/Changelog:
> * config/aarch64/aarch64.md (copysign3): Widen immediate to 
> vector.
> (copysign3_insn): Use VQ_INT_EQUIV in operand 3.
> * config/aarch64/iterators.md (VQ_INT_EQUIV): New iterator.
> (vq_int_equiv): Likewise.
>
> testsuite/Changelog:
> * gcc.target/aarch64/copysign_3.c: New test.
> * gcc.target/aarch64/copysign_4.c: New test.
> * gcc.target/aarch64/fneg-abs_2.c: Fixup test.
> * gcc.target/aarch64/sve/fneg-abs_2.c: Likewise.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index 
> c54b29cd64b9e0dc6c6d12735049386ccedc5408..71f9743df671b70e6a2d189f49de58995398abee
>  100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -7218,20 +7218,11 @@ (define_expand "lrint2"
>  }
>  )
>  
> -;; For copysign (x, y), we want to generate:
> +;; For copysignf (x, y), we want to generate:
>  ;;
> -;;   LDR d2, #(1 << 63)
> -;;   BSL v2.8b, [y], [x]
> +;;   moviv31.4s, 0x80, lsl 24
> +;;   bit v0.16b, v1.16b, v31.16b
>  ;;
> -;; or another, equivalent, sequence using one of BSL/BIT/BIF.  Because
> -;; we expect these operations to nearly always operate on
> -;; floating-point values, we do not want the operation to be
> -;; simplified into a bit-field insert operation that operates on the
> -;; integer side, since typically that would involve three inter-bank
> -;; register copies.  As we do not expect copysign to be followed by
> -;; other logical operations on the result, it seems preferable to keep
> -;; this as an unspec operation, rather than exposing the underlying
> -;; logic to the compiler.

I think the comment starting "Because we expect..." is worth keeping.
It explains why we use an unspec for something that could be expressed
in generic rtl.

OK with that change, thanks.

Richard

>  (define_expand "copysign3"
>[(match_operand:GPF 0 "register_operand")
> @@ -7239,32 +7230,25 @@ (define_expand "copysign3"
> (match_operand:GPF 2 "nonmemory_operand")]
>"TARGET_SIMD"
>  {
> -  rtx signbit_const = GEN_INT (HOST_WIDE_INT_M1U
> -<< (GET_MODE_BITSIZE (mode) - 1));
> -  /* copysign (x, -1) should instead be expanded as orr with the sign
> - bit.  */
> +  rtx sign = GEN_INT (HOST_WIDE_INT_M1U << (GET_MODE_BITSIZE (mode) - 
> 1));
> +  rtx v_bitmask = gen_const_vec_duplicate (mode, sign);
> +  v_bitmask = force_reg (mode, v_bitmask);
> +
> +  /* copysign (x, -1) should instead be expanded as orr with the signbit.  */
>rtx op2_elt = unwrap_const_vec_duplicate (operands[2]);
> +
>if (GET_CODE (op2_elt) == CONST_DOUBLE
>&& real_isneg (CONST_DOUBLE_REAL_VALUE (op2_elt)))
>  {
> -  rtx v_bitmask
> - = force_reg (V2mode,
> -  gen_const_vec_duplicate (V2mode,
> -   signbit_const));
> -
> -  emit_insn (gen_iorv23 (
> - lowpart_subreg (V2mode, operands[0], mode),
> - lowpart_subreg (V2mode, operands[1], mode),
> +  emit_insn (gen_ior3 (
> + lowpart_subreg (mode, operands[0], mode),
> + lowpart_subreg (mode, operands[1], mode),
>   v_bitmask));
>DONE;
>  }
> -
> -  machine_mode int_mode = mode;
> -  rtx bitmask = gen_reg_rtx (int_mode);
> -  emit_move_insn (bitmask, signbit_const);
>operands[2] = force_reg (mode, operands[2]);
>emit_insn (gen_copysign3_insn (operands[0], operands[1], operands[2],
> -bitmask));
> +v_bitmask));
>DONE;
>  }
>  )
> @@ -7273,23 +7257,21 @@ (define_insn "copysign3_insn"
>[(set (match_operand:GPF 0 "register_operand")
>   (unspec:GPF [(match_operand:GPF 1 "register_operand")
>(match_operand:GPF 2 "register_operand")
> -  (match_operand: 3 "register_operand")]
> +  (match_operand: 3 "register_operand")]
>UNSPEC_COPYSIGN))]
>"TARGET_SIMD"
>{@ [ cons: =0 , 1 , 2 , 3 ; attrs: type  ]
>   [ w, w , w , 0 ; neon_bsl  ] bsl\t%0., %2., 
> %1.
>   [ w, 0 , w , w ; neon_bsl  ] bit\t%0., %2., 
> %3.
>   [ w, w , 0 , w ; neon_bsl  ] bif\t%0., %1., 
> %3.
> - [ r, r , 0 , X ; bfm  ] bfxil\t%0, %1, #0, 
> 
>}
>  )
>  
> -
> -;; For xorsign (x, y), we want to generate:
> +;; For xorsignf (x, y), we want to generate:
>  ;;
> -;; LDR   d2, #1<<63
> -;; AND   v3.8B, v1.8B, v2.8B
> -;; EOR   v0.8B, 

Re: [PATCH 1/2] aarch64: Use standard names for saturating arithmetic

2024-10-23 Thread Richard Sandiford
Akram Ahmad  writes:
> This renames the existing {s,u}q{add,sub} instructions to use the
> standard names {s,u}s{add,sub}3 which are used by IFN_SAT_ADD and
> IFN_SAT_SUB.
>
> The NEON intrinsics for saturating arithmetic and their corresponding
> builtins are changed to use these standard names too.
>
> Using the standard names for the instructions causes 32 and 64-bit
> unsigned scalar saturating arithmetic to use the NEON instructions,
> resulting in an additional (and inefficient) FMOV to be generated when
> the original operands are in GP registers. This patch therefore also
> restores the original behaviour of using the adds/subs instructions
> in this circumstance.
>
> Additional tests are written for the scalar and Adv. SIMD cases to
> ensure that the correct instructions are used. The NEON intrinsics are
> already tested elsewhere.

Thanks for doing this.  The approach looks good.  My main question is:
are we sure that we want to use the Advanced SIMD instructions for
signed saturating SI and DI arithmetic on GPRs?  E.g. for addition,
we only saturate at the negative limit if both operands are negative,
and only saturate at the positive limit if both operands are positive.
So for 32-bit values we can use:

asr tmp, x or y, #31
eor tmp, tmp, #0x8000

to calculate the saturation value and:

addsres, x, y
cselres, tmp, res, vs

to calculate the full result.  That's the same number of instructions
as two fmovs for the inputs, the sqadd, and the fmov for the result,
but it should be more efficient.

The reason for asking now, rather than treating it as a potential
future improvement, is that it would also avoid splitting the patterns
for signed and unsigned ops.  (The length of the split alternative can be
conservatively set to 16 even for the unsigned version, since nothing
should care in practice.  The split will have happened before
shorten_branches.)

> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-builtins.cc: Expand iterators.
>   * config/aarch64/aarch64-simd-builtins.def: Use standard names
>   * config/aarch64/aarch64-simd.md: Use standard names, split insn
>   definitions on signedness of operator and type of operands.
>   * config/aarch64/arm_neon.h: Use standard builtin names.
>   * config/aarch64/iterators.md: Add VSDQ_I_QI_HI iterator to
>   simplify splitting of insn for unsigned scalar arithmetic.
>
> gcc/testsuite/ChangeLog:
>
>   * 
> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect.inc:
>   Template file for unsigned vector saturating arithmetic tests.
>   * 
> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c:
>   8-bit vector type tests.
>   * 
> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_2.c:
>   16-bit vector type tests.
>   * 
> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_3.c:
>   32-bit vector type tests.
>   * 
> gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_4.c:
>   64-bit vector type tests.
>   * gcc.target/aarch64/saturating_arithmetic.inc: Template file
>   for scalar saturating arithmetic tests.
>   * gcc.target/aarch64/saturating_arithmetic_1.c: 8-bit tests.
>   * gcc.target/aarch64/saturating_arithmetic_2.c: 16-bit tests.
>   * gcc.target/aarch64/saturating_arithmetic_3.c: 32-bit tests.
>   * gcc.target/aarch64/saturating_arithmetic_4.c: 64-bit tests.
> diff --git 
> a/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
>  
> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
> new file mode 100644
> index 000..63eb21e438b
> --- /dev/null
> +++ 
> b/gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/saturating_arithmetic_autovect_1.c
> @@ -0,0 +1,79 @@
> +/* { dg-do assemble { target { aarch64*-*-* } } } */
> +/* { dg-options "-O2 --save-temps -ftree-vectorize" } */
> +/* { dg-final { check-function-bodies "**" "" "-DCHECK_ASM" } } */
> +
> +/*
> +** uadd_lane: { xfail *-*-* }

Just curious: why does this fail?  Is it a vector costing issue?

> +**   dup\tv([0-9]+).8b, w0
> +**   uqadd\tb([0-9]+), b\1, b0
> +**   umov\tw0, v\2.b\[0]
> +**   ret
> +*/
> +/*
> +** uaddq:
> +** ...
> +**   ldr\tq([0-9]+), .*
> +**   ldr\tq([0-9]+), .*
> +**   uqadd\tv\2.16b, v\1.16b, v\2.16b

Since the operands are commutative, and since there's no restriction
on the choice of destination register, it's probably safer to use:

> +**   uqadd\tv[0-9].16b, (?:v\1.16b, v\2.16b|v\2.16b, v\1.16b)

Similarly for the other qadds.  The qsubs do of course have a fixed
order, but the destination is similarly not restricted, so should use
[0-9]+ rather than \n.

Thanks,
Richard


Re: [PATCH 2/6] aarch64: Use canonical RTL representation for SVE2 XAR and extend it to fixed-width modes

2024-10-23 Thread Richard Sandiford
Kyrylo Tkachov  writes:
> Hi all,
>
> The MD pattern for the XAR instruction in SVE2 is currently expressed with
> non-canonical RTL by using a ROTATERT code with a constant rotate amount.
> Fix it by using the left ROTATE code.  This necessitates splitting out the
> expander separately to translate the immediate coming from the intrinsic
> from a right-rotate to a left-rotate immediate.

Could we instead do the translation in aarch64-sve-builtins-sve2.cc?
It should be simpler to adjust there, by modifying the function_expander's
args array.

> Additionally, as the SVE2 XAR instruction is unpredicated and can handle all
> element sizes from .b to .d, it is a good fit for implementing the XOR+ROTATE
> operation for Advanced SIMD modes where the TARGET_SHA3 cannot be used
> (that can only handle V2DImode operands).  Therefore let's extend the accepted
> modes of the SVE2 patternt to include the 128-bit Advanced SIMD integer modes.

As mentioned in other reply that I sent out-of-order, I think we could
also include the 64-bit modes.

LGTM otherwise FWIW.

Thanks,
Richard

>
> This leads to some tests for the svxar* intrinsics to fail because they now
> simplify to a plain EOR when the rotate amount is the width of the element.
> This simplification is desirable (EOR instructions have better or equal
> throughput than XAR, and they are non-destructive of their input) so the
> tests are adjusted.
>
> For V2DImode XAR operations we should prefer the Advanced SIMD version when
> it is available (TARGET_SHA3) because it is non-destructive, so restrict the
> SVE2 pattern accordingly.  Tests are added to confirm this.
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
> Ok for mainline?
>
> Signed-off-by: Kyrylo Tkachov 
>
> gcc/
>
>   * config/aarch64/iterators.md (SVE_ASIMD_FULL_I): New mode iterator.
>   * config/aarch64/aarch64-sve2.md (@aarch64_sve2_xar): Rename
>   to...
>   (*aarch64_sve2_xar_insn): ... This.  Use SVE_ASIMD_FULL_I
>   iterator and adjust output logic.
>   (@aarch64_sve2_xar): New define_expand.
>
> gcc/testsuite/
>
>   * gcc.target/aarch64/xar_neon_modes.c: New test.
>   * gcc.target/aarch64/xar_v2di_nonsve.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/xar_s16.c: Scan for EOR rather than
>   XAR.
>   * gcc.target/aarch64/sve2/acle/asm/xar_s32.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/xar_s64.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/xar_s8.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/xar_u16.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/xar_u32.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/xar_u64.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/xar_u8.c: Likewise.
>
> From 41a7b2bfe69d7fc716b5da969d19185885c6b2bf Mon Sep 17 00:00:00 2001
> From: Kyrylo Tkachov 
> Date: Tue, 22 Oct 2024 03:27:47 -0700
> Subject: [PATCH 2/6] aarch64: Use canonical RTL representation for SVE2 XAR
>  and extend it to fixed-width modes
>
> The MD pattern for the XAR instruction in SVE2 is currently expressed with
> non-canonical RTL by using a ROTATERT code with a constant rotate amount.
> Fix it by using the left ROTATE code.  This necessitates splitting out the
> expander separately to translate the immediate coming from the intrinsic
> from a right-rotate to a left-rotate immediate.
>
> Additionally, as the SVE2 XAR instruction is unpredicated and can handle all
> element sizes from .b to .d, it is a good fit for implementing the XOR+ROTATE
> operation for Advanced SIMD modes where the TARGET_SHA3 cannot be used
> (that can only handle V2DImode operands).  Therefore let's extend the accepted
> modes of the SVE2 patternt to include the 128-bit Advanced SIMD integer modes.
>
> This leads to some tests for the svxar* intrinsics to fail because they now
> simplify to a plain EOR when the rotate amount is the width of the element.
> This simplification is desirable (EOR instructions have better or equal
> throughput than XAR, and they are non-destructive of their input) so the
> tests are adjusted.
>
> For V2DImode XAR operations we should prefer the Advanced SIMD version when
> it is available (TARGET_SHA3) because it is non-destructive, so restrict the
> SVE2 pattern accordingly.  Tests are added to confirm this.
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
> Ok for mainline?
>
> Signed-off-by: Kyrylo Tkachov 
>
> gcc/
>
>   * config/aarch64/iterators.md (SVE_ASIMD_FULL_I): New mode iterator.
>   * config/aarch64/aarch64-sve2.md (@aarch64_sve2_xar): Rename
>   to...
>   (*aarch64_sve2_xar_insn): ... This.  Use SVE_ASIMD_FULL_I
>   iterator and adjust output logic.
>   (@aarch64_sve2_xar): New define_expand.
>
> gcc/testsuite/
>
>   * gcc.target/aarch64/xar_neon_modes.c: New test.
>   * gcc.target/aarch64/xar_v2di_nonsve.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/xar_s16.c: Scan for EOR rather than
>   XAR.
>   * gcc.target/aarch64/sv

Re: [PATCH 5/6] aarch64: Emit XAR for vector rotates where possible

2024-10-23 Thread Richard Sandiford
Kyrylo Tkachov  writes:
> Hi all,
>
> We can make use of the integrated rotate step of the XAR instruction
> to implement most vector integer rotates, as long we zero out one
> of the input registers for it.  This allows for a lower-latency sequence
> than the fallback SHL+USRA, especially when we can hoist the zeroing operation
> away from loops and hot parts.
> We can also use it for 64-bit vectors as long
> as we zero the top half of the vector to be rotated.  That should still be
> preferable to the default sequence.

Is the zeroing necessary?  We don't expect/require that 64-bit vector
modes are maintained in zero-extended form, or that 64-bit ops act as
strict_lowparts, so it should be OK to take a paradoxical subreg.
Or we could just extend the patterns to 64-bit modes, to avoid the
punning.

> With this patch we can gerate for the input:
> v4si
> G1 (v4si r)
> {
> return (r >> 23) | (r << 9);
> }
>
> v8qi
> G2 (v8qi r)
> {
>   return (r << 3) | (r >> 5);
> }
> the assembly for +sve2:
> G1:
> moviv31.4s, 0
> xar z0.s, z0.s, z31.s, #23
> ret
>
> G2:
> moviv31.4s, 0
> fmovd0, d0
> xar z0.b, z0.b, z31.b, #5
> ret
>
> instead of the current:
> G1:
> shl v31.4s, v0.4s, 9
> usrav31.4s, v0.4s, 23
> mov v0.16b, v31.16b
> ret
> G2:
> shl v31.8b, v0.8b, 3
> usrav31.8b, v0.8b, 5
> mov v0.8b, v31.8b
> ret
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
>
> Signed-off-by: Kyrylo Tkachov 
>
> gcc/
>
>   * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Add
>   generation of XAR sequences when possible.
>
> gcc/testsuite/
>
>   * gcc.target/aarch64/rotate_xar_1.c: New test.
> [...]
> +/*
> +** G1:
> +**   movi?   [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +**   xar v0\.2d, v([0-9]+)\.2d, v([0-9]+)\.2d, 39

FWIW, the (...) captures aren't necessary, since we never use backslash
references to them later.

Thanks,
Richard

> +**  ret
> +*/
> +v2di
> +G1 (v2di r) {
> +return (r >> 39) | (r << 25);
> +}
> +
> +/*
> +** G2:
> +**   movi?   [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +**   xar z0\.s, z([0-9]+)\.s, z([0-9]+)\.s, #23
> +**  ret
> +*/
> +v4si
> +G2 (v4si r) {
> +return (r >> 23) | (r << 9);
> +}
> +
> +/*
> +** G3:
> +**   movi?   [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +**   xar z0\.h, z([0-9]+)\.h, z([0-9]+)\.h, #5
> +**  ret
> +*/
> +v8hi
> +G3 (v8hi r) {
> +return (r >> 5) | (r << 11);
> +}
> +
> +/*
> +** G4:
> +**   movi?   [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +**   xar z0\.b, z([0-9]+)\.b, z([0-9]+)\.b, #6
> +**  ret
> +*/
> +v16qi
> +G4 (v16qi r)
> +{
> +  return (r << 2) | (r >> 6);
> +}
> +
> +/*
> +** G5:
> +**   movi?   [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +**   fmovd[0-9]+, d[0-9]+
> +**   xar z0\.s, z([0-9]+)\.s, z([0-9]+)\.s, #22
> +**  ret
> +*/
> +v2si
> +G5 (v2si r) {
> +return (r >> 22) | (r << 10);
> +}
> +
> +/*
> +** G6:
> +**   movi?   [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +**   fmovd[0-9]+, d[0-9]+
> +**   xar z0\.h, z([0-9]+)\.h, z([0-9]+)\.h, #7
> +**  ret
> +*/
> +v4hi
> +G6 (v4hi r) {
> +return (r >> 7) | (r << 9);
> +}
> +
> +/*
> +** G7:
> +**   movi?   [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +**   fmovd[0-9]+, d[0-9]+
> +**   xar z0\.b, z([0-9]+)\.b, z([0-9]+)\.b, #5
> +**  ret
> +*/
> +v8qi
> +G7 (v8qi r)
> +{
> +  return (r << 3) | (r >> 5);
> +}
> +


Re: [PATCH 4/6] aarch64: Optimize vector rotates into REV* instructions where possible

2024-10-23 Thread Richard Sandiford
Kyrylo Tkachov  writes:
> Hi all,
>
> Some vector rotate operations can be implemented in a single instruction
> rather than using the fallback SHL+USRA sequence.
> In particular, when the rotate amount is half the bitwidth of the element
> we can use a REV64,REV32,REV16 instruction.
> This patch adds this transformation in the recently added splitter for vector
> rotates.
> Bootstrapped and tested on aarch64-none-linux-gnu.
>
> Signed-off-by: Kyrylo Tkachov 
>
> gcc/
>
>   * config/aarch64/aarch64-protos.h (aarch64_emit_opt_vec_rotate):
>   Declare prototype.
>   * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Implement.
>   * config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm):
>   Call the above.
>
> gcc/testsuite/
>
>   * gcc.target/aarch64/simd/pr117048_2.c: New test.

Sorry to be awkward, but I still think at least part of this should be
target-independent.  Any rotate by a byte amount can be expressed as a
vector permutation in a target-independent way.  Target-independent code
can then use the usual optab routines to query whether the permutation
is possible and/or try to generate it.

I can see that it probably makes sense to leave target code to make
the decision about when to use the permutation strategy vs. other
approaches.  But the code to implement that strategy shouldn't need
to be target-specific.

E.g. we could have a routine:

  expand_rotate_as_vec_perm

which checks whether the rotation amount is suitable and tries to
generate the permutation if so.

Thanks,
Richard

> From e97509382b6bb755336ec4aa220fabd968e69502 Mon Sep 17 00:00:00 2001
> From: Kyrylo Tkachov 
> Date: Wed, 16 Oct 2024 04:10:08 -0700
> Subject: [PATCH 4/6] aarch64: Optimize vector rotates into REV* instructions
>  where possible
>
> Some vector rotate operations can be implemented in a single instruction
> rather than using the fallback SHL+USRA sequence.
> In particular, when the rotate amount is half the bitwidth of the element
> we can use a REV64,REV32,REV16 instruction.
> This patch adds this transformation in the recently added splitter for vector
> rotates.
> Bootstrapped and tested on aarch64-none-linux-gnu.
>
> Signed-off-by: Kyrylo Tkachov 
>
> gcc/
>
>   * config/aarch64/aarch64-protos.h (aarch64_emit_opt_vec_rotate):
>   Declare prototype.
>   * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Implement.
>   * config/aarch64/aarch64-simd.md (*aarch64_simd_rotate_imm):
>   Call the above.
>
> gcc/testsuite/
>
>   * gcc.target/aarch64/simd/pr117048_2.c: New test.
> ---
>  gcc/config/aarch64/aarch64-protos.h   |  1 +
>  gcc/config/aarch64/aarch64-simd.md|  3 +
>  gcc/config/aarch64/aarch64.cc | 49 ++
>  .../gcc.target/aarch64/simd/pr117048_2.c  | 66 +++
>  4 files changed, 119 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/pr117048_2.c
>
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index d03c1fe798b..da0e657a513 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -776,6 +776,7 @@ bool aarch64_rnd_imm_p (rtx);
>  bool aarch64_constant_address_p (rtx);
>  bool aarch64_emit_approx_div (rtx, rtx, rtx);
>  bool aarch64_emit_approx_sqrt (rtx, rtx, bool);
> +bool aarch64_emit_opt_vec_rotate (rtx, rtx, rtx);
>  tree aarch64_vector_load_decl (tree);
>  rtx aarch64_gen_callee_cookie (aarch64_isa_mode, arm_pcs);
>  void aarch64_expand_call (rtx, rtx, rtx, bool);
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 543179d9fce..44c40512f30 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1313,6 +1313,9 @@
>   (match_dup 4))
> (match_dup 3)))]
>{
> +if (aarch64_emit_opt_vec_rotate (operands[0], operands[1], operands[2]))
> +  DONE;
> +
>  operands[3] = reload_completed ? operands[0] : gen_reg_rtx (mode);
>  rtx shft_amnt = unwrap_const_vec_duplicate (operands[2]);
>  int bitwidth = GET_MODE_UNIT_SIZE (mode) * BITS_PER_UNIT;
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 21d9a6b5a20..47859c4e31b 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -15998,6 +15998,55 @@ aarch64_emit_approx_div (rtx quo, rtx num, rtx den)
>return true;
>  }
>  
> +/* Emit an optimized sequence to perform a vector rotate
> +   of REG by the vector constant amount AMNT and place the result
> +   in DST.  Return true iff successful.  */
> +
> +bool
> +aarch64_emit_opt_vec_rotate (rtx dst, rtx reg, rtx amnt)
> +{
> +  amnt = unwrap_const_vec_duplicate (amnt);
> +  gcc_assert (CONST_INT_P (amnt));
> +  HOST_WIDE_INT rotamnt = UINTVAL (amnt);
> +  machine_mode mode = GET_MODE (reg);
> +  /* Rotates by half the element width map down to REV* instructions.  */
> +  if (rotamnt == G

Re: [PATCH 3/3] AArch64: Add support for SIMD xor immediate

2024-10-22 Thread Richard Sandiford
Wilco Dijkstra  writes:
> Add support for SVE xor immediate when generating AdvSIMD code and SVE is 
> available.
>
> Passes bootstrap & regress, OK for commit?
>
> gcc/ChangeLog:
>
> * config/aarch64/aarch64.cc (enum simd_immediate_check): Add 
> AARCH64_CHECK_XOR.
> (aarch64_simd_valid_xor_imm): New function.
> (aarch64_output_simd_imm): Add AARCH64_CHECK_XOR support.
> (aarch64_output_simd_xor_imm): New function.
> * config/aarch64/aarch64-protos.h (aarch64_output_simd_xor_imm): New 
> prototype.
> (aarch64_simd_valid_xor_imm): New prototype.
> * config/aarch64/aarch64-simd.md (xor3):
> Use aarch64_reg_or_xor_imm predicate and add an immediate alternative.
> * config/aarch64/predicates.md (aarch64_reg_or_xor_imm): Add new 
> predicate.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/aarch64/sve/simd_imm.c: New test.

OK, thanks.

Richard

> ---
>
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 
> 3f2d40603426a590a0a14ba4792fe9b325d1e585..16ab79c02da62c1a8aa03309708dfe401d1ffb7e
>  100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -827,6 +827,7 @@ char *aarch64_output_scalar_simd_mov_immediate (rtx, 
> scalar_int_mode);
>  char *aarch64_output_simd_mov_imm (rtx, unsigned);
>  char *aarch64_output_simd_orr_imm (rtx, unsigned);
>  char *aarch64_output_simd_and_imm (rtx, unsigned);
> +char *aarch64_output_simd_xor_imm (rtx, unsigned);
>  
>  char *aarch64_output_sve_mov_immediate (rtx);
>  char *aarch64_output_sve_ptrues (rtx);
> @@ -844,6 +845,7 @@ bool aarch64_sve_ptrue_svpattern_p (rtx, struct 
> simd_immediate_info *);
>  bool aarch64_simd_valid_and_imm (rtx);
>  bool aarch64_simd_valid_mov_imm (rtx);
>  bool aarch64_simd_valid_orr_imm (rtx);
> +bool aarch64_simd_valid_xor_imm (rtx);
>  bool aarch64_valid_sysreg_name_p (const char *);
>  const char *aarch64_retrieve_sysreg (const char *, bool, bool);
>  rtx aarch64_check_zero_based_sve_index_immediate (rtx);
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> 5c1de57ce6c3f2064d8be25f903a6a8d949685ef..18795a08b61da874a9e811822ed82e7eb9350bb4
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1144,12 +1144,16 @@ (define_insn "ior3"
>[(set_attr "type" "neon_logic")]
>  )
>  
> +;; For EOR (vector, register) and SVE EOR (vector, immediate)
>  (define_insn "xor3"
> -  [(set (match_operand:VDQ_I 0 "register_operand" "=w")
> -(xor:VDQ_I (match_operand:VDQ_I 1 "register_operand" "w")
> -  (match_operand:VDQ_I 2 "register_operand" "w")))]
> +  [(set (match_operand:VDQ_I 0 "register_operand")
> +(xor:VDQ_I (match_operand:VDQ_I 1 "register_operand")
> +   (match_operand:VDQ_I 2 "aarch64_reg_or_xor_imm")))]
>"TARGET_SIMD"
> -  "eor\t%0., %1., %2."
> +  {@ [ cons: =0 , 1 , 2  ]
> + [ w, w , w  ] eor\t%0., %1., %2.
> + [ w, 0 , Do ] << aarch64_output_simd_xor_imm (operands[2], 
> );
> +  }
>[(set_attr "type" "neon_logic")]
>  )
>  
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> 1a228147e6f945772edbd5540c44167e3a876a74..c019f21e39d9773746792d5885fa0f6805f9bb44
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -134,7 +134,8 @@ constexpr auto AARCH64_STATE_OUT = 1U << 2;
>  enum simd_immediate_check {
>AARCH64_CHECK_MOV,
>AARCH64_CHECK_ORR,
> -  AARCH64_CHECK_AND
> +  AARCH64_CHECK_AND,
> +  AARCH64_CHECK_XOR
>  };
>  
>  /* Information about a legitimate vector immediate operand.  */
> @@ -23320,6 +23321,13 @@ aarch64_simd_valid_and_imm (rtx op)
>return aarch64_simd_valid_imm (op, NULL, AARCH64_CHECK_AND);
>  }
>  
> +/* Return true if OP is a valid SIMD xor immediate for SVE.  */
> +bool
> +aarch64_simd_valid_xor_imm (rtx op)
> +{
> +  return aarch64_simd_valid_imm (op, NULL, AARCH64_CHECK_XOR);
> +}
> +
>  /* Check whether X is a VEC_SERIES-like constant that starts at 0 and
> has a step in the range of INDEX.  Return the index expression if so,
> otherwise return null.  */
> @@ -25503,10 +25511,12 @@ aarch64_output_simd_imm (rtx const_vector, unsigned 
> width,
>  }
>else
>  {
> -  /* AARCH64_CHECK_ORR or AARCH64_CHECK_AND.  */
> +  /* AARCH64_CHECK_ORR, AARCH64_CHECK_AND or AARCH64_CHECK_XOR.  */
>mnemonic = "orr";
>if (which == AARCH64_CHECK_AND)
>   mnemonic = info.insn == simd_immediate_info::MVN ? "bic" : "and";
> +  else if (which == AARCH64_CHECK_XOR)
> + mnemonic = "eor";
>  
>if (info.insn == simd_immediate_info::SVE_MOV)
>   {
> @@ -25544,6 +25554,14 @@ aarch64_output_simd_and_imm (rtx const_vector, 
> unsigned width)
>return aarch64_output_simd_imm (const_vector, width, AARCH64_CHECK_AND);
>  }
>  
> +/* Returns the string with the EOR ins

Re: [PATCH 2/2] AArch64: Improve SIMD immediate generation

2024-10-22 Thread Richard Sandiford
Wilco Dijkstra  writes:
> Allow use of SVE immediates when generating AdvSIMD code and SVE is available.
> First check for a valid AdvSIMD immediate, and if SVE is available, try using
> an SVE move or bitmask immediate.
>
> Passes bootstrap & regress, OK for commit?
>
> gcc/ChangeLog:
>
> * config/aarch64/aarch64-simd.md (ior3):
> Use aarch64_reg_or_orr_imm predicate.  Combine SVE/AdvSIMD immediates
> and use aarch64_output_simd_orr_imm.
> * config/aarch64/aarch64.cc (struct simd_immediate_info): Add SVE_MOV 
> enum.
> (aarch64_sve_valid_immediate): Use SVE_MOV for SVE move immediates.
> (aarch64_simd_valid_imm): Enable SVE SIMD immediates when possible.
> (aarch64_output_simd_imm): Support emitting SVE SIMD immediates. 
> * config/aarch64/predicates.md (aarch64_orr_imm_sve_advsimd): Remove.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/aarch64/sve/acle/asm/insr_s64.c: Allow SVE MOV imm.
> * gcc.target/aarch64/sve/acle/asm/insr_u64.c: Likewise.

Previously we allowed a move into a GPR and an INSR from there, but I agree
that we shouldn't continue to allow that now that it isn't used.  It's
better to "defend" the lack of a cross-file transfer.

The patch also has the effect of turning things like:

typedef int v4si __attribute__((vector_size(16)));
v4si f() { return (v4si) { 0xffc, 0xffc, 0xffc, 0xffc }; }

from:

adrpx0, .LC0
ldr q0, [x0, #:lo12:.LC0]
ret
...
.LC0:
.word   4092
.word   4092
.word   4092
.word   4092

to:

mov z0.s, #4092
ret

I think we should have some tests for that too, again to the "defend"
the improvement.

OK with a test along those lines (for a few different variations).

Thanks,
Richard

> * gcc.target/aarch64/sve/fneg-abs_1.c: Update to check for ORRI.
> * gcc.target/aarch64/sve/fneg-abs_2.c: Likewise.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> 6eeb5aa4871eceabb8e46e52bd63f0aa634b9f3d..2e9f30b9bf50eec7a575f4e5037d3350f7ebc95a
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1135,13 +1135,11 @@ (define_insn "and3"
>  (define_insn "ior3"
>[(set (match_operand:VDQ_I 0 "register_operand")
>   (ior:VDQ_I (match_operand:VDQ_I 1 "register_operand")
> -(match_operand:VDQ_I 2 "aarch64_orr_imm_sve_advsimd")))]
> +(match_operand:VDQ_I 2 "aarch64_reg_or_orr_imm")))]
>"TARGET_SIMD"
> -  {@ [ cons: =0 , 1 , 2; attrs: arch ]
> - [ w, w , w  ; simd  ] orr\t%0., %1., 
> %2.
> - [ w, 0 , vsl; sve   ] orr\t%Z0., %Z0., #%2
> - [ w, 0 , Do ; simd  ] \
> -   << aarch64_output_simd_orr_imm (operands[2], );
> +  {@ [ cons: =0 , 1 , 2  ]
> + [ w, w , w  ] orr\t%0., %1., %2.
> + [ w, 0 , Do ] << aarch64_output_simd_orr_imm (operands[2], 
> );
>}
>[(set_attr "type" "neon_logic")]
>  )
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> d38345770ebab15cf872c24b3ec8ab8cc5cce3e7..7c656476c4974529ae71a6d73328a0cd68dd5ef8
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -140,7 +140,7 @@ enum simd_immediate_check {
>  /* Information about a legitimate vector immediate operand.  */
>  struct simd_immediate_info
>  {
> -  enum insn_type { MOV, MVN, INDEX, PTRUE };
> +  enum insn_type { MOV, MVN, INDEX, PTRUE, SVE_MOV };
>enum modifier_type { LSL, MSL };
>  
>simd_immediate_info () {}
> @@ -22987,14 +22987,16 @@ aarch64_sve_valid_immediate (unsigned HOST_WIDE_INT 
> val64,
>   {
> /* DUP with no shift.  */
> if (info)
> - *info = simd_immediate_info (mode, val);
> + *info = simd_immediate_info (mode, val,
> +  simd_immediate_info::SVE_MOV);
> return true;
>   }
>if ((val & 0xff) == 0 && IN_RANGE (val, -0x8000, 0x7f00))
>   {
> /* DUP with LSL #8.  */
> if (info)
> - *info = simd_immediate_info (mode, val);
> + *info = simd_immediate_info (mode, val,
> +  simd_immediate_info::SVE_MOV);
> return true;
>   }
>  }
> @@ -23002,7 +23004,7 @@ aarch64_sve_valid_immediate (unsigned HOST_WIDE_INT 
> val64,
>  {
>/* DUPM.  */
>if (info)
> - *info = simd_immediate_info (mode, val);
> + *info = simd_immediate_info (mode, val, simd_immediate_info::SVE_MOV);
>return true;
>  }
>return false;
> @@ -23209,8 +23211,13 @@ aarch64_simd_valid_imm (rtx op, simd_immediate_info 
> *info,
>  
>if (vec_flags & VEC_SVE_DATA)
>  return aarch64_sve_valid_immediate (val64, info, which);
> -  else
> -return aarch64_advsimd_valid_immediate (val64, info, which);
> +
> +  if (aarch64_advsimd_valid_immedi

Re: [PATCH 1/2] AArch64: Improve SIMD immediate generation

2024-10-22 Thread Richard Sandiford
Wilco Dijkstra  writes:
> Cleanup the various interfaces related to SIMD immediate generation.  
> Introduce new functions
> that make it clear which operation (AND, OR, MOV) we are testing for rather 
> than guessing the
> final instruction.  Reduce the use of overly long names, unused and default 
> parameters for
> clarity.  No changes to internals or generated code.
>
> Passes regress & bootstrap, OK for commit?

Nice cleanup!  OK with the obvious fix-ups after Tamar's patch.

Thanks,
Richard

>
> gcc/ChangeLog:
>
> * config/aarch64/aarch64-protos.h (enum simd_immediate_check): Move 
> to aarch64.cc.
> (aarch64_output_simd_mov_immediate): Remove.
> (aarch64_output_simd_mov_imm): New prototype.
> (aarch64_output_simd_orr_imm): Likewise.
> (aarch64_output_simd_and_imm): Likewise.
> (aarch64_simd_valid_immediate): Remove.
> (aarch64_simd_valid_and_imm): New prototype.
> (aarch64_simd_valid_mov_imm): Likewise.
> (aarch64_simd_valid_orr_imm): Likewise.
> * config/aarch64/aarch64-simd.md: Use aarch64_output_simd_mov_imm.
> * config/aarch64/aarch64.cc (enum simd_immediate_check): Moved from 
> aarch64-protos.h.
> Use AARCH64_CHECK_AND rather than AARCH64_CHECk_BIC.
> (aarch64_expand_sve_const_vector): Use aarch64_simd_valid_mov_imm.
> (aarch64_expand_mov_immediate): Likewise.
> (aarch64_can_const_movi_rtx_p): Likewise.
> (aarch64_secondary_reload): Likewise.
> (aarch64_legitimate_constant_p): Likewise.
> (aarch64_advsimd_valid_immediate): Simplify checks on 'which' param.
> (aarch64_sve_valid_immediate): Add extra param for move vs logical.
> (aarch64_simd_valid_immediate): Rename to aarch64_simd_valid_imm.
> (aarch64_simd_valid_mov_imm): New function.
> (aarch64_simd_valid_orr_imm): Likewise.
> (aarch64_simd_valid_and_imm): Likewise.
> (aarch64_mov_operand_p): Use aarch64_simd_valid_mov_imm.
> (aarch64_simd_scalar_immediate_valid_for_move): Likewise.
> (aarch64_simd_make_constant): Likewise.
> (aarch64_expand_vector_init_fallback): Likewise.
> (aarch64_output_simd_mov_immediate): Rename to 
> aarch64_output_simd_imm.
> (aarch64_output_simd_orr_imm): New function.
> (aarch64_output_simd_and_imm): Likewise.
> (aarch64_output_simd_mov_imm): Likewise.
> (aarch64_output_scalar_simd_mov_immediate): Use 
> aarch64_output_simd_mov_imm.
> (aarch64_output_sve_mov_immediate): Use aarch64_simd_valid_imm.
> (aarch64_output_sve_ptrues): Likewise.
> * config/aarch64/constraints.md (Do): Use aarch64_simd_valid_orr_imm.
> (Db): Use aarch64_simd_valid_and_imm.
> * config/aarch64/predicates.md (aarch64_reg_or_bic_imm): Use 
> aarch64_simd_valid_orr_imm.
> (aarch64_reg_or_and_imm): Use aarch64_simd_valid_and_imm.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 
> d03c1fe798b2ccc2258b8581473a6eb7dc4af850..e789ca9358341363b976988f01d7c7c7aa88cfe4
>  100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -665,16 +665,6 @@ enum aarch64_extra_tuning_flags
>AARCH64_EXTRA_TUNE_ALL = (1u << AARCH64_EXTRA_TUNE_index_END) - 1
>  };
>  
> -/* Enum to distinguish which type of check is to be done in
> -   aarch64_simd_valid_immediate.  This is used as a bitmask where
> -   AARCH64_CHECK_MOV has both bits set.  Thus AARCH64_CHECK_MOV will
> -   perform all checks.  Adding new types would require changes accordingly.  
> */
> -enum simd_immediate_check {
> -  AARCH64_CHECK_ORR  = 1 << 0,
> -  AARCH64_CHECK_BIC  = 1 << 1,
> -  AARCH64_CHECK_MOV  = AARCH64_CHECK_ORR | AARCH64_CHECK_BIC
> -};
> -
>  extern struct tune_params aarch64_tune_params;
>  
>  /* The available SVE predicate patterns, known in the ACLE as "svpattern".  
> */
> @@ -834,8 +824,10 @@ char *aarch64_output_sve_rdvl (rtx);
>  char *aarch64_output_sve_addvl_addpl (rtx);
>  char *aarch64_output_sve_vector_inc_dec (const char *, rtx);
>  char *aarch64_output_scalar_simd_mov_immediate (rtx, scalar_int_mode);
> -char *aarch64_output_simd_mov_immediate (rtx, unsigned,
> - enum simd_immediate_check w = AARCH64_CHECK_MOV);
> +char *aarch64_output_simd_mov_imm (rtx, unsigned);
> +char *aarch64_output_simd_orr_imm (rtx, unsigned);
> +char *aarch64_output_simd_and_imm (rtx, unsigned);
> +
>  char *aarch64_output_sve_mov_immediate (rtx);
>  char *aarch64_output_sve_ptrues (rtx);
>  bool aarch64_pad_reg_upward (machine_mode, const_tree, bool);
> @@ -849,8 +841,9 @@ bool aarch64_pars_overlap_p (rtx, rtx);
>  bool aarch64_simd_scalar_immediate_valid_for_move (rtx, scalar_int_mode);
>  bool aarch64_simd_shift_imm_p (rtx, machine_mode, bool);
>  bool aarch64_sve_ptrue_svpattern_p (rtx, struct simd_immediate_info *);
> -bool aarch64_simd_valid_immediate (rtx, struct 

Re: [PATCH v2] aarch64: Add support for Ampere-1B (-mcpu=ampere1b) CPU

2024-10-22 Thread Richard Sandiford
Philipp Tomsich  writes:
> We just noticed that we didn't request to backport this one…
> OK for backport?

OK for gcc 13.  I'm nervous about backporting to the most stable
branch after the gcc 11 experience. :)

Thanks,
Richard

>
> On Thu, 30 Nov 2023 at 00:55, Philipp Tomsich 
> wrote:
>
>> Applied to master, thanks!
>> Philipp.
>>
>> On Tue, 28 Nov 2023 at 12:57, Richard Sandiford
>>  wrote:
>> >
>> > Philipp Tomsich  writes:
>> > > On Tue, 28 Nov 2023 at 12:21, Richard Sandiford
>> > >  wrote:
>> > >>
>> > >> Philipp Tomsich  writes:
>> > >> > This patch adds initial support for Ampere-1B core.
>> > >> >
>> > >> > The Ampere-1B core implements ARMv8.7 with the following (compiler
>> > >> > visible) extensions:
>> > >> >  - CSSC (Common Short Sequence Compression instructions),
>> > >> >  - MTE (Memory Tagging Extension)
>> > >> >  - SM3/SM4
>> > >> >
>> > >> > gcc/ChangeLog:
>> > >> >
>> > >> >   * config/aarch64/aarch64-cores.def (AARCH64_CORE): Add
>> ampere-1b
>> > >> >   * config/aarch64/aarch64-cost-tables.h: Add
>> ampere1b_extra_costs
>> > >> >   * config/aarch64/aarch64-tune.md: Regenerate
>> > >> >   * config/aarch64/aarch64.cc: Include ampere1b tuning model
>> > >> >   * doc/invoke.texi: Document -mcpu=ampere1b
>> > >> >   * config/aarch64/tuning_models/ampere1b.h: New file.
>> > >>
>> > >> OK, thanks, but:
>> > >>
>> > >> >
>> > >> > Signed-off-by: Philipp Tomsich 
>> > >> > ---
>> > >> >
>> > >> > Changes in v2:
>> > >> > - moved ampere1b model to a separated file
>> > >> > - regenerated aarch64-tune.md after rebase
>> > >> >
>> > >> >  gcc/config/aarch64/aarch64-cores.def|   1 +
>> > >> >  gcc/config/aarch64/aarch64-cost-tables.h| 107
>> ++
>> > >> >  gcc/config/aarch64/aarch64-tune.md  |   2 +-
>> > >> >  gcc/config/aarch64/aarch64.cc   |   1 +
>> > >> >  gcc/config/aarch64/tuning_models/ampere1b.h | 114
>> 
>> > >> >  gcc/doc/invoke.texi |   2 +-
>> > >> >  6 files changed, 225 insertions(+), 2 deletions(-)
>> > >> >  create mode 100644 gcc/config/aarch64/tuning_models/ampere1b.h
>> > >> >
>> > >> > diff --git a/gcc/config/aarch64/aarch64-cores.def
>> b/gcc/config/aarch64/aarch64-cores.def
>> > >> > index 16752b77f4b..ad896a80f1f 100644
>> > >> > --- a/gcc/config/aarch64/aarch64-cores.def
>> > >> > +++ b/gcc/config/aarch64/aarch64-cores.def
>> > >> > @@ -74,6 +74,7 @@ AARCH64_CORE("thunderxt83",   thunderxt83,
>>  thunderx,  V8A,  (CRC, CRYPTO), thu
>> > >> >  /* Ampere Computing ('\xC0') cores. */
>> > >> >  AARCH64_CORE("ampere1", ampere1, cortexa57, V8_6A, (F16, RNG, AES,
>> SHA3), ampere1, 0xC0, 0xac3, -1)
>> > >> >  AARCH64_CORE("ampere1a", ampere1a, cortexa57, V8_6A, (F16, RNG,
>> AES, SHA3, SM4, MEMTAG), ampere1a, 0xC0, 0xac4, -1)
>> > >> > +AARCH64_CORE("ampere1b", ampere1b, cortexa57, V8_7A, (F16, RNG,
>> AES, SHA3, SM4, MEMTAG, CSSC), ampere1b, 0xC0, 0xac5, -1)
>> > >> >  /* Do not swap around "emag" and "xgene1",
>> > >> > this order is required to handle variant correctly. */
>> > >> >  AARCH64_CORE("emag",emag,  xgene1,V8A,  (CRC,
>> CRYPTO), emag, 0x50, 0x000, 3)
>> > >> > diff --git a/gcc/config/aarch64/aarch64-cost-tables.h
>> b/gcc/config/aarch64/aarch64-cost-tables.h
>> > >> > index 0cb638f3a13..4c8da7f119b 100644
>> > >> > --- a/gcc/config/aarch64/aarch64-cost-tables.h
>> > >> > +++ b/gcc/config/aarch64/aarch64-cost-tables.h
>> > >> > @@ -882,4 +882,111 @@ const struct cpu_cost_table
>> ampere1a_extra_costs =
>> > >> >}
>> > >> >  };
>> > >> >
>> > >> > +const struct cpu_cost_table ampere1b_extra_costs =
>> > >> > +{
>> > &

Re: [PATCH v2 5/8] aarch64: Add masked-load else operands.

2024-10-22 Thread Richard Sandiford
Robin Dapp  writes:
> This adds zero else operands to masked loads and their intrinsics.
> I needed to adjust more than initially thought because we rely on
> combine for several instructions and a change in a "base" pattern
> needs to propagate to all those.

Looks less invasive than I'd feared though -- nice!

> For the lack of a better idea I used a function call property to specify
> whether a builtin needs an else operand or not.  Somebody with better
> knowledge of the aarch64 target can surely improve that.

Yeah, those flags are really for source-level/gimple-level attributes.
Would it work to pass a new parameter to use_contiguous_load instead?

> [...]
> @@ -1505,10 +1506,16 @@ public:
>{
>  insn_code icode;
>  if (e.vectors_per_tuple () == 1)
> -  icode = convert_optab_handler (maskload_optab,
> -  e.vector_mode (0), e.gp_mode (0));
> +  {
> + icode = convert_optab_handler (maskload_optab,
> +e.vector_mode (0), e.gp_mode (0));
> + e.args.quick_push (CONST0_RTX (e.vector_mode (0)));
> +  }
>  else
> -  icode = code_for_aarch64 (UNSPEC_LD1_COUNT, e.tuple_mode (0));
> +  {
> + icode = code_for_aarch64 (UNSPEC_LD1_COUNT, e.tuple_mode (0));
> + e.args.quick_push (CONST0_RTX (e.tuple_mode (0)));
> +  }
>  return e.use_contiguous_load_insn (icode);
>}
>  };

For the record, I don't think we strictly need the zeros on LD1_COUNT
and LD1NT_COUNT.  But I agree it's probably better to add them anyway,
for consistency.

(So please keep this part of the patch.  Just saying the above to show
that I'd thought about it.)

> @@ -1335,6 +1340,27 @@ (define_insn "vec_mask_load_lanes"
>  
>  ;; Predicated load and extend, with 8 elements per 128-bit block.
>  (define_insn_and_rewrite 
> "@aarch64_load_"
> +  [(set (match_operand:SVE_HSDI 0 "register_operand" "=w")
> + (unspec:SVE_HSDI
> +   [(match_operand: 3 "general_operand" "UplDnm")
> +(ANY_EXTEND:SVE_HSDI
> +  (unspec:SVE_PARTIAL_I
> +[(match_operand: 2 "register_operand" "Upl")
> + (match_operand:SVE_PARTIAL_I 1 "memory_operand" "m")
> + (match_operand:SVE_PARTIAL_I 4 "aarch64_maskload_else_operand")]
> +SVE_PRED_LOAD))]
> +   UNSPEC_PRED_X))]
> +  "TARGET_SVE && (~ & ) == 
> 0"
> +  "ld1\t%0., %2/z, %1"
> +  "&& !CONSTANT_P (operands[3])"
> +  {
> +operands[3] = CONSTM1_RTX (mode);
> +  }
> +)
> +
> +;; Same as above without the maskload_else_operand to still allow combine to
> +;; match a sign-extended pred_mov pattern.
> +(define_insn_and_rewrite 
> "*aarch64_load__mov"
>[(set (match_operand:SVE_HSDI 0 "register_operand" "=w")
>   (unspec:SVE_HSDI
> [(match_operand: 3 "general_operand" "UplDnm")

Splitting the patterns is the right thing to do, but it also makes
SVE_PRED_LOAD redundant.  The pattern above with the else operand
should use UNSPEC_LD1_SVE in place of SVE_PRED_LOAD.  The version
without should use UNSPEC_PRED_X (and I think can be an unnamed pattern,
starting with "*").

This would make SVE_PRED_LOAD and pred_load redundant, so they can
be removed.  The caller in svld1_extend_impl would no longer pass
UNSPEC_LD1_SVE.

Sorry about the churn.  Matching the load and move patterns in one go
seemed like a nice bit of factoring at the time, but this patch makes
it look like a factoring too far.

Otherwise it looks good.  Thanks for doing this.

Richard


Re: [PATCH v2 2/8] ifn: Add else-operand handling.

2024-10-22 Thread Richard Sandiford
I agree with Richard's comments, but a couple more:

Robin Dapp  writes:
> @@ -362,6 +363,23 @@ add_mask_and_len_args (expand_operand *ops, unsigned int 
> opno, gcall *stmt)
>  
>create_input_operand (&ops[opno++], mask_rtx,
>   TYPE_MODE (TREE_TYPE (mask)));
> +

Nit: unnecessary blank line.

> [...]
> +/* Return true if the else value ELSE_VAL (one of MASK_LOAD_ELSE_ZERO,
> +   MASK_LOAD_ELSE_M1, and MASK_LOAD_ELSE_UNDEFINED) is valid fo the optab
> +   referred to by ICODE.  The index of the else operand must be specified
> +   in ELSE_INDEX.  */
> +
> +bool
> +supported_else_val_p (enum insn_code icode, unsigned else_index, int 
> else_val)
> +{
> +  if (else_val != MASK_LOAD_ELSE_ZERO && else_val != MASK_LOAD_ELSE_M1
> +  && else_val != MASK_LOAD_ELSE_UNDEFINED)
> +__builtin_unreachable ();

gcc_unreachable (), so that it's a noisy failure when checking is enabled
(and so that it works on host compilers that don't provide
__builtin_unreachable).

Thanks,
Richard


Re: [PATCH v2 1/8] docs: Document maskload else operand and behavior.

2024-10-22 Thread Richard Sandiford
Robin Dapp  writes:
> This patch amends the documentation for masked loads (maskload,
> vec_mask_load_lanes, and mask_gather_load as well as their len
> counterparts) with an else operand.
>
> gcc/ChangeLog:
>
>   * doc/md.texi: Document masked load else operand.
> ---
>  gcc/doc/md.texi | 63 -
>  1 file changed, 41 insertions(+), 22 deletions(-)
>

Looks good, just noticed one texi-ism:

> [...]
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 603f74a78c0..632b036b36c 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5368,8 +5381,13 @@ Operands 4 and 5 have a target-dependent scalar 
> integer mode.
>  @cindex @code{maskload@var{m}@var{n}} instruction pattern
>  @item @samp{maskload@var{m}@var{n}}
>  Perform a masked load of vector from memory operand 1 of mode @var{m}
> -into register operand 0.  Mask is provided in register operand 2 of
> -mode @var{n}.
> +into register operand 0.  The mask is provided in register operand 2 of
> +mode @var{n}.  Operand 3 (the "else value") is of mode @var{m} and

``else value''

OK with that change, thanks.

Richard


Re: [PATCH] SVE intrinsics: Fold svsra with op1 all zeros to svlsr/svasr.

2024-10-22 Thread Richard Sandiford
Jennifer Schmitz  writes:
> A common idiom in intrinsics loops is to have accumulator intrinsics
> in an unrolled loop with an accumulator initialized to zero at the beginning.
> Propagating the initial zero accumulator into the first iteration
> of the loop and simplifying the first accumulate instruction is a
> desirable transformation that we should teach GCC.
> Therefore, this patch folds svsra to svlsr/svasr if op1 is all zeros,
> producing the lower latency instructions LSR/ASR instead of USRA/SSRA.
> We implemented this optimization in svsra_impl::fold.
> Because svlsr/svasr are predicated intrinsics, we added a ptrue
> predicate. Additionally, the width of the shift amount (imm3) was
> adjusted to fit the function type.
> In order to create the ptrue predicate, a new helper function
> build_ptrue was added. We also refactored gimple_folder::fold_to_ptrue
> to use the new helper function.
>
> Tests were added to check the produced assembly for use of LSR/ASR.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz 
>
> gcc/
>   * config/aarch64/aarch64-sve-builtins-sve2.cc
>   (svsra_impl::fold): Fold svsra to svlsr/svasr if op1 is all zeros.
>   * config/aarch64/aarch64-sve-builtins.cc (build_ptrue): New
>   function that returns a ptrue tree.
>   (gimple_folder::fold_to_ptrue): Refactor to use build_ptrue.
>   * config/aarch64/aarch64-sve-builtins.h: Declare build_ptrue.
>
> gcc/testsuite/
>   * gcc.target/aarch64/sve2/acle/asm/sra_s32.c: New test.
>   * gcc.target/aarch64/sve2/acle/asm/sra_s64.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/sra_u32.c: Likewise.
>   * gcc.target/aarch64/sve2/acle/asm/sra_u64.c: Likewise.
> ---
>  .../aarch64/aarch64-sve-builtins-sve2.cc  | 29 +++
>  gcc/config/aarch64/aarch64-sve-builtins.cc| 28 +++---
>  gcc/config/aarch64/aarch64-sve-builtins.h |  1 +
>  .../aarch64/sve2/acle/asm/sra_s32.c   |  9 ++
>  .../aarch64/sve2/acle/asm/sra_s64.c   |  9 ++
>  .../aarch64/sve2/acle/asm/sra_u32.c   |  9 ++
>  .../aarch64/sve2/acle/asm/sra_u64.c   |  9 ++
>  7 files changed, 83 insertions(+), 11 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
> index 6a20a613f83..0990918cc45 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins-sve2.cc
> @@ -417,6 +417,35 @@ public:
>  
>  class svsra_impl : public function_base
>  {
> +public:
> +  gimple *
> +  fold (gimple_folder &f) const override
> +  {
> +/* Fold to svlsr/svasr if op1 is all zeros.  */
> +tree op1 = gimple_call_arg (f.call, 0);
> +if (!integer_zerop (op1))
> +  return NULL;
> +function_instance instance ("svlsr", functions::svlsr,
> + shapes::binary_uint_opt_n, MODE_n,
> + f.type_suffix_ids, GROUP_none, PRED_x);
> +if (!f.type_suffix (0).unsigned_p)
> +  {
> + instance.base_name = "svasr";
> + instance.base = functions::svasr;
> +  }
> +gcall *call = f.redirect_call (instance);
> +unsigned int element_bytes = f.type_suffix (0).element_bytes;
> +/* Add a ptrue as predicate, because unlike svsra, svlsr/svasr are
> +   predicated intrinsics.  */
> +gimple_call_set_arg (call, 0, build_ptrue (element_bytes));

Maybe it would be simpler to use build_all_ones_cst (f.gp_type ()).
Unlike for fold_to_ptrue (which produces output predicates),
we don't need the upper bits of each predicate element to be zero.

> +/* For svsra, the shift amount (imm3) is uint64_t for all function types,
> +   but for svlsr/svasr, imm3 has the same width as the function type.  */
> +tree imm3 = gimple_call_arg (f.call, 2);
> +tree imm3_prec = wide_int_to_tree (scalar_types[f.type_suffix 
> (0).vector_type],

Nit: long line.  The easiest way of avoiding it would be to use
f.scalar_type (0) instead.

> +wi::to_wide (imm3, element_bytes));

This works correctly, but FWIW, it's a little simpler to use
wi::to_widest (imm3) instead.  No need to change though.

Thanks,
Richard

> +gimple_call_set_arg (call, 2, imm3_prec);
> +return call;
> +  }
>  public:
>rtx
>expand (function_expander &e) const override
> diff --git a/gcc/config/aarch64/aarch64-sve-builtins.cc 
> b/gcc/config/aarch64/aarch64-sve-builtins.cc
> index e7c703c987e..945e9818f4e 100644
> --- a/gcc/config/aarch64/aarch64-sve-builtins.cc
> +++ b/gcc/config/aarch64/aarch64-sve-builtins.cc
> @@ -3456,6 +3456,21 @@ is_ptrue (tree v, unsigned int step)
> && vector_cst_all_same (v, step));
>  }
>  
> +/* Return a ptrue tree (type svbool_t) where the element width
> +   is given by ELEMENT_BYTES.
> +   For example, for ELEMENT_BYTES = 2, we get { 1, 0, 1, 0, ... }. 

Re: [PATCH v2 9/9] aarch64: Handle alignment when it is bigger than BIGGEST_ALIGNMENT

2024-10-22 Thread Richard Sandiford
Evgeny Karpov  writes:
> Thursday, October 17, 2024
> Richard Sandiford  wrote:
>
>>>>> For instance:
>>>>> float __attribute__((aligned (32))) large_aligned_array[3];
>>>>>
>>>>> BIGGEST_ALIGNMENT could be up to 512 bits on x64.
>>>>> This patch has been added to cover this case without needing to
>>>>> change the FFmpeg code.
>>>>
>>>> What goes wrong if we don't do this?  I'm not sure from the description
>>>> whether it's a correctness fix, a performance fix, or whether it's about
>>>> avoiding wasted space.
>>>
>>> It is a correctness fix.
>>
>> But you could you explain what goes wrong if you don't do this?
>> (I realise it might be very obvious when you've seen it happen :)
>> But I'm genuinely unsure.)
>
> It will generate an error if ASM_OUTPUT_ALIGNED_LOCAL is not declared.
>
> error: requested alignment for ‘large_aligned_array’ is greater than 
> implemented alignment of 16
>     7 | float __attribute__((aligned (32))) large_aligned_array[3];

Ah, ok, thanks.

>> Why do we ignore the alignment if it is less than BIGGEST_ALIGNMENT?
>
> If ASM_OUTPUT_ALIGNED_LOCAL uses an alignment less than BIGGEST_ALIGNMENT,
> it might trigger a relocation issue.
>
> relocation truncated to fit: IMAGE_REL_ARM64_PAGEOFFSET_12L

Sorry to press the issue, but: why does that happen?

>> Better to use "auto" rather than "unsigned".
> It looks like "auto" cannot be used there.

What goes wrong if you use it?

The reason for asking for "auto" was to avoid silent truncation.

Thanks,
Rihard


[PATCH] testsuite: Skip pr112305.c for -O[01] on simulators

2024-10-22 Thread Richard Sandiford
gcc.dg/torture/pr112305.c contains an inner loop that executes
0x8000_0014 times and an outer loop that executes 5 times, giving about
10 billion total executions of the inner loop body.  At -O2 and above we
are able to remove the inner loop, but at -O1 we keep a no-op loop:

dls lr, r3
.L3:
subsr3, r3, #1
le  lr, .L3

and at -O0 we of course don't optimise.

This can lead to long execution times on simulators, possibly
triggering a timeout.

---

Tested on arm-eabi, where the problem was originally seen, and where we
now skip as expected.  Also tested on native aarch64-linux-gnu, where we
continue to execute all variations.  OK for trunk and backports (so far
to GCC 14)?

Richard


gcc/testsuite
* gcc.dg/torture/pr112305.c: Skip at -O0 and -O1 for simulators.
---
 gcc/testsuite/gcc.dg/torture/pr112305.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/testsuite/gcc.dg/torture/pr112305.c 
b/gcc/testsuite/gcc.dg/torture/pr112305.c
index 9d363aaac9d..ea6e044529d 100644
--- a/gcc/testsuite/gcc.dg/torture/pr112305.c
+++ b/gcc/testsuite/gcc.dg/torture/pr112305.c
@@ -1,5 +1,6 @@
 /* { dg-do run } */
 /* { dg-require-effective-target int32plus } */
+/* { dg-skip-if "long-running loop" { simulator } { "-O0" "-O1" } } */
 
 int a;
 void b()
-- 
2.25.1



Re: pair-fusion: Assume alias conflict if common address reg changes [PR116783]

2024-10-18 Thread Richard Sandiford
Alex Coplan  writes:
> On 11/10/2024 14:30, Richard Biener wrote:
>> On Fri, 11 Oct 2024, Richard Sandiford wrote:
>> 
>> > Alex Coplan  writes:
>> > > Hi,
>> > >
>> > > As the PR shows, pair-fusion was tricking memory_modified_in_insn_p into
>> > > returning false when a common base register (in this case, x1) was
>> > > modified between the mem and the store insn.  This lead to wrong code as
>> > > the accesses really did alias.
>> > >
>> > > To avoid this sort of problem, this patch avoids invoking RTL alias
>> > > analysis altogether (and assume an alias conflict) if the two insns to
>> > > be compared share a common address register R, and the insns see 
>> > > different
>> > > definitions of R (i.e. it was modified in between).
>> > >
>> > > Bootstrapped/regtested on aarch64-linux-gnu (all languages, both regular
>> > > bootstrap and LTO+PGO bootstrap).  OK for trunk?
>> > 
>> > Sorry for the slow review.  The patch looks good to me, but...
>
> Thanks for the review.  I'd missed that you'd sent this, sorry for not
> responding sooner.
>
>> > 
>> > > @@ -2544,11 +2624,37 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, 
>> > > unsigned access_size,
>> > > && bitmap_bit_p (&m_tombstone_bitmap, insn->uid ());
>> > >};
>> > >  
>> > > +  // Maximum number of distinct regnos we expect to appear in a single
>> > > +  // MEM (and thus in a candidate insn).
>> > > +  static constexpr int max_mem_regs = 2;
>> > > +  auto_vec addr_use_vec[2];
>> > > +  use_array addr_uses[2];
>> > > +
>> > > +  // Collect the lists of register uses that occur in the candidate 
>> > > MEMs.
>> > > +  for (int i = 0; i < 2; i++)
>> > > +{
>> > > +  // N.B. it's safe for us to ignore uses that only occur in notes
>> > > +  // here (e.g. in a REG_EQUIV expression) since we only pass the
>> > > +  // MEM down to the alias machinery, so it can't see any insn-level
>> > > +  // notes.
>> > > +  for (auto use : insns[i]->uses ())
>> > > +if (use->is_reg ()
>> > > +&& use->includes_address_uses ()
>> > > +&& !use->only_occurs_in_notes ())
>> > > +  {
>> > > +gcc_checking_assert (addr_use_vec[i].length () < 
>> > > max_mem_regs);
>> > > +addr_use_vec[i].quick_push (use);
>> > 
>> > ...if possible, I think it would be better to just use safe_push here,
>> > without the assert.  There'd then be no need to split max_mem_regs out;
>> > it could just be hard-coded in the addr_use_vec declaration.
>
> I hadn't realised at the time that quick_push () already does a
> gcc_checking_assert to make sure that we don't overflow.  It does:
>
>   template
>   inline T *
>   vec::quick_push (const T &obj)
>   {
> gcc_checking_assert (space (1));
> T *slot = &address ()[m_vecpfx.m_num++];
> ::new (static_cast(slot)) T (obj);
> return slot;
>   }
>
> (I checked the behaviour by writing a quick selftest in vec.cc, and it
> indeed aborts as expected with quick_push on overflow for a
> stack-allocated auto_vec with N = 2.)
>
> This means that the assert above is indeed redundant, so I agree that
> we should be able to drop the assert and drop the max_mem_regs constant,
> using a literal inside the auto_vec template instead (all while still
> using quick_push).
>
> Does that sound OK to you, or did you have another reason to prefer
> safe_push?  AIUI the behaviour of safe_push on overflow would be to
> allocate a new (heap-allocated) vector instead of asserting.

I just thought it looked odd/unexpected.  Normally the intent of:

  auto_vec bar;

is to reserve a sensible amount of stack space for the common case,
but still support the general case of arbitrarily many elements.
The common on-stack case will be fast with both quick_push and
safe_push[*].  The difference is just whether growing beyond the
static space would abort the compiler or work as expected.

quick_push makes sense if an earlier loop has calculated the runtime
length of the vector and if we've already reserved that amount, or if
there is a static property that guarantees a static limit.  But the limit
of 2 looked more like a general assumption, rather than something that
had been definitively checked by earlier code.

I was also wondering whether using safe_push on an array of auto_vecs
caused issues, and so you were having to work around that.  (I remember
sometimes hitting a warning about attempts to delete an on-stack buffer,
presumably due to code duplication creating contradictory paths that
jump threading couldn't optimise away as dead.)

No real objection though.  Just wanted to clarify what I meant. :)

Thanks,
Richard

[*] well, ok, quick_push will be slightly faster in release builds,
since quick_push won't do a bounds check in that case.  But the
check in safe_push would be highly predictable.


  1   2   3   4   5   6   7   8   9   10   >