Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"
On Wed, 14 Feb 2024, Andrew Stubbs wrote: > On 14/02/2024 13:43, Richard Biener wrote: > > On Wed, 14 Feb 2024, Andrew Stubbs wrote: > > > >> On 14/02/2024 13:27, Richard Biener wrote: > >>> On Wed, 14 Feb 2024, Andrew Stubbs wrote: > >>> > On 13/02/2024 08:26, Richard Biener wrote: > > On Mon, 12 Feb 2024, Thomas Schwinge wrote: > > > >> Hi! > >> > >> On 2023-10-20T12:51:03+0100, Andrew Stubbs > >> wrote: > >>> I've committed this patch > >> > >> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691 > >> "amdgcn: add -march=gfx1030 EXPERIMENTAL". > >> > >> The RDNA2 ISA variant doesn't support certain instructions previous > >> implemented in GCC/GCN, so a number of patterns etc. had to be > >> disabled: > >> > >>> [...] Vector > >>> reductions will need to be reworked for RDNA2. [...] > >> > >>>* config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2. > >>>(addc3): Add RDNA2 syntax variant. > >>>(subc3): Likewise. > >>>(2_exec): Add RDNA2 alternatives. > >>>(vec_cmpdi): Likewise. > >>>(vec_cmpdi): Likewise. > >>>(vec_cmpdi_exec): Likewise. > >>>(vec_cmpdi_exec): Likewise. > >>>(vec_cmpdi_dup): Likewise. > >>>(vec_cmpdi_dup_exec): Likewise. > >>>(reduc__scal_): Disable for RDNA2. > >>>(*_dpp_shr_): Likewise. > >>>(*plus_carry_dpp_shr_): Likewise. > >>>(*plus_carry_in_dpp_shr_): Likewise. > >> > >> Etc. The expectation being that GCC middle end copes with this, and > >> synthesizes some less ideal yet still functional vector code, I > >> presume. > >> > >> The later RDNA3/gfx1100 support builds on top of this, and that's what > >> I'm currently working on getting proper GCC/GCN target (not offloading) > >> results for. > >> > >> I'm seeing a good number of execution test FAILs (regressions compared > >> to > >> my earlier non-gfx1100 testing), and I've now tracked down where one > >> large class of those comes into existance -- not yet how to resolve, > >> unfortunately. But maybe, with you guys' combined vectorizer and back > >> end experience, the latter will be done quickly? > >> > >> Richard, I don't know if you've ever run actual GCC/GCN target (not > >> offloading) testing; let me know if you have any questions about that. > > > > I've only done offload testing - in the x86_64 build tree run > > check-target-libgomp. If you can tell me how to do GCN target testing > > (maybe document it on the wiki even!) I can try do that as well. > > > >> Given that (at least largely?) the same patterns etc. are disabled as > >> in > >> my gfx1100 configuration, I suppose your gfx1030 one would exhibit the > >> same issues. You can build GCC/GCN target like you build the > >> offloading > >> one, just remove '--enable-as-accelerator-for=[...]'. Likely, you can > >> even use a offloading GCC/GCN build to reproduce the issue below. > >> > >> One example is the attached 'builtin-bitops-1.c', reduced from > >> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is > >> miscompiled as soon as '-ftree-vectorize' is effective: > >> > >>$ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c > >>-Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ > >>-Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all > >>-fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100 > >>-O1 > >>-ftree-vectorize > >> > >> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for > >> '-march=gfx90a' vs. '-march=gfx1100', we see: > >> > >>+builtin-bitops-1.c:7:17: missed: reduc op not supported by > >>target. > >> > >> ..., and therefore: > >> > >>-builtin-bitops-1.c:7:17: note: Reduce using direct vector > >>reduction. > >>+builtin-bitops-1.c:7:17: note: Reduce using vector shifts > >>+builtin-bitops-1.c:7:17: note: extract scalar result > >> > >> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build > >> a > >> chain of summation of 'VEC_PERM_EXPR's. However, there's wrong code > >> generated: > >> > >>$ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out > >>i=1, ints[i]=0x1 a=1, b=2 > >>i=2, ints[i]=0x8000 a=1, b=2 > >>i=3, ints[i]=0x2 a=1, b=2 > >>i=4, ints[i]=0x4000 a=1, b=2 > >>i=5, ints[i]=0x1 a=1, b=2 > >>i=6, ints[i]=0x8000 a=1, b=2 > >>i=7, ints[i]=0xa5a5a5a5 a=16, b=32 > >>i=8, ints[i]=0x5a5a5a5a a=16, b=32 > >>i=9, ints[i]=0xcafe a=11, b=22 > >>i=10, ints[i]=0xcafe00 a=11, b=22 > >>i=11, ints[i]=0xcafe a=
[PATCH] lower-bitint: Ensure we don't get coalescing ICEs for (ab) SSA_NAMEs used in mul/div/mod [PR113567]
Hi! The build_bitint_stmt_ssa_conflicts hook has a special case for multiplication, division and modulo, where to ensure there is no overlap between lhs and rhs1/rhs2 arrays we make the lhs conflict with the operands. On the following testcase, we have # a_1(ab) = PHI lab: a_3(ab) = a_1(ab) % 3; before lowering and this special case causes a_3(ab) and a_1(ab) to conflict, but the PHI requires them not to conflict, so we ICE because we can't find some partitioning that will work. The following patch fixes this by special casing such statements before the partitioning, force the inputs of the multiplication/division which have large/huge _BitInt (ab) lhs into new non-(ab) SSA_NAMEs initialized right before the multiplication/division. This allows the partitioning to work then, as it has the possibility to use a different partition for the */% operands. Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? 2024-02-15 Jakub Jelinek PR tree-optimization/113567 * gimple-lower-bitint.cc (gimple_lower_bitint): For large/huge _BitInt multiplication, division or modulo with SSA_NAME_OCCURS_IN_ABNORMAL_PHI lhs and at least one of rhs1 and rhs2 force the affected inputs into a new SSA_NAME. * gcc.dg/bitint-90.c: New test. --- gcc/gimple-lower-bitint.cc.jj 2024-02-12 20:45:50.156275452 +0100 +++ gcc/gimple-lower-bitint.cc 2024-02-14 18:17:36.630664828 +0100 @@ -5973,6 +5973,47 @@ gimple_lower_bitint (void) { default: break; + case MULT_EXPR: + case TRUNC_DIV_EXPR: + case TRUNC_MOD_EXPR: + if (SSA_NAME_OCCURS_IN_ABNORMAL_PHI (s)) + { + location_t loc = gimple_location (stmt); + gsi = gsi_for_stmt (stmt); + tree rhs1 = gimple_assign_rhs1 (stmt); + tree rhs2 = gimple_assign_rhs2 (stmt); + /* For multiplication and division with (ab) + lhs and one or both operands force the operands + into new SSA_NAMEs to avoid coalescing failures. */ + if (TREE_CODE (rhs1) == SSA_NAME + && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (rhs1)) + { + first_large_huge = 0; + tree t = make_ssa_name (TREE_TYPE (rhs1)); + g = gimple_build_assign (t, SSA_NAME, rhs1); + gsi_insert_before (&gsi, g, GSI_SAME_STMT); + gimple_set_location (g, loc); + gimple_assign_set_rhs1 (stmt, t); + if (rhs1 == rhs2) + { + gimple_assign_set_rhs2 (stmt, t); + rhs2 = t; + } + update_stmt (stmt); + } + if (TREE_CODE (rhs2) == SSA_NAME + && SSA_NAME_OCCURS_IN_ABNORMAL_PHI (rhs2)) + { + first_large_huge = 0; + tree t = make_ssa_name (TREE_TYPE (rhs2)); + g = gimple_build_assign (t, SSA_NAME, rhs2); + gsi_insert_before (&gsi, g, GSI_SAME_STMT); + gimple_set_location (g, loc); + gimple_assign_set_rhs2 (stmt, t); + update_stmt (stmt); + } + } + break; case LROTATE_EXPR: case RROTATE_EXPR: { --- gcc/testsuite/gcc.dg/bitint-90.c.jj 2024-02-14 18:24:20.546018881 +0100 +++ gcc/testsuite/gcc.dg/bitint-90.c2024-02-14 18:24:09.900167668 +0100 @@ -0,0 +1,23 @@ +/* PR tree-optimization/113567 */ +/* { dg-do compile { target bitint } } */ +/* { dg-options "-O2" } */ + +#if __BITINT_MAXWIDTH__ >= 129 +_BitInt(129) v; + +void +foo (_BitInt(129) a, int i) +{ + __label__ l1, l2; + i &= 1; + void *p[] = { &&l1, &&l2 }; +l1: + a %= 3; + v = a; + i = !i; + goto *(p[i]); +l2:; +} +#else +int i; +#endif Jakub
[PATCH] icf: Reset SSA_NAME_{PTR,RANGE}_INFO in successfully merged functions [PR113907]
Hi! AFAIK we have no code in LTO streaming to stream out or in SSA_NAME_{RANGE,PTR}_INFO, so LTO effectively throws it all away and let vrp1 and alias analysis after IPA recompute that. There is just one spot, for IPA VRP and IPA bit CCP we save/restore ranges and set SSA_NAME_{PTR,RANGE}_INFO e.g. on parameters depending on what we saved and propagated, but that is after streaming in bodies for the post IPA optimizations. Now, without LTO SSA_NAME_{RANGE,PTR}_INFO is already computed from earlier in many cases (er.g. evrp and early alias analysis but other spots too), but IPA ICF is ignoring the ranges and points-to details when comparing the bodies. I think ignoring that is just fine, that is effectively what we do for LTO where we throw that information away before the analysis, and not ignoring it could lead to fewer ICF merging possibilities. So, the following patch instead verifies that for LTO SSA_NAME_{PTR,RANGE}_INFO just isn't there on SSA_NAMEs in functions into which other functions have been ICFed, and for non-LTO throws that information away (which matches the LTO behavior). Another possibility would be to remember the SSA_NAME <-> SSA_NAME mapping vector (just one of the 2) on successful sem_function::equals on the sem_function which is not the chosen leader (e.g. how SSA_NAMEs in the leader map to SSA_NAMEs in the other function) and use that vector to union the ranges in sem_function::merge. I can implement that for comparison, but wanted to post this first if there is an agreement on doing that or if Honza thinks we should take SSA_NAME_{RANGE,PTR}_INFO into account. I think we can compare SSA_NAME_RANGE_INFO, but have no idea how to try to compare points to info. And I think it will result in less effective ICF for non-LTO vs. LTO unnecessarily. Bootstrapped/regtested on x86_64-linux and i686-linux. 2024-02-15 Jakub Jelinek PR middle-end/113907 * ipa-icf.cc (sem_item_optimizer::merge_classes): Reset SSA_NAME_RANGE_INFO and SSA_NAME_PTR_INFO on successfully ICF merged functions. * gcc.dg/pr113907.c: New test. --- gcc/ipa-icf.cc.jj 2024-02-14 14:26:11.101933914 +0100 +++ gcc/ipa-icf.cc 2024-02-14 16:49:35.141518117 +0100 @@ -3396,6 +3397,7 @@ sem_item_optimizer::merge_classes (unsig continue; sem_item *source = c->members[0]; + bool this_merged_p = false; if (DECL_NAME (source->decl) && MAIN_NAME_P (DECL_NAME (source->decl))) @@ -3443,7 +3445,7 @@ sem_item_optimizer::merge_classes (unsig if (dbg_cnt (merged_ipa_icf)) { bool merged = source->merge (alias); - merged_p |= merged; + this_merged_p |= merged; if (merged && alias->type == VAR) { @@ -3452,6 +3454,35 @@ sem_item_optimizer::merge_classes (unsig } } } + + merged_p |= this_merged_p; + if (this_merged_p + && source->type == FUNC + && (!flag_wpa || flag_checking)) + { + unsigned i; + tree name; + FOR_EACH_SSA_NAME (i, name, DECL_STRUCT_FUNCTION (source->decl)) + { + /* We need to either merge or reset SSA_NAME_*_INFO. + For merging we don't preserve the mapping between + original and alias SSA_NAMEs from successful equals + calls. */ + if (POINTER_TYPE_P (TREE_TYPE (name))) + { + if (SSA_NAME_PTR_INFO (name)) + { + gcc_checking_assert (!flag_wpa); + SSA_NAME_PTR_INFO (name) = NULL; + } + } + else if (SSA_NAME_RANGE_INFO (name)) + { + gcc_checking_assert (!flag_wpa); + SSA_NAME_RANGE_INFO (name) = NULL; + } + } + } } if (!m_merged_variables.is_empty ()) --- gcc/testsuite/gcc.dg/pr113907.c.jj 2024-02-14 16:13:48.486555159 +0100 +++ gcc/testsuite/gcc.dg/pr113907.c 2024-02-14 16:13:29.198825045 +0100 @@ -0,0 +1,49 @@ +/* PR middle-end/113907 */ +/* { dg-do run } */ +/* { dg-options "-O2" } */ +/* { dg-additional-options "-minline-all-stringops" { target i?86-*-* x86_64-*-* } } */ + +static inline int +foo (int len, void *indata, void *outdata) +{ + if (len < 0 || (len & 7) != 0) +return 0; + if (len != 0 && indata != outdata) +__builtin_memcpy (outdata, indata, len); + return len; +} + +static inline int +bar (int len, void *indata, void *outdata) +{ + if (len < 0 || (len & 1) != 0) +return 0; + if (len != 0 && indata != outdata) +__builtin_memcpy (outdata, indata, len); + return len; +} + +int (*volatile p1) (int, void *, void *) = foo; +int (*volatile p2) (int, void *, void *) = bar; + +__attribute__((noipa)) int +baz (int len, void *indata, void *outdat
Re: [PATCH] Skip gnat.dg/div_zero.adb on RISC-V
LGTM, thanks :) On Wed, Feb 14, 2024 at 10:11 PM Andreas Schwab wrote: > > Like AArch64 and POWER, RISC-V does not support trap on zero divide. > > gcc/testsuite/ > * gnat.dg/div_zero.adb: Skip on RISC-V. > --- > gcc/testsuite/gnat.dg/div_zero.adb | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/gcc/testsuite/gnat.dg/div_zero.adb > b/gcc/testsuite/gnat.dg/div_zero.adb > index dedf3928db7..fb1c98caeff 100644 > --- a/gcc/testsuite/gnat.dg/div_zero.adb > +++ b/gcc/testsuite/gnat.dg/div_zero.adb > @@ -1,5 +1,5 @@ > -- { dg-do run } > --- { dg-skip-if "divide does not trap" { aarch64*-*-* powerpc*-*-* } } > +-- { dg-skip-if "divide does not trap" { aarch64*-*-* powerpc*-*-* > riscv*-*-* } } > > -- This test requires architecture- and OS-specific support code for > unwinding > -- through signal frames (typically located in *-unwind.h) to pass. Feel > free > -- > 2.43.1 > > > -- > Andreas Schwab, SUSE Labs, sch...@suse.de > GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7 > "And now for something completely different."
PING: [PATCH v3 0/8] Optimize more type traits
IIRC, all libstdc++ patches were already reviewed. It would be great if gcc patches were reviewed as well. Thank you for your time. Sincerely, Ken Matsui On Fri, Jan 5, 2024 at 9:08 PM Ken Matsui wrote: > > Changes in v3: > > - Rebased on top of master. > - Fixed __is_pointer in cpp_type_traits.h. > > Changes in v2: > > - Removed testsuite_tr1.h includes from the testcases. > > --- > > This patch series implements __is_const, __is_volatile, __is_pointer, > and __is_unbounded_array built-in traits, which were isolated from my > previous patch series "Optimize type traits compilation performance" > because they contained performance regression. I confirmed that this > patch series does not cause any performance regression. The main reason > of the performance regression were the exhaustiveness of the benchmarks > and the instability of the benchmark results. Here are new benchmark > results: > > is_const: > https://github.com/ken-matsui/gcc-bench/blob/main/is_const.md#sat-dec-23-090605-am-pst-2023 > > time: -4.36603%, peak memory: -0.300891%, total memory: -0.247934% > > is_volatile_v: > https://github.com/ken-matsui/gcc-bench/blob/main/is_volatile_v.md#sat-dec-23-091518-am-pst-2023 > > time: -4.06816%, peak memory: -0.609298%, total memory: -0.659134% > > is_pointer: > https://github.com/ken-matsui/gcc-bench/blob/main/is_pointer.md#sat-dec-23-124903-pm-pst-2023 > > time: -2.47124%, peak memory: -2.98207%, total memory: -4.0811% > > is_unbounded_array_v: > https://github.com/ken-matsui/gcc-bench/blob/main/is_unbounded_array_v.md#sat-dec-23-010046-pm-pst-2023 > > time: -1.50025%, peak memory: -1.07386%, total memory: -2.32394% > > Ken Matsui (8): > c++: Implement __is_const built-in trait > libstdc++: Optimize std::is_const compilation performance > c++: Implement __is_volatile built-in trait > libstdc++: Optimize std::is_volatile compilation performance > c++: Implement __is_pointer built-in trait > libstdc++: Optimize std::is_pointer compilation performance > c++: Implement __is_unbounded_array built-in trait > libstdc++: Optimize std::is_unbounded_array compilation performance > > gcc/cp/constraint.cc | 12 +++ > gcc/cp/cp-trait.def | 4 + > gcc/cp/semantics.cc | 16 > gcc/testsuite/g++.dg/ext/has-builtin-1.C | 12 +++ > gcc/testsuite/g++.dg/ext/is_const.C | 20 + > gcc/testsuite/g++.dg/ext/is_pointer.C | 51 + > gcc/testsuite/g++.dg/ext/is_unbounded_array.C | 37 ++ > gcc/testsuite/g++.dg/ext/is_volatile.C| 20 + > libstdc++-v3/include/bits/cpp_type_traits.h | 31 +++- > libstdc++-v3/include/std/type_traits | 73 +-- > 10 files changed, 267 insertions(+), 9 deletions(-) > create mode 100644 gcc/testsuite/g++.dg/ext/is_const.C > create mode 100644 gcc/testsuite/g++.dg/ext/is_pointer.C > create mode 100644 gcc/testsuite/g++.dg/ext/is_unbounded_array.C > create mode 100644 gcc/testsuite/g++.dg/ext/is_volatile.C > > -- > 2.43.0 >
[PATCH v3 1/4] c++: Implement __add_pointer built-in trait
This patch implements built-in trait for std::add_pointer. gcc/cp/ChangeLog: * cp-trait.def: Define __add_pointer. * semantics.cc (finish_trait_type): Handle CPTK_ADD_POINTER. gcc/testsuite/ChangeLog: * g++.dg/ext/has-builtin-1.C: Test existence of __add_pointer. * g++.dg/ext/add_pointer.C: New test. Signed-off-by: Ken Matsui --- gcc/cp/cp-trait.def | 1 + gcc/cp/semantics.cc | 9 ++ gcc/testsuite/g++.dg/ext/add_pointer.C | 39 gcc/testsuite/g++.dg/ext/has-builtin-1.C | 3 ++ 4 files changed, 52 insertions(+) create mode 100644 gcc/testsuite/g++.dg/ext/add_pointer.C diff --git a/gcc/cp/cp-trait.def b/gcc/cp/cp-trait.def index 394f006f20f..cec385ee501 100644 --- a/gcc/cp/cp-trait.def +++ b/gcc/cp/cp-trait.def @@ -48,6 +48,7 @@ #define DEFTRAIT_TYPE_DEFAULTED #endif +DEFTRAIT_TYPE (ADD_POINTER, "__add_pointer", 1) DEFTRAIT_EXPR (HAS_NOTHROW_ASSIGN, "__has_nothrow_assign", 1) DEFTRAIT_EXPR (HAS_NOTHROW_CONSTRUCTOR, "__has_nothrow_constructor", 1) DEFTRAIT_EXPR (HAS_NOTHROW_COPY, "__has_nothrow_copy", 1) diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc index 57840176863..8dc975495a8 100644 --- a/gcc/cp/semantics.cc +++ b/gcc/cp/semantics.cc @@ -12760,6 +12760,15 @@ finish_trait_type (cp_trait_kind kind, tree type1, tree type2, switch (kind) { +case CPTK_ADD_POINTER: + if (FUNC_OR_METHOD_TYPE_P (type1) + && (type_memfn_quals (type1) != TYPE_UNQUALIFIED + || type_memfn_rqual (type1) != REF_QUAL_NONE)) + return type1; + if (TYPE_REF_P (type1)) + type1 = TREE_TYPE (type1); + return build_pointer_type (type1); + case CPTK_REMOVE_CV: return cv_unqualified (type1); diff --git a/gcc/testsuite/g++.dg/ext/add_pointer.C b/gcc/testsuite/g++.dg/ext/add_pointer.C new file mode 100644 index 000..c405cdd0feb --- /dev/null +++ b/gcc/testsuite/g++.dg/ext/add_pointer.C @@ -0,0 +1,39 @@ +// { dg-do compile { target c++11 } } + +#define SA(X) static_assert((X),#X) + +class ClassType { }; + +SA(__is_same(__add_pointer(int), int*)); +SA(__is_same(__add_pointer(int*), int**)); +SA(__is_same(__add_pointer(const int), const int*)); +SA(__is_same(__add_pointer(int&), int*)); +SA(__is_same(__add_pointer(ClassType*), ClassType**)); +SA(__is_same(__add_pointer(ClassType), ClassType*)); +SA(__is_same(__add_pointer(void), void*)); +SA(__is_same(__add_pointer(const void), const void*)); +SA(__is_same(__add_pointer(volatile void), volatile void*)); +SA(__is_same(__add_pointer(const volatile void), const volatile void*)); + +void f1(); +using f1_type = decltype(f1); +using pf1_type = decltype(&f1); +SA(__is_same(__add_pointer(f1_type), pf1_type)); + +void f2() noexcept; // PR libstdc++/78361 +using f2_type = decltype(f2); +using pf2_type = decltype(&f2); +SA(__is_same(__add_pointer(f2_type), pf2_type)); + +using fn_type = void(); +using pfn_type = void(*)(); +SA(__is_same(__add_pointer(fn_type), pfn_type)); + +SA(__is_same(__add_pointer(void() &), void() &)); +SA(__is_same(__add_pointer(void() & noexcept), void() & noexcept)); +SA(__is_same(__add_pointer(void() const), void() const)); +SA(__is_same(__add_pointer(void(...) &), void(...) &)); +SA(__is_same(__add_pointer(void(...) & noexcept), void(...) & noexcept)); +SA(__is_same(__add_pointer(void(...) const), void(...) const)); + +SA(__is_same(__add_pointer(void() __restrict), void() __restrict)); diff --git a/gcc/testsuite/g++.dg/ext/has-builtin-1.C b/gcc/testsuite/g++.dg/ext/has-builtin-1.C index 02b4b4d745d..56e8db7ac32 100644 --- a/gcc/testsuite/g++.dg/ext/has-builtin-1.C +++ b/gcc/testsuite/g++.dg/ext/has-builtin-1.C @@ -2,6 +2,9 @@ // { dg-do compile } // Verify that __has_builtin gives the correct answer for C++ built-ins. +#if !__has_builtin (__add_pointer) +# error "__has_builtin (__add_pointer) failed" +#endif #if !__has_builtin (__builtin_addressof) # error "__has_builtin (__builtin_addressof) failed" #endif -- 2.43.0
Re: [PATCH v2 1/4] c++: Implement __add_pointer built-in trait
On Wed, Feb 14, 2024 at 12:19 PM Patrick Palka wrote: > > On Wed, 14 Feb 2024, Ken Matsui wrote: > > > This patch implements built-in trait for std::add_pointer. > > > > gcc/cp/ChangeLog: > > > > * cp-trait.def: Define __add_pointer. > > * semantics.cc (finish_trait_type): Handle CPTK_ADD_POINTER. > > > > gcc/testsuite/ChangeLog: > > > > * g++.dg/ext/has-builtin-1.C: Test existence of __add_pointer. > > * g++.dg/ext/add_pointer.C: New test. > > > > Signed-off-by: Ken Matsui > > --- > > gcc/cp/cp-trait.def | 1 + > > gcc/cp/semantics.cc | 9 ++ > > gcc/testsuite/g++.dg/ext/add_pointer.C | 37 > > gcc/testsuite/g++.dg/ext/has-builtin-1.C | 3 ++ > > 4 files changed, 50 insertions(+) > > create mode 100644 gcc/testsuite/g++.dg/ext/add_pointer.C > > > > diff --git a/gcc/cp/cp-trait.def b/gcc/cp/cp-trait.def > > index 394f006f20f..cec385ee501 100644 > > --- a/gcc/cp/cp-trait.def > > +++ b/gcc/cp/cp-trait.def > > @@ -48,6 +48,7 @@ > > #define DEFTRAIT_TYPE_DEFAULTED > > #endif > > > > +DEFTRAIT_TYPE (ADD_POINTER, "__add_pointer", 1) > > DEFTRAIT_EXPR (HAS_NOTHROW_ASSIGN, "__has_nothrow_assign", 1) > > DEFTRAIT_EXPR (HAS_NOTHROW_CONSTRUCTOR, "__has_nothrow_constructor", 1) > > DEFTRAIT_EXPR (HAS_NOTHROW_COPY, "__has_nothrow_copy", 1) > > diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc > > index 57840176863..e23693ab57f 100644 > > --- a/gcc/cp/semantics.cc > > +++ b/gcc/cp/semantics.cc > > @@ -12760,6 +12760,15 @@ finish_trait_type (cp_trait_kind kind, tree type1, > > tree type2, > > > >switch (kind) > > { > > +case CPTK_ADD_POINTER: > > + if (TREE_CODE (type1) == FUNCTION_TYPE > > + && ((TYPE_QUALS (type1) & (TYPE_QUAL_CONST | TYPE_QUAL_VOLATILE)) > > +|| FUNCTION_REF_QUALIFIED (type1))) > > In other parts of the front end, e.g. the POINTER_TYPE case of tsubst, in > build_trait_object, grokdeclarator and get_typeid, it seems we check for > an unqualified function type with > > (type_memfn_quals (type) != TYPE_UNQUALIFIED >&& type_mem_rqual (type) != REF_QUAL_NONE) > > which should be equivalent to your formulation except it also checks > for non-standard qualifiers such as __restrict. > > I'm not sure what a __restrict-qualified function type means or if we > care about the semantics of __add_pointer(void () __restrict), but I > reckon we might as well be consistent and use the type_mem_quals/rqual > formulation in new code too? > I see and agree. Thank you for your review! I will update this patch. > > + return type1; > > + if (TYPE_REF_P (type1)) > > + type1 = TREE_TYPE (type1); > > + return build_pointer_type (type1); > > + > > case CPTK_REMOVE_CV: > >return cv_unqualified (type1); > > > > diff --git a/gcc/testsuite/g++.dg/ext/add_pointer.C > > b/gcc/testsuite/g++.dg/ext/add_pointer.C > > new file mode 100644 > > index 000..3091510f3b5 > > --- /dev/null > > +++ b/gcc/testsuite/g++.dg/ext/add_pointer.C > > @@ -0,0 +1,37 @@ > > +// { dg-do compile { target c++11 } } > > + > > +#define SA(X) static_assert((X),#X) > > + > > +class ClassType { }; > > + > > +SA(__is_same(__add_pointer(int), int*)); > > +SA(__is_same(__add_pointer(int*), int**)); > > +SA(__is_same(__add_pointer(const int), const int*)); > > +SA(__is_same(__add_pointer(int&), int*)); > > +SA(__is_same(__add_pointer(ClassType*), ClassType**)); > > +SA(__is_same(__add_pointer(ClassType), ClassType*)); > > +SA(__is_same(__add_pointer(void), void*)); > > +SA(__is_same(__add_pointer(const void), const void*)); > > +SA(__is_same(__add_pointer(volatile void), volatile void*)); > > +SA(__is_same(__add_pointer(const volatile void), const volatile void*)); > > + > > +void f1(); > > +using f1_type = decltype(f1); > > +using pf1_type = decltype(&f1); > > +SA(__is_same(__add_pointer(f1_type), pf1_type)); > > + > > +void f2() noexcept; // PR libstdc++/78361 > > +using f2_type = decltype(f2); > > +using pf2_type = decltype(&f2); > > +SA(__is_same(__add_pointer(f2_type), pf2_type)); > > + > > +using fn_type = void(); > > +using pfn_type = void(*)(); > > +SA(__is_same(__add_pointer(fn_type), pfn_type)); > > + > > +SA(__is_same(__add_pointer(void() &), void() &)); > > +SA(__is_same(__add_pointer(void() & noexcept), void() & noexcept)); > > +SA(__is_same(__add_pointer(void() const), void() const)); > > +SA(__is_same(__add_pointer(void(...) &), void(...) &)); > > +SA(__is_same(__add_pointer(void(...) & noexcept), void(...) & noexcept)); > > +SA(__is_same(__add_pointer(void(...) const), void(...) const)); > > diff --git a/gcc/testsuite/g++.dg/ext/has-builtin-1.C > > b/gcc/testsuite/g++.dg/ext/has-builtin-1.C > > index 02b4b4d745d..56e8db7ac32 100644 > > --- a/gcc/testsuite/g++.dg/ext/has-builtin-1.C > > +++ b/gcc/testsuite/g++.dg/ext/has-builtin-1.C > > @@ -2,6 +2,9 @@ > > // { dg-do compile } > > // Verify that __has_builtin gives the correct answer for C++ built-ins.
[PATCH V4 4/5] RISC-V: Quick and simple fixes to testcases that break due to reordering
The following test cases are easily fixed with small updates to the expected assembly order. Additionally make calling-convention testcases more robust PR target/113249 gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/vls/calling-convention-1.c: update * gcc.target/riscv/rvv/autovec/vls/calling-convention-2.c: ditto * gcc.target/riscv/rvv/autovec/vls/calling-convention-3.c: ditto * gcc.target/riscv/rvv/autovec/vls/calling-convention-4.c: ditto * gcc.target/riscv/rvv/autovec/vls/calling-convention-5.c: ditto * gcc.target/riscv/rvv/autovec/vls/calling-convention-6.c: ditto * gcc.target/riscv/rvv/autovec/vls/calling-convention-7.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-12.c: reorder assembly * gcc.target/riscv/rvv/base/binop_vx_constraint-16.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-17.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-19.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-21.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-23.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-25.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-27.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-29.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-31.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-33.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-35.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-4.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-40.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-44.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-8.c: ditto * gcc.target/riscv/rvv/base/shift_vx_constraint-1.c: ditto * gcc.target/riscv/rvv/vsetvl/avl_single-107.c: change expected vsetvl Signed-off-by: Edwin Lu --- V1-3: - Patch did not exist V4: - New patch - improve calling-convention testcases (calling-conventions) - reorder expected function body assembly (binop/shift_vx_constraint) - change expected value (avl_single) --- .../rvv/autovec/vls/calling-convention-1.c| 27 --- .../rvv/autovec/vls/calling-convention-2.c| 23 ++-- .../rvv/autovec/vls/calling-convention-3.c| 18 - .../rvv/autovec/vls/calling-convention-4.c| 12 - .../rvv/autovec/vls/calling-convention-5.c| 22 ++- .../rvv/autovec/vls/calling-convention-6.c| 17 .../rvv/autovec/vls/calling-convention-7.c| 12 - .../riscv/rvv/base/binop_vx_constraint-12.c | 4 +-- .../riscv/rvv/base/binop_vx_constraint-16.c | 4 +-- .../riscv/rvv/base/binop_vx_constraint-17.c | 4 +-- .../riscv/rvv/base/binop_vx_constraint-19.c | 4 +-- .../riscv/rvv/base/binop_vx_constraint-21.c | 4 +-- .../riscv/rvv/base/binop_vx_constraint-23.c | 4 +-- .../riscv/rvv/base/binop_vx_constraint-25.c | 4 +-- .../riscv/rvv/base/binop_vx_constraint-27.c | 4 +-- .../riscv/rvv/base/binop_vx_constraint-29.c | 4 +-- .../riscv/rvv/base/binop_vx_constraint-31.c | 4 +-- .../riscv/rvv/base/binop_vx_constraint-33.c | 4 +-- .../riscv/rvv/base/binop_vx_constraint-35.c | 4 +-- .../riscv/rvv/base/binop_vx_constraint-4.c| 4 +-- .../riscv/rvv/base/binop_vx_constraint-40.c | 4 +-- .../riscv/rvv/base/binop_vx_constraint-44.c | 4 +-- .../riscv/rvv/base/binop_vx_constraint-8.c| 4 +-- .../riscv/rvv/base/shift_vx_constraint-1.c| 5 +--- .../riscv/rvv/vsetvl/avl_single-107.c | 2 +- 25 files changed, 140 insertions(+), 62 deletions(-) diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/calling-convention-1.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/calling-convention-1.c index 41e31c258f8..217885c2d67 100644 --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/calling-convention-1.c +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/vls/calling-convention-1.c @@ -143,12 +143,33 @@ DEF_RET1_ARG9 (v1024qi) DEF_RET1_ARG9 (v2048qi) DEF_RET1_ARG9 (v4096qi) +// RET1_ARG0 tests /* { dg-final { scan-assembler-times {li\s+a[0-1],\s*0} 9 } } */ +/* { dg-final { scan-assembler-times {mv\s+s0,a0\s+call\s+memset\s+mv\s+a0,s0} 3 } } */ + +// v1qi tests: return value (lbu) and function prologue (sb) +// 1 lbu per test, argnum sb's when args > 1 /* { dg-final { scan-assembler-times {lbu\s+a0,\s*[0-9]+\(sp\)} 8 } } */ -/* { dg-final { scan-assembler-times {lhu\s+a0,\s*[0-9]+\(sp\)} 8 } } */ -/* { dg-final { scan-assembler-times {lw\s+a0,\s*[0-9]+\(sp\)} 8 } } */ -/* { dg-final { scan-assembler-times {ld\s+a[0-1],\s*[0-9]+\(sp\)} 35 } } */ /* { dg-final { scan-assembler-times {sb\s+a[0-7],\s*[0-9]+\(sp\)} 43 } } */ + +// v2qi test: return value (lhu) and function prologue (sh) +// 1 lhu per test, argnum sh's when args > 1 +/* { dg-final { scan-assembler-times {lhu\s+a0,
[PATCH V4 3/5] RISC-V: Use default cost model for insn scheduling
Use default cost model scheduling on these test cases. All these tests introduce scan dump failures with -mtune generic-ooo. Since the vector cost models are the same across all three tunes, some of the tests in PR113249 will be fixed with this patch series. PR target/113249 gcc/testsuite/ChangeLog: * g++.target/riscv/rvv/base/bug-1.C: use default scheduling * gcc.target/riscv/rvv/autovec/reduc/reduc_call-2.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-102.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-108.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-114.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-119.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-12.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-16.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-17.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-19.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-21.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-23.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-25.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-27.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-29.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-31.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-33.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-35.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-4.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-40.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-44.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-50.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-56.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-62.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-68.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-74.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-79.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-8.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-84.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-90.c: ditto * gcc.target/riscv/rvv/base/binop_vx_constraint-96.c: ditto * gcc.target/riscv/rvv/base/float-point-dynamic-frm-30.c: ditto * gcc.target/riscv/rvv/base/pr108185-1.c: ditto * gcc.target/riscv/rvv/base/pr108185-2.c: ditto * gcc.target/riscv/rvv/base/pr108185-3.c: ditto * gcc.target/riscv/rvv/base/pr108185-4.c: ditto * gcc.target/riscv/rvv/base/pr108185-5.c: ditto * gcc.target/riscv/rvv/base/pr108185-6.c: ditto * gcc.target/riscv/rvv/base/pr108185-7.c: ditto * gcc.target/riscv/rvv/base/shift_vx_constraint-1.c: ditto * gcc.target/riscv/rvv/vsetvl/pr111037-3.c: ditto * gcc.target/riscv/rvv/vsetvl/vlmax_back_prop-28.c: ditto * gcc.target/riscv/rvv/vsetvl/vlmax_back_prop-29.c: ditto * gcc.target/riscv/rvv/vsetvl/vlmax_back_prop-32.c: ditto * gcc.target/riscv/rvv/vsetvl/vlmax_back_prop-33.c: ditto * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-17.c: ditto * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-18.c: ditto * gcc.target/riscv/rvv/vsetvl/vlmax_single_block-19.c: ditto * gcc.target/riscv/rvv/vsetvl/vlmax_switch_vtype-10.c: ditto * gcc.target/riscv/rvv/vsetvl/vlmax_switch_vtype-11.c: ditto * gcc.target/riscv/rvv/vsetvl/vlmax_switch_vtype-12.c: ditto * gcc.target/riscv/rvv/vsetvl/vlmax_switch_vtype-4.c: ditto * gcc.target/riscv/rvv/vsetvl/vlmax_switch_vtype-5.c: ditto * gcc.target/riscv/rvv/vsetvl/vlmax_switch_vtype-6.c: ditto * gcc.target/riscv/rvv/vsetvl/vlmax_switch_vtype-7.c: ditto * gcc.target/riscv/rvv/vsetvl/vlmax_switch_vtype-8.c: ditto * gcc.target/riscv/rvv/vsetvl/vlmax_switch_vtype-9.c: ditto * gfortran.dg/vect/vect-8.f90: ditto Signed-off-by: Edwin Lu --- V2: - New patch V3/V4: - No change --- gcc/testsuite/g++.target/riscv/rvv/base/bug-1.C | 2 ++ gcc/testsuite/gcc.target/riscv/rvv/autovec/reduc/reduc_call-2.c | 2 ++ .../gcc.target/riscv/rvv/base/binop_vx_constraint-102.c | 2 ++ .../gcc.target/riscv/rvv/base/binop_vx_constraint-108.c | 2 ++ .../gcc.target/riscv/rvv/base/binop_vx_constraint-114.c | 2 ++ .../gcc.target/riscv/rvv/base/binop_vx_constraint-119.c | 2 ++ .../gcc.target/riscv/rvv/base/binop_vx_constraint-12.c | 2 ++ .../gcc.target/riscv/rvv/base/binop_vx_constraint-16.c | 2 ++ .../gcc.target/riscv/rvv/base/binop_vx_constraint-17.c | 2 ++ .../gcc.target/riscv/rvv/base/binop_vx_constraint-19.c | 2 ++ .../gcc.target/riscv/rvv/base/binop_vx_constraint-21.c | 2
[PATCH V4 2/5] RISC-V: Add vector related pipelines
Creates new generic vector pipeline file common to all cpu tunes. Moves all vector related pipelines from generic-ooo to generic-vector-ooo. Creates new vector crypto related insn reservations. gcc/ChangeLog: * config/riscv/generic-ooo.md (generic_ooo): Move reservation (generic_ooo_vec_load): ditto (generic_ooo_vec_store): ditto (generic_ooo_vec_loadstore_seg): ditto (generic_ooo_vec_alu): ditto (generic_ooo_vec_fcmp): ditto (generic_ooo_vec_imul): ditto (generic_ooo_vec_fadd): ditto (generic_ooo_vec_fmul): ditto (generic_ooo_crypto): ditto (generic_ooo_perm): ditto (generic_ooo_vec_reduction): ditto (generic_ooo_vec_ordered_reduction): ditto (generic_ooo_vec_idiv): ditto (generic_ooo_vec_float_divsqrt): ditto (generic_ooo_vec_mask): ditto (generic_ooo_vec_vesetvl): ditto (generic_ooo_vec_setrm): ditto (generic_ooo_vec_readlen): ditto * config/riscv/riscv.md: include generic-vector-ooo * config/riscv/generic-vector-ooo.md: New file. to here Signed-off-by: Edwin Lu Co-authored-by: Robin Dapp --- V2: - Remove unnecessary syntax changes in generic-ooo - Add new vector crypto reservations and types to pipelines V3: - Move all vector pipelines into separate file which defines all ooo vector reservations. - Add temporary attribute while cost model changes. V4: - No change --- gcc/config/riscv/generic-ooo.md| 127 +- gcc/config/riscv/generic-vector-ooo.md | 143 + gcc/config/riscv/riscv.md | 1 + 3 files changed, 145 insertions(+), 126 deletions(-) create mode 100644 gcc/config/riscv/generic-vector-ooo.md diff --git a/gcc/config/riscv/generic-ooo.md b/gcc/config/riscv/generic-ooo.md index 83cd06234b3..e70df63d91f 100644 --- a/gcc/config/riscv/generic-ooo.md +++ b/gcc/config/riscv/generic-ooo.md @@ -1,5 +1,5 @@ ;; RISC-V generic out-of-order core scheduling model. -;; Copyright (C) 2017-2024 Free Software Foundation, Inc. +;; Copyright (C) 2023-2024 Free Software Foundation, Inc. ;; ;; This file is part of GCC. ;; @@ -48,9 +48,6 @@ (define_automaton "generic_ooo") ;; Integer/float issue queues. (define_cpu_unit "issue0,issue1,issue2,issue3,issue4" "generic_ooo") -;; Separate issue queue for vector instructions. -(define_cpu_unit "generic_ooo_vxu_issue" "generic_ooo") - ;; Integer/float execution units. (define_cpu_unit "ixu0,ixu1,ixu2,ixu3" "generic_ooo") (define_cpu_unit "fxu0,fxu1" "generic_ooo") @@ -58,12 +55,6 @@ (define_cpu_unit "fxu0,fxu1" "generic_ooo") ;; Integer subunit for division. (define_cpu_unit "generic_ooo_div" "generic_ooo") -;; Vector execution unit. -(define_cpu_unit "generic_ooo_vxu_alu" "generic_ooo") - -;; Vector subunit that does mult/div/sqrt. -(define_cpu_unit "generic_ooo_vxu_multicycle" "generic_ooo") - ;; Shortcuts (define_reservation "generic_ooo_issue" "issue0|issue1|issue2|issue3|issue4") (define_reservation "generic_ooo_ixu_alu" "ixu0|ixu1|ixu2|ixu3") @@ -92,25 +83,6 @@ (define_insn_reservation "generic_ooo_float_store" 6 (eq_attr "type" "fpstore")) "generic_ooo_issue,generic_ooo_fxu") -;; Vector load/store -(define_insn_reservation "generic_ooo_vec_load" 6 - (and (eq_attr "tune" "generic_ooo") - (eq_attr "type" "vlde,vldm,vlds,vldux,vldox,vldff,vldr")) - "generic_ooo_vxu_issue,generic_ooo_vxu_alu") - -(define_insn_reservation "generic_ooo_vec_store" 6 - (and (eq_attr "tune" "generic_ooo") - (eq_attr "type" "vste,vstm,vsts,vstux,vstox,vstr")) - "generic_ooo_vxu_issue,generic_ooo_vxu_alu") - -;; Vector segment loads/stores. -(define_insn_reservation "generic_ooo_vec_loadstore_seg" 10 - (and (eq_attr "tune" "generic_ooo") - (eq_attr "type" "vlsegde,vlsegds,vlsegdux,vlsegdox,vlsegdff,\ - vssegte,vssegts,vssegtux,vssegtox")) - "generic_ooo_vxu_issue,generic_ooo_vxu_alu") - - ;; Generic integer instructions. (define_insn_reservation "generic_ooo_alu" 1 (and (eq_attr "tune" "generic_ooo") @@ -191,103 +163,6 @@ (define_insn_reservation "generic_ooo_popcount" 2 (eq_attr "type" "cpop,clmul")) "generic_ooo_issue,generic_ooo_ixu_alu") -;; Regular vector operations and integer comparisons. -(define_insn_reservation "generic_ooo_vec_alu" 3 - (and (eq_attr "tune" "generic_ooo") - (eq_attr "type" "vialu,viwalu,vext,vicalu,vshift,vnshift,viminmax,vicmp,\ - vimov,vsalu,vaalu,vsshift,vnclip,vmov,vfmov,vector")) - "generic_ooo_vxu_issue,generic_ooo_vxu_alu") - -;; Vector float comparison, conversion etc. -(define_insn_reservation "generic_ooo_vec_fcmp" 3 - (and (eq_attr "tune" "generic_ooo") - (eq_attr "type" "vfrecp,vfminmax,vfcmp,vfsgnj,vfclass,vfcvtitof,\ - vfcvtftoi,vfwcvtitof,vfwcvtftoi,vfwcvtftof,vfncvtitof,\ - vfncvtftoi,vfncvtftof")) - "generic_ooo_vxu_issue,generic_ooo_vxu_a
[PATCH V4 5/5] RISC-V: Enable assert for insn_has_dfa_reservation
Enables assert that every typed instruction is associated with a dfa reservation gcc/ChangeLog: * config/riscv/riscv.cc (riscv_sched_variable_issue): enable assert Signed-off-by: Edwin Lu --- V2: - No changes V3: - Remove debug statements V4: - no changes --- gcc/config/riscv/riscv.cc | 2 -- 1 file changed, 2 deletions(-) diff --git a/gcc/config/riscv/riscv.cc b/gcc/config/riscv/riscv.cc index 4100abc9dd1..5e984ee2a55 100644 --- a/gcc/config/riscv/riscv.cc +++ b/gcc/config/riscv/riscv.cc @@ -8269,9 +8269,7 @@ riscv_sched_variable_issue (FILE *, int, rtx_insn *insn, int more) /* If we ever encounter an insn without an insn reservation, trip an assert so we can find and fix this problem. */ -#if 0 gcc_assert (insn_has_dfa_reservation_p (insn)); -#endif return more - 1; } -- 2.34.1
[PATCH V4 1/5] RISC-V: Add non-vector types to dfa pipelines
This patch adds non-vector related insn reservations and updates/creates new insn reservations so all non-vector typed instructions have a reservation. gcc/ChangeLog: * config/riscv/generic-ooo.md (generic_ooo_sfb_alu): Add reservation (generic_ooo_branch): ditto * config/riscv/generic.md (generic_sfb_alu): ditto (generic_fmul_half): ditto * config/riscv/riscv.md: Remove cbo, pushpop, and rdfrm types * config/riscv/sifive-7.md (sifive_7_hfma): Add reservation (sifive_7_popcount): ditto * config/riscv/sifive-p400.md (sifive_p400_clmul): ditto * config/riscv/sifive-p600.md (sifive_p600_clmul): ditto * config/riscv/vector.md: change rdfrm to fmove * config/riscv/zc.md: change pushpop to load/store Signed-off-by: Edwin Lu --- V2: - Add insn reservations for HF fmul - Remove/adjust insn types V3: - No changes V4: - Update sifive-p400 and sifive-p600 series --- gcc/config/riscv/generic-ooo.md | 15 +- gcc/config/riscv/generic.md | 20 +-- gcc/config/riscv/riscv.md | 16 +++--- gcc/config/riscv/sifive-7.md| 17 +- gcc/config/riscv/sifive-p400.md | 10 +++- gcc/config/riscv/sifive-p600.md | 10 +++- gcc/config/riscv/vector.md | 2 +- gcc/config/riscv/zc.md | 96 - 8 files changed, 117 insertions(+), 69 deletions(-) diff --git a/gcc/config/riscv/generic-ooo.md b/gcc/config/riscv/generic-ooo.md index a22f8a3e079..83cd06234b3 100644 --- a/gcc/config/riscv/generic-ooo.md +++ b/gcc/config/riscv/generic-ooo.md @@ -115,9 +115,20 @@ (define_insn_reservation "generic_ooo_vec_loadstore_seg" 10 (define_insn_reservation "generic_ooo_alu" 1 (and (eq_attr "tune" "generic_ooo") (eq_attr "type" "unknown,const,arith,shift,slt,multi,auipc,nop,logical,\ - move,bitmanip,min,max,minu,maxu,clz,ctz")) + move,bitmanip,rotate,min,max,minu,maxu,clz,ctz,atomic,\ + condmove,mvpair,zicond")) "generic_ooo_issue,generic_ooo_ixu_alu") +(define_insn_reservation "generic_ooo_sfb_alu" 2 + (and (eq_attr "tune" "generic_ooo") + (eq_attr "type" "sfb_alu")) + "generic_ooo_issue,generic_ooo_ixu_alu") + +;; Branch instructions +(define_insn_reservation "generic_ooo_branch" 1 + (and (eq_attr "tune" "generic_ooo") + (eq_attr "type" "branch,jump,call,jalr,ret,trap")) + "generic_ooo_issue,generic_ooo_ixu_alu") ;; Float move, convert and compare. (define_insn_reservation "generic_ooo_float_move" 3 @@ -184,7 +195,7 @@ (define_insn_reservation "generic_ooo_popcount" 2 (define_insn_reservation "generic_ooo_vec_alu" 3 (and (eq_attr "tune" "generic_ooo") (eq_attr "type" "vialu,viwalu,vext,vicalu,vshift,vnshift,viminmax,vicmp,\ - vimov,vsalu,vaalu,vsshift,vnclip,vmov,vfmov")) + vimov,vsalu,vaalu,vsshift,vnclip,vmov,vfmov,vector")) "generic_ooo_vxu_issue,generic_ooo_vxu_alu") ;; Vector float comparison, conversion etc. diff --git a/gcc/config/riscv/generic.md b/gcc/config/riscv/generic.md index 3f0eaa2ea08..4f6e63bff57 100644 --- a/gcc/config/riscv/generic.md +++ b/gcc/config/riscv/generic.md @@ -27,7 +27,9 @@ (define_cpu_unit "fdivsqrt" "pipe0") (define_insn_reservation "generic_alu" 1 (and (eq_attr "tune" "generic") - (eq_attr "type" "unknown,const,arith,shift,slt,multi,auipc,nop,logical,move,bitmanip,min,max,minu,maxu,clz,ctz,cpop")) + (eq_attr "type" "unknown,const,arith,shift,slt,multi,auipc,nop,logical,\ + move,bitmanip,min,max,minu,maxu,clz,ctz,rotate,atomic,\ + condmove,crypto,mvpair,zicond")) "alu") (define_insn_reservation "generic_load" 3 @@ -47,12 +49,17 @@ (define_insn_reservation "generic_xfer" 3 (define_insn_reservation "generic_branch" 1 (and (eq_attr "tune" "generic") - (eq_attr "type" "branch,jump,call,jalr")) + (eq_attr "type" "branch,jump,call,jalr,ret,trap")) + "alu") + +(define_insn_reservation "generic_sfb_alu" 2 + (and (eq_attr "tune" "generic") + (eq_attr "type" "sfb_alu")) "alu") (define_insn_reservation "generic_imul" 10 (and (eq_attr "tune" "generic") - (eq_attr "type" "imul,clmul")) + (eq_attr "type" "imul,clmul,cpop")) "imuldiv*10") (define_insn_reservation "generic_idivsi" 34 @@ -67,6 +74,12 @@ (define_insn_reservation "generic_idivdi" 66 (eq_attr "mode" "DI"))) "imuldiv*66") +(define_insn_reservation "generic_fmul_half" 5 + (and (eq_attr "tune" "generic") + (and (eq_attr "type" "fadd,fmul,fmadd") + (eq_attr "mode" "HF"))) + "alu") + (define_insn_reservation "generic_fmul_single" 5 (and (eq_attr "tune" "generic") (and (eq_attr "type" "fadd,fmul,fmadd") @@ -88,3 +101,4 @@ (define_insn_reservation "generic_fsqrt" 25 (and (eq_attr "tune" "generic") (eq_attr "type" "fsqrt")) "fdivsqrt*25") + diff --git a/gcc/config/riscv/riscv.md b/g
[PATCH V4 0/5] RISC-V: Associate typed insns to dfa reservation
Previous version (V3 23cd2961bd2ff63583f46e3499a07bd54491d45c) was reverted. Updates all tune insn reservation pipelines to cover all types defined by define_attr "type" in riscv.md. Creates new vector insn reservation pipelines in new file generic-vector-ooo.md which has separate automaton vector_ooo where all reservations are mapped to. This allows all tunes to share a common vector model for now as we make large changes to the vector cost model. (https://gcc.gnu.org/pipermail/gcc-patches/2024-January/642511.html) Disables pipeline scheduling for some tests with scan dump failures when using -mtune=generic-ooo. Updates test cases that were failing due to simple insn reordering to match new code generation Enables assert that all insn types must be associated with a dfa pipeline reservation --- V2: - Update non-vector insn types and add new pipelines - Add -fno-schedule-insn -fno-schedule-insn2 to some test cases V3: - Separate vector pipelines to separate file which all tunes have access to V4: - Add insn reservations to sifive-p400 and sifive-p600 series - Update test cases with new code generation --- Edwin Lu (5): RISC-V: Add non-vector types to dfa pipelines RISC-V: Add vector related pipelines RISC-V: Use default cost model for insn scheduling RISC-V: Quick and simple fixes to testcases that break due to reordering RISC-V: Enable assert for insn_has_dfa_reservation gcc/config/riscv/generic-ooo.md | 140 ++--- gcc/config/riscv/generic-vector-ooo.md| 143 ++ gcc/config/riscv/generic.md | 20 ++- gcc/config/riscv/riscv.cc | 2 - gcc/config/riscv/riscv.md | 17 +-- gcc/config/riscv/sifive-7.md | 17 ++- gcc/config/riscv/sifive-p400.md | 10 +- gcc/config/riscv/sifive-p600.md | 10 +- gcc/config/riscv/vector.md| 2 +- gcc/config/riscv/zc.md| 96 ++-- .../g++.target/riscv/rvv/base/bug-1.C | 2 + .../riscv/rvv/autovec/reduc/reduc_call-2.c| 2 + .../rvv/autovec/vls/calling-convention-1.c| 27 +++- .../rvv/autovec/vls/calling-convention-2.c| 23 ++- .../rvv/autovec/vls/calling-convention-3.c| 18 ++- .../rvv/autovec/vls/calling-convention-4.c| 12 +- .../rvv/autovec/vls/calling-convention-5.c| 22 ++- .../rvv/autovec/vls/calling-convention-6.c| 17 +++ .../rvv/autovec/vls/calling-convention-7.c| 12 +- .../riscv/rvv/base/binop_vx_constraint-102.c | 2 + .../riscv/rvv/base/binop_vx_constraint-108.c | 2 + .../riscv/rvv/base/binop_vx_constraint-114.c | 2 + .../riscv/rvv/base/binop_vx_constraint-119.c | 2 + .../riscv/rvv/base/binop_vx_constraint-12.c | 2 +- .../riscv/rvv/base/binop_vx_constraint-16.c | 2 +- .../riscv/rvv/base/binop_vx_constraint-17.c | 2 +- .../riscv/rvv/base/binop_vx_constraint-19.c | 2 +- .../riscv/rvv/base/binop_vx_constraint-21.c | 2 +- .../riscv/rvv/base/binop_vx_constraint-23.c | 2 +- .../riscv/rvv/base/binop_vx_constraint-25.c | 2 +- .../riscv/rvv/base/binop_vx_constraint-27.c | 2 +- .../riscv/rvv/base/binop_vx_constraint-29.c | 2 +- .../riscv/rvv/base/binop_vx_constraint-31.c | 2 +- .../riscv/rvv/base/binop_vx_constraint-33.c | 2 +- .../riscv/rvv/base/binop_vx_constraint-35.c | 2 +- .../riscv/rvv/base/binop_vx_constraint-4.c| 2 +- .../riscv/rvv/base/binop_vx_constraint-40.c | 2 +- .../riscv/rvv/base/binop_vx_constraint-44.c | 2 +- .../riscv/rvv/base/binop_vx_constraint-50.c | 2 + .../riscv/rvv/base/binop_vx_constraint-56.c | 2 + .../riscv/rvv/base/binop_vx_constraint-62.c | 2 + .../riscv/rvv/base/binop_vx_constraint-68.c | 2 + .../riscv/rvv/base/binop_vx_constraint-74.c | 2 + .../riscv/rvv/base/binop_vx_constraint-79.c | 2 + .../riscv/rvv/base/binop_vx_constraint-8.c| 2 +- .../riscv/rvv/base/binop_vx_constraint-84.c | 2 + .../riscv/rvv/base/binop_vx_constraint-90.c | 2 + .../riscv/rvv/base/binop_vx_constraint-96.c | 2 + .../rvv/base/float-point-dynamic-frm-30.c | 2 + .../gcc.target/riscv/rvv/base/pr108185-1.c| 2 + .../gcc.target/riscv/rvv/base/pr108185-2.c| 2 + .../gcc.target/riscv/rvv/base/pr108185-3.c| 2 + .../gcc.target/riscv/rvv/base/pr108185-4.c| 2 + .../gcc.target/riscv/rvv/base/pr108185-5.c| 2 + .../gcc.target/riscv/rvv/base/pr108185-6.c| 2 + .../gcc.target/riscv/rvv/base/pr108185-7.c| 2 + .../riscv/rvv/base/shift_vx_constraint-1.c| 3 +- .../riscv/rvv/vsetvl/avl_single-107.c | 2 +- .../gcc.target/riscv/rvv/vsetvl/pr111037-3.c | 2 + .../riscv/rvv/vsetvl/vlmax_back_prop-28.c | 2 + .../riscv/rvv/vsetvl/vlmax_back_prop-29.c | 2 + .../riscv/rvv/vsetvl/vlmax_back_prop-32.c | 2 + .../riscv/rvv/vsetvl/vlmax_back_prop-33.c | 2 + .../riscv/rvv/vsetvl/vlmax
Re: [PATCH RFA] build: drop target libs from LD_LIBRARY_PATH [PR105688]
> On 14 Feb 2024, at 22:59, Iain Sandoe wrote: >> On 12 Feb 2024, at 19:59, Jason Merrill wrote: >> >> On 2/10/24 07:30, Iain Sandoe wrote: On 10 Feb 2024, at 12:07, Jason Merrill wrote: On 2/10/24 05:46, Iain Sandoe wrote: >> On 9 Feb 2024, at 23:21, Iain Sandoe wrote: >> >> >> >>> On 9 Feb 2024, at 10:56, Iain Sandoe wrote: On 8 Feb 2024, at 21:44, Jason Merrill wrote: On 2/8/24 12:55, Paolo Bonzini wrote: > On 2/8/24 18:16, Jason Merrill wrote: >>> >>> Hmm. In stage 1, when we build with the system gcc, I'd think we >>> want the just-built gnat1 to find the system libgcc. >>> >>> In stage 2, when we build with the stage 1 gcc, we want the >>> just-built gnat1 to find the stage 1 libgcc. >>> >>> In neither case do we want it to find the libgcc from the current >>> stage. >>> >>> So it seems to me that what we want is for stage2+ LD_LIBRARY_PATH >>> to include the TARGET_LIB_PATH from the previous stage. Something >>> like the below, on top of the earlier patch. >>> >>> Does this make sense? Does it work on Darwin? >> >> Oops, that was broken, please consider this one instead: > Yes, this one makes sense (and the current code would not work since > it lacks the prev- prefix on TARGET_LIB_PATH). Indeed, that seems like evidence that the only element of TARGET_LIB_PATH that has been useful in HOST_EXPORTS is the prev- part of HOST_LIB_PATH_gcc. So, here's another patch that just includes that for post-stage1: <0001-build-drop-target-libs-from-LD_LIBRARY_PATH-PR105688.patch> >>> >>> Hmm this still fails for me with gnat1 being unable to find libgcc_s. >>> It seems I have to add the PREV_HOST_LIB_PATH_gcc to HOST_LIB_PATH for >>> it to succeed so, >>> presumably, the post stage1 exports are not being forwarded to that >>> build. I’ll try to analyze what >>> exactly is failing. >> >> The fail is occurring in the target libada build; so, I suppose, one >> might say it’s reasonable that it >> requires this host path to be added to the target exports since it’s a >> host library used during target >> builds (or do folks expect the host exports to be made for target lib >> builds as well?) >> >> Appending the prev-gcc dirctory to the HOST_LIB_PATH fixes this > Hmm this is still not right, in this case, I think it should actually be > the “just built” directory; > - if we have a tool that depends on host libraries (that happen to be > also target ones), > then those libraries have to be built before the tool so that they can > be linked to it. > (we specially copy libgcc* and the CRTs to gcc/ to allow for this case) > - there is no prev-gcc in cross and —disable-bootstrap builds, but the > tool will still be > linked to the just-built host libraries (which will also be installed). > So, I think we have to add HOST_LIB_PATH_gcc to HOST_LIB_PATH > and HOST_PREV_LIB_PATH_gcc to POSTSTAGE1_HOST_EXPORTS (as per this patch). I don't follow. In a cross build, host libraries are a different architecture from target libraries, and certainly can't be linked into host binaries. In a disable-bootstrap build, even before my change TARGET_LIB_PATH isn't added to RPATH_ENVVAR, since that has been guarded with @if gcc-bootstrap. So in a bootstrap build, it shouldn't be needed for stage1 either. And for stage2, the one we need is from stage1, that matches the compiler we're building host tools with. What am I missing? >>> nothing, I was off on a tangent about the cross/non-bootstrap, sorry about >>> that. >>> However, when doing target builds (the previous point) it seems we do have >>> to make provision for gnat1 to find libgcc_s, and, at present, it seems >>> that only the target exports are active. >> >> Ah, I see: When building target libraries in stage2, we run the stage2 >> compiler that needs the stage1 libgcc_s, but we don't have the HOST_EXPORTS >> because we're building target code, so we also need to get the libgcc path >> into TARGET_EXPORTS. >> >> Since TARGET_LIB_PATH is only added when gcc-bootstrap, I guess the previous >> libgcc is the only piece needed in TARGET_EXPORTS as well. So, how about >> this version of the patch? > > I tested this one on an affected platform version with and without > —enable-host-shared and for all languages (less go which is not yet > supported). It works for me, thanks, > Iain Incidentally, during my investigations I was looking into various parts of this and it seems that actually TARGET_LIB_PATH might well be effectively dead code now. O
[PATCH] bpf: fix zero_extendqidi2 ldx template
Commit 77d0f9ec3809b4d2e32c36069b6b9239d301c030 inadvertently changed the normal asm dialect instruction template for zero_extendqidi2 from ldxb to ldxh. Fix that. Tested for bpf-unknown-none on x86_64-linux-gnu host. gcc/ * config/bpf/bpf.md (zero_extendqidi2): Correct asm template to use ldxb instead of ldxh. --- gcc/config/bpf/bpf.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/config/bpf/bpf.md b/gcc/config/bpf/bpf.md index 080a63cd970..50df1aaa3e2 100644 --- a/gcc/config/bpf/bpf.md +++ b/gcc/config/bpf/bpf.md @@ -292,7 +292,7 @@ (define_insn "zero_extendqidi2" "@ {and\t%0,0xff|%0 &= 0xff} {mov\t%0,%1\;and\t%0,0xff|%0 = %1;%0 &= 0xff} - {ldxh\t%0,%1|%0 = *(u8 *) (%1)}" + {ldxb\t%0,%1|%0 = *(u8 *) (%1)}" [(set_attr "type" "alu,alu,ldx")]) (define_insn "zero_extendsidi2" -- 2.43.0
[PATCH 1/2] doc: Fix some standard named pattern documentation modes
Currently these use `@var{m3}` but the 3 here is a literal 3 and not part of the mode itself so it should not be inside the var. Fixed as such. Built the documentation to make sure it looks correct now. gcc/ChangeLog: * doc/md.texi (widen_ssum, widen_usum, smulhs, umulhs, smulhrs, umulhrs, sdiv_pow2): Move the 3 outside of the var. Signed-off-by: Andrew Pinski --- gcc/doc/md.texi | 32 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index b0c61925120..274dd03d419 100644 --- a/gcc/doc/md.texi +++ b/gcc/doc/md.texi @@ -5798,19 +5798,19 @@ is of a wider mode, is computed and added to operand 3. Operand 3 is of a mode equal or wider than the mode of the absolute difference. The result is placed in operand 0, which is of the same mode as operand 3. -@cindex @code{widen_ssum@var{m3}} instruction pattern -@cindex @code{widen_usum@var{m3}} instruction pattern -@item @samp{widen_ssum@var{m3}} -@itemx @samp{widen_usum@var{m3}} +@cindex @code{widen_ssum@var{m}3} instruction pattern +@cindex @code{widen_usum@var{m}3} instruction pattern +@item @samp{widen_ssum@var{m}3} +@itemx @samp{widen_usum@var{m}3} Operands 0 and 2 are of the same mode, which is wider than the mode of operand 1. Add operand 1 to operand 2 and place the widened result in operand 0. (This is used express accumulation of elements into an accumulator of a wider mode.) -@cindex @code{smulhs@var{m3}} instruction pattern -@cindex @code{umulhs@var{m3}} instruction pattern -@item @samp{smulhs@var{m3}} -@itemx @samp{umulhs@var{m3}} +@cindex @code{smulhs@var{m}3} instruction pattern +@cindex @code{umulhs@var{m}3} instruction pattern +@item @samp{smulhs@var{m}3} +@itemx @samp{umulhs@var{m}3} Signed/unsigned multiply high with scale. This is equivalent to the C code: @smallexample narrow op0, op1, op2; @@ -5820,10 +5820,10 @@ op0 = (narrow) (((wide) op1 * (wide) op2) >> (N / 2 - 1)); where the sign of @samp{narrow} determines whether this is a signed or unsigned operation, and @var{N} is the size of @samp{wide} in bits. -@cindex @code{smulhrs@var{m3}} instruction pattern -@cindex @code{umulhrs@var{m3}} instruction pattern -@item @samp{smulhrs@var{m3}} -@itemx @samp{umulhrs@var{m3}} +@cindex @code{smulhrs@var{m}3} instruction pattern +@cindex @code{umulhrs@var{m}3} instruction pattern +@item @samp{smulhrs@var{m}3} +@itemx @samp{umulhrs@var{m}3} Signed/unsigned multiply high with round and scale. This is equivalent to the C code: @smallexample @@ -5834,10 +5834,10 @@ op0 = (narrow) (wide) op1 * (wide) op2) >> (N / 2 - 2)) + 1) >> 1); where the sign of @samp{narrow} determines whether this is a signed or unsigned operation, and @var{N} is the size of @samp{wide} in bits. -@cindex @code{sdiv_pow2@var{m3}} instruction pattern -@cindex @code{sdiv_pow2@var{m3}} instruction pattern -@item @samp{sdiv_pow2@var{m3}} -@itemx @samp{sdiv_pow2@var{m3}} +@cindex @code{sdiv_pow2@var{m}3} instruction pattern +@cindex @code{sdiv_pow2@var{m}3} instruction pattern +@item @samp{sdiv_pow2@var{m}3} +@itemx @samp{sdiv_pow2@var{m}3} Signed division by power-of-2 immediate. Equivalent to: @smallexample signed op0, op1; -- 2.43.0
[PATCH 0/2] Some minor internal optabs related fixes
While working on adding some new vector code to the aarch64 backend, I was confused on which mode was supposed to be used for widen_ssum pattern so I decided to improve the documentation so the next person won't be confused. Andrew Pinski (2): doc: Fix some standard named pattern documentation modes doc: Add documentation of which operand matches the mode of the standard pattern name [PR113508] gcc/doc/md.texi | 41 + 1 file changed, 25 insertions(+), 16 deletions(-) -- 2.43.0
[PATCH 2/2] doc: Add documentation of which operand matches the mode of the standard pattern name [PR113508]
In some of the standard pattern names, it is not obvious which mode is being used in the pattern name. Is it operand 0, 1, or 2? Is it the wider mode or the narrower mode? This fixes that so there is no confusion by adding a sentence to some of them. Built the documentation to make sure that it builds. gcc/ChangeLog: * doc/md.texi (sdot_prod@var{m}, udot_prod@var{m}, usdot_prod@var{m}, ssad@var{m}, usad@var{m}, widen_usum@var{m}3, smulhs@var{m}3, umulhs@var{m}3, smulhrs@var{m}3, umulhrs@var{m}3): Add sentence about what the mode m is. Signed-off-by: Andrew Pinski --- gcc/doc/md.texi | 9 + 1 file changed, 9 insertions(+) diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi index 274dd03d419..33b37e79cd4 100644 --- a/gcc/doc/md.texi +++ b/gcc/doc/md.texi @@ -5746,6 +5746,7 @@ Operand 1 and operand 2 are of the same mode. Their product, which is of a wider mode, is computed and added to operand 3. Operand 3 is of a mode equal or wider than the mode of the product. The result is placed in operand 0, which is of the same mode as operand 3. +@var{m} is the mode of operand 1 and operand 2. Semantically the expressions perform the multiplication in the following signs @@ -5763,6 +5764,7 @@ Operand 1 and operand 2 are of the same mode. Their product, which is of a wider mode, is computed and added to operand 3. Operand 3 is of a mode equal or wider than the mode of the product. The result is placed in operand 0, which is of the same mode as operand 3. +@var{m} is the mode of operand 1 and operand 2. Semantically the expressions perform the multiplication in the following signs @@ -5779,6 +5781,7 @@ Operand 1 must be unsigned and operand 2 signed. Their product, which is of a wider mode, is computed and added to operand 3. Operand 3 is of a mode equal or wider than the mode of the product. The result is placed in operand 0, which is of the same mode as operand 3. +@var{m} is the mode of operand 1 and operand 2. Semantically the expressions perform the multiplication in the following signs @@ -5797,6 +5800,7 @@ Operand 1 and operand 2 are of the same mode. Their absolute difference, which is of a wider mode, is computed and added to operand 3. Operand 3 is of a mode equal or wider than the mode of the absolute difference. The result is placed in operand 0, which is of the same mode as operand 3. +@var{m} is the mode of operand 1 and operand 2. @cindex @code{widen_ssum@var{m}3} instruction pattern @cindex @code{widen_usum@var{m}3} instruction pattern @@ -5806,6 +5810,7 @@ Operands 0 and 2 are of the same mode, which is wider than the mode of operand 1. Add operand 1 to operand 2 and place the widened result in operand 0. (This is used express accumulation of elements into an accumulator of a wider mode.) +@var{m} is the mode of operand 1. @cindex @code{smulhs@var{m}3} instruction pattern @cindex @code{umulhs@var{m}3} instruction pattern @@ -5819,6 +5824,8 @@ op0 = (narrow) (((wide) op1 * (wide) op2) >> (N / 2 - 1)); @end smallexample where the sign of @samp{narrow} determines whether this is a signed or unsigned operation, and @var{N} is the size of @samp{wide} in bits. +@var{m} is the mode for all 3 operands (narrow). The wide mode is not specified +and is defined to fit the whole multiply. @cindex @code{smulhrs@var{m}3} instruction pattern @cindex @code{umulhrs@var{m}3} instruction pattern @@ -5833,6 +5840,8 @@ op0 = (narrow) (wide) op1 * (wide) op2) >> (N / 2 - 2)) + 1) >> 1); @end smallexample where the sign of @samp{narrow} determines whether this is a signed or unsigned operation, and @var{N} is the size of @samp{wide} in bits. +@var{m} is the mode for all 3 operands (narrow). The wide mode is not specified +and is defined to fit the whole multiply. @cindex @code{sdiv_pow2@var{m}3} instruction pattern @cindex @code{sdiv_pow2@var{m}3} instruction pattern -- 2.43.0
Re: [PATCH RFA] build: drop target libs from LD_LIBRARY_PATH [PR105688]
> On 12 Feb 2024, at 19:59, Jason Merrill wrote: > > On 2/10/24 07:30, Iain Sandoe wrote: >>> On 10 Feb 2024, at 12:07, Jason Merrill wrote: >>> >>> On 2/10/24 05:46, Iain Sandoe wrote: > On 9 Feb 2024, at 23:21, Iain Sandoe wrote: > > > >> On 9 Feb 2024, at 10:56, Iain Sandoe wrote: >>> On 8 Feb 2024, at 21:44, Jason Merrill wrote: >>> >>> On 2/8/24 12:55, Paolo Bonzini wrote: On 2/8/24 18:16, Jason Merrill wrote: >>> >> >> Hmm. In stage 1, when we build with the system gcc, I'd think we >> want the just-built gnat1 to find the system libgcc. >> >> In stage 2, when we build with the stage 1 gcc, we want the >> just-built gnat1 to find the stage 1 libgcc. >> >> In neither case do we want it to find the libgcc from the current >> stage. >> >> So it seems to me that what we want is for stage2+ LD_LIBRARY_PATH >> to include the TARGET_LIB_PATH from the previous stage. Something >> like the below, on top of the earlier patch. >> >> Does this make sense? Does it work on Darwin? > > Oops, that was broken, please consider this one instead: Yes, this one makes sense (and the current code would not work since it lacks the prev- prefix on TARGET_LIB_PATH). >>> >>> Indeed, that seems like evidence that the only element of >>> TARGET_LIB_PATH that has been useful in HOST_EXPORTS is the prev- part >>> of HOST_LIB_PATH_gcc. >>> >>> So, here's another patch that just includes that for post-stage1: >>> <0001-build-drop-target-libs-from-LD_LIBRARY_PATH-PR105688.patch> >> >> Hmm this still fails for me with gnat1 being unable to find libgcc_s. >> It seems I have to add the PREV_HOST_LIB_PATH_gcc to HOST_LIB_PATH for >> it to succeed so, >> presumably, the post stage1 exports are not being forwarded to that >> build. I’ll try to analyze what >> exactly is failing. > > The fail is occurring in the target libada build; so, I suppose, one > might say it’s reasonable that it > requires this host path to be added to the target exports since it’s a > host library used during target > builds (or do folks expect the host exports to be made for target lib > builds as well?) > > Appending the prev-gcc dirctory to the HOST_LIB_PATH fixes this Hmm this is still not right, in this case, I think it should actually be the “just built” directory; - if we have a tool that depends on host libraries (that happen to be also target ones), then those libraries have to be built before the tool so that they can be linked to it. (we specially copy libgcc* and the CRTs to gcc/ to allow for this case) - there is no prev-gcc in cross and —disable-bootstrap builds, but the tool will still be linked to the just-built host libraries (which will also be installed). So, I think we have to add HOST_LIB_PATH_gcc to HOST_LIB_PATH and HOST_PREV_LIB_PATH_gcc to POSTSTAGE1_HOST_EXPORTS (as per this patch). >>> >>> I don't follow. In a cross build, host libraries are a different >>> architecture from target libraries, and certainly can't be linked into host >>> binaries. >>> >>> In a disable-bootstrap build, even before my change TARGET_LIB_PATH isn't >>> added to RPATH_ENVVAR, since that has been guarded with @if gcc-bootstrap. >>> >>> So in a bootstrap build, it shouldn't be needed for stage1 either. And for >>> stage2, the one we need is from stage1, that matches the compiler we're >>> building host tools with. >>> >>> What am I missing? >> nothing, I was off on a tangent about the cross/non-bootstrap, sorry about >> that. >> However, when doing target builds (the previous point) it seems we do have >> to make provision for gnat1 to find libgcc_s, and, at present, it seems that >> only the target exports are active. > > Ah, I see: When building target libraries in stage2, we run the stage2 > compiler that needs the stage1 libgcc_s, but we don't have the HOST_EXPORTS > because we're building target code, so we also need to get the libgcc path > into TARGET_EXPORTS. > > Since TARGET_LIB_PATH is only added when gcc-bootstrap, I guess the previous > libgcc is the only piece needed in TARGET_EXPORTS as well. So, how about > this version of the patch? I tested this one on an affected platform version with and without —enable-host-shared and for all languages (less go which is not yet supported). It works for me, thanks, Iain > > Jason<0001-build-drop-target-libs-from-LD_LIBRARY_PATH-PR105688.patch>
[patch, fortran] Bug 105847 - namelist-object-name can be a renamed host associated entity
Pushed as simple and obvious. Regards, Jerry commit 8221201cc59870579b9dc451b173f94b8d8b0993 (HEAD -> master, origin/master, origin/HEAD) Author: Steve Kargl Date: Wed Feb 14 14:40:16 2024 -0800 Fortran: namelist-object-name renaming. PR fortran/105847 gcc/fortran/ChangeLog: * trans-io.cc (transfer_namelist_element): When building the namelist object name, if the use rename attribute is set, use the local name specified in the use statement. gcc/testsuite/ChangeLog: * gfortran.dg/pr105847.f90: New test.
Re: [PATCH][_GLIBCXX_DEBUG] Fix std::__niter_base behavior
On 14/02/2024 20:44, Jonathan Wakely wrote: On Wed, 14 Feb 2024 at 18:39, François Dumont wrote: libstdc++: [_GLIBCXX_DEBUG] Fix std::__niter_base behavior std::__niter_base is used in _GLIBCXX_DEBUG mode to remove _Safe_iterator<> wrapper on random access iterators. But doing so it should also preserve original behavior to remove __normal_iterator wrapper. libstdc++-v3/ChangeLog: * include/bits/stl_algobase.h (std::__niter_base): Redefine the overload definitions for __gnu_debug::_Safe_iterator. * include/debug/safe_iterator.tcc (std::__niter_base): Adapt declarations. Ok to commit once all tests completed (still need to check pre-c++11) ? The declaration in include/bits/stl_algobase.h has a noexcept-specifier but the definition in include/debug/safe_iterator.tcc does not have one - that seems wrong (I'm surprised it even compiles). It does ! I thought it was only necessary at declaration, and I also had troubles doing it right at definition because of the interaction with the auto and ->. Now simplified and consistent in this new proposal. Just using std::is_nothrow_copy_constructible<_Ite> seems simpler, that will be true for __normal_iterator if is_nothrow_copy_constructible is true. Ok The definition in include/debug/safe_iterator.tcc should use std::declval<_Ite>() not declval<_Ite>(). Is there any reason why the definition uses a late-specified-return-type (i.e. auto and ->) when the declaration doesn't? I initially plan to use '-> std::decltype(std::__niter_base(__it.base()))' but this did not compile, ambiguity issue. So I resort to using std::declval and I could have then done it the same way as declaration, done now. Attached is what I'm testing, ok to commit once fully tested ? François diff --git a/libstdc++-v3/include/bits/stl_algobase.h b/libstdc++-v3/include/bits/stl_algobase.h index e7207f67266..0f73da13172 100644 --- a/libstdc++-v3/include/bits/stl_algobase.h +++ b/libstdc++-v3/include/bits/stl_algobase.h @@ -317,12 +317,26 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION _GLIBCXX_NOEXCEPT_IF(std::is_nothrow_copy_constructible<_Iterator>::value) { return __it; } +#if __cplusplus < 201103L template -_GLIBCXX20_CONSTEXPR _Ite __niter_base(const ::__gnu_debug::_Safe_iterator<_Ite, _Seq, std::random_access_iterator_tag>&); + template +_Ite +__niter_base(const ::__gnu_debug::_Safe_iterator< +::__gnu_cxx::__normal_iterator<_Ite, _Cont>, _Seq, +std::random_access_iterator_tag>&); +#else + template +_GLIBCXX20_CONSTEXPR +decltype(std::__niter_base(std::declval<_Ite>())) +__niter_base(const ::__gnu_debug::_Safe_iterator<_Ite, _Seq, +std::random_access_iterator_tag>&) +noexcept(std::is_nothrow_copy_constructible<_Ite>::value); +#endif + // Reverse the __niter_base transformation to get a // __normal_iterator back again (this assumes that __normal_iterator // is only used to wrap random access iterators, like pointers). diff --git a/libstdc++-v3/include/debug/safe_iterator.tcc b/libstdc++-v3/include/debug/safe_iterator.tcc index 6eb70cbda04..a8b24233e85 100644 --- a/libstdc++-v3/include/debug/safe_iterator.tcc +++ b/libstdc++-v3/include/debug/safe_iterator.tcc @@ -235,13 +235,29 @@ namespace std _GLIBCXX_VISIBILITY(default) { _GLIBCXX_BEGIN_NAMESPACE_VERSION +#if __cplusplus < 201103L template -_GLIBCXX20_CONSTEXPR _Ite __niter_base(const ::__gnu_debug::_Safe_iterator<_Ite, _Seq, std::random_access_iterator_tag>& __it) { return __it.base(); } + template +_Ite +__niter_base(const ::__gnu_debug::_Safe_iterator< +::__gnu_cxx::__normal_iterator<_Ite, _Cont>, _DbgSeq, +std::random_access_iterator_tag>& __it) +{ return __it.base().base(); } +#else + template +_GLIBCXX20_CONSTEXPR +decltype(std::__niter_base(std::declval<_Ite>())) +__niter_base(const ::__gnu_debug::_Safe_iterator<_Ite, _Seq, +std::random_access_iterator_tag>& __it) +noexcept(std::is_nothrow_copy_constructible<_Ite>::value) +{ return std::__niter_base(__it.base()); } +#endif + template _GLIBCXX20_CONSTEXPR
Re: [PATCH] aarch64: Reword error message for mismatch guard size and probing interval [PR90155]
Andrew Pinski writes: > The error message is not clear what options are being taked about when it > says the values > need to match; plus there is a wrong quotation dealing with the diagnostic. > So this changes the error message to be exactly talking about the param > options that > are being taked about and now with the options, it needs the quoting. > > OK? Built and tested for aarch64-linux-gnu. > > gcc/ChangeLog: > > * config/aarch64/aarch64.cc (aarch64_override_options_internal): Fix > error message for mismatch guard size and probing interval. > > Signed-off-by: Andrew Pinski > --- > gcc/config/aarch64/aarch64.cc | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index 32eae49d4e9..2da743469ae 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -18334,7 +18334,7 @@ aarch64_override_options_internal (struct gcc_options > *opts) > "size. Given value %d (%llu KB) is out of range", > guard_size, (1ULL << guard_size) / 1024ULL); > > - /* Enforce that interval is the same size as size so the mid-end does the > + /* Enforce that interval is the same size as size so the middle-end does > the > right thing. */ >SET_OPTION_IF_UNSET (opts, &global_options_set, > param_stack_clash_protection_probe_interval, Not sure about this. Aren't both valid? > @@ -18346,8 +18346,8 @@ aarch64_override_options_internal (struct gcc_options > *opts) >int probe_interval > = param_stack_clash_protection_probe_interval; >if (guard_size != probe_interval) > -error ("stack clash guard size %<%d%> must be equal to probing interval " > -"%<%d%>", guard_size, probe_interval); > +error ("%<--param stack-clash-protection-probe-interval=%d%> needs to > match " > +"%<--param stack-clash-protection-guard-size=%d%>", probe_interval, > guard_size); I suppose both versions are still saying something like "4096 must equal 16384". So since you've brought up the bike shed, how about: "%<--param stack-clash-protection-probe-interval%> value %d does not " "match %<--param stack-clash-protection-guard-size%> value %d" or s/does not match/is not equal to/ OK for this hunk with either of those suggestions if you agree, but I'm open to other suggestions too... Thanks, Richard
Re: [PATCH] RISC-V: Set require-effective-target rv64 for PR113742
On 2/14/2024 12:09 PM, Robin Dapp wrote: On 2/14/24 20:46, Edwin Lu wrote: The testcase pr113742.c is failing for 32 bit targets due to the following cc1 error: cc1: error: ABI requries '-march=rv64' I think we usually just add exactly this to the test options (so it is always run rather than just on a 64-bit target. Regards Robin Ah oops I glanced over the /* { dg-do compile } */part. It should be fine to add '-march=rv64gc' instead then? Edwin
Re: [PATCH] aarch64: Use vec_perm_indices::new_shrunk_vector in aarch64_evpc_reencode
Andrew Pinski writes: > While working on PERM related stuff, I can across that aarch64_evpc_reencode > was manually figuring out if we shrink the perm indices instead of > using vec_perm_indices::new_shrunk_vector; shrunk was added after reencode > was added. > > Built and tested for aarch64-linux-gnu with no regressions. > > gcc/ChangeLog: > > PR target/113822 > * config/aarch64/aarch64.cc (aarch64_evpc_reencode): Use > vec_perm_indices::new_shrunk_vector instead of manually > going through the indices. Good spot! OK for stage 1, thanks. Richard > > Signed-off-by: Andrew Pinski > --- > gcc/config/aarch64/aarch64.cc | 24 +--- > 1 file changed, 5 insertions(+), 19 deletions(-) > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index 32eae49d4e9..f4ed8b86532 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -25431,7 +25431,6 @@ static bool > aarch64_evpc_reencode (struct expand_vec_perm_d *d) > { >expand_vec_perm_d newd; > - unsigned HOST_WIDE_INT nelt; > >if (d->vec_flags != VEC_ADVSIMD) > return false; > @@ -25446,24 +25445,10 @@ aarch64_evpc_reencode (struct expand_vec_perm_d *d) >if (new_mode == word_mode) > return false; > > - /* to_constant is safe since this routine is specific to Advanced SIMD > - vectors. */ > - nelt = d->perm.length ().to_constant (); > - > - vec_perm_builder newpermconst; > - newpermconst.new_vector (nelt / 2, nelt / 2, 1); > + vec_perm_indices newpermindices; > > - /* Convert the perm constant if we can. Require even, odd as the pairs. > */ > - for (unsigned int i = 0; i < nelt; i += 2) > -{ > - poly_int64 elt0 = d->perm[i]; > - poly_int64 elt1 = d->perm[i + 1]; > - poly_int64 newelt; > - if (!multiple_p (elt0, 2, &newelt) || maybe_ne (elt0 + 1, elt1)) > - return false; > - newpermconst.quick_push (newelt.to_constant ()); > -} > - newpermconst.finalize (); > + if (!newpermindices.new_shrunk_vector (d->perm, 2)) > +return false; > >newd.vmode = new_mode; >newd.vec_flags = VEC_ADVSIMD; > @@ -25475,7 +25460,8 @@ aarch64_evpc_reencode (struct expand_vec_perm_d *d) >newd.testing_p = d->testing_p; >newd.one_vector_p = d->one_vector_p; > > - newd.perm.new_vector (newpermconst, newd.one_vector_p ? 1 : 2, nelt / 2); > + newd.perm.new_vector (newpermindices.encoding (), newd.one_vector_p ? 1 : > 2, > + newpermindices.nelts_per_input ()); >return aarch64_expand_vec_perm_const_1 (&newd); > }
Re: [PATCH V2] rs6000: New pass for replacement of adjacent loads fusion (lxv).
Hello Richard: On 15/02/24 2:21 am, Richard Sandiford wrote: > Ajit Agarwal writes: >> Hello Richard: >> >> >> On 14/02/24 10:45 pm, Richard Sandiford wrote: >>> Ajit Agarwal writes: >> diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc >> index 1856fa4884f..ffc47a6eaa0 100644 >> --- a/gcc/emit-rtl.cc >> +++ b/gcc/emit-rtl.cc >> @@ -921,7 +921,7 @@ validate_subreg (machine_mode omode, machine_mode >> imode, >> return false; >> >>/* The subreg offset cannot be outside the inner object. */ >> - if (maybe_ge (offset, isize)) >> + if (maybe_gt (offset, isize)) >> return false; > > Can you explain why this change is needed? > This is required in rs6000 target where we generate the subreg with offset 16 from OO mode (256 bit) to 128 bit vector modes. Otherwise it segfaults. >>> >>> Could you go into more detail? Why does that subreg lead to a segfault? >>> >>> In itself, a 16-byte subreg at byte offset 16 into a 32-byte pair is pretty >>> standard. AArch64 uses this too for its vector load/store pairs (and for >>> structure pairs more generally). >>> >> >> If we want to create subreg V16QI (reg OO R) 16) imode is V16QI (isize = 16) >> and offset >> is 16. maybe_ge (offset, isize) return true and validate_subreg returns >> false; > > isize is supposed to be the size of the "inner mode", which in this > case is OO. Since OO is a 32-bit mode, I would expect isize to be 32 > rather than 16. Is that not the case? > > Or is the problem that something is trying to take a subreg of a subreg? > If so, that is only valid in certain cases. It isn't for example valid > to use a subreg operation to move between (subreg:V16QI (reg:OO X) 16) > and (subreg:V16QI (reg:OO X) 0). > The above changes are not required. emit-rtl.cc changes are not required anymore as I have fixed in rs6000 target fusion code while fixing the modes of src and dest as same for SET rtx as you have suggested for REG_UNUSED issues. Thanks a lot for your help. Thanks & Regards Ajit > Thanks, > Richard
Re: [PATCH V1] Common infrastructure for load-store fusion for aarch64 and rs6000 target
Hello Richard: On 15/02/24 1:14 am, Richard Sandiford wrote: > Ajit Agarwal writes: >> On 14/02/24 10:56 pm, Richard Sandiford wrote: >>> Ajit Agarwal writes: >> diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc >> index 88ee0dd67fc..a8d0ee7c4db 100644 >> --- a/gcc/df-problems.cc >> +++ b/gcc/df-problems.cc >> @@ -3360,7 +3360,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct >> df_mw_hardreg *mws, >>if (df_whole_mw_reg_unused_p (mws, live, artificial_uses)) >> { >>unsigned int regno = mws->start_regno; >> - df_set_note (REG_UNUSED, insn, mws->mw_reg); >> + //df_set_note (REG_UNUSED, insn, mws->mw_reg); >>dead_debug_insert_temp (debug, regno, insn, >> DEBUG_TEMP_AFTER_WITH_REG); >> >>if (REG_DEAD_DEBUGGING) >> @@ -3375,7 +3375,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct >> df_mw_hardreg *mws, >> if (!bitmap_bit_p (live, r) >> && !bitmap_bit_p (artificial_uses, r)) >>{ >> -df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]); >> + // df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]); >> dead_debug_insert_temp (debug, r, insn, >> DEBUG_TEMP_AFTER_WITH_REG); >> if (REG_DEAD_DEBUGGING) >>df_print_note ("adding 2: ", insn, REG_NOTES (insn)); >> @@ -3493,9 +3493,9 @@ df_create_unused_note (rtx_insn *insn, df_ref def, >> || bitmap_bit_p (artificial_uses, dregno) >> || df_ignore_stack_reg (dregno))) >> { >> - rtx reg = (DF_REF_LOC (def)) >> -? *DF_REF_REAL_LOC (def): DF_REF_REG (def); >> - df_set_note (REG_UNUSED, insn, reg); >> + //rtx reg = (DF_REF_LOC (def)) >> + // ? *DF_REF_REAL_LOC (def): DF_REF_REG (def); >> + //df_set_note (REG_UNUSED, insn, reg); >>dead_debug_insert_temp (debug, dregno, insn, >> DEBUG_TEMP_AFTER_WITH_REG); >>if (REG_DEAD_DEBUGGING) >> df_print_note ("adding 3: ", insn, REG_NOTES (insn)); > > I don't think this can be right. The last hunk of the var-tracking.cc > patch also seems to be reverting a correct change. > We generate sequential registers using (subreg V16QI (reg 00mode) 16) and (reg OOmode 0) where OOmode is 256 bit and V16QI is 128 bits in order to generate sequential register pair. >>> >>> OK. As I mentioned in the other message I just sent, it seems pretty >>> standard to use a 256-bit mode to represent a pair of 128-bit values. >>> In that case: >>> >>> - (reg:OO R) always refers to both registers in the pair, and any assignment >>> to it modifies both registers in the pair >>> >>> - (subreg:V16QI (reg:OO R) 0) refers to the first register only, and can >>> be modified independently of the second register >>> >>> - (subreg:V16QI (reg:OO R) 16) refers to the second register only, and can >>> be modified independently of the first register >>> >>> Is that how you're using it? >>> >> >> This is how I use it. >> (insn 27 21 33 2 (set (reg:OO 157 [ vect__5.11_76 ]) >> >> (insn 21 35 27 2 (set (subreg:V2DI (reg:OO 157 [ vect__5.11_76 ]) 16) >> >> to generate sequential registers. With the above sequential registers >> are generated by RA. >> >> >>> One thing to be wary of is that it isn't possible to assign to two >>> subregs of the same reg in a single instruction (at least AFAIK). >>> So any operation that wants to store to both registers in the pair >>> must store to (reg:OO R) itself, not to the two subregs. >>> If I keep the above REG_UNUSED notes ira generates REG_UNUSED and in cprop_harreg pass and dce pass deletes store pairs and we get incorrect code. By commenting REG_UNUSED notes it is not generated and we get the correct store pair fusion and cprop_hardreg and dce doesn't deletes them. Please let me know is there are better ways to address this instead of commenting above generation of REG_UNUSED notes. >>> >>> Could you quote an example rtl sequence that includes incorrect notes? >>> It might help to understand the problem a bit more. >>> >> >> Here is the rtl code: >> >> (insn 21 35 27 2 (set (subreg:V2DI (reg:OO 157 [ vect__5.11_76 ]) 16) >> (plus:V2DI (reg:V2DI 153 [ vect__4.10_72 ]) >> (reg:V2DI 154 [ _63 ]))) >> "/home/aagarwa/gcc-sources-fusion/gcc/testsuite/gcc.c-torture/execute/20030928-1.c":11:18 >> 1706 {addv2di3} >> (expr_list:REG_DEAD (reg:V2DI 154 [ _63 ]) >> (expr_list:REG_DEAD (reg:V2DI 153 [ vect__4.10_72 ]) >> (expr_list:REG_UNUSED (reg:OO 157 [ vect__5.11_76 ]) >> (nil) >> (insn 27 21 33 2 (set (reg:OO 157 [ vect__5.11_76 ]) >> (plus:V2DI (reg:V2DI 158 [ vect__4.10_73 ]) >> (reg:V2DI 159 [ _60 ]))) >> "/home/aagarwa/gcc-sources-fusion/gcc/tests
Re: [PATCH V2] rs6000: New pass for replacement of adjacent loads fusion (lxv).
Ajit Agarwal writes: > Hello Richard: > > > On 14/02/24 10:45 pm, Richard Sandiford wrote: >> Ajit Agarwal writes: > diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc > index 1856fa4884f..ffc47a6eaa0 100644 > --- a/gcc/emit-rtl.cc > +++ b/gcc/emit-rtl.cc > @@ -921,7 +921,7 @@ validate_subreg (machine_mode omode, machine_mode > imode, > return false; > >/* The subreg offset cannot be outside the inner object. */ > - if (maybe_ge (offset, isize)) > + if (maybe_gt (offset, isize)) > return false; Can you explain why this change is needed? >>> >>> This is required in rs6000 target where we generate the subreg >>> with offset 16 from OO mode (256 bit) to 128 bit vector modes. >>> Otherwise it segfaults. >> >> Could you go into more detail? Why does that subreg lead to a segfault? >> >> In itself, a 16-byte subreg at byte offset 16 into a 32-byte pair is pretty >> standard. AArch64 uses this too for its vector load/store pairs (and for >> structure pairs more generally). >> > > If we want to create subreg V16QI (reg OO R) 16) imode is V16QI (isize = 16) > and offset > is 16. maybe_ge (offset, isize) return true and validate_subreg returns false; isize is supposed to be the size of the "inner mode", which in this case is OO. Since OO is a 32-bit mode, I would expect isize to be 32 rather than 16. Is that not the case? Or is the problem that something is trying to take a subreg of a subreg? If so, that is only valid in certain cases. It isn't for example valid to use a subreg operation to move between (subreg:V16QI (reg:OO X) 16) and (subreg:V16QI (reg:OO X) 0). Thanks, Richard
Re: [PATCH v2 4/4] libstdc++: Optimize std::remove_extent compilation performance
On Wed, 14 Feb 2024, Ken Matsui wrote: > This patch optimizes the compilation performance of std::remove_extent > by dispatching to the new __remove_extent built-in trait. > > libstdc++-v3/ChangeLog: > > * include/std/type_traits (remove_extent): Use __remove_extent > built-in trait. LGTM > > Signed-off-by: Ken Matsui > --- > libstdc++-v3/include/std/type_traits | 6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/libstdc++-v3/include/std/type_traits > b/libstdc++-v3/include/std/type_traits > index 3bde7cb8ba3..0fb1762186c 100644 > --- a/libstdc++-v3/include/std/type_traits > +++ b/libstdc++-v3/include/std/type_traits > @@ -2064,6 +2064,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION >// Array modifications. > >/// remove_extent > +#if _GLIBCXX_USE_BUILTIN_TRAIT(__remove_extent) > + template > +struct remove_extent > +{ using type = __remove_extent(_Tp); }; > +#else >template > struct remove_extent > { using type = _Tp; }; > @@ -2075,6 +2080,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION >template > struct remove_extent<_Tp[]> > { using type = _Tp; }; > +#endif > >/// remove_all_extents >template > -- > 2.43.0 > >
Re: [PATCH v2 3/4] c++: Implement __remove_extent built-in trait
On Wed, 14 Feb 2024, Ken Matsui wrote: > This patch implements built-in trait for std::remove_extent. > > gcc/cp/ChangeLog: > > * cp-trait.def: Define __remove_extent. > * semantics.cc (finish_trait_type): Handle CPTK_REMOVE_EXTENT. > > gcc/testsuite/ChangeLog: > > * g++.dg/ext/has-builtin-1.C: Test existence of __remove_extent. > * g++.dg/ext/remove_extent.C: New test. LGTM > > Signed-off-by: Ken Matsui > --- > gcc/cp/cp-trait.def | 1 + > gcc/cp/semantics.cc | 5 + > gcc/testsuite/g++.dg/ext/has-builtin-1.C | 3 +++ > gcc/testsuite/g++.dg/ext/remove_extent.C | 16 > 4 files changed, 25 insertions(+) > create mode 100644 gcc/testsuite/g++.dg/ext/remove_extent.C > > diff --git a/gcc/cp/cp-trait.def b/gcc/cp/cp-trait.def > index cec385ee501..3ff5611b60e 100644 > --- a/gcc/cp/cp-trait.def > +++ b/gcc/cp/cp-trait.def > @@ -96,6 +96,7 @@ DEFTRAIT_EXPR (REF_CONSTRUCTS_FROM_TEMPORARY, > "__reference_constructs_from_tempo > DEFTRAIT_EXPR (REF_CONVERTS_FROM_TEMPORARY, > "__reference_converts_from_temporary", 2) > DEFTRAIT_TYPE (REMOVE_CV, "__remove_cv", 1) > DEFTRAIT_TYPE (REMOVE_CVREF, "__remove_cvref", 1) > +DEFTRAIT_TYPE (REMOVE_EXTENT, "__remove_extent", 1) > DEFTRAIT_TYPE (REMOVE_POINTER, "__remove_pointer", 1) > DEFTRAIT_TYPE (REMOVE_REFERENCE, "__remove_reference", 1) > DEFTRAIT_TYPE (TYPE_PACK_ELEMENT, "__type_pack_element", -1) > diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc > index e23693ab57f..bf998377c88 100644 > --- a/gcc/cp/semantics.cc > +++ b/gcc/cp/semantics.cc > @@ -12777,6 +12777,11 @@ finish_trait_type (cp_trait_kind kind, tree type1, > tree type2, > type1 = TREE_TYPE (type1); >return cv_unqualified (type1); > > +case CPTK_REMOVE_EXTENT: > + if (TREE_CODE (type1) == ARRAY_TYPE) > + type1 = TREE_TYPE (type1); > + return type1; > + > case CPTK_REMOVE_POINTER: >if (TYPE_PTR_P (type1)) > type1 = TREE_TYPE (type1); > diff --git a/gcc/testsuite/g++.dg/ext/has-builtin-1.C > b/gcc/testsuite/g++.dg/ext/has-builtin-1.C > index 56e8db7ac32..4f1094befb9 100644 > --- a/gcc/testsuite/g++.dg/ext/has-builtin-1.C > +++ b/gcc/testsuite/g++.dg/ext/has-builtin-1.C > @@ -170,6 +170,9 @@ > #if !__has_builtin (__remove_cvref) > # error "__has_builtin (__remove_cvref) failed" > #endif > +#if !__has_builtin (__remove_extent) > +# error "__has_builtin (__remove_extent) failed" > +#endif > #if !__has_builtin (__remove_pointer) > # error "__has_builtin (__remove_pointer) failed" > #endif > diff --git a/gcc/testsuite/g++.dg/ext/remove_extent.C > b/gcc/testsuite/g++.dg/ext/remove_extent.C > new file mode 100644 > index 000..6183aca5a48 > --- /dev/null > +++ b/gcc/testsuite/g++.dg/ext/remove_extent.C > @@ -0,0 +1,16 @@ > +// { dg-do compile { target c++11 } } > + > +#define SA(X) static_assert((X),#X) > + > +class ClassType { }; > + > +SA(__is_same(__remove_extent(int), int)); > +SA(__is_same(__remove_extent(int[2]), int)); > +SA(__is_same(__remove_extent(int[2][3]), int[3])); > +SA(__is_same(__remove_extent(int[][3]), int[3])); > +SA(__is_same(__remove_extent(const int[2]), const int)); > +SA(__is_same(__remove_extent(ClassType), ClassType)); > +SA(__is_same(__remove_extent(ClassType[2]), ClassType)); > +SA(__is_same(__remove_extent(ClassType[2][3]), ClassType[3])); > +SA(__is_same(__remove_extent(ClassType[][3]), ClassType[3])); > +SA(__is_same(__remove_extent(const ClassType[2]), const ClassType)); > -- > 2.43.0 > >
Re: [PATCH v2 2/4] libstdc++: Optimize std::add_pointer compilation performance
On Wed, 14 Feb 2024, Ken Matsui wrote: > This patch optimizes the compilation performance of std::add_pointer > by dispatching to the new __add_pointer built-in trait. > > libstdc++-v3/ChangeLog: > > * include/std/type_traits (add_pointer): Use __add_pointer > built-in trait. LGTM > > Signed-off-by: Ken Matsui > --- > libstdc++-v3/include/std/type_traits | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/libstdc++-v3/include/std/type_traits > b/libstdc++-v3/include/std/type_traits > index 21402fd8c13..3bde7cb8ba3 100644 > --- a/libstdc++-v3/include/std/type_traits > +++ b/libstdc++-v3/include/std/type_traits > @@ -2121,6 +2121,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > { }; > #endif > > + /// add_pointer > +#if _GLIBCXX_USE_BUILTIN_TRAIT(__add_pointer) > + template > +struct add_pointer > +{ using type = __add_pointer(_Tp); }; > +#else >template > struct __add_pointer_helper > { using type = _Tp; }; > @@ -2129,7 +2135,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > struct __add_pointer_helper<_Tp, __void_t<_Tp*>> > { using type = _Tp*; }; > > - /// add_pointer >template > struct add_pointer > : public __add_pointer_helper<_Tp> > @@ -2142,6 +2147,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION >template > struct add_pointer<_Tp&&> > { using type = _Tp*; }; > +#endif > > #if __cplusplus > 201103L >/// Alias template for remove_pointer > -- > 2.43.0 > >
Re: [PATCH v2 1/4] c++: Implement __add_pointer built-in trait
On Wed, 14 Feb 2024, Ken Matsui wrote: > This patch implements built-in trait for std::add_pointer. > > gcc/cp/ChangeLog: > > * cp-trait.def: Define __add_pointer. > * semantics.cc (finish_trait_type): Handle CPTK_ADD_POINTER. > > gcc/testsuite/ChangeLog: > > * g++.dg/ext/has-builtin-1.C: Test existence of __add_pointer. > * g++.dg/ext/add_pointer.C: New test. > > Signed-off-by: Ken Matsui > --- > gcc/cp/cp-trait.def | 1 + > gcc/cp/semantics.cc | 9 ++ > gcc/testsuite/g++.dg/ext/add_pointer.C | 37 > gcc/testsuite/g++.dg/ext/has-builtin-1.C | 3 ++ > 4 files changed, 50 insertions(+) > create mode 100644 gcc/testsuite/g++.dg/ext/add_pointer.C > > diff --git a/gcc/cp/cp-trait.def b/gcc/cp/cp-trait.def > index 394f006f20f..cec385ee501 100644 > --- a/gcc/cp/cp-trait.def > +++ b/gcc/cp/cp-trait.def > @@ -48,6 +48,7 @@ > #define DEFTRAIT_TYPE_DEFAULTED > #endif > > +DEFTRAIT_TYPE (ADD_POINTER, "__add_pointer", 1) > DEFTRAIT_EXPR (HAS_NOTHROW_ASSIGN, "__has_nothrow_assign", 1) > DEFTRAIT_EXPR (HAS_NOTHROW_CONSTRUCTOR, "__has_nothrow_constructor", 1) > DEFTRAIT_EXPR (HAS_NOTHROW_COPY, "__has_nothrow_copy", 1) > diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc > index 57840176863..e23693ab57f 100644 > --- a/gcc/cp/semantics.cc > +++ b/gcc/cp/semantics.cc > @@ -12760,6 +12760,15 @@ finish_trait_type (cp_trait_kind kind, tree type1, > tree type2, > >switch (kind) > { > +case CPTK_ADD_POINTER: > + if (TREE_CODE (type1) == FUNCTION_TYPE > + && ((TYPE_QUALS (type1) & (TYPE_QUAL_CONST | TYPE_QUAL_VOLATILE)) > +|| FUNCTION_REF_QUALIFIED (type1))) In other parts of the front end, e.g. the POINTER_TYPE case of tsubst, in build_trait_object, grokdeclarator and get_typeid, it seems we check for an unqualified function type with (type_memfn_quals (type) != TYPE_UNQUALIFIED && type_mem_rqual (type) != REF_QUAL_NONE) which should be equivalent to your formulation except it also checks for non-standard qualifiers such as __restrict. I'm not sure what a __restrict-qualified function type means or if we care about the semantics of __add_pointer(void () __restrict), but I reckon we might as well be consistent and use the type_mem_quals/rqual formulation in new code too? > + return type1; > + if (TYPE_REF_P (type1)) > + type1 = TREE_TYPE (type1); > + return build_pointer_type (type1); > + > case CPTK_REMOVE_CV: >return cv_unqualified (type1); > > diff --git a/gcc/testsuite/g++.dg/ext/add_pointer.C > b/gcc/testsuite/g++.dg/ext/add_pointer.C > new file mode 100644 > index 000..3091510f3b5 > --- /dev/null > +++ b/gcc/testsuite/g++.dg/ext/add_pointer.C > @@ -0,0 +1,37 @@ > +// { dg-do compile { target c++11 } } > + > +#define SA(X) static_assert((X),#X) > + > +class ClassType { }; > + > +SA(__is_same(__add_pointer(int), int*)); > +SA(__is_same(__add_pointer(int*), int**)); > +SA(__is_same(__add_pointer(const int), const int*)); > +SA(__is_same(__add_pointer(int&), int*)); > +SA(__is_same(__add_pointer(ClassType*), ClassType**)); > +SA(__is_same(__add_pointer(ClassType), ClassType*)); > +SA(__is_same(__add_pointer(void), void*)); > +SA(__is_same(__add_pointer(const void), const void*)); > +SA(__is_same(__add_pointer(volatile void), volatile void*)); > +SA(__is_same(__add_pointer(const volatile void), const volatile void*)); > + > +void f1(); > +using f1_type = decltype(f1); > +using pf1_type = decltype(&f1); > +SA(__is_same(__add_pointer(f1_type), pf1_type)); > + > +void f2() noexcept; // PR libstdc++/78361 > +using f2_type = decltype(f2); > +using pf2_type = decltype(&f2); > +SA(__is_same(__add_pointer(f2_type), pf2_type)); > + > +using fn_type = void(); > +using pfn_type = void(*)(); > +SA(__is_same(__add_pointer(fn_type), pfn_type)); > + > +SA(__is_same(__add_pointer(void() &), void() &)); > +SA(__is_same(__add_pointer(void() & noexcept), void() & noexcept)); > +SA(__is_same(__add_pointer(void() const), void() const)); > +SA(__is_same(__add_pointer(void(...) &), void(...) &)); > +SA(__is_same(__add_pointer(void(...) & noexcept), void(...) & noexcept)); > +SA(__is_same(__add_pointer(void(...) const), void(...) const)); > diff --git a/gcc/testsuite/g++.dg/ext/has-builtin-1.C > b/gcc/testsuite/g++.dg/ext/has-builtin-1.C > index 02b4b4d745d..56e8db7ac32 100644 > --- a/gcc/testsuite/g++.dg/ext/has-builtin-1.C > +++ b/gcc/testsuite/g++.dg/ext/has-builtin-1.C > @@ -2,6 +2,9 @@ > // { dg-do compile } > // Verify that __has_builtin gives the correct answer for C++ built-ins. > > +#if !__has_builtin (__add_pointer) > +# error "__has_builtin (__add_pointer) failed" > +#endif > #if !__has_builtin (__builtin_addressof) > # error "__has_builtin (__builtin_addressof) failed" > #endif > -- > 2.43.0 > >
[committed] testsuite: Fix a couple of x86 issues in gcc.dg/vect testsuite
A compile-time test can use -march=skylake-avx512 for all x86 targets, but a runtime test needs to check avx512f effective target if the instructions can be assembled. The runtime test also needs to check if the target machine supports instruction set we have been compiled for. The testsuite uses check_vect infrastructure, but handling of AVX512F+ ISAs was missing there. Add detection of __AVX512F__ and __AVX512VL__, which is enough to handle all currently mentioned target processors in the gcc.dg/vect testsuite. gcc/testsuite/ChangeLog: * gcc.dg/vect/pr113576.c (dg-additional-options): Use -march=skylake-avx512 for avx512f effective target. * gcc.dg/vect/pr98308.c (dg-additional-options): Use -march=skylake-avx512 for all x86 targets. * gcc.dg/vect/tree-vect.h (check_vect): Handle __AVX512F__ and __AVX512VL__. Tested on x86_64-linux-gnu on AVX2 target where the patch prevents pr113576 runtime failure due to unsupported avx512f instruction. Uros. diff --git a/gcc/testsuite/gcc.dg/vect/pr113576.c b/gcc/testsuite/gcc.dg/vect/pr113576.c index decb7abe2f7..b6edde6f8e2 100644 --- a/gcc/testsuite/gcc.dg/vect/pr113576.c +++ b/gcc/testsuite/gcc.dg/vect/pr113576.c @@ -1,6 +1,6 @@ /* { dg-do run } */ /* { dg-options "-O3" } */ -/* { dg-additional-options "-march=skylake-avx512" { target { x86_64-*-* i?86-*-* } } } */ +/* { dg-additional-options "-march=skylake-avx512" { target avx512f } } */ #include "tree-vect.h" diff --git a/gcc/testsuite/gcc.dg/vect/pr98308.c b/gcc/testsuite/gcc.dg/vect/pr98308.c index aeec9771c55..d74431200c7 100644 --- a/gcc/testsuite/gcc.dg/vect/pr98308.c +++ b/gcc/testsuite/gcc.dg/vect/pr98308.c @@ -1,6 +1,6 @@ /* { dg-do compile } */ /* { dg-additional-options "-O3" } */ -/* { dg-additional-options "-march=skylake-avx512" { target avx512f } } */ +/* { dg-additional-options "-march=skylake-avx512" { target x86_64-*-* i?86-*-* } } */ /* { dg-additional-options "-fdump-tree-optimized-details-blocks" } */ extern unsigned long long int arr_86[]; diff --git a/gcc/testsuite/gcc.dg/vect/tree-vect.h b/gcc/testsuite/gcc.dg/vect/tree-vect.h index c4b81441216..1e4b56ee0e1 100644 --- a/gcc/testsuite/gcc.dg/vect/tree-vect.h +++ b/gcc/testsuite/gcc.dg/vect/tree-vect.h @@ -38,7 +38,11 @@ check_vect (void) /* Determine what instruction set we've been compiled for, and detect that we're running with it. This allows us to at least do a compile check for, e.g. SSE4.1 when the machine only supports SSE2. */ -# if defined(__AVX2__) +# if defined(__AVX512VL__) +want_level = 7, want_b = bit_AVX512VL; +# elif defined(__AVX512F__) +want_level = 7, want_b = bit_AVX512F; +# elif defined(__AVX2__) want_level = 7, want_b = bit_AVX2; # elif defined(__AVX__) want_level = 1, want_c = bit_AVX;
Re: [PATCH] RISC-V: Set require-effective-target rv64 for PR113742
On 2/14/24 20:46, Edwin Lu wrote: > The testcase pr113742.c is failing for 32 bit targets due to the following cc1 > error: > cc1: error: ABI requries '-march=rv64' I think we usually just add exactly this to the test options (so it is always run rather than just on a 64-bit target. Regards Robin
Re: [PATCH v2] x86: Support x32 and IBT in heap trampoline
On Wed, Feb 14, 2024 at 11:59 AM Iain Sandoe wrote: > > > > > On 14 Feb 2024, at 18:12, H.J. Lu wrote: > > > > On Tue, Feb 13, 2024 at 8:46 AM Jakub Jelinek wrote: > >> > >> On Tue, Feb 13, 2024 at 08:40:52AM -0800, H.J. Lu wrote: > >>> Add x32 and IBT support to x86 heap trampoline implementation with a > >>> testcase. > >>> > >>> 2024-02-13 Jakub Jelinek > >>> H.J. Lu > >>> > >>> libgcc/ > >>> > >>> PR target/113855 > >>> * config/i386/heap-trampoline.c (trampoline_insns): Add IBT > >>> support and pad to the multiple of 4 bytes. Use movabsq > >>> instead of movabs in comments. Add -mx32 variant. > >>> > >>> gcc/testsuite/ > >>> > >>> PR target/113855 > >>> * gcc.dg/heap-trampoline-1.c: New test. > >>> * lib/target-supports.exp (check_effective_target_heap_trampoline): > >>> New. > >> > >> LGTM, but please give Iain a day or two to chime in. > >> > >>Jakub > >> > > > > I am checking it in today. > > I have just one question; > > from your patch the use of endbr* seems to be unconditionally based on the > flags used to build libgcc. > > However, I was expecting that the use of extended trampolines like this would > depend on command line flags used to compile the end-user’s code. We only ship ONE libgcc binary. You get the same libgcc binary regardless what options one uses to compile an application. Since ENBD64 is a NOP if IBT isn't enabled, so it isn't an issue. > As per the discussion in > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113855#c4 > I was expecting that we would need to extend this implementation to cover > more > cases (i.e. the GCC-14 implementation is “base”). > > any comments? > Iain > > > > > > -- > > H.J. > -- H.J.
Re: [PATCH v2] x86: Support x32 and IBT in heap trampoline
On Wed, Feb 14, 2024 at 07:59:26PM +, Iain Sandoe wrote: > I have just one question; > > from your patch the use of endbr* seems to be unconditionally based on the > flags used to build libgcc. > > However, I was expecting that the use of extended trampolines like this would > depend on command line flags used to compile the end-user’s code. I think for CET the rule is you need everything to be compiled with the CET options, including libgcc, trying to mix and match objects built one and another way unless one is lucky and there are no indirect calls to something that isn't marked is not going to work when enforcing it. And, the endbr* insn acts as a nop on older CPUs (ok, except for VIA or something similar or pre-i686?) or when not enforcing. So, if CET is enabled while building libgcc, the insns in there don't hurt, and if the gcc libraries aren't build with CET, one really can't use it. Jakub
Re: [PATCH v2] x86: Support x32 and IBT in heap trampoline
> On 14 Feb 2024, at 18:12, H.J. Lu wrote: > > On Tue, Feb 13, 2024 at 8:46 AM Jakub Jelinek wrote: >> >> On Tue, Feb 13, 2024 at 08:40:52AM -0800, H.J. Lu wrote: >>> Add x32 and IBT support to x86 heap trampoline implementation with a >>> testcase. >>> >>> 2024-02-13 Jakub Jelinek >>> H.J. Lu >>> >>> libgcc/ >>> >>> PR target/113855 >>> * config/i386/heap-trampoline.c (trampoline_insns): Add IBT >>> support and pad to the multiple of 4 bytes. Use movabsq >>> instead of movabs in comments. Add -mx32 variant. >>> >>> gcc/testsuite/ >>> >>> PR target/113855 >>> * gcc.dg/heap-trampoline-1.c: New test. >>> * lib/target-supports.exp (check_effective_target_heap_trampoline): >>> New. >> >> LGTM, but please give Iain a day or two to chime in. >> >>Jakub >> > > I am checking it in today. I have just one question; from your patch the use of endbr* seems to be unconditionally based on the flags used to build libgcc. However, I was expecting that the use of extended trampolines like this would depend on command line flags used to compile the end-user’s code. As per the discussion in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113855#c4 I was expecting that we would need to extend this implementation to cover more cases (i.e. the GCC-14 implementation is “base”). any comments? Iain > > -- > H.J.
[committed] i386: psrlq is not used for PERM [PR113871]
Introduce vec_shl_ and vec_shr_ expanders to improve '*a = __builtin_shufflevector(*a, (vect64){0}, 1, 2, 3, 4);' and '*a = __builtin_shufflevector((vect64){0}, *a, 3, 4, 5, 6);' shuffles. The generated code improves from: movzwl 6(%rdi), %eax movzwl 4(%rdi), %edx salq$16, %rax orq %rdx, %rax movzwl 2(%rdi), %edx salq$16, %rax orq %rdx, %rax movq%rax, (%rdi) to: movq(%rdi), %xmm0 psrlq $16, %xmm0 movq%xmm0, (%rdi) and to: movq(%rdi), %xmm0 psllq $16, %xmm0 movq%xmm0, (%rdi) in the second case. The patch handles 32-bit vectors as well and improves generated code from: movd(%rdi), %xmm0 pxor%xmm1, %xmm1 punpcklwd %xmm1, %xmm0 pshuflw $230, %xmm0, %xmm0 movd%xmm0, (%rdi) to: movd(%rdi), %xmm0 psrld $16, %xmm0 movd%xmm0, (%rdi) and to: movd(%rdi), %xmm0 pslld $16, %xmm0 movd%xmm0, (%rdi) PR target/113871 gcc/ChangeLog: * config/i386/mmx.md (V248FI): New mode iterator. (V24FI_32): DItto. (vec_shl_): New expander. (vec_shl_): Ditto. (vec_shr_): Ditto. (vec_shr_): Ditto. * config/i386/sse.md (vec_shl_): Simplify expander. (vec_shr_): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/pr113871-1a.c: New test. * gcc.target/i386/pr113871-1b.c: New test. * gcc.target/i386/pr113871-2a.c: New test. * gcc.target/i386/pr113871-2b.c: New test. * gcc.target/i386/pr113871-3a.c: New test. * gcc.target/i386/pr113871-3b.c: New test. * gcc.target/i386/pr113871-4a.c: New test. Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}. Uros. diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md index 6215b12f05f..075309cca9f 100644 --- a/gcc/config/i386/mmx.md +++ b/gcc/config/i386/mmx.md @@ -84,6 +84,11 @@ (define_mode_iterator V_16_32_64 (define_mode_iterator V2FI [V2SF V2SI]) (define_mode_iterator V24FI [V2SF V2SI V4HF V4HI]) + +(define_mode_iterator V248FI [V2SF V2SI V4HF V4HI V8QI]) + +(define_mode_iterator V24FI_32 [V2HF V2HI V4QI]) + ;; Mapping from integer vector mode to mnemonic suffix (define_mode_attr mmxvecsize [(V8QI "b") (V4QI "b") (V2QI "b") @@ -3729,6 +3734,70 @@ (define_expand "vv4qi3" DONE; }) +(define_expand "vec_shl_" + [(set (match_operand:V248FI 0 "register_operand") + (ashift:V1DI + (match_operand:V248FI 1 "nonimmediate_operand") + (match_operand:DI 2 "nonmemory_operand")))] + "TARGET_MMX_WITH_SSE" +{ + rtx op0 = gen_reg_rtx (V1DImode); + rtx op1 = force_reg (mode, operands[1]); + + emit_insn (gen_mmx_ashlv1di3 + (op0, gen_lowpart (V1DImode, op1), operands[2])); + emit_move_insn (operands[0], gen_lowpart (mode, op0)); + DONE; +}) + +(define_expand "vec_shl_" + [(set (match_operand:V24FI_32 0 "register_operand") + (ashift:V1SI + (match_operand:V24FI_32 1 "nonimmediate_operand") + (match_operand:DI 2 "nonmemory_operand")))] + "TARGET_SSE2" +{ + rtx op0 = gen_reg_rtx (V1SImode); + rtx op1 = force_reg (mode, operands[1]); + + emit_insn (gen_mmx_ashlv1si3 + (op0, gen_lowpart (V1SImode, op1), operands[2])); + emit_move_insn (operands[0], gen_lowpart (mode, op0)); + DONE; +}) + +(define_expand "vec_shr_" + [(set (match_operand:V248FI 0 "register_operand") + (lshiftrt:V1DI + (match_operand:V248FI 1 "nonimmediate_operand") + (match_operand:DI 2 "nonmemory_operand")))] + "TARGET_MMX_WITH_SSE" +{ + rtx op0 = gen_reg_rtx (V1DImode); + rtx op1 = force_reg (mode, operands[1]); + + emit_insn (gen_mmx_lshrv1di3 + (op0, gen_lowpart (V1DImode, op1), operands[2])); + emit_move_insn (operands[0], gen_lowpart (mode, op0)); + DONE; +}) + +(define_expand "vec_shr_" + [(set (match_operand:V24FI_32 0 "register_operand") + (lshiftrt:V1SI + (match_operand:V24FI_32 1 "nonimmediate_operand") + (match_operand:DI 2 "nonmemory_operand")))] + "TARGET_SSE2" +{ + rtx op0 = gen_reg_rtx (V1SImode); + rtx op1 = force_reg (mode, operands[1]); + + emit_insn (gen_mmx_lshrv1si3 + (op0, gen_lowpart (V1SImode, op1), operands[2])); + emit_move_insn (operands[0], gen_lowpart (mode, op0)); + DONE; +}) + ; ;; ;; Parallel integral comparisons diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index acd10908d76..1bc614ab702 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -16498,29 +16498,35 @@ (define_split "operands[3] = XVECEXP (operands[2], 0, 0);") (define_expand "vec_shl_" - [(set (match_dup 3) + [(set (match_operand:V_128 0 "register_operand") (ashift:V1TI -(match_operand:V_128 1 "register_operand") -(match_operand:SI 2 "const_0_to_255_mul_8_operand"))) - (set (match_operand:V_128 0 "register_operand") (match_dup 4))] +(match_operand:V_128 1 "nonimmediate_o
[PATCH] RISC-V: Set require-effective-target rv64 for PR113742
The testcase pr113742.c is failing for 32 bit targets due to the following cc1 error: cc1: error: ABI requries '-march=rv64' Disable testing on rv32 targets PR target/113742 gcc/testsuite/ChangeLog: * gcc.target/riscv/pr113742.c: add require-effective-target Signed-off-by: Edwin Lu --- gcc/testsuite/gcc.target/riscv/pr113742.c | 1 + 1 file changed, 1 insertion(+) diff --git a/gcc/testsuite/gcc.target/riscv/pr113742.c b/gcc/testsuite/gcc.target/riscv/pr113742.c index ab8934c2a8a..9cea92ed97c 100644 --- a/gcc/testsuite/gcc.target/riscv/pr113742.c +++ b/gcc/testsuite/gcc.target/riscv/pr113742.c @@ -1,4 +1,5 @@ //* { dg-do compile } */ /* { dg-options "-O2 -finstrument-functions -mabi=lp64d -mcpu=sifive-p670" } */ +/* { dg-require-effective-target rv64 } */ void foo(void) {} -- 2.34.1
Re: [PATCH V2] rs6000: New pass for replacement of adjacent loads fusion (lxv).
Hello Richard: On 14/02/24 10:45 pm, Richard Sandiford wrote: > Ajit Agarwal writes: diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc index 1856fa4884f..ffc47a6eaa0 100644 --- a/gcc/emit-rtl.cc +++ b/gcc/emit-rtl.cc @@ -921,7 +921,7 @@ validate_subreg (machine_mode omode, machine_mode imode, return false; /* The subreg offset cannot be outside the inner object. */ - if (maybe_ge (offset, isize)) + if (maybe_gt (offset, isize)) return false; >>> >>> Can you explain why this change is needed? >>> >> >> This is required in rs6000 target where we generate the subreg >> with offset 16 from OO mode (256 bit) to 128 bit vector modes. >> Otherwise it segfaults. > > Could you go into more detail? Why does that subreg lead to a segfault? > > In itself, a 16-byte subreg at byte offset 16 into a 32-byte pair is pretty > standard. AArch64 uses this too for its vector load/store pairs (and for > structure pairs more generally). > If we want to create subreg V16QI (reg OO R) 16) imode is V16QI (isize = 16) and offset is 16. maybe_ge (offset, isize) return true and validate_subreg returns false; Hence above subreg is not generated and we generate incorrect code. Thats why I have modified to maybe_gt (offset, isize). Thanks & Regards Ajit > Thanks, > Richard
Re: [PATCH][_GLIBCXX_DEBUG] Fix std::__niter_base behavior
On Wed, 14 Feb 2024 at 18:39, François Dumont wrote: > libstdc++: [_GLIBCXX_DEBUG] Fix std::__niter_base behavior > > std::__niter_base is used in _GLIBCXX_DEBUG mode to remove _Safe_iterator<> > wrapper on random access iterators. But doing so it should also preserve > original > behavior to remove __normal_iterator wrapper. > > libstdc++-v3/ChangeLog: > > * include/bits/stl_algobase.h (std::__niter_base): Redefine the > overload > definitions for __gnu_debug::_Safe_iterator. > * include/debug/safe_iterator.tcc (std::__niter_base): Adapt > declarations. > > Ok to commit once all tests completed (still need to check pre-c++11) ? > The declaration in include/bits/stl_algobase.h has a noexcept-specifier but the definition in include/debug/safe_iterator.tcc does not have one - that seems wrong (I'm surprised it even compiles). Just using std::is_nothrow_copy_constructible<_Ite> seems simpler, that will be true for __normal_iterator if is_nothrow_copy_constructible is true. The definition in include/debug/safe_iterator.tcc should use std::declval<_Ite>() not declval<_Ite>(). Is there any reason why the definition uses a late-specified-return-type (i.e. auto and ->) when the declaration doesn't?
Re: [PATCH V1] Common infrastructure for load-store fusion for aarch64 and rs6000 target
Ajit Agarwal writes: > On 14/02/24 10:56 pm, Richard Sandiford wrote: >> Ajit Agarwal writes: > diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc > index 88ee0dd67fc..a8d0ee7c4db 100644 > --- a/gcc/df-problems.cc > +++ b/gcc/df-problems.cc > @@ -3360,7 +3360,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct > df_mw_hardreg *mws, >if (df_whole_mw_reg_unused_p (mws, live, artificial_uses)) > { >unsigned int regno = mws->start_regno; > - df_set_note (REG_UNUSED, insn, mws->mw_reg); > + //df_set_note (REG_UNUSED, insn, mws->mw_reg); >dead_debug_insert_temp (debug, regno, insn, > DEBUG_TEMP_AFTER_WITH_REG); > >if (REG_DEAD_DEBUGGING) > @@ -3375,7 +3375,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct > df_mw_hardreg *mws, > if (!bitmap_bit_p (live, r) > && !bitmap_bit_p (artificial_uses, r)) > { > - df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]); > +// df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]); > dead_debug_insert_temp (debug, r, insn, DEBUG_TEMP_AFTER_WITH_REG); > if (REG_DEAD_DEBUGGING) > df_print_note ("adding 2: ", insn, REG_NOTES (insn)); > @@ -3493,9 +3493,9 @@ df_create_unused_note (rtx_insn *insn, df_ref def, > || bitmap_bit_p (artificial_uses, dregno) > || df_ignore_stack_reg (dregno))) > { > - rtx reg = (DF_REF_LOC (def)) > -? *DF_REF_REAL_LOC (def): DF_REF_REG (def); > - df_set_note (REG_UNUSED, insn, reg); > + //rtx reg = (DF_REF_LOC (def)) > + // ? *DF_REF_REAL_LOC (def): DF_REF_REG (def); > + //df_set_note (REG_UNUSED, insn, reg); >dead_debug_insert_temp (debug, dregno, insn, > DEBUG_TEMP_AFTER_WITH_REG); >if (REG_DEAD_DEBUGGING) > df_print_note ("adding 3: ", insn, REG_NOTES (insn)); I don't think this can be right. The last hunk of the var-tracking.cc patch also seems to be reverting a correct change. >>> >>> We generate sequential registers using (subreg V16QI (reg 00mode) 16) >>> and (reg OOmode 0) >>> where OOmode is 256 bit and V16QI is 128 bits in order to generate >>> sequential register pair. >> >> OK. As I mentioned in the other message I just sent, it seems pretty >> standard to use a 256-bit mode to represent a pair of 128-bit values. >> In that case: >> >> - (reg:OO R) always refers to both registers in the pair, and any assignment >> to it modifies both registers in the pair >> >> - (subreg:V16QI (reg:OO R) 0) refers to the first register only, and can >> be modified independently of the second register >> >> - (subreg:V16QI (reg:OO R) 16) refers to the second register only, and can >> be modified independently of the first register >> >> Is that how you're using it? >> > > This is how I use it. > (insn 27 21 33 2 (set (reg:OO 157 [ vect__5.11_76 ]) > > (insn 21 35 27 2 (set (subreg:V2DI (reg:OO 157 [ vect__5.11_76 ]) 16) > > to generate sequential registers. With the above sequential registers > are generated by RA. > > >> One thing to be wary of is that it isn't possible to assign to two >> subregs of the same reg in a single instruction (at least AFAIK). >> So any operation that wants to store to both registers in the pair >> must store to (reg:OO R) itself, not to the two subregs. >> >>> If I keep the above REG_UNUSED notes ira generates >>> REG_UNUSED and in cprop_harreg pass and dce pass deletes store pairs and >>> we get incorrect code. >>> >>> By commenting REG_UNUSED notes it is not generated and we get the correct >>> store >>> pair fusion and cprop_hardreg and dce doesn't deletes them. >>> >>> Please let me know is there are better ways to address this instead of >>> commenting >>> above generation of REG_UNUSED notes. >> >> Could you quote an example rtl sequence that includes incorrect notes? >> It might help to understand the problem a bit more. >> > > Here is the rtl code: > > (insn 21 35 27 2 (set (subreg:V2DI (reg:OO 157 [ vect__5.11_76 ]) 16) > (plus:V2DI (reg:V2DI 153 [ vect__4.10_72 ]) > (reg:V2DI 154 [ _63 ]))) > "/home/aagarwa/gcc-sources-fusion/gcc/testsuite/gcc.c-torture/execute/20030928-1.c":11:18 > 1706 {addv2di3} > (expr_list:REG_DEAD (reg:V2DI 154 [ _63 ]) > (expr_list:REG_DEAD (reg:V2DI 153 [ vect__4.10_72 ]) > (expr_list:REG_UNUSED (reg:OO 157 [ vect__5.11_76 ]) > (nil) > (insn 27 21 33 2 (set (reg:OO 157 [ vect__5.11_76 ]) > (plus:V2DI (reg:V2DI 158 [ vect__4.10_73 ]) > (reg:V2DI 159 [ _60 ]))) > "/home/aagarwa/gcc-sources-fusion/gcc/testsuite/gcc.c-torture/execute/20030928-1.c":11:18 > 1706 {addv2di3} > (expr_list:REG_DEAD (reg:V2DI 159 [ _60 ]) > (expr_list:REG_DEAD (reg:V2DI 158 [ vect__4.10_73 ]) > (nil > (insn 33 27 39 2 (set (subreg:V2DI (reg:OO 167 [
Re: [PATCH V1] Common infrastructure for load-store fusion for aarch64 and rs6000 target
On 14/02/24 10:56 pm, Richard Sandiford wrote: > Ajit Agarwal writes: diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc index 88ee0dd67fc..a8d0ee7c4db 100644 --- a/gcc/df-problems.cc +++ b/gcc/df-problems.cc @@ -3360,7 +3360,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct df_mw_hardreg *mws, if (df_whole_mw_reg_unused_p (mws, live, artificial_uses)) { unsigned int regno = mws->start_regno; - df_set_note (REG_UNUSED, insn, mws->mw_reg); + //df_set_note (REG_UNUSED, insn, mws->mw_reg); dead_debug_insert_temp (debug, regno, insn, DEBUG_TEMP_AFTER_WITH_REG); if (REG_DEAD_DEBUGGING) @@ -3375,7 +3375,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct df_mw_hardreg *mws, if (!bitmap_bit_p (live, r) && !bitmap_bit_p (artificial_uses, r)) { - df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]); + // df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]); dead_debug_insert_temp (debug, r, insn, DEBUG_TEMP_AFTER_WITH_REG); if (REG_DEAD_DEBUGGING) df_print_note ("adding 2: ", insn, REG_NOTES (insn)); @@ -3493,9 +3493,9 @@ df_create_unused_note (rtx_insn *insn, df_ref def, || bitmap_bit_p (artificial_uses, dregno) || df_ignore_stack_reg (dregno))) { - rtx reg = (DF_REF_LOC (def)) -? *DF_REF_REAL_LOC (def): DF_REF_REG (def); - df_set_note (REG_UNUSED, insn, reg); + //rtx reg = (DF_REF_LOC (def)) + // ? *DF_REF_REAL_LOC (def): DF_REF_REG (def); + //df_set_note (REG_UNUSED, insn, reg); dead_debug_insert_temp (debug, dregno, insn, DEBUG_TEMP_AFTER_WITH_REG); if (REG_DEAD_DEBUGGING) df_print_note ("adding 3: ", insn, REG_NOTES (insn)); >>> >>> I don't think this can be right. The last hunk of the var-tracking.cc >>> patch also seems to be reverting a correct change. >>> >> >> We generate sequential registers using (subreg V16QI (reg 00mode) 16) >> and (reg OOmode 0) >> where OOmode is 256 bit and V16QI is 128 bits in order to generate >> sequential register pair. > > OK. As I mentioned in the other message I just sent, it seems pretty > standard to use a 256-bit mode to represent a pair of 128-bit values. > In that case: > > - (reg:OO R) always refers to both registers in the pair, and any assignment > to it modifies both registers in the pair > > - (subreg:V16QI (reg:OO R) 0) refers to the first register only, and can > be modified independently of the second register > > - (subreg:V16QI (reg:OO R) 16) refers to the second register only, and can > be modified independently of the first register > > Is that how you're using it? > This is how I use it. (insn 27 21 33 2 (set (reg:OO 157 [ vect__5.11_76 ]) (insn 21 35 27 2 (set (subreg:V2DI (reg:OO 157 [ vect__5.11_76 ]) 16) to generate sequential registers. With the above sequential registers are generated by RA. > One thing to be wary of is that it isn't possible to assign to two > subregs of the same reg in a single instruction (at least AFAIK). > So any operation that wants to store to both registers in the pair > must store to (reg:OO R) itself, not to the two subregs. > >> If I keep the above REG_UNUSED notes ira generates >> REG_UNUSED and in cprop_harreg pass and dce pass deletes store pairs and >> we get incorrect code. >> >> By commenting REG_UNUSED notes it is not generated and we get the correct >> store >> pair fusion and cprop_hardreg and dce doesn't deletes them. >> >> Please let me know is there are better ways to address this instead of >> commenting >> above generation of REG_UNUSED notes. > > Could you quote an example rtl sequence that includes incorrect notes? > It might help to understand the problem a bit more. > Here is the rtl code: (insn 21 35 27 2 (set (subreg:V2DI (reg:OO 157 [ vect__5.11_76 ]) 16) (plus:V2DI (reg:V2DI 153 [ vect__4.10_72 ]) (reg:V2DI 154 [ _63 ]))) "/home/aagarwa/gcc-sources-fusion/gcc/testsuite/gcc.c-torture/execute/20030928-1.c":11:18 1706 {addv2di3} (expr_list:REG_DEAD (reg:V2DI 154 [ _63 ]) (expr_list:REG_DEAD (reg:V2DI 153 [ vect__4.10_72 ]) (expr_list:REG_UNUSED (reg:OO 157 [ vect__5.11_76 ]) (nil) (insn 27 21 33 2 (set (reg:OO 157 [ vect__5.11_76 ]) (plus:V2DI (reg:V2DI 158 [ vect__4.10_73 ]) (reg:V2DI 159 [ _60 ]))) "/home/aagarwa/gcc-sources-fusion/gcc/testsuite/gcc.c-torture/execute/20030928-1.c":11:18 1706 {addv2di3} (expr_list:REG_DEAD (reg:V2DI 159 [ _60 ]) (expr_list:REG_DEAD (reg:V2DI 158 [ vect__4.10_73 ]) (nil (insn 33 27 39 2 (set (subreg:V2DI (reg:OO 167 [ vect__5.11_78 ]) 16) (plus:V2DI (reg:V2DI 163 [ vect__4.10_74 ]) (reg:V2DI 164 [ _57 ]))) "/home/aagarwa/gcc-sources-fusion/gcc/t
[PATCH][_GLIBCXX_DEBUG] Fix std::__niter_base behavior
libstdc++: [_GLIBCXX_DEBUG] Fix std::__niter_base behavior std::__niter_base is used in _GLIBCXX_DEBUG mode to remove _Safe_iterator<> wrapper on random access iterators. But doing so it should also preserve original behavior to remove __normal_iterator wrapper. libstdc++-v3/ChangeLog: * include/bits/stl_algobase.h (std::__niter_base): Redefine the overload definitions for __gnu_debug::_Safe_iterator. * include/debug/safe_iterator.tcc (std::__niter_base): Adapt declarations. Ok to commit once all tests completed (still need to check pre-c++11) ? François diff --git a/libstdc++-v3/include/bits/stl_algobase.h b/libstdc++-v3/include/bits/stl_algobase.h index e7207f67266..056fa0c4173 100644 --- a/libstdc++-v3/include/bits/stl_algobase.h +++ b/libstdc++-v3/include/bits/stl_algobase.h @@ -317,12 +317,27 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION _GLIBCXX_NOEXCEPT_IF(std::is_nothrow_copy_constructible<_Iterator>::value) { return __it; } +#if __cplusplus < 201103L template -_GLIBCXX20_CONSTEXPR _Ite __niter_base(const ::__gnu_debug::_Safe_iterator<_Ite, _Seq, std::random_access_iterator_tag>&); + template +_Ite +__niter_base(const ::__gnu_debug::_Safe_iterator< +::__gnu_cxx::__normal_iterator<_Ite, _Cont>, _Seq, +std::random_access_iterator_tag>&); +#else + template +_GLIBCXX20_CONSTEXPR +decltype(std::__niter_base(std::declval<_Ite>())) +__niter_base(const ::__gnu_debug::_Safe_iterator<_Ite, _Seq, +std::random_access_iterator_tag>&) +noexcept( noexcept(std::is_nothrow_copy_constructible< + decltype(std::__niter_base(std::declval<_Ite>()))>::value) ); +#endif + // Reverse the __niter_base transformation to get a // __normal_iterator back again (this assumes that __normal_iterator // is only used to wrap random access iterators, like pointers). diff --git a/libstdc++-v3/include/debug/safe_iterator.tcc b/libstdc++-v3/include/debug/safe_iterator.tcc index 6eb70cbda04..d6cfe24cc83 100644 --- a/libstdc++-v3/include/debug/safe_iterator.tcc +++ b/libstdc++-v3/include/debug/safe_iterator.tcc @@ -235,13 +235,29 @@ namespace std _GLIBCXX_VISIBILITY(default) { _GLIBCXX_BEGIN_NAMESPACE_VERSION +#if __cplusplus < 201103L template -_GLIBCXX20_CONSTEXPR _Ite __niter_base(const ::__gnu_debug::_Safe_iterator<_Ite, _Seq, std::random_access_iterator_tag>& __it) { return __it.base(); } + template +_Ite +__niter_base(const ::__gnu_debug::_Safe_iterator< +::__gnu_cxx::__normal_iterator<_Ite, _Cont>, _DbgSeq, +std::random_access_iterator_tag>& __it) +{ return __it.base().base(); } +#else + template +_GLIBCXX20_CONSTEXPR +auto +__niter_base(const ::__gnu_debug::_Safe_iterator<_Ite, _Seq, +std::random_access_iterator_tag>& __it) +-> decltype(std::__niter_base(declval<_Ite>())) +{ return std::__niter_base(__it.base()); } +#endif + template _GLIBCXX20_CONSTEXPR
RE: [libatomic PATCH] PR other/113336: Fix libatomic testsuite regressions on ARM.
> -Original Message- > From: Victor Do Nascimento > Sent: Wednesday, February 14, 2024 5:06 PM > To: Roger Sayle ; gcc-patches@gcc.gnu.org; > Richard Earnshaw > Subject: Re: [libatomic PATCH] PR other/113336: Fix libatomic testsuite > regressions on ARM. > > Though I'm not in a position to approve the patch, I'm happy to confirm > the proposed changes look good to me. > > Thanks for the updated version, > Victor > This is ok from me too. Thanks Victor for helping with the review. Kyrill > > On 1/28/24 16:24, Roger Sayle wrote: > > > > This patch is a revised version of the fix for PR other/113336. > > > > This patch has been tested on arm-linux-gnueabihf with --with-arch=armv6 > > with make bootstrap and make -k check where it fixes all of the FAILs in > > libatomic. Ok for mainline? > > > > > > 2024-01-28 Roger Sayle > > Victor Do Nascimento > > > > libatomic/ChangeLog > > PR other/113336 > > * Makefile.am: Build tas_1_2_.o on ARCH_ARM_LINUX > > * Makefile.in: Regenerate. > > > > Thanks in advance. > > Roger > > -- > >
Re: [PATCH v2] x86: Support x32 and IBT in heap trampoline
On Tue, Feb 13, 2024 at 8:46 AM Jakub Jelinek wrote: > > On Tue, Feb 13, 2024 at 08:40:52AM -0800, H.J. Lu wrote: > > Add x32 and IBT support to x86 heap trampoline implementation with a > > testcase. > > > > 2024-02-13 Jakub Jelinek > > H.J. Lu > > > > libgcc/ > > > > PR target/113855 > > * config/i386/heap-trampoline.c (trampoline_insns): Add IBT > > support and pad to the multiple of 4 bytes. Use movabsq > > instead of movabs in comments. Add -mx32 variant. > > > > gcc/testsuite/ > > > > PR target/113855 > > * gcc.dg/heap-trampoline-1.c: New test. > > * lib/target-supports.exp (check_effective_target_heap_trampoline): > > New. > > LGTM, but please give Iain a day or two to chime in. > > Jakub > I am checking it in today. -- H.J.
[COMMITTED] aarch64/testsuite: Remove dg-excess-errors from c-c++-common/gomp/pr63328.c and gcc.dg/gomp/pr87895-2.c [PR113861]
These now pass after r14-6416-gf5fc001a84a7db so let's remove the dg-excess-errors from them. Committed as obvious after a test for aarch64-linux-gnu. gcc/testsuite/ChangeLog: PR testsuite/113861 * c-c++-common/gomp/pr63328.c: Remove dg-excess-errors. * gcc.dg/gomp/pr87895-2.c: Likewise. Signed-off-by: Andrew Pinski --- gcc/testsuite/c-c++-common/gomp/pr63328.c | 2 -- gcc/testsuite/gcc.dg/gomp/pr87895-2.c | 1 - 2 files changed, 3 deletions(-) diff --git a/gcc/testsuite/c-c++-common/gomp/pr63328.c b/gcc/testsuite/c-c++-common/gomp/pr63328.c index 54efacea49a..3958abe166b 100644 --- a/gcc/testsuite/c-c++-common/gomp/pr63328.c +++ b/gcc/testsuite/c-c++-common/gomp/pr63328.c @@ -3,5 +3,3 @@ /* { dg-options "-O2 -fopenmp-simd -fno-strict-aliasing -fcompare-debug" } */ #include "pr60823-3.c" -/* { dg-excess-errors "partial simd clone support" { target { aarch64*-*-* } } } */ - diff --git a/gcc/testsuite/gcc.dg/gomp/pr87895-2.c b/gcc/testsuite/gcc.dg/gomp/pr87895-2.c index 26827ac8264..3d27715428e 100644 --- a/gcc/testsuite/gcc.dg/gomp/pr87895-2.c +++ b/gcc/testsuite/gcc.dg/gomp/pr87895-2.c @@ -3,4 +3,3 @@ /* { dg-additional-options "-O1" } */ #include "pr87895-1.c" -/* { dg-excess-errors "partial simd clone support" { target { aarch64*-*-* } } } */ -- 2.43.0
Re: [PATCH V1] Common infrastructure for load-store fusion for aarch64 and rs6000 target
Hello Sam: On 14/02/24 10:50 pm, Sam James wrote: > > Ajit Agarwal writes: > >> Hello Richard: >> >> >> On 14/02/24 4:03 pm, Richard Sandiford wrote: >>> Hi, >>> >>> Thanks for working on this. >>> >>> You posted a version of this patch on Sunday too. If you need to repost >>> to fix bugs or make other improvements, could you describe the changes >>> that you've made since the previous version? It makes things easier >>> to follow. >> >> Sure. Sorry for that I forgot to add that. >> >>> >>> Also, sorry for starting with a meta discussion about reviews, but >>> there are multiple types of review comment, including: >>> >>> (1) Suggestions for changes that are worded as suggestions. >>> >>> (2) Suggestions for changes that are worded as questions ("Wouldn't it be >>> better to do X?", etc). >>> >>> (3) Questions asking for an explanation or for more information. >>> >>> Just sending a new patch makes sense when the previous review comments >>> were all like (1), and arguably also (1)+(2). But Alex's previous review >>> included (3) as well. Could you go back and respond to his questions there? >>> It would help understand some of the design choices. >>> >> >> I have responded to Alex comments for the previous patches. >> I have incorporated all of his comments in this patch. >> >> >>> A natural starting point when reviewing a patch like this is to diff >>> the current aarch64-ldp-fusion.cc with the new pair-fusion.cc. This shows >>> many of the kind of changes that I'd expect. But it also seems to include >>> some code reordering, such as putting fuse_pair after try_fuse_pair. >>> If some reordering is necessary, could you try to organise the patch as >>> a series in which the reordering is a separate step? It's a bit hard >>> to review at the moment. (Reordering for cosmetic reasons is also OK, >>> but again please separate it out for ease of review.) >>> >>> Maybe one way of making the review easier would be to split the aarch64 >>> pass into the "target-dependent" and "target-independent" pieces >>> in-place, i.e. keeping everything within aarch64-ldp-fusion.cc, and then >>> (as separate patches) move the target-independent pieces outside >>> config/aarch64. >>> >> Sure I will do that. >> >>> The patch includes: >>> * emit-rtl.cc: Modify ge with gt on PolyINT data structure. * dce.cc: Add changes not to delete the load store pair. * rtl-ssa/changes.cc: Modified assert code. * var-tracking.cc: Modified assert code. * df-problems.cc: Not to generate REG_UNUSED for multi word registers that is requied for rs6000 target. >>> >>> Please submit these separately, as independent preparatory patches, >>> with an explanation for why they're needed & correct. But: >>> >> Sure I will do that. >> diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc index 88ee0dd67fc..a8d0ee7c4db 100644 --- a/gcc/df-problems.cc +++ b/gcc/df-problems.cc @@ -3360,7 +3360,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct df_mw_hardreg *mws, if (df_whole_mw_reg_unused_p (mws, live, artificial_uses)) { unsigned int regno = mws->start_regno; - df_set_note (REG_UNUSED, insn, mws->mw_reg); + //df_set_note (REG_UNUSED, insn, mws->mw_reg); dead_debug_insert_temp (debug, regno, insn, DEBUG_TEMP_AFTER_WITH_REG); if (REG_DEAD_DEBUGGING) @@ -3375,7 +3375,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct df_mw_hardreg *mws, if (!bitmap_bit_p (live, r) && !bitmap_bit_p (artificial_uses, r)) { - df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]); + // df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]); > > I just want to emphasise here: > a) adding out commented code is very unusual (I know a reviewer picked > up on that already); > > b) if you are going to comment something out as a hack / you need help, > please *clearly flag that* (apologies if I missed it), and possibly add > a comment above it saying "// TODO: Need to figure out " or similar, > otherwise it just looks like it was forgotten about. > > In this case, your question about how to handle REG_UNUSED should've > been made clear in a summary at the top where you mention the > outstanding items. Again, sorry if I missed it. > Question is not about how to handle REG_UNUSED, I am afraid this is not what I meant, I wanted to convey the following. REG_UNSED generated by ira with the above code used by cprop_hardreg to remove the code with REG_UNUSED notes. We can modify these passes to handle REG_UNUSED differently or not to generate REG_UNUSED for multi-word case as above. We use similar to we do as follows: We generate sequential registers using (subreg V16QI (reg 00mode) 16) and (reg OOmode 0) where OOmode is 256 bit and V16QI is 128 bits in order to generate sequential register pair. If I keep the above REG_UNUSED notes ira
Re: [PATCH V1] Common infrastructure for load-store fusion for aarch64 and rs6000 target
Ajit Agarwal writes: >>> diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc >>> index 88ee0dd67fc..a8d0ee7c4db 100644 >>> --- a/gcc/df-problems.cc >>> +++ b/gcc/df-problems.cc >>> @@ -3360,7 +3360,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct >>> df_mw_hardreg *mws, >>>if (df_whole_mw_reg_unused_p (mws, live, artificial_uses)) >>> { >>>unsigned int regno = mws->start_regno; >>> - df_set_note (REG_UNUSED, insn, mws->mw_reg); >>> + //df_set_note (REG_UNUSED, insn, mws->mw_reg); >>>dead_debug_insert_temp (debug, regno, insn, >>> DEBUG_TEMP_AFTER_WITH_REG); >>> >>>if (REG_DEAD_DEBUGGING) >>> @@ -3375,7 +3375,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct >>> df_mw_hardreg *mws, >>> if (!bitmap_bit_p (live, r) >>> && !bitmap_bit_p (artificial_uses, r)) >>> { >>> - df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]); >>> + // df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]); >>> dead_debug_insert_temp (debug, r, insn, DEBUG_TEMP_AFTER_WITH_REG); >>> if (REG_DEAD_DEBUGGING) >>> df_print_note ("adding 2: ", insn, REG_NOTES (insn)); >>> @@ -3493,9 +3493,9 @@ df_create_unused_note (rtx_insn *insn, df_ref def, >>> || bitmap_bit_p (artificial_uses, dregno) >>> || df_ignore_stack_reg (dregno))) >>> { >>> - rtx reg = (DF_REF_LOC (def)) >>> -? *DF_REF_REAL_LOC (def): DF_REF_REG (def); >>> - df_set_note (REG_UNUSED, insn, reg); >>> + //rtx reg = (DF_REF_LOC (def)) >>> + // ? *DF_REF_REAL_LOC (def): DF_REF_REG (def); >>> + //df_set_note (REG_UNUSED, insn, reg); >>>dead_debug_insert_temp (debug, dregno, insn, >>> DEBUG_TEMP_AFTER_WITH_REG); >>>if (REG_DEAD_DEBUGGING) >>> df_print_note ("adding 3: ", insn, REG_NOTES (insn)); >> >> I don't think this can be right. The last hunk of the var-tracking.cc >> patch also seems to be reverting a correct change. >> > > We generate sequential registers using (subreg V16QI (reg 00mode) 16) > and (reg OOmode 0) > where OOmode is 256 bit and V16QI is 128 bits in order to generate > sequential register pair. OK. As I mentioned in the other message I just sent, it seems pretty standard to use a 256-bit mode to represent a pair of 128-bit values. In that case: - (reg:OO R) always refers to both registers in the pair, and any assignment to it modifies both registers in the pair - (subreg:V16QI (reg:OO R) 0) refers to the first register only, and can be modified independently of the second register - (subreg:V16QI (reg:OO R) 16) refers to the second register only, and can be modified independently of the first register Is that how you're using it? One thing to be wary of is that it isn't possible to assign to two subregs of the same reg in a single instruction (at least AFAIK). So any operation that wants to store to both registers in the pair must store to (reg:OO R) itself, not to the two subregs. > If I keep the above REG_UNUSED notes ira generates > REG_UNUSED and in cprop_harreg pass and dce pass deletes store pairs and > we get incorrect code. > > By commenting REG_UNUSED notes it is not generated and we get the correct > store > pair fusion and cprop_hardreg and dce doesn't deletes them. > > Please let me know is there are better ways to address this instead of > commenting > above generation of REG_UNUSED notes. Could you quote an example rtl sequence that includes incorrect notes? It might help to understand the problem a bit more. Thanks, Richard
Re: [PATCH V1] Common infrastructure for load-store fusion for aarch64 and rs6000 target
Ajit Agarwal writes: > Hello Richard: > > > On 14/02/24 4:03 pm, Richard Sandiford wrote: >> Hi, >> >> Thanks for working on this. >> >> You posted a version of this patch on Sunday too. If you need to repost >> to fix bugs or make other improvements, could you describe the changes >> that you've made since the previous version? It makes things easier >> to follow. > > Sure. Sorry for that I forgot to add that. > >> >> Also, sorry for starting with a meta discussion about reviews, but >> there are multiple types of review comment, including: >> >> (1) Suggestions for changes that are worded as suggestions. >> >> (2) Suggestions for changes that are worded as questions ("Wouldn't it be >> better to do X?", etc). >> >> (3) Questions asking for an explanation or for more information. >> >> Just sending a new patch makes sense when the previous review comments >> were all like (1), and arguably also (1)+(2). But Alex's previous review >> included (3) as well. Could you go back and respond to his questions there? >> It would help understand some of the design choices. >> > > I have responded to Alex comments for the previous patches. > I have incorporated all of his comments in this patch. > > >> A natural starting point when reviewing a patch like this is to diff >> the current aarch64-ldp-fusion.cc with the new pair-fusion.cc. This shows >> many of the kind of changes that I'd expect. But it also seems to include >> some code reordering, such as putting fuse_pair after try_fuse_pair. >> If some reordering is necessary, could you try to organise the patch as >> a series in which the reordering is a separate step? It's a bit hard >> to review at the moment. (Reordering for cosmetic reasons is also OK, >> but again please separate it out for ease of review.) >> >> Maybe one way of making the review easier would be to split the aarch64 >> pass into the "target-dependent" and "target-independent" pieces >> in-place, i.e. keeping everything within aarch64-ldp-fusion.cc, and then >> (as separate patches) move the target-independent pieces outside >> config/aarch64. >> > Sure I will do that. > >> The patch includes: >> >>> * emit-rtl.cc: Modify ge with gt on PolyINT data structure. >>> * dce.cc: Add changes not to delete the load store pair. >>> * rtl-ssa/changes.cc: Modified assert code. >>> * var-tracking.cc: Modified assert code. >>> * df-problems.cc: Not to generate REG_UNUSED for multi >>> word registers that is requied for rs6000 target. >> >> Please submit these separately, as independent preparatory patches, >> with an explanation for why they're needed & correct. But: >> > Sure I will do that. > >>> diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc >>> index 88ee0dd67fc..a8d0ee7c4db 100644 >>> --- a/gcc/df-problems.cc >>> +++ b/gcc/df-problems.cc >>> @@ -3360,7 +3360,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct >>> df_mw_hardreg *mws, >>>if (df_whole_mw_reg_unused_p (mws, live, artificial_uses)) >>> { >>>unsigned int regno = mws->start_regno; >>> - df_set_note (REG_UNUSED, insn, mws->mw_reg); >>> + //df_set_note (REG_UNUSED, insn, mws->mw_reg); >>>dead_debug_insert_temp (debug, regno, insn, >>> DEBUG_TEMP_AFTER_WITH_REG); >>> >>>if (REG_DEAD_DEBUGGING) >>> @@ -3375,7 +3375,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct >>> df_mw_hardreg *mws, >>> if (!bitmap_bit_p (live, r) >>> && !bitmap_bit_p (artificial_uses, r)) >>> { >>> - df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]); >>> + // df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]); I just want to emphasise here: a) adding out commented code is very unusual (I know a reviewer picked up on that already); b) if you are going to comment something out as a hack / you need help, please *clearly flag that* (apologies if I missed it), and possibly add a comment above it saying "// TODO: Need to figure out " or similar, otherwise it just looks like it was forgotten about. In this case, your question about how to handle REG_UNUSED should've been made clear in a summary at the top where you mention the outstanding items. Again, sorry if I missed it. >>> dead_debug_insert_temp (debug, r, insn, DEBUG_TEMP_AFTER_WITH_REG); >>> if (REG_DEAD_DEBUGGING) >>> df_print_note ("adding 2: ", insn, REG_NOTES (insn)); >>> @@ -3493,9 +3493,9 @@ df_create_unused_note (rtx_insn *insn, df_ref def, >>> || bitmap_bit_p (artificial_uses, dregno) >>> || df_ignore_stack_reg (dregno))) >>> { >>> - rtx reg = (DF_REF_LOC (def)) >>> -? *DF_REF_REAL_LOC (def): DF_REF_REG (def); >>> - df_set_note (REG_UNUSED, insn, reg); >>> + //rtx reg = (DF_REF_LOC (def)) >>> + // ? *DF_REF_REAL_LOC (def): DF_REF_REG (def); >>> + //df_set_note (REG_UNUSED, insn, reg); >>>dead_debug_insert_temp (debug, dregno, insn, >>> DEBUG_TEMP_AFTER_WITH_REG); >>>
Re: [PATCH V2] rs6000: New pass for replacement of adjacent loads fusion (lxv).
Ajit Agarwal writes: >>> diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc >>> index 1856fa4884f..ffc47a6eaa0 100644 >>> --- a/gcc/emit-rtl.cc >>> +++ b/gcc/emit-rtl.cc >>> @@ -921,7 +921,7 @@ validate_subreg (machine_mode omode, machine_mode imode, >>> return false; >>> >>>/* The subreg offset cannot be outside the inner object. */ >>> - if (maybe_ge (offset, isize)) >>> + if (maybe_gt (offset, isize)) >>> return false; >> >> Can you explain why this change is needed? >> > > This is required in rs6000 target where we generate the subreg > with offset 16 from OO mode (256 bit) to 128 bit vector modes. > Otherwise it segfaults. Could you go into more detail? Why does that subreg lead to a segfault? In itself, a 16-byte subreg at byte offset 16 into a 32-byte pair is pretty standard. AArch64 uses this too for its vector load/store pairs (and for structure pairs more generally). Thanks, Richard
Re: [libatomic PATCH] PR other/113336: Fix libatomic testsuite regressions on ARM.
Though I'm not in a position to approve the patch, I'm happy to confirm the proposed changes look good to me. Thanks for the updated version, Victor On 1/28/24 16:24, Roger Sayle wrote: This patch is a revised version of the fix for PR other/113336. This patch has been tested on arm-linux-gnueabihf with --with-arch=armv6 with make bootstrap and make -k check where it fixes all of the FAILs in libatomic. Ok for mainline? 2024-01-28 Roger Sayle Victor Do Nascimento libatomic/ChangeLog PR other/113336 * Makefile.am: Build tas_1_2_.o on ARCH_ARM_LINUX * Makefile.in: Regenerate. Thanks in advance. Roger --
Fix ICE in loop splitting
Hi, as demonstrated in the testcase, I forgot to check that profile is present in tree-ssa-loop-split. Bootstrapped and regtested x86_64-linux, comitted. PR tree-optimization/111054 gcc/ChangeLog: * tree-ssa-loop-split.cc (split_loop): Check for profile being present. gcc/testsuite/ChangeLog: * gcc.c-torture/compile/pr111054.c: New test. diff --git a/gcc/testsuite/gcc.c-torture/compile/pr111054.c b/gcc/testsuite/gcc.c-torture/compile/pr111054.c new file mode 100644 index 000..3c0d6e816b9 --- /dev/null +++ b/gcc/testsuite/gcc.c-torture/compile/pr111054.c @@ -0,0 +1,11 @@ +/* { dg-additional-options "-fno-guess-branch-probability" } */ +void *p, *q; +int i, j; + +void +foo (void) +{ + for (i = 0; i < 20; i++) +if (i < j) + p = q; +} diff --git a/gcc/tree-ssa-loop-split.cc b/gcc/tree-ssa-loop-split.cc index 04215fe7937..c0bb1b71d17 100644 --- a/gcc/tree-ssa-loop-split.cc +++ b/gcc/tree-ssa-loop-split.cc @@ -712,7 +712,8 @@ split_loop (class loop *loop1) ? true_edge->probability.to_sreal () : (sreal)1; sreal scale2 = false_edge->probability.reliable_p () ? false_edge->probability.to_sreal () : (sreal)1; - sreal div1 = loop1_prob.to_sreal (); + sreal div1 = loop1_prob.initialized_p () +? loop1_prob.to_sreal () : (sreal)1/(sreal)2; /* +1 to get header interations rather than latch iterations and then -1 to convert back. */ if (div1 != 0)
Re: [PATCH] [libiberty] remove TBAA violation in iterative_hash, improve code-gen
On Wed, Feb 14, 2024 at 05:09:39PM +0100, Richard Biener wrote: > > > > Am 14.02.2024 um 16:22 schrieb Jakub Jelinek : > > > > On Wed, Feb 14, 2024 at 04:13:51PM +0100, Richard Biener wrote: > >> The following removes the TBAA violation present in iterative_hash. > >> As we eventually LTO that it's important to fix. This also improves > >> code generation for the >= 12 bytes loop by using | to compose the > >> 4 byte words as at least GCC 7 and up can recognize that pattern > >> and perform a 4 byte load while the variant with a + is not > >> recognized (not on trunk either), I think we have an enhancement bug > >> for this somewhere. > >> > >> Given we reliably merge and the bogus "optimized" path might be > >> only relevant for archs that cannot do misaligned loads efficiently > >> I've chosen to keep a specialization for aligned accesses. > >> > >> Bootstrapped and tested on x86_64-unknown-linux-gnu, OK for trunk? > >> > >> Thanks, > >> Richard. > >> > >> libiberty/ > >>* hashtab.c (iterative_hash): Remove TBAA violating handling > >>of aligned little-endian case in favor of just keeping the > >>aligned case special-cased. Use | for composing a larger word. > > > > Have you tried using memcpy into a hashval_t temporary? > > Just wonder whether you get better or worse code with that compared to > > the shifts. > > I didn’t but I verified I get a single movd on x84-64 when using | instead of > + with GCC 7 and trunk. Ok then. Jakub
Re: [PATCH] [libiberty] remove TBAA violation in iterative_hash, improve code-gen
> Am 14.02.2024 um 16:22 schrieb Jakub Jelinek : > > On Wed, Feb 14, 2024 at 04:13:51PM +0100, Richard Biener wrote: >> The following removes the TBAA violation present in iterative_hash. >> As we eventually LTO that it's important to fix. This also improves >> code generation for the >= 12 bytes loop by using | to compose the >> 4 byte words as at least GCC 7 and up can recognize that pattern >> and perform a 4 byte load while the variant with a + is not >> recognized (not on trunk either), I think we have an enhancement bug >> for this somewhere. >> >> Given we reliably merge and the bogus "optimized" path might be >> only relevant for archs that cannot do misaligned loads efficiently >> I've chosen to keep a specialization for aligned accesses. >> >> Bootstrapped and tested on x86_64-unknown-linux-gnu, OK for trunk? >> >> Thanks, >> Richard. >> >> libiberty/ >>* hashtab.c (iterative_hash): Remove TBAA violating handling >>of aligned little-endian case in favor of just keeping the >>aligned case special-cased. Use | for composing a larger word. > > Have you tried using memcpy into a hashval_t temporary? > Just wonder whether you get better or worse code with that compared to > the shifts. I didn’t but I verified I get a single movd on x84-64 when using | instead of + with GCC 7 and trunk. Richard >Jakub >
Re: [PATCH] coreutils-sum-pr108666.c: fix spurious LLP64 warnings
On 2/14/24 13:55, David Malcolm wrote: On Fri, 2024-02-02 at 23:55 +, Jonathan Yong wrote: Attached patch OK? Fixes the following warnings: Thanks; looks good to me. Dave Thanks, pushed to master branch.
Re: [PATCH] middle-end/113576 - avoid out-of-bound vector element access
Richard Biener writes: > On Wed, 14 Feb 2024, Richard Sandiford wrote: > >> Richard Biener writes: >> > The following avoids accessing out-of-bound vector elements when >> > native encoding a boolean vector with sub-BITS_PER_UNIT precision >> > elements. The error was basing the number of elements to extract >> > on the rounded up total byte size involved and the patch bases >> > everything on the total number of elements to extract instead. >> >> It's too long ago to be certain, but I think this was a deliberate choice. >> The point of the new vector constant encoding is that it can give an >> allegedly sensible value for any given index, even out-of-range ones. >> >> Since the padding bits are undefined, we should in principle have a free >> choice of what to use. And for VLA, it's often better to continue the >> existing pattern rather than force to zero. >> >> I don't strongly object to changing it. I think we should be careful >> about relying on zeroing for correctness though. The bits are in principle >> undefined and we can't rely on reading zeros from equivalent memory or >> register values. > > The main motivation for a change here is to allow catching out-of-bound > indices again for VECTOR_CST_ELT, at least for constant nunits because > it might be a programming error like fat-fingering the index. I do > think it's a regression that we no longer catch those. > > It's probably also a bit non-obvious how an encoding continues and > there might be DImode masks that can be represented by a > zero-extended QImode immediate but "continued" it would require > a larger immediate. > > The change also effectively only changes something for 1 byte > encodings since nunits is a power of two and so is the element > size in bits. Yeah, but even there, there's an argument that all-1s (0xff) is a more obvious value for an all-1s mask. > A patch restoring the VECTOR_CST_ELT checking might be the > following > > diff --git a/gcc/tree.cc b/gcc/tree.cc > index 046a558d1b0..4c9b05167fd 100644 > --- a/gcc/tree.cc > +++ b/gcc/tree.cc > @@ -10325,6 +10325,9 @@ vector_cst_elt (const_tree t, unsigned int i) >if (i < encoded_nelts) > return VECTOR_CST_ENCODED_ELT (t, i); > > + /* Catch out-of-bound element accesses. */ > + gcc_checking_assert (maybe_gt (VECTOR_CST_NELTS (t), i)); > + >/* If there are no steps, the final encoded value is the right one. */ >if (!VECTOR_CST_STEPPED_P (t)) > { > > but it triggers quite a bit via const_binop for, for example > > #2 0x011c1506 in const_binop (code=PLUS_EXPR, > arg1=, arg2=) > (gdb) p debug_generic_expr (arg1) > { 12, 13, 14, 15 } > $5 = void > (gdb) p debug_generic_expr (arg2) > { -2, -2, -2, -3 } > (gdb) p count > $4 = 6 > (gdb) l > 1711 if (!elts.new_binary_operation (type, arg1, arg2, > step_ok_p)) > 1712return NULL_TREE; > 1713 unsigned int count = elts.encoded_nelts (); > 1714 for (unsigned int i = 0; i < count; ++i) > 1715{ > 1716 tree elem1 = VECTOR_CST_ELT (arg1, i); > 1717 tree elem2 = VECTOR_CST_ELT (arg2, i); > 1718 > 1719 tree elt = const_binop (code, elem1, elem2); > > this seems like an error to me - why would we, for fixed-size > vectors and for PLUS ever create a vector encoding with 6 elements?! > That seems at least inefficient to me? It's a case of picking your poison. On the other side, operating individually on each element of a V64QI is inefficient when the representation says up-front that all elements are equal. Fundemantally, operations on VLA vectors are treated as functions that map patterns to patterns. The number of elements that are consumed isn't really relevant to the function itself. The VLA folders therefore rely on being to read an element from a pattern even if the index is outside TREE_VECTOR_SUBPARTS. There were two reasons for using VLA paths for VLS vectors. One I mentioned above: it saves time when all elements are equal, or have a similarly compact representation. The other is that it makes VLA less special and ensures that the code gets more testing. Maybe one compromise between that and the assert would be: (1) enforce the assert only for VLS and (2) add new checks to ensure that a VLA-friendly operation will never read out-of-bounds for VLS vectors But I think this would be awkward. E.g. we now build reversed vectors as a 3-element series N-1, N-2, N-3. It would be nice not to have to special-case N==2 by suppressing N-3. And the condition for (2) might not always be obvious. Another option would be to have special accessors that are allowed to read out of bounds, and add the assert (for both VLA & VLS) to VECTOR_CST_ELT. It might take a while to root out all the places that need to change though. Thanks, Richard
Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"
On 14/02/2024 13:43, Richard Biener wrote: On Wed, 14 Feb 2024, Andrew Stubbs wrote: On 14/02/2024 13:27, Richard Biener wrote: On Wed, 14 Feb 2024, Andrew Stubbs wrote: On 13/02/2024 08:26, Richard Biener wrote: On Mon, 12 Feb 2024, Thomas Schwinge wrote: Hi! On 2023-10-20T12:51:03+0100, Andrew Stubbs wrote: I've committed this patch ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691 "amdgcn: add -march=gfx1030 EXPERIMENTAL". The RDNA2 ISA variant doesn't support certain instructions previous implemented in GCC/GCN, so a number of patterns etc. had to be disabled: [...] Vector reductions will need to be reworked for RDNA2. [...] * config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2. (addc3): Add RDNA2 syntax variant. (subc3): Likewise. (2_exec): Add RDNA2 alternatives. (vec_cmpdi): Likewise. (vec_cmpdi): Likewise. (vec_cmpdi_exec): Likewise. (vec_cmpdi_exec): Likewise. (vec_cmpdi_dup): Likewise. (vec_cmpdi_dup_exec): Likewise. (reduc__scal_): Disable for RDNA2. (*_dpp_shr_): Likewise. (*plus_carry_dpp_shr_): Likewise. (*plus_carry_in_dpp_shr_): Likewise. Etc. The expectation being that GCC middle end copes with this, and synthesizes some less ideal yet still functional vector code, I presume. The later RDNA3/gfx1100 support builds on top of this, and that's what I'm currently working on getting proper GCC/GCN target (not offloading) results for. I'm seeing a good number of execution test FAILs (regressions compared to my earlier non-gfx1100 testing), and I've now tracked down where one large class of those comes into existance -- not yet how to resolve, unfortunately. But maybe, with you guys' combined vectorizer and back end experience, the latter will be done quickly? Richard, I don't know if you've ever run actual GCC/GCN target (not offloading) testing; let me know if you have any questions about that. I've only done offload testing - in the x86_64 build tree run check-target-libgomp. If you can tell me how to do GCN target testing (maybe document it on the wiki even!) I can try do that as well. Given that (at least largely?) the same patterns etc. are disabled as in my gfx1100 configuration, I suppose your gfx1030 one would exhibit the same issues. You can build GCC/GCN target like you build the offloading one, just remove '--enable-as-accelerator-for=[...]'. Likely, you can even use a offloading GCC/GCN build to reproduce the issue below. One example is the attached 'builtin-bitops-1.c', reduced from 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is miscompiled as soon as '-ftree-vectorize' is effective: $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100 -O1 -ftree-vectorize In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for '-march=gfx90a' vs. '-march=gfx1100', we see: +builtin-bitops-1.c:7:17: missed: reduc op not supported by target. ..., and therefore: -builtin-bitops-1.c:7:17: note: Reduce using direct vector reduction. +builtin-bitops-1.c:7:17: note: Reduce using vector shifts +builtin-bitops-1.c:7:17: note: extract scalar result That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a chain of summation of 'VEC_PERM_EXPR's. However, there's wrong code generated: $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out i=1, ints[i]=0x1 a=1, b=2 i=2, ints[i]=0x8000 a=1, b=2 i=3, ints[i]=0x2 a=1, b=2 i=4, ints[i]=0x4000 a=1, b=2 i=5, ints[i]=0x1 a=1, b=2 i=6, ints[i]=0x8000 a=1, b=2 i=7, ints[i]=0xa5a5a5a5 a=16, b=32 i=8, ints[i]=0x5a5a5a5a a=16, b=32 i=9, ints[i]=0xcafe a=11, b=22 i=10, ints[i]=0xcafe00 a=11, b=22 i=11, ints[i]=0xcafe a=11, b=22 i=12, ints[i]=0x a=32, b=64 (I can't tell if the 'b = 2 * a' pattern is purely coincidental?) I don't speak enough "vectorization" to fully understand the generic vectorized algorithm and its implementation. It appears that the "Reduce using vector shifts" code has been around for a very long time, but also has gone through a number of changes. I can't tell which GCC targets/configurations it's actually used for (in the same way as for GCN gfx1100), and thus whether there's an issue in that vectorizer code, or rather in the GCN back end, or GCN back end parameterizing the generic code? The "shift" reduction is basically doing reduction by repeatedly adding the upper to the lower half of the vector (each time halving the vector size). Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code: int my_popcount (unsigned int x) { int stmp__12.12; vector(64) int vect__12.11; vector(64) unsig
Re: [PATCH] [libiberty] remove TBAA violation in iterative_hash, improve code-gen
On Wed, Feb 14, 2024 at 04:13:51PM +0100, Richard Biener wrote: > The following removes the TBAA violation present in iterative_hash. > As we eventually LTO that it's important to fix. This also improves > code generation for the >= 12 bytes loop by using | to compose the > 4 byte words as at least GCC 7 and up can recognize that pattern > and perform a 4 byte load while the variant with a + is not > recognized (not on trunk either), I think we have an enhancement bug > for this somewhere. > > Given we reliably merge and the bogus "optimized" path might be > only relevant for archs that cannot do misaligned loads efficiently > I've chosen to keep a specialization for aligned accesses. > > Bootstrapped and tested on x86_64-unknown-linux-gnu, OK for trunk? > > Thanks, > Richard. > > libiberty/ > * hashtab.c (iterative_hash): Remove TBAA violating handling > of aligned little-endian case in favor of just keeping the > aligned case special-cased. Use | for composing a larger word. Have you tried using memcpy into a hashval_t temporary? Just wonder whether you get better or worse code with that compared to the shifts. Jakub
Re: [PATCH]middle-end: inspect all exits for additional annotations for loop.
> Am 14.02.2024 um 16:16 schrieb Tamar Christina : > > >> >> >> I think this isn't entirely good. For simple cases for >> do {} while the condition ends up in the latch while for while () {} >> loops it ends up in the header. In your case the latch isn't empty >> so it doesn't end up with the conditional. >> >> I think your patch is OK to the point of looking at all loop exit >> sources but you should elide the special-casing of header and >> latch since it's really only exit conditionals that matter. >> > > That makes sense, since in both cases the edges are in the respective > blocks. Should have thought about it more. > > So how about this one. > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > Ok for master? Ok Richard > Thanks, > Tamar > > gcc/ChangeLog: > >* tree-cfg.cc (replace_loop_annotate): Inspect loop edges for annotations. > > gcc/testsuite/ChangeLog: > >* gcc.dg/vect/vect-novect_gcond.c: New test. > > --- inline copy of patch --- > > diff --git a/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c > b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c > new file mode 100644 > index > ..01e69cbef9d51b234c08a400c78dc078d53252f1 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c > @@ -0,0 +1,39 @@ > +/* { dg-add-options vect_early_break } */ > +/* { dg-require-effective-target vect_early_break_hw } */ > +/* { dg-require-effective-target vect_int } */ > +/* { dg-additional-options "-O3" } */ > + > +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */ > + > +#include "tree-vect.h" > + > +#define N 306 > +#define NEEDLE 136 > + > +int table[N]; > + > +__attribute__ ((noipa)) > +int foo (int i, unsigned short parse_tables_n) > +{ > + parse_tables_n >>= 9; > + parse_tables_n += 11; > +#pragma GCC novector > + while (i < N && parse_tables_n--) > +table[i++] = 0; > + > + return table[NEEDLE]; > +} > + > +int main () > +{ > + check_vect (); > + > +#pragma GCC novector > + for (int j = 0; j < N; j++) > +table[j] = -1; > + > + if (foo (0, 0x) != 0) > +__builtin_abort (); > + > + return 0; > +} > diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc > index > cdd439fe7506e7bc33654ffa027b493f23d278ac..bdffc3b4ed277724e81b7dd67fe7966e8ece0c13 > 100644 > --- a/gcc/tree-cfg.cc > +++ b/gcc/tree-cfg.cc > @@ -320,12 +320,9 @@ replace_loop_annotate (void) > > for (auto loop : loops_list (cfun, 0)) > { > - /* First look into the header. */ > - replace_loop_annotate_in_block (loop->header, loop); > - > - /* Then look into the latch, if any. */ > - if (loop->latch) > -replace_loop_annotate_in_block (loop->latch, loop); > + /* Check all exit source blocks for annotations. */ > + for (auto e : get_loop_exit_edges (loop)) > +replace_loop_annotate_in_block (e->src, loop); > > /* Push the global flag_finite_loops state down to individual loops. */ > loop->finite_p = flag_finite_loops; >
[PATCH]AArch64: remove ls64 from being mandatory on armv8.7-a..
Hi All, The Arm Architectural Reference Manual (Version J.a, section A2.9 on FEAT_LS64) shows that ls64 is an optional extensions and should not be enabled by default for Armv8.7-a. This drops it from the mandatory bits for the architecture and brings GCC inline with LLVM and the achitecture. Note that we will not be changing binutils to preserve compatibility with older released compilers. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? and backport to GCC 13,12,11? Thanks, Tamar gcc/ChangeLog: * config/aarch64/aarch64-arches.def (AARCH64_ARCH): Remove LS64 from Armv8.7-a. gcc/testsuite/ChangeLog: * g++.target/aarch64/acle/ls64.C: Add +ls64. * gcc.target/aarch64/acle/pr110100.c: Likewise. * gcc.target/aarch64/acle/pr110132.c: Likewise. * gcc.target/aarch64/options_set_28.c: Drop check for nols64. * gcc.target/aarch64/pragma_cpp_predefs_2.c: Correct header checks. --- inline copy of patch -- diff --git a/gcc/config/aarch64/aarch64-arches.def b/gcc/config/aarch64/aarch64-arches.def index b7115ff7c3d4a7ee7abbedcb091ef15a7efacc79..9bec30e9203bac01155281ef3474846c402bb29e 100644 --- a/gcc/config/aarch64/aarch64-arches.def +++ b/gcc/config/aarch64/aarch64-arches.def @@ -37,7 +37,7 @@ AARCH64_ARCH("armv8.3-a", generic_armv8_a, V8_3A, 8, (V8_2A, PAUTH, R AARCH64_ARCH("armv8.4-a", generic_armv8_a, V8_4A, 8, (V8_3A, F16FML, DOTPROD, FLAGM)) AARCH64_ARCH("armv8.5-a", generic_armv8_a, V8_5A, 8, (V8_4A, SB, SSBS, PREDRES)) AARCH64_ARCH("armv8.6-a", generic_armv8_a, V8_6A, 8, (V8_5A, I8MM, BF16)) -AARCH64_ARCH("armv8.7-a", generic_armv8_a, V8_7A, 8, (V8_6A, LS64)) +AARCH64_ARCH("armv8.7-a", generic_armv8_a, V8_7A, 8, (V8_6A)) AARCH64_ARCH("armv8.8-a", generic_armv8_a, V8_8A, 8, (V8_7A, MOPS)) AARCH64_ARCH("armv8.9-a", generic_armv8_a, V8_9A, 8, (V8_8A)) AARCH64_ARCH("armv8-r", generic_armv8_a, V8R , 8, (V8_4A)) diff --git a/gcc/testsuite/g++.target/aarch64/acle/ls64.C b/gcc/testsuite/g++.target/aarch64/acle/ls64.C index d9002785b578741bde1202761f0881dc3d47e608..dcfe6f1af6711a7f3ec2562f6aabf56baecf417d 100644 --- a/gcc/testsuite/g++.target/aarch64/acle/ls64.C +++ b/gcc/testsuite/g++.target/aarch64/acle/ls64.C @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-additional-options "-march=armv8.7-a" } */ +/* { dg-additional-options "-march=armv8.7-a+ls64" } */ #include int main() { diff --git a/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c b/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c index f56d5e619e8ac23cdf720574bd6ee08fbfd36423..62a82b97c56debad092cc8fd1ed48f0219109cd7 100644 --- a/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c +++ b/gcc/testsuite/gcc.target/aarch64/acle/pr110100.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-march=armv8.7-a -O2" } */ +/* { dg-options "-march=armv8.7-a+ls64 -O2" } */ #include void do_st64b(data512_t data) { __arm_st64b((void*)0x1000, data); diff --git a/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c b/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c index fb88d633dd20772fd96e976a400fe52ae0bc3647..423d91b9a99f269d01d07428414ade7cc518c711 100644 --- a/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c +++ b/gcc/testsuite/gcc.target/aarch64/acle/pr110132.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-additional-options "-march=armv8.7-a" } */ +/* { dg-additional-options "-march=armv8.7-a+ls64" } */ /* Check that ls64 builtins can be invoked using a preprocesed testcase without triggering bogus builtin warnings, see PR110132. diff --git a/gcc/testsuite/gcc.target/aarch64/options_set_28.c b/gcc/testsuite/gcc.target/aarch64/options_set_28.c index 9e63768581e9d429e9408863942051b1b04761ac..d5b15f8bc5831de56fe667179d83d9c853529aaf 100644 --- a/gcc/testsuite/gcc.target/aarch64/options_set_28.c +++ b/gcc/testsuite/gcc.target/aarch64/options_set_28.c @@ -1,9 +1,9 @@ /* { dg-do compile } */ -/* { dg-additional-options "-march=armv9.3-a+nopredres+nols64+nomops" } */ +/* { dg-additional-options "-march=armv9.3-a+nopredres+nomops" } */ int main () { return 0; } -/* { dg-final { scan-assembler-times {\.arch armv9\.3\-a\+crc\+nopredres\+nols64\+nomops\n} 1 } } */ +/* { dg-final { scan-assembler-times {\.arch armv9\.3\-a\+crc\+nopredres\+nomops\n} 1 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/pragma_cpp_predefs_2.c b/gcc/testsuite/gcc.target/aarch64/pragma_cpp_predefs_2.c index 2d76bfc23dfdcd78a74ec0e4845a3bd8d110b010..d8fc86d1557895f91ffe8be2f65d6581abe51568 100644 --- a/gcc/testsuite/gcc.target/aarch64/pragma_cpp_predefs_2.c +++ b/gcc/testsuite/gcc.target/aarch64/pragma_cpp_predefs_2.c @@ -242,8 +242,8 @@ #pragma GCC push_options #pragma GCC target ("arch=armv8.7-a") -#ifndef __ARM_FEATURE_LS64 -#error "__ARM_FEATURE_LS64 is not defined but should be!" +#ifdef __ARM_FEATURE_LS64 +#error "__ARM_FEATURE_LS
RE: [PATCH]middle-end: inspect all exits for additional annotations for loop.
> > I think this isn't entirely good. For simple cases for > do {} while the condition ends up in the latch while for while () {} > loops it ends up in the header. In your case the latch isn't empty > so it doesn't end up with the conditional. > > I think your patch is OK to the point of looking at all loop exit > sources but you should elide the special-casing of header and > latch since it's really only exit conditionals that matter. > That makes sense, since in both cases the edges are in the respective blocks. Should have thought about it more. So how about this one. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: * tree-cfg.cc (replace_loop_annotate): Inspect loop edges for annotations. gcc/testsuite/ChangeLog: * gcc.dg/vect/vect-novect_gcond.c: New test. --- inline copy of patch --- diff --git a/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c new file mode 100644 index ..01e69cbef9d51b234c08a400c78dc078d53252f1 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c @@ -0,0 +1,39 @@ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break_hw } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-additional-options "-O3" } */ + +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */ + +#include "tree-vect.h" + +#define N 306 +#define NEEDLE 136 + +int table[N]; + +__attribute__ ((noipa)) +int foo (int i, unsigned short parse_tables_n) +{ + parse_tables_n >>= 9; + parse_tables_n += 11; +#pragma GCC novector + while (i < N && parse_tables_n--) +table[i++] = 0; + + return table[NEEDLE]; +} + +int main () +{ + check_vect (); + +#pragma GCC novector + for (int j = 0; j < N; j++) +table[j] = -1; + + if (foo (0, 0x) != 0) +__builtin_abort (); + + return 0; +} diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc index cdd439fe7506e7bc33654ffa027b493f23d278ac..bdffc3b4ed277724e81b7dd67fe7966e8ece0c13 100644 --- a/gcc/tree-cfg.cc +++ b/gcc/tree-cfg.cc @@ -320,12 +320,9 @@ replace_loop_annotate (void) for (auto loop : loops_list (cfun, 0)) { - /* First look into the header. */ - replace_loop_annotate_in_block (loop->header, loop); - - /* Then look into the latch, if any. */ - if (loop->latch) - replace_loop_annotate_in_block (loop->latch, loop); + /* Check all exit source blocks for annotations. */ + for (auto e : get_loop_exit_edges (loop)) + replace_loop_annotate_in_block (e->src, loop); /* Push the global flag_finite_loops state down to individual loops. */ loop->finite_p = flag_finite_loops; rb18267.patch Description: rb18267.patch
[PATCH][RFC] tree-optimization/113910 - improve bitmap_hash
The following tries to improve the actual hash function for hashing bitmaps. We're still getting collision rates as high as 23 for the testcase in the PR. The following improves this by properly mixing in the bitmap element starting bit number. This brings down the collision rate below 1.4, improving compile-time by 25% for the testcase but at the expense of bringing bitmap_hash into the profile at around 5% of the samples as collected by perf. When you actually mix each set bit number collisions are virtually non-existent but hashing is then taking 35% of the compile time. Any better ideas? PR tree-optimization/113910 * bitmap.cc (bitmap_hash): Improve hash function by mixing the bitmap element index rather than XORing it. XOR individual elements into the hash. --- gcc/bitmap.cc | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/gcc/bitmap.cc b/gcc/bitmap.cc index 459e32c1ad1..80e185d5146 100644 --- a/gcc/bitmap.cc +++ b/gcc/bitmap.cc @@ -2695,18 +2695,22 @@ hashval_t bitmap_hash (const_bitmap head) { const bitmap_element *ptr; - BITMAP_WORD hash = 0; + hashval_t hash = 0; int ix; gcc_checking_assert (!head->tree_form); for (ptr = head->first; ptr; ptr = ptr->next) { - hash ^= ptr->indx; + hash = iterative_hash_hashval_t (ptr->indx, hash); + BITMAP_WORD bits = 0; for (ix = 0; ix != BITMAP_ELEMENT_WORDS; ix++) - hash ^= ptr->bits[ix]; + bits ^= ptr->bits[ix]; + if (sizeof (bits) == 8 && sizeof (hashval_t) == 4) + bits ^= bits >> 32; + hash ^= (hashval_t)bits; } - return iterative_hash (&hash, sizeof (hash), 0); + return hash; } -- 2.35.3
[PATCH] [libiberty] remove TBAA violation in iterative_hash, improve code-gen
The following removes the TBAA violation present in iterative_hash. As we eventually LTO that it's important to fix. This also improves code generation for the >= 12 bytes loop by using | to compose the 4 byte words as at least GCC 7 and up can recognize that pattern and perform a 4 byte load while the variant with a + is not recognized (not on trunk either), I think we have an enhancement bug for this somewhere. Given we reliably merge and the bogus "optimized" path might be only relevant for archs that cannot do misaligned loads efficiently I've chosen to keep a specialization for aligned accesses. Bootstrapped and tested on x86_64-unknown-linux-gnu, OK for trunk? Thanks, Richard. libiberty/ * hashtab.c (iterative_hash): Remove TBAA violating handling of aligned little-endian case in favor of just keeping the aligned case special-cased. Use | for composing a larger word. --- libiberty/hashtab.c | 23 ++- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/libiberty/hashtab.c b/libiberty/hashtab.c index 48f28078114..e3a07256a30 100644 --- a/libiberty/hashtab.c +++ b/libiberty/hashtab.c @@ -940,26 +940,23 @@ iterative_hash (const void *k_in /* the key */, c = initval; /* the previous hash value */ /* handle most of the key */ -#ifndef WORDS_BIGENDIAN - /* On a little-endian machine, if the data is 4-byte aligned we can hash - by word for better speed. This gives nondeterministic results on - big-endian machines. */ - if (sizeof (hashval_t) == 4 && (((size_t)k)&3) == 0) -while (len >= 12)/* aligned */ + /* Provide specialization for the aligned case for targets that cannot + efficiently perform misaligned loads of a merged access. */ + if size_t)k)&3) == 0) +while (len >= 12) { - a += *(hashval_t *)(k+0); - b += *(hashval_t *)(k+4); - c += *(hashval_t *)(k+8); + a += (k[0] | ((hashval_t)k[1]<<8) | ((hashval_t)k[2]<<16) | ((hashval_t)k[3]<<24)); + b += (k[4] | ((hashval_t)k[5]<<8) | ((hashval_t)k[6]<<16) | ((hashval_t)k[7]<<24)); + c += (k[8] | ((hashval_t)k[9]<<8) | ((hashval_t)k[10]<<16)| ((hashval_t)k[11]<<24)); mix(a,b,c); k += 12; len -= 12; } else /* unaligned */ -#endif while (len >= 12) { - a += (k[0] +((hashval_t)k[1]<<8) +((hashval_t)k[2]<<16) +((hashval_t)k[3]<<24)); - b += (k[4] +((hashval_t)k[5]<<8) +((hashval_t)k[6]<<16) +((hashval_t)k[7]<<24)); - c += (k[8] +((hashval_t)k[9]<<8) +((hashval_t)k[10]<<16)+((hashval_t)k[11]<<24)); + a += (k[0] | ((hashval_t)k[1]<<8) | ((hashval_t)k[2]<<16) | ((hashval_t)k[3]<<24)); + b += (k[4] | ((hashval_t)k[5]<<8) | ((hashval_t)k[6]<<16) | ((hashval_t)k[7]<<24)); + c += (k[8] | ((hashval_t)k[9]<<8) | ((hashval_t)k[10]<<16)| ((hashval_t)k[11]<<24)); mix(a,b,c); k += 12; len -= 12; } -- 2.35.3
Re: [PATCH V1] Common infrastructure for load-store fusion for aarch64 and rs6000 target
On 14/02/24 7:22 pm, Ajit Agarwal wrote: > Hello Richard: > > > On 14/02/24 4:03 pm, Richard Sandiford wrote: >> Hi, >> >> Thanks for working on this. >> >> You posted a version of this patch on Sunday too. If you need to repost >> to fix bugs or make other improvements, could you describe the changes >> that you've made since the previous version? It makes things easier >> to follow. > > Sure. Sorry for that I forgot to add that. There were certain asserts that I have removed it in the earlier patch that I have sent on Sunday and forgot to keep them. I have addressed them in this patch. I have done rtl_dce changes and they were not deleting some of the unwanted moves and hence I changed the code to address this in this patch. Thanks & Regards Ajit > >> >> Also, sorry for starting with a meta discussion about reviews, but >> there are multiple types of review comment, including: >> >> (1) Suggestions for changes that are worded as suggestions. >> >> (2) Suggestions for changes that are worded as questions ("Wouldn't it be >> better to do X?", etc). >> >> (3) Questions asking for an explanation or for more information. >> >> Just sending a new patch makes sense when the previous review comments >> were all like (1), and arguably also (1)+(2). But Alex's previous review >> included (3) as well. Could you go back and respond to his questions there? >> It would help understand some of the design choices. >> > > I have responded to Alex comments for the previous patches. > I have incorporated all of his comments in this patch. > > >> A natural starting point when reviewing a patch like this is to diff >> the current aarch64-ldp-fusion.cc with the new pair-fusion.cc. This shows >> many of the kind of changes that I'd expect. But it also seems to include >> some code reordering, such as putting fuse_pair after try_fuse_pair. >> If some reordering is necessary, could you try to organise the patch as >> a series in which the reordering is a separate step? It's a bit hard >> to review at the moment. (Reordering for cosmetic reasons is also OK, >> but again please separate it out for ease of review.) >> >> Maybe one way of making the review easier would be to split the aarch64 >> pass into the "target-dependent" and "target-independent" pieces >> in-place, i.e. keeping everything within aarch64-ldp-fusion.cc, and then >> (as separate patches) move the target-independent pieces outside >> config/aarch64. >> > Sure I will do that. > >> The patch includes: >> >>> * emit-rtl.cc: Modify ge with gt on PolyINT data structure. >>> * dce.cc: Add changes not to delete the load store pair. >>> * rtl-ssa/changes.cc: Modified assert code. >>> * var-tracking.cc: Modified assert code. >>> * df-problems.cc: Not to generate REG_UNUSED for multi >>> word registers that is requied for rs6000 target. >> >> Please submit these separately, as independent preparatory patches, >> with an explanation for why they're needed & correct. But: >> > Sure I will do that. > >>> diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc >>> index 88ee0dd67fc..a8d0ee7c4db 100644 >>> --- a/gcc/df-problems.cc >>> +++ b/gcc/df-problems.cc >>> @@ -3360,7 +3360,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct >>> df_mw_hardreg *mws, >>>if (df_whole_mw_reg_unused_p (mws, live, artificial_uses)) >>> { >>>unsigned int regno = mws->start_regno; >>> - df_set_note (REG_UNUSED, insn, mws->mw_reg); >>> + //df_set_note (REG_UNUSED, insn, mws->mw_reg); >>>dead_debug_insert_temp (debug, regno, insn, >>> DEBUG_TEMP_AFTER_WITH_REG); >>> >>>if (REG_DEAD_DEBUGGING) >>> @@ -3375,7 +3375,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct >>> df_mw_hardreg *mws, >>> if (!bitmap_bit_p (live, r) >>> && !bitmap_bit_p (artificial_uses, r)) >>> { >>> - df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]); >>> + // df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]); >>> dead_debug_insert_temp (debug, r, insn, DEBUG_TEMP_AFTER_WITH_REG); >>> if (REG_DEAD_DEBUGGING) >>> df_print_note ("adding 2: ", insn, REG_NOTES (insn)); >>> @@ -3493,9 +3493,9 @@ df_create_unused_note (rtx_insn *insn, df_ref def, >>> || bitmap_bit_p (artificial_uses, dregno) >>> || df_ignore_stack_reg (dregno))) >>> { >>> - rtx reg = (DF_REF_LOC (def)) >>> -? *DF_REF_REAL_LOC (def): DF_REF_REG (def); >>> - df_set_note (REG_UNUSED, insn, reg); >>> + //rtx reg = (DF_REF_LOC (def)) >>> + // ? *DF_REF_REAL_LOC (def): DF_REF_REG (def); >>> + //df_set_note (REG_UNUSED, insn, reg); >>>dead_debug_insert_temp (debug, dregno, insn, >>> DEBUG_TEMP_AFTER_WITH_REG); >>>if (REG_DEAD_DEBUGGING) >>> df_print_note ("adding 3: ", insn, REG_NOTES (insn)); >> >> I don't think this can be right. The last hunk of the var-tracking.cc >> patch also seems to be reverting a correct change. >> > > We ge
[PATCH] Skip gnat.dg/div_zero.adb on RISC-V
Like AArch64 and POWER, RISC-V does not support trap on zero divide. gcc/testsuite/ * gnat.dg/div_zero.adb: Skip on RISC-V. --- gcc/testsuite/gnat.dg/div_zero.adb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/testsuite/gnat.dg/div_zero.adb b/gcc/testsuite/gnat.dg/div_zero.adb index dedf3928db7..fb1c98caeff 100644 --- a/gcc/testsuite/gnat.dg/div_zero.adb +++ b/gcc/testsuite/gnat.dg/div_zero.adb @@ -1,5 +1,5 @@ -- { dg-do run } --- { dg-skip-if "divide does not trap" { aarch64*-*-* powerpc*-*-* } } +-- { dg-skip-if "divide does not trap" { aarch64*-*-* powerpc*-*-* riscv*-*-* } } -- This test requires architecture- and OS-specific support code for unwinding -- through signal frames (typically located in *-unwind.h) to pass. Feel free -- 2.43.1 -- Andreas Schwab, SUSE Labs, sch...@suse.de GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7 "And now for something completely different."
Re: [PATCH] analyzer/pr104308.c: Avoid optimizing away the copies
On Tue, 2022-05-03 at 17:29 -0700, Palmer Dabbelt wrote: > The test cases in analyzer/pr104308.c use uninitialized values in a > way > that doesn't plumb through to the return value of the function. This > allows the accesses to be deleted, which can result in the diagnostic > not firing. Thanks; LGTM for trunk. Dave > > gcc/testsuite/ChangeLog > > * gcc.dg/analyzer/pr104308.c (test_memmove_within_uninit): > Return the result of the copy. > (test_memcpy_From_uninit): Likewise. > --- > I was worried this had something to do with this test failing on > RISC-V. > I don't think that's actually the case (IIUC we're just not inlining > the > memmove, which elides the diagnostic), but I'd already written it so > I > figured I'd send it along. > --- > gcc/testsuite/gcc.dg/analyzer/pr104308.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/gcc/testsuite/gcc.dg/analyzer/pr104308.c > b/gcc/testsuite/gcc.dg/analyzer/pr104308.c > index a3a0cbb7317..ae40e59c41c 100644 > --- a/gcc/testsuite/gcc.dg/analyzer/pr104308.c > +++ b/gcc/testsuite/gcc.dg/analyzer/pr104308.c > @@ -8,12 +8,13 @@ int test_memmove_within_uninit (void) > { > char s[5]; /* { dg-message "region created on stack here" } */ > memmove(s, s + 1, 2); /* { dg-warning "use of uninitialized value" > } */ > - return 0; > + return s[0]; > } > > int test_memcpy_from_uninit (void) > { > char a1[5]; > char a2[5]; /* { dg-message "region created on stack here" } */ > - return (memcpy(a1, a2, 5) == a1); /* { dg-warning "use of > uninitialized value" } */ > + memcpy(a1, a2, 5); /* { dg-warning "use of uninitialized value" } > */ > + return a1[0]; > }
Re: [PATCH] coreutils-sum-pr108666.c: fix spurious LLP64 warnings
On Fri, 2024-02-02 at 23:55 +, Jonathan Yong wrote: > Attached patch OK? Fixes the following warnings: Thanks; looks good to me. Dave > coreutils-sum-pr108666.c:17:1: warning: conflicting types for built- > in function ‘memcpy’; expected ‘void *(void *, const void *, long > long unsigned int)’ [-Wbuiltin-declaration-mismatch] > 17 | memcpy(void* __restrict __dest, const void* __restrict > __src, size_t __n) > | ^~ > > coreutils-sum-pr108666.c:25:1: warning: conflicting types for built- > in function ‘malloc’; expected ‘void *(long long unsigned int)’ [- > Wbuiltin-declaration-mismatch] > 25 | malloc(size_t __size) __attribute__((__nothrow__, __leaf__)) > | ^~ > > Copied for review convenience: > diff --git a/gcc/testsuite/c-c++-common/analyzer/coreutils-sum- > pr108666.c b/gcc/testsuite/c-c++-common/analyzer/coreutils-sum- > pr108666.c > index 5684d1b02d4..dadd27eaf41 100644 > --- a/gcc/testsuite/c-c++-common/analyzer/coreutils-sum-pr108666.c > +++ b/gcc/testsuite/c-c++-common/analyzer/coreutils-sum-pr108666.c > @@ -1,6 +1,6 @@ > /* Reduced from coreutils's sum.c: bsd_sum_stream */ > > -typedef long unsigned int size_t; > +typedef __SIZE_TYPE__ size_t; > typedef unsigned char __uint8_t; > typedef unsigned long int __uintmax_t; > typedef struct _IO_FILE FILE;
[PATCH v2 4/4] libstdc++: Optimize std::remove_extent compilation performance
This patch optimizes the compilation performance of std::remove_extent by dispatching to the new __remove_extent built-in trait. libstdc++-v3/ChangeLog: * include/std/type_traits (remove_extent): Use __remove_extent built-in trait. Signed-off-by: Ken Matsui --- libstdc++-v3/include/std/type_traits | 6 ++ 1 file changed, 6 insertions(+) diff --git a/libstdc++-v3/include/std/type_traits b/libstdc++-v3/include/std/type_traits index 3bde7cb8ba3..0fb1762186c 100644 --- a/libstdc++-v3/include/std/type_traits +++ b/libstdc++-v3/include/std/type_traits @@ -2064,6 +2064,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION // Array modifications. /// remove_extent +#if _GLIBCXX_USE_BUILTIN_TRAIT(__remove_extent) + template +struct remove_extent +{ using type = __remove_extent(_Tp); }; +#else template struct remove_extent { using type = _Tp; }; @@ -2075,6 +2080,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION template struct remove_extent<_Tp[]> { using type = _Tp; }; +#endif /// remove_all_extents template -- 2.43.0
[PATCH v2 2/4] libstdc++: Optimize std::add_pointer compilation performance
This patch optimizes the compilation performance of std::add_pointer by dispatching to the new __add_pointer built-in trait. libstdc++-v3/ChangeLog: * include/std/type_traits (add_pointer): Use __add_pointer built-in trait. Signed-off-by: Ken Matsui --- libstdc++-v3/include/std/type_traits | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/libstdc++-v3/include/std/type_traits b/libstdc++-v3/include/std/type_traits index 21402fd8c13..3bde7cb8ba3 100644 --- a/libstdc++-v3/include/std/type_traits +++ b/libstdc++-v3/include/std/type_traits @@ -2121,6 +2121,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION { }; #endif + /// add_pointer +#if _GLIBCXX_USE_BUILTIN_TRAIT(__add_pointer) + template +struct add_pointer +{ using type = __add_pointer(_Tp); }; +#else template struct __add_pointer_helper { using type = _Tp; }; @@ -2129,7 +2135,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION struct __add_pointer_helper<_Tp, __void_t<_Tp*>> { using type = _Tp*; }; - /// add_pointer template struct add_pointer : public __add_pointer_helper<_Tp> @@ -2142,6 +2147,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION template struct add_pointer<_Tp&&> { using type = _Tp*; }; +#endif #if __cplusplus > 201103L /// Alias template for remove_pointer -- 2.43.0
[PATCH v2 3/4] c++: Implement __remove_extent built-in trait
This patch implements built-in trait for std::remove_extent. gcc/cp/ChangeLog: * cp-trait.def: Define __remove_extent. * semantics.cc (finish_trait_type): Handle CPTK_REMOVE_EXTENT. gcc/testsuite/ChangeLog: * g++.dg/ext/has-builtin-1.C: Test existence of __remove_extent. * g++.dg/ext/remove_extent.C: New test. Signed-off-by: Ken Matsui --- gcc/cp/cp-trait.def | 1 + gcc/cp/semantics.cc | 5 + gcc/testsuite/g++.dg/ext/has-builtin-1.C | 3 +++ gcc/testsuite/g++.dg/ext/remove_extent.C | 16 4 files changed, 25 insertions(+) create mode 100644 gcc/testsuite/g++.dg/ext/remove_extent.C diff --git a/gcc/cp/cp-trait.def b/gcc/cp/cp-trait.def index cec385ee501..3ff5611b60e 100644 --- a/gcc/cp/cp-trait.def +++ b/gcc/cp/cp-trait.def @@ -96,6 +96,7 @@ DEFTRAIT_EXPR (REF_CONSTRUCTS_FROM_TEMPORARY, "__reference_constructs_from_tempo DEFTRAIT_EXPR (REF_CONVERTS_FROM_TEMPORARY, "__reference_converts_from_temporary", 2) DEFTRAIT_TYPE (REMOVE_CV, "__remove_cv", 1) DEFTRAIT_TYPE (REMOVE_CVREF, "__remove_cvref", 1) +DEFTRAIT_TYPE (REMOVE_EXTENT, "__remove_extent", 1) DEFTRAIT_TYPE (REMOVE_POINTER, "__remove_pointer", 1) DEFTRAIT_TYPE (REMOVE_REFERENCE, "__remove_reference", 1) DEFTRAIT_TYPE (TYPE_PACK_ELEMENT, "__type_pack_element", -1) diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc index e23693ab57f..bf998377c88 100644 --- a/gcc/cp/semantics.cc +++ b/gcc/cp/semantics.cc @@ -12777,6 +12777,11 @@ finish_trait_type (cp_trait_kind kind, tree type1, tree type2, type1 = TREE_TYPE (type1); return cv_unqualified (type1); +case CPTK_REMOVE_EXTENT: + if (TREE_CODE (type1) == ARRAY_TYPE) + type1 = TREE_TYPE (type1); + return type1; + case CPTK_REMOVE_POINTER: if (TYPE_PTR_P (type1)) type1 = TREE_TYPE (type1); diff --git a/gcc/testsuite/g++.dg/ext/has-builtin-1.C b/gcc/testsuite/g++.dg/ext/has-builtin-1.C index 56e8db7ac32..4f1094befb9 100644 --- a/gcc/testsuite/g++.dg/ext/has-builtin-1.C +++ b/gcc/testsuite/g++.dg/ext/has-builtin-1.C @@ -170,6 +170,9 @@ #if !__has_builtin (__remove_cvref) # error "__has_builtin (__remove_cvref) failed" #endif +#if !__has_builtin (__remove_extent) +# error "__has_builtin (__remove_extent) failed" +#endif #if !__has_builtin (__remove_pointer) # error "__has_builtin (__remove_pointer) failed" #endif diff --git a/gcc/testsuite/g++.dg/ext/remove_extent.C b/gcc/testsuite/g++.dg/ext/remove_extent.C new file mode 100644 index 000..6183aca5a48 --- /dev/null +++ b/gcc/testsuite/g++.dg/ext/remove_extent.C @@ -0,0 +1,16 @@ +// { dg-do compile { target c++11 } } + +#define SA(X) static_assert((X),#X) + +class ClassType { }; + +SA(__is_same(__remove_extent(int), int)); +SA(__is_same(__remove_extent(int[2]), int)); +SA(__is_same(__remove_extent(int[2][3]), int[3])); +SA(__is_same(__remove_extent(int[][3]), int[3])); +SA(__is_same(__remove_extent(const int[2]), const int)); +SA(__is_same(__remove_extent(ClassType), ClassType)); +SA(__is_same(__remove_extent(ClassType[2]), ClassType)); +SA(__is_same(__remove_extent(ClassType[2][3]), ClassType[3])); +SA(__is_same(__remove_extent(ClassType[][3]), ClassType[3])); +SA(__is_same(__remove_extent(const ClassType[2]), const ClassType)); -- 2.43.0
[PATCH v2 1/4] c++: Implement __add_pointer built-in trait
This patch implements built-in trait for std::add_pointer. gcc/cp/ChangeLog: * cp-trait.def: Define __add_pointer. * semantics.cc (finish_trait_type): Handle CPTK_ADD_POINTER. gcc/testsuite/ChangeLog: * g++.dg/ext/has-builtin-1.C: Test existence of __add_pointer. * g++.dg/ext/add_pointer.C: New test. Signed-off-by: Ken Matsui --- gcc/cp/cp-trait.def | 1 + gcc/cp/semantics.cc | 9 ++ gcc/testsuite/g++.dg/ext/add_pointer.C | 37 gcc/testsuite/g++.dg/ext/has-builtin-1.C | 3 ++ 4 files changed, 50 insertions(+) create mode 100644 gcc/testsuite/g++.dg/ext/add_pointer.C diff --git a/gcc/cp/cp-trait.def b/gcc/cp/cp-trait.def index 394f006f20f..cec385ee501 100644 --- a/gcc/cp/cp-trait.def +++ b/gcc/cp/cp-trait.def @@ -48,6 +48,7 @@ #define DEFTRAIT_TYPE_DEFAULTED #endif +DEFTRAIT_TYPE (ADD_POINTER, "__add_pointer", 1) DEFTRAIT_EXPR (HAS_NOTHROW_ASSIGN, "__has_nothrow_assign", 1) DEFTRAIT_EXPR (HAS_NOTHROW_CONSTRUCTOR, "__has_nothrow_constructor", 1) DEFTRAIT_EXPR (HAS_NOTHROW_COPY, "__has_nothrow_copy", 1) diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc index 57840176863..e23693ab57f 100644 --- a/gcc/cp/semantics.cc +++ b/gcc/cp/semantics.cc @@ -12760,6 +12760,15 @@ finish_trait_type (cp_trait_kind kind, tree type1, tree type2, switch (kind) { +case CPTK_ADD_POINTER: + if (TREE_CODE (type1) == FUNCTION_TYPE + && ((TYPE_QUALS (type1) & (TYPE_QUAL_CONST | TYPE_QUAL_VOLATILE)) + || FUNCTION_REF_QUALIFIED (type1))) + return type1; + if (TYPE_REF_P (type1)) + type1 = TREE_TYPE (type1); + return build_pointer_type (type1); + case CPTK_REMOVE_CV: return cv_unqualified (type1); diff --git a/gcc/testsuite/g++.dg/ext/add_pointer.C b/gcc/testsuite/g++.dg/ext/add_pointer.C new file mode 100644 index 000..3091510f3b5 --- /dev/null +++ b/gcc/testsuite/g++.dg/ext/add_pointer.C @@ -0,0 +1,37 @@ +// { dg-do compile { target c++11 } } + +#define SA(X) static_assert((X),#X) + +class ClassType { }; + +SA(__is_same(__add_pointer(int), int*)); +SA(__is_same(__add_pointer(int*), int**)); +SA(__is_same(__add_pointer(const int), const int*)); +SA(__is_same(__add_pointer(int&), int*)); +SA(__is_same(__add_pointer(ClassType*), ClassType**)); +SA(__is_same(__add_pointer(ClassType), ClassType*)); +SA(__is_same(__add_pointer(void), void*)); +SA(__is_same(__add_pointer(const void), const void*)); +SA(__is_same(__add_pointer(volatile void), volatile void*)); +SA(__is_same(__add_pointer(const volatile void), const volatile void*)); + +void f1(); +using f1_type = decltype(f1); +using pf1_type = decltype(&f1); +SA(__is_same(__add_pointer(f1_type), pf1_type)); + +void f2() noexcept; // PR libstdc++/78361 +using f2_type = decltype(f2); +using pf2_type = decltype(&f2); +SA(__is_same(__add_pointer(f2_type), pf2_type)); + +using fn_type = void(); +using pfn_type = void(*)(); +SA(__is_same(__add_pointer(fn_type), pfn_type)); + +SA(__is_same(__add_pointer(void() &), void() &)); +SA(__is_same(__add_pointer(void() & noexcept), void() & noexcept)); +SA(__is_same(__add_pointer(void() const), void() const)); +SA(__is_same(__add_pointer(void(...) &), void(...) &)); +SA(__is_same(__add_pointer(void(...) & noexcept), void(...) & noexcept)); +SA(__is_same(__add_pointer(void(...) const), void(...) const)); diff --git a/gcc/testsuite/g++.dg/ext/has-builtin-1.C b/gcc/testsuite/g++.dg/ext/has-builtin-1.C index 02b4b4d745d..56e8db7ac32 100644 --- a/gcc/testsuite/g++.dg/ext/has-builtin-1.C +++ b/gcc/testsuite/g++.dg/ext/has-builtin-1.C @@ -2,6 +2,9 @@ // { dg-do compile } // Verify that __has_builtin gives the correct answer for C++ built-ins. +#if !__has_builtin (__add_pointer) +# error "__has_builtin (__add_pointer) failed" +#endif #if !__has_builtin (__builtin_addressof) # error "__has_builtin (__builtin_addressof) failed" #endif -- 2.43.0
Re: [PATCH] c++: implicitly_declare_fn and access checks [PR113908]
On 2/14/24 08:46, Patrick Palka wrote: On Tue, 13 Feb 2024, Jason Merrill wrote: On 2/13/24 11:49, Patrick Palka wrote: Bootstrapped and regtested on x86_64-pc-linux-gnu, are one of both of these fixes OK for trunk? -- >8 -- Here during ahead of time checking of the non-dependent new-expr we synthesize B's copy constructor, which should be defined as deleted due to A's inaccessible copy constructor. But enforce_access incorrectly decides to defer the (silent) access check for A::A(const A&) during synthesization since current_template_parms is still set (before r14-557 it checked processing_template_decl which got cleared from implicitly_declare_fn), which leads to the access check leaking out to the template context that needed the synthesization. This patch narrowly fixes this regression in two sufficient ways: 1. Clear current_template_parms alongside processing_template_decl in implicitly_declare_fn so that it's more independent of context. Hmm, perhaps it or synthesized_method_walk should use maybe_push_to_top_level? That works nicely, and also fixes the other regression PR113332. There the lambda context triggering synthesization of a default ctor was causing maybe_dummy_object to misbehave during overload resolution of one of its member's default ctors, and now synthesization is context independent. 2. Don't defer a silent access check when in a template context, since such deferred checks will be replayed noisily at instantiation time which may not be what the caller intended. True, but returning a possibly incorrect 'false' is probably also not what the caller intended. It would be better to see that we never call enforce_access with tf_none in a template. If that's not feasible, I think we should still conservatively return true. Makes sense, I can experiment with that enforce_access access change as a follow-up. Bootstrapped and regtested on x86_64-pc-linux-gnu, does this look OK for trunk? OK. -- >8 -- Subject: [PATCH] c++: synthesized_method_walk context independence [PR113908] PR c++/113908 PR c++/113332 gcc/cp/ChangeLog: * method.cc (synthesized_method_walk): Use maybe_push_to_top_level. gcc/testsuite/ChangeLog: * g++.dg/template/non-dependent31.C: New test. * g++.dg/template/non-dependent32.C: New test. --- gcc/cp/method.cc | 2 ++ .../g++.dg/template/non-dependent31.C | 18 + .../g++.dg/template/non-dependent32.C | 20 +++ 3 files changed, 40 insertions(+) create mode 100644 gcc/testsuite/g++.dg/template/non-dependent31.C create mode 100644 gcc/testsuite/g++.dg/template/non-dependent32.C diff --git a/gcc/cp/method.cc b/gcc/cp/method.cc index 957496d3e18..98c10e6a8b5 100644 --- a/gcc/cp/method.cc +++ b/gcc/cp/method.cc @@ -2760,6 +2760,7 @@ synthesized_method_walk (tree ctype, special_function_kind sfk, bool const_p, return; } + bool push_to_top = maybe_push_to_top_level (TYPE_NAME (ctype)); ++cp_unevaluated_operand; ++c_inhibit_evaluation_warnings; push_deferring_access_checks (dk_no_deferred); @@ -2857,6 +2858,7 @@ synthesized_method_walk (tree ctype, special_function_kind sfk, bool const_p, pop_deferring_access_checks (); --cp_unevaluated_operand; --c_inhibit_evaluation_warnings; + maybe_pop_from_top_level (push_to_top); } /* DECL is a defaulted function whose exception specification is now diff --git a/gcc/testsuite/g++.dg/template/non-dependent31.C b/gcc/testsuite/g++.dg/template/non-dependent31.C new file mode 100644 index 000..3fa68f40fe1 --- /dev/null +++ b/gcc/testsuite/g++.dg/template/non-dependent31.C @@ -0,0 +1,18 @@ +// PR c++/113908 +// { dg-do compile { target c++11 } } + +struct A { + A(); +private: + A(const A&); +}; + +struct B { + A a; + + template + static void f() { new B(); } +}; + +template void B::f(); +static_assert(!__is_constructible(B, const B&), ""); diff --git a/gcc/testsuite/g++.dg/template/non-dependent32.C b/gcc/testsuite/g++.dg/template/non-dependent32.C new file mode 100644 index 000..246654c5b50 --- /dev/null +++ b/gcc/testsuite/g++.dg/template/non-dependent32.C @@ -0,0 +1,20 @@ +// PR c++/113332 +// { dg-do compile { target c++11 } } + +struct tuple { + template + static constexpr bool __is_implicitly_default_constructible() { return true; } + + template()> + tuple(); +}; + +struct DBusStruct { +private: + tuple data_; +}; + +struct IBusService { + int m = [] { DBusStruct{}; return 42; }(); +};
Re: [PATCH V1] Common infrastructure for load-store fusion for aarch64 and rs6000 target
Hello Richard: On 14/02/24 4:03 pm, Richard Sandiford wrote: > Hi, > > Thanks for working on this. > > You posted a version of this patch on Sunday too. If you need to repost > to fix bugs or make other improvements, could you describe the changes > that you've made since the previous version? It makes things easier > to follow. Sure. Sorry for that I forgot to add that. > > Also, sorry for starting with a meta discussion about reviews, but > there are multiple types of review comment, including: > > (1) Suggestions for changes that are worded as suggestions. > > (2) Suggestions for changes that are worded as questions ("Wouldn't it be > better to do X?", etc). > > (3) Questions asking for an explanation or for more information. > > Just sending a new patch makes sense when the previous review comments > were all like (1), and arguably also (1)+(2). But Alex's previous review > included (3) as well. Could you go back and respond to his questions there? > It would help understand some of the design choices. > I have responded to Alex comments for the previous patches. I have incorporated all of his comments in this patch. > A natural starting point when reviewing a patch like this is to diff > the current aarch64-ldp-fusion.cc with the new pair-fusion.cc. This shows > many of the kind of changes that I'd expect. But it also seems to include > some code reordering, such as putting fuse_pair after try_fuse_pair. > If some reordering is necessary, could you try to organise the patch as > a series in which the reordering is a separate step? It's a bit hard > to review at the moment. (Reordering for cosmetic reasons is also OK, > but again please separate it out for ease of review.) > > Maybe one way of making the review easier would be to split the aarch64 > pass into the "target-dependent" and "target-independent" pieces > in-place, i.e. keeping everything within aarch64-ldp-fusion.cc, and then > (as separate patches) move the target-independent pieces outside > config/aarch64. > Sure I will do that. > The patch includes: > >> * emit-rtl.cc: Modify ge with gt on PolyINT data structure. >> * dce.cc: Add changes not to delete the load store pair. >> * rtl-ssa/changes.cc: Modified assert code. >> * var-tracking.cc: Modified assert code. >> * df-problems.cc: Not to generate REG_UNUSED for multi >> word registers that is requied for rs6000 target. > > Please submit these separately, as independent preparatory patches, > with an explanation for why they're needed & correct. But: > Sure I will do that. >> diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc >> index 88ee0dd67fc..a8d0ee7c4db 100644 >> --- a/gcc/df-problems.cc >> +++ b/gcc/df-problems.cc >> @@ -3360,7 +3360,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct >> df_mw_hardreg *mws, >>if (df_whole_mw_reg_unused_p (mws, live, artificial_uses)) >> { >>unsigned int regno = mws->start_regno; >> - df_set_note (REG_UNUSED, insn, mws->mw_reg); >> + //df_set_note (REG_UNUSED, insn, mws->mw_reg); >>dead_debug_insert_temp (debug, regno, insn, >> DEBUG_TEMP_AFTER_WITH_REG); >> >>if (REG_DEAD_DEBUGGING) >> @@ -3375,7 +3375,7 @@ df_set_unused_notes_for_mw (rtx_insn *insn, struct >> df_mw_hardreg *mws, >> if (!bitmap_bit_p (live, r) >> && !bitmap_bit_p (artificial_uses, r)) >>{ >> -df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]); >> + // df_set_note (REG_UNUSED, insn, regno_reg_rtx[r]); >> dead_debug_insert_temp (debug, r, insn, DEBUG_TEMP_AFTER_WITH_REG); >> if (REG_DEAD_DEBUGGING) >>df_print_note ("adding 2: ", insn, REG_NOTES (insn)); >> @@ -3493,9 +3493,9 @@ df_create_unused_note (rtx_insn *insn, df_ref def, >> || bitmap_bit_p (artificial_uses, dregno) >> || df_ignore_stack_reg (dregno))) >> { >> - rtx reg = (DF_REF_LOC (def)) >> -? *DF_REF_REAL_LOC (def): DF_REF_REG (def); >> - df_set_note (REG_UNUSED, insn, reg); >> + //rtx reg = (DF_REF_LOC (def)) >> + // ? *DF_REF_REAL_LOC (def): DF_REF_REG (def); >> + //df_set_note (REG_UNUSED, insn, reg); >>dead_debug_insert_temp (debug, dregno, insn, >> DEBUG_TEMP_AFTER_WITH_REG); >>if (REG_DEAD_DEBUGGING) >> df_print_note ("adding 3: ", insn, REG_NOTES (insn)); > > I don't think this can be right. The last hunk of the var-tracking.cc > patch also seems to be reverting a correct change. > We generate sequential registers using (subreg V16QI (reg 00mode) 16) and (reg OOmode 0) where OOmode is 256 bit and V16QI is 128 bits in order to generate sequential register pair. If I keep the above REG_UNUSED notes ira generates REG_UNUSED and in cprop_harreg pass and dce pass deletes store pairs and we get incorrect code. By commenting REG_UNUSED notes it is not generated and we get the correct store pair fusion and cprop_hardreg and dce doesn't deletes them. Ple
Re: [PATCH] testsuite: gdc: Require ucn in gdc.test/runnable/mangle.d etc. [PR104739]
Excerpts from Rainer Orth's message of Februar 14, 2024 11:51 am: > gdc.test/runnable/mangle.d and two other tests come out UNRESOLVED on > Solaris with the native assembler: > > UNRESOLVED: gdc.test/runnable/mangle.d compilation failed to produce > executable > UNRESOLVED: gdc.test/runnable/mangle.d -shared-libphobos compilation failed > to produce executable > UNRESOLVED: gdc.test/runnable/testmodule.d compilation failed to produce > executable > UNRESOLVED: gdc.test/runnable/testmodule.d -shared-libphobos compilation > failed to produce executable > UNRESOLVED: gdc.test/runnable/ufcs.d compilation failed to produce > executable > UNRESOLVED: gdc.test/runnable/ufcs.d -shared-libphobos compilation failed > to produce executable > > Assembler: mangle.d > "/var/tmp//cci9q2Sc.s", line 115 : Syntax error > Near line: "movzbl test_эльфийские_письмена_9, %eax" > "/var/tmp//cci9q2Sc.s", line 115 : Syntax error > Near line: "movzbl test_эльфийские_письмена_9, %eax" > "/var/tmp//cci9q2Sc.s", line 115 : Syntax error > Near line: "movzbl test_эльфийские_письмена_9, %eax" > "/var/tmp//cci9q2Sc.s", line 115 : Syntax error > Near line: "movzbl test_эльфийские_письмена_9, %eax" > "/var/tmp//cci9q2Sc.s", line 115 : Syntax error > [...] > > since /bin/as lacks UCN support. > > Iain recently added UNICODE_NAMES: annotations to the affected tests and > those recently were imported into trunk. > > This patch handles the DejaGnu side of things, adding > > { dg-require-effective-target ucn } > > to those tests on the fly. > > Tested on i386-pc-solaris2.11, sparc-sun-solaris2.11 (as and gas each), > and x86_64-pc-linux-gnu. > > Ok for trunk. > OK. Thanks! Iain.
Re: [PATCH] tree-optimization/113910 - huge compile time during PTA
On Wed, 14 Feb 2024, Richard Biener wrote: > For the testcase in PR113910 we spend a lot of time in PTA comparing > bitmaps for looking up equivalence class members. This points to > the very weak bitmap_hash function which effectively hashes set > and a subset of not set bits. The following improves it by mixing > that weak result with the population count of the bitmap, reducing > the number of collisions significantly. It's still by no means > a good hash function. > > One major problem with it was that it simply truncated the > BITMAP_WORD sized intermediate hash to hashval_t which is > unsigned int, effectively not hashing half of the bits. That solves > most of the slowness. Mixing in the population count improves > compile-time by another 30% though. > > This reduces the compile-time for the testcase from tens of minutes > to 30 seconds and PTA time from 99% to 25%. bitmap_equal_p is gone > from the profile. > > Bootstrap and regtest running on x86_64-unknown-linux-gnu, will > push to trunk and branches. Ha, and it breaks bootstrap because I misunderstood bitmap_count_bits_in_word (should be word_s_). Fixing this turns out that hashing the population count doesn't help anything so I'm re-testing the following simpler variant, giving up on the cheap last 25% but solving the regression as well. Richard. >From a76aebfdc4b6247db6a061e6395fd088a5694122 Mon Sep 17 00:00:00 2001 From: Richard Biener Date: Wed, 14 Feb 2024 12:33:13 +0100 Subject: [PATCH] tree-optimization/113910 - huge compile time during PTA To: gcc-patches@gcc.gnu.org For the testcase in PR113910 we spend a lot of time in PTA comparing bitmaps for looking up equivalence class members. This points to the very weak bitmap_hash function which effectively hashes set and a subset of not set bits. The major problem with it is that it simply truncates the BITMAP_WORD sized intermediate hash to hashval_t which is unsigned int, effectively not hashing half of the bits. This reduces the compile-time for the testcase from tens of minutes to 42 seconds and PTA time from 99% to 46%. PR tree-optimization/113910 * bitmap.cc (bitmap_hash): Mix the full element "hash" to the hashval_t hash. --- gcc/bitmap.cc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/bitmap.cc b/gcc/bitmap.cc index 6cf326bca5a..459e32c1ad1 100644 --- a/gcc/bitmap.cc +++ b/gcc/bitmap.cc @@ -2706,7 +2706,7 @@ bitmap_hash (const_bitmap head) for (ix = 0; ix != BITMAP_ELEMENT_WORDS; ix++) hash ^= ptr->bits[ix]; } - return (hashval_t)hash; + return iterative_hash (&hash, sizeof (hash), 0); } -- 2.35.3
Re: [PATCH] c++: implicitly_declare_fn and access checks [PR113908]
On Tue, 13 Feb 2024, Jason Merrill wrote: > On 2/13/24 11:49, Patrick Palka wrote: > > Bootstrapped and regtested on x86_64-pc-linux-gnu, are one of > > both of these fixes OK for trunk? > > > > -- >8 -- > > > > Here during ahead of time checking of the non-dependent new-expr we > > synthesize B's copy constructor, which should be defined as deleted > > due to A's inaccessible copy constructor. But enforce_access incorrectly > > decides to defer the (silent) access check for A::A(const A&) during > > synthesization since current_template_parms is still set (before r14-557 > > it checked processing_template_decl which got cleared from > > implicitly_declare_fn), which leads to the access check leaking out to > > the template context that needed the synthesization. > > > > This patch narrowly fixes this regression in two sufficient ways: > > > > 1. Clear current_template_parms alongside processing_template_decl > > in implicitly_declare_fn so that it's more independent of context. > > Hmm, perhaps it or synthesized_method_walk should use maybe_push_to_top_level? That works nicely, and also fixes the other regression PR113332. There the lambda context triggering synthesization of a default ctor was causing maybe_dummy_object to misbehave during overload resolution of one of its member's default ctors, and now synthesization is context independent. > > > 2. Don't defer a silent access check when in a template context, > > since such deferred checks will be replayed noisily at instantiation > > time which may not be what the caller intended. > > True, but returning a possibly incorrect 'false' is probably also not what the > caller intended. It would be better to see that we never call enforce_access > with tf_none in a template. If that's not feasible, I think we should still > conservatively return true. Makes sense, I can experiment with that enforce_access access change as a follow-up. Bootstrapped and regtested on x86_64-pc-linux-gnu, does this look OK for trunk? -- >8 -- Subject: [PATCH] c++: synthesized_method_walk context independence [PR113908] PR c++/113908 PR c++/113332 gcc/cp/ChangeLog: * method.cc (synthesized_method_walk): Use maybe_push_to_top_level. gcc/testsuite/ChangeLog: * g++.dg/template/non-dependent31.C: New test. * g++.dg/template/non-dependent32.C: New test. --- gcc/cp/method.cc | 2 ++ .../g++.dg/template/non-dependent31.C | 18 + .../g++.dg/template/non-dependent32.C | 20 +++ 3 files changed, 40 insertions(+) create mode 100644 gcc/testsuite/g++.dg/template/non-dependent31.C create mode 100644 gcc/testsuite/g++.dg/template/non-dependent32.C diff --git a/gcc/cp/method.cc b/gcc/cp/method.cc index 957496d3e18..98c10e6a8b5 100644 --- a/gcc/cp/method.cc +++ b/gcc/cp/method.cc @@ -2760,6 +2760,7 @@ synthesized_method_walk (tree ctype, special_function_kind sfk, bool const_p, return; } + bool push_to_top = maybe_push_to_top_level (TYPE_NAME (ctype)); ++cp_unevaluated_operand; ++c_inhibit_evaluation_warnings; push_deferring_access_checks (dk_no_deferred); @@ -2857,6 +2858,7 @@ synthesized_method_walk (tree ctype, special_function_kind sfk, bool const_p, pop_deferring_access_checks (); --cp_unevaluated_operand; --c_inhibit_evaluation_warnings; + maybe_pop_from_top_level (push_to_top); } /* DECL is a defaulted function whose exception specification is now diff --git a/gcc/testsuite/g++.dg/template/non-dependent31.C b/gcc/testsuite/g++.dg/template/non-dependent31.C new file mode 100644 index 000..3fa68f40fe1 --- /dev/null +++ b/gcc/testsuite/g++.dg/template/non-dependent31.C @@ -0,0 +1,18 @@ +// PR c++/113908 +// { dg-do compile { target c++11 } } + +struct A { + A(); +private: + A(const A&); +}; + +struct B { + A a; + + template + static void f() { new B(); } +}; + +template void B::f(); +static_assert(!__is_constructible(B, const B&), ""); diff --git a/gcc/testsuite/g++.dg/template/non-dependent32.C b/gcc/testsuite/g++.dg/template/non-dependent32.C new file mode 100644 index 000..246654c5b50 --- /dev/null +++ b/gcc/testsuite/g++.dg/template/non-dependent32.C @@ -0,0 +1,20 @@ +// PR c++/113332 +// { dg-do compile { target c++11 } } + +struct tuple { + template + static constexpr bool __is_implicitly_default_constructible() { return true; } + + template()> + tuple(); +}; + +struct DBusStruct { +private: + tuple data_; +}; + +struct IBusService { + int m = [] { DBusStruct{}; return 42; }(); +}; -- 2.44.0.rc0.46.g2996f11c1d
Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"
On Wed, 14 Feb 2024, Andrew Stubbs wrote: > On 14/02/2024 13:27, Richard Biener wrote: > > On Wed, 14 Feb 2024, Andrew Stubbs wrote: > > > >> On 13/02/2024 08:26, Richard Biener wrote: > >>> On Mon, 12 Feb 2024, Thomas Schwinge wrote: > >>> > Hi! > > On 2023-10-20T12:51:03+0100, Andrew Stubbs wrote: > > I've committed this patch > > ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691 > "amdgcn: add -march=gfx1030 EXPERIMENTAL". > > The RDNA2 ISA variant doesn't support certain instructions previous > implemented in GCC/GCN, so a number of patterns etc. had to be disabled: > > > [...] Vector > > reductions will need to be reworked for RDNA2. [...] > > > * config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2. > > (addc3): Add RDNA2 syntax variant. > > (subc3): Likewise. > > (2_exec): Add RDNA2 alternatives. > > (vec_cmpdi): Likewise. > > (vec_cmpdi): Likewise. > > (vec_cmpdi_exec): Likewise. > > (vec_cmpdi_exec): Likewise. > > (vec_cmpdi_dup): Likewise. > > (vec_cmpdi_dup_exec): Likewise. > > (reduc__scal_): Disable for RDNA2. > > (*_dpp_shr_): Likewise. > > (*plus_carry_dpp_shr_): Likewise. > > (*plus_carry_in_dpp_shr_): Likewise. > > Etc. The expectation being that GCC middle end copes with this, and > synthesizes some less ideal yet still functional vector code, I presume. > > The later RDNA3/gfx1100 support builds on top of this, and that's what > I'm currently working on getting proper GCC/GCN target (not offloading) > results for. > > I'm seeing a good number of execution test FAILs (regressions compared to > my earlier non-gfx1100 testing), and I've now tracked down where one > large class of those comes into existance -- not yet how to resolve, > unfortunately. But maybe, with you guys' combined vectorizer and back > end experience, the latter will be done quickly? > > Richard, I don't know if you've ever run actual GCC/GCN target (not > offloading) testing; let me know if you have any questions about that. > >>> > >>> I've only done offload testing - in the x86_64 build tree run > >>> check-target-libgomp. If you can tell me how to do GCN target testing > >>> (maybe document it on the wiki even!) I can try do that as well. > >>> > Given that (at least largely?) the same patterns etc. are disabled as in > my gfx1100 configuration, I suppose your gfx1030 one would exhibit the > same issues. You can build GCC/GCN target like you build the offloading > one, just remove '--enable-as-accelerator-for=[...]'. Likely, you can > even use a offloading GCC/GCN build to reproduce the issue below. > > One example is the attached 'builtin-bitops-1.c', reduced from > 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is > miscompiled as soon as '-ftree-vectorize' is effective: > > $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c > -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ > -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all > -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100 > -O1 > -ftree-vectorize > > In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for > '-march=gfx90a' vs. '-march=gfx1100', we see: > > +builtin-bitops-1.c:7:17: missed: reduc op not supported by > target. > > ..., and therefore: > > -builtin-bitops-1.c:7:17: note: Reduce using direct vector > reduction. > +builtin-bitops-1.c:7:17: note: Reduce using vector shifts > +builtin-bitops-1.c:7:17: note: extract scalar result > > That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a > chain of summation of 'VEC_PERM_EXPR's. However, there's wrong code > generated: > > $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out > i=1, ints[i]=0x1 a=1, b=2 > i=2, ints[i]=0x8000 a=1, b=2 > i=3, ints[i]=0x2 a=1, b=2 > i=4, ints[i]=0x4000 a=1, b=2 > i=5, ints[i]=0x1 a=1, b=2 > i=6, ints[i]=0x8000 a=1, b=2 > i=7, ints[i]=0xa5a5a5a5 a=16, b=32 > i=8, ints[i]=0x5a5a5a5a a=16, b=32 > i=9, ints[i]=0xcafe a=11, b=22 > i=10, ints[i]=0xcafe00 a=11, b=22 > i=11, ints[i]=0xcafe a=11, b=22 > i=12, ints[i]=0x a=32, b=64 > > (I can't tell if the 'b = 2 * a' pattern is purely coincidental?) > > I don't speak enough "vectorization" to fully understand the generic > vectorized algorithm and its implementation. It appears that the > "Reduce using vector shifts" code has been around for a very long time, > but also has gone t
Re: [PATCH V2] rs6000: New pass for replacement of adjacent loads fusion (lxv).
Hello Alex: On 24/01/24 10:13 pm, Alex Coplan wrote: > Hi Ajit, > > On 21/01/2024 19:57, Ajit Agarwal wrote: >> >> Hello All: >> >> New pass to replace adjacent memory addresses lxv with lxvp. >> Added common infrastructure for load store fusion for >> different targets. > > Thanks for this, it would be nice to see the load/store pair pass > generalized to multiple targets. > > I assume you are targeting GCC 15 for this, as we are in stage 4 at > the moment? > >> >> Common routines are refactored in fusion-common.h. >> >> AARCH64 load/store fusion pass is not changed with the >> common infrastructure. > > I think any patch to generalize the load/store pair fusion pass should > update the aarch64 code at the same time to use the generic > infrastructure, instead of duplicating the code. > > As a general comment, I think we should move as much of the code as > possible to target-independent code, with only the bits that are truly > target-specific (e.g. deciding which modes to allow for a load/store > pair operand) in target code. > > In terms of structuring the interface between generic code and target > code, I think it would be pragmatic to use a class with (in some cases, > pure) virtual functions that can be overriden by targets to implement > any target-specific behaviour. > > IMO the generic class should be implemented in its own .cc instead of > using a header-only approach. The target code would then define a > derived class which overrides the virtual functions (where necessary) > declared in the generic class, and then instantiate the derived class to > create a target-customized instance of the pass. Incorporated the above comments in the recent patch sent. > > A more traditional GCC approach would be to use optabs and target hooks > to customize the behaviour of the pass to handle target-specific > aspects, but: > - Target hooks are quite heavyweight, and we'd potentially have to add >quite a few hooks just for one pass that (at least initially) will >only be used by a couple of targets. > - Using classes allows both sides to easily maintain their own state >and share that state where appropriate. > > Nit on naming: I understand you want to move away from ldp_fusion, but > how about pair_fusion or mem_pair_fusion instead of just "fusion" as a > base name? IMO just "fusion" isn't very clear as to what the pass is > trying to achieve. > I have made it pair_fusion. > In general the code could do with a lot more commentary to explain the > rationale for various things / explain the high-level intent of the > code. > > Unfortunately I'm not familiar with the DF framework (I've only really > worked with RTL-SSA for the aarch64 pass), so I haven't commented on the > use of that framework, but it would be nice if what you're trying to do > could be done using RTL-SSA instead of using DF directly. > I have used rtl-ssa DEF-USE at many places in the recent patch. But DF framework is useful as it populates a pointer rtx through DF_REF_LOC and then we can easily modify. This is missing in rtl-ssa pass and wherever LOC is required to change I have used DF framework in the recent patch. > Hopefully Richard S can chime in on those aspects. > > My main concerns with the patch at the moment (apart from the code > duplication) is that it looks like: > > - The patch removes alias analysis from try_fuse_pair, which is unsafe. > - The patch tries to make its own RTL changes inside >rs6000_gen_load_pair, but it should let fuse_pair make those changes >using RTL-SSA instead. > My mistake that I have remove alias analysis from try_fuse_pair. In recent patch I kept all the code in the aarch64-ldp-fusion intact except organizing the generic and target dependent code through pure virtual functions. > I've left some more specific (but still mostly high-level) comments below. > >> >> For AARCH64 architectures just include "fusion-common.h" >> and target dependent code can be added to that. >> >> >> Alex/Richard: >> >> If you would like me to add for AARCH64 I can do that for AARCH64. >> >> If you would like to do that is fine with me. >> >> Bootstrapped and regtested with powerpc64-linux-gnu. >> >> Improvement in performance is seen with Spec 2017 spec FP benchmarks. >> >> Thanks & Regards >> Ajit >> >> rs6000: New pass for replacement of adjacent lxv with lxvp. > > Are you looking to handle stores eventually, out of interest? Looking > at rs6000-vecload-opt.cc:fusion_bb it looks like you're just handling > loads at the moment. > >> I have included store fusion also in the recent patch. >> New pass to replace adjacent memory addresses lxv with lxvp. >> Added common infrastructure for load store fusion for >> different targets. >> >> Common routines are refactored in fusion-common.h. > > I've just done a very quick scan through this file as it mostly just > looks to be idential to existing code in aarch64-ldp-fusion.cc. > >> >> 2024-01-21 Ajit Kumar Agarwal >>
Re: [PATCH]middle-end: inspect all exits for additional annotations for loop.
On Wed, 14 Feb 2024, Tamar Christina wrote: > Hi All, > > Attaching a pragma to a loop which has a complex condition often gets the > pragma > dropped. e.g. > > #pragma GCC novector > while (i < N && parse_tables_n--) > > before lowering this is represented as: > > if (ANNOTATE_EXPR ) ... > > But after lowering the condition is broken appart and attached to the final > component of the expression: > > if (parse_tables_n.2_2 != 0) goto ; else goto ; > : > iftmp.1D.4452 = 1; > goto ; > : > iftmp.1D.4452 = 0; > : > D.4451 = .ANNOTATE (iftmp.1D.4452, 2, 0); > if (D.4451 != 0) goto ; else goto ; > : > > and it's never heard from again because during replace_loop_annotate we only > inspect the loop header and latch for annotations. > > Since annotations were supposed to apply to the loop as a whole this fixes it > by also checking the loop exit src blocks for annotations. > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > Ok for master? I think this isn't entirely good. For simple cases for do {} while the condition ends up in the latch while for while () {} loops it ends up in the header. In your case the latch isn't empty so it doesn't end up with the conditional. I think your patch is OK to the point of looking at all loop exit sources but you should elide the special-casing of header and latch since it's really only exit conditionals that matter. Richard. > Thanks, > Tamar > > gcc/ChangeLog: > > * tree-cfg.cc (replace_loop_annotate): Inspect loop edges for > annotations. > > gcc/testsuite/ChangeLog: > > * gcc.dg/vect/vect-novect_gcond.c: New test. > > --- inline copy of patch -- > diff --git a/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c > b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c > new file mode 100644 > index > ..01e69cbef9d51b234c08a400c78dc078d53252f1 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c > @@ -0,0 +1,39 @@ > +/* { dg-add-options vect_early_break } */ > +/* { dg-require-effective-target vect_early_break_hw } */ > +/* { dg-require-effective-target vect_int } */ > +/* { dg-additional-options "-O3" } */ > + > +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */ > + > +#include "tree-vect.h" > + > +#define N 306 > +#define NEEDLE 136 > + > +int table[N]; > + > +__attribute__ ((noipa)) > +int foo (int i, unsigned short parse_tables_n) > +{ > + parse_tables_n >>= 9; > + parse_tables_n += 11; > +#pragma GCC novector > + while (i < N && parse_tables_n--) > +table[i++] = 0; > + > + return table[NEEDLE]; > +} > + > +int main () > +{ > + check_vect (); > + > +#pragma GCC novector > + for (int j = 0; j < N; j++) > +table[j] = -1; > + > + if (foo (0, 0x) != 0) > +__builtin_abort (); > + > + return 0; > +} > diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc > index > cdd439fe7506e7bc33654ffa027b493f23d278ac..a29681bffb902d2d05e3f18764ab519aacb3c5bc > 100644 > --- a/gcc/tree-cfg.cc > +++ b/gcc/tree-cfg.cc > @@ -327,6 +327,10 @@ replace_loop_annotate (void) >if (loop->latch) > replace_loop_annotate_in_block (loop->latch, loop); > > + /* Then also check all other exits. */ > + for (auto e : get_loop_exit_edges (loop)) > + replace_loop_annotate_in_block (e->src, loop); > + >/* Push the global flag_finite_loops state down to individual loops. > */ >loop->finite_p = flag_finite_loops; > } > > > > > -- Richard Biener SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"
On 14/02/2024 13:27, Richard Biener wrote: On Wed, 14 Feb 2024, Andrew Stubbs wrote: On 13/02/2024 08:26, Richard Biener wrote: On Mon, 12 Feb 2024, Thomas Schwinge wrote: Hi! On 2023-10-20T12:51:03+0100, Andrew Stubbs wrote: I've committed this patch ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691 "amdgcn: add -march=gfx1030 EXPERIMENTAL". The RDNA2 ISA variant doesn't support certain instructions previous implemented in GCC/GCN, so a number of patterns etc. had to be disabled: [...] Vector reductions will need to be reworked for RDNA2. [...] * config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2. (addc3): Add RDNA2 syntax variant. (subc3): Likewise. (2_exec): Add RDNA2 alternatives. (vec_cmpdi): Likewise. (vec_cmpdi): Likewise. (vec_cmpdi_exec): Likewise. (vec_cmpdi_exec): Likewise. (vec_cmpdi_dup): Likewise. (vec_cmpdi_dup_exec): Likewise. (reduc__scal_): Disable for RDNA2. (*_dpp_shr_): Likewise. (*plus_carry_dpp_shr_): Likewise. (*plus_carry_in_dpp_shr_): Likewise. Etc. The expectation being that GCC middle end copes with this, and synthesizes some less ideal yet still functional vector code, I presume. The later RDNA3/gfx1100 support builds on top of this, and that's what I'm currently working on getting proper GCC/GCN target (not offloading) results for. I'm seeing a good number of execution test FAILs (regressions compared to my earlier non-gfx1100 testing), and I've now tracked down where one large class of those comes into existance -- not yet how to resolve, unfortunately. But maybe, with you guys' combined vectorizer and back end experience, the latter will be done quickly? Richard, I don't know if you've ever run actual GCC/GCN target (not offloading) testing; let me know if you have any questions about that. I've only done offload testing - in the x86_64 build tree run check-target-libgomp. If you can tell me how to do GCN target testing (maybe document it on the wiki even!) I can try do that as well. Given that (at least largely?) the same patterns etc. are disabled as in my gfx1100 configuration, I suppose your gfx1030 one would exhibit the same issues. You can build GCC/GCN target like you build the offloading one, just remove '--enable-as-accelerator-for=[...]'. Likely, you can even use a offloading GCC/GCN build to reproduce the issue below. One example is the attached 'builtin-bitops-1.c', reduced from 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is miscompiled as soon as '-ftree-vectorize' is effective: $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100 -O1 -ftree-vectorize In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for '-march=gfx90a' vs. '-march=gfx1100', we see: +builtin-bitops-1.c:7:17: missed: reduc op not supported by target. ..., and therefore: -builtin-bitops-1.c:7:17: note: Reduce using direct vector reduction. +builtin-bitops-1.c:7:17: note: Reduce using vector shifts +builtin-bitops-1.c:7:17: note: extract scalar result That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a chain of summation of 'VEC_PERM_EXPR's. However, there's wrong code generated: $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out i=1, ints[i]=0x1 a=1, b=2 i=2, ints[i]=0x8000 a=1, b=2 i=3, ints[i]=0x2 a=1, b=2 i=4, ints[i]=0x4000 a=1, b=2 i=5, ints[i]=0x1 a=1, b=2 i=6, ints[i]=0x8000 a=1, b=2 i=7, ints[i]=0xa5a5a5a5 a=16, b=32 i=8, ints[i]=0x5a5a5a5a a=16, b=32 i=9, ints[i]=0xcafe a=11, b=22 i=10, ints[i]=0xcafe00 a=11, b=22 i=11, ints[i]=0xcafe a=11, b=22 i=12, ints[i]=0x a=32, b=64 (I can't tell if the 'b = 2 * a' pattern is purely coincidental?) I don't speak enough "vectorization" to fully understand the generic vectorized algorithm and its implementation. It appears that the "Reduce using vector shifts" code has been around for a very long time, but also has gone through a number of changes. I can't tell which GCC targets/configurations it's actually used for (in the same way as for GCN gfx1100), and thus whether there's an issue in that vectorizer code, or rather in the GCN back end, or GCN back end parameterizing the generic code? The "shift" reduction is basically doing reduction by repeatedly adding the upper to the lower half of the vector (each time halving the vector size). Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code: int my_popcount (unsigned int x) { int stmp__12.12; vector(64) int vect__12.11; vector(64) unsigned int vect__1.8; vector(64) unsigned int _13; vector(64) unsigned int vect_cst__18; vector(64) int [all others];
Re: [PATCH v2] c++: Defer emitting inline variables [PR113708]
On 2/14/24 06:03, Nathaniel Shead wrote: On Tue, Feb 13, 2024 at 09:47:27PM -0500, Jason Merrill wrote: On 2/13/24 20:34, Nathaniel Shead wrote: On Tue, Feb 13, 2024 at 06:08:42PM -0500, Jason Merrill wrote: On 2/11/24 08:26, Nathaniel Shead wrote: Currently inline vars imported from modules aren't correctly finalised, which means that import_export_decl gets called at the end of TU processing despite not being meant to for these kinds of declarations. I disagree that it's not meant to; inline variables are vague linkage just like template instantiations, so the bug seems to be that import_export_decl doesn't accept them. And on the other side, that make_rtl_for_nonlocal_decl doesn't defer them like instantations. Jason True, that's a good point. I think I confused myself here. Here's a fixed patch that looks a lot cleaner. Bootstrapped and regtested (so far just dg.exp and modules.exp) on x86_64-pc-linux-gnu, OK for trunk if full regtest succeeds? OK. A full bootstrap failed two tests in dwarf2.exp, which seem to be caused by an unreferenced 'inline' variable not being emitted into the debug info and thus causing the checks for its existence to fail. Adding a reference to the vars cause the tests to pass. Now fully bootstrapped and regtested on x86_64-pc-linux-gnu, still OK for trunk? (Only change is the two adjusted testcases.) OK. -- >8 -- Inline variables are vague-linkage, and may or may not need to be emitted in any TU that they are part of, similarly to e.g. template instantiations. Currently 'import_export_decl' assumes that inline variables have already been emitted when it comes to end-of-TU processing, and so crashes when importing non-trivially-initialised variables from a module, as they have not yet been finalised. This patch fixes this by ensuring that inline variables are always deferred till end-of-TU processing, unifying the behaviour for module and non-module code. PR c++/113708 gcc/cp/ChangeLog: * decl.cc (make_rtl_for_nonlocal_decl): Defer inline variables. * decl2.cc (import_export_decl): Support inline variables. gcc/testsuite/ChangeLog: * g++.dg/debug/dwarf2/inline-var-1.C: Reference 'a' to ensure it is emitted. * g++.dg/debug/dwarf2/inline-var-3.C: Likewise. * g++.dg/modules/init-7_a.H: New test. * g++.dg/modules/init-7_b.C: New test. Signed-off-by: Nathaniel Shead --- gcc/cp/decl.cc | 4 gcc/cp/decl2.cc | 7 +-- gcc/testsuite/g++.dg/debug/dwarf2/inline-var-1.C | 2 ++ gcc/testsuite/g++.dg/debug/dwarf2/inline-var-3.C | 2 ++ gcc/testsuite/g++.dg/modules/init-7_a.H | 6 ++ gcc/testsuite/g++.dg/modules/init-7_b.C | 6 ++ 6 files changed, 25 insertions(+), 2 deletions(-) create mode 100644 gcc/testsuite/g++.dg/modules/init-7_a.H create mode 100644 gcc/testsuite/g++.dg/modules/init-7_b.C diff --git a/gcc/cp/decl.cc b/gcc/cp/decl.cc index 3e41fd4fa31..969513c069a 100644 --- a/gcc/cp/decl.cc +++ b/gcc/cp/decl.cc @@ -7954,6 +7954,10 @@ make_rtl_for_nonlocal_decl (tree decl, tree init, const char* asmspec) && DECL_IMPLICIT_INSTANTIATION (decl)) defer_p = 1; + /* Defer vague-linkage variables. */ + if (DECL_INLINE_VAR_P (decl)) +defer_p = 1; + /* If we're not deferring, go ahead and assemble the variable. */ if (!defer_p) rest_of_decl_compilation (decl, toplev, at_eof); diff --git a/gcc/cp/decl2.cc b/gcc/cp/decl2.cc index f569d4045ec..1dddbaab38b 100644 --- a/gcc/cp/decl2.cc +++ b/gcc/cp/decl2.cc @@ -3360,7 +3360,9 @@ import_export_decl (tree decl) * implicit instantiations of function templates - * inline function + * inline functions + + * inline variables * implicit instantiations of static data members of class templates @@ -3383,6 +3385,7 @@ import_export_decl (tree decl) || DECL_DECLARED_INLINE_P (decl)); else gcc_assert (DECL_IMPLICIT_INSTANTIATION (decl) + || DECL_INLINE_VAR_P (decl) || DECL_VTABLE_OR_VTT_P (decl) || DECL_TINFO_P (decl)); /* Check that a definition of DECL is available in this translation @@ -3511,7 +3514,7 @@ import_export_decl (tree decl) this entity as undefined in this translation unit. */ import_p = true; } - else if (DECL_FUNCTION_MEMBER_P (decl)) + else if (TREE_CODE (decl) == FUNCTION_DECL && DECL_FUNCTION_MEMBER_P (decl)) { if (!DECL_DECLARED_INLINE_P (decl)) { diff --git a/gcc/testsuite/g++.dg/debug/dwarf2/inline-var-1.C b/gcc/testsuite/g++.dg/debug/dwarf2/inline-var-1.C index 85f74a91521..7ec20afc065 100644 --- a/gcc/testsuite/g++.dg/debug/dwarf2/inline-var-1.C +++ b/gcc/testsuite/g++.dg/debug/dwarf2/inline-var-1.C @@ -8,6 +8,8 @@ // { dg-final { scan-assembler-times " DW_AT_\[^\n\r]*linkage_name" 7 } } inline int a; +int& ar =
Re: [PATCH] [X86_64]: Enable support for next generation AMD Zen5 CPU with znver5 scheduler Model
> [Public] > > Hi, > > >>I assume the znver5 costs are smae as znver4 so far? > > Costing table updated for below entries. > + {COSTS_N_INSNS (10), /* cost of a divide/mod for QI. */ > + COSTS_N_INSNS (11), /* HI. */ > + COSTS_N_INSNS (16), /* DI. */ > + COSTS_N_INSNS (16)},/* > other. */ > + COSTS_N_INSNS (10), /* cost of DIVSS instruction. > */ > + COSTS_N_INSNS (14), /* cost of SQRTSS > instruction. */ > + COSTS_N_INSNS (20), /* cost of SQRTSD > instruction. */ I see, that looks good. > > > >> we can just change znver4.md to also work for znver5? > We will combine znver4 and znver5 scheduler descriptions into one Thanks! Honza > > Thanks and Regards > Karthiban > > -Original Message- > From: Jan Hubicka > Sent: Monday, February 12, 2024 9:30 PM > To: Anbazhagan, Karthiban > Cc: gcc-patches@gcc.gnu.org; Kumar, Venkataramanan > ; Joshi, Tejas Sanjay > ; Nagarajan, Muthu kumar raj > ; Gopalasubramanian, Ganesh > > Subject: Re: [PATCH] [X86_64]: Enable support for next generation AMD Zen5 > CPU with znver5 scheduler Model > > Caution: This message originated from an External Source. Use proper caution > when opening attachments, clicking links, or responding. > > > Hi, > > gcc/ChangeLog: > > * common/config/i386/cpuinfo.h (get_amd_cpu): Recognize znver5. > > * common/config/i386/i386-common.cc (processor_names): Add znver5. > > (processor_alias_table): Likewise. > > * common/config/i386/i386-cpuinfo.h (processor_types): Add new zen > > family. > > (processor_subtypes): Add znver5. > > * config.gcc (x86_64-*-* |...): Likewise. > > * config/i386/driver-i386.cc (host_detect_local_cpu): Let > > march=native detect znver5 cpu's. > > * config/i386/i386-c.cc (ix86_target_macros_internal): Add znver5. > > * config/i386/i386-options.cc (m_ZNVER5): New definition > > (processor_cost_table): Add znver5. > > * config/i386/i386.cc (ix86_reassociation_width): Likewise. > > * config/i386/i386.h (processor_type): Add PROCESSOR_ZNVER5 > > (PTA_ZNVER5): New definition. > > * config/i386/i386.md (define_attr "cpu"): Add znver5. > > (Scheduling descriptions) Add znver5.md. > > * config/i386/x86-tune-costs.h (znver5_cost): New definition. > > * config/i386/x86-tune-sched.cc (ix86_issue_rate): Add znver5. > > (ix86_adjust_cost): Likewise. > > * config/i386/x86-tune.def (avx512_move_by_pieces): Add m_ZNVER5. > > (avx512_store_by_pieces): Add m_ZNVER5. > > * doc/extend.texi: Add znver5. > > * doc/invoke.texi: Likewise. > > * config/i386/znver5.md: New. > > > > gcc/testsuite/ChangeLog: > > * g++.target/i386/mv29.C: Handle znver5 arch. > > * gcc.target/i386/funcspec-56.inc:Likewise. > > +/* This table currently replicates znver4_cost table. */ struct > > +processor_costs znver5_cost = { > > I assume the znver5 costs are smae as znver4 so far? > > > +;; AMD znver5 Scheduling > > +;; Modeling automatons for zen decoders, integer execution pipes, ;; > > +AGU pipes, branch, floating point execution and fp store units. > > +(define_automaton "znver5, znver5_ieu, znver5_idiv, znver5_fdiv, > > +znver5_agu, znver5_fpu, znver5_fp_store") > > + > > +;; Decoders unit has 4 decoders and all of them can decode fast path > > +;; and vector type instructions. > > +(define_cpu_unit "znver5-decode0" "znver5") (define_cpu_unit > > +"znver5-decode1" "znver5") (define_cpu_unit "znver5-decode2" > > +"znver5") (define_cpu_unit "znver5-decode3" "znver5") > > Duplicating znver4 description to znver5 before scheduler description is > tuned is basically just leads to increasing compiler binary size (scheduler > models are quite large). > > Depending on changes between generations, I think we should try to share CPU > unit DFAs where it makes sense (i.e. shared DFA is smaller than two DFAs). > So perhaps unit scheduler is tuned, we can just change znver4.md to also work > for znver5? > > Honza
Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"
On Wed, 14 Feb 2024, Andrew Stubbs wrote: > On 13/02/2024 08:26, Richard Biener wrote: > > On Mon, 12 Feb 2024, Thomas Schwinge wrote: > > > >> Hi! > >> > >> On 2023-10-20T12:51:03+0100, Andrew Stubbs wrote: > >>> I've committed this patch > >> > >> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691 > >> "amdgcn: add -march=gfx1030 EXPERIMENTAL". > >> > >> The RDNA2 ISA variant doesn't support certain instructions previous > >> implemented in GCC/GCN, so a number of patterns etc. had to be disabled: > >> > >>> [...] Vector > >>> reductions will need to be reworked for RDNA2. [...] > >> > >>> * config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2. > >>> (addc3): Add RDNA2 syntax variant. > >>> (subc3): Likewise. > >>> (2_exec): Add RDNA2 alternatives. > >>> (vec_cmpdi): Likewise. > >>> (vec_cmpdi): Likewise. > >>> (vec_cmpdi_exec): Likewise. > >>> (vec_cmpdi_exec): Likewise. > >>> (vec_cmpdi_dup): Likewise. > >>> (vec_cmpdi_dup_exec): Likewise. > >>> (reduc__scal_): Disable for RDNA2. > >>> (*_dpp_shr_): Likewise. > >>> (*plus_carry_dpp_shr_): Likewise. > >>> (*plus_carry_in_dpp_shr_): Likewise. > >> > >> Etc. The expectation being that GCC middle end copes with this, and > >> synthesizes some less ideal yet still functional vector code, I presume. > >> > >> The later RDNA3/gfx1100 support builds on top of this, and that's what > >> I'm currently working on getting proper GCC/GCN target (not offloading) > >> results for. > >> > >> I'm seeing a good number of execution test FAILs (regressions compared to > >> my earlier non-gfx1100 testing), and I've now tracked down where one > >> large class of those comes into existance -- not yet how to resolve, > >> unfortunately. But maybe, with you guys' combined vectorizer and back > >> end experience, the latter will be done quickly? > >> > >> Richard, I don't know if you've ever run actual GCC/GCN target (not > >> offloading) testing; let me know if you have any questions about that. > > > > I've only done offload testing - in the x86_64 build tree run > > check-target-libgomp. If you can tell me how to do GCN target testing > > (maybe document it on the wiki even!) I can try do that as well. > > > >> Given that (at least largely?) the same patterns etc. are disabled as in > >> my gfx1100 configuration, I suppose your gfx1030 one would exhibit the > >> same issues. You can build GCC/GCN target like you build the offloading > >> one, just remove '--enable-as-accelerator-for=[...]'. Likely, you can > >> even use a offloading GCC/GCN build to reproduce the issue below. > >> > >> One example is the attached 'builtin-bitops-1.c', reduced from > >> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is > >> miscompiled as soon as '-ftree-vectorize' is effective: > >> > >> $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c > >> -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ > >> -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all > >> -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100 -O1 > >> -ftree-vectorize > >> > >> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for > >> '-march=gfx90a' vs. '-march=gfx1100', we see: > >> > >> +builtin-bitops-1.c:7:17: missed: reduc op not supported by target. > >> > >> ..., and therefore: > >> > >> -builtin-bitops-1.c:7:17: note: Reduce using direct vector reduction. > >> +builtin-bitops-1.c:7:17: note: Reduce using vector shifts > >> +builtin-bitops-1.c:7:17: note: extract scalar result > >> > >> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a > >> chain of summation of 'VEC_PERM_EXPR's. However, there's wrong code > >> generated: > >> > >> $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out > >> i=1, ints[i]=0x1 a=1, b=2 > >> i=2, ints[i]=0x8000 a=1, b=2 > >> i=3, ints[i]=0x2 a=1, b=2 > >> i=4, ints[i]=0x4000 a=1, b=2 > >> i=5, ints[i]=0x1 a=1, b=2 > >> i=6, ints[i]=0x8000 a=1, b=2 > >> i=7, ints[i]=0xa5a5a5a5 a=16, b=32 > >> i=8, ints[i]=0x5a5a5a5a a=16, b=32 > >> i=9, ints[i]=0xcafe a=11, b=22 > >> i=10, ints[i]=0xcafe00 a=11, b=22 > >> i=11, ints[i]=0xcafe a=11, b=22 > >> i=12, ints[i]=0x a=32, b=64 > >> > >> (I can't tell if the 'b = 2 * a' pattern is purely coincidental?) > >> > >> I don't speak enough "vectorization" to fully understand the generic > >> vectorized algorithm and its implementation. It appears that the > >> "Reduce using vector shifts" code has been around for a very long time, > >> but also has gone through a number of changes. I can't tell which GCC > >> targets/configurations it's actually used for (in the same way as for > >> GCN gfx1100), and thus whether there's an issue in that vectorizer code, > >> or rather in the GCN back end, or GCN back end parameterizing the generic > >> code? > > > > The "shift" reduction is basically doing reduction by repeatedly > > ad
RE: [PATCH] [X86_64]: Enable support for next generation AMD Zen5 CPU with znver5 scheduler Model
[Public] Hi, >>I assume the znver5 costs are smae as znver4 so far? Costing table updated for below entries. + {COSTS_N_INSNS (10), /* cost of a divide/mod for QI. */ + COSTS_N_INSNS (11), /* HI. */ + COSTS_N_INSNS (16), /* DI. */ + COSTS_N_INSNS (16)},/* other. */ + COSTS_N_INSNS (10), /* cost of DIVSS instruction. */ + COSTS_N_INSNS (14), /* cost of SQRTSS instruction. */ + COSTS_N_INSNS (20), /* cost of SQRTSD instruction. */ >> we can just change znver4.md to also work for znver5? We will combine znver4 and znver5 scheduler descriptions into one Thanks and Regards Karthiban -Original Message- From: Jan Hubicka Sent: Monday, February 12, 2024 9:30 PM To: Anbazhagan, Karthiban Cc: gcc-patches@gcc.gnu.org; Kumar, Venkataramanan ; Joshi, Tejas Sanjay ; Nagarajan, Muthu kumar raj ; Gopalasubramanian, Ganesh Subject: Re: [PATCH] [X86_64]: Enable support for next generation AMD Zen5 CPU with znver5 scheduler Model Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding. Hi, > gcc/ChangeLog: > * common/config/i386/cpuinfo.h (get_amd_cpu): Recognize znver5. > * common/config/i386/i386-common.cc (processor_names): Add znver5. > (processor_alias_table): Likewise. > * common/config/i386/i386-cpuinfo.h (processor_types): Add new zen > family. > (processor_subtypes): Add znver5. > * config.gcc (x86_64-*-* |...): Likewise. > * config/i386/driver-i386.cc (host_detect_local_cpu): Let > march=native detect znver5 cpu's. > * config/i386/i386-c.cc (ix86_target_macros_internal): Add znver5. > * config/i386/i386-options.cc (m_ZNVER5): New definition > (processor_cost_table): Add znver5. > * config/i386/i386.cc (ix86_reassociation_width): Likewise. > * config/i386/i386.h (processor_type): Add PROCESSOR_ZNVER5 > (PTA_ZNVER5): New definition. > * config/i386/i386.md (define_attr "cpu"): Add znver5. > (Scheduling descriptions) Add znver5.md. > * config/i386/x86-tune-costs.h (znver5_cost): New definition. > * config/i386/x86-tune-sched.cc (ix86_issue_rate): Add znver5. > (ix86_adjust_cost): Likewise. > * config/i386/x86-tune.def (avx512_move_by_pieces): Add m_ZNVER5. > (avx512_store_by_pieces): Add m_ZNVER5. > * doc/extend.texi: Add znver5. > * doc/invoke.texi: Likewise. > * config/i386/znver5.md: New. > > gcc/testsuite/ChangeLog: > * g++.target/i386/mv29.C: Handle znver5 arch. > * gcc.target/i386/funcspec-56.inc:Likewise. > +/* This table currently replicates znver4_cost table. */ struct > +processor_costs znver5_cost = { I assume the znver5 costs are smae as znver4 so far? > +;; AMD znver5 Scheduling > +;; Modeling automatons for zen decoders, integer execution pipes, ;; > +AGU pipes, branch, floating point execution and fp store units. > +(define_automaton "znver5, znver5_ieu, znver5_idiv, znver5_fdiv, > +znver5_agu, znver5_fpu, znver5_fp_store") > + > +;; Decoders unit has 4 decoders and all of them can decode fast path > +;; and vector type instructions. > +(define_cpu_unit "znver5-decode0" "znver5") (define_cpu_unit > +"znver5-decode1" "znver5") (define_cpu_unit "znver5-decode2" > +"znver5") (define_cpu_unit "znver5-decode3" "znver5") Duplicating znver4 description to znver5 before scheduler description is tuned is basically just leads to increasing compiler binary size (scheduler models are quite large). Depending on changes between generations, I think we should try to share CPU unit DFAs where it makes sense (i.e. shared DFA is smaller than two DFAs). So perhaps unit scheduler is tuned, we can just change znver4.md to also work for znver5? Honza
Re: [patch, libgfortran] PR99210 X editing for reading file with encoding='utf-8'
> Regression tested on x86_64 and new test case. > OK for trunk? OK, and thanks! FX
[PATCH]middle-end: inspect all exits for additional annotations for loop.
Hi All, Attaching a pragma to a loop which has a complex condition often gets the pragma dropped. e.g. #pragma GCC novector while (i < N && parse_tables_n--) before lowering this is represented as: if (ANNOTATE_EXPR ) ... But after lowering the condition is broken appart and attached to the final component of the expression: if (parse_tables_n.2_2 != 0) goto ; else goto ; : iftmp.1D.4452 = 1; goto ; : iftmp.1D.4452 = 0; : D.4451 = .ANNOTATE (iftmp.1D.4452, 2, 0); if (D.4451 != 0) goto ; else goto ; : and it's never heard from again because during replace_loop_annotate we only inspect the loop header and latch for annotations. Since annotations were supposed to apply to the loop as a whole this fixes it by also checking the loop exit src blocks for annotations. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: * tree-cfg.cc (replace_loop_annotate): Inspect loop edges for annotations. gcc/testsuite/ChangeLog: * gcc.dg/vect/vect-novect_gcond.c: New test. --- inline copy of patch -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c new file mode 100644 index ..01e69cbef9d51b234c08a400c78dc078d53252f1 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c @@ -0,0 +1,39 @@ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break_hw } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-additional-options "-O3" } */ + +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */ + +#include "tree-vect.h" + +#define N 306 +#define NEEDLE 136 + +int table[N]; + +__attribute__ ((noipa)) +int foo (int i, unsigned short parse_tables_n) +{ + parse_tables_n >>= 9; + parse_tables_n += 11; +#pragma GCC novector + while (i < N && parse_tables_n--) +table[i++] = 0; + + return table[NEEDLE]; +} + +int main () +{ + check_vect (); + +#pragma GCC novector + for (int j = 0; j < N; j++) +table[j] = -1; + + if (foo (0, 0x) != 0) +__builtin_abort (); + + return 0; +} diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc index cdd439fe7506e7bc33654ffa027b493f23d278ac..a29681bffb902d2d05e3f18764ab519aacb3c5bc 100644 --- a/gcc/tree-cfg.cc +++ b/gcc/tree-cfg.cc @@ -327,6 +327,10 @@ replace_loop_annotate (void) if (loop->latch) replace_loop_annotate_in_block (loop->latch, loop); + /* Then also check all other exits. */ + for (auto e : get_loop_exit_edges (loop)) + replace_loop_annotate_in_block (e->src, loop); + /* Push the global flag_finite_loops state down to individual loops. */ loop->finite_p = flag_finite_loops; } -- diff --git a/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c new file mode 100644 index ..01e69cbef9d51b234c08a400c78dc078d53252f1 --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-novect_gcond.c @@ -0,0 +1,39 @@ +/* { dg-add-options vect_early_break } */ +/* { dg-require-effective-target vect_early_break_hw } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-additional-options "-O3" } */ + +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */ + +#include "tree-vect.h" + +#define N 306 +#define NEEDLE 136 + +int table[N]; + +__attribute__ ((noipa)) +int foo (int i, unsigned short parse_tables_n) +{ + parse_tables_n >>= 9; + parse_tables_n += 11; +#pragma GCC novector + while (i < N && parse_tables_n--) +table[i++] = 0; + + return table[NEEDLE]; +} + +int main () +{ + check_vect (); + +#pragma GCC novector + for (int j = 0; j < N; j++) +table[j] = -1; + + if (foo (0, 0x) != 0) +__builtin_abort (); + + return 0; +} diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc index cdd439fe7506e7bc33654ffa027b493f23d278ac..a29681bffb902d2d05e3f18764ab519aacb3c5bc 100644 --- a/gcc/tree-cfg.cc +++ b/gcc/tree-cfg.cc @@ -327,6 +327,10 @@ replace_loop_annotate (void) if (loop->latch) replace_loop_annotate_in_block (loop->latch, loop); + /* Then also check all other exits. */ + for (auto e : get_loop_exit_edges (loop)) + replace_loop_annotate_in_block (e->src, loop); + /* Push the global flag_finite_loops state down to individual loops. */ loop->finite_p = flag_finite_loops; }
[PATCH] testsuite: Fix guality/ipa-sra-1.c to work with return IPA-VRP
Hi, the test guality/ipa-sra-1.c stopped working after r14-5628-g53ba8d669550d3 because the variable from which the values of removed parameters could be calculated is also removed with it. Fixed with this patch which stops a function from returning a constant. I have also noticed that the XFAILed test passes at -O0 -O1 and -Og on all (three) targets I have tried, not just aarch64, so I extended the xfail exception accordingly. Tested by running make -k check-gcc RUNTESTFLAGS="guality.exp=ipa-sra-1.c" on x86_64-linux, aarch64-linux and ppc64le-linux. I hope it is obvious change for me to commit without approval which I will do later today. Thanks, Martin gcc/testsuite/ChangeLog: 2024-02-14 Martin Jambor * gcc.dg/guality/ipa-sra-1.c (get_val1): Move up in the file. (get_val2): Likewise. (bar): Do not return a constant. Extend xfail exception for all targets. --- gcc/testsuite/gcc.dg/guality/ipa-sra-1.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/gcc/testsuite/gcc.dg/guality/ipa-sra-1.c b/gcc/testsuite/gcc.dg/guality/ipa-sra-1.c index 9ef4eac93a7..55267c6f838 100644 --- a/gcc/testsuite/gcc.dg/guality/ipa-sra-1.c +++ b/gcc/testsuite/gcc.dg/guality/ipa-sra-1.c @@ -1,6 +1,10 @@ /* { dg-do run } */ /* { dg-options "-g -fno-ipa-icf" } */ +int __attribute__((noipa)) +get_val1 (void) {return 20;} +int __attribute__((noipa)) +get_val2 (void) {return 7;} void __attribute__((noipa)) use (int x) @@ -12,8 +16,8 @@ static int __attribute__((noinline)) bar (int i, int k) { asm ("" : "+r" (i)); - use (i); /* { dg-final { gdb-test . "k" "3" { xfail { ! { aarch64*-*-* && { any-opts "-O0" "-O1" "-Og" } } } } } } */ - return 6; + use (i); /* { dg-final { gdb-test . "k" "3" { xfail { ! { *-*-*-* && { any-opts "-O0" "-O1" "-Og" } } } } } } */ + return 6 + get_val1(); } volatile int v; @@ -30,11 +34,6 @@ foo (int i, int k) volatile int v; -int __attribute__((noipa)) -get_val1 (void) {return 20;} -int __attribute__((noipa)) -get_val2 (void) {return 7;} - int main (void) { -- 2.43.0
Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"
On 13/02/2024 08:26, Richard Biener wrote: On Mon, 12 Feb 2024, Thomas Schwinge wrote: Hi! On 2023-10-20T12:51:03+0100, Andrew Stubbs wrote: I've committed this patch ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691 "amdgcn: add -march=gfx1030 EXPERIMENTAL". The RDNA2 ISA variant doesn't support certain instructions previous implemented in GCC/GCN, so a number of patterns etc. had to be disabled: [...] Vector reductions will need to be reworked for RDNA2. [...] * config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2. (addc3): Add RDNA2 syntax variant. (subc3): Likewise. (2_exec): Add RDNA2 alternatives. (vec_cmpdi): Likewise. (vec_cmpdi): Likewise. (vec_cmpdi_exec): Likewise. (vec_cmpdi_exec): Likewise. (vec_cmpdi_dup): Likewise. (vec_cmpdi_dup_exec): Likewise. (reduc__scal_): Disable for RDNA2. (*_dpp_shr_): Likewise. (*plus_carry_dpp_shr_): Likewise. (*plus_carry_in_dpp_shr_): Likewise. Etc. The expectation being that GCC middle end copes with this, and synthesizes some less ideal yet still functional vector code, I presume. The later RDNA3/gfx1100 support builds on top of this, and that's what I'm currently working on getting proper GCC/GCN target (not offloading) results for. I'm seeing a good number of execution test FAILs (regressions compared to my earlier non-gfx1100 testing), and I've now tracked down where one large class of those comes into existance -- not yet how to resolve, unfortunately. But maybe, with you guys' combined vectorizer and back end experience, the latter will be done quickly? Richard, I don't know if you've ever run actual GCC/GCN target (not offloading) testing; let me know if you have any questions about that. I've only done offload testing - in the x86_64 build tree run check-target-libgomp. If you can tell me how to do GCN target testing (maybe document it on the wiki even!) I can try do that as well. Given that (at least largely?) the same patterns etc. are disabled as in my gfx1100 configuration, I suppose your gfx1030 one would exhibit the same issues. You can build GCC/GCN target like you build the offloading one, just remove '--enable-as-accelerator-for=[...]'. Likely, you can even use a offloading GCC/GCN build to reproduce the issue below. One example is the attached 'builtin-bitops-1.c', reduced from 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is miscompiled as soon as '-ftree-vectorize' is effective: $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100 -O1 -ftree-vectorize In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for '-march=gfx90a' vs. '-march=gfx1100', we see: +builtin-bitops-1.c:7:17: missed: reduc op not supported by target. ..., and therefore: -builtin-bitops-1.c:7:17: note: Reduce using direct vector reduction. +builtin-bitops-1.c:7:17: note: Reduce using vector shifts +builtin-bitops-1.c:7:17: note: extract scalar result That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a chain of summation of 'VEC_PERM_EXPR's. However, there's wrong code generated: $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out i=1, ints[i]=0x1 a=1, b=2 i=2, ints[i]=0x8000 a=1, b=2 i=3, ints[i]=0x2 a=1, b=2 i=4, ints[i]=0x4000 a=1, b=2 i=5, ints[i]=0x1 a=1, b=2 i=6, ints[i]=0x8000 a=1, b=2 i=7, ints[i]=0xa5a5a5a5 a=16, b=32 i=8, ints[i]=0x5a5a5a5a a=16, b=32 i=9, ints[i]=0xcafe a=11, b=22 i=10, ints[i]=0xcafe00 a=11, b=22 i=11, ints[i]=0xcafe a=11, b=22 i=12, ints[i]=0x a=32, b=64 (I can't tell if the 'b = 2 * a' pattern is purely coincidental?) I don't speak enough "vectorization" to fully understand the generic vectorized algorithm and its implementation. It appears that the "Reduce using vector shifts" code has been around for a very long time, but also has gone through a number of changes. I can't tell which GCC targets/configurations it's actually used for (in the same way as for GCN gfx1100), and thus whether there's an issue in that vectorizer code, or rather in the GCN back end, or GCN back end parameterizing the generic code? The "shift" reduction is basically doing reduction by repeatedly adding the upper to the lower half of the vector (each time halving the vector size). Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code: int my_popcount (unsigned int x) { int stmp__12.12; vector(64) int vect__12.11; vector(64) unsigned int vect__1.8; vector(64) unsigned int _13; vector(64) unsigned int vect_cst__18; vector(64) int [all others]; [local count: 32534376]: vect_cst__18 =
Re: [PATCH] middle-end/113576 - avoid out-of-bound vector element access
On Wed, 14 Feb 2024, Richard Sandiford wrote: > Richard Biener writes: > > The following avoids accessing out-of-bound vector elements when > > native encoding a boolean vector with sub-BITS_PER_UNIT precision > > elements. The error was basing the number of elements to extract > > on the rounded up total byte size involved and the patch bases > > everything on the total number of elements to extract instead. > > It's too long ago to be certain, but I think this was a deliberate choice. > The point of the new vector constant encoding is that it can give an > allegedly sensible value for any given index, even out-of-range ones. > > Since the padding bits are undefined, we should in principle have a free > choice of what to use. And for VLA, it's often better to continue the > existing pattern rather than force to zero. > > I don't strongly object to changing it. I think we should be careful > about relying on zeroing for correctness though. The bits are in principle > undefined and we can't rely on reading zeros from equivalent memory or > register values. The main motivation for a change here is to allow catching out-of-bound indices again for VECTOR_CST_ELT, at least for constant nunits because it might be a programming error like fat-fingering the index. I do think it's a regression that we no longer catch those. It's probably also a bit non-obvious how an encoding continues and there might be DImode masks that can be represented by a zero-extended QImode immediate but "continued" it would require a larger immediate. The change also effectively only changes something for 1 byte encodings since nunits is a power of two and so is the element size in bits. A patch restoring the VECTOR_CST_ELT checking might be the following diff --git a/gcc/tree.cc b/gcc/tree.cc index 046a558d1b0..4c9b05167fd 100644 --- a/gcc/tree.cc +++ b/gcc/tree.cc @@ -10325,6 +10325,9 @@ vector_cst_elt (const_tree t, unsigned int i) if (i < encoded_nelts) return VECTOR_CST_ENCODED_ELT (t, i); + /* Catch out-of-bound element accesses. */ + gcc_checking_assert (maybe_gt (VECTOR_CST_NELTS (t), i)); + /* If there are no steps, the final encoded value is the right one. */ if (!VECTOR_CST_STEPPED_P (t)) { but it triggers quite a bit via const_binop for, for example #2 0x011c1506 in const_binop (code=PLUS_EXPR, arg1=, arg2=) (gdb) p debug_generic_expr (arg1) { 12, 13, 14, 15 } $5 = void (gdb) p debug_generic_expr (arg2) { -2, -2, -2, -3 } (gdb) p count $4 = 6 (gdb) l 1711 if (!elts.new_binary_operation (type, arg1, arg2, step_ok_p)) 1712return NULL_TREE; 1713 unsigned int count = elts.encoded_nelts (); 1714 for (unsigned int i = 0; i < count; ++i) 1715{ 1716 tree elem1 = VECTOR_CST_ELT (arg1, i); 1717 tree elem2 = VECTOR_CST_ELT (arg2, i); 1718 1719 tree elt = const_binop (code, elem1, elem2); this seems like an error to me - why would we, for fixed-size vectors and for PLUS ever create a vector encoding with 6 elements?! That seems at least inefficient to me? Richard. > Thanks, > Richard > > > > As a side-effect this now consistently results in zeros in the > > padding of the last encoded byte which also avoids the failure > > mode seen in PR113576. > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > OK? > > > > Thanks, > > Richard. > > > > PR middle-end/113576 > > * fold-const.cc (native_encode_vector_part): Avoid accessing > > out-of-bound elements. > > --- > > gcc/fold-const.cc | 8 > > 1 file changed, 4 insertions(+), 4 deletions(-) > > > > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc > > index 80e211e18c0..8638757312b 100644 > > --- a/gcc/fold-const.cc > > +++ b/gcc/fold-const.cc > > @@ -8057,13 +8057,13 @@ native_encode_vector_part (const_tree expr, > > unsigned char *ptr, int len, > > off = 0; > > > >/* Zero the buffer and then set bits later where necessary. */ > > - int extract_bytes = MIN (len, total_bytes - off); > > + unsigned elts_per_byte = BITS_PER_UNIT / elt_bits; > > + unsigned first_elt = off * elts_per_byte; > > + unsigned extract_elts = MIN (len * elts_per_byte, count - first_elt); > > + unsigned extract_bytes = CEIL (elt_bits * extract_elts, > > BITS_PER_UNIT); > >if (ptr) > > memset (ptr, 0, extract_bytes); > > > > - unsigned int elts_per_byte = BITS_PER_UNIT / elt_bits; > > - unsigned int first_elt = off * elts_per_byte; > > - unsigned int extract_elts = extract_bytes * elts_per_byte; > >for (unsigned int i = 0; i < extract_elts; ++i) > > { > > tree elt = VECTOR_CST_ELT (expr, first_elt + i); > -- Richard Biener SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
Re: [PATCH] middle-end/113576 - zero padding of vector bools when expanding compares
On Wed, 14 Feb 2024, Richard Sandiford wrote: > Richard Biener writes: > > The following zeros paddings of vector bools when expanding compares > > and the mode used for the compare is an integer mode. In that case > > targets cannot distinguish between a 4 element and 8 element vector > > compare (both get to the QImode compare optab) so we have to do the > > job in the middle-end. > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > OK? > > > > Thanks, > > Richard. > > > > PR middle-end/113576 > > * expr.cc (do_store_flag): For vector bool compares of vectors > > with padding zero that. > > * dojump.cc (do_compare_and_jump): Likewise. > > --- > > gcc/dojump.cc | 16 > > gcc/expr.cc | 17 + > > 2 files changed, 33 insertions(+) > > > > diff --git a/gcc/dojump.cc b/gcc/dojump.cc > > index e2d2b3cb111..ec2a365e488 100644 > > --- a/gcc/dojump.cc > > +++ b/gcc/dojump.cc > > @@ -1266,6 +1266,7 @@ do_compare_and_jump (tree treeop0, tree treeop1, enum > > rtx_code signed_code, > >machine_mode mode; > >int unsignedp; > >enum rtx_code code; > > + unsigned HOST_WIDE_INT nunits; > > > >/* Don't crash if the comparison was erroneous. */ > >op0 = expand_normal (treeop0); > > @@ -1308,6 +1309,21 @@ do_compare_and_jump (tree treeop0, tree treeop1, > > enum rtx_code signed_code, > >emit_insn (targetm.gen_canonicalize_funcptr_for_compare (new_op1, > > op1)); > >op1 = new_op1; > > } > > + /* For boolean vectors with less than mode precision precision > > Too many precisions. Fixed. > LGTM otherwise, but could we put this in a shared helper, rather than > duplicating the code? I'd be surprised if these are the only places > we need to do something. Let's think of this when we get to more places. I guess you are thinking of the if condition here, right? Pushed with the comment fix for now. Thanks, Richard. > Thanks, and sorry for the slow response (here and elsewhere). > > Richard > > > + make sure to fill padding with consistent values. */ > > + else if (VECTOR_BOOLEAN_TYPE_P (type) > > + && SCALAR_INT_MODE_P (mode) > > + && TYPE_VECTOR_SUBPARTS (type).is_constant (&nunits) > > + && maybe_ne (GET_MODE_PRECISION (mode), nunits)) > > +{ > > + gcc_assert (code == EQ || code == NE); > > + op0 = expand_binop (mode, and_optab, op0, > > + GEN_INT ((1 << nunits) - 1), NULL_RTX, > > + true, OPTAB_WIDEN); > > + op1 = expand_binop (mode, and_optab, op1, > > + GEN_INT ((1 << nunits) - 1), NULL_RTX, > > + true, OPTAB_WIDEN); > > +} > > > >do_compare_rtx_and_jump (op0, op1, code, unsignedp, treeop0, mode, > >((mode == BLKmode) > > diff --git a/gcc/expr.cc b/gcc/expr.cc > > index fc5e998e329..096081fdc53 100644 > > --- a/gcc/expr.cc > > +++ b/gcc/expr.cc > > @@ -13502,6 +13502,7 @@ do_store_flag (sepops ops, rtx target, machine_mode > > mode) > >rtx op0, op1; > >rtx subtarget = target; > >location_t loc = ops->location; > > + unsigned HOST_WIDE_INT nunits; > > > >arg0 = ops->op0; > >arg1 = ops->op1; > > @@ -13694,6 +13695,22 @@ do_store_flag (sepops ops, rtx target, > > machine_mode mode) > > > >expand_operands (arg0, arg1, subtarget, &op0, &op1, EXPAND_NORMAL); > > > > + /* For boolean vectors with less than mode precision precision > > + make sure to fill padding with consistent values. */ > > + if (VECTOR_BOOLEAN_TYPE_P (type) > > + && SCALAR_INT_MODE_P (operand_mode) > > + && TYPE_VECTOR_SUBPARTS (type).is_constant (&nunits) > > + && maybe_ne (GET_MODE_PRECISION (operand_mode), nunits)) > > +{ > > + gcc_assert (code == EQ || code == NE); > > + op0 = expand_binop (mode, and_optab, op0, > > + GEN_INT ((1 << nunits) - 1), NULL_RTX, > > + true, OPTAB_WIDEN); > > + op1 = expand_binop (mode, and_optab, op1, > > + GEN_INT ((1 << nunits) - 1), NULL_RTX, > > + true, OPTAB_WIDEN); > > +} > > + > >if (target == 0) > > target = gen_reg_rtx (mode); > -- Richard Biener SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
[PATCH] tree-optimization/113910 - huge compile time during PTA
For the testcase in PR113910 we spend a lot of time in PTA comparing bitmaps for looking up equivalence class members. This points to the very weak bitmap_hash function which effectively hashes set and a subset of not set bits. The following improves it by mixing that weak result with the population count of the bitmap, reducing the number of collisions significantly. It's still by no means a good hash function. One major problem with it was that it simply truncated the BITMAP_WORD sized intermediate hash to hashval_t which is unsigned int, effectively not hashing half of the bits. That solves most of the slowness. Mixing in the population count improves compile-time by another 30% though. This reduces the compile-time for the testcase from tens of minutes to 30 seconds and PTA time from 99% to 25%. bitmap_equal_p is gone from the profile. Bootstrap and regtest running on x86_64-unknown-linux-gnu, will push to trunk and branches. PR tree-optimization/113910 * bitmap.cc (bitmap_hash): Mix the full element "hash" with the bitmap population count. --- gcc/bitmap.cc | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/gcc/bitmap.cc b/gcc/bitmap.cc index 6cf326bca5a..33aa0beb2b0 100644 --- a/gcc/bitmap.cc +++ b/gcc/bitmap.cc @@ -2696,6 +2696,7 @@ bitmap_hash (const_bitmap head) { const bitmap_element *ptr; BITMAP_WORD hash = 0; + unsigned long count = 0; int ix; gcc_checking_assert (!head->tree_form); @@ -2704,9 +2705,12 @@ bitmap_hash (const_bitmap head) { hash ^= ptr->indx; for (ix = 0; ix != BITMAP_ELEMENT_WORDS; ix++) - hash ^= ptr->bits[ix]; + { + hash ^= ptr->bits[ix]; + count += bitmap_count_bits_in_word (&ptr->bits[ix]); + } } - return (hashval_t)hash; + return iterative_hash (&hash, sizeof (hash), count); } -- 2.35.3
[PATCH 2/2] libstdc++: Optimize std::add_pointer compilation performance
This patch optimizes the compilation performance of std::add_pointer by dispatching to the new __add_pointer built-in trait. libstdc++-v3/ChangeLog: * include/std/type_traits (add_pointer): Use __add_pointer built-in trait. Signed-off-by: Ken Matsui --- libstdc++-v3/include/std/type_traits | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/libstdc++-v3/include/std/type_traits b/libstdc++-v3/include/std/type_traits index 21402fd8c13..3bde7cb8ba3 100644 --- a/libstdc++-v3/include/std/type_traits +++ b/libstdc++-v3/include/std/type_traits @@ -2121,6 +2121,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION { }; #endif + /// add_pointer +#if _GLIBCXX_USE_BUILTIN_TRAIT(__add_pointer) + template +struct add_pointer +{ using type = __add_pointer(_Tp); }; +#else template struct __add_pointer_helper { using type = _Tp; }; @@ -2129,7 +2135,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION struct __add_pointer_helper<_Tp, __void_t<_Tp*>> { using type = _Tp*; }; - /// add_pointer template struct add_pointer : public __add_pointer_helper<_Tp> @@ -2142,6 +2147,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION template struct add_pointer<_Tp&&> { using type = _Tp*; }; +#endif #if __cplusplus > 201103L /// Alias template for remove_pointer -- 2.43.0
[PATCH 1/2] c++: Implement __add_pointer built-in trait
This patch implements built-in trait for std::add_pointer. gcc/cp/ChangeLog: * cp-trait.def: Define __add_pointer. * semantics.cc (finish_trait_type): Handle CPTK_ADD_POINTER. gcc/testsuite/ChangeLog: * g++.dg/ext/has-builtin-1.C: Test existence of __add_pointer. * g++.dg/ext/add_pointer.C: New test. Signed-off-by: Ken Matsui --- gcc/cp/cp-trait.def | 1 + gcc/cp/semantics.cc | 9 ++ gcc/testsuite/g++.dg/ext/add_pointer.C | 37 gcc/testsuite/g++.dg/ext/has-builtin-1.C | 3 ++ 4 files changed, 50 insertions(+) create mode 100644 gcc/testsuite/g++.dg/ext/add_pointer.C diff --git a/gcc/cp/cp-trait.def b/gcc/cp/cp-trait.def index 394f006f20f..cec385ee501 100644 --- a/gcc/cp/cp-trait.def +++ b/gcc/cp/cp-trait.def @@ -48,6 +48,7 @@ #define DEFTRAIT_TYPE_DEFAULTED #endif +DEFTRAIT_TYPE (ADD_POINTER, "__add_pointer", 1) DEFTRAIT_EXPR (HAS_NOTHROW_ASSIGN, "__has_nothrow_assign", 1) DEFTRAIT_EXPR (HAS_NOTHROW_CONSTRUCTOR, "__has_nothrow_constructor", 1) DEFTRAIT_EXPR (HAS_NOTHROW_COPY, "__has_nothrow_copy", 1) diff --git a/gcc/cp/semantics.cc b/gcc/cp/semantics.cc index 57840176863..e23693ab57f 100644 --- a/gcc/cp/semantics.cc +++ b/gcc/cp/semantics.cc @@ -12760,6 +12760,15 @@ finish_trait_type (cp_trait_kind kind, tree type1, tree type2, switch (kind) { +case CPTK_ADD_POINTER: + if (TREE_CODE (type1) == FUNCTION_TYPE + && ((TYPE_QUALS (type1) & (TYPE_QUAL_CONST | TYPE_QUAL_VOLATILE)) + || FUNCTION_REF_QUALIFIED (type1))) + return type1; + if (TYPE_REF_P (type1)) + type1 = TREE_TYPE (type1); + return build_pointer_type (type1); + case CPTK_REMOVE_CV: return cv_unqualified (type1); diff --git a/gcc/testsuite/g++.dg/ext/add_pointer.C b/gcc/testsuite/g++.dg/ext/add_pointer.C new file mode 100644 index 000..3091510f3b5 --- /dev/null +++ b/gcc/testsuite/g++.dg/ext/add_pointer.C @@ -0,0 +1,37 @@ +// { dg-do compile { target c++11 } } + +#define SA(X) static_assert((X),#X) + +class ClassType { }; + +SA(__is_same(__add_pointer(int), int*)); +SA(__is_same(__add_pointer(int*), int**)); +SA(__is_same(__add_pointer(const int), const int*)); +SA(__is_same(__add_pointer(int&), int*)); +SA(__is_same(__add_pointer(ClassType*), ClassType**)); +SA(__is_same(__add_pointer(ClassType), ClassType*)); +SA(__is_same(__add_pointer(void), void*)); +SA(__is_same(__add_pointer(const void), const void*)); +SA(__is_same(__add_pointer(volatile void), volatile void*)); +SA(__is_same(__add_pointer(const volatile void), const volatile void*)); + +void f1(); +using f1_type = decltype(f1); +using pf1_type = decltype(&f1); +SA(__is_same(__add_pointer(f1_type), pf1_type)); + +void f2() noexcept; // PR libstdc++/78361 +using f2_type = decltype(f2); +using pf2_type = decltype(&f2); +SA(__is_same(__add_pointer(f2_type), pf2_type)); + +using fn_type = void(); +using pfn_type = void(*)(); +SA(__is_same(__add_pointer(fn_type), pfn_type)); + +SA(__is_same(__add_pointer(void() &), void() &)); +SA(__is_same(__add_pointer(void() & noexcept), void() & noexcept)); +SA(__is_same(__add_pointer(void() const), void() const)); +SA(__is_same(__add_pointer(void(...) &), void(...) &)); +SA(__is_same(__add_pointer(void(...) & noexcept), void(...) & noexcept)); +SA(__is_same(__add_pointer(void(...) const), void(...) const)); diff --git a/gcc/testsuite/g++.dg/ext/has-builtin-1.C b/gcc/testsuite/g++.dg/ext/has-builtin-1.C index 02b4b4d745d..56e8db7ac32 100644 --- a/gcc/testsuite/g++.dg/ext/has-builtin-1.C +++ b/gcc/testsuite/g++.dg/ext/has-builtin-1.C @@ -2,6 +2,9 @@ // { dg-do compile } // Verify that __has_builtin gives the correct answer for C++ built-ins. +#if !__has_builtin (__add_pointer) +# error "__has_builtin (__add_pointer) failed" +#endif #if !__has_builtin (__builtin_addressof) # error "__has_builtin (__builtin_addressof) failed" #endif -- 2.43.0
[PATCH][GCC 12] tree-optimization/113896 - reduction of permuted external vector
The following fixes eliding of the permutation of a BB reduction of an existing vector which breaks materialization of live lanes as we fail to permute the SLP_TREE_SCALAR_STMTS vector. Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed. PR tree-optimization/113896 * tree-vect-slp.cc (vect_optimize_slp): Permute SLP_TREE_SCALAR_STMTS when eliding a permuation in a VEC_PERM node we need to preserve because it wraps an extern vector. * g++.dg/torture/pr113896.C: New testcase. --- gcc/testsuite/g++.dg/torture/pr113896.C | 35 + gcc/tree-vect-slp.cc| 9 +++ 2 files changed, 44 insertions(+) create mode 100644 gcc/testsuite/g++.dg/torture/pr113896.C diff --git a/gcc/testsuite/g++.dg/torture/pr113896.C b/gcc/testsuite/g++.dg/torture/pr113896.C new file mode 100644 index 000..534c1c2e1cc --- /dev/null +++ b/gcc/testsuite/g++.dg/torture/pr113896.C @@ -0,0 +1,35 @@ +// { dg-do run } +// { dg-additional-options "-ffast-math" } + +double a1 = 1.0; +double a2 = 1.0; + +void __attribute__((noipa)) +f(double K[2], bool b) +{ +double A[] = { +b ? a1 : a2, +0, +0, +0 +}; + +double sum{}; +for(double a : A) sum += a; +for(double& a : A) a /= sum; + +if (b) { +K[0] = A[0]; // 1.0 +K[1] = A[1]; // 0.0 +} else { +K[0] = A[0] + A[1]; +} +} + +int main() +{ + double K[2]{}; + f(K, true); + if (K[0] != 1. || K[1] != 0.) +__builtin_abort (); +} diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index af477c31aa3..b3e3d9e7009 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -4058,6 +4058,15 @@ vect_optimize_slp (vec_info *vinfo) { /* Preserve the special VEC_PERM we use to shield existing vector defs from the rest. But make it a no-op. */ + auto_vec saved; + saved.create (SLP_TREE_SCALAR_STMTS (old).length ()); + for (unsigned i = 0; + i < SLP_TREE_SCALAR_STMTS (old).length (); ++i) + saved.quick_push (SLP_TREE_SCALAR_STMTS (old)[i]); + for (unsigned i = 0; + i < SLP_TREE_SCALAR_STMTS (old).length (); ++i) + SLP_TREE_SCALAR_STMTS (old)[i] + = saved[SLP_TREE_LANE_PERMUTATION (old)[i].second]; unsigned i = 0; for (std::pair &p : SLP_TREE_LANE_PERMUTATION (old)) -- 2.35.3
Re: [PATCH] middle-end/113576 - avoid out-of-bound vector element access
Richard Biener writes: > The following avoids accessing out-of-bound vector elements when > native encoding a boolean vector with sub-BITS_PER_UNIT precision > elements. The error was basing the number of elements to extract > on the rounded up total byte size involved and the patch bases > everything on the total number of elements to extract instead. It's too long ago to be certain, but I think this was a deliberate choice. The point of the new vector constant encoding is that it can give an allegedly sensible value for any given index, even out-of-range ones. Since the padding bits are undefined, we should in principle have a free choice of what to use. And for VLA, it's often better to continue the existing pattern rather than force to zero. I don't strongly object to changing it. I think we should be careful about relying on zeroing for correctness though. The bits are in principle undefined and we can't rely on reading zeros from equivalent memory or register values. Thanks, Richard > > As a side-effect this now consistently results in zeros in the > padding of the last encoded byte which also avoids the failure > mode seen in PR113576. > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > OK? > > Thanks, > Richard. > > PR middle-end/113576 > * fold-const.cc (native_encode_vector_part): Avoid accessing > out-of-bound elements. > --- > gcc/fold-const.cc | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc > index 80e211e18c0..8638757312b 100644 > --- a/gcc/fold-const.cc > +++ b/gcc/fold-const.cc > @@ -8057,13 +8057,13 @@ native_encode_vector_part (const_tree expr, unsigned > char *ptr, int len, > off = 0; > >/* Zero the buffer and then set bits later where necessary. */ > - int extract_bytes = MIN (len, total_bytes - off); > + unsigned elts_per_byte = BITS_PER_UNIT / elt_bits; > + unsigned first_elt = off * elts_per_byte; > + unsigned extract_elts = MIN (len * elts_per_byte, count - first_elt); > + unsigned extract_bytes = CEIL (elt_bits * extract_elts, BITS_PER_UNIT); >if (ptr) > memset (ptr, 0, extract_bytes); > > - unsigned int elts_per_byte = BITS_PER_UNIT / elt_bits; > - unsigned int first_elt = off * elts_per_byte; > - unsigned int extract_elts = extract_bytes * elts_per_byte; >for (unsigned int i = 0; i < extract_elts; ++i) > { > tree elt = VECTOR_CST_ELT (expr, first_elt + i);
Re: [PATCH][GCC 12] aarch64: Avoid out-of-range shrink-wrapped saves [PR111677]
Alex Coplan writes: > This is a backport of the GCC 13 fix for PR111677 to the GCC 12 branch. > The only part of the patch that isn't a straight cherry-pick is due to > the TX iterator lacking TDmode for GCC 12, so this version adjusts > TX_V16QI accordingly. > > Bootstrapped/regtested on aarch64-linux-gnu, the only changes in the > testsuite I saw were in > gcc/testsuite/c-c++-common/hwasan/large-aligned-1.c where the dg-output > "READ of size 4 [...]" check appears to be flaky on the GCC 12 branch > since libhwasan gained the short granule tag feature, I've requested a > backport of the following patch (committed as > r13-100-g3771486daa1e904ceae6f3e135b28e58af33849f) which should fix that > (independent) issue for GCC 12: > https://gcc.gnu.org/pipermail/gcc-patches/2024-February/645278.html > > OK for the GCC 12 branch? OK, thanks. Richard > Thanks, > Alex > > -- >8 -- > > The PR shows us ICEing due to an unrecognizable TFmode save emitted by > aarch64_process_components. The problem is that for T{I,F,D}mode we > conservatively require mems to be in range for x-register ldp/stp. That > is because (at least for TImode) it can be allocated to both GPRs and > FPRs, and in the GPR case that is an x-reg ldp/stp, and the FPR case is > a q-register load/store. > > As Richard pointed out in the PR, aarch64_get_separate_components > already checks that the offsets are suitable for a single load, so we > just need to choose a mode in aarch64_reg_save_mode that gives the full > q-register range. In this patch, we choose V16QImode as an alternative > 16-byte "bag-of-bits" mode that doesn't have the artificial range > restrictions imposed on T{I,F,D}mode. > > Unlike for GCC 14 we need additional handling in the load/store pair > code as various cases are not expecting to see V16QImode (particularly > the writeback patterns, but also aarch64_gen_load_pair). > > gcc/ChangeLog: > > PR target/111677 > * config/aarch64/aarch64.cc (aarch64_reg_save_mode): Use > V16QImode for the full 16-byte FPR saves in the vector PCS case. > (aarch64_gen_storewb_pair): Handle V16QImode. > (aarch64_gen_loadwb_pair): Likewise. > (aarch64_gen_load_pair): Likewise. > * config/aarch64/aarch64.md (loadwb_pair_): > Rename to ... > (loadwb_pair_): ... this, extending to > V16QImode. > (storewb_pair_): Rename to ... > (storewb_pair_): ... this, extending to > V16QImode. > * config/aarch64/iterators.md (TX_V16QI): New. > > gcc/testsuite/ChangeLog: > > PR target/111677 > * gcc.target/aarch64/torture/pr111677.c: New test. > > (cherry picked from commit 2bd8264a131ee1215d3bc6181722f9d30f5569c3) > --- > gcc/config/aarch64/aarch64.cc | 13 ++- > gcc/config/aarch64/aarch64.md | 35 ++- > gcc/config/aarch64/iterators.md | 3 ++ > .../gcc.target/aarch64/torture/pr111677.c | 28 +++ > 4 files changed, 61 insertions(+), 18 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/aarch64/torture/pr111677.c > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index 3bccd96a23d..2bbba323770 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -4135,7 +4135,7 @@ aarch64_reg_save_mode (unsigned int regno) >case ARM_PCS_SIMD: > /* The vector PCS saves the low 128 bits (which is the full > register on non-SVE targets). */ > - return TFmode; > + return V16QImode; > >case ARM_PCS_SVE: > /* Use vectors of DImode for registers that need frame > @@ -8602,6 +8602,10 @@ aarch64_gen_storewb_pair (machine_mode mode, rtx base, > rtx reg, rtx reg2, >return gen_storewb_pairtf_di (base, base, reg, reg2, > GEN_INT (-adjustment), > GEN_INT (UNITS_PER_VREG - adjustment)); > +case E_V16QImode: > + return gen_storewb_pairv16qi_di (base, base, reg, reg2, > +GEN_INT (-adjustment), > +GEN_INT (UNITS_PER_VREG - adjustment)); > default: >gcc_unreachable (); > } > @@ -8647,6 +8651,10 @@ aarch64_gen_loadwb_pair (machine_mode mode, rtx base, > rtx reg, rtx reg2, > case E_TFmode: >return gen_loadwb_pairtf_di (base, base, reg, reg2, GEN_INT > (adjustment), > GEN_INT (UNITS_PER_VREG)); > +case E_V16QImode: > + return gen_loadwb_pairv16qi_di (base, base, reg, reg2, > + GEN_INT (adjustment), > + GEN_INT (UNITS_PER_VREG)); > default: >gcc_unreachable (); > } > @@ -8730,6 +8738,9 @@ aarch64_gen_load_pair (machine_mode mode, rtx reg1, rtx > mem1, rtx reg2, > case E_V4SImode: >return gen_load_pairv4siv4si (reg1, mem1, reg2, mem2); > > +case E_V16QImode: > + return
Re: [PATCH] middle-end/113576 - zero padding of vector bools when expanding compares
Richard Biener writes: > The following zeros paddings of vector bools when expanding compares > and the mode used for the compare is an integer mode. In that case > targets cannot distinguish between a 4 element and 8 element vector > compare (both get to the QImode compare optab) so we have to do the > job in the middle-end. > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > OK? > > Thanks, > Richard. > > PR middle-end/113576 > * expr.cc (do_store_flag): For vector bool compares of vectors > with padding zero that. > * dojump.cc (do_compare_and_jump): Likewise. > --- > gcc/dojump.cc | 16 > gcc/expr.cc | 17 + > 2 files changed, 33 insertions(+) > > diff --git a/gcc/dojump.cc b/gcc/dojump.cc > index e2d2b3cb111..ec2a365e488 100644 > --- a/gcc/dojump.cc > +++ b/gcc/dojump.cc > @@ -1266,6 +1266,7 @@ do_compare_and_jump (tree treeop0, tree treeop1, enum > rtx_code signed_code, >machine_mode mode; >int unsignedp; >enum rtx_code code; > + unsigned HOST_WIDE_INT nunits; > >/* Don't crash if the comparison was erroneous. */ >op0 = expand_normal (treeop0); > @@ -1308,6 +1309,21 @@ do_compare_and_jump (tree treeop0, tree treeop1, enum > rtx_code signed_code, >emit_insn (targetm.gen_canonicalize_funcptr_for_compare (new_op1, > op1)); >op1 = new_op1; > } > + /* For boolean vectors with less than mode precision precision Too many precisions. LGTM otherwise, but could we put this in a shared helper, rather than duplicating the code? I'd be surprised if these are the only places we need to do something. Thanks, and sorry for the slow response (here and elsewhere). Richard > + make sure to fill padding with consistent values. */ > + else if (VECTOR_BOOLEAN_TYPE_P (type) > +&& SCALAR_INT_MODE_P (mode) > +&& TYPE_VECTOR_SUBPARTS (type).is_constant (&nunits) > +&& maybe_ne (GET_MODE_PRECISION (mode), nunits)) > +{ > + gcc_assert (code == EQ || code == NE); > + op0 = expand_binop (mode, and_optab, op0, > + GEN_INT ((1 << nunits) - 1), NULL_RTX, > + true, OPTAB_WIDEN); > + op1 = expand_binop (mode, and_optab, op1, > + GEN_INT ((1 << nunits) - 1), NULL_RTX, > + true, OPTAB_WIDEN); > +} > >do_compare_rtx_and_jump (op0, op1, code, unsignedp, treeop0, mode, > ((mode == BLKmode) > diff --git a/gcc/expr.cc b/gcc/expr.cc > index fc5e998e329..096081fdc53 100644 > --- a/gcc/expr.cc > +++ b/gcc/expr.cc > @@ -13502,6 +13502,7 @@ do_store_flag (sepops ops, rtx target, machine_mode > mode) >rtx op0, op1; >rtx subtarget = target; >location_t loc = ops->location; > + unsigned HOST_WIDE_INT nunits; > >arg0 = ops->op0; >arg1 = ops->op1; > @@ -13694,6 +13695,22 @@ do_store_flag (sepops ops, rtx target, machine_mode > mode) > >expand_operands (arg0, arg1, subtarget, &op0, &op1, EXPAND_NORMAL); > > + /* For boolean vectors with less than mode precision precision > + make sure to fill padding with consistent values. */ > + if (VECTOR_BOOLEAN_TYPE_P (type) > + && SCALAR_INT_MODE_P (operand_mode) > + && TYPE_VECTOR_SUBPARTS (type).is_constant (&nunits) > + && maybe_ne (GET_MODE_PRECISION (operand_mode), nunits)) > +{ > + gcc_assert (code == EQ || code == NE); > + op0 = expand_binop (mode, and_optab, op0, > + GEN_INT ((1 << nunits) - 1), NULL_RTX, > + true, OPTAB_WIDEN); > + op1 = expand_binop (mode, and_optab, op1, > + GEN_INT ((1 << nunits) - 1), NULL_RTX, > + true, OPTAB_WIDEN); > +} > + >if (target == 0) > target = gen_reg_rtx (mode);
Re: [PATCH v2] c++: Defer emitting inline variables [PR113708]
On Tue, Feb 13, 2024 at 09:47:27PM -0500, Jason Merrill wrote: > On 2/13/24 20:34, Nathaniel Shead wrote: > > On Tue, Feb 13, 2024 at 06:08:42PM -0500, Jason Merrill wrote: > > > On 2/11/24 08:26, Nathaniel Shead wrote: > > > > > > > > Currently inline vars imported from modules aren't correctly finalised, > > > > which means that import_export_decl gets called at the end of TU > > > > processing despite not being meant to for these kinds of declarations. > > > > > > I disagree that it's not meant to; inline variables are vague linkage just > > > like template instantiations, so the bug seems to be that > > > import_export_decl > > > doesn't accept them. And on the other side, that > > > make_rtl_for_nonlocal_decl > > > doesn't defer them like instantations. > > > > > > Jason > > > > > > > True, that's a good point. I think I confused myself here. > > > > Here's a fixed patch that looks a lot cleaner. Bootstrapped and > > regtested (so far just dg.exp and modules.exp) on x86_64-pc-linux-gnu, > > OK for trunk if full regtest succeeds? > > OK. > A full bootstrap failed two tests in dwarf2.exp, which seem to be caused by an unreferenced 'inline' variable not being emitted into the debug info and thus causing the checks for its existence to fail. Adding a reference to the vars cause the tests to pass. Now fully bootstrapped and regtested on x86_64-pc-linux-gnu, still OK for trunk? (Only change is the two adjusted testcases.) -- >8 -- Inline variables are vague-linkage, and may or may not need to be emitted in any TU that they are part of, similarly to e.g. template instantiations. Currently 'import_export_decl' assumes that inline variables have already been emitted when it comes to end-of-TU processing, and so crashes when importing non-trivially-initialised variables from a module, as they have not yet been finalised. This patch fixes this by ensuring that inline variables are always deferred till end-of-TU processing, unifying the behaviour for module and non-module code. PR c++/113708 gcc/cp/ChangeLog: * decl.cc (make_rtl_for_nonlocal_decl): Defer inline variables. * decl2.cc (import_export_decl): Support inline variables. gcc/testsuite/ChangeLog: * g++.dg/debug/dwarf2/inline-var-1.C: Reference 'a' to ensure it is emitted. * g++.dg/debug/dwarf2/inline-var-3.C: Likewise. * g++.dg/modules/init-7_a.H: New test. * g++.dg/modules/init-7_b.C: New test. Signed-off-by: Nathaniel Shead --- gcc/cp/decl.cc | 4 gcc/cp/decl2.cc | 7 +-- gcc/testsuite/g++.dg/debug/dwarf2/inline-var-1.C | 2 ++ gcc/testsuite/g++.dg/debug/dwarf2/inline-var-3.C | 2 ++ gcc/testsuite/g++.dg/modules/init-7_a.H | 6 ++ gcc/testsuite/g++.dg/modules/init-7_b.C | 6 ++ 6 files changed, 25 insertions(+), 2 deletions(-) create mode 100644 gcc/testsuite/g++.dg/modules/init-7_a.H create mode 100644 gcc/testsuite/g++.dg/modules/init-7_b.C diff --git a/gcc/cp/decl.cc b/gcc/cp/decl.cc index 3e41fd4fa31..969513c069a 100644 --- a/gcc/cp/decl.cc +++ b/gcc/cp/decl.cc @@ -7954,6 +7954,10 @@ make_rtl_for_nonlocal_decl (tree decl, tree init, const char* asmspec) && DECL_IMPLICIT_INSTANTIATION (decl)) defer_p = 1; + /* Defer vague-linkage variables. */ + if (DECL_INLINE_VAR_P (decl)) +defer_p = 1; + /* If we're not deferring, go ahead and assemble the variable. */ if (!defer_p) rest_of_decl_compilation (decl, toplev, at_eof); diff --git a/gcc/cp/decl2.cc b/gcc/cp/decl2.cc index f569d4045ec..1dddbaab38b 100644 --- a/gcc/cp/decl2.cc +++ b/gcc/cp/decl2.cc @@ -3360,7 +3360,9 @@ import_export_decl (tree decl) * implicit instantiations of function templates - * inline function + * inline functions + + * inline variables * implicit instantiations of static data members of class templates @@ -3383,6 +3385,7 @@ import_export_decl (tree decl) || DECL_DECLARED_INLINE_P (decl)); else gcc_assert (DECL_IMPLICIT_INSTANTIATION (decl) + || DECL_INLINE_VAR_P (decl) || DECL_VTABLE_OR_VTT_P (decl) || DECL_TINFO_P (decl)); /* Check that a definition of DECL is available in this translation @@ -3511,7 +3514,7 @@ import_export_decl (tree decl) this entity as undefined in this translation unit. */ import_p = true; } - else if (DECL_FUNCTION_MEMBER_P (decl)) + else if (TREE_CODE (decl) == FUNCTION_DECL && DECL_FUNCTION_MEMBER_P (decl)) { if (!DECL_DECLARED_INLINE_P (decl)) { diff --git a/gcc/testsuite/g++.dg/debug/dwarf2/inline-var-1.C b/gcc/testsuite/g++.dg/debug/dwarf2/inline-var-1.C index 85f74a91521..7ec20afc065 100644 --- a/gcc/testsuite/g++.dg/debug/dwarf2/inline-var-1.C +++ b/gcc/testsuite/g++.dg/debug/dwarf2/inline-var-1.C @@ -8,6 +8,8 @@ // { dg-final { scan-assembler-times " DW_AT_\[^\n\r]*lin
RE: [PATCH] arm/aarch64: Add bti for all functions [PR106671]
Hi Feng, > -Original Message- > From: Gcc-patches bounces+kyrylo.tkachov=arm@gcc.gnu.org> On Behalf Of Feng Xue OS > via Gcc-patches > Sent: Wednesday, August 2, 2023 4:49 PM > To: gcc-patches@gcc.gnu.org > Subject: [PATCH] arm/aarch64: Add bti for all functions [PR106671] > > This patch extends option -mbranch-protection=bti with an optional > argument > as bti[+all] to force compiler to unconditionally insert bti for all > functions. Because a direct function call at the stage of compiling might be > rewritten to an indirect call with some kind of linker-generated thunk stub > as invocation relay for some reasons. One instance is if a direct callee is > placed far from its caller, direct BL {imm} instruction could not represent > the distance, so indirect BLR {reg} should be used. For this case, a bti is > required at the beginning of the callee. > >caller() { >bl callee >} > > => > >caller() { >adrp reg, >addreg, reg, #constant >blrreg >} > > Although the issue could be fixed with a pretty new version of ld, here we > provide another means for user who has to rely on the old ld or other non-ld > linker. I also checked LLVM, by default, it implements bti just as the > proposed > -mbranch-protection=bti+all. Apologies for the delay, we had discussed this on and off internally over time. I don't think adding extra complexity in the compiler going forward for the sake of older linkers is a good tradeoffs. So I'd like to avoid this. Thanks, Kyrill > > Feng > > --- > gcc/config/aarch64/aarch64.cc| 12 +++- > gcc/config/aarch64/aarch64.opt | 2 +- > gcc/config/arm/aarch-bti-insert.cc | 3 ++- > gcc/config/arm/aarch-common.cc | 22 ++ > gcc/config/arm/aarch-common.h| 18 ++ > gcc/config/arm/arm.cc| 4 ++-- > gcc/config/arm/arm.opt | 2 +- > gcc/doc/invoke.texi | 16 ++-- > gcc/testsuite/gcc.target/aarch64/bti-5.c | 17 + > 9 files changed, 76 insertions(+), 20 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/aarch64/bti-5.c > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index 71215ef9fee..a404447c8d0 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -8997,7 +8997,8 @@ void aarch_bti_arch_check (void) > bool > aarch_bti_enabled (void) > { > - return (aarch_enable_bti == 1); > + gcc_checking_assert (aarch_enable_bti != AARCH_BTI_FUNCTION_UNSET); > + return (aarch_enable_bti != AARCH_BTI_FUNCTION_NONE); > } > > /* Check if INSN is a BTI J insn. */ > @@ -18454,12 +18455,12 @@ aarch64_override_options (void) > >selected_tune = tune ? tune->ident : cpu->ident; > > - if (aarch_enable_bti == 2) > + if (aarch_enable_bti == AARCH_BTI_FUNCTION_UNSET) > { > #ifdef TARGET_ENABLE_BTI > - aarch_enable_bti = 1; > + aarch_enable_bti = AARCH_BTI_FUNCTION; > #else > - aarch_enable_bti = 0; > + aarch_enable_bti = AARCH_BTI_FUNCTION_NONE; > #endif > } > > @@ -22881,7 +22882,8 @@ aarch64_print_patchable_function_entry (FILE > *file, >basic_block bb = ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb; > >if (!aarch_bti_enabled () > - || cgraph_node::get (cfun->decl)->only_called_directly_p ()) > + || (aarch_enable_bti != AARCH_BTI_FUNCTION_ALL > + && cgraph_node::get (cfun->decl)->only_called_directly_p ())) > { >/* Emit the patchable_area at the beginning of the function. */ >rtx_insn *insn = emit_insn_before (pa, BB_HEAD (bb)); > diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt > index 025e52d40e5..5571f7e916d 100644 > --- a/gcc/config/aarch64/aarch64.opt > +++ b/gcc/config/aarch64/aarch64.opt > @@ -37,7 +37,7 @@ TargetVariable > aarch64_feature_flags aarch64_isa_flags = 0 > > TargetVariable > -unsigned aarch_enable_bti = 2 > +enum aarch_bti_function_type aarch_enable_bti = > AARCH_BTI_FUNCTION_UNSET > > TargetVariable > enum aarch_key_type aarch_ra_sign_key = AARCH_KEY_A > diff --git a/gcc/config/arm/aarch-bti-insert.cc b/gcc/config/arm/aarch-bti- > insert.cc > index 71a77e29406..babd2490c9f 100644 > --- a/gcc/config/arm/aarch-bti-insert.cc > +++ b/gcc/config/arm/aarch-bti-insert.cc > @@ -164,7 +164,8 @@ rest_of_insert_bti (void) > functions that are already protected by Return Address Signing (PACIASP/ > PACIBSP). For all other cases insert a BTI C at the beginning of the > function. */ > - if (!cgraph_node::get (cfun->decl)->only_called_directly_p ()) > + if (aarch_enable_bti == AARCH_BTI_FUNCTION_ALL > + || !cgraph_node::get (cfun->decl)->only_called_directly_p ()) > { >bb = ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb; >insn = BB_HEAD (bb); > diff --git a/gcc/config/arm/aarch-common.cc b/gcc/config/arm/aarch- > com
[PATCH] testsuite: gdc: Require ucn in gdc.test/runnable/mangle.d etc. [PR104739]
gdc.test/runnable/mangle.d and two other tests come out UNRESOLVED on Solaris with the native assembler: UNRESOLVED: gdc.test/runnable/mangle.d compilation failed to produce executable UNRESOLVED: gdc.test/runnable/mangle.d -shared-libphobos compilation failed to produce executable UNRESOLVED: gdc.test/runnable/testmodule.d compilation failed to produce executable UNRESOLVED: gdc.test/runnable/testmodule.d -shared-libphobos compilation failed to produce executable UNRESOLVED: gdc.test/runnable/ufcs.d compilation failed to produce executable UNRESOLVED: gdc.test/runnable/ufcs.d -shared-libphobos compilation failed to produce executable Assembler: mangle.d "/var/tmp//cci9q2Sc.s", line 115 : Syntax error Near line: "movzbl test_эльфийские_письмена_9, %eax" "/var/tmp//cci9q2Sc.s", line 115 : Syntax error Near line: "movzbl test_эльфийские_письмена_9, %eax" "/var/tmp//cci9q2Sc.s", line 115 : Syntax error Near line: "movzbl test_эльфийские_письмена_9, %eax" "/var/tmp//cci9q2Sc.s", line 115 : Syntax error Near line: "movzbl test_эльфийские_письмена_9, %eax" "/var/tmp//cci9q2Sc.s", line 115 : Syntax error [...] since /bin/as lacks UCN support. Iain recently added UNICODE_NAMES: annotations to the affected tests and those recently were imported into trunk. This patch handles the DejaGnu side of things, adding { dg-require-effective-target ucn } to those tests on the fly. Tested on i386-pc-solaris2.11, sparc-sun-solaris2.11 (as and gas each), and x86_64-pc-linux-gnu. Ok for trunk. Rainer -- - Rainer Orth, Center for Biotechnology, Bielefeld University 2024-02-03 Rainer Orth gcc/testsuite: PR d/104739 * lib/gdc-utils.exp (gdc-convert-test) : Require ucn support. # HG changeset patch # Parent 5072a8062cf1eac00205b715f4c1af31c9fc45ca testsuite: gdc: Require ucn in gdc.test/runnable/mangle.d etc. [PR104739] diff --git a/gcc/testsuite/lib/gdc-utils.exp b/gcc/testsuite/lib/gdc-utils.exp --- a/gcc/testsuite/lib/gdc-utils.exp +++ b/gcc/testsuite/lib/gdc-utils.exp @@ -244,6 +244,7 @@ proc gdc-copy-file { srcdir filename } { # POST_SCRIPT: Not handled. # REQUIRED_ARGS: Arguments to add to the compiler command line. # DISABLED: Not handled. +# UNICODE_NAMES: Requires ucn support. # proc gdc-convert-test { base test } { @@ -365,6 +366,10 @@ proc gdc-convert-test { base test } { # COMPILABLE_MATH_TEST annotates tests that import the std.math # module. Which will need skipping if not available on the target. set needs_phobos 1 + } elseif [regexp -- {UNICODE_NAMES} $copy_line] { + # Require ucn support. + puts $fdout "// { dg-require-effective-target ucn }" + } }