Re: [PATCH v7] Add condition coverage (MC/DC)

2023-12-29 Thread Jørgen Kvalsvik

On 29/12/2023 22:14, Jan Hubicka wrote:

gcc/ChangeLog:

* builtins.cc (expand_builtin_fork_or_exec): Check
  condition_coverage_flag.
* collect2.cc (main): Add -fno-condition-coverage to OBSTACK.
* common.opt: Add new options -fcondition-coverage and
  -Wcoverage-too-many-conditions.
* doc/gcov.texi: Add --conditions documentation.
* doc/invoke.texi: Add -fcondition-coverage documentation.
* function.cc (free_after_compilation): Clear conditions.
(allocate_struct_function): Allocate conditions.
(basic_condition_uid): New.
* function.h (struct function): Add conditions.
(basic_condition_uid): New declaration.
* gcc.cc: Link gcov on -fcondition-coverage.
* gcov-counter.def (GCOV_COUNTER_CONDS): New.
* gcov-dump.cc (tag_conditions): New.
* gcov-io.h (GCOV_TAG_CONDS): New.
(GCOV_TAG_CONDS_LENGTH): New.
(GCOV_TAG_CONDS_NUM): New.
* gcov.cc (class condition_info): New.
(condition_info::condition_info): New.
(condition_info::popcount): New.
(struct coverage_info): New.
(add_condition_counts): New.
(output_conditions): New.
(print_usage): Add -g, --conditions.
(process_args): Likewise.
(output_intermediate_json_line): Output conditions.
(read_graph_file): Read condition counters.
(read_count_file): Likewise.
(file_summary): Print conditions.
(accumulate_line_info): Accumulate conditions.
(output_line_details): Print conditions.
* gimplify.cc (next_cond_uid): New.
(reset_cond_uid): New.
(shortcut_cond_r): Set condition discriminator.
(tag_shortcut_cond): New.
(shortcut_cond_expr): Set condition discriminator.
(gimplify_cond_expr): Likewise.
(gimplify_function_tree): Call reset_cond_uid.
* ipa-inline.cc (can_early_inline_edge_p): Check
  condition_coverage_flag.
* ipa-split.cc (pass_split_functions::gate): Likewise.
* passes.cc (finish_optimization_passes): Likewise.
* profile.cc (struct condcov): New declaration.
(cov_length): Likewise.
(cov_blocks): Likewise.
(cov_masks): Likewise.
(cov_maps): Likewise.
(cov_free): Likewise.
(instrument_decisions): New.
(read_thunk_profile): Control output to file.
(branch_prob): Call find_conditions, instrument_decisions.
(init_branch_prob): Add total_num_conds.
(end_branch_prob): Likewise.
* tree-core.h (struct tree_exp): Add condition_uid.
* tree-profile.cc (struct conds_ctx): New.
(CONDITIONS_MAX_TERMS): New.
(EDGE_CONDITION): New.
(topological_cmp): New.
(index_of): New.
(single_p): New.
(single_edge): New.
(contract_edge_up): New.
(struct outcomes): New.
(conditional_succs): New.
(condition_index): New.
(masking_vectors): New.
(emit_assign): New.
(emit_bitwise_op): New.
(make_top_index_visit): New.
(make_top_index): New.
(paths_between): New.
(struct condcov): New.
(cov_length): New.
(cov_blocks): New.
(cov_masks): New.
(cov_maps): New.
(cov_free): New.
(gimple_cond_uid): New.
(find_conditions): New.
(struct counters): New.
(find_counters): New.
(resolve_counter): New.
(resolve_counters): New.
(instrument_decisions): New.
(tree_profiling): Check condition_coverage_flag.
(pass_ipa_tree_profile::gate): Likewise.
* tree.h (SET_EXPR_UID): New.
(EXPR_COND_UID): New.

libgcc/ChangeLog:

* libgcov-merge.c (__gcov_merge_ior): New.

gcc/testsuite/ChangeLog:

* lib/gcov.exp: Add condition coverage test function.
* g++.dg/gcov/gcov-18.C: New test.
* gcc.misc-tests/gcov-19.c: New test.
* gcc.misc-tests/gcov-20.c: New test.
* gcc.misc-tests/gcov-21.c: New test.
* gcc.misc-tests/gcov-22.c: New test.
* gcc.misc-tests/gcov-23.c: New test.


Sorry for taking so long on this - I needed some time to actually try
the patch since generally we will need more changes in frontend to
preserve conditionals intanct till gimple.

This revision brings quite a few changes, some of which warrant a more
careful review.

1. Basic conditions are tied to the Boolean expression during
gimplification, not through CFG analysis. The CFG analysis seemed to
work well up until constructs like a && fn (b && c) && d where
fn(...) seems indistinguishible from then-blocks. This wipes out much
of the implementation in tree-profile.cc.
2. I changed the flag from -fprofile-conditions to -fcondition-coverage.
-fprofile-conditions was chosen because of its symmetry with
-fprofile-arcs, condition-coverage does feel more appropriate.



Re: [PATCH 1/2] RTX_COST: Count instructions

2023-12-29 Thread Jeff Law




On 12/29/23 10:46, YunQiang Su wrote:

When we try to combine RTLs, the result may be very complex,
and `rtx_cost` may think that it need lots of costs. But in
fact, it may match a pattern in machine descriptions, which
may emit only 1 or 2 hardware instructions.  This combination
may be refused due to cost comparison failure.

Then that's a problem with the backend's implementation of RTX_COST.



Since the high cost may be due to a more expsensive operation.
To get real reason, we also need information about instruction
count.
Then cost the *operations*, not the number of instructions.  Also note 
that a single insn may generate multiple assembler instructions.


Even with all its warts, the real solution here is to fix the port's RTX 
costs.


jeff




Re: [PATCH] Improved RTL expansion of field assignments into promoted registers.

2023-12-29 Thread Jeff Law




On 12/28/23 19:07, YunQiang Su wrote:

In general, I agree with this change.
When gcc12 on RV64, more than one `sext.w` will be produced with our test.
(Note, use -O1).



There are two things that help here.  The first is that the most significant
bit never appears in the middle of a field, so we don't have to worry about
overlapping, nor writes to the paradoxical bits of the SUBREG.  And secondly,
bits are numbered from zero for least significant, to MODE_BITSIZE (mode) - 1
for most significant, irrespective of the endian-ness.  So the code only needs


I am worrying that the higher bits than MODE_BITSIZE (mode) - 1 are also
modified. In this case, we also need do truncate/sign_extend.
While I cannot produce this C code yet.


to check the highest value bitpos + bitsize is the maximum value for the mode.
The above logic stays the same, but which byte insert requires extension will
change between mips64be and mips64le.  i.e. we test that the most significant
bit of the field/byte being written in the most significant bit of the SUBREG
target. [That's my understanding/rationalization, I could wrong].



The bit higher than MODE_BITSIZE (mode) - 1 also matters.
Since MIPS ISA claims that the src register of SImode instructions should
be sign_extended, otherwise UNPREDICTABLE.
It means,
li $r2, 0xfff0   0001
#  ^
addu $r1, $r0, $r2
is not allowed.
Right.  But that's the whole point behind avoiding the narrowing subreg 
and forcing use of a truncate operation.


So basically the question becomes is there a way to modify those bits in 
a way that GCC doesn't know that it needs to to truncate/extend?


The most obvious concern would be bitfield insertions that modify those 
bits.  But in that case the destination must have been DImode and we 
must truncate it to SImode before we can do anything with the SImode 
object.  BUt that's all supposed to work as long as 
TRULY_NOOP_TRUNCATION is defined properly.


Jeff


Re: [PATCH v2] RISC-V: XFAIL pr30957-1.c when loop vectorized with variable factor

2023-12-29 Thread Jeff Law




On 12/28/23 22:56, Li, Pan2 wrote:

Thanks Jeff.

I think I locate where aarch64 performs the trick here.

1. In the .final we have rtl like

(insn:TI 6 8 29 (set (reg:SF 32 v0)
 (const_double:SF -0.0 [-0x0.0p+0])) 
"/home/box/panli/gnu-toolchain/gcc/gcc/testsuite/gcc.dg/pr30957-1.c":31:7 79 
{*movsf_aarch64}
  (nil))

2. the movsf_aarch64 comes from the aarch64.md file similar to the below rtl. 
Aka, it will generate movi\t%0.2s, #0 if
the aarch64_reg_or_fp_zero is true.

1640 (define_insn "*mov_aarch64"
1641   [(set (match_operand:SFD 0 "nonimmediate_operand")
1642   match_operand:SFD 1 "general_operand"))]
1643   "TARGET_FLOAT && (register_operand (operands[0], mode)
1644 || aarch64_reg_or_fp_zero (operands[1], mode))"
1645   {@ [ cons: =0 , 1   ; attrs: type , arch  ]
1646  [ w, Y   ; neon_move   , simd  ] movi\t%0.2s, #0

3. Then we will have aarch64_float_const_zero_rtx_p here, and the -0.0 input 
rtl will return true in line 10873 because of no-signed-zero is given.

10863 bool
10864 aarch64_float_const_zero_rtx_p (rtx x
10865 {
10866   /* 0.0 in Decimal Floating Point cannot be represented by #0 or
10867  zr as our callers expect, so no need to check the actual
10868  value if X is of Decimal Floating Point type.  */
10869   if (GET_MODE_CLASS (GET_MODE (x)) == MODE_DECIMAL_FLOAT)
10870 return false;
10871
10872   if (REAL_VALUE_MINUS_ZERO (*CONST_DOUBLE_REAL_VALUE (x)))
10873 return !HONOR_SIGNED_ZEROS (GET_MODE (x));
10874   return real_equal (CONST_DOUBLE_REAL_VALUE (x), );
10875 }

I think that explain why we have +0.0 in aarch64 here.
Yup.  Thanks a ton for diving into this.  So I think that points us to 
the right fix, specifically we should be turning -0.0 into 0.0 when 
!HONOR_SIGNED_ZEROS rather than xfailing the test.


I think we'd need to adjust reg_or_0_operand and riscv_output_move, 
probably the G constraint as well.   We might also need to adjust 
move_operand and perhaps riscv_legitimize_move.


jeff


Re: [PATCH v1 1/8] LoongArch: testsuite:Add detection procedures supported by the target.

2023-12-29 Thread chenxiaolong
At 14:28 +0800 on 2023-12-29th, Chenghua Xu wrote:
> chenxiaolong writes:
> 
> > In order to improve and check the function of vector quantization
> > in
> > LoongArch architecture, tests on vector instruction set are
> > provided
> > in target-support.exp.
> > 
> > gcc/testsuite/ChangeLog:
> > 
> > * lib/target-supports.exp:Add LoongArch to the list of
> > supported
> > targets.
>  ^ Should be a space after ":".
> > ---
> >  gcc/testsuite/lib/target-supports.exp | 219 +++---
> > 
> >  1 file changed, 161 insertions(+), 58 deletions(-)
> > 
> > diff --git a/gcc/testsuite/lib/target-supports.exp
> > b/gcc/testsuite/lib/target-supports.exp
> > index 14e3e119792..b90aaf8cabe 100644
> > --- a/gcc/testsuite/lib/target-supports.exp
> > +++ b/gcc/testsuite/lib/target-supports.exp
> > @@ -3811,7 +3811,11 @@ proc add_options_for_bfloat16 { flags } {
> >  # (fma, fms, fnma, and fnms) for both float and double.
> >  
> >  proc check_effective_target_scalar_all_fma { } {
> > -return [istarget aarch64*-*-*]
> > +if { [istarget aarch64*-*-*] 
> 
> Trailing whitespace.
> 
> > +|| [istarget loongarch*-*-*]} {
> > +   return 1
> > +}
> > +return 0
> >  }
> >  
> >  # Return 1 if the target supports compiling fixed-point,
> > @@ -4017,7 +4021,7 @@ proc
> > check_effective_target_vect_cmdline_needed { } {
> >  || ([istarget arm*-*-*] &&
> > [check_effective_target_arm_neon])
> >  || [istarget aarch64*-*-*]
> >  || [istarget amdgcn*-*-*]
> > -|| [istarget riscv*-*-*]} {
> > +|| [istarget riscv*-*-*] } {
> 
> Misses something ?
> 
> > return 0
> > } else {
> > return 1
> > @@ -4047,6 +4051,8 @@ proc check_effective_target_vect_int { } {
> >  && [check_effective_target_s390_vx])
> >  || ([istarget riscv*-*-*]
> >  && [check_effective_target_riscv_v])
> > +|| ([istarget loongarch*-*-*]
> > +&& [check_effective_target_loongarch_sx])
> > }}]
> >  }
> >  
> > @@ -4176,7 +4182,9 @@ proc check_effective_target_vect_intfloat_cvt
> > { } {
> >  || ([istarget s390*-*-*]
> >  && [check_effective_target_s390_vxe2])
> >  || ([istarget riscv*-*-*]
> > -&& [check_effective_target_riscv_v]) }}]
> > +&& [check_effective_target_riscv_v])
> > +|| ([istarget loongarch*-*-*]
> > +&& [check_effective_target_loongarch_sx]) }}]
> >  }
> >  
> >  # Return 1 if the target supports signed double->int conversion
> > @@ -4197,7 +4205,9 @@ proc
> > check_effective_target_vect_doubleint_cvt { } {
> >  || ([istarget s390*-*-*]
> >  && [check_effective_target_s390_vx])
> >  || ([istarget riscv*-*-*]
> > -&& [check_effective_target_riscv_v]) }}]
> > +&& [check_effective_target_riscv_v])
> > +|| ([istarget loongarch*-*-*]
> > +&& [check_effective_target_loongarch_sx]) }}]
> >  }
> >  
> >  # Return 1 if the target supports signed int->double conversion
> > @@ -4218,7 +4228,9 @@ proc
> > check_effective_target_vect_intdouble_cvt { } {
> >  || ([istarget s390*-*-*]
> >  && [check_effective_target_s390_vx])
> >  || ([istarget riscv*-*-*]
> > -&& [check_effective_target_riscv_v]) }}]
> > +&& [check_effective_target_riscv_v])
> > +|| ([istarget loongarch*-*-*]
> > +&& [check_effective_target_loongarch_sx]) }}]
> >  }
> >  
> >  #Return 1 if we're supporting __int128 for target, 0 otherwise.
> > @@ -4251,7 +4263,9 @@ proc
> > check_effective_target_vect_uintfloat_cvt { } {
> >  || ([istarget s390*-*-*]
> >  && [check_effective_target_s390_vxe2])
> >  || ([istarget riscv*-*-*]
> > -&& [check_effective_target_riscv_v]) }}]
> > +&& [check_effective_target_riscv_v])
> > +|| ([istarget loongarch*-*-*]
> > +&& [check_effective_target_loongarch_sx]) }}]
> >  }
> >  
> >  
> > @@ -4270,7 +4284,9 @@ proc check_effective_target_vect_floatint_cvt
> > { } {
> >  || ([istarget s390*-*-*]
> >  && [check_effective_target_s390_vxe2])
> >  || ([istarget riscv*-*-*]
> > -&& [check_effective_target_riscv_v]) }}]
> > +&& [check_effective_target_riscv_v])
> > +|| ([istarget loongarch*-*-*]
> > +&& [check_effective_target_loongarch_sx]) }}]
> >  }
> >  
> >  # Return 1 if the target supports unsigned float->int conversion
> > @@ -4287,7 +4303,9 @@ proc
> > check_effective_target_vect_floatuint_cvt { } {
> > || ([istarget s390*-*-*]
> > && [check_effective_target_s390_vxe2])
> > || ([istarget riscv*-*-*]
> > -   && [check_effective_target_riscv_v]) }}]
> > +   && [check_effective_target_riscv_v])
> > +   || ([istarget loongarch*-*-*]
> > +   && [check_effective_target_loongarch_sx]) }}]
> >  }
> >  
> >  # 

[PATCH] libstdc++ testsuite/std/ranges/iota/max_size_type.cc: Reduce /10 for simulators

2023-12-29 Thread Hans-Peter Nilsson
I'm not completely sure I got the intent of the "log2_limit",
or whether "limit" is sane to decrease like this; it just
looked like an obvious and safe reduction.  Also, I verified
the 10+ minute runtime, on this same host (clocked at 11:43.61
elapsed time) for a r12-2797-g307e0d40367996 build that I
happened to have kept around; likely the build that led up
to that commit.  Now it's 58:45.78 elapsed time for a
successful run.  Looks like a 5x performance regression.
Worrisome; PR mentioned below.

Incidentally, a parallel build and a serial test-run takes 9
hours on that laptop, so that's almost 2 hours just for one 
test, if just updating the timeout to fit.  IOW, currently 48
minutes out of 9 hours for one test that just times out.

(That was just mentioned for comparison purposed: when suitable,
I test with `nprocs`-1 in parallel.)

I'll put it on the back-burner to investigate.  I think I'll
try to graft that version of libstdc++-v3 to this version
and see if I can shift the blame away from MMIX code
generation onto libstdc++-v3.  ;)
Or perhaps the cause is known?

With this, the test successfully completes in ~34 seconds.

Ok to commit?

-- >8 --
Looks like the MMIX port code quality and/or libstdc++
performance of this test has regressed since
r12-2799-ge9b639c4b53221 by a factor 5.  Anyway what was 11+
minutes runtime then, is now at r14-6859-gd1eacedc6d9ba9
close to 60 minutes.  Better prune the test, not just
increase timeouts.  Also of course, investigate the
performance regression, logged as PR113175.

* testsuite/std/ranges/iota/max_size_type.cc: Adjust
limits from -1000..1000 to -100..100 for simulators.
---
 .../std/ranges/iota/max_size_type.cc  | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/libstdc++-v3/testsuite/std/ranges/iota/max_size_type.cc 
b/libstdc++-v3/testsuite/std/ranges/iota/max_size_type.cc
index a1fbc3241dca..38fa6323d47e 100644
--- a/libstdc++-v3/testsuite/std/ranges/iota/max_size_type.cc
+++ b/libstdc++-v3/testsuite/std/ranges/iota/max_size_type.cc
@@ -16,6 +16,7 @@
 // .
 
 // { dg-do run { target c++20 } }
+// { dg-additional-options "-DSIMULATOR_TEST" { target simulator } }
 // { dg-timeout-factor 4 }
 
 #include 
@@ -31,6 +32,14 @@ using signed_rep_t = __int128;
 using signed_rep_t = long long;
 #endif
 
+#ifdef SIMULATOR_TEST
+#define LIMIT 100
+#define LOG2_CEIL_LIMIT 7
+#else
+#define LIMIT 1000
+#define LOG2_CEIL_LIMIT 10
+#endif
+
 static_assert(sizeof(max_size_t) == sizeof(max_diff_t));
 static_assert(sizeof(rep_t) == sizeof(signed_rep_t));
 
@@ -199,8 +208,8 @@ test02()
   using max_type = std::conditional_t;
   using shorten_type = std::conditional_t;
   const int hw_type_bit_size = sizeof(hw_type) * __CHAR_BIT__;
-  const int limit = 1000;
-  const int log2_limit = 10;
+  const int limit = LIMIT;
+  const int log2_limit = LOG2_CEIL_LIMIT;
   static_assert((1 << log2_limit) >= limit);
   const int min = (signed_p ? -limit : 0);
   const int max = limit;
@@ -257,8 +266,8 @@ test03()
   using max_type = std::conditional_t;
   using base_type = std::conditional_t;
   constexpr int hw_type_bit_size = sizeof(hw_type) * __CHAR_BIT__;
-  constexpr int limit = 1000;
-  constexpr int log2_limit = 10;
+  constexpr int limit = LIMIT;
+  constexpr int log2_limit = LOG2_CEIL_LIMIT;
   static_assert((1 << log2_limit) >= limit);
   const int min = (signed_p ? -limit : 0);
   const int max = limit;
@@ -312,7 +321,7 @@ test03()
 void
 test04()
 {
-  constexpr int limit = 1000;
+  constexpr int limit = LIMIT;
   for (int i = -limit; i <= limit; i++)
 {
   VERIFY( -max_size_t(-i) == i );
-- 
2.30.2



[PATCH] libstdc++ testsuite/20_util/hash/quality.cc: Increase timeout 3x

2023-12-29 Thread Hans-Peter Nilsson
Tested for mmix and observing the increased timeout in the .log 
file - and the test passing.

Ok to commit?  Or better suggestions?

-- >8 --
Testing for mmix (a 64-bit target using Knuth's simulator).  The test
is largely pruned for simulators, but still needs 5m57s on my laptop
from 3.5 years ago to run to successful completion.  Perhaps slow
hosted targets could also have problems so increasing the timeout
limit, not just for simulators but for everyone, and by more than a
factor 2.

* testsuite/20_util/hash/quality.cc: Increase timeout by a factor 3.
---
 libstdc++-v3/testsuite/20_util/hash/quality.cc | 1 +
 1 file changed, 1 insertion(+)

diff --git a/libstdc++-v3/testsuite/20_util/hash/quality.cc 
b/libstdc++-v3/testsuite/20_util/hash/quality.cc
index 7d4208ed6d21..80efc026 100644
--- a/libstdc++-v3/testsuite/20_util/hash/quality.cc
+++ b/libstdc++-v3/testsuite/20_util/hash/quality.cc
@@ -1,5 +1,6 @@
 // { dg-options "-DNTESTS=1 -DNSTRINGS=100 -DSTRSIZE=21" { target simulator } }
 // { dg-do run { target c++11 } }
+// { dg-timeout-factor 3 }
 
 // Copyright (C) 2010-2023 Free Software Foundation, Inc.
 //
-- 
2.30.2



[committed] MAINTAINERS: Update my email address

2023-12-29 Thread Joseph Myers
There will be another update in January.

* MAINTAINERS: Update my email address.

diff --git a/MAINTAINERS b/MAINTAINERS
index 343560c5b84..fe5d95ae970 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -34,7 +34,7 @@ Jeff Law  

 Michael Meissner   
 Jason Merrill  
 David S. Miller
-Joseph Myers   
+Joseph Myers   
 Richard Sandiford  
 Bernd Schmidt  
 Ian Lance Taylor   
@@ -155,7 +155,7 @@ cygwin, mingw-w64   Jonathan Yong   
<10wa...@gmail.com>
 
Language Front Ends Maintainers
 
-C front end/ISO C99Joseph Myers
+C front end/ISO C99Joseph Myers
 Ada front end  Arnaud Charlet  
 Ada front end  Eric Botcazou   
 Ada front end  Marc Poulhiès   
@@ -192,7 +192,7 @@ libquadmath Jakub Jelinek   

 libvtv Caroline Tice   
 libphobos  Iain Buclaw 
 line map   Dodji Seketeli  
-soft-fpJoseph Myers

+soft-fpJoseph Myers
 scheduler (+ haifa)Jim Wilson  
 scheduler (+ haifa)Michael Meissner
 scheduler (+ haifa)Jeff Law
@@ -219,7 +219,7 @@ jump.cc David S. Miller 

 web pages  Gerald Pfeifer  
 config.sub/config.guessBen Elliston
 i18n   Philipp Thomas  
-i18n   Joseph Myers
+i18n   Joseph Myers
 diagnostic messagesDodji Seketeli  
 diagnostic messagesDavid Malcolm   
 build machinery (*.in) Paolo Bonzini   
@@ -227,14 +227,14 @@ build machinery (*.in)Nathanael Nerode

 build machinery (*.in) Alexandre Oliva 
 build machinery (*.in) Ralf Wildenhues 
 docs co-maintainer Gerald Pfeifer  
-docs co-maintainer Joseph Myers
+docs co-maintainer Joseph Myers
 docs co-maintainer Sandra Loosemore
 docstring relicensing  Gerald Pfeifer  
-docstring relicensing  Joseph Myers
+docstring relicensing  Joseph Myers
 predict.defJan Hubicka 
 gcov   Jan Hubicka 
 gcov   Nathan Sidwell  
-option handlingJoseph Myers

+option handlingJoseph Myers
 middle-end Jeff Law
 middle-end Ian Lance Taylor
 middle-end Richard Biener  
@@ -278,7 +278,7 @@ CTF, BTF, bpf port  David Faust 

 dataflow   Paolo Bonzini   
 dataflow   Seongbae Park   
 dataflow   Kenneth Zadeck  
-driver Joseph Myers
+driver Joseph Myers
 FortranHarald Anlauf   
 FortranJanne Blomqvist 
 FortranTobias Burnus   


-- 
Joseph S. Myers
j...@polyomino.org.uk

Re: skip vector profiles multiple exits

2023-12-29 Thread Jan Hubicka
> Hi Honza,
Hi,
> 
> I wasn't sure what to do here so I figured I'd ask.
> 
> In adding support for multiple exits to the vectorizer I didn't know how to 
> update this bit:
> 
> https://github.com/gcc-mirror/gcc/blob/master/gcc/tree-vect-loop-manip.cc#L3363
> 
> Essentially, if skip_vector (i.e. not enough iteration to enter the vector 
> loop) then the
> previous code would update the new probability to be the same as that of the
> exit edge.  This made sense because that's the only edge which could bring 
> you to
> the next loop preheader.
> 
> With multiple exits this is no longer the case since any exit can bring you 
> to the
> Preaheader node.  I figured the new counts should simply be the sum of all 
> exit
> edges.  But that gives quite large count values compared to the rest of the 
> loop.
The sum of all exit counts (not probabilities) relative to header count should
give you estimated probability that the loop iterates at any given
iteration.  I am not sure how good estimate this is for loop
preconditioning to be true (without profile histograms it is really hard
to tell).
> 
> I then thought I would need to scale the counts by the probability of the edge
> being taken.  The problem here is that the probabilities don't end up to 100%

So you are summing exit_edge->count ()?
I am not sure how useful would be summit probabilities since they are
conditional (relative to probability of entering BB you go to).
How complicated CFG we now handle with vectorization?

Honza
> 
> so the scaled counts also looked kinda wonkey.   Any suggestions?
> 
> If you want some small examples to look at, testcases
> ./gcc/testsuite/gcc.dg/vect/vect-early-break_90.c to 
> ./gcc/testsuite/gcc.dg/vect/vect-early-break_93.c
> should be relevant here.
> 
> Thanks,
> Tamar


skip vector profiles multiple exits

2023-12-29 Thread Tamar Christina
Hi Honza,

I wasn't sure what to do here so I figured I'd ask.

In adding support for multiple exits to the vectorizer I didn't know how to 
update this bit:

https://github.com/gcc-mirror/gcc/blob/master/gcc/tree-vect-loop-manip.cc#L3363

Essentially, if skip_vector (i.e. not enough iteration to enter the vector 
loop) then the
previous code would update the new probability to be the same as that of the
exit edge.  This made sense because that's the only edge which could bring you 
to
the next loop preheader.

With multiple exits this is no longer the case since any exit can bring you to 
the
Preaheader node.  I figured the new counts should simply be the sum of all exit
edges.  But that gives quite large count values compared to the rest of the 
loop.

I then thought I would need to scale the counts by the probability of the edge
being taken.  The problem here is that the probabilities don't end up to 100%

so the scaled counts also looked kinda wonkey.   Any suggestions?

If you want some small examples to look at, testcases
./gcc/testsuite/gcc.dg/vect/vect-early-break_90.c to 
./gcc/testsuite/gcc.dg/vect/vect-early-break_93.c
should be relevant here.

Thanks,
Tamar


Re: [PATCH 3/7] Lockfile.

2023-12-29 Thread Jan Hubicka
Hi,
> This patch implements lockfile used for incremental LTO.
> 
> Bootstrapped/regtested on x86_64-pc-linux-gnu
> 
> gcc/ChangeLog:
> 
>   * Makefile.in: Add lockfile.o.
>   * lockfile.cc: New file.
>   * lockfile.h: New file.

I can't approve it, but overall it looks good to me.
We also have locking in gcov-io, but it is probably not that practical
to keep these shared, since gcov-io is also built into runtime.

You do not implement GCOV_LINKED_WITH_LOCKING patch, does locking work
with mingw? Or we only build gcc with cygwin emulation layer these days?

Honza
> ---
>  gcc/Makefile.in |   5 +-
>  gcc/lockfile.cc | 136 
>  gcc/lockfile.h  |  85 ++
>  3 files changed, 224 insertions(+), 2 deletions(-)
>  create mode 100644 gcc/lockfile.cc
>  create mode 100644 gcc/lockfile.h
> 
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index 7b7a4ff789a..2c527245c81 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1831,7 +1831,7 @@ ALL_HOST_BACKEND_OBJS = $(GCC_OBJS) $(OBJS) 
> $(OBJS-libcommon) \
>$(OBJS-libcommon-target) main.o c-family/cppspec.o \
>$(COLLECT2_OBJS) $(EXTRA_GCC_OBJS) $(GCOV_OBJS) $(GCOV_DUMP_OBJS) \
>$(GCOV_TOOL_OBJS) $(GENGTYPE_OBJS) gcc-ar.o gcc-nm.o gcc-ranlib.o \
> -  lto-wrapper.o collect-utils.o
> +  lto-wrapper.o collect-utils.o lockfile.o
>  
>  # for anything that is shared use the cc1plus profile data, as that
>  # is likely the most exercised during the build
> @@ -2359,7 +2359,8 @@ collect2$(exeext): $(COLLECT2_OBJS) $(LIBDEPS)
>  CFLAGS-collect2.o += -DTARGET_MACHINE=\"$(target_noncanonical)\" \
>   @TARGET_SYSTEM_ROOT_DEFINE@
>  
> -LTO_WRAPPER_OBJS = lto-wrapper.o collect-utils.o ggc-none.o
> +LTO_WRAPPER_OBJS = lto-wrapper.o collect-utils.o ggc-none.o lockfile.o
> +
>  lto-wrapper$(exeext): $(LTO_WRAPPER_OBJS) libcommon-target.a $(LIBDEPS)
>   +$(LINKER) $(ALL_LINKERFLAGS) $(LDFLAGS) -o T$@ \
>  $(LTO_WRAPPER_OBJS) libcommon-target.a $(LIBS)
> diff --git a/gcc/lockfile.cc b/gcc/lockfile.cc
> new file mode 100644
> index 000..9440e8938f3
> --- /dev/null
> +++ b/gcc/lockfile.cc
> @@ -0,0 +1,136 @@
> +/* File locking.
> +   Copyright (C) 2009-2023 Free Software Foundation, Inc.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify it under
> +the terms of the GNU General Public License as published by the Free
> +Software Foundation; either version 3, or (at your option) any later
> +version.
> +
> +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> +WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> +for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +.  */
> +
> +#include "config.h"
> +#include "system.h"
> +
> +#include "lockfile.h"
> +
> +
> +/* Unique write lock.  No other lock can be held on this lockfile.
> +   Blocking call.  */
> +int
> +lockfile::lock_write ()
> +{
> +  fd = open (filename.c_str (), O_RDWR | O_CREAT, 0666);
> +  if (fd < 0)
> +return -1;
> +
> +#if HAVE_FCNTL_H
> +  struct flock s_flock;
> +
> +  s_flock.l_whence = SEEK_SET;
> +  s_flock.l_start = 0;
> +  s_flock.l_len = 0;
> +  s_flock.l_pid = getpid ();
> +  s_flock.l_type = F_WRLCK;
> +
> +  while (fcntl (fd, F_SETLKW, _flock) && errno == EINTR)
> +continue;
> +#endif
> +  return 0;
> +}
> +
> +/* Unique write lock.  No other lock can be held on this lockfile.
> +   Only locks if this filelock is not locked by any other process.
> +   Return whether locking was successful.  */
> +int
> +lockfile::try_lock_write ()
> +{
> +  fd = open (filename.c_str (), O_RDWR | O_CREAT, 0666);
> +  if (fd < 0)
> +return -1;
> +
> +#if HAVE_FCNTL_H
> +  struct flock s_flock;
> +
> +  s_flock.l_whence = SEEK_SET;
> +  s_flock.l_start = 0;
> +  s_flock.l_len = 0;
> +  s_flock.l_pid = getpid ();
> +  s_flock.l_type = F_WRLCK;
> +
> +  if (fcntl (fd, F_SETLK, _flock) == -1)
> +{
> +  close (fd);
> +  fd = -1;
> +  return 1;
> +}
> +#endif
> +  return 0;
> +}
> +
> +/* Shared read lock.  Only read lock can be held concurrently.
> +   If write lock is already held by this process, it will be
> +   changed to read lock.
> +   Blocking call.  */
> +int
> +lockfile::lock_read ()
> +{
> +  fd = open (filename.c_str (), O_RDWR | O_CREAT, 0666);
> +  if (fd < 0)
> +return -1;
> +
> +#if HAVE_FCNTL_H
> +  struct flock s_flock;
> +
> +  s_flock.l_whence = SEEK_SET;
> +  s_flock.l_start = 0;
> +  s_flock.l_len = 0;
> +  s_flock.l_pid = getpid ();
> +  s_flock.l_type = F_RDLCK;
> +
> +  while (fcntl (fd, F_SETLKW, _flock) && errno == EINTR)
> +continue;
> +#endif
> +  return 0;
> +}
> +
> +/* Unlock all previously placed locks.  */
> +void
> +lockfile::unlock ()
> +{
> +  if (fd < 

Re: [PATCH 2/7] lto: Remove random_seed from section name.

2023-12-29 Thread Jan Hubicka
> Bootstrapped/regtested on x86_64-pc-linux-gnu
> 
> gcc/ChangeLog:
> 
>   * lto-streamer.cc (lto_get_section_name): Remove random_seed in WPA.

This is also OK. (since it lacks explanation - the random suffixes are
added for ld -r to work.  This never happens between WPA and ltrans, so
they only consume extra space and confuse the ltrans cache).
> ---
>  gcc/lto-streamer.cc | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/gcc/lto-streamer.cc b/gcc/lto-streamer.cc
> index 4968fd13413..53275e32618 100644
> --- a/gcc/lto-streamer.cc
> +++ b/gcc/lto-streamer.cc
> @@ -132,11 +132,17 @@ lto_get_section_name (int section_type, const char 
> *name,
>   doesn't confuse the reader with merged sections.
>  
>   For options don't add a ID, the option reader cannot deal with them
> - and merging should be ok here. */
> + and merging should be ok here.
> +
> + WPA output is sent to LTRANS directly inside of lto-wrapper, so name
> + uniqueness for external tools is not needed.
> + Randomness would inhibit incremental LTO.  */
>if (section_type == LTO_section_opts)
>  strcpy (post, "");
>else if (f != NULL) 
>  sprintf (post, "." HOST_WIDE_INT_PRINT_HEX_PURE, f->id);
> +  else if (flag_wpa)
> +strcpy (post, ".0");
Con't post be just empty string?
>else
>  sprintf (post, "." HOST_WIDE_INT_PRINT_HEX_PURE, get_random_seed 
> (false)); 
>char *res = concat (section_name_prefix, sep, add, post, NULL);
> -- 
> 2.42.1
> 


Re: [PATCH 1/7] lto: Skip flag OPT_fltrans_output_list_.

2023-12-29 Thread Jan Hubicka
Hi,
> Bootstrapped/regtested on x86_64-pc-linux-gnu
> 
> gcc/ChangeLog:
> 
>   * lto-opts.cc (lto_write_options): Skip OPT_fltrans_output_list_.
OK,
thanks,
Honza
> ---
>  gcc/lto-opts.cc | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/gcc/lto-opts.cc b/gcc/lto-opts.cc
> index c9bee9d4197..0451e290c75 100644
> --- a/gcc/lto-opts.cc
> +++ b/gcc/lto-opts.cc
> @@ -152,6 +152,7 @@ lto_write_options (void)
>   case OPT_fprofile_prefix_map_:
>   case OPT_fcanon_prefix_map:
>   case OPT_fwhole_program:
> + case OPT_fltrans_output_list_:
> continue;
>  
>   default:
> -- 
> 2.42.1
> 


Re: [PATCH v7] Add condition coverage (MC/DC)

2023-12-29 Thread Jan Hubicka
> gcc/ChangeLog:
> 
>   * builtins.cc (expand_builtin_fork_or_exec): Check
> condition_coverage_flag.
>   * collect2.cc (main): Add -fno-condition-coverage to OBSTACK.
>   * common.opt: Add new options -fcondition-coverage and
> -Wcoverage-too-many-conditions.
>   * doc/gcov.texi: Add --conditions documentation.
>   * doc/invoke.texi: Add -fcondition-coverage documentation.
>   * function.cc (free_after_compilation): Clear conditions.
>   (allocate_struct_function): Allocate conditions.
>   (basic_condition_uid): New.
>   * function.h (struct function): Add conditions.
>   (basic_condition_uid): New declaration.
>   * gcc.cc: Link gcov on -fcondition-coverage.
>   * gcov-counter.def (GCOV_COUNTER_CONDS): New.
>   * gcov-dump.cc (tag_conditions): New.
>   * gcov-io.h (GCOV_TAG_CONDS): New.
>   (GCOV_TAG_CONDS_LENGTH): New.
>   (GCOV_TAG_CONDS_NUM): New.
>   * gcov.cc (class condition_info): New.
>   (condition_info::condition_info): New.
>   (condition_info::popcount): New.
>   (struct coverage_info): New.
>   (add_condition_counts): New.
>   (output_conditions): New.
>   (print_usage): Add -g, --conditions.
>   (process_args): Likewise.
>   (output_intermediate_json_line): Output conditions.
>   (read_graph_file): Read condition counters.
>   (read_count_file): Likewise.
>   (file_summary): Print conditions.
>   (accumulate_line_info): Accumulate conditions.
>   (output_line_details): Print conditions.
>   * gimplify.cc (next_cond_uid): New.
>   (reset_cond_uid): New.
>   (shortcut_cond_r): Set condition discriminator.
>   (tag_shortcut_cond): New.
>   (shortcut_cond_expr): Set condition discriminator.
>   (gimplify_cond_expr): Likewise.
>   (gimplify_function_tree): Call reset_cond_uid.
>   * ipa-inline.cc (can_early_inline_edge_p): Check
> condition_coverage_flag.
>   * ipa-split.cc (pass_split_functions::gate): Likewise.
>   * passes.cc (finish_optimization_passes): Likewise.
>   * profile.cc (struct condcov): New declaration.
>   (cov_length): Likewise.
>   (cov_blocks): Likewise.
>   (cov_masks): Likewise.
>   (cov_maps): Likewise.
>   (cov_free): Likewise.
>   (instrument_decisions): New.
>   (read_thunk_profile): Control output to file.
>   (branch_prob): Call find_conditions, instrument_decisions.
>   (init_branch_prob): Add total_num_conds.
>   (end_branch_prob): Likewise.
>   * tree-core.h (struct tree_exp): Add condition_uid.
>   * tree-profile.cc (struct conds_ctx): New.
>   (CONDITIONS_MAX_TERMS): New.
>   (EDGE_CONDITION): New.
>   (topological_cmp): New.
>   (index_of): New.
>   (single_p): New.
>   (single_edge): New.
>   (contract_edge_up): New.
>   (struct outcomes): New.
>   (conditional_succs): New.
>   (condition_index): New.
>   (masking_vectors): New.
>   (emit_assign): New.
>   (emit_bitwise_op): New.
>   (make_top_index_visit): New.
>   (make_top_index): New.
>   (paths_between): New.
>   (struct condcov): New.
>   (cov_length): New.
>   (cov_blocks): New.
>   (cov_masks): New.
>   (cov_maps): New.
>   (cov_free): New.
>   (gimple_cond_uid): New.
>   (find_conditions): New.
>   (struct counters): New.
>   (find_counters): New.
>   (resolve_counter): New.
>   (resolve_counters): New.
>   (instrument_decisions): New.
>   (tree_profiling): Check condition_coverage_flag.
>   (pass_ipa_tree_profile::gate): Likewise.
>   * tree.h (SET_EXPR_UID): New.
>   (EXPR_COND_UID): New.
> 
> libgcc/ChangeLog:
> 
>   * libgcov-merge.c (__gcov_merge_ior): New.
> 
> gcc/testsuite/ChangeLog:
> 
>   * lib/gcov.exp: Add condition coverage test function.
>   * g++.dg/gcov/gcov-18.C: New test.
>   * gcc.misc-tests/gcov-19.c: New test.
>   * gcc.misc-tests/gcov-20.c: New test.
>   * gcc.misc-tests/gcov-21.c: New test.
>   * gcc.misc-tests/gcov-22.c: New test.
>   * gcc.misc-tests/gcov-23.c: New test.

Sorry for taking so long on this - I needed some time to actually try
the patch since generally we will need more changes in frontend to
preserve conditionals intanct till gimple.
> This revision brings quite a few changes, some of which warrant a more
> careful review.
> 
> 1. Basic conditions are tied to the Boolean expression during
>gimplification, not through CFG analysis. The CFG analysis seemed to
>work well up until constructs like a && fn (b && c) && d where
>fn(...) seems indistinguishible from then-blocks. This wipes out much
>of the implementation in tree-profile.cc.
> 2. I changed the flag from -fprofile-conditions to -fcondition-coverage.
>-fprofile-conditions was chosen because of its symmetry with
>-fprofile-arcs, condition-coverage does feel more appropriate.

This seems good. 

[PATCH]middle-end: maintain LCSSA form when peeled vector iterations have virtual operands

2023-12-29 Thread Tamar Christina
Hi All,

This patch fixes several interconnected issues.

1. When picking an exit we wanted to check for niter_desc.may_be_zero not true.
   i.e. we want to pick an exit which we know will iterate at least once.
   However niter_desc.may_be_zero is not a boolean.  It is a tree that encodes
   a boolean value.  !niter_desc.may_be_zero is just checking if we have some
   information, not what the information is.  This leads us to pick a more
   difficult to vectorize exit more often than we should.

2. Because we had this bug, we used to pick an alternative exit much more ofthen
   which showed one issue, when the loop accesses memory and we "invert it" we
   would corrupt the VUSE chain.  This is because on an peeled vector iteration
   every exit restarts the loop (i.e. they're all early) BUT since we may have
   performed a store, the vUSE would need to be updated.  This version maintains
   virtual PHIs correctly in these cases.   Note that we can't simply remove all
   of them and recreate them because we need the PHI nodes still in the right
   order for if skip_vector.

3. Since we're moving the stores to a safe location I don't think we actually
   need to analyze whether the store is in range of the memref,  because if we
   ever get there, we know that the loads must be in range, and if the loads are
   in range and we get to the store we know the early breaks were not taken and
   so the scalar loop would have done the VF stores too.

4. Instead of searching for where to move stores to, they should always be in
   exit belonging to the latch.  We can only ever delay stores and even if we
   pick a different exit than the latch one as the main one, effects still
   happen in program order when vectorized.  If we don't move the stores to the
   latch exit but instead to whever we pick as the "main" exit then we can
   perform incorrect memory accesses (luckily these are trapped by verify_ssa).

5. We only used to analyze loads inside the same BB as an early break, and also
   we'd never analyze the ones inside the block where we'd be moving memory
   references to.  This is obviously bogus and to fix it this patch splits apart
   the two constraints.  We first validate that all load memory references are
   in bounds and only after that do we perform the alias checks for the writes.
   This makes the code simpler to understand and more trivially correct.

Bootstrapped Regtested on aarch64-none-linux-gnu, x86_64-pc-linux-gnu
and no issues with --enable-checking=release --enable-lto
--with-build-config=bootstrap-O3 --enable-checking=yes,rtl,extra.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR tree-optimization/113137
PR tree-optimization/113136
PR tree-optimization/113172
* tree-vect-data-refs.cc (vect_analyze_early_break_dependences):
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
(vect_do_peeling): Maintain virtual PHIs on inverted loops.
* tree-vect-loop.cc (vec_init_loop_exit_info): Pick exit closes to
latch.
(vect_create_loop_vinfo): Record all conds instead of only alt ones.
* tree-vectorizer.h: Fix comment

gcc/testsuite/ChangeLog:

PR tree-optimization/113137
PR tree-optimization/113136
PR tree-optimization/113172
* g++.dg/vect/vect-early-break_4-pr113137.cc: New test.
* g++.dg/vect/vect-early-break_5-pr113137.cc: New test.
* gcc.dg/vect/vect-early-break_95-pr113137.c: New test.
* gcc.dg/vect/vect-early-break_96-pr113136.c: New test.
* gcc.dg/vect/vect-early-break_97-pr113172.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/g++.dg/vect/vect-early-break_4-pr113137.cc 
b/gcc/testsuite/g++.dg/vect/vect-early-break_4-pr113137.cc
new file mode 100644
index 
..f78db8669dcc65f1b45ea78f4433d175e1138332
--- /dev/null
+++ b/gcc/testsuite/g++.dg/vect/vect-early-break_4-pr113137.cc
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+int b;
+void a() __attribute__((__noreturn__));
+void c() {
+  char *buf;
+  int bufsz = 64;
+  while (b) {
+!bufsz ? a(), 0 : *buf++ = bufsz--;
+b -= 4;
+  }
+}
diff --git a/gcc/testsuite/g++.dg/vect/vect-early-break_5-pr113137.cc 
b/gcc/testsuite/g++.dg/vect/vect-early-break_5-pr113137.cc
new file mode 100644
index 
..dcd19fa2d2145e09de18279479b3f20fc27336ba
--- /dev/null
+++ b/gcc/testsuite/g++.dg/vect/vect-early-break_5-pr113137.cc
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+char UnpackReadTables_BitLength[20];
+int UnpackReadTables_ZeroCount;
+void UnpackReadTables() {
+  for (unsigned I = 0; I < 20;)
+while 

[PATCH 1/2] RTX_COST: Count instructions

2023-12-29 Thread YunQiang Su
When we try to combine RTLs, the result may be very complex,
and `rtx_cost` may think that it need lots of costs. But in
fact, it may match a pattern in machine descriptions, which
may emit only 1 or 2 hardware instructions.  This combination
may be refused due to cost comparison failure.

Since the high cost may be due to a more expsensive operation.
To get real reason, we also need information about instruction
count.

gcc

* rtl.h (struct full_rtx_costs): Add new members,
speed_count and size_count.
(init_costs_to_zero): Ditto.
(costs_add_n_insns): Add new argument, expensive.
(rtx_cost_and_count): New function.
* rtlanal.cc (rtx_cost): Call rtx_cost_and_count now.
(rtx_cost_and_count): New function.
(get_full_rtx_cost): Call rtx_cost_and_count now.
* hooks.cc (hook_bool_rtx_mode_int_int_intp_intp_bool_false):
New fallback hook function.
* hooks.h (hook_bool_rtx_mode_int_int_intp_intp_bool_false):
New fallback hook function.
* target.def (insn_costs): add new argument, count.
* doc/tm.texi (TARGET_RTX_COSTS): Ditto.
* config/aarch64/aarch64.cc (aarch64_rtx_costs_wrapper): Ditto.
* config/alpha/alpha.cc (alpha_rtx_costs): Ditto.
* config/arc/arc.cc (arc_rtx_costs): Ditto.
* config/arm/arm.cc (arm_rtx_costs): Ditto.
* config/avr/avr.cc (avr_rtx_costs): Ditto.
* config/bfin/bfin.cc (bfin_rtx_costs): Ditto.
* config/bpf/bpf.cc (bpf_rtx_costs): Ditto.
* config/c6x/c6x.cc (c6x_rtx_costs): Ditto.
* config/cris/cris.cc (cris_rtx_costs): Ditto.
* config/csky/csky.cc (csky_rtx_costs): Ditto.
* config/epiphany/epiphany.cc (epiphany_rtx_costs): Ditto.
* config/frv/frv.cc (frv_rtx_costs): Ditto.
* config/gcn/gcn.cc (gcn_rtx_costs): Ditto.
* config/h8300/h8300.cc (h8300_rtx_costs): Ditto.
* config/i386/i386.cc (i386_rtx_costs): Ditto.
* config/ia64/ia64.cc (ia64_rtx_costs): Ditto.
* config/iq2000/iq2000.cc (iq2000_rtx_costs): Ditto.
* config/lm32/lm32.cc (lm32_rtx_costs): Ditto.
* config/loongarch/loongarch.cc (loongarch_rtx_costs): Ditto.
* config/m32c/m32c.cc (m32c_rtx_costs): Ditto.
* config/m32c/m32r.cc (m32r_rtx_costs): Ditto.
* config/m68k/m68k.cc (m68k_rtx_costs): Ditto.
* config/mcore/mcore.cc (mcore_rtx_costs): Ditto.
* config/microblaze/microblaze.cc (microblaze_rtx_costs): Ditto.
* config/mips/mips.cc (mips_rtx_costs): Ditto.
* config/mmix/mmix.cc (mmix_rtx_costs): Ditto.
* config/mn10300/mn10300.cc (mn10300_rtx_costs): Ditto.
* config/msp430/msp430.cc (msp430_rtx_costs): Ditto.
* config/nds32/nds32.cc (nds32_rtx_costs): Ditto.
* config/nios2/nios2.cc (nios2_rtx_costs): Ditto.
* config/or1k/or1k.cc (or1k_rtx_costs): Ditto.
* config/pa/pa.cc (hppa_rtx_costs): Ditto.
* config/pdp11/pdp11.cc (pdp11_rtx_costs): Ditto.
* config/pru/pru.cc (pru_rtx_costs): Ditto.
* config/riscv/riscv.cc (riscv_rtx_costs): Ditto.
* config/rl78/rl78.cc (rl78_rtx_costs): Ditto.
* config/rs6000/rs6000.cc (rs6000_rtx_costs): Ditto.
(rs6000_debug_rtx_costs): Ditto.
* config/rx/rx.cc (rx_rtx_costs): Ditto.
* config/s390/s390.cc (s390_rtx_costs): Ditto.
* config/sh/sh.cc (sh_rtx_costs): Ditto.
* config/sparc/sparc.cc (sparc_rtx_costs): Ditto.
* config/stormy16/stormy16.cc (xstormy16_rtx_costs): Ditto.
* config/v850/v850.cc (v850_rtx_costs): Ditto.
* config/vax/vax.cc (vax_rtx_costs): Ditto.
* config/visium/visium.cc (visium_rtx_costs): Ditto.
* config/xtensa/xtensa.cc (xtensa_rtx_costs): Ditto.
---
 gcc/config/aarch64/aarch64.cc   |  3 +-
 gcc/config/alpha/alpha.cc   |  6 ++-
 gcc/config/arc/arc.cc   |  4 +-
 gcc/config/arm/arm.cc   |  7 +++-
 gcc/config/avr/avr.cc   | 10 +++--
 gcc/config/bfin/bfin.cc |  4 +-
 gcc/config/bpf/bpf.cc   |  1 +
 gcc/config/c6x/c6x.cc   |  6 ++-
 gcc/config/cris/cris.cc |  6 ++-
 gcc/config/csky/csky.cc |  9 -
 gcc/config/epiphany/epiphany.cc |  5 ++-
 gcc/config/frv/frv.cc   |  5 ++-
 gcc/config/gcn/gcn.cc   |  4 +-
 gcc/config/h8300/h8300.cc   |  5 ++-
 gcc/config/i386/i386.cc |  4 +-
 gcc/config/ia64/ia64.cc |  6 ++-
 gcc/config/iq2000/iq2000.cc |  7 +++-
 gcc/config/lm32/lm32.cc |  7 +++-
 gcc/config/loongarch/loongarch.cc   |  5 ++-
 gcc/config/m32c/m32c.cc |  4 +-
 gcc/config/m32r/m32r.cc |  8 +++-
 gcc/config/m68k/m68k.cc |  6 ++-
 gcc/config/mcore/mcore.cc   |  6 ++-
 gcc/config/microblaze/microblaze.cc |  4 +-
 gcc/config/mips/mips.cc |  5 ++-
 gcc/config/mmix/mmix.cc 

[PATCH 2/2] MIPS: Implement TARGET_INSN_COSTS

2023-12-29 Thread YunQiang Su
When combine some instructions, the generic `rtx_cost`
may over estimate the cost of result RTL, due to that
the RTL may be quite complex and `rtx_cost` has no
information that this RTL can be convert to simple
hardware instruction(s).

In this case, Let's use `get_attr_insn_count` to estimate
the cost.

gcc
* config/mips/mips.cc (mips_insn_cost): New function.
(mips_rtx_costs): Count instructions also.
---
 gcc/config/mips/mips.cc | 132 
 1 file changed, 120 insertions(+), 12 deletions(-)

diff --git a/gcc/config/mips/mips.cc b/gcc/config/mips/mips.cc
index 647095b6c81..d4251aca80e 100644
--- a/gcc/config/mips/mips.cc
+++ b/gcc/config/mips/mips.cc
@@ -4170,6 +4170,58 @@ mips_set_reg_reg_cost (machine_mode mode)
 }
 }
 
+/* Implement TARGET_INSN_COSTS.  */
+
+static int
+mips_insn_cost (rtx_insn *x, bool speed)
+{
+  int cost;
+  int count;
+  int attr_count;
+  int ratio;
+
+  rtx set = single_set (x);
+  if (!set
+  || (recog_memoized (x) < 0
+  && GET_CODE (PATTERN (x)) != ASM_INPUT
+  && asm_noperands (PATTERN (x)) < 0))
+{
+  if (set)
+   cost = set_rtx_cost (set, speed);
+  else
+   cost = pattern_cost (PATTERN (x), speed);
+  /* If the cost is zero, then it's likely a complex insn.
+We don't want the cost of these to be less than
+something we know about.  */
+  return cost ? cost : COSTS_N_INSNS (2);
+}
+
+  if (!speed)
+return get_attr_length (x);
+
+  cost = rtx_cost_and_count (SET_SRC (set), GET_MODE (SET_DEST (set)),
+ SET, 1, true, );
+  cost = cost ? cost : COSTS_N_INSNS (2);
+  count = count ? count : cost / COSTS_N_INSNS (1);
+  attr_count = get_attr_insn_count (x);
+  ratio = get_attr_perf_ratio (x);
+
+  /* The estimating of rtx_cost_and_count seems good.
+ If we have ratio, we trust it more.  If we don't have ratio,
+ we trust rtx_cost more: so x2.  */
+  if (ratio == 0 && count < attr_count * 2)
+return cost;
+
+  /* Over estimate the count of instructions.  It normally means that
+ we can combine some INSNs, but rtx_cost have no idea about it.  */
+  if (ratio > 0)
+return get_attr_insn_count (x) * COSTS_N_INSNS (1) * ratio;
+  else if (cost > count && count > 0)
+return get_attr_insn_count (x) * cost / count;
+  else
+return get_attr_insn_count (x) * COSTS_N_INSNS (1);
+}
+
 /* Implement TARGET_RTX_COSTS.  */
 
 static bool
@@ -4195,6 +4247,7 @@ mips_rtx_costs (rtx x, machine_mode mode, int outer_code,
 {
   gcc_assert (CONSTANT_P (x));
   *total = 0;
+  *count = 0;
   return true;
 }
 
@@ -4214,6 +4267,7 @@ mips_rtx_costs (rtx x, machine_mode mode, int outer_code,
  && UINTVAL (x) == 0x)
{
  *total = 0;
+ *count = 0;
  return true;
}
 
@@ -4223,6 +4277,7 @@ mips_rtx_costs (rtx x, machine_mode mode, int outer_code,
  if (cost >= 0)
{
  *total = cost;
+ *count = cost / COSTS_N_INSNS (1);
  return true;
}
}
@@ -4236,6 +4291,7 @@ mips_rtx_costs (rtx x, machine_mode mode, int outer_code,
  if (speed || mips_immediate_operand_p (outer_code, INTVAL (x)))
{
  *total = 0;
+ *count = 0;
  return true;
}
}
@@ -4248,6 +4304,7 @@ mips_rtx_costs (rtx x, machine_mode mode, int outer_code,
   if (force_to_mem_operand (x, VOIDmode))
{
  *total = COSTS_N_INSNS (1);
+ *count = 1;
  return true;
}
   cost = mips_const_insns (x);
@@ -4281,10 +4338,12 @@ mips_rtx_costs (rtx x, machine_mode mode, int 
outer_code,
   && (outer_code == SET || GET_MODE (x) == VOIDmode))
cost = 1;
  *total = COSTS_N_INSNS (cost);
+ *count = cost;
  return true;
}
   /* The value will need to be fetched from the constant pool.  */
   *total = CONSTANT_POOL_COST;
+  *count = CONSTANT_POOL_COST / COSTS_N_INSNS (1);
   return true;
 
 case MEM:
@@ -4295,6 +4354,7 @@ mips_rtx_costs (rtx x, machine_mode mode, int outer_code,
   if (cost > 0)
{
  *total = COSTS_N_INSNS (cost + 1);
+ *count = cost;
  return true;
}
   /* Check for a scaled indexed address.  */
@@ -4302,6 +4362,7 @@ mips_rtx_costs (rtx x, machine_mode mode, int outer_code,
  || mips_lx_address_p (addr, mode))
{
  *total = COSTS_N_INSNS (2);
+ *count = 1;
  return true;
}
   /* Otherwise use the default handling.  */
@@ -4309,10 +4370,12 @@ mips_rtx_costs (rtx x, machine_mode mode, int 
outer_code,
 
 case FFS:
   *total = COSTS_N_INSNS (6);
+  *count = 1;
   return false;
 
 case NOT:
-  *total = COSTS_N_INSNS (GET_MODE_SIZE (mode) > UNITS_PER_WORD ? 2 : 1);
+  *count = (GET_MODE_SIZE (mode) > UNITS_PER_WORD ? 2 : 

[PATCH v2 2/2] MIPS: define_attr perf_ratio in mips.md

2023-12-29 Thread YunQiang Su
The accurate cost of an pattern can get with
 insn_count * perf_ratio

The default value is set to 0 instead of 1, since that
we will need to distinguish the default value and it is
really set for an pattern.  Since it is not set for most
patterns yet, to use it, we will need to be sure that it's
value is greater than 0.

This attr will be used in `mips_insn_cost`.

gcc

* config/mips/mips.md (perf_ratio): New attribute.
---
 gcc/config/mips/mips.md | 4 
 1 file changed, 4 insertions(+)

diff --git a/gcc/config/mips/mips.md b/gcc/config/mips/mips.md
index a4c6d630aeb..7db341c694c 100644
--- a/gcc/config/mips/mips.md
+++ b/gcc/config/mips/mips.md
@@ -312,6 +312,10 @@ (define_attr "sync_insn2" "nop,and,xor,not"
 ;; "11" specifies MEMMODEL_ACQUIRE.
 (define_attr "sync_memmodel" "" (const_int 10))
 
+;; Performance ratio.  Add this attr to the slow INSNs.
+;; Used by mips_insn_cost.
+(define_attr "perf_ratio" "" (const_int 0))
+
 ;; Accumulator operand for madd patterns.
 (define_attr "accum_in" "none,0,1,2,3,4,5" (const_string "none"))
 
-- 
2.39.2



[PATCH v2 1/2] MIPS: add pattern insqisi_extended and inshisi_extended

2023-12-29 Thread YunQiang Su
This match pattern allows combination (zero_extract:DI 8, 24, QI)
with an sign-extend to 32bit INS instruction on TARGET_64BIT.

The problem is that, for SI mode, if the sign-bit is modified by
bitops, we will need a sign-extend operation.
Since 32bit INS instruction can be sure that result is sign-extended,
and the QImode src register is safe for INS, too.

(insn 19 18 20 2 (set (zero_extract:DI (reg/v:DI 200 [ val ])
(const_int 8 [0x8])
(const_int 24 [0x18]))
(subreg:DI (reg:QI 205) 0)) "../xx.c":7:29 -1
 (nil))
(insn 20 19 23 2 (set (reg/v:DI 200 [ val ])
(sign_extend:DI (subreg:SI (reg/v:DI 200 [ val ]) 0))) "../xx.c":7:29 -1
 (nil))

Combine try to merge them to:

(insn 20 19 23 2 (set (reg/v:DI 200 [ val ])
(sign_extend:DI (ior:SI (and:SI (subreg:SI (reg/v:DI 200 [ val ]) 0)
(const_int 16777215 [0xff]))
(ashift:SI (subreg:SI (reg:QI 205 [ MEM[(const unsigned char 
*)buf_8(D) + 3B] ]) 0)
(const_int 24 [0x18]) "../xx.c":7:29 18 {*insv_extended}
 (expr_list:REG_DEAD (reg:QI 205 [ MEM[(const unsigned char *)buf_8(D) + 
3B] ])
(nil)))

Let's accept this pattern.
Note: with this patch, we cannot get INS yet: rtx_cost treats
that the later one is more expensive than the previous 2.

And do similarly for 16/16 pair:
(insn 13 12 14 2 (set (zero_extract:DI (reg/v:DI 198 [ val ])
(const_int 16 [0x10])
(const_int 16 [0x10]))
(subreg:DI (reg:HI 201 [ MEM[(const short unsigned int *)buf_6(D) + 2B] 
]) 0)) "xx.c":5:30 286 {*insvdi}
 (expr_list:REG_DEAD (reg:HI 201 [ MEM[(const short unsigned int *)buf_6(D) 
+ 2B] ])
(nil)))
(insn 14 13 17 2 (set (reg/v:DI 198 [ val ])
(sign_extend:DI (subreg:SI (reg/v:DI 198 [ val ]) 0))) "xx.c":5:30 241 
{extendsidi2}
 (nil))
>
(insn 14 13 17 2 (set (reg/v:DI 198 [ val ])
(sign_extend:DI (ior:SI (ashift:SI (subreg:SI (reg:HI 201 [ MEM[(const 
short unsigned int *)buf_6(D) + 2B] ]) 0)
(const_int 16 [0x10]))
(zero_extend:SI (subreg:HI (reg/v:DI 198 [ val ]) 0) 
"xx.c":5:30 284 {*inshisi_extended}
 (expr_list:REG_DEAD (reg:HI 201 [ MEM[(const short unsigned int *)buf_6(D) 
+ 2B] ])
(nil)))

gcc

* config/mips/mips.md (insqisi_extended): New pattern.
(inshisi_extended): Ditto.
---
 gcc/config/mips/mips.md | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/gcc/config/mips/mips.md b/gcc/config/mips/mips.md
index 0666310734e..a4c6d630aeb 100644
--- a/gcc/config/mips/mips.md
+++ b/gcc/config/mips/mips.md
@@ -4415,6 +4415,28 @@ (define_insn "*extzv_truncsi_exts"
   [(set_attr "type" "arith")
(set_attr "mode" "SI")])
 
+(define_insn "*insqisi_extended"
+  [(set (match_operand:DI 0 "register_operand" "=d")
+(sign_extend:DI
+  (ior:SI (and:SI (subreg:SI (match_dup 0) 0)
+   (const_int 16777215))
+ (ashift:SI
+   (subreg:SI (match_operand:QI 1 "register_operand" "d") 0)
+   (const_int 24)]
+  "TARGET_64BIT && !TARGET_MIPS16 && ISA_HAS_EXT_INS"
+  "ins\t%0,%1,24,8"
+  [(set_attr "mode" "SI")])
+
+(define_insn "*inshisi_extended"
+  [(set (match_operand:DI 0 "register_operand" "=d")
+(sign_extend:DI
+  (ior:SI
+   (ashift:SI (subreg:SI (match_operand:HI 1 "register_operand" "d") 0)
+ (const_int 16))
+   (zero_extend:SI (subreg:HI (match_dup 0) 0)]
+  "TARGET_64BIT && !TARGET_MIPS16 && ISA_HAS_EXT_INS"
+  "ins\t%0,%1,16,16"
+  [(set_attr "mode" "SI")])
 
 (define_expand "insvmisalign"
   [(set (zero_extract:GPR (match_operand:BLK 0 "memory_operand")
-- 
2.39.2



Re: [C PATCH] C: Fix type compatibility for structs with variable sized fields.

2023-12-29 Thread Joseph Myers
On Wed, 27 Dec 2023, Martin Uecker wrote:

> This patch hopefully fixes the test failure we see with gnu23-tag-4.c.
> It does for me locally with -march=native (which otherwise reproduces
> the problem).
> 
> Bootstrapped and regession tested on x86_64
> 
> 
> C: Fix type compatibility for structs with variable sized fields.
> 
> This fixes the test gcc.dg/gnu23-tag-4.c introduced by commit 23fee88f
> which fails for -march=... because the DECL_FIELD_BIT_OFFSET are set
> inconsistently for types with and without variable-sized field.  This
> is fixed by testing for DECL_ALIGN instead.  The code is further
> simplified by removing some unnecessary conditions, i.e. anon_field is
> set unconditionaly and all fields are assumed to be DECL_FIELDs.
> 
> gcc/c:
>   * c-typeck.c (tagged_types_tu_compatible_p): Revise.
> 
> gcc/testsuite:
>   * gcc.dg./c23-tag-9.c: New test.

OK.

-- 
Joseph S. Myers
jos...@codesourcery.com


Re: [PATCH]AArch64 Update costing for vector conversions [PR110625]

2023-12-29 Thread Richard Sandiford
Tamar Christina  writes:
> Hi All,
>
> In gimple the operation
>
> short _8;
> double _9;
> _9 = (double) _8;
>
> denotes two operations.  First we have to widen from short to long and then
> convert this integer to a double.

Think it's worth saying "two operations on AArch64".  Some targets
can do int->double directly.

Saying that would explain...

> Currently however we only count the widen/truncate operations:
>
> (double) _5 6 times vec_promote_demote costs 12 in body
> (double) _5 12 times vec_promote_demote costs 24 in body
>
> but not the actual conversion operation, which needs an additional 12
> instructions in the attached testcase.   Without this the attached testcase 
> ends
> up incorrectly thinking that it's beneficial to vectorize the loop at a very
> high VF = 8 (4x unrolled).
>
> Because we can't change the mid-end to account for this the costing code in 
> the

...why we can't do this.

> backend now keeps track of whether the previous operation was a
> promotion/demotion and ajdusts the expected number of instructions to:
>
> 1. If it's the first FLOAT_EXPR and the precision of the lhs and rhs are
>different, double it, since we need to convert and promote.
> 2. If it's the previous operation was a demonition/promotion then reduce the
>cost of the current operation by the amount we added extra in the last.
>
> with the patch we get:
>
> (double) _5 6 times vec_promote_demote costs 24 in body
> (double) _5 12 times vec_promote_demote costs 36 in body
>
> which correctly accounts for 30 operations.
>
> This fixes the regression reported on Neoverse N2 and using the new generic
> Armv9-a cost model.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   PR target/110625
>   * config/aarch64/aarch64.cc (aarch64_vector_costs::add_stmt_cost):
>   Adjust throughput and latency calculations for vector conversions.
>   (class aarch64_vector_costs): Add m_num_last_promote_demote.
>
> gcc/testsuite/ChangeLog:
>
>   PR target/110625
>   * gcc.target/aarch64/pr110625_4.c: New test.
>   * gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Add
>   --param aarch64-sve-compare-costs=0.
>   * gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c: Likewise
>
> --- inline copy of patch -- 
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> f9850320f61c5ddccf47e6583d304e5f405a484f..561413e52717974b96f79cc83008f237c536
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -16077,6 +16077,15 @@ private:
>   leaving a vectorization of { elts }.  */
>bool m_stores_to_vector_load_decl = false;
>  
> +  /* Non-zero if the last operation we costed is a vector promotion or 
> demotion.
> + In this case the value is the number of insn in the last operation.

s/insn/insns/

OK with those changes.  Thanks for tracking this down and working
out what was missing.

Richard

> +
> + On AArch64 vector promotion and demotions require us to first widen or
> + narrow the input and only after that emit conversion instructions.  For
> + costing this means we need to emit the cost of the final conversions as
> + well.  */
> +  unsigned int m_num_last_promote_demote = 0;
> +
>/* - If M_VEC_FLAGS is zero then we're costing the original scalar code.
>   - If M_VEC_FLAGS & VEC_ADVSIMD is nonzero then we're costing Advanced
> SIMD code.
> @@ -17132,6 +17141,29 @@ aarch64_vector_costs::add_stmt_cost (int count, 
> vect_cost_for_stmt kind,
>  stmt_cost = aarch64_sve_adjust_stmt_cost (m_vinfo, kind, stmt_info,
> vectype, stmt_cost);
>  
> +  /*  Vector promotion and demotion requires us to widen the operation first
> +  and only after that perform the conversion.  Unfortunately the mid-end
> +  expects this to be doable as a single operation and doesn't pass on
> +  enough context here for us to tell which operation is happening.  To
> +  account for this we count every promote-demote operation twice and if
> +  the previously costed operation was also a promote-demote we reduce
> +  the cost of the currently being costed operation to simulate the final
> +  conversion cost.  Note that for SVE we can do better here if the 
> converted
> +  value comes from a load since the widening load would consume the 
> widening
> +  operations.  However since we're in stage 3 we can't change the helper
> +  vect_is_extending_load and duplicating the code seems not useful.  */
> +  gassign *assign = NULL;
> +  if (kind == vec_promote_demote
> +  && (assign = dyn_cast  (STMT_VINFO_STMT (stmt_info)))
> +  && gimple_assign_rhs_code (assign) == FLOAT_EXPR)
> +{
> +  auto new_count = count * 2 - m_num_last_promote_demote;
> +  m_num_last_promote_demote = count;
> +  count = new_count;
> +}
> +  else
> +

[PATCH]middle-end: Fix dominators updates when peeling with multiple exits [PR113144]

2023-12-29 Thread Tamar Christina
Hi All,

Only trying to update certain dominators doesn't seem to work very well
because as the loop gets versioned, peeled, or skip_vector then we end up with
very complicated control flow.  This means that the final merge blocks for the
loop exit are not easy to find or update.

Instead of trying to pick which exits to update, this changes it to update all
the blocks reachable by the new exits.  This is because they'll contain common
blocks with e.g. the versioned loop.  It's these blocks that need an update
most of the time.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR middle-end/113144
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
Update all dominators reachable from exit.

gcc/testsuite/ChangeLog:

PR middle-end/113144
* gcc.dg/vect/vect-early-break_94-pr113144.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c
new file mode 100644
index 
..903fe7be6621e81db6f29441e4309fa213d027c5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c
@@ -0,0 +1,41 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+long tar_atol256_max, tar_atol256_size, tar_atosl_min;
+char tar_atol256_s;
+void __errno_location();
+
+
+inline static long tar_atol256(long min) {
+  char c;
+  int sign;
+  c = tar_atol256_s;
+  sign = c;
+  while (tar_atol256_size) {
+if (c != sign)
+  return sign ? min : tar_atol256_max;
+c = tar_atol256_size--;
+  }
+  if ((c & 128) != (sign & 128))
+return sign ? min : tar_atol256_max;
+  return 0;
+}
+
+inline static long tar_atol(long min) {
+  return tar_atol256(min);
+}
+
+long tar_atosl() {
+  long n = tar_atol(-1);
+  if (tar_atosl_min) {
+__errno_location();
+return 0;
+  }
+  if (n > 0)
+return 0;
+  return n;
+}
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index 
1066ea17c5674e03412b3dcd8a62ddf4dd54cf31..3810983a80c8b989be9fd9a9993642069fd39b99
 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -1716,8 +1716,6 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop *loop, 
edge loop_exit,
  /* Now link the alternative exits.  */
  if (multiple_exits_p)
{
- set_immediate_dominator (CDI_DOMINATORS, new_preheader,
-  main_loop_exit_block);
  for (auto gsi_from = gsi_start_phis (loop->header),
   gsi_to = gsi_start_phis (new_preheader);
   !gsi_end_p (gsi_from) && !gsi_end_p (gsi_to);
@@ -1751,12 +1749,26 @@ slpeel_tree_duplicate_loop_to_edge_cfg (class loop 
*loop, edge loop_exit,
 
   /* Finally after wiring the new epilogue we need to update its main exit
 to the original function exit we recorded.  Other exits are already
-correct.  */
+correct.  Because of versioning, skip vectors and others we must update
+the dominators of every node reachable by the new exits.  */
   if (multiple_exits_p)
{
  update_loop = new_loop;
- for (edge e : get_loop_exit_edges (loop))
-   doms.safe_push (e->dest);
+ hash_set  visited;
+ auto_vec  workset;
+ edge ev;
+ edge_iterator ei;
+ workset.safe_splice (get_loop_exit_edges (loop));
+ while (!workset.is_empty ())
+   {
+ auto bb = workset.pop ()->dest;
+ if (visited.add (bb))
+   continue;
+ doms.safe_push (bb);
+ FOR_EACH_EDGE (ev, ei, bb->succs)
+   workset.safe_push (ev);
+   }
+ visited.empty ();
  doms.safe_push (exit_dest);
 
  /* Likely a fall-through edge, so update if needed.  */




-- 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c
new file mode 100644
index 
..903fe7be6621e81db6f29441e4309fa213d027c5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_94-pr113144.c
@@ -0,0 +1,41 @@
+/* { dg-do compile } */
+/* { dg-add-options vect_early_break } */
+/* { dg-require-effective-target vect_early_break } */
+/* { dg-require-effective-target vect_int } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+
+long tar_atol256_max, tar_atol256_size, tar_atosl_min;
+char tar_atol256_s;
+void __errno_location();
+
+
+inline static long tar_atol256(long min) {
+  char c;
+  int sign;
+  c = tar_atol256_s;
+  sign = c;
+  while (tar_atol256_size) {
+if (c != sign)
+  return sign 

[PATCH]middle-end: rejects loops with nonlinear inductions and early breaks [PR113163]

2023-12-29 Thread Tamar Christina
Hi All,

We can't support nonlinear inductions other than neg when vectorizing
early breaks and iteration count is known.

For early break we currently require a peeled epilog but in these cases
we can't compute the remaining values.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
tested on cross cc1 for amdgcn-amdhsa and issue fixed.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR middle-end/113163
* tree-vect-loop-manip.cc (vect_can_peel_nonlinear_iv_p):

gcc/testsuite/ChangeLog:

PR middle-end/113163
* gcc.target/gcn/pr113163.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/testsuite/gcc.target/gcn/pr113163.c 
b/gcc/testsuite/gcc.target/gcn/pr113163.c
new file mode 100644
index 
..99b0fdbaf3a3152ca008b5109abf6e80d8cb3d6a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/gcn/pr113163.c
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize" } */ 
+
+struct _reent { union { struct { char _l64a_buf[8]; } _reent; } _new; };
+static const char R64_ARRAY[] = 
"./0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+char *
+_l64a_r (struct _reent *rptr,
+ long value)
+{
+  char *ptr;
+  char *result;
+  int i, index;
+  unsigned long tmp = (unsigned long)value & 0x;
+  result = 
+  ((
+  rptr
+  )->_new._reent._l64a_buf)
+   ;
+  ptr = result;
+  for (i = 0; i < 6; ++i)
+{
+  if (tmp == 0)
+ {
+   *ptr = '\0';
+   break;
+ }
+  *ptr++ = R64_ARRAY[index];
+  tmp >>= 6;
+}
+}
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index 
3810983a80c8b989be9fd9a9993642069fd39b99..f1bf43b3731868e7b053c186302fbeaf515be8cf
 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -2075,6 +2075,22 @@ vect_can_peel_nonlinear_iv_p (loop_vec_info loop_vinfo,
   return false;
 }
 
+  /* We can't support partial vectors and early breaks with an induction
+ type other than add or neg since we require the epilog and can't
+ perform the peeling.  PR113163.  */
+  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
+  && LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()
+  && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+  && induction_type != vect_step_op_neg)
+{
+  if (dump_enabled_p ())
+   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+"Peeling for epilogue is not supported"
+" for nonlinear induction except neg"
+" when iteration count is known and early breaks.\n");
+  return false;
+}
+
   return true;
 }
 




-- 
diff --git a/gcc/testsuite/gcc.target/gcn/pr113163.c 
b/gcc/testsuite/gcc.target/gcn/pr113163.c
new file mode 100644
index 
..99b0fdbaf3a3152ca008b5109abf6e80d8cb3d6a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/gcn/pr113163.c
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -ftree-vectorize" } */ 
+
+struct _reent { union { struct { char _l64a_buf[8]; } _reent; } _new; };
+static const char R64_ARRAY[] = 
"./0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
+char *
+_l64a_r (struct _reent *rptr,
+ long value)
+{
+  char *ptr;
+  char *result;
+  int i, index;
+  unsigned long tmp = (unsigned long)value & 0x;
+  result = 
+  ((
+  rptr
+  )->_new._reent._l64a_buf)
+   ;
+  ptr = result;
+  for (i = 0; i < 6; ++i)
+{
+  if (tmp == 0)
+ {
+   *ptr = '\0';
+   break;
+ }
+  *ptr++ = R64_ARRAY[index];
+  tmp >>= 6;
+}
+}
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index 
3810983a80c8b989be9fd9a9993642069fd39b99..f1bf43b3731868e7b053c186302fbeaf515be8cf
 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -2075,6 +2075,22 @@ vect_can_peel_nonlinear_iv_p (loop_vec_info loop_vinfo,
   return false;
 }
 
+  /* We can't support partial vectors and early breaks with an induction
+ type other than add or neg since we require the epilog and can't
+ perform the peeling.  PR113163.  */
+  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
+  && LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()
+  && LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+  && induction_type != vect_step_op_neg)
+{
+  if (dump_enabled_p ())
+   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+"Peeling for epilogue is not supported"
+" for nonlinear induction except neg"
+" when iteration count is known and early breaks.\n");
+  return false;
+}
+
   return true;
 }
 





[PATCH 20/21]Arm: Add Advanced SIMD cbranch implementation

2023-12-29 Thread Tamar Christina
Hi All,

This adds an implementation for conditional branch optab for AArch32.
The previous version only allowed operand 0 but it looks like cbranch
expansion does not check with the target and so we have to implement all.

I therefore did not commit it.  This is a larger version. 

For e.g.

void f1 ()
{
  for (int i = 0; i < N; i++)
{
  b[i] += a[i];
  if (a[i] > 0)
break;
}
}

For 128-bit vectors we generate:

vcgt.s32q8, q9, #0
vpmax.u32   d7, d16, d17
vpmax.u32   d7, d7, d7
vmovr3, s14 @ int
cmp r3, #0

and of 64-bit vector we can omit one vpmax as we still need to compress to
32-bits.

Bootstrapped Regtested on arm-none-linux-gnueabihf and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

* config/arm/neon.md (cbranch4): New.

gcc/testsuite/ChangeLog:

* gcc.dg/vect/vect-early-break_2.c: Skip Arm.
* gcc.dg/vect/vect-early-break_7.c: Likewise.
* gcc.dg/vect/vect-early-break_75.c: Likewise.
* gcc.dg/vect/vect-early-break_77.c: Likewise.
* gcc.dg/vect/vect-early-break_82.c: Likewise.
* gcc.dg/vect/vect-early-break_88.c: Likewise.
* lib/target-supports.exp (add_options_for_vect_early_break,
check_effective_target_vect_early_break_hw,
check_effective_target_vect_early_break): Support AArch32.
* gcc.target/arm/vect-early-break-cbranch.c: New test.

--- inline copy of patch -- 
diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index 
d213369ffc38fb88ad0357d848cc7da5af73bab7..0f088a51d31e6882bc0fabbad99862b8b465dd22
 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -408,6 +408,54 @@ (define_insn "vec_extract"
   [(set_attr "type" "neon_store1_one_lane,neon_to_gp")]
 )
 
+;; Patterns comparing two vectors and conditionally jump.
+;; Avdanced SIMD lacks a vector != comparison, but this is a quite common
+;; operation.  To not pay the penalty for inverting == we can map our any
+;; comparisons to all i.e. any(~x) => all(x).
+;;
+;; However unlike the AArch64 version, we can't optimize this further as the
+;; chain is too long for combine due to these being unspecs so it doesn't fold
+;; the operation to something simpler.
+(define_expand "cbranch4"
+  [(set (pc) (if_then_else
+ (match_operator 0 "expandable_comparison_operator"
+  [(match_operand:VDQI 1 "register_operand")
+   (match_operand:VDQI 2 "reg_or_zero_operand")])
+ (label_ref (match_operand 3 "" ""))
+ (pc)))]
+  "TARGET_NEON"
+{
+  rtx mask = operands[1];
+
+  /* If comparing against a non-zero vector we have to do a comparison first
+ so we can have a != 0 comparison with the result.  */
+  if (operands[2] != CONST0_RTX (mode))
+{
+  mask = gen_reg_rtx (mode);
+  emit_insn (gen_xor3 (mask, operands[1], operands[2]));
+}
+
+  /* For 128-bit vectors we need an additional reductions.  */
+  if (known_eq (128, GET_MODE_BITSIZE (mode)))
+{
+  /* Always reduce using a V4SI.  */
+  mask = gen_reg_rtx (V2SImode);
+  rtx low = gen_reg_rtx (V2SImode);
+  rtx high = gen_reg_rtx (V2SImode);
+  rtx op1 = simplify_gen_subreg (V4SImode, operands[1], mode, 0);
+  emit_insn (gen_neon_vget_lowv4si (low, op1));
+  emit_insn (gen_neon_vget_highv4si (high, op1));
+  emit_insn (gen_neon_vpumaxv2si (mask, low, high));
+}
+
+  emit_insn (gen_neon_vpumaxv2si (mask, mask, mask));
+
+  rtx val = gen_reg_rtx (SImode);
+  emit_move_insn (val, gen_lowpart (SImode, mask));
+  emit_jump_insn (gen_cbranch_cc (operands[0], val, const0_rtx, operands[3]));
+  DONE;
+})
+
 ;; This pattern is renamed from "vec_extract" to
 ;; "neon_vec_extract" and this pattern is called
 ;; by define_expand in vec-common.md file.
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_2.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_2.c
index 
5c32bf94409e9743e72429985ab3bf13aab8f2c1..dec0b492ab883de6e02944a95fd554a109a68a39
 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-early-break_2.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_2.c
@@ -5,7 +5,7 @@
 
 /* { dg-additional-options "-Ofast" } */
 
-/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target { ! 
"arm*-*-*" } } } } */
 
 #include 
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_7.c 
b/gcc/testsuite/gcc.dg/vect/vect-early-break_7.c
index 
8c86c5034d7522b3733543fb384a23c5d6ed0fcf..d218a0686719fee4c167684dcf26402851b53260
 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-early-break_7.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_7.c
@@ -5,7 +5,7 @@
 
 /* { dg-additional-options "-Ofast" } */
 
-/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" { target { ! 
"arm*-*-*" } } } } */
 
 #include 
 
diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_75.c 

[PATCH]AArch64 Update costing for vector conversions [PR110625]

2023-12-29 Thread Tamar Christina
Hi All,

In gimple the operation

short _8;
double _9;
_9 = (double) _8;

denotes two operations.  First we have to widen from short to long and then
convert this integer to a double.

Currently however we only count the widen/truncate operations:

(double) _5 6 times vec_promote_demote costs 12 in body
(double) _5 12 times vec_promote_demote costs 24 in body

but not the actual conversion operation, which needs an additional 12
instructions in the attached testcase.   Without this the attached testcase ends
up incorrectly thinking that it's beneficial to vectorize the loop at a very
high VF = 8 (4x unrolled).

Because we can't change the mid-end to account for this the costing code in the
backend now keeps track of whether the previous operation was a
promotion/demotion and ajdusts the expected number of instructions to:

1. If it's the first FLOAT_EXPR and the precision of the lhs and rhs are
   different, double it, since we need to convert and promote.
2. If it's the previous operation was a demonition/promotion then reduce the
   cost of the current operation by the amount we added extra in the last.

with the patch we get:

(double) _5 6 times vec_promote_demote costs 24 in body
(double) _5 12 times vec_promote_demote costs 36 in body

which correctly accounts for 30 operations.

This fixes the regression reported on Neoverse N2 and using the new generic
Armv9-a cost model.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR target/110625
* config/aarch64/aarch64.cc (aarch64_vector_costs::add_stmt_cost):
Adjust throughput and latency calculations for vector conversions.
(class aarch64_vector_costs): Add m_num_last_promote_demote.

gcc/testsuite/ChangeLog:

PR target/110625
* gcc.target/aarch64/pr110625_4.c: New test.
* gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Add
--param aarch64-sve-compare-costs=0.
* gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c: Likewise

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
f9850320f61c5ddccf47e6583d304e5f405a484f..561413e52717974b96f79cc83008f237c536
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16077,6 +16077,15 @@ private:
  leaving a vectorization of { elts }.  */
   bool m_stores_to_vector_load_decl = false;
 
+  /* Non-zero if the last operation we costed is a vector promotion or 
demotion.
+ In this case the value is the number of insn in the last operation.
+
+ On AArch64 vector promotion and demotions require us to first widen or
+ narrow the input and only after that emit conversion instructions.  For
+ costing this means we need to emit the cost of the final conversions as
+ well.  */
+  unsigned int m_num_last_promote_demote = 0;
+
   /* - If M_VEC_FLAGS is zero then we're costing the original scalar code.
  - If M_VEC_FLAGS & VEC_ADVSIMD is nonzero then we're costing Advanced
SIMD code.
@@ -17132,6 +17141,29 @@ aarch64_vector_costs::add_stmt_cost (int count, 
vect_cost_for_stmt kind,
 stmt_cost = aarch64_sve_adjust_stmt_cost (m_vinfo, kind, stmt_info,
  vectype, stmt_cost);
 
+  /*  Vector promotion and demotion requires us to widen the operation first
+  and only after that perform the conversion.  Unfortunately the mid-end
+  expects this to be doable as a single operation and doesn't pass on
+  enough context here for us to tell which operation is happening.  To
+  account for this we count every promote-demote operation twice and if
+  the previously costed operation was also a promote-demote we reduce
+  the cost of the currently being costed operation to simulate the final
+  conversion cost.  Note that for SVE we can do better here if the 
converted
+  value comes from a load since the widening load would consume the 
widening
+  operations.  However since we're in stage 3 we can't change the helper
+  vect_is_extending_load and duplicating the code seems not useful.  */
+  gassign *assign = NULL;
+  if (kind == vec_promote_demote
+  && (assign = dyn_cast  (STMT_VINFO_STMT (stmt_info)))
+  && gimple_assign_rhs_code (assign) == FLOAT_EXPR)
+{
+  auto new_count = count * 2 - m_num_last_promote_demote;
+  m_num_last_promote_demote = count;
+  count = new_count;
+}
+  else
+m_num_last_promote_demote = 0;
+
   if (stmt_info && aarch64_use_new_vector_costs_p ())
 {
   /* Account for any extra "embedded" costs that apply additively
diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_4.c 
b/gcc/testsuite/gcc.target/aarch64/pr110625_4.c
new file mode 100644
index 
..34dac19d81a85d63706d54f4cb0c738ce592d5d7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr110625_4.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } 

[PATCH pushed] LoongArch: Fix the format of bstrins__for_ior_mask condition (NFC)

2023-12-29 Thread Xi Ruoyao
gcc/ChangeLog:

* config/loongarch/loongarch.md (bstrins__for_ior_mask):
For the condition, remove unneeded trailing "\" and move "&&" to
follow GNU coding style.  NFC.
---

Pushed as obvious.

 gcc/config/loongarch/loongarch.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/config/loongarch/loongarch.md 
b/gcc/config/loongarch/loongarch.md
index d705717b5fa..47c1c5603c1 100644
--- a/gcc/config/loongarch/loongarch.md
+++ b/gcc/config/loongarch/loongarch.md
@@ -1486,8 +1486,8 @@ (define_insn_and_split "*bstrins__for_ior_mask"
   (match_operand:GPR 2 "const_int_operand"))
 (and:GPR (match_operand:GPR 3 "register_operand")
  (match_operand:GPR 4 "const_int_operand"]
-  "loongarch_pre_reload_split () && \
-   loongarch_use_bstrins_for_ior_with_mask (mode, operands)"
+  "loongarch_pre_reload_split ()
+   && loongarch_use_bstrins_for_ior_with_mask (mode, operands)"
   "#"
   "&& true"
   [(set (match_dup 0) (match_dup 1))
-- 
2.43.0



Pushed: [PATCH v4] LoongArch: Replace -mexplicit-relocs=auto simple-used address peephole2 with combine

2023-12-29 Thread Xi Ruoyao
Pushed v4 as attached, with the format issues fixed and a minor
adjustment in the commit message ("define_insn_and_split" is changed to
"define_insn_and_rewrite" to match the actual change).

On Fri, 2023-12-29 at 19:55 +0800, Xi Ruoyao wrote:
> On Fri, 2023-12-29 at 15:57 +0800, chenglulu wrote:
> 
> /* snip */
> 
> > > diff --git a/gcc/config/loongarch/loongarch.md 
> > > b/gcc/config/loongarch/loongarch.md
> > /* snip */
> > > +(define_insn_and_rewrite "simple_load"
> > > +  [(set (match_operand:LD_AT_LEAST_32_BIT 0 "register_operand" "=r,f")
> > > + (match_operand:LD_AT_LEAST_32_BIT 1 "mem_simple_ldst_operand" ""))]
> > > +  "loongarch_pre_reload_split () \
> > > +   && la_opt_explicit_relocs == EXPLICIT_RELOCS_AUTO \
> > Is the '\' here dispensable? I don't seem to have added it when I wrote 
> > the conditions.
> 
> It seems '\' is not needed, I'll drop them.
> 
> /* snip */
> 
> > 
> > > +(define_predicate "mem_simple_ldst_operand"
> > > +  (match_code "mem")
> > > +{
> > > +  op = XEXP (op, 0);
> > > +  return symbolic_pcrel_operand (op, Pmode) ||
> > > +  symbolic_pcrel_offset_operand (op, Pmode);
> > > +})
> > > +
> > >   
> > Symbol '||' It shouldn't be at the end of the line.
> 
> Indeed.
> 
> > 
> > +  return symbolic_pcrel_operand (op, Pmode)
> > +    || symbolic_pcrel_offset_operand (op, Pmode);
> > 
> > Others LGTM.
> > Thanks!
> > 
> > /* snip */

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University
From 8b61d109b130f0e6551803cc30f3c607d4fde81c Mon Sep 17 00:00:00 2001
From: Xi Ruoyao 
Date: Tue, 12 Dec 2023 04:54:21 +0800
Subject: [PATCH v4] LoongArch: Replace -mexplicit-relocs=auto simple-used
 address peephole2 with combine

The problem with peephole2 is it uses a naive sliding-window algorithm
and misses many cases.  For example:

float a[1];
float t() { return a[0] + a[8000]; }

is compiled to:

la.local$r13,a
la.local$r12,a+32768
fld.s   $f1,$r13,0
fld.s   $f0,$r12,-768
fadd.s  $f0,$f1,$f0

by trunk.  But as we've explained in r14-4851, the following would be
better with -mexplicit-relocs=auto:

pcalau12i   $r13,%pc_hi20(a)
pcalau12i   $r12,%pc_hi20(a+32000)
fld.s   $f1,$r13,%pc_lo12(a)
fld.s   $f0,$r12,%pc_lo12(a+32000)
fadd.s  $f0,$f1,$f0

However the sliding-window algorithm just won't detect the pcalau12i/fld
pair to be optimized.  Use a define_insn_and_rewrite in combine pass
will work around the issue.

gcc/ChangeLog:

	* config/loongarch/predicates.md
	(symbolic_pcrel_offset_operand): New define_predicate.
	(mem_simple_ldst_operand): Likewise.
	* config/loongarch/loongarch-protos.h
	(loongarch_rewrite_mem_for_simple_ldst): Declare.
	* config/loongarch/loongarch.cc
	(loongarch_rewrite_mem_for_simple_ldst): Implement.
	* config/loongarch/loongarch.md (simple_load): New
	define_insn_and_rewrite.
	(simple_load_ext): Likewise.
	(simple_store): Likewise.
	(define_peephole2): Remove la.local/[f]ld peepholes.

gcc/testsuite/ChangeLog:

	* gcc.target/loongarch/explicit-relocs-auto-single-load-store-2.c:
	New test.
	* gcc.target/loongarch/explicit-relocs-auto-single-load-store-3.c:
	New test.
---
 gcc/config/loongarch/loongarch-protos.h   |   1 +
 gcc/config/loongarch/loongarch.cc |  16 +++
 gcc/config/loongarch/loongarch.md | 114 +-
 gcc/config/loongarch/predicates.md|  13 ++
 ...explicit-relocs-auto-single-load-store-2.c |  11 ++
 ...explicit-relocs-auto-single-load-store-3.c |  18 +++
 6 files changed, 86 insertions(+), 87 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/loongarch/explicit-relocs-auto-single-load-store-2.c
 create mode 100644 gcc/testsuite/gcc.target/loongarch/explicit-relocs-auto-single-load-store-3.c

diff --git a/gcc/config/loongarch/loongarch-protos.h b/gcc/config/loongarch/loongarch-protos.h
index 7bf21a45c69..024f3117604 100644
--- a/gcc/config/loongarch/loongarch-protos.h
+++ b/gcc/config/loongarch/loongarch-protos.h
@@ -163,6 +163,7 @@ extern bool loongarch_use_ins_ext_p (rtx, HOST_WIDE_INT, HOST_WIDE_INT);
 extern bool loongarch_check_zero_div_p (void);
 extern bool loongarch_pre_reload_split (void);
 extern int loongarch_use_bstrins_for_ior_with_mask (machine_mode, rtx *);
+extern rtx loongarch_rewrite_mem_for_simple_ldst (rtx);
 
 union loongarch_gen_fn_ptrs
 {
diff --git a/gcc/config/loongarch/loongarch.cc b/gcc/config/loongarch/loongarch.cc
index 1d4d8f0b256..9f2b3e98bf0 100644
--- a/gcc/config/loongarch/loongarch.cc
+++ b/gcc/config/loongarch/loongarch.cc
@@ -5717,6 +5717,22 @@ loongarch_use_bstrins_for_ior_with_mask (machine_mode mode, rtx *op)
   return 0;
 }
 
+/* Rewrite a MEM for simple load/store under -mexplicit-relocs=auto
+   -mcmodel={normal/medium}.  */
+rtx
+loongarch_rewrite_mem_for_simple_ldst (rtx mem)
+{
+  rtx addr = XEXP (mem, 0);
+  rtx hi = gen_rtx_UNSPEC (Pmode, gen_rtvec (1, addr),
+			   UNSPEC_PCALAU12I_GR);
+  rtx new_mem;
+
+  addr = gen_rtx_LO_SUM 

Re: [PATCH v3] LoongArch: Replace -mexplicit-relocs=auto simple-used address peephole2 with combine

2023-12-29 Thread Xi Ruoyao
On Fri, 2023-12-29 at 15:57 +0800, chenglulu wrote:

/* snip */

> > diff --git a/gcc/config/loongarch/loongarch.md 
> > b/gcc/config/loongarch/loongarch.md
> /* snip */
> > +(define_insn_and_rewrite "simple_load"
> > +  [(set (match_operand:LD_AT_LEAST_32_BIT 0 "register_operand" "=r,f")
> > +   (match_operand:LD_AT_LEAST_32_BIT 1 "mem_simple_ldst_operand" ""))]
> > +  "loongarch_pre_reload_split () \
> > +   && la_opt_explicit_relocs == EXPLICIT_RELOCS_AUTO \
> Is the '\' here dispensable? I don't seem to have added it when I wrote 
> the conditions.

It seems '\' is not needed, I'll drop them.

/* snip */

> 
> > +(define_predicate "mem_simple_ldst_operand"
> > +  (match_code "mem")
> > +{
> > +  op = XEXP (op, 0);
> > +  return symbolic_pcrel_operand (op, Pmode) ||
> > +symbolic_pcrel_offset_operand (op, Pmode);
> > +})
> > +
> >   
> Symbol '||' It shouldn't be at the end of the line.

Indeed.

> 
> +  return symbolic_pcrel_operand (op, Pmode)
> +    || symbolic_pcrel_offset_operand (op, Pmode);
> 
> Others LGTM.
> Thanks!
> 
> /* snip */
> 

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University


[PATCH 2/2] MIPS: define_attr perf_ratio in mips.md

2023-12-29 Thread YunQiang Su
The accurate cost of an pattern can get with
 insn_count * perf_ratio

The default value is set to 0 instead of 1, since that
we will need to distinguish the default value and it is
really set for an pattern.  Since it is not set for most
patterns yet, to use it, we will need to be sure that it's
value is greater than 0.

This attr will be used in `mips_insn_cost`.

gcc

* config/mips/mips.md (perf_ratio): New attribute.
---
 gcc/config/mips/mips.md | 4 
 1 file changed, 4 insertions(+)

diff --git a/gcc/config/mips/mips.md b/gcc/config/mips/mips.md
index 6bc56b0d3da..5abaa7a3a20 100644
--- a/gcc/config/mips/mips.md
+++ b/gcc/config/mips/mips.md
@@ -312,6 +312,10 @@ (define_attr "sync_insn2" "nop,and,xor,not"
 ;; "11" specifies MEMMODEL_ACQUIRE.
 (define_attr "sync_memmodel" "" (const_int 10))
 
+;; Performance ratio.  Add this attr to the slow INSNs.
+;; Used by mips_insn_cost.
+(define_attr "perf_ratio" "" (const_int 0))
+
 ;; Accumulator operand for madd patterns.
 (define_attr "accum_in" "none,0,1,2,3,4,5" (const_string "none"))
 
-- 
2.39.2



[PATCH 1/2] MIPS: add pattern insqisi_extended

2023-12-29 Thread YunQiang Su
This match pattern allows combination (zero_extract:DI 8, 24, QI)
with an sign-extend to 32bit INS instruction on TARGET_64BIT.

The problem is that, for SI mode, if the sign-bit is modified by
bitops, we will need a sign-extend operation.
Since 32bit INS instruction can be sure that result is sign-extended,
and the QImode src register is safe for INS, too.

(insn 19 18 20 2 (set (zero_extract:DI (reg/v:DI 200 [ val ])
(const_int 8 [0x8])
(const_int 24 [0x18]))
(subreg:DI (reg:QI 205) 0)) "../xx.c":7:29 -1
 (nil))
(insn 20 19 23 2 (set (reg/v:DI 200 [ val ])
(sign_extend:DI (subreg:SI (reg/v:DI 200 [ val ]) 0))) "../xx.c":7:29 -1
 (nil))

Combine try to merge them to:

(insn 20 19 23 2 (set (reg/v:DI 200 [ val ])
(sign_extend:DI (ior:SI (and:SI (subreg:SI (reg/v:DI 200 [ val ]) 0)
(const_int 16777215 [0xff]))
(ashift:SI (subreg:SI (reg:QI 205 [ MEM[(const unsigned char 
*)buf_8(D) + 3B] ]) 0)
(const_int 24 [0x18]) "../xx.c":7:29 18 {*insv_extended}
 (expr_list:REG_DEAD (reg:QI 205 [ MEM[(const unsigned char *)buf_8(D) + 
3B] ])
(nil)))

Let's accept this pattern.
Note: with this patch, we cannot get INS yet: rtx_cost treats
that the later one is more expensive than the previous 2.

gcc

* config/mips/mips.md (insqisi_extended): New pattern.
---
 gcc/config/mips/mips.md | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/gcc/config/mips/mips.md b/gcc/config/mips/mips.md
index 0666310734e..6bc56b0d3da 100644
--- a/gcc/config/mips/mips.md
+++ b/gcc/config/mips/mips.md
@@ -4415,6 +4415,16 @@ (define_insn "*extzv_truncsi_exts"
   [(set_attr "type" "arith")
(set_attr "mode" "SI")])
 
+(define_insn "*insqisi_extended"
+  [(set (match_operand:DI 0 "register_operand" "=d")
+   (sign_extend:DI
+(ior:SI (and:SI (subreg:SI (match_dup 0) 0)
+(const_int 16777215))
+   (ashift:SI (subreg:SI (match_operand:QI 1 
"register_operand" "d") 0)
+(const_int 24)]
+  "TARGET_64BIT && !TARGET_MIPS16 && ISA_HAS_EXT_INS"
+  "ins\t%0,%1,24,8"
+  [(set_attr "mode" "SI")])
 
 (define_expand "insvmisalign"
   [(set (zero_extract:GPR (match_operand:BLK 0 "memory_operand")
-- 
2.39.2



[PATCH] Do not count unused scalar use when marking STMT_VINFO_LIVE_P [PR113091]

2023-12-29 Thread Feng Xue OS
This patch is meant to fix over-estimation about SLP vector-to-scalar cost for
STMT_VINFO_LIVE_P statement. When pattern recognition is involved, a
statement whose definition is consumed in some pattern, may not be
included in the final replacement pattern statements, and would be skipped
when building SLP graph.

 * Original
  char a_c = *(char *) a;
  char b_c = *(char *) b;
  unsigned short a_s = (unsigned short) a_c;
  int a_i = (int) a_s;
  int b_i = (int) b_c;
  int r_i = a_i - b_i;

 * After pattern replacement
  a_s = (unsigned short) a_c;
  a_i = (int) a_s;

  patt_b_s = (unsigned short) b_c;// b_i = (int) b_c
  patt_b_i = (int) patt_b_s;  // b_i = (int) b_c

  patt_r_s = widen_minus(a_c, b_c);   // r_i = a_i - b_i
  patt_r_i = (int) patt_r_s;  // r_i = a_i - b_i

The definitions of a_i(original statement) and b_i(pattern statement)
are related to, but actually not part of widen_minus pattern.
Vectorizing the pattern does not cause these definition statements to
be marked as PURE_SLP.  For this case, we need to recursively check
whether their uses are all absorbed into vectorized code.  But there
is an exception that some use may participate in an vectorized
operation via an external SLP node containing that use as an element.

Feng

---
 .../gcc.target/aarch64/bb-slp-pr113091.c  |  22 ++
 gcc/tree-vect-slp.cc  | 189 ++
 2 files changed, 172 insertions(+), 39 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c

diff --git a/gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c 
b/gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c
new file mode 100644
index 000..ff822e90b4a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-slp-details -ftree-slp-vectorize" 
} */
+
+int test(unsigned array[8]);
+
+int foo(char *a, char *b)
+{
+  unsigned array[8];
+
+  array[0] = (a[0] - b[0]);
+  array[1] = (a[1] - b[1]);
+  array[2] = (a[2] - b[2]);
+  array[3] = (a[3] - b[3]);
+  array[4] = (a[4] - b[4]);
+  array[5] = (a[5] - b[5]);
+  array[6] = (a[6] - b[6]);
+  array[7] = (a[7] - b[7]);
+
+  return test(array);
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using 
SLP" 1 "slp2" } } */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index a82fca45161..d36ff37114e 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -6418,6 +6418,84 @@ vect_slp_analyze_node_operations (vec_info *vinfo, 
slp_tree node,
   return res;
 }
 
+/* Given a definition DEF, analyze if it will have any live scalar use after
+   performing SLP vectorization whose information is represented by BB_VINFO,
+   and record result into hash map SCALAR_USE_MAP as cache for later fast
+   check.  */
+
+static bool
+vec_slp_has_scalar_use (bb_vec_info bb_vinfo, tree def,
+   hash_map _use_map)
+{
+  imm_use_iterator use_iter;
+  gimple *use_stmt;
+
+  if (bool *res = scalar_use_map.get (def))
+return *res;
+
+  FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, def)
+{
+  if (is_gimple_debug (use_stmt))
+   continue;
+
+  stmt_vec_info use_stmt_info = bb_vinfo->lookup_stmt (use_stmt);
+
+  if (!use_stmt_info)
+   break;
+
+  if (PURE_SLP_STMT (vect_stmt_to_vectorize (use_stmt_info)))
+   continue;
+
+  /* Do not step forward when encounter PHI statement, since it may
+involve cyclic reference and cause infinite recursive invocation.  */
+  if (gimple_code (use_stmt) == GIMPLE_PHI)
+   break;
+
+  /* When pattern recognition is involved, a statement whose definition is
+consumed in some pattern, may not be included in the final replacement
+pattern statements, so would be skipped when building SLP graph.
+
+* Original
+ char a_c = *(char *) a;
+ char b_c = *(char *) b;
+ unsigned short a_s = (unsigned short) a_c;
+ int a_i = (int) a_s;
+ int b_i = (int) b_c;
+ int r_i = a_i - b_i;
+
+* After pattern replacement
+ a_s = (unsigned short) a_c;
+ a_i = (int) a_s;
+
+ patt_b_s = (unsigned short) b_c;// b_i = (int) b_c
+ patt_b_i = (int) patt_b_s;  // b_i = (int) b_c
+
+ patt_r_s = widen_minus(a_c, b_c);   // r_i = a_i - b_i
+ patt_r_i = (int) patt_r_s;  // r_i = a_i - b_i
+
+The definitions of a_i(original statement) and b_i(pattern statement)
+are related to, but actually not part of widen_minus pattern.
+Vectorizing the pattern does not cause these definition statements to
+be marked as PURE_SLP.  For this case, we need to recursively check
+whether their uses are all absorbed into vectorized code.  But there
+is an exception that some use may participate in an vectorized
+

Re: [PATCH] aarch64: fortran: Adjust vect-8.f90 for libmvec

2023-12-29 Thread Richard Sandiford
Szabolcs Nagy  writes:
> With new glibc one more loop can be vectorized via simd exp in libmvec.
>
> Found by the Linaro TCWG CI.
>
> gcc/testsuite/ChangeLog:
>
>   * gfortran/vect/vect-8.f90: Accept more vectorized loops.

OK.  At first I thought it would be good to "defend" the increase when
it's supposed to apply, but it would need a relatively complicated check,
and there should be plenty of test coverage elsewhere.

Thanks,
Richard

> ---
>  gcc/testsuite/gfortran.dg/vect/vect-8.f90 | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/testsuite/gfortran.dg/vect/vect-8.f90 
> b/gcc/testsuite/gfortran.dg/vect/vect-8.f90
> index ca72ddcffca..938dfc29754 100644
> --- a/gcc/testsuite/gfortran.dg/vect/vect-8.f90
> +++ b/gcc/testsuite/gfortran.dg/vect/vect-8.f90
> @@ -704,7 +704,7 @@ CALL track('KERNEL  ')
>  RETURN
>  END SUBROUTINE kernel
>  
> -! { dg-final { scan-tree-dump-times "vectorized 25 loops" 1 "vect" { target 
> aarch64_sve } } }
> -! { dg-final { scan-tree-dump-times "vectorized 24 loops" 1 "vect" { target 
> { aarch64*-*-* && { ! aarch64_sve } } } } }
> +! { dg-final { scan-tree-dump-times "vectorized 2\[56\] loops" 1 "vect" { 
> target aarch64_sve } } }
> +! { dg-final { scan-tree-dump-times "vectorized 2\[45\] loops" 1 "vect" { 
> target { aarch64*-*-* && { ! aarch64_sve } } } } }
>  ! { dg-final { scan-tree-dump-times "vectorized 2\[234\] loops" 1 "vect" { 
> target { vect_intdouble_cvt && { ! aarch64*-*-* } } } } }
>  ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { target 
> { { ! vect_intdouble_cvt } && { ! aarch64*-*-* } } } } }


Re: [PATCH] aarch64: add 'AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA'

2023-12-29 Thread Richard Sandiford
Di Zhao OS  writes:
> This patch adds a new tuning option 'AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA',
> to consider fully pipelined FMAs in reassociation. Also, set this option
> by default for Ampere CPUs.
>
> Tested on aarch64-unknown-linux-gnu. Is this OK for trunk?
>
> Thanks,
> Di Zhao
>
> gcc/ChangeLog:
>
>   * config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNING_OPTION):
>   New tuning option AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA.
>   * config/aarch64/aarch64.cc (aarch64_override_options_internal): Set
>   param_fully_pipelined_fma according to tuning option.
>   * config/aarch64/tuning_models/ampere1.h: Add
>   AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA to tune_flags.
>   * config/aarch64/tuning_models/ampere1a.h: Likewise.
>   * config/aarch64/tuning_models/ampere1b.h: Likewise.
>
> ---
>  gcc/config/aarch64/aarch64-tuning-flags.def | 2 ++
>  gcc/config/aarch64/aarch64.cc   | 6 ++
>  gcc/config/aarch64/tuning_models/ampere1.h  | 3 ++-
>  gcc/config/aarch64/tuning_models/ampere1a.h | 3 ++-
>  gcc/config/aarch64/tuning_models/ampere1b.h | 3 ++-
>  5 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
> b/gcc/config/aarch64/aarch64-tuning-flags.def
> index f28a73839a6..256f17bad60 100644
> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> @@ -49,4 +49,6 @@ AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", 
> MATCHED_VECTOR_THROUGH
>  
>  AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA)
>  
> +AARCH64_EXTRA_TUNING_OPTION ("fully_pipelined_FMA", FULLY_PIPELINED_FMA)

Could you change this to all-lowercase, i.e. fully_pipelined_fma,
for consistency with avoid_cross_loop_fma above?

> +
>  #undef AARCH64_EXTRA_TUNING_OPTION
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index f9850320f61..1b3b288cdf9 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -18289,6 +18289,12 @@ aarch64_override_options_internal (struct 
> gcc_options *opts)
>  SET_OPTION_IF_UNSET (opts, _options_set, param_avoid_fma_max_bits,
>512);
>  
> +  /* Consider fully pipelined FMA in reassociation.  */
> +  if (aarch64_tune_params.extra_tuning_flags
> +  & AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA)
> +SET_OPTION_IF_UNSET (opts, _options_set, 
> param_fully_pipelined_fma,
> +  1);
> +
>aarch64_override_options_after_change_1 (opts);
>  }
>  
> diff --git a/gcc/config/aarch64/tuning_models/ampere1.h 
> b/gcc/config/aarch64/tuning_models/ampere1.h
> index a144e8f94b3..d63788528a7 100644
> --- a/gcc/config/aarch64/tuning_models/ampere1.h
> +++ b/gcc/config/aarch64/tuning_models/ampere1.h
> @@ -104,7 +104,8 @@ static const struct tune_params ampere1_tunings =
>2, /* min_div_recip_mul_df.  */
>0, /* max_case_values.  */
>tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_AVOID_CROSS_LOOP_FMA), /* tune_flags.  */
> +  (AARCH64_EXTRA_TUNE_AVOID_CROSS_LOOP_FMA |
> +   AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),  /* tune_flags.  */

Formatting nit, but GCC style is to put the "|" at the start of the
following line:

  (AARCH64_EXTRA_TUNE_AVOID_CROSS_LOOP_FMA
   | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),   /* tune_flags.  */

Same for the others.

OK with those changes, thanks.

Richard

>_prefetch_tune,
>AARCH64_LDP_STP_POLICY_ALIGNED,   /* ldp_policy_model.  */
>AARCH64_LDP_STP_POLICY_ALIGNED/* stp_policy_model.  */
> diff --git a/gcc/config/aarch64/tuning_models/ampere1a.h 
> b/gcc/config/aarch64/tuning_models/ampere1a.h
> index f688ed08a79..63506e1d1c6 100644
> --- a/gcc/config/aarch64/tuning_models/ampere1a.h
> +++ b/gcc/config/aarch64/tuning_models/ampere1a.h
> @@ -56,7 +56,8 @@ static const struct tune_params ampere1a_tunings =
>2, /* min_div_recip_mul_df.  */
>0, /* max_case_values.  */
>tune_params::AUTOPREFETCHER_WEAK,  /* autoprefetcher_model.  */
> -  (AARCH64_EXTRA_TUNE_AVOID_CROSS_LOOP_FMA), /* tune_flags.  */
> +  (AARCH64_EXTRA_TUNE_AVOID_CROSS_LOOP_FMA |
> +   AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA),  /* tune_flags.  */
>_prefetch_tune,
>AARCH64_LDP_STP_POLICY_ALIGNED,   /* ldp_policy_model.  */
>AARCH64_LDP_STP_POLICY_ALIGNED/* stp_policy_model.  */
> diff --git a/gcc/config/aarch64/tuning_models/ampere1b.h 
> b/gcc/config/aarch64/tuning_models/ampere1b.h
> index a98b6a980f7..7894e730174 100644
> --- a/gcc/config/aarch64/tuning_models/ampere1b.h
> +++ b/gcc/config/aarch64/tuning_models/ampere1b.h
> @@ -106,7 +106,8 @@ static const struct tune_params ampere1b_tunings =
>0, /* max_case_values.  */
>tune_params::AUTOPREFETCHER_STRONG,/* autoprefetcher_model.  */
>(AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND |
> -   AARCH64_EXTRA_TUNE_AVOID_CROSS_LOOP_FMA), /* tune_flags.  */
> +   

[committed] i386: Fix TARGET_USE_VECTOR_FP_CONVERTS SF->DF float_extend splitter [PR113133]

2023-12-29 Thread Uros Bizjak
The post-reload splitter currently allows xmm16+ registers with TARGET_EVEX512.
The splitter changes SFmode of the output operand to V4SFmode, but the vector
mode is currently unsupported in xmm16+ without TARGET_AVX512VL. lowpart_subreg
returns NULL_RTX in this case and the compilation fails with invalid RTX.

The patch removes support for x/ymm16+ registers with TARGET_EVEX512.  The
support should be restored once ix86_hard_regno_mode_ok is fixed to allow
16-byte modes in x/ymm16+ with TARGET_EVEX512.

PR target/113133

gcc/ChangeLog:

* config/i386/i386.md
(TARGET_USE_VECTOR_FP_CONVERTS SF->DF float_extend splitter):
Do not handle xmm16+ with TARGET_EVEX512.

gcc/testsuite/ChangeLog:

* gcc.target/i386/pr113133-1.c: New test.
* gcc.target/i386/pr113133-2.c: New test.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index ca6dbf42a6d..cdb9ddc4eb3 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -5210,7 +5210,7 @@ (define_split
&& optimize_insn_for_speed_p ()
&& reload_completed
&& (!EXT_REX_SSE_REG_P (operands[0])
-   || TARGET_AVX512VL || TARGET_EVEX512)"
+   || TARGET_AVX512VL)"
[(set (match_dup 2)
 (float_extend:V2DF
   (vec_select:V2SF
diff --git a/gcc/testsuite/gcc.target/i386/pr113133-1.c 
b/gcc/testsuite/gcc.target/i386/pr113133-1.c
new file mode 100644
index 000..63a1a413bba
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr113133-1.c
@@ -0,0 +1,21 @@
+/* PR target/113133 */
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -mavx512f -mtune=barcelona" } */
+
+void
+foo1 (double *d, float f)
+{
+  register float x __asm ("xmm16") = f;
+  asm volatile ("" : "+v" (x));
+
+  *d = x;
+}
+
+void
+foo2 (float *f, double d)
+{
+  register double x __asm ("xmm16") = d;
+  asm volatile ("" : "+v" (x));
+
+  *f = x;
+}
diff --git a/gcc/testsuite/gcc.target/i386/pr113133-2.c 
b/gcc/testsuite/gcc.target/i386/pr113133-2.c
new file mode 100644
index 000..8974d8ced7f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr113133-2.c
@@ -0,0 +1,72 @@
+/* PR target/113133 */
+/* { dg-do compile { target lp64 } } */
+/* { dg-options "-O -fno-tree-ter -mavx512f -mtune=barcelona" } */
+
+typedef char v8u8;
+typedef unsigned char __attribute__((__vector_size__(2))) v16u8;
+typedef signed char __attribute__((__vector_size__(2))) v16s8;
+typedef char __attribute__((__vector_size__(4))) v32u8;
+typedef unsigned char __attribute__((__vector_size__(8))) v64u8;
+typedef char __attribute__((__vector_size__(16))) v128u8;
+typedef signed char __attribute__((__vector_size__(16))) v128s8;
+typedef short __attribute__((__vector_size__(8))) v64u16;
+typedef int __attribute__((__vector_size__(16))) v128u32;
+typedef _Float16 __attribute__((__vector_size__(8))) v64f16;
+typedef _Float32 f32;
+char foo0_u8_0, foo0_ret;
+v16s8 foo0_v16s8_0;
+v64u8 foo0_v64u8_0;
+v128u8 foo0_v128u8_0;
+v128s8 foo0_v128s8_0;
+__attribute__((__vector_size__(2 * sizeof(int int foo0_v64s32_0;
+v128u32 foo0_v128u32_0, foo0_v128f32_0;
+f32 foo0_f32_0, foo0_f128_0;
+v16u8 foo0_v16u8_0;
+v64u16 foo0_v64u16_1;
+void foo0(__attribute__((__vector_size__(4 * sizeof(int int v128s32_0,
+  __attribute__((__vector_size__(sizeof(long long v64s64_0,
+  __attribute__((__vector_size__(2 * sizeof(long long v128u64_0,
+  __attribute__((__vector_size__(2 * sizeof(long long v128s64_0,
+  _Float16 f16_0) {
+  v64f16 v64f16_1 = __builtin_convertvector(foo0_v128f32_0, v64f16);
+  v128u32 v128u32_1 = 0 != foo0_v128u32_0;
+  v16s8 v16s8_1 = __builtin_shufflevector(
+  __builtin_convertvector(foo0_v128s8_0, v128s8), foo0_v16s8_0, 2, 3);
+  v128u8 v128u8_1 = foo0_v128u8_0;
+  v64f16 v64f16_2 = __builtin_convertvector(v128s32_0, v64f16);
+  __attribute__((__vector_size__(2 * sizeof(int int v64u32_1 =
+  -foo0_v64s32_0;
+  __attribute__((__vector_size__(4))) signed char v32s8_1 =
+  __builtin_shufflevector((v16s8){}, v16s8_1, 2, 2, 3, 0);
+  v64u16 v64u16_2 = foo0_v64u16_1 ^ foo0_u8_0;
+  v64u8 v64u8_1 = __builtin_shufflevector(foo0_v64u8_0, foo0_v16u8_0, 6, 7, 4,
+  7, 0, 2, 6, 0);
+  foo0_f32_0 *= __builtin_asinh(foo0_f128_0);
+  v128u8 v128u8_r = foo0_v128u8_0 + v128u8_1 + foo0_v128s8_0 +
+(v128u8)foo0_v128u32_0 + (v128u8)v128u32_1 +
+(v128u8)v128s32_0 + (v128u8)v128u64_0 + (v128u8)v128s64_0 +
+(v128u8)foo0_v128f32_0;
+  v64u8 v64u8_r = ((union {
+v128u8 a;
+v64u8 b;
+  })v128u8_r)
+  .b +
+  foo0_v64u8_0 + v64u8_1 + (v64u8)v64u16_2 + (v64u8)v64u32_1 +
+  (v64u8)v64s64_0 + (v64u8)v64f16_1 + (v64u8)v64f16_2;
+  v32u8 v32u8_r = ((union {
+v64u8 a;
+v32u8 b;
+