[pushed] wwwdocs: readings: Drop 1750a section

2024-06-16 Thread Gerald Pfeifer
We dropped support for 1750a back in 2002.

Pushed.

Gerald
---
 htdocs/readings.html | 6 --
 1 file changed, 6 deletions(-)

diff --git a/htdocs/readings.html b/htdocs/readings.html
index 0f6032c2..784a3bd7 100644
--- a/htdocs/readings.html
+++ b/htdocs/readings.html
@@ -632,12 +632,6 @@ Below is the list of ports that GCC used to support.
 
 
 
-  1750a
-  Exact chip name: MIL-STD-1750A processor
-  Manufacturers: various
-  http://legacy.cleanscape.net/stdprod/xtc1750a/resources/mil-std-1750.zip";>Specification
-  
-
   a29k
   Manufacturer: AMD
   
-- 
2.45.2


Re: [COMMITTED] Do not assume LHS of call is an ssa-name.

2024-06-16 Thread Richard Biener
On Fri, Jun 14, 2024 at 9:20 PM Andrew MacLeod  wrote:
>
> gimple_range_fold makes an assumption that if there is a LHS on a call
> that it is an ssa_name.  Especially later in compilation that may not be
> true.

It's always true if the LHS is of register type (is_gimple_reg_type) and
never true when the LHS is of aggregate type.  So either range_of_call
should have not been invoked here or it should indeed be defensive
as in your patch.

> This patch merely avoids calling routines that assume an ssa-name is
> being passed in.
>
> Bootstraps on  x86_64-pc-linux-gnu with no regressions.   Pushed.
>
> Andrew
>


Re: [PATCH] middle-end/114189 - drop uses of vcond{,u,eq}_optab

2024-06-16 Thread Richard Biener
On Mon, 17 Jun 2024, Kewen.Lin wrote:

> Hi Richi,
> 
> on 2024/6/14 18:31, Richard Biener wrote:
> > The following retires vcond{,u,eq} optabs by stopping to use them
> > from the middle-end.  Targets instead (should) implement vcond_mask
> > and vec_cmp{,u,eq} optabs.  The PR this change refers to lists
> > possibly affected targets - those implementing these patterns,
> > and in particular it lists mips, sparc and ia64 as targets that
> > most definitely will regress while others might simply remove
> > their vcond{,u,eq} patterns.
> > 
> > I'd appreciate testing, I do not expect fallout for x86 or arm/aarch64.
> > I know riscv doesn't implement any of the legacy optabs.  But less
> > maintained vector targets might need adjustments.
> 
> Thanks for making this change, this patch can be bootstrapped on ppc64{,le}
> but both have one failure on gcc/testsuite/gcc.target/powerpc/pr66144-3.c,
> by looking into it, I found it just exposed one oversight in the current
> rs6000 vcond_mask support (the condition mask location is wrong), so I think
> this change is fine for rs6000 port, I'll also test SPEC2017 for this (with
> rs6000 vcond_mask change) soon.

Btw, for those targets where the patch works out fine it would be nice
to delete their vcond{,u,eq} expanders (and double-check that doesn't
cause issues on its own).

Can target maintainers note whether their targets support all condition
codes for their vector comparisons (including FP variants)?  And 
whether they choose to implement all condition codes in vec_cmp
and adjust with inversion / operand swapping for not supported cases?

Thanks,
Richard.

> BR,
> Kewen
> 
> > 
> > I want to get rid of those optabs for GCC 15.  If I don't hear from
> > you I will assume your target is fine.
> > 
> > Thanks,
> > Richard.
> > 
> > PR middle-end/114189
> > * optabs-query.h (get_vcond_icode): Always return CODE_FOR_nothing.
> > (get_vcond_eq_icode): Likewise.
> > ---
> >  gcc/optabs-query.h | 13 -
> >  1 file changed, 4 insertions(+), 9 deletions(-)
> > 
> > diff --git a/gcc/optabs-query.h b/gcc/optabs-query.h
> > index 0cb2c21ba85..31fbce80175 100644
> > --- a/gcc/optabs-query.h
> > +++ b/gcc/optabs-query.h
> > @@ -112,14 +112,9 @@ get_vec_cmp_eq_icode (machine_mode vmode, machine_mode 
> > mask_mode)
> > mode CMODE, unsigned if UNS is true, resulting in a value of mode 
> > VMODE.  */
> >  
> >  inline enum insn_code
> > -get_vcond_icode (machine_mode vmode, machine_mode cmode, bool uns)
> > +get_vcond_icode (machine_mode, machine_mode, bool)
> >  {
> > -  enum insn_code icode = CODE_FOR_nothing;
> > -  if (uns)
> > -icode = convert_optab_handler (vcondu_optab, vmode, cmode);
> > -  else
> > -icode = convert_optab_handler (vcond_optab, vmode, cmode);
> > -  return icode;
> > +  return CODE_FOR_nothing;
> >  }
> >  
> >  /* Return insn code for a conditional operator with a mask mode
> > @@ -135,9 +130,9 @@ get_vcond_mask_icode (machine_mode vmode, machine_mode 
> > mmode)
> > mode CMODE (only EQ/NE), resulting in a value of mode VMODE.  */
> >  
> >  inline enum insn_code
> > -get_vcond_eq_icode (machine_mode vmode, machine_mode cmode)
> > +get_vcond_eq_icode (machine_mode, machine_mode)
> >  {
> > -  return convert_optab_handler (vcondeq_optab, vmode, cmode);
> > +  return CODE_FOR_nothing;
> >  }
> >  
> >  /* Enumerates the possible extraction_insn operations.  */
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)


Re: [PATCH] Enhance if-conversion for automatic arrays

2024-06-16 Thread Richard Biener
On Fri, 14 Jun 2024, Andrew Pinski wrote:

> On Fri, Jun 14, 2024 at 5:54 AM Richard Biener  wrote:
> >
> > Automatic arrays that are not address-taken should not be subject to
> > store data races.
> 
> That seems conservative enough. Though I would think if the array
> never escaped the function would be still correct and allow for more
> arrays (but maybe that is not tracked).

points-to should have that computed (conservatively).  I'll see to
split out a ref_can_have_store_data_races (..) function.

Richard.

> Thanks,
> Andrew Pinski
> 
> > This applies to OMP SIMD in-branch lowered
> > functions result array which for the testcase otherwise prevents
> > vectorization with SSE and for AVX and AVX512 ends up with spurious
> > .MASK_STORE to the stack surviving.
> >
> > This inefficiency was noted in PR111793.
> >
> > Bootstrap and regtest running on x86_64-unknown-linux-gnu.  Is my
> > idea of store data races correct?  At least phiopt uses the same
> > check but for example LIM doesn't special-case locals.
> >
> > PR tree-optimization/111793
> > * tree-if-conv.cc (ifcvt_memrefs_wont_trap): For stores
> > that do not trap only consider -fstore-data-races when
> > the underlying object is not automatic or has its address
> > taken.
> >
> > * gcc.dg/vect/vect-simd-clone-21.c: New testcase.
> > ---
> >  gcc/testsuite/gcc.dg/vect/vect-simd-clone-21.c | 16 
> >  gcc/tree-if-conv.cc| 13 +++--
> >  2 files changed, 27 insertions(+), 2 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/vect-simd-clone-21.c
> >
> > diff --git a/gcc/testsuite/gcc.dg/vect/vect-simd-clone-21.c 
> > b/gcc/testsuite/gcc.dg/vect/vect-simd-clone-21.c
> > new file mode 100644
> > index 000..49c52fb59bd
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/vect-simd-clone-21.c
> > @@ -0,0 +1,16 @@
> > +/* { dg-do compile } */
> > +/* { dg-require-effective-target vect_simd_clones } */
> > +/* { dg-additional-options "-fopenmp-simd" } */
> > +
> > +#pragma omp declare simd simdlen(4) inbranch
> > +__attribute__((noinline)) int
> > +foo (int a, int b)
> > +{
> > +  return a + b;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 4 "vect" { 
> > target i?86-*-* x86_64-*-* } } } */
> > +/* if-conversion shouldn't need to resort to masked stores for the result
> > +   array created by OMP lowering since that's automatic and does not have
> > +   its address taken.  */
> > +/* { dg-final { scan-tree-dump-not "MASK_STORE" "vect" } } */
> > diff --git a/gcc/tree-if-conv.cc b/gcc/tree-if-conv.cc
> > index c4c3ed41a44..974c614edf3 100644
> > --- a/gcc/tree-if-conv.cc
> > +++ b/gcc/tree-if-conv.cc
> > @@ -934,14 +934,23 @@ ifcvt_memrefs_wont_trap (gimple *stmt, 
> > vec drs)
> >if (DR_IS_READ (a))
> > return true;
> >
> > +  bool ok = flag_store_data_races;
> > +  base = get_base_address (base);
> > +  if (DECL_P (base)
> > + && auto_var_in_fn_p (base, cfun->decl)
> > + && ! may_be_aliased (base))
> > +   /* Automatic variables not aliased are not subject to
> > +  data races.  */
> > +   ok = true;
> > +
> >/* an unconditionaly write won't trap if the base is written
> >   to unconditionally.  */
> >if (base_master_dr
> >   && DR_BASE_W_UNCONDITIONALLY (*base_master_dr))
> > -   return flag_store_data_races;
> > +   return ok;
> >/* or the base is known to be not readonly.  */
> >else if (base_object_writable (DR_REF (a)))
> > -   return flag_store_data_races;
> > +   return ok;
> >  }
> >
> >return false;
> > --
> > 2.35.3
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [Patch, Fortran, 90076] 1/3 Fix Polymorphic Allocate on Assignment Memory Leak

2024-06-16 Thread Paul Richard Thomas
Hi Andre,

The patch is OK for mainline. Please change the subject line to have
[PR90076] at the end. I am not sure that the contents of the first square
brackets are especially useful in the commit.

Thanks for the fix

Paul


On Tue, 11 Jun 2024 at 13:57, Andre Vehreschild  wrote:

> Hi all,
>
> the attached patch fix the last case in the bug report. The inital example
> code
> is already fixed by  the combination of PR90068 and PR90072. The issue was
> the
> _vptr was not (re)set correctly, like in the __vtab_...-structure was not
> created. This made the compiler ICE.
>
> Regtests fine on x86_64 Fedora 39. Ok for mainline?
>
> Regards,
> Andre
> --
> Andre Vehreschild * Email: vehre ad gmx dot de
>


[PATCH] fsra: gimple final sra pass for paramters and returns

2024-06-16 Thread Jiufu Guo
Hi,

There are a few PRs (meta-bug PR101926) about accessing aggregate
param/returns which are passed through registers.

We could use the current SRA pass in a special mode right before
RTL expansion for the incoming/outgoing part, as the talked at:
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637935.html

This patch is using IFN ARG_PARTS and SET_RET_PARTS for parameters
and returns. And expand the IFNs according to the incoming/outgoing
registers.

Again there are a few thing could be enhanced for this patch:
* Multi-registers access
* Parameter access cross call
* Optimize for access parameter which in memory
* More cases/targets checking

Bootstrapped/regtested on ppc64{,le}, x86_64.
Is this ok for trunk?

BR,
Jeff (Jiufu Guo)

PR target/108073
PR target/69143

gcc/ChangeLog:

* cfgexpand.cc (expand_value_return): Update for rtx eq checking.
(expand_return): Update for sclarized returns.
* internal-fn.cc (query_position_in_parallel): New function.
(construct_reg_seq): New function.
(get_incoming_element): New function.
(reference_alias_ptr_type): Extern declare.
(expand_ARG_PARTS): New IFN expand.
(store_outgoing_element): New function.
(expand_SET_RET_PARTS): New IFN expand.
(expand_SET_RET_LAST_PARTS): New IFN expand.
* internal-fn.def (ARG_PARTS): New IFN.
(SET_RET_PARTS): New IFN.
(SET_RET_LAST_PARTS): New IFN.
* passes.def (pass_sra_final): Add new pass.
* tree-pass.h (make_pass_sra_final): New function.
* tree-sra.cc (enum sra_mode): New enum item SRA_MODE_FINAL_INTRA.
(build_accesses_from_assign): Accept SRA_MODE_FINAL_INTRA.
(scan_function): Update for argment in fsra.
(find_var_candidates): Collect candidates for SRA_MODE_FINAL_INTRA.
(analyze_access_subtree): Update analyze for fsra.
(generate_subtree_copies): Update to generate new IFNs.
(final_intra_sra): New function.
(class pass_sra_final): New class.
(make_pass_sra_final): New function.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/pr102024.C: Update instructions.
* gcc.target/powerpc/pr108073-1.c: New test.
* gcc.target/powerpc/pr108073.c: New test.
* gcc.target/powerpc/pr69143.c: New test.
---
 gcc/cfgexpand.cc  |   6 +-
 gcc/internal-fn.cc| 255 ++
 gcc/internal-fn.def   |   9 +
 gcc/passes.def|   2 +
 gcc/tree-pass.h   |   1 +
 gcc/tree-sra.cc   | 156 ++-
 gcc/testsuite/g++.target/powerpc/pr102024.C   |   3 +-
 gcc/testsuite/gcc.target/powerpc/pr108073-1.c |  76 ++
 gcc/testsuite/gcc.target/powerpc/pr108073.c   |  74 +
 gcc/testsuite/gcc.target/powerpc/pr69143.c|  23 ++
 11 files changed, 599 insertions(+), 16 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108073-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108073.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr69143.c

diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
index 8de5f2ba58b..04195a9772b 100644
--- a/gcc/cfgexpand.cc
+++ b/gcc/cfgexpand.cc
@@ -3789,7 +3789,7 @@ expand_value_return (rtx val)
 
   tree decl = DECL_RESULT (current_function_decl);
   rtx return_reg = DECL_RTL (decl);
-  if (return_reg != val)
+  if (!rtx_equal_p (return_reg, val))
 {
   tree funtype = TREE_TYPE (current_function_decl);
   tree type = TREE_TYPE (decl);
@@ -3862,6 +3862,10 @@ expand_return (tree retval)
  been stored into it, so we don't have to do anything special.  */
   if (TREE_CODE (retval_rhs) == RESULT_DECL)
 expand_value_return (result_rtl);
+  /* return is scalarized by fsra: TODO use FLAG. */
+  else if (VAR_P (retval_rhs)
+  && rtx_equal_p (result_rtl, DECL_RTL (retval_rhs)))
+expand_null_return_1 ();
 
   /* If the result is an aggregate that is being returned in one (or more)
  registers, load the registers here.  */
diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index 4948b48bde8..42a79809a2a 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -3474,6 +3474,261 @@ expand_ACCESS_WITH_SIZE (internal_fn, gcall *stmt)
 expand_assignment (lhs, ref_to_obj, false);
 }
 
+/* In the parallel rtx register series REGS, compute the register position for
+   given {BITPOS, BITSIZE}.  The results are stored into START_INDEX, 
END_INDEX,
+   LEFT_BITS and RIGHT_BITS.  */
+
+void
+query_position_in_parallel (HOST_WIDE_INT bitpos, HOST_WIDE_INT bitsize,
+   rtx regs, int &start_index, int &end_index,
+   HOST_WIDE_INT &left_bits, HOST_WIDE_INT &right_bits)
+{
+  int cur_index = XEXP (XVECEXP (regs, 0, 0), 0) ? 0 : 1;
+  for (; cur_index < XVECLEN (regs, 0); cur_index++)
+{
+  rtx slot = XVECEXP (reg

Re: [PATCH] x86: Emit cvtne2ps2bf16 for odd increasing perm in __builtin_shufflevector

2024-06-16 Thread Hongtao Liu
On Fri, Jun 14, 2024 at 9:35 AM Levy Hsu  wrote:
>
> This patch updates the GCC x86 backend to efficiently handle
> odd, incrementally increasing permutations of BF16 vectors
> using the cvtne2ps2bf16 instruction.
> It modifies ix86_vectorize_vec_perm_const to support these operations
> and adds a specific predicate to ensure proper sequence handling.
>
> Bootstrapped and tested on x86_64-linux-gnu, OK for trunk?
Ok.
>
> gcc/ChangeLog:
>
> * config/i386/i386-expand.cc
> (ix86_vectorize_vec_perm_const): Convert BF to HI using subreg.
> * config/i386/predicates.md
> (vcvtne2ps2bf_parallel): New define_insn_and_split.
> * config/i386/sse.md
> (vpermt2_sepcial_bf16_shuffle_): New predicates matches odd 
> increasing perm.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/i386/vpermt2-special-bf16-shufflue.c: New test.
> ---
>  gcc/config/i386/i386-expand.cc|  4 +--
>  gcc/config/i386/predicates.md | 11 ++
>  gcc/config/i386/sse.md| 35 +++
>  .../i386/vpermt2-special-bf16-shufflue.c  | 27 ++
>  4 files changed, 75 insertions(+), 2 deletions(-)
>  create mode 100755 
> gcc/testsuite/gcc.target/i386/vpermt2-special-bf16-shufflue.c
>
> diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
> index 312329e550b..3d599c0651a 100644
> --- a/gcc/config/i386/i386-expand.cc
> +++ b/gcc/config/i386/i386-expand.cc
> @@ -23657,8 +23657,8 @@ ix86_vectorize_vec_perm_const (machine_mode vmode, 
> machine_mode op_mode,
>if (GET_MODE_SIZE (vmode) == 64 && !TARGET_EVEX512)
>  return false;
>
> -  /* For HF mode vector, convert it to HI using subreg.  */
> -  if (GET_MODE_INNER (vmode) == HFmode)
> +  /* For HF and BF mode vector, convert it to HI using subreg.  */
> +  if (GET_MODE_INNER (vmode) == HFmode || GET_MODE_INNER (vmode) == BFmode)
>  {
>machine_mode orig_mode = vmode;
>vmode = mode_for_vector (HImode,
> diff --git a/gcc/config/i386/predicates.md b/gcc/config/i386/predicates.md
> index 7afe3100cb7..1676c50de71 100644
> --- a/gcc/config/i386/predicates.md
> +++ b/gcc/config/i386/predicates.md
> @@ -2322,3 +2322,14 @@
>
>return true;
>  })
> +
> +;; Check that each element is odd and incrementally increasing from 1
> +(define_predicate "vcvtne2ps2bf_parallel"
> +  (and (match_code "const_vector")
> +   (match_code "const_int" "a"))
> +{
> +  for (int i = 0; i < XVECLEN (op, 0); ++i)
> +if (INTVAL (XVECEXP (op, 0, i)) != (2 * i + 1))
> +  return false;
> +  return true;
> +})
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 680a46a0b08..5ddd1c0a778 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -30698,3 +30698,38 @@
>"TARGET_AVXVNNIINT16"
>"vpdp\t{%3, %2, %0|%0, %2, %3}"
> [(set_attr "prefix" "vex")])
> +
> +(define_mode_attr hi_cvt_bf
> +  [(V8HI "v8bf") (V16HI "v16bf") (V32HI "v32bf")])
> +
> +(define_mode_attr HI_CVT_BF
> +  [(V8HI "V8BF") (V16HI "V16BF") (V32HI "V32BF")])
> +
> +(define_insn_and_split "vpermt2_sepcial_bf16_shuffle_"
> +  [(set (match_operand:VI2_AVX512F 0 "register_operand")
> +   (unspec:VI2_AVX512F
> + [(match_operand:VI2_AVX512F 1 "vcvtne2ps2bf_parallel")
> +  (match_operand:VI2_AVX512F 2 "register_operand")
> +  (match_operand:VI2_AVX512F 3 "nonimmediate_operand")]
> +  UNSPEC_VPERMT2))]
> +  "TARGET_AVX512VL && TARGET_AVX512BF16 && ix86_pre_reload_split ()"
> +  "#"
> +  "&& 1"
> +  [(const_int 0)]
> +{
> +  rtx op0 = gen_reg_rtx (mode);
> +  operands[2] = lowpart_subreg (mode,
> +   force_reg (mode, operands[2]),
> +   mode);
> +  operands[3] = lowpart_subreg (mode,
> +   force_reg (mode, operands[3]),
> +   mode);
> +
> +  emit_insn (gen_avx512f_cvtne2ps2bf16_(op0,
> +  operands[3],
> +  operands[2]));
> +  emit_move_insn (operands[0], lowpart_subreg (mode, op0,
> +  mode));
> +  DONE;
> +}
> +[(set_attr "mode" "")])
> diff --git a/gcc/testsuite/gcc.target/i386/vpermt2-special-bf16-shufflue.c 
> b/gcc/testsuite/gcc.target/i386/vpermt2-special-bf16-shufflue.c
> new file mode 100755
> index 000..5c65f2a9884
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/vpermt2-special-bf16-shufflue.c
> @@ -0,0 +1,27 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx512bf16 -mavx512vl" } */
> +/* { dg-final { scan-assembler-not "vpermi2b" } } */
> +/* { dg-final { scan-assembler-times "vcvtne2ps2bf16" 3 } } */
> +
> +typedef __bf16 v8bf __attribute__((vector_size(16)));
> +typedef __bf16 v16bf __attribute__((vector_size(32)));
> +typedef __bf16 v32bf __attribute__((vector_size(64)));
> +
> +v8bf foo0(v8bf a, v8bf b)
> +{
> +  return __builtin

Re: [PATCH] i386: Refine all cvtt* instructions with UNSPEC instead of FIX/UNSIGNED_FIX.

2024-06-16 Thread Hongtao Liu
On Thu, Jun 13, 2024 at 3:13 PM Hu, Lin1  wrote:
>
> Hi, all
>
> This patch aims to refine all cvtt* instructions with UNSPEC instead of
> FIX/UNSIGNED_FIX. Because the intrinsics should behave as documented.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu, OK for trunk?
Ok.
>
> BRs,
> Lin
>
> gcc/ChangeLog:
>
> PR target/115161
> * config/i386/i386-builtin.def: Change CODE_FOR_* for cvtt*'s 
> builtins.
> * config/i386/sse.md
> * 
> (unspec_avx512fp16_fix_trunc2):
> Use UNSPEC instead of FIX/UNSIGNED_FIX.
> (unspec_avx512fp16_fix_trunc2):
> Ditto.
> (unspec_avx512fp16_fix_truncv2di2): 
> Ditto.
> 
> (unspec_avx512fp16_fix_trunc2):
> Ditto.
> (unspec_sse_cvttps2pi): Ditto.
> (unspec_sse_cvttss2si): Ditto.
> 
> (unspec_fix_truncv16sfv16si2):
> Ditto.
> (unspec_fix_truncv8sfv8si2): Ditto.
> (unspec_fix_truncv4sfv4si2): Ditto.
> (unspec_sse2_cvttpd2pi): Ditto.
> (unspec_fixuns_truncv2dfv2si2): Ditto.
> (unspec_avx512f_vcvttss2usi):
> Ditto.
> (unspec_avx512f_vcvttsd2usi):
> Ditto.
> (unspec_sse2_cvttsd2si): Ditto.
> 
> (unspec_fix_truncv8dfv8si2):
> Ditto.
> (*unspec_fixuns_truncv2dfv2si2): Ditto.
> (unspec_fixuns_truncv2dfv2si2_mask): Ditto.
> (unspec_fix_truncv4dfv4si2): Ditto.
> (unspec_fixuns_truncv4dfv4si2): Ditto.
> 
> (unspec_fix_trunc2):
> Ditto.
> 
> (unspec_fix_trunc2):
> Ditto.
> (unspec_avx512dq_fix_truncv2sfv2di2):
> Ditto.
> 
> (unspec_fixuns_trunc2):
> Ditto.
> (unspec_sse2_cvttpd2dq): Ditto.
>
> gcc/testsuite/ChangeLog:
>
> PR target/115161
> * gcc.target/i386/pr115161-1.c: New test.
> ---
>  gcc/config/i386/i386-builtin.def   | 128 
>  gcc/config/i386/sse.md | 335 +
>  gcc/testsuite/gcc.target/i386/pr115161-1.c |  65 
>  3 files changed, 464 insertions(+), 64 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr115161-1.c
>
> diff --git a/gcc/config/i386/i386-builtin.def 
> b/gcc/config/i386/i386-builtin.def
> index 729355230b8..893e2baa006 100644
> --- a/gcc/config/i386/i386-builtin.def
> +++ b/gcc/config/i386/i386-builtin.def
> @@ -631,9 +631,9 @@ BDESC (OPTION_MASK_ISA_SSE, 0, CODE_FOR_sse_rcpv4sf2, 
> "__builtin_ia32_rcpps", IX
>  BDESC (OPTION_MASK_ISA_SSE | OPTION_MASK_ISA_MMX, 0, CODE_FOR_sse_cvtps2pi, 
> "__builtin_ia32_cvtps2pi", IX86_BUILTIN_CVTPS2PI, UNKNOWN, (int) 
> V2SI_FTYPE_V4SF)
>  BDESC (OPTION_MASK_ISA_SSE, 0, CODE_FOR_sse_cvtss2si, 
> "__builtin_ia32_cvtss2si", IX86_BUILTIN_CVTSS2SI, UNKNOWN, (int) 
> INT_FTYPE_V4SF)
>  BDESC (OPTION_MASK_ISA_SSE | OPTION_MASK_ISA_64BIT, 0, 
> CODE_FOR_sse_cvtss2siq, "__builtin_ia32_cvtss2si64", IX86_BUILTIN_CVTSS2SI64, 
> UNKNOWN, (int) INT64_FTYPE_V4SF)
> -BDESC (OPTION_MASK_ISA_SSE | OPTION_MASK_ISA_MMX, 0, CODE_FOR_sse_cvttps2pi, 
> "__builtin_ia32_cvttps2pi", IX86_BUILTIN_CVTTPS2PI, UNKNOWN, (int) 
> V2SI_FTYPE_V4SF)
> -BDESC (OPTION_MASK_ISA_SSE, 0, CODE_FOR_sse_cvttss2si, 
> "__builtin_ia32_cvttss2si", IX86_BUILTIN_CVTTSS2SI, UNKNOWN, (int) 
> INT_FTYPE_V4SF)
> -BDESC (OPTION_MASK_ISA_SSE | OPTION_MASK_ISA_64BIT, 0, 
> CODE_FOR_sse_cvttss2siq, "__builtin_ia32_cvttss2si64", 
> IX86_BUILTIN_CVTTSS2SI64, UNKNOWN, (int) INT64_FTYPE_V4SF)
> +BDESC (OPTION_MASK_ISA_SSE | OPTION_MASK_ISA_MMX, 0, 
> CODE_FOR_unspec_sse_cvttps2pi, "__builtin_ia32_cvttps2pi", 
> IX86_BUILTIN_CVTTPS2PI, UNKNOWN, (int) V2SI_FTYPE_V4SF)
> +BDESC (OPTION_MASK_ISA_SSE, 0, CODE_FOR_unspec_sse_cvttss2si, 
> "__builtin_ia32_cvttss2si", IX86_BUILTIN_CVTTSS2SI, UNKNOWN, (int) 
> INT_FTYPE_V4SF)
> +BDESC (OPTION_MASK_ISA_SSE | OPTION_MASK_ISA_64BIT, 0, 
> CODE_FOR_unspec_sse_cvttss2siq, "__builtin_ia32_cvttss2si64", 
> IX86_BUILTIN_CVTTSS2SI64, UNKNOWN, (int) INT64_FTYPE_V4SF)
>
>  BDESC (OPTION_MASK_ISA_SSE, 0, CODE_FOR_sse_shufps, "__builtin_ia32_shufps", 
> IX86_BUILTIN_SHUFPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF_INT)
>
> @@ -725,19 +725,19 @@ BDESC (OPTION_MASK_ISA_SSE2, 0, 
> CODE_FOR_floatv4siv4sf2, "__builtin_ia32_cvtdq2p
>  BDESC (OPTION_MASK_ISA_SSE2, 0, CODE_FOR_sse2_cvtpd2dq, 
> "__builtin_ia32_cvtpd2dq", IX86_BUILTIN_CVTPD2DQ, UNKNOWN, (int) 
> V4SI_FTYPE_V2DF)
>  BDESC (OPTION_MASK_ISA_SSE2 | OPTION_MASK_ISA_MMX, 0, 
> CODE_FOR_sse2_cvtpd2pi, "__builtin_ia32_cvtpd2pi", IX86_BUILTIN_CVTPD2PI, 
> UNKNOWN, (int) V2SI_FTYPE_V2DF)
>  BDESC (OPTION_MASK_ISA_SSE2, 0, CODE_FOR_sse2_cvtpd2ps, 
> "__builtin_ia32_cvtpd2ps", IX86_BUILTIN_CVTPD2PS, UNKNOWN, (int) 
> V4SF_FTYPE_V2DF)
> -BDESC (OPTION_MASK_ISA_SSE2, 0, CODE_FOR_sse2_cvttpd2dq, 
> "__builtin_ia32_cvttpd2dq", IX86_BUILTIN_CVTTPD2DQ, UNKNOWN, (int) 
> V4SI_FTYPE_V2DF)
> -BDESC (OPTION_MASK_ISA_SSE2 | OPTION_MASK_ISA_MMX, 0, 
> CODE_FOR_sse2_cvttpd2pi, "__builtin_ia32_cvttpd2pi", IX86_BUILTIN_CVTTPD2PI, 
> UNKNOWN, (

Re: [PATCH] middle-end/114189 - drop uses of vcond{,u,eq}_optab

2024-06-16 Thread Hongtao Liu
On Fri, Jun 14, 2024 at 10:53 PM Hongtao Liu  wrote:
>
> On Fri, Jun 14, 2024 at 6:31 PM Richard Biener  wrote:
> >
> > The following retires vcond{,u,eq} optabs by stopping to use them
> > from the middle-end.  Targets instead (should) implement vcond_mask
> > and vec_cmp{,u,eq} optabs.  The PR this change refers to lists
> > possibly affected targets - those implementing these patterns,
> > and in particular it lists mips, sparc and ia64 as targets that
> > most definitely will regress while others might simply remove
> > their vcond{,u,eq} patterns.
> >
> > I'd appreciate testing, I do not expect fallout for x86 or arm/aarch64.
> > I know riscv doesn't implement any of the legacy optabs.  But less
> > maintained vector targets might need adjustments.
> >
> At GCC14, I tried to remove these expanders in the x86 backend, and it
> regressed some testcases, mainly because of the optimizations we did
> in ix86_expand_{int,fp}_vcond.
> I've started testing your patch, it's possible that we still need to
> move the ix86_expand_{int,fp}_vcond optimizations to the
> middle-end(isel or match.pd)or add extra patterns to handle it at the
> rtl pas_combine.
These are new failures I got

g++: g++.target/i386/avx-pr54700-1.C   scan-assembler-not vpcmpgt[bdq]

g++: g++.target/i386/avx-pr54700-1.C   scan-assembler-times vblendvpd 4

g++: g++.target/i386/avx-pr54700-1.C   scan-assembler-times vblendvps 4

g++: g++.target/i386/avx-pr54700-1.C   scan-assembler-times vpblendvb 2

g++: g++.target/i386/avx2-pr54700-1.C   scan-assembler-not vpcmpgt[bdq]

g++: g++.target/i386/avx2-pr54700-1.C   scan-assembler-times vblendvpd 4

g++: g++.target/i386/avx2-pr54700-1.C   scan-assembler-times vblendvps 4

g++: g++.target/i386/avx2-pr54700-1.C   scan-assembler-times vpblendvb 2

g++: g++.target/i386/avx512fp16-vcondmn-minmax.C  -std=gnu++14

g++scan-assembler-times vmaxph 3

g++: g++.target/i386/avx512fp16-vcondmn-minmax.C  -std=gnu++14

g++scan-assembler-times vminph 3

g++: g++.target/i386/avx512fp16-vcondmn-minmax.C  -std=gnu++17

g++scan-assembler-times vmaxph 3

g++: g++.target/i386/avx512fp16-vcondmn-minmax.C  -std=gnu++17

g++scan-assembler-times vminph 3

g++: g++.target/i386/avx512fp16-vcondmn-minmax.C  -std=gnu++20

g++scan-assembler-times vmaxph 3

g++: g++.target/i386/avx512fp16-vcondmn-minmax.C  -std=gnu++20

g++scan-assembler-times vminph 3

g++: g++.target/i386/avx512fp16-vcondmn-minmax.C  -std=gnu++98

g++scan-assembler-times vmaxph 3

g++: g++.target/i386/avx512fp16-vcondmn-minmax.C  -std=gnu++98

g++scan-assembler-times vminph 3

g++: g++.target/i386/pr100637-1b.C  -std=gnu++14  scan-assembler-times

g++pcmpeqb 2

g++: g++.target/i386/pr100637-1b.C  -std=gnu++17  scan-assembler-times

g++pcmpeqb 2

g++: g++.target/i386/pr100637-1b.C  -std=gnu++20  scan-assembler-times

g++pcmpeqb 2

g++: g++.target/i386/pr100637-1b.C  -std=gnu++98  scan-assembler-times

g++pcmpeqb 2

g++: g++.target/i386/pr100637-1w.C  -std=gnu++14  scan-assembler-times

g++pcmpeqw 2

g++: g++.target/i386/pr100637-1w.C  -std=gnu++17  scan-assembler-times

g++pcmpeqw 2

g++: g++.target/i386/pr100637-1w.C  -std=gnu++20  scan-assembler-times

g++pcmpeqw 2

g++: g++.target/i386/pr100637-1w.C  -std=gnu++98  scan-assembler-times

g++pcmpeqw 2

g++: g++.target/i386/pr100738-1.C  -std=gnu++14  scan-assembler-not

g++vpcmpeqd[ \\t]

g++: g++.target/i386/pr100738-1.C  -std=gnu++14  scan-assembler-not

g++vpxor[ \\t]

g++: g++.target/i386/pr100738-1.C  -std=gnu++14  scan-assembler-times

g++vblendvps[ \\t] 2

g++: g++.target/i386/pr100738-1.C  -std=gnu++17  scan-assembler-not

g++vpcmpeqd[ \\t]

g++: g++.target/i386/pr100738-1.C  -std=gnu++17  scan-assembler-not

g++vpxor[ \\t]

g++: g++.target/i386/pr100738-1.C  -std=gnu++17  scan-assembler-times

g++vblendvps[ \\t] 2

g++: g++.target/i386/pr100738-1.C  -std=gnu++20  scan-assembler-not

g++vpcmpeqd[ \\t]

g++: g++.target/i386/pr100738-1.C  -std=gnu++20  scan-assembler-not

g++vpxor[ \\t]

g++: g++.target/i386/pr100738-1.C  -std=gnu++20  scan-assembler-times

g++vblendvps[ \\t] 2

g++: g++.target/i386/pr100738-1.C  -std=gnu++98  scan-assembler-not

g++vpcmpeqd[ \\t]

g++: g++.target/i386/pr100738-1.C  -std=gnu++98  scan-assembler-not

g++vpxor[ \\t]

g++: g++.target/i386/pr100738-1.C  -std=gnu++98  scan-assembler-times

g++vblendvps[ \\t] 2

g++: g++.target/i386/pr103861-1.C  -std=gnu++14  scan-assembler-times

g++pcmpeqb 2

g++: g++.target/i386/pr103861-1.C  -std=gnu++17  scan-assembler-times

g++pcmpeqb 2

g++: g++.target/i386/pr103861-1.C  -std=gnu++20  scan-assembler-times

g++pcmpeqb 2

g++: g++.target/i386/pr103861-1.C  -std=gnu++98  scan-assembler-times

g++pcmpeqb 2

g++: g++.target/i386/pr61747.C  -std=gnu++14  scan-assembler-times max 4

g++: g++.target/i386/pr61747.C  -std=gnu++14  scan-assembler-times min 4

g++: g++.target/i386/pr61747.C  -std=gnu++17  scan-assembler-times max 4

g++: g++.target/i386/pr61747.C  -std=gnu++17  scan-assembler-times min 4

g++: g++.target/i386/pr61747.C  -std=gnu++20 

Re: [PATCH 0/3] [APX CFCMOV] Support APX CFCMOV

2024-06-16 Thread Hongtao Liu
On Sat, Jun 15, 2024 at 1:22 AM Jeff Law  wrote:
>
>
>
> On 6/14/24 11:10 AM, Alexander Monakov wrote:
> >
> > On Fri, 14 Jun 2024, Kong, Lingling wrote:
> >
> >> APX CFCMOV[1] feature implements conditionally faulting which means that 
> >> all memory faults are suppressed
> >> when the condition code evaluates to false and load or store a memory 
> >> operand. Now we could load or store a
> >> memory operand may trap or fault for conditional move.
> >>
> >> In middle-end, now we don't support a conditional move if we knew that a 
> >> load
> >> from A or B could trap or fault.
> >
> > Predicated loads&stores on Itanium don't trap either. They are modeled via
> > COND_EXEC on RTL. The late if-conversion pass (the instance that runs after
> > reload) is capable of introducing them.
> >
> >> To enable CFCMOV, we add a target HOOK 
> >> TARGET_HAVE_CONDITIONAL_MOVE_MEM_NOTRAP
> >> in if-conversion pass to allow convert to cmov.
> >
> > Considering the above, is the new hook really necessary? Can you model the 
> > new
> > instructions via (cond_exec () (set ...)) instead of (set (if_then_else 
> > ...)) ?
> Note that turning on cond_exec will turn off some of the cmove support.
Yes, cfcmov looks more like a cmov than a cond_exec.
>
> But the general suggesting of trying to avoid a hook for this is a good
> one.  In fact, my first reaction to this thread was "do we really need a
> hook for this".
Maybe a new optab, .i.e cfmovmodecc, and it differs from movcc for
Conditional Fault?
>
> jeff



-- 
BR,
Hongtao


RE: [PATCH 1/3 v3] vect: generate suitable convert insn for int -> int, float -> float and int <-> float.

2024-06-16 Thread Hu, Lin1
Ping this thread.

BRs,
Lin

-Original Message-
From: Hu, Lin1  
Sent: Tuesday, June 11, 2024 2:49 PM
To: gcc-patches@gcc.gnu.org
Cc: Liu, Hongtao ; ubiz...@gmail.com; rguent...@suse.de
Subject: [PATCH 1/3 v3] vect: generate suitable convert insn for int -> int, 
float -> float and int <-> float.

I wrap a part of code about indirect conversion. The API refers to 
supportable_narrowing/widening_operations.

BRs,
Lin

gcc/ChangeLog:

PR target/107432
* tree-vect-generic.cc
(expand_vector_conversion): Support convert for int -> int,
float -> float and int <-> float.
* tree-vect-stmts.cc (vectorizable_conversion): Wrap the
indirect convert part.
(supportable_indirect_convert_operation): New function.
* tree-vectorizer.h (supportable_indirect_convert_operation):
Define the new function.

gcc/testsuite/ChangeLog:

PR target/107432
* gcc.target/i386/pr107432-1.c: New test.
* gcc.target/i386/pr107432-2.c: Ditto.
* gcc.target/i386/pr107432-3.c: Ditto.
* gcc.target/i386/pr107432-4.c: Ditto.
* gcc.target/i386/pr107432-5.c: Ditto.
* gcc.target/i386/pr107432-6.c: Ditto.
* gcc.target/i386/pr107432-7.c: Ditto.
---
 gcc/testsuite/gcc.target/i386/pr107432-1.c | 234   
gcc/testsuite/gcc.target/i386/pr107432-2.c | 105 +  
gcc/testsuite/gcc.target/i386/pr107432-3.c |  55 +  
gcc/testsuite/gcc.target/i386/pr107432-4.c |  56 +  
gcc/testsuite/gcc.target/i386/pr107432-5.c |  72 ++  
gcc/testsuite/gcc.target/i386/pr107432-6.c | 139   
gcc/testsuite/gcc.target/i386/pr107432-7.c | 156 +
 gcc/tree-vect-generic.cc   |  33 ++-
 gcc/tree-vect-stmts.cc | 244 +
 gcc/tree-vectorizer.h  |   9 +
 10 files changed, 1011 insertions(+), 92 deletions(-)  create mode 100644 
gcc/testsuite/gcc.target/i386/pr107432-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr107432-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr107432-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr107432-4.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr107432-5.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr107432-6.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr107432-7.c

diff --git a/gcc/testsuite/gcc.target/i386/pr107432-1.c 
b/gcc/testsuite/gcc.target/i386/pr107432-1.c
new file mode 100644
index 000..a4f37447eb4
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr107432-1.c
@@ -0,0 +1,234 @@
+/* { dg-do compile } */
+/* { dg-options "-march=x86-64 -mavx512bw -mavx512vl -O3" } */
+/* { dg-final { scan-assembler-times "vpmovqd" 6 } } */
+/* { dg-final { scan-assembler-times "vpmovqw" 6 } } */
+/* { dg-final { scan-assembler-times "vpmovqb" 6 } } */
+/* { dg-final { scan-assembler-times "vpmovdw" 6 { target { ia32 } } } 
+} */
+/* { dg-final { scan-assembler-times "vpmovdw" 8 { target { ! ia32 } } 
+} } */
+/* { dg-final { scan-assembler-times "vpmovdb" 6 { target { ia32 } } } 
+} */
+/* { dg-final { scan-assembler-times "vpmovdb" 8 { target { ! ia32 } } 
+} } */
+/* { dg-final { scan-assembler-times "vpmovwb" 8 } } */
+
+#include 
+
+typedef short __v2hi __attribute__ ((__vector_size__ (4))); typedef 
+char __v2qi __attribute__ ((__vector_size__ (2))); typedef char __v4qi 
+__attribute__ ((__vector_size__ (4))); typedef char __v8qi 
+__attribute__ ((__vector_size__ (8)));
+
+typedef unsigned short __v2hu __attribute__ ((__vector_size__ (4))); 
+typedef unsigned short __v4hu __attribute__ ((__vector_size__ (8))); 
+typedef unsigned char __v2qu __attribute__ ((__vector_size__ (2))); 
+typedef unsigned char __v4qu __attribute__ ((__vector_size__ (4))); 
+typedef unsigned char __v8qu __attribute__ ((__vector_size__ (8))); 
+typedef unsigned int __v2su __attribute__ ((__vector_size__ (8)));
+
+__v2si mm_cvtepi64_epi32_builtin_convertvector(__m128i a) {
+  return __builtin_convertvector((__v2di)a, __v2si); }
+
+__m128imm256_cvtepi64_epi32_builtin_convertvector(__m256i a)
+{
+  return (__m128i)__builtin_convertvector((__v4di)a, __v4si); }
+
+__m256imm512_cvtepi64_epi32_builtin_convertvector(__m512i a)
+{
+  return (__m256i)__builtin_convertvector((__v8di)a, __v8si); }
+
+__v2hi mm_cvtepi64_epi16_builtin_convertvector(__m128i a)
+{
+  return __builtin_convertvector((__v2di)a, __v2hi); }
+
+__v4hi mm256_cvtepi64_epi16_builtin_convertvector(__m256i a)
+{
+  return __builtin_convertvector((__v4di)a, __v4hi); }
+
+__m128imm512_cvtepi64_epi16_builtin_convertvector(__m512i a)
+{
+  return (__m128i)__builtin_convertvector((__v8di)a, __v8hi); }
+
+__v2qi mm_cvtepi64_epi8_builtin_convertvector(__m128i a)
+{
+  return __builtin_convertvector((__v2di)a, __v2qi); }
+
+__v4qi mm256_cvtepi64_epi8_builtin_convertvector(__m256i a)
+{
+  return __builtin_convertvector((__v4di)a, __v4qi); }
+
+__v8qi mm512_cvtepi64_epi8_builtin_convertvector(

[PATCH v1] Match: Support forms 7 and 8 for the unsigned .SAT_ADD

2024-06-16 Thread pan2 . li
From: Pan Li 

When investigate the vectorization of .SAT_ADD,  we notice there
are additional 2 forms,  aka form 7 and 8 for .SAT_ADD.

Form 7:
  #define DEF_SAT_U_ADD_FMT_7(T)  \
  T __attribute__((noinline)) \
  sat_u_add_##T##_fmt_7 (T x, T y)\
  {   \
return x > (T)(x + y) ? -1 : (x + y); \
  }

Form 8:
  #define DEF_SAT_U_ADD_FMT_8(T)   \
  T __attribute__((noinline))  \
  sat_u_add_##T##_fmt_8 (T x, T y) \
  {\
return x <= (T)(x + y) ? (x + y) : -1; \
  }

Thus,  add above 2 forms to the match gimple_unsigned_integer_sat_add,
and then the vectorizer can try to recog the pattern like form 7 and
form 8.

The below test suites are passed for this patch:
1. The rv64gcv fully regression test with newlib.
2. The rv64gcv build with glibc.
3. The x86 bootstrap test.
4. The x86 fully regression test.

gcc/ChangeLog:

* match.pd: Add form 7 and 8 for the unsigned .SAT_ADD match.

Signed-off-by: Pan Li 
---
 gcc/match.pd | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/gcc/match.pd b/gcc/match.pd
index 99968d316ed..aae6d30a5e4 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -3144,6 +3144,16 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (cond^ (ne (imagpart (IFN_ADD_OVERFLOW:c @0 @1)) integer_zerop)
   integer_minus_onep (usadd_left_part_2 @0 @1)))
 
+/* Unsigned saturation add, case 7 (branch with le):
+   SAT_ADD = x <= (X + Y) ? (X + Y) : -1.  */
+(match (unsigned_integer_sat_add @0 @1)
+ (cond^ (le @0 (usadd_left_part_1@2 @0 @1)) @2 integer_minus_onep))
+
+/* Unsigned saturation add, case 8 (branch with gt):
+   SAT_ADD = x > (X + Y) ? -1 : (X + Y).  */
+(match (unsigned_integer_sat_add @0 @1)
+ (cond^ (gt @0 (usadd_left_part_1@2 @0 @1)) integer_minus_onep @2))
+
 /* Unsigned saturation sub, case 1 (branch with gt):
SAT_U_SUB = X > Y ? X - Y : 0  */
 (match (unsigned_integer_sat_sub @0 @1)
-- 
2.34.1



Re: [PATCH] tree-optimization/115254 - don't account single-lane SLP against discovery limit

2024-06-16 Thread YunQiang Su
Richard Biener  于2024年6月6日周四 14:20写道:
>
> On Thu, 6 Jun 2024, YunQiang Su wrote:
>
> > Richard Biener  于2024年5月28日周二 17:47写道:
> > >
> > > The following avoids accounting single-lane SLP to the discovery
> > > limit.  As the two testcases show this makes discovery fail,
> > > unfortunately even not the same across targets.  The following
> > > should fix two FAILs for GCN as a side-effect.
> > >
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.
> > >
> > > PR tree-optimization/115254
> > > * tree-vect-slp.cc (vect_build_slp_tree): Only account
> > > multi-lane SLP to limit.
> > >
> > > * gcc.dg/vect/slp-cond-2-big-array.c: Expect 4 times SLP.
> > > * gcc.dg/vect/slp-cond-2.c: Likewise.
> >
> > With this patch, MIPS/MSA still has only 3 times SLP.
> > I am digging the problem
>
> I bet it's an issue with missed permutes.  f3() requires interleaving
> of two VnQImode vectors.
>

Thanks. This problem disappears when I try to implement vcond_mask.


libbacktrace patch committed: OK if zero backward bits

2024-06-16 Thread Ian Lance Taylor
I've committed this libbacktrace patch to not fail on the case where
there are no bits available when looking backward.  This can happen at
the very end of the frame if no bits are actually required.  The test
case is long and may be proprietary, so not including it.
Bootstrapped and ran libbacktrace and Go testsuite.  Committed to
mainline.

Ian

* elf.c (elf_fetch_bits_backward) Don't fail if no bits are
available.
dda0996e11dbc07f63d3456e36dc5eaec7361004
diff --git a/libbacktrace/elf.c b/libbacktrace/elf.c
index 3cd87020b03..735f8752500 100644
--- a/libbacktrace/elf.c
+++ b/libbacktrace/elf.c
@@ -1182,14 +1182,7 @@ elf_fetch_bits_backward (const unsigned char **ppin,
   val = *pval;
 
   if (unlikely (pin <= pinend))
-{
-  if (bits == 0)
-   {
- elf_uncompress_failed ();
- return 0;
-   }
-  return 1;
-}
+return 1;
 
   pin -= 4;
 


Re: [Fortran, Patch, PR 96992] Fix Class arrays of different ranks are rejected as storage association argument

2024-06-16 Thread Harald Anlauf

Hi Andre,

Am 14.06.24 um 17:05 schrieb Andre Vehreschild:

Hi all,

I somehow got assigned to this PR so I fixed it. GFortran was ICEing because of
the ASSUME_RANK in a derived to class conversion. After fixing this, storage
association was producing segfaults. The "shape conversion" of the class array
as dummy argument was not initializing the dim 0 stride and with that grabbing
into the memory somewhere. This is now fixed and

regtests fine on x86_64 Fedora 39. Ok for mainline?


the patch fixes the testcase in your submission, but not the following
slight variation of the main program:

module foo_mod
  implicit none
  type foo
 integer :: i
  end type foo
contains
  subroutine d1(x,n)
integer, intent(in) :: n
integer :: i
class (foo), intent(out) :: x(n)
select type(x)
class is(foo)
   x(:)%i = (/ (42 + i, i = 1, n ) /)
class default
   stop 1
end select
  end subroutine d1
  subroutine d2(x,n)
integer, intent(in) :: n
integer :: i
class (foo), intent(in) :: x(n,n,n)
select type (x)
class is (foo)
   print *,x%i
   if ( any( x%i /= reshape((/ (42 + i, i = 1, n ** 3 ) /), [n, n, 
n] ))) stop 2

class default
   stop 3
end select
  end subroutine d2
end module foo_mod
program main
  use foo_mod
  implicit none
  type (foo), dimension(:), allocatable :: f
  integer :: n
  n = 2
  allocate (f(n*n*n))
  ! Original testcase:
  call d1(f,n*n*n)
  call d2(f,n)  ! OK
  call d1(f(1:n*n*n),n*n*n)
  print *, "After call d1(f(1:n*n*n:1),n*n*n):"
  print *, f%i
  call d2(f(1:n*n*n),n) ! OK
  ! Using stride -1:
  call d1(f(n*n*n:1:-1),n*n*n)
  print *, "After call d1(f(n*n*n:1:-1),n*n*n):"
  print *, f%i
  call d2(f(n*n*n:1:-1),n)  ! not OK
  deallocate (f)
end program main

While this runs fine with the latest Intel compiler, gfortran including
your patch prints:

  43  44  45  46  47 
48  49  50

 After call d1(f(1:n*n*n:1),n*n*n):
  43  44  45  46  47 
48  49  50
  43  44  45  46  47 
48  49  50

 After call d1(f(n*n*n:1:-1),n*n*n):
  50  49  48  47  46 
45  44  43
  43   0   0  49   0 
34244976   034238480

STOP 2

So while the negative stride (-1) in the call to d1 appears to
work as it should, it does not work properly for the call to d2.
The first array element is fine in d2, but anything else isn't.

Do you see what goes wrong here?

(This may be a more general, pre-existing issue in a different place.)

Thanks,
Harald

P.S.: regarding your commit message, I think the reference to the pr
in brackets should be moved to the end of the summary line, i.e. for

Fortran: [PR96992] Fix rejecting class arrays of different ranks as 
storage association argument.


the "[PR96992" should be moved.  Makes it also easier to read.


I assume this patch could be fixing some other PRs with class array's parameter
passing, too. If that sounds familiar, feel free to point me to them.

Regards,
Andre
--
Andre Vehreschild * Email: vehre ad gmx dot de





[PATCH] aarch64: Fix reg_is_wrapped_separately array size [PR100211]

2024-06-16 Thread Andrew Pinski
Currrently the size of the array reg_is_wrapped_separately is LAST_SAVED_REGNUM.
But LAST_SAVED_REGNUM could be regno that is being saved. So the size needs
to be `LAST_SAVED_REGNUM + 1` like aarch64_frame->reg_offset is.

Committed as obvious after a bootstrap/test for aarch64-linux-gnu.

gcc/ChangeLog:

* config/aarch64/aarch64.h (machine_function): Fix the size
of reg_is_wrapped_separately.

Signed-off-by: Andrew Pinski 
---
 gcc/config/aarch64/aarch64.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 0997b82dbc0..2b89f6f88ef 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -1059,7 +1059,7 @@ typedef struct GTY (()) machine_function
 {
   struct aarch64_frame frame;
   /* One entry for each hard register.  */
-  bool reg_is_wrapped_separately[LAST_SAVED_REGNUM];
+  bool reg_is_wrapped_separately[LAST_SAVED_REGNUM + 1];
   /* One entry for each general purpose register.  */
   rtx call_via[SP_REGNUM];
 
-- 
2.43.0



[to-be-committed][RISC-V] Improve variable bit set for rv64

2024-06-16 Thread Jeff Law


Another case of being able to safely use bset for 1 << n.  In this case 
the (1 << n)  is explicitly zero extended from SI to DI.  Two things to 
keep in mind.  The (1 << n) is done in SImode.  So it doesn't directly 
define bits 32..63 and those bits are cleared by the explicit zero 
extension.  Second if N is out of SImode's range, then the original 
source level construct was undefined.


Thus we can use bset with x0 as our source input.

I think this testcase was from the RAU team.  It doesn't immediately 
look like something from SPEC, but that's where they were primarily focused.


This has been through Ventana's CI system in the past.  I've also 
recently added zbs testing to my own tester and naturally this passed 
there as well.  I'll wait for the pre-commit CI to do its thing before 
moving forward.  The plan would be to commit after passing.




Jeff

diff --git a/gcc/config/riscv/bitmanip.md b/gcc/config/riscv/bitmanip.md
index 0d35fb786e1..311f0d373c0 100644
--- a/gcc/config/riscv/bitmanip.md
+++ b/gcc/config/riscv/bitmanip.md
@@ -597,6 +597,18 @@ (define_insn "*bset_1"
   "bset\t%0,x0,%1"
   [(set_attr "type" "bitmanip")])
 
+;; The result will always have bits 32..63 clear, so the zero-extend
+;; is redundant.  We could split it to bset_1, but it seems
+;; unnecessary.
+(define_insn "*bsetdi_2"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+   (zero_extend:DI
+ (ashift:SI (const_int 1)
+(match_operand:QI 1 "register_operand" "r"]
+  "TARGET_64BIT && TARGET_ZBS"
+  "bset\t%0,x0,%1"
+  [(set_attr "type" "bitmanip")])
+
 (define_insn "*bset_1_mask"
   [(set (match_operand:X 0 "register_operand" "=r")
(ashift:X (const_int 1)
diff --git a/gcc/testsuite/gcc.target/riscv/zbs-zext-2.c 
b/gcc/testsuite/gcc.target/riscv/zbs-zext-2.c
new file mode 100644
index 000..ebd269d1695
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/zbs-zext-2.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gc_zbs -mabi=lp64" } */
+/* { dg-skip-if "" { *-*-* } { "-O0" "-Og" "-O1" } } */
+unsigned long long foo(long long symbol)
+{
+return 1u << symbol;
+}
+
+/* { dg-final { scan-assembler-times "bset\t" 1 } } */
+/* { dg-final { scan-assembler-not "li\t"} } */
+/* { dg-final { scan-assembler-not "sllw\t"} } */
+/* { dg-final { scan-assembler-not "zext.w\t"} } */


[PATCH] LoongArch: NFC: Dedup and sort the comment in loongarch_print_operand_reloc

2024-06-16 Thread Xi Ruoyao
gcc/ChangeLog:

* config/loongarch/loongarch.cc (loongarch_print_operand_reloc):
Dedup and sort the comment describing modifiers.
---

It's a non-functional change thus I've not tested it.  Ok for trunk?

 gcc/config/loongarch/loongarch.cc | 10 +-
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/gcc/config/loongarch/loongarch.cc 
b/gcc/config/loongarch/loongarch.cc
index 256b76d044b..dcb32a96577 100644
--- a/gcc/config/loongarch/loongarch.cc
+++ b/gcc/config/loongarch/loongarch.cc
@@ -6132,21 +6132,13 @@ loongarch_print_operand_reloc (FILE *file, rtx op, bool 
hi64_part,
'T' Print 'f' for (eq:CC ...), 't' for (ne:CC ...),
  'z' for (eq:?I ...), 'n' for (ne:?I ...).
't' Like 'T', but with the EQ/NE cases reversed
-   'F' Print the FPU branch condition for comparison OP.
-   'W' Print the inverse of the FPU branch condition for comparison OP.
-   'w' Print a LSX register.
'u' Print a LASX register.
-   'T' Print 'f' for (eq:CC ...), 't' for (ne:CC ...),
- 'z' for (eq:?I ...), 'n' for (ne:?I ...).
-   't' Like 'T', but with the EQ/NE cases reversed
-   'Y' Print loongarch_fp_conditions[INTVAL (OP)]
-   'Z' Print OP and a comma for 8CC, otherwise print nothing.
-   'z' Print $0 if OP is zero, otherwise print OP normally.
'v' Print the insn size suffix b, h, w or d for vector modes V16QI, V8HI,
  V4SI, V2SI, and w, d for vector modes V4SF, V2DF respectively.
'V' Print exact log2 of CONST_INT OP element 0 of a replicated
  CONST_VECTOR in decimal.
'W' Print the inverse of the FPU branch condition for comparison OP.
+   'w' Print a LSX register.
'X' Print CONST_INT OP in hexadecimal format.
'x' Print the low 16 bits of CONST_INT OP in hexadecimal format.
'Y' Print loongarch_fp_conditions[INTVAL (OP)]
-- 
2.45.2



[RFC PATCH] ARM: thumb1: Use LDMIA/STMIA for DI/DF loads/stores

2024-06-16 Thread Siarhei Volkau
If the address register is dead after load/store operation it looks
beneficial to use LDMIA/STMIA instead of pair of LDR/STR instructions,
at least if optimizing for size.

E.g.
 ldr r0, [r3, #0]
 ldr r1, [r3, #4]  @ r3 is dead after
will be replaced by
 ldmia r3!, {r0, r1}

also for reused reg is legal to:
 ldr r2, [r3, #0]
 ldr r3, [r3, #4] @ r3 reused
will be replaced by
 ldmia r3, {r2, r3}

However, I know little about other thumb CPUs except Cortex M0/M0+.
1. Is there any drawbacks if optimizing speed?
2. Might it be profitable for thumb2?

Regarding code size with the patch gives for v6-m/nofp:
   libgcc:  -52 bytes / -0.10%
Newlib's libc:  -68 bytes / -0.03%
 libm:  -96 bytes / -0.10%
libstdc++: -140 bytes / -0.02%

Also I have questions regarding testing the patch.
It's obscure how to do it properly, for now I compile
for arm-none-eabi target and make check seems failing
on any compilable test due to missing symbols from libnosys.
I guess that arm-gnu-elf is the correct triple but it still
advisable for proper commands to make & run the testsuite.

Signed-off-by: Siarhei Volkau 
---
 gcc/config/arm/arm-protos.h |  2 +-
 gcc/config/arm/arm.cc   |  7 ++-
 gcc/config/arm/thumb1.md| 10 --
 3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
index 2cd560c9925..548bfbaccdc 100644
--- a/gcc/config/arm/arm-protos.h
+++ b/gcc/config/arm/arm-protos.h
@@ -254,7 +254,7 @@ extern int thumb_shiftable_const (unsigned HOST_WIDE_INT);
 extern enum arm_cond_code maybe_get_arm_condition_code (rtx);
 extern void thumb1_final_prescan_insn (rtx_insn *);
 extern void thumb2_final_prescan_insn (rtx_insn *);
-extern const char *thumb_load_double_from_address (rtx *);
+extern const char *thumb_load_double_from_address (rtx *, rtx_insn *);
 extern const char *thumb_output_move_mem_multiple (int, rtx *);
 extern const char *thumb_call_via_reg (rtx);
 extern void thumb_expand_cpymemqi (rtx *);
diff --git a/gcc/config/arm/arm.cc b/gcc/config/arm/arm.cc
index b8c32db0a1d..73c2478ed77 100644
--- a/gcc/config/arm/arm.cc
+++ b/gcc/config/arm/arm.cc
@@ -28350,7 +28350,7 @@ thumb1_output_interwork (void)
a computed memory address.  The computed address may involve a
register which is overwritten by the load.  */
 const char *
-thumb_load_double_from_address (rtx *operands)
+thumb_load_double_from_address (rtx *operands, rtx_insn *insn)
 {
   rtx addr;
   rtx base;
@@ -28368,6 +28368,11 @@ thumb_load_double_from_address (rtx *operands)
   switch (GET_CODE (addr))
 {
 case REG:
+  if (find_reg_note (insn, REG_DEAD, addr))
+return "ldmia\t%m1!, {%0, %H0}";
+  else if (REGNO (addr) == REGNO (operands[0]) + 1)
+return "ldmia\t%m1, {%0, %H0}";
+
   operands[2] = adjust_address (operands[1], SImode, 4);
 
   if (REGNO (operands[0]) == REGNO (addr))
diff --git a/gcc/config/arm/thumb1.md b/gcc/config/arm/thumb1.md
index d7074b43f60..8da6887b560 100644
--- a/gcc/config/arm/thumb1.md
+++ b/gcc/config/arm/thumb1.md
@@ -637,8 +637,11 @@
 case 5:
   return \"stmia\\t%0, {%1, %H1}\";
 case 6:
-  return thumb_load_double_from_address (operands);
+  return thumb_load_double_from_address (operands, insn);
 case 7:
+  if (MEM_P (operands[0]) && REG_P (XEXP (operands[0], 0))
+  && find_reg_note (insn, REG_DEAD, XEXP (operands[0], 0)))
+return \"stmia\\t%m0!, {%1, %H1}\";
   operands[2] = gen_rtx_MEM (SImode,
 plus_constant (Pmode, XEXP (operands[0], 0), 4));
   output_asm_insn (\"str\\t%1, %0\;str\\t%H1, %2\", operands);
@@ -970,8 +973,11 @@
 case 2:
   return \"stmia\\t%0, {%1, %H1}\";
 case 3:
-  return thumb_load_double_from_address (operands);
+  return thumb_load_double_from_address (operands, insn);
 case 4:
+  if (MEM_P (operands[0]) && REG_P (XEXP (operands[0], 0))
+  && find_reg_note (insn, REG_DEAD, XEXP (operands[0], 0)))
+return \"stmia\\t%m0!, {%1, %H1}\";
   operands[2] = gen_rtx_MEM (SImode,
 plus_constant (Pmode,
XEXP (operands[0], 0), 4));
-- 
2.45.2



[pushed] wwwdocs: news: Update link to our ACM SIGPLAN award

2024-06-16 Thread Gerald Pfeifer
This isn't just http to https, also the anchor has changed.

Not sure why anyone would go for #2014_The_GNU_Compiler_Collection_(GCC)
- but so be it.)

Pushed.

Gerald
---
 htdocs/news.html | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/htdocs/news.html b/htdocs/news.html
index 7d793add..4a6c2ab3 100644
--- a/htdocs/news.html
+++ b/htdocs/news.html
@@ -314,7 +314,7 @@
 [2014-06-12] wwwdocs:
 
 
-http://www.sigplan.org/Awards/Software/#2014";>ACM SIGPLAN 
Programming Languages Software Award
+https://www.sigplan.org/Awards/Software/#2014_The_GNU_Compiler_Collection_(GCC)">ACM
 SIGPLAN Programming Languages Software Award
 [2014-06-10] wwwdocs:
 
 
-- 
2.45.2


[PATCH 8/8] vect: Optimize order of lane-reducing statements in loop def-use cycles

2024-06-16 Thread Feng Xue OS
When transforming multiple lane-reducing operations in a loop reduction chain,
originally, corresponding vectorized statements are generated into def-use
cycles starting from 0. The def-use cycle with smaller index, would contain
more statements, which means more instruction dependency. For example:

   int sum = 0;
   for (i)
 {
   sum += d0[i] * d1[i];  // dot-prod 
   sum += w[i];   // widen-sum 
   sum += abs(s0[i] - s1[i]); // sad 
 }

Original transformation result:

   for (i / 16)
 {
   sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
   sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy
 }

For a higher instruction parallelism in final vectorized loop, an optimal
means is to make those effective vectorized lane-reducing statements be
distributed evenly among all def-use cycles. Transformed as the below,
DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles,
instruction dependency could be eliminated.

Thanks,
Feng
---
gcc/
PR tree-optimization/114440
* tree-vectorizer.h (struct _stmt_vec_info): Add a new field
reduc_result_pos.
* tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing
statements in an optimized order.
---
 gcc/tree-vect-loop.cc | 39 +++
 gcc/tree-vectorizer.h |  6 ++
 2 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 6d91665a341..c7e13d655d8 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8828,9 +8828,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy
 
-  sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
-  sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
-  sum_v2 = sum_v2;  // copy
+  sum_v0 = sum_v0;  // copy
+  sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1);
+  sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2);
   sum_v3 = sum_v3;  // copy
 
   sum_v0 += n_v0[i: 0  ~ 3 ];
@@ -8838,14 +8838,45 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   sum_v2 += n_v2[i: 8  ~ 11];
   sum_v3 += n_v3[i: 12 ~ 15];
 }
-   */
+
+Moreover, for a higher instruction parallelism in final vectorized
+loop, it is considered to make those effective vectorized lane-
+reducing statements be distributed evenly among all def-use cycles.
+In the above example, SADs are generated into other cycles rather
+than that of DOT_PROD.  */
   unsigned using_ncopies = vec_oprnds[0].length ();
   unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
+  unsigned result_pos = reduc_info->reduc_result_pos;
+
+  reduc_info->reduc_result_pos
+   = (result_pos + using_ncopies) % reduc_ncopies;
+  gcc_assert (result_pos < reduc_ncopies);
 
   for (unsigned i = 0; i < op.num_ops - 1; i++)
{
  gcc_assert (vec_oprnds[i].length () == using_ncopies);
  vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
+
+ /* Find suitable def-use cycles to generate vectorized statements
+into, and reorder operands based on the selection.  */
+ if (result_pos)
+   {
+ unsigned count = reduc_ncopies - using_ncopies;
+ unsigned start = result_pos - count;
+
+ if ((int) start < 0)
+   {
+ count = result_pos;
+ start = 0;
+   }
+
+ for (unsigned j = using_ncopies; j > start; j--)
+   {
+ unsigned k = j - 1;
+ std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]);
+ gcc_assert (!vec_oprnds[i][k]);
+   }
+   }
}
 }
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 94736736dcc..64c6571a293 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1402,6 +1402,12 @@ public:
   /* The vector type for performing the actual reduction.  */
   tree reduc_vectype;
 
+  /* For loop reduction with multiple vectorized results (ncopies > 1), a
+ lane-reducing operation participating in it may not use all of those
+ results, this field specifies result index starting from which any
+ following land-reducing operation would be assigned to.  */
+  unsigned int reduc_result_pos;
+
   /* If IS_RED

[PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]

2024-06-16 Thread Feng Xue OS
For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
vectorizer could only handle the pattern if the reduction chain does not
contain other operation, no matter the other is normal or lane-reducing.

Actually, to allow multiple arbitrary lane-reducing operations, we need to
support vectorization of loop reduction chain with mixed input vectypes. Since
lanes of vectype may vary with operation, the effective ncopies of vectorized
statements for operation also may not be same to each other, this causes
mismatch on vectorized def-use cycles. A simple way is to align all operations
with the one that has the most ncopies, the gap could be complemented by
generating extra trivial pass-through copies. For example:

   int sum = 0;
   for (i)
 {
   sum += d0[i] * d1[i];  // dot-prod 
   sum += w[i];   // widen-sum 
   sum += abs(s0[i] - s1[i]); // sad 
   sum += n[i];   // normal 
 }

The vector size is 128-bit vectorization factor is 16. Reduction statements
would be transformed as:

   vector<4> int sum_v0 = { 0, 0, 0, 0 };
   vector<4> int sum_v1 = { 0, 0, 0, 0 };
   vector<4> int sum_v2 = { 0, 0, 0, 0 };
   vector<4> int sum_v3 = { 0, 0, 0, 0 };

   for (i / 16)
 {
   sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
   sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 += n_v0[i: 0  ~ 3 ];
   sum_v1 += n_v1[i: 4  ~ 7 ];
   sum_v2 += n_v2[i: 8  ~ 11];
   sum_v3 += n_v3[i: 12 ~ 15];
 }

Thanks,
Feng

---
gcc/
PR tree-optimization/114440
* tree-vectorizer.h (vectorizable_lane_reducing): New function
declaration.
* tree-vect-stmts.cc (vect_analyze_stmt): Call new function
vectorizable_lane_reducing to analyze lane-reducing operation.
* tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation
code related to emulated_mixed_dot_prod.
(vect_reduction_update_partial_vector_usage): Compute ncopies as the
original means for single-lane slp node.
(vectorizable_lane_reducing): New function.
(vectorizable_reduction): Allow multiple lane-reducing operations in
loop reduction. Move some original lane-reducing related code to
vectorizable_lane_reducing.
(vect_transform_reduction): Extend transformation to support reduction
statements with mixed input vectypes.

gcc/testsuite/
PR tree-optimization/114440
* gcc.dg/vect/vect-reduc-chain-1.c
* gcc.dg/vect/vect-reduc-chain-2.c
* gcc.dg/vect/vect-reduc-chain-3.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
* gcc.dg/vect/vect-reduc-dot-slp-1.c
---
 .../gcc.dg/vect/vect-reduc-chain-1.c  |  62 
 .../gcc.dg/vect/vect-reduc-chain-2.c  |  77 +
 .../gcc.dg/vect/vect-reduc-chain-3.c  |  66 
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c  |  95 +
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c  |  67 
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c  |  79 +
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c  |  63 
 .../gcc.dg/vect/vect-reduc-dot-slp-1.c|  35 ++
 gcc/tree-vect-loop.cc | 324 ++
 gcc/tree-vect-stmts.cc|   2 +
 gcc/tree-vectorizer.h |   2 +
 11 files changed, 802 insertions(+), 70 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c 
b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
new file mode 100644
index 000..04bfc419dbd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
@@ -0,0 +1,62 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require

[PATCH 6/8] vect: Tighten an assertion for lane-reducing in transform

2024-06-16 Thread Feng Xue OS
According to logic of code nearby the assertion, all lane-reducing operations
should not appear, not just DOT_PROD_EXPR. Since "use_mask_by_cond_expr_p"
treats SAD_EXPR same as DOT_PROD_EXPR, and WIDEN_SUM_EXPR should not be allowed
by the following assertion "gcc_assert (commutative_binary_op_p (...))", so
tighten the assertion.

Thanks,
Feng

---
gcc/
* tree-vect-loop.cc (vect_transform_reduction): Change assertion to
cover all lane-reducing ops.
---
 gcc/tree-vect-loop.cc | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 7909d63d4df..e0561feddce 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8643,7 +8643,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 }
 
   bool single_defuse_cycle = STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info);
-  gcc_assert (single_defuse_cycle || lane_reducing_op_p (code));
+  bool lane_reducing = lane_reducing_op_p (code);
+  gcc_assert (single_defuse_cycle || lane_reducing);
 
   /* Create the destination vector  */
   tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
@@ -8698,8 +8699,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
   if (masked_loop_p && !mask_by_cond_expr)
{
- /* No conditional ifns have been defined for dot-product yet.  */
- gcc_assert (code != DOT_PROD_EXPR);
+ /* No conditional ifns have been defined for lane-reducing op
+yet.  */
+ gcc_assert (!lane_reducing);
 
  /* Make sure that the reduction accumulator is vop[0].  */
  if (reduc_index == 1)
-- 
2.17.1From d348e63c001e65067876a80dfae75abefe10c240 Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Sun, 16 Jun 2024 13:33:52 +0800
Subject: [PATCH 6/8] vect: Tighten an assertion for lane-reducing in transform

According to logic of code nearby the assertion, all lane-reducing operations
should not appear, not just DOT_PROD_EXPR. Since "use_mask_by_cond_expr_p"
treats SAD_EXPR same as DOT_PROD_EXPR, and WIDEN_SUM_EXPR should not be allowed
by the following assertion "gcc_assert (commutative_binary_op_p (...))", so
tighten the assertion.

2024-06-16 Feng Xue 

gcc/
	* tree-vect-loop.cc (vect_transform_reduction): Change assertion to
	cover all lane-reducing ops.
---
 gcc/tree-vect-loop.cc | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 7909d63d4df..e0561feddce 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8643,7 +8643,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 }
 
   bool single_defuse_cycle = STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info);
-  gcc_assert (single_defuse_cycle || lane_reducing_op_p (code));
+  bool lane_reducing = lane_reducing_op_p (code);
+  gcc_assert (single_defuse_cycle || lane_reducing);
 
   /* Create the destination vector  */
   tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
@@ -8698,8 +8699,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
   if (masked_loop_p && !mask_by_cond_expr)
 	{
-	  /* No conditional ifns have been defined for dot-product yet.  */
-	  gcc_assert (code != DOT_PROD_EXPR);
+	  /* No conditional ifns have been defined for lane-reducing op
+	 yet.  */
+	  gcc_assert (!lane_reducing);
 
 	  /* Make sure that the reduction accumulator is vop[0].  */
 	  if (reduc_index == 1)
-- 
2.17.1



[PATCH 5/8] vect: Use an array to replace 3 relevant variables

2024-06-16 Thread Feng Xue OS
It's better to place 3 relevant independent variables into array, since we
have requirement to access them via an index in the following patch. At the
same time, this change may get some duplicated code be more compact.

Thanks,
Feng

---
gcc/
* tree-vect-loop.cc (vect_transform_reduction): Replace vec_oprnds0/1/2
with one new array variable vec_oprnds[3].
---
 gcc/tree-vect-loop.cc | 42 +-
 1 file changed, 17 insertions(+), 25 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 39aa5cb1197..7909d63d4df 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8605,9 +8605,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 
   /* Transform.  */
   tree new_temp = NULL_TREE;
-  auto_vec vec_oprnds0;
-  auto_vec vec_oprnds1;
-  auto_vec vec_oprnds2;
+  auto_vec vec_oprnds[3];
 
   if (dump_enabled_p ())
 dump_printf_loc (MSG_NOTE, vect_location, "transform reduction.\n");
@@ -8657,12 +8655,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 {
   vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, ncopies,
 single_defuse_cycle && reduc_index == 0
-? NULL_TREE : op.ops[0], &vec_oprnds0,
+? NULL_TREE : op.ops[0], &vec_oprnds[0],
 single_defuse_cycle && reduc_index == 1
-? NULL_TREE : op.ops[1], &vec_oprnds1,
+? NULL_TREE : op.ops[1], &vec_oprnds[1],
 op.num_ops == 3
 && !(single_defuse_cycle && reduc_index == 2)
-? op.ops[2] : NULL_TREE, &vec_oprnds2);
+? op.ops[2] : NULL_TREE, &vec_oprnds[2]);
 }
   else
 {
@@ -8670,12 +8668,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 vectype.  */
   gcc_assert (single_defuse_cycle
  && (reduc_index == 1 || reduc_index == 2));
-  vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, ncopies,
-op.ops[0], truth_type_for (vectype_in), &vec_oprnds0,
+  vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, ncopies, op.ops[0],
+truth_type_for (vectype_in), &vec_oprnds[0],
 reduc_index == 1 ? NULL_TREE : op.ops[1],
-NULL_TREE, &vec_oprnds1,
+NULL_TREE, &vec_oprnds[1],
 reduc_index == 2 ? NULL_TREE : op.ops[2],
-NULL_TREE, &vec_oprnds2);
+NULL_TREE, &vec_oprnds[2]);
 }
 
   /* For single def-use cycles get one copy of the vectorized reduction
@@ -8683,20 +8681,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   if (single_defuse_cycle)
 {
   vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, 1,
-reduc_index == 0 ? op.ops[0] : NULL_TREE, &vec_oprnds0,
-reduc_index == 1 ? op.ops[1] : NULL_TREE, &vec_oprnds1,
+reduc_index == 0 ? op.ops[0] : NULL_TREE,
+&vec_oprnds[0],
+reduc_index == 1 ? op.ops[1] : NULL_TREE,
+&vec_oprnds[1],
 reduc_index == 2 ? op.ops[2] : NULL_TREE,
-&vec_oprnds2);
+&vec_oprnds[2]);
 }
 
   bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
+  unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
 
-  unsigned num = (reduc_index == 0
- ? vec_oprnds1.length () : vec_oprnds0.length ());
   for (unsigned i = 0; i < num; ++i)
 {
   gimple *new_stmt;
-  tree vop[3] = { vec_oprnds0[i], vec_oprnds1[i], NULL_TREE };
+  tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
   if (masked_loop_p && !mask_by_cond_expr)
{
  /* No conditional ifns have been defined for dot-product yet.  */
@@ -8721,7 +8720,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   else
{
  if (op.num_ops >= 3)
-   vop[2] = vec_oprnds2[i];
+   vop[2] = vec_oprnds[2][i];
 
  if (masked_loop_p && mask_by_cond_expr)
{
@@ -8752,14 +8751,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
}
 
   if (single_defuse_cycle && i < num - 1)
-   {
- if (reduc_index == 0)
-   vec_oprnds0.safe_push (gimple_get_lhs (new_stmt));
- else if (reduc_index == 1)
-   vec_oprnds1.safe_push (gimple_get_lhs (new_stmt));
- else if (reduc_index == 2)
-   vec_oprnds2.safe_push (gimple_get_lhs (new_stmt));
-   }
+   vec_oprnds[reduc_index].safe_push (gimple_get_lhs (new_stmt));
   else if (slp_node)
slp_node->push_vec_def (new_stmt);
   else
-- 
2.17.1From 168a55952ae317fca34af55d025c1235b4ff34b5 Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Sun, 16 Jun 

[PATCH 4/8] vect: Determine input vectype for multiple lane-reducing

2024-06-16 Thread Feng Xue OS
The input vectype of reduction PHI statement must be determined before
vect cost computation for the reduction. Since lance-reducing operation has
different input vectype from normal one, so we need to traverse all reduction
statements to find out the input vectype with the least lanes, and set that to
the PHI statement.

Thanks,
Feng

---
gcc/
* tree-vect-loop.cc (vectorizable_reduction): Determine input vectype
during traversal of reduction statements.
---
 gcc/tree-vect-loop.cc | 72 +--
 1 file changed, 49 insertions(+), 23 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 0f7b125e72d..39aa5cb1197 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7643,7 +7643,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 {
   stmt_vec_info def = loop_vinfo->lookup_def (reduc_def);
   stmt_vec_info vdef = vect_stmt_to_vectorize (def);
-  if (STMT_VINFO_REDUC_IDX (vdef) == -1)
+  int reduc_idx = STMT_VINFO_REDUC_IDX (vdef);
+
+  if (reduc_idx == -1)
{
  if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -7686,10 +7688,50 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
  return false;
}
}
-  else if (!stmt_info)
-   /* First non-conversion stmt.  */
-   stmt_info = vdef;
-  reduc_def = op.ops[STMT_VINFO_REDUC_IDX (vdef)];
+  else
+   {
+ /* First non-conversion stmt.  */
+ if (!stmt_info)
+   stmt_info = vdef;
+
+ if (lane_reducing_op_p (op.code))
+   {
+ unsigned group_size = slp_node ? SLP_TREE_LANES (slp_node) : 0;
+ tree op_type = TREE_TYPE (op.ops[0]);
+ tree new_vectype_in = get_vectype_for_scalar_type (loop_vinfo,
+op_type,
+group_size);
+
+ /* The last operand of lane-reducing operation is for
+reduction.  */
+ gcc_assert (reduc_idx > 0 && reduc_idx == (int) op.num_ops - 1);
+
+ /* For lane-reducing operation vectorizable analysis needs the
+reduction PHI information */
+ STMT_VINFO_REDUC_DEF (def) = phi_info;
+
+ if (!new_vectype_in)
+   return false;
+
+ /* Each lane-reducing operation has its own input vectype, while
+reduction PHI will record the input vectype with the least
+lanes.  */
+ STMT_VINFO_REDUC_VECTYPE_IN (vdef) = new_vectype_in;
+
+ /* To accommodate lane-reducing operations of mixed input
+vectypes, choose input vectype with the least lanes for the
+reduction PHI statement, which would result in the most
+ncopies for vectorized reduction results.  */
+ if (!vectype_in
+ || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_in)))
+  < GET_MODE_SIZE (SCALAR_TYPE_MODE (op_type
+   vectype_in = new_vectype_in;
+   }
+ else
+   vectype_in = STMT_VINFO_VECTYPE (phi_info);
+   }
+
+  reduc_def = op.ops[reduc_idx];
   reduc_chain_length++;
   if (!stmt_info && slp_node)
slp_for_stmt_info = SLP_TREE_CHILDREN (slp_for_stmt_info)[0];
@@ -7747,6 +7789,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 
   tree vectype_out = STMT_VINFO_VECTYPE (stmt_info);
   STMT_VINFO_REDUC_VECTYPE (reduc_info) = vectype_out;
+  STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) = vectype_in;
+
   gimple_match_op op;
   if (!gimple_extract_op (stmt_info->stmt, &op))
 gcc_unreachable ();
@@ -7831,16 +7875,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
  = get_vectype_for_scalar_type (loop_vinfo,
 TREE_TYPE (op.ops[i]), slp_op[i]);
 
-  /* To properly compute ncopies we are interested in the widest
-non-reduction input type in case we're looking at a widening
-accumulation that we later handle in vect_transform_reduction.  */
-  if (lane_reducing
- && vectype_op[i]
- && (!vectype_in
- || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_in)))
- < GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE 
(vectype_op[i]))
-   vectype_in = vectype_op[i];
-
   /* Record how the non-reduction-def value of COND_EXPR is defined.
 ???  For a chain of multiple CONDs we'd have to match them up all.  */
   if (op.code == COND_EXPR && reduc_chain_length == 1)
@@ -7859,14 +7893,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
}
}
 }
-  if (!vectype_in)
-vectype_in = STMT_VINFO_VECTYPE (phi_info);
-  STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) = vectype_in;
-
-  /* Each lane-reducing operation has 

[PATCH 3/8] vect: Use one reduction_type local variable

2024-06-16 Thread Feng Xue OS
Two local variables were defined to refer same STMT_VINFO_REDUC_TYPE, better
to keep only one.

Thanks,
Feng

---
gcc/
* tree-vect-loop.cc (vectorizable_reduction): Remove v_reduc_type, and
replace it to another local variable reduction_type.
---
 gcc/tree-vect-loop.cc | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 6e8b3639daf..0f7b125e72d 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7868,10 +7868,10 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   if (lane_reducing)
 STMT_VINFO_REDUC_VECTYPE_IN (stmt_info) = vectype_in;
 
-  enum vect_reduction_type v_reduc_type = STMT_VINFO_REDUC_TYPE (phi_info);
-  STMT_VINFO_REDUC_TYPE (reduc_info) = v_reduc_type;
+  enum vect_reduction_type reduction_type = STMT_VINFO_REDUC_TYPE (phi_info);
+  STMT_VINFO_REDUC_TYPE (reduc_info) = reduction_type;
   /* If we have a condition reduction, see if we can simplify it further.  */
-  if (v_reduc_type == COND_REDUCTION)
+  if (reduction_type == COND_REDUCTION)
 {
   if (slp_node && SLP_TREE_LANES (slp_node) != 1)
return false;
@@ -8038,7 +8038,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 
   STMT_VINFO_REDUC_CODE (reduc_info) = orig_code;
 
-  vect_reduction_type reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info);
+  reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info);
   if (reduction_type == TREE_CODE_REDUCTION)
 {
   /* Check whether it's ok to change the order of the computation.
-- 
2.17.1From 19dc1c91f10ec22e695b9003cae1f4ab5aa45250 Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Sun, 16 Jun 2024 12:17:26 +0800
Subject: [PATCH 3/8] vect: Use one reduction_type local variable

Two local variables were defined to refer same STMT_VINFO_REDUC_TYPE, better
to keep only one.

2024-06-16 Feng Xue 

gcc/
	* tree-vect-loop.cc (vectorizable_reduction): Remove v_reduc_type, and
	replace it to another local variable reduction_type.
---
 gcc/tree-vect-loop.cc | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 6e8b3639daf..0f7b125e72d 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7868,10 +7868,10 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   if (lane_reducing)
 STMT_VINFO_REDUC_VECTYPE_IN (stmt_info) = vectype_in;
 
-  enum vect_reduction_type v_reduc_type = STMT_VINFO_REDUC_TYPE (phi_info);
-  STMT_VINFO_REDUC_TYPE (reduc_info) = v_reduc_type;
+  enum vect_reduction_type reduction_type = STMT_VINFO_REDUC_TYPE (phi_info);
+  STMT_VINFO_REDUC_TYPE (reduc_info) = reduction_type;
   /* If we have a condition reduction, see if we can simplify it further.  */
-  if (v_reduc_type == COND_REDUCTION)
+  if (reduction_type == COND_REDUCTION)
 {
   if (slp_node && SLP_TREE_LANES (slp_node) != 1)
 	return false;
@@ -8038,7 +8038,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 
   STMT_VINFO_REDUC_CODE (reduc_info) = orig_code;
 
-  vect_reduction_type reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info);
+  reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info);
   if (reduction_type == TREE_CODE_REDUCTION)
 {
   /* Check whether it's ok to change the order of the computation.
-- 
2.17.1



[PATCH 2/8] vect: Remove duplicated check on reduction operand

2024-06-16 Thread Feng Xue OS
In vectorizable_reduction, one check on a reduction operand via index could be
contained by another one check via pointer, so remove the former.

Thanks,
Feng

---
gcc/
* tree-vect-loop.cc (vectorizable_reduction): Remove the duplicated
check.
---
 gcc/tree-vect-loop.cc | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index d9a2ad69484..6e8b3639daf 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7815,11 +7815,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 "use not simple.\n");
  return false;
}
-  if (i == STMT_VINFO_REDUC_IDX (stmt_info))
-   continue;
 
-  /* For an IFN_COND_OP we might hit the reduction definition operand
-twice (once as definition, once as else).  */
+  /* Skip reduction operands, and for an IFN_COND_OP we might hit the
+reduction operand twice (once as definition, once as else).  */
   if (op.ops[i] == op.ops[STMT_VINFO_REDUC_IDX (stmt_info)])
continue;
 
-- 
2.17.1From 5d2c22ad724856db12bf0ca568650f471447fa34 Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Sun, 16 Jun 2024 12:08:56 +0800
Subject: [PATCH 2/8] vect: Remove duplicated check on reduction operand

In vectorizable_reduction, one check on a reduction operand via index could be
contained by another one check via pointer, so remove the former.

2024-06-16 Feng Xue 

gcc/
	* tree-vect-loop.cc (vectorizable_reduction): Remove the duplicated
	check.
---
 gcc/tree-vect-loop.cc | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index d9a2ad69484..6e8b3639daf 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7815,11 +7815,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 			 "use not simple.\n");
 	  return false;
 	}
-  if (i == STMT_VINFO_REDUC_IDX (stmt_info))
-	continue;
 
-  /* For an IFN_COND_OP we might hit the reduction definition operand
-	 twice (once as definition, once as else).  */
+  /* Skip reduction operands, and for an IFN_COND_OP we might hit the
+	 reduction operand twice (once as definition, once as else).  */
   if (op.ops[i] == op.ops[STMT_VINFO_REDUC_IDX (stmt_info)])
 	continue;
 
-- 
2.17.1



[PATH 1/8] vect: Add a function to check lane-reducing stmt

2024-06-16 Thread Feng Xue OS
The series of patches are meant to support multiple lane-reducing reduction 
statements. Since the original ones conflicted with the new single-lane slp 
node patches, I have reworked most of the patches, and split them as small as 
possible, which may make code review easier.

In the 1st one, I add a utility function to check if a statement is 
lane-reducing operation,
which could simplify some existing code.

Thanks,
Feng

---
gcc/
* tree-vectorizer.h (lane_reducing_stmt_p): New function.
* tree-vect-slp.cc (vect_analyze_slp): Use new function
lane_reducing_stmt_p to check statement.
---
 gcc/tree-vect-slp.cc  |  4 +---
 gcc/tree-vectorizer.h | 12 
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 7e3d0107b4e..b4ea2e18f00 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -3919,7 +3919,6 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
  scalar_stmts.create (loop_vinfo->reductions.length ());
  for (auto next_info : loop_vinfo->reductions)
{
- gassign *g;
  next_info = vect_stmt_to_vectorize (next_info);
  if ((STMT_VINFO_RELEVANT_P (next_info)
   || STMT_VINFO_LIVE_P (next_info))
@@ -3931,8 +3930,7 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
{
  /* Do not discover SLP reductions combining lane-reducing
 ops, that will fail later.  */
- if (!(g = dyn_cast  (STMT_VINFO_STMT (next_info)))
- || !lane_reducing_op_p (gimple_assign_rhs_code (g)))
+ if (!lane_reducing_stmt_p (STMT_VINFO_STMT (next_info)))
scalar_stmts.quick_push (next_info);
  else
{
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 6bb0f5c3a56..60224f4e284 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2169,12 +2169,24 @@ vect_apply_runtime_profitability_check_p (loop_vec_info 
loop_vinfo)
  && th >= vect_vf_for_cost (loop_vinfo));
 }
 
+/* Return true if CODE is a lane-reducing opcode.  */
+
 inline bool
 lane_reducing_op_p (code_helper code)
 {
   return code == DOT_PROD_EXPR || code == WIDEN_SUM_EXPR || code == SAD_EXPR;
 }
 
+/* Return true if STMT is a lane-reducing statement.  */
+
+inline bool
+lane_reducing_stmt_p (gimple *stmt)
+{
+  if (auto *assign = dyn_cast  (stmt))
+return lane_reducing_op_p (gimple_assign_rhs_code (assign));
+  return false;
+}
+
 /* Source location + hotness information. */
 extern dump_user_location_t vect_location;
 
-- 
2.17.1From 0a90550b4ed3addfb2a36c40085bfa9b4bb05b7c Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Sat, 15 Jun 2024 23:17:10 +0800
Subject: [PATCH 1/8] vect: Add a function to check lane-reducing stmt

Add a utility function to check if a statement is lane-reducing operation,
which could simplify some existing code.

2024-06-16 Feng Xue 

gcc/
	* tree-vectorizer.h (lane_reducing_stmt_p): New function.
	* tree-vect-slp.cc (vect_analyze_slp): Use new function
	lane_reducing_stmt_p to check statement.
---
 gcc/tree-vect-slp.cc  |  4 +---
 gcc/tree-vectorizer.h | 12 
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 7e3d0107b4e..b4ea2e18f00 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -3919,7 +3919,6 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
 	  scalar_stmts.create (loop_vinfo->reductions.length ());
 	  for (auto next_info : loop_vinfo->reductions)
 	{
-	  gassign *g;
 	  next_info = vect_stmt_to_vectorize (next_info);
 	  if ((STMT_VINFO_RELEVANT_P (next_info)
 		   || STMT_VINFO_LIVE_P (next_info))
@@ -3931,8 +3930,7 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
 		{
 		  /* Do not discover SLP reductions combining lane-reducing
 		 ops, that will fail later.  */
-		  if (!(g = dyn_cast  (STMT_VINFO_STMT (next_info)))
-		  || !lane_reducing_op_p (gimple_assign_rhs_code (g)))
+		  if (!lane_reducing_stmt_p (STMT_VINFO_STMT (next_info)))
 		scalar_stmts.quick_push (next_info);
 		  else
 		{
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 6bb0f5c3a56..60224f4e284 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2169,12 +2169,24 @@ vect_apply_runtime_profitability_check_p (loop_vec_info loop_vinfo)
 	  && th >= vect_vf_for_cost (loop_vinfo));
 }
 
+/* Return true if CODE is a lane-reducing opcode.  */
+
 inline bool
 lane_reducing_op_p (code_helper code)
 {
   return code == DOT_PROD_EXPR || code == WIDEN_SUM_EXPR || code == SAD_EXPR;
 }
 
+/* Return true if STMT is a lane-reducing statement.  */
+
+inline bool
+lane_reducing_stmt_p (gimple *stmt)
+{
+  if (auto *assign = dyn_cast  (stmt))
+return lane_reducing_op_p (gimple_assign_rhs_code (assign));
+  return false;
+}
+