Re: [PATCH-1v2, rs6000] Enable SImode in FP registers on P7 [PR88558]

2023-09-18 Thread Kewen.Lin via Gcc-patches
Hi Haochen,

on 2023/9/14 16:35, HAO CHEN GUI wrote:
> Hi Kewen,
> 
> 在 2023/9/12 17:33, Kewen.Lin 写道:
>> Ok, at least regression testing doesn't expose any needs to do disparaging
>> for this.  Could you also test this patch with SPEC2017 for P7 and P8
>> separately at options like -O2 or -O3, to see if there is any assembly
>> change, and if yes filtering out some typical to check it's expected or
>> not?  I think it can help us to better evaluate the impact.  Thanks!
> 
> Just compared the object files of SPEC2017 for P7 and P8. There is no
> difference between P7s'. For P8, some different object files are found.
> All differences are the same. Patched object files replace xxlor with fmr.
> It's expected as the fmr is added to ahead of xxlor in "*movsi_internal1".

Thanks for checking!  So for P7, this patch looks neutral, but for P8 and
later, it may cause some few differences in code gen.  I'm curious that how
many total object files and different object files were checked and found
on P8?  fmr or xxlor preference can be further considered along with existing:
https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612821.html
I also wonder if it's easy to reduce some of them further as small test cases.

Since xxlor is better than fmr at least on Power10, could you also evaluate
the affected bmks on P10 (even P8/P9) to ensure no performance degradation?

Thanks!

BR,
Kewen



[PATCH] rs6000: Skip empty inline asm in rs6000_update_ipa_fn_target_info [PR111366]

2023-09-18 Thread Kewen.Lin via Gcc-patches
Hi,

PR111366 exposes one thing that can be improved in function
rs6000_update_ipa_fn_target_info is to skip the given empty
inline asm string, since it's impossible to adopt any
hardware features (so far HTM).

Since this rs6000_update_ipa_fn_target_info related approach
exists in GCC12 and later, the affected project highway has
updated its target pragma with ",htm", see the link:
https://github.com/google/highway/commit/15e63d61eb535f478bc
I'd not bother to consider an inline asm parser for now but
will file a separated PR for further enhancement.

Bootstrapped and regtested on powerpc64-linux-gnu P7/P8/P9
and powerpc64le-linux-gnu P9 and P10.

I'm going to push this soon.

BR,
Kewen
-
PR target/111366

gcc/ChangeLog:

* config/rs6000/rs6000.cc (rs6000_update_ipa_fn_target_info): Skip
empty inline asm.

gcc/testsuite/ChangeLog:

* g++.target/powerpc/pr111366.C: New test.
---
 gcc/config/rs6000/rs6000.cc |  9 ++--
 gcc/testsuite/g++.target/powerpc/pr111366.C | 48 +
 2 files changed, 54 insertions(+), 3 deletions(-)
 create mode 100644 gcc/testsuite/g++.target/powerpc/pr111366.C

diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index d48134b35f8..40925407a99 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -25475,9 +25475,12 @@ rs6000_update_ipa_fn_target_info (unsigned int , 
const gimple *stmt)
   /* Assume inline asm can use any instruction features.  */
   if (gimple_code (stmt) == GIMPLE_ASM)
 {
-  /* Should set any bits we concerned, for now OPTION_MASK_HTM is
-the only bit we care about.  */
-  info |= RS6000_FN_TARGET_INFO_HTM;
+  const char *asm_str = gimple_asm_string (as_a (stmt));
+  /* Ignore empty inline asm string.  */
+  if (strlen (asm_str) > 0)
+   /* Should set any bits we concerned, for now OPTION_MASK_HTM is
+  the only bit we care about.  */
+   info |= RS6000_FN_TARGET_INFO_HTM;
   return false;
 }
   else if (gimple_code (stmt) == GIMPLE_CALL)
diff --git a/gcc/testsuite/g++.target/powerpc/pr111366.C 
b/gcc/testsuite/g++.target/powerpc/pr111366.C
new file mode 100644
index 000..6d3d8ebc552
--- /dev/null
+++ b/gcc/testsuite/g++.target/powerpc/pr111366.C
@@ -0,0 +1,48 @@
+/* { dg-do compile } */
+/* Use -Wno-attributes to suppress the possible warning on always_inline.  */
+/* { dg-options "-O2 -mdejagnu-cpu=power9 -Wno-attributes" } */
+
+/* Verify it doesn't emit any error messages.  */
+
+#include 
+#define HWY_PRAGMA(tokens) _Pragma (#tokens)
+#define HWY_PUSH_ATTRIBUTES(targets_str) HWY_PRAGMA (GCC target targets_str)
+__attribute__ ((always_inline)) void
+PreventElision ()
+{
+  asm("");
+}
+#define HWY_BEFORE_NAMESPACE() HWY_PUSH_ATTRIBUTES (",cpu=power10")
+HWY_BEFORE_NAMESPACE () namespace detail
+{
+  template  struct CappedTagChecker
+  {
+  };
+}
+template 
+using CappedTag = detail::CappedTagChecker;
+template  struct ForeachCappedR
+{
+  static void Do (size_t, size_t)
+  {
+CappedTag d;
+Test () (int(), d);
+  }
+};
+template  struct ForPartialVectors
+{
+  template  void operator() (T)
+  {
+ForeachCappedR::Do (1, 1);
+  }
+};
+struct TestFloorLog2
+{
+  template  void operator() (T, DF) { PreventElision (); }
+};
+void
+TestAllFloorLog2 ()
+{
+  ForPartialVectors () (float());
+}
+
--
2.31.1


[PATCH] rs6000: Use default target option node for callee by default [PR111380]

2023-09-18 Thread Kewen.Lin via Gcc-patches
Hi,

As PR111380 (and the discussion in related PRs) shows, for
now how function rs6000_can_inline_p treats the callee
without any target option node is wrong.  It considers it's
always safe to inline this kind of callee, but actually its
target flags are from the command line options
(target_option_default_node), it's possible that the flags
of callee don't satisfy the condition of inlining, but it
is still inlined, then result in unexpected consequence.

As the associated test case pr111380-1.c shows, the caller
main is attributed with power8, but the callee foo is
compiled with power9 from command line, it's unexpected to
make main inline foo since foo can contain something that
requires power9 capability.  Without this patch, for lto
(with -flto) we can get error message (as it forces the
callee to have a target option node), but for non-lto, it's
inlined unexpectedly.

This patch is to make callee adopt target_option_default_node
when it doesn't have a target option node, it can avoid wrong
inlining decision and fix the inconsistency between LTO and
non-LTO.  It also aligns with what the other ports do.

Bootstrapped and regtested on powerpc64-linux-gnu P7/P8/P9
and powerpc64le-linux-gnu P9 and P10.

I'm going to push this soon if no objections.

BR,
Kewen
-

PR target/111380

gcc/ChangeLog:

* config/rs6000/rs6000.cc (rs6000_can_inline_p): Adopt
target_option_default_node when the callee has no option
attributes, also simplify the existing code accordingly.

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/pr111380-1.c: New test.
* gcc.target/powerpc/pr111380-2.c: New test.
---
 gcc/config/rs6000/rs6000.cc   | 65 +--
 gcc/testsuite/gcc.target/powerpc/pr111380-1.c | 20 ++
 gcc/testsuite/gcc.target/powerpc/pr111380-2.c | 20 ++
 3 files changed, 70 insertions(+), 35 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr111380-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr111380-2.c

diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index efe9adce1f8..d48134b35f8 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -25508,49 +25508,44 @@ rs6000_can_inline_p (tree caller, tree callee)
   tree caller_tree = DECL_FUNCTION_SPECIFIC_TARGET (caller);
   tree callee_tree = DECL_FUNCTION_SPECIFIC_TARGET (callee);

-  /* If the callee has no option attributes, then it is ok to inline.  */
+  /* If the caller/callee has option attributes, then use them.
+ Otherwise, use the command line options.  */
   if (!callee_tree)
-ret = true;
+callee_tree = target_option_default_node;
+  if (!caller_tree)
+caller_tree = target_option_default_node;

-  else
-{
-  HOST_WIDE_INT caller_isa;
-  struct cl_target_option *callee_opts = TREE_TARGET_OPTION (callee_tree);
-  HOST_WIDE_INT callee_isa = callee_opts->x_rs6000_isa_flags;
-  HOST_WIDE_INT explicit_isa = callee_opts->x_rs6000_isa_flags_explicit;
+  struct cl_target_option *callee_opts = TREE_TARGET_OPTION (callee_tree);
+  struct cl_target_option *caller_opts = TREE_TARGET_OPTION (caller_tree);

-  /* If the caller has option attributes, then use them.
-Otherwise, use the command line options.  */
-  if (caller_tree)
-   caller_isa = TREE_TARGET_OPTION (caller_tree)->x_rs6000_isa_flags;
-  else
-   caller_isa = rs6000_isa_flags;
+  HOST_WIDE_INT callee_isa = callee_opts->x_rs6000_isa_flags;
+  HOST_WIDE_INT caller_isa = caller_opts->x_rs6000_isa_flags;
+  HOST_WIDE_INT explicit_isa = callee_opts->x_rs6000_isa_flags_explicit;

-  cgraph_node *callee_node = cgraph_node::get (callee);
-  if (ipa_fn_summaries && ipa_fn_summaries->get (callee_node) != NULL)
+  cgraph_node *callee_node = cgraph_node::get (callee);
+  if (ipa_fn_summaries && ipa_fn_summaries->get (callee_node) != NULL)
+{
+  unsigned int info = ipa_fn_summaries->get (callee_node)->target_info;
+  if ((info & RS6000_FN_TARGET_INFO_HTM) == 0)
{
- unsigned int info = ipa_fn_summaries->get (callee_node)->target_info;
- if ((info & RS6000_FN_TARGET_INFO_HTM) == 0)
-   {
- callee_isa &= ~OPTION_MASK_HTM;
- explicit_isa &= ~OPTION_MASK_HTM;
-   }
+ callee_isa &= ~OPTION_MASK_HTM;
+ explicit_isa &= ~OPTION_MASK_HTM;
}
+}

-  /* Ignore -mpower8-fusion and -mpower10-fusion options for inlining
-purposes.  */
-  callee_isa &= ~(OPTION_MASK_P8_FUSION | OPTION_MASK_P10_FUSION);
-  explicit_isa &= ~(OPTION_MASK_P8_FUSION | OPTION_MASK_P10_FUSION);
+  /* Ignore -mpower8-fusion and -mpower10-fusion options for inlining
+ purposes.  */
+  callee_isa &= ~(OPTION_MASK_P8_FUSION | OPTION_MASK_P10_FUSION);
+  explicit_isa &= ~(OPTION_MASK_P8_FUSION | OPTION_MASK_P10_FUSION);

-  /* The callee's options must be a subset of the caller's options, i.e.
-a vsx function 

Re: [PATCH v1] rs6000: unnecessary clear after vctzlsbb in vec_first_match_or_eos_index

2023-09-13 Thread Kewen.Lin via Gcc-patches
Hi,

on 2023/9/13 00:39, Ajit Agarwal wrote:
> This patch removes zero extension from vctzlsbb as it already zero extends.
> Bootstrapped and regtested on powerpc64-linux-gnu.
> 
> Thanks & Regards
> Ajit
> 
> rs6000: unnecessary clear after vctzlsbb in vec_first_match_or_eos_index
> 
> For rs6000 target we dont need zero_extend after vctzlsbb as vctzlsbb
> already zero extend.
> 
> 2023-09-12  Ajit Kumar Agarwal  
> 
> gcc/ChangeLog:
> 
>   * config/rs6000/vsx.md (vctzlsbb_zext_): New define_insn.
> 
> gcc/testsuite/ChangeLog:
> 
>   * g++.target/powerpc/altivec-19.C: New testcase.
> ---
>  gcc/config/rs6000/vsx.md  | 17 ++---
>  gcc/testsuite/g++.target/powerpc/altivec-19.C | 10 ++
>  2 files changed, 24 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/g++.target/powerpc/altivec-19.C
> 
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index 19abfeb565a..42379409e5f 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -5846,11 +5846,22 @@
>[(set_attr "type" "vecsimple")])
>  
>  ;; Vector Count Trailing Zero Least-Significant Bits Byte
> -(define_insn "vctzlsbb_"
> -  [(set (match_operand:SI 0 "register_operand" "=r")
> +(define_insn "vctzlsbb_zext_"

Sorry, I didn't note this as well in the previous review, this
define_insn name can be "*vctzlsbb_zext_" as we don't need
any gen_vctzlsbb_zext_* for uses.

btw, once this changes the name in the changelog should be updated
accordingly.

> +  [(set (match_operand:DI 0 "register_operand" "=r")
> + (zero_extend:DI
>   (unspec:SI
>[(match_operand:VSX_EXTRACT_I 1 "altivec_register_operand" "v")]
> -  UNSPEC_VCTZLSBB))]
> +  UNSPEC_VCTZLSBB)))]
> +  "TARGET_P9_VECTOR"
> +  "vctzlsbb %0,%1"
> +  [(set_attr "type" "vecsimple")])
> +
> +;; Vector Count Trailing Zero Least-Significant Bits Byte
> +(define_insn "vctzlsbb_"
> +  [(set (match_operand:SI 0 "register_operand" "=r")
> +(unspec:SI
> + [(match_operand:VSX_EXTRACT_I 1 "altivec_register_operand" "v")]
> + UNSPEC_VCTZLSBB))]
>"TARGET_P9_VECTOR"
>"vctzlsbb %0,%1"
>[(set_attr "type" "vecsimple")])
> diff --git a/gcc/testsuite/g++.target/powerpc/altivec-19.C 
> b/gcc/testsuite/g++.target/powerpc/altivec-19.C
> new file mode 100644
> index 000..e49e5076af8
> --- /dev/null
> +++ b/gcc/testsuite/g++.target/powerpc/altivec-19.C
> @@ -0,0 +1,10 @@
> +/* { dg-do compile { target { powerpc*-*-* } } } */

As the previous review comment, this line can be:

/* { dg-do compile } */

Okay for trunk with these two above nits fixed.  Thanks.

BR,
Kewen

> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-options "-mdejagnu-cpu=power9 -O2 " } */ 
> +
> +#include 
> +
> +unsigned int foo (vector unsigned char a, vector unsigned char b) {
> +  return vec_first_match_or_eos_index (a, b);
> +}
> +/* { dg-final { scan-assembler-not {\mrldicl\M} } } */


Re: [PATCH] rs6000: unnecessary clear after vctzlsbb in vec_first_match_or_eos_index

2023-09-12 Thread Kewen.Lin via Gcc-patches
Hi Ajit,

on 2023/8/31 18:44, Ajit Agarwal via Gcc-patches wrote:
> 
> This patch removes zero extension from vctzlsbb as it already zero extends.
> Bootstrapped and regtested on powerpc64-linux-gnu.
> 
> Thanks & Regards
> Ajit
> 
> rs6000: unnecessary clear after vctzlsbb in vec_first_match_or_eos_index
> 
> For rs6000 target we dont need zero_extend after vctzlsbb as vctzlsbb
> already zero extend.
> 
> 2023-08-31  Ajit Kumar Agarwal  
> 
> gcc/ChangeLog:
> 
>   * config/rs6000/vsx.md: Add new pattern.

Nit: we can offer the pattern name, such as:

* config/rs6000/vsx.md (vctzlsbb_zext_): New define_insn.

> 
> gcc/testsuite/ChangeLog:
> 
>   * g++.target/powerpc/altivec-19.C: New testcase.
> ---
>  gcc/config/rs6000/vsx.md  | 17 ++---
>  gcc/testsuite/g++.target/powerpc/altivec-19.C | 11 +++
>  2 files changed, 25 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/g++.target/powerpc/altivec-19.C
> 
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index 19abfeb565a..09d21a6d00a 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -5846,11 +5846,22 @@
>[(set_attr "type" "vecsimple")])
>  
>  ;; Vector Count Trailing Zero Least-Significant Bits Byte
> -(define_insn "vctzlsbb_"
> -  [(set (match_operand:SI 0 "register_operand" "=r")
> +(define_insn "vctzlsbbzext_"

Nit: s/vctzlsbbzext_/vctzlsbb_zext_/, it's read better to separate
the mnemonic (vctzlsbb).

> +  [(set (match_operand:DI 0 "register_operand" "=r")
> + (zero_extend:DI
>   (unspec:SI
>[(match_operand:VSX_EXTRACT_I 1 "altivec_register_operand" "v")]
> -  UNSPEC_VCTZLSBB))]
> +  UNSPEC_VCTZLSBB)))]
> +  "TARGET_P9_VECTOR"
> +  "vctzlsbb %0,%1"
> +  [(set_attr "type" "vecsimple")])
> +
> +;; Vector Count Trailing Zero Least-Significant Bits Byte
> +(define_insn "vctzlsbb_"
> +  [(set (match_operand:SI 0 "register_operand" "=r")
> +(unspec:SI
> + [(match_operand:VSX_EXTRACT_I 1 "altivec_register_operand" "v")]
> + UNSPEC_VCTZLSBB))]
>"TARGET_P9_VECTOR"
>"vctzlsbb %0,%1"
>[(set_attr "type" "vecsimple")])
> diff --git a/gcc/testsuite/g++.target/powerpc/altivec-19.C 
> b/gcc/testsuite/g++.target/powerpc/altivec-19.C
> new file mode 100644
> index 000..2d630b2fc1f
> --- /dev/null
> +++ b/gcc/testsuite/g++.target/powerpc/altivec-19.C
> @@ -0,0 +1,11 @@
> +/* { dg-do compile { target { powerpc*-*-* } } } */

Nit: Can be simpler with /* { dg-do compile } */
as the target requirement is always satisfied in powerpc
test suite.

> +/* { dg-require-effective-target lp64 } */

This line can be removed as this case and its checking doesn't
require 64 bit env.

> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-options "-mcpu=power9 -O2 " } */

Use -mdejagnu-cpu=power9 instead of -mcpu=power9, it can always
take effect when mixing some other -mcpu=* from the RUN FLAGS.

> +
> +#include 
> +
> +unsigned int foo (vector unsigned char a, vector unsigned char b) {
> +  return vec_first_match_or_eos_index (a, b);
> +}
> +/* { dg-final { scan-assembler-not "rldicl" } } */

Nit: Maybe with \m and \M like:

/* { dg-final { scan-assembler-not {\mrldicl\M} } } */


BR,
Kewen


Re: [PATCH-2v2, rs6000] Implement 32bit inline lrint [PR88558]

2023-09-12 Thread Kewen.Lin via Gcc-patches
Hi Haochen,

on 2023/9/4 13:33, HAO CHEN GUI wrote:
> Hi,
>   This patch implements 32bit inline lrint by "fctiw". It depends on
> the patch1 to do SImode move from FP registers on P7.
> 
>   Compared to last version, the main change is to add tests for "lrintf"
> and adjust the count of corresponding instructions.
> https://gcc.gnu.org/pipermail/gcc-patches/2023-August/628436.html
> 
>   Bootstrapped and tested on powerpc64-linux BE and LE with no regressions.
> 
> Thanks
> Gui Haochen
> 
> ChangeLog
> rs6000: support 32bit inline lrint
> 
> gcc/
>   PR target/88558
>   * config/rs6000/rs6000.md (lrintdi2): Remove TARGET_FPRND
>   from insn condition.
>   (lrintsi2): New insn pattern for 32bit lrint.
> 
> gcc/testsuite/
>   PR target/106769
>   * gcc.target/powerpc/pr88558.h: New.
>   * gcc.target/powerpc/pr88558-p7.c: New.
>   * gcc.target/powerpc/pr88558-p8.c: New.
> 
> patch.diff
> diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
> index edf49bd74e3..a41898e0e08 100644
> --- a/gcc/config/rs6000/rs6000.md
> +++ b/gcc/config/rs6000/rs6000.md
> @@ -6655,10 +6655,18 @@ (define_insn "lrintdi2"
>[(set (match_operand:DI 0 "gpc_reg_operand" "=d")
>   (unspec:DI [(match_operand:SFDF 1 "gpc_reg_operand" "")]
>  UNSPEC_FCTID))]
> -  "TARGET_HARD_FLOAT && TARGET_FPRND"
> +  "TARGET_HARD_FLOAT"
>"fctid %0,%1"
>[(set_attr "type" "fp")])
> 
> +(define_insn "lrintsi2"
> +  [(set (match_operand:SI 0 "gpc_reg_operand" "=d")
> + (unspec:SI [(match_operand:SFDF 1 "gpc_reg_operand" "")]
> +UNSPEC_FCTIW))]
> +  "TARGET_HARD_FLOAT && TARGET_POPCNTD"
> +  "fctiw %0,%1"
> +  [(set_attr "type" "fp")])
> +
>  (define_insn "btrunc2"
>[(set (match_operand:SFDF 0 "gpc_reg_operand" "=d,wa")
>   (unspec:SFDF [(match_operand:SFDF 1 "gpc_reg_operand" "d,wa")]
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr88558-p7.c 
> b/gcc/testsuite/gcc.target/powerpc/pr88558-p7.c
> new file mode 100644
> index 000..f302491c4d0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr88558-p7.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fno-math-errno -mdejagnu-cpu=power7" } */
> +
> +/* -fno-math-errno is required to make {i,l,ll}rint inlined */

Nit: Comment is a bit out of date since now we have irintf.

> +
> +#include "pr88558.h"
> +
> +/* { dg-final { scan-assembler-times {\mfctid\M} 3 { target lp64 } } } */
> +/* { dg-final { scan-assembler-times {\mfctid\M} 1 { target ilp32 } } } */
> +/* { dg-final { scan-assembler-times {\mfctiw\M} 1 { target lp64 } } } */
> +/* { dg-final { scan-assembler-times {\mfctiw\M} 3 { target ilp32 } } } */
> +/* { dg-final { scan-assembler-times {\mstfiwx\M} 1 { target lp64 } } } */
> +/* { dg-final { scan-assembler-times {\mstfiwx\M} 3 { target ilp32 } } } */
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr88558-p8.c 
> b/gcc/testsuite/gcc.target/powerpc/pr88558-p8.c
> new file mode 100644
> index 000..33398aa74c2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr88558-p8.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p8vector_ok } */
> +/* { dg-options "-O2 -fno-math-errno -mdejagnu-cpu=power8" } */
> +
> +/* -fno-math-errno is required to make {i,l,ll}rint inlined */

Ditto.

> +
> +#include "pr88558.h"
> +
> +/* { dg-final { scan-assembler-times {\mfctid\M} 3 { target lp64 } } } */
> +/* { dg-final { scan-assembler-times {\mfctid\M} 1 { target ilp32 } } } */
> +/* { dg-final { scan-assembler-times {\mfctiw\M} 1 { target lp64 } } } */
> +/* { dg-final { scan-assembler-times {\mfctiw\M} 3 { target ilp32 } } } */
> +/* { dg-final { scan-assembler-times {\mmfvsrwz\M} 1 { target lp64 } } } */
> +/* { dg-final { scan-assembler-times {\mmfvsrwz\M} 3 { target ilp32 } } } */
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr88558.h 
> b/gcc/testsuite/gcc.target/powerpc/pr88558.h
> new file mode 100644
> index 000..698640c0ef7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr88558.h
> @@ -0,0 +1,19 @@
> +long int test1 (double a)
> +{
> +  return __builtin_lrint (a);
> +}
> +
> +long long test2 (double a)
> +{
> +  return __builtin_llrint (a);
> +}
> +
> +int test3 (double a)
> +{
> +  return __builtin_irint (a);
> +}
> +
> +long int test4 (float a)
> +{
> +  return __builtin_lrintf (a);
> +}

As you added the extra coverage for irint and llrint excepting for lrint,
I'd expect you can also add the coverage for llrintf and irintf, to make
them consistent.

The others look good to me.  Thanks!

BR,
Kewen


Re: [PATCH-1v2, rs6000] Enable SImode in FP registers on P7 [PR88558]

2023-09-12 Thread Kewen.Lin via Gcc-patches
Hi Haochen,

on 2023/9/4 13:33, HAO CHEN GUI wrote:
> Hi,
>   This patch enables SImode in FP registers on P7. Instruction "fctiw"
> stores its integer output in an FP register. So SImode in FP register
> needs be enabled on P7 if we want support "fctiw" on P7.
> 
>   The test case is in the second patch which implements 32bit inline
> lrint.
> 
>   Compared to the last version, the main change it to remove disparaging
> on the alternatives of "fmr". Test shows it doesn't cause regression.

Ok, at least regression testing doesn't expose any needs to do disparaging
for this.  Could you also test this patch with SPEC2017 for P7 and P8
separately at options like -O2 or -O3, to see if there is any assembly
change, and if yes filtering out some typical to check it's expected or
not?  I think it can help us to better evaluate the impact.  Thanks!

BR,
Kewen

> https://gcc.gnu.org/pipermail/gcc-patches/2023-August/628435.html
> 
>   Bootstrapped and tested on powerpc64-linux BE and LE with no regressions.
> 
> 
> ChangeLog
> rs6000: enable SImode in FP register on P7
> 
> gcc/
>   PR target/88558
>   * config/rs6000/rs6000.cc (rs6000_hard_regno_mode_ok_uncached):
>   Enable SImode in FP registers on P7.
>   * config/rs6000/rs6000.md (*movsi_internal1): Add fmr for SImode
>   move between FP registers.  Set attribute isa of stfiwx to "*"
>   and attribute of stxsiwx to "p7".
> 
> patch.diff
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index 44b448d2ba6..99085c2cdd7 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -1903,7 +1903,7 @@ rs6000_hard_regno_mode_ok_uncached (int regno, 
> machine_mode mode)
> if(GET_MODE_SIZE (mode) == UNITS_PER_FP_WORD)
>   return 1;
> 
> -   if (TARGET_P8_VECTOR && (mode == SImode))
> +   if (TARGET_POPCNTD && mode == SImode)
>   return 1;
> 
> if (TARGET_P9_VECTOR && (mode == QImode || mode == HImode))
> diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
> index cdab49fbb91..edf49bd74e3 100644
> --- a/gcc/config/rs6000/rs6000.md
> +++ b/gcc/config/rs6000/rs6000.md
> @@ -7566,7 +7566,7 @@ (define_split
> 
>  (define_insn "*movsi_internal1"
>[(set (match_operand:SI 0 "nonimmediate_operand"
> -   "=r, r,
> +   "=r, r,  d,
>  r,  d,  v,
>  m,  ?Z, ?Z,
>  r,  r,  r,  r,
> @@ -7575,7 +7575,7 @@ (define_insn "*movsi_internal1"
>  wa, r,
>  r,  *h, *h")
>   (match_operand:SI 1 "input_operand"
> -   "r,  U,
> +   "r,  U,  d,
>  m,  ?Z, ?Z,
>  r,  d,  v,
>  I,  L,  eI, n,
> @@ -7588,6 +7588,7 @@ (define_insn "*movsi_internal1"
>"@
> mr %0,%1
> la %0,%a1
> +   fmr %0,%1
> lwz%U1%X1 %0,%1
> lfiwzx %0,%y1
> lxsiwzx %x0,%y1
> @@ -7611,7 +7612,7 @@ (define_insn "*movsi_internal1"
> mt%0 %1
> nop"
>[(set_attr "type"
> -   "*,  *,
> +   "*,  *,  fpsimple,
>  load,   fpload, fpload,
>  store,  fpstore,fpstore,
>  *,  *,  *,  *,
> @@ -7620,7 +7621,7 @@ (define_insn "*movsi_internal1"
>  mtvsr,  mfvsr,
>  *,  *,  *")
> (set_attr "length"
> -   "*,  *,
> +   "*,  *,  *,
>  *,  *,  *,
>  *,  *,  *,
>  *,  *,  *,  8,
> @@ -7629,9 +7630,9 @@ (define_insn "*movsi_internal1"
>  *,  *,
>  *,  *,  *")
> (set_attr "isa"
> -   "*,  *,
> -*,  p8v,p8v,
> -*,  p8v,p8v,
> +   "*,  *,  *,
> +*,  p7, p8v,
> +*,  *,  p8v,
>  *,  *,  p10,*,
>  p8v,p9v,p9v,p8v,
>  p9v,p8v,p9v,
> 


Re: [PATCH] rs6000: Update instruction counts to match vec_* calls [PR111228]

2023-08-31 Thread Kewen.Lin via Gcc-patches
Hi Peter,

on 2023/8/31 06:42, Peter Bergner wrote:
> Commit  r14-3258-ge7a36e4715c716 increased the amount of folding we perform,
> leading to better code.  Update the expected instruction counts to match the
> the number of associated vec_* built-in calls.
> 
> Tested on powerpc64le-linux with no regressions.  Ok for mainline?

OK for trunk, thanks!

> 
> Peter
> 
> gcc/testsuite/
>   PR testsuite/111228
>   * gcc.target/powerpc/fold-vec-logical-ors-char.c: Update instruction
>   counts to match the number of associated vec_* built-in calls.
>   * gcc.target/powerpc/fold-vec-logical-ors-int.c: Likewise.
>   * gcc.target/powerpc/fold-vec-logical-ors-longlong.c: Likewise.
>   * gcc.target/powerpc/fold-vec-logical-ors-short.c: Likewise.
>   * gcc.target/powerpc/fold-vec-logical-other-char.c: Likewise.
>   * gcc.target/powerpc/fold-vec-logical-other-int.c: Likewise.
>   * gcc.target/powerpc/fold-vec-logical-other-longlong.c: Likewise.
>   * gcc.target/powerpc/fold-vec-logical-other-short.c: Likewise.
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/fold-vec-logical-ors-char.c 
> b/gcc/testsuite/gcc.target/powerpc/fold-vec-logical-ors-char.c
> index 713fed7824a..7406039d054 100644
> --- a/gcc/testsuite/gcc.target/powerpc/fold-vec-logical-ors-char.c
> +++ b/gcc/testsuite/gcc.target/powerpc/fold-vec-logical-ors-char.c
> @@ -120,6 +120,6 @@ test6_nor (vector unsigned char x, vector unsigned char y)
>return *foo;
>  }
>  
> -/* { dg-final { scan-assembler-times {\mxxlor\M} 7 } } */
> +/* { dg-final { scan-assembler-times {\mxxlor\M} 6 } } */
>  /* { dg-final { scan-assembler-times {\mxxlxor\M} 6 } } */
> -/* { dg-final { scan-assembler-times {\mxxlnor\M} 1 } } */
> +/* { dg-final { scan-assembler-times {\mxxlnor\M} 2 } } */
> diff --git a/gcc/testsuite/gcc.target/powerpc/fold-vec-logical-ors-int.c 
> b/gcc/testsuite/gcc.target/powerpc/fold-vec-logical-ors-int.c
> index 4d1c78f40ec..a7c6366b938 100644
> --- a/gcc/testsuite/gcc.target/powerpc/fold-vec-logical-ors-int.c
> +++ b/gcc/testsuite/gcc.target/powerpc/fold-vec-logical-ors-int.c
> @@ -119,6 +119,6 @@ test6_nor (vector unsigned int x, vector unsigned int y)
>return *foo;
>  }
>  
> -/* { dg-final { scan-assembler-times {\mxxlor\M} 7 } } */
> +/* { dg-final { scan-assembler-times {\mxxlor\M} 6 } } */
>  /* { dg-final { scan-assembler-times {\mxxlxor\M} 6 } } */
> -/* { dg-final { scan-assembler-times {\mxxlnor\M} 1 } } */
> +/* { dg-final { scan-assembler-times {\mxxlnor\M} 2 } } */
> diff --git a/gcc/testsuite/gcc.target/powerpc/fold-vec-logical-ors-longlong.c 
> b/gcc/testsuite/gcc.target/powerpc/fold-vec-logical-ors-longlong.c
> index 27ef09ada80..10c69d3d87b 100644
> --- a/gcc/testsuite/gcc.target/powerpc/fold-vec-logical-ors-longlong.c
> +++ b/gcc/testsuite/gcc.target/powerpc/fold-vec-logical-ors-longlong.c
> @@ -156,6 +156,6 @@ test6_nor (vector unsigned long long x, vector unsigned 
> long long y)
>  // For simplicity, this test now only targets "powerpc_p8vector_ok" 
> environments
>  // where the answer is expected to be 6.
>  
> -/* { dg-final { scan-assembler-times {\mxxlor\M} 9 } } */
> +/* { dg-final { scan-assembler-times {\mxxlor\M} 6 } } */
>  /* { dg-final { scan-assembler-times {\mxxlxor\M} 6 } } */
> -/* { dg-final { scan-assembler-times {\mxxlnor\M} 3 } } */
> +/* { dg-final { scan-assembler-times {\mxxlnor\M} 6 } } */
> diff --git a/gcc/testsuite/gcc.target/powerpc/fold-vec-logical-ors-short.c 
> b/gcc/testsuite/gcc.target/powerpc/fold-vec-logical-ors-short.c
> index f796c5b33a9..8352a7f4dc5 100644
> --- a/gcc/testsuite/gcc.target/powerpc/fold-vec-logical-ors-short.c
> +++ b/gcc/testsuite/gcc.target/powerpc/fold-vec-logical-ors-short.c
> @@ -119,6 +119,6 @@ test6_nor (vector unsigned short x, vector unsigned short 
> y)
>return *foo;
>  }
>  
> -/* { dg-final { scan-assembler-times {\mxxlor\M} 7 } } */
> +/* { dg-final { scan-assembler-times {\mxxlor\M} 6 } } */
>  /* { dg-final { scan-assembler-times {\mxxlxor\M} 6 } } */
> -/* { dg-final { scan-assembler-times {\mxxlnor\M} 1 } } */
> +/* { dg-final { scan-assembler-times {\mxxlnor\M} 2 } } */
> diff --git a/gcc/testsuite/gcc.target/powerpc/fold-vec-logical-other-char.c 
> b/gcc/testsuite/gcc.target/powerpc/fold-vec-logical-other-char.c
> index e74308ccda2..7fe3e0b8e0e 100644
> --- a/gcc/testsuite/gcc.target/powerpc/fold-vec-logical-other-char.c
> +++ b/gcc/testsuite/gcc.target/powerpc/fold-vec-logical-other-char.c
> @@ -104,5 +104,5 @@ test6_nand (vector unsigned char x, vector unsigned char 
> y)
>return *foo;
>  }
>  
> -/* { dg-final { scan-assembler-times {\mxxlnand\M} 3 } } */
> +/* { dg-final { scan-assembler-times {\mxxlnand\M} 6 } } */

At the first glance, this seems to become worse, but looking into the
generated asm, I found it used to have xxland which isn't scanned and
counted, and also similar sub and splat, so it's actually improved
similar to the above xxlnor. :) 

BR,
Kewen

>  /* { dg-final { 

Re: [PATCH, rs6000] Call vector load/store with length expand only on 64-bit Power10 [PR96762]

2023-08-31 Thread Kewen.Lin via Gcc-patches
on 2023/8/31 13:47, HAO CHEN GUI wrote:
> Kewen,
>   I refined the patch according to your comments and it passed bootstrap
> and regression test.
> 
>   I committed it as
> https://gcc.gnu.org/g:946b8967b905257ac9f140225db744c9a6ab91be

Thanks!  We want this to be backported, so it's also ok for backporting to all
affected branch releases after a week or so.

BR,
Kewen



Re: [PATCH, rs6000] Call vector load/store with length expand only on 64-bit Power10 [PR96762]

2023-08-29 Thread Kewen.Lin via Gcc-patches
Hi Haochen,

on 2023/8/29 10:50, HAO CHEN GUI wrote:
> Hi,
>   This patch adds "TARGET_64BIT" check when calling vector load/store
> with length expand in expand_block_move. It matches the expand condition
> of "lxvl" and "stxvl" defined in vsx.md.
> 
>   This patch fixes the ICE occurred with the test case on 32-bit Power10.
> 
>   Bootstrapped and tested on powerpc64-linux BE and LE with no regressions.
> 
> Thanks
> Gui Haochen
> 
> 
> ChangeLog
> rs6000: call vector load/store with length expand only on 64-bit Power10
> 
> gcc/
>   PR target/96762
>   * config/rs6000/rs6000-string.cc (expand_block_move): Call vector
>   load/store with length expand only on 64-bit Power10.
> 
> gcc/testsuite/
>   PR target/96762
>   * gcc.target/powerpc/pr96762.c: New.
> 
> 
> patch.diff
> diff --git a/gcc/config/rs6000/rs6000-string.cc 
> b/gcc/config/rs6000/rs6000-string.cc
> index cd8ee8c..d1b48c2 100644
> --- a/gcc/config/rs6000/rs6000-string.cc
> +++ b/gcc/config/rs6000/rs6000-string.cc
> @@ -2811,8 +2811,9 @@ expand_block_move (rtx operands[], bool might_overlap)
> gen_func.mov = gen_vsx_movv2di_64bit;
>   }
>else if (TARGET_BLOCK_OPS_UNALIGNED_VSX
> -&& TARGET_POWER10 && bytes < 16
> -&& orig_bytes > 16
> +/* Only use lxvl/stxvl on 64bit POWER10.  */
> +&& TARGET_POWER10 && TARGET_64BIT
> +&& bytes < 16 && orig_bytes > 16
>  && !(bytes == 1 || bytes == 2
>   || bytes == 4 || bytes == 8)
>  && (align >= 128 || !STRICT_ALIGNMENT))

Nit: Since you touched this part of code, could you format it better as well, 
like:

  else if (TARGET_BLOCK_OPS_UNALIGNED_VSX
   /* Only use lxvl/stxvl on 64bit POWER10.  */
   && TARGET_POWER10
   && TARGET_64BIT
   && bytes < 16
   && orig_bytes > 16
   && !(bytes == 1
|| bytes == 2
|| bytes == 4
|| bytes == 8)
   && (align >= 128
   || !STRICT_ALIGNMENT))


> diff --git a/gcc/testsuite/gcc.target/powerpc/pr96762.c 
> b/gcc/testsuite/gcc.target/powerpc/pr96762.c
> new file mode 100644
> index 000..1145dd1
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr96762.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile { target ilp32 } } */

Nit: we can compile this on lp64, so you can remove the ilp32 restriction,
...

> +/* { dg-options "-O2 -mdejagnu-cpu=power10" } */
> +

... but add one comment line to note the initial purpose, like:

/* Verify there is no ICE on ilp32 env.  */

or similar.

Okay for trunk with these nits fixed, thanks!

BR,
Kewen

> +extern void foo (char *);
> +
> +void
> +bar (void)
> +{
> +  char zj[] = "";
> +  foo (zj);
> +}


Re: [PATCH ver 4] rs6000, add overloaded DFP quantize support

2023-08-29 Thread Kewen.Lin via Gcc-patches
Hi Carl,

on 2023/8/29 04:00, Carl Love wrote:
> 
> GCC maintainers:
> 
> Version 4, additional define_insn name fix.  Change Log fix for the
> UNSPEC_DQUAN.  Retested patch on Power 10 LE.
> 
> Version 3, fixed the built-in instance names.  Missed removing the "n"
> the name.  Added the tighter constraints on the predicates for the
> define_insn.  Updated the wording for the built-ins in the
> documentation file.  Changed the test file name again.  Updated the
> ChangeLog file, added the PR target line.  Retested the patch on Power
> 10LE and Power 8 and Power 9.
> 
> Version 2, renamed the built-in instances.  Changed the name of the
> overloaded built-in.  Added the missing documentation for the new
> built-ins.  Fixed typos.  Changed name of the test.  Updated the
> effective target for the test.  Retested the patch on Power 10LE and
> Power 8 and Power 9.
> 
> The following patch adds four built-ins for the decimal floating point
> (DFP) quantize instructions on rs6000.  The built-ins are for 64-bit
> and 128-bit DFP operands.
> 
> The patch also adds a test case for the new builtins.
> 
> The Patch has been tested on Power 10LE and Power 9 LE/BE.
> 
> Please let me know if the patch is acceptable for mainline.  Thanks.
> 
>  Carl Love
> 
> 
> 
> rs6000, add overloaded DFP quantize support
> 
> Add decimal floating point (DFP) quantize built-ins for both 64-bit DFP
> and 128-DFP operands.  In each case, there is an immediate version and a
> variable version of the built-in.  The RM value is a 2-bit constant int
> which specifies the rounding mode to use.  For the immediate versions of
> the built-in, the TE field is a 5-bit constant that specifies the value of
> the ideal exponent for the result.  The built-in specifications are:
> 
>   __Decimal64 builtin_dfp_quantize (_Decimal64, _Decimal64,
>   const int RM)
>   __Decimal64 builtin_dfp_quantize (const int TE, _Decimal64,
>   const int RM)
>   __Decimal128 builtin_dfp_quantize (_Decimal128, _Decimal128,
>const int RM)
>   __Decimal128 builtin_dfp_quantize (const int TE, _Decimal128,
>const int RM)
> 
> A testcase is added for the new built-in definitions.
> 
> gcc/ChangeLog:
>   * config/rs6000/dfp.md (UNSPEC_DQUAN): New unspec.
>   (dfp_dqua_, dfp_dquai_): New define_insn.
>   * config/rs6000/rs6000-builtins.def (__builtin_dfp_dqua,
>   __builtin_dfp_dquai, __builtin_dfp_dquaq, __builtin_dfp_dquaqi):
>   New buit-in definitions.
>   * config/rs6000/rs6000-overload.def (__builtin_dfp_quantize): New
>   overloaded definition.
>   * doc/extend.texi: Add documentation for __builtin_dfp_quantize.
> 
> gcc/testsuite/
>   * gcc.target/powerpc/pr93448.c: New test case.
> 
>   PR target/93448
> ---
>  gcc/config/rs6000/dfp.md   |  25 ++-
>  gcc/config/rs6000/rs6000-builtins.def  |  15 ++
>  gcc/config/rs6000/rs6000-overload.def  |  10 ++
>  gcc/doc/extend.texi|  17 ++
>  gcc/testsuite/gcc.target/powerpc/pr93448.c | 200 +
>  5 files changed, 266 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr93448.c
> 
> diff --git a/gcc/config/rs6000/dfp.md b/gcc/config/rs6000/dfp.md
> index 5ed8a73ac51..bf4a227b0eb 100644
> --- a/gcc/config/rs6000/dfp.md
> +++ b/gcc/config/rs6000/dfp.md
> @@ -271,7 +271,8 @@ (define_c_enum "unspec"
> UNSPEC_DIEX
> UNSPEC_DSCLI
> UNSPEC_DTSTSFI
> -   UNSPEC_DSCRI])
> +   UNSPEC_DSCRI
> +   UNSPEC_DQUAN])
>  
>  (define_code_iterator DFP_TEST [eq lt gt unordered])
>  
> @@ -395,3 +396,25 @@ (define_insn "dfp_dscri_"
>"dscri %0,%1,%2"
>[(set_attr "type" "dfp")
> (set_attr "size" "")])
> +
> +(define_insn "dfp_dqua_"
> +  [(set (match_operand:DDTD 0 "gpc_reg_operand" "=d")
> +(unspec:DDTD [(match_operand:DDTD 1 "gpc_reg_operand" "d")
> +   (match_operand:DDTD 2 "gpc_reg_operand" "d")
> +   (match_operand:SI 3 "const_0_to_3_operand" "n")]
> + UNSPEC_DQUAN))]
> +  "TARGET_DFP"
> +  "dqua %0,%1,%2,%3"
> +  [(set_attr "type" "dfp")
> +   (set_attr "size" "")])
> +
> +(define_insn "dfp_dquai_"
> +  [(set (match_operand:DDTD 0 "gpc_reg_operand" "=d")
> +(unspec:DDTD [(match_operand:SI 1 "s5bit_cint_operand" "n")
> +   (match_operand:DDTD 2 "gpc_reg_operand" "d")
> +   (match_operand:SI 3 "const_0_to_3_operand" "n")]
> + UNSPEC_DQUAN))]
> +  "TARGET_DFP"
> +  "dquai %1,%0,%2,%3"
> +  [(set_attr "type" "dfp")
> +   (set_attr "size" "")])
> diff --git a/gcc/config/rs6000/rs6000-builtins.def 
> b/gcc/config/rs6000/rs6000-builtins.def
> index 8a294d6c934..ce40600e803 100644
> --- a/gcc/config/rs6000/rs6000-builtins.def
> +++ b/gcc/config/rs6000/rs6000-builtins.def
> @@ -2983,6 

Re: [PATCH-2, rs6000] Implement 32bit inline lrint [PR88558]

2023-08-28 Thread Kewen.Lin via Gcc-patches
Hi Haochen,

on 2023/8/25 14:44, HAO CHEN GUI wrote:
> Hi,
>   This patch implements 32bit inline lrint by "fctiw". It depends on
> the patch1 to do SImode move from FP register on P7.
> 
>   Bootstrapped and tested on powerpc64-linux BE and LE with no regressions.
> 
> Thanks
> Gui Haochen
> 
> ChangeLog
> rs6000: support 32bit inline lrint
> 
> gcc/
>   PR target/88558
>   * config/rs6000/rs6000.md (lrintdi2): Remove TARGET_FPRND
>   from insn condition.
>   (lrintsi2): New insn pattern for 32bit lrint.
> 
> gcc/testsuite/
>   PR target/106769
>   * gcc.target/powerpc/pr88558.h: New.
>   * gcc.target/powerpc/pr88558-p7.c: New.
>   * gcc.target/powerpc/pr88558-p8v.c: New.
> 
> patch.diff
> diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
> index fd263e8dfe3..b36304de8c6 100644
> --- a/gcc/config/rs6000/rs6000.md
> +++ b/gcc/config/rs6000/rs6000.md
> @@ -6655,10 +6655,18 @@ (define_insn "lrintdi2"
>[(set (match_operand:DI 0 "gpc_reg_operand" "=d")
>   (unspec:DI [(match_operand:SFDF 1 "gpc_reg_operand" "")]
>  UNSPEC_FCTID))]
> -  "TARGET_HARD_FLOAT && TARGET_FPRND"
> +  "TARGET_HARD_FLOAT"
>"fctid %0,%1"
>[(set_attr "type" "fp")])
> 
> +(define_insn "lrintsi2"
> +  [(set (match_operand:SI 0 "gpc_reg_operand" "=d")
> + (unspec:SI [(match_operand:SFDF 1 "gpc_reg_operand" "")]
> +UNSPEC_FCTIW))]

It surprises me that we have UNSPEC_FCTIW but it's unused before. :)

> +  "TARGET_HARD_FLOAT && TARGET_POPCNTD"
> +  "fctiw %0,%1"
> +  [(set_attr "type" "fp")])
> +
>  (define_insn "btrunc2"
>[(set (match_operand:SFDF 0 "gpc_reg_operand" "=d,wa")
>   (unspec:SFDF [(match_operand:SFDF 1 "gpc_reg_operand" "d,wa")]
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr88558-p7.c 
> b/gcc/testsuite/gcc.target/powerpc/pr88558-p7.c
> new file mode 100644
> index 000..6437c55fa61
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr88558-p7.c
> @@ -0,0 +1,10 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fno-math-errno -mdejagnu-cpu=power7" } */

Nit: Maybe add one comment for why -fno-math-errno is needed,
such as: "-fno-math-errno is required to make {i,l,ll}rint inlined".

> +
> +#include "pr88558.h"
> +
> +/* { dg-final { scan-assembler-times {\mfctid\M} 2 { target lp64 } } } */
> +/* { dg-final { scan-assembler-times {\mfctid\M} 1 { target ilp32 } } } */
> +/* { dg-final { scan-assembler-times {\mfctiw\M} 1 { target lp64 } } } */
> +/* { dg-final { scan-assembler-times {\mfctiw\M} 2 { target ilp32 } } } */
> +/* { dg-final { scan-assembler-times {\mstfiwx\M} 1 } } */

Shouldn't we also expect different expected count for stfiwx for lp64 and ilp32?
1 for lp64 and 2 for ilp32? no?

> diff --git a/gcc/testsuite/gcc.target/powerpc/pr88558-p8v.c 
> b/gcc/testsuite/gcc.target/powerpc/pr88558-p8v.c
> new file mode 100644
> index 000..fd22123ffb6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr88558-p8v.c

Nit: Maybe just name this with "-p8.c" instead of "-p8v.c".

> @@ -0,0 +1,24 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p8vector_ok } */
> +/* { dg-options "-O2 -fno-math-errno -mdejagnu-cpu=power8" } */
> +
> +long int foo (double a)
> +{
> +  return __builtin_lrint (a);
> +}
> +
> +long long bar (double a)
> +{
> +  return __builtin_llrint (a);
> +}
> +
> +int baz (double a)
> +{
> +  return __builtin_irint (a);
> +}

I think you want to use #include "pr88558.h" here, wrong revision?

> +
> +/* { dg-final { scan-assembler-times {\mfctid\M} 2 { target lp64 } } } */
> +/* { dg-final { scan-assembler-times {\mfctid\M} 1 { target ilp32 } } } */
> +/* { dg-final { scan-assembler-times {\mfctiw\M} 1 { target lp64 } } } */
> +/* { dg-final { scan-assembler-times {\mfctiw\M} 2 { target ilp32 } } } */
> +/* { dg-final { scan-assembler-times {\mmfvsrwz\M} 1 } } */

Similar question on mfvsrwz counts (to the above stfiwx).

> diff --git a/gcc/testsuite/gcc.target/powerpc/pr88558.h 
> b/gcc/testsuite/gcc.target/powerpc/pr88558.h
> new file mode 100644
> index 000..0cc0c68dd4e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr88558.h
> @@ -0,0 +1,14 @@
> +long int foo (double a)
> +{
> +  return __builtin_lrint (a);
> +}
> +
> +long long bar (double a)
> +{
> +  return __builtin_llrint (a);
> +}
> +
> +int baz (double a)
> +{
> +  return __builtin_irint (a);
> +}
> 

The PR also mentioned lrintf, I think we can also add some cases for
the coverage?

BR,
Kewen


Re: [PATCH-1, rs6000] Enable SImode in FP register on P7 [PR88558]

2023-08-28 Thread Kewen.Lin via Gcc-patches
Hi Haochen,

on 2023/8/25 14:44, HAO CHEN GUI wrote:
> Hi,
>   This patch enables SImode in FP register on P7. Instruction "fctiw"
> stores its integer output in an FP register. So SImode in FP register
> needs be enabled on P7 if we want support "fctiw" on P7.
> 

It sounds reasonable to support SImode in fpr with lfiwzx and stfiwx
supports, I'd like to hear from Segher/David/Mike/Peter on what they
think of this.


>   The test case is in the second patch which implements 32bit inline
> lrint.
> 
>   Bootstrapped and tested on powerpc64-linux BE and LE with no regressions.
> 
> Thanks
> Gui Haochen
> 
> ChangeLog
> rs6000: enable SImode in FP register on P7
> 
> gcc/
>   PR target/88558
>   * config/rs6000/rs6000.cc (rs6000_hard_regno_mode_ok_uncached):
>   Enable Simode in FP register for P7.

Nit: s/Simode/SImode/

>   * config/rs6000/rs6000.md (*movsi_internal1): Add fmr for SImode
>   move between FP register.  Set attribute isa of stfiwx to "*"

... "between FP register", s/register/registers/.

>   and attribute of stxsiwx to "p7".
> 
> patch.diff
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index 44b448d2ba6..99085c2cdd7 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -1903,7 +1903,7 @@ rs6000_hard_regno_mode_ok_uncached (int regno, 
> machine_mode mode)
> if(GET_MODE_SIZE (mode) == UNITS_PER_FP_WORD)
>   return 1;
> 
> -   if (TARGET_P8_VECTOR && (mode == SImode))
> +   if (TARGET_POPCNTD && mode == SImode)
>   return 1;
> 
> if (TARGET_P9_VECTOR && (mode == QImode || mode == HImode))
> diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
> index cdab49fbb91..ac5d29a2cf8 100644
> --- a/gcc/config/rs6000/rs6000.md
> +++ b/gcc/config/rs6000/rs6000.md
> @@ -7566,7 +7566,7 @@ (define_split
> 
>  (define_insn "*movsi_internal1"
>[(set (match_operand:SI 0 "nonimmediate_operand"
> -   "=r, r,
> +   "=r, r,  ^d,

One justification is needed on why we need this disparaging, I guess
you want it prefer xxlor over fmr when the former is supported?  There
is a related discussion on fmr vs. xxlor, the original thread is:
https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612821.html

>  r,  d,  v,
>  m,  ?Z, ?Z,
>  r,  r,  r,  r,
> @@ -7575,7 +7575,7 @@ (define_insn "*movsi_internal1"
>  wa, r,
>  r,  *h, *h")
>   (match_operand:SI 1 "input_operand"
> -   "r,  U,
> +   "r,  U,  ^d,

This seems to have the effect that double disparaging on this alternative,
it also needs a justification why one time isn't enough.

BR,
Kewen

>  m,  ?Z, ?Z,
>  r,  d,  v,
>  I,  L,  eI, n,
> @@ -7588,6 +7588,7 @@ (define_insn "*movsi_internal1"
>"@
> mr %0,%1
> la %0,%a1
> +   fmr %0,%1
> lwz%U1%X1 %0,%1
> lfiwzx %0,%y1
> lxsiwzx %x0,%y1
> @@ -7611,7 +7612,7 @@ (define_insn "*movsi_internal1"
> mt%0 %1
> nop"
>[(set_attr "type"
> -   "*,  *,
> +   "*,  *,  fpsimple,
>  load,   fpload, fpload,
>  store,  fpstore,fpstore,
>  *,  *,  *,  *,
> @@ -7620,7 +7621,7 @@ (define_insn "*movsi_internal1"
>  mtvsr,  mfvsr,
>  *,  *,  *")
> (set_attr "length"
> -   "*,  *,
> +   "*,  *,  *,
>  *,  *,  *,
>  *,  *,  *,
>  *,  *,  *,  8,
> @@ -7629,9 +7630,9 @@ (define_insn "*movsi_internal1"
>  *,  *,
>  *,  *,  *")
> (set_attr "isa"
> -   "*,  *,
> -*,  p8v,p8v,
> -*,  p8v,p8v,
> +   "*,  *,  *,
> +*,  p7, p8v,
> +*,  *,  p8v,
>  *,  *,  p10,*,
>  p8v,p9v,p9v,p8v,
>  p9v,p8v,p9v,




Re: [PATCH ver 3] rs6000, add overloaded DFP quantize support

2023-08-27 Thread Kewen.Lin via Gcc-patches
Hi Carl,

on 2023/8/25 03:53, Carl Love wrote:
> GCC maintainers:
> 
> Version 3, fixed the built-in instance names.  Missed removing the "n"
> the name.  Added the tighter constraints on the predicates for the
> define_insn.  Updated the wording for the built-ins in the
> documentation file.  Changed the test file name again.  Updated the
> ChangeLog file, added the PR target line.  Retested the patch on Power
> 10LE and Power 8 and Power 9.
> 
> Version 2, renamed the built-in instances.  Changed the name of the
> overloaded built-in.  Added the missing documentation for the new
> built-ins.  Fixed typos.  Changed name of the test.  Updated the
> effective target for the test.  Retested the patch on Power 10LE and
> Power 8 and Power 9.
> 
> The following patch adds four built-ins for the decimal floating point
> (DFP) quantize instructions on rs6000.  The built-ins are for 64-bit
> and 128-bit DFP operands.
> 
> The patch also adds a test case for the new builtins.
> 
> The Patch has been tested on Power 10LE and Power 9 LE/BE.
> 
> Please let me know if the patch is acceptable for mainline.  Thanks.
> 
>  Carl Love
> 
> 
> ---
> rs6000, add overloaded DFP quantize support
> 
> Add decimal floating point (DFP) quantize built-ins for both 64-bit DFP
> and 128-DFP operands.  In each case, there is an immediate version and a
> variable version of the built-in.  The RM value is a 2-bit constant int
> which specifies the rounding mode to use.  For the immediate versions of
> the built-in, the TE field is a 5-bit constant that specifies the value of
> the ideal exponent for the result.  The built-in specifications are:
> 
>   __Decimal64 builtin_dfp_quantize (_Decimal64, _Decimal64,
>   const int RM)
>   __Decimal64 builtin_dfp_quantize (const int TE, _Decimal64,
>   const int RM)
>   __Decimal128 builtin_dfp_quantize (_Decimal128, _Decimal128,
>const int RM)
>   __Decimal128 builtin_dfp_quantize (const int TE, _Decimal128,
>const int RM)
> 
> A testcase is added for the new built-in definitions.
> 
> gcc/ChangeLog:
>   * config/rs6000/dfp.md: New UNSPEC_DQUAN.

Nit: (UNSPEC_DQUAN): New unspec.

>   (dfp_dqua_, dfp_dqua_i): New define_insn.
>   * config/rs6000/rs6000-builtins.def (__builtin_dfp_dqua,
>   __builtin_dfp_dquai, __builtin_dfp_dquaq, __builtin_dfp_dquaqi):
>   New buit-in definitions.
>   * config/rs6000/rs6000-overload.def (__builtin_dfp_quantize): New
>   overloaded definition.
>   * doc/extend.texi: Add documentation for __builtin_dfp_quantize.
> 
> gcc/testsuite/
>   * gcc.target/powerpc/pr93448.c: New test case.
> 
>   PR target/93448
> ---
>  gcc/config/rs6000/dfp.md   |  25 ++-
>  gcc/config/rs6000/rs6000-builtins.def  |  15 ++
>  gcc/config/rs6000/rs6000-overload.def  |  10 ++
>  gcc/doc/extend.texi|  17 ++
>  gcc/testsuite/gcc.target/powerpc/pr93448.c | 200 +
>  5 files changed, 266 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr93448.c
> 
> diff --git a/gcc/config/rs6000/dfp.md b/gcc/config/rs6000/dfp.md
> index 5ed8a73ac51..052dc0946d3 100644
> --- a/gcc/config/rs6000/dfp.md
> +++ b/gcc/config/rs6000/dfp.md
> @@ -271,7 +271,8 @@ (define_c_enum "unspec"
> UNSPEC_DIEX
> UNSPEC_DSCLI
> UNSPEC_DTSTSFI
> -   UNSPEC_DSCRI])
> +   UNSPEC_DSCRI
> +   UNSPEC_DQUAN])
>  
>  (define_code_iterator DFP_TEST [eq lt gt unordered])
>  
> @@ -395,3 +396,25 @@ (define_insn "dfp_dscri_"
>"dscri %0,%1,%2"
>[(set_attr "type" "dfp")
> (set_attr "size" "")])
> +
> +(define_insn "dfp_dqua_"
> +  [(set (match_operand:DDTD 0 "gpc_reg_operand" "=d")
> +(unspec:DDTD [(match_operand:DDTD 1 "gpc_reg_operand" "d")
> +   (match_operand:DDTD 2 "gpc_reg_operand" "d")
> +   (match_operand:SI 3 "const_0_to_3_operand" "n")]
> + UNSPEC_DQUAN))]
> +  "TARGET_DFP"
> +  "dqua %0,%1,%2,%3"
> +  [(set_attr "type" "dfp")
> +   (set_attr "size" "")])
> +
> +(define_insn "dfp_dqua_i"

Sorry for nitpicking, but what I suggested previously was "dfp_dquai_"
instead of "dfp_dqua_i", "dquai" matches the according mnemonic so it's
read better, i expands to "idd" and "itd" that look odd to me.
Do you agree "dquai" is better?  If yes, the changelog and the related
expanders need to be updated as well.

The others look good to me, thanks!

BR,
Kewen

> +  [(set (match_operand:DDTD 0 "gpc_reg_operand" "=d")
> +(unspec:DDTD [(match_operand:SI 1 "s5bit_cint_operand" "n")
> +   (match_operand:DDTD 2 "gpc_reg_operand" "d")
> +   (match_operand:SI 3 "const_0_to_3_operand" "n")]
> + UNSPEC_DQUAN))]
> +  "TARGET_DFP"
> +  "dquai %1,%0,%2,%3"
> +  [(set_attr "type" "dfp")
> 

Re: [PATCH] rs6000: Disable PCREL for unsupported targets [PR111045]

2023-08-27 Thread Kewen.Lin via Gcc-patches
on 2023/8/26 06:04, Peter Bergner wrote:
> On 8/25/23 6:20 AM, Kewen.Lin wrote:
>> Assuming the current PCREL_SUPPORTED_BY_OS unchanged, when
>> PCREL_SUPPORTED_BY_OS is true, all its required conditions are
>> satisfied, it should be safe.  while PCREL_SUPPORTED_BY_OS is
>> false, it means the given subtarget doesn't support it, or one
>> or more of conditions below don't hold:
>>
>>  - TARGET_POWER10 
>>  - TARGET_PREFIXED
>>  - ELFv2_ABI_CHECK
>>  - TARGET_CMODEL == CMODEL_MEDIUM
> 
> The pcrel instructions are 64-bit/prefix instructions, so I think
> for PCREL_SUPPORTED_BY_OS, we want to keep the TARGET_POWER10 and
> the TARGET_PREFIXED checks.  Then we should have the checks for
> the OS that can support pcrel, in this case, only ELFv2_ABI_CHECK
> for now.  I think we should remove the TARGET_CMODEL check, since
> that isn't strictly required, it's a current code requirement for
> ELFv2, but could change in the future.  In fact, Mike has talked
> about adding pcrel support for ELFv2 and -mcmodel=large, so I think
> that should move moved out of PCREL_SUPPORTED_BY_OS and into the
> option override checks.

Thanks for clarifying this!  Yes, I know the pcrel support requires
TARGET_PREFIXED (and its required TARGET_POWER10), but IMHO these
TARGET_PREFIXED and TARGET_POWER10 are common for any subtargets
which support or will support pcrel, they can be checked in common
code, instead of being a part of PCREL_SUPPORTED_BY_OS.

We already has the code to check pcrel's need on prefixed and
prefixed's need on Power10, we can just move these checkings after
PCREL_SUPPORTED_BY_OS check.

Assuming we only have ELFv2_ABI_CHECK in PCREL_SUPPORTED_BY_OS, we
can have either TARGET_PCREL or !TARGET_PCREL after the checking.
For the latter, it's fine and don't need any checks. For the former,
if it's implicit, for !TARGET_PREFIXED we will clean it silently;
while if it's explicit, for !TARGET_PREFIXED we will emit an error.
TARGET_PREFIXED checking has considered Power10, so it's also
concerned accordingly.

> 
> 
> 
[snip ...]
> 
> 
>> btw, I was also expecting that we don't implicitly set
>> OPTION_MASK_PCREL any more for Power10, that is to remove
>> OPTION_MASK_PCREL from OTHER_POWER10_MASKS.
> 
> I'm on the fence about this one and would like to hear from Segher
> and Mike on what they think.  In some respect, pcrel is a Power10
> hardware feature, so that would seem to make sense to set the flag,
> but yeah, we only have one OS that currently supports it, so maybe
> not setting it makes sense?  Like I said, I think I need Segher and
> Mike to chime in with their thoughts.

Yeah, looking forward to their opinions.  IMHO, with the current proposed
change, pcrel doesn't look like a pure Power10 hardware feature, it also
quite relies on ABIs, that's why I thought it seems good not to turn it
on by default for Power10.

BR,
Kewen


Re: [PATCH] rs6000: Disable PCREL for unsupported targets [PR111045]

2023-08-25 Thread Kewen.Lin via Gcc-patches
on 2023/8/25 11:20, Peter Bergner wrote:
> On 8/24/23 12:56 AM, Kewen.Lin wrote:
>> By looking into the uses of function rs6000_pcrel_p, I think we can
>> just replace it with TARGET_PCREL.  Previously we don't require PCREL
>> unset for any unsupported target/OS, so we need rs6000_pcrel_p() to
>> ensure it's really supported in those use places, now if we can guarantee
>> TARGET_PCREL only hold for all places where it's supported, it looks
>> we can just check TARGET_PCREL only?
> 
> I think you're correct on only needing TARGET_PCREL instead of 
> rs6000_pcrel_p()
> as long as we correctly disable PCREL on the targets that either don't support
> it or those that due, but don't due to explicit options (ie, -mno-prel or
> ELFv2 and -mcmodel != medium, etc.).
> 
> 
> 
>> Then the code structure can look like:
>>
>> if (PCREL_SUPPORTED_BY_OS
>> && (rs6000_isa_flags_explicit & OPTION_MASK_PCREL) == 0)
>>// enable
>> else if (TARGET_PCREL && DEFAULT_ABI != ABI_ELFv2)
>>// disable
>> else if (TARGET_PCREL && TARGET_CMODEL != CMODEL_MEDIUM)
>>// disable
> 
> I don't like that first conditional expression. The problem is,
> PCREL_SUPPORTED_BY_OS could be true or false for the following
> tests, because it's anded with the explicit option test, and
> sometimes that won't make sense.  I think something safer is
> something like:

Assuming the current PCREL_SUPPORTED_BY_OS unchanged, when
PCREL_SUPPORTED_BY_OS is true, all its required conditions are
satisfied, it should be safe.  while PCREL_SUPPORTED_BY_OS is
false, it means the given subtarget doesn't support it, or one
or more of conditions below don't hold:

 - TARGET_POWER10 
 - TARGET_PREFIXED
 - ELFv2_ABI_CHECK
 - TARGET_CMODEL == CMODEL_MEDIUM

The two "else if" check for the last two (abi test also can
check unsupported targets), we have check code for TARGET_PCREL
without TARGET_PREFIXED support, and further Power10, so it
seems safe?

> 
> if (PCREL_SUPPORTED_BY_OS)
>   { 
> /* PCREL on ELFv2 requires -mcmodel=medium.  */
> if (DEFAULT_ABI == ABI_ELFv2 && TARGET_CMODEL != CMODEL_MEDIUM)
>   error ("%qs requires %qs", "-mpcrel", "-mcmodel=medium");

I guess your proposal seems to drop CMODEL_MEDIUM (even PREFIX?)
from PCREL_SUPPORTED_BY_OS, to leave it just ABI_ELFv2 & P10? (
otherwise TARGET_CMODEL should be always CMODEL_MEDIUM for
ABI_ELFv2 with the original PCREL_SUPPORTED_BY_OS).  And it
seems to make the subtarget specific checking according to the
ABI type further?

I was expecting that when new subtargets need to be supported, we
can move these subtarget specific checkings into macro/function
SUB*TARGET_OVERRIDE_OPTIONS, IMHO the upside is each subtarget
can take care of its own checkings separately.  Maybe we can
just move them now (to rs6000_linux64_override_options).  And in
the common code, for the cases with zero value PCREL_SUPPORTED_BY_OS,
(assuming PCREL_SUPPORTED_BY_OS just simple as like ABI_ELFv2), we
can emit an invalid error for explicit -mpcrel as you proposed
below.  Thoughts?

btw, I was also expecting that we don't implicitly set
OPTION_MASK_PCREL any more for Power10, that is to remove
OPTION_MASK_PCREL from OTHER_POWER10_MASKS.

> 
> if ((rs6000_isa_flags_explicit & OPTION_MASK_PCREL) == 0)
>   rs6000_isa_flags |= OPTION_MASK_PCREL;
>   } 
> else
>   {
> if ((rs6000_isa_flags_explicit & OPTION_MASK_PCREL) != 0
> && TARGET_PCREL)
>   error ("use of %qs is invalid for this target", "-mpcrel");

I like this, it's more clear!

> rs6000_isa_flags &= ~OPTION_MASK_PCREL;
>   }
> 
> ...although, even the cmodel != medium test above should probably have
> some extra tests to ensure TARGET_PCREL is true (and explicit?) and
> mcmodel != medium before giving an error???  Ie, a ELFv2 P10 compile
> with an implicit -mpcrel and explicit -mcmodel={small,large} probably
> should not error and just silently disable pcrel???  Meaning only
> when we explicitly say -mpcrel -mcmodel={small,large} should we give
> that error.  Thoughts on that?

Yeah, I think it makes more sense to only error for explicit -mpcrel
-mcmodel={small,large}, linux64 sets cmodel as medium by default, I guess
that's why the current code doesn't check if the cmodel explicitly specified,
!medium implies it's changed explicitly.  So if we move the checkings
to subtarget, it seems to have good locality (context)?

BR,
Kewen


Re: [PATCH] rs6000: Disable PCREL for unsupported targets [PR111045]

2023-08-23 Thread Kewen.Lin via Gcc-patches
Hi Peter,

on 2023/8/24 10:07, Peter Bergner wrote:
> On 8/21/23 8:51 PM, Kewen.Lin wrote:
>>> The following patch has been bootstrapped and regtested on powerpc64-linux.
>>
>> I think we should test this on powerpc64le-linux P8 or P9 (no P10) as well.
> 
> That's a good idea!
> 
> 
> 
>> I think this should be moved to be with the hunk on PCREL:
>>
>>   /* If the ABI has support for PC-relative relocations, enable it by 
>> default.
>>  This test depends on the sub-target tests above setting the code model 
>> to
>>  medium for ELF v2 systems.  */
>>   if (PCREL_SUPPORTED_BY_OS
>>   && (rs6000_isa_flags_explicit & OPTION_MASK_PCREL) == 0)
>> rs6000_isa_flags |= OPTION_MASK_PCREL;
>>
>>   /* -mpcrel requires -mcmodel=medium, but we can't check TARGET_CMODEL until
>>   after the subtarget override options are done.  */
>>   else if (TARGET_PCREL && TARGET_CMODEL != CMODEL_MEDIUM)
>> {
>>   if ((rs6000_isa_flags_explicit & OPTION_MASK_PCREL) != 0)
>>  error ("%qs requires %qs", "-mpcrel", "-mcmodel=medium");
>>
>>   rs6000_isa_flags &= ~OPTION_MASK_PCREL;
>> }
> 
> Agreed on the location, but...
> 
> Looking at this closer, I don't think I'm happy with the current code.
> We seem to have duplicated tests for whether the target supports pcrel
> or not in both PCREL_SUPPORTED_BY_OS and rs6000_pcrel_p().  That means
> if another target were to add support for pcrel, they'd have to update
> multiple locations of the code, and that seems error prone.
> 

Good point!  I also noticed this, but I wasn't sure what the future
supported target/OS combinations will be like, what would be common
fundamentally, then I thought maybe we can just leave it for now.
I change my mind now and I agree we can do more.

> I think we should standardize our tests for whether the target/OS
> supports pcrel (irrespective of the -mpcrel or -mcmodel=medium options)
> and have that in PCREL_SUPPORTED_BY_OS.  Ie, a one-stop-shop for testing
> whether the current target/OS can support pcrel.  Then we should modify
> rs6000_pcrel_p() use PCREL_SUPPORTED_BY_OS rather than its own
> semi-duplicated target/OS tests, plus any other tests for options that
> might disqualify the current target/OS from supporting pcrel, when it
> normally can (ie, -mmodel != medium for ELFv2).

By looking into the uses of function rs6000_pcrel_p, I think we can
just replace it with TARGET_PCREL.  Previously we don't require PCREL
unset for any unsupported target/OS, so we need rs6000_pcrel_p() to
ensure it's really supported in those use places, now if we can guarantee
TARGET_PCREL only hold for all places where it's supported, it looks
we can just check TARGET_PCREL only?

Then the code structure can look like:

if (PCREL_SUPPORTED_BY_OS
&& (rs6000_isa_flags_explicit & OPTION_MASK_PCREL) == 0)
   // enable
else if (TARGET_PCREL && DEFAULT_ABI != ABI_ELFv2)
   // disable
else if (TARGET_PCREL && TARGET_CMODEL != CMODEL_MEDIUM)
   // disable

Here, ABI_ELFv2 and CMODEL_MEDIUM checking is current supported
target/OS specific, in future when we have any new supported
target/OS, this part can be factored out as one subtarget specific
checking function/macro.

Does it make sense?

BR,
Kewen

> 
> I think then, that should allow simplifying the code in
> rs6000_option_override_internal.
> 
> Thoughts?
> 
> 
> Peter
> 
> 



Re: [PATCH 1/3] vect: Remove some manual release in vectorizable_store

2023-08-22 Thread Kewen.Lin via Gcc-patches
on 2023/8/22 20:32, Richard Biener wrote:
> On Tue, Aug 22, 2023 at 10:45 AM Kewen.Lin  wrote:
>>
>> Hi,
>>
>> To avoid some duplicates in some follow-up patches on
>> function vectorizable_store, this patch is to adjust some
>> existing vec with auto_vec and remove some manual release
>> invocation.  Also refactor a bit and remove some useless
>> codes.
>>
>> Bootstrapped and regtested on x86_64-redhat-linux,
>> aarch64-linux-gnu and powerpc64{,le}-linux-gnu.
>>
>> Is it ok for trunk?
> 
> OK.

Thanks Richi, pushed this as r14-3402, the other two as r14-3403
and r14-3404.

BR,
Kewen



Re: [PATCH] vect: Replace DR_GROUP_STORE_COUNT with DR_GROUP_LAST_ELEMENT

2023-08-22 Thread Kewen.Lin via Gcc-patches
Hi Richi,

on 2023/8/22 20:17, Richard Biener wrote:
> On Tue, Aug 22, 2023 at 10:44 AM Kewen.Lin  wrote:
>>
>> Hi,
>>
>> Now we use DR_GROUP_STORE_COUNT to record how many stores
>> in a group have been transformed and only do the actual
>> transform when encountering the last one.  I'm making
>> patches to move costing next to the transform code, it's
>> awkward to use this DR_GROUP_STORE_COUNT for both costing
>> and transforming.  This patch is to introduce last_element
>> to record the last element to be transformed in the group
>> rather than to sum up the store number we have seen, then
>> we can only check the given stmt is the last or not.  It
>> can make it work simply for both costing and transforming.
>>
>> Bootstrapped and regtested on x86_64-redhat-linux,
>> aarch64-linux-gnu and powerpc64{,le}-linux-gnu.
>>
>> Is it ok for trunk?
> 
> This is all (existing) gross, so ... can't we do sth like the following
> instead?  Going to test this further besides the quick single
> testcase I verified.

I just realized that dealing with this in vect_transform_stmt is super
neat as you questioned and posted, thanks a lot for pushing commit
r14-3383-g2c27600fa79431 for this!

BR,
Kewen


> 
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 33f62b77710..67de19d9ce5 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -8437,16 +8437,6 @@ vectorizable_store (vec_info *vinfo,
>/* FORNOW */
>gcc_assert (!loop || !nested_in_vect_loop_p (loop, stmt_info));
> 
> -  /* We vectorize all the stmts of the interleaving group when we
> -reach the last stmt in the group.  */
> -  if (DR_GROUP_STORE_COUNT (first_stmt_info)
> - < DR_GROUP_SIZE (first_stmt_info)
> - && !slp)
> -   {
> - *vec_stmt = NULL;
> - return true;
> -   }
> -
>if (slp)
>  {
>grouped_store = false;
> @@ -12487,21 +12477,21 @@ vect_transform_stmt (vec_info *vinfo,
>break;
> 
>  case store_vec_info_type:
> -  done = vectorizable_store (vinfo, stmt_info,
> -gsi, _stmt, slp_node, NULL);
> -  gcc_assert (done);
> -  if (STMT_VINFO_GROUPED_ACCESS (stmt_info) && !slp_node)
> +  if (STMT_VINFO_GROUPED_ACCESS (stmt_info)
> + && !slp_node
> + && DR_GROUP_NEXT_ELEMENT (stmt_info))
> +   /* In case of interleaving, the whole chain is vectorized when the
> +  last store in the chain is reached.  Store stmts before the last
> +  one are skipped, and there vec_stmt_info shouldn't be freed
> +  meanwhile.  */
> +   ;
> +  else
> {
> - /* In case of interleaving, the whole chain is vectorized when the
> -last store in the chain is reached.  Store stmts before the last
> -one are skipped, and there vec_stmt_info shouldn't be freed
> -meanwhile.  */
> - stmt_vec_info group_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
> - if (DR_GROUP_STORE_COUNT (group_info) == DR_GROUP_SIZE (group_info))
> -   is_store = true;
> + done = vectorizable_store (vinfo, stmt_info,
> +gsi, _stmt, slp_node, NULL);
> + gcc_assert (done);
> + is_store = true;
> }
> -  else
> -   is_store = true;
>break;
> 
>  case condition_vec_info_type:
> 
> 
>> BR,
>> Kewen
>> -
>>
>> gcc/ChangeLog:
>>
>> * tree-vect-data-refs.cc (vect_set_group_last_element): New function.
>> (vect_analyze_group_access): Call new function
>> vect_set_group_last_element.
>> * tree-vect-stmts.cc (vectorizable_store): Replace 
>> DR_GROUP_STORE_COUNT
>> uses with DR_GROUP_LAST_ELEMENT.
>> (vect_transform_stmt): Likewise.
>> * tree-vect-slp.cc (vect_split_slp_store_group): Likewise.
>> (vect_build_slp_instance): Likewise.
>> * tree-vectorizer.h (DR_GROUP_LAST_ELEMENT): New macro.
>> (DR_GROUP_STORE_COUNT): Remove.
>> (class _stmt_vec_info::store_count): Remove.
>> (class _stmt_vec_info::last_element): New class member.
>> (vect_set_group_last_element): New function declaration.
>> ---
>>  gcc/tree-vect-data-refs.cc | 30 ++
>>  gcc/tree-vect-slp.cc   | 13 +
>>  gcc/tree-vect-stmts.cc |  9 +++--
>>  gcc/tree-vectorizer.h  | 12 +++-
>>  4 files changed, 49 insertions(+), 15 deletions(-)
>>
>> diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
>> index 3e9a284666c..c4a495431d5 100644
>> --- a/gcc/tree-vect-data-refs.cc
>> +++ b/gcc/tree-vect-data-refs.cc
>> @@ -2832,6 +2832,33 @@ vect_analyze_group_access_1 (vec_info *vinfo, 
>> dr_vec_info *dr_info)
>>return true;
>>  }
>>
>> +/* Given vectorization information VINFO, set the last element in the
>> +   group led by FIRST_STMT_INFO.  For now, it's only used for loop
>> +   

[PATCH 3/3] vect: Move VMAT_GATHER_SCATTER handlings from final loop nest

2023-08-22 Thread Kewen.Lin via Gcc-patches
Hi,

Like r14-3317 which moves the handlings on memory access
type VMAT_GATHER_SCATTER in vectorizable_load final loop
nest, this one is to deal with vectorizable_store side.

Bootstrapped and regtested on x86_64-redhat-linux,
aarch64-linux-gnu and powerpc64{,le}-linux-gnu.

Is it ok for trunk?

BR,
Kewen
-

gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_store): Move the handlings on
VMAT_GATHER_SCATTER in the final loop nest to its own loop,
and update the final nest accordingly.
---
 gcc/tree-vect-stmts.cc | 258 +
 1 file changed, 159 insertions(+), 99 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 18f5ebcc09c..b959c1861ad 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -8930,44 +8930,23 @@ vectorizable_store (vec_info *vinfo,
   return true;
 }

-  auto_vec result_chain (group_size);
-  auto_vec vec_offsets;
-  auto_vec vec_oprnds;
-  for (j = 0; j < ncopies; j++)
+  if (memory_access_type == VMAT_GATHER_SCATTER)
 {
-  gimple *new_stmt;
-  if (j == 0)
+  gcc_assert (!slp && !grouped_store);
+  auto_vec vec_offsets;
+  for (j = 0; j < ncopies; j++)
{
- if (slp)
-   {
- /* Get vectorized arguments for SLP_NODE.  */
- vect_get_vec_defs (vinfo, stmt_info, slp_node, 1, op,
-_oprnds);
- vec_oprnd = vec_oprnds[0];
-   }
- else
+ gimple *new_stmt;
+ if (j == 0)
{
- /* For interleaved stores we collect vectorized defs for all the
-stores in the group in DR_CHAIN. DR_CHAIN is then used as an
-input to vect_permute_store_chain().
-
-If the store is not grouped, DR_GROUP_SIZE is 1, and DR_CHAIN
-is of size 1.  */
- stmt_vec_info next_stmt_info = first_stmt_info;
- for (i = 0; i < group_size; i++)
-   {
- /* Since gaps are not supported for interleaved stores,
-DR_GROUP_SIZE is the exact number of stmts in the chain.
-Therefore, NEXT_STMT_INFO can't be NULL_TREE.  In case
-that there is no interleaving, DR_GROUP_SIZE is 1,
-and only one iteration of the loop will be executed.  */
- op = vect_get_store_rhs (next_stmt_info);
- vect_get_vec_defs_for_operand (vinfo, next_stmt_info, ncopies,
-op, gvec_oprnds[i]);
- vec_oprnd = (*gvec_oprnds[i])[0];
- dr_chain.quick_push (vec_oprnd);
- next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info);
-   }
+ /* Since the store is not grouped, DR_GROUP_SIZE is 1, and
+DR_CHAIN is of size 1.  */
+ gcc_assert (group_size == 1);
+ op = vect_get_store_rhs (first_stmt_info);
+ vect_get_vec_defs_for_operand (vinfo, first_stmt_info, ncopies,
+op, gvec_oprnds[0]);
+ vec_oprnd = (*gvec_oprnds[0])[0];
+ dr_chain.quick_push (vec_oprnd);
  if (mask)
{
  vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies,
@@ -8975,91 +8954,55 @@ vectorizable_store (vec_info *vinfo,
 mask_vectype);
  vec_mask = vec_masks[0];
}
-   }

- /* We should have catched mismatched types earlier.  */
- gcc_assert (useless_type_conversion_p (vectype,
-TREE_TYPE (vec_oprnd)));
- bool simd_lane_access_p
-   = STMT_VINFO_SIMD_LANE_ACCESS_P (stmt_info) != 0;
- if (simd_lane_access_p
- && !loop_masks
- && TREE_CODE (DR_BASE_ADDRESS (first_dr_info->dr)) == ADDR_EXPR
- && VAR_P (TREE_OPERAND (DR_BASE_ADDRESS (first_dr_info->dr), 0))
- && integer_zerop (get_dr_vinfo_offset (vinfo, first_dr_info))
- && integer_zerop (DR_INIT (first_dr_info->dr))
- && alias_sets_conflict_p (get_alias_set (aggr_type),
-   get_alias_set (TREE_TYPE (ref_type
-   {
- dataref_ptr = unshare_expr (DR_BASE_ADDRESS (first_dr_info->dr));
- dataref_offset = build_int_cst (ref_type, 0);
+ /* We should have catched mismatched types earlier.  */
+ gcc_assert (useless_type_conversion_p (vectype,
+TREE_TYPE (vec_oprnd)));
+ if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+   vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info,
+slp_node, _info, _ptr,
+

[PATCH 2/3] vect: Move VMAT_LOAD_STORE_LANES handlings from final loop nest

2023-08-22 Thread Kewen.Lin via Gcc-patches
Hi,

Like commit r14-3214 which moves the handlings on memory
access type VMAT_LOAD_STORE_LANES in vectorizable_load
final loop nest, this one is to deal with the function
vectorizable_store.

Bootstrapped and regtested on x86_64-redhat-linux,
aarch64-linux-gnu and powerpc64{,le}-linux-gnu.

Is it ok for trunk?

BR,
Kewen
-

gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_store): Move the handlings on
VMAT_LOAD_STORE_LANES in the final loop nest to its own loop,
and update the final nest accordingly.
---
 gcc/tree-vect-stmts.cc | 732 ++---
 1 file changed, 387 insertions(+), 345 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index fcaa4127e52..18f5ebcc09c 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -8779,42 +8779,29 @@ vectorizable_store (vec_info *vinfo,
   */

   auto_vec dr_chain (group_size);
-  auto_vec result_chain (group_size);
   auto_vec vec_masks;
   tree vec_mask = NULL;
-  auto_vec vec_offsets;
   auto_delete_vec> gvec_oprnds (group_size);
   for (i = 0; i < group_size; i++)
 gvec_oprnds.quick_push (new auto_vec (ncopies));
-  auto_vec vec_oprnds;
-  for (j = 0; j < ncopies; j++)
+
+  if (memory_access_type == VMAT_LOAD_STORE_LANES)
 {
-  gimple *new_stmt;
-  if (j == 0)
+  gcc_assert (!slp && grouped_store);
+  for (j = 0; j < ncopies; j++)
{
-  if (slp)
-{
- /* Get vectorized arguments for SLP_NODE.  */
- vect_get_vec_defs (vinfo, stmt_info, slp_node, 1,
-op, _oprnds);
-  vec_oprnd = vec_oprnds[0];
-}
-  else
-{
- /* For interleaved stores we collect vectorized defs for all the
-stores in the group in DR_CHAIN. DR_CHAIN is then used as an
-input to vect_permute_store_chain().
-
-If the store is not grouped, DR_GROUP_SIZE is 1, and DR_CHAIN
-is of size 1.  */
+ gimple *new_stmt;
+ if (j == 0)
+   {
+ /* For interleaved stores we collect vectorized defs for all
+the stores in the group in DR_CHAIN. DR_CHAIN is then used
+as an input to vect_permute_store_chain().  */
  stmt_vec_info next_stmt_info = first_stmt_info;
  for (i = 0; i < group_size; i++)
{
  /* Since gaps are not supported for interleaved stores,
-DR_GROUP_SIZE is the exact number of stmts in the chain.
-Therefore, NEXT_STMT_INFO can't be NULL_TREE.  In case
-that there is no interleaving, DR_GROUP_SIZE is 1,
-and only one iteration of the loop will be executed.  */
+DR_GROUP_SIZE is the exact number of stmts in the
+chain. Therefore, NEXT_STMT_INFO can't be NULL_TREE.  */
  op = vect_get_store_rhs (next_stmt_info);
  vect_get_vec_defs_for_operand (vinfo, next_stmt_info, ncopies,
 op, gvec_oprnds[i]);
@@ -8825,66 +8812,37 @@ vectorizable_store (vec_info *vinfo,
  if (mask)
{
  vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies,
-mask, _masks, 
mask_vectype);
+mask, _masks,
+mask_vectype);
  vec_mask = vec_masks[0];
}
-   }

- /* We should have catched mismatched types earlier.  */
- gcc_assert (useless_type_conversion_p (vectype,
-TREE_TYPE (vec_oprnd)));
- bool simd_lane_access_p
-   = STMT_VINFO_SIMD_LANE_ACCESS_P (stmt_info) != 0;
- if (simd_lane_access_p
- && !loop_masks
- && TREE_CODE (DR_BASE_ADDRESS (first_dr_info->dr)) == ADDR_EXPR
- && VAR_P (TREE_OPERAND (DR_BASE_ADDRESS (first_dr_info->dr), 0))
- && integer_zerop (get_dr_vinfo_offset (vinfo, first_dr_info))
- && integer_zerop (DR_INIT (first_dr_info->dr))
- && alias_sets_conflict_p (get_alias_set (aggr_type),
-   get_alias_set (TREE_TYPE (ref_type
-   {
- dataref_ptr = unshare_expr (DR_BASE_ADDRESS (first_dr_info->dr));
- dataref_offset = build_int_cst (ref_type, 0);
+ /* We should have catched mismatched types earlier.  */
+ gcc_assert (
+   useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd)));
+ dataref_ptr
+   = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type,
+   NULL, offset, , gsi,
+   _incr, 

[PATCH 1/3] vect: Remove some manual release in vectorizable_store

2023-08-22 Thread Kewen.Lin via Gcc-patches
Hi,

To avoid some duplicates in some follow-up patches on
function vectorizable_store, this patch is to adjust some
existing vec with auto_vec and remove some manual release
invocation.  Also refactor a bit and remove some useless
codes.

Bootstrapped and regtested on x86_64-redhat-linux,
aarch64-linux-gnu and powerpc64{,le}-linux-gnu.

Is it ok for trunk?

BR,
Kewen
-

gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_store): Remove vec oprnds,
adjust vec result_chain, vec_oprnd with auto_vec, and adjust
gvec_oprnds with auto_delete_vec.
---
 gcc/tree-vect-stmts.cc | 64 +++---
 1 file changed, 23 insertions(+), 41 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 1580a396301..fcaa4127e52 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -8200,9 +8200,6 @@ vectorizable_store (vec_info *vinfo,
   stmt_vec_info first_stmt_info;
   bool grouped_store;
   unsigned int group_size, i;
-  vec oprnds = vNULL;
-  vec result_chain = vNULL;
-  vec vec_oprnds = vNULL;
   bool slp = (slp_node != NULL);
   unsigned int vec_num;
   bb_vec_info bb_vinfo = dyn_cast  (vinfo);
@@ -8601,6 +8598,7 @@ vectorizable_store (vec_info *vinfo,

   alias_off = build_int_cst (ref_type, 0);
   stmt_vec_info next_stmt_info = first_stmt_info;
+  auto_vec vec_oprnds (ncopies);
   for (g = 0; g < group_size; g++)
{
  running_off = offvar;
@@ -8682,7 +8680,7 @@ vectorizable_store (vec_info *vinfo,
}
}
  next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info);
- vec_oprnds.release ();
+ vec_oprnds.truncate(0);
  if (slp)
break;
}
@@ -8690,9 +8688,6 @@ vectorizable_store (vec_info *vinfo,
   return true;
 }

-  auto_vec dr_chain (group_size);
-  oprnds.create (group_size);
-
   gcc_assert (alignment_support_scheme);
   vec_loop_masks *loop_masks
 = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
@@ -8783,11 +8778,15 @@ vectorizable_store (vec_info *vinfo,
  STMT_VINFO_RELATED_STMT for the next copies.
   */

+  auto_vec dr_chain (group_size);
+  auto_vec result_chain (group_size);
   auto_vec vec_masks;
   tree vec_mask = NULL;
   auto_vec vec_offsets;
-  auto_vec > gvec_oprnds;
-  gvec_oprnds.safe_grow_cleared (group_size, true);
+  auto_delete_vec> gvec_oprnds (group_size);
+  for (i = 0; i < group_size; i++)
+gvec_oprnds.quick_push (new auto_vec (ncopies));
+  auto_vec vec_oprnds;
   for (j = 0; j < ncopies; j++)
 {
   gimple *new_stmt;
@@ -8803,11 +8802,11 @@ vectorizable_store (vec_info *vinfo,
   else
 {
  /* For interleaved stores we collect vectorized defs for all the
-stores in the group in DR_CHAIN and OPRNDS. DR_CHAIN is then
-used as an input to vect_permute_store_chain().
+stores in the group in DR_CHAIN. DR_CHAIN is then used as an
+input to vect_permute_store_chain().

 If the store is not grouped, DR_GROUP_SIZE is 1, and DR_CHAIN
-and OPRNDS are of size 1.  */
+is of size 1.  */
  stmt_vec_info next_stmt_info = first_stmt_info;
  for (i = 0; i < group_size; i++)
{
@@ -8817,11 +8816,10 @@ vectorizable_store (vec_info *vinfo,
 that there is no interleaving, DR_GROUP_SIZE is 1,
 and only one iteration of the loop will be executed.  */
  op = vect_get_store_rhs (next_stmt_info);
- vect_get_vec_defs_for_operand (vinfo, next_stmt_info,
-ncopies, op, _oprnds[i]);
- vec_oprnd = gvec_oprnds[i][0];
- dr_chain.quick_push (gvec_oprnds[i][0]);
- oprnds.quick_push (gvec_oprnds[i][0]);
+ vect_get_vec_defs_for_operand (vinfo, next_stmt_info, ncopies,
+op, gvec_oprnds[i]);
+ vec_oprnd = (*gvec_oprnds[i])[0];
+ dr_chain.quick_push (vec_oprnd);
  next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info);
}
  if (mask)
@@ -8863,16 +8861,13 @@ vectorizable_store (vec_info *vinfo,
   else
{
  gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
- /* For interleaved stores we created vectorized defs for all the
-defs stored in OPRNDS in the previous iteration (previous copy).
-DR_CHAIN is then used as an input to vect_permute_store_chain().
-If the store is not grouped, DR_GROUP_SIZE is 1, and DR_CHAIN and
-OPRNDS are of size 1.  */
+ /* DR_CHAIN is then used as an input to vect_permute_store_chain().
+If the store is not grouped, DR_GROUP_SIZE is 1, and DR_CHAIN is
+of size 1.  */
  for (i = 0; 

[PATCH] vect: Replace DR_GROUP_STORE_COUNT with DR_GROUP_LAST_ELEMENT

2023-08-22 Thread Kewen.Lin via Gcc-patches
Hi,

Now we use DR_GROUP_STORE_COUNT to record how many stores
in a group have been transformed and only do the actual
transform when encountering the last one.  I'm making
patches to move costing next to the transform code, it's
awkward to use this DR_GROUP_STORE_COUNT for both costing
and transforming.  This patch is to introduce last_element
to record the last element to be transformed in the group
rather than to sum up the store number we have seen, then
we can only check the given stmt is the last or not.  It
can make it work simply for both costing and transforming.

Bootstrapped and regtested on x86_64-redhat-linux,
aarch64-linux-gnu and powerpc64{,le}-linux-gnu.

Is it ok for trunk?

BR,
Kewen
-

gcc/ChangeLog:

* tree-vect-data-refs.cc (vect_set_group_last_element): New function.
(vect_analyze_group_access): Call new function
vect_set_group_last_element.
* tree-vect-stmts.cc (vectorizable_store): Replace DR_GROUP_STORE_COUNT
uses with DR_GROUP_LAST_ELEMENT.
(vect_transform_stmt): Likewise.
* tree-vect-slp.cc (vect_split_slp_store_group): Likewise.
(vect_build_slp_instance): Likewise.
* tree-vectorizer.h (DR_GROUP_LAST_ELEMENT): New macro.
(DR_GROUP_STORE_COUNT): Remove.
(class _stmt_vec_info::store_count): Remove.
(class _stmt_vec_info::last_element): New class member.
(vect_set_group_last_element): New function declaration.
---
 gcc/tree-vect-data-refs.cc | 30 ++
 gcc/tree-vect-slp.cc   | 13 +
 gcc/tree-vect-stmts.cc |  9 +++--
 gcc/tree-vectorizer.h  | 12 +++-
 4 files changed, 49 insertions(+), 15 deletions(-)

diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
index 3e9a284666c..c4a495431d5 100644
--- a/gcc/tree-vect-data-refs.cc
+++ b/gcc/tree-vect-data-refs.cc
@@ -2832,6 +2832,33 @@ vect_analyze_group_access_1 (vec_info *vinfo, 
dr_vec_info *dr_info)
   return true;
 }

+/* Given vectorization information VINFO, set the last element in the
+   group led by FIRST_STMT_INFO.  For now, it's only used for loop
+   vectorization and stores, since for loop-vect the grouped stores
+   are only transformed till encounting its last one.  */
+
+void
+vect_set_group_last_element (vec_info *vinfo, stmt_vec_info first_stmt_info)
+{
+  if (first_stmt_info
+  && is_a (vinfo)
+  && DR_IS_WRITE (STMT_VINFO_DATA_REF (first_stmt_info)))
+{
+  stmt_vec_info stmt_info = DR_GROUP_NEXT_ELEMENT (first_stmt_info);
+  stmt_vec_info last_stmt_info = first_stmt_info;
+  while (stmt_info)
+   {
+ gimple *stmt = stmt_info->stmt;
+ gimple *last_stmt = last_stmt_info->stmt;
+ gcc_assert (gimple_bb (stmt) == gimple_bb (last_stmt));
+ if (gimple_uid (stmt) > gimple_uid (last_stmt))
+   last_stmt_info = stmt_info;
+ stmt_info = DR_GROUP_NEXT_ELEMENT (stmt_info);
+   }
+  DR_GROUP_LAST_ELEMENT (first_stmt_info) = last_stmt_info;
+}
+}
+
 /* Analyze groups of accesses: check that DR_INFO belongs to a group of
accesses of legal size, step, etc.  Detect gaps, single element
interleaving, and other special cases. Set grouped access info.
@@ -2853,6 +2880,9 @@ vect_analyze_group_access (vec_info *vinfo, dr_vec_info 
*dr_info)
}
   return false;
 }
+
+  stmt_vec_info first_stmt_info = DR_GROUP_FIRST_ELEMENT (dr_info->stmt);
+  vect_set_group_last_element (vinfo, first_stmt_info);
   return true;
 }

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 89c3216afac..e9b64efe125 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -2827,7 +2827,8 @@ vect_find_first_scalar_stmt_in_slp (slp_tree node)
Return the first stmt in the second group.  */

 static stmt_vec_info
-vect_split_slp_store_group (stmt_vec_info first_vinfo, unsigned group1_size)
+vect_split_slp_store_group (vec_info *vinfo, stmt_vec_info first_vinfo,
+   unsigned group1_size)
 {
   gcc_assert (DR_GROUP_FIRST_ELEMENT (first_vinfo) == first_vinfo);
   gcc_assert (group1_size > 0);
@@ -2860,6 +2861,9 @@ vect_split_slp_store_group (stmt_vec_info first_vinfo, 
unsigned group1_size)
   /* DR_GROUP_GAP of the first group now has to skip over the second group 
too.  */
   DR_GROUP_GAP (first_vinfo) += group2_size;

+  vect_set_group_last_element (vinfo, first_vinfo);
+  vect_set_group_last_element (vinfo, group2);
+
   if (dump_enabled_p ())
 dump_printf_loc (MSG_NOTE, vect_location, "Split group into %d and %d\n",
 group1_size, group2_size);
@@ -3321,7 +3325,7 @@ vect_build_slp_instance (vec_info *vinfo,
  if (dump_enabled_p ())
dump_printf_loc (MSG_NOTE, vect_location,
 "Splitting SLP group at stmt %u\n", i);
- stmt_vec_info rest = vect_split_slp_store_group (stmt_info,
+ stmt_vec_info rest = vect_split_slp_store_group 

Re: [PATCH] rs6000: Disable PCREL for unsupported targets [PR111045]

2023-08-21 Thread Kewen.Lin via Gcc-patches
Hi Jeevitha,

on 2023/8/21 18:32, jeevitha wrote:
> Hi All,
> 
> The following patch has been bootstrapped and regtested on powerpc64-linux.

I think we should test this on powerpc64le-linux P8 or P9 (no P10) as well.

> 
> It is currently possible to incorrectly enable PCREL for targets that do not
> officially support it. Disable PCREL for targets that do not support it.
> 
> 2023-08-21  Jeevitha Palanisamy  
> 
> gcc/
>   PR target/111045
>   * config/rs6000/rs6000.cc (rs6000_option_override_internal): Disable 
> PCREL
>   for unsupported targets.
> 
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index efe9adc..4838f8c 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -4232,6 +4232,9 @@ rs6000_option_override_internal (bool global_init_p)
>rs6000_isa_flags &= ~OPTION_MASK_FLOAT128_HW;
>  }
>  
> +  if (!rs6000_pcrel_p())

Nit: Only do this check when TARGET_PCREL, one space before "()",
that is "...pcrel_p ()".

I think this should be moved to be with the hunk on PCREL:

  /* If the ABI has support for PC-relative relocations, enable it by default.
 This test depends on the sub-target tests above setting the code model to
 medium for ELF v2 systems.  */
  if (PCREL_SUPPORTED_BY_OS
  && (rs6000_isa_flags_explicit & OPTION_MASK_PCREL) == 0)
rs6000_isa_flags |= OPTION_MASK_PCREL;

  /* -mpcrel requires -mcmodel=medium, but we can't check TARGET_CMODEL until
  after the subtarget override options are done.  */
  else if (TARGET_PCREL && TARGET_CMODEL != CMODEL_MEDIUM)
{
  if ((rs6000_isa_flags_explicit & OPTION_MASK_PCREL) != 0)
error ("%qs requires %qs", "-mpcrel", "-mcmodel=medium");

  rs6000_isa_flags &= ~OPTION_MASK_PCREL;
}

==>

   else if (TARGET_PCREL && !rs6000_pcrel_p ())
  rs6000_isa_flags &= ~OPTION_MASK_PCREL;

Use DEFAULT_ABI != ABI_ELFv2 seems more straightforward, but I guessed the
reason why you used rs6000_pcrel_p is to avoid scattered conditions, it's
fine to me.  Besides, I think we want one error for explicit -mpcrel on
unsupported targets just similar to the above cmodel?

BR,
Kewen

> +rs6000_isa_flags &= ~OPTION_MASK_PCREL;
> +
>/* Enable -mprefixed by default on power10 systems.  */
>if (TARGET_POWER10 && (rs6000_isa_flags_explicit & OPTION_MASK_PREFIXED) 
> == 0)
>  rs6000_isa_flags |= OPTION_MASK_PREFIXED;
> 



Re: [PATCH V5] VECT: Support loop len control on EXTRACT_LAST vectorization

2023-08-21 Thread Kewen.Lin via Gcc-patches
Hi Juzhe,

on 2023/8/21 18:59, Juzhe-Zhong wrote:
> Co-Authored-By: Kewen.Lin 
> 
> Hi, @Richi and @Richard, base on previous disscussion, I simpily fix issuses 
> for
> powerpc and s390 with your suggestions:
> 
> -  machine_mode len_load_mode = get_len_load_store_mode
> -(loop_vinfo->vector_mode, true).require ();
> -  machine_mode len_store_mode = get_len_load_store_mode
> -(loop_vinfo->vector_mode, false).require ();
> +  machine_mode len_load_mode, len_store_mode;
> +  if (!get_len_load_store_mode (loop_vinfo->vector_mode, true)
> +.exists (_load_mode))
> +return false;
> +  if (!get_len_load_store_mode (loop_vinfo->vector_mode, false)
> +.exists (_store_mode))
> +return false;
> 
> Hi, @Kewen and @Stefan
> 
> Could you test this patch again ? Thanks.

I confirmed it's bootstrapped and regress-tested on
powerpc64le-linux-gnu P9/P10.  Thanks!

BR,
Kewen

> 
> Co-Authored-By: Kewen.Lin 
> 
> gcc/ChangeLog:
> 
>   * tree-vect-loop.cc (vect_verify_loop_lens): Add exists check.
>   (vectorizable_live_operation): Add live vectorization for length loop 
> control.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.target/riscv/rvv/autovec/partial/live-1.c: New test.
>   * gcc.target/riscv/rvv/autovec/partial/live_run-1.c: New test.
> 
> ---
>  .../riscv/rvv/autovec/partial/live-1.c| 34 +++
>  .../riscv/rvv/autovec/partial/live_run-1.c| 35 
>  gcc/tree-vect-loop.cc | 89 ++-
>  3 files changed, 138 insertions(+), 20 deletions(-)
>  create mode 100644 
> gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/live-1.c
>  create mode 100644 
> gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/live_run-1.c
> 
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/live-1.c 
> b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/live-1.c
> new file mode 100644
> index 000..75fa2eba8cc
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/live-1.c
> @@ -0,0 +1,34 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-march=rv32gcv -mabi=ilp32d --param 
> riscv-autovec-preference=scalable -fdump-tree-optimized-details" } */
> +
> +#include 
> +
> +#define EXTRACT_LAST(TYPE)   
>   \
> +  TYPE __attribute__ ((noinline, noclone))   
>   \
> +  test_##TYPE (TYPE *x, int n, TYPE value)   
>   \
> +  {  
>   \
> +TYPE last;   
>   \
> +for (int j = 0; j < n; ++j)  
>   \
> +  {  
>   \
> + last = x[j];   \
> + x[j] = last * value;   \
> +  }  
>   \
> +return last; 
>   \
> +  }
> +
> +#define TEST_ALL(T)  
>   \
> +  T (int8_t) 
>   \
> +  T (int16_t)
>   \
> +  T (int32_t)
>   \
> +  T (int64_t)
>   \
> +  T (uint8_t)
>   \
> +  T (uint16_t)   
>   \
> +  T (uint32_t)   
>   \
> +  T (uint64_t)   
>   \
> +  T (_Float16)   
>   \
> +  T (float)  
>   \
> +  T (double)
> +
> +TEST_ALL (EXTRACT_LAST)
> +
> +/* { dg-final { scan-tree-dump-times "\.VEC_EXTRACT" 10 "optimized" } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/live_run-1.c 
> b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/live_run-1.c
> new file mode 100644
> index 000..42913a112c6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/live_run-1.c
> @@ -0,0 +1,35 @@
> +/* { dg-do run { target { riscv_vector } } } */
> +/* { dg-additional-options "--param riscv-autovec-preference=scalable" } */
> +
> +#include "live-1.c"
> +
> +#define N 107
> +#define OP 70
> +
> +#define TEST_LOOP(TYPE)  \
> +  {  \
> +TYPE a[N];   \
> +for (int i = 0; i < N; ++i)  \
> +   

[PATCH] vect: Factor out the handling on scatter store having gs_info.decl

2023-08-17 Thread Kewen.Lin via Gcc-patches
Hi,

Similar to the existing function vect_build_gather_load_calls,
this patch is to factor out the handling on scatter store
having gs_info.decl to vect_build_scatter_store_calls which
is a new function.  It also does some minor refactoring like
moving some variables' declarations close to their uses and
restrict the scope for some of them etc.

It's a pre-patch for upcoming vectorizable_store re-structuring
for costing.

Bootstrapped and regtested on x86_64-redhat-linux,
aarch64-linux-gnu and powerpc64{,le}-linux-gnu.

Is it ok for trunk?

BR,
Kewen
-

gcc/ChangeLog:

* tree-vect-stmts.cc (vect_build_scatter_store_calls): New, factor
out from ...
(vectorizable_store): ... here.
---
 gcc/tree-vect-stmts.cc | 411 +
 1 file changed, 212 insertions(+), 199 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index cd8e0a76374..f8a904de503 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -2989,6 +2989,216 @@ vect_build_gather_load_calls (vec_info *vinfo, 
stmt_vec_info stmt_info,
   *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
 }

+/* Build a scatter store call while vectorizing STMT_INFO.  Insert new
+   instructions before GSI and add them to VEC_STMT.  GS_INFO describes
+   the scatter store operation.  If the store is conditional, MASK is the
+   unvectorized condition, otherwise MASK is null.  */
+
+static void
+vect_build_scatter_store_calls (vec_info *vinfo, stmt_vec_info stmt_info,
+   gimple_stmt_iterator *gsi, gimple **vec_stmt,
+   gather_scatter_info *gs_info, tree mask)
+{
+  loop_vec_info loop_vinfo = dyn_cast (vinfo);
+  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
+  poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
+  int ncopies = vect_get_num_copies (loop_vinfo, vectype);
+  enum { NARROW, NONE, WIDEN } modifier;
+  poly_uint64 scatter_off_nunits
+= TYPE_VECTOR_SUBPARTS (gs_info->offset_vectype);
+
+  tree perm_mask = NULL_TREE, mask_halfvectype = NULL_TREE;
+  if (known_eq (nunits, scatter_off_nunits))
+modifier = NONE;
+  else if (known_eq (nunits * 2, scatter_off_nunits))
+{
+  modifier = WIDEN;
+
+  /* Currently gathers and scatters are only supported for
+fixed-length vectors.  */
+  unsigned int count = scatter_off_nunits.to_constant ();
+  vec_perm_builder sel (count, count, 1);
+  for (unsigned i = 0; i < (unsigned int) count; ++i)
+   sel.quick_push (i | (count / 2));
+
+  vec_perm_indices indices (sel, 1, count);
+  perm_mask = vect_gen_perm_mask_checked (gs_info->offset_vectype, 
indices);
+  gcc_assert (perm_mask != NULL_TREE);
+}
+  else if (known_eq (nunits, scatter_off_nunits * 2))
+{
+  modifier = NARROW;
+
+  /* Currently gathers and scatters are only supported for
+fixed-length vectors.  */
+  unsigned int count = nunits.to_constant ();
+  vec_perm_builder sel (count, count, 1);
+  for (unsigned i = 0; i < (unsigned int) count; ++i)
+   sel.quick_push (i | (count / 2));
+
+  vec_perm_indices indices (sel, 2, count);
+  perm_mask = vect_gen_perm_mask_checked (vectype, indices);
+  gcc_assert (perm_mask != NULL_TREE);
+  ncopies *= 2;
+
+  if (mask)
+   mask_halfvectype = truth_type_for (gs_info->offset_vectype);
+}
+  else
+gcc_unreachable ();
+
+  tree rettype = TREE_TYPE (TREE_TYPE (gs_info->decl));
+  tree arglist = TYPE_ARG_TYPES (TREE_TYPE (gs_info->decl));
+  tree ptrtype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
+  tree masktype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
+  tree idxtype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
+  tree srctype = TREE_VALUE (arglist); arglist = TREE_CHAIN (arglist);
+  tree scaletype = TREE_VALUE (arglist);
+
+  gcc_checking_assert (TREE_CODE (masktype) == INTEGER_TYPE
+  && TREE_CODE (rettype) == VOID_TYPE);
+
+  tree ptr = fold_convert (ptrtype, gs_info->base);
+  if (!is_gimple_min_invariant (ptr))
+{
+  gimple_seq seq;
+  ptr = force_gimple_operand (ptr, , true, NULL_TREE);
+  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
+  edge pe = loop_preheader_edge (loop);
+  basic_block new_bb = gsi_insert_seq_on_edge_immediate (pe, seq);
+  gcc_assert (!new_bb);
+}
+
+  tree mask_arg = NULL_TREE;
+  if (mask == NULL_TREE)
+{
+  mask_arg = build_int_cst (masktype, -1);
+  mask_arg = vect_init_vector (vinfo, stmt_info, mask_arg, masktype, NULL);
+}
+
+  tree scale = build_int_cst (scaletype, gs_info->scale);
+
+  auto_vec vec_oprnds0;
+  auto_vec vec_oprnds1;
+  auto_vec vec_masks;
+  if (mask)
+{
+  tree mask_vectype = truth_type_for (vectype);
+  vect_get_vec_defs_for_operand (vinfo, stmt_info,
+modifier == NARROW ? ncopies / 2 : ncopies,
+mask, 

[PATCH] Makefile.in: Make TM_P_H depend on $(TREE_H) [PR111021]

2023-08-17 Thread Kewen.Lin via Gcc-patches
Hi,

As PR111021 shows, the below ${port}-protos.h include tree.h
for code_helper and tree_code:

  arm/arm-protos.h:#include "tree.h"
  cris/cris-protos.h:#include "tree.h" (H-P removed this in r14-3218)
  microblaze/microblaze-protos.h:#include "tree.h"
  rl78/rl78-protos.h:#include "tree.h"
  stormy16/stormy16-protos.h:#include "tree.h"

, when compiling build/gencondmd.cc, the include hierarchy
makes it depend on tm_p.h -> ${port}-protos.h -> tree.h,
which further includes (depends on) some files that are
generated during the building, such as: all-tree.def,
tree-check.h and so on.  The previous commit r14-3215
should already force build/gencondmd.cc to depend on
${TREE_H}, so the reported build failure should be gone.

But for a long term maintenance, especially one day some
build/xxx.cc requires tm_p.h but not recog.h, the ${TREE_H}
dependence could be missed and a build failure will show
up.  So this patch is to make TM_P_H depend on $(TREE_H),
any new build/xxx.cc depending on tm_p.h will be able to
consider ${TREE_H}.

It's tested with cross-builds for the affected ports with
steps:
 1) dropped the fix r14-3215;
 2) reproduced the build failure with serial build;
 3) applied this patch, serially built and verified all passed;
 4) added back r14-3215, serially built and verified all passed;

Also bootstrapped and regtested on x86_64-redhat-linux and
powerpc64{,le}-linux-gnu.

Is it ok for trunk?

BR,
Kewen
-
PR bootstrap/111021

gcc/ChangeLog:

* Makefile.in (TM_P_H): Add $(TREE_H) as dependence.
---
 gcc/Makefile.in | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 9dddb65b45d..b85c967951b 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -893,7 +893,8 @@ OPTIONS_C_EXTRA = $(PRETTY_PRINT_H)
 BCONFIG_H = bconfig.h $(build_xm_file_list)
 CONFIG_H  = config.h  $(host_xm_file_list)
 TCONFIG_H = tconfig.h $(xm_file_list)
-TM_P_H= tm_p.h$(tm_p_file_list)
+# Some $(target)-protos.h depends on tree.h
+TM_P_H= tm_p.h$(tm_p_file_list) $(TREE_H)
 TM_D_H= tm_d.h$(tm_d_file_list)
 GTM_H = tm.h  $(tm_file_list) insn-constants.h
 TM_H  = $(GTM_H) insn-flags.h $(OPTIONS_H)
--
2.39.1


Re: [PATCH ver 2] rs6000, add overloaded DFP quantize support

2023-08-17 Thread Kewen.Lin via Gcc-patches
Hi Carl,

on 2023/8/17 08:19, Carl Love wrote:
> 
> GCC maintainers:
> 
> Version 2, renamed the built-in instances.  Changed the name of the
> overloaded built-in.  Added the missing documentation for the new
> built-ins.  Fixed typos.  Changed name of the test.  Updated the
> effective target for the test.  Retested the patch on Power 10LE and
> Power 8 and Power 9.
> 
> The following patch adds four built-ins for the decimal floating point
> (DFP) quantize instructions on rs6000.  The built-ins are for 64-bit
> and 128-bit DFP operands.
> 
> The patch also adds a test case for the new builtins.
> 
> The Patch has been tested on Power 10LE and Power 9 LE/BE.
> 
> Please let me know if the patch is acceptable for mainline.  Thanks.
> 
>  Carl Love
> 
> 
> 
> --
> [PATCH] rs6000, add overloaded DFP quantize support
> 
> Add decimal floating point (DFP) quantize built-ins for both 64-bit DFP
> and 128-DFP operands.  In each case, there is an immediate version and a
> variable version of the built-in.  The RM value is a 2-bit constant int
> which specifies the rounding mode to use.  For the immediate versions of
> the built-in, the TE field is a 5-bit constant that specifies the value of
> the ideal exponent for the result.  The built-in specifications are:
> 
>   __Decimal64 builtin_dfp_quantize (_Decimal64, _Decimal64,
>   const int RM)
>   __Decimal64 builtin_dfp_quantize (const int TE, _Decimal64,
>   const int)
>   __Decimal128 builtin_dfp_quantize (_Decimal128, _Decimal128,
>const int RM)
>   __Decimal128 builtin_dfp_quantize (const int TE, _Decimal128,
>const int)

Nit: Add the parameter name "RM" for all instances, otherwise the readers
might feel confused what do the other two without RM mean. :)

> 
> A testcase is added for the new built-in definitions.

Nit: A PR marker line like:

PR target/93448

> 
> gcc/ChangeLog:
>   * config/rs6000/dfp.md: New UNSPECDQUAN.
>   (dfp_quan_, dfp_quan_i): New define_insn.
>   * config/rs6000/rs6000-builtins.def (__builtin_dfp_quantize_64,
>   __builtin_dfp_quantize_64i, __builtin_dfp_quantize_128,
>   __builtin_dfp_quantize_128i): New buit-in definitions.
>   * config/rs6000/rs6000-overload.def (__builtin_dfp_quantize,
>   __builtin_dfpq_quantize): New overloaded definitions.

These entries need updates with this new revision, also miss one entry
for documentation update.

> 
> gcc/testsuite/
>* gcc.target/powerpc/builtin-dfp-quantize-runnable.c: New test
>   case.

Ditto, inconsistent name.

> ---
>  gcc/config/rs6000/dfp.md  |  25 ++-
>  gcc/config/rs6000/rs6000-builtins.def |  15 ++
>  gcc/config/rs6000/rs6000-overload.def |  10 +
>  gcc/doc/extend.texi   |  15 ++
>  .../gcc.target/powerpc/pr93448-dfp-quantize.c | 199 ++
>  5 files changed, 263 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr93448-dfp-quantize.c
> 
> diff --git a/gcc/config/rs6000/dfp.md b/gcc/config/rs6000/dfp.md
> index 5ed8a73ac51..abd21c5db75 100644
> --- a/gcc/config/rs6000/dfp.md
> +++ b/gcc/config/rs6000/dfp.md
> @@ -271,7 +271,8 @@
> UNSPEC_DIEX
> UNSPEC_DSCLI
> UNSPEC_DTSTSFI
> -   UNSPEC_DSCRI])
> +   UNSPEC_DSCRI
> +   UNSPEC_DQUAN])
>  
>  (define_code_iterator DFP_TEST [eq lt gt unordered])
>  
> @@ -395,3 +396,25 @@
>"dscri %0,%1,%2"
>[(set_attr "type" "dfp")
> (set_attr "size" "")])
> +
> +(define_insn "dfp_dquan_"

I guess I mentioned this previously, I prefer "dfp_dqua_"
which aligns with the most others ...

> +  [(set (match_operand:DDTD 0 "gpc_reg_operand" "=d")
> +(unspec:DDTD [(match_operand:DDTD 1 "gpc_reg_operand" "d")
> +   (match_operand:DDTD 2 "gpc_reg_operand" "d")
> +   (match_operand:QI 3 "immediate_operand" "i")]
> + UNSPEC_DQUAN))]
> +  "TARGET_DFP"
> +  "dqua %0,%1,%2,%3"
> +  [(set_attr "type" "dfp")
> +   (set_attr "size" "")])
> +
> +(define_insn "dfp_dquan_i"

..., also prefer "dfp_dquai_" here.

Please also incorporate Peter's insightful comments on predicates
and constraints on this part.

> +  [(set (match_operand:DDTD 0 "gpc_reg_operand" "=d")
> +(unspec:DDTD [(match_operand:SI 1 "const_int_operand" "n")
> +   (match_operand:DDTD 2 "gpc_reg_operand" "d")
> +   (match_operand:SI 3 "immediate_operand" "i")]
> + UNSPEC_DQUAN))]
> +  "TARGET_DFP"
> +  "dquai %1,%0,%2,%3"
> +  [(set_attr "type" "dfp")
> +   (set_attr "size" "")])
> diff --git a/gcc/config/rs6000/rs6000-builtins.def 
> b/gcc/config/rs6000/rs6000-builtins.def
> index 8a294d6c934..a7ab90771f9 100644
> --- a/gcc/config/rs6000/rs6000-builtins.def
> +++ b/gcc/config/rs6000/rs6000-builtins.def
> @@ 

Re: [PATCH ver 2] rs6000, add overloaded DFP quantize support

2023-08-16 Thread Kewen.Lin via Gcc-patches
on 2023/8/17 11:11, Peter Bergner wrote:
> On 8/16/23 7:19 PM, Carl Love wrote:
>> +(define_insn "dfp_dquan_"
>> +  [(set (match_operand:DDTD 0 "gpc_reg_operand" "=d")
>> +(unspec:DDTD [(match_operand:DDTD 1 "gpc_reg_operand" "d")
>> +  (match_operand:DDTD 2 "gpc_reg_operand" "d")
>> +  (match_operand:QI 3 "immediate_operand" "i")]
>> + UNSPEC_DQUAN))]
>> +  "TARGET_DFP"
>> +  "dqua %0,%1,%2,%3"
>> +  [(set_attr "type" "dfp")
>> +   (set_attr "size" "")])
> 
> operand 3 refers to the RMC operand field of the insn we are emitting.
> RMC is a two bit unsigned operand, so I think the predicate should be
> const_0_to_3_operand rather than immediate_operand.  It's always best
> to use a tighter predicate if we have one. Ditto for the other patterns
> with an RMC operand.

Good point!  I agree it's better to use a suitable tighter predicate here,
even if for now it's only used for bif expanding and the bif prototype
already restricts it.

> 
> I don't think we allow anything other than an integer for that operand
> value, so I _think_ that "n" is probably a better constraint than "i"?
> Ke Wen/Segher???

Yeah, I agree "n" is better for this context, it better matches your
proposed const_0_to_3_operand/s5bit_cint_operand (const_int).

BR,
Kewen


Re: [PATCH] Makefile.in: Add variable TM_P_H2 for TM_P_H dependency [PR111021]

2023-08-15 Thread Kewen.Lin via Gcc-patches
on 2023/8/16 10:31, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> As PR111021 shows, the below ${port}-protos.h include tree.h
> for code_helper and tree_code:
> 
>   arm/arm-protos.h:#include "tree.h"
>   cris/cris-protos.h:#include "tree.h"  (H-P removed this in r14-3218)
>   microblaze/microblaze-protos.h:#include "tree.h"
>   rl78/rl78-protos.h:#include "tree.h"
>   stormy16/stormy16-protos.h:#include "tree.h"
> 
> , when compiling build/gencondmd.cc, the include hierarchy
> makes it depend on tm_p.h -> ${port}-protos.h -> tree.h,
> which further includes (depends on) some files that are
> generated during the building, such as: all-tree.def,
> tree-check.h and so on.  The previous commit r14-3215
> should already force build/gencondmd.cc to depend on
> ${TREE_H}, so the reported build failure should be gone.
> 
> But for a long term maintenance, especially one day some
> build/xxx.cc requires tm_p.h but not recog.h, the ${TREE_H}
> dependence could be missed and a build failure will show
> up.  So this patch is to add one variable under section
> "# Shorthand variables for dependency lists.", to explicit
> indicate tm_p.h which includes ${port}-protos.h should
> depend on ${TREE_H}.  Then any new build/xxx.cc depending
> on tm_p.h will be able to consider ${TREE_H}.
> 
> Note that the existing ${TM_P_H} variable is also used for
> "generated_files", it isn't dedicated for dependencies, so
> a variable named ${TM_P_H2} is proposed and put under the
> "# Shorthand variables for dependency lists.", also the
> only use as dependence is updated accordingly.

I did some more checkings and found that not all files in
$(generated_files) are **generated**, some of them actually
sit in source directory, I misinterpreted it from its name,
I think we can just update the existing ${TM_P_H} instead of
adding a new variable.

I'll post a new patch after some testings, sorry for noise!

BR,
Kewen


[PATCH] Makefile.in: Add variable TM_P_H2 for TM_P_H dependency [PR111021]

2023-08-15 Thread Kewen.Lin via Gcc-patches
Hi,

As PR111021 shows, the below ${port}-protos.h include tree.h
for code_helper and tree_code:

  arm/arm-protos.h:#include "tree.h"
  cris/cris-protos.h:#include "tree.h"  (H-P removed this in r14-3218)
  microblaze/microblaze-protos.h:#include "tree.h"
  rl78/rl78-protos.h:#include "tree.h"
  stormy16/stormy16-protos.h:#include "tree.h"

, when compiling build/gencondmd.cc, the include hierarchy
makes it depend on tm_p.h -> ${port}-protos.h -> tree.h,
which further includes (depends on) some files that are
generated during the building, such as: all-tree.def,
tree-check.h and so on.  The previous commit r14-3215
should already force build/gencondmd.cc to depend on
${TREE_H}, so the reported build failure should be gone.

But for a long term maintenance, especially one day some
build/xxx.cc requires tm_p.h but not recog.h, the ${TREE_H}
dependence could be missed and a build failure will show
up.  So this patch is to add one variable under section
"# Shorthand variables for dependency lists.", to explicit
indicate tm_p.h which includes ${port}-protos.h should
depend on ${TREE_H}.  Then any new build/xxx.cc depending
on tm_p.h will be able to consider ${TREE_H}.

Note that the existing ${TM_P_H} variable is also used for
"generated_files", it isn't dedicated for dependencies, so
a variable named ${TM_P_H2} is proposed and put under the
"# Shorthand variables for dependency lists.", also the
only use as dependence is updated accordingly.

It's tested with cross-builds for the affected ports with
steps:

  1) dropped the fix r14-3215;
  2) reproduced the build failure with serial build;
  3) applied this patch, serial built and verified all passed;
  4) added back r14-3215, serial built and verified all passed;

Is it ok for trunk?

BR,
Kewen
-
PR bootstrap/111021

gcc/ChangeLog:

* Makefile.in (TM_P_H2): New variable for tm_p.h dependence.
---
 gcc/Makefile.in | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 9dddb65b45d..192dc76f294 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1062,6 +1062,7 @@ RTL_SSA_H = $(PRETTY_PRINT_H) insn-config.h 
splay-tree-utils.h \
rtl-ssa/changes.h rtl-ssa/functions.h rtl-ssa/is-a.inl \
rtl-ssa/access-utils.h rtl-ssa/insn-utils.h rtl-ssa/movement.h \
rtl-ssa/change-utils.h rtl-ssa/member-fns.inl
+TM_P_H2 = $(TM_P_H) $(TREE_H)

 #

 # Now figure out from those variables how to compile and link.
@@ -2905,7 +2906,7 @@ build/inchash.o : inchash.cc $(BCONFIG_H) $(SYSTEM_H) 
$(CORETYPES_H)  \
   $(HASHTAB_H) inchash.h
 build/gencondmd.o : build/gencondmd.cc $(BCONFIG_H) $(SYSTEM_H)
\
   $(CORETYPES_H) $(GTM_H) insn-constants.h \
-  $(filter-out insn-flags.h, $(RTL_H) $(TM_P_H) $(FUNCTION_H) $(REGS_H) \
+  $(filter-out insn-flags.h, $(RTL_H) $(TM_P_H2) $(FUNCTION_H) $(REGS_H) \
   $(RECOG_H) output.h $(FLAGS_H) $(RESOURCE_H) toplev.h $(DIAGNOSTIC_CORE_H) 
reload.h  \
   $(EXCEPT_H) tm-constrs.h)
 # This pulls in tm-pred.h which contains inline functions wrapping up
--
2.39.1


Re: [PATCH] vect: Move VMAT_GATHER_SCATTER handlings from final loop nest

2023-08-15 Thread Kewen.Lin via Gcc-patches
on 2023/8/15 17:13, Richard Sandiford wrote:
> Richard Biener  writes:
>>> OK, fair enough.  So the idea is: see where we end up and then try to
>>> improve/factor the APIs in a less peephole way?
>>
>> Yeah, I think that's the only good way forward.
> 
> OK, no objection from me.  Sorry for holding the patch up.

This hasn't been approved yet (although the patch on VMAT_LOAD_STORE_LANES
was), so it wasn't held up and thanks for sharing your thoughts and making
it get attention. :)

>From the discussions, it seems this looks good to both of you.  But I could
be wrong, so may I ask if it's ok for trunk?

BR,
Kewen


Re: [PATCH] vect: Move VMAT_GATHER_SCATTER handlings from final loop nest

2023-08-15 Thread Kewen.Lin via Gcc-patches
on 2023/8/15 20:07, Richard Biener wrote:
> On Tue, Aug 15, 2023 at 1:47 PM Kewen.Lin  wrote:
>>
>> on 2023/8/15 15:53, Richard Biener wrote:
>>> On Tue, Aug 15, 2023 at 4:44 AM Kewen.Lin  wrote:

 on 2023/8/14 22:16, Richard Sandiford wrote:
> "Kewen.Lin"  writes:
>> Hi Richard,
>>
>> on 2023/8/14 20:20, Richard Sandiford wrote:
>>> Thanks for the clean-ups.  But...
>>>
>>> "Kewen.Lin"  writes:
 Hi,

 Following Richi's suggestion [1], this patch is to move the
 handlings on VMAT_GATHER_SCATTER in the final loop nest
 of function vectorizable_load to its own loop.  Basically
 it duplicates the final loop nest, clean up some useless
 set up code for the case of VMAT_GATHER_SCATTER, remove some
 unreachable code.  Also remove the corresponding handlings
 in the final loop nest.

 Bootstrapped and regtested on x86_64-redhat-linux,
 aarch64-linux-gnu and powerpc64{,le}-linux-gnu.

 [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-June/623329.html

 Is it ok for trunk?

 BR,
 Kewen
 -

 gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_load): Move the handlings on
VMAT_GATHER_SCATTER in the final loop nest to its own loop,
and update the final nest accordingly.
 ---
  gcc/tree-vect-stmts.cc | 361 +
  1 file changed, 219 insertions(+), 142 deletions(-)
>>>
>>> ...that seems like quite a lot of +s.  Is there nothing we can do to
>>> avoid the cut-&-paste?
>>
>> Thanks for the comments!  I'm not sure if I get your question, if we
>> want to move out the handlings of VMAT_GATHER_SCATTER, the new +s seem
>> inevitable?  Your concern is mainly about git blame history?
>
> No, it was more that 219-142=77, so it seems like a lot of lines
> are being duplicated rather than simply being moved.  (Unlike for
> VMAT_LOAD_STORE_LANES, which was even a slight LOC saving, and so
> was a clear improvement.)
>
> So I was just wondering if there was any obvious factoring-out that
> could be done to reduce the duplication.

 ah, thanks for the clarification!

 I think the main duplication are on the loop body beginning and end,
 let's take a look at them in details:

 +  if (memory_access_type == VMAT_GATHER_SCATTER)
 +{
 +  gcc_assert (alignment_support_scheme == dr_aligned
 + || alignment_support_scheme == dr_unaligned_supported);
 +  gcc_assert (!grouped_load && !slp_perm);
 +
 +  unsigned int inside_cost = 0, prologue_cost = 0;

 // These above are newly added.

 +  for (j = 0; j < ncopies; j++)
 +   {
 + /* 1. Create the vector or array pointer update chain.  */
 + if (j == 0 && !costing_p)
 +   {
 + if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
 +   vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info,
 +slp_node, _info, 
 _ptr,
 +_offsets);
 + else
 +   dataref_ptr
 + = vect_create_data_ref_ptr (vinfo, first_stmt_info, 
 aggr_type,
 + at_loop, offset, , gsi,
 + _incr, false, bump);
 +   }
 + else if (!costing_p)
 +   {
 + gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
 + if (!STMT_VINFO_GATHER_SCATTER_P (stmt_info))
 +   dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, 
 ptr_incr,
 +  gsi, stmt_info, bump);
 +   }

 // These are for dataref_ptr, in the final looop nest we deal with more 
 cases
 on simd_lane_access_p and diff_first_stmt_info, but don't handle
 STMT_VINFO_GATHER_SCATTER_P any more, very few (one case) can be shared 
 between,
 IMHO factoring out it seems like a overkill.

 +
 + if (mask && !costing_p)
 +   vec_mask = vec_masks[j];

 // It's merged out from j == 0 and j != 0

 +
 + gimple *new_stmt = NULL;
 + for (i = 0; i < vec_num; i++)
 +   {
 + tree final_mask = NULL_TREE;
 + tree final_len = NULL_TREE;
 + tree bias = NULL_TREE;
 + if (!costing_p)
 +   {
 + if (loop_masks)
 +   final_mask
 + = vect_get_loop_mask (loop_vinfo, gsi, loop_masks,
 +   

Re: [PATCH] vect: Move VMAT_GATHER_SCATTER handlings from final loop nest

2023-08-15 Thread Kewen.Lin via Gcc-patches
on 2023/8/15 15:53, Richard Biener wrote:
> On Tue, Aug 15, 2023 at 4:44 AM Kewen.Lin  wrote:
>>
>> on 2023/8/14 22:16, Richard Sandiford wrote:
>>> "Kewen.Lin"  writes:
 Hi Richard,

 on 2023/8/14 20:20, Richard Sandiford wrote:
> Thanks for the clean-ups.  But...
>
> "Kewen.Lin"  writes:
>> Hi,
>>
>> Following Richi's suggestion [1], this patch is to move the
>> handlings on VMAT_GATHER_SCATTER in the final loop nest
>> of function vectorizable_load to its own loop.  Basically
>> it duplicates the final loop nest, clean up some useless
>> set up code for the case of VMAT_GATHER_SCATTER, remove some
>> unreachable code.  Also remove the corresponding handlings
>> in the final loop nest.
>>
>> Bootstrapped and regtested on x86_64-redhat-linux,
>> aarch64-linux-gnu and powerpc64{,le}-linux-gnu.
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-June/623329.html
>>
>> Is it ok for trunk?
>>
>> BR,
>> Kewen
>> -
>>
>> gcc/ChangeLog:
>>
>>* tree-vect-stmts.cc (vectorizable_load): Move the handlings on
>>VMAT_GATHER_SCATTER in the final loop nest to its own loop,
>>and update the final nest accordingly.
>> ---
>>  gcc/tree-vect-stmts.cc | 361 +
>>  1 file changed, 219 insertions(+), 142 deletions(-)
>
> ...that seems like quite a lot of +s.  Is there nothing we can do to
> avoid the cut-&-paste?

 Thanks for the comments!  I'm not sure if I get your question, if we
 want to move out the handlings of VMAT_GATHER_SCATTER, the new +s seem
 inevitable?  Your concern is mainly about git blame history?
>>>
>>> No, it was more that 219-142=77, so it seems like a lot of lines
>>> are being duplicated rather than simply being moved.  (Unlike for
>>> VMAT_LOAD_STORE_LANES, which was even a slight LOC saving, and so
>>> was a clear improvement.)
>>>
>>> So I was just wondering if there was any obvious factoring-out that
>>> could be done to reduce the duplication.
>>
>> ah, thanks for the clarification!
>>
>> I think the main duplication are on the loop body beginning and end,
>> let's take a look at them in details:
>>
>> +  if (memory_access_type == VMAT_GATHER_SCATTER)
>> +{
>> +  gcc_assert (alignment_support_scheme == dr_aligned
>> + || alignment_support_scheme == dr_unaligned_supported);
>> +  gcc_assert (!grouped_load && !slp_perm);
>> +
>> +  unsigned int inside_cost = 0, prologue_cost = 0;
>>
>> // These above are newly added.
>>
>> +  for (j = 0; j < ncopies; j++)
>> +   {
>> + /* 1. Create the vector or array pointer update chain.  */
>> + if (j == 0 && !costing_p)
>> +   {
>> + if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>> +   vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info,
>> +slp_node, _info, 
>> _ptr,
>> +_offsets);
>> + else
>> +   dataref_ptr
>> + = vect_create_data_ref_ptr (vinfo, first_stmt_info, 
>> aggr_type,
>> + at_loop, offset, , gsi,
>> + _incr, false, bump);
>> +   }
>> + else if (!costing_p)
>> +   {
>> + gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
>> + if (!STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>> +   dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr,
>> +  gsi, stmt_info, bump);
>> +   }
>>
>> // These are for dataref_ptr, in the final looop nest we deal with more cases
>> on simd_lane_access_p and diff_first_stmt_info, but don't handle
>> STMT_VINFO_GATHER_SCATTER_P any more, very few (one case) can be shared 
>> between,
>> IMHO factoring out it seems like a overkill.
>>
>> +
>> + if (mask && !costing_p)
>> +   vec_mask = vec_masks[j];
>>
>> // It's merged out from j == 0 and j != 0
>>
>> +
>> + gimple *new_stmt = NULL;
>> + for (i = 0; i < vec_num; i++)
>> +   {
>> + tree final_mask = NULL_TREE;
>> + tree final_len = NULL_TREE;
>> + tree bias = NULL_TREE;
>> + if (!costing_p)
>> +   {
>> + if (loop_masks)
>> +   final_mask
>> + = vect_get_loop_mask (loop_vinfo, gsi, loop_masks,
>> +   vec_num * ncopies, vectype,
>> +   vec_num * j + i);
>> + if (vec_mask)
>> +   final_mask = prepare_vec_mask (loop_vinfo, mask_vectype,
>> +  final_mask, vec_mask, 
>> gsi);
>> +
>> + if (i > 0 && 

Re: [PATCH V4] VECT: Support loop len control on EXTRACT_LAST vectorization

2023-08-15 Thread Kewen.Lin via Gcc-patches
Hi Stefan,

on 2023/8/15 02:51, Stefan Schulze Frielinghaus wrote:
> Hi everyone,
> 
> I have bootstrapped and regtested the patch below on s390.  For the
> 64-bit target I do not see any changes regarding the testsuite.  For the
> 31-bit target I see the following failures:
> 
> FAIL: gcc.dg/vect/no-scevccp-outer-14.c (internal compiler error: in require, 
> at machmode.h:313)
> FAIL: gcc.dg/vect/no-scevccp-outer-14.c (test for excess errors)
> FAIL: gcc.dg/vect/pr50451.c (internal compiler error: in require, at 
> machmode.h:313)
> FAIL: gcc.dg/vect/pr50451.c (test for excess errors)
> FAIL: gcc.dg/vect/pr50451.c -flto -ffat-lto-objects (internal compiler error: 
> in require, at machmode.h:313)
> FAIL: gcc.dg/vect/pr50451.c -flto -ffat-lto-objects (test for excess errors)
> FAIL: gcc.dg/vect/pr53773.c (internal compiler error: in require, at 
> machmode.h:313)
> FAIL: gcc.dg/vect/pr53773.c (test for excess errors)
> FAIL: gcc.dg/vect/pr53773.c -flto -ffat-lto-objects (internal compiler error: 
> in require, at machmode.h:313)
> FAIL: gcc.dg/vect/pr53773.c -flto -ffat-lto-objects (test for excess errors)
> FAIL: gcc.dg/vect/pr71407.c (internal compiler error: in require, at 
> machmode.h:313)
> FAIL: gcc.dg/vect/pr71407.c (test for excess errors)
> FAIL: gcc.dg/vect/pr71407.c -flto -ffat-lto-objects (internal compiler error: 
> in require, at machmode.h:313)
> FAIL: gcc.dg/vect/pr71407.c -flto -ffat-lto-objects (test for excess errors)
> FAIL: gcc.dg/vect/pr71416-1.c (internal compiler error: in require, at 
> machmode.h:313)
> FAIL: gcc.dg/vect/pr71416-1.c (test for excess errors)
> FAIL: gcc.dg/vect/pr71416-1.c -flto -ffat-lto-objects (internal compiler 
> error: in require, at machmode.h:313)
> FAIL: gcc.dg/vect/pr71416-1.c -flto -ffat-lto-objects (test for excess errors)
> FAIL: gcc.dg/vect/pr94443.c (internal compiler error: in require, at 
> machmode.h:313)
> FAIL: gcc.dg/vect/pr94443.c (test for excess errors)
> FAIL: gcc.dg/vect/pr94443.c -flto -ffat-lto-objects (internal compiler error: 
> in require, at machmode.h:313)
> FAIL: gcc.dg/vect/pr94443.c -flto -ffat-lto-objects (test for excess errors)
> FAIL: gcc.dg/vect/pr97558.c (internal compiler error: in require, at 
> machmode.h:313)
> FAIL: gcc.dg/vect/pr97558.c (test for excess errors)
> FAIL: gcc.dg/vect/pr97558.c -flto -ffat-lto-objects (internal compiler error: 
> in require, at machmode.h:313)
> FAIL: gcc.dg/vect/pr97558.c -flto -ffat-lto-objects (test for excess errors)
> FAIL: gcc.dg/vect/vect-reduc-pattern-3.c -flto -ffat-lto-objects (internal 
> compiler error: in require, at machmode.h:313)
> FAIL: gcc.dg/vect/vect-reduc-pattern-3.c -flto -ffat-lto-objects (test for 
> excess errors)
> UNRESOLVED: gcc.dg/vect/no-scevccp-outer-14.c compilation failed to produce 
> executable
> UNRESOLVED: gcc.dg/vect/pr53773.c -flto -ffat-lto-objects  
> scan-tree-dump-times optimized "\\* 10" 2
> UNRESOLVED: gcc.dg/vect/pr53773.c scan-tree-dump-times optimized "\\* 10" 2
> UNRESOLVED: gcc.dg/vect/pr71416-1.c -flto -ffat-lto-objects compilation 
> failed to produce executable
> UNRESOLVED: gcc.dg/vect/pr71416-1.c compilation failed to produce executable
> UNRESOLVED: gcc.dg/vect/vect-reduc-pattern-3.c -flto -ffat-lto-objects 
> compilation failed to produce executable
> 
> I've randomely picked pr50451.c and ran gcc against it which results in:
> 
> during GIMPLE pass: vect
> dump file: pr50451.c.174t.vect
> /gcc-verify-workdir/patched/src/gcc/testsuite/gcc.dg/vect/pr50451.c: In 
> function ‘foo’:
> /gcc-verify-workdir/patched/src/gcc/testsuite/gcc.dg/vect/pr50451.c:5:1: 
> internal compiler error: in require, at machmode.h:313
> 0x1265d21 opt_mode::require() const
> /gcc-verify-workdir/patched/src/gcc/machmode.h:313
> 0x1d7e4e9 opt_mode::require() const
> /gcc-verify-workdir/patched/src/gcc/vec.h:955
> 0x1d7e4e9 vect_verify_loop_lens
> /gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:1471
> 0x1da29ab vect_analyze_loop_2
> /gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:2929
> 0x1da40c7 vect_analyze_loop_1
> /gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:3330
> 0x1da499d vect_analyze_loop(loop*, vec_info_shared*)
> /gcc-verify-workdir/patched/src/gcc/tree-vect-loop.cc:3484
> 0x1deed27 try_vectorize_loop_1
> /gcc-verify-workdir/patched/src/gcc/tree-vectorizer.cc:1064
> 0x1deed27 try_vectorize_loop
> /gcc-verify-workdir/patched/src/gcc/tree-vectorizer.cc:1180
> 0x1def5c1 execute
> /gcc-verify-workdir/patched/src/gcc/tree-vectorizer.cc:1296
> Please submit a full bug report, with preprocessed source (by using 
> -freport-bug).
> Please include the complete backtrace with any bug report.
> See  for instructions.
> 

It looks like s390 supports variable index vec_extract at -m31 but
no vector with length.  It seems we need to further check the vector
with length capability, with something like:

diff --git a/gcc/tree-vect-loop.cc 

[PATCH] Makefile.in: Make recog.h depend on $(TREE_H)

2023-08-14 Thread Kewen.Lin via Gcc-patches
Hi,

Commit r14-3093 introduced a random build failure on
build/gencondmd.cc building.  Since r14-3093 makes recog.h
include tree.h, which further includes (depends on) some
files that are generated during the building, such as:
all-tree.def, tree-check.h etc, when building file
build/gencondmd.cc, the build can fail if these dependencies
are not ready.  So this patch is to teach this dependence.

Thank Jan-Benedict Glaw for testing this!

gcc/ChangeLog:

* Makefile.in (RECOG_H): Add $(TREE_H) as dependence.
---
 gcc/Makefile.in | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 2429128cbf2..9dddb65b45d 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -991,7 +991,7 @@ GIMPLE_H = gimple.h gimple.def gsstruct.def $(VEC_H) \
$(GGC_H) $(BASIC_BLOCK_H) $(TREE_H) tree-ssa-operands.h \
tree-ssa-alias.h $(INTERNAL_FN_H) $(HASH_TABLE_H) is-a.h
 GCOV_IO_H = gcov-io.h version.h auto-host.h gcov-counter.def
-RECOG_H = recog.h
+RECOG_H = recog.h $(TREE_H)
 EMIT_RTL_H = emit-rtl.h
 FLAGS_H = flags.h flag-types.h $(OPTIONS_H)
 OPTIONS_H = options.h flag-types.h $(OPTIONS_H_EXTRA)
--
2.31.1


Re: [PATCH] vect: Move VMAT_GATHER_SCATTER handlings from final loop nest

2023-08-14 Thread Kewen.Lin via Gcc-patches
on 2023/8/14 22:16, Richard Sandiford wrote:
> "Kewen.Lin"  writes:
>> Hi Richard,
>>
>> on 2023/8/14 20:20, Richard Sandiford wrote:
>>> Thanks for the clean-ups.  But...
>>>
>>> "Kewen.Lin"  writes:
 Hi,

 Following Richi's suggestion [1], this patch is to move the
 handlings on VMAT_GATHER_SCATTER in the final loop nest
 of function vectorizable_load to its own loop.  Basically
 it duplicates the final loop nest, clean up some useless
 set up code for the case of VMAT_GATHER_SCATTER, remove some
 unreachable code.  Also remove the corresponding handlings
 in the final loop nest.

 Bootstrapped and regtested on x86_64-redhat-linux,
 aarch64-linux-gnu and powerpc64{,le}-linux-gnu.

 [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-June/623329.html

 Is it ok for trunk?

 BR,
 Kewen
 -

 gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_load): Move the handlings on
VMAT_GATHER_SCATTER in the final loop nest to its own loop,
and update the final nest accordingly.
 ---
  gcc/tree-vect-stmts.cc | 361 +
  1 file changed, 219 insertions(+), 142 deletions(-)
>>>
>>> ...that seems like quite a lot of +s.  Is there nothing we can do to
>>> avoid the cut-&-paste?
>>
>> Thanks for the comments!  I'm not sure if I get your question, if we
>> want to move out the handlings of VMAT_GATHER_SCATTER, the new +s seem
>> inevitable?  Your concern is mainly about git blame history?
> 
> No, it was more that 219-142=77, so it seems like a lot of lines
> are being duplicated rather than simply being moved.  (Unlike for
> VMAT_LOAD_STORE_LANES, which was even a slight LOC saving, and so
> was a clear improvement.)
> 
> So I was just wondering if there was any obvious factoring-out that
> could be done to reduce the duplication.

ah, thanks for the clarification!

I think the main duplication are on the loop body beginning and end,
let's take a look at them in details:

+  if (memory_access_type == VMAT_GATHER_SCATTER)
+{
+  gcc_assert (alignment_support_scheme == dr_aligned
+ || alignment_support_scheme == dr_unaligned_supported);
+  gcc_assert (!grouped_load && !slp_perm);
+
+  unsigned int inside_cost = 0, prologue_cost = 0;

// These above are newly added.

+  for (j = 0; j < ncopies; j++)
+   {
+ /* 1. Create the vector or array pointer update chain.  */
+ if (j == 0 && !costing_p)
+   {
+ if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+   vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info,
+slp_node, _info, _ptr,
+_offsets);
+ else
+   dataref_ptr
+ = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type,
+ at_loop, offset, , gsi,
+ _incr, false, bump);
+   }
+ else if (!costing_p)
+   {
+ gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
+ if (!STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+   dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr,
+  gsi, stmt_info, bump);
+   }

// These are for dataref_ptr, in the final looop nest we deal with more cases
on simd_lane_access_p and diff_first_stmt_info, but don't handle
STMT_VINFO_GATHER_SCATTER_P any more, very few (one case) can be shared between,
IMHO factoring out it seems like a overkill.

+
+ if (mask && !costing_p)
+   vec_mask = vec_masks[j];

// It's merged out from j == 0 and j != 0

+
+ gimple *new_stmt = NULL;
+ for (i = 0; i < vec_num; i++)
+   {
+ tree final_mask = NULL_TREE;
+ tree final_len = NULL_TREE;
+ tree bias = NULL_TREE;
+ if (!costing_p)
+   {
+ if (loop_masks)
+   final_mask
+ = vect_get_loop_mask (loop_vinfo, gsi, loop_masks,
+   vec_num * ncopies, vectype,
+   vec_num * j + i);
+ if (vec_mask)
+   final_mask = prepare_vec_mask (loop_vinfo, mask_vectype,
+  final_mask, vec_mask, gsi);
+
+ if (i > 0 && !STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+   dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr,
+  gsi, stmt_info, bump);
+   }

// This part is directly copied from the original, the original gets updated by
removing && !STMT_VINFO_GATHER_SCATTER_P.  Due to its size, I didn't consider
this before, do you prefer me to factor this part out?

+   

Re: [PATCH] vect: Move VMAT_GATHER_SCATTER handlings from final loop nest

2023-08-14 Thread Kewen.Lin via Gcc-patches
Hi Richard,

on 2023/8/14 20:20, Richard Sandiford wrote:
> Thanks for the clean-ups.  But...
> 
> "Kewen.Lin"  writes:
>> Hi,
>>
>> Following Richi's suggestion [1], this patch is to move the
>> handlings on VMAT_GATHER_SCATTER in the final loop nest
>> of function vectorizable_load to its own loop.  Basically
>> it duplicates the final loop nest, clean up some useless
>> set up code for the case of VMAT_GATHER_SCATTER, remove some
>> unreachable code.  Also remove the corresponding handlings
>> in the final loop nest.
>>
>> Bootstrapped and regtested on x86_64-redhat-linux,
>> aarch64-linux-gnu and powerpc64{,le}-linux-gnu.
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-June/623329.html
>>
>> Is it ok for trunk?
>>
>> BR,
>> Kewen
>> -
>>
>> gcc/ChangeLog:
>>
>>  * tree-vect-stmts.cc (vectorizable_load): Move the handlings on
>>  VMAT_GATHER_SCATTER in the final loop nest to its own loop,
>>  and update the final nest accordingly.
>> ---
>>  gcc/tree-vect-stmts.cc | 361 +
>>  1 file changed, 219 insertions(+), 142 deletions(-)
> 
> ...that seems like quite a lot of +s.  Is there nothing we can do to
> avoid the cut-&-paste?

Thanks for the comments!  I'm not sure if I get your question, if we
want to move out the handlings of VMAT_GATHER_SCATTER, the new +s seem
inevitable?  Your concern is mainly about git blame history?

BR,
Kewen

> 
> Richard
> 
>>
>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
>> index c361e16cb7b..5e514eca19b 100644
>> --- a/gcc/tree-vect-stmts.cc
>> +++ b/gcc/tree-vect-stmts.cc
>> @@ -10455,6 +10455,218 @@ vectorizable_load (vec_info *vinfo,
>>return true;
>>  }
>>
>> +  if (memory_access_type == VMAT_GATHER_SCATTER)
>> +{
>> +  gcc_assert (alignment_support_scheme == dr_aligned
>> +  || alignment_support_scheme == dr_unaligned_supported);
>> +  gcc_assert (!grouped_load && !slp_perm);
>> +
>> +  unsigned int inside_cost = 0, prologue_cost = 0;
>> +  for (j = 0; j < ncopies; j++)
>> +{
>> +  /* 1. Create the vector or array pointer update chain.  */
>> +  if (j == 0 && !costing_p)
>> +{
>> +  if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>> +vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info,
>> + slp_node, _info, _ptr,
>> + _offsets);
>> +  else
>> +dataref_ptr
>> +  = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type,
>> +  at_loop, offset, , gsi,
>> +  _incr, false, bump);
>> +}
>> +  else if (!costing_p)
>> +{
>> +  gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
>> +  if (!STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>> +dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr,
>> +   gsi, stmt_info, bump);
>> +}
>> +
>> +  if (mask && !costing_p)
>> +vec_mask = vec_masks[j];
>> +
>> +  gimple *new_stmt = NULL;
>> +  for (i = 0; i < vec_num; i++)
>> +{
>> +  tree final_mask = NULL_TREE;
>> +  tree final_len = NULL_TREE;
>> +  tree bias = NULL_TREE;
>> +  if (!costing_p)
>> +{
>> +  if (loop_masks)
>> +final_mask
>> +  = vect_get_loop_mask (loop_vinfo, gsi, loop_masks,
>> +vec_num * ncopies, vectype,
>> +vec_num * j + i);
>> +  if (vec_mask)
>> +final_mask = prepare_vec_mask (loop_vinfo, mask_vectype,
>> +   final_mask, vec_mask, gsi);
>> +
>> +  if (i > 0 && !STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>> +dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr,
>> +   gsi, stmt_info, bump);
>> +}
>> +
>> +  /* 2. Create the vector-load in the loop.  */
>> +  unsigned HOST_WIDE_INT align;
>> +  if (gs_info.ifn != IFN_LAST)
>> +{
>> +  if (costing_p)
>> +{
>> +  unsigned int cnunits = vect_nunits_for_cost (vectype);
>> +  inside_cost
>> += record_stmt_cost (cost_vec, cnunits, scalar_load,
>> +stmt_info, 0, vect_body);
>> +  continue;
>> +}
>> +  if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
>> +vec_offset = vec_offsets[vec_num * j + i];
>> +  tree zero = build_zero_cst (vectype);
>> +  tree scale = size_int (gs_info.scale);
>> +
>> +  if (gs_info.ifn == IFN_MASK_LEN_GATHER_LOAD)
>> +{
>> +   

Re: [PATCH] vect: Move VMAT_LOAD_STORE_LANES handlings from final loop nest

2023-08-14 Thread Kewen.Lin via Gcc-patches
Hi Richi,

on 2023/8/14 20:04, Richard Biener wrote:
> On Mon, Aug 14, 2023 at 10:54 AM Kewen.Lin  wrote:
>>
>> Hi,
>>
>> Following Richi's suggestion [1], this patch is to move the
>> handlings on VMAT_LOAD_STORE_LANES in the final loop nest
>> of function vectorizable_load to its own loop.  Basically
>> it duplicates the final loop nest, clean up some useless
>> set up code for the case of VMAT_LOAD_STORE_LANES, remove
>> some unreachable code.  Also remove the corresponding
>> handlings in the final loop nest.
>>
>> Bootstrapped and regtested on x86_64-redhat-linux,
>> aarch64-linux-gnu and powerpc64{,le}-linux-gnu.
> 
> OK (I guess the big diff is mostly because of re-indenting).

Thanks!  Yes, there is some code in the original final loop nest like

if (memory_access_type == VMAT_LOAD_STORE_LANES)
  {
...
  }
else
  {
...
  }

Then the else arm is fully re-indented.

The other patch on VMAT_GATHER_SCATTER looks a bit better since
it doesn't need re-indenting.

BR,
Kewen

> 
> Thanks,
> Richard.
> 
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-June/623329.html
>>
>> gcc/ChangeLog:
>>
>> * tree-vect-stmts.cc (vectorizable_load): Move the handlings on
>> VMAT_LOAD_STORE_LANES in the final loop nest to its own loop,
>> and update the final nest accordingly.
>> ---
>>  gcc/tree-vect-stmts.cc | 1275 
>>  1 file changed, 634 insertions(+), 641 deletions(-)
>>
>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
>> index 4f2d088484c..c361e16cb7b 100644
>> --- a/gcc/tree-vect-stmts.cc
>> +++ b/gcc/tree-vect-stmts.cc
>> @@ -10332,7 +10332,129 @@ vectorizable_load (vec_info *vinfo,
>> vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies, mask,
>>_masks, mask_vectype);
>>  }
>> +
>>tree vec_mask = NULL_TREE;
>> +  if (memory_access_type == VMAT_LOAD_STORE_LANES)
>> +{
>> +  gcc_assert (alignment_support_scheme == dr_aligned
>> + || alignment_support_scheme == dr_unaligned_supported);
>> +  gcc_assert (grouped_load && !slp);
>> +
>> +  unsigned int inside_cost = 0, prologue_cost = 0;
>> +  for (j = 0; j < ncopies; j++)
>> +   {
>> + if (costing_p)
>> +   {
>> + /* An IFN_LOAD_LANES will load all its vector results,
>> +regardless of which ones we actually need.  Account
>> +for the cost of unused results.  */
>> + if (first_stmt_info == stmt_info)
>> +   {
>> + unsigned int gaps = DR_GROUP_SIZE (first_stmt_info);
>> + stmt_vec_info next_stmt_info = first_stmt_info;
>> + do
>> +   {
>> + gaps -= 1;
>> + next_stmt_info = DR_GROUP_NEXT_ELEMENT 
>> (next_stmt_info);
>> +   }
>> + while (next_stmt_info);
>> + if (gaps)
>> +   {
>> + if (dump_enabled_p ())
>> +   dump_printf_loc (MSG_NOTE, vect_location,
>> +"vect_model_load_cost: %d "
>> +"unused vectors.\n",
>> +gaps);
>> + vect_get_load_cost (vinfo, stmt_info, gaps,
>> + alignment_support_scheme,
>> + misalignment, false, _cost,
>> + _cost, cost_vec, cost_vec,
>> + true);
>> +   }
>> +   }
>> + vect_get_load_cost (vinfo, stmt_info, 1, 
>> alignment_support_scheme,
>> + misalignment, false, _cost,
>> + _cost, cost_vec, cost_vec, true);
>> + continue;
>> +   }
>> +
>> + /* 1. Create the vector or array pointer update chain.  */
>> + if (j == 0)
>> +   dataref_ptr
>> + = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type,
>> + at_loop, offset, , gsi,
>> + _incr, false, bump);
>> + else
>> +   {
>> + gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
>> + dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr, 
>> gsi,
>> +stmt_info, bump);
>> +   }
>> + if (mask)
>> +   vec_mask = vec_masks[j];
>> +
>> + tree vec_array = create_vector_array (vectype, vec_num);
>> +
>> + tree final_mask = NULL_TREE;
>> + if (loop_masks)
>> +   final_mask = vect_get_loop_mask (loop_vinfo, gsi, loop_masks,
>> +ncopies, vectype, j);
>> + if (vec_mask)
>> +   final_mask = 

Re: [PATCH V4] VECT: Support loop len control on EXTRACT_LAST vectorization

2023-08-14 Thread Kewen.Lin via Gcc-patches
Hi Juzhe,

on 2023/8/14 20:08, juzhe.zh...@rivai.ai wrote:
> Hi, Kewin.
> 
> Could you test 'can_vec_extract_var_idx_p' and send V5 patch when you pass 
> the testing?

The below diff was bootstrapped and regress-tested on Power10 LE.  Comparing to 
the
previous v4, the only changes should be the proposed can_vec_extract_var_idx_p 
and
its required new includes as below:

+#include "memmodel.h"
+#include "optabs.h"
 
Could you have a double check?

Since I just tested it on Power10, you have the full ownership on the patch, 
I'd leave
the v5 posting to you.  Thanks!

BR,
Kewen
-
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index bc3063c3615..5ae9f69c7eb 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -32,6 +32,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-pass.h"
 #include "ssa.h"
 #include "optabs-tree.h"
+#include "memmodel.h"
+#include "optabs.h"
 #include "diagnostic-core.h"
 #include "fold-const.h"
 #include "stor-layout.h"
@@ -10300,17 +10302,7 @@ vectorizable_live_operation (vec_info *vinfo, 
stmt_vec_info stmt_info,
   /* No transformation required.  */
   if (loop_vinfo && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
{
- if (!direct_internal_fn_supported_p (IFN_EXTRACT_LAST, vectype,
-  OPTIMIZE_FOR_SPEED))
-   {
- if (dump_enabled_p ())
-   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-"can't operate on partial vectors "
-"because the target doesn't support extract "
-"last reduction.\n");
- LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
-   }
- else if (slp_node)
+ if (slp_node)
{
  if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -10330,9 +10322,26 @@ vectorizable_live_operation (vec_info *vinfo, 
stmt_vec_info stmt_info,
  else
{
  gcc_assert (ncopies == 1 && !slp_node);
- vect_record_loop_mask (loop_vinfo,
-_VINFO_MASKS (loop_vinfo),
-1, vectype, NULL);
+ if (direct_internal_fn_supported_p (IFN_EXTRACT_LAST, vectype,
+ OPTIMIZE_FOR_SPEED))
+   vect_record_loop_mask (loop_vinfo,
+  _VINFO_MASKS (loop_vinfo),
+  1, vectype, NULL);
+ else if (can_vec_extract_var_idx_p (
+TYPE_MODE (vectype), TYPE_MODE (TREE_TYPE (vectype
+   vect_record_loop_len (loop_vinfo,
+ _VINFO_LENS (loop_vinfo),
+ 1, vectype, 1);
+ else
+   {
+ if (dump_enabled_p ())
+   dump_printf_loc (
+ MSG_MISSED_OPTIMIZATION, vect_location,
+ "can't operate on partial vectors "
+ "because the target doesn't support extract "
+ "last reduction.\n");
+ LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+   }
}
}
   /* ???  Enable for loop costing as well.  */
@@ -10358,7 +10367,9 @@ vectorizable_live_operation (vec_info *vinfo, 
stmt_vec_info stmt_info,
   gimple *vec_stmt;
   if (slp_node)
 {
-  gcc_assert (!loop_vinfo || !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo));
+  gcc_assert (!loop_vinfo
+ || (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
+ && !LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)));

   /* Get the correct slp vectorized stmt.  */
   vec_lhs = SLP_TREE_VEC_DEFS (slp_node)[vec_entry];
@@ -10402,7 +10413,42 @@ vectorizable_live_operation (vec_info *vinfo, 
stmt_vec_info stmt_info,

   gimple_seq stmts = NULL;
   tree new_tree;
-  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+   {
+ /* Emit:
+
+  SCALAR_RES = VEC_EXTRACT 
+
+where VEC_LHS is the vectorized live-out result and MASK is
+the loop mask for the final iteration.  */
+ gcc_assert (ncopies == 1 && !slp_node);
+ gimple_seq tem = NULL;
+ gimple_stmt_iterator gsi = gsi_last (tem);
+ tree len
+   = vect_get_loop_len (loop_vinfo, ,
+_VINFO_LENS (loop_vinfo),
+1, vectype, 0, 0);
+
+ /* BIAS - 1.  */
+ signed char biasval = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo);
+ tree bias_minus_one
+   = int_const_binop (MINUS_EXPR,
+  build_int_cst (TREE_TYPE (len), biasval),
+ 

Re: [PATCH] rs6000, add overloaded DFP quantize support

2023-08-14 Thread Kewen.Lin via Gcc-patches
Hi Carl,

on 2023/8/9 23:52, Carl Love wrote:
> 
> GCC maintainers:
> 
> The following patch adds four built-ins for the decimal floating point
> (DFP) quantize instructions on rs6000.  The built-ins are for 64-bit
> and 128-bit DFP operands.
> 
> The patch also adds a test case for the new builtins.
> 
> The Patch has been tested on Power 10LE and Power 9 LE/BE.
> 
> Please let me know if the patch is acceptable for mainline.  Thanks.
> 
>  Carl Love
> 
> 
> --
> rs6000, add overloaded DFP quantize support
> 
> Add decimal floating point (DFP) quantize built-ins for both 64-bit DFP
> and 128-DFP operands.  In each case, there is an immediate version and a
> variable version of the bult-in.  The RM value is a 2-bit const int which

Nit: s/bult-in//built-in/

> specifies the rounding mode to use.  For the immediate versions of the
> built-in, TE field is a 5-bit constant that specifies the value of the
> ideal exponent for the result.  The built-in specifications are:
> 
>   __Decimal64 builtin_dfp_quantize (_Decimal64, _Decimal64,
>   const int RM)
>   __Decimal64 builtin_dfp_quantize (const int TE, _Decimal64,
>   const int)
>   __Decimal128 builtin_dfpq_quantize (_Decimal128, _Decimal128,
> const int RM)
>   __Decimal128 builtin_dfpq_quantize (const int TE, _Decimal128,
> const int)
> 

I noticed that the existing DFP bifs are directly using the insn
mnemonics, I perfer to keep consistent with them.  So could we
have one function like unique external interface like dfp_quantize
for users' uses?

And we can have the underlying instances for it:

__Decimal64 builtin_dfp_dqua (_Decimal64, _Decimal64, const int RM)

__Decimal64 builtin_dfp_dquai (const int TE, _Decimal64, const int)

__Decimal128 builtin_dfp_dquaq (_Decimal128, _Decimal128, const int RM)

__Decimal128 builtin_dfp_dquaiq (const int TE, _Decimal128, const int)

Besides, this patch missed to update the documentation, please add them
in gcc//doc/extend.texi by searching "The following built-in functions
are available when hardware decimal floating point".

> A testcase is added for the new built-in definitions.
> 
> gcc/ChangeLog:
>   * config/rs6000/dfp.md: New UNSPECDQUAN.
>   (dfp_quan_, dfp_quan_i): New define_insn.
>   * config/rs6000/rs6000-builtins.def (__builtin_dfp_quantize_64,
>   __builtin_dfp_quantize_64i, __builtin_dfp_quantize_128,
>   __builtin_dfp_quantize_128i): New buit-in definitions.
>   * config/rs6000/rs6000-overload.def (__builtin_dfp_quantize,
>   __builtin_dfpq_quantize): New overloaded definitions.
> 
> gcc/testsuite/
>* gcc.target/powerpc/builtin-dfp-quantize-runnable.c: New test
>   case.
> ---
>  gcc/config/rs6000/dfp.md  |  25 ++-
>  gcc/config/rs6000/rs6000-builtins.def |  15 ++
>  gcc/config/rs6000/rs6000-overload.def |  12 ++
>  .../powerpc/builtin-dfp-quantize-runnable.c   | 198 ++
>  4 files changed, 249 insertions(+), 1 deletion(-)
>  create mode 100644 
> gcc/testsuite/gcc.target/powerpc/builtin-dfp-quantize-runnable.c
> 
> diff --git a/gcc/config/rs6000/dfp.md b/gcc/config/rs6000/dfp.md
> index 5ed8a73ac51..254c22a5c20 100644
> --- a/gcc/config/rs6000/dfp.md
> +++ b/gcc/config/rs6000/dfp.md
> @@ -271,7 +271,8 @@
> UNSPEC_DIEX
> UNSPEC_DSCLI
> UNSPEC_DTSTSFI
> -   UNSPEC_DSCRI])
> +   UNSPEC_DSCRI
> +   UNSPEC_DQUAN])
>  
>  (define_code_iterator DFP_TEST [eq lt gt unordered])
>  
> @@ -395,3 +396,25 @@
>"dscri %0,%1,%2"
>[(set_attr "type" "dfp")
> (set_attr "size" "")])
> +
> +(define_insn "dfp_quan_"

s/quan/dqua/ to aglin with the others.

> +  [(set (match_operand:DDTD 0 "gpc_reg_operand" "=d")
> +(unspec:DDTD [(match_operand:DDTD 1 "gpc_reg_operand" "d")
> +   (match_operand:DDTD 2 "gpc_reg_operand" "d")
> +  (match_operand:QI 3 "immediate_operand" "i")]
> + UNSPEC_DQUAN))]
> +  "TARGET_DFP"
> +  "dqua %0,%1,%2,%3"
> +  [(set_attr "type" "dfp")
> +   (set_attr "size" "")])
> +
> +(define_insn "dfp_quan_i"

also s/quan_i/dquai_/

> +  [(set (match_operand:DDTD 0 "gpc_reg_operand" "=d")
> +(unspec:DDTD [(match_operand:SI 1 "const_int_operand" "n")
> +   (match_operand:DDTD 2 "gpc_reg_operand" "d")
> +  (match_operand:SI 3 "immediate_operand" "i")]
> + UNSPEC_DQUAN))]
> +  "TARGET_DFP"
> +  "dquai %1,%0,%2,%3"
> +  [(set_attr "type" "dfp")
> +   (set_attr "size" "")])
> diff --git a/gcc/config/rs6000/rs6000-builtins.def 
> b/gcc/config/rs6000/rs6000-builtins.def
> index 35c4cdf74c5..36a56311643 100644
> --- a/gcc/config/rs6000/rs6000-builtins.def
> +++ b/gcc/config/rs6000/rs6000-builtins.def
> @@ -2983,6 +2983,21 @@
>const unsigned long long __builtin_unpack_dec128 

Re: [PATCH V4] VECT: Support loop len control on EXTRACT_LAST vectorization

2023-08-14 Thread Kewen.Lin via Gcc-patches
Hi Robin,

on 2023/8/14 16:58, Robin Dapp wrote:
> Hi Kewen,
> 
>> I did a bootstrapping and regression testing on Power10 (LE) and found a lot 
>> of failures.
> 
> I think the problem is that just like for vec_set we're expecting
> the vec_extract expander not to fail.  It is probably passed not a
> const int here anymore and therefore fails to expand?

Thanks for the comments!  Yeah, I think the expectation doesn't hold
on Power, as our vec_extract optab only support const index, that
is:

(define_expand "vec_extract"
  [(match_operand: 0 "register_operand")
   (match_operand:VEC_E 1 "vlogical_operand")
   (match_operand 2 "const_int_operand")]
  "VECTOR_MEM_ALTIVEC_OR_VSX_P (mode)"
{
  rs6000_expand_vector_extract (operands[0], operands[1], operands[2]);
  DONE;
})

> 
> can_vec_extract_var_idx_p is supposed to check if the backend
> supports extracting a variable index.

OK, it sounds that this new capability needs to further check with
function can_vec_extract_var_idx_p to ensure the ifn expanding work
as expected.  I re-spined by adding the below as your comments:

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 07f3717ed9d..80ba5cae84a 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -10328,7 +10328,9 @@ vectorizable_live_operation (vec_info *vinfo, 
stmt_vec_info stmt_info,
   else if (convert_optab_handler (vec_extract_optab,
   TYPE_MODE (vectype),
   TYPE_MODE (TREE_TYPE (vectype)))
-   != CODE_FOR_nothing)
+ != CODE_FOR_nothing
+   && can_vec_extract_var_idx_p (
+ TYPE_MODE (vectype), TYPE_MODE (TREE_TYPE (vectype
 vect_record_loop_len (loop_vinfo,
   _VINFO_LENS (loop_vinfo),
   1, vectype, 1);

BR,
Kewen


[PATCH] vect: Move VMAT_GATHER_SCATTER handlings from final loop nest

2023-08-14 Thread Kewen.Lin via Gcc-patches
Hi,

Following Richi's suggestion [1], this patch is to move the
handlings on VMAT_GATHER_SCATTER in the final loop nest
of function vectorizable_load to its own loop.  Basically
it duplicates the final loop nest, clean up some useless
set up code for the case of VMAT_GATHER_SCATTER, remove some
unreachable code.  Also remove the corresponding handlings
in the final loop nest.

Bootstrapped and regtested on x86_64-redhat-linux,
aarch64-linux-gnu and powerpc64{,le}-linux-gnu.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-June/623329.html

Is it ok for trunk?

BR,
Kewen
-

gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_load): Move the handlings on
VMAT_GATHER_SCATTER in the final loop nest to its own loop,
and update the final nest accordingly.
---
 gcc/tree-vect-stmts.cc | 361 +
 1 file changed, 219 insertions(+), 142 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index c361e16cb7b..5e514eca19b 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -10455,6 +10455,218 @@ vectorizable_load (vec_info *vinfo,
   return true;
 }

+  if (memory_access_type == VMAT_GATHER_SCATTER)
+{
+  gcc_assert (alignment_support_scheme == dr_aligned
+ || alignment_support_scheme == dr_unaligned_supported);
+  gcc_assert (!grouped_load && !slp_perm);
+
+  unsigned int inside_cost = 0, prologue_cost = 0;
+  for (j = 0; j < ncopies; j++)
+   {
+ /* 1. Create the vector or array pointer update chain.  */
+ if (j == 0 && !costing_p)
+   {
+ if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+   vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info,
+slp_node, _info, _ptr,
+_offsets);
+ else
+   dataref_ptr
+ = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type,
+ at_loop, offset, , gsi,
+ _incr, false, bump);
+   }
+ else if (!costing_p)
+   {
+ gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
+ if (!STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+   dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr,
+  gsi, stmt_info, bump);
+   }
+
+ if (mask && !costing_p)
+   vec_mask = vec_masks[j];
+
+ gimple *new_stmt = NULL;
+ for (i = 0; i < vec_num; i++)
+   {
+ tree final_mask = NULL_TREE;
+ tree final_len = NULL_TREE;
+ tree bias = NULL_TREE;
+ if (!costing_p)
+   {
+ if (loop_masks)
+   final_mask
+ = vect_get_loop_mask (loop_vinfo, gsi, loop_masks,
+   vec_num * ncopies, vectype,
+   vec_num * j + i);
+ if (vec_mask)
+   final_mask = prepare_vec_mask (loop_vinfo, mask_vectype,
+  final_mask, vec_mask, gsi);
+
+ if (i > 0 && !STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+   dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr,
+  gsi, stmt_info, bump);
+   }
+
+ /* 2. Create the vector-load in the loop.  */
+ unsigned HOST_WIDE_INT align;
+ if (gs_info.ifn != IFN_LAST)
+   {
+ if (costing_p)
+   {
+ unsigned int cnunits = vect_nunits_for_cost (vectype);
+ inside_cost
+   = record_stmt_cost (cost_vec, cnunits, scalar_load,
+   stmt_info, 0, vect_body);
+ continue;
+   }
+ if (STMT_VINFO_GATHER_SCATTER_P (stmt_info))
+   vec_offset = vec_offsets[vec_num * j + i];
+ tree zero = build_zero_cst (vectype);
+ tree scale = size_int (gs_info.scale);
+
+ if (gs_info.ifn == IFN_MASK_LEN_GATHER_LOAD)
+   {
+ if (loop_lens)
+   final_len
+ = vect_get_loop_len (loop_vinfo, gsi, loop_lens,
+  vec_num * ncopies, vectype,
+  vec_num * j + i, 1);
+ else
+   final_len
+ = build_int_cst (sizetype,
+  TYPE_VECTOR_SUBPARTS (vectype));
+ signed char biasval
+   = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo);

[PATCH] vect: Move VMAT_LOAD_STORE_LANES handlings from final loop nest

2023-08-14 Thread Kewen.Lin via Gcc-patches
Hi,

Following Richi's suggestion [1], this patch is to move the
handlings on VMAT_LOAD_STORE_LANES in the final loop nest
of function vectorizable_load to its own loop.  Basically
it duplicates the final loop nest, clean up some useless
set up code for the case of VMAT_LOAD_STORE_LANES, remove
some unreachable code.  Also remove the corresponding
handlings in the final loop nest.

Bootstrapped and regtested on x86_64-redhat-linux,
aarch64-linux-gnu and powerpc64{,le}-linux-gnu.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-June/623329.html

gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_load): Move the handlings on
VMAT_LOAD_STORE_LANES in the final loop nest to its own loop,
and update the final nest accordingly.
---
 gcc/tree-vect-stmts.cc | 1275 
 1 file changed, 634 insertions(+), 641 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 4f2d088484c..c361e16cb7b 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -10332,7 +10332,129 @@ vectorizable_load (vec_info *vinfo,
vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies, mask,
   _masks, mask_vectype);
 }
+
   tree vec_mask = NULL_TREE;
+  if (memory_access_type == VMAT_LOAD_STORE_LANES)
+{
+  gcc_assert (alignment_support_scheme == dr_aligned
+ || alignment_support_scheme == dr_unaligned_supported);
+  gcc_assert (grouped_load && !slp);
+
+  unsigned int inside_cost = 0, prologue_cost = 0;
+  for (j = 0; j < ncopies; j++)
+   {
+ if (costing_p)
+   {
+ /* An IFN_LOAD_LANES will load all its vector results,
+regardless of which ones we actually need.  Account
+for the cost of unused results.  */
+ if (first_stmt_info == stmt_info)
+   {
+ unsigned int gaps = DR_GROUP_SIZE (first_stmt_info);
+ stmt_vec_info next_stmt_info = first_stmt_info;
+ do
+   {
+ gaps -= 1;
+ next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info);
+   }
+ while (next_stmt_info);
+ if (gaps)
+   {
+ if (dump_enabled_p ())
+   dump_printf_loc (MSG_NOTE, vect_location,
+"vect_model_load_cost: %d "
+"unused vectors.\n",
+gaps);
+ vect_get_load_cost (vinfo, stmt_info, gaps,
+ alignment_support_scheme,
+ misalignment, false, _cost,
+ _cost, cost_vec, cost_vec,
+ true);
+   }
+   }
+ vect_get_load_cost (vinfo, stmt_info, 1, alignment_support_scheme,
+ misalignment, false, _cost,
+ _cost, cost_vec, cost_vec, true);
+ continue;
+   }
+
+ /* 1. Create the vector or array pointer update chain.  */
+ if (j == 0)
+   dataref_ptr
+ = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type,
+ at_loop, offset, , gsi,
+ _incr, false, bump);
+ else
+   {
+ gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
+ dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr, gsi,
+stmt_info, bump);
+   }
+ if (mask)
+   vec_mask = vec_masks[j];
+
+ tree vec_array = create_vector_array (vectype, vec_num);
+
+ tree final_mask = NULL_TREE;
+ if (loop_masks)
+   final_mask = vect_get_loop_mask (loop_vinfo, gsi, loop_masks,
+ncopies, vectype, j);
+ if (vec_mask)
+   final_mask = prepare_vec_mask (loop_vinfo, mask_vectype, final_mask,
+  vec_mask, gsi);
+
+ gcall *call;
+ if (final_mask)
+   {
+ /* Emit:
+  VEC_ARRAY = MASK_LOAD_LANES (DATAREF_PTR, ALIAS_PTR,
+   VEC_MASK).  */
+ unsigned int align = TYPE_ALIGN (TREE_TYPE (vectype));
+ tree alias_ptr = build_int_cst (ref_type, align);
+ call = gimple_build_call_internal (IFN_MASK_LOAD_LANES, 3,
+dataref_ptr, alias_ptr,
+final_mask);
+   }
+ else
+   {
+ /* Emit:
+  VEC_ARRAY = LOAD_LANES (MEM_REF[...all 

[PATCH] vect: Remove several useless VMAT_INVARIANT checks

2023-08-14 Thread Kewen.Lin via Gcc-patches
Hi,

In function vectorizable_load, there is one hunk which is
dedicated for the handlings on VMAT_INVARIANT and return
early, it means we shouldn't encounter any cases with
memory_access_type VMAT_INVARIANT in the following code
after that.  This patch is to clean up several useless
checks on VMAT_INVARIANT.  There should be no functional
changes.

Bootstrapped and regtested on x86_64-redhat-linux,
aarch64-linux-gnu and powerpc64{,le}-linux-gnu.

gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_load): Remove some useless checks
on VMAT_INVARIANT.
---
 gcc/tree-vect-stmts.cc | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 89607a98f99..d4e781531fd 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -10499,7 +10499,7 @@ vectorizable_load (vec_info *vinfo,
  tree bias = NULL_TREE;
  if (!costing_p)
{
- if (loop_masks && memory_access_type != VMAT_INVARIANT)
+ if (loop_masks)
final_mask
  = vect_get_loop_mask (loop_vinfo, gsi, loop_masks,
vec_num * ncopies, vectype,
@@ -10729,7 +10729,7 @@ vectorizable_load (vec_info *vinfo,
bias = build_int_cst (intQI_type_node, biasval);
  }

-   if (final_len && memory_access_type != VMAT_INVARIANT)
+   if (final_len)
  {
tree ptr
  = build_int_cst (ref_type, align * BITS_PER_UNIT);
--
2.31.1


Re: [PATCH V4] VECT: Support loop len control on EXTRACT_LAST vectorization

2023-08-14 Thread Kewen.Lin via Gcc-patches
Hi Juzhe,

on 2023/8/14 15:09, juzhe.zh...@rivai.ai wrote:
> Thanks Richi.
> 
> CC kewen to see whether this patch is suitable for powerpc and s390.

I did a bootstrapping and regression testing on Power10 (LE) and found a lot of 
failures.

A short list looks like:

< FAIL: gcc.c-torture/compile/20150108.c   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  (internal compiler 
error: in expand_vec_extract_optab_fn,
at internal-fn.cc:3164)
< FAIL: gcc.c-torture/compile/20150108.c   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  (test for excess 
errors)
< FAIL: gcc.c-torture/compile/20150108.c   -O3 -g  (internal compiler error: in 
expand_vec_extract_optab_fn, at internal-fn.cc:3164)
< FAIL: gcc.c-torture/compile/20150108.c   -O3 -g  (test for excess errors)
< FAIL: gcc.c-torture/execute/20011126-2.c   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  (internal compiler 
error: in expand_vec_extract_optab_fn,
at internal-fn.cc:3164)
< FAIL: gcc.c-torture/execute/20011126-2.c   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  (test for excess 
errors)
< FAIL: gcc.c-torture/execute/20011126-2.c   -O3 -g  (internal compiler error: 
in expand_vec_extract_optab_fn, at internal-fn.cc:3164)
< FAIL: gcc.c-torture/execute/20011126-2.c   -O3 -g  (test for excess errors)
< FAIL: gcc.c-torture/execute/pr58419.c   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  (internal compiler 
error: in expand_vec_extract_optab_fn, at
internal-fn.cc:3164)
< FAIL: gcc.c-torture/execute/pr58419.c   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  (test for excess 
errors)
< FAIL: gcc.c-torture/execute/pr58419.c   -O3 -g  (internal compiler error: in 
expand_vec_extract_optab_fn, at internal-fn.cc:3164)
< FAIL: gcc.c-torture/execute/pr58419.c   -O3 -g  (test for excess errors)
< FAIL: gcc.dg/pr84321.c (internal compiler error: in 
expand_vec_extract_optab_fn, at internal-fn.cc:3164)
< FAIL: gcc.dg/pr84321.c (test for excess errors)
< FAIL: gcc.dg/torture/pr108793.c   -O3 -fomit-frame-pointer -funroll-loops 
-fpeel-loops -ftracer -finline-functions  (internal compiler error: in 
expand_vec_extract_optab_fn, at internal-fn.cc:3164)
< FAIL: gcc.dg/torture/pr108793.c   -O3 -fomit-frame-pointer -funroll-loops 
-fpeel-loops -ftracer -finline-functions  (test for excess errors)
< FAIL: gcc.dg/torture/pr108793.c   -O3 -g  (internal compiler error: in 
expand_vec_extract_optab_fn, at internal-fn.cc:3164)
< FAIL: gcc.dg/torture/pr108793.c   -O3 -g  (test for excess errors)
< FAIL: gcc.dg/torture/pr51070-2.c   -O3 -fomit-frame-pointer -funroll-loops 
-fpeel-loops -ftracer -finline-functions  (internal compiler error: in 
expand_vec_extract_optab_fn, at
internal-fn.cc:3164)
< FAIL: gcc.dg/torture/pr51070-2.c   -O3 -fomit-frame-pointer -funroll-loops 
-fpeel-loops -ftracer -finline-functions  (test for excess errors)
< FAIL: gcc.dg/torture/pr51070-2.c   -O3 -g  (internal compiler error: in 
expand_vec_extract_optab_fn, at internal-fn.cc:3164)
< FAIL: gcc.dg/torture/pr51070-2.c   -O3 -g  (test for excess errors)
< FAIL: gcc.dg/torture/pr51070.c   -O3 -fomit-frame-pointer -funroll-loops 
-fpeel-loops -ftracer -finline-functions  (internal compiler error: in 
expand_vec_extract_optab_fn, at internal-fn.cc:3164)
< FAIL: gcc.dg/torture/pr51070.c   -O3 -fomit-frame-pointer -funroll-loops 
-fpeel-loops -ftracer -finline-functions  (test for excess errors)
< FAIL: gcc.dg/torture/pr51070.c   -O3 -g  (internal compiler error: in 
expand_vec_extract_optab_fn, at internal-fn.cc:3164)


> 
> --
> juzhe.zh...@rivai.ai
> 
>  
> *From:* Richard Biener 
> *Date:* 2023-08-14 14:53
> *To:* Ju-Zhe Zhong 
> *CC:* gcc-patches ; richard.sandiford 
> 
> 

Re: [PATCH 2/3] ivopts: Call valid_mem_ref_p with code_helper [PR110248]

2023-08-14 Thread Kewen.Lin via Gcc-patches
Hi,

on 2023/8/14 15:53, Jan-Benedict Glaw wrote:
> On Fri, 2023-06-30 13:46:40 +0800, Kewen.Lin via Gcc-patches 
>  wrote:
>> Bootstrapped and regtested on x86_64-redhat-linux and
>> powerpc64{,le}-linux-gnu.
>>
>> Is it ok for trunk?
> [...]
> 
>> diff --git a/gcc/recog.h b/gcc/recog.h
>> index badf8e3dc1c..c6ef619c5dd 100644
>> --- a/gcc/recog.h
>> +++ b/gcc/recog.h
>> @@ -20,6 +20,9 @@ along with GCC; see the file COPYING3.  If not see
>>  #ifndef GCC_RECOG_H
>>  #define GCC_RECOG_H
>>
>> +/* For enum tree_code ERROR_MARK.  */
>> +#include "tree.h"
>> +
>>  /* Random number that should be large enough for all purposes.  Also define
>> a type that has at least MAX_RECOG_ALTERNATIVES + 1 bits, with the extra
>> bit giving an invalid value that can be used to mean "uninitialized".  */
> 
> This part breaks for me (up-to-date amd64-linux host, cf. for example
> http://toolchain.lug-owl.de/laminar/jobs/gcc-local/82):
> 
> configure '--with-pkgversion=basepoints/gcc-14-3093-g4a8e6fa8016, built 
> at 1691996332'\
>   --prefix=/var/lib/laminar/run/gcc-local/82/toolchain-install
> \
>   --enable-werror-always  
> \
>   --enable-languages=all  
> \
>   --disable-multilib
> make V=1 all-gcc
> 
> echo timestamp > s-preds-h
> TARGET_CPU_DEFAULT="" \
> HEADERS="config/i386/i386-d.h" DEFINES="" \
> /bin/bash ../../gcc/gcc/mkconfig.sh tm_d.h
> /var/lib/laminar/run/gcc-local/82/local-toolchain-install/bin/g++ -std=c++11 
> -c   -g -O2   -DIN_GCC-fno-exceptions -fno-rtti 
> -fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings 
> -Wcast-qual -Wmissing-format-attribute -Wconditionally-supported 
> -Woverloaded-virtual -pedantic -Wno-long-long -Wno-variadic-macros 
> -Wno-overlength-strings -Werror -fno-common  -DHAVE_CONFIG_H  
> -DGENERATOR_FILE -I. -Ibuild -I../../gcc/gcc -I../../gcc/gcc/build 
> -I../../gcc/gcc/../include  -I../../gcc/gcc/../libcpp/include  \
>  -o build/genflags.o ../../gcc/gcc/genflags.cc
> /var/lib/laminar/run/gcc-local/82/local-toolchain-install/bin/g++ -std=c++11  
>  -g -O2   -DIN_GCC-fno-exceptions -fno-rtti -fasynchronous-unwind-tables 
> -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual 
> -Wmissing-format-attribute -Wconditionally-supported -Woverloaded-virtual 
> -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -Werror 
> -fno-common  -DHAVE_CONFIG_H  -DGENERATOR_FILE -static-libstdc++ 
> -static-libgcc  -o build/genflags \
> build/genflags.o build/rtl.o build/read-rtl.o build/ggc-none.o 
> build/vec.o build/min-insn-modes.o build/gensupport.o build/print-rtl.o 
> build/hash-table.o build/sort.o build/read-md.o build/errors.o 
> ../build-x86_64-pc-linux-gnu/libiberty/libiberty.a
> /var/lib/laminar/run/gcc-local/82/local-toolchain-install/bin/g++ -std=c++11 
> -c   -g -O2   -DIN_GCC-fno-exceptions -fno-rtti 
> -fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings 
> -Wcast-qual -Wmissing-format-attribute -Wconditionally-supported 
> -Woverloaded-virtual -pedantic -Wno-long-long -Wno-variadic-macros 
> -Wno-overlength-strings -Werror -fno-common  -DHAVE_CONFIG_H  
> -DGENERATOR_FILE -I. -Ibuild -I../../gcc/gcc -I../../gcc/gcc/build 
> -I../../gcc/gcc/../include  -I../../gcc/gcc/../libcpp/include  \
>  -o build/genconditions.o ../../gcc/gcc/genconditions.cc
> /var/lib/laminar/run/gcc-local/82/local-toolchain-install/bin/g++ -std=c++11  
>  -g -O2   -DIN_GCC-fno-exceptions -fno-rtti -fasynchronous-unwind-tables 
> -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual 
> -Wmissing-format-attribute -Wconditionally-supported -Woverloaded-virtual 
> -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -Werror 
> -fno-common  -DHAVE_CONFIG_H  -DGENERATOR_FILE -static-libstdc++ 
> -static-libgcc  -o build/genconditions \
> build/genconditions.o build/rtl.o build/read-rtl.o build/ggc-none.o 
> build/vec.o build/min-insn-modes.o build/gensupport.o build/print-rtl.o 
> build/hash-table.o build/sort.o build/read-md.o build/errors.o 
> ../build-x86_64-pc-linux-gnu/libiberty/libiberty.a
> build/genconditions ../../gcc/gcc/common.md ../../gcc/gcc/config/i386/i386.md 
> > tmp-condmd.cc
> /bin/bash ../../gcc/gcc/../move-if-change tmp-condmd.cc build/gencondmd.cc
> echo timestamp > s-conditions
> build/genpreds -c ../../gcc/gcc/common.md ../../gcc/gcc/config/i386/i386.md > 
> tmp-constrs.h
> /bin/bash ../../gcc/gcc/../move-if-change tmp-constrs.h tm-cons

Re: [PATCHv4, rs6000] Generate mfvsrwz for all subtargets and remove redundant zero extend [PR106769]

2023-08-14 Thread Kewen.Lin via Gcc-patches
Hi Haochen,

on 2023/8/14 10:18, HAO CHEN GUI wrote:
> Hi,
>   This patch modifies vsx extract expand and generates mfvsrwz/stxsiwx
> for all sub targets when the mode is V4SI and the extracted element is word
> 1 from BE order. Also this patch adds a insn pattern for mfvsrwz which
> helps eliminate redundant zero extend.
> 
>   Compared to last version, the main change is to put the word index
> checking in the split condition of "*vsx_extract_v4si_w023". Also modified
> some comments.
> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625380.html
> 
>   Bootstrapped and tested on powerpc64-linux BE and LE with no regressions.
> 
> Thanks
> Gui Haochen
> 
> ChangeLog
> rs6000: Generate mfvsrwz for all platform and remove redundant zero extend
> 
> mfvsrwz has lower latency than xxextractuw or vextuw[lr]x.  So it should be
> generated even with p9 vector enabled.  Also the instruction is already
> zero extended.  A combine pattern is needed to eliminate redundant zero
> extend instructions.
> 
> gcc/
>   PR target/106769
>   * config/rs6000/vsx.md (expand vsx_extract_): Set it only
>   for V8HI and V16QI.
>   (vsx_extract_v4si): New expand for V4SI extraction.
>   (vsx_extract_v4si_w1): New insn pattern for V4SI extraction on
>   word 1 from BE order.   
>   (*mfvsrwz): New insn pattern for mfvsrwz.
>   (*vsx_extract__di_p9): Assert that it won't be generated on
>   word 1 from BE order.
>   (*vsx_extract_si): Remove.
>   (*vsx_extract_v4si_w023): New insn and split pattern on word 0, 2,
>   3 from BE order.
> 
> gcc/testsuite/
>   PR target/106769
>   * gcc.target/powerpc/pr106769.h: New.
>   * gcc.target/powerpc/pr106769-p8.c: New.
>   * gcc.target/powerpc/pr106769-p9.c: New.
> 
> patch.diff
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index 0a34ceebeb5..1cbdc2f1c01 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -3722,9 +3722,9 @@ (define_insn "vsx_xxpermdi2__1"
>  (define_expand  "vsx_extract_"
>[(parallel [(set (match_operand: 0 "gpc_reg_operand")
>  (vec_select:
> - (match_operand:VSX_EXTRACT_I 1 "gpc_reg_operand")
> + (match_operand:VSX_EXTRACT_I2 1 "gpc_reg_operand")
>   (parallel [(match_operand:QI 2 "const_int_operand")])))
> -   (clobber (match_scratch:VSX_EXTRACT_I 3))])]
> +   (clobber (match_scratch:VSX_EXTRACT_I2 3))])]
>"VECTOR_MEM_VSX_P (mode) && TARGET_DIRECT_MOVE_64BIT"
>  {
>/* If we have ISA 3.0, we can do a xxextractuw/vextractu{b,h}.  */
> @@ -3736,6 +3736,63 @@ (define_expand  "vsx_extract_"
>  }
>  })
> 
> +(define_expand  "vsx_extract_v4si"
> +  [(parallel [(set (match_operand:SI 0 "gpc_reg_operand")
> +(vec_select:SI
> + (match_operand:V4SI 1 "gpc_reg_operand")
> + (parallel [(match_operand:QI 2 "const_0_to_3_operand")])))
> +   (clobber (match_scratch:V4SI 3))])]
> +  "TARGET_DIRECT_MOVE_64BIT"
> +{
> +  /* The word 1 (BE order) can be extracted by mfvsrwz/stxsiwx.  So just
> + fall through to vsx_extract_v4si_w1.  */
> +  if (TARGET_P9_VECTOR
> +  && INTVAL (operands[2]) != (BYTES_BIG_ENDIAN ? 1 : 2))
> +{
> +  emit_insn (gen_vsx_extract_v4si_p9 (operands[0], operands[1],
> +   operands[2]));
> +  DONE;
> +}
> +})
> +
> +/* Extract from word 1 (BE order);  */

Nit: I guessed I requested this before, please use ";" instead of
"/* ... */" for the comments, to align with the existing ones.

> +(define_insn "vsx_extract_v4si_w1"
> +  [(set (match_operand:SI 0 "nonimmediate_operand" "=r,wa,Z,wa")
> + (vec_select:SI
> +  (match_operand:V4SI 1 "gpc_reg_operand" "v,v,v,0")
> +  (parallel [(match_operand:QI 2 "const_0_to_3_operand" "n,n,n,n")])))
> +   (clobber (match_scratch:V4SI 3 "=v,v,v,v"))]
> +  "TARGET_DIRECT_MOVE_64BIT
> +   && INTVAL (operands[2]) == (BYTES_BIG_ENDIAN ? 1 : 2)"
> +{
> +   if (which_alternative == 0)
> + return "mfvsrwz %0,%x1";
> +
> +   if (which_alternative == 1)
> + return "xxlor %x0,%x1,%x1";
> +
> +   if (which_alternative == 2)
> + return "stxsiwx %x1,%y0";
> +
> +   return ASM_COMMENT_START " vec_extract to same register";
> +}
> +  [(set_attr "type" "mfvsr,veclogical,fpstore,*")
> +   (set_attr "length" "4,4,4,0")
> +   (set_attr "isa" "p8v,*,p8v,*")])
> +
> +(define_insn "*mfvsrwz"
> +  [(set (match_operand:DI 0 "register_operand" "=r")
> + (zero_extend:DI
> +   (vec_select:SI
> + (match_operand:V4SI 1 "vsx_register_operand" "wa")
> + (parallel [(match_operand:QI 2 "const_int_operand" "n")]
> +   (clobber (match_scratch:V4SI 3 "=v"))]
> +  "TARGET_DIRECT_MOVE_64BIT
> +   && INTVAL (operands[2]) == (BYTES_BIG_ENDIAN ? 1 : 2)"
> +  "mfvsrwz %0,%x1"
> +  [(set_attr "type" "mfvsr")
> +   (set_attr "isa" "p8v")])
> +
>  (define_insn "vsx_extract__p9"
>[(set (match_operand: 0 

Re: [PATCH] rs6000: Fix issue in specifying PTImode as an attribute [PR106895]

2023-08-09 Thread Kewen.Lin via Gcc-patches
Hi,

on 2023/7/20 12:35, jeevitha via Gcc-patches wrote:
> Hi All,
> 
> The following patch has been bootstrapped and regtested on powerpc64le-linux.
> 
> When the user specifies PTImode as an attribute, it breaks. Created
> a tree node to handle PTImode types. PTImode attribute helps in generating
> even/odd register pairs on 128 bits.
> 
> 2023-07-20  Jeevitha Palanisamy  
> 
> gcc/
>   PR target/110411
>   * config/rs6000/rs6000.h (enum rs6000_builtin_type_index): Add fields
>   to hold PTImode type.
>   * config/rs6000/rs6000-builtin.cc (rs6000_init_builtins): Add node
>   for PTImode type.
> 
> gcc/testsuite/
>   PR target/106895
>   * gcc.target/powerpc/pr106895.c: New testcase.
> 
> diff --git a/gcc/config/rs6000/rs6000-builtin.cc 
> b/gcc/config/rs6000/rs6000-builtin.cc
> index a8f291c6a72..ca00c3b0d4c 100644
> --- a/gcc/config/rs6000/rs6000-builtin.cc
> +++ b/gcc/config/rs6000/rs6000-builtin.cc
> @@ -756,6 +756,15 @@ rs6000_init_builtins (void)
>else
>  ieee128_float_type_node = NULL_TREE;
>  
> +  /* PTImode to get even/odd register pairs.  */
> +  intPTI_type_internal_node = make_node(INTEGER_TYPE);
> +  TYPE_PRECISION (intPTI_type_internal_node) = GET_MODE_BITSIZE (PTImode);
> +  layout_type (intPTI_type_internal_node);
> +  SET_TYPE_MODE (intPTI_type_internal_node, PTImode);
> +  t = build_qualified_type (intPTI_type_internal_node, TYPE_QUAL_CONST);
> +  lang_hooks.types.register_builtin_type (intPTI_type_internal_node,
> +   "__int128pti");

IIUC, this builtin type registering makes this type expose to users, so
I wonder if we want to actually expose this type for users' uses.
If yes, we need to update the documentation (and not sure if the current
name is good enough); otherwise, I wonder if there is some existing
practice to declare a builtin type with a name which users can't actually
use and is just for shadowing a mode.

BR,
Kewen

> +
>/* Vector pair and vector quad support.  */
>vector_pair_type_node = make_node (OPAQUE_TYPE);
>SET_TYPE_MODE (vector_pair_type_node, OOmode);
> diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
> index 3503614efbd..0456bf56d17 100644
> --- a/gcc/config/rs6000/rs6000.h
> +++ b/gcc/config/rs6000/rs6000.h
> @@ -2303,6 +2303,7 @@ enum rs6000_builtin_type_index
>RS6000_BTI_ptr_vector_quad,
>RS6000_BTI_ptr_long_long,
>RS6000_BTI_ptr_long_long_unsigned,
> +  RS6000_BTI_PTI,
>RS6000_BTI_MAX
>  };
>  
> @@ -2347,6 +2348,7 @@ enum rs6000_builtin_type_index
>  #define uintDI_type_internal_node 
> (rs6000_builtin_types[RS6000_BTI_UINTDI])
>  #define intTI_type_internal_node  
> (rs6000_builtin_types[RS6000_BTI_INTTI])
>  #define uintTI_type_internal_node 
> (rs6000_builtin_types[RS6000_BTI_UINTTI])
> +#define intPTI_type_internal_node (rs6000_builtin_types[RS6000_BTI_PTI])
>  #define float_type_internal_node  
> (rs6000_builtin_types[RS6000_BTI_float])
>  #define double_type_internal_node 
> (rs6000_builtin_types[RS6000_BTI_double])
>  #define long_double_type_internal_node
> (rs6000_builtin_types[RS6000_BTI_long_double])
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr106895.c 
> b/gcc/testsuite/gcc.target/powerpc/pr106895.c
> new file mode 100644
> index 000..04630fe1df5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr106895.c
> @@ -0,0 +1,15 @@
> +/* PR target/106895 */
> +/* { dg-require-effective-target int128 } */
> +/* { dg-options "-O2" } */
> +
> +/* Verify the following generates even/odd register pairs.  */
> +
> +typedef __int128 pti __attribute__((mode(PTI)));
> +
> +void
> +set128 (pti val, pti *mem)
> +{
> +asm("stq %1,%0" : "=m"(*mem) : "r"(val));
> +}
> +
> +/* { dg-final { scan-assembler "stq 10,0\\(5\\)" } } */
> 
>


Re: [PATCH ver 3] rs6000: Fix __builtin_altivec_vcmpne{b,h,w} implementation

2023-08-09 Thread Kewen.Lin via Gcc-patches
Hi Carl,

on 2023/8/8 01:50, Carl Love wrote:
> 
> GCC maintainers:
> 
> Ver 3: Updated description to make it clear the patch fixes the
> confusion on the availability of the builtins.  Fixed the dg-require-
> effective-target on the test cases and the dg-options.  Change the test
> case so the for loop for the test will not be unrolled.  Fixed a
> spelling error in a vec-cmpne.c comment.  Retested on Power 10LE.
> 
> Ver 2:  Re-worked the test vec-cmpne.c to create a compile only test
> verify the instruction generation and a runnable test to verify the
> built-in functionality.  Retested the patch on Power 8 LE/BE, Power
> 9LE/BE and Power 10 LE with no regressions.
> 
> The following patch cleans up the definition for the
> __builtin_altivec_vcmpne{b,h,w}.  The current implementation implies
> that the built-in is only supported on Power 9 since it is defined
> under the Power 9 stanza.  However the built-in has no ISA restrictions
> as stated in the Power Vector Intrinsic Programming Reference document.
> The current built-in works because the built-in gets replaced during
> GIMPLE folding by a simple not-equal operator so it doesn't get
> expanded and checked for Power 9 code generation.
> 
> This patch moves the definition to the Altivec stanza in the built-in
> definition file to make it clear the built-ins are valid for Power 8,
> Power 9 and beyond.  
> 
> The patch has been tested on Power 8 LE/BE, Power 9 LE/BE and Power 10
> LE with no regressions.
> 
> Please let me know if the patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
> 
> 
> rs6000: Fix __builtin_altivec_vcmpne{b,h,w} implementation
> 
> The current built-in definitions for vcmpneb, vcmpneh, vcmpnew are defined
> under the Power 9 section of r66000-builtins.  This implies they are only
> supported on Power 9 and above when in fact they are defined and work with
> Altivec as well with the appropriate Altivec instruction generation.
> 
> The vec_cmpne builtin should generate the vcmpequ{b,h,w} instruction with
> Altivec enabled and generate the vcmpne{b,h,w} on Power 9 and newer
> processors.
> 
> This patch moves the definitions to the Altivec stanza to make it clear
> the built-ins are supported for all Altivec processors.  The patch
> removes the confusion as to which processors support the vcmpequ{b,h,w}
> instructions.
> 
> There is existing test coverage for the vec_cmpne built-in for
> vector bool char, vector bool short, vector bool int,
> vector bool long long in builtins-3-p9.c and p8vector-builtin-2.c.
> Coverage for vector signed int, vector unsigned int is in
> p8vector-builtin-2.c.
> 
> Test vec-cmpne.c is updated to check the generation of the vcmpequ{b,h,w}
> instructions for Altivec.  A new test vec-cmpne-runnable.c is added to
> verify the built-ins work as expected.
> 
> Patch has been tested on Power 8 LE/BE, Power 9 LE/BE and Power 10 LE
> with no regressions.

Okay for trunk with two nits below fixed, thanks!

> 
> gcc/ChangeLog:
> 
>   * config/rs6000/rs6000-builtins.def (vcmpneb, vcmpneh, vcmpnew):
>   Move definitions to Altivec stanza.
>   * config/rs6000/altivec.md (vcmpneb, vcmpneh, vcmpnew): New
>   define_expand.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.target/powerpc/vec-cmpne-runnable.c: New execution test.
>   * gcc.target/powerpc/vec-cmpne.c (define_test_functions,
>   execute_test_functions) moved to vec-cmpne.h.  Added
>   scan-assembler-times for vcmpequb, vcmpequh, vcmpequw.

s/ moved/: Move/ => "... execute_test_functions): Move "

s/Added/Add/

>   * gcc.target/powerpc/vec-cmpne.h: New include file for vec-cmpne.c
>   and vec-cmpne-runnable.c. Split define_test_functions definition
>   into define_test_functions and define_init_verify_functions.
> ---
>  gcc/config/rs6000/altivec.md  |  12 ++
>  gcc/config/rs6000/rs6000-builtins.def |  18 +--
>  .../gcc.target/powerpc/vec-cmpne-runnable.c   |  36 ++
>  gcc/testsuite/gcc.target/powerpc/vec-cmpne.c  | 112 ++
>  gcc/testsuite/gcc.target/powerpc/vec-cmpne.h  |  90 ++
>  5 files changed, 156 insertions(+), 112 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vec-cmpne-runnable.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vec-cmpne.h
> 
> diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
> index ad1224e0b57..31f65aa1b7a 100644
> --- a/gcc/config/rs6000/altivec.md
> +++ b/gcc/config/rs6000/altivec.md
> @@ -2631,6 +2631,18 @@ (define_insn "altivec_vcmpequt_p"
>"vcmpequq. %0,%1,%2"
>[(set_attr "type" "veccmpfx")])
>  
> +;; Expand for builtin vcmpne{b,h,w}
> +(define_expand "altivec_vcmpne_"
> +  [(set (match_operand:VSX_EXTRACT_I 3 "altivec_register_operand" "=v")
> + (eq:VSX_EXTRACT_I (match_operand:VSX_EXTRACT_I 1 
> "altivec_register_operand" "v")
> +   

Re: [PATCH 1/3] targhooks: Extend legitimate_address_p with code_helper [PR110248]

2023-08-07 Thread Kewen.Lin via Gcc-patches
Hi Richi,

on 2023/6/30 17:13, Kewen.Lin via Gcc-patches wrote:
> Hi Richi,
> 
> Thanks for your review!
> 
> on 2023/6/30 16:56, Richard Biener wrote:
>> On Fri, Jun 30, 2023 at 7:38 AM Kewen.Lin  wrote:
>>>
>>> Hi,
>>>
>>> As PR110248 shows, some middle-end passes like IVOPTs can
>>> query the target hook legitimate_address_p with some
>>> artificially constructed rtx to determine whether some
>>> addressing modes are supported by target for some gimple
>>> statement.  But for now the existing legitimate_address_p
>>> only checks the given mode, it's unable to distinguish
>>> some special cases unfortunately, for example, for LEN_LOAD
>>> ifn on Power port, we would expand it with lxvl hardware
>>> insn, which only supports one register to hold the address
>>> (the other register is holding the length), that is we
>>> don't support base (reg) + index (reg) addressing mode for
>>> sure.  But hook legitimate_address_p only considers the
>>> given mode which would be some vector mode for LEN_LOAD
>>> ifn, and we do support base + index addressing mode for
>>> normal vector load and store insns, so the hook will return
>>> true for the query unexpectedly.
>>>
>>> This patch is to introduce one extra argument of type
>>> code_helper for hook legitimate_address_p, it makes targets
>>> able to handle some special case like what's described
>>> above.  The subsequent patches will show how to leverage
>>> the new code_helper argument.
>>>
>>> I didn't separate those target specific adjustments to
>>> their own patches, since those changes are no function
>>> changes.  One typical change is to add one unnamed argument
>>> with default ERROR_MARK, some ports need to include tree.h
>>> in their {port}-protos.h since the hook is used in some
>>> machine description files.  I've cross-built a corresponding
>>> cc1 successfully for at least one triple of each affected
>>> target so I believe they are safe.  But feel free to correct
>>> me if separating is needed for the review of this patch.
>>>
>>> Besides, it's bootstrapped and regtested on
>>> x86_64-redhat-linux and powerpc64{,le}-linux-gnu.
>>>
>>> Is it ok for trunk?
>>
>> Is defaulting the arguments in the targets necessary for
>> the middle-end or only for direct uses in the targets?
> 
> It's only for the direct uses in target codes, the call
> sites in generic code of these hooks would use the given
> code_helper type variable or an explicit ERROR_MARK, they
> don't require target codes to set that.
> 
>>
>> It looks OK in general but please give others some time to
>> comment.

Some weeks passed and no further comments are received, I guess
this still looks good to you?

BR,
Kewen

> 
> OK, thanks again!
> 
> BR,
> Kewen
> 
>>
>> Thanks,
>> Richard.
>>
>>> BR,
>>> Kewen
>>> --
>>> PR tree-optimization/110248
>>>
>>> gcc/ChangeLog:
>>>
>>> * coretypes.h (class code_helper): Add forward declaration.
>>> * doc/tm.texi: Regenerate.
>>> * lra-constraints.cc (valid_address_p): Call target hook
>>> targetm.addr_space.legitimate_address_p with an extra parameter
>>> ERROR_MARK as its prototype changes.
>>> * recog.cc (memory_address_addr_space_p): Likewise.
>>> * reload.cc (strict_memory_address_addr_space_p): Likewise.
>>> * target.def (legitimate_address_p, 
>>> addr_space.legitimate_address_p):
>>> Extend with one more argument of type code_helper, update the
>>> documentation accordingly.
>>> * targhooks.cc (default_legitimate_address_p): Adjust for the
>>> new code_helper argument.
>>> (default_addr_space_legitimate_address_p): Likewise.
>>> * targhooks.h (default_legitimate_address_p): Likewise.
>>> (default_addr_space_legitimate_address_p): Likewise.
>>> * config/aarch64/aarch64.cc (aarch64_legitimate_address_hook_p): 
>>> Adjust
>>> with extra unnamed code_helper argument with default ERROR_MARK.
>>> * config/alpha/alpha.cc (alpha_legitimate_address_p): Likewise.
>>> * config/arc/arc.cc (arc_legitimate_address_p): Likewise.
>>> * config/arm/arm-protos.h (arm_legitimate_address_p): Likewise.
>>> (tree.h): New include for tree_code ERROR_MARK.
&g

PING^2 [PATCH v2] rs6000: Don't use optimize_function_for_speed_p too early [PR108184]

2023-08-07 Thread Kewen.Lin via Gcc-patches
Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2023-January/609993.html

BR,
Kewen


> on 2023/1/16 17:08, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
>> As Honza pointed out in [1], the current uses of function
>> optimize_function_for_speed_p in rs6000_option_override_internal
>> are too early, since the query results from the functions
>> optimize_function_for_{speed,size}_p could be changed later due
>> to profile feedback and some function attributes handlings etc.
>>
>> This patch is to move optimize_function_for_speed_p to all the
>> use places of the corresponding flags, which follows the existing
>> practices.  Maybe we can cache it somewhere at an appropriate
>> timing, but that's another thing.
>>
>> Comparing with v1[2], this version added one test case for
>> SAVE_TOC_INDIRECT as Segher questioned and suggested, and it
>> also considered the possibility of explicit option (see test
>> cases pr108184-2.c and pr108184-4.c).  I believe that excepting
>> for the intentional change on optimize_function_for_{speed,
>> size}_p, there is no other function change.
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2022-November/607527.html
>> [2] https://gcc.gnu.org/pipermail/gcc-patches/2023-January/609379.html
>>
>> Bootstrapped and regtested on powerpc64-linux-gnu P8,
>> powerpc64le-linux-gnu P{9,10} and powerpc-ibm-aix.
>>
>> Is it ok for trunk?
>>
>> BR,
>> Kewen
>> -
>> gcc/ChangeLog:
>>
>>  * config/rs6000/rs6000.cc (rs6000_option_override_internal): Remove
>>  all optimize_function_for_speed_p uses.
>>  (fusion_gpr_load_p): Call optimize_function_for_speed_p along
>>  with TARGET_P8_FUSION_SIGN.
>>  (expand_fusion_gpr_load): Likewise.
>>  (rs6000_call_aix): Call optimize_function_for_speed_p along with
>>  TARGET_SAVE_TOC_INDIRECT.
>>  * config/rs6000/predicates.md (fusion_gpr_mem_load): Call
>>  optimize_function_for_speed_p along with TARGET_P8_FUSION_SIGN.
>>
>> gcc/testsuite/ChangeLog:
>>
>>  * gcc.target/powerpc/pr108184-1.c: New test.
>>  * gcc.target/powerpc/pr108184-2.c: New test.
>>  * gcc.target/powerpc/pr108184-3.c: New test.
>>  * gcc.target/powerpc/pr108184-4.c: New test.
>> ---
>>  gcc/config/rs6000/predicates.md   |  5 +++-
>>  gcc/config/rs6000/rs6000.cc   | 19 +-
>>  gcc/testsuite/gcc.target/powerpc/pr108184-1.c | 16 
>>  gcc/testsuite/gcc.target/powerpc/pr108184-2.c | 15 +++
>>  gcc/testsuite/gcc.target/powerpc/pr108184-3.c | 25 +++
>>  gcc/testsuite/gcc.target/powerpc/pr108184-4.c | 24 ++
>>  6 files changed, 97 insertions(+), 7 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108184-1.c
>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108184-2.c
>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108184-3.c
>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108184-4.c
>>
>> diff --git a/gcc/config/rs6000/predicates.md 
>> b/gcc/config/rs6000/predicates.md
>> index a1764018545..9f84468db84 100644
>> --- a/gcc/config/rs6000/predicates.md
>> +++ b/gcc/config/rs6000/predicates.md
>> @@ -1878,7 +1878,10 @@ (define_predicate "fusion_gpr_mem_load"
>>
>>/* Handle sign/zero extend.  */
>>if (GET_CODE (op) == ZERO_EXTEND
>> -  || (TARGET_P8_FUSION_SIGN && GET_CODE (op) == SIGN_EXTEND))
>> +  || (TARGET_P8_FUSION_SIGN
>> +  && GET_CODE (op) == SIGN_EXTEND
>> +  && (rs6000_isa_flags_explicit & OPTION_MASK_P8_FUSION_SIGN
>> +  || optimize_function_for_speed_p (cfun
>>  {
>>op = XEXP (op, 0);
>>mode = GET_MODE (op);
>> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
>> index 6ac3adcec6b..f47d21980a9 100644
>> --- a/gcc/config/rs6000/rs6000.cc
>> +++ b/gcc/config/rs6000/rs6000.cc
>> @@ -3997,8 +3997,7 @@ rs6000_option_override_internal (bool global_init_p)
>>/* If we can shrink-wrap the TOC register save separately, then use
>>   -msave-toc-indirect unless explicitly disabled.  */
>>if ((rs6000_isa_flags_explicit & OPTION_MASK_SAVE_TOC_INDIRECT) == 0
>> -  && flag_shrink_wrap_separate
>> -  && optimize_function_for_speed_p (cfun))
>> +  && flag_shrink_wrap_separate)
>>  rs6000_isa_flags |= OPTION_MASK_SAVE_TOC_INDIRECT;
>>
>>/* Enable power8 fusion if we

PING^3 [PATCH v2] sched: Change no_real_insns_p to no_real_nondebug_insns_p [PR108273]

2023-08-07 Thread Kewen.Lin via Gcc-patches
Hi,

I'd like to gentle ping this patch:

https://gcc.gnu.org/pipermail/gcc-patches/2023-March/614818.html

BR,
Kewen

>> on 2023/3/29 15:18, Kewen.Lin via Gcc-patches wrote:
>>> Hi,
>>>
>>> By addressing Alexander's comments, against v1 this
>>> patch v2 mainly:
>>>
>>>   - Rename no_real_insns_p to no_real_nondebug_insns_p;
>>>   - Introduce enum rgn_bb_deps_free_action for three
>>> kinds of actions to free deps;
>>>   - Change function free_deps_for_bb_no_real_insns_p to
>>> resolve_forw_deps which only focuses on forward deps;
>>>   - Extend the handlings to cover dbg-cnt sched_block,
>>> add one test case for it;
>>>   - Move free_trg_info call in schedule_region to an
>>> appropriate place.
>>>
>>> One thing I'm not sure about is the change in function
>>> sched_rgn_local_finish, currently the invocation to
>>> sched_rgn_local_free is guarded with !sel_sched_p (),
>>> so I just follow it, but the initialization of those
>>> structures (in sched_rgn_local_init) isn't guarded
>>> with !sel_sched_p (), it looks odd.
>>>
>>> 
>>>
>>> As PR108273 shows, when there is one block which only has
>>> NOTE_P and LABEL_P insns at non-debug mode while has some
>>> extra DEBUG_INSN_P insns at debug mode, after scheduling
>>> it, the DFA states would be different between debug mode
>>> and non-debug mode.  Since at non-debug mode, the block
>>> meets no_real_insns_p, it gets skipped; while at debug
>>> mode, it gets scheduled, even it only has NOTE_P, LABEL_P
>>> and DEBUG_INSN_P, the call of function advance_one_cycle
>>> will change the DFA state.  PR108519 also shows this issue
>>> issue can be exposed by some scheduler changes.
>>>
>>> This patch is to change function no_real_insns_p to
>>> function no_real_nondebug_insns_p by taking debug insn into
>>> account, which make us not try to schedule for the block
>>> having only NOTE_P, LABEL_P and DEBUG_INSN_P insns,
>>> resulting in consistent DFA states between non-debug and
>>> debug mode.
>>>
>>> Changing no_real_insns_p to no_real_nondebug_insns_p caused
>>> ICE when doing free_block_dependencies, the root cause is
>>> that we create dependencies for debug insns, those
>>> dependencies are expected to be resolved during scheduling
>>> insns, but which gets skipped after this change.
>>> By checking the code, it looks it's reasonable to skip to
>>> compute block dependences for no_real_nondebug_insns_p
>>> blocks.  There is also another issue, which gets exposed
>>> in SPEC2017 bmks build at option -O2 -g, is that we could
>>> skip to schedule some block, which already gets dependency
>>> graph built so has dependencies computed and rgn_n_insns
>>> accumulated, then the later verification on if the graph
>>> becomes exhausted by scheduling would fail as follow:
>>>
>>>   /* Sanity check: verify that all region insns were
>>>  scheduled.  */
>>> gcc_assert (sched_rgn_n_insns == rgn_n_insns);
>>>
>>> , and also some forward deps aren't resovled.
>>>
>>> As Alexander pointed out, the current debug count handling
>>> also suffers the similar issue, so this patch handles these
>>> two cases together: one is for some block gets skipped by
>>> !dbg_cnt (sched_block), the other is for some block which
>>> is not no_real_nondebug_insns_p initially but becomes
>>> no_real_nondebug_insns_p due to speculative scheduling.
>>>
>>> This patch can be bootstrapped and regress-tested on
>>> x86_64-redhat-linux, aarch64-linux-gnu and
>>> powerpc64{,le}-linux-gnu.
>>>
>>> I also verified this patch can pass SPEC2017 both intrate
>>> and fprate bmks building at -g -O2/-O3.
>>>
>>> Any thoughts?
>>>
>>> BR,
>>> Kewen
>>> 
>>> PR rtl-optimization/108273
>>>
>>> gcc/ChangeLog:
>>>
>>> * haifa-sched.cc (no_real_insns_p): Rename to ...
>>> (no_real_nondebug_insns_p): ... this, and consider DEBUG_INSN_P insn.
>>> * sched-ebb.cc (schedule_ebb): Replace no_real_insns_p with
>>> no_real_nondebug_insns_p.
>>> * sched-int.h (no_real_insns_p): Rename to ...
>>> (no_real_nondebug_insns_p): ... this.
>>> * sched-rgn.cc (enum rgn_bb_deps_free_action): New enum.
>>> (bb_deps_

PING^4 [PATCH 0/9] rs6000: Rework rs6000_emit_vector_compare

2023-08-07 Thread Kewen.Lin via Gcc-patches
Hi,

Gentle ping this series:

https://gcc.gnu.org/pipermail/gcc-patches/2022-November/607146.html

BR,
Kewen

>>> on 2022/11/24 17:15, Kewen Lin wrote:
 Hi,

 Following Segher's suggestion, this patch series is to rework
 function rs6000_emit_vector_compare for vector float and int
 in multiple steps, it's based on the previous attempts [1][2].
 As mentioned in [1], the need to rework this for float is to
 make a centralized place for vector float comparison handlings
 instead of supporting with swapping ops and reversing code etc.
 dispersedly.  It's also for a subsequent patch to handle
 comparison operators with or without trapping math (PR105480).
 With the handling on vector float reworked, we can further make
 the handling on vector int simplified as shown.

 For Segher's concern about whether this rework causes any
 assembly change, I constructed two testcases for vector float[3]
 and int[4] respectively before, it showed the most are fine
 excepting for the difference on LE and UNGT, it's demonstrated
 as improvement since it uses GE instead of GT ior EQ.  The
 associated test case in patch 3/9 is a good example.

 Besides, w/ and w/o the whole patch series, I built the whole
 SPEC2017 at options -O3 and -Ofast separately, checked the
 differences on object assembly.  The result showed that the
 most are unchanged, except for:

   * at -O3, 521.wrf_r has 9 object files and 526.blender_r has
 9 object files with differences.

   * at -Ofast, 521.wrf_r has 12 object files, 526.blender_r has
 one and 527.cam4_r has 4 object files with differences.

 By looking into these differences, all significant differences
 are caused by the known improvement mentined above transforming
 GT ior EQ to GE, which can also affect unrolling decision due
 to insn count.  Some other trivial differences are branch
 target offset difference, nop difference for alignment, vsx
 register number differences etc.

 I also evaluated the runtime performance for these changed
 benchmarks, the result is neutral.

 These patches are bootstrapped and regress-tested
 incrementally on powerpc64-linux-gnu P7 & P8, and
 powerpc64le-linux-gnu P9 & P10.

 Is it ok for trunk?

 BR,
 Kewen
 -
 [1] https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606375.html
 [2] https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606376.html
 [3] https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606504.html
 [4] https://gcc.gnu.org/pipermail/gcc-patches/2022-November/606506.html

 Kewen Lin (9):
   rs6000: Rework vector float comparison in rs6000_emit_vector_compare - p1
   rs6000: Rework vector float comparison in rs6000_emit_vector_compare - p2
   rs6000: Rework vector float comparison in rs6000_emit_vector_compare - p3
   rs6000: Rework vector float comparison in rs6000_emit_vector_compare - p4
   rs6000: Rework vector integer comparison in rs6000_emit_vector_compare - 
 p1
   rs6000: Rework vector integer comparison in rs6000_emit_vector_compare - 
 p2
   rs6000: Rework vector integer comparison in rs6000_emit_vector_compare - 
 p3
   rs6000: Rework vector integer comparison in rs6000_emit_vector_compare - 
 p4
   rs6000: Rework vector integer comparison in rs6000_emit_vector_compare - 
 p5

  gcc/config/rs6000/rs6000.cc | 180 ++--
  gcc/testsuite/gcc.target/powerpc/vcond-fp.c |  25 +++
  2 files changed, 74 insertions(+), 131 deletions(-)
  create mode 100644 gcc/testsuite/gcc.target/powerpc/vcond-fp.c



Re: [PATCH V2] rs6000: Don't allow AltiVec address in movoo & movxo pattern [PR110411]

2023-08-07 Thread Kewen.Lin via Gcc-patches
Hi Jeevitha,

on 2023/7/20 00:46, jeevitha wrote:
> Hi All,
> 
> The following patch has been bootstrapped and regtested on powerpc64le-linux.
> 
> There are no instructions that do traditional AltiVec addresses (i.e.
> with the low four bits of the address masked off) for OOmode and XOmode
> objects. The solution is to modify the constraints used in the movoo and
> movxo pattern to disallow these types of addresses, which assists LRA in
> resolving this issue. Furthermore, the mode size 16 check has been
> removed in vsx_quad_dform_memory_operand to allow OOmode and

Excepting for the nits Peter caught, one minor nit: "... to allow
OOmode and XOmode, and ..."

This patch looks quite reasonable that to make mov[ox]o to use the same
memory constraints as what the underlying vsx paired load/store insns use.
 
It's okay for trunk with those nits tweaked, also okay for all affected
release branches after burn-in time, thanks!

BR,
Kewen

> quad_address_p already handles less than size 16.
> 
> 2023-07-19  Jeevitha Palanisamy  
> 
> gcc/
>   PR target/110411
>   * config/rs6000/mma.md (define_insn_and_split movoo): Disallow
>   AltiVec address in movoo and movxo pattern.
>   (define_insn_and_split movxo): Likewise.
>   *config/rs6000/predicates.md (vsx_quad_dform_memory_operand):Remove
>   redundant mode size check.
> 
> gcc/testsuite/
>   PR target/110411
>   * gcc.target/powerpc/pr110411-1.c: New testcase.
>   * gcc.target/powerpc/pr110411-2.c: New testcase.
> 
> diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
> index d36dc13872b..575751d477e 100644
> --- a/gcc/config/rs6000/mma.md
> +++ b/gcc/config/rs6000/mma.md
> @@ -293,8 +293,8 @@
>  })
>  
>  (define_insn_and_split "*movoo"
> -  [(set (match_operand:OO 0 "nonimmediate_operand" "=wa,m,wa")
> - (match_operand:OO 1 "input_operand" "m,wa,wa"))]
> +  [(set (match_operand:OO 0 "nonimmediate_operand" "=wa,ZwO,wa")
> + (match_operand:OO 1 "input_operand" "ZwO,wa,wa"))]
>"TARGET_MMA
> && (gpc_reg_operand (operands[0], OOmode)
> || gpc_reg_operand (operands[1], OOmode))"
> @@ -340,8 +340,8 @@
>  })
>  
>  (define_insn_and_split "*movxo"
> -  [(set (match_operand:XO 0 "nonimmediate_operand" "=d,m,d")
> - (match_operand:XO 1 "input_operand" "m,d,d"))]
> +  [(set (match_operand:XO 0 "nonimmediate_operand" "=d,ZwO,d")
> + (match_operand:XO 1 "input_operand" "ZwO,d,d"))]
>"TARGET_MMA
> && (gpc_reg_operand (operands[0], XOmode)
> || gpc_reg_operand (operands[1], XOmode))"
> diff --git a/gcc/config/rs6000/predicates.md b/gcc/config/rs6000/predicates.md
> index 3552d908e9d..925f69cd3fc 100644
> --- a/gcc/config/rs6000/predicates.md
> +++ b/gcc/config/rs6000/predicates.md
> @@ -924,7 +924,7 @@
>  (define_predicate "vsx_quad_dform_memory_operand"
>(match_code "mem")
>  {
> -  if (!TARGET_P9_VECTOR || GET_MODE_SIZE (mode) != 16)
> +  if (!TARGET_P9_VECTOR)
>  return false;
>  
>return quad_address_p (XEXP (op, 0), mode, false);
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr110411-1.c 
> b/gcc/testsuite/gcc.target/powerpc/pr110411-1.c
> new file mode 100644
> index 000..f42e9388d65
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr110411-1.c
> @@ -0,0 +1,22 @@
> +/* PR target/110411 */
> +/* { dg-require-effective-target power10_ok } */
> +/* { dg-options "-O2 -mdejagnu-cpu=power10 -mblock-ops-vector-pair" } */
> +
> +/* Verify we do not ICE on the following.  */
> +
> +#include 
> +
> +struct s {
> +  long a;
> +  long b;
> +  long c;
> +  long d: 1;
> +};
> +unsigned long ptr;
> +
> +void
> +bug (struct s *dst)
> +{
> +  struct s *src = (struct s *)(ptr & ~0xFUL);
> +  memcpy (dst, src, sizeof(struct s));
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr110411-2.c 
> b/gcc/testsuite/gcc.target/powerpc/pr110411-2.c
> new file mode 100644
> index 000..c2046fb9855
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr110411-2.c
> @@ -0,0 +1,12 @@
> +/* PR target/110411 */
> +/* { dg-require-effective-target power10_ok } */
> +/* { dg-options "-O2 -mdejagnu-cpu=power10" } */
> +
> +/* Verify we do not ICE on the following.  */
> +
> +void
> +bug (__vector_quad *dst)
> +{
> +  dst = (__vector_quad *)((unsigned long)dst & ~0xFUL);
> +  __builtin_mma_xxsetaccz (dst);
> +}
> 
> 


Re: [PATCH v2] rs6000: Fix __builtin_altivec_vcmpne{b,h,w} implementation

2023-08-07 Thread Kewen.Lin via Gcc-patches
Hi Carl,

Sorry for the late review.

on 2023/8/2 02:29, Carl Love wrote:
> 
> GCC maintainers:
> 
> Ver 2:  Re-worked the test vec-cmpne.c to create a compile only test
> verify the instruction generation and a runnable test to verify the
> built-in functionality.  Retested the patch on Power 8 LE/BE, Power 9LE/BE 
> and Power 10 LE with no regressions.
> 
> The following patch cleans up the definition for the
> __builtin_altivec_vcmpne{b,h,w}.  The current implementation implies
> that the built-in is only supported on Power 9 since it is defined
> under the Power 9 stanza.  However the built-in has no ISA restrictions
> as stated in the Power Vector Intrinsic Programming Reference document.
> The current built-in works because the built-in gets replaced during
> GIMPLE folding by a simple not-equal operator so it doesn't get
> expanded and checked for Power 9 code generation.
> 
> This patch moves the definition to the Altivec stanza in the built-in
> definition file to make it clear the built-ins are valid for Power 8,
> Power 9 and beyond.  
> 
> The patch has been tested on Power 8 LE/BE, Power 9 LE/BE and Power 10
> LE with no regressions.
> 
> Please let me know if the patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
> 
> rs6000: Fix __builtin_altivec_vcmpne{b,h,w} implementation
> 
> The current built-in definitions for vcmpneb, vcmpneh, vcmpnew are defined
> under the Power 9 section of r66000-builtins.  This implies they are only
> supported on Power 9 and above when in fact they are defined and work with
> Altivec as well with the appropriate Altivec instruction generation.
> 
> The vec_cmpne builtin should generate the vcmpequ{b,h,w} instruction with
> Altivec enabled and generate the vcmpne{b,h,w} on Power 9 and newer
> processors.
> 
> This patch moves the definitions to the Altivec stanza to make it clear
> the built-ins are supported for all Altivec processors.  The patch
> enables the vcmpequ{b,h,w} instruction to be generated on Altivec and
> the vcmpne{b,h,w} instruction to be generated on Power 9 and beyond.

But as you noted above, the current built-ins work as expected, that is
to generate with vcmpequ{b,h,w} on altivec but not Power9 while generate
with vcmpne{b,h,w} on Power9.  So I think we shouldn't say it's enabled
by this patch.  Instead it's to remove the confusion.

> 
> There is existing test coverage for the vec_cmpne built-in for
> vector bool char, vector bool short, vector bool int,
> vector bool long long in builtins-3-p9.c and p8vector-builtin-2.c.
> Coverage for vector signed int, vector unsigned int is in
> p8vector-builtin-2.c.
> 
> Test vec-cmpne.c is updated to check the generation of the vcmpequ{b,h,w}
> instructions for Altivec.  A new test vec-cmpne-runnable.c is added to
> verify the built-ins work as expected.
> 
> Patch has been tested on Power 8 LE/BE, Power 9 LE/BE and Power 10 LE
> with no regressions.
> 
> gcc/ChangeLog:
> 
>   * config/rs6000/rs6000-builtins.def (vcmpneb, vcmpneh, vcmpnew):
>   Move definitions to Altivec stanza.
>   * config/rs6000/altivec.md (vcmpneb, vcmpneh, vcmpnew): New
>   define_expand.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.target/powerpc/vec-cmpne-runnable.c: New execution test.
>   * gcc.target/powerpc/vec-cmpne.c (define_test_functions,
>   execute_test_functions) moved to vec-cmpne.h.  Added
>   scan-assembler-times for vcmpequb, vcmpequh, vcmpequw.
>   * gcc.target/powerpc/vec-cmpne.h: New include file for vec-cmpne.c
>   and vec-cmpne-runnable.c. Split define_test_functions definition
>   into define_test_functions and define_init_verify_functions.
> ---
>  gcc/config/rs6000/altivec.md  |  12 ++
>  gcc/config/rs6000/rs6000-builtins.def |  18 +--
>  .../gcc.target/powerpc/vec-cmpne-runnable.c   |  36 ++
>  gcc/testsuite/gcc.target/powerpc/vec-cmpne.c  | 110 ++
>  gcc/testsuite/gcc.target/powerpc/vec-cmpne.h  |  86 ++
>  5 files changed, 151 insertions(+), 111 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vec-cmpne-runnable.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vec-cmpne.h
> 
> diff --git a/gcc/config/rs6000/altivec.md b/gcc/config/rs6000/altivec.md
> index ad1224e0b57..31f65aa1b7a 100644
> --- a/gcc/config/rs6000/altivec.md
> +++ b/gcc/config/rs6000/altivec.md
> @@ -2631,6 +2631,18 @@ (define_insn "altivec_vcmpequt_p"
>"vcmpequq. %0,%1,%2"
>[(set_attr "type" "veccmpfx")])
>  
> +;; Expand for builtin vcmpne{b,h,w}
> +(define_expand "altivec_vcmpne_"
> +  [(set (match_operand:VSX_EXTRACT_I 3 "altivec_register_operand" "=v")
> + (eq:VSX_EXTRACT_I (match_operand:VSX_EXTRACT_I 1 
> "altivec_register_operand" "v")
> +   (match_operand:VSX_EXTRACT_I 2 
> "altivec_register_operand" "v")))
> +   (set (match_operand:VSX_EXTRACT_I 0 "altivec_register_operand" "=v")
> +

Re: [PATCH] rs6000: Fix __builtin_altivec_vcmpne{b,h,w} implementation

2023-07-31 Thread Kewen.Lin via Gcc-patches
Hi Carl,

on 2023/7/28 23:00, Carl Love wrote:
> GCC maintainers:
> 
> The following patch cleans up the definition for the
> __builtin_altivec_vcmpnet.  The current implementation implies that the

s/__builtin_altivec_vcmpnet/__builtin_altivec_vcmpne[bhw]/

> built-in is only supported on Power 9 since it is defined under the
> Power 9 stanza.  However the built-in has no ISA restrictions as stated
> in the Power Vector Intrinsic Programming Reference document. The
> current built-in works because the built-in gets replaced during GIMPLE
> folding by a simple not-equal operator so it doesn't get expanded and
> checked for Power 9 code generation.
> 
> This patch moves the definition to the Altivec stanza in the built-in
> definition file to make it clear the built-ins are valid for Power 8,
> Power 9 and beyond.  
> 
> The patch has been tested on Power 8 LE/BE, Power 9 LE/BE and Power 10
> LE with no regressions.
> 
> Please let me know if the patch is acceptable for mainline.  Thanks.
> 
>   Carl 
> 
> --
> rs6000: Fix __builtin_altivec_vcmpne{b,h,w} implementation
> 
> The current built-in definitions for vcmpneb, vcmpneh, vcmpnew are defined
> under the Power 9 section of r66000-builtins.  This implies they are only
> supported on Power 9 and above when in fact they are defined and work on
> Power 8 as well with the appropriate Power 8 instruction generation.

Nit: It's confusing to say Power8 only, it's actually supported once altivec
is enabled, so I think it's more clear to replace Power8 with altivec here.

> 
> The vec_cmpne builtin should generate the vcmpequ{b,h,w} instruction on
> Power 8 and generate the vcmpne{b,h,w} on Power 9 an newer processors.

Ditto for Power8 and "an" -> "and"?

> 
> This patch moves the definitions to the Altivec stanza to make it clear
> the built-ins are supported for all Altivec processors.  The patch
> enables the vcmpequ{b,h,w} instruction to be generated on Power 8 and
> the vcmpne{b,h,w} instruction to be generated on Power 9 and beyond.

Ditto for Power8.

> 
> There is existing test coverage for the vec_cmpne built-in for
> vector bool char, vector bool short, vector bool int,
> vector bool long long in builtins-3-p9.c and p8vector-builtin-2.c.
> Coverage for vector signed int, vector unsigned int is in
> p8vector-builtin-2.c.

So there is no coverage with the basic altivec support.  I noticed
we have one test case "gcc/testsuite/gcc.target/powerpc/vec-cmpne.c"
which is a test case for running but with vsx_ok, I think we can
rewrite it with altivec (vmx), either separating to compiling and
running case, or adding -save-temp and check expected insns.

Coverage for unsigned long long int and long long int
> for Power 10 in int_128bit-runnable.c.
> 
> Patch has been tested on Power 8 LE/BE, Power 9 LE/BE and Power 10 LE
> with no regressions.
> 
> gcc/ChangeLog:
> 
>   * config/rs6000/rs6000-builtins.def (vcmpneb, vcmpneh, vcmpnew.
>   vcmpnet): Move definitions to Altivec stanza.

vcmpnet which isn't handled in this patch should be removed.

The others look good to me, thanks!

BR,
Kewen


Re: [PATCHv2, rs6000] Generate mfvsrwz for all subtargets and remove redundant zero extend [PR106769]

2023-07-30 Thread Kewen.Lin via Gcc-patches
Hi Haochen,

on 2023/7/25 10:10, HAO CHEN GUI wrote:
> Hi,
>   This patch modifies vsx extract expand and generates mfvsrwz/stxsiwx
> for all subtargets when the mode is V4SI and the index of extracted element
> is 1 for BE and 2 for LE. Also this patch adds a insn pattern for mfvsrwz
> which helps eliminate redundant zero extend.
> 
>   Compared to last version, the main change is to move "vsx_extract_v4si_w1"
> and "*mfvsrwz" to the front of "*vsx_extract__di_p9". Also some insn
> conditions are changed to assertions.
> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625128.html

Since the previous one is v2, this is actually v3. ;-)

> 
>   Bootstrapped and tested on powerpc64-linux BE and LE with no regressions.
> 
> Thanks
> Gui Haochen
> 
> ChangeLog
> rs6000: Generate mfvsrwz for all platform and remove redundant zero extend
> 
> mfvsrwz has lower latency than xxextractuw or vextuw[lr]x.  So it should be
> generated even with p9 vector enabled.  Also the instruction is already
> zero extended.  A combine pattern is needed to eliminate redundant zero
> extend instructions.
> 
> gcc/
>   PR target/106769
>   * config/rs6000/vsx.md (expand vsx_extract_): Set it only
>   for V8HI and V16QI.
>   (vsx_extract_v4si): New expand for V4SI extraction.
>   (vsx_extract_v4si_w1): New insn pattern for V4SI extraction
>   when the index of extracted element is 1 with BE and 2 with LE.

Nit: Maybe better to match the name with " ... for V4SI extraction on
word 1 from BE order."

>   (*mfvsrwz): New insn pattern.
>   (*vsx_extract__di_p9): Not generate the insn when the index
>   of extracted element is 1 with BE and 2 with LE.
>   (*vsx_extract_si): Removed.

Nit: s/Removed/Remove/

>   (*vsx_extract_v4si_not_w1): New insn and split pattern which deals
>   with the cases not handled by vsx_extract_v4si_w1.
> 
> gcc/testsuite/
>   PR target/106769
>   * gcc.target/powerpc/pr106769.h: New.
>   * gcc.target/powerpc/pr106769-p8.c: New.
>   * gcc.target/powerpc/pr106769-p9.c: New.
> 
> patch.diff
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index 0a34ceebeb5..0065b76fef8 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -3722,9 +3722,9 @@ (define_insn "vsx_xxpermdi2__1"
>  (define_expand  "vsx_extract_"
>[(parallel [(set (match_operand: 0 "gpc_reg_operand")
>  (vec_select:
> - (match_operand:VSX_EXTRACT_I 1 "gpc_reg_operand")
> + (match_operand:VSX_EXTRACT_I2 1 "gpc_reg_operand")
>   (parallel [(match_operand:QI 2 "const_int_operand")])))
> -   (clobber (match_scratch:VSX_EXTRACT_I 3))])]
> +   (clobber (match_scratch:VSX_EXTRACT_I2 3))])]
>"VECTOR_MEM_VSX_P (mode) && TARGET_DIRECT_MOVE_64BIT"
>  {
>/* If we have ISA 3.0, we can do a xxextractuw/vextractu{b,h}.  */
> @@ -3736,6 +3736,63 @@ (define_expand  "vsx_extract_"
>  }
>  })
> 
> +(define_expand  "vsx_extract_v4si"
> +  [(parallel [(set (match_operand:SI 0 "gpc_reg_operand")
> +(vec_select:SI
> + (match_operand:V4SI 1 "gpc_reg_operand")
> + (parallel [(match_operand:QI 2 "const_0_to_3_operand")])))
> +   (clobber (match_scratch:V4SI 3))])]
> +  "TARGET_DIRECT_MOVE_64BIT"
> +{
> +  /* The word 1 (BE order) can be extracted by mfvsrwz/stxsiwx.  So just
> + fall through to vsx_extract_v4si_w1.  */
> +  if (TARGET_P9_VECTOR
> +  && INTVAL (operands[2]) != (BYTES_BIG_ENDIAN ? 1 : 2))
> +{
> +  emit_insn (gen_vsx_extract_v4si_p9 (operands[0], operands[1],
> +   operands[2]));
> +  DONE;
> +}
> +})
> +
> +/* Extract from word 1 (BE order).  */

Nit: Use semicolon ";" for comments to keep consistent with the others
and what the doc says.

> +(define_insn "vsx_extract_v4si_w1"
> +  [(set (match_operand:SI 0 "nonimmediate_operand" "=r,wa,Z,wa")
> + (vec_select:SI
> +  (match_operand:V4SI 1 "gpc_reg_operand" "v,v,v,0")
> +  (parallel [(match_operand:QI 2 "const_0_to_3_operand" "n,n,n,n")])))
> +   (clobber (match_scratch:V4SI 3 "=v,v,v,v"))]
> +  "TARGET_DIRECT_MOVE_64BIT
> +   && INTVAL (operands[2]) == (BYTES_BIG_ENDIAN ? 1 : 2)"
> +{
> +   if (which_alternative == 0)
> + return "mfvsrwz %0,%x1";
> +
> +   if (which_alternative == 1)
> + return "xxlor %x0,%x1,%x1";
> +
> +   if (which_alternative == 2)
> + return "stxsiwx %x1,%y0";
> +
> +   return ASM_COMMENT_START " vec_extract to same register";
> +}
> +  [(set_attr "type" "mfvsr,veclogical,fpstore,*")
> +   (set_attr "length" "4,4,4,0")
> +   (set_attr "isa" "p8v,*,p8v,*")])
> +
> +(define_insn "*mfvsrwz"
> +  [(set (match_operand:DI 0 "register_operand" "=r")
> + (zero_extend:DI
> +   (vec_select:SI
> + (match_operand:V4SI 1 "vsx_register_operand" "wa")
> + (parallel [(match_operand:QI 2 "const_int_operand" "n")]
> +   (clobber 

Re: [PATCH, rs6000] Skip redundant vector extract if the element is first element of dword0 [PR110429]

2023-07-28 Thread Kewen.Lin via Gcc-patches
Hi Haochen,

on 2023/7/5 11:22, HAO CHEN GUI wrote:
> Hi,
>   This patch skips redundant vector extract insn to be generated when
> the extracted element is the first element of dword0 and the destination

"The first element" is confusing, it's easy to be misunderstood as element
0, but in fact the extracted element index is: 
  - for byte, 7 on BE while 8 on LE;
  - for half word, 3 on BE while 4 on LE;

so maybe just say when the extracted index for byte and half word like above,
the element to be stored is already in the corresponding place for stxsi[hb]x,
we don't need a redundant vector extraction at all.

> is a memory operand. Only one 'stxsi[hb]x' instruction is enough.
> 
>   The V4SImode is fixed in a previous patch.
> https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622101.html
> 
>   Bootstrapped and tested on powerpc64-linux BE and LE with no regressions.
> Thanks
> Gui Haochen
> 
> ChangeLog
> rs6000: Skip redundant vector extract if the element is first element of
> dword0
> 
> gcc/
>   PR target/110429
>   * config/rs6000/vsx.md (*vsx_extract__store_p9): Skip vector
>   extract when the element is the first element of dword0.
> 
> gcc/testsuite/
>   PR target/110429
>   * gcc.target/powerpc/pr110429.c: New.
> 
> 
> patch.diff
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index 0c269e4e8d9..b3fec910eb6 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -3855,7 +3855,22 @@ (define_insn_and_split "*vsx_extract__store_p9"
>   (parallel [(match_dup 2)])))
> (clobber (match_dup 4))])
> (set (match_dup 0)
> - (match_dup 3))])
> + (match_dup 3))]
> +{
> +  enum machine_mode dest_mode = GET_MODE (operands[0]);

Nit: Move this line ...

> +
> +  if (which_alternative == 0
> +  && ((mode == V16QImode
> +&& INTVAL (operands[2]) == (BYTES_BIG_ENDIAN ? 7 : 8))
> +   || (mode == V8HImode
> +   && INTVAL (operands[2]) == (BYTES_BIG_ENDIAN ? 3 : 4
> +{

... here.

> +  emit_move_insn (operands[0],
> +   gen_rtx_REG (dest_mode, REGNO (operands[3])));
> +  DONE;
> +}
> +})
> +
> 
>  (define_insn_and_split  "*vsx_extract_si"
>[(set (match_operand:SI 0 "nonimmediate_operand" "=r,wa,Z")
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr110429.c 
> b/gcc/testsuite/gcc.target/powerpc/pr110429.c
> new file mode 100644
> index 000..5a938f9f90a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr110429.c
> @@ -0,0 +1,28 @@
> +/* { dg-do compile } */
> +/* { dg-skip-if "" { powerpc*-*-darwin* } } */
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-options "-mdejagnu-cpu=power9 -O2" } */
> +/* { dg-require-effective-target has_arch_ppc64 } */
> +
> +#include 
> +
> +#ifdef __BIG_ENDIAN__
> +#define DWORD0_FIRST_SHORT 3
> +#define DWORD0_FIRST_CHAR 7
> +#else
> +#define DWORD0_FIRST_SHORT 4
> +#define DWORD0_FIRST_CHAR 8
> +#endif
> +
> +void vec_extract_short (vector short v, short* p)
> +{
> +   *p = vec_extract(v, DWORD0_FIRST_SHORT);
> +}
> +
> +void vec_extract_char (vector char v, char* p)
> +{
> +   *p = vec_extract(v, DWORD0_FIRST_CHAR);
> +}
> +
> +/* { dg-final { scan-assembler-times "stxsi\[hb\]x" 2 } } */

Nit: Break this check into stxsihx and stxsibx, and surround
with \m and \M.

> +/* { dg-final { scan-assembler-not "vextractu\[hb\]" } } */

Also with \m and \M.

OK for trunk with these nits tweaked and testing goes well,
thanks!

BR,
Kewen


Re: [PATCH] Optimize vec_splats of vec_extract for V2DI/V2DF (PR target/99293)

2023-07-28 Thread Kewen.Lin via Gcc-patches
Hi Mike,

on 2023/7/11 03:50, Michael Meissner wrote:
> This patch optimizes cases like:
> 
>   vector double v1, v2;
>   /* ... */
>   v2 = vec_splats (vec_extract (v1, 0);   /* or  */
>   v2 = vec_splats (vec_extract (v1, 1);
> 
> Previously:
> 
>   vector long long
>   splat_dup_l_0 (vector long long v)
>   {
> return __builtin_vec_splats (__builtin_vec_extract (v, 0));
>   }
> 
> would generate:
> 
> mfvsrld 9,34
> mtvsrdd 34,9,9
> blr
> 
> With this patch, GCC generates:
> 
> xxpermdi 34,34,34,3
>   blr
> > 2023-07-10  Michael Meissner  
> 
> gcc/
> 
>   PR target/99293
>   * gcc/config/rs6000/vsx.md (vsx_splat_extract_): New combiner
>   insn.
> 
> gcc/testsuite/
> 
>   PR target/108958
>   * gcc.target/powerpc/pr99293.c: New test.
>   * gcc.target/powerpc/builtins-1.c: Update insn count.
> ---
>  gcc/config/rs6000/vsx.md  | 18 ++
>  gcc/testsuite/gcc.target/powerpc/builtins-1.c |  2 +-
>  gcc/testsuite/gcc.target/powerpc/pr99293.c| 55 +++
>  3 files changed, 74 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/pr99293.c
> 
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index 0c269e4e8d9..d34c3b21abe 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -4600,6 +4600,24 @@ (define_insn "vsx_splat__mem"
>"lxvdsx %x0,%y1"
>[(set_attr "type" "vecload")])
>  
> +;; Optimize SPLAT of an extract from a V2DF/V2DI vector with a constant 
> element
> +(define_insn "*vsx_splat_extract_"
> +  [(set (match_operand:VSX_D 0 "vsx_register_operand" "=wa")
> + (vec_duplicate:VSX_D
> +  (vec_select:
> +   (match_operand:VSX_D 1 "vsx_register_operand" "wa")
> +   (parallel [(match_operand 2 "const_0_to_1_operand" "n")]]
> +  "VECTOR_MEM_VSX_P (mode)"
> +{
> +  int which_word = INTVAL (operands[2]);
> +  if (!BYTES_BIG_ENDIAN)
> +which_word = 1 - which_word;
> +
> +  operands[3] = GEN_INT (which_word ? 3 : 0);
> +  return "xxpermdi %x0,%x1,%x1,%3";
> +}
> +  [(set_attr "type" "vecperm")])
> +
>  ;; V4SI splat support
>  (define_insn "vsx_splat_v4si"
>[(set (match_operand:V4SI 0 "vsx_register_operand" "=wa,wa")
> diff --git a/gcc/testsuite/gcc.target/powerpc/builtins-1.c 
> b/gcc/testsuite/gcc.target/powerpc/builtins-1.c
> index 28cd1aa6b1a..98783668bce 100644
> --- a/gcc/testsuite/gcc.target/powerpc/builtins-1.c
> +++ b/gcc/testsuite/gcc.target/powerpc/builtins-1.c
> @@ -1035,4 +1035,4 @@ foo156 (vector unsigned short usa)
>  /* { dg-final { scan-assembler-times {\mvmrglb\M} 3 } } */
>  /* { dg-final { scan-assembler-times {\mvmrgew\M} 4 } } */
>  /* { dg-final { scan-assembler-times {\mvsplth|xxsplth\M} 4 } } */
> -/* { dg-final { scan-assembler-times {\mxxpermdi\M} 44 } } */
> +/* { dg-final { scan-assembler-times {\mxxpermdi\M} 42 } } */
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr99293.c 
> b/gcc/testsuite/gcc.target/powerpc/pr99293.c
> new file mode 100644
> index 000..e5f44bd7346
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr99293.c
> @@ -0,0 +1,55 @@
> +/* { dg-require-effective-target powerpc_p8vector_ok } */
> +/* { dg-options "-O2 -mpower8-vector" } */

Nit: IMHO -mdejagnu-cpu=power8 is preferred against -mpower8-vector which is
considered as a workaround option, and we plan to make it go away.

> +
> +/* Test for PR 99263, which wants to do:
> + __builtin_vec_splats (__builtin_vec_extract (v, n))

Nit: Maybe remove all "__builtin_" prefixes since vec_splats and vec_extract
are defined in PVIPR without __builtin_.

This is also applied for the others below.

> +
> +   where v is a V2DF or V2DI vector and n is either 0 or 1.  Previously the 
> GCC
> +   compiler would do a direct move to the GPR registers to select the item 
> and a
> +   direct move from the GPR registers to do the splat.
> +
> +   Before the patch, splat_dup_ll_0 or splat_dup_dbl_0 below would generate:
> +
> +mfvsrld 9,34
> +mtvsrdd 34,9,9
> +blr
> +
> +   and now it generates:
> +
> +xxpermdi 34,34,34,3
> +blr  */
> +
> +#include 
> +
> +vector long long
> +splat_dup_ll_0 (vector long long v)
> +{
> +  /* xxpermdi 34,34,34,3 */
> +  return __builtin_vec_splats (vec_extract (v, 0));
> +}
> +
> +vector double
> +splat_dup_dbl_0 (vector double v)
> +{
> +  /* xxpermdi 34,34,34,3 */
> +  return __builtin_vec_splats (vec_extract (v, 0));
> +}
> +
> +vector long long
> +splat_dup_ll_1 (vector long long v)
> +{
> +  /* xxpermdi 34,34,34,0 */
> +  return __builtin_vec_splats (vec_extract (v, 1));
> +}
> +
> +vector double
> +splat_dup_dbl_1 (vector double v)
> +{
> +  /* xxpermdi 34,34,34,0 */
> +  return __builtin_vec_splats (vec_extract (v, 1));
> +}
> +
> +/* { dg-final { scan-assembler-times "xxpermdi" 4 } } */

Nit: It's good to add \m..\M like the others, i.e.

   /* { dg-final { scan-assembler-times {\mxxpermdi\M} 4 

Re: [PATCH] vect: Treat VMAT_ELEMENTWISE as scalar load in costing [PR110776]

2023-07-26 Thread Kewen.Lin via Gcc-patches
on 2023/7/26 18:02, Richard Biener wrote:
> On Wed, Jul 26, 2023 at 4:52 AM Kewen.Lin  wrote:
>>
>> Hi,
>>
>> PR110776 exposes one issue that we could query unaligned
>> load for vector type but actually no unaligned vector load
>> is supported there.  The reason is that the costed load is
>> with single-lane vector type and its memory access type is
>> VMAT_ELEMENTWISE, we actually take it as scalar load and
>> set its alignment_support_scheme as dr_unaligned_supported.
>>
>> To avoid the ICE as exposed, following Rich's suggestion,
>> this patch is to make VMAT_ELEMENTWISE be costed as scalar
>> load.
>>
>> Bootstrapped and regress-tested on x86_64-redhat-linux,
>> powerpc64-linux-gnu P8/P9 and powerpc64le-linux-gnu P9/P10.
>>
>> Is it ok for trunk?
> 
> OK.

Thanks Richi, pushed as r14-2813.

BR,
Kewen


Re: [PATCH] Fix typo in insn name.

2023-07-25 Thread Kewen.Lin via Gcc-patches
Hi Mike,

on 2023/7/11 03:59, Michael Meissner wrote:
> In doing other work, I noticed that there was an insn:
> 
>   vsx_extract_v4sf__load
> 
> Which did not have an iterator.  I removed the useless .

It actually has a mode iterator, the "P" is used for clobber.

The whole pattern of this define_insn_and_split is

(define_insn_and_split "*vsx_extract_v4sf__load"
  [(set (match_operand:SF 0 "register_operand" "=f,v,v,?r")
(vec_select:SF
 (match_operand:V4SF 1 "memory_operand" "m,Z,m,m")
 (parallel [(match_operand:QI 2 "const_0_to_3_operand" "n,n,n,n")])))
   (clobber (match_scratch:P 3 "=,,,"))] <== *P used here*

Its definition is:

(define_mode_iterator P [(SI "TARGET_32BIT") (DI "TARGET_64BIT")])

I guess we can just leave it there?

BR,
Kewen

> 
> I have tested this patch on the following systems and there was no degration.
> Can I check it into the trunk branch?
> 
> * Power10, LE, --with-cpu=power10, IBM 128-bit long double
> * Power9,  LE, --with-cpu=power9,  IBM 128-bit long double
> * Power9,  LE, --with-cpu=power9,  IEEE 128-bit long double
> *   Power9,  LE, --with-cpu=power9,  64-bit default long double
> * Power9,  BE, --with-cpu=power9,  IBM 128-bit long double
> * Power8,  BE, --with-cpu=power8,  IBM 128-bit long double
> 
> 2023-07-10  Michael Meissner  
> 
> gcc/
> 
>   * config/rs6000/vsx.md (vsx_extract_v4sf_load): Rename from
>   vsx_extract_v4sf__load.
> ---
>  gcc/config/rs6000/vsx.md | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index d34c3b21abe..aed450e31ec 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -3576,7 +3576,7 @@ (define_insn_and_split "vsx_extract_v4sf"
>[(set_attr "length" "8")
> (set_attr "type" "fp")])
>  
> -(define_insn_and_split "*vsx_extract_v4sf__load"
> +(define_insn_and_split "*vsx_extract_v4sf_load"
>[(set (match_operand:SF 0 "register_operand" "=f,v,v,?r")
>   (vec_select:SF
>(match_operand:V4SF 1 "memory_operand" "m,Z,m,m")





[PATCH] rs6000: Correct vsx operands output for xxeval [PR110741]

2023-07-25 Thread Kewen.Lin via Gcc-patches
Hi,

PR110741 exposes one issue that we didn't use the correct
character for vsx operands in output operand substitution,
consequently it can map to the wrong registers which hold
some unexpected values.

Bootstrapped and regress-tested on powerpc64-linux-gnu
P7/P8/P9 and powerpc64le-linux-gnu P9/P10.

I'll push this soon and backport to release branches after
a week or so.

BR,
Kewen
-
PR target/110741

gcc/ChangeLog:

* config/rs6000/vsx.md (define_insn xxeval): Correct vsx
operands output with "x".

gcc/testsuite/ChangeLog:

* g++.target/powerpc/pr110741.C: New test.
---
 gcc/config/rs6000/vsx.md|   2 +-
 gcc/testsuite/g++.target/powerpc/pr110741.C | 552 
 2 files changed, 553 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/g++.target/powerpc/pr110741.C

diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index 0c269e4e8d9..1a87f1c0b63 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -6586,7 +6586,7 @@ (define_insn "xxeval"
  (match_operand:QI 4 "u8bit_cint_operand" "n")]
 UNSPEC_XXEVAL))]
"TARGET_POWER10"
-   "xxeval %0,%1,%2,%3,%4"
+   "xxeval %x0,%x1,%x2,%x3,%4"
[(set_attr "type" "vecperm")
 (set_attr "prefixed" "yes")])

diff --git a/gcc/testsuite/g++.target/powerpc/pr110741.C 
b/gcc/testsuite/g++.target/powerpc/pr110741.C
new file mode 100644
index 000..0214936b06d
--- /dev/null
+++ b/gcc/testsuite/g++.target/powerpc/pr110741.C
@@ -0,0 +1,552 @@
+/* { dg-do run { target { power10_hw } } } */
+/* { dg-options "-O2 -mdejagnu-cpu=power10" } */
+
+#include 
+
+typedef unsigned char uint8_t;
+
+template 
+static inline vector unsigned long long
+VSXTernaryLogic (vector unsigned long long a, vector unsigned long long b,
+vector unsigned long long c)
+{
+  return vec_ternarylogic (a, b, c, kTernLogOp);
+}
+
+static vector unsigned long long
+VSXTernaryLogic (vector unsigned long long a, vector unsigned long long b,
+vector unsigned long long c, int ternary_logic_op)
+{
+  switch (ternary_logic_op & 0xFF)
+{
+case 0:
+  return VSXTernaryLogic<0> (a, b, c);
+case 1:
+  return VSXTernaryLogic<1> (a, b, c);
+case 2:
+  return VSXTernaryLogic<2> (a, b, c);
+case 3:
+  return VSXTernaryLogic<3> (a, b, c);
+case 4:
+  return VSXTernaryLogic<4> (a, b, c);
+case 5:
+  return VSXTernaryLogic<5> (a, b, c);
+case 6:
+  return VSXTernaryLogic<6> (a, b, c);
+case 7:
+  return VSXTernaryLogic<7> (a, b, c);
+case 8:
+  return VSXTernaryLogic<8> (a, b, c);
+case 9:
+  return VSXTernaryLogic<9> (a, b, c);
+case 10:
+  return VSXTernaryLogic<10> (a, b, c);
+case 11:
+  return VSXTernaryLogic<11> (a, b, c);
+case 12:
+  return VSXTernaryLogic<12> (a, b, c);
+case 13:
+  return VSXTernaryLogic<13> (a, b, c);
+case 14:
+  return VSXTernaryLogic<14> (a, b, c);
+case 15:
+  return VSXTernaryLogic<15> (a, b, c);
+case 16:
+  return VSXTernaryLogic<16> (a, b, c);
+case 17:
+  return VSXTernaryLogic<17> (a, b, c);
+case 18:
+  return VSXTernaryLogic<18> (a, b, c);
+case 19:
+  return VSXTernaryLogic<19> (a, b, c);
+case 20:
+  return VSXTernaryLogic<20> (a, b, c);
+case 21:
+  return VSXTernaryLogic<21> (a, b, c);
+case 22:
+  return VSXTernaryLogic<22> (a, b, c);
+case 23:
+  return VSXTernaryLogic<23> (a, b, c);
+case 24:
+  return VSXTernaryLogic<24> (a, b, c);
+case 25:
+  return VSXTernaryLogic<25> (a, b, c);
+case 26:
+  return VSXTernaryLogic<26> (a, b, c);
+case 27:
+  return VSXTernaryLogic<27> (a, b, c);
+case 28:
+  return VSXTernaryLogic<28> (a, b, c);
+case 29:
+  return VSXTernaryLogic<29> (a, b, c);
+case 30:
+  return VSXTernaryLogic<30> (a, b, c);
+case 31:
+  return VSXTernaryLogic<31> (a, b, c);
+case 32:
+  return VSXTernaryLogic<32> (a, b, c);
+case 33:
+  return VSXTernaryLogic<33> (a, b, c);
+case 34:
+  return VSXTernaryLogic<34> (a, b, c);
+case 35:
+  return VSXTernaryLogic<35> (a, b, c);
+case 36:
+  return VSXTernaryLogic<36> (a, b, c);
+case 37:
+  return VSXTernaryLogic<37> (a, b, c);
+case 38:
+  return VSXTernaryLogic<38> (a, b, c);
+case 39:
+  return VSXTernaryLogic<39> (a, b, c);
+case 40:
+  return VSXTernaryLogic<40> (a, b, c);
+case 41:
+  return VSXTernaryLogic<41> (a, b, c);
+case 42:
+  return VSXTernaryLogic<42> (a, b, c);
+case 43:
+  return VSXTernaryLogic<43> (a, b, c);
+case 44:
+  return VSXTernaryLogic<44> (a, b, c);
+case 45:
+  return VSXTernaryLogic<45> (a, b, c);
+case 46:
+  return VSXTernaryLogic<46> (a, b, c);
+case 47:
+  return VSXTernaryLogic<47> (a, b, c);
+case 48:
+  

[PATCH] vect: Treat VMAT_ELEMENTWISE as scalar load in costing [PR110776]

2023-07-25 Thread Kewen.Lin via Gcc-patches
Hi,

PR110776 exposes one issue that we could query unaligned
load for vector type but actually no unaligned vector load
is supported there.  The reason is that the costed load is
with single-lane vector type and its memory access type is
VMAT_ELEMENTWISE, we actually take it as scalar load and
set its alignment_support_scheme as dr_unaligned_supported.

To avoid the ICE as exposed, following Rich's suggestion,
this patch is to make VMAT_ELEMENTWISE be costed as scalar
load.

Bootstrapped and regress-tested on x86_64-redhat-linux,
powerpc64-linux-gnu P8/P9 and powerpc64le-linux-gnu P9/P10.

Is it ok for trunk?

BR,
Kewen
-

Co-authored-by: Richard Biener 

PR tree-optimization/110776

gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_load): Always cost VMAT_ELEMENTWISE
as scalar load.

gcc/testsuite/ChangeLog:

* gcc.target/powerpc/pr110776.c: New test.
---
 gcc/testsuite/gcc.target/powerpc/pr110776.c | 22 +
 gcc/tree-vect-stmts.cc  |  5 -
 2 files changed, 26 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr110776.c

diff --git a/gcc/testsuite/gcc.target/powerpc/pr110776.c 
b/gcc/testsuite/gcc.target/powerpc/pr110776.c
new file mode 100644
index 000..749159fd675
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr110776.c
@@ -0,0 +1,22 @@
+/* { dg-require-effective-target powerpc_altivec_ok } */
+/* { dg-options "-O2 -mdejagnu-cpu=power6 -maltivec" } */
+
+/* Verify there is no ICE.  */
+
+int a;
+long *b;
+int
+c ()
+{
+  long e;
+  int d = 0;
+  for (long f; f; f++)
+{
+  e = b[f * a];
+  if (e)
+   d = 1;
+}
+  if (d)
+for (;;)
+  ;
+}
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index ed28fbdced3..09705200594 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -9840,7 +9840,10 @@ vectorizable_load (vec_info *vinfo,
{
  if (costing_p)
{
- if (VECTOR_TYPE_P (ltype))
+ /* For VMAT_ELEMENTWISE, just cost it as scalar_load to
+avoid ICE, see PR110776.  */
+ if (VECTOR_TYPE_P (ltype)
+ && memory_access_type != VMAT_ELEMENTWISE)
vect_get_load_cost (vinfo, stmt_info, 1,
alignment_support_scheme, misalignment,
false, _cost, nullptr, cost_vec,
--
2.39.1


Re: [PATCH] vect: Don't vectorize a single scalar iteration loop [PR110740]

2023-07-24 Thread Kewen.Lin via Gcc-patches
on 2023/7/21 19:49, Richard Biener wrote:
> On Fri, Jul 21, 2023 at 8:08 AM Kewen.Lin  wrote:
>>
>> Hi,
>>
>> The function vect_update_epilogue_niters which has been
>> removed by r14-2281 has some code taking care of that if
>> there is only one scalar iteration left for epilogue then
>> we won't try to vectorize it any more.
>>
>> Although costing should be able to care about it eventually,
>> I think we still want this special casing without costing
>> enabled, so this patch is to add it back in function
>> vect_analyze_loop_costing, and make it more general for
>> both main and epilogue loops as Richi suggested, it can fix
>> some exposed failures on Power10:
>>
>>  - gcc.target/powerpc/p9-vec-length-epil-{1,8}.c
>>  - gcc.dg/vect/slp-perm-{1,5,6,7}.c
>>
>> Bootstrapped and regtested on x86_64-redhat-linux,
>> aarch64-linux-gnu, powerpc64-linux-gnu P8/P9 and
>> powerpc64le-linux-gnu P9/P10.
>>
>> Is it ok for trunk?
> 
> OK.
> 

Thanks Richi, pushed as r14-2736.

BR,
Kewen


Re: [PATCHv2, rs6000] Generate mfvsrwz for all subtargets and remove redundant zero extend [PR106769]

2023-07-23 Thread Kewen.Lin via Gcc-patches
Hi Haochen,

on 2023/7/21 09:32, HAO CHEN GUI wrote:
> Hi,
>   This patch modifies vsx extract expand and generates mfvsrwz/stxsiwx
> for all subtargets when the mode is V4SI and the index of extracted element
> is 1 for BE and 2 for LE. Also this patch adds a insn pattern for mfvsrwz
> which can help eliminate redundant zero extend.
> 
>   Compared to last version, the main change is to add a new expand for V4SI
> and separate "vsx_extract_si" to 2 insn patterns.
> https://gcc.gnu.org/pipermail/gcc-patches/2023-June/622101.html
> 
>   Bootstrapped and tested on powerpc64-linux BE and LE with no regressions.
> 
> Thanks
> Gui Haochen
> 
> 
> ChangeLog
> rs6000: Generate mfvsrwz for all subtargets and remove redundant zero extend
> 
> mfvsrwz has lower latency than xxextractuw or vextuw[lr]x.  So it should be
> generated even with p9 vector enabled.  Also the instruction is already
> zero extended.  A combine pattern is needed to eliminate redundant zero
> extend instructions.
> 
> gcc/
>   PR target/106769
>   * config/rs6000/vsx.md (expand vsx_extract_): Set it only
>   for V8HI and V16QI.
>   (vsx_extract_v4si): New expand for V4SI.
>   (*vsx_extract__di_p9): Not generate the insn when it can
>   be generated by mfvsrwz.
>   (mfvsrwz): New insn pattern for zero extended vsx_extract_v4si.
>   (*vsx_extract_si): Removed.
>   (vsx_extract_v4si_0): New insn pattern to deal with V4SI extract
>   when the index of extracted element is 1 with BE and 2 with LE.
>   (vsx_extract_v4si_1): New insn and split pattern which deals with
>   the cases not handled by vsx_extract_v4si_0.
> 
> gcc/testsuite/
>   PR target/106769
>   * gcc.target/powerpc/pr106769.h: New.
>   * gcc.target/powerpc/pr106769-p8.c: New.
>   * gcc.target/powerpc/pr106769-p9.c: New.
> 
> patch.diff
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index 0a34ceebeb5..ad249441bcf 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -3722,9 +3722,9 @@ (define_insn "vsx_xxpermdi2__1"
>  (define_expand  "vsx_extract_"
>[(parallel [(set (match_operand: 0 "gpc_reg_operand")
>  (vec_select:
> - (match_operand:VSX_EXTRACT_I 1 "gpc_reg_operand")
> + (match_operand:VSX_EXTRACT_I2 1 "gpc_reg_operand")
>   (parallel [(match_operand:QI 2 "const_int_operand")])))
> -   (clobber (match_scratch:VSX_EXTRACT_I 3))])]
> +   (clobber (match_scratch:VSX_EXTRACT_I2 3))])]
>"VECTOR_MEM_VSX_P (mode) && TARGET_DIRECT_MOVE_64BIT"
>  {
>/* If we have ISA 3.0, we can do a xxextractuw/vextractu{b,h}.  */
> @@ -3736,6 +3736,23 @@ (define_expand  "vsx_extract_"
>  }
>  })
> 
> +(define_expand  "vsx_extract_v4si"
> +  [(parallel [(set (match_operand:SI 0 "gpc_reg_operand")
> +(vec_select:SI
> + (match_operand:V4SI 1 "gpc_reg_operand")
> + (parallel [(match_operand:QI 2 "const_0_to_3_operand")])))
> +   (clobber (match_scratch:V4SI 3))])]
> +  "TARGET_DIRECT_MOVE_64BIT"
> +{

Nit: Maybe add a comment here for why we special-case op2.

> +  if (TARGET_P9_VECTOR
> +  && INTVAL (operands[2]) != (BYTES_BIG_ENDIAN ? 1 : 2))
> +{
> +  emit_insn (gen_vsx_extract_v4si_p9 (operands[0], operands[1],
> +   operands[2]));
> +  DONE;
> +}
> +})
> +

Nit: Move "(define_insn \"vsx_extract_v4si_0\"..." up here to ensure
it takes the first priority in matching.

>  (define_insn "vsx_extract__p9"
>[(set (match_operand: 0 "gpc_reg_operand" "=r,")
>   (vec_select:
> @@ -3798,7 +3815,9 @@ (define_insn_and_split "*vsx_extract__di_p9"
> (match_operand:VSX_EXTRACT_I 1 "gpc_reg_operand" "v,")
> (parallel [(match_operand:QI 2 "const_int_operand" "n,n")]
> (clobber (match_scratch:SI 3 "=r,X"))]
> -  "VECTOR_MEM_VSX_P (mode) && TARGET_VEXTRACTUB"
> +  "TARGET_VEXTRACTUB
> +   && (mode != V4SImode
> +   || INTVAL (operands[2]) != (BYTES_BIG_ENDIAN ? 1 : 2))"

I'd expect that under condition TARGET_VEXTRACTUB, we won't get this kind of
pattern with V4SI and 1/2 op2 now?  Instead of putting one condition to exclude
it, IMHO it's better to assert op2 isn't 1 or 2 in its splitters.

>"#"
>"&& reload_completed"
>[(parallel [(set (match_dup 4)
> @@ -3830,58 +3849,78 @@ (define_insn_and_split "*vsx_extract__store_p9"
> (set (match_dup 0)
>   (match_dup 3))])
> 
> -(define_insn_and_split  "*vsx_extract_si"
> +(define_insn "mfvsrwz"
> +  [(set (match_operand:DI 0 "register_operand" "=r")
> + (zero_extend:DI
> +   (vec_select:SI
> + (match_operand:V4SI 1 "vsx_register_operand" "wa")
> + (parallel [(match_operand:QI 2 "const_int_operand" "n")]
> +   (clobber (match_scratch:V4SI 3 "=v"))]
> +  "TARGET_DIRECT_MOVE_64BIT
> +   && INTVAL (operands[2]) == (BYTES_BIG_ENDIAN ? 1 : 2)"
> +  "mfvsrwz %0,%x1"
> +  [(set_attr "type" 

Re: [PATCH 2/2 ver 5] rs6000, fix vec_replace_unaligned built-in arguments

2023-07-23 Thread Kewen.Lin via Gcc-patches
Hi Carl,

on 2023/7/22 07:38, Carl Love wrote:
> GCC maintainers:
> 
> Version 5, Fixed patch description, the first argument should be of
> type vector.  Fixed comment in vsx.md to say "Vector and scalar
> extract_elt iterator/attr ".  Removed a few of the changes in
> version 4.  Specifically, reverted the names of REPLACE_ELT_V_sh back
> to REPLACE_ELT_sh and REPLACE_ELT_V_max back to REPLACE_ELT_V_max. 
> Combined the REPLACE_ELT_char and REPLACE_ELT_V_char mode attributes
> into REPLACE_ELT_char.  Put the "dg-do link" directive back into the
> vec-replace-word-runnable_1.c test file.  The patch was tested with the
> updated patch 1 in the series on Power 8 LE/BE, Power 9 LE/BE and Power
> 10 with no regressions.
> 

snip...

> 
> rs6000, fix vec_replace_unaligned built-in arguments
> 
> The first argument of the vec_replace_unaligned built-in should always be
> of type vector unsigned char, as specified in gcc/doc/extend.texi.
> 
> This patch fixes the builtin definitions and updates the test cases to use
> the correct arguments.  The original test file is renamed and a second test
> file is added for a new test case.
> 
> gcc/ChangeLog:
>   * config/rs6000/rs6000-builtins.def: Rename
>   __builtin_altivec_vreplace_un_uv2di as __builtin_altivec_vreplace_un_udi
>   __builtin_altivec_vreplace_un_uv4si as __builtin_altivec_vreplace_un_usi
>   __builtin_altivec_vreplace_un_v2df as __builtin_altivec_vreplace_un_df
>   __builtin_altivec_vreplace_un_v2di as __builtin_altivec_vreplace_un_di
>   __builtin_altivec_vreplace_un_v4sf as __builtin_altivec_vreplace_un_sf
>   __builtin_altivec_vreplace_un_v4si as __builtin_altivec_vreplace_un_si.
>   Rename VREPLACE_UN_UV2DI as VREPLACE_UN_UDI, VREPLACE_UN_UV4SI as
>   VREPLACE_UN_USI, VREPLACE_UN_V2DF as VREPLACE_UN_DF,
>   VREPLACE_UN_V2DI as VREPLACE_UN_DI, VREPLACE_UN_V4SF as
>   VREPLACE_UN_SF, VREPLACE_UN_V4SI as VREPLACE_UN_SI.
>   Rename vreplace_un_v2di as vreplace_un_di, vreplace_un_v4si as
>   vreplace_un_si, vreplace_un_v2df as vreplace_un_df,
>   vreplace_un_v2di as vreplace_un_di, vreplace_un_v4sf as
>   vreplace_un_sf, vreplace_un_v4si as vreplace_un_si.
>   * config/rs6000/rs6000-c.cc (find_instance): Add case
>   RS6000_OVLD_VEC_REPLACE_UN.
>   * config/rs6000/rs6000-overload.def (__builtin_vec_replace_un):
>   Fix first argument type.  Rename VREPLACE_UN_UV4SI as
>   VREPLACE_UN_USI, VREPLACE_UN_V4SI as VREPLACE_UN_SI,
>   VREPLACE_UN_UV2DI as VREPLACE_UN_UDI, VREPLACE_UN_V2DI as
>   VREPLACE_UN_DI, VREPLACE_UN_V4SF as VREPLACE_UN_SF,
>   VREPLACE_UN_V2DF as VREPLACE_UN_DF.
>   * config/rs6000/vsx.md (REPLACE_ELT): Renamed the mode_iterator

Nit: s/Renamed/Rename/

>   REPLACE_ELT_V for vector modes.
>   (REPLACE_ELT): New scalar mode iterator.
>   (REPLACE_ELT_char): Add scalar attributes.
>   (vreplace_un_): Change iterator and mode attribute.
> 
> gcc/testsuite/ChangeLog:
>   * gcc.target/powerpc/vec-replace-word-runnable.c: Renamed
>   vec-replace-word-runnable_1.c.

Ditto.

>   * gcc.target/powerpc/vec-replace-word-runnable_1.c
>   (dg-options): add -flax-vector-conversions.
>   (vec_replace_unaligned) Fix first argument type.
>   (vresult_uchar): Fix expected results.
>   (vec_replace_unaligned): Update for loop to check uchar results.
>   Remove extra spaces in if statements. Insert missing spaces in
>   for statements.
>   * gcc.target/powerpc/vec-replace-word-runnable_2.c: New test file.
> ---

[snip...]

>  
>  [VEC_REVB, vec_revb, __builtin_vec_revb]
>vss __builtin_vec_revb (vss);
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index 0c269e4e8d9..7a4cf492ea5 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -380,10 +380,13 @@ (define_int_attr xvcvbf16   [(UNSPEC_VSX_XVCVSPBF16 
> "xvcvspbf16")
>  ;; Like VI, defined in vector.md, but add ISA 2.07 integer vector ops
>  (define_mode_iterator VI2 [V4SI V8HI V16QI V2DI])
>  
> -;; Vector extract_elt iterator/attr for 32-bit and 64-bit elements
> -(define_mode_iterator REPLACE_ELT [V4SI V4SF V2DI V2DF])
> +;; Vector and scalar extract_elt iterator/attr for 32-bit and 64-bit elements

Nit: Since you touched this comment line, extract_elt is wrong before.
Maybe s/extract_elt/vector replace/?

> +(define_mode_iterator REPLACE_ELT_V [V4SI V4SF V2DI V2DF])
> +(define_mode_iterator REPLACE_ELT [SI SF DI DF])
>  (define_mode_attr REPLACE_ELT_char [(V4SI "w") (V4SF "w")
> - (V2DI  "d") (V2DF "d")])
> + (V2DI "d") (V2DF "d")
> + (SI "w") (SF "w")
> + (DI "d") (DF "d")])
>  (define_mode_attr REPLACE_ELT_sh [(V4SI "2") (V4SF "2")
> (V2DI  "3") (V2DF "3")])
>  (define_mode_attr REPLACE_ELT_max [(V4SI "12") 

Re: [PATCH 1/2 ver 2] rs6000, add argument to function find_instance

2023-07-23 Thread Kewen.Lin via Gcc-patches
Hi Carl,

on 2023/7/22 07:38, Carl Love wrote:
> GCC maintainers:
> 
> Version 2:  Updated a number of formatting and spacing issues.   Added
> the NARGS description to the header comment for function find_instance.
> This patch was tested on Power 8 LE/BE, Power 9 LE/BE and Power 10 LE
> with no regressions.
> 
> The rs6000 function find_instance assumes that it is called for built-
> ins with only two arguments.  There is no checking for the actual
> number of aruguments used in the built-in.  This patch adds an
> additional parameter to the function call containing the number of
> aruguments in the built-in.  The function will now do the needed checks
> for all of the arguments.
> 
> This fix is needed for the next patch in the series that fixes the
> vec_replace_unaligned built-in.c test.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
> Carl 
> 
> 
> 
> 
> -
> rs6000, add argument to function find_instance
> 
> The function find_instance assumes it is called to check a built-in with
> only two arguments.  This patch extends the function by adding a parameter
> specifying the number of built-in arguments to check.
> 
> gcc/ChangeLog:
>   * config/rs6000/rs6000-c.cc (find_instance): Add new parameter that
>   specifies the number of built-in arguments to check.
>   (altivec_resolve_overloaded_builtin): Update calls to find_instance
>   to pass the number of built-in arguments to be checked.
> ---
>  gcc/config/rs6000/rs6000-c.cc | 40 +++
>  1 file changed, 26 insertions(+), 14 deletions(-)
> 
> diff --git a/gcc/config/rs6000/rs6000-c.cc b/gcc/config/rs6000/rs6000-c.cc
> index a353bca19ef..de35490de42 100644
> --- a/gcc/config/rs6000/rs6000-c.cc
> +++ b/gcc/config/rs6000/rs6000-c.cc
> @@ -1668,18 +1668,20 @@ resolve_vec_step (resolution *res, vec 
> *arglist, unsigned nargs)
>  /* Look for a matching instance in a chain of instances.  INSTANCE points to
> the chain of instances; INSTANCE_CODE is the code identifying the specific
> built-in being searched for; FCODE is the overloaded function code; TYPES
> -   contains an array of two types that must match the types of the instance's
> -   parameters; and ARGS contains an array of two arguments to be passed to
> -   the instance.  If found, resolve the built-in and return it, unless the
> -   built-in is not supported in context.  In that case, set
> -   UNSUPPORTED_BUILTIN to true.  If we don't match, return error_mark_node
> -   and leave UNSUPPORTED_BUILTIN alone.  */
> +   contains an array of NARGS types that must match the types of the
> +   instance's parameters; ARGS contains an array of NARGS arguments to be
> +   passed to the instance; and NARGS is the number of built-in arguments to
> +   check.  If found, resolve the built-in and return it, unless the built-in
> +   is not supported in context.  In that case, set UNSUPPORTED_BUILTIN to
> +   true.  If we don't match, return error_mark_node and leave
> +   UNSUPPORTED_BUILTIN alone.
> +*/

Minor nit: "*/" doesn't need to start with a new line, just "... alone.  */".

This patch is okay for trunk with this fixed.  Thanks!

BR,
Kewen



Re: [PATCH 4/9] vect: Adjust vectorizable_load costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP

2023-07-23 Thread Kewen.Lin via Gcc-patches
Hi Iain,

on 2023/7/22 23:58, Iain Sandoe wrote:
> Hi Kewen,
> 
> This patch breaks bootstrap on powerpc-darwin (which has Altivec, but not 
> VSX) while building libgfortran.
> 
>> On 3 Jul 2023, at 04:19, Kewen.Lin via Gcc-patches  
>> wrote:
> 
> Please see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110776
> thanks
> Iain
> 

Thanks for reporting!  I'll have a look at it.


BR,
Kewen


[PATCH] vect: Don't vectorize a single scalar iteration loop [PR110740]

2023-07-21 Thread Kewen.Lin via Gcc-patches
Hi,

The function vect_update_epilogue_niters which has been
removed by r14-2281 has some code taking care of that if
there is only one scalar iteration left for epilogue then
we won't try to vectorize it any more.

Although costing should be able to care about it eventually,
I think we still want this special casing without costing
enabled, so this patch is to add it back in function
vect_analyze_loop_costing, and make it more general for
both main and epilogue loops as Richi suggested, it can fix
some exposed failures on Power10:

 - gcc.target/powerpc/p9-vec-length-epil-{1,8}.c
 - gcc.dg/vect/slp-perm-{1,5,6,7}.c

Bootstrapped and regtested on x86_64-redhat-linux,
aarch64-linux-gnu, powerpc64-linux-gnu P8/P9 and
powerpc64le-linux-gnu P9/P10.

Is it ok for trunk?

BR,
Kewen
-
PR tree-optimization/110740

gcc/ChangeLog:

* tree-vect-loop.cc (vect_analyze_loop_costing): Do not vectorize a
loop with a single scalar iteration.
---
 gcc/tree-vect-loop.cc | 55 ++-
 1 file changed, 34 insertions(+), 21 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index b44fb9c7712..92d2abde094 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -2158,8 +2158,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
  epilogue we can also decide whether the main loop leaves us
  with enough iterations, prefering a smaller vector epilog then
  also possibly used for the case we skip the vector loop.  */
-  if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
-  && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
+  if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo))
 {
   widest_int scalar_niters
= wi::to_widest (LOOP_VINFO_NITERSM1 (loop_vinfo)) + 1;
@@ -2182,32 +2181,46 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
   % lowest_vf + gap);
}
}
-
-  /* Check that the loop processes at least one full vector.  */
-  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
-  if (known_lt (scalar_niters, vf))
+  /* Reject vectorizing for a single scalar iteration, even if
+we could in principle implement that using partial vectors.  */
+  unsigned peeling_gap = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
+  if (scalar_niters <= peeling_gap + 1)
{
  if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-"loop does not have enough iterations "
-"to support vectorization.\n");
+"not vectorized: loop only has a single "
+"scalar iteration.\n");
  return 0;
}

-  /* If we need to peel an extra epilogue iteration to handle data
-accesses with gaps, check that there are enough scalar iterations
-available.
-
-The check above is redundant with this one when peeling for gaps,
-but the distinction is useful for diagnostics.  */
-  if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
- && known_le (scalar_niters, vf))
+  if (!LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
{
- if (dump_enabled_p ())
-   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-"loop does not have enough iterations "
-"to support peeling for gaps.\n");
- return 0;
+ /* Check that the loop processes at least one full vector.  */
+ poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+ if (known_lt (scalar_niters, vf))
+   {
+ if (dump_enabled_p ())
+   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+"loop does not have enough iterations "
+"to support vectorization.\n");
+ return 0;
+   }
+
+ /* If we need to peel an extra epilogue iteration to handle data
+accesses with gaps, check that there are enough scalar iterations
+available.
+
+The check above is redundant with this one when peeling for gaps,
+but the distinction is useful for diagnostics.  */
+ if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
+ && known_le (scalar_niters, vf))
+   {
+ if (dump_enabled_p ())
+   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+"loop does not have enough iterations "
+"to support peeling for gaps.\n");
+ return 0;
+   }
}
 }

--
2.39.3


Re: [PATCH] sccvn: Correct the index of bias for IFN_LEN_STORE [PR110744]

2023-07-20 Thread Kewen.Lin via Gcc-patches
on 2023/7/20 20:37, Richard Sandiford wrote:
> "Kewen.Lin"  writes:
>> Hi,
>>
>> Commit r14-2267-gb8806f6ffbe72e adjusts the arguments order
>> of LEN_STORE from {len,vector,bias} to {len,bias,vector},
>> in order to make them consistent with LEN_MASK_STORE and
>> MASK_STORE.  But it missed to update the related handlings
>> in tree-ssa-sccvn.cc, it caused the failure shown in PR
>> 110744.  This patch is to fix the related handlings with
>> the correct index.
>>
>> Bootstrapped and regress-tested on x86_64-redhat-linux,
>> powerpc64-linux-gnu P8/P9 and powerpc64le-linux-gnu P9/P10.
>>
>> Is it ok for trunk?
>>
>> BR,
>> Kewen
>> -
>>  PR tree-optimization/110744
>>
>> gcc/ChangeLog:
>>
>>  * tree-ssa-sccvn.cc (vn_reference_lookup_3): Correct the index of bias
>>  operand for ifn IFN_LEN_STORE.
> 
> OK, thanks.
> 

Thanks Richard!  Pushed as r14-2694.

BR,
Kewen


Re: [PATCH] testsuite: Add a test case for PR110729

2023-07-20 Thread Kewen.Lin via Gcc-patches
on 2023/7/20 20:34, Richard Sandiford wrote:
> "Kewen.Lin"  writes:
>> Hi,
>>
>> As PR110729 reported, there was one issue for .section
>> __patchable_function_entries with -ffunction-sections, that
>> is we put the same symbol as link_to section symbol for all
>> functions wrongly.  The commit r13-4294 for PR99889 has
>> fixed this with the corresponding label LPFE* which sits in
>> the function_section.
>>
>> As Fangrui suggested[1], this patch is to add a bit more test
>> coverage.  I didn't find a good way to check all linked_to
>> symbols are different, so I checked for LPFE[012] here.
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624866.html
>>
>> Tested well on x86_64-redhat-linux, powerpc64-linux-gnu
>> P7/P8/P9 and powerpc64le-linux-gnu P9/P10.
>>
>> Is it ok for trunk?
>>
>> BR,
>> Kewen
>> -
>>  PR testsuite/110729
>>
>> gcc/testsuite/ChangeLog:
>>
>>  * gcc.dg/pr110729.c: New test.
> 
> OK, thanks.

Thanks Richard!  Pushed as r14-2693.

BR,
Kewen


Re: [PATCH 2/2 ver 4] rs6000, fix vec_replace_unaligned built-in arguments

2023-07-20 Thread Kewen.Lin via Gcc-patches
Hi Carl,

on 2023/7/18 03:20, Carl Love wrote:
> GCC maintainers:
> 
> Version 4, changed the new RS6000_OVLD_VEC_REPLACE_UN case statement
> rs6000/rs6000-c.cc.  The existing REPLACE_ELT iterator name was changed
> to REPLACE_ELT_V along with the associated define_mode_attr.  Renamed
> VEC_RU to REPLACE_ELT for the iterator name and VEC_RU_char to
> REPLACE_ELT_char.  Fixed the double test in vec-replace-word-
> runnable_1.c to be consistent with the other tests.  Removed the "dg-do 
> link" from both tests.  Put in an explicit cast in test 
> vec-replace-word-runnable_2.c to eliminate the need for the 
> -flax-vector-conversions dg-option.
> 
> Version 3, added code to altivec_resolve_overloaded_builtin so the
> correct instruction is selected for the size of the second argument. 
> This restores the instruction counts to the original values where the
> correct instructions were originally being generated.  The naming of
> the overloaded builtin instances and builtin definitions were changed
> to reflect the type of the second argument since the type of the first
> argument is now the same for all overloaded instances.  A new builtin
> test file was added for the case where the first argument is cast to
> the unsigned long long type.  This test requires the -flax-vector-
> conversions gcc command line option.  Since the other tests do not
> require this option, I felt that the new test needed to be in a
> separate file.  Finally some formatting fixes were made in the original
> test file.  Patch has been retested on Power 10 with no regressions.
> 
> Version 2, fixed various typos.  Updated the change log body to say the
> instruction counts were updated.  The instruction counts changed as a
> result of changing the first argument of the vec_replace_unaligned
> builtin call from vector unsigned long long (vull) to vector unsigned
> char (vuc).  When the first argument was vull the builtin call
> generated the vinsd instruction for the two test cases.  The updated
> call with vuc as the first argument generates two vinsw instructions
> instead.  Patch was retested on Power 10 with no regressions.
> 
> The following patch fixes the first argument in the builtin definition
> and the corresponding test cases.  Initially, the builtin specification
> was wrong due to a cut and past error.  The documentation was fixed in:
> 
>commit ed3fea09b18f67e757b5768b42cb6e816626f1db
>Author: Bill Schmidt 
>Date:   Fri Feb 4 13:07:17 2022 -0600
> 
>rs6000: Correct function prototypes for vec_replace_unaligned
> 
>Due to a pasto error in the documentation, vec_replace_unaligned was
>implemented with the same function prototypes as vec_replace_elt.  
>It was intended that vec_replace_unaligned always specify output
>vectors as having type vector unsigned char, to emphasize that 
>elements are potentially misaligned by this built-in function.  
>This patch corrects the misimplementation.
> 
> 
> This patch fixes the arguments in the definitions and updates the
> testcases accordingly.  Additionally, a few minor spacing issues are
> fixed.
> 
> The patch has been tested on Power 10 with no regressions.  Please let
> me know if the patch is acceptable for mainline.  Thanks.
> 
>  Carl 
> 
> 
> 
> rs6000, fix vec_replace_unaligned built-in arguments
> 
> The first argument of the vec_replace_unaligned built-in should always be
> of type unsigned char, as specified in gcc/doc/extend.texi.

Shouldn't be "vector unsigned char" instead of "unsigned char"?

Or do I miss something?

> 
> This patch fixes the builtin definitions and updates the test cases to use
> the correct arguments.  The original test file is renamed and a second test
> file is added for a new test case.
> 
> gcc/ChangeLog:
>   * config/rs6000/rs6000-builtins.def: Rename
>   __builtin_altivec_vreplace_un_uv2di as __builtin_altivec_vreplace_un_udi
>   __builtin_altivec_vreplace_un_uv4si as __builtin_altivec_vreplace_un_usi
>   __builtin_altivec_vreplace_un_v2df as __builtin_altivec_vreplace_un_df
>   __builtin_altivec_vreplace_un_v2di as __builtin_altivec_vreplace_un_di
>   __builtin_altivec_vreplace_un_v4sf as __builtin_altivec_vreplace_un_sf
>   __builtin_altivec_vreplace_un_v4si as __builtin_altivec_vreplace_un_si.
>   Rename VREPLACE_UN_UV2DI as VREPLACE_UN_UDI, VREPLACE_UN_UV4SI as
>   VREPLACE_UN_USI, VREPLACE_UN_V2DF as VREPLACE_UN_DF,
>   VREPLACE_UN_V2DI as VREPLACE_UN_DI, VREPLACE_UN_V4SF as
>   VREPLACE_UN_SF, VREPLACE_UN_V4SI as VREPLACE_UN_SI.
>   Rename vreplace_un_v2di as vreplace_un_di, vreplace_un_v4si as
>   vreplace_un_si, vreplace_un_v2df as vreplace_un_df,
>   vreplace_un_v2di as vreplace_un_di, vreplace_un_v4sf as
>   vreplace_un_sf, vreplace_un_v4si as vreplace_un_si.
>   * config/rs6000/rs6000-c.cc (find_instance): Add case
>   RS6000_OVLD_VEC_REPLACE_UN.

Re: [PATCH 1/2] rs6000, add argument to function find_instance

2023-07-20 Thread Kewen.Lin via Gcc-patches
Hi Carl,

on 2023/7/18 03:19, Carl Love wrote:
> 
> GCC maintainers:
> 
> The rs6000 function find_instance assumes that it is called for built-
> ins with only two arguments.  There is no checking for the actual
> number of aruguments used in the built-in.  This patch adds an
> additional parameter to the function call containing the number of
> aruguments in the built-in.  The function will now do the needed checks
> for all of the arguments.
> 
> This fix is needed for the next patch in the series that fixes the
> vec_replace_unaligned built-in.c test.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
> Carl 
> 
> 
> 
> rs6000, add argument to function find_instance
> 
> The function find_instance assumes it is called to check a built-in  with 
> ~~ two spaces.
> only two arguments.  Ths patch extends the function by adding a parameter
   s/Ths/This/
> specifying the number of buit-in arguments to check.
  s/bult-in/built-in/

> 
> gcc/ChangeLog:
>   * config/rs6000/rs6000-c.cc (find_instance): Add new parameter that
>   specifies the number of built-in arguments to check.
>   (altivec_resolve_overloaded_builtin): Update calls to find_instance
>   to pass the number of built-in argument to be checked.

s/argument/arguments/

> ---
>  gcc/config/rs6000/rs6000-c.cc | 27 +++
>  1 file changed, 19 insertions(+), 8 deletions(-)
> 
> diff --git a/gcc/config/rs6000/rs6000-c.cc b/gcc/config/rs6000/rs6000-c.cc
> index a353bca19ef..350987b851b 100644
> --- a/gcc/config/rs6000/rs6000-c.cc
> +++ b/gcc/config/rs6000/rs6000-c.cc
> @@ -1679,7 +1679,7 @@ tree

There is one function comment here describing the meaning of each parameter,
I think we should add a corresponding for NARGS, may be something like:

"; and NARGS specifies the number of built-in arguments."

Also we need to update the below "two"s with "NARGS".

"TYPES contains an array of two types..." and "ARGS contains an array of two 
arguments..."

since we already extend this to handle NARGS instead of two.

>  find_instance (bool *unsupported_builtin, ovlddata **instance,
>  rs6000_gen_builtins instance_code,
>  rs6000_gen_builtins fcode,
> -tree *types, tree *args)
> +tree *types, tree *args, int nargs)
>  {
>while (*instance && (*instance)->bifid != instance_code)
>  *instance = (*instance)->next;
> @@ -1691,17 +1691,28 @@ find_instance (bool *unsupported_builtin, ovlddata 
> **instance,
>if (!inst->fntype)
>  return error_mark_node;
>tree fntype = rs6000_builtin_info[inst->bifid].fntype;
> -  tree parmtype0 = TREE_VALUE (TYPE_ARG_TYPES (fntype));
> -  tree parmtype1 = TREE_VALUE (TREE_CHAIN (TYPE_ARG_TYPES (fntype)));
> +  tree argtype = TYPE_ARG_TYPES (fntype);
> +  tree parmtype;

Nit: We can move "tree parmtype" into the loop (close to its only use).

> +  int args_compatible = true;

s/int/bool/

>  
> -  if (rs6000_builtin_type_compatible (types[0], parmtype0)
> -  && rs6000_builtin_type_compatible (types[1], parmtype1))
> +  for (int i = 0; i   {
> +  parmtype = TREE_VALUE (argtype);

 tree parmtype = TREE_VALUE (argtype);

> +  if (! rs6000_builtin_type_compatible (types[i], parmtype))

Nit: One unexpected(?) space after "!".

> + {
> +   args_compatible = false;
> +   break;
> + }
> +  argtype = TREE_CHAIN (argtype);
> +}
> +
> +  if (args_compatible)
> +  {

Nit: indent issue for "{".

Ok for trunk with these nits fixed.  Btw, the description doesn't say
how this was tested, I'm not sure if it's only tested together with
"patch 2/2", but please ensure it's bootstrapped and regress-tested
on BE and LE when committing.  Thanks!

BR,
Kewen

>if (rs6000_builtin_decl (inst->bifid, false) != error_mark_node
> && rs6000_builtin_is_supported (inst->bifid))
>   {
> tree ret_type = TREE_TYPE (inst->fntype);
> -   return altivec_build_resolved_builtin (args, 2, fntype, ret_type,
> +   return altivec_build_resolved_builtin (args, nargs, fntype, ret_type,
>inst->bifid, fcode);
>   }
>else
> @@ -1921,7 +1932,7 @@ altivec_resolve_overloaded_builtin (location_t loc, 
> tree fndecl,
> instance_code = RS6000_BIF_CMPB_32;
>  
>   tree call = find_instance (_builtin, ,
> -instance_code, fcode, types, args);
> +instance_code, fcode, types, args, nargs);
>   if (call != error_mark_node)
> return call;
>   break;
> @@ -1958,7 +1969,7 @@ altivec_resolve_overloaded_builtin (location_t loc, 
> tree fndecl,
> }
>  
>   tree call = find_instance (_builtin, ,
> -instance_code, fcode, 

[PATCH] sccvn: Correct the index of bias for IFN_LEN_STORE [PR110744]

2023-07-20 Thread Kewen.Lin via Gcc-patches
Hi,

Commit r14-2267-gb8806f6ffbe72e adjusts the arguments order
of LEN_STORE from {len,vector,bias} to {len,bias,vector},
in order to make them consistent with LEN_MASK_STORE and
MASK_STORE.  But it missed to update the related handlings
in tree-ssa-sccvn.cc, it caused the failure shown in PR
110744.  This patch is to fix the related handlings with
the correct index.

Bootstrapped and regress-tested on x86_64-redhat-linux,
powerpc64-linux-gnu P8/P9 and powerpc64le-linux-gnu P9/P10.

Is it ok for trunk?

BR,
Kewen
-
PR tree-optimization/110744

gcc/ChangeLog:

* tree-ssa-sccvn.cc (vn_reference_lookup_3): Correct the index of bias
operand for ifn IFN_LEN_STORE.
---
 gcc/tree-ssa-sccvn.cc | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/gcc/tree-ssa-sccvn.cc b/gcc/tree-ssa-sccvn.cc
index 11061a374a2..c0b3ec420c5 100644
--- a/gcc/tree-ssa-sccvn.cc
+++ b/gcc/tree-ssa-sccvn.cc
@@ -3299,11 +3299,14 @@ vn_reference_lookup_3 (ao_ref *ref, tree vuse, void 
*data_,
return (void *)-1;
  break;
case IFN_LEN_STORE:
- len = gimple_call_arg (call, 2);
- bias = gimple_call_arg (call, 4);
- if (!tree_fits_uhwi_p (len) || !tree_fits_shwi_p (bias))
-   return (void *)-1;
- break;
+ {
+   int len_index = internal_fn_len_index (fn);
+   len = gimple_call_arg (call, len_index);
+   bias = gimple_call_arg (call, len_index + 1);
+   if (!tree_fits_uhwi_p (len) || !tree_fits_shwi_p (bias))
+ return (void *) -1;
+   break;
+ }
default:
  return (void *)-1;
}
--
2.39.3


[PATCH] testsuite: Add a test case for PR110729

2023-07-20 Thread Kewen.Lin via Gcc-patches
Hi,

As PR110729 reported, there was one issue for .section
__patchable_function_entries with -ffunction-sections, that
is we put the same symbol as link_to section symbol for all
functions wrongly.  The commit r13-4294 for PR99889 has
fixed this with the corresponding label LPFE* which sits in
the function_section.

As Fangrui suggested[1], this patch is to add a bit more test
coverage.  I didn't find a good way to check all linked_to
symbols are different, so I checked for LPFE[012] here.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624866.html

Tested well on x86_64-redhat-linux, powerpc64-linux-gnu
P7/P8/P9 and powerpc64le-linux-gnu P9/P10.

Is it ok for trunk?

BR,
Kewen
-
PR testsuite/110729

gcc/testsuite/ChangeLog:

* gcc.dg/pr110729.c: New test.
---
 gcc/testsuite/gcc.dg/pr110729.c | 29 +
 1 file changed, 29 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/pr110729.c

diff --git a/gcc/testsuite/gcc.dg/pr110729.c b/gcc/testsuite/gcc.dg/pr110729.c
new file mode 100644
index 000..92dfd8ae000
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110729.c
@@ -0,0 +1,29 @@
+/* { dg-do compile { target { ! { nvptx*-*-* visium-*-* } } } } */
+/* { dg-require-effective-target o_flag_in_section } */
+/* { dg-options "-ffunction-sections -fpatchable-function-entry=2" } */
+/* { dg-additional-options "-fno-pie" { target sparc*-*-* } } */
+
+/* Verify there are three different link_to symbols for three
+.section __patchable_function_entries respectively.  */
+
+int
+f ()
+{
+  return 1;
+}
+
+int
+g ()
+{
+  return 2;
+}
+
+int
+h ()
+{
+  return 3;
+}
+
+/* { dg-final { scan-assembler-times {.section[\t 
]*__patchable_function_entries,.*,\.LPFE0} 1 } }  */
+/* { dg-final { scan-assembler-times {.section[\t 
]*__patchable_function_entries,.*,\.LPFE1} 1 } }  */
+/* { dg-final { scan-assembler-times {.section[\t 
]*__patchable_function_entries,.*,\.LPFE2} 1 } }  */
--
2.39.3


Re: PING^2 [PATCH] Adjust the symbol for SECTION_LINK_ORDER linked_to section [PR99889]

2023-07-19 Thread Kewen.Lin via Gcc-patches
Hi Fangrui,

on 2023/7/19 14:33, Fangrui Song wrote:
> On Thu, Nov 24, 2022 at 7:26 PM Kewen.Lin via Gcc-patches
>  wrote:
>>
>> Hi Richard,
>>
>> on 2022/11/23 00:08, Richard Sandiford wrote:
>>> "Kewen.Lin"  writes:
>>>> Hi Richard,
>>>>
>>>> Many thanks for your review comments!
>>>>
>>>>>>> on 2022/8/24 16:17, Kewen.Lin via Gcc-patches wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> As discussed in PR98125, -fpatchable-function-entry with
>>>>>>>> SECTION_LINK_ORDER support doesn't work well on powerpc64
>>>>>>>> ELFv1 because the filled "Symbol" in
>>>>>>>>
>>>>>>>>   .section name,"flags"o,@type,Symbol
>>>>>>>>
>>>>>>>> sits in .opd section instead of in the function_section
>>>>>>>> like .text or named .text*.
>>>>>>>>
>>>>>>>> Since we already generates one label LPFE* which sits in
>>>>>>>> function_section of current_function_decl, this patch is
>>>>>>>> to reuse it as the symbol for the linked_to section.  It
>>>>>>>> avoids the above ABI specific issue when using the symbol
>>>>>>>> concluded from current_function_decl.
>>>>>>>>
>>>>>>>> Besides, with this support some previous workarounds for
>>>>>>>> powerpc64 ELFv1 can be reverted.
>>>>>>>>
>>>>>>>> btw, rs6000_print_patchable_function_entry can be dropped
>>>>>>>> but there is another rs6000 patch which needs this rs6000
>>>>>>>> specific hook rs6000_print_patchable_function_entry, not
>>>>>>>> sure which one gets landed first, so just leave it here.
>>>>>>>>
>>>>>>>> Bootstrapped and regtested on below:
>>>>>>>>
>>>>>>>>   1) powerpc64-linux-gnu P8 with default binutils 2.27
>>>>>>>>  and latest binutils 2.39.
>>>>>>>>   2) powerpc64le-linux-gnu P9 (default binutils 2.30).
>>>>>>>>   3) powerpc64le-linux-gnu P10 (default binutils 2.30).
>>>>>>>>   4) x86_64-redhat-linux with default binutils 2.30
>>>>>>>>  and latest binutils 2.39.
>>>>>>>>   5) aarch64-linux-gnu  with default binutils 2.30
>>>>>>>>  and latest binutils 2.39.
>>>>>>>>
>>>>
>>>> [snip...]
>>>>
>>>>>>>> diff --git a/gcc/varasm.cc b/gcc/varasm.cc
>>>>>>>> index 4db8506b106..d4de6e164ee 100644
>>>>>>>> --- a/gcc/varasm.cc
>>>>>>>> +++ b/gcc/varasm.cc
>>>>>>>> @@ -6906,11 +6906,16 @@ default_elf_asm_named_section (const char 
>>>>>>>> *name, unsigned int flags,
>>>>>>>>  fprintf (asm_out_file, ",%d", flags & SECTION_ENTSIZE);
>>>>>>>>if (flags & SECTION_LINK_ORDER)
>>>>>>>>  {
>>>>>>>> -  tree id = DECL_ASSEMBLER_NAME (decl);
>>>>>>>> -  ultimate_transparent_alias_target ();
>>>>>>>> -  const char *name = IDENTIFIER_POINTER (id);
>>>>>>>> -  name = targetm.strip_name_encoding (name);
>>>>>>>> -  fprintf (asm_out_file, ",%s", name);
>>>>>>>> +  /* For now, only section "__patchable_function_entries"
>>>>>>>> + adopts flag SECTION_LINK_ORDER, internal label LPFE*
>>>>>>>> + was emitted in default_print_patchable_function_entry,
>>>>>>>> + just place it here for linked_to section.  */
>>>>>>>> +  gcc_assert (!strcmp (name, "__patchable_function_entries"));
>>>>>
>>>>> I like the idea of removing the rs600 workaround in favour of making the
>>>>> target-independent more robust.  But this seems a bit hackish.  What
>>>>> would we do if SECTION_LINK_ORDER was used for something else in future?
>>>>>
>>>>
>>>> Good question!  I think it depends on how we can 

Re: [PATCH V2] rs6000: Change GPR2 to volatile & non-fixed register for function that does not use TOC [PR110320]

2023-07-19 Thread Kewen.Lin via Gcc-patches
Hi Jeevitha,

on 2023/7/17 11:40, P Jeevitha wrote:
> 
> Hi All,
> 
> The following patch has been bootstrapped and regtested on powerpc64le-linux.

Since one line touched has (DEFAULT_ABI == ABI_AIX || DEFAULT_ABI == ABI_ELFv2)
and powerpc64le-linux only adopts ABI_ELFv2, could you also test this on
powerpc64-linux or aix to ensure it doesn't break ABI_AIX as we expected?

And Peter made the diff on rs6000.cc, I guess you want to put him as co-author,
i.e maybe adding one line with:

Co-authored-by: Peter Bergner 

The others look good to me, okay for trunk if the suggested testings go well
as expected.  Thanks!

BR,
Kewen

> 
> Normally, GPR2 is the TOC pointer and is defined as a fixed and non-volatile
> register. However, it can be used as volatile for PCREL addressing. Therefore,
> modified r2 to be non-fixed in FIXED_REGISTERS and set it to fixed if it is 
> not
> PCREL and also when the user explicitly requests TOC or fixed. If the register
> r2 is fixed, it is made as non-volatile. Changes in register preservation 
> roles
> can be accomplished with the help of available target hooks
> (TARGET_CONDITIONAL_REGISTER_USAGE).
> 
> 2023-07-12  Jeevitha Palanisamy  
> 
> gcc/
>   PR target/PR110320
>   * config/rs6000/rs6000.cc (rs6000_conditional_register_usage): Change
>   GPR2 to volatile and non-fixed register for PCREL.
> 
> gcc/testsuite/
>   PR target/PR110320
>   * gcc.target/powerpc/pr110320-1.c: New testcase.
>   * gcc.target/powerpc/pr110320-2.c: New testcase.
>   * gcc.target/powerpc/pr110320-3.c: New testcase.
> 
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index 44b448d2ba6..9aa04ec5d57 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -10193,9 +10193,13 @@ rs6000_conditional_register_usage (void)
>  for (i = 32; i < 64; i++)
>fixed_regs[i] = call_used_regs[i] = 1;
>  
> +  /* For non PC-relative code, GPR2 is unavailable for register allocation.  
> */
> +  if (FIXED_R2 && !rs6000_pcrel_p ())
> +fixed_regs[2] = 1;
> +
>/* The TOC register is not killed across calls in a way that is
>   visible to the compiler.  */
> -  if (DEFAULT_ABI == ABI_AIX || DEFAULT_ABI == ABI_ELFv2)
> +  if (fixed_regs[2] && (DEFAULT_ABI == ABI_AIX || DEFAULT_ABI == ABI_ELFv2))
>  call_used_regs[2] = 0;
>  
>if (DEFAULT_ABI == ABI_V4 && flag_pic == 2)
> diff --git a/gcc/config/rs6000/rs6000.h b/gcc/config/rs6000/rs6000.h
> index 3503614efbd..2a24fbdf9fd 100644
> --- a/gcc/config/rs6000/rs6000.h
> +++ b/gcc/config/rs6000/rs6000.h
> @@ -812,7 +812,7 @@ enum data_align { align_abi, align_opt, align_both };
>  
>  #define FIXED_REGISTERS  \
>{/* GPRs */   \
> -   0, 1, FIXED_R2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, FIXED_R13, 0, 0, \
> +   0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, FIXED_R13, 0, 0, \
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
> /* FPRs */   \
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr110320-1.c 
> b/gcc/testsuite/gcc.target/powerpc/pr110320-1.c
> new file mode 100644
> index 000..a4ad34d9303
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr110320-1.c
> @@ -0,0 +1,22 @@
> +/* PR target/110320 */
> +/* { dg-require-effective-target powerpc_pcrel } */
> +/* { dg-options "-O2 -mdejagnu-cpu=power10 -ffixed-r0 -ffixed-r11 
> -ffixed-r12" } */
> +
> +/* Ensure we use r2 as a normal volatile register for the code below.
> +   The test case ensures all of the parameter registers r3 - r10 are used
> +   and needed after we compute the expression "x + y" which requires a
> +   temporary.  The -ffixed-r* options disallow using the other volatile
> +   registers r0, r11 and r12.  That leaves RA to choose from r2 and the more
> +   expensive non-volatile registers for the temporary to be assigned to, and
> +   RA will always chooses the cheaper volatile r2 register.  */
> +
> +extern long bar (long, long, long, long, long, long, long, long *);
> +
> +long
> +foo (long r3, long r4, long r5, long r6, long r7, long r8, long r9, long 
> *r10)
> +{
> +  *r10 = r3 + r4;
> +  return bar (r3, r4, r5, r6, r7, r8, r9, r10);
> +}
> +
> +/* { dg-final { scan-assembler {\madd 2,3,4\M} } } */
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr110320-2.c 
> b/gcc/testsuite/gcc.target/powerpc/pr110320-2.c
> new file mode 100644
> index 000..9d6aefedd2e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr110320-2.c
> @@ -0,0 +1,21 @@
> +/* PR target/110320 */
> +/* { dg-require-effective-target powerpc_pcrel } */
> +/* { dg-options "-O2 -mdejagnu-cpu=power10 -mno-pcrel -ffixed-r0 -ffixed-r11 
> -ffixed-r12" } */
> +
> +/* Ensure we don't use r2 as a normal volatile register for the code below.
> +   The test case ensures all of the parameter registers r3 - r10 are used
> +   and needed after we compute the expression "x + y" 

Re: [PATCH v7, rs6000] Implemented f[min/max]_optab by xs[min/max]dp [PR103605]

2023-07-18 Thread Kewen.Lin via Gcc-patches
Hi Haochen,

on 2022/9/26 11:35, HAO CHEN GUI wrote:
> Hi,
>   This patch implements optab f[min/max]_optab by xs[min/max]dp on rs6000.
> Tests show that outputs of xs[min/max]dp are consistent with the standard
> of C99 fmin/max.
> 
>   This patch also binds __builtin_vsx_xs[min/max]dp to fmin/max instead
> of smin/max when fast-math is not set. While fast-math is set, xs[min/max]dp
> are folded to MIN/MAX_EXPR in gimple, and finally expanded to smin/max.
> 
>   Bootstrapped and tested on ppc64 Linux BE and LE with no regressions.
> Is this okay for trunk? Any recommendations? Thanks a lot.

Sorry for the late review, this patch is okay for trunk with the below
nit tweaked or not.  Thanks!

> 
> ChangeLog
> 2022-09-26 Haochen Gui 
> 
> gcc/
>   PR target/103605
>   * config/rs6000/rs6000-builtin.cc (rs6000_gimple_fold_builtin): Gimple
>   fold RS6000_BIF_XSMINDP and RS6000_BIF_XSMAXDP when fast-math is set.
>   * config/rs6000/rs6000.md (FMINMAX): New int iterator.
>   (minmax_op): New int attribute.
>   (UNSPEC_FMAX, UNSPEC_FMIN): New unspecs.
>   (f3): New pattern by UNSPEC_FMAX and UNSPEC_FMIN.
>   * config/rs6000/rs6000-builtins.def (__builtin_vsx_xsmaxdp): Set
>   pattern to fmaxdf3.
>   (__builtin_vsx_xsmindp): Set pattern to fmindf3.
> 
> gcc/testsuite/
>   PR target/103605
>   * gcc.dg/powerpc/pr103605.h: New.
>   * gcc.dg/powerpc/pr103605-1.c: New.
>   * gcc.dg/powerpc/pr103605-2.c: New.
> 
> patch.diff
> diff --git a/gcc/config/rs6000/rs6000-builtin.cc 
> b/gcc/config/rs6000/rs6000-builtin.cc
> index e925ba9fad9..944ae9fe55c 100644
> --- a/gcc/config/rs6000/rs6000-builtin.cc
> +++ b/gcc/config/rs6000/rs6000-builtin.cc
> @@ -1588,6 +1588,8 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *gsi)
>gimple_set_location (g, gimple_location (stmt));
>gsi_replace (gsi, g, true);
>return true;
> +/* fold into MIN_EXPR when fast-math is set.  */
> +case RS6000_BIF_XSMINDP:
>  /* flavors of vec_min.  */
>  case RS6000_BIF_XVMINDP:
>  case RS6000_BIF_XVMINSP:
> @@ -1614,6 +1616,8 @@ rs6000_gimple_fold_builtin (gimple_stmt_iterator *gsi)
>gimple_set_location (g, gimple_location (stmt));
>gsi_replace (gsi, g, true);
>return true;
> +/* fold into MAX_EXPR when fast-math is set.  */
> +case RS6000_BIF_XSMAXDP:
>  /* flavors of vec_max.  */
>  case RS6000_BIF_XVMAXDP:
>  case RS6000_BIF_XVMAXSP:
> diff --git a/gcc/config/rs6000/rs6000-builtins.def 
> b/gcc/config/rs6000/rs6000-builtins.def
> index f4a9f24bcc5..8b735493b40 100644
> --- a/gcc/config/rs6000/rs6000-builtins.def
> +++ b/gcc/config/rs6000/rs6000-builtins.def
> @@ -1613,10 +1613,10 @@
>  XSCVSPDP vsx_xscvspdp {}
> 
>const double __builtin_vsx_xsmaxdp (double, double);
> -XSMAXDP smaxdf3 {}
> +XSMAXDP fmaxdf3 {}
> 
>const double __builtin_vsx_xsmindp (double, double);
> -XSMINDP smindf3 {}
> +XSMINDP fmindf3 {}
> 
>const double __builtin_vsx_xsrdpi (double);
>  XSRDPI vsx_xsrdpi {}
> diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
> index bf85baa5370..ae0dd98f0f9 100644
> --- a/gcc/config/rs6000/rs6000.md
> +++ b/gcc/config/rs6000/rs6000.md
> @@ -158,6 +158,8 @@ (define_c_enum "unspec"
> UNSPEC_HASHCHK
> UNSPEC_XXSPLTIDP_CONST
> UNSPEC_XXSPLTIW_CONST
> +   UNSPEC_FMAX
> +   UNSPEC_FMIN
>])
> 
>  ;;
> @@ -5341,6 +5343,22 @@ (define_insn_and_split "*s3_fpr"
>DONE;
>  })
> 
> +
> +(define_int_iterator FMINMAX [UNSPEC_FMAX UNSPEC_FMIN])
> +
> +(define_int_attr  minmax_op [(UNSPEC_FMAX "max")
> +  (UNSPEC_FMIN "min")])
> +
> +(define_insn "f3"
> +  [(set (match_operand:SFDF 0 "vsx_register_operand" "=wa")
> + (unspec:SFDF [(match_operand:SFDF 1 "vsx_register_operand" "wa")
> +   (match_operand:SFDF 2 "vsx_register_operand" "wa")]
> +  FMINMAX))]
> +  "TARGET_VSX && !flag_finite_math_only"
> +  "xsdp %x0,%x1,%x2"
> +  [(set_attr "type" "fp")]
> +)
> +
>  (define_expand "movcc"
> [(set (match_operand:GPR 0 "gpc_reg_operand")
>(if_then_else:GPR (match_operand 1 "comparison_operator")
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr103605-1.c 
> b/gcc/testsuite/gcc.target/powerpc/pr103605-1.c
> new file mode 100644
> index 000..923deec6a1e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr103605-1.c
> @@ -0,0 +1,7 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_vsx_ok } */
> +/* { dg-options "-O2 -mvsx" } */

Nit: Add a comment here like:

/* Verify that GCC generates expected min/max hw insns instead of fmin/fmax 
calls. */

> +/* { dg-final { scan-assembler-times {\mxsmaxdp\M} 3 } } */
> +/* { dg-final { scan-assembler-times {\mxsmindp\M} 3 } } */> +
> +#include "pr103605.h"
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr103605-2.c 
> b/gcc/testsuite/gcc.target/powerpc/pr103605-2.c
> new file mode 100644
> index 

Re: rs6000: Fix expected counts powerpc/p9-vec-length-full

2023-07-18 Thread Kewen.Lin via Gcc-patches
Hi Carl,

The issue was tracked by PR109971 
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109971)
and I think it had been resolved.

btw, when the expected insn count changes, it does expose some
issues but which can be either test or functionality issue, if
it's taken as a test issue, it needs some justification why
it changes like that and the change is expected.

BR,
Kewen

on 2023/7/18 23:39, Carl Love wrote:
> Ping
> 
> On Thu, 2023-06-01 at 16:11 -0700, Carl Love wrote:
>> GCC maintainers:
>>
>> The following patch updates the expected instruction counts in four
>> tests.  The counts in all of the tests changed with commit
>> f574e2dfae79055f16d0c63cc12df24815d8ead6.  
>>
>> The updated counts have been verified on both Power 9 and Power 10.
>>
>> Please let me know if this patch is acceptable for mainline.  Thanks.
>>
>>   Carl 
>>
>> 
>> rs6000: Fix expected counts powerpc/p9-vec-length-full tests
>>
>> The counts for instructions lxvl and stxvl in tests:
>>
>>   p9-vec-length-full-1.c
>>   p9-vec-length-full-2.c
>>   p9-vec-length-full-6.c
>>   p9-vec-length-full-7.c
>>
>> changed with commit:
>>
>>commit f574e2dfae79055f16d0c63cc12df24815d8ead6
>>Author: Ju-Zhe Zhong 
>>Date:   Thu May 25 22:42:35 2023 +0800
>>
>>  VECT: Add decrement IV iteration loop control by variable amount
>> support
>>
>>  This patch is supporting decrement IV by following the flow
>> designed by
>>  Richard:
>>...
>>
>> The expected counts for lxvl changed from 20 to 40 and the counts for
>> stxvl
>> changed from 10 to 20 in the first three tests.  The number of stxvl
>> instructions changed from 12 to 20 in p9-vec-length-full-7.c.  This
>> patch updates the number of expected instructions in the four tests.
>>
>> The counts have been verified on Power 9 and Power 10.
>> ---
>>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-1.c | 4 ++--
>>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-2.c | 4 ++--
>>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-6.c | 4 ++--
>>  gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-7.c | 2 +-
>>  4 files changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-1.c
>> b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-1.c
>> index f01f1c54fa5..5e4f34421d3 100644
>> --- a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-1.c
>> +++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-1.c
>> @@ -12,5 +12,5 @@
>>  /* { dg-final { scan-assembler-not   {\mstxv\M} } } */
>>  /* { dg-final { scan-assembler-not   {\mlxvx\M} } } */
>>  /* { dg-final { scan-assembler-not   {\mstxvx\M} } } */
>> -/* { dg-final { scan-assembler-times {\mlxvl\M} 20 } } */
>> -/* { dg-final { scan-assembler-times {\mstxvl\M} 10 } } */
>> +/* { dg-final { scan-assembler-times {\mlxvl\M} 40 } } */
>> +/* { dg-final { scan-assembler-times {\mstxvl\M} 20 } } */
>> diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-2.c
>> b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-2.c
>> index f546e97fa7d..c7d927382c3 100644
>> --- a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-2.c
>> +++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-2.c
>> @@ -12,5 +12,5 @@
>>  /* { dg-final { scan-assembler-not   {\mstxv\M} } } */
>>  /* { dg-final { scan-assembler-not   {\mlxvx\M} } } */
>>  /* { dg-final { scan-assembler-not   {\mstxvx\M} } } */
>> -/* { dg-final { scan-assembler-times {\mlxvl\M} 20 } } */
>> -/* { dg-final { scan-assembler-times {\mstxvl\M} 10 } } */
>> +/* { dg-final { scan-assembler-times {\mlxvl\M} 40 } } */
>> +/* { dg-final { scan-assembler-times {\mstxvl\M} 20 } } */
>> diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-6.c
>> b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-6.c
>> index 65ddf2b098a..f3be3842c62 100644
>> --- a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-6.c
>> +++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-6.c
>> @@ -11,5 +11,5 @@
>>  /* It can use normal vector load for constant vector load.  */
>>  /* { dg-final { scan-assembler-times {\mstxvx?\M} 6 } } */
>>  /* 64bit/32bit pairs won't use partial vectors.  */
>> -/* { dg-final { scan-assembler-times {\mlxvl\M} 10 } } */
>> -/* { dg-final { scan-assembler-times {\mstxvl\M} 10 } } */
>> +/* { dg-final { scan-assembler-times {\mlxvl\M} 20 } } */
>> +/* { dg-final { scan-assembler-times {\mstxvl\M} 20 } } */
>> diff --git a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-7.c
>> b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-7.c
>> index e0e51d9a972..da086f1826a 100644
>> --- a/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-7.c
>> +++ b/gcc/testsuite/gcc.target/powerpc/p9-vec-length-full-7.c
>> @@ -12,4 +12,4 @@
>>
>>  /* Each type has one stxvl excepting for int8 and uint8, that have
>> two due to
>> rtl pass bbro duplicating the block which has one stxvl.  */
>> -/* { dg-final { scan-assembler-times {\mstxvl\M} 

Re: [PATCH, rs6000] Generate mfvsrwz for all platforms and remove redundant zero extend [PR106769]

2023-07-18 Thread Kewen.Lin via Gcc-patches
Hi Haochen,

on 2023/6/19 09:14, HAO CHEN GUI wrote:
> Hi,
>   This patch modifies vsx extract expander and generates mfvsrwz/stxsiwx
> for all platforms when the mode is V4SI and the index of extracted element
> is 1 for BE and 2 for LE. Also this patch adds a insn pattern for mfvsrwz
> which can help eliminate redundant zero extend.
> 
>   Bootstrapped and tested on powerpc64-linux BE and LE with no regressions.
> 
> Thanks
> Gui Haochen
> 
> 
> ChangeLog
> rs6000: Generate mfvsrwz for all platforms and remove redundant zero extend
> 
> mfvsrwz has lower latency than xxextractuw.  So it should be generated

Nice, it also has lower latency than vextuw[lr]x.

> even with p9 vector enabled if possible.  Also the instruction is
> already zero extended.  A combine pattern is needed to eliminate
> redundant zero extend instructions.
> 
> gcc/
>   PR target/106769
>   * config/rs6000/vsx.md (expand vsx_extract_): Skip calling
>   gen_vsx_extract__p9 when it can be implemented by
>   mfvsrwz/stxsiwx.
>   (*vsx_extract__di_p9): Not generate the insn when it can
>   be generated by mfvsrwz.
>   (mfvsrwz): New insn pattern.
>   (*vsx_extract_si): Rename to...
>   (vsx_extract_si): ..., remove redundant insn condition and
>   generate the insn on p9 when it can be implemented by
>   mfvsrwz/stxsiwx.  Add a dup alternative for simple vector moving.
>   Remove reload_completed from split condition as it's unnecessary.
>   Remove unnecessary checking from preparation statements.  Set
>   type and length attributes for each alternative.
> 
> gcc/testsuite/
>   PR target/106769
>   * gcc.target/powerpc/pr106769.h: New.
>   * gcc.target/powerpc/pr106769-p8.c: New.
>   * gcc.target/powerpc/pr106769-p9.c: New.
> 
> diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
> index 0a34ceebeb5..09b0f83db86 100644
> --- a/gcc/config/rs6000/vsx.md
> +++ b/gcc/config/rs6000/vsx.md
> @@ -3728,7 +3728,9 @@ (define_expand  "vsx_extract_"

I noticed that we special case VSX_D for vsx_extract_, then it
has two special define_insn for special operand 2 (0/1) like:

  "define_insn "*vsx_extract__0" and "..._1".

I wonder if we can do a similar thing to special case VSX_W, it seems
more clear?

>"VECTOR_MEM_VSX_P (mode) && TARGET_DIRECT_MOVE_64BIT"
>  {
>/* If we have ISA 3.0, we can do a xxextractuw/vextractu{b,h}.  */
> -  if (TARGET_P9_VECTOR)
> +  if (TARGET_P9_VECTOR
> +  && (mode != V4SImode
> +   || INTVAL (operands[2]) != (BYTES_BIG_ENDIAN ? 1 : 2)))
>  {
>emit_insn (gen_vsx_extract__p9 (operands[0], operands[1],
>   operands[2]));
> @@ -3798,7 +3800,9 @@ (define_insn_and_split "*vsx_extract__di_p9"
> (match_operand:VSX_EXTRACT_I 1 "gpc_reg_operand" "v,")
> (parallel [(match_operand:QI 2 "const_int_operand" "n,n")]
> (clobber (match_scratch:SI 3 "=r,X"))]
> -  "VECTOR_MEM_VSX_P (mode) && TARGET_VEXTRACTUB"
> +  "TARGET_VEXTRACTUB
> +   && (mode != V4SImode
> +   || INTVAL (operands[2]) != (BYTES_BIG_ENDIAN ? 1 : 2))"
>"#"
>"&& reload_completed"
>[(parallel [(set (match_dup 4)
> @@ -3830,58 +3834,67 @@ (define_insn_and_split "*vsx_extract__store_p9"
> (set (match_dup 0)
>   (match_dup 3))])
> 
> -(define_insn_and_split  "*vsx_extract_si"
> -  [(set (match_operand:SI 0 "nonimmediate_operand" "=r,wa,Z")
> +(define_insn "mfvsrwz"
> +  [(set (match_operand:DI 0 "register_operand" "=r")
> + (zero_extend:DI
> +   (vec_select:SI
> + (match_operand:V4SI 1 "vsx_register_operand" "wa")
> + (parallel [(match_operand:QI 2 "const_int_operand" "n")]
> +   (clobber (match_scratch:V4SI 3 "=v"))]
> +  "TARGET_DIRECT_MOVE_64BIT
> +   && INTVAL (operands[2]) == (BYTES_BIG_ENDIAN ? 1 : 2)"
> +  "mfvsrwz %0,%x1"
> +  [(set_attr "type" "mfvsr")
> +   (set_attr "isa" "p8v")])
> +
> +(define_insn_and_split  "vsx_extract_si"
> +  [(set (match_operand:SI 0 "nonimmediate_operand" "=r,wa,Z,wa")
>   (vec_select:SI
> -  (match_operand:V4SI 1 "gpc_reg_operand" "v,v,v")
> -  (parallel [(match_operand:QI 2 "const_0_to_3_operand" "n,n,n")])))
> -   (clobber (match_scratch:V4SI 3 "=v,v,v"))]
> -  "VECTOR_MEM_VSX_P (V4SImode) && TARGET_DIRECT_MOVE_64BIT && 
> !TARGET_P9_VECTOR"
> -  "#"
> -  "&& reload_completed"
> +  (match_operand:V4SI 1 "gpc_reg_operand" "v,v,v,0")
> +  (parallel [(match_operand:QI 2 "const_0_to_3_operand" "n,n,n,n")])))
> +   (clobber (match_scratch:V4SI 3 "=v,v,v,v"))]
> +  "TARGET_DIRECT_MOVE_64BIT
> +   && (!TARGET_P9_VECTOR || INTVAL (operands[2]) == (BYTES_BIG_ENDIAN ? 1 : 
> 2))"
> +{
> +   if (which_alternative == 0)
> + return "mfvsrwz %0,%x1";
> +
> +   if (which_alternative == 1)
> + return "xxlor %x0,%x1,%x1";
> +
> +   if (which_alternative == 2)
> + return "stxsiwx %x1,%y0";
> +
> +   return ASM_COMMENT_START " vec_extract to same register";
> +}
> +  "&& INTVAL (operands[2]) != 

Re: [PATCH] vect: Initialize new_temp to avoid false positive warning [PR110652]

2023-07-17 Thread Kewen.Lin via Gcc-patches
on 2023/7/17 14:39, Richard Biener wrote:
> On Mon, Jul 17, 2023 at 4:22 AM Kewen.Lin  wrote:
>>
>> Hi,
>>
>> As PR110652 and its duplicate PRs show, there could be one
>> build error
>>
>>   error: 'new_temp' may be used uninitialized
>>
>> for some build configurations.  It's a false positive warning
>> (or error at -Werror), but in order to make the build succeed,
>> this patch is to initialize the reported variable 'new_temp'
>> as NULL_TREE.
>>
>> Confirmed this patch fixed the reported issue in PR110652
>> (with the same configuration).
>>
>> Is it ok for trunk?
> 
> OK.

Thanks Richi, pushed as r14-2560.

BR,
Kewen


[PATCH] vect: Initialize new_temp to avoid false positive warning [PR110652]

2023-07-16 Thread Kewen.Lin via Gcc-patches
Hi,

As PR110652 and its duplicate PRs show, there could be one
build error

  error: 'new_temp' may be used uninitialized

for some build configurations.  It's a false positive warning
(or error at -Werror), but in order to make the build succeed,
this patch is to initialize the reported variable 'new_temp'
as NULL_TREE.

Confirmed this patch fixed the reported issue in PR110652
(with the same configuration).

Is it ok for trunk?

BR,
Kewen
-
PR tree-optimization/110652

gcc/ChangeLog:

* tree-vect-stmts.cc (vectorizable_load): Initialize new_temp as
NULL_TREE.
---
 gcc/tree-vect-stmts.cc | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index c08d0ef951f..cb86d544313 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -9297,7 +9297,8 @@ vectorizable_load (vec_info *vinfo,
   class loop *containing_loop = gimple_bb (stmt_info->stmt)->loop_father;
   bool nested_in_vect_loop = false;
   tree elem_type;
-  tree new_temp;
+  /* Avoid false positive uninitialized warning, see PR110652.  */
+  tree new_temp = NULL_TREE;
   machine_mode mode;
   tree dummy;
   tree dataref_ptr = NULL_TREE;
--
2.31.1


Re: [PATCH ver 3] rs6000, fix vec_replace_unaligned built-in arguments

2023-07-13 Thread Kewen.Lin via Gcc-patches
Hi Carl,

on 2023/7/8 04:18, Carl Love wrote:
> 
> GCC maintainers:
> 
> Version 3, added code to altivec_resolve_overloaded_builtin so the
> correct instruction is selected for the size of the second argument. 
> This restores the instruction counts to the original values where the
> correct instructions were originally being generated.  The naming of

Nice, I have some comments inlined below.

> the overloaded builtin instances and builtin definitions were changed
> to reflect the type of the second argument since the type of the first
> argument is now the same for all overloaded instances.  A new builtin
> test file was added for the case where the first argument is cast to
> the unsigned long long type.  This test requires the -flax-vector-
> conversions gcc command line option.  Since the other tests do not
> require this option, I felt that the new test needed to be in a
> separate file.  Finally some formatting fixes were made in the original
> test file.  Patch has been retested on Power 10 with no regressions.
> 
> Version 2, fixed various typos.  Updated the change log body to say the
> instruction counts were updated.  The instruction counts changed as a
> result of changing the first argument of the vec_replace_unaligned
> builtin call from vector unsigned long long (vull) to vector unsigned
> char (vuc).  When the first argument was vull the builtin call
> generated the vinsd instruction for the two test cases.  The updated
> call with vuc as the first argument generates two vinsw instructions
> instead.  Patch was retested on Power 10 with no regressions.
> 
> The following patch fixes the first argument in the builtin definition
> and the corresponding test cases.  Initially, the builtin specification
> was wrong due to a cut and past error.  The documentation was fixed in:
> 
>commit ed3fea09b18f67e757b5768b42cb6e816626f1db
>Author: Bill Schmidt 
>Date:   Fri Feb 4 13:07:17 2022 -0600
> 
>rs6000: Correct function prototypes for vec_replace_unaligned
> 
>Due to a pasto error in the documentation, vec_replace_unaligned
> was
>implemented with the same function prototypes as
> vec_replace_elt.  It was
>intended that vec_replace_unaligned always specify output
> vectors as having
>type vector unsigned char, to emphasize that elements are
> potentially
>misaligned by this built-in function.  This patch corrects the
>misimplementation.
> 
> 
> This patch fixes the arguments in the definitions and updates the
> testcases accordingly.  Additionally, a few minor spacing issues are
> fixed.
> 
> The patch has been tested on Power 10 with no regressions.  Please let
> me know if the patch is acceptable for mainline.  Thanks.
> 
>  Carl 
> 
> --
> rs6000, fix vec_replace_unaligned built-in arguments
> 
> The first argument of the vec_replace_unaligned built-in should always be
> unsigned char, as specified in gcc/doc/extend.texi.

Maybe "be with type vector unsigned char"?

> 
> This patch fixes the builtin definitions and updates the test cases to use
> the correct arguments.  The original test file is renamed and a second test
> file is added for a new test case.
> 
> gcc/ChangeLog:
>   * config/rs6000/rs6000-builtins.def: Rename
>   __builtin_altivec_vreplace_un_uv2di as __builtin_altivec_vreplace_un_udi
>   __builtin_altivec_vreplace_un_uv4si as __builtin_altivec_vreplace_un_usi
>   __builtin_altivec_vreplace_un_v2df as __builtin_altivec_vreplace_un_df
>   __builtin_altivec_vreplace_un_v2di as __builtin_altivec_vreplace_un_di
>   __builtin_altivec_vreplace_un_v4sf as __builtin_altivec_vreplace_un_sf
>   __builtin_altivec_vreplace_un_v4si as __builtin_altivec_vreplace_un_si.
>   Rename VREPLACE_UN_UV2DI as VREPLACE_UN_UDI, VREPLACE_UN_UV4SI as
>   VREPLACE_UN_USI, VREPLACE_UN_V2DF as VREPLACE_UN_DF,
>   VREPLACE_UN_V2DI as VREPLACE_UN_DI, VREPLACE_UN_V4SF as
>   VREPLACE_UN_SF, VREPLACE_UN_V4SI as VREPLACE_UN_SI.
>   Rename vreplace_un_v2di as vreplace_un_di, vreplace_un_v4si as
>   vreplace_un_si, vreplace_un_v2df as vreplace_un_df,
>   vreplace_un_v2di as vreplace_un_di, vreplace_un_v4sf as
>   vreplace_un_sf, vreplace_un_v4si as vreplace_un_si.
>   * config/rs6000/rs6000-c.cc (find_instance): Add new argument
>   nargs.  Add nargs check.  Extend function to handle three arguments.
>   (altivec_resolve_overloaded_builtin): Add new argument nargs to
>   function calls.  Add case RS6000_OVLD_VEC_REPLACE_UN.
>   * config/rs6000/rs6000-overload.def (__builtin_vec_replace_un):
>   Fix first argument type.  Rename VREPLACE_UN_UV4SI as
>   VREPLACE_UN_USI, VREPLACE_UN_V4SI as VREPLACE_UN_SI,
>   VREPLACE_UN_UV2DI as VREPLACE_UN_UDI, VREPLACE_UN_V2DI as
>   VREPLACE_UN_DI, VREPLACE_UN_V4SF as VREPLACE_UN_SF,
>   VREPLACE_UN_V2DF as VREPLACE_UN_DF.
>   * config/rs6000/vsx.md (VEC_RU): New mode 

Re: [PATCH ver4] rs6000, Add return value to __builtin_set_fpscr_rn

2023-07-13 Thread Kewen.Lin via Gcc-patches
Hi Carl,

on 2023/7/12 02:06, Carl Love wrote:
> GCC maintainers:
> 
> Ver 4, Removed extra space in subject line.  Added comment to commit
> log comments about new __SET_FPSCR_RN_RETURNS_FPSCR__ define.  Changed
> Added to Add and Renamed to Rename in ChangeLog.  Updated define_expand
> "rs6000_set_fpscr_rn" per Peter's comments to use new temporary
> register for output value.  Also, comments from Kewen about moving rtx
> tmp_di1 close to use.  Renamed tmp_di2 as orig_df_in_di.  Additionally,
> changed the name of tmp_di3 to tmp_di2 so the numbering is
> sequential.  Moved the new rtx tmp_di2 = gen_reg_rtx (DImode); right
> before its use to be consistent with previous move request.  Fixed tabs
> in comment.  Remove -std=c99 from test_fpscr_rn_builtin_1.c. Cleaned up
> comment and removed abort from test_fpscr_rn_builtin_2.c.  
> 
> Fixed a couple of additional issues with the ChangeLog per feedback
> from git gcc-verify.
> 
> Retested updated patch on Power 8, 9 and 10 to verify changes.
> 
> Ver 3, Renamed the patch per comments on ver 2.  Previous subject line
> was " [PATCH ver 2] rs6000, __builtin_set_fpscr_rn add retrun value".  
> Fixed spelling mistakes and formatting.  Updated define_expand
> "rs6000_set_fpscr_rn to have the rs6000_get_fpscr_fields and
> rs6000_update_fpscr_rn_field define expands inlined.  Optimized the
> code and fixed use of temporary register values. Updated the test file
> dg-do run arguments and dg-options.  Removed the check for
> __SET_FPSCR_RN_RETURNS_FPSCR__. Removed additional references to the
> overloaded built-in with double argument.  Fixed up the documentation
> file.  Updated patch retested on Power 8 BE/LE, Power 9 BE/LE and Power
> 10 LE.
> 
> Ver 2,  Went back thru the requirements and emails.  Not sure where I
> came up with the requirement for an overloaded version with double
> argument.  Removed the overloaded version with the double argument. 
> Added the macro to announce if the __builtin_set_fpscr_rn returns a
> void or a double with the FPSCR bits.  Updated the documentation file. 
> Retested on Power 8 BE/LE, Power 9 BE/LE, Power 10 LE.  Redid the test
> file.  Per request, the original test file functionality was not
> changed.  Just changed the name from test_fpscr_rn_builtin.c to 
> test_fpscr_rn_builtin_1.c.  Put new tests for the return values into a
> new test file, test_fpscr_rn_builtin_2.c.
> 
> The GLibC team requested a builtin to replace the mffscrn and
> mffscrniinline asm instructions in the GLibC code.  Previously there
> was discussion on adding builtins for the mffscrn instructions.
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620261.html
> 
> In the end, it was felt that it would be to extend the existing
> __builtin_set_fpscr_rn builtin to return a double instead of a void
> type.  The desire is that we could have the functionality of the
> mffscrn and mffscrni instructions on older ISAs.  The two instructions
> were initially added in ISA 3.0.  The __builtin_set_fpscr_rn has the
> needed functionality to set the RN field using the mffscrn and mffscrni
> instructions if ISA 3.0 is supported or fall back to using logical
> instructions to mask and set the bits for earlier ISAs.  The
> instructions return the current value of the FPSCR fields DRN, VE, OE,
> UE, ZE, XE, NI, RN bit positions then update the RN bit positions with
> the new RN value provided.
> 
> The current __builtin_set_fpscr_rn builtin has a return type of void. 
> So, changing the return type to double and returning the  FPSCR fields
> DRN, VE, OE, UE, ZE, XE, NI, RN bit positions would then give the
> functionally equivalent of the mffscrn and mffscrni instructions.  Any
> current uses of the builtin would just ignore the return value yet any
> new uses could use the return value.  So the requirement is for the
> change to the __builtin_set_fpscr_rn builtin to be backwardly
> compatible and work for all ISAs.
> 
> The following patch changes the return type of the
>  __builtin_set_fpscr_rn builtin from void to double.  The return value
> is the current value of the various FPSCR fields DRN, VE, OE, UE, ZE,
> XE, NI, RN bit positions when the builtin is called.  The builtin then
> updated the RN field with the new value provided as an argument to the
> builtin.  The patch adds new testcases to test_fpscr_rn_builtin.c to
> check that the builtin returns the current value of the FPSCR fields
> and then updates the RN field.
> 
> The GLibC team has reviewed the patch to make sure it met their needs
> as a drop in replacement for the inline asm mffscr and mffscrni
> statements in the GLibC code.  T
> 
> The patch has been tested on Power 8 LE/BE, Power 9 LE/BE and Power 10
> LE.
> 
> Please let me know if the patch is acceptable for mainline.  Thanks.
> 
>Carl 
> 
> -
> rs6000, Add return value to __builtin_set_fpscr_rn
> 
> Change the return value from void to double for 

Re: [PATCH ver3] rs6000, Add return value to __builtin_set_fpscr_rn

2023-07-10 Thread Kewen.Lin via Gcc-patches
Hi Carl,

Excepting for Peter's review comments, some nits are inline below.

on 2023/7/11 03:18, Carl Love wrote:
> 
> GCC maintainers:
> 
> Ver 3, Renamed the patch per comments on ver 2.  Previous subject line
> was " [PATCH ver 2] rs6000, __builtin_set_fpscr_rn add retrun value".  
> Fixed spelling mistakes and formatting.  Updated define_expand
> "rs6000_set_fpscr_rn to have the rs6000_get_fpscr_fields and
> rs6000_update_fpscr_rn_field define expands inlined.  Optimized the
> code and fixed use of temporary register values. Updated the test file
> dg-do run arguments and dg-options.  Removed the check for
> __SET_FPSCR_RN_RETURNS_FPSCR__. Removed additional references to the
> overloaded built-in with double argument.  Fixed up the documentation
> file.  Updated patch retested on Power 8 BE/LE, Power 9 BE/LE and Power
> 10 LE.
> 
> Ver 2,  Went back thru the requirements and emails.  Not sure where I
> came up with the requirement for an overloaded version with double
> argument.  Removed the overloaded version with the double argument. 
> Added the macro to announce if the __builtin_set_fpscr_rn returns a
> void or a double with the FPSCR bits.  Updated the documentation file. 
> Retested on Power 8 BE/LE, Power 9 BE/LE, Power 10 LE.  Redid the test
> file.  Per request, the original test file functionality was not
> changed.  Just changed the name from test_fpscr_rn_builtin.c to 
> test_fpscr_rn_builtin_1.c.  Put new tests for the return values into a
> new test file, test_fpscr_rn_builtin_2.c.
> 
> The GLibC team requested a builtin to replace the mffscrn and
> mffscrniinline asm instructions in the GLibC code.  Previously there
> was discussion on adding builtins for the mffscrn instructions.
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620261.html
> 
> In the end, it was felt that it would be to extend the existing
> __builtin_set_fpscr_rn builtin to return a double instead of a void
> type.  The desire is that we could have the functionality of the
> mffscrn and mffscrni instructions on older ISAs.  The two instructions
> were initially added in ISA 3.0.  The __builtin_set_fpscr_rn has the
> needed functionality to set the RN field using the mffscrn and mffscrni
> instructions if ISA 3.0 is supported or fall back to using logical
> instructions to mask and set the bits for earlier ISAs.  The
> instructions return the current value of the FPSCR fields DRN, VE, OE,
> UE, ZE, XE, NI, RN bit positions then update the RN bit positions with
> the new RN value provided.
> 
> The current __builtin_set_fpscr_rn builtin has a return type of void. 
> So, changing the return type to double and returning the  FPSCR fields
> DRN, VE, OE, UE, ZE, XE, NI, RN bit positions would then give the
> functionally equivalent of the mffscrn and mffscrni instructions.  Any
> current uses of the builtin would just ignore the return value yet any
> new uses could use the return value.  So the requirement is for the
> change to the __builtin_set_fpscr_rn builtin to be backwardly
> compatible and work for all ISAs.
> 
> The following patch changes the return type of the
>  __builtin_set_fpscr_rn builtin from void to double.  The return value
> is the current value of the various FPSCR fields DRN, VE, OE, UE, ZE,
> XE, NI, RN bit positions when the builtin is called.  The builtin then
> updated the RN field with the new value provided as an argument to the
> builtin.  The patch adds new testcases to test_fpscr_rn_builtin.c to
> check that the builtin returns the current value of the FPSCR fields
> and then updates the RN field.
> 
> The GLibC team has reviewed the patch to make sure it met their needs
> as a drop in replacement for the inline asm mffscr and mffscrni
> statements in the GLibC code.  T
> 
> The patch has been tested on Power 8 LE/BE, Power 9 LE/BE and Power 10
> LE.
> 
> Please let me know if the patch is acceptable for mainline.  Thanks.
> 
>Carl 
> 
> 
> 
> -
> rs6000, Add return value  to __builtin_set_fpscr_rn

Nit: One more unexpected space.

> 
> Change the return value from void to double for __builtin_set_fpscr_rn.
> The return value consists of the FPSCR fields DRN, VE, OE, UE, ZE, XE, NI,
> RN bit positions.  A new test file, test powerpc/test_fpscr_rn_builtin_2.c,
> is added to test the new return value for the built-in.

Nit: It would be better to note the newly added __SET_FPSCR_RN_RETURNS_FPSCR__
in commit log as well.

> 
> gcc/ChangeLog:
>   * config/rs6000/rs6000-builtins.def (__builtin_set_fpscr_rn): Update
>   built-in definition return type.
>   * config/rs6000-c.cc (rs6000_target_modify_macros): Add check,
>   define __SET_FPSCR_RN_RETURNS_FPSCR__ macro.
>   * config/rs6000/rs6000.md (rs6000_set_fpscr_rn): Added return

Nit: s/Added/Add/

>   argument to return FPSCR fields.
>   * doc/extend.texi (__builtin_set_fpscr_rn): Update description for
>   the return value.  

Re: [PATCH ver 2] rs6000, __builtin_set_fpscr_rn add retrun value

2023-07-10 Thread Kewen.Lin via Gcc-patches
on 2023/7/11 03:18, Carl Love wrote:
> On Fri, 2023-07-07 at 12:06 +0800, Kewen.Lin wrote:
>> Hi Carl,
>>
>> Some more minor comments are inline below on top of Peter's
>> insightful
>> review comments.
>>
>> on 2023/7/1 08:58, Carl Love wrote:
>>> GCC maintainers:
>>>
>>> Ver 2,  Went back thru the requirements and emails.  Not sure where
>>> I
>>> came up with the requirement for an overloaded version with double
>>> argument.  Removed the overloaded version with the double
>>> argument. 
>>> Added the macro to announce if the __builtin_set_fpscr_rn returns a
>>> void or a double with the FPSCR bits.  Updated the documentation
>>> file. 
>>> Retested on Power 8 BE/LE, Power 9 BE/LE, Power 10 LE.  Redid the
>>> test
>>> file.  Per request, the original test file functionality was not
>>> changed.  Just changed the name from test_fpscr_rn_builtin.c to 
>>> test_fpscr_rn_builtin_1.c.  Put new tests for the return values
>>> into a
>>> new test file, test_fpscr_rn_builtin_2.c.
>>>
>>> The GLibC team requested a builtin to replace the mffscrn and
>>> mffscrniinline asm instructions in the GLibC code.  Previously
>>> there
>>> was discussion on adding builtins for the mffscrn instructions.
>>>
>>> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620261.html
>>>
>>> In the end, it was felt that it would be to extend the existing
>>> __builtin_set_fpscr_rn builtin to return a double instead of a void
>>> type.  The desire is that we could have the functionality of the
>>> mffscrn and mffscrni instructions on older ISAs.  The two
>>> instructions
>>> were initially added in ISA 3.0.  The __builtin_set_fpscr_rn has
>>> the
>>> needed functionality to set the RN field using the mffscrn and
>>> mffscrni
>>> instructions if ISA 3.0 is supported or fall back to using logical
>>> instructions to mask and set the bits for earlier ISAs.  The
>>> instructions return the current value of the FPSCR fields DRN, VE,
>>> OE,
>>> UE, ZE, XE, NI, RN bit positions then update the RN bit positions
>>> with
>>> the new RN value provided.
>>>
>>> The current __builtin_set_fpscr_rn builtin has a return type of
>>> void. 
>>> So, changing the return type to double and returning the  FPSCR
>>> fields
>>> DRN, VE, OE, UE, ZE, XE, NI, RN bit positions would then give the
>>> functionally equivalent of the mffscrn and mffscrni
>>> instructions.  Any
>>> current uses of the builtin would just ignore the return value yet
>>> any
>>> new uses could use the return value.  So the requirement is for the
>>> change to the __builtin_set_fpscr_rn builtin to be backwardly
>>> compatible and work for all ISAs.
>>>
>>> The following patch changes the return type of the
>>>  __builtin_set_fpscr_rn builtin from void to double.  The return
>>> value
>>> is the current value of the various FPSCR fields DRN, VE, OE, UE,
>>> ZE,
>>> XE, NI, RN bit positions when the builtin is called.  The builtin
>>> then
>>> updated the RN field with the new value provided as an argument to
>>> the
>>> builtin.  The patch adds new testcases to test_fpscr_rn_builtin.c
>>> to
>>> check that the builtin returns the current value of the FPSCR
>>> fields
>>> and then updates the RN field.
>>>
>>> The GLibC team has reviewed the patch to make sure it met their
>>> needs
>>> as a drop in replacement for the inline asm mffscr and mffscrni
>>> statements in the GLibC code.  T
>>>
>>> The patch has been tested on Power 8 LE/BE, Power 9 LE/BE and Power
>>> 10
>>> LE.
>>>
>>> Please let me know if the patch is acceptable for
>>> mainline.  Thanks.
>>>
>>>Carl 
>>>
>>>
>>> --
>>> rs6000, __builtin_set_fpscr_rn add retrun value
>>>
>>> Change the return value from void to double.  The return value
>>> consists of
>>> the FPSCR fields DRN, VE, OE, UE, ZE, XE, NI, RN bit
>>> positions.  Add an
>>> overloaded version which accepts a double argument.
>>>
>>> The test powerpc/test_fpscr_rn_builtin.c is updated to add tests
>>> for the
>>> double reterun value and the new double argument.
>>>
>>> gcc/ChangeLog:
>>> * config/rs6000/rs6000-builtins.def (__builtin_set_fpscr_rn):
>>> Update
>>> builtin definition return type.
>>> * config/rs6000-c.cc(rs6000_target_modify_macros): Add check,
>>> define
>>> __SET_FPSCR_RN_RETURNS_FPSCR__ macro.
>>> * config/rs6000/rs6000.md ((rs6000_get_fpscr_fields): New
>>> define_expand.
>>> (rs6000_update_fpscr_rn_field): New define_expand.
>>> (rs6000_set_fpscr_rn): Addedreturn argument.  Updated
>>> to use new
>>> rs6000_get_fpscr_fields and rs6000_update_fpscr_rn_field define
>>>  _expands.
>>> * doc/extend.texi (__builtin_set_fpscr_rn): Update description
>>> for
>>> the return value and new double argument.  Add descripton for
>>> __SET_FPSCR_RN_RETURNS_FPSCR__ macro.
>>>
>>> gcc/testsuite/ChangeLog:
>>> gcc.target/powerpc/test_fpscr_rn_builtin.c: Renamed to
>>> test_fpscr_rn_builtin_1.c.  Added comment.
>>> 

Re: [PATCH] VECT: Add COND_LEN_* operations for loop control with length targets

2023-07-10 Thread Kewen.Lin via Gcc-patches
Hi Juzhe,

> Hi, Kewen.
> 
> You mean COND_LEN_* family I added is unlikely used by power in the future?

Since AFAIK there is no plan to extend len family on Power, it's very likely 
that
Power will have no chance to leverage them, so yes.

> Could you revise them to make it possible used by power in the future so that 
> we won't duplicate to many patterns.

Sorry, since we don't have such plan for this kind of capability, I don't have
any solid inputs or requirements for this patch.  But IMHO the proposed 
interfaces
look good enough for any potential future uses.

BR,
Kewen

> For example, COND_LEN_* has mask operand, is it possible that power can also 
> use it with dummy mask = { -1, -1, ..., -1}.> 
> Thanks.
> 


Re: [PATCH] rs6000: Remove redundant initialization [PR106907]

2023-07-10 Thread Kewen.Lin via Gcc-patches
on 2023/7/11 07:11, Peter Bergner wrote:
> On 6/29/23 4:31 AM, Kewen.Lin via Gcc-patches wrote:
>> This is okay for trunk (no backports needed btw), this fix can even be
>> taken as obvious, thanks!
>>
>>>
>>> 2023-06-07  Jeevitha Palanisamy  
>>>
>>> gcc/
>>> PR target/106907
>>
>> One curious question is that this PR106907 seemed not to report this issue,
>> is there another PR reporting this?  Or do I miss something?
> 
> I think Jeevitha just ran cppcheck by hand and noticed the "new" warnings
> and added them to the list of things to fixup.  Yeah, it would be nice to
> add the new warnings to the PR for historical reasons.

Thanks for clarifying it.  Yeah, I noticed Jeevitha added more comments to
that PR. :)

BR,
Kewen


Re: [PATCH v5] rs6000: Update the vsx-vector-6.* tests.

2023-07-10 Thread Kewen.Lin via Gcc-patches
Hi Carl,

on 2023/7/8 04:40, Carl Love wrote:
> 
> GCC maintainers:
> 
> Ver 5. Removed -compile from the names of the compile only tests. Fixed
> up the reference to the compile file names in the .h file headers. 
> Replaced powerpc_vsx_ok with vsx_hw in the run test files.  Removed the
> -save-temps from all files.  Retested on all of the various platforms
> with no regressions.
> 
> Ver 4. Fixed a few typos.  Redid the tests to create separate run and
> compile tests.
> 
> Ver 3.  Added __attribute__ ((noipa)) to the test files.  Changed some
> of the scan-assembler-times checks to cover multiple similar
> instructions.  Change the function check macro to a macro to generate a
> function to do the test and check the results.  Retested on the various
> processor types and BE/LE versions.
> 
> Ver 2.  Switched to using code macros to generate the call to the
> builtin and test the results.  Added in instruction counts for the key
> instruction for the builtin.  Moved the tests into an additional
> function call to ensure the compile doesn't replace the builtin call
> code with the statically computed results.  The compiler was doing this
> for a few of the simpler tests.  
> 
> The following patch takes the tests in vsx-vector-6-p7.h,  vsx-vector-
> 6-p8.h, vsx-vector-6-p9.h and reorganizes them into a series of smaller
> test files by functionality rather than processor version.
> 
> Tested the patch on Power 8 LE/BE, Power 9 LE/BE and Power 10 LE with
> no regresions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.

This patch is okay for trunk, thanks for the patience!

BR,
Kewen

> 
>Carl
> 
> 
> 
> -
> rs6000: Update the vsx-vector-6.* tests.
> 
> The vsx-vector-6.h file is included into the processor specific test files
> vsx-vector-6.p7.c, vsx-vector-6.p8.c, and vsx-vector-6.p9.c.  The .h file
> contains a large number of vsx vector built-in tests.  The processor
> specific files contain the number of instructions that the tests are
> expected to generate for that processor.  The tests are compile only.
> 
> This patch reworks the tests into a series of files for related tests.
> The new tests consist of a runnable test to verify the built-in argument
> types and the functional correctness of each built-in.  There is also a
> compile only test that verifies the built-ins generate the expected number
> of instructions for the various built-in tests.
> 
> gcc/testsuite/
>   * gcc.target/powerpc/vsx-vector-6-func-1op.h: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-1op-run.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-1op.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-2lop.h: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-2lop-run.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-2lop.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-2op.h: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-2op-run.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-2op.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-3op.h: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-3op-run.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-3op.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-cmp-all.h: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-cmp-all-run.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-cmp-all.c: New test
>   file.
>   * gcc.target/powerpc/vsx-vector-6-func-cmp.h: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-cmp-run.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-cmp.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6.h: Remove test file.
>   * gcc.target/powerpc/vsx-vector-6.p7.c: Remove test file.
>   * gcc.target/powerpc/vsx-vector-6.p8.c: Remove test file.
>   * gcc.target/powerpc/vsx-vector-6.p9.c: Remove test file.
> ---
>  .../powerpc/vsx-vector-6-func-1op-run.c   |  98 
>  .../powerpc/vsx-vector-6-func-1op.c   |  22 ++
>  .../powerpc/vsx-vector-6-func-1op.h   |  43 
>  .../powerpc/vsx-vector-6-func-2lop-run.c  | 177 ++
>  .../powerpc/vsx-vector-6-func-2lop.c  |  14 ++
>  .../powerpc/vsx-vector-6-func-2lop.h  |  47 
>  .../powerpc/vsx-vector-6-func-2op-run.c   |  96 
>  .../powerpc/vsx-vector-6-func-2op.c   |  21 ++
>  .../powerpc/vsx-vector-6-func-2op.h   |  42 
>  .../powerpc/vsx-vector-6-func-3op-run.c   | 229 ++
>  .../powerpc/vsx-vector-6-func-3op.c   |  17 ++
>  .../powerpc/vsx-vector-6-func-3op.h   |  73 ++
>  .../powerpc/vsx-vector-6-func-cmp-all-run.c   | 147 +++
>  .../powerpc/vsx-vector-6-func-cmp-all.c   |  17 ++
>  .../powerpc/vsx-vector-6-func-cmp-all.h   |  76 ++
>  

Re: [PATCH] VECT: Add COND_LEN_* operations for loop control with length targets

2023-07-10 Thread Kewen.Lin via Gcc-patches
on 2023/7/10 18:40, Richard Biener wrote:
> On Fri, 7 Jul 2023, juzhe.zh...@rivai.ai wrote:
> 
>> From: Ju-Zhe Zhong 
>>
>> Hi, Richard and Richi.
>>
>> This patch is adding cond_len_* operations pattern for target support 
>> loop control with length.
> 
> It looks mostly OK - the probably obvious question is with rearding
> to the "missing" bias argument ...
> 
> IBM folks - is there any expectation that the set of len familiy of
> instructions increases or will they be accounted as "mistake" and
> future additions will happen in different ways?

As far as I know, there is no plan to extend this len family on Power
and I guess future extension very likely adopts a different way.

BR,
Kewen

> 
> At the moment I'd say for consistency reasons 'len' should always
> come with 'bias'.
> 
> Thanks,
> Richard.
> 
>> These patterns will be used in these following case:
>>
>> 1. Integer division:
>>void
>>f (int32_t *restrict a, int32_t *restrict b, int32_t *restrict c, int n)
>>{
>>  for (int i = 0; i < n; ++i)
>>   {
>> a[i] = b[i] / c[i];
>>   }
>>}
>>
>>   ARM SVE IR:
>>   
>>   ...
>>   max_mask_36 = .WHILE_ULT (0, bnd.5_32, { 0, ... });
>>
>>   Loop:
>>   ...
>>   # loop_mask_29 = PHI 
>>   ...
>>   vect__4.8_28 = .MASK_LOAD (_33, 32B, loop_mask_29);
>>   ...
>>   vect__6.11_25 = .MASK_LOAD (_20, 32B, loop_mask_29);
>>   vect__8.12_24 = .COND_DIV (loop_mask_29, vect__4.8_28, vect__6.11_25, 
>> vect__4.8_28);
>>   ...
>>   .MASK_STORE (_1, 32B, loop_mask_29, vect__8.12_24);
>>   ...
>>   next_mask_37 = .WHILE_ULT (_2, bnd.5_32, { 0, ... });
>>   ...
>>   
>>   For target like RVV who support loop control with length, we want to see 
>> IR as follows:
>>   
>>   Loop:
>>   ...
>>   # loop_len_29 = SELECT_VL
>>   ...
>>   vect__4.8_28 = .LEN_MASK_LOAD (_33, 32B, loop_len_29);
>>   ...
>>   vect__6.11_25 = .LEN_MASK_LOAD (_20, 32B, loop_len_29);
>>   vect__8.12_24 = .COND_LEN_DIV (dummp_mask, vect__4.8_28, vect__6.11_25, 
>> vect__4.8_28, loop_len_29);
>>   ...
>>   .LEN_MASK_STORE (_1, 32B, loop_len_29, vect__8.12_24);
>>   ...
>>   next_mask_37 = .WHILE_ULT (_2, bnd.5_32, { 0, ... });
>>   ...
>>   
>>   Notice here, we use dummp_mask = { -1, -1,  , -1 }
>>
>> 2. Integer conditional division:
>>Similar case with (1) but with condtion:
>>void
>>f (int32_t *restrict a, int32_t *restrict b, int32_t *restrict c, int32_t 
>> * cond, int n)
>>{
>>  for (int i = 0; i < n; ++i)
>>{
>>  if (cond[i])
>>  a[i] = b[i] / c[i];
>>}
>>}
>>
>>ARM SVE:
>>...
>>max_mask_76 = .WHILE_ULT (0, bnd.6_52, { 0, ... });
>>
>>Loop:
>>...
>># loop_mask_55 = PHI 
>>...
>>vect__4.9_56 = .MASK_LOAD (_51, 32B, loop_mask_55);
>>mask__29.10_58 = vect__4.9_56 != { 0, ... };
>>vec_mask_and_61 = loop_mask_55 & mask__29.10_58;
>>...
>>vect__6.13_62 = .MASK_LOAD (_24, 32B, vec_mask_and_61);
>>...
>>vect__8.16_66 = .MASK_LOAD (_1, 32B, vec_mask_and_61);
>>vect__10.17_68 = .COND_DIV (vec_mask_and_61, vect__6.13_62, 
>> vect__8.16_66, vect__6.13_62);
>>...
>>.MASK_STORE (_2, 32B, vec_mask_and_61, vect__10.17_68);
>>...
>>next_mask_77 = .WHILE_ULT (_3, bnd.6_52, { 0, ... });
>>
>>Here, ARM SVE use vec_mask_and_61 = loop_mask_55 & mask__29.10_58; to 
>> gurantee the correct result.
>>
>>However, target with length control can not perform this elegant flow, 
>> for RVV, we would expect:
>>
>>Loop:
>>...
>>loop_len_55 = SELECT_VL
>>...
>>mask__29.10_58 = vect__4.9_56 != { 0, ... };
>>...
>>vect__10.17_68 = .COND_LEN_DIV (mask__29.10_58, vect__6.13_62, 
>> vect__8.16_66, vect__6.13_62, loop_len_55);
>>...
>>
>>Here we expect COND_LEN_DIV predicated by a real mask which is the 
>> outcome of comparison: mask__29.10_58 = vect__4.9_56 != { 0, ... };
>>and a real length which is produced by loop control : loop_len_55 = 
>> SELECT_VL
>>
>> 3. conditional Floating-point operations (no -ffast-math):
>>
>> void
>> f (float *restrict a, float *restrict b, int32_t *restrict cond, int n)
>> {
>>   for (int i = 0; i < n; ++i)
>> {
>>   if (cond[i])
>>   a[i] = b[i] + a[i];
>> }
>> }
>>   
>>   ARM SVE IR:
>>   max_mask_70 = .WHILE_ULT (0, bnd.6_46, { 0, ... });
>>
>>   ...
>>   # loop_mask_49 = PHI 
>>   ...
>>   mask__27.10_52 = vect__4.9_50 != { 0, ... };
>>   vec_mask_and_55 = loop_mask_49 & mask__27.10_52;
>>   ...
>>   vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, 
>> vect__6.13_56);
>>   ...
>>   next_mask_71 = .WHILE_ULT (_22, bnd.6_46, { 0, ... });
>>   ...
>>   
>>   For RVV, we would expect IR:
>>   
>>   ...
>>   loop_len_49 = SELECT_VL
>>   ...
>>   mask__27.10_52 = vect__4.9_50 != { 0, ... };
>>   ...
>>   vect__9.17_62 = .COND_LEN_ADD (mask__27.10_52, vect__6.13_56, 
>> vect__8.16_60, vect__6.13_56, loop_len_49);
>>   ...
>>
>> 4. 

Re: [PATCH ver 2] rs6000, __builtin_set_fpscr_rn add retrun value

2023-07-06 Thread Kewen.Lin via Gcc-patches
on 2023/7/7 07:00, Peter Bergner wrote:
> On 7/6/23 5:54 PM, Peter Bergner wrote:
>> On 6/30/23 7:58 PM, Carl Love via Gcc-patches wrote:
>>> +++ b/gcc/testsuite/gcc.target/powerpc/test_fpscr_rn_builtin_2.c
>>> @@ -0,0 +1,153 @@
>>> +/* { dg-do run { target { powerpc*-*-* } } } */
>>
>> powerpc*-*-* is the default for this test directory, so you can drop that,
>> but you need to disable this test for soft-float systems, so you probably 
>> want:
>>
>>   /* { dg-do run { target powerpc_fprs } } */
> 
> We actually want something like powerpc_fprs_hw, but that doesn't exist.
> 

Yeah, good point!  I noticed that we have a few test cases which need to
check soft-float env as well but they don't, I didn't find any related
issues have been reported, so I would assume that there are very few
actual testings on this area.  Based on this, I'm not sure if it's worthy
to add a new effective target for it.  Personally I'm happy with just using
powerpc_fprs here to keep it simple. :)

BR,
Kewen


Re: [PATCH ver 2] rs6000, __builtin_set_fpscr_rn add retrun value

2023-07-06 Thread Kewen.Lin via Gcc-patches
Hi Carl,

Some more minor comments are inline below on top of Peter's insightful
review comments.

on 2023/7/1 08:58, Carl Love wrote:
> 
> GCC maintainers:
> 
> Ver 2,  Went back thru the requirements and emails.  Not sure where I
> came up with the requirement for an overloaded version with double
> argument.  Removed the overloaded version with the double argument. 
> Added the macro to announce if the __builtin_set_fpscr_rn returns a
> void or a double with the FPSCR bits.  Updated the documentation file. 
> Retested on Power 8 BE/LE, Power 9 BE/LE, Power 10 LE.  Redid the test
> file.  Per request, the original test file functionality was not
> changed.  Just changed the name from test_fpscr_rn_builtin.c to 
> test_fpscr_rn_builtin_1.c.  Put new tests for the return values into a
> new test file, test_fpscr_rn_builtin_2.c.
> 
> The GLibC team requested a builtin to replace the mffscrn and
> mffscrniinline asm instructions in the GLibC code.  Previously there
> was discussion on adding builtins for the mffscrn instructions.
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2023-May/620261.html
> 
> In the end, it was felt that it would be to extend the existing
> __builtin_set_fpscr_rn builtin to return a double instead of a void
> type.  The desire is that we could have the functionality of the
> mffscrn and mffscrni instructions on older ISAs.  The two instructions
> were initially added in ISA 3.0.  The __builtin_set_fpscr_rn has the
> needed functionality to set the RN field using the mffscrn and mffscrni
> instructions if ISA 3.0 is supported or fall back to using logical
> instructions to mask and set the bits for earlier ISAs.  The
> instructions return the current value of the FPSCR fields DRN, VE, OE,
> UE, ZE, XE, NI, RN bit positions then update the RN bit positions with
> the new RN value provided.
> 
> The current __builtin_set_fpscr_rn builtin has a return type of void. 
> So, changing the return type to double and returning the  FPSCR fields
> DRN, VE, OE, UE, ZE, XE, NI, RN bit positions would then give the
> functionally equivalent of the mffscrn and mffscrni instructions.  Any
> current uses of the builtin would just ignore the return value yet any
> new uses could use the return value.  So the requirement is for the
> change to the __builtin_set_fpscr_rn builtin to be backwardly
> compatible and work for all ISAs.
> 
> The following patch changes the return type of the
>  __builtin_set_fpscr_rn builtin from void to double.  The return value
> is the current value of the various FPSCR fields DRN, VE, OE, UE, ZE,
> XE, NI, RN bit positions when the builtin is called.  The builtin then
> updated the RN field with the new value provided as an argument to the
> builtin.  The patch adds new testcases to test_fpscr_rn_builtin.c to
> check that the builtin returns the current value of the FPSCR fields
> and then updates the RN field.
> 
> The GLibC team has reviewed the patch to make sure it met their needs
> as a drop in replacement for the inline asm mffscr and mffscrni
> statements in the GLibC code.  T
> 
> The patch has been tested on Power 8 LE/BE, Power 9 LE/BE and Power 10
> LE.
> 
> Please let me know if the patch is acceptable for mainline.  Thanks.
> 
>Carl 
> 
> 
> --
> rs6000, __builtin_set_fpscr_rn add retrun value
> 
> Change the return value from void to double.  The return value consists of
> the FPSCR fields DRN, VE, OE, UE, ZE, XE, NI, RN bit positions.  Add an
> overloaded version which accepts a double argument.
> 
> The test powerpc/test_fpscr_rn_builtin.c is updated to add tests for the
> double reterun value and the new double argument.
> 
> gcc/ChangeLog:
>   * config/rs6000/rs6000-builtins.def (__builtin_set_fpscr_rn): Update
>   builtin definition return type.
>   * config/rs6000-c.cc(rs6000_target_modify_macros): Add check, define
>   __SET_FPSCR_RN_RETURNS_FPSCR__ macro.
>   * config/rs6000/rs6000.md ((rs6000_get_fpscr_fields): New
>   define_expand.
>   (rs6000_update_fpscr_rn_field): New define_expand.
>   (rs6000_set_fpscr_rn): Addedreturn argument.  Updated to use new
>   rs6000_get_fpscr_fields and rs6000_update_fpscr_rn_field define
>_expands.
>   * doc/extend.texi (__builtin_set_fpscr_rn): Update description for
>   the return value and new double argument.  Add descripton for
>   __SET_FPSCR_RN_RETURNS_FPSCR__ macro.
> 
> gcc/testsuite/ChangeLog:
>   gcc.target/powerpc/test_fpscr_rn_builtin.c: Renamed to
>   test_fpscr_rn_builtin_1.c.  Added comment.
>   gcc.target/powerpc/test_fpscr_rn_builtin_2.c: New test for the
>   return value of __builtin_set_fpscr_rn builtin.
> ---
>  gcc/config/rs6000/rs6000-builtins.def |   2 +-
>  gcc/config/rs6000/rs6000-c.cc |   4 +
>  gcc/config/rs6000/rs6000.md   |  87 +++---
>  gcc/doc/extend.texi   |  26 ++-
>  

Re: [PATCH v4] rs6000: Update the vsx-vector-6.* tests.

2023-07-06 Thread Kewen.Lin via Gcc-patches
Hi Carl,

on 2023/7/6 23:33, Carl Love wrote:
> GCC maintainers:
> 
> Ver 4. Fixed a few typos.  Redid the tests to create separate run and
> compile tests.

Thanks!  This new version looks good, excepting that we need vsx_hw
for run and two nits, see below.

> 
> Ver 3.  Added __attribute__ ((noipa)) to the test files.  Changed some
> of the scan-assembler-times checks to cover multiple similar
> instructions.  Change the function check macro to a macro to generate a
> function to do the test and check the results.  Retested on the various
> processor types and BE/LE versions.
> 
> Ver 2.  Switched to using code macros to generate the call to the
> builtin and test the results.  Added in instruction counts for the key
> instruction for the builtin.  Moved the tests into an additional
> function call to ensure the compile doesn't replace the builtin call
> code with the statically computed results.  The compiler was doing this
> for a few of the simpler tests.  
> 
> The following patch takes the tests in vsx-vector-6-p7.h,  vsx-vector-
> 6-p8.h, vsx-vector-6-p9.h and reorganizes them into a series of smaller
> test files by functionality rather than processor version.
> 
> Tested the patch on Power 8 LE/BE, Power 9 LE/BE and Power 10 LE with
> no regresions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>Carl
> 
> 
> 
> -
> rs6000: Update the vsx-vector-6.* tests.
> 
> The vsx-vector-6.h file is included into the processor specific test files
> vsx-vector-6.p7.c, vsx-vector-6.p8.c, and vsx-vector-6.p9.c.  The .h file
> contains a large number of vsx vector builtin tests.  The processor
> specific files contain the number of instructions that the tests are
> expected to generate for that processor.  The tests are compile only.
> 
> This patch reworks the tests into a series of files for related tests.
> The new tests consist of a runnable test to verify the builtin argument
> types and the functional correctness of each builtin.  There is also a
> compile only test that verifies the builtins generate the expected number
> of instructions for the various builtin tests.
> 
> gcc/testsuite/
>   * gcc.target/powerpc/vsx-vector-6-func-1op.h: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-1op-run.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-1op-compile.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-2lop.h: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-2lop-run.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-2lop-compile.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-2op.h: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-2op-run.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-2op-compile.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-3op.h: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-3op-run.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-3op-compile.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-cmp-all.h: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-cmp-all-run.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-cmp-all-compile.c: New test
>   file.
>   * gcc.target/powerpc/vsx-vector-6-func-cmp.h: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-cmp-run.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-func-cmp-compile.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6.h: Remove test file.
>   * gcc.target/powerpc/vsx-vector-6.p7.c: Remove test file.
>   * gcc.target/powerpc/vsx-vector-6.p8.c: Remove test file.
>   * gcc.target/powerpc/vsx-vector-6.p9.c: Remove test file.
> ---
>  .../powerpc/vsx-vector-6-func-1op-compile.c   |  22 ++
>  .../powerpc/vsx-vector-6-func-1op-run.c   |  98 
>  .../powerpc/vsx-vector-6-func-1op.h   |  43 
>  .../powerpc/vsx-vector-6-func-2lop-compile.c  |  14 ++
>  .../powerpc/vsx-vector-6-func-2lop-run.c  | 177 ++
>  .../powerpc/vsx-vector-6-func-2lop.h  |  47 
>  .../powerpc/vsx-vector-6-func-2op-compile.c   |  21 ++
>  .../powerpc/vsx-vector-6-func-2op-run.c   |  96 
>  .../powerpc/vsx-vector-6-func-2op.h   |  42 
>  .../powerpc/vsx-vector-6-func-3op-compile.c   |  17 ++
>  .../powerpc/vsx-vector-6-func-3op-run.c   | 229 ++
>  .../powerpc/vsx-vector-6-func-3op.h   |  73 ++
>  .../vsx-vector-6-func-cmp-all-compile.c   |  17 ++
>  .../powerpc/vsx-vector-6-func-cmp-all-run.c   | 147 +++
>  .../powerpc/vsx-vector-6-func-cmp-all.h   |  76 ++
>  .../powerpc/vsx-vector-6-func-cmp-compile.c   |  16 ++
>  .../powerpc/vsx-vector-6-func-cmp-run.c   |  92 +++
>  .../powerpc/vsx-vector-6-func-cmp.h   |  40 +++
>  

Re: [PATCH V4 1/4] rs6000: build constant via li;rotldi

2023-07-03 Thread Kewen.Lin via Gcc-patches
Hi Jeff,

on 2023/7/4 10:18, Jiufu Guo via Gcc-patches wrote:
> Hi,
> 
> If a constant is possible to be rotated to/from a positive or negative
> value from "li", then "li;rotldi" can be used to build the constant.
> 
> Compare with the previous version:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-June/621961.html
> This patch just did minor changes to the style and comments.
> 
> Bootstrap and regtest pass on ppc64{,le}.
> 
> Since the previous version is approved with conditions, this version
> explained the concern too.  If no objection, I would like to apply
> this patch to trunk.
> 
> 
> BR,
> Jeff (Jiufu)
> 
> gcc/ChangeLog:
> 
>   * config/rs6000/rs6000.cc (can_be_built_by_li_and_rotldi): New function.
>   (rs6000_emit_set_long_const): Call can_be_built_by_li_and_rotldi.
> 
> gcc/testsuite/ChangeLog:
> 
>   * gcc.target/powerpc/const-build.c: New test.
> ---
>  gcc/config/rs6000/rs6000.cc   | 47 +--
>  .../gcc.target/powerpc/const-build.c  | 57 +++
>  2 files changed, 98 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/const-build.c
> 
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index 42f49e4a56b..acc332acc05 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -10258,6 +10258,31 @@ rs6000_emit_set_const (rtx dest, rtx source)
>return true;
>  }
>  
> +/* Check if value C can be built by 2 instructions: one is 'li', another is
> +   rotldi.

Nit: different style, li is with "'" but rotldi isn't.

> +
> +   If so, *SHIFT is set to the shift operand of rotldi(rldicl), and *MASK
> +   is set to the mask operand of rotldi(rldicl), and return true.
> +   Return false otherwise.  */
> +
> +static bool
> +can_be_built_by_li_and_rotldi (HOST_WIDE_INT c, int *shift,
> +HOST_WIDE_INT *mask)
> +{
> +  /* If C or ~C contains at least 49 successive zeros, then C can be rotated
> + to/from a positive or negative value that 'li' is able to load.  */
> +  int n;
> +  if (can_be_rotated_to_lowbits (c, 15, )
> +  || can_be_rotated_to_lowbits (~c, 15, ))
> +{
> +  *mask = HOST_WIDE_INT_M1;
> +  *shift = HOST_BITS_PER_WIDE_INT - n;
> +  return true;
> +}
> +
> +  return false;
> +}
> +
>  /* Subroutine of rs6000_emit_set_const, handling PowerPC64 DImode.
> Output insns to set DEST equal to the constant C as a series of
> lis, ori and shl instructions.  */
> @@ -10266,15 +10291,14 @@ static void
>  rs6000_emit_set_long_const (rtx dest, HOST_WIDE_INT c)
>  {
>rtx temp;
> +  int shift;
> +  HOST_WIDE_INT mask;
>HOST_WIDE_INT ud1, ud2, ud3, ud4;
>  
>ud1 = c & 0x;
> -  c = c >> 16;
> -  ud2 = c & 0x;
> -  c = c >> 16;
> -  ud3 = c & 0x;
> -  c = c >> 16;
> -  ud4 = c & 0x;
> +  ud2 = (c >> 16) & 0x;
> +  ud3 = (c >> 32) & 0x;
> +  ud4 = (c >> 48) & 0x;
>  
>if ((ud4 == 0x && ud3 == 0x && ud2 == 0x && (ud1 & 0x8000))
>|| (ud4 == 0 && ud3 == 0 && ud2 == 0 && ! (ud1 & 0x8000)))
> @@ -10305,6 +10329,17 @@ rs6000_emit_set_long_const (rtx dest, HOST_WIDE_INT 
> c)
>emit_move_insn (dest, gen_rtx_XOR (DImode, temp,
>GEN_INT ((ud2 ^ 0x) << 16)));
>  }
> +  else if (can_be_built_by_li_and_rotldi (c, , ))
> +{
> +  temp = !can_create_pseudo_p () ? dest : gen_reg_rtx (DImode);
> +  unsigned HOST_WIDE_INT imm = (c | ~mask);
> +  imm = (imm >> shift) | (imm << (HOST_BITS_PER_WIDE_INT - shift));
> +
> +  emit_move_insn (temp, GEN_INT (imm));
> +  if (shift != 0)
> + temp = gen_rtx_ROTATE (DImode, temp, GEN_INT (shift));
> +  emit_move_insn (dest, temp);
> +}
>else if (ud3 == 0 && ud4 == 0)
>  {
>temp = !can_create_pseudo_p () ? dest : gen_reg_rtx (DImode);
> diff --git a/gcc/testsuite/gcc.target/powerpc/const-build.c 
> b/gcc/testsuite/gcc.target/powerpc/const-build.c
> new file mode 100644
> index 000..69b37e2bb53
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/const-build.c
> @@ -0,0 +1,57 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -save-temps" } */
> +/* { dg-require-effective-target has_arch_ppc64 } */
> +
> +/* Verify that two instructions are sucessfully used to build constants.

s/sucessfully/successfully/

> +   One insn is li or lis, another is rotate: rldicl, rldicr or rldic.  */

Nit: This patch is for insn li + insn rldicl only, you probably want to keep
consistent in the comments.

The others look good to me, thanks!

Segher had one question on "~c" before, I saw you had explained for it, it
makes sense to me, but in case he has more questions I'd defer the final
approval to him.

BR,
Kewen


Re: [PATCH ver 3] rs6000: Update the vsx-vector-6.* tests.

2023-07-03 Thread Kewen.Lin via Gcc-patches
Hi Carl,

on 2023/6/30 05:36, Carl Love wrote:
> GCC maintainers:
> 
> Ver 3.  Added __attribute__ ((noipa)) to the test files.  Changed some
> of the scan-assembler-times checks to cover multiple similar
> instructions.  Change the function check macro to a macro to generate a
> function to do the test and check the results.  Retested on the various
> processor types and BE/LE versions.
> 
> Ver 2.  Switched to using code macros to generate the call to the
> builtin and test the results.  Added in instruction counts for the key
> instruction for the builtin.  Moved the tests into an additional
> function call to ensure the compile doesn't replace the builtin call
> code with the statically computed results.  The compiler was doing this
> for a few of the simpler tests.  
> 
> The following patch takes the tests in vsx-vector-6-p7.h,  vsx-vector-
> 6-p8.h, vsx-vector-6-p9.h and reorganizes them into a series of smaller
> test files by functionality rather than processor version.
> 
> Tested the patch on Power 8 LE/BE, Power 9 LE/BE and Power 10 LE with
> no regresions.
> 
> Please let me know if this patch is acceptable for mainline.  Thanks.
> 
>Carl
> 
> 
> -
> rs6000: Update the vsx-vector-6.* tests.
> 
> The vsx-vector-6.h file is included into the processor specific test files
> vsx-vector-6.p7.c, vsx-vector-6.p8.c, and vsx-vector-6.p9.c.  The .h file
> contains a large number of vsx vector builtin tests.  The processor
> specific files contain the number of instructions that the tests are
> expected to generate for that processor.  The tests are compile only.
> 
> The tests are broken up into a seriers of files for related tests.  The

s/seriers/series/

> new tests are runnable tests to verify the builtin argument types and the
> functional correctness of each test rather then verifying the type and
> number of instructions generated.
> 
> gcc/testsuite/
>   * gcc.target/powerpc/vsx-vector-6-1op.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-2lop.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-2op.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-3op.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-cmp-all.c: New test file.
>   * gcc.target/powerpc/vsx-vector-6-cmp.c: New test file.

Missing "func-" in the names ...

>   * gcc.target/powerpc/vsx-vector-6.h: Remove test file.
>   * gcc.target/powerpc/vsx-vector-6-p7.h: Remove test file.
>   * gcc.target/powerpc/vsx-vector-6-p8.h: Remove test file.
>   * gcc.target/powerpc/vsx-vector-6-p9.h: Remove test file.

should be vsx-vector-6-p{7,8,9}.c, "git gcc-verify" should catch these.

> ---
>  .../powerpc/vsx-vector-6-func-1op.c   | 141 ++
>  .../powerpc/vsx-vector-6-func-2lop.c  | 217 +++
>  .../powerpc/vsx-vector-6-func-2op.c   | 133 +
>  .../powerpc/vsx-vector-6-func-3op.c   | 257 ++
>  .../powerpc/vsx-vector-6-func-cmp-all.c   | 211 ++
>  .../powerpc/vsx-vector-6-func-cmp.c   | 121 +
>  .../gcc.target/powerpc/vsx-vector-6.h | 154 ---
>  .../gcc.target/powerpc/vsx-vector-6.p7.c  |  43 ---
>  .../gcc.target/powerpc/vsx-vector-6.p8.c  |  43 ---
>  .../gcc.target/powerpc/vsx-vector-6.p9.c  |  42 ---
>  10 files changed, 1080 insertions(+), 282 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vsx-vector-6-func-1op.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vsx-vector-6-func-2lop.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vsx-vector-6-func-2op.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vsx-vector-6-func-3op.c
>  create mode 100644 
> gcc/testsuite/gcc.target/powerpc/vsx-vector-6-func-cmp-all.c
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/vsx-vector-6-func-cmp.c
>  delete mode 100644 gcc/testsuite/gcc.target/powerpc/vsx-vector-6.h
>  delete mode 100644 gcc/testsuite/gcc.target/powerpc/vsx-vector-6.p7.c
>  delete mode 100644 gcc/testsuite/gcc.target/powerpc/vsx-vector-6.p8.c
>  delete mode 100644 gcc/testsuite/gcc.target/powerpc/vsx-vector-6.p9.c
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/vsx-vector-6-func-1op.c 
> b/gcc/testsuite/gcc.target/powerpc/vsx-vector-6-func-1op.c
> new file mode 100644
> index 000..52c7ae3e983
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/vsx-vector-6-func-1op.c
> @@ -0,0 +1,141 @@
> +/* { dg-do run { target lp64 } } */
> +/* { dg-skip-if "" { powerpc*-*-darwin* } } */
> +/* { dg-options "-O2 -save-temps" } */

I just noticed that we missed an effective target check here to ensure the
support of those bifs during the test run, and since it's a runnable test
case, also need to ensure the generated hw insn supported, it's "vsx_hw"
like:

/* { dg-require-effective-target vsx_hw } */

And adding "-mvsx" to the dg-options.

This is also applied for the other test cases.


Re: [PATCH] rs6000: Update the vsx-vector-6.* tests.

2023-07-03 Thread Kewen.Lin via Gcc-patches
Hi Carl,

on 2023/7/3 23:57, Carl Love wrote:
> Kewen:
> 
> On Fri, 2023-06-30 at 15:20 -0700, Carl Love wrote:
>> Segher never liked the above way of looking at the assembly.  He
>> prefers:
>>   gcc -S -g -mcpu=power8 -o vsx-vector-6-func-2lop.s vsx-vector-6-
>> func-
>> 2lop.c
>>
>>   grep xxlor vsx-vector-6-func-2lop.s | wc
>>  34  68 516
>>
>> So, again, I get the same count of 34 on both makalu and genoa.  But
>> again, that doesn't agree with what make script/scan-assembler thinks
>> the counts should be.
>>
>> When I looked at the vsx-vector-6-func-2lop.s I see on BE:
>>
>>  
>> lxvd2x 0,10,9
>> xxlor 0,12,0
>> xxlnor 0,0,0
>>  ...
>>
>> I was guessing that it was adjusting the data layout from the load. 
>> But looking again more carefully versus LE:
>>
>> 
>> lxvd2x 0,31,9 
>>xxpermdi 0,0,0,2 
>>xxlor 0,12,0  
>>xxlnor 0,0,0  
>>xxpermdi 0,0,0,2 
>> 
>>
>> the xxpermdi is probably what is really doing the data layout change.
>>
>> So, we have the issue that looking at the assembly gives different
>> instruction counts then what 
>>
>>dg-final { scan-assembler-times {\mxxlor\M} }
>>
>> comes up with???  Now I am really confused.  I don't know how the
>> scan-
>> assembler-times works but I will go see if I can find it and see if I
>> can figure out what the issue is.  I would expect that the scan-
>> assembler is working off the --save-temp files, which get deleted as
>> part of the run.  I would guess that scan-assembler does a grep to
>> find
>> the instructions and then maybe uses wc to count them??? I will go
>> see
>> if I can figure out how scan-assembler-times works.
> 
> OK, I figured out why I was getting 34 xxlor instructions instead of
> the 22 that the scan-assembler-times was getting.  The difference was
> when I compiled the program I forgot to use -O2.  So with -O2 I get the
> same number of xxlor instructins as scan-assembler-instructions.  I get
> 34 if I do not specify optimization.

OK, thanks for looking into it.  When you run a test case with RUNTESTFLAGS,
you can add the "-v" (and even more times) to RUNTESTFLAGS, then you can find
the exact compiling commands in the dumping, I usually used this way for
reproducing and I hope it can avoid some inconsistency for reproduction.

> 
> So, I think the scan-assembler-times are all correct.
> 
> As Peter says, counting xxlor is a bit problematic in general.  We
> could just drop counting xxlor or have the LE/BE count qualifier for
> the instructions.  Your call.

Yeah, I agree that counting xxlor in the checking code (from function main)
is fragile, but as you said we still want to check expected xxlor generated
for bif vec_or, so I'd prefer to separate the existing case into the
compiling part and run part, I'll reply with more details to your latest v3.

Thanks,
Kewen


Re: [PATCH 0/9] vect: Move costing next to the transform for vect load

2023-07-02 Thread Kewen.Lin via Gcc-patches
Hi Richi,

Thanks for your review comments on this and some others!

on 2023/6/30 19:37, Richard Biener wrote:
> On Tue, Jun 13, 2023 at 4:07 AM Kewen Lin  wrote:
>>
>> This patch series follows Richi's suggestion at the link [1],
>> which suggest structuring vectorizable_load to make costing
>> next to the transform, in order to make it easier to keep
>> costing and the transform in sync.  For now, it's a known
>> issue that what we cost can be inconsistent with what we
>> transform, as the case in PR82255 and some other associated
>> test cases in the patches of this series show.
>>
>> Basically this patch series makes costing not call function
>> vect_model_load_cost any more.  To make the review and
>> bisection easy, I organized the changes according to the
>> memory access types of vector load.  For each memory access
>> type, firstly it follows the handlings in the function
>> vect_model_load_costto avoid any missing, then refines
>> further by referring to the transform code, I also checked
>> them with some typical test cases to verify.  Hope the
>> subjects of patches are clear enough.
>>
>> The whole series can be bootstrapped and regtested
>> incrementally on:
>>   - x86_64-redhat-linux
>>   - aarch64-linux-gnu
>>   - powerpc64-linux-gnu P7, P8 and P9
>>   - powerpc64le-linux-gnu P8, P9 and P10
>>
>> By considering the current vector test buckets are mainly
>> tested without cost model, I also verified the whole patch
>> series was neutral for SPEC2017 int/fp on Power9 at O2,
>> O3 and Ofast separately.
> 
> I went through the series now and I like it overall (well, I suggested
> the change).
> Looking at the changes I think we want some followup to reduce the
> mess in the final loop nest.  We already have some VMAT_* cases handled
> separately, maybe we can split out some more cases.  Maybe we should

At first glance, the simple parts look to be the handlings for
VMAT_LOAD_STORE_LANES, and VMAT_GATHER_SCATTER (with ifn and emulated).
It seems a bit straightforward if it's fine to duplicate the nested loop,
but may need to care about removing some useless code.

> bite the bullet and duplicate that loop nest for the different VMAT_* cases.
> Maybe we can merge some of the if (!costing_p) checks by clever
> re-ordering.

I've tried a bit to merge them if possible, like the place to check
VMAT_CONTIGUOUS, VMAT_CONTIGUOUS_REVERSE and VMAT_CONTIGUOUS_PERMUTE.
But will keep in mind for the following updates.

> So what
> this series doesn't improve is overall readability of the code (indent and our
> 80 char line limit).

Sorry about that.

> 
> The change also makes it more difficult(?) to separate analysis and transform
> though in the end I hope that analysis will actually "code generate" to a 
> (SLP)
> data structure so the target will have a chance to see the actual flow of 
> insns.
> 
> That said, I'd like to hear from Richard whether he thinks this is a step
> in the right direction.
> 
> Are you willing to followup with doing the same re-structuring to
> vectorizable_store?

Yes, vectorizable_store was also pointed out in your original suggestion [1],
I planned to update this once this series meets your expectations and gets 
landed.

> 
> OK from my side with the few comments addressed.  The patch likely needs 
> refresh
> after the RVV changes in this area?

Thanks!  Yes, I've updated 2/9 and 3/9 according to your comments, and updated
5/9 and 9/9 as they had some conflicts when rebasing.  Re-testing is ongoing,
do the updated versions look good to you?  Is this series ok for trunk if all 
the
test runs go well again as before?

BR,
Kewen


Re: [PATCH 4/9] vect: Adjust vectorizable_load costing on VMAT_ELEMENTWISE and VMAT_STRIDED_SLP

2023-07-02 Thread Kewen.Lin via Gcc-patches
Hi Richard,

on 2023/7/2 16:58, Richard Sandiford wrote:
> Kewen Lin  writes:
>> @@ -9743,11 +9739,23 @@ vectorizable_load (vec_info *vinfo,
>>unsigned int n_groups = 0;
>>for (j = 0; j < ncopies; j++)
>>  {
>> -  if (nloads > 1)
>> +  if (nloads > 1 && !costing_p)
>>  vec_alloc (v, nloads);
>>gimple *new_stmt = NULL;
>>for (i = 0; i < nloads; i++)
>>  {
>> +  if (costing_p)
>> +{
>> +  if (VECTOR_TYPE_P (ltype))
>> +vect_get_load_cost (vinfo, stmt_info, 1,
>> +alignment_support_scheme, misalignment,
>> +false, _cost, nullptr, cost_vec,
>> +cost_vec, true);
>> +  else
>> +inside_cost += record_stmt_cost (cost_vec, 1, scalar_load,
>> + stmt_info, 0, vect_body);
>> +  continue;
>> +}
> 
> Just a note that this might make life harder for AArch64 costing.
> Strided SLP loads are still equivalent to vector loads for AArch64,
> since they happen on the FPR/vector side even if they have integral
> modes.
> 
> But I agree this is more accurate from a general target-independent POV,
> especially given the relatively coarse-grain costing enum.  So I think
> that's just something AArch64 will need to account for.

Sorry for the inconvenience.  Hope accounting for it with target hook on
vect costing isn't very complicated.

BR,
Kewen


[PATCH 9/9 v2] vect: Adjust vectorizable_load costing on VMAT_CONTIGUOUS

2023-07-02 Thread Kewen.Lin via Gcc-patches
This is version v2 rebasing from latest trunk.

=

This patch adjusts the cost handling on VMAT_CONTIGUOUS in
function vectorizable_load.  We don't call function
vect_model_load_cost for it any more.  It removes function
vect_model_load_cost which becomes useless and unreachable
now.

gcc/ChangeLog:

* tree-vect-stmts.cc (vect_model_load_cost): Remove.
(vectorizable_load): Adjust the cost handling on VMAT_CONTIGUOUS without
calling vect_model_load_cost.
---
 gcc/tree-vect-stmts.cc | 94 --
 1 file changed, 9 insertions(+), 85 deletions(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 0a9a75ce3c7..9dfe1903181 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1118,75 +1118,6 @@ vect_get_store_cost (vec_info *, stmt_vec_info 
stmt_info, int ncopies,
 }
 }

-
-/* Function vect_model_load_cost
-
-   Models cost for loads.  In the case of grouped accesses, one access has
-   the overhead of the grouped access attributed to it.  Since unaligned
-   accesses are supported for loads, we also account for the costs of the
-   access scheme chosen.  */
-
-static void
-vect_model_load_cost (vec_info *vinfo,
- stmt_vec_info stmt_info, unsigned ncopies, poly_uint64 vf,
- vect_memory_access_type memory_access_type,
- dr_alignment_support alignment_support_scheme,
- int misalignment,
- slp_tree slp_node,
- stmt_vector_for_cost *cost_vec)
-{
-  gcc_assert (memory_access_type == VMAT_CONTIGUOUS);
-
-  unsigned int inside_cost = 0, prologue_cost = 0;
-  bool grouped_access_p = STMT_VINFO_GROUPED_ACCESS (stmt_info);
-
-  gcc_assert (cost_vec);
-
-  /* ???  Somehow we need to fix this at the callers.  */
-  if (slp_node)
-ncopies = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
-
-  if (slp_node && SLP_TREE_LOAD_PERMUTATION (slp_node).exists ())
-{
-  /* If the load is permuted then the alignment is determined by
-the first group element not by the first scalar stmt DR.  */
-  stmt_vec_info first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
-  if (!first_stmt_info)
-   first_stmt_info = stmt_info;
-  /* Record the cost for the permutation.  */
-  unsigned n_perms, n_loads;
-  vect_transform_slp_perm_load (vinfo, slp_node, vNULL, NULL,
-   vf, true, _perms, _loads);
-  inside_cost += record_stmt_cost (cost_vec, n_perms, vec_perm,
-  first_stmt_info, 0, vect_body);
-
-  /* And adjust the number of loads performed.  This handles
-redundancies as well as loads that are later dead.  */
-  ncopies = n_loads;
-}
-
-  /* Grouped loads read all elements in the group at once,
- so we want the DR for the first statement.  */
-  stmt_vec_info first_stmt_info = stmt_info;
-  if (!slp_node && grouped_access_p)
-first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
-
-  /* True if we should include any once-per-group costs as well as
- the cost of the statement itself.  For SLP we only get called
- once per group anyhow.  */
-  bool first_stmt_p = (first_stmt_info == stmt_info);
-
-  vect_get_load_cost (vinfo, stmt_info, ncopies, alignment_support_scheme,
- misalignment, first_stmt_p, _cost, _cost,
- cost_vec, cost_vec, true);
-
-  if (dump_enabled_p ())
-dump_printf_loc (MSG_NOTE, vect_location,
- "vect_model_load_cost: inside_cost = %d, "
- "prologue_cost = %d .\n", inside_cost, prologue_cost);
-}
-
-
 /* Calculate cost of DR's memory access.  */
 void
 vect_get_load_cost (vec_info *, stmt_vec_info stmt_info, int ncopies,
@@ -10830,7 +10761,8 @@ vectorizable_load (vec_info *vinfo,
 we only need to count it once for the whole group.  */
  bool first_stmt_info_p = first_stmt_info == stmt_info;
  bool add_realign_cost = first_stmt_info_p && i == 0;
- if (memory_access_type == VMAT_CONTIGUOUS_REVERSE
+ if (memory_access_type == VMAT_CONTIGUOUS
+ || memory_access_type == VMAT_CONTIGUOUS_REVERSE
  || (memory_access_type == VMAT_CONTIGUOUS_PERMUTE
  && (!grouped_load || first_stmt_info_p)))
vect_get_load_cost (vinfo, stmt_info, 1,
@@ -10954,15 +10886,14 @@ vectorizable_load (vec_info *vinfo,
 direct vect_transform_slp_perm_load to DCE the unused parts.
 ???  This is a hack to prevent compile-time issues as seen
 in PR101120 and friends.  */
- if (costing_p
- && memory_access_type != VMAT_CONTIGUOUS)
+ if (costing_p)
{
  vect_transform_slp_perm_load (vinfo, slp_node, vNULL, nullptr, vf,
true, 

  1   2   3   4   5   6   7   8   9   10   >