date:20230524

Re: [PATCH] [testsuite] [powerpc] adjust -m32 counts for fold-vec-extract*

2023-05-24 Thread Kewen.Lin via Gcc-patches

Hi Alexandre,

on 2023/5/24 13:51, Alexandre Oliva wrote:
> 
> Codegen changes caused add instruction count mismatches on
> ppc-*-linux-gnu and other 32-bit ppc targets.  At some point the
> expected counts were adjusted for lp64, but ilp32 differences
> remained, and published test results confirm it.

Thanks for fixing, I tested this on ppc64le and ppc64 {-m64,-m32}
well.

> 
> Bootstrapped on x86_64-linux-gnu.  Also tested on ppc- and x86-vx7r2
> with gcc-12.
> 
> for  gcc/testsuite/ChangeLog

I think this is for PR101169, could you add it as PR marker?

> 
>   * gcc.target/powerpc/fold-vec-extract-char.p7.c: Adjust addi
>   counts for ilp32.
>   * gcc.target/powerpc/fold-vec-extract-double.p7.c: Likewise.
>   * gcc.target/powerpc/fold-vec-extract-float.p7.c: Likewise.
>   * gcc.target/powerpc/fold-vec-extract-float.p8.c: Likewise.
>   * gcc.target/powerpc/fold-vec-extract-int.p7.c: Likewise.
>   * gcc.target/powerpc/fold-vec-extract-int.p8.c: Likewise.
>   * gcc.target/powerpc/fold-vec-extract-short.p7.c: Likewise.
>   * gcc.target/powerpc/fold-vec-extract-short.p8.c: Likewise.
> ---
>  .../gcc.target/powerpc/fold-vec-extract-char.p7.c  |3 ++-
>  .../powerpc/fold-vec-extract-double.p7.c   |2 +-
>  .../gcc.target/powerpc/fold-vec-extract-float.p7.c |2 +-
>  .../gcc.target/powerpc/fold-vec-extract-float.p8.c |2 +-
>  .../gcc.target/powerpc/fold-vec-extract-int.p7.c   |2 +-
>  .../gcc.target/powerpc/fold-vec-extract-int.p8.c   |2 +-
>  .../gcc.target/powerpc/fold-vec-extract-short.p7.c |2 +-
>  .../gcc.target/powerpc/fold-vec-extract-short.p8.c |2 +-
>  8 files changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-char.p7.c 
> b/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-char.p7.c
> index 29a8aa84db282..c6647431d09c9 100644
> --- a/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-char.p7.c
> +++ b/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-char.p7.c
> @@ -11,7 +11,8 @@
>  /* one extsb (extend sign-bit) instruction generated for each test against
> unsigned types */
> 
> -/* { dg-final { scan-assembler-times {\maddi\M} 9 } } */
> +/* { dg-final { scan-assembler-times {\maddi\M} 9 { target { lp64 } } } } */
> +/* { dg-final { scan-assembler-times {\maddi\M} 6 { target { ilp32 } } } } */
>  /* { dg-final { scan-assembler-times {\mli\M} 6 } } */
>  /* { dg-final { scan-assembler-times {\mstxvw4x\M|\mstvx\M|\mstxv\M} 6 } } */
>  /* -m32 target uses rlwinm in place of rldicl. */
> diff --git a/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-double.p7.c 
> b/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-double.p7.c
> index 3cae644b90b71..db325efbb07ff 100644
> --- a/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-double.p7.c
> +++ b/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-double.p7.c
> @@ -14,7 +14,7 @@
>  /* { dg-final { scan-assembler-times {\mli\M} 1 } } */
>  /* -m32 target has an 'add' in place of one of the 'addi'. */
>  /* { dg-final { scan-assembler-times {\maddi\M|\madd\M} 2 { target lp64 } } 
> } */
> -/* { dg-final { scan-assembler-times {\maddi\M|\madd\M} 3 { target ilp32 } } 
> } */
> +/* { dg-final { scan-assembler-times {\maddi\M|\madd\M} 2 { target ilp32 } } 
> } */

So both lp64 and ilp32 have the same count, could we merge it and remove the 
selectors?

>  /* -m32 target has a rlwinm in place of a rldic .  */
>  /* { dg-final { scan-assembler-times {\mrldic\M|\mrlwinm\M} 1 } } */
>  /* { dg-final { scan-assembler-times {\mstxvd2x\M} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-float.p7.c 
> b/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-float.p7.c
> index 59a4979457dcb..42ec69475fd07 100644
> --- a/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-float.p7.c
> +++ b/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-float.p7.c
> @@ -13,7 +13,7 @@
>  /* { dg-final { scan-assembler-times {\mli\M} 1 } } */
>  /* -m32 as an add in place of an addi. */
>  /* { dg-final { scan-assembler-times {\maddi\M|\madd\M} 2 { target lp64 } } 
> } */
> -/* { dg-final { scan-assembler-times {\maddi\M|\madd\M} 3 { target ilp32 } } 
> } */
> +/* { dg-final { scan-assembler-times {\maddi\M|\madd\M} 2 { target ilp32 } } 
> } */

Ditto.

>  /* { dg-final { scan-assembler-times {\mstxvd2x\M|\mstvx\M|\mstxv\M} 1 } } */
>  /* -m32 uses rlwinm in place of rldic */
>  /* { dg-final { scan-assembler-times {\mrldic\M|\mrlwinm\M} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-float.p8.c 
> b/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-float.p8.c
> index 4b1d75ee26d0f..68de4b307 100644
> --- a/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-float.p8.c
> +++ b/gcc/testsuite/gcc.target/powerpc/fold-vec-extract-float.p8.c
> @@ -26,7 +26,7 @@
>  /* { dg-final { scan-assembler-times {\mstxvd2x\M} 1 { target ilp32 } } } */
>  /* { dg-final { scan-assembler-times {\madd\M} 1 { target ilp32 } } }

Re: [V7][PATCH 1/2] Handle component_ref to a structre/union field including flexible array member [PR101832]

2023-05-24 Thread Bernhard Reutner-Fischer via Gcc-patches

On 24 May 2023 16:09:21 CEST, Qing Zhao  wrote:
>Bernhard,
>
>Thanks a lot for your comments.
>
>> On May 19, 2023, at 7:11 PM, Bernhard Reutner-Fischer 
>>  wrote:
>> 
>> On Fri, 19 May 2023 20:49:47 +
>> Qing Zhao via Gcc-patches  wrote:
>> 
>>> GCC extension accepts the case when a struct with a flexible array member
>>> is embedded into another struct or union (possibly recursively).
>> 
>> Do you mean TYPE_TRAILING_FLEXARRAY()?
>
>The following might be more accurate description:
>
>GCC extension accepts the case when a struct with a flexible array member
> is embedded into another struct or union (possibly recursively) as the last 
> field.
>
>
>
>> 
>>> diff --git a/gcc/tree.h b/gcc/tree.h
>>> index 0b72663e6a1..237644e788e 100644
>>> --- a/gcc/tree.h
>>> +++ b/gcc/tree.h
>>> @@ -786,7 +786,12 @@ extern void omp_clause_range_check_failed (const_tree, 
>>> const char *, int,
>>>(...) prototype, where arguments can be accessed with va_start and
>>>va_arg), as opposed to an unprototyped function.  */
>>> #define TYPE_NO_NAMED_ARGS_STDARG_P(NODE) \
>>> -  (TYPE_CHECK (NODE)->type_common.no_named_args_stdarg_p)
>>> +  (FUNC_OR_METHOD_CHECK (NODE)->type_common.no_named_args_stdarg_p)
>>> +
>>> +/* True if this RECORD_TYPE or UNION_TYPE includes a flexible array member
>>> +   at the last field recursively.  */
>>> +#define TYPE_INCLUDE_FLEXARRAY(NODE) \
>>> +  (RECORD_OR_UNION_CHECK (NODE)->type_common.no_named_args_stdarg_p)
>> 
>> Until i read the description above i read TYPE_INCLUDE_FLEXARRAY as an
>> option to include or not include something. The description hints more
>> at TYPE_INCLUDES_FLEXARRAY (with an S) to be a type which has at least
>> one member which has a trailing flexible array or which itself has a
>> trailing flexible array.
>
>Yes, TYPE_INCLUDES_FLEXARRAY (maybe with a S is a better name) means the 
>structure/union TYPE includes a flexible array member or includes a struct 
>with a flexible array member as the last field.
>

So ANY_TRAILING_FLEXARRAY or TYPE_CONTAINS_FLEXARRAY, TYPE_INCLUDES_FLEXARRAY 
or something like that would be more clear, i don't know.
I'd probably use the first, but that's enough bike shedding for me now. Let's 
see what others think.

thanks,

>Hope this is clear.
>thanks.
>
>Qing
>> 
>>> 
>>> /* In an IDENTIFIER_NODE, this means that assemble_name was called with
>>>this string as an argument.  */
>> 
>

[COMMITTED] Stream out NANs correctly.

2023-05-24 Thread Aldy Hernandez via Gcc-patches

NANs don't have bounds, so there's no need to stream them out.

gcc/ChangeLog:

* data-streamer-in.cc (streamer_read_value_range): Handle NANs.
* data-streamer-out.cc (streamer_write_vrange): Same.
* value-range.h (class vrange): Make streamer_write_vrange a friend.
---
 gcc/data-streamer-in.cc  | 16 
 gcc/data-streamer-out.cc | 17 -
 gcc/value-range.h|  1 +
 3 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/gcc/data-streamer-in.cc b/gcc/data-streamer-in.cc
index 07728bef413..578c328475f 100644
--- a/gcc/data-streamer-in.cc
+++ b/gcc/data-streamer-in.cc
@@ -248,14 +248,22 @@ streamer_read_value_range (class lto_input_block *ib, 
data_in *data_in,
   if (is_a  (vr))
 {
   frange  = as_a  (vr);
-  REAL_VALUE_TYPE lb, ub;
-  streamer_read_real_value (ib, );
-  streamer_read_real_value (ib, );
+
+  // Stream in NAN bits.
   struct bitpack_d bp = streamer_read_bitpack (ib);
   bool pos_nan = (bool) bp_unpack_value (, 1);
   bool neg_nan = (bool) bp_unpack_value (, 1);
   nan_state nan (pos_nan, neg_nan);
-  r.set (type, lb, ub, nan);
+
+  if (kind == VR_NAN)
+   r.set_nan (type, nan);
+  else
+   {
+ REAL_VALUE_TYPE lb, ub;
+ streamer_read_real_value (ib, );
+ streamer_read_real_value (ib, );
+ r.set (type, lb, ub, nan);
+   }
   return;
 }
   gcc_unreachable ();
diff --git a/gcc/data-streamer-out.cc b/gcc/data-streamer-out.cc
index afc9862062b..93dedfcb895 100644
--- a/gcc/data-streamer-out.cc
+++ b/gcc/data-streamer-out.cc
@@ -410,7 +410,7 @@ streamer_write_vrange (struct output_block *ob, const 
vrange )
   gcc_checking_assert (!v.undefined_p ());
 
   // Write the common fields to all vranges.
-  value_range_kind kind = v.varying_p () ? VR_VARYING : VR_RANGE;
+  value_range_kind kind = v.m_kind;
   streamer_write_enum (ob->main_stream, value_range_kind, VR_LAST, kind);
   stream_write_tree (ob, v.type (), true);
 
@@ -429,15 +429,22 @@ streamer_write_vrange (struct output_block *ob, const 
vrange )
   if (is_a  (v))
 {
   const frange  = as_a  (v);
-  REAL_VALUE_TYPE lb = r.lower_bound ();
-  REAL_VALUE_TYPE ub = r.upper_bound ();
-  streamer_write_real_value (ob, );
-  streamer_write_real_value (ob, );
+
+  // Stream out NAN bits.
   bitpack_d bp = bitpack_create (ob->main_stream);
   nan_state nan = r.get_nan_state ();
   bp_pack_value (, nan.pos_p (), 1);
   bp_pack_value (, nan.neg_p (), 1);
   streamer_write_bitpack ();
+
+  // Stream out bounds.
+  if (kind != VR_NAN)
+   {
+ REAL_VALUE_TYPE lb = r.lower_bound ();
+ REAL_VALUE_TYPE ub = r.upper_bound ();
+ streamer_write_real_value (ob, );
+ streamer_write_real_value (ob, );
+   }
   return;
 }
   gcc_unreachable ();
diff --git a/gcc/value-range.h b/gcc/value-range.h
index 39023e7b5eb..2b4ebabe7c8 100644
--- a/gcc/value-range.h
+++ b/gcc/value-range.h
@@ -76,6 +76,7 @@ class GTY((user)) vrange
 {
   template  friend bool is_a (vrange &);
   friend class Value_Range;
+  friend void streamer_write_vrange (struct output_block *, const vrange &);
 public:
   virtual void accept (const class vrange_visitor ) const = 0;
   virtual void set (tree, tree, value_range_kind = VR_RANGE);
-- 
2.40.1

[COMMITTED] Disallow setting of NANs in frange setter unless setting trees.

2023-05-24 Thread Aldy Hernandez via Gcc-patches

frange::set() is confusing in that we can set a NAN by specifying a
bound of +-NAN, even though we tecnically disallow NANs in the setter
because the kind can never be VR_NAN.  This is a wart for
get_tree_range(), which builds a range out of a tree from the source,
to work correctly.  It's ugly, and it showed its limitation while
implementing LTO streaming of ranges.

This patch disallows passing NAN bounds in frange::set() and fixes
get_tree_range.

gcc/ChangeLog:

* value-query.cc (range_query::get_tree_range): Set NAN directly
if necessary.
* value-range.cc (frange::set): Assert that bounds are not NAN.
---
 gcc/value-query.cc | 13 ++---
 gcc/value-range.cc |  9 +
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/gcc/value-query.cc b/gcc/value-query.cc
index 43297f17c39..a84f164d77b 100644
--- a/gcc/value-query.cc
+++ b/gcc/value-query.cc
@@ -189,9 +189,16 @@ range_query::get_tree_range (vrange , tree expr, gimple 
*stmt)
   {
frange  = as_a  (r);
REAL_VALUE_TYPE *rv = TREE_REAL_CST_PTR (expr);
-   f.set (TREE_TYPE (expr), *rv, *rv);
-   if (!real_isnan (rv))
- f.clear_nan ();
+   if (real_isnan (rv))
+ {
+   bool sign = real_isneg (rv);
+   f.set_nan (TREE_TYPE (expr), sign);
+ }
+   else
+ {
+   nan_state nan (false);
+   f.set (TREE_TYPE (expr), *rv, *rv, nan);
+ }
return true;
   }
 
diff --git a/gcc/value-range.cc b/gcc/value-range.cc
index 2f37ff3e58e..707b1f15fd4 100644
--- a/gcc/value-range.cc
+++ b/gcc/value-range.cc
@@ -359,14 +359,7 @@ frange::set (tree type,
   gcc_unreachable ();
 }
 
-  // Handle NANs.
-  if (real_isnan () || real_isnan ())
-{
-  gcc_checking_assert (real_identical (, ));
-  bool sign = real_isneg ();
-  set_nan (type, sign);
-  return;
-}
+  gcc_checking_assert (!real_isnan () && !real_isnan ());
 
   m_kind = kind;
   m_type = type;
-- 
2.40.1

[COMMITTED] Hash known NANs correctly for franges.

2023-05-24 Thread Aldy Hernandez via Gcc-patches

We're ICEing when trying to hash a known NAN.  This is unnoticeable
because the only user would be IPA, and even so, it currently doesn't
handle floats.  However, handling floats is a flip of a switch, so
it's best to handle them already.

gcc/ChangeLog:

* value-range.cc (add_vrange): Handle known NANs.
---
 gcc/value-range.cc | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/gcc/value-range.cc b/gcc/value-range.cc
index 874a1843ebf..2f37ff3e58e 100644
--- a/gcc/value-range.cc
+++ b/gcc/value-range.cc
@@ -269,14 +269,14 @@ add_vrange (const vrange , inchash::hash ,
   if (is_a  (v))
 {
   const frange  = as_a  (v);
-  if (r.varying_p ())
-   hstate.add_int (VR_VARYING);
+  if (r.known_isnan ())
+   hstate.add_int (VR_NAN);
   else
-   hstate.add_int (VR_RANGE);
-
-  hstate.add_real_value (r.lower_bound ());
-  hstate.add_real_value (r.upper_bound ());
-
+   {
+ hstate.add_int (r.varying_p () ? VR_VARYING : VR_RANGE);
+ hstate.add_real_value (r.lower_bound ());
+ hstate.add_real_value (r.upper_bound ());
+   }
   nan_state nan = r.get_nan_state ();
   hstate.add_int (nan.pos_p ());
   hstate.add_int (nan.neg_p ());
-- 
2.40.1

[COMMITTED] Add an frange::set_nan() variant that takes a nan_state.

2023-05-24 Thread Aldy Hernandez via Gcc-patches

Generalize frange::set_nan() to take a nan_state and make current
set_nan() methods syntactic sugar.

This is in preparation for better streaming of NANs for LTO/IPA.

gcc/ChangeLog:

* value-range.h (frange::set_nan): New.
---
 gcc/value-range.h | 32 +---
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/gcc/value-range.h b/gcc/value-range.h
index b8cc2a0e76a..39023e7b5eb 100644
--- a/gcc/value-range.h
+++ b/gcc/value-range.h
@@ -327,6 +327,7 @@ public:
const nan_state &, value_range_kind = VR_RANGE);
   void set_nan (tree type);
   void set_nan (tree type, bool sign);
+  void set_nan (tree type, const nan_state &);
   virtual void set_varying (tree type) override;
   virtual void set_undefined () override;
   virtual bool union_ (const vrange &) override;
@@ -1219,17 +1220,18 @@ frange_val_is_max (const REAL_VALUE_TYPE , const_tree 
type)
   return real_identical (, );
 }
 
-// Build a signless NAN of type TYPE.
+// Build a NAN with a state of NAN.
 
 inline void
-frange::set_nan (tree type)
+frange::set_nan (tree type, const nan_state )
 {
+  gcc_checking_assert (nan.pos_p () || nan.neg_p ());
   if (HONOR_NANS (type))
 {
   m_kind = VR_NAN;
   m_type = type;
-  m_pos_nan = true;
-  m_neg_nan = true;
+  m_neg_nan = nan.neg_p ();
+  m_pos_nan = nan.pos_p ();
   if (flag_checking)
verify_range ();
 }
@@ -1237,22 +1239,22 @@ frange::set_nan (tree type)
 set_undefined ();
 }
 
+// Build a signless NAN of type TYPE.
+
+inline void
+frange::set_nan (tree type)
+{
+  nan_state nan (true);
+  set_nan (type, nan);
+}
+
 // Build a NAN of type TYPE with SIGN.
 
 inline void
 frange::set_nan (tree type, bool sign)
 {
-  if (HONOR_NANS (type))
-{
-  m_kind = VR_NAN;
-  m_type = type;
-  m_neg_nan = sign;
-  m_pos_nan = !sign;
-  if (flag_checking)
-   verify_range ();
-}
-  else
-set_undefined ();
+  nan_state nan (/*pos=*/!sign, /*neg=*/sign);
+  set_nan (type, nan);
 }
 
 // Return TRUE if range is known to be finite.
-- 
2.40.1

Re: [PATCH v2] rs6000: Add buildin for mffscrn instructions

2023-05-24 Thread Kewen.Lin via Gcc-patches

on 2023/5/24 23:20, Carl Love wrote:
> On Wed, 2023-05-24 at 13:32 +0800, Kewen.Lin wrote:
>> on 2023/5/24 06:30, Peter Bergner wrote:
>>> On 5/23/23 12:24 AM, Kewen.Lin wrote:
 on 2023/5/23 01:31, Carl Love wrote:
> The builtins were requested for use in GLibC.  As of version
> 2.31 they
> were added as inline asm.  They requested a builtin so the asm
> could be
> removed.

 So IMHO we also want the similar support for mffscrn, that is to
 make
 use of mffscrn and mffscrni on Power9 and later, but falls back
 to 
 __builtin_set_fpscr_rn + mffs similar on older platforms.
>>>
>>> So __builtin_set_fpscr_rn everything we want (sets the RN bits) and
>>> uses mffscrn/mffscrni on P9 and later and uses older insns on pre-
>>> P9.
>>> The only problem is we don't return the current FPSCR bits, as the
>>> bif
>>> is defined to return void.
>>
>> Yes.
>>
>>> Crazy idea, but could we extend the built-in
>>> with an overload that returns the FPSCR bits?  
>>
>> So you agree that we should make this proposed new bif handle pre-P9
>> just
>> like some other existing bifs. :)  I think extending it is good and
>> doable,
>> but the only concern here is the bif name "__builtin_set_fpscr_rn",
>> which
>> matches the existing behavior (only set rounding) but doesn't match
>> the
>> proposed extending behavior (set rounding and get some env bits
>> back).
>> Maybe it's not a big deal if the documentation clarify it well.
> 
> Extending the builtin to pre Power 9 is straight forward and I agree
> would make good sense to do.
> 
> I am a bit concerned on how to extend __builtin_set_fpscr_rn to add the
> new functionality.  Peter suggests overloading the builtin to either
> return void or returns FPSCR bits.  It is my understanding that the
> return value for a given builtin had to be the same, i.e. you can't
> overload the return value. Maybe you can with Bill's new
> infrastructure?  I recall having problems trying to overload the return
> value in the past and Bill said you couldn't do it.  I play with this
> and see if I can overload the return value.

Your understanding on that we fail to overload this for just different
return types is correct.  But previously I interpreted the extending
proposal as to extend
 
  void __builtin_set_fpscr_rn (int);

to 

  void __builtin_set_fpscr_rn (int, double*);

The related address taken and store here can be optimized out normally.

BR,
Kewen

Re: Re: RISC-V Bootstrap problems

2023-05-24 Thread juzhe.zh...@rivai.ai

>> It's highly unlikely we'll switch from the mechanisms we're using.
>>They're pretty deeply embedded into how all the ports are developed and
>>work.

We just take a look at the build file. It seems that the functions generated by 
define_insn 
are so many. Do we have the chance optimize it?
I believe the tablegen mechanism in LLVM is well optimized in case of generated 
files and functions
so that they won't be affected to much as instructions go up.

Thanks.

juzhe.zh...@rivai.ai

From: Jeff Law
Date: 2023-05-25 12:07
To: juzhe.zh...@rivai.ai; kito.cheng
CC: jeffreyalaw; palmer; vineetg; Kito.cheng; gcc-patches; Patrick O'Neill; 
macro
Subject: Re: RISC-V Bootstrap problems

On 5/24/23 21:54, juzhe.zh...@rivai.ai wrote:
>  >> IIRC LLVM is using the table driven mechanism, so it's less impact 
> on the
>>>compilation time when the instruction becomes more and more.
> Oh, I see. Could you share more details ?
> Maybe we can support this in GCC.
It's highly unlikely we'll switch from the mechanisms we're using. 
They're pretty deeply embedded into how all the ports are developed and 
work.

The first step is to figure out what's exploding.  I strongly suspect 
we'll be able to see this in a cross, but again, the magnitude will be 
smaller.

jeff

Re: RISC-V Bootstrap problems

2023-05-24 Thread Kito Cheng via Gcc-patches

Yeah, JoJo still working on toolchain stuff, but just not active on upstream GCC

cc. jojo

On Thu, May 25, 2023 at 12:06 PM Jeff Law  wrote:
>
>
>
> On 5/24/23 21:53, Kito Cheng wrote:
> > Jojo has a patch to try to split those things that should help this,
> > but seems not landed.
> >
> > https://patchwork.ozlabs.org/project/gcc/patch/20201104015315.81416-1-jiejie_r...@c-sky.com/
> Is JoJo still active?  I haven't heard from JoJo in many months, perhaps
> as long as a year or two.
>
> Jeff

Re: RISC-V Bootstrap problems

2023-05-24 Thread Jeff Law





On 5/24/23 21:54, juzhe.zh...@rivai.ai wrote:
 >> IIRC LLVM is using the table driven mechanism, so it's less impact 
on the

compilation time when the instruction becomes more and more.

Oh, I see. Could you share more details ?
Maybe we can support this in GCC.
It's highly unlikely we'll switch from the mechanisms we're using. 
They're pretty deeply embedded into how all the ports are developed and 
work.


The first step is to figure out what's exploding.  I strongly suspect 
we'll be able to see this in a cross, but again, the magnitude will be 
smaller.


jeff

Re: RISC-V Bootstrap problems

2023-05-24 Thread Jeff Law





On 5/24/23 21:53, Kito Cheng wrote:

Jojo has a patch to try to split those things that should help this,
but seems not landed.

https://patchwork.ozlabs.org/project/gcc/patch/20201104015315.81416-1-jiejie_r...@c-sky.com/
Is JoJo still active?  I haven't heard from JoJo in many months, perhaps 
as long as a year or two.


Jeff

Re: [PATCH] RISC-V: Remove FRM_REGNUM dependency for rtx conversions

2023-05-24 Thread Kito Cheng via Gcc-patches

LGTM, thanks :)

On Wed, May 24, 2023 at 7:26 PM  wrote:
>
> From: Juzhe-Zhong 
>
> According to RVV ISA:
> The conversions use the dynamic rounding mode in frm, except for the rtz 
> variants, which round towards zero.
>
> So rtz conversion patterns should not have FRM dependency.
>
> We can't support mode switching for FRM yet since rvv intrinsic doc is not 
> updated but
> I think this patch is correct.
>
> gcc/ChangeLog:
>
> * config/riscv/vector.md: Remove FRM_REGNUM dependency in rtz 
> instructions.
>
> ---
>  gcc/config/riscv/vector.md | 12 +++-
>  1 file changed, 3 insertions(+), 9 deletions(-)
>
> diff --git a/gcc/config/riscv/vector.md b/gcc/config/riscv/vector.md
> index 9afef0d12bc..15f66efaa48 100644
> --- a/gcc/config/riscv/vector.md
> +++ b/gcc/config/riscv/vector.md
> @@ -7072,10 +7072,8 @@
>  (match_operand 5 "const_int_operand""  i,  i,  i,  
> i")
>  (match_operand 6 "const_int_operand""  i,  i,  i,  
> i")
>  (match_operand 7 "const_int_operand""  i,  i,  i,  
> i")
> -(match_operand 8 "const_int_operand""  i,  i,  i,  
> i")
>  (reg:SI VL_REGNUM)
> -(reg:SI VTYPE_REGNUM)
> -(reg:SI FRM_REGNUM)] UNSPEC_VPREDICATE)
> +(reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE)
>   (any_fix:
>  (match_operand:VF 3 "register_operand"  " vr, vr, vr, 
> vr"))
>   (match_operand: 2 "vector_merge_operand" " vu,  0, vu,  
> 0")))]
> @@ -7142,10 +7140,8 @@
>  (match_operand 5 "const_int_operand""i,i")
>  (match_operand 6 "const_int_operand""i,i")
>  (match_operand 7 "const_int_operand""i,i")
> -(match_operand 8 "const_int_operand""i,i")
>  (reg:SI VL_REGNUM)
> -(reg:SI VTYPE_REGNUM)
> -(reg:SI FRM_REGNUM)] UNSPEC_VPREDICATE)
> +(reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE)
>   (any_fix:VWCONVERTI
>  (match_operand: 3 "register_operand" "   vr,   vr"))
>   (match_operand:VWCONVERTI 2 "vector_merge_operand" "   vu,0")))]
> @@ -7233,10 +7229,8 @@
>  (match_operand 5 "const_int_operand" "  i,  i,  i,  
> i,i,i")
>  (match_operand 6 "const_int_operand" "  i,  i,  i,  
> i,i,i")
>  (match_operand 7 "const_int_operand" "  i,  i,  i,  
> i,i,i")
> -(match_operand 8 "const_int_operand" "  i,  i,  i,  
> i,i,i")
>  (reg:SI VL_REGNUM)
> -(reg:SI VTYPE_REGNUM)
> -(reg:SI FRM_REGNUM)] UNSPEC_VPREDICATE)
> +(reg:SI VTYPE_REGNUM)] UNSPEC_VPREDICATE)
>   (any_fix:
>  (match_operand:VF 3 "register_operand"   "  0,  0,  0,  
> 0,   vr,   vr"))
>   (match_operand: 2 "vector_merge_operand" " vu,  0, vu,  
> 0,   vu,0")))]
> --
> 2.36.1
>

Re: Re: RISC-V Bootstrap problems

2023-05-24 Thread juzhe.zh...@rivai.ai

>> IIRC LLVM is using the table driven mechanism, so it's less impact on the
>> compilation time when the instruction becomes more and more.
Oh, I see. Could you share more details ?
Maybe we can support this in GCC.



juzhe.zh...@rivai.ai
 
From: Kito Cheng
Date: 2023-05-25 11:53
To: juzhe.zh...@rivai.ai
CC: jeffreyalaw; palmer; vineetg; Kito.cheng; gcc-patches; Patrick O'Neill; 
jlaw; macro
Subject: Re: Re: RISC-V Bootstrap problems
Jojo has a patch to try to split those things that should help this,
but seems not landed.
 
https://patchwork.ozlabs.org/project/gcc/patch/20201104015315.81416-1-jiejie_r...@c-sky.com/
 
 
> How about LLVM? Can kito help with this issue?
> LLVM has already supported full intrinsics for a long time and no issues.
 
IIRC LLVM is using the table driven mechanism, so it's less impact on the
compilation time when the instruction becomes more and more.
 
 
On Thu, May 25, 2023 at 11:46 AM juzhe.zh...@rivai.ai
 wrote:
>
> segment intrinsics are really huge amount.
>
> Even though I have tried to optimized them, still we have the issues..
>
> How about LLVM? Can kito help with this issue?
> LLVM has already support full intrinsics for a long time and no issues.
>
> Thanks.
>
>
> juzhe.zh...@rivai.ai
>
> From: Jeff Law
> Date: 2023-05-25 11:43
> To: Palmer Dabbelt; Vineet Gupta
> CC: kito.cheng; gcc-patches; Kito Cheng; Patrick O'Neill; Jeff Law; macro; 
> juzhe.zh...@rivai.ai
> Subject: Re: RISC-V Bootstrap problems
>
>
> On 5/24/23 17:13, Palmer Dabbelt wrote:
> > On Wed, 24 May 2023 16:12:20 PDT (-0700), Vineet Gupta wrote:
>
> [ ... big snip ... ]
>
> >>
> >> Never mind. Looks like I found the issue - with just trial and error and
> >> no idea of how this stuff works.
> >> The torture-{init,finish} needs to be in riscv.exp not rvv.exp
> >> Running full tests now.
> >
> > Thanks!
> Marginally related.  I was able to bisect the "hang" when 3-staging the
> trunk on RISC-V with qemu user mode emulation.
>
> So it wasn't actually hanging, but after the introduction of segment
> intrinsics the compilation time for insn-emit explodes -- previously I
> could do a full 3-stage bootstrap, build the glibc & the kernel, then
> test c/c++/fortran in ~10 hours.
>
> Now just building insn-emit.o alone takes ~10 hours in that environment.
>   I suspect (but have not yet confirmed) that we should see a huge
> compile-time spike in cross builds as well, though obviously it won't be
> as bad since we're not using qemu emulation.
>
> Clearly something isn't scaling well.  I don't know if we've got a crazy
> large function in there, a crazy number of functions or something that's
> just triggering a compile-time scaling problem.  Whatever it is, we
> probably need to address it.
>
> jeff
>
>

Re: Re: RISC-V Bootstrap problems

2023-05-24 Thread Kito Cheng via Gcc-patches

Jojo has a patch to try to split those things that should help this,
but seems not landed.

https://patchwork.ozlabs.org/project/gcc/patch/20201104015315.81416-1-jiejie_r...@c-sky.com/


> How about LLVM? Can kito help with this issue?
> LLVM has already supported full intrinsics for a long time and no issues.

IIRC LLVM is using the table driven mechanism, so it's less impact on the
compilation time when the instruction becomes more and more.


On Thu, May 25, 2023 at 11:46 AM juzhe.zh...@rivai.ai
 wrote:
>
> segment intrinsics are really huge amount.
>
> Even though I have tried to optimized them, still we have the issues..
>
> How about LLVM? Can kito help with this issue?
> LLVM has already support full intrinsics for a long time and no issues.
>
> Thanks.
>
>
> juzhe.zh...@rivai.ai
>
> From: Jeff Law
> Date: 2023-05-25 11:43
> To: Palmer Dabbelt; Vineet Gupta
> CC: kito.cheng; gcc-patches; Kito Cheng; Patrick O'Neill; Jeff Law; macro; 
> juzhe.zh...@rivai.ai
> Subject: Re: RISC-V Bootstrap problems
>
>
> On 5/24/23 17:13, Palmer Dabbelt wrote:
> > On Wed, 24 May 2023 16:12:20 PDT (-0700), Vineet Gupta wrote:
>
> [ ... big snip ... ]
>
> >>
> >> Never mind. Looks like I found the issue - with just trial and error and
> >> no idea of how this stuff works.
> >> The torture-{init,finish} needs to be in riscv.exp not rvv.exp
> >> Running full tests now.
> >
> > Thanks!
> Marginally related.  I was able to bisect the "hang" when 3-staging the
> trunk on RISC-V with qemu user mode emulation.
>
> So it wasn't actually hanging, but after the introduction of segment
> intrinsics the compilation time for insn-emit explodes -- previously I
> could do a full 3-stage bootstrap, build the glibc & the kernel, then
> test c/c++/fortran in ~10 hours.
>
> Now just building insn-emit.o alone takes ~10 hours in that environment.
>   I suspect (but have not yet confirmed) that we should see a huge
> compile-time spike in cross builds as well, though obviously it won't be
> as bad since we're not using qemu emulation.
>
> Clearly something isn't scaling well.  I don't know if we've got a crazy
> large function in there, a crazy number of functions or something that's
> just triggering a compile-time scaling problem.  Whatever it is, we
> probably need to address it.
>
> jeff
>
>

Re: Re: RISC-V Bootstrap problems

2023-05-24 Thread juzhe.zh...@rivai.ai

Besides, we don't have compilation issues in crossing-compiling (with segment 
intrinsics).
But I do agree we need to address such issue.

As far as I known, GCC compile insn-emit in single thread single core.
Can we multi-thread && multi-core to compile it to speed up the compilation?

Thanks.

juzhe.zh...@rivai.ai

From: Jeff Law
Date: 2023-05-25 11:43
To: Palmer Dabbelt; Vineet Gupta
CC: kito.cheng; gcc-patches; Kito Cheng; Patrick O'Neill; Jeff Law; macro; 
juzhe.zh...@rivai.ai
Subject: Re: RISC-V Bootstrap problems

On 5/24/23 17:13, Palmer Dabbelt wrote:
> On Wed, 24 May 2023 16:12:20 PDT (-0700), Vineet Gupta wrote:

[ ... big snip ... ]

>>
>> Never mind. Looks like I found the issue - with just trial and error and
>> no idea of how this stuff works.
>> The torture-{init,finish} needs to be in riscv.exp not rvv.exp
>> Running full tests now.
> 
> Thanks!
Marginally related.  I was able to bisect the "hang" when 3-staging the 
trunk on RISC-V with qemu user mode emulation.

So it wasn't actually hanging, but after the introduction of segment 
intrinsics the compilation time for insn-emit explodes -- previously I 
could do a full 3-stage bootstrap, build the glibc & the kernel, then 
test c/c++/fortran in ~10 hours.

Now just building insn-emit.o alone takes ~10 hours in that environment. 
  I suspect (but have not yet confirmed) that we should see a huge 
compile-time spike in cross builds as well, though obviously it won't be 
as bad since we're not using qemu emulation.

Clearly something isn't scaling well.  I don't know if we've got a crazy 
large function in there, a crazy number of functions or something that's 
just triggering a compile-time scaling problem.  Whatever it is, we 
probably need to address it.

jeff

Re: Re: RISC-V Bootstrap problems

2023-05-24 Thread juzhe.zh...@rivai.ai

segment intrinsics are really huge amount. 

Even though I have tried to optimized them, still we have the issues..

How about LLVM? Can kito help with this issue? 
LLVM has already support full intrinsics for a long time and no issues.

Thanks.

juzhe.zh...@rivai.ai

From: Jeff Law
Date: 2023-05-25 11:43
To: Palmer Dabbelt; Vineet Gupta
CC: kito.cheng; gcc-patches; Kito Cheng; Patrick O'Neill; Jeff Law; macro; 
juzhe.zh...@rivai.ai
Subject: Re: RISC-V Bootstrap problems

On 5/24/23 17:13, Palmer Dabbelt wrote:
> On Wed, 24 May 2023 16:12:20 PDT (-0700), Vineet Gupta wrote:

[ ... big snip ... ]

>>
>> Never mind. Looks like I found the issue - with just trial and error and
>> no idea of how this stuff works.
>> The torture-{init,finish} needs to be in riscv.exp not rvv.exp
>> Running full tests now.
> 
> Thanks!
Marginally related.  I was able to bisect the "hang" when 3-staging the 
trunk on RISC-V with qemu user mode emulation.

So it wasn't actually hanging, but after the introduction of segment 
intrinsics the compilation time for insn-emit explodes -- previously I 
could do a full 3-stage bootstrap, build the glibc & the kernel, then 
test c/c++/fortran in ~10 hours.

Now just building insn-emit.o alone takes ~10 hours in that environment. 
  I suspect (but have not yet confirmed) that we should see a huge 
compile-time spike in cross builds as well, though obviously it won't be 
as bad since we're not using qemu emulation.

Clearly something isn't scaling well.  I don't know if we've got a crazy 
large function in there, a crazy number of functions or something that's 
just triggering a compile-time scaling problem.  Whatever it is, we 
probably need to address it.

jeff

Re: RISC-V Bootstrap problems

2023-05-24 Thread Jeff Law via Gcc-patches





On 5/24/23 17:13, Palmer Dabbelt wrote:

On Wed, 24 May 2023 16:12:20 PDT (-0700), Vineet Gupta wrote:


[ ... big snip ... ]



Never mind. Looks like I found the issue - with just trial and error and
no idea of how this stuff works.
The torture-{init,finish} needs to be in riscv.exp not rvv.exp
Running full tests now.


Thanks!
Marginally related.  I was able to bisect the "hang" when 3-staging the 
trunk on RISC-V with qemu user mode emulation.


So it wasn't actually hanging, but after the introduction of segment 
intrinsics the compilation time for insn-emit explodes -- previously I 
could do a full 3-stage bootstrap, build the glibc & the kernel, then 
test c/c++/fortran in ~10 hours.


Now just building insn-emit.o alone takes ~10 hours in that environment. 
 I suspect (but have not yet confirmed) that we should see a huge 
compile-time spike in cross builds as well, though obviously it won't be 
as bad since we're not using qemu emulation.


Clearly something isn't scaling well.  I don't know if we've got a crazy 
large function in there, a crazy number of functions or something that's 
just triggering a compile-time scaling problem.  Whatever it is, we 
probably need to address it.


jeff

Re: [PATCH] LoongArch: Fix the problem of structure parameter passing in C++. This structure has empty structure members and less than three floating point members.

2023-05-24 Thread Lulu Cheng

在 2023/5/25 上午10:52, WANG Xuerui 写道:

On 2023/5/25 10:46, Lulu Cheng wrote:

在 2023/5/25 上午4:15, Jason Merrill 写道:
On Wed, May 24, 2023 at 5:00 AM Jonathan Wakely via Gcc-patches 
mailto:gcc-patches@gcc.gnu.org>> wrote:

    On Wed, 24 May 2023 at 09:41, Xi Ruoyao  wrote:

    > Wang Lei raised some concerns about Itanium C++ ABI, so let's
    ask a C++
    > expert here...
    >
    > Jonathan: AFAIK the standard and the Itanium ABI treats an empty
    class
    > as size 1

    Only as a complete object, not as a subobject.

Also as a data member subobject.

    > in order to guarantee unique address, so for the following:
    >
    > class Empty {};
    > class Test { Empty empty; double a, b; };

    There is no need to have a unique address here, so Test::empty and
    Test::a
    have the same address. It's a potentially-overlapping subobject.

    For the Itanium ABI, sizeof(Test) == 2 * sizeof(double).

That would be true if Test::empty were marked [[no_unique_address]], 
but without that attribute, sizeof(Test) is actually 3 * 
sizeof(double).

    > When we pass "Test" via registers, we may only allocate the
    registers
    > for Test::a and Test::b, and complete ignore Test::empty because
    there
    > is no addresses of registers.  Is this correct or not?

    I think that's a decision for the loongarch psABI. In principle,
    there's no
    reason a register has to be used to pass Test::empty, since you
    can't read
    from it or write to it.

Agreed.  The Itanium C++ ABI has nothing to say about how registers 
are allocated for parameter passing; this is a matter for the psABI.

And there is no need for a psABI to allocate a register for 
Test::empty because it contains no data.

In the x86_64 psABI, Test above is passed in memory because of its 
size ("the size of the aggregate exceeds two eightbytes...").  But

struct Test2 { Empty empty; double a; };

is passed in a single floating-point register; the Test2::empty 
subobject is not passed anywhere, because its eightbyte is 
classified as NO_CLASS, because there is no actual data there.

I know nothing about the LoongArch psABI, but going out of your way 
to assign a register to an empty class seems like a mistake.

MIPS64 and ARM64 also allocate parameter registers for empty structs. 
https://godbolt.org/z/jT4cY3T5o

Our original intention is not to pass this empty structure member, 
but to make the following two structures treat empty structure members

in the same way in the process of passing parameters.

struct st1
{
 struct empty {} e1;
 long a;
 long b;
};

struct st2
{
 struct empty {} e1;
 double f0;
 double f1;
};

Then shouldn't we try to avoid the extra register in all cases, 
instead of wasting one regardless? ;-)

https://godbolt.org/z/eK5T3Erbs

Compared with the situation of x86-64, if it is necessary not to pass 
empty structure members, it is difficult to achieve uniform processing.

Re: [PATCH] i386: Fix incorrect intrinsic signature for AVX512 s{lli|rai|rli}

2023-05-24 Thread Hongtao Liu via Gcc-patches

On Thu, May 25, 2023 at 10:55 AM Hu, Lin1 via Gcc-patches
 wrote:
>
> Hi all,
>
> This patch aims to fix incorrect intrinsic signature for 
> _mm{512|256|}_s{lli|rai|rli}_epi*. And it has been tested on 
> x86_64-pc-linux-gnu. OK for trunk?
>
> BRs,
> Lin
>
> gcc/ChangeLog:
>
> PR target/109173
> PR target/109174
> * config/i386/avx512bwintrin.h (_mm512_srli_epi16): Change type from
> int to const int.
int to unsigned int or const int to const unsigned int.
Others LGTM.
> (_mm512_mask_srli_epi16): Ditto.
> (_mm512_slli_epi16): Ditto.
> (_mm512_mask_slli_epi16): Ditto.
> (_mm512_maskz_slli_epi16): Ditto.
> (_mm512_srai_epi16): Ditto.
> (_mm512_mask_srai_epi16): Ditto.
> (_mm512_maskz_srai_epi16): Ditto.
> * config/i386/avx512vlintrin.h (_mm256_mask_srli_epi32): Ditto.
> (_mm256_maskz_srli_epi32): Ditto.
> (_mm_mask_srli_epi32): Ditto.
> (_mm_maskz_srli_epi32): Ditto.
> (_mm256_mask_srli_epi64): Ditto.
> (_mm256_maskz_srli_epi64): Ditto.
> (_mm_mask_srli_epi64): Ditto.
> (_mm_maskz_srli_epi64): Ditto.
> (_mm256_mask_srai_epi32): Ditto.
> (_mm256_maskz_srai_epi32): Ditto.
> (_mm_mask_srai_epi32): Ditto.
> (_mm_maskz_srai_epi32): Ditto.
> (_mm256_srai_epi64): Ditto.
> (_mm256_mask_srai_epi64): Ditto.
> (_mm256_maskz_srai_epi64): Ditto.
> (_mm_srai_epi64): Ditto.
> (_mm_mask_srai_epi64): Ditto.
> (_mm_maskz_srai_epi64): Ditto.
> (_mm_mask_slli_epi32): Ditto.
> (_mm_maskz_slli_epi32): Ditto.
> (_mm_mask_slli_epi64): Ditto.
> (_mm_maskz_slli_epi64): Ditto.
> (_mm256_mask_slli_epi32): Ditto.
> (_mm256_maskz_slli_epi32): Ditto.
> (_mm256_mask_slli_epi64): Ditto.
> (_mm256_maskz_slli_epi64): Ditto.
> (_mm_mask_srai_epi16): Ditto.
> (_mm_maskz_srai_epi16): Ditto.
> (_mm256_srai_epi16): Ditto.
> (_mm256_mask_srai_epi16): Ditto.
> (_mm_mask_slli_epi16): Ditto.
> (_mm_maskz_slli_epi16): Ditto.
> (_mm256_mask_slli_epi16): Ditto.
> (_mm256_maskz_slli_epi16): Ditto.
>
> gcc/testsuite/ChangeLog:
>
> PR target/109173
> PR target/109174
> * gcc.target/i386/pr109173-1.c: New test.
> * gcc.target/i386/pr109174-1.c: Ditto.
> ---
>  gcc/config/i386/avx512bwintrin.h   |  32 +++---
>  gcc/config/i386/avx512fintrin.h|  58 +++
>  gcc/config/i386/avx512vlbwintrin.h |  36 ---
>  gcc/config/i386/avx512vlintrin.h   | 112 +++--
>  gcc/testsuite/gcc.target/i386/pr109173-1.c |  57 +++
>  gcc/testsuite/gcc.target/i386/pr109174-1.c |  45 +
>  6 files changed, 236 insertions(+), 104 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr109173-1.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr109174-1.c
>
> diff --git a/gcc/config/i386/avx512bwintrin.h 
> b/gcc/config/i386/avx512bwintrin.h
> index 89790f7917b..791d4e35f32 100644
> --- a/gcc/config/i386/avx512bwintrin.h
> +++ b/gcc/config/i386/avx512bwintrin.h
> @@ -2880,7 +2880,7 @@ _mm512_maskz_dbsad_epu8 (__mmask32 __U, __m512i __A, 
> __m512i __B,
>
>  extern __inline __m512i
>  __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> -_mm512_srli_epi16 (__m512i __A, const int __imm)
> +_mm512_srli_epi16 (__m512i __A, const unsigned int __imm)
>  {
>return (__m512i) __builtin_ia32_psrlwi512_mask ((__v32hi) __A, __imm,
>   (__v32hi)
> @@ -2891,7 +2891,7 @@ _mm512_srli_epi16 (__m512i __A, const int __imm)
>  extern __inline __m512i
>  __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
>  _mm512_mask_srli_epi16 (__m512i __W, __mmask32 __U, __m512i __A,
> -   const int __imm)
> +   const unsigned int __imm)
>  {
>return (__m512i) __builtin_ia32_psrlwi512_mask ((__v32hi) __A, __imm,
>   (__v32hi) __W,
> @@ -2910,7 +2910,7 @@ _mm512_maskz_srli_epi16 (__mmask32 __U, __m512i __A, 
> const int __imm)
>
>  extern __inline __m512i
>  __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
> -_mm512_slli_epi16 (__m512i __A, const int __B)
> +_mm512_slli_epi16 (__m512i __A, const unsigned int __B)
>  {
>return (__m512i) __builtin_ia32_psllwi512_mask ((__v32hi) __A, __B,
>   (__v32hi)
> @@ -2921,7 +2921,7 @@ _mm512_slli_epi16 (__m512i __A, const int __B)
>  extern __inline __m512i
>  __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
>  _mm512_mask_slli_epi16 (__m512i __W, __mmask32 __U, __m512i __A,
> -   const int __B)
> +   const unsigned int __B)
>  {
>return (__m512i) __builtin_ia32_psllwi512_mask ((__v32hi) __A, __B,
>

RE: [PATCH v6] RISC-V: Using merge approach to optimize repeating sequence

2023-05-24 Thread Li, Pan2 via Gcc-patches

Oops, forget to remove it in previous version, will wait a while and update 
them together.

Pan

From: juzhe.zh...@rivai.ai 
Sent: Thursday, May 25, 2023 11:14 AM
To: Li, Pan2 ; gcc-patches 
Cc: Kito.cheng ; Li, Pan2 ; Wang, 
Yanzhang 
Subject: Re: [PATCH v6] RISC-V: Using merge approach to optimize repeating 
sequence

* machmode.h (VECTOR_BOOL_MODE_P): New macro.

--- a/gcc/machmode.h

+++ b/gcc/machmode.h

@@ -134,6 +134,10 @@ extern const unsigned char mode_class[NUM_MACHINE_MODES];

|| GET_MODE_CLASS (MODE) == MODE_VECTOR_ACCUM\

|| GET_MODE_CLASS (MODE) == MODE_VECTOR_UACCUM)

+/* Nonzero if MODE is a vector bool mode.  */

+#define VECTOR_BOOL_MODE_P(MODE)\

+  (GET_MODE_CLASS (MODE) == MODE_VECTOR_BOOL)   \

+
Why do you add this? But no use. You should drop this.

juzhe.zh...@rivai.ai

From: pan2.li
Date: 2023-05-25 11:09
To: gcc-patches
CC: juzhe.zhong; 
kito.cheng; pan2.li; 
yanzhang.wang
Subject: [PATCH v6] RISC-V: Using merge approach to optimize repeating sequence
From: Pan Li mailto:pan2...@intel.com>>

This patch would like to optimize the VLS vector initialization like
repeating sequence. From the vslide1down to the vmerge with a simple
cost model, aka every instruction only has 1 cost.

Given code with -march=rv64gcv_zvl256b --param 
riscv-autovec-preference=fixed-vlmax
typedef int64_t vnx32di __attribute__ ((vector_size (256)));

__attribute__ ((noipa)) void
f_vnx32di (int64_t a, int64_t b, int64_t *out)
{
  vnx32di v = {
a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b,
  };
  *(vnx32di *) out = v;
}

Before this patch:
vslide1down.vx (x31 times)

After this patch:
li a5,-1431654400
addi a5,a5,-1365
li a3,-1431654400
addi a3,a3,-1366
slli a5,a5,32
add a5,a5,a3
vsetvli a4,zero,e64,m8,ta,ma
vmv.v.x v8,a0
vmv.s.x v0,a5
vmerge.vxm v8,v8,a1,v0
vs8r.v v8,0(a2)

Since we dont't have SEW = 128 in vec_duplicate, we can't combine ab into
SEW = 128 element and then broadcast this big element.

Signed-off-by: Pan Li mailto:pan2...@intel.com>>
Co-Authored by: Juzhe-Zhong mailto:juzhe.zh...@rivai.ai>>

gcc/ChangeLog:

* config/riscv/riscv-protos.h (enum insn_type): New type.
* config/riscv/riscv-v.cc (RVV_INSN_OPERANDS_MAX): New macro.
(rvv_builder::can_duplicate_repeating_sequence_p): Align the
referenced class member.
(rvv_builder::get_merged_repeating_sequence):
(rvv_builder::repeating_sequence_use_merge_profitable_p): New
function to evaluate the optimization cost.
(rvv_builder::get_merge_scalar_mask): New function to get the
merge mask.
(emit_scalar_move_insn): New function to emit vmv.s.x.
(emit_vlmax_integer_move_insn): New function to emit vlmax vmv.v.x.
(emit_nonvlmax_integer_move_insn): New function to emit nonvlmax
vmv.v.x.
(get_repeating_sequence_dup_machine_mode): New function to get
the dup machine mode.
(expand_vector_init_merge_repeating_sequence): New function to
perform the optimization.
(expand_vec_init): Add this vector init optimization.
* config/riscv/riscv.h (BITS_PER_WORD): New macro.
* machmode.h (VECTOR_BOOL_MODE_P): New macro.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-1.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-2.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-3.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-4.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-5.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-run-1.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-run-2.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-run-3.c: New test.

Signed-off-by: Pan Li mailto:pan2...@intel.com>>
---
gcc/config/riscv/riscv-protos.h   |   1 +
gcc/config/riscv/riscv-v.cc   | 225 +-
gcc/config/riscv/riscv.h  |   1 +
gcc/machmode.h|   4 +
.../vls-vlmax/init-repeat-sequence-1.c|  21 ++
.../vls-vlmax/init-repeat-sequence-2.c|  24 ++
.../vls-vlmax/init-repeat-sequence-3.c|  25 ++
.../vls-vlmax/init-repeat-sequence-4.c|  15 ++
.../vls-vlmax/init-repeat-sequence-5.c|  17 ++
.../vls-vlmax/init-repeat-sequence-run-1.c|  47 
.../vls-vlmax/init-repeat-sequence-run-2.c|  46 
.../vls-vlmax/init-repeat-sequence-run-3.c|  41 
12 files changed, 461 insertions(+), 6 deletions(-)
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-1.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-2.c

Re: [PATCH v6] RISC-V: Using merge approach to optimize repeating sequence

2023-05-24 Thread juzhe.zh...@rivai.ai

* machmode.h (VECTOR_BOOL_MODE_P): New macro.
--- a/gcc/machmode.h
+++ b/gcc/machmode.h
@@ -134,6 +134,10 @@ extern const unsigned char mode_class[NUM_MACHINE_MODES];
|| GET_MODE_CLASS (MODE) == MODE_VECTOR_ACCUM   \
|| GET_MODE_CLASS (MODE) == MODE_VECTOR_UACCUM)
 
+/* Nonzero if MODE is a vector bool mode.  */
+#define VECTOR_BOOL_MODE_P(MODE)   \
+  (GET_MODE_CLASS (MODE) == MODE_VECTOR_BOOL)  \
+
Why do you add this? But no use. You should drop this.



juzhe.zh...@rivai.ai
 
From: pan2.li
Date: 2023-05-25 11:09
To: gcc-patches
CC: juzhe.zhong; kito.cheng; pan2.li; yanzhang.wang
Subject: [PATCH v6] RISC-V: Using merge approach to optimize repeating sequence
From: Pan Li 
 
This patch would like to optimize the VLS vector initialization like
repeating sequence. From the vslide1down to the vmerge with a simple
cost model, aka every instruction only has 1 cost.
 
Given code with -march=rv64gcv_zvl256b --param 
riscv-autovec-preference=fixed-vlmax
typedef int64_t vnx32di __attribute__ ((vector_size (256)));
 
__attribute__ ((noipa)) void
f_vnx32di (int64_t a, int64_t b, int64_t *out)
{
  vnx32di v = {
a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b,
  };
  *(vnx32di *) out = v;
}
 
Before this patch:
vslide1down.vx (x31 times)
 
After this patch:
li a5,-1431654400
addi a5,a5,-1365
li a3,-1431654400
addi a3,a3,-1366
slli a5,a5,32
add a5,a5,a3
vsetvli a4,zero,e64,m8,ta,ma
vmv.v.x v8,a0
vmv.s.x v0,a5
vmerge.vxm v8,v8,a1,v0
vs8r.v v8,0(a2)
 
Since we dont't have SEW = 128 in vec_duplicate, we can't combine ab into
SEW = 128 element and then broadcast this big element.
 
Signed-off-by: Pan Li 
Co-Authored by: Juzhe-Zhong 
 
gcc/ChangeLog:
 
* config/riscv/riscv-protos.h (enum insn_type): New type.
* config/riscv/riscv-v.cc (RVV_INSN_OPERANDS_MAX): New macro.
(rvv_builder::can_duplicate_repeating_sequence_p): Align the
referenced class member.
(rvv_builder::get_merged_repeating_sequence):
(rvv_builder::repeating_sequence_use_merge_profitable_p): New
function to evaluate the optimization cost.
(rvv_builder::get_merge_scalar_mask): New function to get the
merge mask.
(emit_scalar_move_insn): New function to emit vmv.s.x.
(emit_vlmax_integer_move_insn): New function to emit vlmax vmv.v.x.
(emit_nonvlmax_integer_move_insn): New function to emit nonvlmax
vmv.v.x.
(get_repeating_sequence_dup_machine_mode): New function to get
the dup machine mode.
(expand_vector_init_merge_repeating_sequence): New function to
perform the optimization.
(expand_vec_init): Add this vector init optimization.
* config/riscv/riscv.h (BITS_PER_WORD): New macro.
* machmode.h (VECTOR_BOOL_MODE_P): New macro.
 
gcc/testsuite/ChangeLog:
 
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-1.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-2.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-3.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-4.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-5.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-run-1.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-run-2.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-run-3.c: New test.
 
Signed-off-by: Pan Li 
---
gcc/config/riscv/riscv-protos.h   |   1 +
gcc/config/riscv/riscv-v.cc   | 225 +-
gcc/config/riscv/riscv.h  |   1 +
gcc/machmode.h|   4 +
.../vls-vlmax/init-repeat-sequence-1.c|  21 ++
.../vls-vlmax/init-repeat-sequence-2.c|  24 ++
.../vls-vlmax/init-repeat-sequence-3.c|  25 ++
.../vls-vlmax/init-repeat-sequence-4.c|  15 ++
.../vls-vlmax/init-repeat-sequence-5.c|  17 ++
.../vls-vlmax/init-repeat-sequence-run-1.c|  47 
.../vls-vlmax/init-repeat-sequence-run-2.c|  46 
.../vls-vlmax/init-repeat-sequence-run-3.c|  41 
12 files changed, 461 insertions(+), 6 deletions(-)
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-1.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-2.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-3.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-4.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-5.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-run-1.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-run-2.c
create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-run-3.c
 
diff --git a/gcc/config/riscv/riscv-protos.h

RE: Re: [PATCH V5] RISC-V: Using merge approach to optimize repeating sequence in vec_init

2023-05-24 Thread Li, Pan2 via Gcc-patches

Hi Kito,

Update the PATCH v6 with refactored framework as below, thanks for comments.

https://gcc.gnu.org/pipermail/gcc-patches/2023-May/619536.html

Pan

-Original Message-
From: Gcc-patches  On Behalf 
Of Kito Cheng via Gcc-patches
Sent: Wednesday, May 17, 2023 11:52 AM
To: juzhe.zh...@rivai.ai
Cc: gcc-patches ; palmer ; 
jeffreyalaw 
Subject: Re: Re: [PATCH V5] RISC-V: Using merge approach to optimize repeating 
sequence in vec_init

On Wed, May 17, 2023 at 11:36 AM juzhe.zh...@rivai.ai  
wrote:
>
> >> Does it means we assume inner_int_mode is DImode? (because sizeof 
> >> (uint64_t)) or it should be something like `for (unsigned int i = 
> >> 0; i < (GET_MODE_SIZE(inner_int_mode ()) * 8 / npatterns ()); i++)` ?
> No, sizeof (uint64_t) means uint64_t mask = 0;

+  return gen_int_mode (mask, inner_int_mode ());
And we expect the uint64_t mask can always be put into inner_int_mode ()?
If not, why do we fill up all 64 bits?

>
> >> Do you mind give more comment about this? what it checked and what it did?
> The reason we use known_gt (GET_MODE_SIZE (dup_mode), 
> BYTES_PER_RISCV_VECTOR) since we want are using vector integer mode to 
> generate the mask for example we generate 0b01010101010101 mask, we 
> should use a scalar register holding value = 0b010101010...
> Then vmv.v.x into a vector,then this vector will be used as a mask.
>
> >> Why this only hide in else? I guess I have this question is because 
> >> I don't fully understand the logic of the if condition?
>
> Since we can't vector floting-point instruction to generate a mask.

I don't get why it's not something like below?

if (known_gt (GET_MODE_SIZE (dup_mode), BYTES_PER_RISCV_VECTOR)) { ...
}
if (FLOAT_MODE_P (dup_mode))
{
...
}



>
> >> nit: builder.inner_mode () rather than GET_MODE_INNER (dup_mode)?
>
> They are the same. I can change it using GET_MODE_INNER
>
> >> And I would like have more commnet to explain why we need force_reg here.
> Since it will creat ICE.

But why? And why can it be resolved by force_reg? you need few more comment in 
the code

[PATCH v6] RISC-V: Using merge approach to optimize repeating sequence

2023-05-24 Thread Pan Li via Gcc-patches

From: Pan Li 

This patch would like to optimize the VLS vector initialization like
repeating sequence. From the vslide1down to the vmerge with a simple
cost model, aka every instruction only has 1 cost.

Given code with -march=rv64gcv_zvl256b --param 
riscv-autovec-preference=fixed-vlmax
typedef int64_t vnx32di __attribute__ ((vector_size (256)));

__attribute__ ((noipa)) void
f_vnx32di (int64_t a, int64_t b, int64_t *out)
{
  vnx32di v = {
a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b,
a, b, a, b, a, b, a, b,
  };
  *(vnx32di *) out = v;
}

Before this patch:
vslide1down.vx (x31 times)

After this patch:
li  a5,-1431654400
addia5,a5,-1365
li  a3,-1431654400
addia3,a3,-1366
sllia5,a5,32
add a5,a5,a3
vsetvli a4,zero,e64,m8,ta,ma
vmv.v.x v8,a0
vmv.s.x v0,a5
vmerge.vxm  v8,v8,a1,v0
vs8r.v  v8,0(a2)

Since we dont't have SEW = 128 in vec_duplicate, we can't combine ab into
SEW = 128 element and then broadcast this big element.

Signed-off-by: Pan Li 
Co-Authored by: Juzhe-Zhong 

gcc/ChangeLog:

* config/riscv/riscv-protos.h (enum insn_type): New type.
* config/riscv/riscv-v.cc (RVV_INSN_OPERANDS_MAX): New macro.
(rvv_builder::can_duplicate_repeating_sequence_p): Align the
referenced class member.
(rvv_builder::get_merged_repeating_sequence):
(rvv_builder::repeating_sequence_use_merge_profitable_p): New
function to evaluate the optimization cost.
(rvv_builder::get_merge_scalar_mask): New function to get the
merge mask.
(emit_scalar_move_insn): New function to emit vmv.s.x.
(emit_vlmax_integer_move_insn): New function to emit vlmax vmv.v.x.
(emit_nonvlmax_integer_move_insn): New function to emit nonvlmax
vmv.v.x.
(get_repeating_sequence_dup_machine_mode): New function to get
the dup machine mode.
(expand_vector_init_merge_repeating_sequence): New function to
perform the optimization.
(expand_vec_init): Add this vector init optimization.
* config/riscv/riscv.h (BITS_PER_WORD): New macro.
* machmode.h (VECTOR_BOOL_MODE_P): New macro.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-1.c: New 
test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-2.c: New 
test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-3.c: New 
test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-4.c: New 
test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-5.c: New 
test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-run-1.c: 
New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-run-2.c: 
New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-run-3.c: 
New test.

Signed-off-by: Pan Li 
---
 gcc/config/riscv/riscv-protos.h   |   1 +
 gcc/config/riscv/riscv-v.cc   | 225 +-
 gcc/config/riscv/riscv.h  |   1 +
 gcc/machmode.h|   4 +
 .../vls-vlmax/init-repeat-sequence-1.c|  21 ++
 .../vls-vlmax/init-repeat-sequence-2.c|  24 ++
 .../vls-vlmax/init-repeat-sequence-3.c|  25 ++
 .../vls-vlmax/init-repeat-sequence-4.c|  15 ++
 .../vls-vlmax/init-repeat-sequence-5.c|  17 ++
 .../vls-vlmax/init-repeat-sequence-run-1.c|  47 
 .../vls-vlmax/init-repeat-sequence-run-2.c|  46 
 .../vls-vlmax/init-repeat-sequence-run-3.c|  41 
 12 files changed, 461 insertions(+), 6 deletions(-)
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-1.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-2.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-3.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-4.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-5.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-run-1.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-run-2.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/vls-vlmax/init-repeat-sequence-run-3.c

diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 36419c95bbd..768b646fec1 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -140,6 +140,7 @@ enum insn_type
   RVV_MERGE_OP = 4,
   RVV_CMP_OP = 4,
   RVV_CMP_MU_OP = RVV_CMP_OP + 2, /* +2 means mask and maskoff operand.  */
+  RVV_SCALAR_MOV_OP = 4,
 };
 enum vlmul_type
 {
diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index f71ad9e46a1..458020ce0a1 100644

Re: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread juzhe.zh...@rivai.ai

Hi, Richard. 
After several tries with your testcases (I already added into V15 patch).
I think "using a new IV" would be better than "multiplication"

Now:
 loop_len_34 = MIN_EXPR ;
  _74 = MIN_EXPR ;   --> multiplication approach will changed 
into  _74 = loop_len_34  * 2;
  loop_len_48 = MIN_EXPR <_74, 4>;
  _77 = _74 - loop_len_48;
  loop_len_49 = MIN_EXPR <_77, 4>;
  _78 = _77 - loop_len_49;
  loop_len_50 = MIN_EXPR <_78, 4>;
  loop_len_51 = _78 - loop_len_50;

I prefer "new IV" since it looks more reasonable and better codegen.
Could you take a look at it:
V15 patch:
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/619534.html 

Thanks.

juzhe.zh...@rivai.ai

From: Richard Sandiford
Date: 2023-05-25 04:05
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by 
variable amount support
I'll look at the samples tomorrow, but just to address one thing:

钟居哲  writes:
>>> What gives the best code in these cases?  Is emitting a multiplication
>>> better?  Or is using a new IV better?
> Could you give me more detail information about "new refresh IV" approach.
> I'd like to try that.

By “using a new IV” I meant calling vect_set_loop_controls_directly
for every rgroup, not just the first.  So in the earlier example,
there would be one decrementing IV for x and one decrementing IV for y.

Thanks,
Richard

[PATCH V15] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread juzhe . zhong

From: Ju-Zhe Zhong 

This patch is supporting decrement IV by following the flow designed by Richard:

(1) In vect_set_loop_condition_partial_vectors, for the first iteration of:
call vect_set_loop_controls_directly.

(2) vect_set_loop_controls_directly calculates "step" as in your patch.
If rgc has 1 control, this step is the SSA name created for that control.
Otherwise the step is a fresh SSA name, as in your patch.

(3) vect_set_loop_controls_directly stores this step somewhere for later
use, probably in LOOP_VINFO.  Let's use "S" to refer to this stored step.

(4) After the vect_set_loop_controls_directly call above, and outside
the "if" statement that now contains vect_set_loop_controls_directly,
check whether rgc->controls.length () > 1.  If so, use
vect_adjust_loop_lens_control to set the controls based on S.

Then the only caller of vect_adjust_loop_lens_control is
vect_set_loop_condition_partial_vectors.  And the starting
step for vect_adjust_loop_lens_control is always S.

This patch has well tested for single-rgroup and multiple-rgroup (SLP) and
passed all testcase in RISC-V port.

Also, pass tests for multiple-rgroup (non-SLP) tested on vec_pack_trunk.

Fix bugs of V14 patch:
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution 
test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution 
test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution 
test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution 
test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution 
test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution 
test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution 
test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution 
test

This patch passed all testcases listed above.

gcc/ChangeLog:

* tree-vect-loop-manip.cc (vect_set_loop_controls_directly): Add 
decrement IV support.
(vect_adjust_loop_lens_control): Ditto.
(vect_set_loop_condition_partial_vectors): Ditto.
* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): New variables.
* tree-vectorizer.h (LOOP_VINFO_USING_DECREMENTING_IV_P): New macro.
(LOOP_VINFO_DECREMENTING_IV_STEP): Ditto.

gcc/testsuite/ChangeLog:

* gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-3.c: New test.
* gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-4.c: New test.
* gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c: New 
test.
* gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-4.c: New 
test.

---
 .../rvv/autovec/partial/multiple_rgroup-3.c   | 288 ++
 .../rvv/autovec/partial/multiple_rgroup-4.c   |  75 +
 .../autovec/partial/multiple_rgroup_run-3.c   |  36 +++
 .../autovec/partial/multiple_rgroup_run-4.c   |  15 +
 gcc/tree-vect-loop-manip.cc   | 153 ++
 gcc/tree-vect-loop.cc |  13 +
 gcc/tree-vectorizer.h |  12 +
 7 files changed, 592 insertions(+)
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-3.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-4.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c
 create mode 100644 
gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-4.c

diff --git 
a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-3.c 
b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-3.c
new file mode 100644
index 000..9579749c285
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-3.c
@@ -0,0 +1,288 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-march=rv32gcv -mabi=ilp32d --param 
riscv-autovec-preference=fixed-vlmax" } */
+
+#include 
+
+void __attribute__ ((noinline, noclone))
+f0 (int8_t *__restrict x, int16_t *__restrict y, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 4, j += 8)
+{
+  x[i + 0] += 1;
+  x[i + 1] += 2;
+  x[i + 2] += 3;
+  x[i + 3] += 4;
+  y[j + 0] += 1;
+  y[j + 1] += 2;
+  y[j + 2] += 3;
+  y[j + 3] += 4;
+  y[j + 4] += 5;
+  y[j + 5] += 6;
+  y[j + 6] += 7;
+  y[j + 7] += 8;
+}
+}
+
+void __attribute__ ((optimize (0)))
+f0_init (int8_t *__restrict x, int8_t *__restrict x2, int16_t *__restrict y,
+int16_t *__restrict y2, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 4, j += 8)
+{
+  x[i + 0] = i % 120;
+  x[i + 1] = i % 78;
+  x[i + 2] = i % 55;
+  x[i + 3] = i % 27;
+  y[j + 0] = j % 33;
+  y[j + 1] = j % 44;
+  y[j + 2] = j % 66;
+  y[j + 3] = j % 88;
+  y[j + 4] = j % 99;
+  y[j + 5] = j % 39;
+  y[j + 6] = j % 49;
+  y[j + 7] = j % 101;
+
+  x2[i + 0] = i % 120;
+

[PATCH] i386: Fix incorrect intrinsic signature for AVX512 s{lli|rai|rli}

2023-05-24 Thread Hu, Lin1 via Gcc-patches

Hi all,

This patch aims to fix incorrect intrinsic signature for 
_mm{512|256|}_s{lli|rai|rli}_epi*. And it has been tested on 
x86_64-pc-linux-gnu. OK for trunk?

BRs,
Lin

gcc/ChangeLog:

PR target/109173
PR target/109174
* config/i386/avx512bwintrin.h (_mm512_srli_epi16): Change type from
int to const int.
(_mm512_mask_srli_epi16): Ditto.
(_mm512_slli_epi16): Ditto.
(_mm512_mask_slli_epi16): Ditto.
(_mm512_maskz_slli_epi16): Ditto.
(_mm512_srai_epi16): Ditto.
(_mm512_mask_srai_epi16): Ditto.
(_mm512_maskz_srai_epi16): Ditto.
* config/i386/avx512vlintrin.h (_mm256_mask_srli_epi32): Ditto.
(_mm256_maskz_srli_epi32): Ditto.
(_mm_mask_srli_epi32): Ditto.
(_mm_maskz_srli_epi32): Ditto.
(_mm256_mask_srli_epi64): Ditto.
(_mm256_maskz_srli_epi64): Ditto.
(_mm_mask_srli_epi64): Ditto.
(_mm_maskz_srli_epi64): Ditto.
(_mm256_mask_srai_epi32): Ditto.
(_mm256_maskz_srai_epi32): Ditto.
(_mm_mask_srai_epi32): Ditto.
(_mm_maskz_srai_epi32): Ditto.
(_mm256_srai_epi64): Ditto.
(_mm256_mask_srai_epi64): Ditto.
(_mm256_maskz_srai_epi64): Ditto.
(_mm_srai_epi64): Ditto.
(_mm_mask_srai_epi64): Ditto.
(_mm_maskz_srai_epi64): Ditto.
(_mm_mask_slli_epi32): Ditto.
(_mm_maskz_slli_epi32): Ditto.
(_mm_mask_slli_epi64): Ditto.
(_mm_maskz_slli_epi64): Ditto.
(_mm256_mask_slli_epi32): Ditto.
(_mm256_maskz_slli_epi32): Ditto.
(_mm256_mask_slli_epi64): Ditto.
(_mm256_maskz_slli_epi64): Ditto.
(_mm_mask_srai_epi16): Ditto.
(_mm_maskz_srai_epi16): Ditto.
(_mm256_srai_epi16): Ditto.
(_mm256_mask_srai_epi16): Ditto.
(_mm_mask_slli_epi16): Ditto.
(_mm_maskz_slli_epi16): Ditto.
(_mm256_mask_slli_epi16): Ditto.
(_mm256_maskz_slli_epi16): Ditto.

gcc/testsuite/ChangeLog:

PR target/109173
PR target/109174
* gcc.target/i386/pr109173-1.c: New test.
* gcc.target/i386/pr109174-1.c: Ditto.
---
 gcc/config/i386/avx512bwintrin.h   |  32 +++---
 gcc/config/i386/avx512fintrin.h|  58 +++
 gcc/config/i386/avx512vlbwintrin.h |  36 ---
 gcc/config/i386/avx512vlintrin.h   | 112 +++--
 gcc/testsuite/gcc.target/i386/pr109173-1.c |  57 +++
 gcc/testsuite/gcc.target/i386/pr109174-1.c |  45 +
 6 files changed, 236 insertions(+), 104 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr109173-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/pr109174-1.c

diff --git a/gcc/config/i386/avx512bwintrin.h b/gcc/config/i386/avx512bwintrin.h
index 89790f7917b..791d4e35f32 100644
--- a/gcc/config/i386/avx512bwintrin.h
+++ b/gcc/config/i386/avx512bwintrin.h
@@ -2880,7 +2880,7 @@ _mm512_maskz_dbsad_epu8 (__mmask32 __U, __m512i __A, 
__m512i __B,
 
 extern __inline __m512i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm512_srli_epi16 (__m512i __A, const int __imm)
+_mm512_srli_epi16 (__m512i __A, const unsigned int __imm)
 {
   return (__m512i) __builtin_ia32_psrlwi512_mask ((__v32hi) __A, __imm,
  (__v32hi)
@@ -2891,7 +2891,7 @@ _mm512_srli_epi16 (__m512i __A, const int __imm)
 extern __inline __m512i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_mask_srli_epi16 (__m512i __W, __mmask32 __U, __m512i __A,
-   const int __imm)
+   const unsigned int __imm)
 {
   return (__m512i) __builtin_ia32_psrlwi512_mask ((__v32hi) __A, __imm,
  (__v32hi) __W,
@@ -2910,7 +2910,7 @@ _mm512_maskz_srli_epi16 (__mmask32 __U, __m512i __A, 
const int __imm)
 
 extern __inline __m512i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm512_slli_epi16 (__m512i __A, const int __B)
+_mm512_slli_epi16 (__m512i __A, const unsigned int __B)
 {
   return (__m512i) __builtin_ia32_psllwi512_mask ((__v32hi) __A, __B,
  (__v32hi)
@@ -2921,7 +2921,7 @@ _mm512_slli_epi16 (__m512i __A, const int __B)
 extern __inline __m512i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
 _mm512_mask_slli_epi16 (__m512i __W, __mmask32 __U, __m512i __A,
-   const int __B)
+   const unsigned int __B)
 {
   return (__m512i) __builtin_ia32_psllwi512_mask ((__v32hi) __A, __B,
  (__v32hi) __W,
@@ -2930,7 +2930,7 @@ _mm512_mask_slli_epi16 (__m512i __W, __mmask32 __U, 
__m512i __A,
 
 extern __inline __m512i
 __attribute__ ((__gnu_inline__, __always_inline__, __artificial__))
-_mm512_maskz_slli_epi16 (__mmask32 __U, __m512i __A, const int __B)
+_mm512_maskz_slli_epi16 (__mmask32 __U,

Re: [PATCH] LoongArch: Fix the problem of structure parameter passing in C++. This structure has empty structure members and less than three floating point members.

2023-05-24 Thread WANG Xuerui

On 2023/5/25 10:46, Lulu Cheng wrote:

在 2023/5/25 上午4:15, Jason Merrill 写道:
On Wed, May 24, 2023 at 5:00 AM Jonathan Wakely via Gcc-patches 
mailto:gcc-patches@gcc.gnu.org>> wrote:

On Wed, 24 May 2023 at 09:41, Xi Ruoyao  wrote:

> Wang Lei raised some concerns about Itanium C++ ABI, so let's
ask a C++
> expert here...
>
> Jonathan: AFAIK the standard and the Itanium ABI treats an empty
class
> as size 1

Only as a complete object, not as a subobject.

Also as a data member subobject.

> in order to guarantee unique address, so for the following:
>
> class Empty {};
> class Test { Empty empty; double a, b; };

There is no need to have a unique address here, so Test::empty and
Test::a
have the same address. It's a potentially-overlapping subobject.

For the Itanium ABI, sizeof(Test) == 2 * sizeof(double).

That would be true if Test::empty were marked [[no_unique_address]], 
but without that attribute, sizeof(Test) is actually 3 * sizeof(double).

> When we pass "Test" via registers, we may only allocate the
registers
> for Test::a and Test::b, and complete ignore Test::empty because
there
> is no addresses of registers.  Is this correct or not?

I think that's a decision for the loongarch psABI. In principle,
there's no
reason a register has to be used to pass Test::empty, since you
can't read
from it or write to it.

Agreed.  The Itanium C++ ABI has nothing to say about how registers 
are allocated for parameter passing; this is a matter for the psABI.

And there is no need for a psABI to allocate a register for 
Test::empty because it contains no data.

In the x86_64 psABI, Test above is passed in memory because of its 
size ("the size of the aggregate exceeds two eightbytes...").  But

struct Test2 { Empty empty; double a; };

is passed in a single floating-point register; the Test2::empty 
subobject is not passed anywhere, because its eightbyte is classified 
as NO_CLASS, because there is no actual data there.

I know nothing about the LoongArch psABI, but going out of your way to 
assign a register to an empty class seems like a mistake.

MIPS64 and ARM64 also allocate parameter registers for empty structs. 
https://godbolt.org/z/jT4cY3T5o

Our original intention is not to pass this empty structure member, but 
to make the following two structures treat empty structure members

in the same way in the process of passing parameters.

struct st1
{
     struct empty {} e1;
     long a;
     long b;
};

struct st2
{
     struct empty {} e1;
     double f0;
     double f1;
};

Then shouldn't we try to avoid the extra register in all cases, instead 
of wasting one regardless? ;-)

Re: [PATCH] LoongArch: Fix the problem of structure parameter passing in C++. This structure has empty structure members and less than three floating point members.

2023-05-24 Thread Lulu Cheng

在 2023/5/25 上午4:15, Jason Merrill 写道:
On Wed, May 24, 2023 at 5:00 AM Jonathan Wakely via Gcc-patches 
mailto:gcc-patches@gcc.gnu.org>> wrote:

On Wed, 24 May 2023 at 09:41, Xi Ruoyao  wrote:

> Wang Lei raised some concerns about Itanium C++ ABI, so let's
ask a C++
> expert here...
>
> Jonathan: AFAIK the standard and the Itanium ABI treats an empty
class
> as size 1

Only as a complete object, not as a subobject.

Also as a data member subobject.

> in order to guarantee unique address, so for the following:
>
> class Empty {};
> class Test { Empty empty; double a, b; };

There is no need to have a unique address here, so Test::empty and
Test::a
have the same address. It's a potentially-overlapping subobject.

For the Itanium ABI, sizeof(Test) == 2 * sizeof(double).

That would be true if Test::empty were marked [[no_unique_address]], 
but without that attribute, sizeof(Test) is actually 3 * sizeof(double).

> When we pass "Test" via registers, we may only allocate the
registers
> for Test::a and Test::b, and complete ignore Test::empty because
there
> is no addresses of registers.  Is this correct or not?

I think that's a decision for the loongarch psABI. In principle,
there's no
reason a register has to be used to pass Test::empty, since you
can't read
from it or write to it.

Agreed.  The Itanium C++ ABI has nothing to say about how registers 
are allocated for parameter passing; this is a matter for the psABI.

And there is no need for a psABI to allocate a register for 
Test::empty because it contains no data.

In the x86_64 psABI, Test above is passed in memory because of its 
size ("the size of the aggregate exceeds two eightbytes...").  But

struct Test2 { Empty empty; double a; };

is passed in a single floating-point register; the Test2::empty 
subobject is not passed anywhere, because its eightbyte is classified 
as NO_CLASS, because there is no actual data there.

I know nothing about the LoongArch psABI, but going out of your way to 
assign a register to an empty class seems like a mistake.

MIPS64 and ARM64 also allocate parameter registers for empty structs. 
https://godbolt.org/z/jT4cY3T5o

Our original intention is not to pass this empty structure member, but 
to make the following two structures treat empty structure members

in the same way in the process of passing parameters.

struct st1
{
    struct empty {} e1;
    long a;
    long b;
};

struct st2
{
    struct empty {} e1;
    double f0;
    double f1;
};

> On Wed, 2023-05-24 at 14:45 +0800, Xi Ruoyao via Gcc-patches wrote:
> > On Wed, 2023-05-24 at 14:04 +0800, Lulu Cheng wrote:
> > > An empty struct type that is not non-trivial for the purposes of
> > > calls
> > > will be treated as though it were the following C type:
> > >
> > > struct {
> > >   char c;
> > > };
> > >
> > > Before this patch was added, a structure parameter containing an
> > > empty structure and
> > > less than three floating-point members was passed through
one or two
> > > floating-point
> > > registers, while nested empty structures are ignored. Which
did not
> > > conform to the
> > > calling convention.
> >
> > No, it's a deliberate decision I've made in
> > https://gcc.gnu.org/r12-8294. And we already agreed "the ABI
needs to
> > be updated" when we applied r12-8294, but I've never improved my
> > English
> > skill to revise the ABI myself :(.
> >
> > We are also using the same "de-facto" ABI throwing away the empty
> > struct
> > for Clang++ (https://reviews.llvm.org/D132285). So we should
update
> > the
> > spec here, instead of changing every implementation.
> >
> > The C++ standard treats the empty struct as size 1 for
ensuring the
> > semantics of pointer comparison operations.  When we pass it
through
> > the
> > registers, there is no need to really consider the empty field
because
> > there is no pointers to registers.
> >
>
>

[V8][PATCH 0/2]Accept and Handle the case when a structure including a FAM nested in another structure

2023-05-24 Thread Qing Zhao via Gcc-patches

Hi,

This is the 8th version of the patch, which rebased on the latest trunk.
This is an important patch needed by Linux Kernel security project. 

compared to the 7th version, the major change are:
1. update the documentation wordings based on Joseph's suggestions.
2. change the name of the new macro TYPE_INCLUDE_FLEXARRAY to
   TYPE_INCLUDES_FLEXARRAY. 

all others keep the same as version 7. 

the 7th version are here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/619033.html
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/619034.html
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/619036.html

bootstrapped and regression tested on aarch64 and x86.

Okay for commit?

thanks a lot.

Qing

[PATCH 2/2] Update documentation to clarify a GCC extension [PR77650]

2023-05-24 Thread Qing Zhao via Gcc-patches

on a structure with a C99 flexible array member being nested in
another structure.

"The GCC extension accepts a structure containing an ISO C99 "flexible array
member", or a union containing such a structure (possibly recursively)
to be a member of a structure.

 There are two situations:

   * A structure containing a C99 flexible array member, or a union
 containing such a structure, is the last field of another structure,
 for example:

  struct flex  { int length; char data[]; };
  union union_flex { int others; struct flex f; };

  struct out_flex_struct { int m; struct flex flex_data; };
  struct out_flex_union { int n; union union_flex flex_data; };

 In the above, both 'out_flex_struct.flex_data.data[]' and
 'out_flex_union.flex_data.f.data[]' are considered as flexible
 arrays too.

   * A structure containing a C99 flexible array member, or a union
 containing such a structure, is not the last field of another structure,
 for example:

  struct flex  { int length; char data[]; };

  struct mid_flex { int m; struct flex flex_data; int n; };

 In the above, accessing a member of the array 'mid_flex.flex_data.data[]'
 might have undefined behavior.  Compilers do not handle such a case
 consistently, Any code relying on this case should be modified to ensure
 that flexible array members only end up at the ends of structures.

 Please use the warning option '-Wflex-array-member-not-at-end' to
 identify all such cases in the source code and modify them.  This
 warning will be on by default starting from GCC 15.
"

gcc/c-family/ChangeLog:

* c.opt: New option -Wflex-array-member-not-at-end.

gcc/c/ChangeLog:

* c-decl.cc (finish_struct): Issue warnings for new option.

gcc/ChangeLog:

* doc/extend.texi: Document GCC extension on a structure containing
a flexible array member to be a member of another structure.

gcc/testsuite/ChangeLog:

* gcc.dg/variable-sized-type-flex-array.c: New test.
---
 gcc/c-family/c.opt|  5 +++
 gcc/c/c-decl.cc   |  9 
 gcc/doc/extend.texi   | 44 ++-
 .../gcc.dg/variable-sized-type-flex-array.c   | 31 +
 4 files changed, 88 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/variable-sized-type-flex-array.c

diff --git a/gcc/c-family/c.opt b/gcc/c-family/c.opt
index cddeece..c26d9801b63 100644
--- a/gcc/c-family/c.opt
+++ b/gcc/c-family/c.opt
@@ -737,6 +737,11 @@ Wformat-truncation=
 C ObjC C++ LTO ObjC++ Joined RejectNegative UInteger Var(warn_format_trunc) 
Warning LangEnabledBy(C ObjC C++ LTO ObjC++,Wformat=, warn_format >= 1, 0) 
IntegerRange(0, 2)
 Warn about calls to snprintf and similar functions that truncate output.
 
+Wflex-array-member-not-at-end
+C C++ Var(warn_flex_array_member_not_at_end) Warning
+Warn when a structure containing a C99 flexible array member as the last
+field is not at the end of another structure.
+
 Wif-not-aligned
 C ObjC C++ ObjC++ Var(warn_if_not_aligned) Init(1) Warning
 Warn when the field in a struct is not aligned.
diff --git a/gcc/c/c-decl.cc b/gcc/c/c-decl.cc
index e14f514cb6e..ecd10ebb69c 100644
--- a/gcc/c/c-decl.cc
+++ b/gcc/c/c-decl.cc
@@ -9278,6 +9278,15 @@ finish_struct (location_t loc, tree t, tree fieldlist, 
tree attributes,
TYPE_INCLUDES_FLEXARRAY (t)
  = is_last_field && TYPE_INCLUDES_FLEXARRAY (TREE_TYPE (x));
 
+  if (warn_flex_array_member_not_at_end
+ && !is_last_field
+ && RECORD_OR_UNION_TYPE_P (TREE_TYPE (x))
+ && TYPE_INCLUDES_FLEXARRAY (TREE_TYPE (x)))
+   warning_at (DECL_SOURCE_LOCATION (x),
+   OPT_Wflex_array_member_not_at_end,
+   "structure containing a flexible array member"
+   " is not at the end of another structure");
+
   if (DECL_NAME (x)
  || RECORD_OR_UNION_TYPE_P (TREE_TYPE (x)))
saw_named_field = true;
diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index f9d13b495ad..17ef80e75cc 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -1751,7 +1751,49 @@ Flexible array members may only appear as the last 
member of a
 A structure containing a flexible array member, or a union containing
 such a structure (possibly recursively), may not be a member of a
 structure or an element of an array.  (However, these uses are
-permitted by GCC as extensions.)
+permitted by GCC as extensions, see details below.)
+@end itemize
+
+The GCC extension accepts a structure containing an ISO C99 @dfn{flexible array
+member}, or a union containing such a structure (possibly recursively)
+to be a member of a structure.
+
+There are two situations:
+
+@itemize @bullet
+@item
+A structure containing a C99 flexible array member, or a union containing
+such a structure, is the last field of another structure, for example:
+

[PATCH 1/2] Handle component_ref to a structre/union field including flexible array member [PR101832]

2023-05-24 Thread Qing Zhao via Gcc-patches

GCC extension accepts the case when a struct with a C99 flexible array member
is embedded into another struct or union (possibly recursively) as the last
field.
__builtin_object_size should treat such struct as flexible size.

gcc/c/ChangeLog:

PR tree-optimization/101832
* c-decl.cc (finish_struct): Set TYPE_INCLUDES_FLEXARRAY for
struct/union type.

gcc/lto/ChangeLog:

PR tree-optimization/101832
* lto-common.cc (compare_tree_sccs_1): Compare bit
TYPE_NO_NAMED_ARGS_STDARG_P or TYPE_INCLUDES_FLEXARRAY properly
for its corresponding type.

gcc/ChangeLog:

PR tree-optimization/101832
* print-tree.cc (print_node): Print new bit type_includes_flexarray.
* tree-core.h (struct tree_type_common): Use bit no_named_args_stdarg_p
as type_includes_flexarray for RECORD_TYPE or UNION_TYPE.
* tree-object-size.cc (addr_object_size): Handle structure/union type
when it has flexible size.
* tree-streamer-in.cc (unpack_ts_type_common_value_fields): Stream
in bit no_named_args_stdarg_p properly for its corresponding type.
* tree-streamer-out.cc (pack_ts_type_common_value_fields): Stream
out bit no_named_args_stdarg_p properly for its corresponding type.
* tree.h (TYPE_INCLUDES_FLEXARRAY): New macro TYPE_INCLUDES_FLEXARRAY.

gcc/testsuite/ChangeLog:

PR tree-optimization/101832
* gcc.dg/builtin-object-size-pr101832.c: New test.
---
 gcc/c/c-decl.cc   |  11 ++
 gcc/lto/lto-common.cc |   5 +-
 gcc/print-tree.cc |   5 +
 .../gcc.dg/builtin-object-size-pr101832.c | 134 ++
 gcc/tree-core.h   |   2 +
 gcc/tree-object-size.cc   |  23 ++-
 gcc/tree-streamer-in.cc   |   5 +-
 gcc/tree-streamer-out.cc  |   5 +-
 gcc/tree.h|   7 +-
 9 files changed, 192 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/builtin-object-size-pr101832.c

diff --git a/gcc/c/c-decl.cc b/gcc/c/c-decl.cc
index 1af51c4acfc..e14f514cb6e 100644
--- a/gcc/c/c-decl.cc
+++ b/gcc/c/c-decl.cc
@@ -9267,6 +9267,17 @@ finish_struct (location_t loc, tree t, tree fieldlist, 
tree attributes,
   /* Set DECL_NOT_FLEXARRAY flag for FIELD_DECL x.  */
   DECL_NOT_FLEXARRAY (x) = !is_flexible_array_member_p (is_last_field, x);
 
+  /* Set TYPE_INCLUDES_FLEXARRAY for the context of x, t.
+when x is an array and is the last field.  */
+  if (TREE_CODE (TREE_TYPE (x)) == ARRAY_TYPE)
+   TYPE_INCLUDES_FLEXARRAY (t)
+ = is_last_field && flexible_array_member_type_p (TREE_TYPE (x));
+  /* Recursively set TYPE_INCLUDES_FLEXARRAY for the context of x, t
+when x is an union or record and is the last field.  */
+  else if (RECORD_OR_UNION_TYPE_P (TREE_TYPE (x)))
+   TYPE_INCLUDES_FLEXARRAY (t)
+ = is_last_field && TYPE_INCLUDES_FLEXARRAY (TREE_TYPE (x));
+
   if (DECL_NAME (x)
  || RECORD_OR_UNION_TYPE_P (TREE_TYPE (x)))
saw_named_field = true;
diff --git a/gcc/lto/lto-common.cc b/gcc/lto/lto-common.cc
index 537570204b3..f6b85bbc6f7 100644
--- a/gcc/lto/lto-common.cc
+++ b/gcc/lto/lto-common.cc
@@ -1275,7 +1275,10 @@ compare_tree_sccs_1 (tree t1, tree t2, tree **map)
   if (AGGREGATE_TYPE_P (t1))
compare_values (TYPE_TYPELESS_STORAGE);
   compare_values (TYPE_EMPTY_P);
-  compare_values (TYPE_NO_NAMED_ARGS_STDARG_P);
+  if (FUNC_OR_METHOD_TYPE_P (t1))
+   compare_values (TYPE_NO_NAMED_ARGS_STDARG_P);
+  if (RECORD_OR_UNION_TYPE_P (t1))
+   compare_values (TYPE_INCLUDES_FLEXARRAY);
   compare_values (TYPE_PACKED);
   compare_values (TYPE_RESTRICT);
   compare_values (TYPE_USER_ALIGN);
diff --git a/gcc/print-tree.cc b/gcc/print-tree.cc
index ccecd3dc6a7..62451b6cf4e 100644
--- a/gcc/print-tree.cc
+++ b/gcc/print-tree.cc
@@ -632,6 +632,11 @@ print_node (FILE *file, const char *prefix, tree node, int 
indent,
  && TYPE_CXX_ODR_P (node))
fputs (" cxx-odr-p", file);
 
+  if ((code == RECORD_TYPE
+  || code == UNION_TYPE)
+ && TYPE_INCLUDES_FLEXARRAY (node))
+   fputs (" includes-flexarray", file);
+
   /* The transparent-union flag is used for different things in
 different nodes.  */
   if ((code == UNION_TYPE || code == RECORD_TYPE)
diff --git a/gcc/testsuite/gcc.dg/builtin-object-size-pr101832.c 
b/gcc/testsuite/gcc.dg/builtin-object-size-pr101832.c
new file mode 100644
index 000..60078e11634
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/builtin-object-size-pr101832.c
@@ -0,0 +1,134 @@
+/* PR 101832: 
+   GCC extension accepts the case when a struct with a C99 flexible array
+   member is embedded into another struct (possibly recursively).
+   __builtin_object_size will treat such struct as flexible size.

RE: [PATCH] PR gcc/98350:Handle FMA friendly in reassoc pass

2023-05-24 Thread Cui, Lili via Gcc-patches

> > +rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
> > +const vec
> > +)
> >  {
> >enum tree_code opcode = gimple_assign_rhs_code (stmt);
> >int op_num = ops.length ();
> > @@ -5483,10 +5494,11 @@ rewrite_expr_tree_parallel (gassign *stmt, int
> width,
> >int stmt_num = op_num - 1;
> >gimple **stmts = XALLOCAVEC (gimple *, stmt_num);
> >int op_index = op_num - 1;
> > -  int stmt_index = 0;
> > -  int ready_stmts_end = 0;
> > -  int i = 0;
> > -  gimple *stmt1 = NULL, *stmt2 = NULL;
> > +  int width_count = width;
> > +  int i = 0, j = 0;
> > +  tree tmp_op[2], op1;
> > +  operand_entry *oe;
> > +  gimple *stmt1 = NULL;
> >tree last_rhs1 = gimple_assign_rhs1 (stmt);
> >
> >/* We start expression rewriting from the top statements.
> > @@ -5496,91 +5508,84 @@ rewrite_expr_tree_parallel (gassign *stmt, int
> width,
> >for (i = stmt_num - 2; i >= 0; i--)
> >  stmts[i] = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
> >
> > -  for (i = 0; i < stmt_num; i++)
> > +  /* Build parallel dependency chain according to width.  */  for (i
> > + = 0; i < width; i++)
> >  {
> > -  tree op1, op2;
> > -
> > -  /* Determine whether we should use results of
> > -already handled statements or not.  */
> > -  if (ready_stmts_end == 0
> > - && (i - stmt_index >= width || op_index < 1))
> > -   ready_stmts_end = i;
> > -
> > -  /* Now we choose operands for the next statement.  Non zero
> > -value in ready_stmts_end means here that we should use
> > -the result of already generated statements as new operand.  */
> > -  if (ready_stmts_end > 0)
> > -   {
> > - op1 = gimple_assign_lhs (stmts[stmt_index++]);
> > - if (ready_stmts_end > stmt_index)
> > -   op2 = gimple_assign_lhs (stmts[stmt_index++]);
> > - else if (op_index >= 0)
> > -   {
> > - operand_entry *oe = ops[op_index--];
> > - stmt2 = oe->stmt_to_insert;
> > - op2 = oe->op;
> > -   }
> > - else
> > -   {
> > - gcc_assert (stmt_index < i);
> > - op2 = gimple_assign_lhs (stmts[stmt_index++]);
> > -   }
> > +  /*   */
> 
> empty comment?

Added it, thanks.

> 
> > +  if (op_index > 1 && !has_fma)
> > +   swap_ops_for_binary_stmt (ops, op_index - 2);
> >
> > - if (stmt_index >= ready_stmts_end)
> > -   ready_stmts_end = 0;
> > -   }
> > -  else
> > +  for (j = 0; j < 2; j++)
> > {
> > - if (op_index > 1)
> > -   swap_ops_for_binary_stmt (ops, op_index - 2);
> > - operand_entry *oe2 = ops[op_index--];
> > - operand_entry *oe1 = ops[op_index--];
> > - op2 = oe2->op;
> > - stmt2 = oe2->stmt_to_insert;
> > - op1 = oe1->op;
> > - stmt1 = oe1->stmt_to_insert;
> > + gcc_assert (op_index >= 0);
> > + oe = ops[op_index--];
> > + tmp_op[j] = oe->op;
> > + /* If the stmt that defines operand has to be inserted, insert it
> > +before the use.  */
> > + stmt1 = oe->stmt_to_insert;
> > + if (stmt1)
> > +   insert_stmt_before_use (stmts[i], stmt1);
> > + stmt1 = NULL;
> > }
> > -
> > -  /* If we emit the last statement then we should put
> > -operands into the last statement.  It will also
> > -break the loop.  */
> > -  if (op_index < 0 && stmt_index == i)
> > -   i = stmt_num - 1;
> > +  stmts[i] = build_and_add_sum (TREE_TYPE (last_rhs1), tmp_op[1],
> tmp_op[0], opcode);
> > +  gimple_set_visited (stmts[i], true);
> >
> >if (dump_file && (dump_flags & TDF_DETAILS))
> > {
> > - fprintf (dump_file, "Transforming ");
> > + fprintf (dump_file, " into ");
> >   print_gimple_stmt (dump_file, stmts[i], 0);
> > }
> > +}
> >
> > -  /* If the stmt that defines operand has to be inserted, insert it
> > -before the use.  */
> > -  if (stmt1)
> > -   insert_stmt_before_use (stmts[i], stmt1);
> > -  if (stmt2)
> > -   insert_stmt_before_use (stmts[i], stmt2);
> > -  stmt1 = stmt2 = NULL;
> > -
> > -  /* We keep original statement only for the last one.  All
> > -others are recreated.  */
> > -  if (i == stmt_num - 1)
> > +  for (i = width; i < stmt_num; i++)
> > +{
> > +  /* We keep original statement only for the last one.  All others are
> > +recreated.  */
> > +  if ( op_index < 0)
> > {
> > - gimple_assign_set_rhs1 (stmts[i], op1);
> > - gimple_assign_set_rhs2 (stmts[i], op2);
> > - update_stmt (stmts[i]);
> > + if (width_count == 2)
> > +   {
> > +
> > + /* We keep original statement only for the last one.  All
> > +others are recreated.  */
> > +

[PATCH] Handle FMA friendly in reassoc pass

2023-05-24 Thread Cui, Lili via Gcc-patches

From: Lili Cui 

Make some changes in reassoc pass to make it more friendly to fma pass later.
Using FMA instead of mult + add reduces register pressure and insruction
retired.

There are mainly two changes
1. Put no-mult ops and mult ops alternately at the end of the queue, which is
conducive to generating more fma and reducing the loss of FMA when breaking
the chain.
2. Rewrite the rewrite_expr_tree_parallel function to try to build parallel
chains according to the given correlation width, keeping the FMA chance as
much as possible.

With the patch applied

On ICX:
507.cactuBSSN_r: Improved by 1.7% for multi-copy .
503.bwaves_r   : Improved by  0.60% for single copy .
507.cactuBSSN_r: Improved by  1.10% for single copy .
519.lbm_r  : Improved by  2.21% for single copy .
no measurable changes for other benchmarks.

On aarch64
507.cactuBSSN_r: Improved by 1.7% for multi-copy.
503.bwaves_r   : Improved by 6.00% for single-copy.
no measurable changes for other benchmarks.

TEST1:

float
foo (float a, float b, float c, float d, float *e)
{
   return  *e  + a * b + c * d ;
}

For "-Ofast -mfpmath=sse -mfma" GCC generates:
vmulss  %xmm3, %xmm2, %xmm2
vfmadd132ss %xmm1, %xmm2, %xmm0
vaddss  (%rdi), %xmm0, %xmm0
ret

With this patch GCC generates:
vfmadd213ss   (%rdi), %xmm1, %xmm0
vfmadd231ss   %xmm2, %xmm3, %xmm0
ret

TEST2:

for (int i = 0; i < N; i++)
{
  a[i] += b[i]* c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] * l[i] + 
m[i]* o[i] + p[i];
}

For "-Ofast -mfpmath=sse -mfma"  GCC generates:
vmovapd e(%rax), %ymm4
vmulpd  d(%rax), %ymm4, %ymm3
addq$32, %rax
vmovapd c-32(%rax), %ymm5
vmovapd j-32(%rax), %ymm6
vmulpd  h-32(%rax), %ymm6, %ymm2
vmovapd a-32(%rax), %ymm6
vaddpd  p-32(%rax), %ymm6, %ymm0
vmovapd g-32(%rax), %ymm7
vfmadd231pd b-32(%rax), %ymm5, %ymm3
vmovapd o-32(%rax), %ymm4
vmulpd  m-32(%rax), %ymm4, %ymm1
vmovapd l-32(%rax), %ymm5
vfmadd231pd f-32(%rax), %ymm7, %ymm2
vfmadd231pd k-32(%rax), %ymm5, %ymm1
vaddpd  %ymm3, %ymm0, %ymm0
vaddpd  %ymm2, %ymm0, %ymm0
vaddpd  %ymm1, %ymm0, %ymm0
vmovapd %ymm0, a-32(%rax)
cmpq$8192, %rax
jne .L4
vzeroupper
ret

with this patch applied GCC breaks the chain with width = 2 and generates 6 fma:

vmovapd a(%rax), %ymm2
vmovapd c(%rax), %ymm0
addq$32, %rax
vmovapd e-32(%rax), %ymm1
vmovapd p-32(%rax), %ymm5
vmovapd g-32(%rax), %ymm3
vmovapd j-32(%rax), %ymm6
vmovapd l-32(%rax), %ymm4
vmovapd o-32(%rax), %ymm7
vfmadd132pd b-32(%rax), %ymm2, %ymm0
vfmadd132pd d-32(%rax), %ymm5, %ymm1
vfmadd231pd f-32(%rax), %ymm3, %ymm0
vfmadd231pd h-32(%rax), %ymm6, %ymm1
vfmadd231pd k-32(%rax), %ymm4, %ymm0
vfmadd231pd m-32(%rax), %ymm7, %ymm1
vaddpd  %ymm1, %ymm0, %ymm0
vmovapd %ymm0, a-32(%rax)
cmpq$8192, %rax
jne .L2
vzeroupper
ret

gcc/ChangeLog:

PR gcc/98350
* tree-ssa-reassoc.cc
(rewrite_expr_tree_parallel): Rewrite this function.
(rank_ops_for_fma): New.
(reassociate_bb): Handle new function.

gcc/testsuite/ChangeLog:

PR gcc/98350
* gcc.dg/pr98350-1.c: New test.
* gcc.dg/pr98350-2.c: Ditto.
---
 gcc/testsuite/gcc.dg/pr98350-1.c |  31 
 gcc/testsuite/gcc.dg/pr98350-2.c |  11 ++
 gcc/tree-ssa-reassoc.cc  | 256 +--
 3 files changed, 215 insertions(+), 83 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr98350-1.c
 create mode 100644 gcc/testsuite/gcc.dg/pr98350-2.c

diff --git a/gcc/testsuite/gcc.dg/pr98350-1.c b/gcc/testsuite/gcc.dg/pr98350-1.c
new file mode 100644
index 000..6bcf78a19ab
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr98350-1.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast  -fdump-tree-widening_mul" } */
+
+/* Test that the compiler properly optimizes multiply and add 
+   to generate more FMA instructions.  */
+#define N 1024
+double a[N];
+double b[N];
+double c[N];
+double d[N];
+double e[N];
+double f[N];
+double g[N];
+double h[N];
+double j[N];
+double k[N];
+double l[N];
+double m[N];
+double o[N];
+double p[N];
+
+
+void
+foo (void)
+{
+  for (int i = 0; i < N; i++)
+  {
+a[i] += b[i] * c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] * 
l[i] + m[i]* o[i] + p[i];
+  }
+}
+/* { dg-final { scan-tree-dump-times { = \.FMA \(} 6 "widening_mul" } } */
diff --git a/gcc/testsuite/gcc.dg/pr98350-2.c b/gcc/testsuite/gcc.dg/pr98350-2.c
new file mode 100644
index 000..333d34f026a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr98350-2.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -fdump-tree-widening_mul" } */
+
+/*

Re: [PATCH] RISC-V: Add missing torture-init and torture-finish for rvv.exp

2023-05-24 Thread Jeff Law





On 5/24/23 17:12, Vineet Gupta wrote:



On 5/24/23 15:13, Vineet Gupta wrote:


PASS: gcc.target/riscv/zmmul-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  (test for excess errors)
PASS: gcc.target/riscv/zmmul-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects   scan-assembler-times mul\t 1
PASS: gcc.target/riscv/zmmul-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects   scan-assembler-not div\t
PASS: gcc.target/riscv/zmmul-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects   scan-assembler-not rem\t
testcase 
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/riscv.exp completed in 60 seconds
Running 
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp ...
ERROR: tcl error sourcing 
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp.

ERROR: tcl error code NONE
ERROR: torture-init: torture_without_loops is not empty as expected
    while executing
"error "torture-init: torture_without_loops is not empty as expected""
    invoked from within
"if [info exists torture_without_loops] {
    error "torture-init: torture_without_loops is not empty as expected"
    }"
    (procedure "torture-init" line 4)
    invoked from within
"torture-init"
    (file 
"/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp" line 42)

    invoked from within
"source 
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp"

    ("uplevel" body line 1)
    invoked from within
"uplevel #0 source 
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp"

    invoked from within
"catch "uplevel #0 source $test_file_name" msg"
UNRESOLVED: testcase 
'/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp' aborted due to Tcl error
testcase 
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp completed in 0 seconds
Running 
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/rl78/rl78.exp ...

...



Never mind. Looks like I found the issue - with just trial and error and 
no idea of how this stuff works.

The torture-{init,finish} needs to be in riscv.exp not rvv.exp
Running full tests now.

Trial and error is how I think most of us deal with the TCL insanity.

I have found send_user (aka printf debugging) to be quite helpful 
through the years.  There's also verbosity and trace options, but they 
can be painfully hard to interpret.


jeff

[RFC] RISC-V: Eliminate extension after for *w instructions

2023-05-24 Thread Jivan Hakobyan via Gcc-patches

`This patch tries to prevent generating unnecessary sign extension
after *w instructions like "addiw" or "divw".

The main idea of it is to add SUBREG_PROMOTED fields during expanding.

I have tested on SPEC2017 there is no regression.
Only gcc.dg/pr30957-1.c test failed.
To solve that I did some changes in loop-iv.cc, but not sure that it is
suitable.


gcc/ChangeLog:
* config/riscv/bitmanip.md (rotrdi3): New pattern.
(rotrsi3): Likewise.
(rotlsi3): Likewise.
* config/riscv/riscv-protos.h (riscv_emit_binary): New function
declaration
* config/riscv/riscv.cc (riscv_emit_binary): Removed static
* config/riscv/riscv.md (addsi3): New pattern
(subsi3): Likewise.
(negsi2): Likewise.
(mulsi3): Likewise.
(si3): New pattern for any_div.
(si3): New pattern for any_shift.
* loop-iv.cc (get_biv_step_1):  Process src of extension when it
PLUS

gcc/testsuite/ChangeLog:
* testsuite/gcc.target/riscv/shift-and-2.c: New test
* testsuite/gcc.target/riscv/shift-shift-2.c: New test
* testsuite/gcc.target/riscv/sign-extend.c: New test
* testsuite/gcc.target/riscv/zbb-rol-ror-03.c: New test


-- 
With the best regards
Jivan Hakobyan
diff --git a/gcc/config/riscv/bitmanip.md b/gcc/config/riscv/bitmanip.md
index 96d31d92670b27d495dc5a9fbfc07e8767f40976..0430af7c95b1590308648dc4d5aaea78ada71760 100644
--- a/gcc/config/riscv/bitmanip.md
+++ b/gcc/config/riscv/bitmanip.md
@@ -304,9 +304,9 @@
   [(set_attr "type" "bitmanip,load")
(set_attr "mode" "HI")])
 
-(define_expand "rotr3"
-  [(set (match_operand:GPR 0 "register_operand")
-	(rotatert:GPR (match_operand:GPR 1 "register_operand")
+(define_expand "rotrdi3"
+  [(set (match_operand:DI 0 "register_operand")
+	(rotatert:DI (match_operand:DI 1 "register_operand")
 		 (match_operand:QI 2 "arith_operand")))]
   "TARGET_ZBB || TARGET_XTHEADBB || TARGET_ZBKB"
 {
@@ -322,6 +322,26 @@
   "ror%i2%~\t%0,%1,%2"
   [(set_attr "type" "bitmanip")])
 
+(define_expand "rotrsi3"
+  [(set (match_operand:SI 0 "register_operand" "=r")
+   (rotatert:SI (match_operand:SI 1 "register_operand" "r")
+(match_operand:QI 2 "arith_operand" "rI")))]
+  "TARGET_ZBB || TARGET_ZBKB || TARGET_XTHEADBB"
+{
+  if (TARGET_XTHEADBB && !immediate_operand (operands[2], VOIDmode))
+FAIL;
+  if (TARGET_64BIT && register_operand(operands[2], QImode))
+{
+  rtx t = gen_reg_rtx (DImode);
+  emit_insn (gen_rotrsi3_sext (t, operands[1], operands[2]));
+  t = gen_lowpart (SImode, t);
+  SUBREG_PROMOTED_VAR_P (t) = 1;
+  SUBREG_PROMOTED_SET (t, SRP_SIGNED);
+  emit_move_insn (operands[0], t);
+  DONE;
+}
+})
+
 (define_insn "*rotrdi3"
   [(set (match_operand:DI 0 "register_operand" "=r")
 	(rotatert:DI (match_operand:DI 1 "register_operand" "r")
@@ -330,7 +350,7 @@
   "ror%i2\t%0,%1,%2"
   [(set_attr "type" "bitmanip")])
 
-(define_insn "*rotrsi3_sext"
+(define_insn "rotrsi3_sext"
   [(set (match_operand:DI 0 "register_operand" "=r")
 	(sign_extend:DI (rotatert:SI (match_operand:SI 1 "register_operand" "r")
  (match_operand:QI 2 "arith_operand" "rI"]
@@ -338,7 +358,7 @@
   "ror%i2%~\t%0,%1,%2"
   [(set_attr "type" "bitmanip")])
 
-(define_insn "rotlsi3"
+(define_insn "*rotlsi3"
   [(set (match_operand:SI 0 "register_operand" "=r")
 	(rotate:SI (match_operand:SI 1 "register_operand" "r")
 		   (match_operand:QI 2 "register_operand" "r")))]
@@ -346,6 +366,24 @@
   "rol%~\t%0,%1,%2"
   [(set_attr "type" "bitmanip")])
 
+(define_expand "rotlsi3"
+  [(set (match_operand:SI 0 "register_operand" "=r")
+   (rotate:SI (match_operand:SI 1 "register_operand" "r")
+  (match_operand:QI 2 "register_operand" "r")))]
+  "TARGET_ZBB || TARGET_ZBKB"
+{
+  if (TARGET_64BIT)
+{
+  rtx t = gen_reg_rtx (DImode);
+  emit_insn (gen_rotlsi3_sext (t, operands[1], operands[2]));
+  t = gen_lowpart (SImode, t);
+  SUBREG_PROMOTED_VAR_P (t) = 1;
+  SUBREG_PROMOTED_SET (t, SRP_SIGNED);
+  emit_move_insn (operands[0], t);
+  DONE;
+}
+})
+
 (define_insn "rotldi3"
   [(set (match_operand:DI 0 "register_operand" "=r")
 	(rotate:DI (match_operand:DI 1 "register_operand" "r")
diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 36419c95bbd8eebcb499ae0e02ca7aafde6c879f..de16ffd607e8e004e9b98ee9e25e4f3693818762 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -61,6 +61,7 @@ extern const char *riscv_output_return ();
 extern void riscv_expand_int_scc (rtx, enum rtx_code, rtx, rtx);
 extern void riscv_expand_float_scc (rtx, enum rtx_code, rtx, rtx);
 extern void riscv_expand_conditional_branch (rtx, enum rtx_code, rtx, rtx);
+extern rtx riscv_emit_binary (enum rtx_code code, rtx dest, rtx x, rtx y);
 #endif
 extern bool riscv_expand_conditional_move (rtx, rtx, rtx, rtx);
 extern rtx

Re: [PATCH] RISC-V: Add missing torture-init and torture-finish for rvv.exp

2023-05-24 Thread Palmer Dabbelt


On Wed, 24 May 2023 16:12:20 PDT (-0700), Vineet Gupta wrote:



On 5/24/23 15:13, Vineet Gupta wrote:


PASS: gcc.target/riscv/zmmul-2.c   -O2 -flto -fuse-linker-plugin
-fno-fat-lto-objects  (test for excess errors)
PASS: gcc.target/riscv/zmmul-2.c   -O2 -flto -fuse-linker-plugin
-fno-fat-lto-objects   scan-assembler-times mul\t 1
PASS: gcc.target/riscv/zmmul-2.c   -O2 -flto -fuse-linker-plugin
-fno-fat-lto-objects   scan-assembler-not div\t
PASS: gcc.target/riscv/zmmul-2.c   -O2 -flto -fuse-linker-plugin
-fno-fat-lto-objects   scan-assembler-not rem\t
testcase
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/riscv.exp
completed in 60 seconds
Running
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp
...
ERROR: tcl error sourcing
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp.
ERROR: tcl error code NONE
ERROR: torture-init: torture_without_loops is not empty as expected
    while executing
"error "torture-init: torture_without_loops is not empty as expected""
    invoked from within
"if [info exists torture_without_loops] {
    error "torture-init: torture_without_loops is not empty as expected"
    }"
    (procedure "torture-init" line 4)
    invoked from within
"torture-init"
    (file
"/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp"
line 42)
    invoked from within
"source
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp"
    ("uplevel" body line 1)
    invoked from within
"uplevel #0 source
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp"
    invoked from within
"catch "uplevel #0 source $test_file_name" msg"
UNRESOLVED: testcase
'/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp'
aborted due to Tcl error
testcase
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp
completed in 0 seconds
Running
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/rl78/rl78.exp
...
...



Never mind. Looks like I found the issue - with just trial and error and
no idea of how this stuff works.
The torture-{init,finish} needs to be in riscv.exp not rvv.exp
Running full tests now.


Thanks!



-Vineet

Re: [PATCH] RISC-V: Add missing torture-init and torture-finish for rvv.exp

2023-05-24 Thread Vineet Gupta





On 5/24/23 15:13, Vineet Gupta wrote:


PASS: gcc.target/riscv/zmmul-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  (test for excess errors)
PASS: gcc.target/riscv/zmmul-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects   scan-assembler-times mul\t 1
PASS: gcc.target/riscv/zmmul-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects   scan-assembler-not div\t
PASS: gcc.target/riscv/zmmul-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects   scan-assembler-not rem\t
testcase 
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/riscv.exp 
completed in 60 seconds
Running 
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp 
...
ERROR: tcl error sourcing 
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp.

ERROR: tcl error code NONE
ERROR: torture-init: torture_without_loops is not empty as expected
    while executing
"error "torture-init: torture_without_loops is not empty as expected""
    invoked from within
"if [info exists torture_without_loops] {
    error "torture-init: torture_without_loops is not empty as expected"
    }"
    (procedure "torture-init" line 4)
    invoked from within
"torture-init"
    (file 
"/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp" 
line 42)

    invoked from within
"source 
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp"

    ("uplevel" body line 1)
    invoked from within
"uplevel #0 source 
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp"

    invoked from within
"catch "uplevel #0 source $test_file_name" msg"
UNRESOLVED: testcase 
'/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp' 
aborted due to Tcl error
testcase 
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp 
completed in 0 seconds
Running 
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/rl78/rl78.exp 
...

...



Never mind. Looks like I found the issue - with just trial and error and 
no idea of how this stuff works.

The torture-{init,finish} needs to be in riscv.exp not rvv.exp
Running full tests now.

-Vineet

Re: [PATCH] RISC-V: Add missing torture-init and torture-finish for rvv.exp

2023-05-24 Thread Vineet Gupta


On 5/24/23 13:34, Thomas Schwinge wrote:

Yeah, at this point I'm not sure whether my recent changes really are
related/relevant here.


Apparently in addition to Kito's patch below, If I comment out the
additional torture options, failures go down drastically.

Meaning that *all* those ERRORs disappear?


No but they reduced significantly. Anyhow I think the issue should be 
simple enough for someone familiar with how the tcl stuff works...





diff --git a/gcc/testsuite/gcc.target/riscv/riscv.exp
b/gcc/testsuite/gcc.target/riscv/riscv.exp

-lappend ADDITIONAL_TORTURE_OPTIONS {-Og -g} {-Oz}
+#lappend ADDITIONAL_TORTURE_OPTIONS {-Og -g} {-Oz}

@Thomas, do you have some thoughts on how to fix riscv.exp properly in
light of recent changes to exp files.

I'm trying to understand this, but so far don't.  Can I please see a
complete 'gcc.log' file where the ERRORs are visible?


So we are at bleeding edge gcc from today
 2023-05-24 ec2e86274427 Fortran: reject bad DIM argument of SIZE 
intrinsic in simplification [PR104350]


With an additional fix from Kito along the lines of..

diff --git a/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp 
b/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp


 dg-init
+torture-init

 # All done.
+torture-finish
 dg-finish

I'm pasting a snippet of gcc.log. Issue is indeed triggered by rvv.exp 
which needs some love.



PASS: gcc.target/riscv/zmmul-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects  (test for excess errors)
PASS: gcc.target/riscv/zmmul-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects   scan-assembler-times mul\t 1
PASS: gcc.target/riscv/zmmul-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects   scan-assembler-not div\t
PASS: gcc.target/riscv/zmmul-2.c   -O2 -flto -fuse-linker-plugin 
-fno-fat-lto-objects   scan-assembler-not rem\t
testcase 
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/riscv.exp 
completed in 60 seconds
Running 
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp 
...
ERROR: tcl error sourcing 
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp.

ERROR: tcl error code NONE
ERROR: torture-init: torture_without_loops is not empty as expected
    while executing
"error "torture-init: torture_without_loops is not empty as expected""
    invoked from within
"if [info exists torture_without_loops] {
    error "torture-init: torture_without_loops is not empty as expected"
    }"
    (procedure "torture-init" line 4)
    invoked from within
"torture-init"
    (file 
"/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp" 
line 42)

    invoked from within
"source 
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp"

    ("uplevel" body line 1)
    invoked from within
"uplevel #0 source 
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp"

    invoked from within
"catch "uplevel #0 source $test_file_name" msg"
UNRESOLVED: testcase 
'/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp' 
aborted due to Tcl error
testcase 
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp 
completed in 0 seconds
Running 
/scratch/vineetg/gnu/toolchain-upstream/gcc/gcc/testsuite/gcc.target/rl78/rl78.exp 
...

...

[COMMITTED 2/4] - Make ssa_cache a range_query.

2023-05-24 Thread Andrew MacLeod via Gcc-patches

By having an ssa_cache inherit from a range_query, and then providing a 
range_of_expr routine which returns the current global value, we open up 
the possibility of folding statements and doing other interesting things 
with an ssa-cache.


In particular, you can now call fold_range()  with an ssa-range cache 
and fold a stmt by retrieving the values which are stored in the cache.


This patch also provides a ranger object with a  const_query() method 
which will allow access to the current global ranges ranger knows for 
folding.   There are times where we use get_global_range_query(), but 
we'd actually get more accuarte results if we have a ranger and use 
const_query ().    const_query should be  a superset of what 
get_global_range_query knows.


There is 0 performance impact.

Bootstraps on x86_64-pc-linux-gnu  with no regressions.  Pushed.

Andrew


From be6e6b93cc5d42a09a1f2be26dfdf7e3f897d296 Mon Sep 17 00:00:00 2001
From: Andrew MacLeod 
Date: Wed, 24 May 2023 09:06:26 -0400
Subject: [PATCH 2/4] Make ssa_cache a range_query.

By providing range_of_expr as a range_query, we can fold and do other
interesting things using values from the global table.  Make ranger's
knonw globals available via const_query.

	* gimple-range-cache.cc (ssa_cache::range_of_expr): New.
	* gimple-range-cache.h (class ssa_cache): Inherit from range_query.
	(ranger_cache::const_query): New.
	* gimple-range.cc (gimple_ranger::const_query): New.
	* gimple-range.h (gimple_ranger::const_query): New prototype.
---
 gcc/gimple-range-cache.cc | 14 ++
 gcc/gimple-range-cache.h  |  5 -
 gcc/gimple-range.cc   |  8 
 gcc/gimple-range.h|  1 +
 4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/gcc/gimple-range-cache.cc b/gcc/gimple-range-cache.cc
index f25abaffd34..52165d2405b 100644
--- a/gcc/gimple-range-cache.cc
+++ b/gcc/gimple-range-cache.cc
@@ -545,6 +545,20 @@ ssa_cache::~ssa_cache ()
   delete m_range_allocator;
 }
 
+// Enable a query to evaluate staements/ramnges based on picking up ranges
+// from just an ssa-cache.
+
+bool
+ssa_cache::range_of_expr (vrange , tree expr, gimple *stmt)
+{
+  if (!gimple_range_ssa_p (expr))
+return get_tree_range (r, expr, stmt);
+
+  if (!get_range (r, expr))
+gimple_range_global (r, expr, cfun);
+  return true;
+}
+
 // Return TRUE if the global range of NAME has a cache entry.
 
 bool
diff --git a/gcc/gimple-range-cache.h b/gcc/gimple-range-cache.h
index 4fc98230430..afcf8d7de7b 100644
--- a/gcc/gimple-range-cache.h
+++ b/gcc/gimple-range-cache.h
@@ -52,7 +52,7 @@ private:
 // has been visited during this incarnation.  Once the ranger evaluates
 // a name, it is typically not re-evaluated again.
 
-class ssa_cache
+class ssa_cache : public range_query
 {
 public:
   ssa_cache ();
@@ -63,6 +63,8 @@ public:
   virtual void clear_range (tree name);
   virtual void clear ();
   void dump (FILE *f = stderr);
+  virtual bool range_of_expr (vrange , tree expr, gimple *stmt);
+
 protected:
   vec m_tab;
   vrange_allocator *m_range_allocator;
@@ -103,6 +105,7 @@ public:
   bool get_global_range (vrange , tree name) const;
   bool get_global_range (vrange , tree name, bool _p);
   void set_global_range (tree name, const vrange , bool changed = true);
+  range_query _query () { return m_globals; }
 
   void propagate_updated_value (tree name, basic_block bb);
 
diff --git a/gcc/gimple-range.cc b/gcc/gimple-range.cc
index 4fae3f95e6a..01e62d3ff39 100644
--- a/gcc/gimple-range.cc
+++ b/gcc/gimple-range.cc
@@ -70,6 +70,14 @@ gimple_ranger::~gimple_ranger ()
   m_stmt_list.release ();
 }
 
+// Return a range_query which accesses just the known global values.
+
+range_query &
+gimple_ranger::const_query ()
+{
+  return m_cache.const_query ();
+}
+
 bool
 gimple_ranger::range_of_expr (vrange , tree expr, gimple *stmt)
 {
diff --git a/gcc/gimple-range.h b/gcc/gimple-range.h
index e3aa9475f5e..6587e4923ff 100644
--- a/gcc/gimple-range.h
+++ b/gcc/gimple-range.h
@@ -64,6 +64,7 @@ public:
   bool fold_stmt (gimple_stmt_iterator *gsi, tree (*) (tree));
   void register_inferred_ranges (gimple *s);
   void register_transitive_inferred_ranges (basic_block bb);
+  range_query _query ();
 protected:
   bool fold_range_internal (vrange , gimple *s, tree name);
   void prefill_name (vrange , tree name);
-- 
2.40.1

[COMMITTED 4/4] - Gimple range PHI analyzer and testcases

2023-05-24 Thread Andrew MacLeod via Gcc-patches


This patch provide the framework for a gimple-range phi analyzer.

Currently, the  primary purpose is to give better initial values for 
members of a "phi group"


a PHI group is defined as a a group of PHI nodes whose arguments are all 
either members of the same PHI group, or one of 2 other values:

 - An initializer, (typically a constant), but not necessarily,
 - A modifier, which is always of the form:   member_ssa = member_ssa 
OP op2


When the analyzer finds a group which matches this pattern, it tries to 
evaluate the modifier using the initial value and project a range for 
the entire group.


This initial version is fairly simplistic.  It looks for 2 things:

1) if there is a relation between LHS and the other ssa_name in the 
modifier, then we can project a range. ie,

    a_3 = a_2 + 1
if there is a relation generated by the stmt which say a_3 > a_2, and 
the initial value is 0, we can project a range of [0, +INF] as the 
moifier will cause the value to always increase, and not wrap.


Likewise, for a_3 = a_2 - 1,  we can project a range of [-INF, 0] based 
on the "<" relationship between a_3 and a_2.


2) If there is no relationship, then we use the initial range and 
"simulate" the modifier statement a set number of times looking to see 
if the value converges.
Currently I have arbitrarily hard coded 10 attempts, but intend to 
change this down the road with a --param, as well as to perhaps 
influence it with any known values from SCEV regarding known iterations 
of the loop and possibly change it based on optimization levels.


I also suspect something like one more than the number of bits in the 
type might help with any bitmasking tricks.


Theres a lot of additinal things we can do to enhance this, but this 
framework provides a start.  These 2 initial evaluations fix 107822, and 
part of 107986.


 There is about a 1.5% slowdown to VRP to invoke and utilize the 
analyzer in all 3 passes of VRP.  overall compile time is 0.06% slower.


Bootstraps on x86_64-pc-linux-gnu  with no regressions.  Pushed.

Andrew




From 64e844c1182198e49d33f9fa138b9a782371225d Mon Sep 17 00:00:00 2001
From: Andrew MacLeod 
Date: Wed, 24 May 2023 09:52:26 -0400
Subject: [PATCH 4/4] Gimple range PHI analyzer and testcases

Provide a PHI analyzer framework to provive better initial values for
PHI nodes which formk groups with initial values and single statements
which modify the PHI values in some predicatable way.

	PR tree-optimization/107822
	PR tree-optimization/107986
	gcc/
	* Makefile.in (OBJS): Add gimple-range-phi.o.
	* gimple-range-cache.h (ranger_cache::m_estimate): New
	phi_analyzer pointer member.
	* gimple-range-fold.cc (fold_using_range::range_of_phi): Use
	phi_analyzer if no loop info is available.
	* gimple-range-phi.cc: New file.
	* gimple-range-phi.h: New file.
	* tree-vrp.cc (execute_ranger_vrp): Utililze a phi_analyzer.

	gcc/testsuite/
	* gcc.dg/pr107822.c: New.
	* gcc.dg/pr107986-1.c: New.
---
 gcc/Makefile.in   |   1 +
 gcc/gimple-range-cache.h  |   2 +
 gcc/gimple-range-fold.cc  |  27 ++
 gcc/gimple-range-phi.cc   | 518 ++
 gcc/gimple-range-phi.h| 109 +++
 gcc/testsuite/gcc.dg/pr107822.c   |  20 ++
 gcc/testsuite/gcc.dg/pr107986-1.c |  16 +
 gcc/tree-vrp.cc   |   7 +-
 8 files changed, 699 insertions(+), 1 deletion(-)
 create mode 100644 gcc/gimple-range-phi.cc
 create mode 100644 gcc/gimple-range-phi.h
 create mode 100644 gcc/testsuite/gcc.dg/pr107822.c
 create mode 100644 gcc/testsuite/gcc.dg/pr107986-1.c

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index bb63b5c501d..1d39e6dd3f8 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1454,6 +1454,7 @@ OBJS = \
 	gimple-range-gori.o \
 	gimple-range-infer.o \
 	gimple-range-op.o \
+	gimple-range-phi.o \
 	gimple-range-trace.o \
 	gimple-ssa-backprop.o \
 	gimple-ssa-isolate-paths.o \
diff --git a/gcc/gimple-range-cache.h b/gcc/gimple-range-cache.h
index afcf8d7de7b..93d16294d2e 100644
--- a/gcc/gimple-range-cache.h
+++ b/gcc/gimple-range-cache.h
@@ -23,6 +23,7 @@ along with GCC; see the file COPYING3.  If not see
 
 #include "gimple-range-gori.h" 
 #include "gimple-range-infer.h"
+#include "gimple-range-phi.h"
 
 // This class manages a vector of pointers to ssa_block ranges.  It
 // provides the basis for the "range on entry" cache for all
@@ -136,6 +137,7 @@ private:
   void exit_range (vrange , tree expr, basic_block bb, enum rfd_mode);
   bool edge_range (vrange , edge e, tree name, enum rfd_mode);
 
+  phi_analyzer *m_estimate;
   vec m_workback;
   class update_list *m_update;
 };
diff --git a/gcc/gimple-range-fold.cc b/gcc/gimple-range-fold.cc
index 4df065c8a6e..173d9f386c5 100644
--- a/gcc/gimple-range-fold.cc
+++ b/gcc/gimple-range-fold.cc
@@ -934,6 +934,7 @@ fold_using_range::range_of_phi (vrange , gphi *phi, fur_source )
 	  }
   }
 
+  bool loop_info_p = false;
   // If SCEV is available, query if this PHI has any

[COMMITTED 3/4] Provide relation queries for a stmt.

2023-05-24 Thread Andrew MacLeod via Gcc-patches

This tweaks someof the fold_stmt routines and helpers.. in particular 
the ones which you provide a vector of ranges to to satisfy any ssa-names.


Previously, once the vector was depleted, any remaining values were 
picked up from the default get_global_range_query() query. It is useful 
to be able to speiocyf your own range_query to these routines, as most 
fo the other fold_stmt routines allow.


This patch changes it so the default doesnt change, but you can 
optionally specify your own range_query to the routines.


It also provides a new routine:

    relation_trio fold_relations (gimple *s, range_query *q)

Which instead of folding a stmt, will return a relation trio based on 
folding the stmt with the range_query.  The relation trio will let you 
know if the statement causes a relation between LHS-OP1,  LHS_OP2, or 
OP1_OP2...  so for something like

   a_3 = b_4 + 6
based on known ranges and types, we might get back (LHS  > OP1)

It just provides  a generic interface into what relations a statement 
may provide based on what a range_query returns for values and the stmt 
itself.


There is no performance impact.

Bootstraps on x86_64-pc-linux-gnu  with no regressions.  Pushed.

Andrew


From 933e14dc613269641ffe3613bf4792ac50590275 Mon Sep 17 00:00:00 2001
From: Andrew MacLeod 
Date: Wed, 24 May 2023 09:17:32 -0400
Subject: [PATCH 3/4] Provide relation queries for a stmt.

Allow fur_list and fold_stmt to be provided a range_query rather than
always defaultsing to NULL (which becomes a global query).
Also provide a fold_relations () routine which can provide a range_trio
for an arbitrary statement using any range_query

	* gimple-range-fold.cc (fur_list::fur_list): Add range_query param
	to contructors.
	(fold_range): Add range_query parameter.
	(fur_relation::fur_relation): New.
	(fur_relation::trio): New.
	(fur_relation::register_relation): New.
	(fold_relations): New.
	* gimple-range-fold.h (fold_range): Adjust prototypes.
	(fold_relations): New.
---
 gcc/gimple-range-fold.cc | 128 +++
 gcc/gimple-range-fold.h  |  11 +++-
 2 files changed, 124 insertions(+), 15 deletions(-)

diff --git a/gcc/gimple-range-fold.cc b/gcc/gimple-range-fold.cc
index 96cbd799488..4df065c8a6e 100644
--- a/gcc/gimple-range-fold.cc
+++ b/gcc/gimple-range-fold.cc
@@ -214,9 +214,9 @@ fur_depend::register_relation (edge e, relation_kind k, tree op1, tree op2)
 class fur_list : public fur_source
 {
 public:
-  fur_list (vrange );
-  fur_list (vrange , vrange );
-  fur_list (unsigned num, vrange **list);
+  fur_list (vrange , range_query *q = NULL);
+  fur_list (vrange , vrange , range_query *q = NULL);
+  fur_list (unsigned num, vrange **list, range_query *q = NULL);
   virtual bool get_operand (vrange , tree expr) override;
   virtual bool get_phi_operand (vrange , tree expr, edge e) override;
 private:
@@ -228,7 +228,7 @@ private:
 
 // One range supplied for unary operations.
 
-fur_list::fur_list (vrange ) : fur_source (NULL)
+fur_list::fur_list (vrange , range_query *q) : fur_source (q)
 {
   m_list = m_local;
   m_index = 0;
@@ -238,7 +238,7 @@ fur_list::fur_list (vrange ) : fur_source (NULL)
 
 // Two ranges supplied for binary operations.
 
-fur_list::fur_list (vrange , vrange ) : fur_source (NULL)
+fur_list::fur_list (vrange , vrange , range_query *q) : fur_source (q)
 {
   m_list = m_local;
   m_index = 0;
@@ -249,7 +249,8 @@ fur_list::fur_list (vrange , vrange ) : fur_source (NULL)
 
 // Arbitrary number of ranges in a vector.
 
-fur_list::fur_list (unsigned num, vrange **list) : fur_source (NULL)
+fur_list::fur_list (unsigned num, vrange **list, range_query *q)
+  : fur_source (q)
 {
   m_list = list;
   m_index = 0;
@@ -278,20 +279,20 @@ fur_list::get_phi_operand (vrange , tree expr, edge e ATTRIBUTE_UNUSED)
 // Fold stmt S into range R using R1 as the first operand.
 
 bool
-fold_range (vrange , gimple *s, vrange )
+fold_range (vrange , gimple *s, vrange , range_query *q)
 {
   fold_using_range f;
-  fur_list src (r1);
+  fur_list src (r1, q);
   return f.fold_stmt (r, s, src);
 }
 
 // Fold stmt S into range R using R1  and R2 as the first two operands.
 
 bool
-fold_range (vrange , gimple *s, vrange , vrange )
+fold_range (vrange , gimple *s, vrange , vrange , range_query *q)
 {
   fold_using_range f;
-  fur_list src (r1, r2);
+  fur_list src (r1, r2, q);
   return f.fold_stmt (r, s, src);
 }
 
@@ -299,10 +300,11 @@ fold_range (vrange , gimple *s, vrange , vrange )
 // operands encountered.
 
 bool
-fold_range (vrange , gimple *s, unsigned num_elements, vrange **vector)
+fold_range (vrange , gimple *s, unsigned num_elements, vrange **vector,
+	range_query *q)
 {
   fold_using_range f;
-  fur_list src (num_elements, vector);
+  fur_list src (num_elements, vector, q);
   return f.fold_stmt (r, s, src);
 }
 
@@ -326,6 +328,108 @@ fold_range (vrange , gimple *s, edge on_edge, range_query *q)
   return f.fold_stmt (r, s, src);
 }
 
+// Provide a fur_source which can be used

[COMMITTED 1/4] - Make ssa_cache and ssa_lazy_cache virtual.

2023-05-24 Thread Andrew MacLeod via Gcc-patches

I originally implemented the lazy ssa cache by inheriting from an 
ssa_cache in protected mode and providing the required routines. This 
makes it a little awkward to do various things, and they also become not 
quite as interchangeable as I'd like.   Making the routines virtual and 
using proper inheritance will avoid an inevitable issue down the road, 
and allows me to remove the printing hack which provided a protected 
output routine.


Overall performance impact is pretty negligible, so lets just clean it up.

Bootstraps on x86_64-pc-linux-gnu  with no regressions.  Pushed.

Andrew

From 3079056d0b779b907f8adc01d48a8aa495b8a661 Mon Sep 17 00:00:00 2001
From: Andrew MacLeod 
Date: Wed, 24 May 2023 08:49:30 -0400
Subject: [PATCH 1/4] Make ssa_cache and ssa_lazy_cache virtual.

Making them virtual allows us to interchangebly use the caches.

	* gimple-range-cache.cc (ssa_cache::dump): Use get_range.
	(ssa_cache::dump_range_query): Delete.
	(ssa_lazy_cache::dump_range_query): Delete.
	(ssa_lazy_cache::get_range): Move from header file.
	(ssa_lazy_cache::clear_range): ditto.
	(ssa_lazy_cache::clear): Ditto.
	* gimple-range-cache.h (class ssa_cache): Virtualize.
	(class ssa_lazy_cache): Inherit and virtualize.
---
 gcc/gimple-range-cache.cc | 43 +++
 gcc/gimple-range-cache.h  | 37 ++---
 2 files changed, 41 insertions(+), 39 deletions(-)

diff --git a/gcc/gimple-range-cache.cc b/gcc/gimple-range-cache.cc
index e069241bc9d..f25abaffd34 100644
--- a/gcc/gimple-range-cache.cc
+++ b/gcc/gimple-range-cache.cc
@@ -626,7 +626,7 @@ ssa_cache::dump (FILE *f)
   // Invoke dump_range_query which is a private virtual version of
   // get_range.   This avoids performance impacts on general queries,
   // but allows sharing of the dump routine.
-  if (dump_range_query (r, ssa_name (x)) && !r.varying_p ())
+  if (get_range (r, ssa_name (x)) && !r.varying_p ())
 	{
 	  if (print_header)
 	{
@@ -648,23 +648,14 @@ ssa_cache::dump (FILE *f)
 fputc ('\n', f);
 }
 
-// Virtual private get_range query for dumping.
+// Return true if NAME has an active range in the cache.
 
 bool
-ssa_cache::dump_range_query (vrange , tree name) const
+ssa_lazy_cache::has_range (tree name) const
 {
-  return get_range (r, name);
+  return bitmap_bit_p (active_p, SSA_NAME_VERSION (name));
 }
 
-// Virtual private get_range query for dumping.
-
-bool
-ssa_lazy_cache::dump_range_query (vrange , tree name) const
-{
-  return get_range (r, name);
-}
-
-
 // Set range of NAME to R in a lazy cache.  Return FALSE if it did not already
 // have a range.
 
@@ -684,6 +675,32 @@ ssa_lazy_cache::set_range (tree name, const vrange )
   return false;
 }
 
+// Return TRUE if NAME has a range, and return it in R.
+
+bool
+ssa_lazy_cache::get_range (vrange , tree name) const
+{
+  if (!bitmap_bit_p (active_p, SSA_NAME_VERSION (name)))
+return false;
+  return ssa_cache::get_range (r, name);
+}
+
+// Remove NAME from the active range list.
+
+void
+ssa_lazy_cache::clear_range (tree name)
+{
+  bitmap_clear_bit (active_p, SSA_NAME_VERSION (name));
+}
+
+// Remove all ranges from the active range list.
+
+void
+ssa_lazy_cache::clear ()
+{
+  bitmap_clear (active_p);
+}
+
 // --
 
 
diff --git a/gcc/gimple-range-cache.h b/gcc/gimple-range-cache.h
index 871255a8116..4fc98230430 100644
--- a/gcc/gimple-range-cache.h
+++ b/gcc/gimple-range-cache.h
@@ -57,14 +57,13 @@ class ssa_cache
 public:
   ssa_cache ();
   ~ssa_cache ();
-  bool has_range (tree name) const;
-  bool get_range (vrange , tree name) const;
-  bool set_range (tree name, const vrange );
-  void clear_range (tree name);
-  void clear ();
+  virtual bool has_range (tree name) const;
+  virtual bool get_range (vrange , tree name) const;
+  virtual bool set_range (tree name, const vrange );
+  virtual void clear_range (tree name);
+  virtual void clear ();
   void dump (FILE *f = stderr);
 protected:
-  virtual bool dump_range_query (vrange , tree name) const;
   vec m_tab;
   vrange_allocator *m_range_allocator;
 };
@@ -72,35 +71,21 @@ protected:
 // This is the same as global cache, except it maintains an active bitmap
 // rather than depending on a zero'd out vector of pointers.  This is better
 // for sparsely/lightly used caches.
-// It could be made a fully derived class, but at this point there doesnt seem
-// to be a need to take the performance hit for it.
 
-class ssa_lazy_cache : protected ssa_cache
+class ssa_lazy_cache : public ssa_cache
 {
 public:
   inline ssa_lazy_cache () { active_p = BITMAP_ALLOC (NULL); }
   inline ~ssa_lazy_cache () { BITMAP_FREE (active_p); }
-  bool set_range (tree name, const vrange );
-  inline bool get_range (vrange , tree name) const;
-  inline void clear_range (tree name)
-{ bitmap_clear_bit (active_p, SSA_NAME_VERSION (name)); } ;
-  inline void clear () { bitmap_clear (active_p); }
-  inline

Re: [PATCH] testsuite, analyzer: Fix testcases with fclose

2023-05-24 Thread David Malcolm via Gcc-patches

On Tue, 2023-05-23 at 09:34 +, Christophe Lyon wrote:
> The gcc.dg/analyzer/data-model-4.c and
> gcc.dg/analyzer/torture/conftest-1.c fail with recent glibc headers
> and succeed with older headers.
> 
> The new error message is:
> warning: use of possibly-NULL 'f' where non-null expected [CWE-690]
> [-Wanalyzer-possible-null-argument]
> 
> Like similar previous fixes in this area, this patch updates the
> testcase so that this warning isn't reported.

LGTM

Thanks
Dave

> 
> 2023-05-23  Christophe Lyon  
> 
> gcc/testsuite/
> * gcc.dg/analyzer/data-model-4.c: Exit if fopen returns NULL.
> * gcc.dg/analyzer/torture/conftest-1.c: Likewise.
> ---
>  gcc/testsuite/gcc.dg/analyzer/data-model-4.c   | 2 ++
>  gcc/testsuite/gcc.dg/analyzer/torture/conftest-1.c | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/gcc/testsuite/gcc.dg/analyzer/data-model-4.c
> b/gcc/testsuite/gcc.dg/analyzer/data-model-4.c
> index 33f90871dfb..d41868d6dbc 100644
> --- a/gcc/testsuite/gcc.dg/analyzer/data-model-4.c
> +++ b/gcc/testsuite/gcc.dg/analyzer/data-model-4.c
> @@ -8,6 +8,8 @@ int
>  main ()
>  {
>    FILE *f = fopen ("conftest.out", "w");
> +  if (f == NULL)
> +    return 1;
>    return ferror (f) || fclose (f) != 0;
>  
>    ;
> diff --git a/gcc/testsuite/gcc.dg/analyzer/torture/conftest-1.c
> b/gcc/testsuite/gcc.dg/analyzer/torture/conftest-1.c
> index 0cf85f0ebe1..9631bcf73e0 100644
> --- a/gcc/testsuite/gcc.dg/analyzer/torture/conftest-1.c
> +++ b/gcc/testsuite/gcc.dg/analyzer/torture/conftest-1.c
> @@ -3,6 +3,8 @@ int
>  main ()
>  {
>    FILE *f = fopen ("conftest.out", "w");
> +  if (f == NULL)
> +    return 1;
>    return ferror (f) || fclose (f) != 0;
>  
>    ;

Re: [PATCH] RISC-V: Add missing torture-init and torture-finish for rvv.exp

2023-05-24 Thread Thomas Schwinge via Gcc-patches

Hi!

On 2023-05-24T11:18:35-0700, Vineet Gupta  wrote:
> On 5/22/23 20:52, Vineet Gupta wrote:
>> On 5/22/23 02:17, Kito Cheng wrote:
>>> Ooops, seems still some issue around here,
>>
>> Yep still 5000 fails :-(
>>
>>>   but I found something might
>>> related this issue:
>>>
>>> https://github.com/gcc-mirror/gcc/commit/d6654a4be3ba44c0d57be7c8a51d76d9721345e1
>>>  
>>>
>>> https://github.com/gcc-mirror/gcc/commit/23c49bb8d09bc3bfce9a08be637cf32ac014de56
>>>  
>>>
>>
>> It seems both of these patches are essentially doing what yours did. 
>> So something else is amiss still.

Yeah, at this point I'm not sure whether my recent changes really are
related/relevant here.

> Apparently in addition to Kito's patch below, If I comment out the 
> additional torture options, failures go down drastically.

Meaning that *all* those ERRORs disappear?

> diff --git a/gcc/testsuite/gcc.target/riscv/riscv.exp 
> b/gcc/testsuite/gcc.target/riscv/riscv.exp
>
> -lappend ADDITIONAL_TORTURE_OPTIONS {-Og -g} {-Oz}
> +#lappend ADDITIONAL_TORTURE_OPTIONS {-Og -g} {-Oz}
>
> @Thomas, do you have some thoughts on how to fix riscv.exp properly in 
> light of recent changes to exp files.

I'm trying to understand this, but so far don't.  Can I please see a
complete 'gcc.log' file where the ERRORs are visible?


Grüße
 Thomas


>>> On Mon, May 22, 2023 at 2:42 PM Kito Cheng  
>>> wrote:
 Hi Vineet:

 Could you help to test this patch, this could resolve that issue on our
 machine, but I would like to also work for other env.

 Thanks :)

 ---

 We got bunch of following error message for multi-lib run:

 ERROR: torture-init: torture_without_loops is not empty as expected
 ERROR: tcl error code NONE

 And seems we need torture-init and torture-finish around the test
 loop.

 gcc/testsuite/ChangeLog:

  * gcc.target/riscv/rvv/rvv.exp: Add torture-init and
  torture-finish.
 ---
   gcc/testsuite/gcc.target/riscv/rvv/rvv.exp | 3 +++
   1 file changed, 3 insertions(+)

 diff --git a/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp 
 b/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp
 index bc99cc0c3cf4..19179564361a 100644
 --- a/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp
 +++ b/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp
 @@ -39,6 +39,7 @@ if [istarget riscv32-*-*] then {

   # Initialize `dg'.
   dg-init
 +torture-init

   # Main loop.
   set CFLAGS "$DEFAULT_CFLAGS -march=$gcc_march -mabi=$gcc_mabi -O3"
 @@ -69,5 +70,7 @@ foreach op $AUTOVEC_TEST_OPTS {
   dg-runtest [lsort [glob -nocomplain 
 $srcdir/$subdir/autovec/vls-vlmax/*.\[cS\]]] \
  "-std=c99 -O3 -ftree-vectorize --param 
 riscv-autovec-preference=fixed-vlmax" $CFLAGS

 +torture-finish
 +
   # All done.
   dg-finish
 -- 
 2.40.1

>>

Re: [PATCH] LoongArch: Fix the problem of structure parameter passing in C++. This structure has empty structure members and less than three floating point members.

2023-05-24 Thread Jason Merrill via Gcc-patches

On Wed, May 24, 2023 at 5:00 AM Jonathan Wakely via Gcc-patches <
gcc-patches@gcc.gnu.org> wrote:

> On Wed, 24 May 2023 at 09:41, Xi Ruoyao  wrote:
>
> > Wang Lei raised some concerns about Itanium C++ ABI, so let's ask a C++
> > expert here...
> >
> > Jonathan: AFAIK the standard and the Itanium ABI treats an empty class
> > as size 1
>
> Only as a complete object, not as a subobject.
>

Also as a data member subobject.


> > in order to guarantee unique address, so for the following:
> >
> > class Empty {};
> > class Test { Empty empty; double a, b; };
>
> There is no need to have a unique address here, so Test::empty and Test::a
> have the same address. It's a potentially-overlapping subobject.
>
> For the Itanium ABI, sizeof(Test) == 2 * sizeof(double).
>

That would be true if Test::empty were marked [[no_unique_address]], but
without that attribute, sizeof(Test) is actually 3 * sizeof(double).


> > When we pass "Test" via registers, we may only allocate the registers
> > for Test::a and Test::b, and complete ignore Test::empty because there
> > is no addresses of registers.  Is this correct or not?
>
> I think that's a decision for the loongarch psABI. In principle, there's no
> reason a register has to be used to pass Test::empty, since you can't read
> from it or write to it.
>

Agreed.  The Itanium C++ ABI has nothing to say about how registers are
allocated for parameter passing; this is a matter for the psABI.

And there is no need for a psABI to allocate a register for Test::empty
because it contains no data.

In the x86_64 psABI, Test above is passed in memory because of its size
("the size of the aggregate exceeds two eightbytes...").  But

struct Test2 { Empty empty; double a; };

is passed in a single floating-point register; the Test2::empty subobject
is not passed anywhere, because its eightbyte is classified as NO_CLASS,
because there is no actual data there.

I know nothing about the LoongArch psABI, but going out of your way to
assign a register to an empty class seems like a mistake.

> On Wed, 2023-05-24 at 14:45 +0800, Xi Ruoyao via Gcc-patches wrote:
> > > On Wed, 2023-05-24 at 14:04 +0800, Lulu Cheng wrote:
> > > > An empty struct type that is not non-trivial for the purposes of
> > > > calls
> > > > will be treated as though it were the following C type:
> > > >
> > > > struct {
> > > >   char c;
> > > > };
> > > >
> > > > Before this patch was added, a structure parameter containing an
> > > > empty structure and
> > > > less than three floating-point members was passed through one or two
> > > > floating-point
> > > > registers, while nested empty structures are ignored. Which did not
> > > > conform to the
> > > > calling convention.
> > >
> > > No, it's a deliberate decision I've made in
> > > https://gcc.gnu.org/r12-8294.  And we already agreed "the ABI needs to
> > > be updated" when we applied r12-8294, but I've never improved my
> > > English
> > > skill to revise the ABI myself :(.
> > >
> > > We are also using the same "de-facto" ABI throwing away the empty
> > > struct
> > > for Clang++ (https://reviews.llvm.org/D132285).  So we should update
> > > the
> > > spec here, instead of changing every implementation.
> > >
> > > The C++ standard treats the empty struct as size 1 for ensuring the
> > > semantics of pointer comparison operations.  When we pass it through
> > > the
> > > registers, there is no need to really consider the empty field because
> > > there is no pointers to registers.
> > >
> >
> >
>
>

Re: [PATCH V14] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread Richard Sandiford via Gcc-patches

I'll look at the samples tomorrow, but just to address one thing:

钟居哲  writes:
>>> What gives the best code in these cases?  Is emitting a multiplication
>>> better?  Or is using a new IV better?
> Could you give me more detail information about "new refresh IV" approach.
> I'd like to try that.

By “using a new IV” I meant calling vect_set_loop_controls_directly
for every rgroup, not just the first.  So in the earlier example,
there would be one decrementing IV for x and one decrementing IV for y.

Thanks,
Richard

Re: [aarch64] Code-gen for vector initialization involving constants

2023-05-24 Thread Richard Sandiford via Gcc-patches

Prathamesh Kulkarni  writes:
> On Wed, 24 May 2023 at 15:40, Richard Sandiford
>  wrote:
>>
>> Prathamesh Kulkarni  writes:
>> > On Mon, 22 May 2023 at 14:18, Richard Sandiford
>> >  wrote:
>> >>
>> >> Prathamesh Kulkarni  writes:
>> >> > Hi Richard,
>> >> > Thanks for the suggestions. Does the attached patch look OK ?
>> >> > Boostrap+test in progress on aarch64-linux-gnu.
>> >>
>> >> Like I say, please wait for the tests to complete before sending an RFA.
>> >> It saves a review cycle if the tests don't in fact pass.
>> > Right, sorry, will post patches after completion of testing henceforth.
>> >>
>> >> > diff --git a/gcc/config/aarch64/aarch64.cc 
>> >> > b/gcc/config/aarch64/aarch64.cc
>> >> > index 29dbacfa917..e611a7cca25 100644
>> >> > --- a/gcc/config/aarch64/aarch64.cc
>> >> > +++ b/gcc/config/aarch64/aarch64.cc
>> >> > @@ -22332,6 +22332,43 @@ aarch64_unzip_vector_init (machine_mode mode, 
>> >> > rtx vals, bool even_p)
>> >> >return gen_rtx_PARALLEL (new_mode, vec);
>> >> >  }
>> >> >
>> >> > +/* Return true if INSN is a scalar move.  */
>> >> > +
>> >> > +static bool
>> >> > +scalar_move_insn_p (const rtx_insn *insn)
>> >> > +{
>> >> > +  rtx set = single_set (insn);
>> >> > +  if (!set)
>> >> > +return false;
>> >> > +  rtx src = SET_SRC (set);
>> >> > +  rtx dest = SET_DEST (set);
>> >> > +  return is_a(GET_MODE (dest))
>> >> > +  && aarch64_mov_operand_p (src, GET_MODE (src));
>> >>
>> >> Formatting:
>> >>
>> >>   return (is_a(GET_MODE (dest))
>> >>   && aarch64_mov_operand_p (src, GET_MODE (src)));
>> >>
>> >> OK with that change if the tests pass, thanks.
>> > Unfortunately, the patch regressed vec-init-21.c:
>> >
>> > int8x16_t f_s8(int8_t x, int8_t y)
>> > {
>> >   return (int8x16_t) { x, y, 1, 2, 3, 4, 5, 6,
>> >7, 8, 9, 10, 11, 12, 13, 14 };
>> > }
>> >
>> > -O3 code-gen trunk:
>> > f_s8:
>> > adrpx2, .LC0
>> > ldr q0, [x2, #:lo12:.LC0]
>> > ins v0.b[0], w0
>> > ins v0.b[1], w1
>> > ret
>> >
>> > -O3 code-gen patch:
>> > f_s8:
>> > adrpx2, .LC0
>> > ldr d31, [x2, #:lo12:.LC0]
>> > adrpx2, .LC1
>> > ldr d0, [x2, #:lo12:.LC1]
>> > ins v31.b[0], w0
>> > ins v0.b[0], w1
>> > zip1v0.16b, v31.16b, v0.16b
>> > ret
>> >
>> > With trunk, it chooses the fallback sequence because both fallback
>> > and zip1 sequence had cost = 20, however with patch applied,
>> > we end up with zip1 sequence cost = 24 and fallback sequence
>> > cost = 28.
>> >
>> > This happens because of using insn_cost instead of
>> > set_rtx_cost for the following expression:
>> > (set (reg:QI 100)
>> > (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0))
>> > set_rtx_cost returns 0 for above expression but insn_cost returns 4.
>>
>> Yeah, was wondering why you'd dropped the set_rtx_cost thing,
>> but decided not to question it since using insn_cost seemed
>> reasonable if it worked.
> The attached patch uses set_rtx_cost for single_set and insn_cost
> otherwise for non debug insns similar to seq_cost.

FWIW, I think with the aarch64_mov_operand fix, the old way of using
insn_cost for everything would have worked too.  But either way is fine.

>> > This expression template appears twice in fallback sequence, which raises
>> > the cost to 28 from 20, while it appears once in each half of zip1 
>> > sequence,
>> > which raises the cost to 24 from 20, and so it now prefers zip1 sequence
>> > instead.
>> >
>> > I assumed this expression would be ignored because it looks like a scalar 
>> > move,
>> > but that doesn't seem to be the case ?
>> > aarch64_classify_symbolic_expression returns
>> > SYMBOL_FORCE_TO_MEM for (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0)
>> > and thus aarch64_mov_operand_p returns false.
>>
>> Ah, I guess it should be aarch64_mov_operand instead.  Confusing that
>> they're so different...
> Thanks, using aarch64_mov_operand worked.
>>
>> > Another issue with the zip1 sequence above is using same register x2
>> > for loading another half of constant in:
>> > adrpx2, .LC1
>> >
>> > I guess this will create an output dependency from adrp x2, .LC0 ->
>> > adrp x2, .LC1
>> > and anti-dependency from  ldr d31, [x2, #:lo12:.LC0] -> adrp x2, .LC1
>> > essentially forcing almost the entire sequence (except ins
>> > instructions) to execute sequentially ?
>>
>> I'd expect modern cores to handle that via renaming.
> Ah right, thanks for the clarification.
>
> For some reason, it seems git diff is not formatting the patch correctly :/
> Or perhaps I am doing something wrongly.

No, I think it's fine.  It's just tabs vs. spaces.  A leading
"+" followed by a tab is still only indented 8 columns, whereas
"+" followed by 6 spaces is indented 7 columns.  So indentation
can look a bit weird in the diff.

I was accounting for that though. :)

> For eg, it shows:
> +  return is_a(GET_MODE (dest))
> +&& aarch64_mov_operand (src, GET_MODE

Re: [aarch64] Code-gen for vector initialization involving constants

2023-05-24 Thread Prathamesh Kulkarni via Gcc-patches

On Wed, 24 May 2023 at 15:40, Richard Sandiford
 wrote:
>
> Prathamesh Kulkarni  writes:
> > On Mon, 22 May 2023 at 14:18, Richard Sandiford
> >  wrote:
> >>
> >> Prathamesh Kulkarni  writes:
> >> > Hi Richard,
> >> > Thanks for the suggestions. Does the attached patch look OK ?
> >> > Boostrap+test in progress on aarch64-linux-gnu.
> >>
> >> Like I say, please wait for the tests to complete before sending an RFA.
> >> It saves a review cycle if the tests don't in fact pass.
> > Right, sorry, will post patches after completion of testing henceforth.
> >>
> >> > diff --git a/gcc/config/aarch64/aarch64.cc 
> >> > b/gcc/config/aarch64/aarch64.cc
> >> > index 29dbacfa917..e611a7cca25 100644
> >> > --- a/gcc/config/aarch64/aarch64.cc
> >> > +++ b/gcc/config/aarch64/aarch64.cc
> >> > @@ -22332,6 +22332,43 @@ aarch64_unzip_vector_init (machine_mode mode, 
> >> > rtx vals, bool even_p)
> >> >return gen_rtx_PARALLEL (new_mode, vec);
> >> >  }
> >> >
> >> > +/* Return true if INSN is a scalar move.  */
> >> > +
> >> > +static bool
> >> > +scalar_move_insn_p (const rtx_insn *insn)
> >> > +{
> >> > +  rtx set = single_set (insn);
> >> > +  if (!set)
> >> > +return false;
> >> > +  rtx src = SET_SRC (set);
> >> > +  rtx dest = SET_DEST (set);
> >> > +  return is_a(GET_MODE (dest))
> >> > +  && aarch64_mov_operand_p (src, GET_MODE (src));
> >>
> >> Formatting:
> >>
> >>   return (is_a(GET_MODE (dest))
> >>   && aarch64_mov_operand_p (src, GET_MODE (src)));
> >>
> >> OK with that change if the tests pass, thanks.
> > Unfortunately, the patch regressed vec-init-21.c:
> >
> > int8x16_t f_s8(int8_t x, int8_t y)
> > {
> >   return (int8x16_t) { x, y, 1, 2, 3, 4, 5, 6,
> >7, 8, 9, 10, 11, 12, 13, 14 };
> > }
> >
> > -O3 code-gen trunk:
> > f_s8:
> > adrpx2, .LC0
> > ldr q0, [x2, #:lo12:.LC0]
> > ins v0.b[0], w0
> > ins v0.b[1], w1
> > ret
> >
> > -O3 code-gen patch:
> > f_s8:
> > adrpx2, .LC0
> > ldr d31, [x2, #:lo12:.LC0]
> > adrpx2, .LC1
> > ldr d0, [x2, #:lo12:.LC1]
> > ins v31.b[0], w0
> > ins v0.b[0], w1
> > zip1v0.16b, v31.16b, v0.16b
> > ret
> >
> > With trunk, it chooses the fallback sequence because both fallback
> > and zip1 sequence had cost = 20, however with patch applied,
> > we end up with zip1 sequence cost = 24 and fallback sequence
> > cost = 28.
> >
> > This happens because of using insn_cost instead of
> > set_rtx_cost for the following expression:
> > (set (reg:QI 100)
> > (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0))
> > set_rtx_cost returns 0 for above expression but insn_cost returns 4.
>
> Yeah, was wondering why you'd dropped the set_rtx_cost thing,
> but decided not to question it since using insn_cost seemed
> reasonable if it worked.
[reposting because my reply got blocked for moderator approval]

The attached patch uses set_rtx_cost for single_set and insn_cost
otherwise for non debug insns similar to seq_cost.
>
> > This expression template appears twice in fallback sequence, which raises
> > the cost to 28 from 20, while it appears once in each half of zip1 sequence,
> > which raises the cost to 24 from 20, and so it now prefers zip1 sequence
> > instead.
> >
> > I assumed this expression would be ignored because it looks like a scalar 
> > move,
> > but that doesn't seem to be the case ?
> > aarch64_classify_symbolic_expression returns
> > SYMBOL_FORCE_TO_MEM for (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0)
> > and thus aarch64_mov_operand_p returns false.
>
> Ah, I guess it should be aarch64_mov_operand instead.  Confusing that
> they're so different...
Thanks, using aarch64_mov_operand worked.
>
> > Another issue with the zip1 sequence above is using same register x2
> > for loading another half of constant in:
> > adrpx2, .LC1
> >
> > I guess this will create an output dependency from adrp x2, .LC0 ->
> > adrp x2, .LC1
> > and anti-dependency from  ldr d31, [x2, #:lo12:.LC0] -> adrp x2, .LC1
> > essentially forcing almost the entire sequence (except ins
> > instructions) to execute sequentially ?
>
> I'd expect modern cores to handle that via renaming.
Ah right, thanks for the clarification.

For some reason, it seems git diff is not formatting the patch correctly :/
Or perhaps I am doing something wrongly.
For eg, it shows:
+  return is_a(GET_MODE (dest))
+&& aarch64_mov_operand (src, GET_MODE (src));
but after applying the patch, it's formatted correctly with
"&"  right below is_a, both on column 10.

Similarly, for following hunk in seq_cost_ignoring_scalar_moves:
+if (NONDEBUG_INSN_P (seq)
+   && !scalar_move_insn_p (seq))
After applying patch, "&&" is below N, and not '('. Both N and "&&"
are on col 9.

And for the following just below:
+  {
+   if (rtx set = single_set (seq))

diff shows only one space difference between '{' and the following if,
but after applying the patch

Re: [PATCH] Fortran: checking and simplification of RESHAPE intrinsic [PR103794]

2023-05-24 Thread Mikael Morin


Le 21/05/2023 à 22:48, Harald Anlauf via Fortran a écrit :

Dear all,

checking and simplification of the RESHAPE intrinsic could fail in
various ways for sufficiently complicated arguments, like array
constructors.  Debugging revealed that in these cases we determined
that the array arguments were constant but we did not properly
simplify and expand the constructors.

A possible solution is the extend the test for constant arrays -
which already does an expansion for initialization expressions -
to also perform an expansion for small constructors in the
non-initialization case.

Regtested on x86_64-pc-linux-gnu.  OK for mainline?


OK, thanks.

Re: [PATCH] Fortran: reject bad DIM argument of SIZE intrinsic in simplification [PR104350]

2023-05-24 Thread Mikael Morin


Le 24/05/2023 à 21:16, Harald Anlauf via Fortran a écrit :

Dear all,

the attached almost obvious patch fixes an ICE on invalid that may
occur when we attempt to simplify an initialization expression with
SIZE for an out-of-range DIM argument.  Returning gfc_bad_expr
allows for a more graceful error recovery.

Regtested on x86_64-pc-linux-gnu.  OK for mainline?


OK, thanks.

Re: [PATCH v4] libgfortran: Replace mutex with rwlock

2023-05-24 Thread Thomas Koenig via Gcc-patches


Hi Lipeng,


May I know any comment or concern on this patch, thanks for your time 


Thanks for your patience in getting this reviewed.

A few remarks / questions.

Which strategy is used in this implementation, read-preferring or
write-preferring?  And if read-preferring is used, is there
a danger of deadlock if people do unreasonable things?
Maybe you could explain that, also in a comment in the code.

Can you add some sort of torture test case(s) which does a lot of
opening/closing/reading/writing, possibly with asynchronous
I/O and/or pthreads, to catch possible problems?  If there is a
system dependency or some race condition, chances are that regression
testers will catch this.

With this, the libgfortran parts are OK, unless somebody else has more
comments, so give this a couple of days.  I cannot approve the libgcc
parts, that would be somebody else (Jakub?)

Best regards

Thomas

[PATCH] Fortran: reject bad DIM argument of SIZE intrinsic in simplification [PR104350]

2023-05-24 Thread Harald Anlauf via Gcc-patches

Dear all,

the attached almost obvious patch fixes an ICE on invalid that may
occur when we attempt to simplify an initialization expression with
SIZE for an out-of-range DIM argument.  Returning gfc_bad_expr
allows for a more graceful error recovery.

Regtested on x86_64-pc-linux-gnu.  OK for mainline?

Thanks,
Harald

From 738bdcce46bd760fcafd1eb56700c8824621266f Mon Sep 17 00:00:00 2001
From: Harald Anlauf 
Date: Wed, 24 May 2023 21:04:43 +0200
Subject: [PATCH] Fortran: reject bad DIM argument of SIZE intrinsic in
 simplification [PR104350]

gcc/fortran/ChangeLog:

	PR fortran/104350
	* simplify.cc (simplify_size): Reject DIM argument of intrinsic SIZE
	with error when out of valid range.

gcc/testsuite/ChangeLog:

	PR fortran/104350
	* gfortran.dg/size_dim_2.f90: New test.
---
 gcc/fortran/simplify.cc  | 12 +++-
 gcc/testsuite/gfortran.dg/size_dim_2.f90 | 19 +++
 2 files changed, 30 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gfortran.dg/size_dim_2.f90

diff --git a/gcc/fortran/simplify.cc b/gcc/fortran/simplify.cc
index 3f77203e62e..81680117f70 100644
--- a/gcc/fortran/simplify.cc
+++ b/gcc/fortran/simplify.cc
@@ -7594,7 +7594,17 @@ simplify_size (gfc_expr *array, gfc_expr *dim, int k)
   if (dim->expr_type != EXPR_CONSTANT)
 	return NULL;

-  d = mpz_get_ui (dim->value.integer) - 1;
+  if (array->rank == -1)
+	return NULL;
+
+  d = mpz_get_si (dim->value.integer) - 1;
+  if (d < 0 || d > array->rank - 1)
+	{
+	  gfc_error ("DIM argument (%d) to intrinsic SIZE at %L out of range "
+		 "(1:%d)", d+1, >where, array->rank);
+	  return _bad_expr;
+	}
+
   if (!gfc_array_dimen_size (array, d, ))
 	return NULL;
 }
diff --git a/gcc/testsuite/gfortran.dg/size_dim_2.f90 b/gcc/testsuite/gfortran.dg/size_dim_2.f90
new file mode 100644
index 000..27a71d90a47
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/size_dim_2.f90
@@ -0,0 +1,19 @@
+! { dg-do compile }
+! PR fortran/104350 - ICE with SIZE and bad DIM in initialization expression
+! Contributed by G. Steinmetz
+
+program p
+  implicit none
+  integer :: k
+  integer, parameter :: x(2,3) = 42
+  integer, parameter :: s(*) = [(size(x,dim=k),k=1,rank(x))]
+  integer, parameter :: t(*) = [(size(x,dim=k),k=1,3)]   ! { dg-error "out of range" }
+  integer, parameter :: u(*) = [(size(x,dim=k),k=0,3)]   ! { dg-error "out of range" }
+  integer, parameter :: v = product(shape(x))
+  integer, parameter :: w = product([(size(x,k),k=0,3)]) ! { dg-error "out of range" }
+  print *,([(size(x,dim=k),k=1,rank(x))])
+  print *, [(size(x,dim=k),k=1,rank(x))]
+  print *, [(size(x,dim=k),k=0,rank(x))]
+  print *, product([(size(x,dim=k),k=1,rank(x))])
+  print *, product([(size(x,dim=k),k=0,rank(x))])
+end
--
2.35.3

[PATCH RFC] c++: use __cxa_call_terminate for MUST_NOT_THROW [PR97720]

2023-05-24 Thread Jason Merrill via Gcc-patches

Middle-end folks: any thoughts about how best to make the change described in
the last paragraph below?

Library folks: any thoughts on the changes to __cxa_call_terminate?

-- 8< --

[except.handle]/7 says that when we enter std::terminate due to a throw,
that is considered an active handler.  We already implemented that properly
for the case of not finding a handler (__cxa_throw calls __cxa_begin_catch
before std::terminate) and the case of finding a callsite with no landing
pad (the personality function calls __cxa_call_terminate which calls
__cxa_begin_catch), but for the case of a throw in a try/catch in a noexcept
function, we were emitting a cleanup that calls std::terminate directly
without ever calling __cxa_begin_catch to handle the exception.

A straightforward way to fix this seems to be calling __cxa_call_terminate
instead.  However, that requires exporting it from libstdc++, which we have
not previously done.  Despite the name, it isn't actually part of the ABI
standard.  Nor is __cxa_call_unexpected, as far as I can tell, but that one
is also used by clang.  For this case they use __clang_call_terminate; it
seems reasonable to me for us to stick with __cxa_call_terminate.

I also change __cxa_call_terminate to take void* for simplicity in the front
end (and consistency with __cxa_call_unexpected) but that isn't necessary if
it's undesirable for some reason.

This patch does not fix the issue that representing the noexcept as a
cleanup is wrong, and confuses the handler search; since it looks like a
cleanup in the EH tables, the unwinder keeps looking until it finds the
catch in main(), which it should never have gotten to.  Without the
try/catch in main, the unwinder would reach the end of the stack and say no
handler was found.  The noexcept is a handler, and should be treated as one,
as it is when the landing pad is omitted.

The best fix for that issue seems to me to be to represent an
ERT_MUST_NOT_THROW after an ERT_TRY in an action list as though it were an
ERT_ALLOWED_EXCEPTIONS (since indeed it is an exception-specification).  The
actual code generation shouldn't need to change (apart from the change made
by this patch), only the action table entry.

PR c++/97720

gcc/cp/ChangeLog:

* cp-tree.h (enum cp_tree_index): Add CPTI_CALL_TERMINATE_FN.
(call_terminate_fn): New macro.
* cp-gimplify.cc (gimplify_must_not_throw_expr): Use it.
* except.cc (init_exception_processing): Set it.
(cp_protect_cleanup_actions): Return it.

gcc/ChangeLog:

* tree-eh.cc (lower_resx): Pass the exception pointer to the
failure_decl.
* except.h: Tweak comment.

libstdc++-v3/ChangeLog:

* libsupc++/eh_call.cc (__cxa_call_terminate): Take void*.
* config/abi/pre/gnu.ver: Add it.

gcc/testsuite/ChangeLog:

* g++.dg/eh/terminate2.C: New test.
---
 gcc/cp/cp-tree.h |  2 ++
 gcc/except.h |  2 +-
 gcc/cp/cp-gimplify.cc|  2 +-
 gcc/cp/except.cc |  5 -
 gcc/testsuite/g++.dg/eh/terminate2.C | 30 
 gcc/tree-eh.cc   | 16 ++-
 libstdc++-v3/libsupc++/eh_call.cc|  4 +++-
 libstdc++-v3/config/abi/pre/gnu.ver  |  7 +++
 8 files changed, 63 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/eh/terminate2.C

diff --git a/gcc/cp/cp-tree.h b/gcc/cp/cp-tree.h
index a1b882f11fe..a8465a988b5 100644
--- a/gcc/cp/cp-tree.h
+++ b/gcc/cp/cp-tree.h
@@ -217,6 +217,7 @@ enum cp_tree_index
definitions.  */
 CPTI_ALIGN_TYPE,
 CPTI_TERMINATE_FN,
+CPTI_CALL_TERMINATE_FN,
 CPTI_CALL_UNEXPECTED_FN,
 
 /* These are lazily inited.  */
@@ -358,6 +359,7 @@ extern GTY(()) tree cp_global_trees[CPTI_MAX];
 /* Exception handling function declarations.  */
 #define terminate_fn   cp_global_trees[CPTI_TERMINATE_FN]
 #define call_unexpected_fn cp_global_trees[CPTI_CALL_UNEXPECTED_FN]
+#define call_terminate_fn  cp_global_trees[CPTI_CALL_TERMINATE_FN]
 #define get_exception_ptr_fn   
cp_global_trees[CPTI_GET_EXCEPTION_PTR_FN]
 #define begin_catch_fn cp_global_trees[CPTI_BEGIN_CATCH_FN]
 #define end_catch_fn   cp_global_trees[CPTI_END_CATCH_FN]
diff --git a/gcc/except.h b/gcc/except.h
index 5ecdbc0d1dc..378a9e4cb77 100644
--- a/gcc/except.h
+++ b/gcc/except.h
@@ -155,7 +155,7 @@ struct GTY(()) eh_region_d
 struct eh_region_u_must_not_throw {
   /* A function decl to be invoked if this region is actually reachable
 from within the function, rather than implementable from the runtime.
-The normal way for this to happen is for there to be a CLEANUP region
+The normal way for this to happen is for there to be a TRY region
 contained within this MUST_NOT_THROW region.  Note that if the
 runtime handles the MUST_NOT_THROW region, we have no

[COMMITTED] Remove deprecated vrange::kind().

2023-05-24 Thread Aldy Hernandez via Gcc-patches

gcc/ChangeLog:

* value-range.h (vrange::kind): Remove.
---
 gcc/value-range.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/gcc/value-range.h b/gcc/value-range.h
index 936eb175062..b8cc2a0e76a 100644
--- a/gcc/value-range.h
+++ b/gcc/value-range.h
@@ -100,9 +100,6 @@ public:
   bool operator== (const vrange &) const;
   bool operator!= (const vrange ) const { return !(*this == r); }
   void dump (FILE *) const;
-
-  enum value_range_kind kind () const; // DEPRECATED
-
 protected:
   vrange (enum value_range_discriminator d) : m_discriminator (d) { }
   ENUM_BITFIELD(value_range_kind) m_kind : 8;
-- 
2.40.1

Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-05-24 Thread Alexander Monakov via Gcc-patches



On Wed, 24 May 2023, Richard Biener via Gcc-patches wrote:

> I’d have to check the ISAs what they actually do here - it of course depends
> on RTL semantics as well but as you say those are not strictly defined here
> either.

Plus, we can add the following executable test to the testsuite:

#include 

#define CHECK(TYPE, WIDTH, OP, COUNT, INVERT) \
{ \
typedef TYPE vec __attribute__((vector_size(WIDTH))); \
  \
static volatile vec zero; \
vec tmp = (zero-2) OP (COUNT);\
vec ref = INVERT zero;\
if (__builtin_memcmp(, , sizeof tmp)) \
__builtin_abort();\
}

int main(void)
{
CHECK( uint8_t, 16, <<, 8,  )
CHECK( uint8_t, 16, <<, 31, )
CHECK( uint8_t, 16, >>, 8,  )
CHECK( uint8_t, 16, >>, 31, )
CHECK(  int8_t, 16, <<, 8,  )
CHECK(  int8_t, 16, <<, 31, )
CHECK(  int8_t, 16, >>, 8,  ~)
CHECK(  int8_t, 16, >>, 31, ~)
CHECK(uint16_t, 16, <<, 16, )
CHECK(uint16_t, 16, <<, 31, )
CHECK(uint16_t, 16, >>, 16, )
CHECK(uint16_t, 16, >>, 31, )
CHECK( int16_t, 16, <<, 16, )
CHECK( int16_t, 16, <<, 31, )
CHECK( int16_t, 16, >>, 16, ~)
CHECK( int16_t, 16, >>, 31, ~)
// Per-lane-variable shifts:
CHECK( uint8_t, 16, <<, zero+8,  )
CHECK( uint8_t, 16, <<, zero+31, )
CHECK( uint8_t, 16, >>, zero+8,  )
CHECK( uint8_t, 16, >>, zero+31, )
CHECK(  int8_t, 16, <<, zero+8,  )
CHECK(  int8_t, 16, <<, zero+31, )
CHECK(  int8_t, 16, >>, zero+8,  ~)
CHECK(  int8_t, 16, >>, zero+31, ~)
CHECK(uint16_t, 16, <<, zero+16, )
CHECK(uint16_t, 16, <<, zero+31, )
CHECK(uint16_t, 16, >>, zero+16, )
CHECK(uint16_t, 16, >>, zero+31, )
CHECK( int16_t, 16, <<, zero+16, )
CHECK( int16_t, 16, <<, zero+31, )
CHECK( int16_t, 16, >>, zero+16, ~)
CHECK( int16_t, 16, >>, zero+31, ~)

// Repeat for WIDTH=32 and WIDTH=64
}

Alexander

Re: [PATCH] RISC-V: Add missing torture-init and torture-finish for rvv.exp

2023-05-24 Thread Vineet Gupta


+CC Thomas and Maciej

On 5/22/23 20:52, Vineet Gupta wrote:

On 5/22/23 02:17, Kito Cheng wrote:

Ooops, seems still some issue around here,


Yep still 5000 fails :-(


  but I found something might
related this issue:

https://github.com/gcc-mirror/gcc/commit/d6654a4be3ba44c0d57be7c8a51d76d9721345e1 

https://github.com/gcc-mirror/gcc/commit/23c49bb8d09bc3bfce9a08be637cf32ac014de56 



It seems both of these patches are essentially doing what yours did. 
So something else is amiss still.


Apparently in addition to Kito's patch below, If I comment out the 
additional torture options, failures go down drastically.


diff --git a/gcc/testsuite/gcc.target/riscv/riscv.exp 
b/gcc/testsuite/gcc.target/riscv/riscv.exp


-lappend ADDITIONAL_TORTURE_OPTIONS {-Og -g} {-Oz}
+#lappend ADDITIONAL_TORTURE_OPTIONS {-Og -g} {-Oz}

@Thomas, do you have some thoughts on how to fix riscv.exp properly in 
light of recent changes to exp files.




Thx,
-Vineet



On Mon, May 22, 2023 at 2:42 PM Kito Cheng  
wrote:

Hi Vineet:

Could you help to test this patch, this could resolve that issue on our
machine, but I would like to also work for other env.

Thanks :)

---

We got bunch of following error message for multi-lib run:

ERROR: torture-init: torture_without_loops is not empty as expected
ERROR: tcl error code NONE

And seems we need torture-init and torture-finish around the test
loop.

gcc/testsuite/ChangeLog:

 * gcc.target/riscv/rvv/rvv.exp: Add torture-init and
 torture-finish.
---
  gcc/testsuite/gcc.target/riscv/rvv/rvv.exp | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp 
b/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp

index bc99cc0c3cf4..19179564361a 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp
+++ b/gcc/testsuite/gcc.target/riscv/rvv/rvv.exp
@@ -39,6 +39,7 @@ if [istarget riscv32-*-*] then {

  # Initialize `dg'.
  dg-init
+torture-init

  # Main loop.
  set CFLAGS "$DEFAULT_CFLAGS -march=$gcc_march -mabi=$gcc_mabi -O3"
@@ -69,5 +70,7 @@ foreach op $AUTOVEC_TEST_OPTS {
  dg-runtest [lsort [glob -nocomplain 
$srcdir/$subdir/autovec/vls-vlmax/*.\[cS\]]] \
 "-std=c99 -O3 -ftree-vectorize --param 
riscv-autovec-preference=fixed-vlmax" $CFLAGS


+torture-finish
+
  # All done.
  dg-finish
--
2.40.1

[i386 PATCH] A minor code clean-up: Use NULL_RTX instead of nullptr

2023-05-24 Thread Roger Sayle


My understanding is that GCC's preferred null value for rtx is NULL_RTX
(and for tree is NULL_TREE), and by being typed allows strict type checking,
and use with function polymorphism and template instantiation.
C++'s nullptr is preferred over NULL and 0 for pointer types that don't
have a defined null of the correct type.

This minor clean-up uses NULL_RTX consistently in i386-expand.cc.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and without --target_board=unix{-m32}
with no new failures.  Ok for mainline?  Is my understanding correct?


2023-05-24  Roger Sayle  

gcc/ChangeLog
* config/i386/i386.cc (ix86_convert_wide_int_to_broadcast):  Use
NULL_RTX instead of nullptr.
(ix86_convert_const_wide_int_to_broadcast): Likewise.
(ix86_broadcast_from_constant): Likewise.
(ix86_expand_vector_move): Likewise.
(ix86_extract_perm_from_pool_constant): Likewise.


Thanks,
Roger
--

diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 634fe61..a867288 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -296,7 +296,7 @@ ix86_convert_const_wide_int_to_broadcast (machine_mode 
mode, rtx op)
   /* Don't use integer vector broadcast if we can't move from GPR to SSE
  register directly.  */
   if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
-return nullptr;
+return NULL_RTX;
 
   /* Convert CONST_WIDE_INT to a non-standard SSE constant integer
  broadcast only if vector broadcast is available.  */
@@ -305,7 +305,7 @@ ix86_convert_const_wide_int_to_broadcast (machine_mode 
mode, rtx op)
   || standard_sse_constant_p (op, mode)
   || (CONST_WIDE_INT_NUNITS (op) * HOST_BITS_PER_WIDE_INT
  != GET_MODE_BITSIZE (mode)))
-return nullptr;
+return NULL_RTX;
 
   HOST_WIDE_INT val = CONST_WIDE_INT_ELT (op, 0);
   HOST_WIDE_INT val_broadcast;
@@ -326,12 +326,12 @@ ix86_convert_const_wide_int_to_broadcast (machine_mode 
mode, rtx op)
  val_broadcast))
 broadcast_mode = DImode;
   else
-return nullptr;
+return NULL_RTX;
 
   /* Check if OP can be broadcasted from VAL.  */
   for (int i = 1; i < CONST_WIDE_INT_NUNITS (op); i++)
 if (val != CONST_WIDE_INT_ELT (op, i))
-  return nullptr;
+  return NULL_RTX;
 
   unsigned int nunits = (GET_MODE_SIZE (mode)
 / GET_MODE_SIZE (broadcast_mode));
@@ -525,7 +525,7 @@ ix86_expand_move (machine_mode mode, rtx operands[])
{
  rtx tmp = ix86_convert_const_wide_int_to_broadcast
(GET_MODE (op0), op1);
- if (tmp != nullptr)
+ if (tmp != NULL_RTX)
op1 = tmp;
}
}
@@ -541,13 +541,13 @@ ix86_broadcast_from_constant (machine_mode mode, rtx op)
 {
   int nunits = GET_MODE_NUNITS (mode);
   if (nunits < 2)
-return nullptr;
+return NULL_RTX;
 
   /* Don't use integer vector broadcast if we can't move from GPR to SSE
  register directly.  */
   if (!TARGET_INTER_UNIT_MOVES_TO_VEC
   && INTEGRAL_MODE_P (mode))
-return nullptr;
+return NULL_RTX;
 
   /* Convert CONST_VECTOR to a non-standard SSE constant integer
  broadcast only if vector broadcast is available.  */
@@ -557,7 +557,7 @@ ix86_broadcast_from_constant (machine_mode mode, rtx op)
|| GET_MODE_INNER (mode) == DImode))
|| FLOAT_MODE_P (mode))
   || standard_sse_constant_p (op, mode))
-return nullptr;
+return NULL_RTX;
 
   /* Don't broadcast from a 64-bit integer constant in 32-bit mode.
  We can still put 64-bit integer constant in memory when
@@ -565,14 +565,14 @@ ix86_broadcast_from_constant (machine_mode mode, rtx op)
   if (GET_MODE_INNER (mode) == DImode && !TARGET_64BIT
   && (!TARGET_AVX512F
  || (GET_MODE_SIZE (mode) < 64 && !TARGET_AVX512VL)))
-return nullptr;
+return NULL_RTX;
 
   if (GET_MODE_INNER (mode) == TImode)
-return nullptr;
+return NULL_RTX;
 
   rtx constant = get_pool_constant (XEXP (op, 0));
   if (GET_CODE (constant) != CONST_VECTOR)
-return nullptr;
+return NULL_RTX;
 
   /* There could be some rtx like
  (mem/u/c:V16QI (symbol_ref/u:DI ("*.LC1")))
@@ -581,8 +581,8 @@ ix86_broadcast_from_constant (machine_mode mode, rtx op)
 {
   constant = simplify_subreg (mode, constant, GET_MODE (constant),
  0);
-  if (constant == nullptr || GET_CODE (constant) != CONST_VECTOR)
-   return nullptr;
+  if (constant == NULL_RTX || GET_CODE (constant) != CONST_VECTOR)
+   return NULL_RTX;
 }
 
   rtx first = XVECEXP (constant, 0, 0);
@@ -592,7 +592,7 @@ ix86_broadcast_from_constant (machine_mode mode, rtx op)
   rtx tmp = XVECEXP (constant, 0, i);
   /* Vector duplicate value.  */
   if (!rtx_equal_p (tmp, first))
-   return nullptr;
+   return NULL_RTX;
 }
 
   return first;
@@ -641,7

[PATCH v3] tree-ssa-sink: Improve code sinking pass

2023-05-24 Thread Ajit Agarwal via Gcc-patches

Hello All:

This patch improves code sinking pass to sink statements before call to reduce
register pressure.
Review comments are incorporated.

For example :

void bar();
int j;
void foo(int a, int b, int c, int d, int e, int f)
{
  int l;
  l = a + b + c + d +e + f;
  if (a != 5)
{
  bar();
  j = l;
}
}

Code Sinking does the following:

void bar();
int j;
void foo(int a, int b, int c, int d, int e, int f)
{
  int l;
  
  if (a != 5)
{
  l = a + b + c + d +e + f; 
  bar();
  j = l;
}
}

Bootstrapped regtested on powerpc64-linux-gnu.

Thanks & Regards
Ajit
  
tree-ssa-sink: Improve code sinking pass

Code Sinking sinks the blocks after call.This increases register pressure
for callee-saved registers. Improves code sinking before call in the use
blocks or immediate dominator of use blocks.

2023-05-24  Ajit Kumar Agarwal  

gcc/ChangeLog:

* tree-ssa-sink.cc (statement_sink_location): Move statements before
calls.
(def_use_same_block): New function.
(select_best_block): Add heuristics to select the best blocks in the
immediate post dominator.

gcc/testsuite/ChangeLog:

* gcc.dg/tree-ssa/ssa-sink-20.c: New testcase.
* gcc.dg/tree-ssa/ssa-sink-21.c: New testcase.
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-20.c | 15 +
 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c | 19 ++
 gcc/tree-ssa-sink.cc| 74 +
 3 files changed, 96 insertions(+), 12 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-20.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-20.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-20.c
new file mode 100644
index 000..69fa6d32e7c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-20.c
@@ -0,0 +1,15 @@
+/* { dg-options "-O2 -fdump-tree-optimized -fdump-tree-sink-stats" } */
+
+void bar();
+int j;
+void foo(int a, int b, int c, int d, int e, int f)
+{
+  int l;
+  l = a + b + c + d +e + f;
+  if (a != 5)
+{
+  bar();
+  j = l;
+}
+}
+/* { dg-final { scan-tree-dump-times "Sunk statements: 5" 1 "sink" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
new file mode 100644
index 000..b34959c8a4d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-sink-21.c
@@ -0,0 +1,19 @@
+/* { dg-options "-O2 -fdump-tree-sink-stats" } */
+
+void bar();
+int j, x;
+void foo(int a, int b, int c, int d, int e, int f)
+{
+  int l;
+  l = a + b + c + d +e + f;
+  if (a != 5)
+{
+  bar();
+  if (b != 3)
+x = 3;
+  else
+x = 5;
+  j = l;
+}
+}
+/* { dg-final { scan-tree-dump-times "Sunk statements: 5" 1 "sink" } } */
diff --git a/gcc/tree-ssa-sink.cc b/gcc/tree-ssa-sink.cc
index b1ba7a2ad6c..ee8988bbb2c 100644
--- a/gcc/tree-ssa-sink.cc
+++ b/gcc/tree-ssa-sink.cc
@@ -171,9 +171,28 @@ nearest_common_dominator_of_uses (def_operand_p def_p, 
bool *debug_stmts)
   return commondom;
 }
 
+/* Return TRUE if immediate uses of the defs in
+   STMT occur in the same block as STMT, FALSE otherwise.  */
+
+bool
+def_use_same_block (gimple *stmt)
+{
+  def_operand_p def;
+  ssa_op_iter iter;
+
+  FOR_EACH_SSA_DEF_OPERAND (def, stmt, iter, SSA_OP_DEF)
+{
+  gimple *def_stmt = SSA_NAME_DEF_STMT (DEF_FROM_PTR (def));
+  if ((gimple_bb (def_stmt) == gimple_bb (stmt)))
+   return true;
+ }
+  return false;
+}
+
 /* Given EARLY_BB and LATE_BB, two blocks in a path through the dominator
tree, return the best basic block between them (inclusive) to place
-   statements.
+   statements. The best basic block should be in immediate dominator of
+   best basic block if the use stmt is after the call.
 
We want the most control dependent block in the shallowest loop nest.
 
@@ -190,7 +209,8 @@ nearest_common_dominator_of_uses (def_operand_p def_p, bool 
*debug_stmts)
 static basic_block
 select_best_block (basic_block early_bb,
   basic_block late_bb,
-  gimple *stmt)
+  gimple *stmt,
+  gimple *use)
 {
   basic_block best_bb = late_bb;
   basic_block temp_bb = late_bb;
@@ -230,14 +250,46 @@ select_best_block (basic_block early_bb,
   if (threshold > 100)
threshold = 100;
 }
-
   /* If BEST_BB is at the same nesting level, then require it to have
  significantly lower execution frequency to avoid gratuitous movement.  */
   if (bb_loop_depth (best_bb) == bb_loop_depth (early_bb)
   /* If result of comparsion is unknown, prefer EARLY_BB.
 Thus use !(...>=..) rather than (...<...)  */
   && !(best_bb->count * 100 >= early_bb->count * threshold))
-return best_bb;
+{
+  basic_block new_best_bb = get_immediate_dominator (CDI_DOMINATORS, 
best_bb);
+  /* Return best_bb if def and use are in same block otherwise new_best_bb.
+
+Things to

Re: [V7][PATCH 2/2] Update documentation to clarify a GCC extension [PR77650]

2023-05-24 Thread Qing Zhao via Gcc-patches

Hi, Joseph,

I modified the gcc/doc/extend.texi per your suggestion as following:

Let me know if you have further comment and suggestion on this patch.

I will send out the V8 of the patch after some testing.

Thanks.

Qing.



diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index 6425ba57e88..9aedaa802e0 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -1754,7 +1754,7 @@ structure or an element of an array.  (However, these 
uses are
 permitted by GCC as extensions, see details below.)
 @end itemize
 
-GCC extension accepts a structure containing an ISO C99 @dfn{flexible array
+The GCC extension accepts a structure containing an ISO C99 @dfn{flexible array
 member}, or a union containing such a structure (possibly recursively)
 to be a member of a structure.
 
@@ -1776,10 +1776,9 @@ struct out_flex_union @{ int n; union union_flex 
flex_data; @};
 In the above, both @code{out_flex_struct.flex_data.data[]} and
 @code{out_flex_union.flex_data.f.data[]} are considered as flexible arrays too.
 
-
 @item
 A structure containing a C99 flexible array member, or a union containing
-such a structure, is the middle field of another structure, for example:
+such a structure, is not the last field of another structure, for example:
 
 @smallexample
 struct flex  @{ int length; char data[]; @};
@@ -1787,12 +1786,12 @@ struct flex  @{ int length; char data[]; @};
 struct mid_flex @{ int m; struct flex flex_data; int n; @};
 @end smallexample
 
-In the above, @code{mid_flex.flex_data.data[]} has undefined behavior.
-Compilers do not handle such case consistently, Any code relying on
-such case should be modified to ensure that flexible array members
-only end up at the ends of structures.
+In the above, accessing a member of the array @code{mid_flex.flex_data.data[]}
+might have undefined behavior.  Compilers do not handle such a case
+consistently.   Any code relying on this case should be modified to ensure
+that flexible array members only end up at the ends of structures.
 
-Please use warning option  @option{-Wflex-array-member-not-at-end} to
+Please use the warning option @option{-Wflex-array-member-not-at-end} to
 identify all such cases in the source code and modify them.  This warning
 will be on by default starting from GCC 15.
 @end itemize



> On May 19, 2023, at 5:12 PM, Joseph Myers  wrote:
> 
> On Fri, 19 May 2023, Qing Zhao via Gcc-patches wrote:
> 
>> +GCC extension accepts a structure containing an ISO C99 @dfn{flexible array
> 
> "The GCC extension" or "A GCC extension".
> 
>> +@item
>> +A structure containing a C99 flexible array member, or a union containing
>> +such a structure, is the middle field of another structure, for example:
> 
> There might be more than one middle field, and I think this case also 
> includes where it's the *first* field - any field other than the last.
> 
>> +@smallexample
>> +struct flex  @{ int length; char data[]; @};
>> +
>> +struct mid_flex @{ int m; struct flex flex_data; int n; @};
>> +@end smallexample
>> +
>> +In the above, @code{mid_flex.flex_data.data[]} has undefined behavior.
> 
> And it's not literally mid_flex.flex_data.data[] that has undefined 
> behavior, but trying to access a member of that array.
> 
>> +Compilers do not handle such case consistently, Any code relying on
> 
> "such a case", and "," should be "." at the end of a sentence.
> 
> -- 
> Joseph S. Myers
> jos...@codesourcery.com

Re: [PATCH] Dump if a pattern fails after having printed applying it

2023-05-24 Thread Andrew Pinski via Gcc-patches

On Wed, May 24, 2023 at 2:03 AM Richard Biener via Gcc-patches
 wrote:
>
> On Wed, May 24, 2023 at 1:16 AM Andrew Pinski via Gcc-patches
>  wrote:
> >
> > While trying to understand how to use the ! operand for match
> > patterns, I noticed that the debug dumps would print out applying
> > a pattern but nothing when it was rejected in the end. This was confusing
> > me.
> > This adds that by emitting a dump for the failed case.
> > Note the patch is little more complex as we don't want to print out
> > if debug counter rejected the pattern and then we need to fix up
> > when we mark needing a label or not.
> >
> > OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions.
>
> Hmm, can we maybe simply move the
>
> if (UNLIKELY
> (debug_dump)) fprintf (dump_file, "Applying pattern %s:%d, %s:%d\n",
> "match.pd", 1157, __FILE__, __LINE__);
>
> right before the return true; instead?

Yes that should work. Let me do a patch for that.

Thanks,
Andrew

>
> > gcc/ChangeLog:
> >
> > * genmatch.cc (needs_label): New variable
> > (expr::gen_transform): Set needs_label
> > if we use the local_label.
> > (dt_simplify::gen_1): Use `_1` for the debug count label.
> > After the local label, emit debug print for the failure.
> > Emit `_1` label if needed.
> > ---
> >  gcc/genmatch.cc | 28 
> >  1 file changed, 24 insertions(+), 4 deletions(-)
> >
> > diff --git a/gcc/genmatch.cc b/gcc/genmatch.cc
> > index 177c13d87cb..2ea80d341a2 100644
> > --- a/gcc/genmatch.cc
> > +++ b/gcc/genmatch.cc
> > @@ -2433,6 +2433,7 @@ capture_info::walk_c_expr (c_expr *e)
> >  /* The current label failing the current matched pattern during
> > code generation.  */
> >  static char *fail_label;
> > +static bool needs_label;
> >
> >  /* Code generation off the decision tree and the refered AST nodes.  */
> >
> > @@ -2611,6 +2612,7 @@ expr::gen_transform (FILE *f, int indent, const char 
> > *dest, bool gimple,
> >fprintf_indent (f, indent,
> >   "if (!_r%d) goto %s;\n",
> >   depth, fail_label);
> > +  needs_label = true;
> >if (*opr == CONVERT_EXPR)
> > {
> >   indent -= 4;
> > @@ -2640,11 +2642,13 @@ expr::gen_transform (FILE *f, int indent, const 
> > char *dest, bool gimple,
> > {
> >   fprintf_indent (f, indent, "if (!_r%d)\n", depth);
> >   fprintf_indent (f, indent, "  goto %s;\n", fail_label);
> > + needs_label = true;
> > }
> >if (force_leaf)
> > {
> >   fprintf_indent (f, indent, "if (EXPR_P (_r%d))\n", depth);
> >   fprintf_indent (f, indent, "  goto %s;\n", fail_label);
> > + needs_label = true;
> > }
> >if (*opr == CONVERT_EXPR)
> > {
> > @@ -3409,7 +3413,8 @@ dt_simplify::gen_1 (FILE *f, int indent, bool gimple, 
> > operand *result)
> >char local_fail_label[256];
> >snprintf (local_fail_label, 256, "next_after_fail%u", ++fail_label_cnt);
> >fail_label = local_fail_label;
> > -  bool needs_label = false;
> > +  needs_label = false;
> > +  bool needs_label_1 = false;
> >
> >/* Analyze captures and perform early-outs on the incoming arguments
> >   that cover cases we cannot handle.  */
> > @@ -3484,8 +3489,8 @@ dt_simplify::gen_1 (FILE *f, int indent, bool gimple, 
> > operand *result)
> >
> >if (s->kind == simplify::SIMPLIFY)
> >  {
> > -  fprintf_indent (f, indent, "if (UNLIKELY (!dbg_cnt (match))) goto 
> > %s;\n", fail_label);
> > -  needs_label = true;
> > +  fprintf_indent (f, indent, "if (UNLIKELY (!dbg_cnt (match))) goto 
> > %s_1;\n", fail_label);
> > +  needs_label_1 = true;
> >  }
> >
> >fprintf_indent (f, indent, "if (UNLIKELY (debug_dump)) "
> > @@ -3718,7 +3723,22 @@ dt_simplify::gen_1 (FILE *f, int indent, bool 
> > gimple, operand *result)
> >indent -= 2;
> >fprintf_indent (f, indent, "}\n");
> >if (needs_label)
> > -fprintf (f, "%s:;\n", fail_label);
> > +{
> > +  fprintf (f, "%s:;\n", fail_label);
> > +  if (s->kind == simplify::SIMPLIFY)
> > +   {
> > + fprintf_indent (f, indent, "if (UNLIKELY (debug_dump)) "
> > + "fprintf (dump_file, \"Pattern failed ");
> > + fprintf (f, "%%s:%%d, %%s:%%d\\n\", ");
> > + output_line_directive (f,
> > +result ? result->location : 
> > s->match->location, true,
> > +true);
> > + fprintf (f, ", __FILE__, __LINE__);\n");
> > +   }
> > +}
> > +  if (needs_label_1)
> > +fprintf (f, "%s_1:;\n", fail_label);
> > +  needs_label = false;
> >fail_label = NULL;
> >  }
> >
> > --
> > 2.31.1
> >

Re: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Hi, Richard. After I fix codes, now IR is correct I think:

loop_len_34 = MIN_EXPR ;
  _74 = loop_len_34 * 2;
  loop_len_48 = MIN_EXPR <_74, 4>;
  _75 = _74 - loop_len_48;
  loop_len_49 = MIN_EXPR <_75, 4>;
  _76 = _75 - loop_len_49;
  loop_len_50 = MIN_EXPR <_76, 4>;
  loop_len_51 = _76 - loop_len_50;
  ...
  vect__1.8_33 = .LEN_LOAD (_17, 16B, loop_len_34, 0);
...
  .LEN_STORE (_17, 16B, loop_len_34, vect__4.11_21, 0);
...

  vect__10.16_52 = .LEN_LOAD (_31, 32B, loop_len_48, 0);
...
  vect__10.17_54 = .LEN_LOAD (_29, 32B, loop_len_49, 0);
...
  vect__10.18_56 = .LEN_LOAD (_25, 32B, loop_len_50, 0);
...
  vect__10.19_58 = .LEN_LOAD (_80, 32B, loop_len_51, 0);


For this case:

uint64_t x2[100];
uint16_t y2[200];

void f2(int n) {
  for (int i = 0, j = 0; i < n; i += 2, j += 4) {
x2[i + 0] += 1;
x2[i + 1] += 2;
y2[j + 0] += 1;
y2[j + 1] += 2;
y2[j + 2] += 3;
y2[j + 3] += 4;
  }
}

The IR is like this:

  loop_len_56 = MIN_EXPR ;
  _66 = loop_len_56 * 4;
  loop_len_43 = _66 + 18446744073709551614;
  ...
  vect__1.44_44 = .LEN_LOAD (_6, 64B, 2, 0);
  ...
  vect__1.45_46 = .LEN_LOAD (_14, 64B, loop_len_43, 0);
  vect__2.46_47 = vect__1.44_44 + { 1, 2 };
  vect__2.46_48 = vect__1.45_46 + { 1, 2 };
  .LEN_STORE (_6, 64B, 2, vect__2.46_47, 0);
  .LEN_STORE (_14, 64B, loop_len_43, vect__2.46_48, 0);
  ...
  vect__6.51_57 = .LEN_LOAD (_10, 16B, loop_len_56, 0);

  vect__7.52_58 = vect__6.51_57 + { 1, 2, 3, 4, 1, 2, 3, 4 };
  .LEN_STORE (_10, 16B, loop_len_56, vect__7.52_58, 0);

It seems correct too ?

>> What gives the best code in these cases?  Is emitting a multiplication
>> better?  Or is using a new IV better?
Could you give me more detail information about "new refresh IV" approach.
I'd like to try that.

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-25 00:00
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
> Oh. I see. Thank you so much for pointing this.
> Could you tell me what I should do in the codes?
> It seems that I should adjust it in 
> vect_adjust_loop_lens_control
>
> muliply by some factor ? Is this correct multiply by max_nscalars_per_iter
> ?
 
max_nscalars_per_iter * factor rather than just max_nscalars_per_iter
 
Note that it's possible for later max_nscalars_per_iter * factor to
be smaller, so a division might be needed in rare cases.  E.g.:
 
uint64_t x[100];
uint16_t y[200];
 
void f() {
  for (int i = 0, j = 0; i < 100; i += 2, j += 4) {
x[i + 0] += 1;
x[i + 1] += 2;
y[j + 0] += 1;
y[j + 1] += 2;
y[j + 2] += 3;
y[j + 3] += 4;
  }
}
 
where y has a single-control rgroup with max_nscalars_per_iter == 4
and x has a 2-control rgroup with max_nscalars_per_iter == 2
 
What gives the best code in these cases?  Is emitting a multiplication
better?  Or is using a new IV better?
 
Thanks,
Richard

Re: [PATCH v2] rs6000: Add buildin for mffscrn instructions

2023-05-24 Thread Peter Bergner via Gcc-patches

On 5/24/23 10:20 AM, Carl Love wrote:
> Extending the builtin to pre Power 9 is straight forward and I agree
> would make good sense to do.
> 
> I am a bit concerned on how to extend __builtin_set_fpscr_rn to add the
> new functionality.  Peter suggests overloading the builtin to either
> return void or returns FPSCR bits.  It is my understanding that the
> return value for a given builtin had to be the same, i.e. you can't
> overload the return value. Maybe you can with Bill's new
> infrastructure?  I recall having problems trying to overload the return
> value in the past and Bill said you couldn't do it.  I play with this
> and see if I can overload the return value.

In this case, I don't think we need a built-in overload, but just change
the current built-in to return a double rather than void.  All of the
old code should still work, since they'll just ignore the return
value.  As I said, the built-in machinery can see whether we're assigning
the built-in return value to a variable or not, ie, the difference between

  __builtin_set_fpscr_rn ();

versus:

  foo = __builtin_set_fpscr_rn ();

In the former case, the built-in can expand exactly as how it does now.
In the later case, we'll use the target rtx we're passed in as the
destination of the mffscrn[i] insn for P9/10 and for pre-P9, we'll
use the target for an additional mffs instruction which we don't
generate now.  Note we'd only generate the mffs when we're passed in
a target rtx as in the second case.  The first case we won't.

This is all assuming Segher is fine with the change this way.  If we do
go this way, I would recommend adding a predefined macro that users can
test for to know whether the built-in returns a value or not.

If Segher doesn't like this idea, then it's all moot! :-)

Peter

Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-05-24 Thread Richard Biener via Gcc-patches



> Am 24.05.2023 um 16:21 schrieb Alexander Monakov :
> 
> 
>> On Wed, 24 May 2023, Richard Biener wrote:
>>> On Wed, May 24, 2023 at 2:54 PM Alexander Monakov via Gcc-patches
>>>  wrote:
>>> Explicitly say that bitwise shifts for narrow types work similar to
>>> element-wise C shifts with integer promotions, which coincides with
>>> OpenCL semantics.
>> Do we need to clarify that v << w with v being a vector of shorts
>> still yields a vector of shorts and not a vector of ints?
> 
> I don't think so, but if necessary we could add "and the result was
> truncated back to the base type":
> 
>   When the base type is narrower than @code{int}, element-wise shifts
>   are performed as if operands underwent C integer promotions, and
>   the result was truncated back to the base type, like in OpenCL. 
> 
>> Btw, I don't see this promotion reflected in the IL.  For
>> typedef short v8hi __attribute__((vector_size(16)));
>> v8hi foo (v8hi a, v8hi b)
>> {
>> return a << b;
>> }
>> I get no masking of 'b' and vector lowering if the target doens't handle it
>> yields
>> short int _5;
>> short int _6;
>> _5 = BIT_FIELD_REF ;
>> _6 = BIT_FIELD_REF ;
>> _7 = _5 << _6;
>> which we could derive ranges from for _6 (apparantly we don't yet).
> 
> Here it depends on how we define the GIMPLE-level semantics of bit-shift
> operators for narrow types. To avoid changing lowering we could say that
> shifting by up to 31 bits is well-defined for narrow types.
> 
> RTL-level semantics are also undocumented, unfortunately.
> 
>> Even
>> typedef int v8hi __attribute__((vector_size(16)));
>> v8hi x;
>> int foo (v8hi a, v8hi b)
>> {
>> x = a << b;
>> return (b[0] > 33);
>> }
>> isn't optimized currently (but could - note I've used 'int' elements here).
> 
> Yeah. But let's constrain the optimizations first.
> 
>> So, I don't see us making sure the hardware does the right thing for
>> out-of bound values.
> 
> I think in practice it worked out even if GCC did not pay attention to it,
> because SIMD instructions had to facilitate autovectorization for C with
> corresponding shift semantics.

I’d have to check the ISAs what they actually do here - it of course depends on 
RTL semantics as well but as you say those are not strictly defined here either.

I agree we can go with smaller types than int behave as if promoted (also for 
scalars for consistency).  Those operations do not exist in the C standard 
after all (maybe with _BitInt it’s now a thing)

Richard.

> Alexander
> 
>> Richard.
>>> gcc/ChangeLog:
>>>   * doc/extend.texi (Vector Extensions): Clarify bitwise shift
>>>   semantics.
>>> ---
>>> gcc/doc/extend.texi | 7 ++-
>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
>>> index e426a2eb7d..6b4e94b6a1 100644
>>> --- a/gcc/doc/extend.texi
>>> +++ b/gcc/doc/extend.texi
>>> @@ -12026,7 +12026,12 @@ elements in the operand.
>>> It is possible to use shifting operators @code{<<}, @code{>>} on
>>> integer-type vectors. The operation is defined as following: @code{@{a0,
>>> a1, @dots{}, an@} >> @{b0, b1, @dots{}, bn@} == @{a0 >> b0, a1 >> b1,
>>> -@dots{}, an >> bn@}}@. Vector operands must have the same number of
>>> +@dots{}, an >> bn@}}@.  When the base type is narrower than @code{int},
>>> +element-wise shifts are performed as if operands underwent C integer
>>> +promotions, like in OpenCL.  This makes vector shifts by up to 31 bits
>>> +well-defined for vectors with @code{char} and @code{short} base types.
>>> +
>>> +Operands of binary vector operations must have the same number of
>>> elements.
>>> For convenience, it is allowed to use a binary vector operation
>>> --
>>> 2.39.2

Re: [PATCH] libstdc++: Fix type of first argument to vec_cntm call

2023-05-24 Thread Jonathan Wakely via Gcc-patches

On Wed, 24 May 2023 at 16:06, Matthias Kretz via Libstdc++ <
libstd...@gcc.gnu.org> wrote:

> OK for master and backports? (also a long-standing bug that didn't surface
> until the new constexpr test was added)
>

OK for all


>
> tested on powerpc64le-linux-gnu
>
> - 8< -
>
> Signed-off-by: Matthias Kretz 
>
> libstdc++-v3/ChangeLog:
>
> PR libstdc++/109949
> * include/experimental/bits/simd.h (__intrinsic_type): If
> __ALTIVEC__ is defined, map gnu::vector_size types to their
> corresponding __vector T types without losing unsignedness of
> integer types. Also prefer long long over long.
> * include/experimental/bits/simd_ppc.h (_S_popcount): Cast mask
> object to the expected unsigned vector type.
> ---
>  libstdc++-v3/include/experimental/bits/simd.h | 39 ---
>  .../include/experimental/bits/simd_ppc.h  |  3 +-
>  2 files changed, 36 insertions(+), 6 deletions(-)
>
>
> --
> ──
>  Dr. Matthias Kretz   https://mattkretz.github.io
>  GSI Helmholtz Centre for Heavy Ion Research   https://gsi.de
>  stdₓ::simd
> ──

Re: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Hi, For the first piece of code ,I tried:
  unsigned int nitems_per_iter
= dest_rgm->max_nscalars_per_iter * dest_rgm->factor;
  step = gimple_build (seq, MULT_EXPR, iv_type, step,
   build_int_cst (iv_type, nitems_per_iter));

Then optimized IR:
loop_len_34 = MIN_EXPR ;
  _74 = loop_len_34 * 4;
  loop_len_51 = _74 + 18446744073709551604;

  _16 = (void *) ivtmp.27_41;
  _17 =   [(short int *)_16];

  vect__1.7_33 = .LEN_LOAD (_17, 16B, loop_len_34, 0);

  vect__2.8_23 = VIEW_CONVERT_EXPR(vect__1.7_33);
  vect__3.9_22 = vect__2.8_23 + { 1, 2, 1, 2, 1, 2, 1, 2 };
  vect__4.10_21 = VIEW_CONVERT_EXPR(vect__3.9_22);
  .LEN_STORE (_17, 16B, loop_len_34, vect__4.10_21, 0);
  _20 = (void *) ivtmp.28_1;
  _31 =   [(int *)_20];

  vect__10.15_52 = .LEN_LOAD (_31, 32B, 4, 0);

  _30 = (void *) ivtmp.31_4;
  _29 =   [(int *)_30];

  vect__10.16_54 = .LEN_LOAD (_29, 32B, 4, 0);

  _26 = (void *) ivtmp.32_8;
  _25 =   [(int *)_26];

  vect__10.17_56 = .LEN_LOAD (_25, 32B, 4, 0);

  _79 = (void *) ivtmp.33_12;
  _80 =   [(int *)_79];

  vect__10.18_58 = .LEN_LOAD (_80, 32B, loop_len_51, 0);

Is it correct ? It looks wierd ? 


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-25 00:00
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
> Oh. I see. Thank you so much for pointing this.
> Could you tell me what I should do in the codes?
> It seems that I should adjust it in 
> vect_adjust_loop_lens_control
>
> muliply by some factor ? Is this correct multiply by max_nscalars_per_iter
> ?
 
max_nscalars_per_iter * factor rather than just max_nscalars_per_iter
 
Note that it's possible for later max_nscalars_per_iter * factor to
be smaller, so a division might be needed in rare cases.  E.g.:
 
uint64_t x[100];
uint16_t y[200];
 
void f() {
  for (int i = 0, j = 0; i < 100; i += 2, j += 4) {
x[i + 0] += 1;
x[i + 1] += 2;
y[j + 0] += 1;
y[j + 1] += 2;
y[j + 2] += 3;
y[j + 3] += 4;
  }
}
 
where y has a single-control rgroup with max_nscalars_per_iter == 4
and x has a 2-control rgroup with max_nscalars_per_iter == 2
 
What gives the best code in these cases?  Is emitting a multiplication
better?  Or is using a new IV better?
 
Thanks,
Richard

Re: [PATCH V14] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread Richard Sandiford via Gcc-patches

钟居哲  writes:
> Oh. I see. Thank you so much for pointing this.
> Could you tell me what I should do in the codes?
> It seems that I should adjust it in 
> vect_adjust_loop_lens_control
>
> muliply by some factor ? Is this correct multiply by max_nscalars_per_iter
> ?

max_nscalars_per_iter * factor rather than just max_nscalars_per_iter

Note that it's possible for later max_nscalars_per_iter * factor to
be smaller, so a division might be needed in rare cases.  E.g.:

uint64_t x[100];
uint16_t y[200];

void f() {
  for (int i = 0, j = 0; i < 100; i += 2, j += 4) {
x[i + 0] += 1;
x[i + 1] += 2;
y[j + 0] += 1;
y[j + 1] += 2;
y[j + 2] += 3;
y[j + 3] += 4;
  }
}

where y has a single-control rgroup with max_nscalars_per_iter == 4
and x has a 2-control rgroup with max_nscalars_per_iter == 2

What gives the best code in these cases?  Is emitting a multiplication
better?  Or is using a new IV better?

Thanks,
Richard

Re: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Oh. I see. Thank you so much for pointing this.
Could you tell me what I should do in the codes?
It seems that I should adjust it in 
vect_adjust_loop_lens_control

muliply by some factor ? Is this correct multiply by max_nscalars_per_iter
?
Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-24 23:47
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
> Hi, Richard. I still don't understand it. Sorry about that.
>
>>>  loop_len_48 = MIN_EXPR ;
>   >>   _74 = loop_len_34 * 2 - loop_len_48;
>
> I have the tests already tested.
> We have a MIN_EXPR to calculate the total elements:
> loop_len_34 = MIN_EXPR ;
> I think "8" is already multiplied by 2?
>
> Why do we need loop_len_34 * 2 ?
> Could you give me more informations, The similiar tests you present we 
> already have
> execution check and passed. I am not sure whether this patch has the issue 
> that I didn't notice.
 
Think about the maximum values of each SSA name:
 
   loop_len_34 = MIN_EXPR ;   // MAX 8
   loop_len_48 = MIN_EXPR ;// MAX 4
   _74 = loop_len_34 - loop_len_48;// MAX 4
   loop_len_49 = MIN_EXPR <_74, 4>;// MAX 4 (always == _74)
   _75 = _74 - loop_len_49;// 0
   loop_len_50 = MIN_EXPR <_75, 4>;// 0
   loop_len_51 = _75 - loop_len_50;// 0
 
So the final two y vectors will always have 0 controls.
 
Thanks,
Richard

Re: [PATCH V14] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread Richard Sandiford via Gcc-patches

钟居哲  writes:
> Hi, Richard. I still don't understand it. Sorry about that.
>
>>>  loop_len_48 = MIN_EXPR ;
>   >>   _74 = loop_len_34 * 2 - loop_len_48;
>
> I have the tests already tested.
> We have a MIN_EXPR to calculate the total elements:
> loop_len_34 = MIN_EXPR ;
> I think "8" is already multiplied by 2?
>
> Why do we need loop_len_34 * 2 ?
> Could you give me more informations, The similiar tests you present we 
> already have
> execution check and passed. I am not sure whether this patch has the issue 
> that I didn't notice.

Think about the maximum values of each SSA name:

   loop_len_34 = MIN_EXPR ;   // MAX 8
   loop_len_48 = MIN_EXPR ;// MAX 4
   _74 = loop_len_34 - loop_len_48;// MAX 4
   loop_len_49 = MIN_EXPR <_74, 4>;// MAX 4 (always == _74)
   _75 = _74 - loop_len_49;// 0
   loop_len_50 = MIN_EXPR <_75, 4>;// 0
   loop_len_51 = _75 - loop_len_50;// 0

So the final two y vectors will always have 0 controls.

Thanks,
Richard

Re: [patch]: Implement PR104327 for avr

2023-05-24 Thread Georg-Johann Lay





Am 24.05.23 um 11:38 schrieb Richard Biener:

On Tue, May 23, 2023 at 2:56 PM Georg-Johann Lay  wrote:


PR target/104327 not only affects s390 but also avr:
The avr backend pre-sets some options depending on optimization level.
The inliner then thinks that always_inline functions are not eligible
for inlining and terminates with an error.

Proposing the following patch that implements TARGET_CAN_INLINE_P.

Ok to apply?

Johann

--

target/104327: Allow more inlining between different optimization levels.

avr-common.cc introduces the following options that are set depending
on optimization level: -mgas-isr-prologues, -mmain-is-OS-task and
-fsplit-wide-types-early.  The inliner thinks that different options
disallow cross-optimization inlining, so provide can_inline_p.

gcc/
 PR target/104327
 * config/avr/avr.cc (avr_can_inline_p): New static function.
 (TARGET_CAN_INLINE_P): Define to that function.
diff --git a/gcc/config/avr/avr.cc b/gcc/config/avr/avr.cc
index 9fa50ca230d..55b48f63865 100644
--- a/gcc/config/avr/avr.cc
+++ b/gcc/config/avr/avr.cc
@@ -1018,6 +1018,22 @@ avr_no_gccisr_function_p (tree func)
 return avr_lookup_function_attribute1 (func, "no_gccisr");
   }

+
+/* Implement `TARGET_CAN_INLINE_P'.  */
+/* Some options like -mgas_isr_prologues depend on optimization level,
+   and the inliner might think that due to different options, inlining
+   is not permitted; see PR104327.  */
+
+static bool
+avr_can_inline_p (tree /* caller */, tree callee)
+{
+  // For now, dont't allow to inline ISRs.  If the user actually wants
+  // to inline ISR code, they have to turn the body of the ISR into an
+  // ordinary function.
+
+  return ! avr_interrupt_function_p (callee);


I'm not sure if AVR has ISA extensions but the above will likely break
things like

void __attribute__((target("-mX"))) foo () { asm ("isa X opcode");
stmt-that-generates-X-ISA; }


This yields

warning: target attribute is not supported on this machine [-Wattributes]

avr has -mmcu= target options, but switching them in mid-air
won't work because the file prologue might already be different
and incompatible across different architectures.  And I never
saw any user requesting such a thing, and I can't imagine
any reasonable use case...  If the warning is not strong enough,
may be it can be turned into an error, but -Wattributes is not
specific enough for that.


void bar ()
{
   if (cpu-has-X)
 foo ();
}

if always-inlines are the concern you can use

   bool always_inline
 = (DECL_DISREGARD_INLINE_LIMITS (callee)
&& lookup_attribute ("always_inline",
 DECL_ATTRIBUTES (callee)));
   /* Do what the user says.  */
   if (always_inline)
 return true;

   return default_target_can_inline_p (caller, callee);


The default implementation of can_inline_p worked fine for avr.
As far as I understand, the new behavior is due to clean-up
of global states for options?

So I need to take into account inlining costs and decide on that
whether it's preferred to inline a function or not?

Johann


+}
+
   /* Implement `TARGET_SET_CURRENT_FUNCTION'.  */
   /* Sanity cheching for above function attributes.  */

@@ -14713,6 +14729,9 @@ avr_float_lib_compare_returns_bool (machine_mode
mode, enum rtx_code)
   #undef  TARGET_MD_ASM_ADJUST
   #define TARGET_MD_ASM_ADJUST avr_md_asm_adjust

+#undef  TARGET_CAN_INLINE_P
+#define TARGET_CAN_INLINE_P avr_can_inline_p
+
   struct gcc_target targetm = TARGET_INITIALIZER;

Re: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Hi, Richard. I still don't understand it. Sorry about that.

>>  loop_len_48 = MIN_EXPR ;
  >>   _74 = loop_len_34 * 2 - loop_len_48;

I have the tests already tested.
We have a MIN_EXPR to calculate the total elements:
loop_len_34 = MIN_EXPR ;
I think "8" is already multiplied by 2?

Why do we need loop_len_34 * 2 ?
Could you give me more informations, The similiar tests you present we already 
have
execution check and passed. I am not sure whether this patch has the issue that 
I didn't notice.

Thanks.

juzhe.zh...@rivai.ai

From: Richard Sandiford
Date: 2023-05-24 23:31
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
> Hi, the .optimized dump is like this:
>
>[local count: 21045336]:
>   ivtmp.26_36 = (unsigned long) 
>   ivtmp.27_3 = (unsigned long) 
>   ivtmp.30_6 = (unsigned long)   [(void *) + 16B];
>   ivtmp.31_10 = (unsigned long)   [(void *) + 32B];
>   ivtmp.32_14 = (unsigned long)   [(void *) + 48B];
>
>[local count: 273589366]:
>   # ivtmp_72 = PHI 
>   # ivtmp.26_41 = PHI 
>   # ivtmp.27_1 = PHI 
>   # ivtmp.30_4 = PHI 
>   # ivtmp.31_8 = PHI 
>   # ivtmp.32_12 = PHI 
>   loop_len_34 = MIN_EXPR ;
>   loop_len_48 = MIN_EXPR ;
>   _74 = loop_len_34 - loop_len_48;

Yeah, I think this needs to be:

  loop_len_48 = MIN_EXPR ;
  _74 = loop_len_34 * 2 - loop_len_48;

(as valid gimple).  The point is that...

>   loop_len_49 = MIN_EXPR <_74, 4>;
>   _75 = _74 - loop_len_49;
>   loop_len_50 = MIN_EXPR <_75, 4>;
>   loop_len_51 = _75 - loop_len_50;

...there are 4 lengths capped to 4, for a total element count of 16.
But loop_len_34 is never greater than 8.

So for this case we either need to multiply, or we need to create
a fresh IV for the second rgroup.  Both approaches are fine.

Thanks,
Richard

回复: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Hi, Richard.

I think it can work after I analyze it.
Let's take a look the codes:

void f() {
  for (int i = 0, j = 0; i < 100; i += 2, j += 4) {
x[i + 0] += 1;
x[i + 1] += 2;
y[j + 0] += 1;
y[j + 1] += 2;
y[j + 2] += 3;
y[j + 3] += 4;
  }
}

For "x", each scalar iteration calculate 2 elements (x[i + 0] and x[i + 1])
For "y", each scalar iteration calculate 4 elements (y[i + 0] and y[i + 1] and 
y[j + 2] and y[j + 3)
With this patch:

loop_len_34 = MIN_EXPR ;
The total elements of "x" vector of each iteration is maximum 8 which is 128bit 
(8 16bit elements)
So the vector can process "4" scalar iterations (x[i + 0] and x[i + 1])
So there is a len_load: vect__1.6_33 = .LEN_LOAD (_17, 16B, loop_len_34, 0);

Since the INT16 (x) is "4" scalar iterations, then INT8 ("y") is also 4 scalar 
iterations and 
each process 4 scalar elements (y[i + 0] and y[i + 1] and y[j + 2] and y[j + 3)

So you can see 4 vector operations of y:
 vect__11.18_59 = vect__10.14_52 + { 1, 2, 3, 4 };
  vect__11.18_60 = vect__10.15_54 + { 1, 2, 3, 4 };
  vect__11.18_61 = vect__10.16_56 + { 1, 2, 3, 4 };
  vect__11.18_62 = vect__10.17_58 + { 1, 2, 3, 4 };
  .LEN_STORE (_31, 32B, loop_len_48, vect__11.18_59, 0);
  .LEN_STORE (_29, 32B, loop_len_49, vect__11.18_60, 0);
  .LEN_STORE (_25, 32B, loop_len_50, vect__11.18_61, 0);
  .LEN_STORE (_79, 32B, loop_len_51, vect__11.18_62, 0);

So each vector loop has 1 group "x" (4 * 2 elements = 8 elements) and 4 group 
"y" (4 * 4)

And we adjust loop len for each control of y:
loop_len_34 = MIN_EXPR ;
  loop_len_48 = MIN_EXPR ;
  _74 = loop_len_34 - loop_len_48;
  loop_len_49 = MIN_EXPR <_74, 4>;
  _75 = _74 - loop_len_49;
  loop_len_50 = MIN_EXPR <_75, 4>;
  loop_len_51 = _75 - loop_len_50;

It seems to work. I wonder why we need multiplication ?

Thanks.


juzhe.zh...@rivai.ai
 
发件人： 钟居哲
发送时间： 2023-05-24 23:13
收件人： richard.sandiford
抄送： gcc-patches; rguenther
主题： Re: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by 
variable amount support
Hi, the .optimized dump is like this:

   [local count: 21045336]:
  ivtmp.26_36 = (unsigned long) 
  ivtmp.27_3 = (unsigned long) 
  ivtmp.30_6 = (unsigned long)   [(void *) + 16B];
  ivtmp.31_10 = (unsigned long)   [(void *) + 32B];
  ivtmp.32_14 = (unsigned long)   [(void *) + 48B];

   [local count: 273589366]:
  # ivtmp_72 = PHI 
  # ivtmp.26_41 = PHI 
  # ivtmp.27_1 = PHI 
  # ivtmp.30_4 = PHI 
  # ivtmp.31_8 = PHI 
  # ivtmp.32_12 = PHI 
  loop_len_34 = MIN_EXPR ;
  loop_len_48 = MIN_EXPR ;
  _74 = loop_len_34 - loop_len_48;
  loop_len_49 = MIN_EXPR <_74, 4>;
  _75 = _74 - loop_len_49;
  loop_len_50 = MIN_EXPR <_75, 4>;
  loop_len_51 = _75 - loop_len_50;
  _16 = (void *) ivtmp.26_41;
  _17 =   [(short int *)_16];
  vect__1.6_33 = .LEN_LOAD (_17, 16B, loop_len_34, 0);
  vect__2.7_23 = VIEW_CONVERT_EXPR(vect__1.6_33);
  vect__3.8_22 = vect__2.7_23 + { 1, 2, 1, 2, 1, 2, 1, 2 };
  vect__4.9_21 = VIEW_CONVERT_EXPR(vect__3.8_22);
  .LEN_STORE (_17, 16B, loop_len_34, vect__4.9_21, 0);
  _20 = (void *) ivtmp.27_1;
  _31 =   [(int *)_20];
  vect__10.14_52 = .LEN_LOAD (_31, 32B, loop_len_48, 0);
  _30 = (void *) ivtmp.30_4;
  _29 =   [(int *)_30];
  vect__10.15_54 = .LEN_LOAD (_29, 32B, loop_len_49, 0);
  _26 = (void *) ivtmp.31_8;
  _25 =   [(int *)_26];
  vect__10.16_56 = .LEN_LOAD (_25, 32B, loop_len_50, 0);
  _78 = (void *) ivtmp.32_12;
  _79 =   [(int *)_78];
  vect__10.17_58 = .LEN_LOAD (_79, 32B, loop_len_51, 0);
  vect__11.18_59 = vect__10.14_52 + { 1, 2, 3, 4 };
  vect__11.18_60 = vect__10.15_54 + { 1, 2, 3, 4 };
  vect__11.18_61 = vect__10.16_56 + { 1, 2, 3, 4 };
  vect__11.18_62 = vect__10.17_58 + { 1, 2, 3, 4 };
  .LEN_STORE (_31, 32B, loop_len_48, vect__11.18_59, 0);
  .LEN_STORE (_29, 32B, loop_len_49, vect__11.18_60, 0);
  .LEN_STORE (_25, 32B, loop_len_50, vect__11.18_61, 0);
  .LEN_STORE (_79, 32B, loop_len_51, vect__11.18_62, 0);
  ivtmp_73 = ivtmp_72 - loop_len_34;
  ivtmp.26_37 = ivtmp.26_41 + 16;
  ivtmp.27_2 = ivtmp.27_1 + 64;
  ivtmp.30_5 = ivtmp.30_4 + 64;
  ivtmp.31_9 = ivtmp.31_8 + 64;
  ivtmp.32_13 = ivtmp.32_12 + 64;
  if (ivtmp_73 != 0)
goto ; [92.31%]
  else
goto ; [7.69%]

I am still check about it but I send it to you earlier.

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-24 23:07
To: juzhe.zhong
CC: gcc-patches; rguenther
Subject: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by 
variable amount support
Thanks for trying it.  I'm still surprised that no multiplication
is needed though.  Does the patch work for:
 
short x[100];
int y[200];
 
void f() {
  for (int i = 0, j = 0; i < 100; i += 2, j += 4) {
x[i + 0] += 1;
x[i + 1] += 2;
y[j + 0] += 1;
y[j + 1] += 2;
y[j + 2] += 3;
y[j + 3] += 4;
  }
}
 
?  Here, there should be a single-control rgroup for x, counting
2 units per scalar iteration.  I'd expect the IV to use this scale.
 
There should also be a 4-control rgroup for y, counting 4 units per
scalar iteration.  So I

Re: [PATCH V14] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread Richard Sandiford via Gcc-patches

钟居哲  writes:
> Hi, the .optimized dump is like this:
>
>[local count: 21045336]:
>   ivtmp.26_36 = (unsigned long) 
>   ivtmp.27_3 = (unsigned long) 
>   ivtmp.30_6 = (unsigned long)   [(void *) + 16B];
>   ivtmp.31_10 = (unsigned long)   [(void *) + 32B];
>   ivtmp.32_14 = (unsigned long)   [(void *) + 48B];
>
>[local count: 273589366]:
>   # ivtmp_72 = PHI 
>   # ivtmp.26_41 = PHI 
>   # ivtmp.27_1 = PHI 
>   # ivtmp.30_4 = PHI 
>   # ivtmp.31_8 = PHI 
>   # ivtmp.32_12 = PHI 
>   loop_len_34 = MIN_EXPR ;
>   loop_len_48 = MIN_EXPR ;
>   _74 = loop_len_34 - loop_len_48;

Yeah, I think this needs to be:

  loop_len_48 = MIN_EXPR ;
  _74 = loop_len_34 * 2 - loop_len_48;
  
(as valid gimple).  The point is that...

>   loop_len_49 = MIN_EXPR <_74, 4>;
>   _75 = _74 - loop_len_49;
>   loop_len_50 = MIN_EXPR <_75, 4>;
>   loop_len_51 = _75 - loop_len_50;

...there are 4 lengths capped to 4, for a total element count of 16.
But loop_len_34 is never greater than 8.

So for this case we either need to multiply, or we need to create
a fresh IV for the second rgroup.  Both approaches are fine.

Thanks,
Richard

Re: [PATCH v2] rs6000: Add buildin for mffscrn instructions

2023-05-24 Thread Carl Love via Gcc-patches

On Wed, 2023-05-24 at 13:32 +0800, Kewen.Lin wrote:
> on 2023/5/24 06:30, Peter Bergner wrote:
> > On 5/23/23 12:24 AM, Kewen.Lin wrote:
> > > on 2023/5/23 01:31, Carl Love wrote:
> > > > The builtins were requested for use in GLibC.  As of version
> > > > 2.31 they
> > > > were added as inline asm.  They requested a builtin so the asm
> > > > could be
> > > > removed.
> > > 
> > > So IMHO we also want the similar support for mffscrn, that is to
> > > make
> > > use of mffscrn and mffscrni on Power9 and later, but falls back
> > > to 
> > > __builtin_set_fpscr_rn + mffs similar on older platforms.
> > 
> > So __builtin_set_fpscr_rn everything we want (sets the RN bits) and
> > uses mffscrn/mffscrni on P9 and later and uses older insns on pre-
> > P9.
> > The only problem is we don't return the current FPSCR bits, as the
> > bif
> > is defined to return void.
> 
> Yes.
> 
> > Crazy idea, but could we extend the built-in
> > with an overload that returns the FPSCR bits?  
> 
> So you agree that we should make this proposed new bif handle pre-P9
> just
> like some other existing bifs. :)  I think extending it is good and
> doable,
> but the only concern here is the bif name "__builtin_set_fpscr_rn",
> which
> matches the existing behavior (only set rounding) but doesn't match
> the
> proposed extending behavior (set rounding and get some env bits
> back).
> Maybe it's not a big deal if the documentation clarify it well.

Extending the builtin to pre Power 9 is straight forward and I agree
would make good sense to do.

I am a bit concerned on how to extend __builtin_set_fpscr_rn to add the
new functionality.  Peter suggests overloading the builtin to either
return void or returns FPSCR bits.  It is my understanding that the
return value for a given builtin had to be the same, i.e. you can't
overload the return value. Maybe you can with Bill's new
infrastructure?  I recall having problems trying to overload the return
value in the past and Bill said you couldn't do it.  I play with this
and see if I can overload the return value.
> 
> 
> > To be honest, I like
> > the __builtin_set_fpscr_rn name better than __builtin_mffscrn[i].
> 
> +1
> 
> BR,
> Kewen
> 
> > The built-in machinery can see that the usage is expecting a return
> > value
> > or not and for the pre-P9 code, can skip generating the ending mffs
> > if
> > we don't want the return value.
> > 
> > Peter
> > 
> >

Re: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Hi, the .optimized dump is like this:

   [local count: 21045336]:
  ivtmp.26_36 = (unsigned long) 
  ivtmp.27_3 = (unsigned long) 
  ivtmp.30_6 = (unsigned long)   [(void *) + 16B];
  ivtmp.31_10 = (unsigned long)   [(void *) + 32B];
  ivtmp.32_14 = (unsigned long)   [(void *) + 48B];

   [local count: 273589366]:
  # ivtmp_72 = PHI 
  # ivtmp.26_41 = PHI 
  # ivtmp.27_1 = PHI 
  # ivtmp.30_4 = PHI 
  # ivtmp.31_8 = PHI 
  # ivtmp.32_12 = PHI 
  loop_len_34 = MIN_EXPR ;
  loop_len_48 = MIN_EXPR ;
  _74 = loop_len_34 - loop_len_48;
  loop_len_49 = MIN_EXPR <_74, 4>;
  _75 = _74 - loop_len_49;
  loop_len_50 = MIN_EXPR <_75, 4>;
  loop_len_51 = _75 - loop_len_50;
  _16 = (void *) ivtmp.26_41;
  _17 =   [(short int *)_16];
  vect__1.6_33 = .LEN_LOAD (_17, 16B, loop_len_34, 0);
  vect__2.7_23 = VIEW_CONVERT_EXPR(vect__1.6_33);
  vect__3.8_22 = vect__2.7_23 + { 1, 2, 1, 2, 1, 2, 1, 2 };
  vect__4.9_21 = VIEW_CONVERT_EXPR(vect__3.8_22);
  .LEN_STORE (_17, 16B, loop_len_34, vect__4.9_21, 0);
  _20 = (void *) ivtmp.27_1;
  _31 =   [(int *)_20];
  vect__10.14_52 = .LEN_LOAD (_31, 32B, loop_len_48, 0);
  _30 = (void *) ivtmp.30_4;
  _29 =   [(int *)_30];
  vect__10.15_54 = .LEN_LOAD (_29, 32B, loop_len_49, 0);
  _26 = (void *) ivtmp.31_8;
  _25 =   [(int *)_26];
  vect__10.16_56 = .LEN_LOAD (_25, 32B, loop_len_50, 0);
  _78 = (void *) ivtmp.32_12;
  _79 =   [(int *)_78];
  vect__10.17_58 = .LEN_LOAD (_79, 32B, loop_len_51, 0);
  vect__11.18_59 = vect__10.14_52 + { 1, 2, 3, 4 };
  vect__11.18_60 = vect__10.15_54 + { 1, 2, 3, 4 };
  vect__11.18_61 = vect__10.16_56 + { 1, 2, 3, 4 };
  vect__11.18_62 = vect__10.17_58 + { 1, 2, 3, 4 };
  .LEN_STORE (_31, 32B, loop_len_48, vect__11.18_59, 0);
  .LEN_STORE (_29, 32B, loop_len_49, vect__11.18_60, 0);
  .LEN_STORE (_25, 32B, loop_len_50, vect__11.18_61, 0);
  .LEN_STORE (_79, 32B, loop_len_51, vect__11.18_62, 0);
  ivtmp_73 = ivtmp_72 - loop_len_34;
  ivtmp.26_37 = ivtmp.26_41 + 16;
  ivtmp.27_2 = ivtmp.27_1 + 64;
  ivtmp.30_5 = ivtmp.30_4 + 64;
  ivtmp.31_9 = ivtmp.31_8 + 64;
  ivtmp.32_13 = ivtmp.32_12 + 64;
  if (ivtmp_73 != 0)
goto ; [92.31%]
  else
goto ; [7.69%]

I am still check about it but I send it to you earlier.

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-24 23:07
To: juzhe.zhong
CC: gcc-patches; rguenther
Subject: Re: [PATCH V14] VECT: Add decrement IV iteration loop control by 
variable amount support
Thanks for trying it.  I'm still surprised that no multiplication
is needed though.  Does the patch work for:
 
short x[100];
int y[200];
 
void f() {
  for (int i = 0, j = 0; i < 100; i += 2, j += 4) {
x[i + 0] += 1;
x[i + 1] += 2;
y[j + 0] += 1;
y[j + 1] += 2;
y[j + 2] += 3;
y[j + 3] += 4;
  }
}
 
?  Here, there should be a single-control rgroup for x, counting
2 units per scalar iteration.  I'd expect the IV to use this scale.
 
There should also be a 4-control rgroup for y, counting 4 units per
scalar iteration.  So I think the IV would need to be multiplied by 2
before being used for the y rgroup.
 
Thanks,
Richard
 
juzhe.zh...@rivai.ai writes:
> From: Ju-Zhe Zhong 
>
> This patch is supporting decrement IV by following the flow designed by 
> Richard:
>
> (1) In vect_set_loop_condition_partial_vectors, for the first iteration of:
> call vect_set_loop_controls_directly.
>
> (2) vect_set_loop_controls_directly calculates "step" as in your patch.
> If rgc has 1 control, this step is the SSA name created for that control.
> Otherwise the step is a fresh SSA name, as in your patch.
>
> (3) vect_set_loop_controls_directly stores this step somewhere for later
> use, probably in LOOP_VINFO.  Let's use "S" to refer to this stored step.
>
> (4) After the vect_set_loop_controls_directly call above, and outside
> the "if" statement that now contains vect_set_loop_controls_directly,
> check whether rgc->controls.length () > 1.  If so, use
> vect_adjust_loop_lens_control to set the controls based on S.
>
> Then the only caller of vect_adjust_loop_lens_control is
> vect_set_loop_condition_partial_vectors.  And the starting
> step for vect_adjust_loop_lens_control is always S.
>
> This patch has well tested for single-rgroup and multiple-rgroup (SLP) and
> passed all testcase in RISC-V port.
>
> Also, pass tests for multiple-rgroup (non-SLP) tested on vec_pack_trunk.
>
> ---
>  gcc/tree-vect-loop-manip.cc | 178 +---
>  gcc/tree-vect-loop.cc   |  13 +++
>  gcc/tree-vectorizer.h   |  12 +++
>  3 files changed, 192 insertions(+), 11 deletions(-)
>
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index ff6159e08d5..578ac5b783e 100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -468,6 +468,38 @@ vect_set_loop_controls_directly (class loop *loop, 
> loop_vec_info loop_vinfo,
>gimple_stmt_iterator incr_gsi;
>bool insert_after;
>standard_iv_increment_position (loop, _gsi, _after);
> +  if

Re: [PATCH V14] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread Richard Sandiford via Gcc-patches

Thanks for trying it.  I'm still surprised that no multiplication
is needed though.  Does the patch work for:

short x[100];
int y[200];

void f() {
  for (int i = 0, j = 0; i < 100; i += 2, j += 4) {
x[i + 0] += 1;
x[i + 1] += 2;
y[j + 0] += 1;
y[j + 1] += 2;
y[j + 2] += 3;
y[j + 3] += 4;
  }
}

?  Here, there should be a single-control rgroup for x, counting
2 units per scalar iteration.  I'd expect the IV to use this scale.

There should also be a 4-control rgroup for y, counting 4 units per
scalar iteration.  So I think the IV would need to be multiplied by 2
before being used for the y rgroup.

Thanks,
Richard

juzhe.zh...@rivai.ai writes:
> From: Ju-Zhe Zhong 
>
> This patch is supporting decrement IV by following the flow designed by 
> Richard:
>
> (1) In vect_set_loop_condition_partial_vectors, for the first iteration of:
> call vect_set_loop_controls_directly.
>
> (2) vect_set_loop_controls_directly calculates "step" as in your patch.
> If rgc has 1 control, this step is the SSA name created for that control.
> Otherwise the step is a fresh SSA name, as in your patch.
>
> (3) vect_set_loop_controls_directly stores this step somewhere for later
> use, probably in LOOP_VINFO.  Let's use "S" to refer to this stored step.
>
> (4) After the vect_set_loop_controls_directly call above, and outside
> the "if" statement that now contains vect_set_loop_controls_directly,
> check whether rgc->controls.length () > 1.  If so, use
> vect_adjust_loop_lens_control to set the controls based on S.
>
> Then the only caller of vect_adjust_loop_lens_control is
> vect_set_loop_condition_partial_vectors.  And the starting
> step for vect_adjust_loop_lens_control is always S.
>
> This patch has well tested for single-rgroup and multiple-rgroup (SLP) and
> passed all testcase in RISC-V port.
>
> Also, pass tests for multiple-rgroup (non-SLP) tested on vec_pack_trunk.
>
> ---
>  gcc/tree-vect-loop-manip.cc | 178 +---
>  gcc/tree-vect-loop.cc   |  13 +++
>  gcc/tree-vectorizer.h   |  12 +++
>  3 files changed, 192 insertions(+), 11 deletions(-)
>
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index ff6159e08d5..578ac5b783e 100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -468,6 +468,38 @@ vect_set_loop_controls_directly (class loop *loop, 
> loop_vec_info loop_vinfo,
>gimple_stmt_iterator incr_gsi;
>bool insert_after;
>standard_iv_increment_position (loop, _gsi, _after);
> +  if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
> +{
> +  /* single rgroup:
> +  ...
> +  _10 = (unsigned long) count_12(D);
> +  ...
> +  # ivtmp_9 = PHI 
> +  _36 = MIN_EXPR ;
> +  ...
> +  vect__4.8_28 = .LEN_LOAD (_17, 32B, _36, 0);
> +  ...
> +  ivtmp_35 = ivtmp_9 - _36;
> +  ...
> +  if (ivtmp_35 != 0)
> +goto ; [83.33%]
> +  else
> +goto ; [16.67%]
> +  */
> +  nitems_total = gimple_convert (preheader_seq, iv_type, nitems_total);
> +  tree step = rgc->controls.length () == 1 ? rgc->controls[0]
> +: make_ssa_name (iv_type);
> +  /* Create decrement IV.  */
> +  create_iv (nitems_total, MINUS_EXPR, step, NULL_TREE, loop, _gsi,
> +  insert_after, _before_incr, _after_incr);
> +  gimple_seq_add_stmt (header_seq, gimple_build_assign (step, MIN_EXPR,
> + index_before_incr,
> + nitems_step));
> +  LOOP_VINFO_DECREMENTING_IV_STEP (loop_vinfo) = step;
> +  return index_after_incr;
> +}
> +
> +  /* Create increment IV.  */
>create_iv (build_int_cst (iv_type, 0), PLUS_EXPR, nitems_step, NULL_TREE,
>loop, _gsi, insert_after, _before_incr,
>_after_incr);
> @@ -683,6 +715,63 @@ vect_set_loop_controls_directly (class loop *loop, 
> loop_vec_info loop_vinfo,
>return next_ctrl;
>  }
>  
> +/* Try to use adjust loop lens for multiple-rgroups.
> +
> + _36 = MIN_EXPR ;
> +
> + First length (MIN (X, VF/N)):
> +   loop_len_15 = MIN_EXPR <_36, VF/N>;
> +
> + Second length:
> +   tmp = _36 - loop_len_15;
> +   loop_len_16 = MIN (tmp, VF/N);
> +
> + Third length:
> +   tmp2 = tmp - loop_len_16;
> +   loop_len_17 = MIN (tmp2, VF/N);
> +
> + Last length:
> +   loop_len_18 = tmp2 - loop_len_17;
> +*/
> +
> +static void
> +vect_adjust_loop_lens_control (tree iv_type, gimple_seq *seq,
> +rgroup_controls *dest_rgm, tree step)
> +{
> +  tree ctrl_type = dest_rgm->type;
> +  poly_uint64 nitems_per_ctrl
> += TYPE_VECTOR_SUBPARTS (ctrl_type) * dest_rgm->factor;
> +  tree length_limit = build_int_cst (iv_type, nitems_per_ctrl);
> +
> +  for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
> +{
> +  tree ctrl = dest_rgm->controls[i];
> +

[PATCH] libstdc++: Fix type of first argument to vec_cntm call

2023-05-24 Thread Matthias Kretz via Gcc-patches

OK for master and backports? (also a long-standing bug that didn't surface 
until the new constexpr test was added)

tested on powerpc64le-linux-gnu

- 8< -

Signed-off-by: Matthias Kretz 

libstdc++-v3/ChangeLog:

PR libstdc++/109949
* include/experimental/bits/simd.h (__intrinsic_type): If
__ALTIVEC__ is defined, map gnu::vector_size types to their
corresponding __vector T types without losing unsignedness of
integer types. Also prefer long long over long.
* include/experimental/bits/simd_ppc.h (_S_popcount): Cast mask
object to the expected unsigned vector type.
---
 libstdc++-v3/include/experimental/bits/simd.h | 39 ---
 .../include/experimental/bits/simd_ppc.h  |  3 +-
 2 files changed, 36 insertions(+), 6 deletions(-)


--
──
 Dr. Matthias Kretz   https://mattkretz.github.io
 GSI Helmholtz Centre for Heavy Ion Research   https://gsi.de
 stdₓ::simd
──diff --git a/libstdc++-v3/include/experimental/bits/simd.h b/libstdc++-v3/include/experimental/bits/simd.h
index d1f388310f9..26f08f83ab0 100644
--- a/libstdc++-v3/include/experimental/bits/simd.h
+++ b/libstdc++-v3/include/experimental/bits/simd.h
@@ -2466,11 +2466,40 @@ struct __intrinsic_type<_Tp, _Bytes, enable_if_t<__is_vectorizable_v<_Tp> && _By
 		  "no __intrinsic_type support for 64-bit floating point on PowerPC w/o VSX");
 #endif
 
-using type =
-  typename __intrinsic_type_impl<
-		 conditional_t,
-			   conditional_t<_S_is_ldouble, double, _Tp>,
-			   __int_for_sizeof_t<_Tp>>>::type;
+static constexpr auto __element_type()
+{
+  if constexpr (is_floating_point_v<_Tp>)
+	{
+	  if constexpr (_S_is_ldouble)
+	return double {};
+	  else
+	return _Tp {};
+	}
+  else if constexpr (is_signed_v<_Tp>)
+	{
+	  if constexpr (sizeof(_Tp) == sizeof(_SChar))
+	return _SChar {};
+	  else if constexpr (sizeof(_Tp) == sizeof(short))
+	return short {};
+	  else if constexpr (sizeof(_Tp) == sizeof(int))
+	return int {};
+	  else if constexpr (sizeof(_Tp) == sizeof(_LLong))
+	return _LLong {};
+	}
+  else
+	{
+	  if constexpr (sizeof(_Tp) == sizeof(_UChar))
+	return _UChar {};
+	  else if constexpr (sizeof(_Tp) == sizeof(_UShort))
+	return _UShort {};
+	  else if constexpr (sizeof(_Tp) == sizeof(_UInt))
+	return _UInt {};
+	  else if constexpr (sizeof(_Tp) == sizeof(_ULLong))
+	return _ULLong {};
+	}
+}
+
+using type = typename __intrinsic_type_impl::type;
   };
 #endif // __ALTIVEC__
 
diff --git a/libstdc++-v3/include/experimental/bits/simd_ppc.h b/libstdc++-v3/include/experimental/bits/simd_ppc.h
index eca1b34241b..2ea7234bd99 100644
--- a/libstdc++-v3/include/experimental/bits/simd_ppc.h
+++ b/libstdc++-v3/include/experimental/bits/simd_ppc.h
@@ -130,7 +130,8 @@ _S_popcount(simd_mask<_Tp, _Abi> __k)
 	const auto __kv = __as_vector(__k);
 	if constexpr (__have_power10vec)
 	  {
-	return vec_cntm(__to_intrin(__kv), 1);
+	using _Intrin = __intrinsic_type16_t>>;
+	return vec_cntm(reinterpret_cast<_Intrin>(__kv), 1);
 	  }
 	else if constexpr (sizeof(_Tp) >= sizeof(int))
 	  {

Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Yeah. Thanks. I have sent V14:
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/619478.html 
which I found there is no distinction between SLP and non-SLP.

Could you review it? I think it's more reasonable now.

Thanks.



juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-24 22:57
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
>>> Both approaches are fine.  I'm not against one or the other.
>
>>> What I didn't understand was why your patch only reuses existing IVs
>>> for max_nscalars_per_iter == 1.  Was it to avoid having to do a
>>> multiplication (well, really a shift left) when moving from one
>>> rgroup to another?  E.g. if one rgroup had;
>
>>>   nscalars_per_iter == 2 && factor == 1
>
>>> and another had:
>
>>>   nscalars_per_iter == 4 && factor == 1
>
>>> then we would need to mulitply by 2 when going from the first rgroup
>>> to the second.
>
>>> If so, avoiding a multiplication seems like a good reason for the choice
>>> you were making in the path.  But we then need to check
>>> max_nscalars_per_iter == 1 for both the source rgroup and the
>>> destination rgroup, not just the destination.  And I think the
>>> condition for “no multiplication needed” should be that:
>
> Oh, I didn't realize such complicated problem. Frankly, I didn't understand 
> well
> rgroup. Sorry about that :).
>
> I just remember last time you said I need to handle multiple-rgroup
> not only for SLP but also non-SLP (which is vec_pack_trunk that I tested).
> Then I asked you when is non-SLP, you said max_nscalars_per_iter == 1.
 
Yeah, max_nscalars_per_iter == 1 is the right way of checking for non-SLP.
 
But I'm never been convinced that SLP vs. non-SLP is a meaningful
distinction for this patch (that is, the parts that don't use
SELECT_VL).
 
SLP vs. non-SLP matters for SELECT_VL.  But the rgroup abstraction
should mean that SLP vs. non-SLP doesn't matter otherwise.
 
Thanks,
Richard

Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread Richard Sandiford via Gcc-patches

钟居哲  writes:
>>> Both approaches are fine.  I'm not against one or the other.
>
>>> What I didn't understand was why your patch only reuses existing IVs
>>> for max_nscalars_per_iter == 1.  Was it to avoid having to do a
>>> multiplication (well, really a shift left) when moving from one
>>> rgroup to another?  E.g. if one rgroup had;
>
>>>   nscalars_per_iter == 2 && factor == 1
>
>>> and another had:
>
>>>   nscalars_per_iter == 4 && factor == 1
>
>>> then we would need to mulitply by 2 when going from the first rgroup
>>> to the second.
>
>>> If so, avoiding a multiplication seems like a good reason for the choice
>>> you were making in the path.  But we then need to check
>>> max_nscalars_per_iter == 1 for both the source rgroup and the
>>> destination rgroup, not just the destination.  And I think the
>>> condition for “no multiplication needed” should be that:
>
> Oh, I didn't realize such complicated problem. Frankly, I didn't understand 
> well
> rgroup. Sorry about that :).
>
> I just remember last time you said I need to handle multiple-rgroup
> not only for SLP but also non-SLP (which is vec_pack_trunk that I tested).
> Then I asked you when is non-SLP, you said max_nscalars_per_iter == 1.

Yeah, max_nscalars_per_iter == 1 is the right way of checking for non-SLP.

But I'm never been convinced that SLP vs. non-SLP is a meaningful
distinction for this patch (that is, the parts that don't use
SELECT_VL).

SLP vs. non-SLP matters for SELECT_VL.  But the rgroup abstraction
should mean that SLP vs. non-SLP doesn't matter otherwise.

Thanks,
Richard

Re: [PATCH] LoongArch: Fix the problem of structure parameter passing in C++. This structure has empty structure members and less than three floating point members.

2023-05-24 Thread Xi Ruoyao via Gcc-patches

On Wed, 2023-05-24 at 18:07 +0800, Lulu Cheng wrote:
> 
> 在 2023/5/24 下午5:25, Xi Ruoyao 写道:
> > On Wed, 2023-05-24 at 16:47 +0800, Lulu Cheng wrote:
> > > 在 2023/5/24 下午2:45, Xi Ruoyao 写道:
> > > > On Wed, 2023-05-24 at 14:04 +0800, Lulu Cheng wrote:
> > > > > An empty struct type that is not non-trivial for the purposes of
> > > > > calls
> > > > > will be treated as though it were the following C type:
> > > > > 
> > > > > struct {
> > > > >     char c;
> > > > > };
> > > > > 
> > > > > Before this patch was added, a structure parameter containing an
> > > > > empty structure and
> > > > > less than three floating-point members was passed through one or
> > > > > two floating-point
> > > > > registers, while nested empty structures are ignored. Which did
> > > > > not conform to the
> > > > > calling convention.
> > > > No, it's a deliberate decision I've made in
> > > > https://gcc.gnu.org/r12-8294.  And we already agreed "the ABI needs
> > > > to
> > > > be updated" when we applied r12-8294, but I've never improved my
> > > > English
> > > > skill to revise the ABI myself :(.
> > > > 
> > > > We are also using the same "de-facto" ABI throwing away the empty
> > > > struct
> > > > for Clang++ (https://reviews.llvm.org/D132285).  So we should update
> > > > the
> > > > spec here, instead of changing every implementation.
> > > > 
> > > > The C++ standard treats the empty struct as size 1 for ensuring the
> > > > semantics of pointer comparison operations.  When we pass it through
> > > > the
> > > > registers, there is no need to really consider the empty field
> > > > because
> > > > there is no pointers to registers.
> > > > 
> > > I think that the rules for passing parameters to empty structures or
> > > nested empty structures should be unified,
> > There is no need to unify them because "passing a struct" is already
> > different from "passing its members one by one".  Say:
> > 
> > int f1(int a, int b);
> > 
> > and
> > 
> > int f2(struct {int a, b;} ab);
> > 
> > "a" and "b" are already passed differently.
> I mean I think that empty structs in st1 and st2 should be treated the
> same way in the way of passing parameters.
> > 
> > > but the current implementation in gcc is as follows(in C++):
> > > 
> > > Compare the two structures,the current implementation is as follows:
> > > 
> > > struct st1
> > > {
> > >     struct empty {} e1;
> > >     long a;
> > >     long b;
> > > };
> > > 
> > > passed by reference.
> > > 
> > > 
> > > struct st2
> > > {
> > >     struct empty {} e1;
> > >     double f0;
> > >     double f1;
> > > };
> > > 
> > > passed through two floating-point registers.
> > Well this is nasty, but it is the same behavior as RISC-V:
> > https://godbolt.org/z/fEexq148r
> > 
> > I deliberately made our logic similar to RISC-V in r12-8294 because
> > "there seems no reason to do it differently".  Maybe I was wrong and we
> > should have ignored st1::e1 as well (but IIRC we were running out of
> > time for GCC 12 release so we didn't have time to consider this :( ).
> > 
> > But now it's better to "keep the current behavior as-is" because:
> > 
> > 1. The current behavior of GCC and Clang already matches and the
> > behavior is kept since the day one GCC and Clang supports LoongArch.  So
> > there is currently no ABI incompatibility in practice, but changing the
> > behavior will introduce an ABI incompatibility.
> 
> The parameter passing rules for a single empty structure are different
> in GCC and Clang.
> 
> eg:
> 
> void test (struct empty, int a);
> 
> In GCC, the empty structure is passed through $a0, and the variable a is 
> passed through $a1,
> 
> but Clang passes a through $a0, and the empty structure is ignored.
> 
> > 2. Changing the behavior will make the compiler more complex, and
> > slower.
> > 3. Changing the behavior will need a -Wpsabi warning according to the
> > GCC policy, leading to more boring code (and more slow-down) in the
> > compiler.
> 
> I really understand and thank you for your concerns, we have also 
> considered the issue of compatibility.
> 
> Before the modification, we made an assessment. The colleagues of the 
> operating system
> 
> built a total of 3,300 linux basic packages, and only one package was 
> affected by this modification.
> 
> This is why GCC fixes this as a bug without adding -Wpsabi.

If you are really determined to do this, then OK.  I'm in a very bad
mood and I don't want to spend my mental strength on debating (esp. on a
corner case unlikely to affect "real" code) anymore.

But remember to add a entry in GCC 14 changes.html, and test this thing:

struct Empty {};

struct Something : Empty
{
  double a, b;
};

If we are not careful enough we may introduce a ABI mismatch between -
std=c++14 and -std=c++17 here.  See https://gcc.gnu.org/PR94383.

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University

Re: [PATCH V13] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Forget about V13. Plz go directly review V14.
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/619478.html 

Thanks.



juzhe.zh...@rivai.ai
 
From: juzhe.zhong
Date: 2023-05-24 22:29
To: gcc-patches
CC: richard.sandiford; rguenther; Ju-Zhe Zhong
Subject: [PATCH V13] VECT: Add decrement IV iteration loop control by variable 
amount support
From: Ju-Zhe Zhong 
 
This patch is supporting decrement IV by following the flow designed by Richard:
 
(1) In vect_set_loop_condition_partial_vectors, for the first iteration of:
call vect_set_loop_controls_directly.
 
(2) vect_set_loop_controls_directly calculates "step" as in your patch.
If rgc has 1 control, this step is the SSA name created for that control.
Otherwise the step is a fresh SSA name, as in your patch.
 
(3) vect_set_loop_controls_directly stores this step somewhere for later
use, probably in LOOP_VINFO.  Let's use "S" to refer to this stored step.
 
(4) After the vect_set_loop_controls_directly call above, and outside
the "if" statement that now contains vect_set_loop_controls_directly,
check whether rgc->controls.length () > 1.  If so, use
vect_adjust_loop_lens_control to set the controls based on S.
 
Then the only caller of vect_adjust_loop_lens_control is
vect_set_loop_condition_partial_vectors.  And the starting
step for vect_adjust_loop_lens_control is always S.
 
This patch has well tested for single-rgroup and multiple-rgroup (SLP) and
passed all testcase in RISC-V port.
 
Also, pass tests for multiple-rgroup (non-SLP) tested on vec_pack_trunk.
 
 
gcc/ChangeLog:
 
* tree-vect-loop-manip.cc (vect_set_loop_controls_directly): Add 
decrement IV support.
(vect_adjust_loop_lens_control): Ditto.
(vect_set_loop_condition_partial_vectors): Ditto.
* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): New variable.
* tree-vectorizer.h (LOOP_VINFO_USING_DECREMENTING_IV_P): New macro.
(LOOP_VINFO_DECREMENTING_IV_STEP): New macro.
 
---
gcc/tree-vect-loop-manip.cc | 179 +---
gcc/tree-vect-loop.cc   |  13 +++
gcc/tree-vectorizer.h   |  12 +++
3 files changed, 193 insertions(+), 11 deletions(-)
 
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index ff6159e08d5..3a872668f89 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -468,6 +468,38 @@ vect_set_loop_controls_directly (class loop *loop, 
loop_vec_info loop_vinfo,
   gimple_stmt_iterator incr_gsi;
   bool insert_after;
   standard_iv_increment_position (loop, _gsi, _after);
+  if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
+{
+  /* single rgroup:
+ ...
+ _10 = (unsigned long) count_12(D);
+ ...
+ # ivtmp_9 = PHI 
+ _36 = MIN_EXPR ;
+ ...
+ vect__4.8_28 = .LEN_LOAD (_17, 32B, _36, 0);
+ ...
+ ivtmp_35 = ivtmp_9 - _36;
+ ...
+ if (ivtmp_35 != 0)
+goto ; [83.33%]
+ else
+goto ; [16.67%]
+  */
+  nitems_total = gimple_convert (preheader_seq, iv_type, nitems_total);
+  tree step = rgc->controls.length () == 1 ? rgc->controls[0]
+: make_ssa_name (iv_type);
+  /* Create decrement IV.  */
+  create_iv (nitems_total, MINUS_EXPR, step, NULL_TREE, loop, _gsi,
+ insert_after, _before_incr, _after_incr);
+  gimple_seq_add_stmt (header_seq, gimple_build_assign (step, MIN_EXPR,
+ index_before_incr,
+ nitems_step));
+  LOOP_VINFO_DECREMENTING_IV_STEP (loop_vinfo) = step;
+  return index_after_incr;
+}
+
+  /* Create increment IV.  */
   create_iv (build_int_cst (iv_type, 0), PLUS_EXPR, nitems_step, NULL_TREE,
 loop, _gsi, insert_after, _before_incr,
 _after_incr);
@@ -683,6 +715,63 @@ vect_set_loop_controls_directly (class loop *loop, 
loop_vec_info loop_vinfo,
   return next_ctrl;
}
+/* Try to use adjust loop lens for multiple-rgroups.
+
+ _36 = MIN_EXPR ;
+
+ First length (MIN (X, VF/N)):
+   loop_len_15 = MIN_EXPR <_36, VF/N>;
+
+ Second length:
+   tmp = _36 - loop_len_15;
+   loop_len_16 = MIN (tmp, VF/N);
+
+ Third length:
+   tmp2 = tmp - loop_len_16;
+   loop_len_17 = MIN (tmp2, VF/N);
+
+ Last length:
+   loop_len_18 = tmp2 - loop_len_17;
+*/
+
+static void
+vect_adjust_loop_lens_control (tree iv_type, gimple_seq *seq,
+rgroup_controls *dest_rgm, tree step)
+{
+  tree ctrl_type = dest_rgm->type;
+  poly_uint64 nitems_per_ctrl
+= TYPE_VECTOR_SUBPARTS (ctrl_type) * dest_rgm->factor;
+  tree length_limit = build_int_cst (iv_type, nitems_per_ctrl);
+
+  for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
+{
+  tree ctrl = dest_rgm->controls[i];
+  if (i == 0)
+ {
+   /* First iteration: MIN (X, VF/N) capped to the range [0, VF/N].  */
+   gassign *assign
+ = gimple_build_assign (ctrl, MIN_EXPR, step, length_limit);
+   gimple_seq_add_stmt (seq, assign);
+ }
+  else if (i == dest_rgm->controls.length () - 1)
+ {
+   /* Last iteration: Remain capped to the range [0, VF/N].  */
+   gassign *assign =

[PATCH V14] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread juzhe . zhong

From: Ju-Zhe Zhong 

This patch is supporting decrement IV by following the flow designed by Richard:

(1) In vect_set_loop_condition_partial_vectors, for the first iteration of:
call vect_set_loop_controls_directly.

(2) vect_set_loop_controls_directly calculates "step" as in your patch.
If rgc has 1 control, this step is the SSA name created for that control.
Otherwise the step is a fresh SSA name, as in your patch.

(3) vect_set_loop_controls_directly stores this step somewhere for later
use, probably in LOOP_VINFO.  Let's use "S" to refer to this stored step.

(4) After the vect_set_loop_controls_directly call above, and outside
the "if" statement that now contains vect_set_loop_controls_directly,
check whether rgc->controls.length () > 1.  If so, use
vect_adjust_loop_lens_control to set the controls based on S.

Then the only caller of vect_adjust_loop_lens_control is
vect_set_loop_condition_partial_vectors.  And the starting
step for vect_adjust_loop_lens_control is always S.

This patch has well tested for single-rgroup and multiple-rgroup (SLP) and
passed all testcase in RISC-V port.

Also, pass tests for multiple-rgroup (non-SLP) tested on vec_pack_trunk.

---
 gcc/tree-vect-loop-manip.cc | 178 +---
 gcc/tree-vect-loop.cc   |  13 +++
 gcc/tree-vectorizer.h   |  12 +++
 3 files changed, 192 insertions(+), 11 deletions(-)

diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index ff6159e08d5..578ac5b783e 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -468,6 +468,38 @@ vect_set_loop_controls_directly (class loop *loop, 
loop_vec_info loop_vinfo,
   gimple_stmt_iterator incr_gsi;
   bool insert_after;
   standard_iv_increment_position (loop, _gsi, _after);
+  if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
+{
+  /* single rgroup:
+...
+_10 = (unsigned long) count_12(D);
+...
+# ivtmp_9 = PHI 
+_36 = MIN_EXPR ;
+...
+vect__4.8_28 = .LEN_LOAD (_17, 32B, _36, 0);
+...
+ivtmp_35 = ivtmp_9 - _36;
+...
+if (ivtmp_35 != 0)
+  goto ; [83.33%]
+else
+  goto ; [16.67%]
+  */
+  nitems_total = gimple_convert (preheader_seq, iv_type, nitems_total);
+  tree step = rgc->controls.length () == 1 ? rgc->controls[0]
+  : make_ssa_name (iv_type);
+  /* Create decrement IV.  */
+  create_iv (nitems_total, MINUS_EXPR, step, NULL_TREE, loop, _gsi,
+insert_after, _before_incr, _after_incr);
+  gimple_seq_add_stmt (header_seq, gimple_build_assign (step, MIN_EXPR,
+   index_before_incr,
+   nitems_step));
+  LOOP_VINFO_DECREMENTING_IV_STEP (loop_vinfo) = step;
+  return index_after_incr;
+}
+
+  /* Create increment IV.  */
   create_iv (build_int_cst (iv_type, 0), PLUS_EXPR, nitems_step, NULL_TREE,
 loop, _gsi, insert_after, _before_incr,
 _after_incr);
@@ -683,6 +715,63 @@ vect_set_loop_controls_directly (class loop *loop, 
loop_vec_info loop_vinfo,
   return next_ctrl;
 }
 
+/* Try to use adjust loop lens for multiple-rgroups.
+
+ _36 = MIN_EXPR ;
+
+ First length (MIN (X, VF/N)):
+   loop_len_15 = MIN_EXPR <_36, VF/N>;
+
+ Second length:
+   tmp = _36 - loop_len_15;
+   loop_len_16 = MIN (tmp, VF/N);
+
+ Third length:
+   tmp2 = tmp - loop_len_16;
+   loop_len_17 = MIN (tmp2, VF/N);
+
+ Last length:
+   loop_len_18 = tmp2 - loop_len_17;
+*/
+
+static void
+vect_adjust_loop_lens_control (tree iv_type, gimple_seq *seq,
+  rgroup_controls *dest_rgm, tree step)
+{
+  tree ctrl_type = dest_rgm->type;
+  poly_uint64 nitems_per_ctrl
+= TYPE_VECTOR_SUBPARTS (ctrl_type) * dest_rgm->factor;
+  tree length_limit = build_int_cst (iv_type, nitems_per_ctrl);
+
+  for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
+{
+  tree ctrl = dest_rgm->controls[i];
+  if (i == 0)
+   {
+ /* First iteration: MIN (X, VF/N) capped to the range [0, VF/N].  */
+ gassign *assign
+   = gimple_build_assign (ctrl, MIN_EXPR, step, length_limit);
+ gimple_seq_add_stmt (seq, assign);
+   }
+  else if (i == dest_rgm->controls.length () - 1)
+   {
+ /* Last iteration: Remain capped to the range [0, VF/N].  */
+ gassign *assign = gimple_build_assign (ctrl, MINUS_EXPR, step,
+dest_rgm->controls[i - 1]);
+ gimple_seq_add_stmt (seq, assign);
+   }
+  else
+   {
+ /* (MIN (remain, VF*I/N)) capped to the range [0, VF/N].  */
+ step = gimple_build (seq, MINUS_EXPR, iv_type, step,
+  dest_rgm->controls[i - 1]);
+ gassign *assign
+   =

Re: [PATCH] Fix artificial overflow during GENERIC folding

2023-05-24 Thread Eric Botcazou via Gcc-patches

> But nobody is going to understand why the INTEGER_CST case goes the
> other way.

I can add a fat comment to that effect of course. :-)

> As you say we don't have a good way to say we're doing
> this to avoid undefined behavior, but then a view-convert back would
> be a good way to indicate that?  I can't come up with a better name
> for a custom operator we could also use,
> 
>   (convert_without_overflow (negate (convert:utype @1
> 
> maybe?  As said, if view_convert works I prefer that.  Does it?

Well, VIEW_CONVERT_EXPR adds its own set of problems in GENERIC and it will 
precisely survive when it is not needed, so I'm not sure that's any better.

-- 
Eric Botcazou

Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Hi. Richard. I have sent V13:
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/619475.html 
It looks more reasonable now.
Could you continue review it again?

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-24 22:01
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
>>> In other words, why is this different from what
>>>vect_set_loop_controls_directly would do?
> Oh, I see.  You are confused that why I do not make multiple-rgroup vec_trunk
> handling inside "vect_set_loop_controls_directly".
>
> Well. Frankly, I just replicate the handling of ARM SVE:
> unsigned int nmasks = i + 1;
> if (use_masks_p && (nmasks & 1) == 0)
>   {
> rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1];
> if (!half_rgc->controls.is_empty ()
> && vect_maybe_permute_loop_masks (_seq, rgc, half_rgc))
>   continue;
>   }
>
> /* Try to use permutes to define the masks in DEST_RGM using the masks
>in SRC_RGM, given that the former has twice as many masks as the
>latter.  Return true on success, adding any new statements to SEQ.  */
>
> static bool
> vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
>rgroup_controls *src_rgm)
> {
>   tree src_masktype = src_rgm->type;
>   tree dest_masktype = dest_rgm->type;
>   machine_mode src_mode = TYPE_MODE (src_masktype);
>   insn_code icode1, icode2;
>   if (dest_rgm->max_nscalars_per_iter <= src_rgm->max_nscalars_per_iter
>   && (icode1 = optab_handler (vec_unpacku_hi_optab,
>   src_mode)) != CODE_FOR_nothing
>   && (icode2 = optab_handler (vec_unpacku_lo_optab,
>   src_mode)) != CODE_FOR_nothing)
> {
>   /* Unpacking the source masks gives at least as many mask bits as
>  we need.  We can then VIEW_CONVERT any excess bits away.  */
>   machine_mode dest_mode = insn_data[icode1].operand[0].mode;
>   gcc_assert (dest_mode == insn_data[icode2].operand[0].mode);
>   tree unpack_masktype = vect_halve_mask_nunits (src_masktype, dest_mode);
>   for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
> {
>   tree src = src_rgm->controls[i / 2];
>   tree dest = dest_rgm->controls[i];
>   tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1)
> ? VEC_UNPACK_HI_EXPR
> : VEC_UNPACK_LO_EXPR);
>   gassign *stmt;
>   if (dest_masktype == unpack_masktype)
> stmt = gimple_build_assign (dest, code, src);
>   else
> {
>   tree temp = make_ssa_name (unpack_masktype);
>   stmt = gimple_build_assign (temp, code, src);
>   gimple_seq_add_stmt (seq, stmt);
>   stmt = gimple_build_assign (dest, VIEW_CONVERT_EXPR,
>   build1 (VIEW_CONVERT_EXPR,
>   dest_masktype, temp));
> }
>   gimple_seq_add_stmt (seq, stmt);
> }
>   return true;
> }
>   vec_perm_indices indices[2];
>   if (dest_masktype == src_masktype
>   && interleave_supported_p ([0], src_masktype, 0)
>   && interleave_supported_p ([1], src_masktype, 1))
> {
>   /* The destination requires twice as many mask bits as the source, so
>  we can use interleaving permutes to double up the number of bits.  */
>   tree masks[2];
>   for (unsigned int i = 0; i < 2; ++i)
> masks[i] = vect_gen_perm_mask_checked (src_masktype, indices[i]);
>   for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
> {
>   tree src = src_rgm->controls[i / 2];
>   tree dest = dest_rgm->controls[i];
>   gimple *stmt = gimple_build_assign (dest, VEC_PERM_EXPR,
>   src, src, masks[i & 1]);
>   gimple_seq_add_stmt (seq, stmt);
> }
>   return true;
> }
>   return false;
> }
>
> I know this is just optimization for ARM SVE with sub_rgc (int16)  is half 
> size of rgc (int8).
> But when I just copy the codes from ARM SVE and make it general for all cases 
> (int8 <-> int64).
> They all work well and codegen is good. 
>
> If you don't like this way, would you mind give me some suggestions?
 
It's not a case of disliking one approach or disliking another.
There are two separate parts of this: one specific and one general.
 
The specific part is that the code had:
 
rgroup_controls *sub_rgc
  = &(*controls)[nmasks / rgc->controls.length () - 1];
if (!sub_rgc->controls.is_empty ())
  {
tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
vect_adjust_loop_lens_control (iv_type, _seq, rgc,
   sub_rgc, NULL_TREE);
continue;
  }
 
But AIUI, nmasks is always

[PATCH V13] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread juzhe . zhong

From: Ju-Zhe Zhong 

This patch is supporting decrement IV by following the flow designed by Richard:

(1) In vect_set_loop_condition_partial_vectors, for the first iteration of:
call vect_set_loop_controls_directly.

(2) vect_set_loop_controls_directly calculates "step" as in your patch.
If rgc has 1 control, this step is the SSA name created for that control.
Otherwise the step is a fresh SSA name, as in your patch.

(3) vect_set_loop_controls_directly stores this step somewhere for later
use, probably in LOOP_VINFO.  Let's use "S" to refer to this stored step.

(4) After the vect_set_loop_controls_directly call above, and outside
the "if" statement that now contains vect_set_loop_controls_directly,
check whether rgc->controls.length () > 1.  If so, use
vect_adjust_loop_lens_control to set the controls based on S.

Then the only caller of vect_adjust_loop_lens_control is
vect_set_loop_condition_partial_vectors.  And the starting
step for vect_adjust_loop_lens_control is always S.

This patch has well tested for single-rgroup and multiple-rgroup (SLP) and
passed all testcase in RISC-V port.

Also, pass tests for multiple-rgroup (non-SLP) tested on vec_pack_trunk.


gcc/ChangeLog:

* tree-vect-loop-manip.cc (vect_set_loop_controls_directly): Add 
decrement IV support.
(vect_adjust_loop_lens_control): Ditto.
(vect_set_loop_condition_partial_vectors): Ditto.
* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): New variable.
* tree-vectorizer.h (LOOP_VINFO_USING_DECREMENTING_IV_P): New macro.
(LOOP_VINFO_DECREMENTING_IV_STEP): New macro.

---
 gcc/tree-vect-loop-manip.cc | 179 +---
 gcc/tree-vect-loop.cc   |  13 +++
 gcc/tree-vectorizer.h   |  12 +++
 3 files changed, 193 insertions(+), 11 deletions(-)

diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index ff6159e08d5..3a872668f89 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -468,6 +468,38 @@ vect_set_loop_controls_directly (class loop *loop, 
loop_vec_info loop_vinfo,
   gimple_stmt_iterator incr_gsi;
   bool insert_after;
   standard_iv_increment_position (loop, _gsi, _after);
+  if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
+{
+  /* single rgroup:
+...
+_10 = (unsigned long) count_12(D);
+...
+# ivtmp_9 = PHI 
+_36 = MIN_EXPR ;
+...
+vect__4.8_28 = .LEN_LOAD (_17, 32B, _36, 0);
+...
+ivtmp_35 = ivtmp_9 - _36;
+...
+if (ivtmp_35 != 0)
+  goto ; [83.33%]
+else
+  goto ; [16.67%]
+  */
+  nitems_total = gimple_convert (preheader_seq, iv_type, nitems_total);
+  tree step = rgc->controls.length () == 1 ? rgc->controls[0]
+  : make_ssa_name (iv_type);
+  /* Create decrement IV.  */
+  create_iv (nitems_total, MINUS_EXPR, step, NULL_TREE, loop, _gsi,
+insert_after, _before_incr, _after_incr);
+  gimple_seq_add_stmt (header_seq, gimple_build_assign (step, MIN_EXPR,
+   index_before_incr,
+   nitems_step));
+  LOOP_VINFO_DECREMENTING_IV_STEP (loop_vinfo) = step;
+  return index_after_incr;
+}
+
+  /* Create increment IV.  */
   create_iv (build_int_cst (iv_type, 0), PLUS_EXPR, nitems_step, NULL_TREE,
 loop, _gsi, insert_after, _before_incr,
 _after_incr);
@@ -683,6 +715,63 @@ vect_set_loop_controls_directly (class loop *loop, 
loop_vec_info loop_vinfo,
   return next_ctrl;
 }
 
+/* Try to use adjust loop lens for multiple-rgroups.
+
+ _36 = MIN_EXPR ;
+
+ First length (MIN (X, VF/N)):
+   loop_len_15 = MIN_EXPR <_36, VF/N>;
+
+ Second length:
+   tmp = _36 - loop_len_15;
+   loop_len_16 = MIN (tmp, VF/N);
+
+ Third length:
+   tmp2 = tmp - loop_len_16;
+   loop_len_17 = MIN (tmp2, VF/N);
+
+ Last length:
+   loop_len_18 = tmp2 - loop_len_17;
+*/
+
+static void
+vect_adjust_loop_lens_control (tree iv_type, gimple_seq *seq,
+  rgroup_controls *dest_rgm, tree step)
+{
+  tree ctrl_type = dest_rgm->type;
+  poly_uint64 nitems_per_ctrl
+= TYPE_VECTOR_SUBPARTS (ctrl_type) * dest_rgm->factor;
+  tree length_limit = build_int_cst (iv_type, nitems_per_ctrl);
+
+  for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
+{
+  tree ctrl = dest_rgm->controls[i];
+  if (i == 0)
+   {
+ /* First iteration: MIN (X, VF/N) capped to the range [0, VF/N].  */
+ gassign *assign
+   = gimple_build_assign (ctrl, MIN_EXPR, step, length_limit);
+ gimple_seq_add_stmt (seq, assign);
+   }
+  else if (i == dest_rgm->controls.length () - 1)
+   {
+ /* Last iteration: Remain capped to the range [0, VF/N].  */
+ gassign *assign =

Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

Oh. I just realize the follow you design is working well for vec_pack_trunk too.
Will send V13 patch soon.

Thanks.



juzhe.zh...@rivai.ai
 
From: 钟居哲
Date: 2023-05-24 22:10
To: richard.sandiford
CC: gcc-patches; rguenther
Subject: Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by 
variable amount support
>> Both approaches are fine.  I'm not against one or the other.

>> What I didn't understand was why your patch only reuses existing IVs
>> for max_nscalars_per_iter == 1.  Was it to avoid having to do a
>> multiplication (well, really a shift left) when moving from one
>> rgroup to another?  E.g. if one rgroup had;

>>   nscalars_per_iter == 2 && factor == 1

>> and another had:

>>   nscalars_per_iter == 4 && factor == 1

>> then we would need to mulitply by 2 when going from the first rgroup
>> to the second.

>> If so, avoiding a multiplication seems like a good reason for the choice
>> you were making in the path.  But we then need to check
>> max_nscalars_per_iter == 1 for both the source rgroup and the
>> destination rgroup, not just the destination.  And I think the
>> condition for “no multiplication needed” should be that:

Oh, I didn't realize such complicated problem. Frankly, I didn't understand well
rgroup. Sorry about that :).

I just remember last time you said I need to handle multiple-rgroup
not only for SLP but also non-SLP (which is vec_pack_trunk that I tested).
Then I asked you when is non-SLP, you said max_nscalars_per_iter == 1.
Then I use max_nscalars_per_iter == 1 here (I didn't really lean very well from 
this, just add it as you said). 

Actually, I just want to hanlde multip-rgroup for non-SLP here, I am trying to 
avoid  multiplication and I think
scalar multiplication (not cost too much) is fine in modern CPU.

So, what do you suggest that I handle multiple-rgroup for non-SLP.

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-24 22:01
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
>>> In other words, why is this different from what
>>>vect_set_loop_controls_directly would do?
> Oh, I see.  You are confused that why I do not make multiple-rgroup vec_trunk
> handling inside "vect_set_loop_controls_directly".
>
> Well. Frankly, I just replicate the handling of ARM SVE:
> unsigned int nmasks = i + 1;
> if (use_masks_p && (nmasks & 1) == 0)
>   {
> rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1];
> if (!half_rgc->controls.is_empty ()
> && vect_maybe_permute_loop_masks (_seq, rgc, half_rgc))
>   continue;
>   }
>
> /* Try to use permutes to define the masks in DEST_RGM using the masks
>in SRC_RGM, given that the former has twice as many masks as the
>latter.  Return true on success, adding any new statements to SEQ.  */
>
> static bool
> vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
>rgroup_controls *src_rgm)
> {
>   tree src_masktype = src_rgm->type;
>   tree dest_masktype = dest_rgm->type;
>   machine_mode src_mode = TYPE_MODE (src_masktype);
>   insn_code icode1, icode2;
>   if (dest_rgm->max_nscalars_per_iter <= src_rgm->max_nscalars_per_iter
>   && (icode1 = optab_handler (vec_unpacku_hi_optab,
>   src_mode)) != CODE_FOR_nothing
>   && (icode2 = optab_handler (vec_unpacku_lo_optab,
>   src_mode)) != CODE_FOR_nothing)
> {
>   /* Unpacking the source masks gives at least as many mask bits as
>  we need.  We can then VIEW_CONVERT any excess bits away.  */
>   machine_mode dest_mode = insn_data[icode1].operand[0].mode;
>   gcc_assert (dest_mode == insn_data[icode2].operand[0].mode);
>   tree unpack_masktype = vect_halve_mask_nunits (src_masktype, dest_mode);
>   for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
> {
>   tree src = src_rgm->controls[i / 2];
>   tree dest = dest_rgm->controls[i];
>   tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1)
> ? VEC_UNPACK_HI_EXPR
> : VEC_UNPACK_LO_EXPR);
>   gassign *stmt;
>   if (dest_masktype == unpack_masktype)
> stmt = gimple_build_assign (dest, code, src);
>   else
> {
>   tree temp = make_ssa_name (unpack_masktype);
>   stmt = gimple_build_assign (temp, code, src);
>   gimple_seq_add_stmt (seq, stmt);
>   stmt = gimple_build_assign (dest, VIEW_CONVERT_EXPR,
>   build1 (VIEW_CONVERT_EXPR,
>   dest_masktype, temp));
> }
>   gimple_seq_add_stmt (seq, stmt);
> }
>   return true;
> }
>   vec_perm_indices indices[2];
>   if

Re: [V7][PATCH 1/2] Handle component_ref to a structre/union field including flexible array member [PR101832]

2023-05-24 Thread Qing Zhao via Gcc-patches

Bernhard,

Thanks a lot for your comments.

> On May 19, 2023, at 7:11 PM, Bernhard Reutner-Fischer  
> wrote:
> 
> On Fri, 19 May 2023 20:49:47 +
> Qing Zhao via Gcc-patches  wrote:
> 
>> GCC extension accepts the case when a struct with a flexible array member
>> is embedded into another struct or union (possibly recursively).
> 
> Do you mean TYPE_TRAILING_FLEXARRAY()?

The following might be more accurate description:

GCC extension accepts the case when a struct with a flexible array member
 is embedded into another struct or union (possibly recursively) as the last 
field.



> 
>> diff --git a/gcc/tree.h b/gcc/tree.h
>> index 0b72663e6a1..237644e788e 100644
>> --- a/gcc/tree.h
>> +++ b/gcc/tree.h
>> @@ -786,7 +786,12 @@ extern void omp_clause_range_check_failed (const_tree, 
>> const char *, int,
>>(...) prototype, where arguments can be accessed with va_start and
>>va_arg), as opposed to an unprototyped function.  */
>> #define TYPE_NO_NAMED_ARGS_STDARG_P(NODE) \
>> -  (TYPE_CHECK (NODE)->type_common.no_named_args_stdarg_p)
>> +  (FUNC_OR_METHOD_CHECK (NODE)->type_common.no_named_args_stdarg_p)
>> +
>> +/* True if this RECORD_TYPE or UNION_TYPE includes a flexible array member
>> +   at the last field recursively.  */
>> +#define TYPE_INCLUDE_FLEXARRAY(NODE) \
>> +  (RECORD_OR_UNION_CHECK (NODE)->type_common.no_named_args_stdarg_p)
> 
> Until i read the description above i read TYPE_INCLUDE_FLEXARRAY as an
> option to include or not include something. The description hints more
> at TYPE_INCLUDES_FLEXARRAY (with an S) to be a type which has at least
> one member which has a trailing flexible array or which itself has a
> trailing flexible array.

Yes, TYPE_INCLUDES_FLEXARRAY (maybe with a S is a better name) means the 
structure/union TYPE includes a flexible array member or includes a struct with 
a flexible array member as the last field.

Hope this is clear.
thanks.

Qing
> 
>> 
>> /* In an IDENTIFIER_NODE, this means that assemble_name was called with
>>this string as an argument.  */
>

Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-05-24 Thread Alexander Monakov via Gcc-patches



On Wed, 24 May 2023, Richard Biener wrote:

> On Wed, May 24, 2023 at 2:54 PM Alexander Monakov via Gcc-patches
>  wrote:
> >
> > Explicitly say that bitwise shifts for narrow types work similar to
> > element-wise C shifts with integer promotions, which coincides with
> > OpenCL semantics.
> 
> Do we need to clarify that v << w with v being a vector of shorts
> still yields a vector of shorts and not a vector of ints?

I don't think so, but if necessary we could add "and the result was
truncated back to the base type":

When the base type is narrower than @code{int}, element-wise shifts
are performed as if operands underwent C integer promotions, and
the result was truncated back to the base type, like in OpenCL. 

> Btw, I don't see this promotion reflected in the IL.  For
> 
> typedef short v8hi __attribute__((vector_size(16)));
> 
> v8hi foo (v8hi a, v8hi b)
> {
>   return a << b;
> }
> 
> I get no masking of 'b' and vector lowering if the target doens't handle it
> yields
> 
>   short int _5;
>   short int _6;
> 
>   _5 = BIT_FIELD_REF ;
>   _6 = BIT_FIELD_REF ;
>   _7 = _5 << _6;
> 
> which we could derive ranges from for _6 (apparantly we don't yet).

Here it depends on how we define the GIMPLE-level semantics of bit-shift
operators for narrow types. To avoid changing lowering we could say that
shifting by up to 31 bits is well-defined for narrow types.

RTL-level semantics are also undocumented, unfortunately.

> Even
> 
> typedef int v8hi __attribute__((vector_size(16)));
> 
> v8hi x;
> int foo (v8hi a, v8hi b)
> {
>   x = a << b;
>   return (b[0] > 33);
> }
> 
> isn't optimized currently (but could - note I've used 'int' elements here).

Yeah. But let's constrain the optimizations first.

> So, I don't see us making sure the hardware does the right thing for
> out-of bound values.

I think in practice it worked out even if GCC did not pay attention to it,
because SIMD instructions had to facilitate autovectorization for C with
corresponding shift semantics.

Alexander

> 
> Richard.
> 
> > gcc/ChangeLog:
> >
> > * doc/extend.texi (Vector Extensions): Clarify bitwise shift
> > semantics.
> > ---
> >  gcc/doc/extend.texi | 7 ++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> > index e426a2eb7d..6b4e94b6a1 100644
> > --- a/gcc/doc/extend.texi
> > +++ b/gcc/doc/extend.texi
> > @@ -12026,7 +12026,12 @@ elements in the operand.
> >  It is possible to use shifting operators @code{<<}, @code{>>} on
> >  integer-type vectors. The operation is defined as following: @code{@{a0,
> >  a1, @dots{}, an@} >> @{b0, b1, @dots{}, bn@} == @{a0 >> b0, a1 >> b1,
> > -@dots{}, an >> bn@}}@. Vector operands must have the same number of
> > +@dots{}, an >> bn@}}@.  When the base type is narrower than @code{int},
> > +element-wise shifts are performed as if operands underwent C integer
> > +promotions, like in OpenCL.  This makes vector shifts by up to 31 bits
> > +well-defined for vectors with @code{char} and @code{short} base types.
> > +
> > +Operands of binary vector operations must have the same number of
> >  elements.
> >
> >  For convenience, it is allowed to use a binary vector operation
> > --
> > 2.39.2
> >
>

[COMMITTED] i386: Add vv4qi3 expander

2023-05-24 Thread Uros Bizjak via Gcc-patches

Also, move vv8qi3 expander to a better place and enable
it with TARGET_MMX_WITH_SSE.  Remove handling of V8QImode from
ix86_expand_vecop_qihi2 since all partial QI->HI vector modes expand
via ix86_expand_vecop_qihi_partial.

gcc/ChangeLog:

* config/i386/i386-expand.cc (ix86_expand_vecop_qihi2):
Remove handling of V8QImode.
* config/i386/mmx.md (vv8qi3): Move from sse.md.
Call ix86_expand_vecop_qihi_partial.  Enable for TARGET_MMX_WITH_SSE.
(vv4qi3): Ditto.
* config/i386/sse.md (vv8qi3): Remove.

gcc/testsuite/ChangeLog:

* gcc.target/i386/vect-shiftv4qi.c (dg-options):
Remove -ftree-vectorize.
* gcc.target/i386/vect-shiftv8qi.c (dg-options): Ditto.
* gcc.target/i386/vect-vshiftv4qi.c: New test.
* gcc.target/i386/vect-vshiftv8qi.c: New test.

Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.
diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index ff3d382f1b4..2e6e6585aeb 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -23132,9 +23132,10 @@ ix86_expand_vecop_qihi2 (enum rtx_code code, rtx dest, 
rtx op1, rtx op2)
   /* vpmovwb only available under AVX512BW.  */
   if (!TARGET_AVX512BW)
 return false;
-  if ((qimode == V8QImode || qimode == V16QImode)
-  && !TARGET_AVX512VL)
+
+  if (qimode == V16QImode && !TARGET_AVX512VL)
 return false;
+
   /* Do not generate ymm/zmm instructions when
  target prefers 128/256 bit vector width.  */
   if ((qimode == V16QImode && TARGET_PREFER_AVX128)
@@ -23143,10 +23144,6 @@ ix86_expand_vecop_qihi2 (enum rtx_code code, rtx dest, 
rtx op1, rtx op2)
 
   switch (qimode)
 {
-case E_V8QImode:
-  himode = V8HImode;
-  gen_truncate = gen_truncv8hiv8qi2;
-  break;
 case E_V16QImode:
   himode = V16HImode;
   gen_truncate = gen_truncv16hiv16qi2;
diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md
index a37811f..dbcb850ffde 100644
--- a/gcc/config/i386/mmx.md
+++ b/gcc/config/i386/mmx.md
@@ -2734,6 +2734,30 @@ (define_insn_and_split "v2qi3"
   [(set_attr "type" "multi")
(set_attr "mode" "QI")])
 
+(define_expand "vv8qi3"
+  [(set (match_operand:V8QI 0 "register_operand")
+   (any_shift:V8QI
+ (match_operand:V8QI 1 "register_operand")
+ (match_operand:V8QI 2 "register_operand")))]
+  "TARGET_AVX512BW && TARGET_AVX512VL && TARGET_MMX_WITH_SSE"
+{
+  ix86_expand_vecop_qihi_partial (, operands[0],
+ operands[1], operands[2]);
+  DONE;
+})
+
+(define_expand "vv4qi3"
+  [(set (match_operand:V4QI 0 "register_operand")
+   (any_shift:V4QI
+ (match_operand:V4QI 1 "register_operand")
+ (match_operand:V4QI 2 "register_operand")))]
+  "TARGET_AVX512BW && TARGET_AVX512VL"
+{
+  ix86_expand_vecop_qihi_partial (, operands[0],
+ operands[1], operands[2]);
+  DONE;
+})
+
 ;
 ;;
 ;; Parallel integral comparisons
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 26dd0b1aa10..0656a5ce717 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -24564,17 +24564,6 @@ (define_expand "v3"
 }
 })
 
-(define_expand "vv8qi3"
-  [(set (match_operand:V8QI 0 "register_operand")
-   (any_shift:V8QI
- (match_operand:V8QI 1 "register_operand")
- (match_operand:V8QI 2 "nonimmediate_operand")))]
-  "TARGET_AVX512BW && TARGET_AVX512VL && TARGET_64BIT"
-{
-  ix86_expand_vecop_qihi (, operands[0], operands[1], operands[2]);
-  DONE;
-})
-
 (define_expand "vlshr3"
   [(set (match_operand:VI48_512 0 "register_operand")
(lshiftrt:VI48_512
diff --git a/gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c 
b/gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c
index c06dfb87bd1..c6a63903604 100644
--- a/gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c
+++ b/gcc/testsuite/gcc.target/i386/vect-shiftv4qi.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-vectorize -msse2" } */
+/* { dg-options "-O2 -msse2" } */
 
 #define N 4
 
diff --git a/gcc/testsuite/gcc.target/i386/vect-shiftv8qi.c 
b/gcc/testsuite/gcc.target/i386/vect-shiftv8qi.c
index f5e8925aa25..244b0dbd28a 100644
--- a/gcc/testsuite/gcc.target/i386/vect-shiftv8qi.c
+++ b/gcc/testsuite/gcc.target/i386/vect-shiftv8qi.c
@@ -1,5 +1,5 @@
 /* { dg-do compile { target { ! ia32 } } } */
-/* { dg-options "-O2 -ftree-vectorize -msse2" } */
+/* { dg-options "-O2 -msse2" } */
 
 #define N 8
 
diff --git a/gcc/testsuite/gcc.target/i386/vect-vshiftv4qi.c 
b/gcc/testsuite/gcc.target/i386/vect-vshiftv4qi.c
new file mode 100644
index 000..c74cc991f59
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-vshiftv4qi.c
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx512bw -mavx512vl" } */
+
+#define N 4
+
+typedef unsigned char __vu __attribute__ ((__vector_size__ (N)));
+typedef signed char __vi __attribute__ ((__vector_size__

Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

>> Actually, I just want to hanlde multip-rgroup for non-SLP here, I am trying 
>> to avoid  multiplication and I think
>> scalar multiplication (not cost too much) is fine in modern CPU.
Sorry for incorrect typo. I didn't try to avoid multiplication and I think 
multiplication is fine.


juzhe.zh...@rivai.ai
 
From: 钟居哲
Date: 2023-05-24 22:10
To: richard.sandiford
CC: gcc-patches; rguenther
Subject: Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by 
variable amount support
>> Both approaches are fine.  I'm not against one or the other.

>> What I didn't understand was why your patch only reuses existing IVs
>> for max_nscalars_per_iter == 1.  Was it to avoid having to do a
>> multiplication (well, really a shift left) when moving from one
>> rgroup to another?  E.g. if one rgroup had;

>>   nscalars_per_iter == 2 && factor == 1

>> and another had:

>>   nscalars_per_iter == 4 && factor == 1

>> then we would need to mulitply by 2 when going from the first rgroup
>> to the second.

>> If so, avoiding a multiplication seems like a good reason for the choice
>> you were making in the path.  But we then need to check
>> max_nscalars_per_iter == 1 for both the source rgroup and the
>> destination rgroup, not just the destination.  And I think the
>> condition for “no multiplication needed” should be that:

Oh, I didn't realize such complicated problem. Frankly, I didn't understand well
rgroup. Sorry about that :).

I just remember last time you said I need to handle multiple-rgroup
not only for SLP but also non-SLP (which is vec_pack_trunk that I tested).
Then I asked you when is non-SLP, you said max_nscalars_per_iter == 1.
Then I use max_nscalars_per_iter == 1 here (I didn't really lean very well from 
this, just add it as you said). 

Actually, I just want to hanlde multip-rgroup for non-SLP here, I am trying to 
avoid  multiplication and I think
scalar multiplication (not cost too much) is fine in modern CPU.

So, what do you suggest that I handle multiple-rgroup for non-SLP.

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-24 22:01
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
>>> In other words, why is this different from what
>>>vect_set_loop_controls_directly would do?
> Oh, I see.  You are confused that why I do not make multiple-rgroup vec_trunk
> handling inside "vect_set_loop_controls_directly".
>
> Well. Frankly, I just replicate the handling of ARM SVE:
> unsigned int nmasks = i + 1;
> if (use_masks_p && (nmasks & 1) == 0)
>   {
> rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1];
> if (!half_rgc->controls.is_empty ()
> && vect_maybe_permute_loop_masks (_seq, rgc, half_rgc))
>   continue;
>   }
>
> /* Try to use permutes to define the masks in DEST_RGM using the masks
>in SRC_RGM, given that the former has twice as many masks as the
>latter.  Return true on success, adding any new statements to SEQ.  */
>
> static bool
> vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
>rgroup_controls *src_rgm)
> {
>   tree src_masktype = src_rgm->type;
>   tree dest_masktype = dest_rgm->type;
>   machine_mode src_mode = TYPE_MODE (src_masktype);
>   insn_code icode1, icode2;
>   if (dest_rgm->max_nscalars_per_iter <= src_rgm->max_nscalars_per_iter
>   && (icode1 = optab_handler (vec_unpacku_hi_optab,
>   src_mode)) != CODE_FOR_nothing
>   && (icode2 = optab_handler (vec_unpacku_lo_optab,
>   src_mode)) != CODE_FOR_nothing)
> {
>   /* Unpacking the source masks gives at least as many mask bits as
>  we need.  We can then VIEW_CONVERT any excess bits away.  */
>   machine_mode dest_mode = insn_data[icode1].operand[0].mode;
>   gcc_assert (dest_mode == insn_data[icode2].operand[0].mode);
>   tree unpack_masktype = vect_halve_mask_nunits (src_masktype, dest_mode);
>   for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
> {
>   tree src = src_rgm->controls[i / 2];
>   tree dest = dest_rgm->controls[i];
>   tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1)
> ? VEC_UNPACK_HI_EXPR
> : VEC_UNPACK_LO_EXPR);
>   gassign *stmt;
>   if (dest_masktype == unpack_masktype)
> stmt = gimple_build_assign (dest, code, src);
>   else
> {
>   tree temp = make_ssa_name (unpack_masktype);
>   stmt = gimple_build_assign (temp, code, src);
>   gimple_seq_add_stmt (seq, stmt);
>   stmt = gimple_build_assign (dest, VIEW_CONVERT_EXPR,
>   build1 (VIEW_CONVERT_EXPR,
>

Re: [PATCH] Provide an API for ipa_vr.

2023-05-24 Thread Martin Jambor

Hello,

On Wed, May 17 2023, Aldy Hernandez wrote:
> This patch encapsulates the ipa_vr internals into an API.  It also
> makes it type agnostic, in preparation for upcoming changes to IPA.
>
> Interestingly, there's a 0.44% improvement to IPA-cp, which I'm sure
> we'll soak up with future changes in this area :).
>
> BTW, there's a note here:
> +  // vrange_storage is typeless, but we need to know what type of
> +  // range that is being streamed out (irange, frange, etc).  AFAICT,
> +  // there's no way to get at the underlying type by the time we
> +  // stream out in write_ipcp_transformation_info.
> +  tree m_type;
>
> Could someone more IPA savvy double check this is indeed the case?

Yes, that is true and keeping the type around in ipa_vr is probably
easier than postponing the deallocation of parameter descriptors
somehow.

>
> OK for trunk?

Yes, thanks.

Martin

>
> gcc/ChangeLog:
>
>   * ipa-cp.cc (ipa_value_range_from_jfunc): Use new ipa_vr API.
>   (ipcp_store_vr_results): Same.
>   * ipa-prop.cc (ipa_vr::ipa_vr): New.
>   (ipa_vr::get_vrange): New.
>   (ipa_vr::set_unknown): New.
>   (ipa_vr::streamer_read): New.
>   (ipa_vr::streamer_write): New.
>   (write_ipcp_transformation_info): Use new ipa_vr API.
>   (read_ipcp_transformation_info): Same.
>   (ipa_vr::nonzero_p): Delete.
>   (ipcp_update_vr): Use new ipa_vr API.
>   * ipa-prop.h (class ipa_vr): Provide an API and hide internals.
>   * ipa-sra.cc (zap_useless_ipcp_results): Use new ipa_vr API.
>   * gcc.dg/ipa/pr78121.c: Adjust for vrange::dump use.
>   * gcc.dg/ipa/vrp1.c: Same.
>   * gcc.dg/ipa/vrp2.c: Same.
>   * gcc.dg/ipa/vrp3.c: Same.
>   * gcc.dg/ipa/vrp4.c: Same.
>   * gcc.dg/ipa/vrp5.c: Same.
>   * gcc.dg/ipa/vrp6.c: Same.
>   * gcc.dg/ipa/vrp7.c: Same.
>   * gcc.dg/ipa/vrp8.c: Same.

Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

>> Both approaches are fine.  I'm not against one or the other.

>> What I didn't understand was why your patch only reuses existing IVs
>> for max_nscalars_per_iter == 1.  Was it to avoid having to do a
>> multiplication (well, really a shift left) when moving from one
>> rgroup to another?  E.g. if one rgroup had;

>>   nscalars_per_iter == 2 && factor == 1

>> and another had:

>>   nscalars_per_iter == 4 && factor == 1

>> then we would need to mulitply by 2 when going from the first rgroup
>> to the second.

>> If so, avoiding a multiplication seems like a good reason for the choice
>> you were making in the path.  But we then need to check
>> max_nscalars_per_iter == 1 for both the source rgroup and the
>> destination rgroup, not just the destination.  And I think the
>> condition for “no multiplication needed” should be that:

Oh, I didn't realize such complicated problem. Frankly, I didn't understand well
rgroup. Sorry about that :).

I just remember last time you said I need to handle multiple-rgroup
not only for SLP but also non-SLP (which is vec_pack_trunk that I tested).
Then I asked you when is non-SLP, you said max_nscalars_per_iter == 1.
Then I use max_nscalars_per_iter == 1 here (I didn't really lean very well from 
this, just add it as you said). 

Actually, I just want to hanlde multip-rgroup for non-SLP here, I am trying to 
avoid  multiplication and I think
scalar multiplication (not cost too much) is fine in modern CPU.

So, what do you suggest that I handle multiple-rgroup for non-SLP.

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-24 22:01
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by 
variable amount support
钟居哲  writes:
>>> In other words, why is this different from what
>>>vect_set_loop_controls_directly would do?
> Oh, I see.  You are confused that why I do not make multiple-rgroup vec_trunk
> handling inside "vect_set_loop_controls_directly".
>
> Well. Frankly, I just replicate the handling of ARM SVE:
> unsigned int nmasks = i + 1;
> if (use_masks_p && (nmasks & 1) == 0)
>   {
> rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1];
> if (!half_rgc->controls.is_empty ()
> && vect_maybe_permute_loop_masks (_seq, rgc, half_rgc))
>   continue;
>   }
>
> /* Try to use permutes to define the masks in DEST_RGM using the masks
>in SRC_RGM, given that the former has twice as many masks as the
>latter.  Return true on success, adding any new statements to SEQ.  */
>
> static bool
> vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
>rgroup_controls *src_rgm)
> {
>   tree src_masktype = src_rgm->type;
>   tree dest_masktype = dest_rgm->type;
>   machine_mode src_mode = TYPE_MODE (src_masktype);
>   insn_code icode1, icode2;
>   if (dest_rgm->max_nscalars_per_iter <= src_rgm->max_nscalars_per_iter
>   && (icode1 = optab_handler (vec_unpacku_hi_optab,
>   src_mode)) != CODE_FOR_nothing
>   && (icode2 = optab_handler (vec_unpacku_lo_optab,
>   src_mode)) != CODE_FOR_nothing)
> {
>   /* Unpacking the source masks gives at least as many mask bits as
>  we need.  We can then VIEW_CONVERT any excess bits away.  */
>   machine_mode dest_mode = insn_data[icode1].operand[0].mode;
>   gcc_assert (dest_mode == insn_data[icode2].operand[0].mode);
>   tree unpack_masktype = vect_halve_mask_nunits (src_masktype, dest_mode);
>   for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
> {
>   tree src = src_rgm->controls[i / 2];
>   tree dest = dest_rgm->controls[i];
>   tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1)
> ? VEC_UNPACK_HI_EXPR
> : VEC_UNPACK_LO_EXPR);
>   gassign *stmt;
>   if (dest_masktype == unpack_masktype)
> stmt = gimple_build_assign (dest, code, src);
>   else
> {
>   tree temp = make_ssa_name (unpack_masktype);
>   stmt = gimple_build_assign (temp, code, src);
>   gimple_seq_add_stmt (seq, stmt);
>   stmt = gimple_build_assign (dest, VIEW_CONVERT_EXPR,
>   build1 (VIEW_CONVERT_EXPR,
>   dest_masktype, temp));
> }
>   gimple_seq_add_stmt (seq, stmt);
> }
>   return true;
> }
>   vec_perm_indices indices[2];
>   if (dest_masktype == src_masktype
>   && interleave_supported_p ([0], src_masktype, 0)
>   && interleave_supported_p ([1], src_masktype, 1))
> {
>   /* The destination requires twice as many mask bits as the source, so
>  we can use interleaving permutes to double up the number of bits.  */
>   tree

Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread Richard Sandiford via Gcc-patches

钟居哲  writes:
>>> In other words, why is this different from what
>>>vect_set_loop_controls_directly would do?
> Oh, I see.  You are confused that why I do not make multiple-rgroup vec_trunk
> handling inside "vect_set_loop_controls_directly".
>
> Well. Frankly, I just replicate the handling of ARM SVE:
> unsigned int nmasks = i + 1;
> if (use_masks_p && (nmasks & 1) == 0)
>   {
> rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1];
> if (!half_rgc->controls.is_empty ()
> && vect_maybe_permute_loop_masks (_seq, rgc, half_rgc))
>   continue;
>   }
>
> /* Try to use permutes to define the masks in DEST_RGM using the masks
>in SRC_RGM, given that the former has twice as many masks as the
>latter.  Return true on success, adding any new statements to SEQ.  */
>
> static bool
> vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
>rgroup_controls *src_rgm)
> {
>   tree src_masktype = src_rgm->type;
>   tree dest_masktype = dest_rgm->type;
>   machine_mode src_mode = TYPE_MODE (src_masktype);
>   insn_code icode1, icode2;
>   if (dest_rgm->max_nscalars_per_iter <= src_rgm->max_nscalars_per_iter
>   && (icode1 = optab_handler (vec_unpacku_hi_optab,
>   src_mode)) != CODE_FOR_nothing
>   && (icode2 = optab_handler (vec_unpacku_lo_optab,
>   src_mode)) != CODE_FOR_nothing)
> {
>   /* Unpacking the source masks gives at least as many mask bits as
>  we need.  We can then VIEW_CONVERT any excess bits away.  */
>   machine_mode dest_mode = insn_data[icode1].operand[0].mode;
>   gcc_assert (dest_mode == insn_data[icode2].operand[0].mode);
>   tree unpack_masktype = vect_halve_mask_nunits (src_masktype, dest_mode);
>   for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
> {
>   tree src = src_rgm->controls[i / 2];
>   tree dest = dest_rgm->controls[i];
>   tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1)
> ? VEC_UNPACK_HI_EXPR
> : VEC_UNPACK_LO_EXPR);
>   gassign *stmt;
>   if (dest_masktype == unpack_masktype)
> stmt = gimple_build_assign (dest, code, src);
>   else
> {
>   tree temp = make_ssa_name (unpack_masktype);
>   stmt = gimple_build_assign (temp, code, src);
>   gimple_seq_add_stmt (seq, stmt);
>   stmt = gimple_build_assign (dest, VIEW_CONVERT_EXPR,
>   build1 (VIEW_CONVERT_EXPR,
>   dest_masktype, temp));
> }
>   gimple_seq_add_stmt (seq, stmt);
> }
>   return true;
> }
>   vec_perm_indices indices[2];
>   if (dest_masktype == src_masktype
>   && interleave_supported_p ([0], src_masktype, 0)
>   && interleave_supported_p ([1], src_masktype, 1))
> {
>   /* The destination requires twice as many mask bits as the source, so
>  we can use interleaving permutes to double up the number of bits.  */
>   tree masks[2];
>   for (unsigned int i = 0; i < 2; ++i)
> masks[i] = vect_gen_perm_mask_checked (src_masktype, indices[i]);
>   for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
> {
>   tree src = src_rgm->controls[i / 2];
>   tree dest = dest_rgm->controls[i];
>   gimple *stmt = gimple_build_assign (dest, VEC_PERM_EXPR,
>   src, src, masks[i & 1]);
>   gimple_seq_add_stmt (seq, stmt);
> }
>   return true;
> }
>   return false;
> }
>
> I know this is just optimization for ARM SVE with sub_rgc (int16)  is half 
> size of rgc (int8).
> But when I just copy the codes from ARM SVE and make it general for all cases 
> (int8 <-> int64).
> They all work well and codegen is good. 
>
> If you don't like this way, would you mind give me some suggestions?

It's not a case of disliking one approach or disliking another.
There are two separate parts of this: one specific and one general.

The specific part is that the code had:

rgroup_controls *sub_rgc
  = &(*controls)[nmasks / rgc->controls.length () - 1];
if (!sub_rgc->controls.is_empty ())
  {
tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
vect_adjust_loop_lens_control (iv_type, _seq, rgc,
   sub_rgc, NULL_TREE);
continue;
  }

But AIUI, nmasks is always equal to rgc->controls.length ()
(if rgc->controls is non-empty).  So I think this always used
(*controls)[0] as the source rgroup.  And I think that's the
only case that would work, since vect_adjust_loop_lens_control
only reads from sub_rgc once.

[PATCH][committed] aarch64: PR target/99195 Annotate vector shift patterns for vec-concat-zero

2023-05-24 Thread Kyrylo Tkachov via Gcc-patches

Hi all,

Continuing the series of straightforward annotations, this one handles the 
normal (not widening or narrowing) vector shifts.
Tests included.

Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf.
Pushing to trunk.
Thanks,
Kyrill

gcc/ChangeLog:

PR target/99195
* config/aarch64/aarch64-simd.md (aarch64_simd_lshr): Rename to...
(aarch64_simd_lshr): ... This.
(aarch64_simd_ashr): Rename to...
(aarch64_simd_ashr): ... This.
(aarch64_simd_imm_shl): Rename to...
(aarch64_simd_imm_shl): ... This.
(aarch64_simd_reg_sshl): Rename to...
(aarch64_simd_reg_sshl): ... This.
(aarch64_simd_reg_shl_unsigned): Rename to...
(aarch64_simd_reg_shl_unsigned): ... This.
(aarch64_simd_reg_shl_signed): Rename to...
(aarch64_simd_reg_shl_signed): ... This.
(vec_shr_): Rename to...
(vec_shr_): ... This.
(aarch64_shl): Rename to...
(aarch64_shl): ... This.
(aarch64_qshl): Rename to...
(aarch64_qshl): ... This.

gcc/testsuite/ChangeLog:

* gcc.target/aarch64/simd/pr99195_1.c: Add testing for shifts.
* gcc.target/aarch64/simd/pr99195_6.c: Likewise.
* gcc.target/aarch64/simd/pr99195_8.c: New test.


shift.patch
Description: shift.patch

Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

OK. Thanks. I am gonna refine the patch following Richard's idea and test it.
Thanks both Richard and Richi.

juzhe.zh...@rivai.ai

From: Richard Biener
Date: 2023-05-24 20:51
To: Richard Sandiford
CC: 钟居哲; gcc-patches
Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by 
variable amount support
On Wed, 24 May 2023, Richard Sandiford wrote:

> Sorry, I realised later that I had an implicit assumption here:
> if there are multiple rgroups, it's better to have a single IV
> for the smallest rgroup and scale that up to bigger rgroups.
> 
> E.g. if the loop control IV is taken from an N-control rgroup
> and has a step S, an N*M-control rgroup would be based on M*S.
> 
> Of course, it's also OK to create multiple IVs if you prefer.
> It's just a question of which approach gives the best output
> in practice.

One thing to check is whether IVOPTs is ever able to eliminate
one such IV using another.  You can then also check whether
when presented with a single IV it already considers the
others you can create as candidates so you get the optimal
selection in the end.

Richard.

Re: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread 钟居哲

>> In other words, why is this different from what
>>vect_set_loop_controls_directly would do?
Oh, I see.  You are confused that why I do not make multiple-rgroup vec_trunk
handling inside "vect_set_loop_controls_directly".

Well. Frankly, I just replicate the handling of ARM SVE:
unsigned int nmasks = i + 1;
if (use_masks_p && (nmasks & 1) == 0)
  {
rgroup_controls *half_rgc = &(*controls)[nmasks / 2 - 1];
if (!half_rgc->controls.is_empty ()
&& vect_maybe_permute_loop_masks (_seq, rgc, half_rgc))
  continue;
  }

/* Try to use permutes to define the masks in DEST_RGM using the masks
   in SRC_RGM, given that the former has twice as many masks as the
   latter.  Return true on success, adding any new statements to SEQ.  */

static bool
vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_controls *dest_rgm,
   rgroup_controls *src_rgm)
{
  tree src_masktype = src_rgm->type;
  tree dest_masktype = dest_rgm->type;
  machine_mode src_mode = TYPE_MODE (src_masktype);
  insn_code icode1, icode2;
  if (dest_rgm->max_nscalars_per_iter <= src_rgm->max_nscalars_per_iter
  && (icode1 = optab_handler (vec_unpacku_hi_optab,
  src_mode)) != CODE_FOR_nothing
  && (icode2 = optab_handler (vec_unpacku_lo_optab,
  src_mode)) != CODE_FOR_nothing)
{
  /* Unpacking the source masks gives at least as many mask bits as
 we need.  We can then VIEW_CONVERT any excess bits away.  */
  machine_mode dest_mode = insn_data[icode1].operand[0].mode;
  gcc_assert (dest_mode == insn_data[icode2].operand[0].mode);
  tree unpack_masktype = vect_halve_mask_nunits (src_masktype, dest_mode);
  for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
{
  tree src = src_rgm->controls[i / 2];
  tree dest = dest_rgm->controls[i];
  tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1)
? VEC_UNPACK_HI_EXPR
: VEC_UNPACK_LO_EXPR);
  gassign *stmt;
  if (dest_masktype == unpack_masktype)
stmt = gimple_build_assign (dest, code, src);
  else
{
  tree temp = make_ssa_name (unpack_masktype);
  stmt = gimple_build_assign (temp, code, src);
  gimple_seq_add_stmt (seq, stmt);
  stmt = gimple_build_assign (dest, VIEW_CONVERT_EXPR,
  build1 (VIEW_CONVERT_EXPR,
  dest_masktype, temp));
}
  gimple_seq_add_stmt (seq, stmt);
}
  return true;
}
  vec_perm_indices indices[2];
  if (dest_masktype == src_masktype
  && interleave_supported_p ([0], src_masktype, 0)
  && interleave_supported_p ([1], src_masktype, 1))
{
  /* The destination requires twice as many mask bits as the source, so
 we can use interleaving permutes to double up the number of bits.  */
  tree masks[2];
  for (unsigned int i = 0; i < 2; ++i)
masks[i] = vect_gen_perm_mask_checked (src_masktype, indices[i]);
  for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
{
  tree src = src_rgm->controls[i / 2];
  tree dest = dest_rgm->controls[i];
  gimple *stmt = gimple_build_assign (dest, VEC_PERM_EXPR,
  src, src, masks[i & 1]);
  gimple_seq_add_stmt (seq, stmt);
}
  return true;
}
  return false;
}

I know this is just optimization for ARM SVE with sub_rgc (int16)  is half size 
of rgc (int8).
But when I just copy the codes from ARM SVE and make it general for all cases 
(int8 <-> int64).
They all work well and codegen is good. 

If you don't like this way, would you mind give me some suggestions?

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-24 20:41
To: 钟居哲
CC: gcc-patches; rguenther
Subject: Re: [PATCH V12] VECT: Add decrement IV iteration loop control by 
variable amount support
Sorry, I realised later that I had an implicit assumption here:
if there are multiple rgroups, it's better to have a single IV
for the smallest rgroup and scale that up to bigger rgroups.
 
E.g. if the loop control IV is taken from an N-control rgroup
and has a step S, an N*M-control rgroup would be based on M*S.
 
Of course, it's also OK to create multiple IVs if you prefer.
It's just a question of which approach gives the best output
in practice.
 
Another way of going from an N-control rgroup ("G1") to an N*M-control
rgroup ("G2") would be to reuse all N controls from G1.  E.g. the
first M controls in G2 would come from G1[0], the next M from
G1[1], etc.  That might lower the longest dependency chain.
 
But whatever we do, it doesn't feel like max_nscalars_per_iter
should be part of the decision.  (I realise it will be part of

Re: [PATCH] doc: clarify semantics of vector bitwise shifts

2023-05-24 Thread Richard Biener via Gcc-patches

On Wed, May 24, 2023 at 2:54 PM Alexander Monakov via Gcc-patches
 wrote:
>
> Explicitly say that bitwise shifts for narrow types work similar to
> element-wise C shifts with integer promotions, which coincides with
> OpenCL semantics.

Do we need to clarify that v << w with v being a vector of shorts
still yields a vector of shorts and not a vector of ints?

Btw, I don't see this promotion reflected in the IL.  For

typedef short v8hi __attribute__((vector_size(16)));

v8hi foo (v8hi a, v8hi b)
{
  return a << b;
}

I get no masking of 'b' and vector lowering if the target doens't handle it
yields

  short int _5;
  short int _6;

  _5 = BIT_FIELD_REF ;
  _6 = BIT_FIELD_REF ;
  _7 = _5 << _6;

which we could derive ranges from for _6 (apparantly we don't yet).  Even

typedef int v8hi __attribute__((vector_size(16)));

v8hi x;
int foo (v8hi a, v8hi b)
{
  x = a << b;
  return (b[0] > 33);
}

isn't optimized currently (but could - note I've used 'int' elements here).

So, I don't see us making sure the hardware does the right thing for
out-of bound values.

Richard.

> gcc/ChangeLog:
>
> * doc/extend.texi (Vector Extensions): Clarify bitwise shift
> semantics.
> ---
>  gcc/doc/extend.texi | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
> index e426a2eb7d..6b4e94b6a1 100644
> --- a/gcc/doc/extend.texi
> +++ b/gcc/doc/extend.texi
> @@ -12026,7 +12026,12 @@ elements in the operand.
>  It is possible to use shifting operators @code{<<}, @code{>>} on
>  integer-type vectors. The operation is defined as following: @code{@{a0,
>  a1, @dots{}, an@} >> @{b0, b1, @dots{}, bn@} == @{a0 >> b0, a1 >> b1,
> -@dots{}, an >> bn@}}@. Vector operands must have the same number of
> +@dots{}, an >> bn@}}@.  When the base type is narrower than @code{int},
> +element-wise shifts are performed as if operands underwent C integer
> +promotions, like in OpenCL.  This makes vector shifts by up to 31 bits
> +well-defined for vectors with @code{char} and @code{short} base types.
> +
> +Operands of binary vector operations must have the same number of
>  elements.
>
>  For convenience, it is allowed to use a binary vector operation
> --
> 2.39.2
>

Re: [PATCH] Fix artificial overflow during GENERIC folding

2023-05-24 Thread Richard Biener via Gcc-patches

On Wed, May 24, 2023 at 2:39 PM Eric Botcazou  wrote:
>
> > I don't like littering the patterns with this and it's likely far from the
> > only cases we have?
>
> Maybe, but that's the only problematic case we have in Ada.  It occurs only on
> mainline because we have streamlined address calculations there, from out-of-
> line to inline expansion, i.e. from run time to compile time.
>
> > Since we did move some of the patterns from fold-const.cc to match.pd and
> > the frontends might be interested in TREE_OVERFLOW (otherwise we'd just
> > scrap that!) I'm not sure removing the flag is good (and I never was really
> > convinced the setting for the implementation defined behavior on conversion
> > to unsigned is good).
>
> Yes, the Ada front-end relies on the TREE_OVERFLOW flag to detect overflows at
> compile time, so it cannot be removed, but it must be set correctly, which is
> not the case here: (T)p - (T) (p + 4) where T is signed should just yield -4.
>
> > Am I correct that the user writing such a conversion in Ada _should_
> > get a constraint violation?  So it's just the middle-end introducing it
> > to avoid undefined signed overflow that's on error?
>
> Yes, it's a Constraint_Error in Ada to convert a value of an unsigned type to
> a signed type if it does not fit in the signed type.
>
> > I'll also note that fold_convert_const_int_from_int shouldn't set
> > TREE_OVERFLOW on unsigned destination types?  So it's the
> > outer conversion back to signed that generates the TREE_OVERFLOW?
>
> Yes, 4 is converted to unsigned, then negated, yielding a huge number, and the
> final conversion back to signed yields -4 with TREE_OVERFLOW set.
>
> > Would it help to use a (view_convert ...) here?  For non-constants that
> > should be folded back to a sign changing (convert ...) but the constant
> > folding should hopefully happen earlier?  But it's again implementation
> > defined behavior we have here, so not sure we need TREE_OVERFLOW at all.
>
> I'm not sure we need to jump through too many hoops here: the intermediate
> conversion trick is a kludge because we lack a proper method to selectively
> disable undefined overflow at run time, but that's not the case at compile
> time where we have a finer-grained control (and even different rules) so I
> don't really see a problem with handling the two cases differently.

But nobody is going to understand why the INTEGER_CST case goes the
other way.  As you say we don't have a good way to say we're doing
this to avoid undefined behavior, but then a view-convert back would
be a good way to indicate that?  I can't come up with a better name
for a custom operator we could also use,

  (convert_without_overflow (negate (convert:utype @1

maybe?  As said, if view_convert works I prefer that.  Does it?

Richard.

>
> --
> Eric Botcazou
>
>

Re: [V7][PATCH 2/2] Update documentation to clarify a GCC extension [PR77650]

2023-05-24 Thread Qing Zhao via Gcc-patches

Joseph,

Thanks a lot for the review. And sorry for my late reply (just came back from a 
short vacation).

> On May 19, 2023, at 5:12 PM, Joseph Myers  wrote:
> 
> On Fri, 19 May 2023, Qing Zhao via Gcc-patches wrote:
> 
>> +GCC extension accepts a structure containing an ISO C99 @dfn{flexible array
> 
> "The GCC extension" or "A GCC extension".

Okay.
> 
>> +@item
>> +A structure containing a C99 flexible array member, or a union containing
>> +such a structure, is the middle field of another structure, for example:
> 
> There might be more than one middle field, and I think this case also 
> includes where it's the *first* field - any field other than the last.

Good point. Will fix this.
> 
>> +@smallexample
>> +struct flex  @{ int length; char data[]; @};
>> +
>> +struct mid_flex @{ int m; struct flex flex_data; int n; @};
>> +@end smallexample
>> +
>> +In the above, @code{mid_flex.flex_data.data[]} has undefined behavior.
> 
> And it's not literally mid_flex.flex_data.data[] that has undefined 
> behavior, but trying to access a member of that array.

Yes, you are right. Will fix this.
> 
>> +Compilers do not handle such case consistently, Any code relying on
> 
> "such a case", and "," should be "." at the end of a sentence.
Okay, will fix this.

Thanks

Qing
> 
> -- 
> Joseph S. Myers
> jos...@codesourcery.com

[PATCH] doc: clarify semantics of vector bitwise shifts

2023-05-24 Thread Alexander Monakov via Gcc-patches

Explicitly say that bitwise shifts for narrow types work similar to
element-wise C shifts with integer promotions, which coincides with
OpenCL semantics.

gcc/ChangeLog:

* doc/extend.texi (Vector Extensions): Clarify bitwise shift
semantics.
---
 gcc/doc/extend.texi | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/gcc/doc/extend.texi b/gcc/doc/extend.texi
index e426a2eb7d..6b4e94b6a1 100644
--- a/gcc/doc/extend.texi
+++ b/gcc/doc/extend.texi
@@ -12026,7 +12026,12 @@ elements in the operand.
 It is possible to use shifting operators @code{<<}, @code{>>} on
 integer-type vectors. The operation is defined as following: @code{@{a0,
 a1, @dots{}, an@} >> @{b0, b1, @dots{}, bn@} == @{a0 >> b0, a1 >> b1,
-@dots{}, an >> bn@}}@. Vector operands must have the same number of
+@dots{}, an >> bn@}}@.  When the base type is narrower than @code{int},
+element-wise shifts are performed as if operands underwent C integer
+promotions, like in OpenCL.  This makes vector shifts by up to 31 bits
+well-defined for vectors with @code{char} and @code{short} base types.
+
+Operands of binary vector operations must have the same number of
 elements. 
 
 For convenience, it is allowed to use a binary vector operation
-- 
2.39.2

Re: [PATCH V12] VECT: Add decrement IV iteration loop control by variable amount support

2023-05-24 Thread Richard Biener via Gcc-patches

On Wed, 24 May 2023, Richard Sandiford wrote:

> Sorry, I realised later that I had an implicit assumption here:
> if there are multiple rgroups, it's better to have a single IV
> for the smallest rgroup and scale that up to bigger rgroups.
> 
> E.g. if the loop control IV is taken from an N-control rgroup
> and has a step S, an N*M-control rgroup would be based on M*S.
> 
> Of course, it's also OK to create multiple IVs if you prefer.
> It's just a question of which approach gives the best output
> in practice.

One thing to check is whether IVOPTs is ever able to eliminate
one such IV using another.  You can then also check whether
when presented with a single IV it already considers the
others you can create as candidates so you get the optimal
selection in the end.

Richard.

1 2 >

1 - 100 of 170 matches

Mail list logo