from:"Feng Xue OS"

Re: [PATCH] vect: Fix inconsistency in fully-masked lane-reducing op generation [PR116985]

2024-10-12 Thread Feng Xue OS

Added.

Thanks,
Feng


From: Richard Biener 
Sent: Saturday, October 12, 2024 8:12 PM
To: Feng Xue OS
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] vect: Fix inconsistency in fully-masked lane-reducing op 
generation [PR116985]

On Sat, Oct 12, 2024 at 9:12 AM Feng Xue OS  wrote:
>
> To align vectorized def/use when lane-reducing op is present in loop 
> reduction,
> we may need to insert extra trivial pass-through copies, which would cause
> mismatch between lane-reducing vector copy and loop mask index. This could be
> fixed by computing the right index around a new counter on effective lane-
> reducing vector copies.

OK, but can you add the reduced testcase from the PR in a way that it ICEs
before and not after the patch?

Thanks,
Richard.

> Thanks,
> Feng
> ---
> gcc/
> PR tree-optimization/116985
> * tree-vect-loop.cc (vect_transform_reduction): Compute loop mask
> index based on effective vector copies for reduction op.
> ---
>  gcc/tree-vect-loop.cc | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index ade72a5124f..025442aabc3 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -8916,6 +8916,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>
>bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
>unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
> +  unsigned mask_index = 0;
>
>for (unsigned i = 0; i < num; ++i)
>  {
> @@ -8954,7 +8955,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>   std::swap (vop[0], vop[1]);
> }
>   tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks,
> - vec_num * ncopies, vectype_in, i);
> + vec_num * ncopies, vectype_in,
> + mask_index++);
>   gcall *call = gimple_build_call_internal (cond_fn, 4, mask,
> vop[0], vop[1], vop[0]);
>   new_temp = make_ssa_name (vec_dest, call);
> @@ -8971,7 +8973,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>   if (masked_loop_p && mask_by_cond_expr)
> {
>   tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks,
> - vec_num * ncopies, vectype_in, 
> i);
> + vec_num * ncopies, vectype_in,
> + mask_index++);
>   build_vect_cond_expr (code, vop, mask, gsi);
> }
>
> --
> 2.17.1
>
>
>

[PATCH] vect: Fix inconsistency in fully-masked lane-reducing op generation [PR116985]

2024-10-12 Thread Feng Xue OS

To align vectorized def/use when lane-reducing op is present in loop reduction,
we may need to insert extra trivial pass-through copies, which would cause
mismatch between lane-reducing vector copy and loop mask index. This could be
fixed by computing the right index around a new counter on effective lane-
reducing vector copies.

Thanks,
Feng
---
gcc/
PR tree-optimization/116985
* tree-vect-loop.cc (vect_transform_reduction): Compute loop mask
index based on effective vector copies for reduction op.
---
 gcc/tree-vect-loop.cc | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index ade72a5124f..025442aabc3 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8916,6 +8916,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo,

   bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
   unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
+  unsigned mask_index = 0;

   for (unsigned i = 0; i < num; ++i)
 {
@@ -8954,7 +8955,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
  std::swap (vop[0], vop[1]);
}
  tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks,
- vec_num * ncopies, vectype_in, i);
+ vec_num * ncopies, vectype_in,
+ mask_index++);
  gcall *call = gimple_build_call_internal (cond_fn, 4, mask,
vop[0], vop[1], vop[0]);
  new_temp = make_ssa_name (vec_dest, call);
@@ -8971,7 +8973,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
  if (masked_loop_p && mask_by_cond_expr)
{
  tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks,
- vec_num * ncopies, vectype_in, i);
+ vec_num * ncopies, vectype_in,
+ mask_index++);
  build_vect_cond_expr (code, vop, mask, gsi);
}

-- 
2.17.1

Re: [RFC] Generalize formation of lane-reducing ops in loop reduction

2024-08-21 Thread Feng Xue OS

>>
>> >> 1. Background
>> >>
>> >> For loop reduction of accumulating result of a widening operation, the
>> >> preferred pattern is lane-reducing operation, if supported by target. 
>> >> Because
>> >> this kind of operation need not preserve intermediate results of widening
>> >> operation, and only produces reduced amount of final results for 
>> >> accumulation,
>> >> choosing the pattern could lead to pretty compact codegen.
>> >>
>> >> Three lane-reducing opcodes are defined in gcc, belonging to two kinds of
>> >> operations: dot-product (DOT_PROD_EXPR) and sum-of-absolute-difference
>> >> (SAD_EXPR). WIDEN_SUM_EXPR could be seen as a degenerated dot-product 
>> >> with a
>> >> constant operand as "1". Currently, gcc only supports recognition of 
>> >> simple
>> >> lane-reducing case, in which each accumulation statement of loop reduction
>> >> forms one pattern:
>> >>
>> >>  char  *d0, *d1;
>> >>  short *s0, *s1;
>> >>
>> >>  for (i) {
>> >>sum += d0[i] * d1[i];  //  = DOT_PROD > >> char>
>> >>sum += abs(s0[i] - s1[i]); //  = SAD 
>> >>  }
>> >>
>> >> We could rewrite the example as the below using only one statement, whose 
>> >> non-
>> >> reduction addend is the sum of the above right-side parts. As a whole, the
>> >> addend would match nothing, while its two sub-expressions could be 
>> >> recognized
>> >> as corresponding lane-reducing patterns.
>> >>
>> >>  for (i) {
>> >>sum += d0[i] * d1[i] + abs(s0[i] - s1[i]);
>> >>  }
>> >
>> > Note we try to recognize the original form as SLP reduction (which of
>> > course fails).
>> >
>> >> This case might be too elaborately crafted to be very common in reality.
>> >> Though, we do find seemingly variant but essentially similar code pattern 
>> >> in
>> >> some AI applications, which use matrix-vector operations extensively, some
>> >> usages are just single loop reduction composed of multiple dot-products. A
>> >> code snippet from ggml:
>> >>
>> >>  for (int j = 0; j < qk/2; ++j) {
>> >>const uint8_t xh_0 = ((qh >> (j +  0)) << 4) & 0x10;
>> >>const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10;
>> >>
>> >>const int32_t x0 = (x[i].qs[j] & 0xF) | xh_0;
>> >>const int32_t x1 = (x[i].qs[j] >>  4) | xh_1;
>> >>
>> >>sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
>> >>  }
>> >>
>> >> In the source level, it appears to be a nature and minor scaling-up of 
>> >> simple
>> >> one lane-reducing pattern, but it is beyond capability of current 
>> >> vectorization
>> >> pattern recognition, and needs some kind of generic extension to the 
>> >> framework.
>>
>> Sorry for late response.
>>
>> > So this is about re-associating lane-reducing ops to alternative 
>> > lane-reducing
>> > ops to save repeated accumulation steps?
>>
>> You mean re-associating slp-based lane-reducing ops to loop-based?
> 
> Yes.
> 
>> > The thing is that IMO pattern recognition as we do now is limiting and 
>> > should
>> > eventually move to the SLP side where we should be able to more freely
>> > "undo" and associate.
>>
>> No matter pattern recognition is done prior to or within SLP, the must thing 
>> is we
>> need to figure out which op is qualified for lane-reducing by some means.
>>
>> For example, when seeing a mult in a loop with vectorization-favored shape,
>> ...
>> t = a * b; // char a, b
>> ...
>>
>> we could not say it is decidedly applicable for reduced computation via 
>> dot-product
>> even the corresponding target ISA is available.
> 
> True.  Note there's a PR which shows SLP lane-reducing written out like
> 
>   a[i] = b[4*i] * 3 + b[4*i+1] * 3 + b[4*i+2] * 3 + b[4*i+3] * 3;
> 
> which we cannot change to a DOT_PROD because we do not know which
> lanes are reduced.  My point was there are non-reduction cases where knowing
> which actual lanes get reduced would help.  For reductions it's not important
> and associating in a way to expose more possible (reduction) lane reductions
> is almost always going to be a win.
> 
>> Recognition of normal patterns merely involves local statement-based match, 
>> while
>> for lane-reducing, validity check requires global loop-wise analysis on 
>> structure of
>> reduction, probably not same as, but close to what is proposed in the RFC. 
>> The
>> basic logic, IMHO, is independent of where pattern recognition is 
>> implemented.
>> As the matter of fact, this is not about of "associating", but "tagging" 
>> (mark all lane-
>> reducing quantifiable statements). After the process, "re-associator" could 
>> play its
>> role to guide selection of either loop-based or slp-based lane-reducing op.
>>
>> > I've searched twice now, a few days ago I read that the optabs not 
>> > specifying
>> > which lanes are combined/reduced is a limitation.  Yes, it is - I hope we 
>> > can
>> > rectify this, so if this is motivation enough we should split the optabs up
>> > into even/odd/hi/lo (or whatever else interesting targets actually do).
>>
>> Actually, how lanes are combined/reduced does

[PATCH] vect: Add missed opcodes in vect_get_smallest_scalar_type [PR115228]

2024-08-05 Thread Feng Xue OS

Some opcodes are missed when determining the smallest scalar type for a
vectorizable statement. Currently, this bug does not cause any problem,
because vect_get_smallest_scalar_type is only used to compute max nunits
vectype, and even statement with missed opcode is incorrectly bypassed,
the max nunits vectype could also be rightly deduced from def statements
for operands of the statement.

In the future, if this function will be called to do other thing, we may
get something wrong. So fix it in this patch.

Thanks,
Feng

---
gcc/
* tree-vect-data-refs.cc (vect_get_smallest_scalar_type): Add
missed opcodes that involve widening operation.
---
 gcc/tree-vect-data-refs.cc | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
index 39fd887a96b..5b0d548f847 100644
--- a/gcc/tree-vect-data-refs.cc
+++ b/gcc/tree-vect-data-refs.cc
@@ -162,7 +162,10 @@ vect_get_smallest_scalar_type (stmt_vec_info stmt_info, 
tree scalar_type)
   if (gimple_assign_cast_p (assign)
  || gimple_assign_rhs_code (assign) == DOT_PROD_EXPR
  || gimple_assign_rhs_code (assign) == WIDEN_SUM_EXPR
+ || gimple_assign_rhs_code (assign) == SAD_EXPR
  || gimple_assign_rhs_code (assign) == WIDEN_MULT_EXPR
+ || gimple_assign_rhs_code (assign) == WIDEN_MULT_PLUS_EXPR
+ || gimple_assign_rhs_code (assign) == WIDEN_MULT_MINUS_EXPR
  || gimple_assign_rhs_code (assign) == WIDEN_LSHIFT_EXPR
  || gimple_assign_rhs_code (assign) == FLOAT_EXPR)
{
-- 
2.17.1

[PATCH] vect: Allow unsigned-to-signed promotion in vect_look_through_possible_promotion [PR115707]

2024-08-05 Thread Feng Xue OS

The function vect_look_through_possible_promotion() fails to figure out root
definition if casts involves more than two promotions with sign change as:

long a = (long)b;   // promotion cast
 -> int b = (int)c; // promotion cast, sign change
   -> unsigned short c = ...;

For this case, the function thinks the 2nd cast has different sign as the 1st,
so stop looking through, while "unsigned short -> integer" is a nature sign
extension. This patch allows this unsigned-to-signed promotion in the function.

Thanks,
Feng

---
gcc/
* tree-vect-patterns.cc (vect_look_through_possible_promotion): Allow
unsigned-to-signed promotion.
---
 gcc/tree-vect-patterns.cc | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 4674a16d15f..b2c83cfd219 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -434,7 +434,9 @@ vect_look_through_possible_promotion (vec_info *vinfo, tree 
op,
 sign of the previous promotion.  */
  if (!res
  || TYPE_PRECISION (unprom->type) == orig_precision
- || TYPE_SIGN (unprom->type) == TYPE_SIGN (op_type))
+ || TYPE_SIGN (unprom->type) == TYPE_SIGN (op_type)
+ || (TYPE_UNSIGNED (op_type)
+ && TYPE_PRECISION (op_type) < TYPE_PRECISION (unprom->type)))
{
  unprom->set_op (op, dt, caster);
  min_precision = TYPE_PRECISION (op_type);
-- 
2.17.1From 334998e1d991e1d2c8e4c2234663b4d829e88e5c Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Mon, 5 Aug 2024 15:23:56 +0800
Subject: [PATCH] vect: Allow unsigned-to-signed promotion in
 vect_look_through_possible_promotion [PR115707]

The function fails to figure out root definition if casts involves more than
two promotions with sign change as:

long a = (long)b;   // promotion cast
 -> int b = (int)c; // promotion cast, sign change
   -> unsigned short c = ...;

For this case, the function thinks the 2nd cast has different sign as the 1st,
so stop looking through, while "unsigned short -> integer" is a nature sign
extension.

2024-08-05 Feng Xue 

gcc/
	* tree-vect-patterns.cc (vect_look_through_possible_promotion): Allow
	unsigned-to-signed promotion.
---
 gcc/tree-vect-patterns.cc | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 4674a16d15f..b2c83cfd219 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -434,7 +434,9 @@ vect_look_through_possible_promotion (vec_info *vinfo, tree op,
 	 sign of the previous promotion.  */
 	  if (!res
 	  || TYPE_PRECISION (unprom->type) == orig_precision
-	  || TYPE_SIGN (unprom->type) == TYPE_SIGN (op_type))
+	  || TYPE_SIGN (unprom->type) == TYPE_SIGN (op_type)
+	  || (TYPE_UNSIGNED (op_type)
+		  && TYPE_PRECISION (op_type) < TYPE_PRECISION (unprom->type)))
 	{
 	  unprom->set_op (op, dt, caster);
 	  min_precision = TYPE_PRECISION (op_type);
-- 
2.17.1

Re: [RFC] Generalize formation of lane-reducing ops in loop reduction

2024-08-03 Thread Feng Xue OS

>> 1. Background
>>
>> For loop reduction of accumulating result of a widening operation, the
>> preferred pattern is lane-reducing operation, if supported by target. Because
>> this kind of operation need not preserve intermediate results of widening
>> operation, and only produces reduced amount of final results for 
>> accumulation,
>> choosing the pattern could lead to pretty compact codegen.
>>
>> Three lane-reducing opcodes are defined in gcc, belonging to two kinds of
>> operations: dot-product (DOT_PROD_EXPR) and sum-of-absolute-difference
>> (SAD_EXPR). WIDEN_SUM_EXPR could be seen as a degenerated dot-product with a
>> constant operand as "1". Currently, gcc only supports recognition of simple
>> lane-reducing case, in which each accumulation statement of loop reduction
>> forms one pattern:
>>
>>  char  *d0, *d1;
>>  short *s0, *s1;
>>
>>  for (i) {
>>sum += d0[i] * d1[i];  //  = DOT_PROD 
>>sum += abs(s0[i] - s1[i]); //  = SAD 
>>  }
>>
>> We could rewrite the example as the below using only one statement, whose 
>> non-
>> reduction addend is the sum of the above right-side parts. As a whole, the
>> addend would match nothing, while its two sub-expressions could be recognized
>> as corresponding lane-reducing patterns.
>>
>>  for (i) {
>>sum += d0[i] * d1[i] + abs(s0[i] - s1[i]);
>>  }
> 
> Note we try to recognize the original form as SLP reduction (which of
> course fails).
> 
>> This case might be too elaborately crafted to be very common in reality.
>> Though, we do find seemingly variant but essentially similar code pattern in
>> some AI applications, which use matrix-vector operations extensively, some
>> usages are just single loop reduction composed of multiple dot-products. A
>> code snippet from ggml:
>>
>>  for (int j = 0; j < qk/2; ++j) {
>>const uint8_t xh_0 = ((qh >> (j +  0)) << 4) & 0x10;
>>const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10;
>>
>>const int32_t x0 = (x[i].qs[j] & 0xF) | xh_0;
>>const int32_t x1 = (x[i].qs[j] >>  4) | xh_1;
>>
>>sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
>>  }
>>
>> In the source level, it appears to be a nature and minor scaling-up of simple
>> one lane-reducing pattern, but it is beyond capability of current 
>> vectorization
>> pattern recognition, and needs some kind of generic extension to the 
>> framework.

Sorry for late response. 

> So this is about re-associating lane-reducing ops to alternative lane-reducing
> ops to save repeated accumulation steps?

You mean re-associating slp-based lane-reducing ops to loop-based?

> The thing is that IMO pattern recognition as we do now is limiting and should
> eventually move to the SLP side where we should be able to more freely
> "undo" and associate.

No matter pattern recognition is done prior to or within SLP, the must thing is 
we
need to figure out which op is qualified for lane-reducing by some means.

For example, when seeing a mult in a loop with vectorization-favored shape,
...
t = a * b; // char a, b
...

we could not say it is decidedly applicable for reduced computation via 
dot-product
even the corresponding target ISA is available.

Recognition of normal patterns merely involves local statement-based match, 
while
for lane-reducing, validity check requires global loop-wise analysis on 
structure of
reduction, probably not same as, but close to what is proposed in the RFC. The
basic logic, IMHO, is independent of where pattern recognition is implemented.
As the matter of fact, this is not about of "associating", but "tagging" (mark 
all lane-
reducing quantifiable statements). After the process, "re-associator" could 
play its
role to guide selection of either loop-based or slp-based lane-reducing op.

> I've searched twice now, a few days ago I read that the optabs not specifying
> which lanes are combined/reduced is a limitation.  Yes, it is - I hope we can
> rectify this, so if this is motivation enough we should split the optabs up
> into even/odd/hi/lo (or whatever else interesting targets actually do).

Actually, how lanes are combined/reduced does not matter too much regarding
to recognition of lane-reducing patterns.

> I did read through the rest of the e-mail before, I do in general like the 
> idea
> to do better.  Costing is another place where we need to re-do things

Yes, current pattern recognition framework is not costing-model-driven, and has
no way to "undo" decision previously made even it negatively impacts pattern
match later. But this is a weakness of the framework, not any specific pattern. 
To overcome it, we may count on another separate task instead of mingling with
this RFC, and better to have the task contained into your plan of moving pattern
recognition to SLP.

> completely;  my current idea is to feed targets the SLP graph so they'll
> have some dependence info.  They should already have access to the
> actual operation done, though in awkward ways.  I guess the first target
> to impleme

[RFC][PATCH 5/5] vect: Add accumulating-result pattern for lane-reducing operation

2024-07-21 Thread Feng Xue OS

This patch adds a pattern to fold a summation into the last operand of lane-
reducing operation when appropriate, which is a supplement to those operation-
specific patterns for dot-prod/sad/widen-sum.

  sum = lane-reducing-op(..., 0) + value;
=>
  sum = lane-reducing-op(..., value);

Thanks,
Feng
---
gcc/
* tree-vect-patterns (vect_recog_lane_reducing_accum_pattern): New
pattern function.
(vect_vect_recog_func_ptrs): Add the new pattern function.
* params.opt (vect-lane-reducing-accum-pattern): New parameter.

gcc/testsuite/
* gcc.dg/vect/vect-reduc-accum-pattern.c
---
 gcc/params.opt|   4 +
 .../gcc.dg/vect/vect-reduc-accum-pattern.c|  61 ++
 gcc/tree-vect-patterns.cc | 106 ++
 3 files changed, 171 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-accum-pattern.c

diff --git a/gcc/params.opt b/gcc/params.opt
index c17ba17b91b..b94bdc26cbd 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1198,6 +1198,10 @@ The maximum factor which the loop vectorizer applies to 
the cost of statements i
 Common Joined UInteger Var(param_vect_induction_float) Init(1) IntegerRange(0, 
1) Param Optimization
 Enable loop vectorization of floating point inductions.
 
+-param=vect-lane-reducing-accum-pattern=
+Common Joined UInteger Var(param_vect_lane_reducing_accum_pattern) Init(2) 
IntegerRange(0, 2) Param Optimization
+Allow pattern of combining plus into lane reducing operation or not. If value 
is 2, allow this for all statements, or if 1, only for reduction statement, 
otherwise, disable it.
+
 -param=vrp-block-limit=
 Common Joined UInteger Var(param_vrp_block_limit) Init(15) Optimization 
Param
 Maximum number of basic blocks before VRP switches to a fast model with less 
memory requirements.
diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-accum-pattern.c 
b/gcc/testsuite/gcc.dg/vect/vect-reduc-accum-pattern.c
new file mode 100644
index 000..80a2c4f047e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-accum-pattern.c
@@ -0,0 +1,61 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { 
aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#define FN(name, S1, S2)   \
+S1 int __attribute__ ((noipa)) \
+name (S1 int res,  \
+  S2 char *restrict a, \
+  S2 char *restrict b, \
+  S2 char *restrict c, \
+  S2 char *restrict d) \
+{  \
+  for (int i = 0; i < N; i++)  \
+res += a[i] * b[i];\
+   \
+  asm volatile ("" ::: "memory");  \
+  for (int i = 0; i < N; ++i)  \
+res += (a[i] * b[i] + c[i] * d[i]) << 3;   \
+   \
+  return res;  \
+}
+
+FN(f1_vec, signed, signed)
+
+#pragma GCC push_options
+#pragma GCC optimize ("O0")
+FN(f1_novec, signed, signed)
+#pragma GCC pop_options
+
+#define BASE2 ((signed int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  signed char a[N], b[N];
+  signed char c[N], d[N];
+
+#pragma GCC novector
+  for (int i = 0; i < N; ++i)
+{
+  a[i] = BASE2 + i * 5;
+  b[i] = BASE2 + OFFSET + i * 4;
+  c[i] = BASE2 + i * 6;
+  d[i] = BASE2 + OFFSET + i * 5;
+}
+
+  if (f1_vec (0x12345, a, b, c, d) != f1_novec (0x12345, a, b, c, d))
+__builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" 
} } */
+/* { dg-final { scan-tree-dump "vect_recog_lane_reducing_accum_pattern: 
detected" "vect" { target { vect_sdot_qi } } } } */
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index bb037af0b68..9a6b16532e4 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -1490,6 +1490,111 @@ vect_recog_abd_pattern (vec_info *vinfo,
   return vect_convert_output (vinfo, stmt_vinfo, out_type, stmt, vectype_out);
 }
 
+/* Function vect_recog_lane_reducing_accum_pattern
+
+   Try to fold a summation into the last operand of lane-reducing operation.
+
+   sum = lane-reducing-op(..., 0) + value;
+
+   A lane-reducing operation contains two aspects: main primitive operation
+   and appendant result-accumulation.  Pattern matching for the basic aspect
+   is handled in specific pattern for dot-prod/sad/widen-sum respectively.
+   The function is in charge of the other aspect.
+
+   Input:
+
+   * STMT_VINFO: The stmt from which the pattern se

[RFC][PATCH 2/5] vect: Introduce loop reduction affine closure to vect pattern recog

2024-07-21 Thread Feng Xue OS

For sum-based loop reduction, its affine closure is composed by statements
whose results and derived computation only end up in the reduction, and are
not used in any non-linear transform operation. The concept underlies the
generalized lane-reducing pattern recognition in the coming patches. As
mathematically proved, it is legitimate to optimize evaluation of a value
with lane-reducing pattern, only if its definition statement locates in affine
closure. That is to say, canonicalized representation for loop reduction
could be of the following affine form, in which "opX" denotes an operation
for lane-reducing pattern, h(i) represents remaining operations irrelvant to
those patterns.

  for (i)
sum += cst0 * op0 + cst1 * op1 + ... + cstN * opN + h(i);

At initialization, we invoke a preprocessing step to mark all statements in
affine closure, which could ease retrieval of the property during pattern
matching. Since a pattern hit would replace original statement with new
pattern statements, we resort to a postprocessing step after recognition,
to parse semantics of those new, and incrementally update affine closure,
or rollback the pattern change if it would break completeness of existing
closure.

Thus, inside affine closure, recog framework could universally handle both
lane-reducing and normal patterns. Also with this patch, we are able to add
more complicated logic to enhance lane-reducing patterns.

Thanks,
Feng
---
gcc/
* tree-vectorizer.h (enum vect_reduc_pattern_status): New enum.
(_stmt_vec_info): Add a new field reduc_pattern_status.
* tree-vect-patterns.cc (vect_split_statement): Adjust statement
status for reduction affine closure.
(vect_convert_input): Do not reuse conversion statement in process.
(vect_reassociating_reduction_p): Add a condition check to only allow
statement in reduction affine closure.
(vect_pattern_expr_invariant_p): New function.
(vect_get_affine_operands_mask): Likewise.
(vect_mark_reduction_affine_closure): Likewise.
(vect_mark_stmts_for_reduction_pattern_recog): Likewise.
(vect_get_prev_reduction_stmt): Likewise.
(vect_mark_reduction_pattern_sequence_formed): Likewise.
(vect_check_pattern_stmts_for_reduction): Likewise.
(vect_pattern_recog_1): Check if a pattern recognition would break
existing lane-reducing pattern statements.
(vect_pattern_recog): Mark loop reduction affine closure.
---
 gcc/tree-vect-patterns.cc | 722 +-
 gcc/tree-vectorizer.h |  23 ++
 2 files changed, 742 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index ca8809e7cfd..02f6b942026 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -750,7 +750,6 @@ vect_split_statement (vec_info *vinfo, stmt_vec_info 
stmt2_info, tree new_rhs,
  gimple_stmt_iterator gsi = gsi_for_stmt (stmt2_info->stmt, def_seq);
  gsi_insert_before_without_update (&gsi, stmt1, GSI_SAME_STMT);
}
-  return true;
 }
   else
 {
@@ -783,9 +782,35 @@ vect_split_statement (vec_info *vinfo, stmt_vec_info 
stmt2_info, tree new_rhs,
  dump_printf_loc (MSG_NOTE, vect_location, "and: %G",
   (gimple *) new_stmt2);
}
+}
 
-  return true;
+  /* Since this function would change existing conversion statement no matter
+ the pattern is finally applied or not, we should check whether affine
+ closure of loop reduction need to be adjusted for impacted statements.  */
+  unsigned int status = stmt2_info->reduc_pattern_status;
+
+  if (status != rpatt_none)
+{
+  tree rhs_type = TREE_TYPE (gimple_assign_rhs1 (stmt1));
+  tree new_rhs_type = TREE_TYPE (new_rhs);
+
+  /* The new statement generated by splitting is a nature widening
+conversion. */
+  gcc_assert (TYPE_PRECISION (rhs_type) < TYPE_PRECISION (new_rhs_type));
+  gcc_assert (TYPE_UNSIGNED (rhs_type) || !TYPE_UNSIGNED (new_rhs_type));
+
+  /* The new statement would not break transform invariance of lane-
+reducing operation, if the original conversion depends on the one
+formed previously.  For the case, it should also be marked with
+rpatt_formed status.  */
+  if (status & rpatt_formed)
+   vinfo->lookup_stmt (stmt1)->reduc_pattern_status = rpatt_formed;
+
+  if (!is_pattern_stmt_p (stmt2_info))
+   STMT_VINFO_RELATED_STMT (stmt2_info)->reduc_pattern_status = status;
 }
+
+  return true;
 }
 
 /* Look for the following pattern
@@ -890,7 +915,10 @@ vect_convert_input (vec_info *vinfo, stmt_vec_info 
stmt_info, tree type,
 return wide_int_to_tree (type, wi::to_widest (unprom->op));
 
   tree input = unprom->op;
-  if (unprom->caster)
+
+  /* We should not reuse conversion, if it is just the statement under pattern
+ recognition.  */
+  if (unprom->caster && unprom->cast

[RFC][PATCH 4/5] vect: Extend lane-reducing patterns to non-loop-reduction statement

2024-07-21 Thread Feng Xue OS

Previously, only simple lane-reducing case is supported, in which one loop
reduction statement forms one pattern match:

  char *d0, *d1, *s0, *s1, *w;
  for (i) {
sum += d0[i] * d1[i];  // sum = DOT_PROD(d0, d1, sum);
sum += abs(s0[i] - s1[i]); // sum = SAD(s0, s1, sum);
sum += w[i];   // sum = WIDEN_SUM(w, sum);
  }

This patch removes limitation of current lane-reducing matching strategy, and
extends candidate scope to the whole loop reduction affine closure. Thus, we
could optimize reduction with lane-reducing as many as possible, which ends up
with generalized pattern recognition as ("opX" denotes an operation for
lane-reducing pattern):

 for (i)
   sum += cst0 * op0 + cst1 * op1 + ... + cstN * opN + h(i);

A lane-reducing operation contains two aspects: main primitive operation and
appendant result-accumulation. Original design handles match of the compound
semantics in single pattern, but the means is not suitable for operation that
does not directly participate in loop reduction. In this patch, we only focus
on the basic aspect, and leave another patch to cover the rest. An example
with dot-product:

sum = DOT_PROD(d0, d1, sum);   // original
sum = DOT_PROD(d0, d1, 0) + sum;   // now

Thanks,
Feng
---
gcc/
* tree-vect-patterns (vect_reassociating_reduction_p): Remove the
function.
(vect_recog_dot_prod_pattern): Relax check to allow any statement in
reduction affine closure.
(vect_recog_sad_pattern): Likewise.
(vect_recog_widen_sum_pattern): Likewise. And use dot-product if
widen-sum is not supported.
(vect_vect_recog_func_ptrs): Move lane-reducing patterns to the topmost.

gcc/testsuite/
* gcc.dg/vect/vect-reduc-affine-1.c
* gcc.dg/vect/vect-reduc-affine-2.c
* gcc.dg/vect/vect-reduc-affine-slp-1.c
---
 .../gcc.dg/vect/vect-reduc-affine-1.c | 112 ++
 .../gcc.dg/vect/vect-reduc-affine-2.c |  81 +
 .../gcc.dg/vect/vect-reduc-affine-slp-1.c |  74 
 gcc/tree-vect-patterns.cc | 321 ++
 4 files changed, 372 insertions(+), 216 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-affine-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-affine-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-affine-slp-1.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-affine-1.c 
b/gcc/testsuite/gcc.dg/vect/vect-reduc-affine-1.c
new file mode 100644
index 000..a5e99ce703b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-affine-1.c
@@ -0,0 +1,112 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { 
aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#define FN(name, S1, S2)   \
+S1 int __attribute__ ((noipa)) \
+name (S1 int res,  \
+  S2 char *restrict a, \
+  S2 char *restrict b, \
+  S2 int *restrict c,  \
+  S2 int cst1, \
+  S2 int cst2, \
+  int shift)   \
+{  \
+  for (int i = 0; i < N; i++)  \
+res += a[i] * b[i] + 16;   \
+   \
+  asm volatile ("" ::: "memory");  \
+  for (int i = 0; i < N; i++)  \
+res += a[i] * b[i] + cst1; \
+   \
+  asm volatile ("" ::: "memory");  \
+  for (int i = 0; i < N; i++)  \
+res += a[i] * b[i] + c[i]; \
+   \
+  asm volatile ("" ::: "memory");  \
+  for (int i = 0; i < N; i++)  \
+res += a[i] * b[i] * 23;   \
+   \
+  asm volatile ("" ::: "memory");  \
+  for (int i = 0; i < N; i++)  \
+res += a[i] * b[i] << 6;   \
+   \
+  asm volatile ("" ::: "memory");  \
+  for (int i = 0; i < N; i++)  \
+res += a[i] * b[i] * cst2; \
+   \
+  asm volatile ("" ::: "memory");  \
+  for

[RFC][PATCH 3/5] vect: Enable lane-reducing operation that is not loop reduction statement

2024-07-21 Thread Feng Xue OS

This patch extends original vect analysis and transform to support a new kind
of lane-reducing operation that participates in loop reduction indirectly. The
operation itself is not reduction statement, but its value would be accumulated
into reduction result finally.

Thanks,
Feng
---
gcc/
* tree-vect-loop.cc (vectorizable_lane_reducing): Allow indirect lane-
reducing operation.
(vect_transform_reduction): Extend transform for indirect lane-reducing
operation.
---
 gcc/tree-vect-loop.cc | 48 +++
 1 file changed, 40 insertions(+), 8 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index d7d628efa60..c344158b419 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7520,9 +7520,7 @@ vectorizable_lane_reducing (loop_vec_info loop_vinfo, 
stmt_vec_info stmt_info,
 
   stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
 
-  /* TODO: Support lane-reducing operation that does not directly participate
- in loop reduction.  */
-  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
+  if (!reduc_info)
 return false;
 
   /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
@@ -7530,7 +7528,16 @@ vectorizable_lane_reducing (loop_vec_info loop_vinfo, 
stmt_vec_info stmt_info,
   gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def);
   gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION);
 
-  for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++)
+  int sum_idx = STMT_VINFO_REDUC_IDX (stmt_info);
+  int num_ops = (int) gimple_num_ops (stmt) - 1;
+
+  /* Participate in loop reduction either directly or indirectly.  */
+  if (sum_idx >= 0)
+gcc_assert (sum_idx  == num_ops - 1);
+  else
+sum_idx = num_ops - 1;
+
+  for (int i = 0; i < num_ops; i++)
 {
   stmt_vec_info def_stmt_info;
   slp_tree slp_op;
@@ -7573,7 +7580,24 @@ vectorizable_lane_reducing (loop_vec_info loop_vinfo, 
stmt_vec_info stmt_info,
 
   tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
 
-  gcc_assert (vectype_in);
+  if (!vectype_in)
+{
+  enum vect_def_type dt;
+  tree rhs1 = gimple_assign_rhs1 (stmt);
+
+  if (!vect_is_simple_use (rhs1, loop_vinfo, &dt, &vectype_in))
+   return false;
+
+  if (!vectype_in)
+   {
+ vectype_in = get_vectype_for_scalar_type (loop_vinfo,
+   TREE_TYPE (rhs1));
+ if (!vectype_in)
+   return false;
+   }
+
+  STMT_VINFO_REDUC_VECTYPE_IN (stmt_info) = vectype_in;
+}
 
   /* Compute number of effective vector statements for costing.  */
   unsigned int ncopies_for_cost = vect_get_num_copies (loop_vinfo, slp_node,
@@ -8750,9 +8774,17 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   gcc_assert (single_defuse_cycle || lane_reducing);
 
   if (lane_reducing)
-{
-  /* The last operand of lane-reducing op is for reduction.  */
-  gcc_assert (reduc_index == (int) op.num_ops - 1);
+{  
+  if (reduc_index < 0)
+   {
+ reduc_index = (int) op.num_ops - 1;
+ single_defuse_cycle = false;
+   }
+  else
+   {
+ /* The last operand of lane-reducing op is for reduction.  */
+ gcc_assert (reduc_index == (int) op.num_ops - 1);
+   }
 }
 
   /* Create the destination vector  */
-- 
2.17.1From 5e65c65786d9594c172b58a6cd1af50c67efb927 Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Wed, 24 Apr 2024 16:46:49 +0800
Subject: [PATCH 3/5] vect: Enable lane-reducing operation that is not loop
 reduction statement

This patch extends original vect analysis and transform to support a new kind
of lane-reducing operation that participates in loop reduction indirectly. The
operation itself is not reduction statement, but its value would be accumulated
into reduction result finally.

2024-04-24 Feng Xue 

gcc/
	* tree-vect-loop.cc (vectorizable_lane_reducing): Allow indirect lane-
	reducing operation.
	(vect_transform_reduction): Extend transform for indirect lane-reducing
	operation.
---
 gcc/tree-vect-loop.cc | 48 +++
 1 file changed, 40 insertions(+), 8 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index d7d628efa60..c344158b419 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7520,9 +7520,7 @@ vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
 
   stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
 
-  /* TODO: Support lane-reducing operation that does not directly participate
- in loop reduction.  */
-  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
+  if (!reduc_info)
 return false;
 
   /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
@@ -7530,7 +7528,16 @@ vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info,
   gcc_assert (ST

[RFC][PATCH 1/5] vect: Fix single_imm_use in tree_vect_patterns

2024-07-21 Thread Feng Xue OS

The work for RFC 
(https://gcc.gnu.org/pipermail/gcc-patches/2024-July/657860.html)
involves not a little code change, so I have to separate it into several batches
of patchset. This and the following patches constitute the first batch.

Since pattern statement coexists with normal statements in a way that it is
not linked into function body, we should not invoke utility procedures that
depends on def/use graph on pattern statement, such as counting uses of a
pseudo value defined by a pattern statement. This patch is to fix a bug of
this type in vect pattern formation.

Thanks,
Feng
---
gcc/
* tree-vect-patterns.cc (vect_recog_bitfield_ref_pattern): Only call
single_imm_use if statement is not generated by pattern recognition.
---
 gcc/tree-vect-patterns.cc | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 4570c25b664..ca8809e7cfd 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -2700,7 +2700,8 @@ vect_recog_bitfield_ref_pattern (vec_info *vinfo, 
stmt_vec_info stmt_info,
   /* If the only use of the result of this BIT_FIELD_REF + CONVERT is a
  PLUS_EXPR then do the shift last as some targets can combine the shift and
  add into a single instruction.  */
-  if (lhs && single_imm_use (lhs, &use_p, &use_stmt))
+  if (lhs && !STMT_VINFO_RELATED_STMT (stmt_info)
+  && single_imm_use (lhs, &use_p, &use_stmt))
 {
   if (gimple_code (use_stmt) == GIMPLE_ASSIGN
  && gimple_assign_rhs_code (use_stmt) == PLUS_EXPR)
-- 
2.17.1

From 52e1725339fc7e4552eb7916570790c4ab7f133d Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Fri, 14 Jun 2024 15:49:23 +0800
Subject: [PATCH 1/5] vect: Fix single_imm_use in tree_vect_patterns

Since pattern statement coexists with normal statements in a way that it is
not linked into function body, we should not invoke utility procedures that
depends on def/use graph on pattern statement, such as counting uses of a
pseudo value defined by a pattern statement. This patch is to fix a bug of
this type in vect pattern formation.

2024-06-14 Feng Xue 

gcc/
	* tree-vect-patterns.cc (vect_recog_bitfield_ref_pattern): Only call
	single_imm_use if statement is not generated by pattern recognition.
---
 gcc/tree-vect-patterns.cc | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 4570c25b664..ca8809e7cfd 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -2700,7 +2700,8 @@ vect_recog_bitfield_ref_pattern (vec_info *vinfo, stmt_vec_info stmt_info,
   /* If the only use of the result of this BIT_FIELD_REF + CONVERT is a
  PLUS_EXPR then do the shift last as some targets can combine the shift and
  add into a single instruction.  */
-  if (lhs && single_imm_use (lhs, &use_p, &use_stmt))
+  if (lhs && !STMT_VINFO_RELATED_STMT (stmt_info)
+  && single_imm_use (lhs, &use_p, &use_stmt))
 {
   if (gimple_code (use_stmt) == GIMPLE_ASSIGN
 	  && gimple_assign_rhs_code (use_stmt) == PLUS_EXPR)
-- 
2.17.1

[RFC] Generalize formation of lane-reducing ops in loop reduction

2024-07-21 Thread Feng Xue OS

Hi,

  I composed some patches to generalize lane-reducing (dot-product is a typical 
representative) pattern recognition, and prepared a RFC document so as to help
review. The original intention was to make a complete solution for 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114440.  For sure, the work might
be limited, so hope your comments. Thanks.

-

1. Background

For loop reduction of accumulating result of a widening operation, the
preferred pattern is lane-reducing operation, if supported by target. Because
this kind of operation need not preserve intermediate results of widening
operation, and only produces reduced amount of final results for accumulation,
choosing the pattern could lead to pretty compact codegen.

Three lane-reducing opcodes are defined in gcc, belonging to two kinds of
operations: dot-product (DOT_PROD_EXPR) and sum-of-absolute-difference
(SAD_EXPR). WIDEN_SUM_EXPR could be seen as a degenerated dot-product with a
constant operand as "1". Currently, gcc only supports recognition of simple
lane-reducing case, in which each accumulation statement of loop reduction
forms one pattern:

 char  *d0, *d1;
 short *s0, *s1;

 for (i) {
   sum += d0[i] * d1[i];  //  = DOT_PROD 
   sum += abs(s0[i] - s1[i]); //  = SAD 
 }

We could rewrite the example as the below using only one statement, whose non-
reduction addend is the sum of the above right-side parts. As a whole, the
addend would match nothing, while its two sub-expressions could be recognized
as corresponding lane-reducing patterns.

 for (i) {
   sum += d0[i] * d1[i] + abs(s0[i] - s1[i]);
 }

This case might be too elaborately crafted to be very common in reality.
Though, we do find seemingly variant but essentially similar code pattern in
some AI applications, which use matrix-vector operations extensively, some
usages are just single loop reduction composed of multiple dot-products. A
code snippet from ggml:

 for (int j = 0; j < qk/2; ++j) {
   const uint8_t xh_0 = ((qh >> (j +  0)) << 4) & 0x10;
   const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10;

   const int32_t x0 = (x[i].qs[j] & 0xF) | xh_0;
   const int32_t x1 = (x[i].qs[j] >>  4) | xh_1;

   sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
 }

In the source level, it appears to be a nature and minor scaling-up of simple
one lane-reducing pattern, but it is beyond capability of current vectorization
pattern recognition, and needs some kind of generic extension to the framework.

2. Reasoning on validity of transform

First of all, we should tell what kind of expression is appropriate for lane-
reducing transform. Given a loop, we use the language of mathematics to define
an abstract function f(x, i), whose first independent variable "x" denotes a
value that will participate sum-based loop reduction either directly or
indirectly, and the 2nd one "i" specifies index of a loop iteration, which
implies other intra-iteration factor irrelevant to "x". The function itself
represents the transformed value by applying a series of operations on "x" in
the context of "i"th loop iteration, and this value is directly accumulated to
the loop reduction result. For the purpose of vectorization, it is implicitly
supposed that f(x, i) is a pure function, and free of loop dependency.

Additionally, for a value "x" defined in the loop, let "X" be a vector as
, consisting of the "x" values in all iterations, to be
specific, "X[i]" corresponds to "x" at iteration "i", or "xi". With sequential
execution order, a loop reduction regarding to f(x, i) would be expanded to:

 sum += f(x0, 0);
 sum += f(x1, 1);
 ...
 sum += f(xM, M);

2.1 Lane-reducing vs. Lane-combining

Following lane-reducing semantics, we introduce a new similar lane-combining
operation that also manipulates a subset of lanes/elements in vector, by
accumulating all into one of them, at the same time, clearing the rest lanes
to be zero. Two operations are equivalent in essence, while a major difference
is that lane-combining operation does not reduce the lanes of vector. One
advantage about this is codegen of lane-combining operation could seamlessly
inter-operate with that of normal (non-lane-reducing) vector operation.

Any lane-combining operation could be synthesized by a sequence of the most
basic two-lane operations, which become the focuses of our analysis. Given two
lanes "i" and "j", and let X' = lane-combine(X, i, j), then we have:

  X  = <..., xi , ...,  xj, ...>
  X' = <..., xi + xj, ...,   0, ...>

2.2 Equations for loop reduction invariance

Since combining strategy of lane-reducing operations is target-specific, for
examples, accumulating quad lanes to one (#0 + #1 + #2 + #3 => #0), or low to
high (#0 + #4 => #4), we just make a conservative assumption that combining
could happen on arbitrary two lanes in either order. Under the precondition,
it is legitimate to optimize evaluation of a value "x" with a lane-reducing
pattern, only if loop reduction always produces invariant result no matter
w

Re: [PATCH 1/4] vect: Add a unified vect_get_num_copies for slp and non-slp

2024-07-17 Thread Feng Xue OS

>> +inline unsigned int
>> +vect_get_num_copies (vec_info *vinfo, slp_tree node, tree vectype = NULL)
>> +{
>> +  poly_uint64 vf;
>> +
>> +  if (loop_vec_info loop_vinfo = dyn_cast  (vinfo))
>> +vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>> +  else
>> +vf = 1;
>> +
>> +  if (node)
>> +{
>> +  vf *= SLP_TREE_LANES (node);
>> +  if (!vectype)
>> +   vectype = SLP_TREE_VECTYPE (node);
>> +}
>> +  else
>> +gcc_checking_assert (vectype);
>
> can you make the checking assert unconditional?
>
> OK with that change.  vect_get_num_vectors will ICE anyway
> I guess, so at your choice remove the assert completely.
>

OK, I removed the assert.

Thanks,
Feng


From: Richard Biener 
Sent: Monday, July 15, 2024 10:00 PM
To: Feng Xue OS
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH 1/4] vect: Add a unified vect_get_num_copies for slp and 
non-slp

On Sat, Jul 13, 2024 at 5:46 PM Feng Xue OS  wrote:
>
> Extend original vect_get_num_copies (pure loop-based) to calculate number of
> vector stmts for slp node regarding a generic vect region.
>
> Thanks,
> Feng
> ---
> gcc/
> * tree-vectorizer.h (vect_get_num_copies): New overload function.
> (vect_get_slp_num_vectors): New function.
> * tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Calculate
> number of vector stmts for slp node with vect_get_num_copies.
> (vect_slp_analyze_node_operations): Calculate number of vector 
> elements
> for constant/external slp node with vect_get_num_copies.
> ---
>  gcc/tree-vect-slp.cc  | 19 +++
>  gcc/tree-vectorizer.h | 29 -
>  2 files changed, 31 insertions(+), 17 deletions(-)
>
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index d0a8531fd3b..4dadbc6854d 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -6573,17 +6573,7 @@ vect_slp_analyze_node_operations_1 (vec_info *vinfo, 
> slp_tree node,
>   }
>  }
>else
> -{
> -  poly_uint64 vf;
> -  if (loop_vec_info loop_vinfo = dyn_cast  (vinfo))
> -   vf = loop_vinfo->vectorization_factor;
> -  else
> -   vf = 1;
> -  unsigned int group_size = SLP_TREE_LANES (node);
> -  tree vectype = SLP_TREE_VECTYPE (node);
> -  SLP_TREE_NUMBER_OF_VEC_STMTS (node)
> -   = vect_get_num_vectors (vf * group_size, vectype);
> -}
> +SLP_TREE_NUMBER_OF_VEC_STMTS (node) = vect_get_num_copies (vinfo, node);
>
>/* Handle purely internal nodes.  */
>if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> @@ -6851,12 +6841,9 @@ vect_slp_analyze_node_operations (vec_info *vinfo, 
> slp_tree node,
>   && j == 1);
>   continue;
> }
> - unsigned group_size = SLP_TREE_LANES (child);
> - poly_uint64 vf = 1;
> - if (loop_vec_info loop_vinfo = dyn_cast  (vinfo))
> -   vf = loop_vinfo->vectorization_factor;
> +
>   SLP_TREE_NUMBER_OF_VEC_STMTS (child)
> -   = vect_get_num_vectors (vf * group_size, vector_type);
> +   = vect_get_num_copies (vinfo, child);
>   /* And cost them.  */
>   vect_prologue_cost_for_slp (child, cost_vec);
> }
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 8eb3ec4df86..09923b9b440 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -2080,6 +2080,33 @@ vect_get_num_vectors (poly_uint64 nunits, tree vectype)
>return exact_div (nunits, TYPE_VECTOR_SUBPARTS (vectype)).to_constant ();
>  }
>
> +/* Return the number of vectors in the context of vectorization region VINFO,
> +   needed for a group of total SIZE statements that are supposed to be
> +   interleaved together with no gap, and all operate on vectors of type
> +   VECTYPE.  If NULL, SLP_TREE_VECTYPE of NODE is used.  */
> +
> +inline unsigned int
> +vect_get_num_copies (vec_info *vinfo, slp_tree node, tree vectype = NULL)
> +{
> +  poly_uint64 vf;
> +
> +  if (loop_vec_info loop_vinfo = dyn_cast  (vinfo))
> +vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> +  else
> +vf = 1;
> +
> +  if (node)
> +{
> +  vf *= SLP_TREE_LANES (node);
> +  if (!vectype)
> +   vectype = SLP_TREE_VECTYPE (node);
> +}
> +  else
> +gcc_checking_assert (vectype);

can you make the checking assert unconditional?

OK with that change.  vect_get_num_vectors will ICE anyway
I guess, so at your choice remove the assert completely.

Thanks,
Richard.

> +
> +  return vect_get_num_vectors (vf, vecty

[PATCH 4/4] vect: Optimize order of lane-reducing statements in loop def-use cycles

2024-07-13 Thread Feng Xue OS

When transforming multiple lane-reducing operations in a loop reduction chain,
originally, corresponding vectorized statements are generated into def-use
cycles starting from 0. The def-use cycle with smaller index, would contain
more statements, which means more instruction dependency. For example:

   int sum = 1;
   for (i)
 {
   sum += d0[i] * d1[i];  // dot-prod 
   sum += w[i];   // widen-sum 
   sum += abs(s0[i] - s1[i]); // sad 
   sum += n[i];   // normal 
 }

Original transformation result:

   for (i / 16)
 {
   sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
   sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   ...
 }

For a higher instruction parallelism in final vectorized loop, an optimal
means is to make those effective vector lane-reducing ops be distributed
evenly among all def-use cycles. Transformed as the below, DOT_PROD,
WIDEN_SUM and SADs are generated into disparate cycles, instruction
dependency among them could be eliminated.

   for (i / 16)
 {
   sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = sum_v0;  // copy
   sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = sum_v0;  // copy
   sum_v1 = sum_v1;  // copy
   sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2);
   sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3);

   ...
 }

Thanks,
Feng
---
gcc/
PR tree-optimization/114440
* tree-vectorizer.h (struct _stmt_vec_info): Add a new field
reduc_result_pos.
* tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing
statements in an optimized order.
---
 gcc/tree-vect-loop.cc | 64 ++-
 gcc/tree-vectorizer.h |  6 
 2 files changed, 63 insertions(+), 7 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index e72d692ffa3..5bc6e526d43 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8841,6 +8841,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   sum += d0[i] * d1[i];  // dot-prod 
   sum += w[i];   // widen-sum 
   sum += abs(s0[i] - s1[i]); // sad 
+  sum += n[i];   // normal 
 }
 
 The vector size is 128-bit，vectorization factor is 16.  Reduction
@@ -8858,19 +8859,27 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy
 
-  sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
-  sum_v1 = sum_v1;  // copy
+  sum_v0 = sum_v0;  // copy
+  sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy
 
-  sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
-  sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
-  sum_v2 = sum_v2;  // copy
+  sum_v0 = sum_v0;  // copy
+  sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1);
+  sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2);
   sum_v3 = sum_v3;  // copy
+
+  sum_v0 += n_v0[i: 0  ~ 3 ];
+  sum_v1 += n_v1[i: 4  ~ 7 ];
+  sum_v2 += n_v2[i: 8  ~ 11];
+  sum_v3 += n_v3[i: 12 ~ 15];
 }
 
-  sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0 + sum_v1
-   */
+Moreover, for a higher instruction parallelism in final vectorized
+loop, it is considered to make those effective vector lane-reducing
+ops be distributed evenly among all def-use cycles.  In the above
+example, DOT_PROD, WIDEN_SUM and SADs are generated into disparate
+cycles, instruction dependency among them could be eliminated.  */
   unsigned effec_ncopies = vec_oprnds[0].length ();
   unsigned total_ncopies = vec_oprnds[reduc_index].length ();
 
@@ -8884,6 +8893,47 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
  vec_oprnds[i].safe_grow_cleared (total_ncopies);
}
}
+
+  tree reduc_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
+  gcc_assert (reduc_vectype_in);
+
+  unsigned effec_reduc_ncopies
+   = vect_get_num_copies (loop_vinfo, slp_no

[PATCH 3/4] vect: Support multiple lane-reducing operations for loop reduction [PR114440]

2024-07-13 Thread Feng Xue OS

For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
vectorizer could only handle the pattern if the reduction chain does not
contain other operation, no matter the other is normal or lane-reducing.

This patches removes some constraints in reduction analysis to allow multiple
arbitrary lane-reducing operations with mixed input vectypes in a loop
reduction chain. For example:

   int sum = 1;
   for (i)
 {
   sum += d0[i] * d1[i];  // dot-prod 
   sum += w[i];   // widen-sum 
   sum += abs(s0[i] - s1[i]); // sad 
 }

The vector size is 128-bit vectorization factor is 16. Reduction statements
would be transformed as:

   vector<4> int sum_v0 = { 0, 0, 0, 1 };
   vector<4> int sum_v1 = { 0, 0, 0, 0 };
   vector<4> int sum_v2 = { 0, 0, 0, 0 };
   vector<4> int sum_v3 = { 0, 0, 0, 0 };

   for (i / 16)
 {
   sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
   sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy
 }

sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0 + sum_v1

Thanks,
Feng
---
gcc/
PR tree-optimization/114440
* tree-vectorizer.h (vectorizable_lane_reducing): New function
declaration.
* tree-vect-stmts.cc (vect_analyze_stmt): Call new function
vectorizable_lane_reducing to analyze lane-reducing operation.
* tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation
code related to emulated_mixed_dot_prod.
(vectorizable_lane_reducing): New function.
(vectorizable_reduction): Allow multiple lane-reducing operations in
loop reduction. Move some original lane-reducing related code to
vectorizable_lane_reducing.
(vect_transform_reduction): Adjust comments with updated example.

gcc/testsuite/
PR tree-optimization/114440
* gcc.dg/vect/vect-reduc-chain-1.c
* gcc.dg/vect/vect-reduc-chain-2.c
* gcc.dg/vect/vect-reduc-chain-3.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
* gcc.dg/vect/vect-reduc-dot-slp-1.c
---
 .../gcc.dg/vect/vect-reduc-chain-1.c  |  64 +
 .../gcc.dg/vect/vect-reduc-chain-2.c  |  79 ++
 .../gcc.dg/vect/vect-reduc-chain-3.c  |  68 +
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c  |  95 +++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c  |  67 +
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c  |  79 ++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c  |  63 +
 .../gcc.dg/vect/vect-reduc-dot-slp-1.c|  60 +
 gcc/tree-vect-loop.cc | 240 +-
 gcc/tree-vect-stmts.cc|   2 +
 gcc/tree-vectorizer.h |   2 +
 11 files changed, 750 insertions(+), 69 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c 
b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
new file mode 100644
index 000..80b0089ea0f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
@@ -0,0 +1,64 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { 
aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_2 char *restrict c,
+   SIGNEDNESS_2 char *restrict d,
+   SIGNEDNESS_1 int *restrict e)
+{
+  for (int i = 0; i < N; ++i)
+{
+  res += a[i] * b[i];
+  res += c[i] * d[i];
+  r

[PATCH 2/4] vect: Refit lane-reducing to be normal operation

2024-07-13 Thread Feng Xue OS

Vector stmts number of an operation is calculated based on output vectype.
This is over-estimated for lane-reducing operation, which would cause vector
def/use mismatched when we want to support loop reduction mixed with lane-
reducing and normal operations. One solution is to refit lane-reducing
to make it behave like a normal one, by adding new pass-through copies to
fix possible def/use gap. And resultant superfluous statements could be
optimized away after vectorization.  For example:

  int sum = 1;
  for (i)
{
  sum += d0[i] * d1[i];  // dot-prod 
}

  The vector size is 128-bit，vectorization factor is 16.  Reduction
  statements would be transformed as:

  vector<4> int sum_v0 = { 0, 0, 0, 1 };
  vector<4> int sum_v1 = { 0, 0, 0, 0 };
  vector<4> int sum_v2 = { 0, 0, 0, 0 };
  vector<4> int sum_v3 = { 0, 0, 0, 0 };

  for (i / 16)
{
  sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
  sum_v1 = sum_v1;  // copy
  sum_v2 = sum_v2;  // copy
  sum_v3 = sum_v3;  // copy
}

  sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0

Thanks,
Feng
---
gcc/
* tree-vect-loop.cc (vect_reduction_update_partial_vector_usage):
Calculate effective vector stmts number with generic
vect_get_num_copies.
(vect_transform_reduction): Insert copies for lane-reducing so as to
fix over-estimated vector stmts number.
(vect_transform_cycle_phi): Calculate vector PHI number only based on
output vectype.
* tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Remove
adjustment on vector stmts number specific to slp reduction.
---
 gcc/tree-vect-loop.cc | 134 +++---
 gcc/tree-vect-slp.cc  |  27 +++--
 2 files changed, 121 insertions(+), 40 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index a64b5082bd1..5ac83e76975 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7468,12 +7468,8 @@ vect_reduction_update_partial_vector_usage 
(loop_vec_info loop_vinfo,
= get_masked_reduction_fn (reduc_fn, vectype_in);
   vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
   vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
-  unsigned nvectors;
-
-  if (slp_node)
-   nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
-  else
-   nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
+  unsigned nvectors = vect_get_num_copies (loop_vinfo, slp_node,
+  vectype_in);
 
   if (mask_reduc_fn == IFN_MASK_LEN_FOLD_LEFT_PLUS)
vect_record_loop_len (loop_vinfo, lens, nvectors, vectype_in, 1);
@@ -8595,12 +8591,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   stmt_vec_info phi_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
   gphi *reduc_def_phi = as_a  (phi_info->stmt);
   int reduc_index = STMT_VINFO_REDUC_IDX (stmt_info);
-  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
+  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
+
+  if (!vectype_in)
+vectype_in = STMT_VINFO_VECTYPE (stmt_info);
 
   if (slp_node)
 {
   ncopies = 1;
-  vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
+  vec_num = vect_get_num_copies (loop_vinfo, slp_node, vectype_in);
 }
   else
 {
@@ -8658,13 +8657,40 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   bool lane_reducing = lane_reducing_op_p (code);
   gcc_assert (single_defuse_cycle || lane_reducing);
 
+  if (lane_reducing)
+{
+  /* The last operand of lane-reducing op is for reduction.  */
+  gcc_assert (reduc_index == (int) op.num_ops - 1);
+}
+
   /* Create the destination vector  */
   tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
   tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out);
 
+  if (lane_reducing && !slp_node && !single_defuse_cycle)
+{
+  /* Note: there are still vectorizable cases that can not be handled by
+single-lane slp.  Probably it would take some time to evolve the
+feature to a mature state.  So we have to keep the below non-slp code
+path as failsafe for lane-reducing support.  */
+  gcc_assert (op.num_ops <= 3);
+  for (unsigned i = 0; i < op.num_ops; i++)
+   {
+ unsigned oprnd_ncopies = ncopies;
+
+ if ((int) i == reduc_index)
+   {
+ tree vectype = STMT_VINFO_VECTYPE (stmt_info);
+ oprnd_ncopies = vect_get_num_copies (loop_vinfo, vectype);
+   }
+
+ vect_get_vec_defs_for_operand (loop_vinfo, stmt_info, oprnd_ncopies,
+op.ops[i], &vec_oprnds[i]);
+   }
+}
   /* Get NCOPIES vector definitions for all operands except the reduction
  definition.  */
-  if (!cond_fn_p)
+  else if (!cond_fn_p)
 {
   gcc_assert (reduc_index >= 0 && reduc_index <= 2);
   vect_get_vec_defs (loop_vinf

[PATCH 1/4] vect: Add a unified vect_get_num_copies for slp and non-slp

2024-07-13 Thread Feng Xue OS

Extend original vect_get_num_copies (pure loop-based) to calculate number of
vector stmts for slp node regarding a generic vect region.

Thanks,
Feng
---
gcc/
* tree-vectorizer.h (vect_get_num_copies): New overload function.
(vect_get_slp_num_vectors): New function.
* tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Calculate
number of vector stmts for slp node with vect_get_num_copies.
(vect_slp_analyze_node_operations): Calculate number of vector elements
for constant/external slp node with vect_get_num_copies.
---
 gcc/tree-vect-slp.cc  | 19 +++
 gcc/tree-vectorizer.h | 29 -
 2 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index d0a8531fd3b..4dadbc6854d 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -6573,17 +6573,7 @@ vect_slp_analyze_node_operations_1 (vec_info *vinfo, 
slp_tree node,
  }
 }
   else
-{
-  poly_uint64 vf;
-  if (loop_vec_info loop_vinfo = dyn_cast  (vinfo))
-   vf = loop_vinfo->vectorization_factor;
-  else
-   vf = 1;
-  unsigned int group_size = SLP_TREE_LANES (node);
-  tree vectype = SLP_TREE_VECTYPE (node);
-  SLP_TREE_NUMBER_OF_VEC_STMTS (node)
-   = vect_get_num_vectors (vf * group_size, vectype);
-}
+SLP_TREE_NUMBER_OF_VEC_STMTS (node) = vect_get_num_copies (vinfo, node);
 
   /* Handle purely internal nodes.  */
   if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
@@ -6851,12 +6841,9 @@ vect_slp_analyze_node_operations (vec_info *vinfo, 
slp_tree node,
  && j == 1);
  continue;
}
- unsigned group_size = SLP_TREE_LANES (child);
- poly_uint64 vf = 1;
- if (loop_vec_info loop_vinfo = dyn_cast  (vinfo))
-   vf = loop_vinfo->vectorization_factor;
+
  SLP_TREE_NUMBER_OF_VEC_STMTS (child)
-   = vect_get_num_vectors (vf * group_size, vector_type);
+   = vect_get_num_copies (vinfo, child);
  /* And cost them.  */
  vect_prologue_cost_for_slp (child, cost_vec);
}
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 8eb3ec4df86..09923b9b440 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2080,6 +2080,33 @@ vect_get_num_vectors (poly_uint64 nunits, tree vectype)
   return exact_div (nunits, TYPE_VECTOR_SUBPARTS (vectype)).to_constant ();
 }
 
+/* Return the number of vectors in the context of vectorization region VINFO,
+   needed for a group of total SIZE statements that are supposed to be
+   interleaved together with no gap, and all operate on vectors of type
+   VECTYPE.  If NULL, SLP_TREE_VECTYPE of NODE is used.  */
+
+inline unsigned int
+vect_get_num_copies (vec_info *vinfo, slp_tree node, tree vectype = NULL)
+{
+  poly_uint64 vf;
+
+  if (loop_vec_info loop_vinfo = dyn_cast  (vinfo))
+vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  else
+vf = 1;
+
+  if (node)
+{
+  vf *= SLP_TREE_LANES (node);
+  if (!vectype)
+   vectype = SLP_TREE_VECTYPE (node);
+}
+  else
+gcc_checking_assert (vectype);
+
+  return vect_get_num_vectors (vf, vectype);
+}
+
 /* Return the number of copies needed for loop vectorization when
a statement operates on vectors of type VECTYPE.  This is the
vectorization factor divided by the number of elements in
@@ -2088,7 +2115,7 @@ vect_get_num_vectors (poly_uint64 nunits, tree vectype)
 inline unsigned int
 vect_get_num_copies (loop_vec_info loop_vinfo, tree vectype)
 {
-  return vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR (loop_vinfo), vectype);
+  return vect_get_num_copies (loop_vinfo, NULL, vectype);
 }
 
 /* Update maximum unit count *MAX_NUNITS so that it accounts for
-- 
2.17.1

Re: [PATCH 2/4] vect: Fix inaccurate vector stmts number for slp reduction with lane-reducing

2024-07-13 Thread Feng Xue OS

> > Hi, Richard,
> >
> > Let me explain some idea that has to be chosen for lane-reducing. The key
> > complication is that these ops are associated with two kinds of vec_nums,
> > one is number of effective vector stmts, which is used by partial 
> > vectorzation
> > function such as vect_get_loop_mask.  The other is number of total created
> > vector stmts. Now we should make it aligned with normal op, in order to
> > interoperate with normal op. Suppose expressions mixed with lane-reducing
> > and normal as:
> >
> > temp = lane_reducing<16*char> + expr<4*int>;
> > temp = cst<4*int> * lane_reducing<16*char>;
> >
> > If only generating effective vector stmt for lane_reducing, vector def/use
> > between ops will never be matched, so extra pass-through copies are
> > necessary. This is why we say "refit a lane-reducing to be a fake normal 
> > op".
> 
> And this only happens in vect_transform_reduction, right?

Yes. it is.

> 
> The other pre-existing issue is that for single_defuse_cycle optimization
> SLP_TREE_NUMBER_OF_VEC_STMTS is also off (too large).  But here
> the transform also goes through vect_transform_reduction.
> 
> > The requirement of two vec_stmts are independent of how we will implement
> > SLP_TREE_NUMBER_OF_VEC_STMTS. Moreover, if we want to refactor vect code
> > to unify ncopies/vec_num computation and completely remove
> > SLP_TREE_NUMBER_OF_VEC_STMTS, this tends to be a a large task, and might
> > be overkill for these lane-reducing patches. So I will keep it as before, 
> > and do
> > not touch it as what I have done in this patch.
> >
> > Since one SLP_TREE_NUMBER_OF_VEC_STMTS could not be used for two purposes.
> > The your previous suggestion might not be work:
> >
> > > As said, I don't like this much.  vect_slp_analyze_node_operations_1 sets 
> > > this
> > > and I think the existing "exception"
> > >
> > >  /* Calculate the number of vector statements to be created for the
> > > scalar stmts in this node.  For SLP reductions it is equal to the
> > > number of vector statements in the children (which has already been
> > > calculated by the recursive call).  Otherwise it is the number of
> > > scalar elements in one scalar iteration (DR_GROUP_SIZE) multiplied by
> > > VF divided by the number of elements in a vector.  */
> > >  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
> > >  && !STMT_VINFO_DATA_REF (stmt_info)
> > >  && REDUC_GROUP_FIRST_ELEMENT (stmt_info))
> > >{
> > >  for (unsigned i = 0; i < SLP_TREE_CHILDREN (node).length (); ++i)
> > >if (SLP_TREE_DEF_TYPE (SLP_TREE_CHILDREN (node)[i]) ==
> > >  vect_internal_def)
> > >  {
> > >SLP_TREE_NUMBER_OF_VEC_STMTS (node)
> > >  = SLP_TREE_NUMBER_OF_VEC_STMTS (SLP_TREE_CHILDREN (node)[i]);
> > >break;
> > >  }
> > >}
> > >
> > > could be changed (or amended if replacing doesn't work out) to
> > >
> > >   if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
> > >   && STMT_VINFO_REDUC_IDX (stmt_info)
> > >   // do we have this always set?
> > >   && STMT_VINFO_REDUC_VECTYPE_IN (stmt_info))
> > >{
> > >   do the same as in else {} but using VECTYPE_IN
> > >}
> > >
> > > Or maybe scrap the special case and use STMT_VINFO_REDUC_VECTYPE_IN
> > > when that's set instead of SLP_TREE_VECTYPE?  As said having wrong
> > > SLP_TREE_NUMBER_OF_VEC_STMTS is going to backfire.
> >
> > Then the alternative is to limit special handling related to the vec_num 
> > only
> > inside vect_transform_reduction. Is that ok? Or any other suggestion?
> 
> I think that's kind-of in line with the suggestion of a reduction
> specific VF, so yes,
> not using SLP_TREE_NUMBER_OF_VEC_STMTS in vect_transform_reduction
> sounds fine to me and would be a step towards not having
> SLP_TREE_NUMBER_OF_VEC_STMTS
> where the function would be responsible for appropriate allocation as well.

OK. I remade 4 patches, and send them in a new emails.

Thanks,
Feng


> From: Richard Biener 
> Sent: Thursday, July 11, 2024 5:43 PM
> To: Feng Xue OS; Richard Sandiford
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH 2/4] vect: Fix inaccurate vector stmts number for slp 
> reduction with lane-reducing
>
> On Thu, Jul 11, 2024 at 10:53 AM Feng Xue OS

Re: [PATCH 2/4] vect: Fix inaccurate vector stmts number for slp reduction with lane-reducing

2024-07-11 Thread Feng Xue OS

Hi, Richard,

Let me explain some idea that has to be chosen for lane-reducing. The key
complication is that these ops are associated with two kinds of vec_nums,
one is number of effective vector stmts, which is used by partial vectorzation
function such as vect_get_loop_mask.  The other is number of total created
vector stmts. Now we should make it aligned with normal op, in order to
interoperate with normal op. Suppose expressions mixed with lane-reducing
and normal as:

temp = lane_reducing<16*char> + expr<4*int>;
temp = cst<4*int> * lane_reducing<16*char>;

If only generating effective vector stmt for lane_reducing, vector def/use
between ops will never be matched, so extra pass-through copies are 
necessary. This is why we say "refit a lane-reducing to be a fake normal op".

The requirement of two vec_stmts are independent of how we will implement
SLP_TREE_NUMBER_OF_VEC_STMTS. Moreover, if we want to refactor vect code
to unify ncopies/vec_num computation and completely remove
SLP_TREE_NUMBER_OF_VEC_STMTS, this tends to be a a large task, and might
be overkill for these lane-reducing patches. So I will keep it as before, and do
not touch it as what I have done in this patch.

Since one SLP_TREE_NUMBER_OF_VEC_STMTS could not be used for two purposes.
The your previous suggestion might not be work:

> As said, I don't like this much.  vect_slp_analyze_node_operations_1 sets this
> and I think the existing "exception"
>
>  /* Calculate the number of vector statements to be created for the
> scalar stmts in this node.  For SLP reductions it is equal to the
> number of vector statements in the children (which has already been
> calculated by the recursive call).  Otherwise it is the number of
> scalar elements in one scalar iteration (DR_GROUP_SIZE) multiplied by
> VF divided by the number of elements in a vector.  */
>  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
>  && !STMT_VINFO_DATA_REF (stmt_info)
>  && REDUC_GROUP_FIRST_ELEMENT (stmt_info))
>{
>  for (unsigned i = 0; i < SLP_TREE_CHILDREN (node).length (); ++i)
>if (SLP_TREE_DEF_TYPE (SLP_TREE_CHILDREN (node)[i]) ==
>  vect_internal_def)
>  {
>SLP_TREE_NUMBER_OF_VEC_STMTS (node)
>  = SLP_TREE_NUMBER_OF_VEC_STMTS (SLP_TREE_CHILDREN (node)[i]);
>break;
>  }
>}
>
> could be changed (or amended if replacing doesn't work out) to
> 
>   if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
>   && STMT_VINFO_REDUC_IDX (stmt_info)
>   // do we have this always set?
>   && STMT_VINFO_REDUC_VECTYPE_IN (stmt_info))
>{
>   do the same as in else {} but using VECTYPE_IN
>}
> 
> Or maybe scrap the special case and use STMT_VINFO_REDUC_VECTYPE_IN
> when that's set instead of SLP_TREE_VECTYPE?  As said having wrong
> SLP_TREE_NUMBER_OF_VEC_STMTS is going to backfire.

Then the alternative is to limit special handling related to the vec_num only
inside vect_transform_reduction. Is that ok? Or any other suggestion?

Thanks,
Feng

From: Richard Biener 
Sent: Thursday, July 11, 2024 5:43 PM
To: Feng Xue OS; Richard Sandiford
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH 2/4] vect: Fix inaccurate vector stmts number for slp 
reduction with lane-reducing

On Thu, Jul 11, 2024 at 10:53 AM Feng Xue OS
 wrote:
>
> Vector stmts number of an operation is calculated based on output vectype.
> This is over-estimated for lane-reducing operation. Sometimes, to workaround
> the issue, we have to rely on additional logic to deduce an exactly accurate
> number by other means. Aiming at the inconvenience, in this patch, we would
> "turn" lane-reducing operation into a normal one by inserting new trivial
> statements like zero-valued PHIs and pass-through copies, which could be
> optimized away by later passes. At the same time, a new field is added for
> slp node to hold number of vector stmts that are really effective after
> vectorization. For example:

Adding Richard into the loop.

I'm sorry, but this feels a bit backwards - in the end I was hoping that we
can get rid of SLP_TREE_NUMBER_OF_VEC_STMTS completely.
We do currently have the odd ncopies (non-SLP) vs. vec_num (SLP)
duality but in reality all vectorizable_* should know the number of
stmt copies (or output vector defs) to produce by looking at the vector
type and the vectorization factor (and in the SLP case the number of
lanes represented by the node).

That means that in the end vectorizable_* could at transform time
simply make sure that SLP_TREE_VEC_DEF is appropriately
created (currently generic code does this based on
SLP_TREE_NUMBER_OF_VEC_STMTS and also generic code
tries to determine SLP_TREE_NUMBER_OF_VEC

[PATCH 4/4] vect: Optimize order of lane-reducing statements in loop def-use cycles

2024-07-11 Thread Feng Xue OS

When transforming multiple lane-reducing operations in a loop reduction chain,
originally, corresponding vectorized statements are generated into def-use
cycles starting from 0. The def-use cycle with smaller index, would contain
more statements, which means more instruction dependency. For example:

   int sum = 1;
   for (i)
 {
   sum += d0[i] * d1[i];  // dot-prod 
   sum += w[i];   // widen-sum 
   sum += abs(s0[i] - s1[i]); // sad 
   sum += n[i];   // normal 
 }

Original transformation result:

   for (i / 16)
 {
   sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
   sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   ...
 }

For a higher instruction parallelism in final vectorized loop, an optimal
means is to make those effective vector lane-reducing ops be distributed
evenly among all def-use cycles. Transformed as the below, DOT_PROD,
WIDEN_SUM and SADs are generated into disparate cycles, instruction
dependency among them could be eliminated.

   for (i / 16)
 {
   sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = sum_v0;  // copy
   sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = sum_v0;  // copy
   sum_v1 = sum_v1;  // copy
   sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2);
   sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3);

   ...
 }

Thanks,
Feng

---
gcc/
PR tree-optimization/114440
* tree-vectorizer.h (struct _stmt_vec_info): Add a new field
reduc_result_pos.
(vect_transform_reduction): Add a new parameter of slp_instance type.
* tree-vect-stmts.cc (vect_transform_stmt): Add a new argument
slp_node_instance to vect_transform_reduction.
* tree-vect-loop.cc (vect_transform_reduction): Add a new parameter
slp_node_instance. Generate lane-reducing statements in an optimized
order.
---
 gcc/tree-vect-loop.cc  | 73 +++---
 gcc/tree-vect-stmts.cc |  3 +-
 gcc/tree-vectorizer.h  |  8 -
 3 files changed, 71 insertions(+), 13 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index a3374fb2d1a..841ef4c9120 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8673,7 +8673,8 @@ vect_emulate_mixed_dot_prod (loop_vec_info loop_vinfo, 
stmt_vec_info stmt_info,
 bool
 vect_transform_reduction (loop_vec_info loop_vinfo,
  stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
- gimple **vec_stmt, slp_tree slp_node)
+ gimple **vec_stmt, slp_tree slp_node,
+ slp_instance slp_node_instance)
 {
   tree vectype_out = STMT_VINFO_VECTYPE (stmt_info);
   class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
@@ -8863,6 +8864,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   sum += d0[i] * d1[i];  // dot-prod 
   sum += w[i];   // widen-sum 
   sum += abs(s0[i] - s1[i]); // sad 
+  sum += n[i];   // normal 
 }
 
 The vector size is 128-bit，vectorization factor is 16.  Reduction
@@ -8880,25 +8882,30 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy
 
-  sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
-  sum_v1 = sum_v1;  // copy
+  sum_v0 = sum_v0;  // copy
+  sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy
 
-  sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
-  sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
-  sum_v2 = sum_v2;  // copy
+  sum_v0 = sum_v0;  // copy
+  sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1);
+  sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2);
   sum_v3 = sum_v3;  // copy
+
+  sum_v0 += n_v0[i: 0  ~ 3 ];
+  sum_v1 += n_v1[i: 4  ~ 7 ];
+  sum_v2 += n_v2[i: 8  ~ 11];
+  sum_v3 += n_v3[i: 12 ~ 15];
 }
 
-  sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0 + sum_v1
-   */
+Moreover, for a higher instruction pa

[PATCH 3/4] vect: Support multiple lane-reducing operations for loop reduction [PR114440]

2024-07-11 Thread Feng Xue OS

For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
vectorizer could only handle the pattern if the reduction chain does not
contain other operation, no matter the other is normal or lane-reducing.

This patches removes some constraints in reduction analysis to allow multiple
arbitrary lane-reducing operations with mixed input vectypes in a loop
reduction chain. For example:

   int sum = 1;
   for (i)
 {
   sum += d0[i] * d1[i];  // dot-prod 
   sum += w[i];   // widen-sum 
   sum += abs(s0[i] - s1[i]); // sad 
 }

The vector size is 128-bit vectorization factor is 16. Reduction statements
would be transformed as:

   vector<4> int sum_v0 = { 0, 0, 0, 1 };
   vector<4> int sum_v1 = { 0, 0, 0, 0 };
   vector<4> int sum_v2 = { 0, 0, 0, 0 };
   vector<4> int sum_v3 = { 0, 0, 0, 0 };

   for (i / 16)
 {
   sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
   sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy
 }

sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0 + sum_v1

Thanks,
Feng

---
gcc/
PR tree-optimization/114440
* tree-vectorizer.h (vectorizable_lane_reducing): New function
declaration.
* tree-vect-stmts.cc (vect_analyze_stmt): Call new function
vectorizable_lane_reducing to analyze lane-reducing operation.
* tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation
code related to emulated_mixed_dot_prod.
(vectorizable_lane_reducing): New function.
(vectorizable_reduction): Allow multiple lane-reducing operations in
loop reduction. Move some original lane-reducing related code to
vectorizable_lane_reducing.
(vect_transform_reduction): Extend transformation to support reduction
statements with mixed input vectypes for non-slp code path.

gcc/testsuite/
PR tree-optimization/114440
* gcc.dg/vect/vect-reduc-chain-1.c
* gcc.dg/vect/vect-reduc-chain-2.c
* gcc.dg/vect/vect-reduc-chain-3.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
* gcc.dg/vect/vect-reduc-dot-slp-1.c
---
 .../gcc.dg/vect/vect-reduc-chain-1.c  |  64 
 .../gcc.dg/vect/vect-reduc-chain-2.c  |  79 +
 .../gcc.dg/vect/vect-reduc-chain-3.c  |  68 +
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c  |  95 ++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c  |  67 
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c  |  79 +
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c  |  63 
 .../gcc.dg/vect/vect-reduc-dot-slp-1.c|  60 
 gcc/tree-vect-loop.cc | 285 +-
 gcc/tree-vect-stmts.cc|   2 +
 gcc/tree-vectorizer.h |   2 +
 11 files changed, 785 insertions(+), 79 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c 
b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
new file mode 100644
index 000..80b0089ea0f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
@@ -0,0 +1,64 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { 
aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_2 char *restrict c,
+   SIGNEDNESS_2 char *restrict d,
+   SIGNEDNESS_1 int *restrict e)
+{
+  for (int i = 0; i < N; ++i)
+

[PATCH 2/4] vect: Fix inaccurate vector stmts number for slp reduction with lane-reducing

2024-07-11 Thread Feng Xue OS

Vector stmts number of an operation is calculated based on output vectype.
This is over-estimated for lane-reducing operation. Sometimes, to workaround
the issue, we have to rely on additional logic to deduce an exactly accurate
number by other means. Aiming at the inconvenience, in this patch, we would
"turn" lane-reducing operation into a normal one by inserting new trivial
statements like zero-valued PHIs and pass-through copies, which could be
optimized away by later passes. At the same time, a new field is added for
slp node to hold number of vector stmts that are really effective after
vectorization. For example:

  int sum = 1;
  for (i)
{
  sum += d0[i] * d1[i];  // dot-prod 
}

  The vector size is 128-bit，vectorization factor is 16.  Reduction
  statements would be transformed as:

  vector<4> int sum_v0 = { 0, 0, 0, 1 };
  vector<4> int sum_v1 = { 0, 0, 0, 0 };
  vector<4> int sum_v2 = { 0, 0, 0, 0 };
  vector<4> int sum_v3 = { 0, 0, 0, 0 };

  for (i / 16)
{
  sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
  sum_v1 = sum_v1;  // copy
  sum_v2 = sum_v2;  // copy
  sum_v3 = sum_v3;  // copy
}

  sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0

Thanks,
Feng

---
gcc/
* tree-vectorizer.h (vec_stmts_effec_size): New field in _slp_tree.
(SLP_TREE_VEC_STMTS_EFFEC_NUM): New macro.
(vect_get_num_vectors): New overload function.
(vect_get_slp_num_vectors): New function.
* tree-vect-loop.cc (vect_reduction_update_partial_vector_usage): Use
effective vector stmts number.
(vectorizable_reduction): Compute number of effective vector stmts for
lane-reducing op and reduction PHI.
(vect_transform_reduction): Insert copies for lane-reducing so as to
fix inaccurate vector stmts number.
(vect_transform_cycle_phi): Only need to calculate vector PHI number
based on input vectype for non-slp code path.
* tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize effective vector
stmts number to zero.
(vect_slp_analyze_node_operations_1): Remove adjustment on vector
stmts number specific to slp reduction.
(vect_slp_analyze_node_operations): Compute number of vector elements
for constant/external slp node with vect_get_slp_num_vectors.
---
 gcc/tree-vect-loop.cc | 139 --
 gcc/tree-vect-slp.cc  |  56 ++---
 gcc/tree-vectorizer.h |  45 ++
 3 files changed, 183 insertions(+), 57 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index c183e2b6068..5ad9836d6c8 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7471,7 +7471,7 @@ vect_reduction_update_partial_vector_usage (loop_vec_info 
loop_vinfo,
   unsigned nvectors;
 
   if (slp_node)
-   nvectors = SLP_TREE_VEC_STMTS_NUM (slp_node);
+   nvectors = SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node);
   else
nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
 
@@ -7594,6 +7594,15 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   stmt_vec_info phi_info = stmt_info;
   if (!is_a  (stmt_info->stmt))
 {
+  if (lane_reducing_stmt_p (stmt_info->stmt) && slp_node)
+   {
+ /* Compute number of effective vector statements for lane-reducing
+ops.  */
+ vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
+ gcc_assert (vectype_in);
+ SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node)
+   = vect_get_slp_num_vectors (loop_vinfo, slp_node, vectype_in);
+   }
   STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
   return true;
 }
@@ -8012,14 +8021,25 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   if (STMT_VINFO_LIVE_P (phi_info))
 return false;
 
-  if (slp_node)
-ncopies = 1;
-  else
-ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
+  poly_uint64 nunits_out = TYPE_VECTOR_SUBPARTS (vectype_out);
 
-  gcc_assert (ncopies >= 1);
+  if (slp_node)
+{
+  ncopies = 1;
 
-  poly_uint64 nunits_out = TYPE_VECTOR_SUBPARTS (vectype_out);
+  if (maybe_ne (TYPE_VECTOR_SUBPARTS (vectype_in), nunits_out))
+   {
+ /* Not all vector reduction PHIs would be used, compute number
+of the effective statements.  */
+ SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node)
+   = vect_get_slp_num_vectors (loop_vinfo, slp_node, vectype_in);
+   }
+}
+  else
+{
+  ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
+  gcc_assert (ncopies >= 1);
+}
 
   if (nested_cycle)
 {
@@ -8360,7 +8380,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
|| (slp_node
   && !REDUC_GROUP_FIRST_ELEMENT (stmt_info)
   && SLP_TREE_LANES (slp_node) == 1
-  && vect_get_num_copies (loop_vinfo, vectype_in) > 1))
+  && SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node) > 1))
   && (STMT_VINFO_RELEVANT (stmt

[PATCH 1/4] vect: Shorten name of macro SLP_TREE_NUMBER_OF_VEC_STMTS

2024-07-11 Thread Feng Xue OS

This patch series are recomposed and split from
https://gcc.gnu.org/pipermail/gcc-patches/2024-June/655974.html.

As I will add a new field tightly coupled with "vec_stmts_size", if following
naming conversion as original, the new macro would be very long. So better
to choose samely meaningful but shorter names, the patch makes change for
this macro, the other new patch would handle the new field and macro
accordingly as this.

Thanks,
Feng

---
gcc/
* tree-vectorizer.h (SLP_TREE_NUMBER_OF_VEC_STMTS): Change the macro
to SLP_TREE_VEC_STMTS_NUM.
* tree-vect-stmts.cc (vect_model_simple_cost): Likewise.
(check_load_store_for_partial_vectors): Likewise.
(vectorizable_bswap): Likewise.
(vectorizable_call): Likewise.
(vectorizable_conversion): Likewise.
(vectorizable_shift): Likewise. And replace direct field reference
to "vec_stmts_size" with the new macro.
(vectorizable_operation): Likewise.
(vectorizable_store): Likewise.
(vectorizable_load): Likewise.
(vectorizable_condition): Likewise.
* tree-vect-loop.cc (vect_reduction_update_partial_vector_usage):
Likewise.
(vectorizable_reduction): Likewise.
(vect_transform_reduction): Likewise.
(vectorizable_phi): Likewise.
(vectorizable_recurr): Likewise.
(vectorizable_induction): Likewise.
(vectorizable_live_operation): Likewise.
* tree-vect-slp.cc (_slp_tree::_slp_tree): Likewise.
(vect_slp_analyze_node_operations_1): Likewise.
(vect_prologue_cost_for_slp): Likewise.
(vect_slp_analyze_node_operations): Likewise.
(vect_create_constant_vectors): Likewise.
(vect_get_slp_vect_def): Likewise.
(vect_transform_slp_perm_load_1): Likewise.
(vectorizable_slp_permutation_1): Likewise.
(vect_schedule_slp_node): Likewise.
(vectorize_slp_instance_root_stmt): Likewise.
---
 gcc/tree-vect-loop.cc  | 17 +++---
 gcc/tree-vect-slp.cc   | 34 +--
 gcc/tree-vect-stmts.cc | 52 --
 gcc/tree-vectorizer.h  |  2 +-
 4 files changed, 51 insertions(+), 54 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index a64b5082bd1..c183e2b6068 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7471,7 +7471,7 @@ vect_reduction_update_partial_vector_usage (loop_vec_info 
loop_vinfo,
   unsigned nvectors;
 
   if (slp_node)
-   nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
+   nvectors = SLP_TREE_VEC_STMTS_NUM (slp_node);
   else
nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
 
@@ -8121,7 +8121,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
|| reduction_type == CONST_COND_REDUCTION
|| reduction_type == EXTRACT_LAST_REDUCTION)
   && slp_node
-  && SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node) > 1)
+  && SLP_TREE_VEC_STMTS_NUM (slp_node) > 1)
 {
   if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -8600,7 +8600,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   if (slp_node)
 {
   ncopies = 1;
-  vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
+  vec_num = SLP_TREE_VEC_STMTS_NUM (slp_node);
 }
   else
 {
@@ -9196,7 +9196,7 @@ vectorizable_phi (vec_info *,
 for the scalar and the vector PHIs.  This avoids artificially
 favoring the vector path (but may pessimize it in some cases).  */
   if (gimple_phi_num_args (as_a  (stmt_info->stmt)) > 1)
-   record_stmt_cost (cost_vec, SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node),
+   record_stmt_cost (cost_vec, SLP_TREE_VEC_STMTS_NUM (slp_node),
  vector_stmt, stmt_info, vectype, 0, vect_body);
   STMT_VINFO_TYPE (stmt_info) = phi_info_type;
   return true;
@@ -9304,7 +9304,7 @@ vectorizable_recurr (loop_vec_info loop_vinfo, 
stmt_vec_info stmt_info,
   tree vectype = STMT_VINFO_VECTYPE (stmt_info);
   unsigned ncopies;
   if (slp_node)
-ncopies = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
+ncopies = SLP_TREE_VEC_STMTS_NUM (slp_node);
   else
 ncopies = vect_get_num_copies (loop_vinfo, vectype);
   poly_int64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
@@ -10217,8 +10217,7 @@ vectorizable_induction (loop_vec_info loop_vinfo,
  }
  /* loop cost for vec_loop.  */
  inside_cost
-   = record_stmt_cost (cost_vec,
-   SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node),
+   = record_stmt_cost (cost_vec, SLP_TREE_VEC_STMTS_NUM (slp_node),
vector_stmt, stmt_info, 0, vect_body);
  /* prologue cost for vec_init (if not nested) and step.  */
  prologue_cost = record_stmt_cost (cost_vec, 1 + !nested_in_vect_loop,
@@ -10289,7 +10288,7 @@ vectorizable_induction (loop_vec_info loop_vinfo,
}
 
   /*

Re: [PATCH] vect: Fix shift-by-induction for single-lane slp

2024-06-27 Thread Feng Xue OS

I added two test cases for the examples your mentioned.

BTW: would you please look over another 3 lane-reducing patches that have been 
updated? If  ok, I would consider to check them in.

Thanks,
Feng

--

Allow shift-by-induction for slp node, when it is single lane, which is
aligned with the original loop-based handling.

gcc/
* tree-vect-stmts.cc (vectorizable_shift): Allow shift-by-induction
for single-lane slp node.

gcc/testsuite/
* gcc.dg/vect/vect-shift-6.c
* gcc.dg/vect/vect-shift-7.c
---
 gcc/testsuite/gcc.dg/vect/vect-shift-6.c | 51 +++
 gcc/testsuite/gcc.dg/vect/vect-shift-7.c | 65 
 gcc/tree-vect-stmts.cc   |  2 +-
 3 files changed, 117 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-shift-6.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-shift-7.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-shift-6.c 
b/gcc/testsuite/gcc.dg/vect/vect-shift-6.c
new file mode 100644
index 000..940f7f2a4db
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-shift-6.c
@@ -0,0 +1,51 @@
+/* { dg-require-effective-target vect_shift } */
+/* { dg-require-effective-target vect_int } */
+
+#include 
+#include "tree-vect.h"
+
+#define N 32
+
+int A[N];
+int B[N];
+
+#define FN(name)   \
+__attribute__((noipa)) \
+void name(int *a)  \
+{  \
+  for (int i = 0; i < N / 2; i++)  \
+{  \
+   a[2 * i + 0] <<= i; \
+   a[2 * i + 1] <<= i; \
+}  \
+}
+
+
+FN(foo_vec)
+
+#pragma GCC push_options
+#pragma GCC optimize ("O0")
+FN(foo_novec)
+#pragma GCC pop_options
+
+int main ()
+{
+  int i;
+
+  check_vect ();
+
+#pragma GCC novector
+  for (i = 0; i < N; i++)
+A[i] = B[i] = -(i + 1);
+
+  foo_vec(A);
+  foo_novec(B);
+
+  /* check results:  */
+#pragma GCC novector
+  for (i = 0; i < N; i++)
+if (A[i] != B[i])
+  abort ();
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/vect/vect-shift-7.c 
b/gcc/testsuite/gcc.dg/vect/vect-shift-7.c
new file mode 100644
index 000..a33b120343b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-shift-7.c
@@ -0,0 +1,65 @@
+/* { dg-require-effective-target vect_shift } */
+/* { dg-require-effective-target vect_int } */
+
+#include 
+#include "tree-vect.h"
+
+#define N 32
+#define M 64
+
+int A[N];
+int B[N];
+
+#define FN(name)   \
+__attribute__((noipa)) \
+void name(int *a)  \
+{  \
+  for (int i = 0; i < N / 2; i++)  \
+{  \
+  int s1 = i;  \
+  int s2 = s1 + 1; \
+  int r1 = 0;  \
+  int r2 = 1;  \
+  \
+  for (int j = 0; j < M; j++)  \
+ { \
+r1 += j << s1; \
+r2 += j << s2; \
+s1++;  \
+s2++;  \
+ } \
+   \
+   a[2 * i + 0] = r1;  \
+   a[2 * i + 1] = r2;  \
+}  \
+}
+
+
+FN(foo_vec)
+
+#pragma GCC push_options
+#pragma GCC optimize ("O0")
+FN(foo_novec)
+#pragma GCC pop_options
+
+int main ()
+{
+  int i;
+
+  check_vect ();
+
+#pragma GCC novector
+  for (i = 0; i < N; i++)
+A[i] = B[i] = 0;
+
+  foo_vec(A);
+  foo_novec(B);
+
+  /* check results:  */
+#pragma GCC novector
+  for (i = 0; i < N; i++)
+if (A[i] != B[i])
+  abort ();
+
+  return 0;
+}
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index ca6052662a3..840e162c7f0 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -6247,7 +6247,7 @@ vectorizable_shift (vec_info *vinfo,
   if ((dt[1] == vect_internal_def
|| dt[1] == vect_induction_def
|| dt[1] == vect_nested_cycle)
-  && !slp_node)
+  && (!slp_node || SLP_TREE_LANES (slp_node) == 1))
 scalar_shift_arg = false;
   else if (dt[1] == vect_constant_def
   || dt[1] == vect_external_def
--
2.17.1

________
From: Richard Biener 
Sent: Thursday, June 27, 2024 12:49 AM
To: Feng Xue OS
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] vect: Fix shift-by-induction for single-lane slp

On Wed, Jun 26, 2024 at 4:58 PM Feng Xue OS  wrote:
>
> Allow shift-by-induction for slp node, when it is single lane, which is
> aligned with the original loop-based handling.

OK.

Did you try whether we handle multiple lanes correctly?  The simplest
case would be a loop
body with say

  a[2*i] = x << i;
  a[2*i+1] = x << i;

I'm not sure how we match up multiple (different) inductions in the
sam

[PATCH] vect: Fix shift-by-induction for single-lane slp

2024-06-26 Thread Feng Xue OS

Allow shift-by-induction for slp node, when it is single lane, which is
aligned with the original loop-based handling. 

Thanks,
Feng

---
 gcc/tree-vect-stmts.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index ca6052662a3..840e162c7f0 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -6247,7 +6247,7 @@ vectorizable_shift (vec_info *vinfo,
   if ((dt[1] == vect_internal_def
|| dt[1] == vect_induction_def
|| dt[1] == vect_nested_cycle)
-  && !slp_node)
+  && (!slp_node || SLP_TREE_LANES (slp_node) == 1))
 scalar_shift_arg = false;
   else if (dt[1] == vect_constant_def
   || dt[1] == vect_external_def
-- 
2.17.1

[PATCH] vect: Fix shift-by-induction for single-lane slp

2024-06-26 Thread Feng Xue OS

Allow shift-by-induction for slp node, when it is single lane, which is
aligned with the original loop-based handling. 

Thanks,
Feng

---
 gcc/tree-vect-stmts.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index ca6052662a3..840e162c7f0 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -6247,7 +6247,7 @@ vectorizable_shift (vec_info *vinfo,
   if ((dt[1] == vect_internal_def
|| dt[1] == vect_induction_def
|| dt[1] == vect_nested_cycle)
-  && !slp_node)
+  && (!slp_node || SLP_TREE_LANES (slp_node) == 1))
 scalar_shift_arg = false;
   else if (dt[1] == vect_constant_def
   || dt[1] == vect_external_def
-- 
2.17.1

Re: [PATCH 8/8] vect: Optimize order of lane-reducing statements in loop def-use cycles

2024-06-26 Thread Feng Xue OS

unsigned k = j - 1;
+ std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]);
+ gcc_assert (!vec_oprnds[i][k]);
+   }
+   }
+   }
}
 }

diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 94736736dcc..64c6571a293 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1402,6 +1402,12 @@ public:
   /* The vector type for performing the actual reduction.  */
   tree reduc_vectype;

+  /* For loop reduction with multiple vectorized results (ncopies > 1), a
+ lane-reducing operation participating in it may not use all of those
+ results, this field specifies result index starting from which any
+ following land-reducing operation would be assigned to.  */
+  unsigned int reduc_result_pos;
+
   /* If IS_REDUC_INFO is true and if the vector code is performing
  N scalar reductions in parallel, this variable gives the initial
  scalar values of those N reductions.  */
--
2.17.1

____________
From: Feng Xue OS 
Sent: Thursday, June 20, 2024 2:02 PM
To: Richard Biener
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH 8/8] vect: Optimize order of lane-reducing statements in 
loop  def-use cycles

This patch was updated with some new change.

When transforming multiple lane-reducing operations in a loop reduction chain,
originally, corresponding vectorized statements are generated into def-use
cycles starting from 0. The def-use cycle with smaller index, would contain
more statements, which means more instruction dependency. For example:

   int sum = 0;
   for (i)
 {
   sum += d0[i] * d1[i];  // dot-prod 
   sum += w[i];   // widen-sum 
   sum += abs(s0[i] - s1[i]); // sad 
 }

Original transformation result:

   for (i / 16)
 {
   sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
   sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy
 }

For a higher instruction parallelism in final vectorized loop, an optimal
means is to make those effective vectorized lane-reducing statements be
distributed evenly among all def-use cycles. Transformed as the below,
DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles,
instruction dependency could be eliminated.

   for (i / 16)
 {
   sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = sum_v0;  // copy
   sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = sum_v0;  // copy
   sum_v1 = sum_v1;  // copy
   sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2);
   sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3);
 }

2024-03-22 Feng Xue 

gcc/
PR tree-optimization/114440
* tree-vectorizer.h (struct _stmt_vec_info): Add a new field
reduc_result_pos.
* tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing
statements in an optimized order.
---
 gcc/tree-vect-loop.cc | 43 +++
 gcc/tree-vectorizer.h |  6 ++
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 5a27a2c3d9c..adee54350d4 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8821,9 +8821,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

-  sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
-  sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
-  sum_v2 = sum_v2;  // copy
+  sum_v0 = sum_v0;  // copy
+  sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1);
+  sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2);
   sum_v3 = sum_v3;  // copy

   sum_v0 += n_v0[i: 0  ~ 3 ];
@@ -8831,7 +8831,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   sum_v2 += n_v2[i: 8  ~ 11];
   sum_v3 += n_v3[i: 12 ~ 15];
 }
-   */
+
+Moreover, for a higher instruction parallelism in final vectorized
+loop, it is considered to make those effective vectorized lane-
+reducing statements be distributed evenly among all def-use cycles.
+In the above example, SADs are generated into other cycles rather
+than that of DOT_PROD.  */

Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]

2024-06-26 Thread Feng Xue OS

_v2 += n_v2[i: 8  ~ 11];
+  sum_v3 += n_v3[i: 12 ~ 15];
+}
+   */
+  unsigned using_ncopies = vec_oprnds[0].length ();
+  unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
+
+  gcc_assert (using_ncopies <= reduc_ncopies);
+
+  if (using_ncopies < reduc_ncopies)
+   {
+ for (unsigned i = 0; i < op.num_ops - 1; i++)
+   {
+ gcc_assert (vec_oprnds[i].length () == using_ncopies);
+ vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
+   }
+   }
+}

   bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
   unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
@@ -8706,7 +8874,18 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 {
   gimple *new_stmt;
   tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
-  if (masked_loop_p && !mask_by_cond_expr)
+
+  if (!vop[0] || !vop[1])
+   {
+ tree reduc_vop = vec_oprnds[reduc_index][i];
+
+ /* Insert trivial copy if no need to generate vectorized
+statement.  */
+ gcc_assert (reduc_vop);
+
+ new_stmt = SSA_NAME_DEF_STMT (reduc_vop);
+   }
+  else if (masked_loop_p && !mask_by_cond_expr)
{
  /* No conditional ifns have been defined for lane-reducing op
 yet.  */
@@ -8735,8 +8914,22 @@ vect_transform_reduction (loop_vec_info loop_vinfo,

  if (masked_loop_p && mask_by_cond_expr)
{
+ unsigned nvectors = vec_num * ncopies;
+ tree stmt_vectype_in = vectype_in;
+
+ /* For single-lane slp node on lane-reducing op, we need to
+compute exact number of vector stmts from its input vectype,
+since the value got from the slp node is over-estimated.
+TODO: properly set the number this somewhere, so that this
+fixup could be removed.  */
+ if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
+   {
+ stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
+ nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
+   }
+
  tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks,
- vec_num * ncopies, vectype_in, i);
+ nvectors, stmt_vectype_in, i);
  build_vect_cond_expr (code, vop, mask, gsi);
}

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 840e162c7f0..845647b4399 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -13350,6 +13350,8 @@ vect_analyze_stmt (vec_info *vinfo,
          NULL, NULL, node, cost_vec)
  || vectorizable_load (vinfo, stmt_info, NULL, NULL, node, cost_vec)
  || vectorizable_store (vinfo, stmt_info, NULL, NULL, node, cost_vec)
+ || vectorizable_lane_reducing (as_a  (vinfo),
+stmt_info, node, cost_vec)
  || vectorizable_reduction (as_a  (vinfo), stmt_info,
 node, node_instance, cost_vec)
  || vectorizable_induction (as_a  (vinfo), stmt_info,
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 60224f4e284..94736736dcc 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2455,6 +2455,8 @@ extern loop_vec_info vect_create_loop_vinfo (class loop 
*, vec_info_shared *,
 extern bool vectorizable_live_operation (vec_info *, stmt_vec_info,
 slp_tree, slp_instance, int,
 bool, stmt_vector_for_cost *);
+extern bool vectorizable_lane_reducing (loop_vec_info, stmt_vec_info,
+   slp_tree, stmt_vector_for_cost *);
 extern bool vectorizable_reduction (loop_vec_info, stmt_vec_info,
slp_tree, slp_instance,
stmt_vector_for_cost *);
--
2.17.1


From: Feng Xue OS 
Sent: Tuesday, June 25, 2024 5:32 PM
To: Richard Biener
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for 
loop reduction [PR114440]

>>
>> >> -  if (slp_node)
>> >> +  if (slp_node && SLP_TREE_LANES (slp_node) > 1)
>> >
>> > Hmm, that looks wrong.  It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off
>> > instead, which is bad.
>> >
>> >> nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
>> >>else
>> >> nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
>> >> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage 
>> >> (loop_vec_info lo

Re: [PATCH 4/8] vect: Determine input vectype for multiple lane-reducing

2024-06-26 Thread Feng Xue OS

ype_in;
-
-  /* Each lane-reducing operation has its own input vectype, while reduction
- PHI records the input vectype with least lanes.  */
-  if (lane_reducing)
-STMT_VINFO_REDUC_VECTYPE_IN (stmt_info) = vectype_in;

   enum vect_reduction_type reduction_type = STMT_VINFO_REDUC_TYPE (phi_info);
   STMT_VINFO_REDUC_TYPE (reduc_info) = reduction_type;
--
2.17.1



From: Feng Xue OS 
Sent: Thursday, June 20, 2024 1:47 PM
To: Richard Biener
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH 4/8] vect: Determine input vectype for multiple 
lane-reducing

>> + if (lane_reducing_op_p (op.code))
>> +   {
>> + unsigned group_size = slp_node ? SLP_TREE_LANES (slp_node) : 0;
>> + tree op_type = TREE_TYPE (op.ops[0]);
>> + tree new_vectype_in = get_vectype_for_scalar_type (loop_vinfo,
>> +op_type,
>> +group_size);
>
> I think doing it this way does not adhere to the vector type size constraint
> with loop vectorization.  You should use vect_is_simple_use like the
> original code did as the actual vector definition determines the vector type
> used.

OK, though this might be wordy.

Actually, STMT_VINFO_REDUC_VECTYPE_IN is logically equivalent to nunits_vectype
that is determined in vect_determine_vf_for_stmt_1(). So how about setting the 
type
in this function?

>
> You are always using op.ops[0] here - I think that works because
> reduc_idx is the last operand of all lane-reducing ops.  But then
> we should assert reduc_idx != 0 here and add a comment.

Already added in the following assertion.

>> +
>> + /* The last operand of lane-reducing operation is for
>> +reduction.  */
>> + gcc_assert (reduc_idx > 0 && reduc_idx == (int) op.num_ops - 
>> 1);

 ^^
>> +
>> + /* For lane-reducing operation vectorizable analysis needs the
>> +reduction PHI information */
>> + STMT_VINFO_REDUC_DEF (def) = phi_info;
>> +
>> + if (!new_vectype_in)
>> +   return false;
>> +
>> + /* Each lane-reducing operation has its own input vectype, 
>> while
>> +reduction PHI will record the input vectype with the least
>> +lanes.  */
>> + STMT_VINFO_REDUC_VECTYPE_IN (vdef) = new_vectype_in;
>> +
>> + /* To accommodate lane-reducing operations of mixed input
>> +vectypes, choose input vectype with the least lanes for the
>> +reduction PHI statement, which would result in the most
>> +ncopies for vectorized reduction results.  */
>> + if (!vectype_in
>> + || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE 
>> (vectype_in)))
>> +  < GET_MODE_SIZE (SCALAR_TYPE_MODE (op_type
>> +   vectype_in = new_vectype_in;
>
> I know this is a fragile area but I always wonder since the accumulating 
> operand
> is the largest (all lane-reducing ops are widening), and that will be
> equal to the
> type of the PHI node, how this condition can be ever true.

In the original code, accumulating operand is skipped! While it is correctly, we
should not count the operand, this is why we call operation lane-reducing.

>
> ncopies is determined by the VF, so the comment is at least misleading.
>
>> +   }
>> + else
>> +   vectype_in = STMT_VINFO_VECTYPE (phi_info);
>
> Please initialize vectype_in from phi_info before the loop (that
> should never be NULL).
>

May not, as the below explanation.

> I'll note that with your patch it seems we'd initialize vectype_in to
> the biggest
> non-accumulation vector type involved in lane-reducing ops but the 
> accumulating
> type might still be larger.   Why, when we have multiple lane-reducing
> ops, would
> we chose the largest input here?  I see we eventually do
>
>   if (slp_node)
> ncopies = 1;
>   else
> ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
>
> but then IIRC we always force a single cycle def for lane-reducing ops(?).


> In particular for vect_transform_reduction and SLP we rely on
> SLP_TREE_NUMBER_OF_VEC_STMTS while non-SLP uses
> STMT_VINFO_REDUC_VECTYPE_IN.
>
> So I wonder what breaks when we set vectype_in = vector type of PHI?
>

Yes. It is right, nothing is broken. Suppose that a loop contains three 
dot_prods,
two are <16 * char>, on

Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]

2024-06-25 Thread Feng Xue OS

>>
>> >> -  if (slp_node)
>> >> +  if (slp_node && SLP_TREE_LANES (slp_node) > 1)
>> >
>> > Hmm, that looks wrong.  It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off
>> > instead, which is bad.
>> >
>> >> nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
>> >>else
>> >> nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
>> >> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage 
>> >> (loop_vec_info loop_vinfo,
>> >>  }
>> >>  }
>> >>
>> >> +/* Check if STMT_INFO is a lane-reducing operation that can be 
>> >> vectorized in
>> >> +   the context of LOOP_VINFO, and vector cost will be recorded in 
>> >> COST_VEC.
>> >> +   Now there are three such kinds of operations: dot-prod/widen-sum/sad
>> >> +   (sum-of-absolute-differences).
>> >> +
>> >> +   For a lane-reducing operation, the loop reduction path that it lies 
>> >> in,
>> >> +   may contain normal operation, or other lane-reducing operation of 
>> >> different
>> >> +   input type size, an example as:
>> >> +
>> >> + int sum = 0;
>> >> + for (i)
>> >> +   {
>> >> + ...
>> >> + sum += d0[i] * d1[i];   // dot-prod 
>> >> + sum += w[i];// widen-sum 
>> >> + sum += abs(s0[i] - s1[i]);  // sad 
>> >> + sum += n[i];// normal 
>> >> + ...
>> >> +   }
>> >> +
>> >> +   Vectorization factor is essentially determined by operation whose 
>> >> input
>> >> +   vectype has the most lanes ("vector(16) char" in the example), while 
>> >> we
>> >> +   need to choose input vectype with the least lanes ("vector(4) int" in 
>> >> the
>> >> +   example) for the reduction PHI statement.  */
>> >> +
>> >> +bool
>> >> +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info 
>> >> stmt_info,
>> >> +   slp_tree slp_node, stmt_vector_for_cost 
>> >> *cost_vec)
>> >> +{
>> >> +  gimple *stmt = stmt_info->stmt;
>> >> +
>> >> +  if (!lane_reducing_stmt_p (stmt))
>> >> +return false;
>> >> +
>> >> +  tree type = TREE_TYPE (gimple_assign_lhs (stmt));
>> >> +
>> >> +  if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type))
>> >> +return false;
>> >> +
>> >> +  /* Do not try to vectorize bit-precision reductions.  */
>> >> +  if (!type_has_mode_precision_p (type))
>> >> +return false;
>> >> +
>> >> +  if (!slp_node)
>> >> +return false;
>> >> +
>> >> +  for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++)
>> >> +{
>> >> +  stmt_vec_info def_stmt_info;
>> >> +  slp_tree slp_op;
>> >> +  tree op;
>> >> +  tree vectype;
>> >> +  enum vect_def_type dt;
>> >> +
>> >> +  if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op,
>> >> +  &slp_op, &dt, &vectype, &def_stmt_info))
>> >> +   {
>> >> + if (dump_enabled_p ())
>> >> +   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> >> +"use not simple.\n");
>> >> + return false;
>> >> +   }
>> >> +
>> >> +  if (!vectype)
>> >> +   {
>> >> + vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE 
>> >> (op),
>> >> +slp_op);
>> >> + if (!vectype)
>> >> +   return false;
>> >> +   }
>> >> +
>> >> +  if (!vect_maybe_update_slp_op_vectype (slp_op, vectype))
>> >> +   {
>> >> + if (dump_enabled_p ())
>> >> +   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>> >> +"incompatible vector types for 
>> >> invariants\n");
>> >> + return false;
>> >> +   }
>> >> +
>> >> +  if (i == STMT_VINFO_REDUC_IDX (stmt_info))
>> >> +   continue;
>> >> +
>> >> +  /* There should be at most one cycle def in the stmt.  */
>> >> +  if (VECTORIZABLE_CYCLE_DEF (dt))
>> >> +   return false;
>> >> +}
>> >> +
>> >> +  stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt 
>> >> (stmt_info));
>> >> +
>> >> +  /* TODO: Support lane-reducing operation that does not directly 
>> >> participate
>> >> + in loop reduction. */
>> >> +  if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0)
>> >> +return false;
>> >> +
>> >> +  /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not
>> >> + recoginized.  */
>> >> +  gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def);
>> >> +  gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION);
>> >> +
>> >> +  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
>> >> +  int ncopies_for_cost;
>> >> +
>> >> +  if (SLP_TREE_LANES (slp_node) > 1)
>> >> +{
>> >> +  /* Now lane-reducing operations in a non-single-lane slp node 
>> >> should only
>> >> +come from the same loop reduction path.  */
>> >> +  gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info));
>> >> +  ncopies_for_cost = 1;
>> >> +}
>> >> +  else
>> >> +

Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]

2024-06-23 Thread Feng Xue OS

;
>> +
> 
> assert reduc_ncopies >= using_ncopies?  Maybe assert
> reduc_index == op.num_ops - 1 given you use one above
> and the other below?  Or simply iterate till op.num_ops
> and sip i == reduc_index.
> 
>> +  for (unsigned i = 0; i < op.num_ops - 1; i++)
>> +   {
>> + gcc_assert (vec_oprnds[i].length () == using_ncopies);
>> + vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
>> +   }
>> +}
>>
>>    bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod 
>> (stmt_info);
>>unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
>> @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>>  {
>>gimple *new_stmt;
>>tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
>> -  if (masked_loop_p && !mask_by_cond_expr)
>> +
>> +  if (!vop[0] || !vop[1])
>> +   {
>> + tree reduc_vop = vec_oprnds[reduc_index][i];
>> +
>> + /* Insert trivial copy if no need to generate vectorized
>> +statement.  */
>> + gcc_assert (reduc_vop);
>> +
>> + new_stmt = gimple_build_assign (vec_dest, reduc_vop);
>> + new_temp = make_ssa_name (vec_dest, new_stmt);
>> + gimple_set_lhs (new_stmt, new_temp);
>> + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
> 
> I think you could simply do
> 
>slp_node->push_vec_def (reduc_vop);
>continue;
> 
> without any code generation.
> 

OK, that would be easy. Here comes another question, this patch assumes
lane-reducing op would always be contained in a slp node, since single-lane
slp node feature has been enabled. But I got some regression if I enforced
such constraint on lane-reducing op check. Those cases are founded to
be unvectorizable with single-lane slp, so this should not be what we want?
and need to be fixed?

>> +   }
>> +  else if (masked_loop_p && !mask_by_cond_expr)
>> {
>>   /* No conditional ifns have been defined for lane-reducing op
>>  yet.  */
>> @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>>
>>   if (masked_loop_p && mask_by_cond_expr)
>> {
>> + tree stmt_vectype_in = vectype_in;
>> + unsigned nvectors = vec_num * ncopies;
>> +
>> + if (lane_reducing && SLP_TREE_LANES (slp_node) == 1)
>> +   {
>> + /* Input vectype of the reduction PHI may be defferent from
> 
> different
> 
>> +that of lane-reducing operation.  */
>> + stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
>> + nvectors = vect_get_num_copies (loop_vinfo, 
>> stmt_vectype_in);
> 
> I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS.

To partially vectorizing a dot_prod<16 * char> with 128-bit vector width, 
we should pass (nvector=4, vectype=<4 *int>) instead of (nvector=1, vectype=<16 
*char>)
to vect_get_loop_mask?

Thanks,
Feng



From: Richard Biener 
Sent: Thursday, June 20, 2024 8:26 PM
To: Feng Xue OS
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for 
loop reduction [PR114440]

On Sun, Jun 16, 2024 at 9:31?AM Feng Xue OS  wrote:
>
> For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
> vectorizer could only handle the pattern if the reduction chain does not
> contain other operation, no matter the other is normal or lane-reducing.
>
> Actually, to allow multiple arbitrary lane-reducing operations, we need to
> support vectorization of loop reduction chain with mixed input vectypes. Since
> lanes of vectype may vary with operation, the effective ncopies of vectorized
> statements for operation also may not be same to each other, this causes
> mismatch on vectorized def-use cycles. A simple way is to align all operations
> with the one that has the most ncopies, the gap could be complemented by
> generating extra trivial pass-through copies. For example:
>
>int sum = 0;
>for (i)
>  {
>sum += d0[i] * d1[i];  // dot-prod 
>sum += w[i];   // widen-sum 
>sum += abs(s0[i] - s1[i]); // sad 
>sum += n[i];   // normal 
>  }
>
> The vector size is 128-bit vectorization factor is 16. Reduction statements
> would be transformed as:
>
>vector<4> int sum_v0 = { 0, 0, 0, 0 };
>vector<4> i

Re: [PATCH 8/8] vect: Optimize order of lane-reducing statements in loop def-use cycles

2024-06-19 Thread Feng Xue OS

s; j > start; j--)
+   {
+ unsigned k = j - 1;
+ std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]);
+ gcc_assert (!vec_oprnds[i][k]);
+   }
+   }
+   }
}
 }

diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 94736736dcc..64c6571a293 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1402,6 +1402,12 @@ public:
   /* The vector type for performing the actual reduction.  */
   tree reduc_vectype;

+  /* For loop reduction with multiple vectorized results (ncopies > 1), a
+ lane-reducing operation participating in it may not use all of those
+ results, this field specifies result index starting from which any
+ following land-reducing operation would be assigned to.  */
+  unsigned int reduc_result_pos;
+
   /* If IS_REDUC_INFO is true and if the vector code is performing
  N scalar reductions in parallel, this variable gives the initial
  scalar values of those N reductions.  */
--
2.17.1

____________
From: Feng Xue OS 
Sent: Sunday, June 16, 2024 3:32 PM
To: Richard Biener
Cc: gcc-patches@gcc.gnu.org
Subject: [PATCH 8/8] vect: Optimize order of lane-reducing statements in loop  
def-use cycles

When transforming multiple lane-reducing operations in a loop reduction chain,
originally, corresponding vectorized statements are generated into def-use
cycles starting from 0. The def-use cycle with smaller index, would contain
more statements, which means more instruction dependency. For example:

   int sum = 0;
   for (i)
 {
   sum += d0[i] * d1[i];  // dot-prod 
   sum += w[i];   // widen-sum 
   sum += abs(s0[i] - s1[i]); // sad 
 }

Original transformation result:

   for (i / 16)
 {
   sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
   sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy
 }

For a higher instruction parallelism in final vectorized loop, an optimal
means is to make those effective vectorized lane-reducing statements be
distributed evenly among all def-use cycles. Transformed as the below,
DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles,
instruction dependency could be eliminated.

Thanks,
Feng
---
gcc/
PR tree-optimization/114440
* tree-vectorizer.h (struct _stmt_vec_info): Add a new field
reduc_result_pos.
* tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing
statements in an optimized order.
---
 gcc/tree-vect-loop.cc | 39 +++
 gcc/tree-vectorizer.h |  6 ++
 2 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 6d91665a341..c7e13d655d8 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8828,9 +8828,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

-  sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
-  sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
-  sum_v2 = sum_v2;  // copy
+  sum_v0 = sum_v0;  // copy
+  sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1);
+  sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2);
   sum_v3 = sum_v3;  // copy

   sum_v0 += n_v0[i: 0  ~ 3 ];
@@ -8838,14 +8838,45 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   sum_v2 += n_v2[i: 8  ~ 11];
   sum_v3 += n_v3[i: 12 ~ 15];
 }
-   */
+
+Moreover, for a higher instruction parallelism in final vectorized
+loop, it is considered to make those effective vectorized lane-
+reducing statements be distributed evenly among all def-use cycles.
+In the above example, SADs are generated into other cycles rather
+than that of DOT_PROD.  */
   unsigned using_ncopies = vec_oprnds[0].length ();
   unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
+  unsigned result_pos = reduc_info->reduc_result_pos;
+
+  reduc_info->reduc_result_pos
+   = (result_pos + using_ncopies) % reduc_ncopies;
+  gcc_assert (result_pos < reduc_ncopies);

   for (unsigned i = 0; i < op.num_ops - 1; i++)
{
  gcc_assert (vec_oprnds[i].length () == using_ncopies);
  vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
+
+ /* Find suitable def-use cycles

Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]

2024-06-19 Thread Feng Xue OS

py
+  sum_v2 = sum_v2;  // copy
+  sum_v3 = sum_v3;  // copy
+
+  sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
+  sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
+  sum_v2 = sum_v2;  // copy
+  sum_v3 = sum_v3;  // copy
+
+  sum_v0 += n_v0[i: 0  ~ 3 ];
+  sum_v1 += n_v1[i: 4  ~ 7 ];
+  sum_v2 += n_v2[i: 8  ~ 11];
+  sum_v3 += n_v3[i: 12 ~ 15];
+}
+   */
+  tree phi_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
+  unsigned all_ncopies = vect_get_num_copies (loop_vinfo, phi_vectype_in);
+  unsigned use_ncopies = vec_oprnds[0].length ();
+
+  if (use_ncopies < all_ncopies)
+   {
+ if (!slp_node)
+   {
+ tree reduc_oprnd = op.ops[reduc_index];
+
+ vec_oprnds[reduc_index].truncate (0);
+ vect_get_vec_defs_for_operand (loop_vinfo, stmt_info,
+all_ncopies, reduc_oprnd,
+&vec_oprnds[reduc_index]);
+   }
+ else
+   gcc_assert (all_ncopies == vec_oprnds[reduc_index].length ());
+
+ for (unsigned i = 0; i < op.num_ops - 1; i++)
+   {
+ gcc_assert (vec_oprnds[i].length () == use_ncopies);
+ vec_oprnds[i].safe_grow_cleared (all_ncopies);
+   }
+   }
+}

   bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
   unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
@@ -8699,7 +8865,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 {
   gimple *new_stmt;
   tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
-  if (masked_loop_p && !mask_by_cond_expr)
+
+  if (!vop[0] || !vop[1])
+   {
+ tree reduc_vop = vec_oprnds[reduc_index][i];
+
+ /* Insert trivial copy if no need to generate vectorized
+statement.  */
+ gcc_assert (reduc_vop);
+
+ new_stmt = gimple_build_assign (vec_dest, reduc_vop);
+ new_temp = make_ssa_name (vec_dest, new_stmt);
+ gimple_set_lhs (new_stmt, new_temp);
+ vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi);
+   }
+  else if (masked_loop_p && !mask_by_cond_expr)
{
  /* No conditional ifns have been defined for lane-reducing op
 yet.  */
@@ -8728,8 +8908,16 @@ vect_transform_reduction (loop_vec_info loop_vinfo,

  if (masked_loop_p && mask_by_cond_expr)
{
+ unsigned nvectors = vec_num * ncopies;
+
+ /* For single-lane slp node on lane-reducing op, we need to
+compute exact number of vector stmts from its input vectype,
+since the value got from the slp node is over-estimated.  */
+ if (lane_reducing && slp_node && SLP_TREE_LANES (slp_node) == 1)
+   nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
+
  tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks,
- vec_num * ncopies, vectype_in, i);
+ nvectors, vectype_in, i);
  build_vect_cond_expr (code, vop, mask, gsi);
}

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index ca6052662a3..1b73ef01ade 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -13350,6 +13350,8 @@ vect_analyze_stmt (vec_info *vinfo,
  NULL, NULL, node, cost_vec)
  || vectorizable_load (vinfo, stmt_info, NULL, NULL, node, cost_vec)
  || vectorizable_store (vinfo, stmt_info, NULL, NULL, node, cost_vec)
+ || vectorizable_lane_reducing (as_a  (vinfo),
+stmt_info, node, cost_vec)
  || vectorizable_reduction (as_a  (vinfo), stmt_info,
 node, node_instance, cost_vec)
  || vectorizable_induction (as_a  (vinfo), stmt_info,
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 60224f4e284..94736736dcc 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2455,6 +2455,8 @@ extern loop_vec_info vect_create_loop_vinfo (class loop 
*, vec_info_shared *,
 extern bool vectorizable_live_operation (vec_info *, stmt_vec_info,
 slp_tree, slp_instance, int,
 bool, stmt_vector_for_cost *);
+extern bool vectorizable_lane_reducing (loop_vec_info, stmt_vec_info,
+   slp_tree, stmt_vector_for_cost *);
 extern bool vectorizable_reduction (loop_vec_info, stmt_vec_info,
slp_tree, slp_instance,
stmt_vector_for_cost *);
--
2.17.1


From: F

Re: [PATCH 4/8] vect: Determine input vectype for multiple lane-reducing

2024-06-19 Thread Feng Xue OS

>(short_c0_hi, short_c1_hi, sum_v1);

 sum_v2 = sum_v2;
 sum_v3 = sum_v3;
  }

The def/use cycles (sum_v2 and sum_v3> would be optimized away finally.
Then this gets same result as setting vectype_in to <8 * short>.

With the patch #8, we get:

  vector<4> int sum_v0 = { 0, 0, 0, 0 };
  vector<4> int sum_v1 = { 0, 0, 0, 0 };
  vector<4> int sum_v2 = { 0, 0, 0, 0 };
  vector<4> int sum_v3 = { 0, 0, 0, 0 };

  loop () {
 sum_v0 = dot_prod<16 * char>(char_a0, char_a1, sum_v0); 

 sum_v1 = dot_prod<16 * char>(char_b0, char_b1, sum_v1);

 sum_v2 = dot_prod<8 * short>(short_c0_lo, short_c1_lo, sum_v2);
 sum_v3 = dot_prod<8 * short>(short_c0_hi, short_c1_hi, sum_v3);
  }

All dot_prods are assigned to separate def/use cycles, and no
dependency. More def/use cycles, higher instruction parallelism,
but there need extra cost in epilogue to combine the result.

So we consider a somewhat compact def/use layout similar to
single-defuse-cycle, in which two <16 * char> dot_prods are independent,
and cycle 2 and 3 are not used, and this is better than the 1st scheme.

  vector<4> int sum_v0 = { 0, 0, 0, 0 };
  vector<4> int sum_v1 = { 0, 0, 0, 0 };

  loop () {
 sum_v0 = dot_prod<16 * char>(char_a0, char_a1, sum_v0); 

     sum_v1 = dot_prod<16 * char>(char_b0, char_b1, sum_v1);

 sum_v0 = dot_prod<8 * short>(short_c0_lo, short_c1_lo, sum_v0);
     sum_v1 = dot_prod<8 * short>(short_c0_hi, short_c1_hi, sum_v1);
  }

For this purpose, we need to track the vectype_in that results in
the most ncopies, for this case, the type is <8 * short>.

BTW: would you please also take a look at patch #7 and #8?

Thanks,
Feng


From: Richard Biener 
Sent: Wednesday, June 19, 2024 9:01 PM
To: Feng Xue OS
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH 4/8] vect: Determine input vectype for multiple 
lane-reducing

On Sun, Jun 16, 2024 at 9:25 AM Feng Xue OS  wrote:
>
> The input vectype of reduction PHI statement must be determined before
> vect cost computation for the reduction. Since lance-reducing operation has
> different input vectype from normal one, so we need to traverse all reduction
> statements to find out the input vectype with the least lanes, and set that to
> the PHI statement.
>
> Thanks,
> Feng
>
> ---
> gcc/
> * tree-vect-loop.cc (vectorizable_reduction): Determine input vectype
> during traversal of reduction statements.
> ---
>  gcc/tree-vect-loop.cc | 72 +--
>  1 file changed, 49 insertions(+), 23 deletions(-)
>
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 0f7b125e72d..39aa5cb1197 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -7643,7 +7643,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>  {
>stmt_vec_info def = loop_vinfo->lookup_def (reduc_def);
>stmt_vec_info vdef = vect_stmt_to_vectorize (def);
> -  if (STMT_VINFO_REDUC_IDX (vdef) == -1)
> +  int reduc_idx = STMT_VINFO_REDUC_IDX (vdef);
> +
> +  if (reduc_idx == -1)
> {
>   if (dump_enabled_p ())
> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> @@ -7686,10 +7688,50 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>   return false;
> }
> }
> -  else if (!stmt_info)
> -   /* First non-conversion stmt.  */
> -   stmt_info = vdef;
> -  reduc_def = op.ops[STMT_VINFO_REDUC_IDX (vdef)];
> +  else
> +   {
> + /* First non-conversion stmt.  */
> + if (!stmt_info)
> +   stmt_info = vdef;
> +
> + if (lane_reducing_op_p (op.code))
> +   {
> + unsigned group_size = slp_node ? SLP_TREE_LANES (slp_node) : 0;
> + tree op_type = TREE_TYPE (op.ops[0]);
> + tree new_vectype_in = get_vectype_for_scalar_type (loop_vinfo,
> +op_type,
> +group_size);

I think doing it this way does not adhere to the vector type size constraint
with loop vectorization.  You should use vect_is_simple_use like the
original code did as the actual vector definition determines the vector type
used.

You are always using op.ops[0] here - I think that works because
reduc_idx is the last operand of all lane-reducing ops.  But then
we should assert reduc_idx != 0 here and add a comment.

> +
> + /* The last operand of lane-reducing operation is for
> +reduction.  */
> + gcc_assert (redu

[PATCH 8/8] vect: Optimize order of lane-reducing statements in loop def-use cycles

2024-06-16 Thread Feng Xue OS

When transforming multiple lane-reducing operations in a loop reduction chain,
originally, corresponding vectorized statements are generated into def-use
cycles starting from 0. The def-use cycle with smaller index, would contain
more statements, which means more instruction dependency. For example:

   int sum = 0;
   for (i)
 {
   sum += d0[i] * d1[i];  // dot-prod 
   sum += w[i];   // widen-sum 
   sum += abs(s0[i] - s1[i]); // sad 
 }

Original transformation result:

   for (i / 16)
 {
   sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
   sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy
 }

For a higher instruction parallelism in final vectorized loop, an optimal
means is to make those effective vectorized lane-reducing statements be
distributed evenly among all def-use cycles. Transformed as the below,
DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles,
instruction dependency could be eliminated.

Thanks,
Feng
---
gcc/
PR tree-optimization/114440
* tree-vectorizer.h (struct _stmt_vec_info): Add a new field
reduc_result_pos.
* tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing
statements in an optimized order.
---
 gcc/tree-vect-loop.cc | 39 +++
 gcc/tree-vectorizer.h |  6 ++
 2 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 6d91665a341..c7e13d655d8 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8828,9 +8828,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy
 
-  sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
-  sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
-  sum_v2 = sum_v2;  // copy
+  sum_v0 = sum_v0;  // copy
+  sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1);
+  sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2);
   sum_v3 = sum_v3;  // copy
 
   sum_v0 += n_v0[i: 0  ~ 3 ];
@@ -8838,14 +8838,45 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   sum_v2 += n_v2[i: 8  ~ 11];
   sum_v3 += n_v3[i: 12 ~ 15];
 }
-   */
+
+Moreover, for a higher instruction parallelism in final vectorized
+loop, it is considered to make those effective vectorized lane-
+reducing statements be distributed evenly among all def-use cycles.
+In the above example, SADs are generated into other cycles rather
+than that of DOT_PROD.  */
   unsigned using_ncopies = vec_oprnds[0].length ();
   unsigned reduc_ncopies = vec_oprnds[reduc_index].length ();
+  unsigned result_pos = reduc_info->reduc_result_pos;
+
+  reduc_info->reduc_result_pos
+   = (result_pos + using_ncopies) % reduc_ncopies;
+  gcc_assert (result_pos < reduc_ncopies);
 
   for (unsigned i = 0; i < op.num_ops - 1; i++)
{
  gcc_assert (vec_oprnds[i].length () == using_ncopies);
  vec_oprnds[i].safe_grow_cleared (reduc_ncopies);
+
+ /* Find suitable def-use cycles to generate vectorized statements
+into, and reorder operands based on the selection.  */
+ if (result_pos)
+   {
+ unsigned count = reduc_ncopies - using_ncopies;
+ unsigned start = result_pos - count;
+
+ if ((int) start < 0)
+   {
+ count = result_pos;
+ start = 0;
+   }
+
+ for (unsigned j = using_ncopies; j > start; j--)
+   {
+ unsigned k = j - 1;
+ std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]);
+ gcc_assert (!vec_oprnds[i][k]);
+   }
+   }
}
 }
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 94736736dcc..64c6571a293 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1402,6 +1402,12 @@ public:
   /* The vector type for performing the actual reduction.  */
   tree reduc_vectype;
 
+  /* For loop reduction with multiple vectorized results (ncopies > 1), a
+ lane-reducing operation participating in it may not use all of those
+ results, this field specifies result index starting from which any
+ following land-reducing operation would be assigned to.  */
+  unsigned int reduc_result_pos;
+
   /* If IS_RED

[PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]

2024-06-16 Thread Feng Xue OS

For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
vectorizer could only handle the pattern if the reduction chain does not
contain other operation, no matter the other is normal or lane-reducing.

Actually, to allow multiple arbitrary lane-reducing operations, we need to
support vectorization of loop reduction chain with mixed input vectypes. Since
lanes of vectype may vary with operation, the effective ncopies of vectorized
statements for operation also may not be same to each other, this causes
mismatch on vectorized def-use cycles. A simple way is to align all operations
with the one that has the most ncopies, the gap could be complemented by
generating extra trivial pass-through copies. For example:

   int sum = 0;
   for (i)
 {
   sum += d0[i] * d1[i];  // dot-prod 
   sum += w[i];   // widen-sum 
   sum += abs(s0[i] - s1[i]); // sad 
   sum += n[i];   // normal 
 }

The vector size is 128-bit vectorization factor is 16. Reduction statements
would be transformed as:

   vector<4> int sum_v0 = { 0, 0, 0, 0 };
   vector<4> int sum_v1 = { 0, 0, 0, 0 };
   vector<4> int sum_v2 = { 0, 0, 0, 0 };
   vector<4> int sum_v3 = { 0, 0, 0, 0 };

   for (i / 16)
 {
   sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
   sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 += n_v0[i: 0  ~ 3 ];
   sum_v1 += n_v1[i: 4  ~ 7 ];
   sum_v2 += n_v2[i: 8  ~ 11];
   sum_v3 += n_v3[i: 12 ~ 15];
 }

Thanks,
Feng

---
gcc/
PR tree-optimization/114440
* tree-vectorizer.h (vectorizable_lane_reducing): New function
declaration.
* tree-vect-stmts.cc (vect_analyze_stmt): Call new function
vectorizable_lane_reducing to analyze lane-reducing operation.
* tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation
code related to emulated_mixed_dot_prod.
(vect_reduction_update_partial_vector_usage): Compute ncopies as the
original means for single-lane slp node.
(vectorizable_lane_reducing): New function.
(vectorizable_reduction): Allow multiple lane-reducing operations in
loop reduction. Move some original lane-reducing related code to
vectorizable_lane_reducing.
(vect_transform_reduction): Extend transformation to support reduction
statements with mixed input vectypes.

gcc/testsuite/
PR tree-optimization/114440
* gcc.dg/vect/vect-reduc-chain-1.c
* gcc.dg/vect/vect-reduc-chain-2.c
* gcc.dg/vect/vect-reduc-chain-3.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
* gcc.dg/vect/vect-reduc-dot-slp-1.c
---
 .../gcc.dg/vect/vect-reduc-chain-1.c  |  62 
 .../gcc.dg/vect/vect-reduc-chain-2.c  |  77 +
 .../gcc.dg/vect/vect-reduc-chain-3.c  |  66 
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c  |  95 +
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c  |  67 
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c  |  79 +
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c  |  63 
 .../gcc.dg/vect/vect-reduc-dot-slp-1.c|  35 ++
 gcc/tree-vect-loop.cc | 324 ++
 gcc/tree-vect-stmts.cc|   2 +
 gcc/tree-vectorizer.h |   2 +
 11 files changed, 802 insertions(+), 70 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c 
b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
new file mode 100644
index 000..04bfc419dbd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
@@ -0,0 +1,62 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require

[PATCH 6/8] vect: Tighten an assertion for lane-reducing in transform

2024-06-16 Thread Feng Xue OS

According to logic of code nearby the assertion, all lane-reducing operations
should not appear, not just DOT_PROD_EXPR. Since "use_mask_by_cond_expr_p"
treats SAD_EXPR same as DOT_PROD_EXPR, and WIDEN_SUM_EXPR should not be allowed
by the following assertion "gcc_assert (commutative_binary_op_p (...))", so
tighten the assertion.

Thanks,
Feng

---
gcc/
* tree-vect-loop.cc (vect_transform_reduction): Change assertion to
cover all lane-reducing ops.
---
 gcc/tree-vect-loop.cc | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 7909d63d4df..e0561feddce 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8643,7 +8643,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 }
 
   bool single_defuse_cycle = STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info);
-  gcc_assert (single_defuse_cycle || lane_reducing_op_p (code));
+  bool lane_reducing = lane_reducing_op_p (code);
+  gcc_assert (single_defuse_cycle || lane_reducing);
 
   /* Create the destination vector  */
   tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
@@ -8698,8 +8699,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
   if (masked_loop_p && !mask_by_cond_expr)
{
- /* No conditional ifns have been defined for dot-product yet.  */
- gcc_assert (code != DOT_PROD_EXPR);
+ /* No conditional ifns have been defined for lane-reducing op
+yet.  */
+ gcc_assert (!lane_reducing);
 
  /* Make sure that the reduction accumulator is vop[0].  */
  if (reduc_index == 1)
-- 
2.17.1From d348e63c001e65067876a80dfae75abefe10c240 Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Sun, 16 Jun 2024 13:33:52 +0800
Subject: [PATCH 6/8] vect: Tighten an assertion for lane-reducing in transform

According to logic of code nearby the assertion, all lane-reducing operations
should not appear, not just DOT_PROD_EXPR. Since "use_mask_by_cond_expr_p"
treats SAD_EXPR same as DOT_PROD_EXPR, and WIDEN_SUM_EXPR should not be allowed
by the following assertion "gcc_assert (commutative_binary_op_p (...))", so
tighten the assertion.

2024-06-16 Feng Xue 

gcc/
	* tree-vect-loop.cc (vect_transform_reduction): Change assertion to
	cover all lane-reducing ops.
---
 gcc/tree-vect-loop.cc | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 7909d63d4df..e0561feddce 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8643,7 +8643,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 }
 
   bool single_defuse_cycle = STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info);
-  gcc_assert (single_defuse_cycle || lane_reducing_op_p (code));
+  bool lane_reducing = lane_reducing_op_p (code);
+  gcc_assert (single_defuse_cycle || lane_reducing);
 
   /* Create the destination vector  */
   tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
@@ -8698,8 +8699,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
   if (masked_loop_p && !mask_by_cond_expr)
 	{
-	  /* No conditional ifns have been defined for dot-product yet.  */
-	  gcc_assert (code != DOT_PROD_EXPR);
+	  /* No conditional ifns have been defined for lane-reducing op
+	 yet.  */
+	  gcc_assert (!lane_reducing);
 
 	  /* Make sure that the reduction accumulator is vop[0].  */
 	  if (reduc_index == 1)
-- 
2.17.1

[PATCH 5/8] vect: Use an array to replace 3 relevant variables

2024-06-16 Thread Feng Xue OS

It's better to place 3 relevant independent variables into array, since we
have requirement to access them via an index in the following patch. At the
same time, this change may get some duplicated code be more compact.

Thanks,
Feng

---
gcc/
* tree-vect-loop.cc (vect_transform_reduction): Replace vec_oprnds0/1/2
with one new array variable vec_oprnds[3].
---
 gcc/tree-vect-loop.cc | 42 +-
 1 file changed, 17 insertions(+), 25 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 39aa5cb1197..7909d63d4df 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8605,9 +8605,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 
   /* Transform.  */
   tree new_temp = NULL_TREE;
-  auto_vec vec_oprnds0;
-  auto_vec vec_oprnds1;
-  auto_vec vec_oprnds2;
+  auto_vec vec_oprnds[3];
 
   if (dump_enabled_p ())
 dump_printf_loc (MSG_NOTE, vect_location, "transform reduction.\n");
@@ -8657,12 +8655,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 {
   vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, ncopies,
 single_defuse_cycle && reduc_index == 0
-? NULL_TREE : op.ops[0], &vec_oprnds0,
+? NULL_TREE : op.ops[0], &vec_oprnds[0],
 single_defuse_cycle && reduc_index == 1
-? NULL_TREE : op.ops[1], &vec_oprnds1,
+? NULL_TREE : op.ops[1], &vec_oprnds[1],
 op.num_ops == 3
 && !(single_defuse_cycle && reduc_index == 2)
-? op.ops[2] : NULL_TREE, &vec_oprnds2);
+? op.ops[2] : NULL_TREE, &vec_oprnds[2]);
 }
   else
 {
@@ -8670,12 +8668,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 vectype.  */
   gcc_assert (single_defuse_cycle
  && (reduc_index == 1 || reduc_index == 2));
-  vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, ncopies,
-op.ops[0], truth_type_for (vectype_in), &vec_oprnds0,
+  vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, ncopies, op.ops[0],
+truth_type_for (vectype_in), &vec_oprnds[0],
 reduc_index == 1 ? NULL_TREE : op.ops[1],
-NULL_TREE, &vec_oprnds1,
+NULL_TREE, &vec_oprnds[1],
 reduc_index == 2 ? NULL_TREE : op.ops[2],
-NULL_TREE, &vec_oprnds2);
+NULL_TREE, &vec_oprnds[2]);
 }
 
   /* For single def-use cycles get one copy of the vectorized reduction
@@ -8683,20 +8681,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   if (single_defuse_cycle)
 {
   vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, 1,
-reduc_index == 0 ? op.ops[0] : NULL_TREE, &vec_oprnds0,
-reduc_index == 1 ? op.ops[1] : NULL_TREE, &vec_oprnds1,
+reduc_index == 0 ? op.ops[0] : NULL_TREE,
+&vec_oprnds[0],
+reduc_index == 1 ? op.ops[1] : NULL_TREE,
+&vec_oprnds[1],
 reduc_index == 2 ? op.ops[2] : NULL_TREE,
-&vec_oprnds2);
+&vec_oprnds[2]);
 }
 
   bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
+  unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
 
-  unsigned num = (reduc_index == 0
- ? vec_oprnds1.length () : vec_oprnds0.length ());
   for (unsigned i = 0; i < num; ++i)
 {
   gimple *new_stmt;
-  tree vop[3] = { vec_oprnds0[i], vec_oprnds1[i], NULL_TREE };
+  tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
   if (masked_loop_p && !mask_by_cond_expr)
{
  /* No conditional ifns have been defined for dot-product yet.  */
@@ -8721,7 +8720,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   else
{
  if (op.num_ops >= 3)
-   vop[2] = vec_oprnds2[i];
+   vop[2] = vec_oprnds[2][i];
 
  if (masked_loop_p && mask_by_cond_expr)
{
@@ -8752,14 +8751,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
}
 
   if (single_defuse_cycle && i < num - 1)
-   {
- if (reduc_index == 0)
-   vec_oprnds0.safe_push (gimple_get_lhs (new_stmt));
- else if (reduc_index == 1)
-   vec_oprnds1.safe_push (gimple_get_lhs (new_stmt));
- else if (reduc_index == 2)
-   vec_oprnds2.safe_push (gimple_get_lhs (new_stmt));
-   }
+   vec_oprnds[reduc_index].safe_push (gimple_get_lhs (new_stmt));
   else if (slp_node)
slp_node->push_vec_def (new_stmt);
   else
-- 
2.17.1From 168a55952ae317fca34af55d025c1235b4ff34b5 Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Sun, 16 Jun

[PATCH 4/8] vect: Determine input vectype for multiple lane-reducing

2024-06-16 Thread Feng Xue OS

The input vectype of reduction PHI statement must be determined before
vect cost computation for the reduction. Since lance-reducing operation has
different input vectype from normal one, so we need to traverse all reduction
statements to find out the input vectype with the least lanes, and set that to
the PHI statement.

Thanks,
Feng

---
gcc/
* tree-vect-loop.cc (vectorizable_reduction): Determine input vectype
during traversal of reduction statements.
---
 gcc/tree-vect-loop.cc | 72 +--
 1 file changed, 49 insertions(+), 23 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 0f7b125e72d..39aa5cb1197 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7643,7 +7643,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 {
   stmt_vec_info def = loop_vinfo->lookup_def (reduc_def);
   stmt_vec_info vdef = vect_stmt_to_vectorize (def);
-  if (STMT_VINFO_REDUC_IDX (vdef) == -1)
+  int reduc_idx = STMT_VINFO_REDUC_IDX (vdef);
+
+  if (reduc_idx == -1)
{
  if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -7686,10 +7688,50 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
  return false;
}
}
-  else if (!stmt_info)
-   /* First non-conversion stmt.  */
-   stmt_info = vdef;
-  reduc_def = op.ops[STMT_VINFO_REDUC_IDX (vdef)];
+  else
+   {
+ /* First non-conversion stmt.  */
+ if (!stmt_info)
+   stmt_info = vdef;
+
+ if (lane_reducing_op_p (op.code))
+   {
+ unsigned group_size = slp_node ? SLP_TREE_LANES (slp_node) : 0;
+ tree op_type = TREE_TYPE (op.ops[0]);
+ tree new_vectype_in = get_vectype_for_scalar_type (loop_vinfo,
+op_type,
+group_size);
+
+ /* The last operand of lane-reducing operation is for
+reduction.  */
+ gcc_assert (reduc_idx > 0 && reduc_idx == (int) op.num_ops - 1);
+
+ /* For lane-reducing operation vectorizable analysis needs the
+reduction PHI information */
+ STMT_VINFO_REDUC_DEF (def) = phi_info;
+
+ if (!new_vectype_in)
+   return false;
+
+ /* Each lane-reducing operation has its own input vectype, while
+reduction PHI will record the input vectype with the least
+lanes.  */
+ STMT_VINFO_REDUC_VECTYPE_IN (vdef) = new_vectype_in;
+
+ /* To accommodate lane-reducing operations of mixed input
+vectypes, choose input vectype with the least lanes for the
+reduction PHI statement, which would result in the most
+ncopies for vectorized reduction results.  */
+ if (!vectype_in
+ || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_in)))
+  < GET_MODE_SIZE (SCALAR_TYPE_MODE (op_type
+   vectype_in = new_vectype_in;
+   }
+ else
+   vectype_in = STMT_VINFO_VECTYPE (phi_info);
+   }
+
+  reduc_def = op.ops[reduc_idx];
   reduc_chain_length++;
   if (!stmt_info && slp_node)
slp_for_stmt_info = SLP_TREE_CHILDREN (slp_for_stmt_info)[0];
@@ -7747,6 +7789,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 
   tree vectype_out = STMT_VINFO_VECTYPE (stmt_info);
   STMT_VINFO_REDUC_VECTYPE (reduc_info) = vectype_out;
+  STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) = vectype_in;
+
   gimple_match_op op;
   if (!gimple_extract_op (stmt_info->stmt, &op))
 gcc_unreachable ();
@@ -7831,16 +7875,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
  = get_vectype_for_scalar_type (loop_vinfo,
 TREE_TYPE (op.ops[i]), slp_op[i]);
 
-  /* To properly compute ncopies we are interested in the widest
-non-reduction input type in case we're looking at a widening
-accumulation that we later handle in vect_transform_reduction.  */
-  if (lane_reducing
- && vectype_op[i]
- && (!vectype_in
- || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_in)))
- < GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE 
(vectype_op[i]))
-   vectype_in = vectype_op[i];
-
   /* Record how the non-reduction-def value of COND_EXPR is defined.
 ???  For a chain of multiple CONDs we'd have to match them up all.  */
   if (op.code == COND_EXPR && reduc_chain_length == 1)
@@ -7859,14 +7893,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
}
}
 }
-  if (!vectype_in)
-vectype_in = STMT_VINFO_VECTYPE (phi_info);
-  STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) = vectype_in;
-
-  /* Each lane-reducing operation has

[PATCH 3/8] vect: Use one reduction_type local variable

2024-06-16 Thread Feng Xue OS

Two local variables were defined to refer same STMT_VINFO_REDUC_TYPE, better
to keep only one.

Thanks,
Feng

---
gcc/
* tree-vect-loop.cc (vectorizable_reduction): Remove v_reduc_type, and
replace it to another local variable reduction_type.
---
 gcc/tree-vect-loop.cc | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 6e8b3639daf..0f7b125e72d 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7868,10 +7868,10 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   if (lane_reducing)
 STMT_VINFO_REDUC_VECTYPE_IN (stmt_info) = vectype_in;
 
-  enum vect_reduction_type v_reduc_type = STMT_VINFO_REDUC_TYPE (phi_info);
-  STMT_VINFO_REDUC_TYPE (reduc_info) = v_reduc_type;
+  enum vect_reduction_type reduction_type = STMT_VINFO_REDUC_TYPE (phi_info);
+  STMT_VINFO_REDUC_TYPE (reduc_info) = reduction_type;
   /* If we have a condition reduction, see if we can simplify it further.  */
-  if (v_reduc_type == COND_REDUCTION)
+  if (reduction_type == COND_REDUCTION)
 {
   if (slp_node && SLP_TREE_LANES (slp_node) != 1)
return false;
@@ -8038,7 +8038,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 
   STMT_VINFO_REDUC_CODE (reduc_info) = orig_code;
 
-  vect_reduction_type reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info);
+  reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info);
   if (reduction_type == TREE_CODE_REDUCTION)
 {
   /* Check whether it's ok to change the order of the computation.
-- 
2.17.1From 19dc1c91f10ec22e695b9003cae1f4ab5aa45250 Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Sun, 16 Jun 2024 12:17:26 +0800
Subject: [PATCH 3/8] vect: Use one reduction_type local variable

Two local variables were defined to refer same STMT_VINFO_REDUC_TYPE, better
to keep only one.

2024-06-16 Feng Xue 

gcc/
	* tree-vect-loop.cc (vectorizable_reduction): Remove v_reduc_type, and
	replace it to another local variable reduction_type.
---
 gcc/tree-vect-loop.cc | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 6e8b3639daf..0f7b125e72d 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7868,10 +7868,10 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   if (lane_reducing)
 STMT_VINFO_REDUC_VECTYPE_IN (stmt_info) = vectype_in;
 
-  enum vect_reduction_type v_reduc_type = STMT_VINFO_REDUC_TYPE (phi_info);
-  STMT_VINFO_REDUC_TYPE (reduc_info) = v_reduc_type;
+  enum vect_reduction_type reduction_type = STMT_VINFO_REDUC_TYPE (phi_info);
+  STMT_VINFO_REDUC_TYPE (reduc_info) = reduction_type;
   /* If we have a condition reduction, see if we can simplify it further.  */
-  if (v_reduc_type == COND_REDUCTION)
+  if (reduction_type == COND_REDUCTION)
 {
   if (slp_node && SLP_TREE_LANES (slp_node) != 1)
 	return false;
@@ -8038,7 +8038,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 
   STMT_VINFO_REDUC_CODE (reduc_info) = orig_code;
 
-  vect_reduction_type reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info);
+  reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info);
   if (reduction_type == TREE_CODE_REDUCTION)
 {
   /* Check whether it's ok to change the order of the computation.
-- 
2.17.1

[PATCH 2/8] vect: Remove duplicated check on reduction operand

2024-06-16 Thread Feng Xue OS

In vectorizable_reduction, one check on a reduction operand via index could be
contained by another one check via pointer, so remove the former.

Thanks,
Feng

---
gcc/
* tree-vect-loop.cc (vectorizable_reduction): Remove the duplicated
check.
---
 gcc/tree-vect-loop.cc | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index d9a2ad69484..6e8b3639daf 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7815,11 +7815,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 "use not simple.\n");
  return false;
}
-  if (i == STMT_VINFO_REDUC_IDX (stmt_info))
-   continue;
 
-  /* For an IFN_COND_OP we might hit the reduction definition operand
-twice (once as definition, once as else).  */
+  /* Skip reduction operands, and for an IFN_COND_OP we might hit the
+reduction operand twice (once as definition, once as else).  */
   if (op.ops[i] == op.ops[STMT_VINFO_REDUC_IDX (stmt_info)])
continue;
 
-- 
2.17.1From 5d2c22ad724856db12bf0ca568650f471447fa34 Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Sun, 16 Jun 2024 12:08:56 +0800
Subject: [PATCH 2/8] vect: Remove duplicated check on reduction operand

In vectorizable_reduction, one check on a reduction operand via index could be
contained by another one check via pointer, so remove the former.

2024-06-16 Feng Xue 

gcc/
	* tree-vect-loop.cc (vectorizable_reduction): Remove the duplicated
	check.
---
 gcc/tree-vect-loop.cc | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index d9a2ad69484..6e8b3639daf 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7815,11 +7815,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 			 "use not simple.\n");
 	  return false;
 	}
-  if (i == STMT_VINFO_REDUC_IDX (stmt_info))
-	continue;
 
-  /* For an IFN_COND_OP we might hit the reduction definition operand
-	 twice (once as definition, once as else).  */
+  /* Skip reduction operands, and for an IFN_COND_OP we might hit the
+	 reduction operand twice (once as definition, once as else).  */
   if (op.ops[i] == op.ops[STMT_VINFO_REDUC_IDX (stmt_info)])
 	continue;
 
-- 
2.17.1

[PATH 1/8] vect: Add a function to check lane-reducing stmt

2024-06-16 Thread Feng Xue OS

The series of patches are meant to support multiple lane-reducing reduction 
statements. Since the original ones conflicted with the new single-lane slp 
node patches, I have reworked most of the patches, and split them as small as 
possible, which may make code review easier.

In the 1st one, I add a utility function to check if a statement is 
lane-reducing operation,
which could simplify some existing code.

Thanks,
Feng

---
gcc/
* tree-vectorizer.h (lane_reducing_stmt_p): New function.
* tree-vect-slp.cc (vect_analyze_slp): Use new function
lane_reducing_stmt_p to check statement.
---
 gcc/tree-vect-slp.cc  |  4 +---
 gcc/tree-vectorizer.h | 12 
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 7e3d0107b4e..b4ea2e18f00 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -3919,7 +3919,6 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
  scalar_stmts.create (loop_vinfo->reductions.length ());
  for (auto next_info : loop_vinfo->reductions)
{
- gassign *g;
  next_info = vect_stmt_to_vectorize (next_info);
  if ((STMT_VINFO_RELEVANT_P (next_info)
   || STMT_VINFO_LIVE_P (next_info))
@@ -3931,8 +3930,7 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
{
  /* Do not discover SLP reductions combining lane-reducing
 ops, that will fail later.  */
- if (!(g = dyn_cast  (STMT_VINFO_STMT (next_info)))
- || !lane_reducing_op_p (gimple_assign_rhs_code (g)))
+ if (!lane_reducing_stmt_p (STMT_VINFO_STMT (next_info)))
scalar_stmts.quick_push (next_info);
  else
{
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 6bb0f5c3a56..60224f4e284 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2169,12 +2169,24 @@ vect_apply_runtime_profitability_check_p (loop_vec_info 
loop_vinfo)
  && th >= vect_vf_for_cost (loop_vinfo));
 }
 
+/* Return true if CODE is a lane-reducing opcode.  */
+
 inline bool
 lane_reducing_op_p (code_helper code)
 {
   return code == DOT_PROD_EXPR || code == WIDEN_SUM_EXPR || code == SAD_EXPR;
 }
 
+/* Return true if STMT is a lane-reducing statement.  */
+
+inline bool
+lane_reducing_stmt_p (gimple *stmt)
+{
+  if (auto *assign = dyn_cast  (stmt))
+return lane_reducing_op_p (gimple_assign_rhs_code (assign));
+  return false;
+}
+
 /* Source location + hotness information. */
 extern dump_user_location_t vect_location;
 
-- 
2.17.1From 0a90550b4ed3addfb2a36c40085bfa9b4bb05b7c Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Sat, 15 Jun 2024 23:17:10 +0800
Subject: [PATCH 1/8] vect: Add a function to check lane-reducing stmt

Add a utility function to check if a statement is lane-reducing operation,
which could simplify some existing code.

2024-06-16 Feng Xue 

gcc/
	* tree-vectorizer.h (lane_reducing_stmt_p): New function.
	* tree-vect-slp.cc (vect_analyze_slp): Use new function
	lane_reducing_stmt_p to check statement.
---
 gcc/tree-vect-slp.cc  |  4 +---
 gcc/tree-vectorizer.h | 12 
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 7e3d0107b4e..b4ea2e18f00 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -3919,7 +3919,6 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
 	  scalar_stmts.create (loop_vinfo->reductions.length ());
 	  for (auto next_info : loop_vinfo->reductions)
 	{
-	  gassign *g;
 	  next_info = vect_stmt_to_vectorize (next_info);
 	  if ((STMT_VINFO_RELEVANT_P (next_info)
 		   || STMT_VINFO_LIVE_P (next_info))
@@ -3931,8 +3930,7 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
 		{
 		  /* Do not discover SLP reductions combining lane-reducing
 		 ops, that will fail later.  */
-		  if (!(g = dyn_cast  (STMT_VINFO_STMT (next_info)))
-		  || !lane_reducing_op_p (gimple_assign_rhs_code (g)))
+		  if (!lane_reducing_stmt_p (STMT_VINFO_STMT (next_info)))
 		scalar_stmts.quick_push (next_info);
 		  else
 		{
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 6bb0f5c3a56..60224f4e284 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2169,12 +2169,24 @@ vect_apply_runtime_profitability_check_p (loop_vec_info loop_vinfo)
 	  && th >= vect_vf_for_cost (loop_vinfo));
 }
 
+/* Return true if CODE is a lane-reducing opcode.  */
+
 inline bool
 lane_reducing_op_p (code_helper code)
 {
   return code == DOT_PROD_EXPR || code == WIDEN_SUM_EXPR || code == SAD_EXPR;
 }
 
+/* Return true if STMT is a lane-reducing statement.  */
+
+inline bool
+lane_reducing_stmt_p (gimple *stmt)
+{
+  if (auto *assign = dyn_cast  (stmt))
+return lane_reducing_op_p (gimple_assign_rhs_code (assign));
+  return false;
+}
+

Re: [PATCH 6/6] vect: Optimize order of lane-reducing statements in loop def-use cycles [PR114440]

2024-06-13 Thread Feng Xue OS

Regenerate the patch due to changes on its dependent patches.

Thanks,
Feng,
---
gcc/
PR tree-optimization/114440
* tree-vectorizer.h (struct _stmt_vec_info): Add a new field
reduc_result_pos.
* tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing
statements in an optimized order.
---
 gcc/tree-vect-loop.cc | 51 ++-
 gcc/tree-vectorizer.h |  6 +
 2 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index fb9259d115c..de7a9bab990 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8734,7 +8734,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 }

   bool single_defuse_cycle = STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info);
-  gcc_assert (single_defuse_cycle || lane_reducing_op_p (code));
+  bool lane_reducing = lane_reducing_op_p (code);
+  gcc_assert (single_defuse_cycle || lane_reducing);

   /* Create the destination vector  */
   tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
@@ -8751,6 +8752,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 }
   else
 {
+  int result_pos = 0;
+
   /* The input vectype of the reduction PHI determines copies of
 vectorized def-use cycles, which might be more than effective copies
 of vectorized lane-reducing reduction statements.  This could be
@@ -8780,9 +8783,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

-  sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
-  sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
-  sum_v2 = sum_v2;  // copy
+  sum_v0 = sum_v0;  // copy
+  sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1);
+  sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2);
   sum_v3 = sum_v3;  // copy

   sum_v0 += n_v0[i: 0  ~ 3 ];
@@ -8790,7 +8793,20 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   sum_v2 += n_v2[i: 8  ~ 11];
   sum_v3 += n_v3[i: 12 ~ 15];
 }
-   */
+
+Moreover, for a higher instruction parallelism in final vectorized
+loop, it is considered to make those effective vectorized
+lane-reducing statements be distributed evenly among all def-use
+cycles. In the above example, SADs are generated into other cycles
+rather than that of DOT_PROD.  */
+
+  if (stmt_ncopies < ncopies)
+   {
+ gcc_assert (lane_reducing);
+ result_pos = reduc_info->reduc_result_pos;
+ reduc_info->reduc_result_pos = (result_pos + stmt_ncopies) % ncopies;
+ gcc_assert (result_pos >= 0 && result_pos < ncopies);
+   }

   for (i = 0; i < MIN (3, (int) op.num_ops); i++)
{
@@ -8826,7 +8842,30 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   op.ops[i], &vec_oprnds[i], vectype);

  if (used_ncopies < ncopies)
-   vec_oprnds[i].safe_grow_cleared (ncopies);
+   {
+ vec_oprnds[i].safe_grow_cleared (ncopies);
+
+ /* Find suitable def-use cycles to generate vectorized
+statements into, and reorder operands based on the
+selection.  */
+ if (i != reduc_index && result_pos)
+   {
+ int count = ncopies - used_ncopies;
+ int start = result_pos - count;
+
+ if (start < 0)
+   {
+ count = result_pos;
+ start = 0;
+   }
+
+ for (int j = used_ncopies - 1; j >= start; j--)
+   {
+ std::swap (vec_oprnds[i][j], vec_oprnds[i][j + count]);
+ gcc_assert (!vec_oprnds[i][j]);
+   }
+   }
+   }
}
 }

diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 3f7db707d97..b9bc9d432ee 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -1402,6 +1402,12 @@ public:
   /* The vector type for performing the actual reduction.  */
   tree reduc_vectype;

+  /* For loop reduction with multiple vectorized results (ncopies > 1), a
+ lane-reducing operation participating in it may not use all of those
+ results, this field specifies result index starting from which any
+ following land-reducing operation would be assigned to.  */
+  int reduc_result_pos;
+
   /* If IS_REDUC_INFO is true and if the vector code is performing
  N scalar reductions in parallel, this variable gives the initial
  scalar values of those N reductions.  */
--
2.17.1

________
From: Feng Xue OS 
Sent: Thursday, May 30, 2024 10:56 PM
To: Richard Biener

Re: [PATCH 3/6] vect: Set STMT_VINFO_REDUC_DEF for non-live stmt in loop reduction

2024-06-13 Thread Feng Xue OS

Updated the patch.

Thanks,
Feng
--

gcc/
* tree-vect-loop.cc (vectorizable_reduction): Set STMT_VINFO_REDUC_DEF
for non-live stmt.
* tree-vect-stmts.cc (vectorizable_condition): Treat the condition
statement that is pointed by stmt_vec_info of reduction PHI as the
real "for_reduction" statement.
---
 gcc/tree-vect-loop.cc  |  7 +--
 gcc/tree-vect-stmts.cc | 11 ++-
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index bbd5d261907..35c50eb72cb 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7665,8 +7665,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
if (STMT_VINFO_LIVE_P (s))
  STMT_VINFO_REDUC_DEF (vect_orig_stmt (s)) = phi_info;
}
-  else if (STMT_VINFO_LIVE_P (vdef))
-   STMT_VINFO_REDUC_DEF (def) = phi_info;
+
+  /* For lane-reducing operation vectorizable analysis needs the
+reduction PHI information */
+  STMT_VINFO_REDUC_DEF (def) = phi_info;
+
   gimple_match_op op;
   if (!gimple_extract_op (vdef->stmt, &op))
{
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index e32d44050e5..dbdb59054e0 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -12137,11 +12137,20 @@ vectorizable_condition (vec_info *vinfo,
   vect_reduction_type reduction_type = TREE_CODE_REDUCTION;
   bool for_reduction
 = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)) != NULL;
+  if (for_reduction)
+{
+  reduc_info = info_for_reduction (vinfo, stmt_info);
+  if (STMT_VINFO_REDUC_DEF (reduc_info) != vect_orig_stmt (stmt_info))
+   {
+ for_reduction = false;
+ reduc_info = NULL;
+   }
+}
+
   if (for_reduction)
 {
   if (slp_node && SLP_TREE_LANES (slp_node) > 1)
return false;
-  reduc_info = info_for_reduction (vinfo, stmt_info);
   reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info);
   reduc_index = STMT_VINFO_REDUC_IDX (stmt_info);
   gcc_assert (reduction_type != EXTRACT_LAST_REDUCTION
--
2.17.1

____________
From: Feng Xue OS 
Sent: Thursday, May 30, 2024 10:51 PM
To: Richard Biener
Cc: Tamar Christina; gcc-patches@gcc.gnu.org
Subject: [PATCH 3/6] vect: Set STMT_VINFO_REDUC_DEF for non-live stmt in loop  
reduction

Normally, vectorizable checking on statement in a loop reduction chain does
not use the reduction PHI information. But some special statements might
need it in vectorizable analysis, especially, for multiple lane-reducing
operations support later.

Thanks,
Feng
---
gcc/
* tree-vect-loop.cc (vectorizable_reduction): Set STMT_VINFO_REDUC_DEF
for non-live stmt.
* tree-vect-stmts.cc (vectorizable_condition): Treat the condition
statement that is pointed by stmt_vec_info of reduction PHI as the
real "for_reduction" statement.
---
 gcc/tree-vect-loop.cc  |  5 +++--
 gcc/tree-vect-stmts.cc | 11 ++-
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index aa5f21ccd1a..51627c27f8a 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7632,14 +7632,15 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 all lanes here - even though we only will vectorize from
 the SLP node with live lane zero the other live lanes also
 need to be identified as part of a reduction to be able
-to skip code generation for them.  */
+to skip code generation for them.  For lane-reducing operation
+vectorizable analysis needs the reduction PHI information.  */
   if (slp_for_stmt_info)
{
  for (auto s : SLP_TREE_SCALAR_STMTS (slp_for_stmt_info))
if (STMT_VINFO_LIVE_P (s))
  STMT_VINFO_REDUC_DEF (vect_orig_stmt (s)) = phi_info;
}
-  else if (STMT_VINFO_LIVE_P (vdef))
+  else
STMT_VINFO_REDUC_DEF (def) = phi_info;
   gimple_match_op op;
   if (!gimple_extract_op (vdef->stmt, &op))
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 935d80f0e1b..2e0be763abb 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -12094,11 +12094,20 @@ vectorizable_condition (vec_info *vinfo,
   vect_reduction_type reduction_type = TREE_CODE_REDUCTION;
   bool for_reduction
 = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)) != NULL;
+  if (for_reduction)
+{
+  reduc_info = info_for_reduction (vinfo, stmt_info);
+  if (STMT_VINFO_REDUC_DEF (reduc_info) != vect_orig_stmt (stmt_info))
+   {
+ for_reduction = false;
+ reduc_info = NULL;
+   }
+}
+
   if (for_reduction)
 {
   if (slp_node)
return false;
-  reduc_info = info_for_reduction (vinfo, stmt_info);
   reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info);
   reduc_index = STMT_V

Re: [PATCH 5/6] vect: Support multiple lane-reducing operations for loop reduction [PR114440]

2024-06-02 Thread Feng Xue OS

Please see my comments below.

Thanks,
Feng

> On Thu, May 30, 2024 at 4:55 PM Feng Xue OS  
> wrote:
>>
>> For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, 
>> current
>> vectorizer could only handle the pattern if the reduction chain does not
>> contain other operation, no matter the other is normal or lane-reducing.
>>
>> Actually, to allow multiple arbitray lane-reducing operations, we need to
>> support vectorization of loop reduction chain with mixed input vectypes. 
>> Since
>> lanes of vectype may vary with operation, the effective ncopies of vectorized
>> statements for operation also may not be same to each other, this causes
>> mismatch on vectorized def-use cycles. A simple way is to align all 
>> operations
>> with the one that has the most ncopies, the gap could be complemented by
>> generating extra trival pass-through copies. For example:
>>
>>int sum = 0;
>>for (i)
>>  {
>>sum += d0[i] * d1[i];  // dot-prod 
>>sum += w[i];   // widen-sum 
>>sum += abs(s0[i] - s1[i]); // sad 
>>sum += n[i];   // normal 
>>  }
>>
>> The vector size is 128-bit，vectorization factor is 16. Reduction statements
>> would be transformed as:
>>
>>vector<4> int sum_v0 = { 0, 0, 0, 0 };
>>vector<4> int sum_v1 = { 0, 0, 0, 0 };
>>vector<4> int sum_v2 = { 0, 0, 0, 0 };
>>vector<4> int sum_v3 = { 0, 0, 0, 0 };
>>
>>for (i / 16)
>>  {
>>sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
>>sum_v1 = sum_v1;  // copy
>>sum_v2 = sum_v2;  // copy
>>sum_v3 = sum_v3;  // copy
>>
>>sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
>>sum_v1 = sum_v1;  // copy
>>sum_v2 = sum_v2;  // copy
>>sum_v3 = sum_v3;  // copy
>>
>>sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
>>sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
>>sum_v2 = sum_v2;  // copy
>>sum_v3 = sum_v3;  // copy
>>
>>sum_v0 += n_v0[i: 0  ~ 3 ];
>>sum_v1 += n_v1[i: 4  ~ 7 ];
>>sum_v2 += n_v2[i: 8  ~ 11];
>>sum_v3 += n_v3[i: 12 ~ 15];
>>  }
>>
>> Thanks,
>> Feng
>>
>> ...
>>
>> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
>> index 20c99f11e9a..b5849dbb08a 100644
>> --- a/gcc/tree-vect-loop.cc
>> +++ b/gcc/tree-vect-loop.cc
>> @@ -5322,8 +5322,6 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
>>if (!gimple_extract_op (orig_stmt_info->stmt, &op))
>>  gcc_unreachable ();
>>
>> -  bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod 
>> (stmt_info);
>> -
>>if (reduction_type == EXTRACT_LAST_REDUCTION)
>>  /* No extra instructions are needed in the prologue.  The loop body
>> operations are costed in vectorizable_condition.  */
>> @@ -5358,12 +5356,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
>>initial result of the data reduction, initial value of the index
>>reduction.  */
>> prologue_stmts = 4;
>> -  else if (emulated_mixed_dot_prod)
>> -   /* We need the initial reduction value and two invariants:
>> -  one that contains the minimum signed value and one that
>> -  contains half of its negative.  */
>> -   prologue_stmts = 3;
>>else
>> +   /* We need the initial reduction value.  */
>> prologue_stmts = 1;
>>prologue_cost += record_stmt_cost (cost_vec, prologue_stmts,
>>  scalar_to_vec, stmt_info, 0,
>> @@ -7464,6 +7458,169 @@ vect_reduction_use_partial_vector (loop_vec_info 
>> loop_vinfo,
>>  }
>>  }
>>
>> +/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in
>> +   the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC.
>> +   Now there are three such kinds of operations: dot-prod/widen-sum/sad
>> +   (sum-of-absolute-differences).
>> +
>> +   For a lane-reducing operation, the loop reduction path that it lies in,
>> +   may contain normal operation, or other lane-reducing operation of 
>> different
>> +   input type size, an example as:
>> +
>> + int sum = 0;
>> + for (i)
>> +   {
>> + ...
>> + sum += d0[i] * d1[i];   // dot-prod 
>> +

Re: [PATCH 2/6] vect: Split out partial vect checking for reduction into a function

2024-05-31 Thread Feng Xue OS

Ok. Updated as the comments.

Thanks,
Feng


From: Richard Biener 
Sent: Friday, May 31, 2024 3:29 PM
To: Feng Xue OS
Cc: Tamar Christina; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH 2/6] vect: Split out partial vect checking for reduction 
into a function

On Thu, May 30, 2024 at 4:48 PM Feng Xue OS  wrote:
>
> This is a patch that is split out from 
> https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652626.html.
>
> Partial vectorization checking for vectorizable_reduction is a piece of
> relatively isolated code, which may be reused by other places. Move the
> code into a new function for sharing.
>
> Thanks,
> Feng
> ---
> gcc/
> * tree-vect-loop.cc (vect_reduction_use_partial_vector): New function.

Can you rename the function to vect_reduction_update_partial_vector_usage
please?  And keep ...

> (vectorizable_reduction): Move partial vectorization checking code to
> vect_reduction_use_partial_vector.
> ---
>  gcc/tree-vect-loop.cc | 138 --
>  1 file changed, 78 insertions(+), 60 deletions(-)
>
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index a42d79c7cbf..aa5f21ccd1a 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -7391,6 +7391,81 @@ build_vect_cond_expr (code_helper code, tree vop[3], 
> tree mask,
>  }
>  }
>
> +/* Given an operation with CODE in loop reduction path whose reduction PHI is
> +   specified by REDUC_INFO, the operation has TYPE of scalar result, and its
> +   input vectype is represented by VECTYPE_IN. The vectype of vectorized 
> result
> +   may be different from VECTYPE_IN, either in base type or vectype lanes,
> +   lane-reducing operation is the case.  This function check if it is 
> possible,
> +   and how to perform partial vectorization on the operation in the context
> +   of LOOP_VINFO.  */
> +
> +static void
> +vect_reduction_use_partial_vector (loop_vec_info loop_vinfo,
> +  stmt_vec_info reduc_info,
> +  slp_tree slp_node, code_helper code,
> +  tree type, tree vectype_in)
> +{
> +  if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> +return;
> +
> +  enum vect_reduction_type reduc_type = STMT_VINFO_REDUC_TYPE (reduc_info);
> +  internal_fn reduc_fn = STMT_VINFO_REDUC_FN (reduc_info);
> +  internal_fn cond_fn = get_conditional_internal_fn (code, type);
> +
> +  if (reduc_type != FOLD_LEFT_REDUCTION
> +  && !use_mask_by_cond_expr_p (code, cond_fn, vectype_in)
> +  && (cond_fn == IFN_LAST
> + || !direct_internal_fn_supported_p (cond_fn, vectype_in,
> + OPTIMIZE_FOR_SPEED)))
> +{
> +  if (dump_enabled_p ())
> +   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +"can't operate on partial vectors because"
> +" no conditional operation is available.\n");
> +  LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +}
> +  else if (reduc_type == FOLD_LEFT_REDUCTION
> +  && reduc_fn == IFN_LAST
> +  && !expand_vec_cond_expr_p (vectype_in, truth_type_for 
> (vectype_in),
> +  SSA_NAME))
> +{
> +  if (dump_enabled_p ())
> +   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +   "can't operate on partial vectors because"
> +   " no conditional operation is available.\n");
> +  LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +}
> +  else if (reduc_type == FOLD_LEFT_REDUCTION
> +  && internal_fn_mask_index (reduc_fn) == -1
> +  && FLOAT_TYPE_P (vectype_in)
> +  && HONOR_SIGN_DEPENDENT_ROUNDING (vectype_in))
> +{
> +  if (dump_enabled_p ())
> +   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +"can't operate on partial vectors because"
> +" signed zeros cannot be preserved.\n");
> +  LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +}
> +  else
> +{
> +  internal_fn mask_reduc_fn
> +   = get_masked_reduction_fn (reduc_fn, vectype_in);
> +  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> +  vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
> +  unsigned nvectors;
> +
> +  if (slp_node)
> +   nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
> +  el

[PATCH 6/6] vect: Optimize order of lane-reducing statements in loop def-use cycles [PR114440]

2024-05-30 Thread Feng Xue OS

When transforming multiple lane-reducing operations in a loop reduction chain,
originally, corresponding vectorized statements are generated into def-use
cycles starting from 0. The def-use cycle with smaller index, would contain
more statements, which means more instruction dependency. For example:

   int sum = 0;
   for (i)
 {
   sum += d0[i] * d1[i];  // dot-prod 
   sum += w[i];   // widen-sum 
   sum += abs(s0[i] - s1[i]); // sad 
 }

Original transformation result:

   for (i / 16)
 {
   sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
   sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy
 }

For a higher instruction parallelism in final vectorized loop, an optimal
means is to make those effective vectorized lane-reducing statements be
distributed evenly among all def-use cycles. Transformed as the below,
DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles,
instruction dependency could be eliminated.

   for (i / 16)
 {
   sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = sum_v0;  // copy
   sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = sum_v0;  // copy
   sum_v1 = sum_v1;  // copy
   sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2);
   sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3);
 }

Thanks,
Feng
---
gcc/
PR tree-optimization/114440
* tree-vectorizer.h (struct _stmt_vec_info): Add a new field
reduc_result_pos.
* tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing
statements in an optimized order.
---
 gcc/tree-vect-loop.cc | 51 ++-
 gcc/tree-vectorizer.h |  6 +
 2 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index b5849dbb08a..4807f529506 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -8703,7 +8703,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 }
 
   bool single_defuse_cycle = STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info);
-  gcc_assert (single_defuse_cycle || lane_reducing_op_p (code));
+  bool lane_reducing = lane_reducing_op_p (code);
+  gcc_assert (single_defuse_cycle || lane_reducing);
 
   /* Create the destination vector  */
   tree scalar_dest = gimple_get_lhs (stmt_info->stmt);
@@ -8720,6 +8721,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 }
   else
 {
+  int result_pos = 0;
+
   /* The input vectype of the reduction PHI determines copies of
 vectorized def-use cycles, which might be more than effective copies
 of vectorized lane-reducing reduction statements.  This could be
@@ -8749,9 +8752,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy
 
-  sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
-  sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
-  sum_v2 = sum_v2;  // copy
+  sum_v0 = sum_v0;  // copy
+  sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1);
+  sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2);
   sum_v3 = sum_v3;  // copy
 
   sum_v0 += n_v0[i: 0  ~ 3 ];
@@ -8759,7 +8762,20 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
   sum_v2 += n_v2[i: 8  ~ 11];
   sum_v3 += n_v3[i: 12 ~ 15];
 }
-   */
+
+Moreover, for a higher instruction parallelism in final vectorized
+loop, it is considered to make those effective vectorized
+lane-reducing statements be distributed evenly among all def-use
+cycles. In the above example, SADs are generated into other cycles
+rather than that of DOT_PROD.  */
+
+  if (stmt_ncopies < ncopies)
+   {
+ gcc_assert (lane_reducing);
+ result_pos = reduc_info->reduc_result_pos;
+ reduc_info->reduc_result_pos = (result_pos + stmt_ncopies) % ncopies;
+ gcc_assert (result_pos >= 0 && result_pos < ncopies);
+   }
 
   for (i = 0; i < MIN (3, (int) op.num_ops); i++)
{
@@ -8792,7 +8808,30 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
 op.ops[i], &vec_oprnds[i], vectype);
 
  if (used_ncopies <

[PATCH 5/6] vect: Support multiple lane-reducing operations for loop reduction [PR114440]

2024-05-30 Thread Feng Xue OS

For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
vectorizer could only handle the pattern if the reduction chain does not
contain other operation, no matter the other is normal or lane-reducing.

Actually, to allow multiple arbitray lane-reducing operations, we need to
support vectorization of loop reduction chain with mixed input vectypes. Since
lanes of vectype may vary with operation, the effective ncopies of vectorized
statements for operation also may not be same to each other, this causes
mismatch on vectorized def-use cycles. A simple way is to align all operations
with the one that has the most ncopies, the gap could be complemented by
generating extra trival pass-through copies. For example:

   int sum = 0;
   for (i)
 {
   sum += d0[i] * d1[i];  // dot-prod 
   sum += w[i];   // widen-sum 
   sum += abs(s0[i] - s1[i]); // sad 
   sum += n[i];   // normal 
 }

The vector size is 128-bit，vectorization factor is 16. Reduction statements
would be transformed as:

   vector<4> int sum_v0 = { 0, 0, 0, 0 };
   vector<4> int sum_v1 = { 0, 0, 0, 0 };
   vector<4> int sum_v2 = { 0, 0, 0, 0 };
   vector<4> int sum_v3 = { 0, 0, 0, 0 };

   for (i / 16)
 {
   sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
   sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 += n_v0[i: 0  ~ 3 ];
   sum_v1 += n_v1[i: 4  ~ 7 ];
   sum_v2 += n_v2[i: 8  ~ 11];
   sum_v3 += n_v3[i: 12 ~ 15];
 }

Thanks,
Feng
---
gcc/
PR tree-optimization/114440
* tree-vectorizer.h (vectorizable_lane_reducing): New function
declaration.
* tree-vect-stmts.cc (vect_analyze_stmt): Call new function
vectorizable_lane_reducing to analyze lane-reducing operation.
* tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation
code related to emulated_mixed_dot_prod.
(vectorizable_lane_reducing): New function.
(vectorizable_reduction): Allow multiple lane-reducing operations in
loop reduction. Move some original lane-reducing related code to
vectorizable_lane_reducing.
(vect_transform_reduction): Extend transformation to support reduction
statements with mixed input vectypes.

gcc/testsuite/
PR tree-optimization/114440
* gcc.dg/vect/vect-reduc-chain-1.c
* gcc.dg/vect/vect-reduc-chain-2.c
* gcc.dg/vect/vect-reduc-chain-3.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
* gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
* gcc.dg/vect/vect-reduc-dot-slp-1.c
---
 .../gcc.dg/vect/vect-reduc-chain-1.c  |  62 +++
 .../gcc.dg/vect/vect-reduc-chain-2.c  |  77 +++
 .../gcc.dg/vect/vect-reduc-chain-3.c  |  66 +++
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c  |  97 
 .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c  |  81 +++
 .../gcc.dg/vect/vect-reduc-dot-slp-1.c|  35 ++
 gcc/tree-vect-loop.cc | 478 --
 gcc/tree-vect-stmts.cc|   2 +
 gcc/tree-vectorizer.h |   2 +
 9 files changed, 755 insertions(+), 145 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c 
b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
new file mode 100644
index 000..04bfc419dbd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
@@ -0,0 +1,62 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { 
aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_2 char *restrict c,
+   SIGNEDNESS_2 char *restrict d,
+   SIGNEDNESS_1 int *

[PATCH 4/6] vect: Bind input vectype to lane-reducing operation

2024-05-30 Thread Feng Xue OS

The input vectype is an attribute of lane-reducing operation, instead of
reduction PHI that it is associated to, since there might be more than one
lane-reducing operations with different type in a loop reduction chain. So
bind each lane-reducing operation with its own input type.

Thanks,
Feng
---
gcc/
* tree-vect-loop.cc (vect_is_emulated_mixed_dot_prod): Remove parameter
loop_vinfo. Get input vectype from stmt_info instead of reduction PHI.
(vect_model_reduction_cost): Remove loop_vinfo argument of call to
vect_is_emulated_mixed_dot_prod.
(vect_transform_reduction): Likewise.
(vectorizable_reduction): Likewise, and bind input vectype to
lane-reducing operation.
---
 gcc/tree-vect-loop.cc | 23 +--
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 51627c27f8a..20c99f11e9a 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -5270,8 +5270,7 @@ have_whole_vector_shift (machine_mode mode)
See vect_emulate_mixed_dot_prod for the actual sequence used.  */
 
 static bool
-vect_is_emulated_mixed_dot_prod (loop_vec_info loop_vinfo,
-stmt_vec_info stmt_info)
+vect_is_emulated_mixed_dot_prod (stmt_vec_info stmt_info)
 {
   gassign *assign = dyn_cast (stmt_info->stmt);
   if (!assign || gimple_assign_rhs_code (assign) != DOT_PROD_EXPR)
@@ -5282,10 +5281,9 @@ vect_is_emulated_mixed_dot_prod (loop_vec_info 
loop_vinfo,
   if (TYPE_SIGN (TREE_TYPE (rhs1)) == TYPE_SIGN (TREE_TYPE (rhs2)))
 return false;
 
-  stmt_vec_info reduc_info = info_for_reduction (loop_vinfo, stmt_info);
-  gcc_assert (reduc_info->is_reduc_info);
+  gcc_assert (STMT_VINFO_REDUC_VECTYPE_IN (stmt_info));
   return !directly_supported_p (DOT_PROD_EXPR,
-   STMT_VINFO_REDUC_VECTYPE_IN (reduc_info),
+   STMT_VINFO_REDUC_VECTYPE_IN (stmt_info),
optab_vector_mixed_sign);
 }
 
@@ -5324,8 +5322,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo,
   if (!gimple_extract_op (orig_stmt_info->stmt, &op))
 gcc_unreachable ();
 
-  bool emulated_mixed_dot_prod
-= vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info);
+  bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
+
   if (reduction_type == EXTRACT_LAST_REDUCTION)
 /* No extra instructions are needed in the prologue.  The loop body
operations are costed in vectorizable_condition.  */
@@ -7840,6 +7838,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 vectype_in = STMT_VINFO_VECTYPE (phi_info);
   STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) = vectype_in;
 
+  /* Each lane-reducing operation has its own input vectype, while reduction
+ PHI records the input vectype with least lanes.  */
+  if (lane_reducing)
+STMT_VINFO_REDUC_VECTYPE_IN (stmt_info) = vectype_in;
+
   enum vect_reduction_type v_reduc_type = STMT_VINFO_REDUC_TYPE (phi_info);
   STMT_VINFO_REDUC_TYPE (reduc_info) = v_reduc_type;
   /* If we have a condition reduction, see if we can simplify it further.  */
@@ -8366,7 +8369,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   if (single_defuse_cycle || lane_reducing)
 {
   int factor = 1;
-  if (vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info))
+  if (vect_is_emulated_mixed_dot_prod (stmt_info))
/* Three dot-products and a subtraction.  */
factor = 4;
   record_stmt_cost (cost_vec, ncopies * factor, vector_stmt,
@@ -8617,8 +8620,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
: &vec_oprnds2));
 }
 
-  bool emulated_mixed_dot_prod
-= vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info);
+  bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
+
   FOR_EACH_VEC_ELT (vec_oprnds0, i, def0)
 {
   gimple *new_stmt;
-- 
2.17.1From b885de76ad7e9f5accceff18cb6c11de73a36225 Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Wed, 29 May 2024 16:41:57 +0800
Subject: [PATCH 4/6] vect: Bind input vectype to lane-reducing operation

The input vectype is an attribute of lane-reducing operation, instead of
reduction PHI that it is associated to, since there might be more than one
lane-reducing operations with different type in a loop reduction chain. So
bind each lane-reducing operation with its own input type.

2024-05-29 Feng Xue 

gcc/
	* tree-vect-loop.cc (vect_is_emulated_mixed_dot_prod): Remove parameter
	loop_vinfo. Get input vectype from stmt_info instead of reduction PHI.
	(vect_model_reduction_cost): Remove loop_vinfo argument of call to
	vect_is_emulated_mixed_dot_prod.
	(vect_transform_reduction): Likewise.
	(vectorizable_reduction): Likewise, and bind input vectype to
	lane-reducing operation.
---
 gcc/tree-vect-loop.cc | 23 +--
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/gcc/tr

[PATCH 3/6] vect: Set STMT_VINFO_REDUC_DEF for non-live stmt in loop reduction

2024-05-30 Thread Feng Xue OS

Normally, vectorizable checking on statement in a loop reduction chain does
not use the reduction PHI information. But some special statements might
need it in vectorizable analysis, especially, for multiple lane-reducing
operations support later.

Thanks,
Feng
---
gcc/
* tree-vect-loop.cc (vectorizable_reduction): Set STMT_VINFO_REDUC_DEF
for non-live stmt.
* tree-vect-stmts.cc (vectorizable_condition): Treat the condition
statement that is pointed by stmt_vec_info of reduction PHI as the
real "for_reduction" statement.
---
 gcc/tree-vect-loop.cc  |  5 +++--
 gcc/tree-vect-stmts.cc | 11 ++-
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index aa5f21ccd1a..51627c27f8a 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7632,14 +7632,15 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 all lanes here - even though we only will vectorize from
 the SLP node with live lane zero the other live lanes also
 need to be identified as part of a reduction to be able
-to skip code generation for them.  */
+to skip code generation for them.  For lane-reducing operation
+vectorizable analysis needs the reduction PHI information.  */
   if (slp_for_stmt_info)
{
  for (auto s : SLP_TREE_SCALAR_STMTS (slp_for_stmt_info))
if (STMT_VINFO_LIVE_P (s))
  STMT_VINFO_REDUC_DEF (vect_orig_stmt (s)) = phi_info;
}
-  else if (STMT_VINFO_LIVE_P (vdef))
+  else
STMT_VINFO_REDUC_DEF (def) = phi_info;
   gimple_match_op op;
   if (!gimple_extract_op (vdef->stmt, &op))
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 935d80f0e1b..2e0be763abb 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -12094,11 +12094,20 @@ vectorizable_condition (vec_info *vinfo,
   vect_reduction_type reduction_type = TREE_CODE_REDUCTION;
   bool for_reduction
 = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)) != NULL;
+  if (for_reduction)
+{
+  reduc_info = info_for_reduction (vinfo, stmt_info);
+  if (STMT_VINFO_REDUC_DEF (reduc_info) != vect_orig_stmt (stmt_info))
+   {
+ for_reduction = false;
+ reduc_info = NULL;
+   }
+}
+
   if (for_reduction)
 {
   if (slp_node)
return false;
-  reduc_info = info_for_reduction (vinfo, stmt_info);
   reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info);
   reduc_index = STMT_VINFO_REDUC_IDX (stmt_info);
   gcc_assert (reduction_type != EXTRACT_LAST_REDUCTION
-- 
2.17.1

[PATCH 2/6] vect: Split out partial vect checking for reduction into a function

2024-05-30 Thread Feng Xue OS

This is a patch that is split out from 
https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652626.html.

Partial vectorization checking for vectorizable_reduction is a piece of
relatively isolated code, which may be reused by other places. Move the
code into a new function for sharing.

Thanks,
Feng
---
gcc/
* tree-vect-loop.cc (vect_reduction_use_partial_vector): New function.
(vectorizable_reduction): Move partial vectorization checking code to
vect_reduction_use_partial_vector.
---
 gcc/tree-vect-loop.cc | 138 --
 1 file changed, 78 insertions(+), 60 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index a42d79c7cbf..aa5f21ccd1a 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7391,6 +7391,81 @@ build_vect_cond_expr (code_helper code, tree vop[3], 
tree mask,
 }
 }
 
+/* Given an operation with CODE in loop reduction path whose reduction PHI is
+   specified by REDUC_INFO, the operation has TYPE of scalar result, and its
+   input vectype is represented by VECTYPE_IN. The vectype of vectorized result
+   may be different from VECTYPE_IN, either in base type or vectype lanes,
+   lane-reducing operation is the case.  This function check if it is possible,
+   and how to perform partial vectorization on the operation in the context
+   of LOOP_VINFO.  */
+
+static void
+vect_reduction_use_partial_vector (loop_vec_info loop_vinfo,
+  stmt_vec_info reduc_info,
+  slp_tree slp_node, code_helper code,
+  tree type, tree vectype_in)
+{
+  if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
+return;
+
+  enum vect_reduction_type reduc_type = STMT_VINFO_REDUC_TYPE (reduc_info);
+  internal_fn reduc_fn = STMT_VINFO_REDUC_FN (reduc_info);
+  internal_fn cond_fn = get_conditional_internal_fn (code, type);
+
+  if (reduc_type != FOLD_LEFT_REDUCTION
+  && !use_mask_by_cond_expr_p (code, cond_fn, vectype_in)
+  && (cond_fn == IFN_LAST
+ || !direct_internal_fn_supported_p (cond_fn, vectype_in,
+ OPTIMIZE_FOR_SPEED)))
+{
+  if (dump_enabled_p ())
+   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+"can't operate on partial vectors because"
+" no conditional operation is available.\n");
+  LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+}
+  else if (reduc_type == FOLD_LEFT_REDUCTION
+  && reduc_fn == IFN_LAST
+  && !expand_vec_cond_expr_p (vectype_in, truth_type_for (vectype_in),
+  SSA_NAME))
+{
+  if (dump_enabled_p ())
+   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+   "can't operate on partial vectors because"
+   " no conditional operation is available.\n");
+  LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+}
+  else if (reduc_type == FOLD_LEFT_REDUCTION
+  && internal_fn_mask_index (reduc_fn) == -1
+  && FLOAT_TYPE_P (vectype_in)
+  && HONOR_SIGN_DEPENDENT_ROUNDING (vectype_in))
+{
+  if (dump_enabled_p ())
+   dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+"can't operate on partial vectors because"
+" signed zeros cannot be preserved.\n");
+  LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+}
+  else
+{
+  internal_fn mask_reduc_fn
+   = get_masked_reduction_fn (reduc_fn, vectype_in);
+  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
+  vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
+  unsigned nvectors;
+
+  if (slp_node)
+   nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
+  else
+   nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
+
+  if (mask_reduc_fn == IFN_MASK_LEN_FOLD_LEFT_PLUS)
+   vect_record_loop_len (loop_vinfo, lens, nvectors, vectype_in, 1);
+  else
+   vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype_in, NULL);
+}
+}
+
 /* Function vectorizable_reduction.
 
Check if STMT_INFO performs a reduction operation that can be vectorized.
@@ -7456,7 +7531,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   bool single_defuse_cycle = false;
   bool nested_cycle = false;
   bool double_reduc = false;
-  int vec_num;
   tree cr_index_scalar_type = NULL_TREE, cr_index_vector_type = NULL_TREE;
   tree cond_reduc_val = NULL_TREE;
 
@@ -8283,11 +8357,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
  return false;
}
 
-  if (slp_node)
-vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
-  else
-vec_num = 1;
-
   vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn,
 reduction_type, ncopies, cost_vec);
   /* Cost the reducti

[PATCH 1/6] vect: Add a function to check lane-reducing code [PR114440]

2024-05-30 Thread Feng Xue OS

This is a patch that is split out from 
https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652626.html.

Check if an operation is lane-reducing requires comparison of code against
three kinds (DOT_PROD_EXPR/WIDEN_SUM_EXPR/SAD_EXPR).  Add an utility
function to make source coding for the check handy and concise.

Feng
--
gcc/
* tree-vectorizer.h (lane_reducing_op_p): New function.
* tree-vect-slp.cc (vect_analyze_slp): Use new function
lane_reducing_op_p to check statement code.
* tree-vect-loop.cc (vect_transform_reduction): Likewise.
(vectorizable_reduction): Likewise, and change name of a local
variable that holds the result flag.
---
 gcc/tree-vect-loop.cc | 29 -
 gcc/tree-vect-slp.cc  |  4 +---
 gcc/tree-vectorizer.h |  6 ++
 3 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 04a9ac64df7..a42d79c7cbf 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7650,9 +7650,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   gimple_match_op op;
   if (!gimple_extract_op (stmt_info->stmt, &op))
 gcc_unreachable ();
-  bool lane_reduc_code_p = (op.code == DOT_PROD_EXPR
-   || op.code == WIDEN_SUM_EXPR
-   || op.code == SAD_EXPR);
+  bool lane_reducing = lane_reducing_op_p (op.code);
 
   if (!POINTER_TYPE_P (op.type) && !INTEGRAL_TYPE_P (op.type)
   && !SCALAR_FLOAT_TYPE_P (op.type))
@@ -7664,7 +7662,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 
   /* For lane-reducing ops we're reducing the number of reduction PHIs
  which means the only use of that may be in the lane-reducing operation.  
*/
-  if (lane_reduc_code_p
+  if (lane_reducing
   && reduc_chain_length != 1
   && !only_slp_reduc_chain)
 {
@@ -7678,7 +7676,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
  since we'll mix lanes belonging to different reductions.  But it's
  OK to use them in a reduction chain or when the reduction group
  has just one element.  */
-  if (lane_reduc_code_p
+  if (lane_reducing
   && slp_node
   && !REDUC_GROUP_FIRST_ELEMENT (stmt_info)
   && SLP_TREE_LANES (slp_node) > 1)
@@ -7738,7 +7736,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   /* To properly compute ncopies we are interested in the widest
 non-reduction input type in case we're looking at a widening
 accumulation that we later handle in vect_transform_reduction.  */
-  if (lane_reduc_code_p
+  if (lane_reducing
  && vectype_op[i]
  && (!vectype_in
  || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_in)))
@@ -8211,7 +8209,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   && loop_vinfo->suggested_unroll_factor == 1)
 single_defuse_cycle = true;
 
-  if (single_defuse_cycle || lane_reduc_code_p)
+  if (single_defuse_cycle || lane_reducing)
 {
   gcc_assert (op.code != COND_EXPR);
 
@@ -8227,7 +8225,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 mixed-sign dot-products can be implemented using signed
 dot-products.  */
   machine_mode vec_mode = TYPE_MODE (vectype_in);
-  if (!lane_reduc_code_p
+  if (!lane_reducing
  && !directly_supported_p (op.code, vectype_in, optab_vector))
 {
   if (dump_enabled_p ())
@@ -8252,7 +8250,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
  For the other cases try without the single cycle optimization.  */
   if (!ok)
{
- if (lane_reduc_code_p)
+ if (lane_reducing)
return false;
  else
single_defuse_cycle = false;
@@ -8263,7 +8261,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   /* If the reduction stmt is one of the patterns that have lane
  reduction embedded we cannot handle the case of ! single_defuse_cycle.  */
   if ((ncopies > 1 && ! single_defuse_cycle)
-  && lane_reduc_code_p)
+  && lane_reducing)
 {
   if (dump_enabled_p ())
dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -8274,7 +8272,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
 
   if (slp_node
   && !(!single_defuse_cycle
-  && !lane_reduc_code_p
+  && !lane_reducing
   && reduction_type != FOLD_LEFT_REDUCTION))
 for (i = 0; i < (int) op.num_ops; i++)
   if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i]))
@@ -8295,7 +8293,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
   /* Cost the reduction op inside the loop if transformed via
  vect_transform_reduction.  Otherwise this is costed by the
  separate vectorizable_* routines.  */
-  if (single_defuse_cycle || lane_reduc_code_p)
+  if (single_defuse_cycle || lane_reducing)
 {
   int factor = 1;
   if (vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info))
@@ -8313,7 +8311,7 @@ vectori

Re: [PATCH] vect: Support multiple lane-reducing operations for loop reduction [PR114440]

2024-05-30 Thread Feng Xue OS

>> Hi,
>>
>> The patch was updated with the newest trunk, and also contained some minor 
>> changes.
>>
>> I am working on another new feature which is meant to support pattern 
>> recognition
>> of lane-reducing operations in affine closure originated from loop reduction 
>> variable,
>> like:
>>
>>   sum += cst1 * dot_prod_1 + cst2 * sad_2 + ... + cstN * lane_reducing_op_N
>>
>> The feature WIP depends on the patch. It has been a little bit long time 
>> since its post,
>> would you please take a time to review this one? Thanks.

> This seems to do multiple things so I wonder if you can split up the
> patch a bit?

OK. Will send out split patches in new mails.

> For example adding lane_reducing_op_p can be split out, it also seems like
> the vect_transform_reduction change to better distribute work can be done
> separately?  Likewise refactoring like splitting out
> vect_reduction_use_partial_vector.
> 
> When we have
> 
>sum += d0[i] * d1[i];  // dot-prod 
>sum += w[i];   // widen-sum 
>sum += abs(s0[i] - s1[i]); // sad 
>sum += n[i];   // normal 
> 
> the vector DOT_PROD and friend ops can end up mixing different lanes
> since it is not specified which lanes are reduced into which output lane.
> So, DOT_PROD might combine 0-3, 4-7, ... but SAD might combine
> 0,4,8,12; 1,5,9,13; ... I think this isn't worse than what one op itself
> is doing, but it's worth pointing out (it's probably unlikely a target
> mixes different reduction strategies anyway).

Yes. But even on a peculiar target, DOT_PROD and SAD have different reduction
strategies, it does not impact result correctness, at least for integer 
operation.
Is there anything special that we need to consider?

> 
> Can you make sure to add at least one SLP reduction example to show
> this works for SLP as well?
OK. The patches contains the cases for SLP reduction chain. Will add one for 
SLP reduction, this should be a negative case.

Thanks,
Feng

Re: [PATCH] vect: Unify bbs in loop_vec_info and bb_vec_info

2024-05-29 Thread Feng Xue OS

Ok. Then I will add a TODO comment on "bbs" field to describe it.

Thanks,
Feng



From: Richard Biener 
Sent: Wednesday, May 29, 2024 3:14 PM
To: Feng Xue OS
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] vect: Unify bbs in loop_vec_info and bb_vec_info

On Tue, May 28, 2024 at 6:11 PM Feng Xue OS  wrote:
>
> Because bbs of loop_vec_info need to be allocated via old-fashion
> XCNEWVEC, in order to receive result from dfs_enumerate_from(),
> so have to make bb_vec_info align with loop_vec_info, use
> basic_block * instead of vec. Another reason is that
> some loop vect related codes assume that bbs is a pointer, such
> as using LOOP_VINFO_BBS() to directly free the bbs area.

I think dfs_enumerate_from is fine with receiving bbs.address ()
(if you first grow the vector, of course).  There might be other code
that needs changing, sure.

> While encapsulating bbs into array_slice might make changed code
> more wordy. So still choose basic_block * as its type. Updated the
> patch by removing bbs_as_vector.

The updated patch looks good to me.  Lifetime management of
the base class bbs done differently by _loop_vec_info and _bb_vec_info
is a bit ugly but it's a well isolated fact.

Thus, OK.

I do think we can turn the basic_block * back to a vec<> but this
can be done as followup if anybody has spare cycles.

Thanks,
Richard.

> Feng.
> 
> gcc/
> * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Move
> initialization of bbs to explicit construction code.  Adjust the
> definition of nbbs.
> (update_epilogue_loop_vinfo): Update nbbs for epilog vinfo.
> * tree-vect-pattern.cc (vect_determine_precisions): Make
> loop_vec_info and bb_vec_info share same code.
> (vect_pattern_recog): Remove duplicated vect_pattern_recog_1 loop.
> * tree-vect-slp.cc (vect_get_and_check_slp_defs): Access to bbs[0]
> via base vec_info class.
> (_bb_vec_info::_bb_vec_info): Initialize bbs and nbbs using data
> fields of input auto_vec<> bbs.
> (vect_slp_region): Use access to nbbs to replace original
> bbs.length().
> (vect_schedule_slp_node): Access to bbs[0] via base vec_info class.
> * tree-vectorizer.cc (vec_info::vec_info): Add initialization of
> bbs and nbbs.
> (vec_info::insert_seq_on_entry): Access to bbs[0] via base vec_info
> class.
> * tree-vectorizer.h (vec_info): Add new fields bbs and nbbs.
> (LOOP_VINFO_NBBS): New macro.
> (BB_VINFO_BBS): Rename BB_VINFO_BB to BB_VINFO_BBS.
> (BB_VINFO_NBBS): New macro.
> (_loop_vec_info): Remove field bbs.
> (_bb_vec_info): Rename field bbs.
> ---
>  gcc/tree-vect-loop.cc |   7 +-
>  gcc/tree-vect-patterns.cc | 142 +++---
>  gcc/tree-vect-slp.cc  |  23 +++---
>  gcc/tree-vectorizer.cc|   7 +-
>  gcc/tree-vectorizer.h |  19 +++--
>  5 files changed, 70 insertions(+), 128 deletions(-)
>
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 3b94bb13a8b..04a9ac64df7 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -1028,7 +1028,6 @@ bb_in_loop_p (const_basic_block bb, const void *data)
>  _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
>: vec_info (vec_info::loop, shared),
>  loop (loop_in),
> -bbs (XCNEWVEC (basic_block, loop->num_nodes)),
>  num_itersm1 (NULL_TREE),
>  num_iters (NULL_TREE),
>  num_iters_unchanged (NULL_TREE),
> @@ -1079,8 +1078,9 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, 
> vec_info_shared *shared)
>   case of the loop forms we allow, a dfs order of the BBs would the same
>   as reversed postorder traversal, so we are safe.  */
>
> -  unsigned int nbbs = dfs_enumerate_from (loop->header, 0, bb_in_loop_p,
> - bbs, loop->num_nodes, loop);
> +  bbs = XCNEWVEC (basic_block, loop->num_nodes);
> +  nbbs = dfs_enumerate_from (loop->header, 0, bb_in_loop_p, bbs,
> +loop->num_nodes, loop);
>gcc_assert (nbbs == loop->num_nodes);
>
>for (unsigned int i = 0; i < nbbs; i++)
> @@ -11667,6 +11667,7 @@ update_epilogue_loop_vinfo (class loop *epilogue, 
> tree advance)
>
>free (LOOP_VINFO_BBS (epilogue_vinfo));
>LOOP_VINFO_BBS (epilogue_vinfo) = epilogue_bbs;
> +  LOOP_VINFO_NBBS (epilogue_vinfo) = epilogue->num_nodes;
>
>/* Advance data_reference's with the number of iterations of the previous
>   loop and its prologue.  */
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> i

Re: [PATCH] vect: Unify bbs in loop_vec_info and bb_vec_info

2024-05-28 Thread Feng Xue OS

0644
--- a/gcc/tree-vectorizer.cc
+++ b/gcc/tree-vectorizer.cc
@@ -463,7 +463,9 @@ shrink_simd_arrays
 vec_info::vec_info (vec_info::vec_kind kind_in, vec_info_shared *shared_)
   : kind (kind_in),
 shared (shared_),
-stmt_vec_info_ro (false)
+stmt_vec_info_ro (false),
+bbs (NULL),
+nbbs (0)
 {
   stmt_vec_infos.create (50);
 }
@@ -660,9 +662,8 @@ vec_info::insert_seq_on_entry (stmt_vec_info context, 
gimple_seq seq)
 }
   else
 {
-  bb_vec_info bb_vinfo = as_a  (this);
   gimple_stmt_iterator gsi_region_begin
-   = gsi_after_labels (bb_vinfo->bbs[0]);
+   = gsi_after_labels (bbs[0]);
   gsi_insert_seq_before (&gsi_region_begin, seq, GSI_SAME_STMT);
 }
 }
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 93bc30ef660..bd4f5952f4b 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -499,6 +499,12 @@ public:
  made any decisions about which vector modes to use.  */
   machine_mode vector_mode;

+  /* The basic blocks in the vectorization region.  */
+  basic_block *bbs;
+
+  /* The count of the basic blocks in the vectorization region.  */
+  unsigned int nbbs;
+
 private:
   stmt_vec_info new_stmt_vec_info (gimple *stmt);
   void set_vinfo_for_stmt (gimple *, stmt_vec_info, bool = true);
@@ -679,9 +685,6 @@ public:
   /* The loop to which this info struct refers to.  */
   class loop *loop;

-  /* The loop basic blocks.  */
-  basic_block *bbs;
-
   /* Number of latch executions.  */
   tree num_itersm1;
   /* Number of iterations.  */
@@ -969,6 +972,7 @@ public:
 #define LOOP_VINFO_EPILOGUE_IV_EXIT(L) (L)->vec_epilogue_loop_iv_exit
 #define LOOP_VINFO_SCALAR_IV_EXIT(L)   (L)->scalar_loop_iv_exit
 #define LOOP_VINFO_BBS(L)  (L)->bbs
+#define LOOP_VINFO_NBBS(L) (L)->nbbs
 #define LOOP_VINFO_NITERSM1(L) (L)->num_itersm1
 #define LOOP_VINFO_NITERS(L)   (L)->num_iters
 /* Since LOOP_VINFO_NITERS and LOOP_VINFO_NITERSM1 can change after
@@ -1094,16 +1098,11 @@ public:
   _bb_vec_info (vec bbs, vec_info_shared *);
   ~_bb_vec_info ();

-  /* The region we are operating on.  bbs[0] is the entry, excluding
- its PHI nodes.  In the future we might want to track an explicit
- entry edge to cover bbs[0] PHI nodes and have a region entry
- insert location.  */
-  vec bbs;
-
   vec roots;
 } *bb_vec_info;

-#define BB_VINFO_BB(B)   (B)->bb
+#define BB_VINFO_BBS(B)  (B)->bbs
+#define BB_VINFO_NBBS(B) (B)->nbbs
 #define BB_VINFO_GROUPED_STORES(B)   (B)->grouped_stores
 #define BB_VINFO_SLP_INSTANCES(B)(B)->slp_instances
 #define BB_VINFO_DATAREFS(B) (B)->shared->datarefs
--
2.17.1


From: Richard Biener 
Sent: Tuesday, May 28, 2024 5:43 PM
To: Feng Xue OS
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] vect: Unify bbs in loop_vec_info and bb_vec_info

On Sat, May 25, 2024 at 4:54 PM Feng Xue OS  wrote:
>
> Both derived classes ( loop_vec_info/bb_vec_info) have their own "bbs"
> field, which have exactly same purpose of recording all basic blocks
> inside the corresponding vect region, while the fields are composed by
> different data type, one is normal array, the other is auto_vec. This
> difference causes some duplicated code even handling the same stuff,
> almost in tree-vect-patterns. One refinement is lifting this field into the
> base class "vec_info", and reset its value to the continuous memory area
> pointed by two old "bbs" in each constructor of derived classes.

Nice.  But.  bbs_as_vector - why is that necessary?  Why is vinfo->bbs
not a vec?  Having bbs and nbbs feels like a step back.

Note the code duplications can probably be removed by "indirecting"
through an array_slice.

I'm a bit torn to approve this as-is given the above.  Can you explain what
made you not choose vec<> for bbs?  I bet you tried.

Richard.

> Regression test on x86-64 and aarch64.
>
> Feng
> --
> gcc/
> * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Move
> initialization of bbs to explicit construction code.  Adjust the
> definition of nbbs.
> * tree-vect-pattern.cc (vect_determine_precisions): Make
> loop_vec_info and bb_vec_info share same code.
> (vect_pattern_recog): Remove duplicated vect_pattern_recog_1 loop.
> * tree-vect-slp.cc (vect_get_and_check_slp_defs): Access to bbs[0]
> via base vec_info class.
> (_bb_vec_info::_bb_vec_info): Initialize bbs and nbbs using data
> fields of input auto_vec<> bbs.
> (_bb_vec_info::_bb_vec_info): Add assertions on bbs and nbbs to ensure
> they are not changed externally.
> (vect_slp_region): Use access to nbbs to replace original
> bbs.length(

Re: [PATCH] vect: Use vect representative statement instead of original in patch recog [PR115060]

2024-05-28 Thread Feng Xue OS

Changed as the comments.

Thanks,
Feng


From: Richard Biener 
Sent: Tuesday, May 28, 2024 5:34 PM
To: Feng Xue OS
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] vect: Use vect representative statement instead of 
original in patch recog [PR115060]

On Sat, May 25, 2024 at 4:45 PM Feng Xue OS  wrote:
>
> Some utility functions (such as vect_look_through_possible_promotion) that are
> to find out certain kind of direct or indirect definition SSA for a value, may
> return the original one of the SSA, not its pattern representative SSA, even
> pattern is involved. For example,
>
>a = (T1) patt_b;
>patt_b = (T2) c;// b = ...
>patt_c = not-a-cast;// c = ...
>
> Given 'a', the mentioned function will return 'c', instead of 'patt_c'. This
> subtlety would make some pattern recog code that is unaware of it mis-use the
> original instead of the new pattern statement, which is inconsistent wth
> processing logic of the pattern formation pass. This patch corrects the issue
> by forcing another utility function (vect_get_internal_def) return the pattern
> statement information to caller by default.
>
> Regression test on x86-64 and aarch64.
>
> Feng
> --
> gcc/
> PR tree-optimization/115060
> * tree-vect-patterns.h (vect_get_internal_def): Add a new parameter
> for_vectorize.
> (vect_widened_op_tree): Call vect_get_internal_def instead of look_def
> to get statement information.
> (vect_recog_widen_abd_pattern): No need to call 
> vect_stmt_to_vectorize.
> ---
>  gcc/tree-vect-patterns.cc | 16 +++-
>  1 file changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
> index a313dc64643..fa35bf26372 100644
> --- a/gcc/tree-vect-patterns.cc
> +++ b/gcc/tree-vect-patterns.cc
> @@ -258,15 +258,21 @@ vect_element_precision (unsigned int precision)
>  }
>
>  /* If OP is defined by a statement that's being considered for vectorization,
> -   return information about that statement, otherwise return NULL.  */
> +   return information about that statement, otherwise return NULL.
> +   FOR_VECTORIZE is used to specify whether original or vectorization
> +   representative (if have) statement information is returned.  */
>
>  static stmt_vec_info
> -vect_get_internal_def (vec_info *vinfo, tree op)
> +vect_get_internal_def (vec_info *vinfo, tree op, bool for_vectorize = true)

I'm probably blind - but you nowhere pass 'false' and I think returning the
pattern stmt is the correct behavior always.

OK with omitting the new parameter.

>  {
>stmt_vec_info def_stmt_info = vinfo->lookup_def (op);
>if (def_stmt_info
>&& STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_internal_def)
> -return def_stmt_info;
> +{
> +  if (for_vectorize)
> +   def_stmt_info = vect_stmt_to_vectorize (def_stmt_info);
> +  return def_stmt_info;
> +}
>return NULL;
>  }
>
> @@ -655,7 +661,8 @@ vect_widened_op_tree (vec_info *vinfo, stmt_vec_info 
> stmt_info, tree_code code,
>
>   /* Recursively process the definition of the operand.  */
>   stmt_vec_info def_stmt_info
> -   = vinfo->lookup_def (this_unprom->op);
> +   = vect_get_internal_def (vinfo, this_unprom->op);
> +
>   nops = vect_widened_op_tree (vinfo, def_stmt_info, code,
>widened_code, shift_p, max_nops,
>this_unprom, common_type,
> @@ -1739,7 +1746,6 @@ vect_recog_widen_abd_pattern (vec_info *vinfo, 
> stmt_vec_info stmt_vinfo,
>if (!abd_pattern_vinfo)
>  return NULL;
>
> -  abd_pattern_vinfo = vect_stmt_to_vectorize (abd_pattern_vinfo);
>gcall *abd_stmt = dyn_cast  (STMT_VINFO_STMT (abd_pattern_vinfo));
>if (!abd_stmt

[PATCH] vect: Unify bbs in loop_vec_info and bb_vec_info

2024-05-25 Thread Feng Xue OS

Both derived classes ( loop_vec_info/bb_vec_info) have their own "bbs"
field, which have exactly same purpose of recording all basic blocks
inside the corresponding vect region, while the fields are composed by
different data type, one is normal array, the other is auto_vec. This
difference causes some duplicated code even handling the same stuff,
almost in tree-vect-patterns. One refinement is lifting this field into the
base class "vec_info", and reset its value to the continuous memory area
pointed by two old "bbs" in each constructor of derived classes.

Regression test on x86-64 and aarch64.

Feng
--
gcc/
* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Move
initialization of bbs to explicit construction code.  Adjust the
definition of nbbs.
* tree-vect-pattern.cc (vect_determine_precisions): Make
loop_vec_info and bb_vec_info share same code.
(vect_pattern_recog): Remove duplicated vect_pattern_recog_1 loop.
* tree-vect-slp.cc (vect_get_and_check_slp_defs): Access to bbs[0]
via base vec_info class.
(_bb_vec_info::_bb_vec_info): Initialize bbs and nbbs using data
fields of input auto_vec<> bbs.
(_bb_vec_info::_bb_vec_info): Add assertions on bbs and nbbs to ensure
they are not changed externally.
(vect_slp_region): Use access to nbbs to replace original
bbs.length().
(vect_schedule_slp_node): Access to bbs[0] via base vec_info class.
* tree-vectorizer.cc (vec_info::vec_info): Add initialization of
bbs and nbbs.
(vec_info::insert_seq_on_entry): Access to bbs[0] via base vec_info
class.
* tree-vectorizer.h (vec_info): Add new fields bbs and nbbs.
(_loop_vec_info): Remove field bbs.
(_bb_vec_info): Rename old bbs field to bbs_as_vector, and make it
be private.
---
 gcc/tree-vect-loop.cc |   6 +-
 gcc/tree-vect-patterns.cc | 142 +++---
 gcc/tree-vect-slp.cc  |  24 ---
 gcc/tree-vectorizer.cc|   7 +-
 gcc/tree-vectorizer.h |  19 ++---
 5 files changed, 72 insertions(+), 126 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 83c0544b6aa..aef17420a5f 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -1028,7 +1028,6 @@ bb_in_loop_p (const_basic_block bb, const void *data)
 _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
   : vec_info (vec_info::loop, shared),
 loop (loop_in),
-bbs (XCNEWVEC (basic_block, loop->num_nodes)),
 num_itersm1 (NULL_TREE),
 num_iters (NULL_TREE),
 num_iters_unchanged (NULL_TREE),
@@ -1079,8 +1078,9 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, 
vec_info_shared *shared)
  case of the loop forms we allow, a dfs order of the BBs would the same
  as reversed postorder traversal, so we are safe.  */
 
-  unsigned int nbbs = dfs_enumerate_from (loop->header, 0, bb_in_loop_p,
- bbs, loop->num_nodes, loop);
+  bbs = XCNEWVEC (basic_block, loop->num_nodes);
+  nbbs = dfs_enumerate_from (loop->header, 0, bb_in_loop_p, bbs,
+loop->num_nodes, loop);
   gcc_assert (nbbs == loop->num_nodes);
 
   for (unsigned int i = 0; i < nbbs; i++)
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index a313dc64643..848a3195a93 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -6925,81 +6925,41 @@ vect_determine_stmt_precisions (vec_info *vinfo, 
stmt_vec_info stmt_info)
 void
 vect_determine_precisions (vec_info *vinfo)
 {
+  basic_block *bbs = vinfo->bbs;
+  unsigned int nbbs = vinfo->nbbs;
+
   DUMP_VECT_SCOPE ("vect_determine_precisions");
 
-  if (loop_vec_info loop_vinfo = dyn_cast  (vinfo))
+  for (unsigned int i = 0; i < nbbs; i++)
 {
-  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
-  basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo);
-  unsigned int nbbs = loop->num_nodes;
-
-  for (unsigned int i = 0; i < nbbs; i++)
+  basic_block bb = bbs[i];
+  for (auto gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next (&gsi))
{
- basic_block bb = bbs[i];
- for (auto gsi = gsi_start_phis (bb);
-  !gsi_end_p (gsi); gsi_next (&gsi))
-   {
- stmt_vec_info stmt_info = vinfo->lookup_stmt (gsi.phi ());
- if (stmt_info)
-   vect_determine_mask_precision (vinfo, stmt_info);
-   }
- for (auto si = gsi_start_bb (bb); !gsi_end_p (si); gsi_next (&si))
-   if (!is_gimple_debug (gsi_stmt (si)))
- vect_determine_mask_precision
-   (vinfo, vinfo->lookup_stmt (gsi_stmt (si)));
+ stmt_vec_info stmt_info = vinfo->lookup_stmt (gsi.phi ());
+ if (stmt_info && STMT_VINFO_VECTORIZABLE (stmt_info))
+   vect_determine_mask_precision (vinfo, stmt_info);
}
-  for (unsigned int i = 0; i < nbbs;

[PATCH] vect: Use vect representative statement instead of original in patch recog [PR115060]

2024-05-25 Thread Feng Xue OS

Some utility functions (such as vect_look_through_possible_promotion) that are
to find out certain kind of direct or indirect definition SSA for a value, may
return the original one of the SSA, not its pattern representative SSA, even
pattern is involved. For example,

   a = (T1) patt_b;
   patt_b = (T2) c;// b = ...
   patt_c = not-a-cast;// c = ...

Given 'a', the mentioned function will return 'c', instead of 'patt_c'. This
subtlety would make some pattern recog code that is unaware of it mis-use the
original instead of the new pattern statement, which is inconsistent wth
processing logic of the pattern formation pass. This patch corrects the issue
by forcing another utility function (vect_get_internal_def) return the pattern
statement information to caller by default.

Regression test on x86-64 and aarch64.

Feng
--
gcc/
PR tree-optimization/115060
* tree-vect-patterns.h (vect_get_internal_def): Add a new parameter
for_vectorize.
(vect_widened_op_tree): Call vect_get_internal_def instead of look_def
to get statement information.
(vect_recog_widen_abd_pattern): No need to call vect_stmt_to_vectorize.
---
 gcc/tree-vect-patterns.cc | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index a313dc64643..fa35bf26372 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -258,15 +258,21 @@ vect_element_precision (unsigned int precision)
 }
 
 /* If OP is defined by a statement that's being considered for vectorization,
-   return information about that statement, otherwise return NULL.  */
+   return information about that statement, otherwise return NULL.
+   FOR_VECTORIZE is used to specify whether original or vectorization
+   representative (if have) statement information is returned.  */
 
 static stmt_vec_info
-vect_get_internal_def (vec_info *vinfo, tree op)
+vect_get_internal_def (vec_info *vinfo, tree op, bool for_vectorize = true)
 {
   stmt_vec_info def_stmt_info = vinfo->lookup_def (op);
   if (def_stmt_info
   && STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_internal_def)
-return def_stmt_info;
+{
+  if (for_vectorize)
+   def_stmt_info = vect_stmt_to_vectorize (def_stmt_info);
+  return def_stmt_info;
+}
   return NULL;
 }
 
@@ -655,7 +661,8 @@ vect_widened_op_tree (vec_info *vinfo, stmt_vec_info 
stmt_info, tree_code code,
 
  /* Recursively process the definition of the operand.  */
  stmt_vec_info def_stmt_info
-   = vinfo->lookup_def (this_unprom->op);
+   = vect_get_internal_def (vinfo, this_unprom->op);
+
  nops = vect_widened_op_tree (vinfo, def_stmt_info, code,
   widened_code, shift_p, max_nops,
   this_unprom, common_type,
@@ -1739,7 +1746,6 @@ vect_recog_widen_abd_pattern (vec_info *vinfo, 
stmt_vec_info stmt_vinfo,
   if (!abd_pattern_vinfo)
 return NULL;
 
-  abd_pattern_vinfo = vect_stmt_to_vectorize (abd_pattern_vinfo);
   gcall *abd_stmt = dyn_cast  (STMT_VINFO_STMT (abd_pattern_vinfo));
   if (!abd_stmt

Re: [PATCH] vect: Support multiple lane-reducing operations for loop reduction [PR114440]

2024-05-24 Thread Feng Xue OS

Hi,

The patch was updated with the newest trunk, and also contained some minor 
changes.

I am working on another new feature which is meant to support pattern 
recognition
of lane-reducing operations in affine closure originated from loop reduction 
variable,
like:

  sum += cst1 * dot_prod_1 + cst2 * sad_2 + ... + cstN * lane_reducing_op_N

The feature WIP depends on the patch. It has been a little bit long time since 
its post,
would you please take a time to review this one? Thanks.

Feng


gcc/
PR tree-optimization/114440
* tree-vectorizer.h (struct _stmt_vec_info): Add a new field
reduc_result_pos.
(lane_reducing_op_p): New function.
(vectorizable_lane_reducing): New function declaration.
* tree-vect-stmts.cc (vectorizable_condition): Treat the condition
statement that is pointed by stmt_vec_info of reduction PHI as the
real "for_reduction" statement.
(vect_analyze_stmt): Call new function vectorizable_lane_reducing
to analyze lane-reducing operation.
* tree-vect-loop.cc (vect_is_emulated_mixed_dot_prod): Remove parameter
loop_vinfo. Get input vectype from stmt_info instead of reduction PHI.
(vect_model_reduction_cost): Remove cost computation code related to
emulated_mixed_dot_prod.
(vect_reduction_use_partial_vector): New function.
(vectorizable_lane_reducing): New function.
(vectorizable_reduction): Allow multiple lane-reducing operations in
loop reduction. Move some original lane-reducing related code to
vectorizable_lane_reducing, and move partial vectorization checking
code to vect_reduction_use_partial_vector.
(vect_transform_reduction): Extend transformation to support reduction
statements with mixed input vectypes.
* tree-vect-slp.cc (vect_analyze_slp): Use new function
lane_reducing_op_p to check statement code.

gcc/testsuite/
PR tree-optimization/114440
* gcc.dg/vect/vect-reduc-chain-1.c
* gcc.dg/vect/vect-reduc-chain-2.c
* gcc.dg/vect/vect-reduc-chain-3.c
* gcc.dg/vect/vect-reduc-dot-slp-1.c
* gcc.dg/vect/vect-reduc-dot-slp-2.c
---
 .../gcc.dg/vect/vect-reduc-chain-1.c  |  62 ++
 .../gcc.dg/vect/vect-reduc-chain-2.c  |  77 ++
 .../gcc.dg/vect/vect-reduc-chain-3.c  |  66 ++
 .../gcc.dg/vect/vect-reduc-dot-slp-1.c|  97 +++
 .../gcc.dg/vect/vect-reduc-dot-slp-2.c|  81 +++
 gcc/tree-vect-loop.cc | 680 --
 gcc/tree-vect-slp.cc  |   4 +-
 gcc/tree-vect-stmts.cc|  13 +-
 gcc/tree-vectorizer.h |  14 +
 9 files changed, 873 insertions(+), 221 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-2.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c 
b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
new file mode 100644
index 000..04bfc419dbd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
@@ -0,0 +1,62 @@
+/* Disabling epilogues until we find a better way to deal with scans.  */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { 
aarch64*-*-* || arm*-*-* } } } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+
+#include "tree-vect.h"
+
+#define N 50
+
+#ifndef SIGNEDNESS_1
+#define SIGNEDNESS_1 signed
+#define SIGNEDNESS_2 signed
+#endif
+
+SIGNEDNESS_1 int __attribute__ ((noipa))
+f (SIGNEDNESS_1 int res,
+   SIGNEDNESS_2 char *restrict a,
+   SIGNEDNESS_2 char *restrict b,
+   SIGNEDNESS_2 char *restrict c,
+   SIGNEDNESS_2 char *restrict d,
+   SIGNEDNESS_1 int *restrict e)
+{
+  for (int i = 0; i < N; ++i)
+{
+  res += a[i] * b[i];
+  res += c[i] * d[i];
+  res += e[i];
+}
+  return res;
+}
+
+#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4)
+#define OFFSET 20
+
+int
+main (void)
+{
+  check_vect ();
+
+  SIGNEDNESS_2 char a[N], b[N];
+  SIGNEDNESS_2 char c[N], d[N];
+  SIGNEDNESS_1 int e[N];
+  int expected = 0x12345;
+  for (int i = 0; i < N; ++i)
+{
+  a[i] = BASE + i * 5;
+  b[i] = BASE + OFFSET + i * 4;
+  c[i] = BASE + i * 2;
+  d[i] = BASE + OFFSET + i * 3;
+  e[i] = i;
+  asm volatile ("" ::: "memory");
+  expected += a[i] * b[i];
+  expected += c[i] * d[i];
+  expected += e[i];
+}
+  if (f (0x12345, a, b, c, d, e) != expected)
+__builtin_abort ();
+}
+
+/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" 
} } */
+/* { dg-final { scan-tree

[PATCH] vect: Support multiple lane-reducing operations for loop reduction [PR114440]

2024-04-07 Thread Feng Xue OS

For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
vectorizer could only handle the pattern if the reduction chain does not
contain other operation, no matter the other is normal or lane-reducing.

Acctually, to allow multiple arbitray lane-reducing operations, we need to
support vectorization of loop reduction chain with mixed input vectypes. Since
lanes of vectype may vary with operation, the effective ncopies of vectorized
statements for operation also may not be same to each other, this causes
mismatch on vectorized def-use cycles. A simple way is to align all operations
with the one that has the most ncopies, the gap could be complemented by
generating extra trival pass-through copies. For example:

   int sum = 0;
   for (i)
 {
   sum += d0[i] * d1[i];  // dot-prod 
   sum += w[i];   // widen-sum 
   sum += abs(s0[i] - s1[i]); // sad 
   sum += n[i];   // normal 
 }

The vector size is 128-bit，vectorization factor is 16. Reduction statements
would be transformed as:

   vector<4> int sum_v0 = { 0, 0, 0, 0 };
   vector<4> int sum_v1 = { 0, 0, 0, 0 };
   vector<4> int sum_v2 = { 0, 0, 0, 0 };
   vector<4> int sum_v3 = { 0, 0, 0, 0 };

   for (i / 16)
 {
   sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
   sum_v1 = sum_v1;  // copy
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = sum_v0;  // copy
   sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1);
   sum_v2 = sum_v2;  // copy
   sum_v3 = sum_v3;  // copy

   sum_v0 = sum_v0;  // copy
   sum_v1 = sum_v1;  // copy
   sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2);
   sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3);

   sum_v0 += n_v0[i: 0  ~ 3 ];
   sum_v1 += n_v1[i: 4  ~ 7 ];
   sum_v2 += n_v2[i: 8  ~ 11];
   sum_v3 += n_v3[i: 12 ~ 15];
 }

Moreover, for a higher instruction parallelism in final vectorized loop, it
is considered to make those effective vectorized lane-reducing statements be
distributed evenly among all def-use cycles. In the above example, DOT_PROD,
WIDEN_SUM and SADs are generated into disparate cycles.

Bootstrapped/regtested on x86_64-linux and aarch64-linux.

Feng
---
gcc/
PR tree-optimization/114440
* tree-vectorizer.h (struct _stmt_vec_info): Add a new field
reduc_result_pos.
(vectorizable_lane_reducing): New function declaration.
* tree-vect-stmts.cc (vectorizable_condition): Treat the condition
statement that is pointed by stmt_vec_info of reduction PHI as the
real "for_reduction" statement.
(vect_analyze_stmt): Call new function vectorizable_lane_reducing
to analyze lane-reducing operation.
* tree-vect-loop.cc (vect_is_emulated_mixed_dot_prod): Remove parameter
loop_vinfo. Get input vectype from stmt_info instead of reduction PHI.
(vect_model_reduction_cost): Remove cost computation code related to
emulated_mixed_dot_prod.
(vect_reduction_use_partial_vector): New function.
(vectorizable_lane_reducing): New function.
(vectorizable_reduction): Allow multiple lane-reducing operations in
loop reduction. Move some original lane-reducing related code to
vectorizable_lane_reducing, and move partial vectorization checking
code to vect_reduction_use_partial_vector.
(vect_transform_reduction): Extend transformation to support reduction
statements with mixed input vectypes.

gcc/testsuite/
PR tree-optimization/114440
* gcc.dg/vect/vect-reduc-chain-1.c
* gcc.dg/vect/vect-reduc-chain-2.c
* gcc.dg/vect/vect-reduc-chain-3.c
* gcc.dg/vect/vect-reduc-dot-slp-1.c
* gcc.dg/vect/vect-reduc-dot-slp-2.c
---
 .../gcc.dg/vect/vect-reduc-chain-1.c  |  62 ++
 .../gcc.dg/vect/vect-reduc-chain-2.c  |  77 ++
 .../gcc.dg/vect/vect-reduc-chain-3.c  |  66 ++
 .../gcc.dg/vect/vect-reduc-dot-slp-1.c|  97 +++
 .../gcc.dg/vect/vect-reduc-dot-slp-2.c|  81 +++
 gcc/tree-vect-loop.cc | 668 --
 gcc/tree-vect-stmts.cc|  13 +-
 gcc/tree-vectorizer.h |   8 +
 8 files changed, 863 insertions(+), 209 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-2.c

diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c 
b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
new file mode 100644
index 000..04bfc419dbd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c
@@ -0,0 +1,62 @@
+/* Disabling epilogues until we find a

Re: [PATCH] Do not count unused scalar use when marking STMT_VINFO_LIVE_P [PR113091]

2024-01-11 Thread Feng Xue OS

+ scalar_use_map.put (op, 1);
+   }
+  else
+   {
+ for (slp_tree child : SLP_TREE_CHILDREN (node))
+   if (child && !visited.add (child))
+ worklist.safe_push (child);
+   }
+} while (!worklist.is_empty ());
+
+  visited.empty ();
+
+  for (slp_instance instance : bb_vinfo->slp_instances)
+{
+  vect_location = instance->location ();
+  vect_bb_slp_mark_live_stmts (bb_vinfo, SLP_INSTANCE_TREE (instance),
+  instance, &instance->cost_vec,
+  scalar_use_map, svisited, visited);
+}
 }

 /* Determine whether we can vectorize the reduction epilogue for INSTANCE.  */
@@ -6684,17 +6823,7 @@ vect_slp_analyze_operations (vec_info *vinfo)

   /* Compute vectorizable live stmts.  */
   if (bb_vec_info bb_vinfo = dyn_cast  (vinfo))
-{
-  hash_set svisited;
-  hash_set visited;
-  for (i = 0; vinfo->slp_instances.iterate (i, &instance); ++i)
-   {
- vect_location = instance->location ();
- vect_bb_slp_mark_live_stmts (bb_vinfo, SLP_INSTANCE_TREE (instance),
-  instance, &instance->cost_vec, svisited,
-  visited);
-   }
-}
+vect_bb_slp_mark_live_stmts (bb_vinfo);

   return !vinfo->slp_instances.is_empty ();
 }
--
2.17.1


From: Richard Biener 
Sent: Thursday, January 11, 2024 5:52 PM
To: Feng Xue OS; Richard Sandiford
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] Do not count unused scalar use when marking 
STMT_VINFO_LIVE_P [PR113091]

On Thu, Jan 11, 2024 at 10:46 AM Richard Biener
 wrote:
>
> On Fri, Dec 29, 2023 at 11:29 AM Feng Xue OS
>  wrote:
> >
> > This patch is meant to fix over-estimation about SLP vector-to-scalar cost 
> > for
> > STMT_VINFO_LIVE_P statement. When pattern recognition is involved, a
> > statement whose definition is consumed in some pattern, may not be
> > included in the final replacement pattern statements, and would be skipped
> > when building SLP graph.
> >
> >  * Original
> >   char a_c = *(char *) a;
> >   char b_c = *(char *) b;
> >   unsigned short a_s = (unsigned short) a_c;
> >   int a_i = (int) a_s;
> >   int b_i = (int) b_c;
> >   int r_i = a_i - b_i;
> >
> >  * After pattern replacement
> >   a_s = (unsigned short) a_c;
> >   a_i = (int) a_s;
> >
> >   patt_b_s = (unsigned short) b_c;// b_i = (int) b_c
> >   patt_b_i = (int) patt_b_s;  // b_i = (int) b_c
> >
> >   patt_r_s = widen_minus(a_c, b_c);   // r_i = a_i - b_i
> >   patt_r_i = (int) patt_r_s;  // r_i = a_i - b_i
> >
> > The definitions of a_i(original statement) and b_i(pattern statement)
> > are related to, but actually not part of widen_minus pattern.
> > Vectorizing the pattern does not cause these definition statements to
> > be marked as PURE_SLP.  For this case, we need to recursively check
> > whether their uses are all absorbed into vectorized code.  But there
> > is an exception that some use may participate in an vectorized
> > operation via an external SLP node containing that use as an element.
> >
> > Feng
> >
> > ---
> >  .../gcc.target/aarch64/bb-slp-pr113091.c  |  22 ++
> >  gcc/tree-vect-slp.cc  | 189 ++
> >  2 files changed, 172 insertions(+), 39 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c
> >
> > diff --git a/gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c 
> > b/gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c
> > new file mode 100644
> > index 000..ff822e90b4a
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c
> > @@ -0,0 +1,22 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-O3 -fdump-tree-slp-details 
> > -ftree-slp-vectorize" } */
> > +
> > +int test(unsigned array[8]);
> > +
> > +int foo(char *a, char *b)
> > +{
> > +  unsigned array[8];
> > +
> > +  array[0] = (a[0] - b[0]);
> > +  array[1] = (a[1] - b[1]);
> > +  array[2] = (a[2] - b[2]);
> > +  array[3] = (a[3] - b[3]);
> > +  array[4] = (a[4] - b[4]);
> > +  array[5] = (a[5] - b[5]);
> > +  array[6] = (a[6] - b[6]);
> > +  array[7] = (a[7] - b[7]);
> > +
> > +  return test(array);
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using 
> > SLP" 1 "slp2" } } */
> > diff --git a/gcc

PING: [PATCH] Do not count unused scalar use when marking STMT_VINFO_LIVE_P [PR113091]

2024-01-10 Thread Feng Xue OS

Hi, Richard,

  Would you please talk a look at this patch?

Thanks,
Feng


From: Feng Xue OS 
Sent: Friday, December 29, 2023 6:28 PM
To: gcc-patches@gcc.gnu.org
Subject: [PATCH] Do not count unused scalar use when marking STMT_VINFO_LIVE_P 
[PR113091]

This patch is meant to fix over-estimation about SLP vector-to-scalar cost for
STMT_VINFO_LIVE_P statement. When pattern recognition is involved, a
statement whose definition is consumed in some pattern, may not be
included in the final replacement pattern statements, and would be skipped
when building SLP graph.

 * Original
  char a_c = *(char *) a;
  char b_c = *(char *) b;
  unsigned short a_s = (unsigned short) a_c;
  int a_i = (int) a_s;
  int b_i = (int) b_c;
  int r_i = a_i - b_i;

 * After pattern replacement
  a_s = (unsigned short) a_c;
  a_i = (int) a_s;

  patt_b_s = (unsigned short) b_c;// b_i = (int) b_c
  patt_b_i = (int) patt_b_s;  // b_i = (int) b_c

  patt_r_s = widen_minus(a_c, b_c);   // r_i = a_i - b_i
  patt_r_i = (int) patt_r_s;  // r_i = a_i - b_i

The definitions of a_i(original statement) and b_i(pattern statement)
are related to, but actually not part of widen_minus pattern.
Vectorizing the pattern does not cause these definition statements to
be marked as PURE_SLP.  For this case, we need to recursively check
whether their uses are all absorbed into vectorized code.  But there
is an exception that some use may participate in an vectorized
operation via an external SLP node containing that use as an element.

Feng

---
 .../gcc.target/aarch64/bb-slp-pr113091.c  |  22 ++
 gcc/tree-vect-slp.cc  | 189 ++
 2 files changed, 172 insertions(+), 39 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c

diff --git a/gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c 
b/gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c
new file mode 100644
index 000..ff822e90b4a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-slp-details -ftree-slp-vectorize" 
} */
+
+int test(unsigned array[8]);
+
+int foo(char *a, char *b)
+{
+  unsigned array[8];
+
+  array[0] = (a[0] - b[0]);
+  array[1] = (a[1] - b[1]);
+  array[2] = (a[2] - b[2]);
+  array[3] = (a[3] - b[3]);
+  array[4] = (a[4] - b[4]);
+  array[5] = (a[5] - b[5]);
+  array[6] = (a[6] - b[6]);
+  array[7] = (a[7] - b[7]);
+
+  return test(array);
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using 
SLP" 1 "slp2" } } */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index a82fca45161..d36ff37114e 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -6418,6 +6418,84 @@ vect_slp_analyze_node_operations (vec_info *vinfo, 
slp_tree node,
   return res;
 }

+/* Given a definition DEF, analyze if it will have any live scalar use after
+   performing SLP vectorization whose information is represented by BB_VINFO,
+   and record result into hash map SCALAR_USE_MAP as cache for later fast
+   check.  */
+
+static bool
+vec_slp_has_scalar_use (bb_vec_info bb_vinfo, tree def,
+   hash_map &scalar_use_map)
+{
+  imm_use_iterator use_iter;
+  gimple *use_stmt;
+
+  if (bool *res = scalar_use_map.get (def))
+return *res;
+
+  FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, def)
+{
+  if (is_gimple_debug (use_stmt))
+   continue;
+
+  stmt_vec_info use_stmt_info = bb_vinfo->lookup_stmt (use_stmt);
+
+  if (!use_stmt_info)
+   break;
+
+  if (PURE_SLP_STMT (vect_stmt_to_vectorize (use_stmt_info)))
+   continue;
+
+  /* Do not step forward when encounter PHI statement, since it may
+involve cyclic reference and cause infinite recursive invocation.  */
+  if (gimple_code (use_stmt) == GIMPLE_PHI)
+   break;
+
+  /* When pattern recognition is involved, a statement whose definition is
+consumed in some pattern, may not be included in the final replacement
+pattern statements, so would be skipped when building SLP graph.
+
+* Original
+ char a_c = *(char *) a;
+ char b_c = *(char *) b;
+ unsigned short a_s = (unsigned short) a_c;
+ int a_i = (int) a_s;
+ int b_i = (int) b_c;
+ int r_i = a_i - b_i;
+
+* After pattern replacement
+ a_s = (unsigned short) a_c;
+ a_i = (int) a_s;
+
+ patt_b_s = (unsigned short) b_c;// b_i = (int) b_c
+ patt_b_i = (int) patt_b_s;  // b_i = (int) b_c
+
+ patt_r_s = widen_minus(a_c, b_c);   // r_i = a_i - b_i
+ patt_r_i = (int) patt_r_s;  // r_i = a_i - b_i
+
+The definitions of a_i(original statement) and b_i(pattern statement)
+are related to, but act

[PATCH] Do not count unused scalar use when marking STMT_VINFO_LIVE_P [PR113091]

2023-12-29 Thread Feng Xue OS

This patch is meant to fix over-estimation about SLP vector-to-scalar cost for
STMT_VINFO_LIVE_P statement. When pattern recognition is involved, a
statement whose definition is consumed in some pattern, may not be
included in the final replacement pattern statements, and would be skipped
when building SLP graph.

 * Original
  char a_c = *(char *) a;
  char b_c = *(char *) b;
  unsigned short a_s = (unsigned short) a_c;
  int a_i = (int) a_s;
  int b_i = (int) b_c;
  int r_i = a_i - b_i;

 * After pattern replacement
  a_s = (unsigned short) a_c;
  a_i = (int) a_s;

  patt_b_s = (unsigned short) b_c;// b_i = (int) b_c
  patt_b_i = (int) patt_b_s;  // b_i = (int) b_c

  patt_r_s = widen_minus(a_c, b_c);   // r_i = a_i - b_i
  patt_r_i = (int) patt_r_s;  // r_i = a_i - b_i

The definitions of a_i(original statement) and b_i(pattern statement)
are related to, but actually not part of widen_minus pattern.
Vectorizing the pattern does not cause these definition statements to
be marked as PURE_SLP.  For this case, we need to recursively check
whether their uses are all absorbed into vectorized code.  But there
is an exception that some use may participate in an vectorized
operation via an external SLP node containing that use as an element.

Feng

---
 .../gcc.target/aarch64/bb-slp-pr113091.c  |  22 ++
 gcc/tree-vect-slp.cc  | 189 ++
 2 files changed, 172 insertions(+), 39 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c

diff --git a/gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c 
b/gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c
new file mode 100644
index 000..ff822e90b4a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-slp-details -ftree-slp-vectorize" 
} */
+
+int test(unsigned array[8]);
+
+int foo(char *a, char *b)
+{
+  unsigned array[8];
+
+  array[0] = (a[0] - b[0]);
+  array[1] = (a[1] - b[1]);
+  array[2] = (a[2] - b[2]);
+  array[3] = (a[3] - b[3]);
+  array[4] = (a[4] - b[4]);
+  array[5] = (a[5] - b[5]);
+  array[6] = (a[6] - b[6]);
+  array[7] = (a[7] - b[7]);
+
+  return test(array);
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using 
SLP" 1 "slp2" } } */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index a82fca45161..d36ff37114e 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -6418,6 +6418,84 @@ vect_slp_analyze_node_operations (vec_info *vinfo, 
slp_tree node,
   return res;
 }
 
+/* Given a definition DEF, analyze if it will have any live scalar use after
+   performing SLP vectorization whose information is represented by BB_VINFO,
+   and record result into hash map SCALAR_USE_MAP as cache for later fast
+   check.  */
+
+static bool
+vec_slp_has_scalar_use (bb_vec_info bb_vinfo, tree def,
+   hash_map &scalar_use_map)
+{
+  imm_use_iterator use_iter;
+  gimple *use_stmt;
+
+  if (bool *res = scalar_use_map.get (def))
+return *res;
+
+  FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, def)
+{
+  if (is_gimple_debug (use_stmt))
+   continue;
+
+  stmt_vec_info use_stmt_info = bb_vinfo->lookup_stmt (use_stmt);
+
+  if (!use_stmt_info)
+   break;
+
+  if (PURE_SLP_STMT (vect_stmt_to_vectorize (use_stmt_info)))
+   continue;
+
+  /* Do not step forward when encounter PHI statement, since it may
+involve cyclic reference and cause infinite recursive invocation.  */
+  if (gimple_code (use_stmt) == GIMPLE_PHI)
+   break;
+
+  /* When pattern recognition is involved, a statement whose definition is
+consumed in some pattern, may not be included in the final replacement
+pattern statements, so would be skipped when building SLP graph.
+
+* Original
+ char a_c = *(char *) a;
+ char b_c = *(char *) b;
+ unsigned short a_s = (unsigned short) a_c;
+ int a_i = (int) a_s;
+ int b_i = (int) b_c;
+ int r_i = a_i - b_i;
+
+* After pattern replacement
+ a_s = (unsigned short) a_c;
+ a_i = (int) a_s;
+
+ patt_b_s = (unsigned short) b_c;// b_i = (int) b_c
+ patt_b_i = (int) patt_b_s;  // b_i = (int) b_c
+
+ patt_r_s = widen_minus(a_c, b_c);   // r_i = a_i - b_i
+ patt_r_i = (int) patt_r_s;  // r_i = a_i - b_i
+
+The definitions of a_i(original statement) and b_i(pattern statement)
+are related to, but actually not part of widen_minus pattern.
+Vectorizing the pattern does not cause these definition statements to
+be marked as PURE_SLP.  For this case, we need to recursively check
+whether their uses are all absorbed into vectorized code.  But there
+is an exception that some use may participate in an vectorized
+

[PATCH] arm/aarch64: Add bti for all functions [PR106671]

2023-08-02 Thread Feng Xue OS via Gcc-patches

This patch extends option -mbranch-protection=bti with an optional argument
as bti[+all] to force compiler to unconditionally insert bti for all
functions. Because a direct function call at the stage of compiling might be
rewritten to an indirect call with some kind of linker-generated thunk stub
as invocation relay for some reasons. One instance is if a direct callee is
placed far from its caller, direct BL {imm} instruction could not represent
the distance, so indirect BLR {reg} should be used. For this case, a bti is
required at the beginning of the callee.

   caller() {
   bl callee
   }

=>

   caller() {
   adrp   reg, 
   addreg, reg, #constant
   blrreg
   }

Although the issue could be fixed with a pretty new version of ld, here we
provide another means for user who has to rely on the old ld or other non-ld
linker. I also checked LLVM, by default, it implements bti just as the proposed
-mbranch-protection=bti+all.

Feng

---
 gcc/config/aarch64/aarch64.cc| 12 +++-
 gcc/config/aarch64/aarch64.opt   |  2 +-
 gcc/config/arm/aarch-bti-insert.cc   |  3 ++-
 gcc/config/arm/aarch-common.cc   | 22 ++
 gcc/config/arm/aarch-common.h| 18 ++
 gcc/config/arm/arm.cc|  4 ++--
 gcc/config/arm/arm.opt   |  2 +-
 gcc/doc/invoke.texi  | 16 ++--
 gcc/testsuite/gcc.target/aarch64/bti-5.c | 17 +
 9 files changed, 76 insertions(+), 20 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/bti-5.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 71215ef9fee..a404447c8d0 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -8997,7 +8997,8 @@ void aarch_bti_arch_check (void)
 bool
 aarch_bti_enabled (void)
 {
-  return (aarch_enable_bti == 1);
+  gcc_checking_assert (aarch_enable_bti != AARCH_BTI_FUNCTION_UNSET);
+  return (aarch_enable_bti != AARCH_BTI_FUNCTION_NONE);
 }
 
 /* Check if INSN is a BTI J insn.  */
@@ -18454,12 +18455,12 @@ aarch64_override_options (void)
 
   selected_tune = tune ? tune->ident : cpu->ident;
 
-  if (aarch_enable_bti == 2)
+  if (aarch_enable_bti == AARCH_BTI_FUNCTION_UNSET)
 {
 #ifdef TARGET_ENABLE_BTI
-  aarch_enable_bti = 1;
+  aarch_enable_bti = AARCH_BTI_FUNCTION;
 #else
-  aarch_enable_bti = 0;
+  aarch_enable_bti = AARCH_BTI_FUNCTION_NONE;
 #endif
 }
 
@@ -22881,7 +22882,8 @@ aarch64_print_patchable_function_entry (FILE *file,
   basic_block bb = ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb;
 
   if (!aarch_bti_enabled ()
-  || cgraph_node::get (cfun->decl)->only_called_directly_p ())
+  || (aarch_enable_bti != AARCH_BTI_FUNCTION_ALL
+ && cgraph_node::get (cfun->decl)->only_called_directly_p ()))
 {
   /* Emit the patchable_area at the beginning of the function.  */
   rtx_insn *insn = emit_insn_before (pa, BB_HEAD (bb));
diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
index 025e52d40e5..5571f7e916d 100644
--- a/gcc/config/aarch64/aarch64.opt
+++ b/gcc/config/aarch64/aarch64.opt
@@ -37,7 +37,7 @@ TargetVariable
 aarch64_feature_flags aarch64_isa_flags = 0
 
 TargetVariable
-unsigned aarch_enable_bti = 2
+enum aarch_bti_function_type aarch_enable_bti = AARCH_BTI_FUNCTION_UNSET
 
 TargetVariable
 enum aarch_key_type aarch_ra_sign_key = AARCH_KEY_A
diff --git a/gcc/config/arm/aarch-bti-insert.cc 
b/gcc/config/arm/aarch-bti-insert.cc
index 71a77e29406..babd2490c9f 100644
--- a/gcc/config/arm/aarch-bti-insert.cc
+++ b/gcc/config/arm/aarch-bti-insert.cc
@@ -164,7 +164,8 @@ rest_of_insert_bti (void)
  functions that are already protected by Return Address Signing (PACIASP/
  PACIBSP).  For all other cases insert a BTI C at the beginning of the
  function.  */
-  if (!cgraph_node::get (cfun->decl)->only_called_directly_p ())
+  if (aarch_enable_bti == AARCH_BTI_FUNCTION_ALL
+  || !cgraph_node::get (cfun->decl)->only_called_directly_p ())
 {
   bb = ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb;
   insn = BB_HEAD (bb);
diff --git a/gcc/config/arm/aarch-common.cc b/gcc/config/arm/aarch-common.cc
index 5b96ff4c2e8..7751d40f909 100644
--- a/gcc/config/arm/aarch-common.cc
+++ b/gcc/config/arm/aarch-common.cc
@@ -666,7 +666,7 @@ static enum aarch_parse_opt_result
 aarch_handle_no_branch_protection (char* str, char* rest)
 {
   aarch_ra_sign_scope = AARCH_FUNCTION_NONE;
-  aarch_enable_bti = 0;
+  aarch_enable_bti = AARCH_BTI_FUNCTION_NONE;
   if (rest)
 {
   error ("unexpected %<%s%> after %<%s%>", rest, str);
@@ -680,7 +680,7 @@ aarch_handle_standard_branch_protection (char* str, char* 
rest)
 {
   aarch_ra_sign_scope = AARCH_FUNCTION_NON_LEAF;
   aarch_ra_sign_key = AARCH_KEY_A;
-  aarch_enable_bti = 1;
+  aarch_enable_bti = AARCH_BTI_FUNCTION;
   if (rest)
 {
   error ("unexpected %<%s%> after %<%s%>", res

PING^2: [PATCH/RFC 2/2] WPD: Enable whole program devirtualization at LTRANS

2021-10-14 Thread Feng Xue OS via Gcc-patches

Thanks,
Feng


From: Feng Xue OS
Sent: Thursday, September 16, 2021 5:26 PM
To: Jan Hubicka; mjam...@suse.cz; Richard Biener; gcc-patches@gcc.gnu.org
Cc: JiangNing OS
Subject: [PATCH/RFC 2/2] WPD: Enable whole program devirtualization at LTRANS

This patch is to extend applicability  of full devirtualization to LTRANS stage.
Normally, whole program assumption would not hold when WPA splits
whole compilation into more than one LTRANS partitions. To avoid information
lost for WPD at LTRANS, we will record all vtable nodes and related member
function references into each partition.

Bootstrapped/regtested on x86_64-linux and aarch64-linux.

Thanks,
Feng


2021-09-07  Feng Xue  

gcc/
* tree.h (TYPE_CXX_LOCAL): New macro for type using
base.nothrow_flag.
* tree-core.h (tree_base): Update comment on using
base.nothrow_flag to represent TYPE_CXX_LOCAL.
* ipa-devirt.c (odr_type_d::whole_program_local): Removed.
(odr_type_d::whole_program_local_p): Check TYPE_CXX_LOCAL flag
on type, and enable WPD at LTRANS when flag_devirtualize_fully
is true.
(get_odr_type): Remove setting whole_program_local flag on type.
(identify_whole_program_local_types): Replace whole_program_local
in odr_type_d by TYPE_CXX_LOCAL on type.
(maybe_record_node): Enable WPD at LTRANS when
flag_devirtualize_fully is true.
* ipa.c (can_remove_vtable_if_no_refs_p): Retain vtables at LTRANS
stage under full devirtualization.
* lto-cgraph.c (compute_ltrans_boundary): Add all defined vtables
to boundary of each LTRANS partition.
* lto-streamer-out.c (get_symbol_initial_value): Streaming out
initial value of vtable even its class is optimized away.
* lto-lang.c (lto_post_options): Disable full devirtualization
if flag_ltrans_devirtualize is false.
* tree-streamer-in.c (unpack_ts_base_value_fields): unpack value
of TYPE_CXX_LOCAL for a type from streaming data.
* tree-streamer-out.c (pack_ts_base_value_fields): pack value
ofTYPE_CXX_LOCAL for a type into streaming data.
---
From 624aef44d72799ae488a431b4dce730f4b0fc28e Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Mon, 6 Sep 2021 20:34:50 +0800
Subject: [PATCH 2/2] WPD: Enable whole program devirtualization at LTRANS

Whole program assumption would not hold when WPA splits whole compilation
into more than one LTRANS partitions. To avoid information lost for WPD
at LTRANS, we will record all vtable nodes and related member function
references into each partition.

2021-09-07  Feng Xue  

gcc/
	* tree.h (TYPE_CXX_LOCAL): New macro for type using
	base.nothrow_flag.
   	* tree-core.h (tree_base): Update comment on using
	base.nothrow_flag to represent TYPE_CXX_LOCAL.
	* ipa-devirt.c (odr_type_d::whole_program_local): Removed.
(odr_type_d::whole_program_local_p): Check TYPE_CXX_LOCAL flag
	on type, and enable WPD at LTRANS when flag_devirtualize_fully
	is true.
(get_odr_type): Remove setting whole_program_local flag on type.
(identify_whole_program_local_types): Replace whole_program_local
	in odr_type_d by TYPE_CXX_LOCAL on type.
(maybe_record_node): Enable WPD at LTRANS when
	flag_devirtualize_fully	is true.
* ipa.c (can_remove_vtable_if_no_refs_p): Retain vtables at LTRANS
	stage under full devirtualization.
* lto-cgraph.c (compute_ltrans_boundary): Add all defined vtables
	to boundary of each LTRANS partition.
	* lto-streamer-out.c (get_symbol_initial_value): Streaming out
	initial	value of vtable even its class is optimized away.
	* lto-streamer-in.c (lto_input_tree): There might be more than
	one decls in dref_queue, register debuginfo for all of them.
	* lto-lang.c (lto_post_options): Disable full devirtualization
	if flag_ltrans_devirtualize is false.
	* tree-streamer-in.c (unpack_ts_base_value_fields): unpack value
	of TYPE_CXX_LOCAL for a type from streaming data.
	* tree-streamer-out.c (pack_ts_base_value_fields): pack value
	ofTYPE_CXX_LOCAL for a type into streaming data.
---
 gcc/ipa-devirt.c| 33 ++---
 gcc/ipa.c   |  7 ++-
 gcc/lto-cgraph.c| 19 +++
 gcc/lto-streamer-in.c   |  3 +--
 gcc/lto-streamer-out.c  | 12 +++-
 gcc/lto/lto-lang.c  |  6 ++
 gcc/tree-core.h |  3 +++
 gcc/tree-streamer-in.c  | 11 ---
 gcc/tree-streamer-out.c | 16 +---
 gcc/tree.h  |  5 +
 10 files changed, 90 insertions(+), 25 deletions(-)

diff --git a/gcc/ipa-devirt.c b/gcc/ipa-devirt.c
index 284c449c6c1..bb929f016f8 100644
--- a/gcc/ipa-devirt.c
+++ b/gcc/ipa-devirt.c
@@ -216,8 +216,6 @@ struct GTY(()) odr_type_d
   int id;
   /* Is it in anonymous namespace? */
   bool anonymous_namespace;
-  /* Set when type is not used outside of program.  */
-  bool whole_program_local;
   /* Did we

PING^2: [PATCH/RFC 1/2] WPD: Enable whole program devirtualization

2021-10-14 Thread Feng Xue OS via Gcc-patches

Hi, Honza & Martin,

  Would you please take some time to review proposal and patches of whole
program devirtualization? We have to say, this feature is not 100% safe, but
provides us a way to deploy correct WPD on C++ program if we elaborately
prepare linked libraries to ensure rtti symbols are contained, which is always
the case for libstdc++ and well-composed third-part c++libraries with default
gcc options. If not, we could get an expected rebuild with desirable options,
and this does not require invasive modification on source codes, which is an
advantage over LLVM visibility-based scheme.

Now gcc-12 dev branch is at late stage since time will step into Nov.  Anyway,
we are not sure it is acceptable or not. But if yes, getting it in before code
freeze would be a good time point. 

And made some minor changes on patches, also posted RFC link here for
your convenience.  (https://gcc.gnu.org/pipermail/gcc/2021-August/237132.html) 

Thanks,
Feng


From: Feng Xue OS 
Sent: Saturday, September 18, 2021 5:38 PM
To: Jason Merrill; Jan Hubicka; mjam...@suse.cz; Richard Biener; 
gcc-patches@gcc.gnu.org
Subject: Re: [PATCH/RFC 1/2] WPD: Enable whole program devirtualization

>On 9/16/21 22:29, Feng Xue OS wrote:
>>> On 9/16/21 05:25, Feng Xue OS via Gcc-patches wrote:
>>>> This and following patches are composed to enable full devirtualization
>>>> under whole program assumption (so also called whole-program
>>>> devirtualization, WPD for short), which is an enhancement to current
>>>> speculative devirtualization. The base of the optimization is how to
>>>> identify class type that is local in terms of whole-program scope, at
>>>> least  those class types in libstdc++ must be excluded in some way.
>>>> Our means is to use typeinfo symbol as identity marker of a class since
>>>> it is unique and always generated once the class or its derived type
>>>> is instantiated somewhere, and rely on symbol resolution by
>>>> lto-linker-plugin to detect whether  a typeinfo is referenced by regular
>>>> object/library, which indirectly tells class types are escaped or not.
>>>> The RFC at https://gcc.gnu.org/pipermail/gcc/2021-August/237132.html
>>>> gives more details on that.
>>>>
>>>> Bootstrapped/regtested on x86_64-linux and aarch64-linux.
>>>>
>>>> Thanks,
>>>> Feng
>>>>
>>>> 
>>>> 2021-09-07  Feng Xue  
>>>>
>>>> gcc/
>>>>* common.opt (-fdevirtualize-fully): New option.
>>>>* class.c (build_rtti_vtbl_entries): Force generation of typeinfo
>>>>even -fno-rtti is specificied under full devirtualization.
>>>
>>> This makes -fno-rtti useless; rather than this, you should warn about
>>> the combination of flags and force flag_rtti on.  It also sounds like
>>> you depend on the library not being built with -fno-rtti.
>>
>> Although rtti is generated by front-end, we will remove it after lto symtab
>> merge, which is meant to keep same behavior as -fno-rtti.
>
> Ah, the cp/ change is OK, then, with a comment about that.
>
>> Yes, regular library to be linked with should contain rtti data, otherwise
>> WPD could not deduce class type usage safely. By default, we can think
>> that it should work for libstdc++, but it probably becomes a problem for
>> user library, which might be avoided if we properly document this
>> requirement and suggest user doing that when using WPD.
>
> Yes, I would expect that external libraries would be built with RTTI on
> to allow users to use RTTI features even if they aren't used within the
> library.  But it's good to document it as a requirement.
>
>> +   /* If a class with virtual base is only instantiated as
>> +  subobjects of derived classes, and has no complete object in
>> +  compilation unit, merely construction vtables will be 
>> involved,
>> +  its primary vtable is really not needed, and subject to being
>> +  removed.  So once a vtable node is encountered, for all
>> +  polymorphic base classes of the vtable's context class, always
>> +  force generation of primary vtable nodes when full
>> +  devirtualization is enabled.  */
>
> Why do you need the primary vtable if you're relying on RTTI info?
> Construction vtables will point to the same RTTI node.

At middle end, the easiest way to get vtable of type is via TYPE_BINFO(type),
it is the primary one. And WPD relies on existence of varpool_node of the
vtable dec

PING: [PATCH/RFC 2/2] WPD: Enable whole program devirtualization at LTRANS

2021-09-29 Thread Feng Xue OS via Gcc-patches

Made some minor changes.

Thanks,
Feng


From: Feng Xue OS
Sent: Thursday, September 16, 2021 5:26 PM
To: Jan Hubicka; mjam...@suse.cz; Richard Biener; gcc-patches@gcc.gnu.org
Cc: JiangNing OS
Subject: [PATCH/RFC 2/2] WPD: Enable whole program devirtualization at LTRANS

This patch is to extend applicability  of full devirtualization to LTRANS stage.
Normally, whole program assumption would not hold when WPA splits
whole compilation into more than one LTRANS partitions. To avoid information
lost for WPD at LTRANS, we will record all vtable nodes and related member
function references into each partition.

Bootstrapped/regtested on x86_64-linux and aarch64-linux.

Thanks,
Feng


2021-09-07  Feng Xue  

gcc/
* tree.h (TYPE_CXX_LOCAL): New macro for type using
base.nothrow_flag.
* tree-core.h (tree_base): Update comment on using
base.nothrow_flag to represent TYPE_CXX_LOCAL.
* ipa-devirt.c (odr_type_d::whole_program_local): Removed.
(odr_type_d::whole_program_local_p): Check TYPE_CXX_LOCAL flag
on type, and enable WPD at LTRANS when flag_devirtualize_fully
is true.
(get_odr_type): Remove setting whole_program_local flag on type.
(identify_whole_program_local_types): Replace whole_program_local
in odr_type_d by TYPE_CXX_LOCAL on type.
(maybe_record_node): Enable WPD at LTRANS when
flag_devirtualize_fully is true.
* ipa.c (can_remove_vtable_if_no_refs_p): Retain vtables at LTRANS
stage under full devirtualization.
* lto-cgraph.c (compute_ltrans_boundary): Add all defined vtables
to boundary of each LTRANS partition.
* lto-streamer-out.c (get_symbol_initial_value): Streaming out
initial value of vtable even its class is optimized away.
* lto-lang.c (lto_post_options): Disable full devirtualization
if flag_ltrans_devirtualize is false.
* tree-streamer-in.c (unpack_ts_base_value_fields): unpack value
of TYPE_CXX_LOCAL for a type from streaming data.
* tree-streamer-out.c (pack_ts_base_value_fields): pack value
ofTYPE_CXX_LOCAL for a type into streaming data.
---
From 2c0d243b0c092585561c732bac490700f41001fb Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Mon, 6 Sep 2021 20:34:50 +0800
Subject: [PATCH 2/2] WPD: Enable whole program devirtualization at LTRANS

Whole program assumption would not hold when WPA splits whole compilation
into more than one LTRANS partitions. To avoid information lost for WPD
at LTRANS, we will record all vtable nodes and related member function
references into each partition.

2021-09-07  Feng Xue  

gcc/
	* tree.h (TYPE_CXX_LOCAL): New macro for type using
	base.nothrow_flag.
   	* tree-core.h (tree_base): Update comment on using
	base.nothrow_flag to represent TYPE_CXX_LOCAL.
	* ipa-devirt.c (odr_type_d::whole_program_local): Removed.
(odr_type_d::whole_program_local_p): Check TYPE_CXX_LOCAL flag
	on type, and enable WPD at LTRANS when flag_devirtualize_fully
	is true.
(get_odr_type): Remove setting whole_program_local flag on type.
(identify_whole_program_local_types): Replace whole_program_local
	in odr_type_d by TYPE_CXX_LOCAL on type.
(maybe_record_node): Enable WPD at LTRANS when
	flag_devirtualize_fully	is true.
* ipa.c (can_remove_vtable_if_no_refs_p): Retain vtables at LTRANS
	stage under full devirtualization.
* lto-cgraph.c (compute_ltrans_boundary): Add all defined vtables
	to boundary of each LTRANS partition.
	* lto-streamer-out.c (get_symbol_initial_value): Streaming out
	initial	value of vtable even its class is optimized away.
	* lto-streamer-in.c (lto_input_tree): There might be more than
	one decls in dref_queue, register debuginfo for all of them.
	* lto-lang.c (lto_post_options): Disable full devirtualization
	if flag_ltrans_devirtualize is false.
	* tree-streamer-in.c (unpack_ts_base_value_fields): unpack value
	of TYPE_CXX_LOCAL for a type from streaming data.
	* tree-streamer-out.c (pack_ts_base_value_fields): pack value
	ofTYPE_CXX_LOCAL for a type into streaming data.

temp
---
 gcc/ipa-devirt.c| 29 ++---
 gcc/ipa.c   |  7 ++-
 gcc/lto-cgraph.c| 18 ++
 gcc/lto-streamer-in.c   |  3 +--
 gcc/lto-streamer-out.c  | 12 +++-
 gcc/lto/lto-lang.c  |  6 ++
 gcc/tree-core.h |  3 +++
 gcc/tree-streamer-in.c  | 11 ---
 gcc/tree-streamer-out.c | 11 ---
 gcc/tree.h  |  5 +
 10 files changed, 84 insertions(+), 21 deletions(-)

diff --git a/gcc/ipa-devirt.c b/gcc/ipa-devirt.c
index a7d04388dab..4ff551bace8 100644
--- a/gcc/ipa-devirt.c
+++ b/gcc/ipa-devirt.c
@@ -216,8 +216,6 @@ struct GTY(()) odr_type_d
   int id;
   /* Is it in anonymous namespace? */
   bool anonymous_namespace;
-  /* Set when type is not used outside of program.  */
-  bool

PING: [PATCH/RFC 1/2] WPD: Enable whole program devirtualization

2021-09-29 Thread Feng Xue OS via Gcc-patches

Minor update for some bugfixs and comment wording change.

Thanks,
Feng


From: Feng Xue OS 
Sent: Saturday, September 18, 2021 5:38 PM
To: Jason Merrill; Jan Hubicka; mjam...@suse.cz; Richard Biener; 
gcc-patches@gcc.gnu.org
Subject: Re: [PATCH/RFC 1/2] WPD: Enable whole program devirtualization

>On 9/16/21 22:29, Feng Xue OS wrote:
>>> On 9/16/21 05:25, Feng Xue OS via Gcc-patches wrote:
>>>> This and following patches are composed to enable full devirtualization
>>>> under whole program assumption (so also called whole-program
>>>> devirtualization, WPD for short), which is an enhancement to current
>>>> speculative devirtualization. The base of the optimization is how to
>>>> identify class type that is local in terms of whole-program scope, at
>>>> least  those class types in libstdc++ must be excluded in some way.
>>>> Our means is to use typeinfo symbol as identity marker of a class since
>>>> it is unique and always generated once the class or its derived type
>>>> is instantiated somewhere, and rely on symbol resolution by
>>>> lto-linker-plugin to detect whether  a typeinfo is referenced by regular
>>>> object/library, which indirectly tells class types are escaped or not.
>>>> The RFC at https://gcc.gnu.org/pipermail/gcc/2021-August/237132.html
>>>> gives more details on that.
>>>>
>>>> Bootstrapped/regtested on x86_64-linux and aarch64-linux.
>>>>
>>>> Thanks,
>>>> Feng
>>>>
>>>> 
>>>> 2021-09-07  Feng Xue  
>>>>
>>>> gcc/
>>>>* common.opt (-fdevirtualize-fully): New option.
>>>>* class.c (build_rtti_vtbl_entries): Force generation of typeinfo
>>>>even -fno-rtti is specificied under full devirtualization.
>>>
>>> This makes -fno-rtti useless; rather than this, you should warn about
>>> the combination of flags and force flag_rtti on.  It also sounds like
>>> you depend on the library not being built with -fno-rtti.
>>
>> Although rtti is generated by front-end, we will remove it after lto symtab
>> merge, which is meant to keep same behavior as -fno-rtti.
>
> Ah, the cp/ change is OK, then, with a comment about that.
>
>> Yes, regular library to be linked with should contain rtti data, otherwise
>> WPD could not deduce class type usage safely. By default, we can think
>> that it should work for libstdc++, but it probably becomes a problem for
>> user library, which might be avoided if we properly document this
>> requirement and suggest user doing that when using WPD.
>
> Yes, I would expect that external libraries would be built with RTTI on
> to allow users to use RTTI features even if they aren't used within the
> library.  But it's good to document it as a requirement.
>
>> +   /* If a class with virtual base is only instantiated as
>> +  subobjects of derived classes, and has no complete object in
>> +  compilation unit, merely construction vtables will be 
>> involved,
>> +  its primary vtable is really not needed, and subject to being
>> +  removed.  So once a vtable node is encountered, for all
>> +  polymorphic base classes of the vtable's context class, always
>> +  force generation of primary vtable nodes when full
>> +  devirtualization is enabled.  */
>
> Why do you need the primary vtable if you're relying on RTTI info?
> Construction vtables will point to the same RTTI node.

At middle end, the easiest way to get vtable of type is via TYPE_BINFO(type),
it is the primary one. And WPD relies on existence of varpool_node of the
vtable decl to determine if the type has been removed (when it is never
instantiated), so we will force generation of vtable node at very early stage.
Additionally, construction vtable (C-in-D) belongs to the class (D) of complete
object, not the class (C) of subobject actually being constructed for, it is not
easy to correlate construction vtable with the subobject class (C) after front
end.

>
>> +   /* Public class w/o key member function (or local class in a public
>> +  inline function) requires COMDAT-like vtable so as to be shared
>> +  among units.  But C++ privatizing via -fno-weak would introduce
>> +  multiple static vtable copies for one class in merged lto symbol
>> +  table.  This breaks one-to-one correspondence between class and
>> +  vtable, and makes class liveness check become not that easy.

[PATCH] Fix value uninitialization in vn_reference_insert_pieces [PR102400]

2021-09-22 Thread Feng Xue OS via Gcc-patches

Bootstrapped/regtested on x86_64-linux.

Thanks,
Feng
---
2021-09-23  Feng Xue  

gcc/ChangeLog
PR tree-optimization/102400
* tree-ssa-sccvn.c (vn_reference_insert_pieces): Initialize
result_vdef to zero value.
---
 gcc/tree-ssa-sccvn.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/tree-ssa-sccvn.c b/gcc/tree-ssa-sccvn.c
index a901f51a025..e8b1c39184d 100644
--- a/gcc/tree-ssa-sccvn.c
+++ b/gcc/tree-ssa-sccvn.c
@@ -3811,6 +3811,7 @@ vn_reference_insert_pieces (tree vuse, alias_set_type set,
   if (result && TREE_CODE (result) == SSA_NAME)
 result = SSA_VAL (result);
   vr1->result = result;
+  vr1->result_vdef = NULL_TREE;
 
   slot = valid_info->references->find_slot_with_hash (vr1, vr1->hashcode,
  INSERT);
-- 
2.17.1

[PATCH] Fix null-pointer dereference in delete_dead_or_redundant_call [PR102451]

2021-09-22 Thread Feng Xue OS via Gcc-patches

Bootstrapped/regtested on x86_64-linux and aarch64-linux.

Thanks,
Feng

---
2021-09-23  Feng Xue  

gcc/ChangeLog:
PR tree-optimization/102451
* tree-ssa-dse.c (delete_dead_or_redundant_call): Record bb of stmt
before removal.
---
 gcc/tree-ssa-dse.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/gcc/tree-ssa-dse.c b/gcc/tree-ssa-dse.c
index 98daa8ab24c..27287fe88ee 100644
--- a/gcc/tree-ssa-dse.c
+++ b/gcc/tree-ssa-dse.c
@@ -978,6 +978,7 @@ delete_dead_or_redundant_call (gimple_stmt_iterator *gsi, 
const char *type)
   fprintf (dump_file, "\n");
 }
 
+  basic_block bb = gimple_bb (stmt);
   tree lhs = gimple_call_lhs (stmt);
   if (lhs)
 {
@@ -985,7 +986,7 @@ delete_dead_or_redundant_call (gimple_stmt_iterator *gsi, 
const char *type)
   gimple *new_stmt = gimple_build_assign (lhs, ptr);
   unlink_stmt_vdef (stmt);
   if (gsi_replace (gsi, new_stmt, true))
-bitmap_set_bit (need_eh_cleanup, gimple_bb (stmt)->index);
+   bitmap_set_bit (need_eh_cleanup, bb->index);
 }
   else
 {
@@ -994,7 +995,7 @@ delete_dead_or_redundant_call (gimple_stmt_iterator *gsi, 
const char *type)
 
   /* Remove the dead store.  */
   if (gsi_remove (gsi, true))
-   bitmap_set_bit (need_eh_cleanup, gimple_bb (stmt)->index);
+   bitmap_set_bit (need_eh_cleanup, bb->index);
   release_defs (stmt);
 }
 }
-- 
2.17.1

Re: [PATCH/RFC 1/2] WPD: Enable whole program devirtualization

2021-09-18 Thread Feng Xue OS via Gcc-patches

>On 9/16/21 22:29, Feng Xue OS wrote:
>>> On 9/16/21 05:25, Feng Xue OS via Gcc-patches wrote:
>>>> This and following patches are composed to enable full devirtualization
>>>> under whole program assumption (so also called whole-program
>>>> devirtualization, WPD for short), which is an enhancement to current
>>>> speculative devirtualization. The base of the optimization is how to
>>>> identify class type that is local in terms of whole-program scope, at
>>>> least  those class types in libstdc++ must be excluded in some way.
>>>> Our means is to use typeinfo symbol as identity marker of a class since
>>>> it is unique and always generated once the class or its derived type
>>>> is instantiated somewhere, and rely on symbol resolution by
>>>> lto-linker-plugin to detect whether  a typeinfo is referenced by regular
>>>> object/library, which indirectly tells class types are escaped or not.
>>>> The RFC at https://gcc.gnu.org/pipermail/gcc/2021-August/237132.html
>>>> gives more details on that.
>>>>
>>>> Bootstrapped/regtested on x86_64-linux and aarch64-linux.
>>>>
>>>> Thanks,
>>>> Feng
>>>>
>>>> 
>>>> 2021-09-07  Feng Xue  
>>>>
>>>> gcc/
>>>>* common.opt (-fdevirtualize-fully): New option.
>>>>* class.c (build_rtti_vtbl_entries): Force generation of typeinfo
>>>>even -fno-rtti is specificied under full devirtualization.
>>>
>>> This makes -fno-rtti useless; rather than this, you should warn about
>>> the combination of flags and force flag_rtti on.  It also sounds like
>>> you depend on the library not being built with -fno-rtti.
>>
>> Although rtti is generated by front-end, we will remove it after lto symtab
>> merge, which is meant to keep same behavior as -fno-rtti.
>
> Ah, the cp/ change is OK, then, with a comment about that.
>
>> Yes, regular library to be linked with should contain rtti data, otherwise
>> WPD could not deduce class type usage safely. By default, we can think
>> that it should work for libstdc++, but it probably becomes a problem for
>> user library, which might be avoided if we properly document this
>> requirement and suggest user doing that when using WPD.
>
> Yes, I would expect that external libraries would be built with RTTI on
> to allow users to use RTTI features even if they aren't used within the
> library.  But it's good to document it as a requirement.
>
>> +   /* If a class with virtual base is only instantiated as
>> +  subobjects of derived classes, and has no complete object in
>> +  compilation unit, merely construction vtables will be 
>> involved,
>> +  its primary vtable is really not needed, and subject to being
>> +  removed.  So once a vtable node is encountered, for all
>> +  polymorphic base classes of the vtable's context class, always
>> +  force generation of primary vtable nodes when full
>> +  devirtualization is enabled.  */
>
> Why do you need the primary vtable if you're relying on RTTI info?
> Construction vtables will point to the same RTTI node.

At middle end, the easiest way to get vtable of type is via TYPE_BINFO(type),
it is the primary one. And WPD relies on existence of varpool_node of the
vtable decl to determine if the type has been removed (when it is never
instantiated), so we will force generation of vtable node at very early stage.
Additionally, construction vtable (C-in-D) belongs to the class (D) of complete
object, not the class (C) of subobject actually being constructed for, it is not
easy to correlate construction vtable with the subobject class (C) after front
end.

>
>> +   /* Public class w/o key member function (or local class in a public
>> +  inline function) requires COMDAT-like vtable so as to be shared
>> +  among units.  But C++ privatizing via -fno-weak would introduce
>> +  multiple static vtable copies for one class in merged lto symbol
>> +  table.  This breaks one-to-one correspondence between class and
>> +  vtable, and makes class liveness check become not that easy.  To
>> +  be simple, we exclude such kind of class from our choice list.
>
> Same question.  Also, why would you use -fno-weak?  Forcing multiple
> copies of things we're perfectly capable of combining seems like a
> strange choice.  You can privatize things with the symbol visibility
> controls or RTLD_LOCAL.

We expect that user does not specify -fno-weak for WPD. But if
specified, we should correctly handle that and bypass the type. And
indeed there is no need to force generation of vtable under this
situation.  But if vtable is not keyed to any compilation unit, we might
never have any copy of it in ordinary build, while its class type is
meaningful to whole-program analysis, such as an abstract root class.

Thanks,
Feng

Re: [PATCH/RFC 1/2] WPD: Enable whole program devirtualization

2021-09-16 Thread Feng Xue OS via Gcc-patches

>On 9/16/21 05:25, Feng Xue OS via Gcc-patches wrote:
>> This and following patches are composed to enable full devirtualization
>> under whole program assumption (so also called whole-program
>> devirtualization, WPD for short), which is an enhancement to current
>> speculative devirtualization. The base of the optimization is how to
>> identify class type that is local in terms of whole-program scope, at
>> least  those class types in libstdc++ must be excluded in some way.
>> Our means is to use typeinfo symbol as identity marker of a class since
>> it is unique and always generated once the class or its derived type
>> is instantiated somewhere, and rely on symbol resolution by
>> lto-linker-plugin to detect whether  a typeinfo is referenced by regular
>> object/library, which indirectly tells class types are escaped or not.
>> The RFC at https://gcc.gnu.org/pipermail/gcc/2021-August/237132.html
>> gives more details on that.
>>
>> Bootstrapped/regtested on x86_64-linux and aarch64-linux.
>>
>> Thanks,
>> Feng
>>
>> 
>> 2021-09-07  Feng Xue  
>>
>> gcc/
>>   * common.opt (-fdevirtualize-fully): New option.
>>   * class.c (build_rtti_vtbl_entries): Force generation of typeinfo
>>   even -fno-rtti is specificied under full devirtualization.
>
>This makes -fno-rtti useless; rather than this, you should warn about
>the combination of flags and force flag_rtti on.  It also sounds like
>you depend on the library not being built with -fno-rtti.

Although rtti is generated by front-end, we will remove it after lto symtab
merge, which is meant to keep same behavior as -fno-rtti.

Yes, regular library to be linked with should contain rtti data, otherwise
WPD could not deduce class type usage safely. By default, we can think
that it should work for libstdc++, but it probably becomes a problem for
user library, which might be avoided if we properly document this
requirement and suggest user doing that when using WPD.

Thanks
Feng
>
>>   * cgraph.c (cgraph_update_edges_for_call_stmt): Add an assertion
>>   to check node to be traversed.
>>   * cgraphclones.c (cgraph_node::find_replacement): Record
>>   former_clone_of on replacement node.
>>   * cgraphunit.c (symtab_node::needed_p): Always output vtable for
>>   full devirtualization.
>>   (analyze_functions): Force generation of primary vtables for all
>>   base classes.
>>   * ipa-devirt.c (odr_type_d::whole_program_local): New field.
>>   (odr_type_d::has_virtual_base): Likewise.
>>   (odr_type_d::all_derivations_known): Removed.
>>   (odr_type_d::whole_program_local_p): New member function.
>>   (odr_type_d::all_derivations_known_p): Likewise.
>>   (odr_type_d::possibly_instantiated_p): Likewise.
>>   (odr_type_d::set_has_virtual_base): Likewise.
>>   (get_odr_type): Set "whole_program_local" and "has_virtual_base"
>>   when adding a type.
>>   (type_all_derivations_known_p): Replace implementation by a call
>>   to odr_type_d::all_derivations_known_p.
>>   (type_possibly_instantiated_p): Replace implementation by a call
>>   to odr_type_d::possibly_instantiated_p.
>>   (type_known_to_have_no_derivations_p): Replace call to
>>   type_possibly_instantiated_p with call to
>>   odr_type_d::possibly_instantiated_p.
>>   (type_all_ctors_visible_p): Removed.
>>   (type_whole_program_local_p): New function.
>>   (get_type_vtable): Likewise.
>>   (extract_typeinfo_in_vtable): Likewise.
>>   (identify_whole_program_local_types): Likewise.
>>   (dump_odr_type): Dump has_virtual_base and whole_program_local_p()
>>   of type.
>>   (maybe_record_node): Resort to type_whole_program_local_p to
>>   check whether a class has been optimized away.
>>   (record_target_from_binfo): Remove parameter "anonymous", add
>>   a new parameter "possibly_instantiated", and adjust code
>>   accordingly.
>>   (devirt_variable_node_removal_hook): Replace call to
>>   "type_in_anonymous_namespace_p" with "type_whole_program_local_p".
>>   (possible_polymorphic_call_targets): Replace call to
>>   "type_possibly_instantiated_p" with "possibly_instantiated_p",
>>   replace flag check on "all_derivations_known" with call to
>>"all_derivations_known_p".
>>   * ipa-icf.c (filter_removed_items): Disable folding on vtable
>>   under full devirtualization.
>>   * ipa-

[PATCH/RFC 2/2] WPD: Enable whole program devirtualization at LTRANS

2021-09-16 Thread Feng Xue OS via Gcc-patches

This patch is to extend applicability  of full devirtualization to LTRANS stage.
Normally, whole program assumption would not hold when WPA splits
whole compilation into more than one LTRANS partitions. To avoid information
lost for WPD at LTRANS, we will record all vtable nodes and related member
function references into each partition.

Bootstrapped/regtested on x86_64-linux and aarch64-linux.

Thanks,
Feng


2021-09-07  Feng Xue  

gcc/
* tree.h (TYPE_CXX_LOCAL): New macro for type using
base.nothrow_flag.
* tree-core.h (tree_base): Update comment on using
base.nothrow_flag to represent TYPE_CXX_LOCAL.
* ipa-devirt.c (odr_type_d::whole_program_local): Removed.
(odr_type_d::whole_program_local_p): Check TYPE_CXX_LOCAL flag
on type, and enable WPD at LTRANS when flag_devirtualize_fully
is true.
(get_odr_type): Remove setting whole_program_local flag on type.
(identify_whole_program_local_types): Replace whole_program_local
in odr_type_d by TYPE_CXX_LOCAL on type.
(maybe_record_node): Enable WPD at LTRANS when
flag_devirtualize_fully is true.
* ipa.c (can_remove_vtable_if_no_refs_p): Retain vtables at LTRANS
stage under full devirtualization.
* lto-cgraph.c (compute_ltrans_boundary): Add all defined vtables
to boundary of each LTRANS partition.
* lto-streamer-out.c (get_symbol_initial_value): Streaming out
initial value of vtable even its class is optimized away.
* lto-lang.c (lto_post_options): Disable full devirtualization
if flag_ltrans_devirtualize is false.
* tree-streamer-in.c (unpack_ts_base_value_fields): unpack value
of TYPE_CXX_LOCAL for a type from streaming data.
* tree-streamer-out.c (pack_ts_base_value_fields): pack value
ofTYPE_CXX_LOCAL for a type into streaming data.
---
From 3af32b9aadff23d339750ada4541386b3d358edc Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Mon, 6 Sep 2021 20:34:50 +0800
Subject: [PATCH 2/2] WPD: Enable whole program devirtualization at LTRANS

Whole program assumption would not hold when WPA splits whole compilation
into more than one LTRANS partitions. To avoid information lost for WPD
at LTRANS, we will record all vtable nodes and related member function
references into each partition.

2021-09-07  Feng Xue  

gcc/
	* tree.h (TYPE_CXX_LOCAL): New macro for type using
	base.nothrow_flag.
   	* tree-core.h (tree_base): Update comment on using
	base.nothrow_flag to represent TYPE_CXX_LOCAL.
	* ipa-devirt.c (odr_type_d::whole_program_local): Removed.
(odr_type_d::whole_program_local_p): Check TYPE_CXX_LOCAL flag
	on type, and enable WPD at LTRANS when flag_devirtualize_fully
	is true.
(get_odr_type): Remove setting whole_program_local flag on type.
(identify_whole_program_local_types): Replace whole_program_local
	in odr_type_d by TYPE_CXX_LOCAL on type.
(maybe_record_node): Enable WPD at LTRANS when
	flag_devirtualize_fully	is true.
* ipa.c (can_remove_vtable_if_no_refs_p): Retain vtables at LTRANS
	stage under full devirtualization.
* lto-cgraph.c (compute_ltrans_boundary): Add all defined vtables
	to boundary of each LTRANS partition.
	* lto-streamer-out.c (get_symbol_initial_value): Streaming out
	initial	value of vtable even its class is optimized away.
	* lto-lang.c (lto_post_options): Disable full devirtualization
	if flag_ltrans_devirtualize is false.
	* tree-streamer-in.c (unpack_ts_base_value_fields): unpack value
	of TYPE_CXX_LOCAL for a type from streaming data.
	* tree-streamer-out.c (pack_ts_base_value_fields): pack value
	ofTYPE_CXX_LOCAL for a type into streaming data.
---
 gcc/ipa-devirt.c| 29 ++---
 gcc/ipa.c   |  7 ++-
 gcc/lto-cgraph.c| 18 ++
 gcc/lto-streamer-out.c  | 12 +++-
 gcc/lto/lto-lang.c  |  6 ++
 gcc/tree-core.h |  3 +++
 gcc/tree-streamer-in.c  | 11 ---
 gcc/tree-streamer-out.c | 11 ---
 gcc/tree.h  |  5 +
 9 files changed, 83 insertions(+), 19 deletions(-)

diff --git a/gcc/ipa-devirt.c b/gcc/ipa-devirt.c
index fcb097d7156..65e9ebbfb59 100644
--- a/gcc/ipa-devirt.c
+++ b/gcc/ipa-devirt.c
@@ -216,8 +216,6 @@ struct GTY(()) odr_type_d
   int id;
   /* Is it in anonymous namespace? */
   bool anonymous_namespace;
-  /* Set when type is not used outside of program.  */
-  bool whole_program_local;
   /* Did we report ODR violation here?  */
   bool odr_violated;
   /* Set when virtual table without RTTI prevailed table with.  */
@@ -290,10 +288,18 @@ get_type_vtable (tree type)
 bool
 odr_type_d::whole_program_local_p ()
 {
-  if (flag_ltrans)
+  if (flag_ltrans && !flag_devirtualize_fully)
 return false;
 
-  return whole_program_local;
+  if (in_lto_p)
+return TYPE_CXX_LOCAL (type);
+
+  /* Although a local class is always considered as whole program loca

[PATCH/RFC 1/2] WPD: Enable whole program devirtualization

2021-09-16 Thread Feng Xue OS via Gcc-patches

This and following patches are composed to enable full devirtualization
under whole program assumption (so also called whole-program
devirtualization, WPD for short), which is an enhancement to current
speculative devirtualization. The base of the optimization is how to
identify class type that is local in terms of whole-program scope, at
least  those class types in libstdc++ must be excluded in some way.
Our means is to use typeinfo symbol as identity marker of a class since
it is unique and always generated once the class or its derived type
is instantiated somewhere, and rely on symbol resolution by
lto-linker-plugin to detect whether  a typeinfo is referenced by regular
object/library, which indirectly tells class types are escaped or not.
The RFC at https://gcc.gnu.org/pipermail/gcc/2021-August/237132.html
gives more details on that.

Bootstrapped/regtested on x86_64-linux and aarch64-linux.

Thanks,
Feng


2021-09-07  Feng Xue  

gcc/
* common.opt (-fdevirtualize-fully): New option.
* class.c (build_rtti_vtbl_entries): Force generation of typeinfo
even -fno-rtti is specificied under full devirtualization.
* cgraph.c (cgraph_update_edges_for_call_stmt): Add an assertion
to check node to be traversed.
* cgraphclones.c (cgraph_node::find_replacement): Record
former_clone_of on replacement node.
* cgraphunit.c (symtab_node::needed_p): Always output vtable for
full devirtualization.
(analyze_functions): Force generation of primary vtables for all
base classes.
* ipa-devirt.c (odr_type_d::whole_program_local): New field.
(odr_type_d::has_virtual_base): Likewise.
(odr_type_d::all_derivations_known): Removed.
(odr_type_d::whole_program_local_p): New member function.
(odr_type_d::all_derivations_known_p): Likewise.
(odr_type_d::possibly_instantiated_p): Likewise.
(odr_type_d::set_has_virtual_base): Likewise.
(get_odr_type): Set "whole_program_local" and "has_virtual_base"
when adding a type.
(type_all_derivations_known_p): Replace implementation by a call
to odr_type_d::all_derivations_known_p.
(type_possibly_instantiated_p): Replace implementation by a call
to odr_type_d::possibly_instantiated_p.
(type_known_to_have_no_derivations_p): Replace call to
type_possibly_instantiated_p with call to
odr_type_d::possibly_instantiated_p.
(type_all_ctors_visible_p): Removed.
(type_whole_program_local_p): New function.
(get_type_vtable): Likewise.
(extract_typeinfo_in_vtable): Likewise.
(identify_whole_program_local_types): Likewise.
(dump_odr_type): Dump has_virtual_base and whole_program_local_p()
of type.
(maybe_record_node): Resort to type_whole_program_local_p to
check whether a class has been optimized away.
(record_target_from_binfo): Remove parameter "anonymous", add
a new parameter "possibly_instantiated", and adjust code
accordingly.
(devirt_variable_node_removal_hook): Replace call to
"type_in_anonymous_namespace_p" with "type_whole_program_local_p".
(possible_polymorphic_call_targets): Replace call to
"type_possibly_instantiated_p" with "possibly_instantiated_p",
replace flag check on "all_derivations_known" with call to
 "all_derivations_known_p".
* ipa-icf.c (filter_removed_items): Disable folding on vtable
under full devirtualization.
* ipa-polymorphic-call.c (restrict_to_inner_class): Move odr
type check to type_known_to_have_no_derivations_p.
* ipa-utils.h (identify_whole_program_local_types): New
declaration.
(type_all_derivations_known_p): Parameter type adjustment.
* ipa.c (walk_polymorphic_call_targets): Do not mark vcall
targets as reachable for full devirtualization.
(can_remove_vtable_if_no_refs_p): New function.
(symbol_table::remove_unreachable_nodes): Add defined vtables
to reachable list under full devirtualization.
* lto-symtab.c (lto_symtab_merge_symbols): Identify whole
program local types after symbol table merge.
---From 2632d8e7ea8f96cb545e57dedd9e4148b5a2cae4 Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Mon, 6 Sep 2021 15:03:31 +0800
Subject: [PATCH 1/2] WPD: Enable whole program devirtualization

Enable full devirtualization under whole program assumption (so also
called whole-program devirtualization, WPD for short). The base of the
optimization is how to identify class type that is local in terms of
whole-program scope. But "whole program" does not ensure that class
hierarchy of a type never span to dependent C++ libraries (one is
libstdc++), which would result in incorrect devirtualization. An
example is given below to demonstrate the problem.

// Has been pre-compiled to a library
class Base {
vi

Re: [PATCH] Fix loop split incorrect count and probability

2021-08-10 Thread Feng Xue OS via Gcc-patches

Any transformation involving cfg alteration would face same problem,
it is not that easy to update new cfg with reasonable and seemly-correct
profile count. We can adjust probability for impacted condition bbs, but
lack of a utility like what static profile estimating pass does, and only
propagates count partially.

Thanks,
Feng


From: Richard Biener 
Sent: Tuesday, August 10, 2021 10:47 PM
To: Xionghu Luo
Cc: gcc-patches@gcc.gnu.org; seg...@kernel.crashing.org; Feng Xue OS; 
wschm...@linux.ibm.com; guoji...@linux.ibm.com; li...@gcc.gnu.org; 
hubi...@ucw.cz
Subject: Re: [PATCH] Fix loop split incorrect count and probability

On Mon, 9 Aug 2021, Xionghu Luo wrote:

> Thanks,
>
> On 2021/8/6 19:46, Richard Biener wrote:
> > On Tue, 3 Aug 2021, Xionghu Luo wrote:
> >
> >> loop split condition is moved between loop1 and loop2, the split bb's
> >> count and probability should also be duplicated instead of (100% vs INV),
> >> secondly, the original loop1 and loop2 count need be propotional from the
> >> original loop.
> >>
> >>
> >> diff base/loop-cond-split-1.c.151t.lsplit  
> >> patched/loop-cond-split-1.c.151t.lsplit:
> >> ...
> >> int prephitmp_16;
> >> int prephitmp_25;
> >>
> >>  [local count: 118111600]:
> >> if (n_7(D) > 0)
> >>   goto ; [89.00%]
> >> else
> >>   goto ; [11.00%]
> >>
> >>  [local count: 118111600]:
> >> return;
> >>
> >>  [local count: 105119324]:
> >> pretmp_3 = ga;
> >>
> >> -   [local count: 955630225]:
> >> +   [local count: 315357973]:
> >> # i_13 = PHI 
> >> # prephitmp_12 = PHI 
> >> if (prephitmp_12 != 0)
> >>   goto ; [33.00%]
> >> else
> >>   goto ; [67.00%]
> >>
> >> -   [local count: 315357972]:
> >> +   [local count: 104068130]:
> >> _2 = do_something ();
> >> ga = _2;
> >>
> >> -   [local count: 955630225]:
> >> +   [local count: 315357973]:
> >> # prephitmp_5 = PHI 
> >> i_10 = inc (i_13);
> >> if (n_7(D) > i_10)
> >>   goto ; [89.00%]
> >> else
> >>   goto ; [11.00%]
> >>
> >>  [local count: 105119324]:
> >> goto ; [100.00%]
> >>
> >> -   [local count: 850510901]:
> >> +   [local count: 280668596]:
> >> if (prephitmp_12 != 0)
> >> -goto ; [100.00%]
> >> +goto ; [33.00%]
> >> else
> >> -goto ; [INV]
> >> +goto ; [67.00%]
> >>
> >> -   [local count: 850510901]:
> >> +   [local count: 280668596]:
> >> goto ; [100.00%]
> >>
> >> -   [count: 0]:
> >> +   [local count: 70429947]:
> >> # i_23 = PHI 
> >> # prephitmp_25 = PHI 
> >>
> >> -   [local count: 955630225]:
> >> +   [local count: 640272252]:
> >> # i_15 = PHI 
> >> # prephitmp_16 = PHI 
> >> i_22 = inc (i_15);
> >> if (n_7(D) > i_22)
> >>   goto ; [89.00%]
> >> else
> >>   goto ; [11.00%]
> >>
> >> -   [local count: 850510901]:
> >> +   [local count: 569842305]:
> >> goto ; [100.00%]
> >>
> >>   }
> >>
> >> gcc/ChangeLog:
> >>
> >>* tree-ssa-loop-split.c (split_loop): Fix incorrect probability.
> >>(do_split_loop_on_cond): Likewise.
> >> ---
> >>   gcc/tree-ssa-loop-split.c | 16 
> >>   1 file changed, 8 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
> >> index 3a09bbc39e5..8e5a7ded0f7 100644
> >> --- a/gcc/tree-ssa-loop-split.c
> >> +++ b/gcc/tree-ssa-loop-split.c
> >> @@ -583,10 +583,10 @@ split_loop (class loop *loop1)
> >>basic_block cond_bb;
>
>   if (!initial_true)
> -   cond = fold_build1 (TRUTH_NOT_EXPR, boolean_type_node, cond);
> +   cond = fold_build1 (TRUTH_NOT_EXPR, boolean_type_node, cond);
> +
> + edge true_edge = EDGE_SUCC (bbs[i], 0)->flags & EDGE_TRUE_VALUE
> +? EDGE_SUCC (bbs[i], 0)
> +: EDGE_SUCC (bbs[i], 1);
>
> >>
> >>class loop *loop2 = loop_version (loop1, cond, &cond_bb,
> >> - profile_probability::always (),
> >> -

Re: [PATCH] Fix loop split incorrect count and probability

2021-08-08 Thread Feng Xue OS via Gcc-patches

Yes. Condition to to switch two versioned loops is "true", the first two 
arguments should be 100% and 0%.

It is different from normal loop split, we could not deduce exactly precise 
probability for
condition-based loop split, since cfg inside loop2 would be changed. 
(invar-branch is replaced
to "true", as shown in the comment on do_split_loop_on_cond). Any way, your way 
of scaling
two loops' probabilities according to that of invar-branch, seems to be a 
better heuristics than
original, which would give us more reasonable execution count, at least for 
loop header bb.

Thanks,
Feng


From: Gcc-patches  
on behalf of Richard Biener via Gcc-patches 
Sent: Friday, August 6, 2021 7:46 PM
To: Xionghu Luo
Cc: seg...@kernel.crashing.org; wschm...@linux.ibm.com; li...@gcc.gnu.org; 
gcc-patches@gcc.gnu.org; hubi...@ucw.cz; dje@gmail.com
Subject: Re: [PATCH] Fix loop split incorrect count and probability

On Tue, 3 Aug 2021, Xionghu Luo wrote:

> loop split condition is moved between loop1 and loop2, the split bb's
> count and probability should also be duplicated instead of (100% vs INV),
> secondly, the original loop1 and loop2 count need be propotional from the
> original loop.
>
> Regression tested pass, OK for master?
>
> diff base/loop-cond-split-1.c.151t.lsplit  
> patched/loop-cond-split-1.c.151t.lsplit:
> ...
>int prephitmp_16;
>int prephitmp_25;
>
> [local count: 118111600]:
>if (n_7(D) > 0)
>  goto ; [89.00%]
>else
>  goto ; [11.00%]
>
> [local count: 118111600]:
>return;
>
> [local count: 105119324]:
>pretmp_3 = ga;
>
> -   [local count: 955630225]:
> +   [local count: 315357973]:
># i_13 = PHI 
># prephitmp_12 = PHI 
>if (prephitmp_12 != 0)
>  goto ; [33.00%]
>else
>  goto ; [67.00%]
>
> -   [local count: 315357972]:
> +   [local count: 104068130]:
>_2 = do_something ();
>ga = _2;
>
> -   [local count: 955630225]:
> +   [local count: 315357973]:
># prephitmp_5 = PHI 
>i_10 = inc (i_13);
>if (n_7(D) > i_10)
>  goto ; [89.00%]
>else
>  goto ; [11.00%]
>
> [local count: 105119324]:
>goto ; [100.00%]
>
> -   [local count: 850510901]:
> +   [local count: 280668596]:
>if (prephitmp_12 != 0)
> -goto ; [100.00%]
> +goto ; [33.00%]
>else
> -goto ; [INV]
> +goto ; [67.00%]
>
> -   [local count: 850510901]:
> +   [local count: 280668596]:
>goto ; [100.00%]
>
> -   [count: 0]:
> +   [local count: 70429947]:
># i_23 = PHI 
># prephitmp_25 = PHI 
>
> -   [local count: 955630225]:
> +   [local count: 640272252]:
># i_15 = PHI 
># prephitmp_16 = PHI 
>i_22 = inc (i_15);
>if (n_7(D) > i_22)
>  goto ; [89.00%]
>else
>  goto ; [11.00%]
>
> -   [local count: 850510901]:
> +   [local count: 569842305]:
>goto ; [100.00%]
>
>  }
>
> gcc/ChangeLog:
>
>   * tree-ssa-loop-split.c (split_loop): Fix incorrect probability.
>   (do_split_loop_on_cond): Likewise.
> ---
>  gcc/tree-ssa-loop-split.c | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c
> index 3a09bbc39e5..8e5a7ded0f7 100644
> --- a/gcc/tree-ssa-loop-split.c
> +++ b/gcc/tree-ssa-loop-split.c
> @@ -583,10 +583,10 @@ split_loop (class loop *loop1)
>   basic_block cond_bb;
>
>   class loop *loop2 = loop_version (loop1, cond, &cond_bb,
> -profile_probability::always (),
> -profile_probability::always (),
> -profile_probability::always (),
> -profile_probability::always (),
> +true_edge->probability,
> +true_edge->probability.invert (),
> +true_edge->probability,
> +true_edge->probability.invert (),
>  true);

there is no 'true_edge' variable at this point.

>   gcc_assert (loop2);
>
> @@ -1486,10 +1486,10 @@ do_split_loop_on_cond (struct loop *loop1, edge 
> invar_branch)
>initialize_original_copy_tables ();
>
>struct loop *loop2 = loop_version (loop1, boolean_true_node, NULL,
> -  profile_probability::always (),
> -  profile_probability::never (),
> -  profile_probability::always (),
> -  profile_probability::always (),
> +  invar_branch->probability.invert (),
> +  invar_branch->probability,
> +  invar_branch->probability.invert (),
> +  invar_branch->probability,
>true);
>if (!loop2)
>

Question about non-POD class type

2021-05-14 Thread Feng Xue OS via Gcc-patches

For an instance of a non-POD class, can I always assume that any
operation on it should be type-safe, any wrong or even trick code
to violate this is UB in C++ spec? For example, here are some ways:

 union {
Type1  *p1;
Type2  *p2;
};

or 

union {
Type1 t1;
Type2 t2;
};

or

void *p = Type1 *p1;
Type2 *p2 = p;
p2->xxx;

Feng

Re: [PATCH/RFC] Add a new memory gathering optimization for loop (PR98598)

2021-05-06 Thread Feng Xue OS via Gcc-patches

>> gcc/
>> PR tree-optimization/98598
>> * Makefile.in (OBJS): Add tree-ssa-loop-mgo.o.
>> * common.opt (-ftree-loop-mgo): New option.
> 
> Just a quick comment - -ftree-loop-mgo is user-facing and it isn't really a 
> good
> name.  -floop-mgo would be better but still I'd have no idea what this would 
> do.
> 
> I don't have a good suggestion here other than to expand it to
> -floop-gather-memory (?!).

OK. Better than "mgo", this abbr. is only a term for development use.

> The option documentation isn't informative either.
> 
> From:
> 
>   outer-loop ()
> {
>   inner-loop (iter, iter_count)
> {
>   Type1 v1 = LOAD (iter);
>   Type2 v2 = LOAD (v1);
>   Type3 v3 = LOAD (v2);
>   ...
>   iter = NEXT (iter);
> }
> }
> 
> To:
> 
>   typedef struct cache_elem
> {
>   bool   init;
>   Type1  c_v1;
>   Type2  c_v2;
>   Type3  c_v3;
> } cache_elem;
> 
>   cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem));
> 
>   outer-loop ()
> {
>   size_t cache_idx = 0;
> 
>   inner-loop (iter, iter_count)
> {
>   if (!cache_arr[cache_idx]->init)
> {
>   v1 = LOAD (iter);
>   v2 = LOAD (v1);
>   v3 = LOAD (v2);
> 
>   cache_arr[cache_idx]->init = true;
>   cache_arr[cache_idx]->c_v1 = v1;
>   cache_arr[cache_idx]->c_v2 = v2;
>   cache_arr[cache_idx]->c_v3 = v3;
> }
>   else
> {
>   v1 = cache_arr[cache_idx]->c_v1;
>   v2 = cache_arr[cache_idx]->c_v2;
>   v3 = cache_arr[cache_idx]->c_v3;
> }
>   ...
>   cache_idx++;
>   iter = NEXT (iter);
> }
> }
> 
>   free (cache_arr);
> 
> This is a _very_ special transform.  What it seems to do is
> optimize the dependent loads for outer loop iteration n > 1
> by caching the result(s).  If that's possible then you should
> be able to distribute the outer loop to one doing the caching
> and one using the cache.  Then this transform would be more
> like a tradidional array expansion of scalars?  In some cases
> also loop interchange could remove the need for the caching.
> 
> Doing MGO as the very first loop pass thus looks bad, I think
> MGO should be much later, for example after interchange.
> I also think that MGO should work in concert with loop
> distribution (which could have an improved cost model)
> rather than being a separate pass.
> 
> Your analysis phase looks quite expensive, building sth
> like a on-the side representation very closely matching SSA.
> It seems to work from PHI defs to uses, which looks backwards.

Did not catch this point very clearly. Would you please detail it more?

> You seem to roll your own dependence analysis code :/  Please
> have a look at loop distribution.
> 
> Also you build an actual structure type for reasons that escape
> me rather than simply accessing the allocated storage at
> appropriate offsets.
> 
> I think simply calling 'calloc' isn't OK because you might need
> aligned storage and because calloc might not be available.
> Please at least use 'malloc' and make sure MALLOC_ABI_ALIGNMENT
> is large enough for the data you want to place (or perform
> dynamic re-alignment yourself).  We probably want some generic
> middle-end utility to obtain aligned allocated storage at some
> point.
> 
> As said above I think you want to re-do this transform as
> a loop distribution transform.  I think if caching works then
> the loads should be distributable and the loop distribution
> transform should be enhanced to expand the scalars to arrays.

I checked code of loop distribution, and its trigger strategy seems
to be very conservative, now only targets simple and regular
index-based loop, and could not handle link-list traversal, which
consists of a series of discrete memory accesses, and MGO would
matter a lot. Additionally, for some complicate cases,  we could
not completely decompose MGO as two separate loops for
"do caching" and "use caching" respectively. An example:

for (i = 0; i < N; i++)
  {
for (j = 0; j < i; j++)
   {
   Type1 v1 = LOAD_FN1 (j);
   Type2 v2 = LOAD_FN2 (v1);
   Type3 v3 = LOAD_FN3 (v2);

   ...

   condition = ...
   }

if (condition)
  break;
  }

We should not cache all loads (Totally N) in one step since some
of them might be invalid after "condition" breaks loops. We have to
mix up "do caching" and "use caching", and let them dynamically
switched against "init" flag.  But loop distribution does have some
overlap on analysis and transformation with MGO, we will try to
see if there is a way to unify them.

Thanks,
Feng

Re: [PATCH/RFC] Add a new memory gathering optimization for loop (PR98598)

2021-04-29 Thread Feng Xue OS via Gcc-patches

>> This patch implements a new loop optimization according to the proposal
>> in RFC given at
>> https://gcc.gnu.org/pipermail/gcc/2021-January/234682.html.
>> So do not repeat the idea in this mail. Hope your comments on it.
> 
> With the caveat that I'm not an optimization expert (but no one else
> seems to have replied), here are some thoughts.
> 
> [...snip...]
> 
>> Subject: [PATCH 1/3] mgo: Add a new memory gathering optimization for loop
>>  [PR98598]
> 
> BTW, did you mean to also post patches 2 and 3?
>

Not yet, but they are ready. Since this is kind of special optimization that 
uses
heap as temporary storage, not a common means in gcc, we do not know
basic attitude of the community towards it. So only the first patch was sent
out for initial comments, in that it implements a generic MGO framework, and
is complete and self-contained. Other 2 patches just composed some
enhancements for specific code pattern and dynamic alias check. If possible,
this proposal would be accepted principally, we will submit other 2 for review.

> 
>> In nested loops, if scattered memory accesses inside inner loop remain
>> unchanged in outer loop, we can sequentialize these loads by caching
>> their values into a temporary memory region at the first time, and
>> reuse the caching data in following iterations. This way can improve
>> efficiency of cpu cache subsystem by reducing its unpredictable activies.
> 
> I don't think you've cited any performance numbers so far.  Does the
> optimization show a measurable gain on some benchmark(s)?  e.g. is this
> ready to run SPEC yet, and how does it do?

Yes, we have done that. Minor improvement about several point percentage
could gain for some real applications. And to be specific, we also get major
improvement as more than 30% for certain benchmark in SPEC2017.

> 
>> To illustrate what the optimization will do, two pieces of pseudo code,
>> before and after transformation, are given. Suppose all loads and
>> "iter_count" are invariant in outer loop.
>>
>> From:
>>
>>   outer-loop ()
>> {
>>   inner-loop (iter, iter_count)
>> {
>>   Type1 v1 = LOAD (iter);
>>   Type2 v2 = LOAD (v1);
>>   Type3 v3 = LOAD (v2);
>>   ...
>>   iter = NEXT (iter);
>> }
>> }
>>
>> To:
>>
>>   typedef struct cache_elem
>> {
>>   bool   init;
>>   Type1  c_v1;
>>   Type2  c_v2;
>>   Type3  c_v3;
> 
> Putting the "bool init;" at the front made me think "what about
> packing?" but presumably the idea is that every element is accessed in
> order, so it presumably benefits speed to have "init" at the top of the
> element, right?

Yes, layout of the struct layout could be optimized in terms of size by
some means, such as:
  o. packing "init" into a padding hole after certain field
  o. if certain field is a pointer type, the field can take the role of "init"
  (Non-NULL implies "initialized")
Now this simple scheme is straightforward, and would be enhanced
in various aspects later.

>> } cache_elem;
>>
>>   cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem));

> What if the allocation fails at runtime?  Do we keep an unoptimized
> copy of the nested loops around as a fallback and have an unlikely
> branch to that copy?

Yes, we should. But in a different way, a flag is added into original
nested loop to control runtime switch between optimized and
unoptimized execution. This definitely incurs runtime cost, but 
avoid possible code size bloating. A better handling, as a TODO is
to apply dynamic-switch for large loop, and loop-clone for small one.

> I notice that you're using calloc, presumably to clear all of the
> "init" flags (and the whole buffer).
> 
> FWIW, this feels like a case where it would be nice to have a thread-
> local heap allocation, perhaps something like an obstack implemented in
> the standard library - but that's obviously scope creep for this.

Yes, that's good, specially for many-thread application.

> Could it make sense to use alloca for small allocations?  (or is that
> scope creep?)

We did consider using alloca as you said.  But if we could not determine
up limit for a non-constant size, we have to place alloca inside a loop that
encloses the nested loop. Without a corresponding free operation, this
kind of alloca-in-loop might cause stack overflow. So it becomes another
TODO.

>>   outer-loop ()
>> {
>>   size_t cache_idx = 0;
>>
>>   inner-loop (iter, iter_count)
>> {
>>   if (!cache_arr[cache_idx]->init)
>> {
>>   v1 = LOAD (iter);
>>   v2 = LOAD (v1);
>>   v3 = LOAD (v2);
>>
>>   cache_arr[cache_idx]->init = true;
>>   cache_arr[cache_idx]->c_v1 = v1;
>>   cache_arr[cache_idx]->c_v2 = v2;
>>   cache_arr[cache_idx]->c_v3 = v3;
>> }
>> else
>> {
>>   v1 = cache_arr[cache_idx]->c_v1;
>>   v2

[PATCH] Fix testcases to avoid plusminus-with-convert pattern (PR 97066)

2020-09-16 Thread Feng Xue OS via Gcc-patches

With the new pattern rule (T)(A) +- (T)(B) -> (T)(A +- B),
some testcases are simplified and could not keep expected
code pattern as test-check. Minor changes are made to those
cases to avoid simplification effect of the rule.

Tested on x86_64-linux and aarch64-linux.

Feng
---
2020-09-16  Feng Xue  

gcc/testsuite/
PR testsuite/97066
* gcc.dg/ifcvt-3.c: Modified to suppress simplification.
* gcc.dg/tree-ssa/20030807-10.c: Likewise.From ac768c385f1332e276260c6de83b12929180fbfb Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Wed, 16 Sep 2020 16:21:14 +0800
Subject: [PATCH] testsuite/97066 - minor change to bypass
 plusminus-with-convert rule

The following testcases will be simplified by the new rule
(T)(A) +- (T)(B) -> (T)(A +- B), so could not keep expected code pattern
as test-check. Adjust test code to suppress simplification.

2020-09-16  Feng Xue  

gcc/testsuite/
	PR testsuite/97066
	* gcc.dg/ifcvt-3.c: Modified to suppress simplification.
	* gcc.dg/tree-ssa/20030807-10.c: Likewise.
---
 gcc/testsuite/gcc.dg/ifcvt-3.c  | 2 +-
 gcc/testsuite/gcc.dg/tree-ssa/20030807-10.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/ifcvt-3.c b/gcc/testsuite/gcc.dg/ifcvt-3.c
index b250bc15e08..56fdd753a0a 100644
--- a/gcc/testsuite/gcc.dg/ifcvt-3.c
+++ b/gcc/testsuite/gcc.dg/ifcvt-3.c
@@ -11,7 +11,7 @@ foo (s64 a, s64 b, s64 c)
   if (d == 0)
 return a + c;
   else
-return b + d + c;
+return b + c + d;
 }
 
 /* This test can be reduced to just return a + c;  */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20030807-10.c b/gcc/testsuite/gcc.dg/tree-ssa/20030807-10.c
index 0903f3c4321..0e01e511b78 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/20030807-10.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/20030807-10.c
@@ -7,7 +7,7 @@ unsigned int
 subreg_highpart_offset (outermode, innermode)
  int outermode, innermode;
 {
-  unsigned int offset = 0;
+  unsigned int offset = 1;
   int difference = (mode_size[innermode] - mode_size[outermode]);
   if (difference > 0)
 {
-- 
2.17.1

Re: [PATCH 2/2 V4] Add plusminus-with-convert pattern (PR 94234)

2020-09-15 Thread Feng Xue OS via Gcc-patches

>> Add a rule (T)(A) +- (T)(B) -> (T)(A +- B), which works only when (A +- B)
>> could be folded to a simple value. By this rule, a 
>> plusminus-mult-with-convert
>> expression could be handed over to the rule (A * C) +- (B * C) -> (A +- B).
>
>Please use INTEGRAL_TYPE_P () instead of TREE_CODE == INTEGER_TYPE
>in all three cases.  It's enough to check for INTEGRAL_TYPE_P on one operand,
>the types_match will take care of the other.

I would have considered using INTEGRAL_TYPE_P(), but if inner type is bool or
enum, can we do plus/minus operation on that?

Feng

>
>OK with those changes.
>
>Thanks,
>Richard.
>
>
> Bootstrapped/regtested on x86_64-linux and aarch64-linux.
>
> Feng
> ---
> 2020-09-15  Feng Xue  
>
> gcc/
> PR tree-optimization/94234
> * match.pd (T)(A) +- (T)(B) -> (T)(A +- B): New simplification.
>
> gcc/testsuite/
> PR tree-optimization/94234
> * gcc.dg/pr94234-3.c: New test.

Re: Ping: [PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop (PR 94234)

2020-09-15 Thread Feng Xue OS via Gcc-patches

>> This patch is to handle simplification of plusminus-mult-with-convert 
>> expression
>> as ((T) X) +- ((T) Y), in which at least one of (X, Y) is result of 
>> multiplication.
>> This is done in forwprop pass. We try to transform it to (T) (X +- Y), and 
>> resort
>> to gimple-matcher to fold (X +- Y) instead of manually code pattern 
>> recognition.
>
>I still don't like the complete new function with all its correctness
>issues - the existing
>fold_plusminus_mult_expr was difficult enough to get correct for
>corner cases and
>we do have a set of match.pd patterns (partly?) implementing its transforms.
>
>Looking at
>
>+unsigned goo (unsigned m_param, unsigned n_param)
>+{
>+  unsigned b1 = m_param * (n_param + 2);
>+  unsigned b2 = m_param * (n_param + 1);
>+  int r = (int)(b1) - (int)(b2);
>
>it seems we want to simplify (signed)A - (signed)B to
>(signed)(A - B) if A - B "simplifies"?  I guess
>
>(simplify
>  (plusminus (nop_convert @0) (nop_convert? @1))
>  (convert (plusminus! @0 @1)))
>
>probably needs a swapped pattern or not iterate over plus/minus
>to handle at least one converted operand and avoid adding
>a (plus @0 @1) -> (convert (plus! @0 @1)) rule.
>
>Even
>
>(simplify
> (minus (nop_convert @0) (nop_convert @1))
> (convert (minus! @0 @1)))
>
>seems to handle all your testcases already (which means
>they are all the same and not very exhaustive...)
Yes. This is much simpler.

Thanks,
Feng

>Richard.
>
>
>> Regards,
>> Feng
>> ---
>> 2020-09-03  Feng Xue  
>>
>> gcc/
>> PR tree-optimization/94234
>> * tree-ssa-forwprop.c (simplify_plusminus_mult_with_convert): New
>> function.
>> (fwprop_ssa_val): Move it before its new caller.
>> (pass_forwprop::execute): Add call to
>> simplify_plusminus_mult_with_convert.
>>
>> gcc/testsuite/
>> PR tree-optimization/94234
>> * gcc.dg/pr94234-3.c: New test.
>

[PATCH 2/2 V4] Add plusminus-with-convert pattern (PR 94234)

2020-09-15 Thread Feng Xue OS via Gcc-patches

Add a rule (T)(A) +- (T)(B) -> (T)(A +- B), which works only when (A +- B)
could be folded to a simple value. By this rule, a plusminus-mult-with-convert
expression could be handed over to the rule (A * C) +- (B * C) -> (A +- B).

Bootstrapped/regtested on x86_64-linux and aarch64-linux.

Feng
---
2020-09-15  Feng Xue  

gcc/
PR tree-optimization/94234
* match.pd (T)(A) +- (T)(B) -> (T)(A +- B): New simplification.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/pr94234-3.c: New test.From f7c7483bd61fe1e3d6888f84d718fb4be4ea9e14 Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Mon, 17 Aug 2020 23:00:35 +0800
Subject: [PATCH] tree-optimization/94234 - add plusminus-with-convert pattern

Add a rule (T)(A) +- (T)(B) -> (T)(A +- B), which works only when (A +- B)
could be folded to a simple value. By this rule, a plusminus-mult-with-convert
expression could be handed over to the rule (A * C) +- (B * C) -> (A +- B).

2020-09-15  Feng Xue  

gcc/
	PR tree-optimization/94234
	* match.pd (T)(A) +- (T)(B) -> (T)(A +- B): New simplification.

gcc/testsuite/
	PR tree-optimization/94234
 	* gcc.dg/pr94234-3.c: New test.
---
 gcc/match.pd | 16 
 gcc/testsuite/gcc.dg/pr94234-3.c | 42 
 2 files changed, 58 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/pr94234-3.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 46fd880bd37..d8c59fad9c1 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -2397,6 +2397,22 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
(plus (convert @0) (op @2 (convert @1))
 #endif
 
+/* (T)(A) +- (T)(B) -> (T)(A +- B) only when (A +- B) could be simplified
+   to a simple value.  */
+#if GIMPLE
+  (for op (plus minus)
+   (simplify
+(op (convert @0) (convert @1))
+ (if (TREE_CODE (type) == INTEGER_TYPE
+	  && TREE_CODE (TREE_TYPE (@0)) == INTEGER_TYPE
+	  && TREE_CODE (TREE_TYPE (@1)) == INTEGER_TYPE
+	  && TYPE_PRECISION (type) <= TYPE_PRECISION (TREE_TYPE (@0))
+	  && types_match (TREE_TYPE (@0), TREE_TYPE (@1))
+	  && !TYPE_OVERFLOW_TRAPS (type)
+	  && !TYPE_OVERFLOW_SANITIZED (type))
+  (convert (op! @0 @1)
+#endif
+
   /* ~A + A -> -1 */
   (simplify
(plus:c (bit_not @0) @0)
diff --git a/gcc/testsuite/gcc.dg/pr94234-3.c b/gcc/testsuite/gcc.dg/pr94234-3.c
new file mode 100644
index 000..9bb9b46bd96
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr94234-3.c
@@ -0,0 +1,42 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop1" } */
+
+typedef __SIZE_TYPE__ size_t;
+typedef __PTRDIFF_TYPE__ ptrdiff_t;
+
+ptrdiff_t foo1 (char *a, size_t n)
+{
+  char *b1 = a + 8 * n;
+  char *b2 = a + 8 * (n - 1);
+
+  return b1 - b2;
+}
+
+int use_ptr (char *a, char *b);
+
+ptrdiff_t foo2 (char *a, size_t n)
+{
+  char *b1 = a + 8 * (n - 1);
+  char *b2 = a + 8 * n;
+
+  use_ptr (b1, b2);
+
+  return b1 - b2;
+}
+
+int use_int (int i);
+
+unsigned goo (unsigned m_param, unsigned n_param)
+{
+  unsigned b1 = m_param * (n_param + 2);
+  unsigned b2 = m_param * (n_param + 1);
+  int r = (int)(b1) - (int)(b2);
+
+  use_int (r);
+
+  return r;
+}
+
+/* { dg-final { scan-tree-dump-times "return 8;" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-times "return -8;" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-times "return m_param" 1 "forwprop1" } } */
-- 
2.17.1

Re: Ping: [PATCH 1/2] Fold plusminus_mult expr with multi-use operands (PR 94234)

2020-09-14 Thread Feng Xue OS via Gcc-patches

>@@ -3426,8 +3426,16 @@ dt_simplify::gen_1 (FILE *f, int indent, bool
>gimple, operand *result)
>  /* Re-fold the toplevel result.  It's basically an embedded
> gimple_build w/o actually building the stmt.  */
>  if (!is_predicate)
>-   fprintf_indent (f, indent,
>-   "res_op->resimplify (lseq, valueize);\n");
>+   {
>+ fprintf_indent (f, indent,
>+ "res_op->resimplify (lseq, valueize);\n");
>+ if (e->force_leaf)
>+   {
>+ fprintf_indent (f, indent,
>+ "if (!maybe_push_res_to_seq (res_op, NULL))\n");
>+ fprintf_indent (f, indent + 2, "return false;\n");
>
>please use "goto %s;\n", fail_label)  here.  OK with that change.
Ok.

>
>I've tried again to think about sth prettier to cover these kind of
>single-use checks but failed to come up with sth.
Maybe we need a smart combiner that can deduce cost globally, and
remove these single-use specifiers from rule description.

Feng


From: Richard Biener 
Sent: Monday, September 14, 2020 9:39 PM
To: Feng Xue OS
Cc: gcc-patches@gcc.gnu.org
Subject: Re: Ping: [PATCH 1/2] Fold plusminus_mult expr with multi-use operands 
(PR 94234)

On Mon, Sep 14, 2020 at 5:17 AM Feng Xue OS via Gcc-patches
 wrote:
>
> Thanks,

@@ -3426,8 +3426,16 @@ dt_simplify::gen_1 (FILE *f, int indent, bool
gimple, operand *result)
  /* Re-fold the toplevel result.  It's basically an embedded
 gimple_build w/o actually building the stmt.  */
  if (!is_predicate)
-   fprintf_indent (f, indent,
-   "res_op->resimplify (lseq, valueize);\n");
+   {
+ fprintf_indent (f, indent,
+ "res_op->resimplify (lseq, valueize);\n");
+ if (e->force_leaf)
+   {
+ fprintf_indent (f, indent,
+ "if (!maybe_push_res_to_seq (res_op, NULL))\n");
+ fprintf_indent (f, indent + 2, "return false;\n");

please use "goto %s;\n", fail_label)  here.  OK with that change.

I've tried again to think about sth prettier to cover these kind of
single-use checks but failed to come up with sth.

Thanks and sorry for the delay,
Richard.

> Feng
>
> 
> From: Feng Xue OS
> Sent: Thursday, September 3, 2020 2:06 PM
> To: gcc-patches@gcc.gnu.org
> Subject: [PATCH 1/2] Fold plusminus_mult expr with multi-use operands (PR 
> 94234)
>
> For pattern A * C +- B * C -> (A +- B) * C, simplification is disabled
> when A and B are not single-use. This patch is a minor enhancement
> on the pattern, which allows folding if final result is found to be a
> simple gimple value (constant/existing SSA).
>
> Bootstrapped/regtested on x86_64-linux and aarch64-linux.
>
> Feng
> ---
> 2020-09-03  Feng Xue  
>
> gcc/
> PR tree-optimization/94234
> * genmatch.c (dt_simplify::gen_1): Emit check on final simplification
> result when "!" is specified on toplevel output expr.
> * match.pd ((A * C) +- (B * C) -> (A +- B) * C): Allow folding for
> expr with multi-use operands if final result is a simple gimple value.
>
> gcc/testsuite/
> PR tree-optimization/94234
> * gcc.dg/pr94234-2.c: New test.
> ---

Ping: [PATCH 1/2] Fold plusminus_mult expr with multi-use operands (PR 94234)

2020-09-13 Thread Feng Xue OS via Gcc-patches

Thanks,
Feng


From: Feng Xue OS
Sent: Thursday, September 3, 2020 2:06 PM
To: gcc-patches@gcc.gnu.org
Subject: [PATCH 1/2] Fold plusminus_mult expr with multi-use operands (PR 94234)

For pattern A * C +- B * C -> (A +- B) * C, simplification is disabled
when A and B are not single-use. This patch is a minor enhancement
on the pattern, which allows folding if final result is found to be a
simple gimple value (constant/existing SSA).

Bootstrapped/regtested on x86_64-linux and aarch64-linux.

Feng
---
2020-09-03  Feng Xue  

gcc/
PR tree-optimization/94234
* genmatch.c (dt_simplify::gen_1): Emit check on final simplification
result when "!" is specified on toplevel output expr.
* match.pd ((A * C) +- (B * C) -> (A +- B) * C): Allow folding for
expr with multi-use operands if final result is a simple gimple value.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/pr94234-2.c: New test.
---
From e247eb0d9a43856cc0b46f98414ed58d13796d62 Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Tue, 1 Sep 2020 17:17:58 +0800
Subject: [PATCH] tree-optimization/94234 - Fold plusminus_mult expr with
 multi-use operands

2020-09-03  Feng Xue  

gcc/
	PR tree-optimization/94234
	* genmatch.c (dt_simplify::gen_1): Emit check on final simplification
	result when "!" is specified on toplevel output expr.
	* match.pd ((A * C) +- (B * C) -> (A +- B) * C): Allow folding for
	expr with multi-use operands if final result is a simple gimple value.

gcc/testsuite/
	PR tree-optimization/94234
	* gcc.dg/pr94234-2.c: New test.
---
 gcc/genmatch.c   | 12 --
 gcc/match.pd | 22 ++
 gcc/testsuite/gcc.dg/pr94234-2.c | 39 
 3 files changed, 62 insertions(+), 11 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr94234-2.c

diff --git a/gcc/genmatch.c b/gcc/genmatch.c
index 906d842c4d8..d4f01401964 100644
--- a/gcc/genmatch.c
+++ b/gcc/genmatch.c
@@ -3426,8 +3426,16 @@ dt_simplify::gen_1 (FILE *f, int indent, bool gimple, operand *result)
 	  /* Re-fold the toplevel result.  It's basically an embedded
 	 gimple_build w/o actually building the stmt.  */
 	  if (!is_predicate)
-	fprintf_indent (f, indent,
-			"res_op->resimplify (lseq, valueize);\n");
+	{
+	  fprintf_indent (f, indent,
+			  "res_op->resimplify (lseq, valueize);\n");
+	  if (e->force_leaf)
+		{
+		  fprintf_indent (f, indent,
+		  "if (!maybe_push_res_to_seq (res_op, NULL))\n");
+		  fprintf_indent (f, indent + 2, "return false;\n");
+		}
+	}
 	}
   else if (result->type == operand::OP_CAPTURE
 	   || result->type == operand::OP_C_EXPR)
diff --git a/gcc/match.pd b/gcc/match.pd
index 6e45836e32b..46fd880bd37 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -2570,15 +2570,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (for plusminus (plus minus)
   (simplify
(plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2))
-   (if ((!ANY_INTEGRAL_TYPE_P (type)
-	 || TYPE_OVERFLOW_WRAPS (type)
-	 || (INTEGRAL_TYPE_P (type)
-	 && tree_expr_nonzero_p (@0)
-	 && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type)
-	/* If @1 +- @2 is constant require a hard single-use on either
-	   original operand (but not on both).  */
-	&& (single_use (@3) || single_use (@4)))
-(mult (plusminus @1 @2) @0)))
+   (if (!ANY_INTEGRAL_TYPE_P (type)
+	|| TYPE_OVERFLOW_WRAPS (type)
+	|| (INTEGRAL_TYPE_P (type)
+	&& tree_expr_nonzero_p (@0)
+	&& expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type)
+(if (single_use (@3) || single_use (@4))
+ /* If @1 +- @2 is constant require a hard single-use on either
+	original operand (but not on both).  */
+ (mult (plusminus @1 @2) @0)
+#if GIMPLE
+ (mult! (plusminus @1 @2) @0)
+#endif
+  )))
   /* We cannot generate constant 1 for fract.  */
   (if (!ALL_FRACT_MODE_P (TYPE_MODE (type)))
(simplify
diff --git a/gcc/testsuite/gcc.dg/pr94234-2.c b/gcc/testsuite/gcc.dg/pr94234-2.c
new file mode 100644
index 000..1f4b194dd43
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr94234-2.c
@@ -0,0 +1,39 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop1" } */ 
+
+int use_fn (int a);
+
+int foo (int n)
+{
+  int b1 = 8 * (n + 1);
+  int b2 = 8 * n;
+
+  use_fn (b1 ^ b2);
+
+  return b1 - b2;
+}
+
+unsigned goo (unsigned m_param, unsigned n_param)
+{
+  unsigned b1 = m_param * (n_param + 2);
+  unsigned b2 = m_param * (n_param + 1);
+
+  use_fn (b1 ^ b2);
+
+  return b1 - b2;
+}
+
+unsigned hoo (unsigned k_param)
+{
+  unsigned b1 = k_param * 28;
+  unsigned b2 = k_param * 15;
+  unsigned b3 = k_param * 12;
+
+  use_fn (b1 ^ b2 ^ b3);
+
+  return (b1 - b2) - b3;
+}
+
+/* { dg-final { scan-tree-dump-times "return 8;" 1 "for

Ping: [PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop (PR 94234)

2020-09-13 Thread Feng Xue OS via Gcc-patches

Thanks,
Feng


From: Feng Xue OS 
Sent: Thursday, September 3, 2020 5:29 PM
To: Richard Biener; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in 
forwprop (PR 94234)

Attach patch file.

Feng

From: Gcc-patches  on behalf of Feng Xue OS 
via Gcc-patches 
Sent: Thursday, September 3, 2020 5:27 PM
To: Richard Biener; gcc-patches@gcc.gnu.org
Subject: [PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop 
(PR 94234)

This patch is to handle simplification of plusminus-mult-with-convert expression
as ((T) X) +- ((T) Y), in which at least one of (X, Y) is result of 
multiplication.
This is done in forwprop pass. We try to transform it to (T) (X +- Y), and 
resort
to gimple-matcher to fold (X +- Y) instead of manually code pattern recognition.

Regards,
Feng
---
2020-09-03  Feng Xue  

gcc/
PR tree-optimization/94234
* tree-ssa-forwprop.c (simplify_plusminus_mult_with_convert): New
function.
(fwprop_ssa_val): Move it before its new caller.
(pass_forwprop::execute): Add call to
simplify_plusminus_mult_with_convert.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/pr94234-3.c: New test.
From 98c4b97989207dcef5742e9cb451799feafd125e Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Mon, 17 Aug 2020 23:00:35 +0800
Subject: [PATCH] tree-optimization/94234 - simplify
 plusminus-mult-with-convert in forwprop

For expression as ((T) X) +- ((T) Y), and at lease of (X, Y) is result of
multification, try to transform it to (T) (X +- Y), and apply simplification
on (X +- Y) if possible. In this way, we can avoid creating almost duplicated
rule to handle plusminus-mult-with-convert variant.

2020-09-03  Feng Xue  

gcc/
	PR tree-optimization/94234
	* tree-ssa-forwprop.c (simplify_plusminus_mult_with_convert): New
	function.
	(fwprop_ssa_val): Move it before its new caller.
	(pass_forwprop::execute): Add call to
	simplify_plusminus_mult_with_convert.

gcc/testsuite/
	PR tree-optimization/94234
 	* gcc.dg/pr94234-3.c: New test.
---
 gcc/testsuite/gcc.dg/pr94234-3.c |  42 
 gcc/tree-ssa-forwprop.c  | 168 +++
 2 files changed, 191 insertions(+), 19 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr94234-3.c

diff --git a/gcc/testsuite/gcc.dg/pr94234-3.c b/gcc/testsuite/gcc.dg/pr94234-3.c
new file mode 100644
index 000..9bb9b46bd96
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr94234-3.c
@@ -0,0 +1,42 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop1" } */
+
+typedef __SIZE_TYPE__ size_t;
+typedef __PTRDIFF_TYPE__ ptrdiff_t;
+
+ptrdiff_t foo1 (char *a, size_t n)
+{
+  char *b1 = a + 8 * n;
+  char *b2 = a + 8 * (n - 1);
+
+  return b1 - b2;
+}
+
+int use_ptr (char *a, char *b);
+
+ptrdiff_t foo2 (char *a, size_t n)
+{
+  char *b1 = a + 8 * (n - 1);
+  char *b2 = a + 8 * n;
+
+  use_ptr (b1, b2);
+
+  return b1 - b2;
+}
+
+int use_int (int i);
+
+unsigned goo (unsigned m_param, unsigned n_param)
+{
+  unsigned b1 = m_param * (n_param + 2);
+  unsigned b2 = m_param * (n_param + 1);
+  int r = (int)(b1) - (int)(b2);
+
+  use_int (r);
+
+  return r;
+}
+
+/* { dg-final { scan-tree-dump-times "return 8;" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-times "return -8;" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-times "return m_param" 1 "forwprop1" } } */
diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
index e2d008dfb92..7b9d46ec919 100644
--- a/gcc/tree-ssa-forwprop.c
+++ b/gcc/tree-ssa-forwprop.c
@@ -338,6 +338,25 @@ remove_prop_source_from_use (tree name)
   return cfg_changed;
 }
 
+/* Primitive "lattice" function for gimple_simplify.  */
+
+static tree
+fwprop_ssa_val (tree name)
+{
+  /* First valueize NAME.  */
+  if (TREE_CODE (name) == SSA_NAME
+  && SSA_NAME_VERSION (name) < lattice.length ())
+{
+  tree val = lattice[SSA_NAME_VERSION (name)];
+  if (val)
+	name = val;
+}
+  /* We continue matching along SSA use-def edges for SSA names
+ that are not single-use.  Currently there are no patterns
+ that would cause any issues with that.  */
+  return name;
+}
+
 /* Return the rhs of a gassign *STMT in a form of a single tree,
converted to type TYPE.
 
@@ -1821,6 +1840,133 @@ simplify_rotate (gimple_stmt_iterator *gsi)
   return true;
 }
 
+/* Given ((T) X) +- ((T) Y), and at least one of (X, Y) is result of
+   multiplication, if the expr can be transformed to (T) (X +- Y) in terms of
+   two's complement computation, apply simplification on (X +- Y) if it is
+   possible.  As a prerequisite, outer result type (T) has precision not more
+   than that of inner operand type.  */
+
+static bool
+simplify_plusminus_mult_with_convert (gimple_stmt_iterator *gsi)
+{
+  gimple *stmt = gsi_stm

Re: [PATCH] Fix ICE in ipa-cp due to cost addition overflow (PR 96806)

2020-09-03 Thread Feng Xue OS via Gcc-patches

>> Hi,
>>
>> On Mon, Aug 31 2020, Feng Xue OS wrote:
>> > This patch is to fix a bug that cost that is used to evaluate clone 
>> > candidate
>> > becomes negative due to integer overflow.
>> >
>> > Feng
>> > ---
>> > 2020-08-31  Feng Xue  
>> >
>> > gcc/
>> > PR tree-optimization/96806
>>
>> the component is "ipa," please change that when you commit the patch.
>>
>> > * ipa-cp.c (decide_about_value): Use safe_add to avoid cost 
>> > addition
>> > overflow.
>>
>> assuming you have bootstrapped and tested it, it is OK for both trunk
>> and all affected release branches.
>
>I have already added caps on things that come from profile counts so
>things do not overflow, but I think in longer run we want to simply use
>sreals here..
>> >&& !good_cloning_opportunity_p (node,
>> > - val->local_time_benefit
>> > - + val->prop_time_benefit,
>> > + safe_add (val->local_time_benefit,
>> > +   val->prop_time_benefit),
>> >   freq_sum, count_sum,
>> > - val->local_size_cost
>> > - + val->prop_size_cost))
>> > + safe_add (val->local_size_cost,
>> > +   val->prop_size_cost)))
>
>Is it also size cost that may overflow? That seem bit odd ;)
>

Yes. prop_size_cost accumulates all callees' size_cost. And since
there are two recursive calls, this value increases exponentially
as 2's power, and easily exceeds value space of integer.

It is actually a defect of cost computation for recursive cloning.
But I think we need a complete consideration on how to adjust
cost model for recursive cloning, including profile estimation,
threshold, size_cost...

And a quick fix is to add a cap here to avoid overflow.

Feng

>Honza
>> >  return false;
>> >
>> >if (dump_file)
>>
>> [...]
>

Re: [PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop (PR 94234)

2020-09-03 Thread Feng Xue OS via Gcc-patches

Attach patch file.

Feng

From: Gcc-patches  on behalf of Feng Xue OS 
via Gcc-patches 
Sent: Thursday, September 3, 2020 5:27 PM
To: Richard Biener; gcc-patches@gcc.gnu.org
Subject: [PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop 
(PR 94234)

This patch is to handle simplification of plusminus-mult-with-convert expression
as ((T) X) +- ((T) Y), in which at least one of (X, Y) is result of 
multiplication.
This is done in forwprop pass. We try to transform it to (T) (X +- Y), and 
resort
to gimple-matcher to fold (X +- Y) instead of manually code pattern recognition.

Regards,
Feng
---
2020-09-03  Feng Xue  

gcc/
PR tree-optimization/94234
* tree-ssa-forwprop.c (simplify_plusminus_mult_with_convert): New
function.
(fwprop_ssa_val): Move it before its new caller.
(pass_forwprop::execute): Add call to
simplify_plusminus_mult_with_convert.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/pr94234-3.c: New test.
From 98c4b97989207dcef5742e9cb451799feafd125e Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Mon, 17 Aug 2020 23:00:35 +0800
Subject: [PATCH] tree-optimization/94234 - simplify
 plusminus-mult-with-convert in forwprop

For expression as ((T) X) +- ((T) Y), and at lease of (X, Y) is result of
multification, try to transform it to (T) (X +- Y), and apply simplification
on (X +- Y) if possible. In this way, we can avoid creating almost duplicated
rule to handle plusminus-mult-with-convert variant.

2020-09-03  Feng Xue  

gcc/
	PR tree-optimization/94234
	* tree-ssa-forwprop.c (simplify_plusminus_mult_with_convert): New
	function.
	(fwprop_ssa_val): Move it before its new caller.
	(pass_forwprop::execute): Add call to
	simplify_plusminus_mult_with_convert.

gcc/testsuite/
	PR tree-optimization/94234
 	* gcc.dg/pr94234-3.c: New test.
---
 gcc/testsuite/gcc.dg/pr94234-3.c |  42 
 gcc/tree-ssa-forwprop.c  | 168 +++
 2 files changed, 191 insertions(+), 19 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr94234-3.c

diff --git a/gcc/testsuite/gcc.dg/pr94234-3.c b/gcc/testsuite/gcc.dg/pr94234-3.c
new file mode 100644
index 000..9bb9b46bd96
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr94234-3.c
@@ -0,0 +1,42 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop1" } */
+
+typedef __SIZE_TYPE__ size_t;
+typedef __PTRDIFF_TYPE__ ptrdiff_t;
+
+ptrdiff_t foo1 (char *a, size_t n)
+{
+  char *b1 = a + 8 * n;
+  char *b2 = a + 8 * (n - 1);
+
+  return b1 - b2;
+}
+
+int use_ptr (char *a, char *b);
+
+ptrdiff_t foo2 (char *a, size_t n)
+{
+  char *b1 = a + 8 * (n - 1);
+  char *b2 = a + 8 * n;
+
+  use_ptr (b1, b2);
+
+  return b1 - b2;
+}
+
+int use_int (int i);
+
+unsigned goo (unsigned m_param, unsigned n_param)
+{
+  unsigned b1 = m_param * (n_param + 2);
+  unsigned b2 = m_param * (n_param + 1);
+  int r = (int)(b1) - (int)(b2);
+
+  use_int (r);
+
+  return r;
+}
+
+/* { dg-final { scan-tree-dump-times "return 8;" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-times "return -8;" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-times "return m_param" 1 "forwprop1" } } */
diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
index e2d008dfb92..7b9d46ec919 100644
--- a/gcc/tree-ssa-forwprop.c
+++ b/gcc/tree-ssa-forwprop.c
@@ -338,6 +338,25 @@ remove_prop_source_from_use (tree name)
   return cfg_changed;
 }
 
+/* Primitive "lattice" function for gimple_simplify.  */
+
+static tree
+fwprop_ssa_val (tree name)
+{
+  /* First valueize NAME.  */
+  if (TREE_CODE (name) == SSA_NAME
+  && SSA_NAME_VERSION (name) < lattice.length ())
+{
+  tree val = lattice[SSA_NAME_VERSION (name)];
+  if (val)
+	name = val;
+}
+  /* We continue matching along SSA use-def edges for SSA names
+ that are not single-use.  Currently there are no patterns
+ that would cause any issues with that.  */
+  return name;
+}
+
 /* Return the rhs of a gassign *STMT in a form of a single tree,
converted to type TYPE.
 
@@ -1821,6 +1840,133 @@ simplify_rotate (gimple_stmt_iterator *gsi)
   return true;
 }
 
+/* Given ((T) X) +- ((T) Y), and at least one of (X, Y) is result of
+   multiplication, if the expr can be transformed to (T) (X +- Y) in terms of
+   two's complement computation, apply simplification on (X +- Y) if it is
+   possible.  As a prerequisite, outer result type (T) has precision not more
+   than that of inner operand type.  */
+
+static bool
+simplify_plusminus_mult_with_convert (gimple_stmt_iterator *gsi)
+{
+  gimple *stmt = gsi_stmt (*gsi);
+  tree lhs = gimple_assign_lhs (stmt);
+  tree rtype = TREE_TYPE (lhs);
+  tree ctype = NULL_TREE;
+  enum tree_code code = gimple_assign_rhs_code (stmt);
+
+  if (code != PLUS_EXPR && code != MINUS_EXPR)
+return false;
+
+  /*

[PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop (PR 94234)

2020-09-03 Thread Feng Xue OS via Gcc-patches

This patch is to handle simplification of plusminus-mult-with-convert expression
as ((T) X) +- ((T) Y), in which at least one of (X, Y) is result of 
multiplication. 
This is done in forwprop pass. We try to transform it to (T) (X +- Y), and 
resort
to gimple-matcher to fold (X +- Y) instead of manually code pattern recognition.

Regards,
Feng
---
2020-09-03  Feng Xue  

gcc/
PR tree-optimization/94234
* tree-ssa-forwprop.c (simplify_plusminus_mult_with_convert): New
function.
(fwprop_ssa_val): Move it before its new caller.
(pass_forwprop::execute): Add call to
simplify_plusminus_mult_with_convert.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/pr94234-3.c: New test.

[PATCH 1/2] Fold plusminus_mult expr with multi-use operands (PR 94234)

2020-09-02 Thread Feng Xue OS via Gcc-patches

For pattern A * C +- B * C -> (A +- B) * C, simplification is disabled
when A and B are not single-use. This patch is a minor enhancement
on the pattern, which allows folding if final result is found to be a
simple gimple value (constant/existing SSA).

Bootstrapped/regtested on x86_64-linux and aarch64-linux.

Feng
---
2020-09-03  Feng Xue  

gcc/
PR tree-optimization/94234
* genmatch.c (dt_simplify::gen_1): Emit check on final simplification
result when "!" is specified on toplevel output expr.
* match.pd ((A * C) +- (B * C) -> (A +- B) * C): Allow folding for
expr with multi-use operands if final result is a simple gimple value.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/pr94234-2.c: New test.
---From e247eb0d9a43856cc0b46f98414ed58d13796d62 Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Tue, 1 Sep 2020 17:17:58 +0800
Subject: [PATCH] tree-optimization/94234 - Fold plusminus_mult expr with
 multi-use operands

2020-09-03  Feng Xue  

gcc/
	PR tree-optimization/94234
	* genmatch.c (dt_simplify::gen_1): Emit check on final simplification
	result when "!" is specified on toplevel output expr.
	* match.pd ((A * C) +- (B * C) -> (A +- B) * C): Allow folding for
	expr with multi-use operands if final result is a simple gimple value.

gcc/testsuite/
	PR tree-optimization/94234
	* gcc.dg/pr94234-2.c: New test.
---
 gcc/genmatch.c   | 12 --
 gcc/match.pd | 22 ++
 gcc/testsuite/gcc.dg/pr94234-2.c | 39 
 3 files changed, 62 insertions(+), 11 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/pr94234-2.c

diff --git a/gcc/genmatch.c b/gcc/genmatch.c
index 906d842c4d8..d4f01401964 100644
--- a/gcc/genmatch.c
+++ b/gcc/genmatch.c
@@ -3426,8 +3426,16 @@ dt_simplify::gen_1 (FILE *f, int indent, bool gimple, operand *result)
 	  /* Re-fold the toplevel result.  It's basically an embedded
 	 gimple_build w/o actually building the stmt.  */
 	  if (!is_predicate)
-	fprintf_indent (f, indent,
-			"res_op->resimplify (lseq, valueize);\n");
+	{
+	  fprintf_indent (f, indent,
+			  "res_op->resimplify (lseq, valueize);\n");
+	  if (e->force_leaf)
+		{
+		  fprintf_indent (f, indent,
+		  "if (!maybe_push_res_to_seq (res_op, NULL))\n");
+		  fprintf_indent (f, indent + 2, "return false;\n");
+		}
+	}
 	}
   else if (result->type == operand::OP_CAPTURE
 	   || result->type == operand::OP_C_EXPR)
diff --git a/gcc/match.pd b/gcc/match.pd
index 6e45836e32b..46fd880bd37 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -2570,15 +2570,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (for plusminus (plus minus)
   (simplify
(plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2))
-   (if ((!ANY_INTEGRAL_TYPE_P (type)
-	 || TYPE_OVERFLOW_WRAPS (type)
-	 || (INTEGRAL_TYPE_P (type)
-	 && tree_expr_nonzero_p (@0)
-	 && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type)
-	/* If @1 +- @2 is constant require a hard single-use on either
-	   original operand (but not on both).  */
-	&& (single_use (@3) || single_use (@4)))
-(mult (plusminus @1 @2) @0)))
+   (if (!ANY_INTEGRAL_TYPE_P (type)
+	|| TYPE_OVERFLOW_WRAPS (type)
+	|| (INTEGRAL_TYPE_P (type)
+	&& tree_expr_nonzero_p (@0)
+	&& expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type)
+(if (single_use (@3) || single_use (@4))
+ /* If @1 +- @2 is constant require a hard single-use on either
+	original operand (but not on both).  */
+ (mult (plusminus @1 @2) @0)
+#if GIMPLE
+ (mult! (plusminus @1 @2) @0)
+#endif
+  )))
   /* We cannot generate constant 1 for fract.  */
   (if (!ALL_FRACT_MODE_P (TYPE_MODE (type)))
(simplify
diff --git a/gcc/testsuite/gcc.dg/pr94234-2.c b/gcc/testsuite/gcc.dg/pr94234-2.c
new file mode 100644
index 000..1f4b194dd43
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr94234-2.c
@@ -0,0 +1,39 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop1" } */ 
+
+int use_fn (int a);
+
+int foo (int n)
+{
+  int b1 = 8 * (n + 1);
+  int b2 = 8 * n;
+
+  use_fn (b1 ^ b2);
+
+  return b1 - b2;
+}
+
+unsigned goo (unsigned m_param, unsigned n_param)
+{
+  unsigned b1 = m_param * (n_param + 2);
+  unsigned b2 = m_param * (n_param + 1);
+
+  use_fn (b1 ^ b2);
+
+  return b1 - b2;
+}
+
+unsigned hoo (unsigned k_param)
+{
+  unsigned b1 = k_param * 28;
+  unsigned b2 = k_param * 15;
+  unsigned b3 = k_param * 12;
+
+  use_fn (b1 ^ b2 ^ b3);
+
+  return (b1 - b2) - b3;
+}
+
+/* { dg-final { scan-tree-dump-times "return 8;" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-times "return m_param" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-not "return k_param" "forwprop1" } } */
-- 
2.17.1

Re: [PATCH V2] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)

2020-09-01 Thread Feng Xue OS via Gcc-patches

>> >> gcc/
>> >> PR tree-optimization/94234
>> >> * tree-ssa-forwprop.c (simplify_binary_with_convert): New 
>> >> function.
>> >> * (fwprop_ssa_val): Move it before its new caller.
>>
>> > No * at this line.  There's an entry for (pass_forwprop::execute) missing.
>> OK.
>>
>> > I don't think the transform as implemented, ((T) X) OP ((T) Y) to
>> > (T) (X OP Y) is useful to do in tree-ssa-forwprop.c.  Instead what I
>> > suggested was to do the original
>> >
>> > +/* (T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C) and
>> > +   (T)(A * C) +- (T)(A) -> (T)(A * (C +- 1)). */
>> >
>> > but realize we already do this for GENERIC in fold_plusminus_mult_expr, 
>> > just
>> > without the conversions (also look at the conditions in the callers).  This
>> > function takes great care for handling overflow correctly and thus I 
>> > suggested
>> > to take that over to GIMPLE in tree-ssa-forwprop.c and try extend it to 
>> > cover
>> > the conversions you need for the specific cases.
>> But this way would introduce duplicate handling. Is it more concise to reuse
>> existing rule?

> Sure moving the GENERIC folding to match.pd so it covers both GENERIC
> and GIMPLE would be nice to avoid duplication.

>> And different from GENERIC, we might need to check whether operand is 
>> single-use
>> or not, and have distinct actions accordingly.
>>
>>(T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C)
>>
>> Suppose both A and B are multiple-used, in most situations, the transform
>> is unprofitable and avoided. But if (A +- B) could be folded to a constant, 
>> we
>> can still allow the transform. For this, we have to recursively fold (A 
>> +-B), either
>> handle it manually or resort to gimple-matcher to tell result. The latter is 
>> a
>> natural choice. If so, why not do it on the top.
>
> I don't understand.  From the comments in your patch you are just
> hoisting conversions in the transform.  I don't really see the connection
> to the originally desired transform here?

A code sequence as:

  t1 = (T)(A * C)
  t2 = (T)(B * C)

  ... = use (t1)
  ... = use (t2)

  t3 = t1 - t2

Since t1 and t2 are not single-use, we do not expect the transform on t3
happens in that it incurs more (add/mul) operations in most situations.
But if (A - B) * C can be folded to a constant or an existing SSA, the transform
is OK. That is to say we need to try to fold (A - B) and (A - B) * C to peak 
the final
result. To do this, it is natural to use gimple-matcher instead of manually 
pattern
matching as fold_plusminus_mult_expr, which could not cover all cases as gimple
rules.

 Some examples:
 A = n + 2,  B = n + 1,  C=m
 A = n - m,  B = n,  C = -1
 A = 3 * n,  B = 2 * n,  C = 1

And this way can be easily generalized to handle ((T) X) OP ((T) Y).

>> > Alternatively one could move the GENERIC bits to match.pd, leaving a
>> > worker in fold-const.c.  Then try to extend that there.
>> This worker function is meant to be used by both GENERIC and GIMPLE?

> Yes, for both.

Thanks,
Feng

Re: [PATCH V2] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)

2020-09-01 Thread Feng Xue OS via Gcc-patches


>> gcc/
>> PR tree-optimization/94234
>> * tree-ssa-forwprop.c (simplify_binary_with_convert): New function.
>> * (fwprop_ssa_val): Move it before its new caller.

> No * at this line.  There's an entry for (pass_forwprop::execute) missing.
OK.

> I don't think the transform as implemented, ((T) X) OP ((T) Y) to
> (T) (X OP Y) is useful to do in tree-ssa-forwprop.c.  Instead what I
> suggested was to do the original
> 
> +/* (T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C) and
> +   (T)(A * C) +- (T)(A) -> (T)(A * (C +- 1)). */
> 
> but realize we already do this for GENERIC in fold_plusminus_mult_expr, just
> without the conversions (also look at the conditions in the callers).  This
> function takes great care for handling overflow correctly and thus I suggested
> to take that over to GIMPLE in tree-ssa-forwprop.c and try extend it to cover
> the conversions you need for the specific cases.
But this way would introduce duplicate handling. Is it more concise to reuse
existing rule? 

And different from GENERIC, we might need to check whether operand is single-use
or not, and have distinct actions accordingly.

   (T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C)

Suppose both A and B are multiple-used, in most situations, the transform
is unprofitable and avoided. But if (A +- B) could be folded to a constant, we
can still allow the transform. For this, we have to recursively fold (A +-B), 
either
handle it manually or resort to gimple-matcher to tell result. The latter is a
natural choice. If so, why not do it on the top.

> Alternatively one could move the GENERIC bits to match.pd, leaving a
> worker in fold-const.c.  Then try to extend that there.
This worker function is meant to be used by both GENERIC and GIMPLE?

> I just remember this is a very fragile area with respect to overflow
> correctness.

Thanks,
Feng

PING: [PATCH V2] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)

2020-08-31 Thread Feng Xue OS via Gcc-patches

Thanks,
Feng

From: Feng Xue OS 
Sent: Wednesday, August 19, 2020 5:17 PM
To: Richard Biener
Cc: gcc-patches@gcc.gnu.org; Marc Glisse
Subject: [PATCH V2] Add pattern for pointer-diff on addresses with same 
base/offset (PR 94234)

As Richard's comment, this patch is composed to simplify generalized
binary-with-convert pattern like ((T) X) OP ((T) Y). Instead of creating
almost duplicated rules into match.pd, we try to transform it to (T) (X OP Y),
and apply simplification on (X OP Y) in forwprop pass.

Regards,
Feng
---
2020-08-19  Feng Xue  

gcc/
PR tree-optimization/94234
* tree-ssa-forwprop.c (simplify_binary_with_convert): New function.
* (fwprop_ssa_val): Move it before its new caller.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/ifcvt-3.c: Modified to suppress forward propagation.
* gcc.dg/tree-ssa/20030807-10.c: Likewise.
* gcc.dg/pr94234-2.c: New test.

> 
> From: Richard Biener 
> Sent: Monday, June 15, 2020 3:41 PM
> To: Feng Xue OS
> Cc: gcc-patches@gcc.gnu.org; Marc Glisse
> Subject: Re: [PATCH] Add pattern for pointer-diff on addresses with same 
> base/offset (PR 94234)
>
> On Fri, Jun 5, 2020 at 11:20 AM Feng Xue OS  
> wrote:
>>
>>  As Marc suggested, removed the new pointer_diff rule, and add another rule 
>> to fold
>>  convert-add expression. This new rule is:
>>
>>(T)(A * C) +- (T)(B * C) -> (T) ((A +- B) * C)
>>
>>  Regards,
>>  Feng
>>
>>  ---
>> 2020-06-01  Feng Xue  
>>
>>  gcc/
>>  PR tree-optimization/94234
>>  * match.pd ((T)(A * C) +- (T)(B * C)) -> (T)((A +- B) * C): New
>>  simplification.
>>  * ((PTR_A + OFF) - (PTR_B + OFF)) -> (PTR_A - PTR_B): New
>>  simplification.
>>
>>  gcc/testsuite/
>>  PR tree-optimization/94234
>>  * gcc.dg/pr94234.c: New test.
>>  ---
>>   gcc/match.pd   | 28 
>>   gcc/testsuite/gcc.dg/pr94234.c | 24 
>>   2 files changed, 52 insertions(+)
>>   create mode 100644 gcc/testsuite/gcc.dg/pr94234.c
>>
>>  diff --git a/gcc/match.pd b/gcc/match.pd
>>  index 33ee1a920bf..4f340bfe40a 100644
>>  --- a/gcc/match.pd
>>  +++ b/gcc/match.pd
>>  @@ -2515,6 +2515,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>>   && TREE_CODE (@2) == INTEGER_CST
>>   && tree_int_cst_sign_bit (@2) == 0))
>>(minus (convert @1) (convert @2)
>>  +   (simplify
>>  +(pointer_diff (pointer_plus @0 @2) (pointer_plus @1 @2))
>>  + (pointer_diff @0 @1))
>
> This new pattern is OK.  Please commit it separately.
>
>>  (simplify
>>   (pointer_diff (pointer_plus @@0 @1) (pointer_plus @0 @2))
>>   /* The second argument of pointer_plus must be interpreted as signed, 
>> and
>>  @@ -2526,6 +2529,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>>(minus (convert (view_convert:stype @1))
>>  (convert (view_convert:stype @2)))
>>
>>  +/* (T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C) and
>>  +   (T)(A * C) +- (T)(A) -> (T)(A * (C +- 1)). */
>>  +(if (INTEGRAL_TYPE_P (type))
>>  + (for plusminus (plus minus)
>>  +  (simplify
>>  +   (plusminus (convert:s (mult:cs @0 @1)) (convert:s (mult:cs @0 @2)))
>>  +   (if (element_precision (type) <= element_precision (TREE_TYPE (@0))
>>  +   && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type))
>>  +   && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)))
>>  +(convert (mult (plusminus @1 @2) @0
>>  +  (simplify
>>  +   (plusminus (convert @0) (convert@2 (mult:c@3 @0 @1)))
>>  +   (if (element_precision (type) <= element_precision (TREE_TYPE (@0))
>>  +   && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type))
>>  +   && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))
>>  +   && single_use (@2) && single_use (@3))
>>  +(convert (mult (plusminus { build_one_cst (TREE_TYPE (@1)); } @1) 
>> @0
>>  +  (simplify
>>  +   (plusminus (convert@2 (mult:c@3 @0 @1)) (convert @0))
>>  +   (if (element_precision (type) <= element_precision (TREE_TYPE (@0))
>>  +   && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type))
>>  +   && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))
>>  +   && single_use (@2) && single_use (@3))
>>  +(convert (mult (plusminus @1 { build_one_cst (TREE_TYPE (@1)); }) 
>> @0))
>>  +
>
> This shows the limit of pattern matching IMHO.  I'm also not convinced
> it gets the
> overflow cases correct (but I didn't spend too much time here).  Note we have
> similar functionality implemented in fold_plusminus_mult_expr.  IMHO instead
> of doing the above moving fold_plusminus_mult_expr to GIMPLE by executing
> it from inside the forwprop pass would make more sense.  Or finally biting the
> bullet and try to teach reassociation about how to handle signed arithmetic
> with non-wrapping overflow behavior.
>
> Richard.

Re: [PATCH] Fix ICE in ipa-cp due to cost addition overflow (PR 96806)

2020-08-31 Thread Feng Xue OS via Gcc-patches

>>> the component is "ipa," please change that when you commit the patch.
>> Mistake has been made, I'v pushed it. Is there a way to correct it? git push 
>> --force?
>
> There is.  You need to wait until tomorrow (after the commit message
> gets copied to gcc/ChangeLog by a script) and then push a commit that
> modifies nothing else but the ChangeLog. IIUC.
> 
> Thanks again for taking care of this,

I will. Thanks.

Feng

Re: [PATCH] Fix ICE in ipa-cp due to cost addition overflow (PR 96806)

2020-08-31 Thread Feng Xue OS via Gcc-patches

>> gcc/
>> PR tree-optimization/96806

> the component is "ipa," please change that when you commit the patch.
Mistake has been made, I'v pushed it. Is there a way to correct it? git push 
--force?

Thanks,
Feng

[PATCH] Fix ICE in ipa-cp due to cost addition overflow (PR 96806)

2020-08-31 Thread Feng Xue OS via Gcc-patches

This patch is to fix a bug that cost that is used to evaluate clone candidate
becomes negative due to integer overflow.

Feng
---
2020-08-31  Feng Xue  

gcc/
PR tree-optimization/96806
* ipa-cp.c (decide_about_value): Use safe_add to avoid cost addition
overflow.

gcc/testsuite/
PR tree-optimization/96806
* g++.dg/ipa/pr96806.C: New test.From 8d92b4ca4be2303a73f0a2441e57564488ca1c23 Mon Sep 17 00:00:00 2001
From: Feng Xue 
Date: Mon, 31 Aug 2020 15:00:52 +0800
Subject: [PATCH] ipa/96806 - Fix ICE in ipa-cp due to integer addition
 overflow

2020-08-31  Feng Xue  

gcc/
PR tree-optimization/96806
* ipa-cp.c (decide_about_value): Use safe_add to avoid cost addition
	overflow.

gcc/testsuite/
PR tree-optimization/96806
* g++.dg/ipa/pr96806.C: New test.
---
 gcc/ipa-cp.c   |  8 ++---
 gcc/testsuite/g++.dg/ipa/pr96806.C | 53 ++
 2 files changed, 57 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/g++.dg/ipa/pr96806.C

diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
index e4910a04ffa..8e5d6e2a393 100644
--- a/gcc/ipa-cp.c
+++ b/gcc/ipa-cp.c
@@ -5480,11 +5480,11 @@ decide_about_value (struct cgraph_node *node, int index, HOST_WIDE_INT offset,
    freq_sum, count_sum,
    val->local_size_cost)
   && !good_cloning_opportunity_p (node,
-  val->local_time_benefit
-  + val->prop_time_benefit,
+  safe_add (val->local_time_benefit,
+		val->prop_time_benefit),
   freq_sum, count_sum,
-  val->local_size_cost
-  + val->prop_size_cost))
+  safe_add (val->local_size_cost,
+		val->prop_size_cost)))
 return false;
 
   if (dump_file)
diff --git a/gcc/testsuite/g++.dg/ipa/pr96806.C b/gcc/testsuite/g++.dg/ipa/pr96806.C
new file mode 100644
index 000..28fdf7787a1
--- /dev/null
+++ b/gcc/testsuite/g++.dg/ipa/pr96806.C
@@ -0,0 +1,53 @@
+/* { dg-do compile } */
+/* { dg-options "-std=c++11 -O -fipa-cp -fipa-cp-clone --param=ipa-cp-max-recursive-depth=94 --param=logical-op-non-short-circuit=0" } */
+
+enum a {};
+struct m;
+struct n {
+  a d;
+};
+int o(int, int);
+struct p {
+  char d;
+  char aa;
+  p *ab;
+  bool q() const {
+int h = d & 4;
+return h;
+  }
+  char r() const { return aa; }
+  int s(const m *, bool) const;
+} l;
+struct t {
+  p *ac;
+  p *u() { return ac; }
+  p *v(int);
+};
+int w(const p *, const p *, const m *, int = 0);
+struct m : n {
+  struct {
+t *ad;
+  } ae;
+  char x() const;
+  p *y(int z) const { return ae.ad ? nullptr : ae.ad->v(z); }
+} j;
+int w(const p *z, const p *af, const m *ag, int ah) {
+  int a, g = z->s(ag, true), i = af->s(ag, true);
+  if (af->q()) {
+if (ag->x())
+  return 0;
+ah++;
+char b = af->r();
+p *c = ag->y(b), *e = ag->ae.ad->u();
+int d = w(z, c, ag, ah), f = w(z, af ? e : af->ab, ag, ah);
+a = f ? d : f;
+return a;
+  }
+  if (g || i == 1)
+return ag->d ? o(g, i) : o(g, i);
+  return 0;
+}
+void ai() {
+  for (p k;;)
+w(&k, &l, &j);
+}
-- 
2.17.1

[PATCH V2] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)

2020-08-19 Thread Feng Xue OS via Gcc-patches

As Richard's comment, this patch is composed to simplify generalized
binary-with-convert pattern like ((T) X) OP ((T) Y). Instead of creating
almost duplicated rules into match.pd, we try to transform it to (T) (X OP Y),
and apply simplification on (X OP Y) in forwprop pass.

Regards,
Feng
---
2020-08-19  Feng Xue  

gcc/
PR tree-optimization/94234
* tree-ssa-forwprop.c (simplify_binary_with_convert): New function.
* (fwprop_ssa_val): Move it before its new caller.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/ifcvt-3.c: Modified to suppress forward propagation.
* gcc.dg/tree-ssa/20030807-10.c: Likewise.
* gcc.dg/pr94234-2.c: New test.

> 
> From: Richard Biener 
> Sent: Monday, June 15, 2020 3:41 PM
> To: Feng Xue OS
> Cc: gcc-patches@gcc.gnu.org; Marc Glisse
> Subject: Re: [PATCH] Add pattern for pointer-diff on addresses with same 
> base/offset (PR 94234)
> 
> On Fri, Jun 5, 2020 at 11:20 AM Feng Xue OS  
> wrote:
>>
>>  As Marc suggested, removed the new pointer_diff rule, and add another rule 
>> to fold
>>  convert-add expression. This new rule is:
>> 
>>(T)(A * C) +- (T)(B * C) -> (T) ((A +- B) * C)
>> 
>>  Regards,
>>  Feng
>> 
>>  ---
>> 2020-06-01  Feng Xue  
>> 
>>  gcc/
>>  PR tree-optimization/94234
>>  * match.pd ((T)(A * C) +- (T)(B * C)) -> (T)((A +- B) * C): New
>>  simplification.
>>  * ((PTR_A + OFF) - (PTR_B + OFF)) -> (PTR_A - PTR_B): New
>>  simplification.
>> 
>>  gcc/testsuite/
>>  PR tree-optimization/94234
>>  * gcc.dg/pr94234.c: New test.
>>  ---
>>   gcc/match.pd   | 28 
>>   gcc/testsuite/gcc.dg/pr94234.c | 24 
>>   2 files changed, 52 insertions(+)
>>   create mode 100644 gcc/testsuite/gcc.dg/pr94234.c
>> 
>>  diff --git a/gcc/match.pd b/gcc/match.pd
>>  index 33ee1a920bf..4f340bfe40a 100644
>>  --- a/gcc/match.pd
>>  +++ b/gcc/match.pd
>>  @@ -2515,6 +2515,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>>   && TREE_CODE (@2) == INTEGER_CST
>>   && tree_int_cst_sign_bit (@2) == 0))
>>(minus (convert @1) (convert @2)
>>  +   (simplify
>>  +(pointer_diff (pointer_plus @0 @2) (pointer_plus @1 @2))
>>  + (pointer_diff @0 @1))
> 
> This new pattern is OK.  Please commit it separately.
> 
>>  (simplify
>>   (pointer_diff (pointer_plus @@0 @1) (pointer_plus @0 @2))
>>   /* The second argument of pointer_plus must be interpreted as signed, 
>> and
>>  @@ -2526,6 +2529,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>>(minus (convert (view_convert:stype @1))
>>  (convert (view_convert:stype @2)))
>> 
>>  +/* (T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C) and
>>  +   (T)(A * C) +- (T)(A) -> (T)(A * (C +- 1)). */
>>  +(if (INTEGRAL_TYPE_P (type))
>>  + (for plusminus (plus minus)
>>  +  (simplify
>>  +   (plusminus (convert:s (mult:cs @0 @1)) (convert:s (mult:cs @0 @2)))
>>  +   (if (element_precision (type) <= element_precision (TREE_TYPE (@0))
>>  +   && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type))
>>  +   && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)))
>>  +(convert (mult (plusminus @1 @2) @0
>>  +  (simplify
>>  +   (plusminus (convert @0) (convert@2 (mult:c@3 @0 @1)))
>>  +   (if (element_precision (type) <= element_precision (TREE_TYPE (@0))
>>  +   && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type))
>>  +   && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))
>>  +   && single_use (@2) && single_use (@3))
>>  +(convert (mult (plusminus { build_one_cst (TREE_TYPE (@1)); } @1) 
>> @0
>>  +  (simplify
>>  +   (plusminus (convert@2 (mult:c@3 @0 @1)) (convert @0))
>>  +   (if (element_precision (type) <= element_precision (TREE_TYPE (@0))
>>  +   && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type))
>>  +   && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))
>>  +   && single_use (@2) && single_use (@3))
>>  +(convert (mult (plusminus @1 { build_one_cst (TREE_TYPE (@1)); }) 
>> @0))
>>  +
> 
> This shows the limit of pattern matching IMHO.  I'm also not convinced
> it gets the
> overflow cases correct (but I didn't spend too much time here).  Note we have
> similar functionality implemented in f

Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls

2020-08-12 Thread Feng Xue OS via Gcc-patches

> Hello,
> with Martin we spent some time looking into exchange2 and my
> understanding of the problem is the following:
> 
> There is the self recursive function digits_2 with the property that it
> has 10 nested loops and calls itself from the innermost.
> Now we do not do amazing job on guessing the profile since it is quite
> atypical. First observation is that the callback frequencly needs to be
> less than 1 otherwise the program never terminates, however with 10
> nested loops one needs to predict every loop to iterate just few times
> and conditionals guarding them as not very likely. For that we added
> PRED_LOOP_GUARD_WITH_RECURSION some time ago and I fixed it yesterday
> (causing regression in exhange since the bad profile turned out to
> disable some harmful vectorization) and I also now added a cap to the
> self recursive frequency so things to not get mispropagated by ipa-cp.

With default setting of PRED_LOOP_GUARD_WITH_RECURSION, static profile
estimation for exchange2 is far from accurate, the hottest recursive function
is predicted as infrequent. However, this low execution estimation works fine
with IRA. I've tried to tweak likelihood of the predictor, same as you,
performance was degraded when estimated profile increased. This regression
is also found to be correlated with IRA, which produces much more register
spills than default. In presence of deep loops and high register pressure, IRA
behaves more sensitively to profile estimation, and this exhibits an unwanted
property of current IRA algorithm. I've described it in a tracker
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174).

Feng

> 
> Now if ipa-cp decides to duplicate digits few times we have a new
> problem.  The tree of recursion is orgnaized in a way that the depth is
> bounded by 10 (which GCC does not know) and moreover most time is not
> spent on very deep levels of recursion.
> 
> For that you have the patch which increases frequencies of recursively
> cloned nodes, however it still seems to me as very specific hack for
> exchange: I do not see how to guess where most of time is spent.
> Even for very regular trees, by master theorem, it depends on very
> little differences in the estimates of recursion frequency whether most
> of time is spent on the top of tree, bottom or things are balanced.
> 
> With algorithms doing backtracing, like exhchange, the likelyness of
> recusion reduces with deeper recursion level, but we do not know how
> quickly and what the level is.
> 
>> From: Xiong Hu Luo 
>> 
>>  For SPEC2017 exchange2, there is a large recursive functiondigits_2(function
>>  size 1300) generates specialized node from digits_2.1 to digits_2.8 with 
>> added
>>  build option:
>> 
>>  --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80
>> 
>>  ipa-inline pass will consider inline these nodes called only once, but these
>>  large functions inlined too deeply will cause serious register spill and
>>  performance down as followed.
>> 
>>  inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5 (inline 
>> 2.6, 2.7, 2.8)
>>  inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4 (inline 
>> digits_2.5, 2.6) -> call digits_2.7 (inline 2.8)
>>  inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) -> 2.4 -> 
>> 2.5 -> 2.6 (inline 2.7 ) -> 2.8
>>  inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4 -> 
>> 2.5 -> 2.6 -> 2.7 -> 2.8
>> 
>>  Performance diff:
>>  inlineB is ~25% faster than inlineA;
>>  inlineC is ~20% faster than inlineB;
>>  inlineD is ~30% faster than inlineC.
>> 
>>  The master GCC code now generates inline sequence like inlineB, this patch
>>  makes the ipa-inline pass behavior like inlineD by:
>>   1) The growth acumulation for recursive calls by adding the growth data
>>  to the edge when edge's caller is inlined into another function to avoid
>>  inline too deeply;
>>   2) And if caller and callee are both specialized from same node, the edge
>>  should also be considered as recursive edge.
>> 
>>  SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without 
>> exchange2).
>>  Any comments?  Thanks.
>> 
>>  523.xalancbmk_r +1.32%
>>  541.leela_r +1.51%
>>  548.exchange2_r +31.87%
>>  507.cactuBSSN_r +0.80%
>>  526.blender_r   +1.25%
>>  538.imagick_r   +1.82%
>> 
>>  gcc/ChangeLog:
>> 
>>  2020-08-12  Xionghu Luo  
>> 
>>* cgraph.h (cgraph_edge::recursive_p): Return true if caller and
>>callee and specialized from same node.
>>* ipa-inline-analysis.c (do_estimate_growth_1): Add caller's
>>inlined_to growth to edge whose caller is inlined.
>>  ---
>>   gcc/cgraph.h  | 2 ++
>>   gcc/ipa-inline-analysis.c | 3 +++
>>   2 files changed, 5 insertions(+)
>> 
>>  diff --git a/gcc/cgraph.h b/gcc/cgraph.h
>>  index 0211f08964f..11903ac1960 100644
>>  --- a/gcc/cgraph.h
>>  +++ b/gcc/cgraph.h
>>  @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void)
>> cgraph_node *c = callee->ultimate_alias_target

Re: [PATCH] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)

2020-06-15 Thread Feng Xue OS via Gcc-patches

Here is an question about pointer operation: 

Pointer is treated as unsigned in comparison operation, while distance between
pointers is signed. Then we can not assume the below conclusion is true?

 (ptr_a > ptr_b) => (ptr_a - ptr_b) >= 0

Thanks,
Feng

From: Marc Glisse 
Sent: Wednesday, June 3, 2020 10:32 PM
To: Feng Xue OS
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] Add pattern for pointer-diff on addresses with same 
base/offset (PR 94234)

On Wed, 3 Jun 2020, Feng Xue OS via Gcc-patches wrote:

>> Ah, looking at the PR, you decided to perform the operation as unsigned
>> because that has fewer NOP conversions, which, in that particular testcase
>> where the offsets are originally unsigned, means we simplify better. But I
>> would expect it to regress other testcases (in particular if the offsets
>> were originally signed). Also, changing the second argument of
>> pointer_plus to be signed, as is supposed to eventually happen, would
>> break your testcase again.
> The old rule might produce overflow result (offset_a = (signed_int_max)UL,
> offset_b = 1UL).

signed_int_max-1 does not overflow. But the point is that pointer_plus /
pointer_diff are defined in a way that if that subtraction would overflow,
then one of the pointer_plus or pointed_diff would have been undefined
already. In particular, you cannot have objects larger than half the
address space, and pointer_plus/pointer_diff have to remain inside an
object. Doing the subtraction in a signed type keeps (part of) that
information.

> Additionally, (stype)(offset_a - offset_b) is more compact,

Not if offset_a comes from (utype)a and offset_b from (utype)b with a and
b signed. Using size_t indices as in the bugzilla testcase is not
recommended practice. Change it to ssize_t, and we do optimize the
testcase in CCP1 already.

> there might be
> further simplification opportunities on offset_a - offset_b, even it is not
> in form of (A * C - B * C), for example (~A - 1 -> -A). But for old rule, we 
> have
> to introduce another rule as (T)A - (T)(B) -> (T)(A - B), which seems to
> be too generic to benefit performance in all situations.

Sadly, conversions complicate optimizations and are all over the place, we
need to handle them in more places. I sometimes dream of getting rid of
NOP conversions, and having a single PLUS_EXPR with some kind of flag
saying if it can wrap/saturate/trap when seen as a signed/unsigned
operation, i.e. push the information on the operations instead of objects.

> If the 2nd argument is signed, we can add a specific rule as your suggestion
> (T)(A * C) - (T)(B * C) -> (T) (A - B) * C.
>
>> At the very least we want to keep a comment next to the transformation
>> explaining the situation.
>
>> If there are platforms where the second argument of pointer_plus is a
>> smaller type than the result of pointer_diff (can this happen? I keep
>> forgetting all the weird things some platforms do), this version may do an
>> unsafe zero-extension.
> If the 2nd argument is a smaller type, this might bring confuse semantic to
> pointer_plus operator. Suppose the type is a (unsigned) char, the expression
> "ptr + ((char) -1)" represents ptr + 255 or ptr - 1?

(pointer_plus ptr 255) would mean ptr - 1 on a platform where the second
argument of pointer_plus has size 1 byte.

Do note that I am not a reviewer, what I say isn't final.

--
Marc Glisse

Ping: [PATCH V2] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)

2020-06-14 Thread Feng Xue OS via Gcc-patches

Thanks,
Feng


From: Feng Xue OS 
Sent: Friday, June 5, 2020 5:20 PM
To: Richard Biener; gcc-patches@gcc.gnu.org; Marc Glisse
Subject: Re: [PATCH] Add pattern for pointer-diff on addresses with same 
base/offset (PR 94234)

As Marc suggested, removed the new pointer_diff rule, and add another rule to 
fold
convert-add expression. This new rule is:

  (T)(A * C) +- (T)(B * C) -> (T) ((A +- B) * C)

Regards,
Feng

---
2020-06-01  Feng Xue  

gcc/
PR tree-optimization/94234
* match.pd ((T)(A * C) +- (T)(B * C)) -> (T)((A +- B) * C): New
simplification.
* ((PTR_A + OFF) - (PTR_B + OFF)) -> (PTR_A - PTR_B): New
simplification.

gcc/testsuite/
PR tree-optimization/94234
* gcc.dg/pr94234.c: New test.
---
 gcc/match.pd   | 28 
 gcc/testsuite/gcc.dg/pr94234.c | 24 
 2 files changed, 52 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/pr94234.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 33ee1a920bf..4f340bfe40a 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -2515,6 +2515,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 && TREE_CODE (@2) == INTEGER_CST
 && tree_int_cst_sign_bit (@2) == 0))
  (minus (convert @1) (convert @2)
+   (simplify
+(pointer_diff (pointer_plus @0 @2) (pointer_plus @1 @2))
+ (pointer_diff @0 @1))
(simplify
 (pointer_diff (pointer_plus @@0 @1) (pointer_plus @0 @2))
 /* The second argument of pointer_plus must be interpreted as signed, and
@@ -2526,6 +2529,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
  (minus (convert (view_convert:stype @1))
(convert (view_convert:stype @2)))

+/* (T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C) and
+   (T)(A * C) +- (T)(A) -> (T)(A * (C +- 1)). */
+(if (INTEGRAL_TYPE_P (type))
+ (for plusminus (plus minus)
+  (simplify
+   (plusminus (convert:s (mult:cs @0 @1)) (convert:s (mult:cs @0 @2)))
+   (if (element_precision (type) <= element_precision (TREE_TYPE (@0))
+   && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type))
+   && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)))
+(convert (mult (plusminus @1 @2) @0
+  (simplify
+   (plusminus (convert @0) (convert@2 (mult:c@3 @0 @1)))
+   (if (element_precision (type) <= element_precision (TREE_TYPE (@0))
+   && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type))
+   && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))
+   && single_use (@2) && single_use (@3))
+(convert (mult (plusminus { build_one_cst (TREE_TYPE (@1)); } @1) @0
+  (simplify
+   (plusminus (convert@2 (mult:c@3 @0 @1)) (convert @0))
+   (if (element_precision (type) <= element_precision (TREE_TYPE (@0))
+   && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type))
+   && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))
+   && single_use (@2) && single_use (@3))
+(convert (mult (plusminus @1 { build_one_cst (TREE_TYPE (@1)); }) @0))
+
 /* (A * C) +- (B * C) -> (A+-B) * C and (A * C) +- A -> A * (C+-1).
 Modeled after fold_plusminus_mult_expr.  */
 (if (!TYPE_SATURATING (type)
diff --git a/gcc/testsuite/gcc.dg/pr94234.c b/gcc/testsuite/gcc.dg/pr94234.c
new file mode 100644
index 000..3f7c7a5e58f
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr94234.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-forwprop1" } */
+
+typedef __SIZE_TYPE__ size_t;
+typedef __PTRDIFF_TYPE__ ptrdiff_t;
+
+ptrdiff_t foo (char *a, size_t n)
+{
+  char *b1 = a + 8 * n;
+  char *b2 = a + 8 * (n - 1);
+
+  return b1 - b2;
+}
+
+ptrdiff_t goo (char *a, size_t n, size_t m)
+{
+  char *b1 = a + 8 * n;
+  char *b2 = a + 8 * (n + 1);
+
+  return (b1 + m) - (b2 + m);
+}
+
+/* { dg-final { scan-tree-dump-times "return 8;" 1 "forwprop1" } } */
+/* { dg-final { scan-tree-dump-times "return -8;" 1 "forwprop1" } } */
--



From: Richard Biener 
Sent: Thursday, June 4, 2020 4:30 PM
To: gcc-patches@gcc.gnu.org
Cc: Feng Xue OS
Subject: Re: [PATCH] Add pattern for pointer-diff on addresses with same 
base/offset (PR 94234)

On Wed, Jun 3, 2020 at 4:33 PM Marc Glisse  wrote:
>
> On Wed, 3 Jun 2020, Feng Xue OS via Gcc-patches wrote:
>
> >> Ah, looking at the PR, you decided to perform the operation as unsigned
> >> because that has fewer NOP conversions, which, in that particular testcase
> >> where the offsets are originally unsigned, means we simplify better. But I
> >> would expect it to regress other testcases (in particular if the offsets
> >> were originally signed). Also, changing the second argument of
> >> pointer_plus to be signed, as is supposed to eventually happen, would
> >> brea

1 2 >

1 - 100 of 195 matches

Mail list logo