Re: [PATCH] vect: Fix inconsistency in fully-masked lane-reducing op generation [PR116985]
Added. Thanks, Feng From: Richard Biener Sent: Saturday, October 12, 2024 8:12 PM To: Feng Xue OS Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH] vect: Fix inconsistency in fully-masked lane-reducing op generation [PR116985] On Sat, Oct 12, 2024 at 9:12 AM Feng Xue OS wrote: > > To align vectorized def/use when lane-reducing op is present in loop > reduction, > we may need to insert extra trivial pass-through copies, which would cause > mismatch between lane-reducing vector copy and loop mask index. This could be > fixed by computing the right index around a new counter on effective lane- > reducing vector copies. OK, but can you add the reduced testcase from the PR in a way that it ICEs before and not after the patch? Thanks, Richard. > Thanks, > Feng > --- > gcc/ > PR tree-optimization/116985 > * tree-vect-loop.cc (vect_transform_reduction): Compute loop mask > index based on effective vector copies for reduction op. > --- > gcc/tree-vect-loop.cc | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > index ade72a5124f..025442aabc3 100644 > --- a/gcc/tree-vect-loop.cc > +++ b/gcc/tree-vect-loop.cc > @@ -8916,6 +8916,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo, > >bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info); >unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length (); > + unsigned mask_index = 0; > >for (unsigned i = 0; i < num; ++i) > { > @@ -8954,7 +8955,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo, > std::swap (vop[0], vop[1]); > } > tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks, > - vec_num * ncopies, vectype_in, i); > + vec_num * ncopies, vectype_in, > + mask_index++); > gcall *call = gimple_build_call_internal (cond_fn, 4, mask, > vop[0], vop[1], vop[0]); > new_temp = make_ssa_name (vec_dest, call); > @@ -8971,7 +8973,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo, > if (masked_loop_p && mask_by_cond_expr) > { > tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks, > - vec_num * ncopies, vectype_in, > i); > + vec_num * ncopies, vectype_in, > + mask_index++); > build_vect_cond_expr (code, vop, mask, gsi); > } > > -- > 2.17.1 > > >
[PATCH] vect: Fix inconsistency in fully-masked lane-reducing op generation [PR116985]
To align vectorized def/use when lane-reducing op is present in loop reduction, we may need to insert extra trivial pass-through copies, which would cause mismatch between lane-reducing vector copy and loop mask index. This could be fixed by computing the right index around a new counter on effective lane- reducing vector copies. Thanks, Feng --- gcc/ PR tree-optimization/116985 * tree-vect-loop.cc (vect_transform_reduction): Compute loop mask index based on effective vector copies for reduction op. --- gcc/tree-vect-loop.cc | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index ade72a5124f..025442aabc3 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8916,6 +8916,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo, bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info); unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length (); + unsigned mask_index = 0; for (unsigned i = 0; i < num; ++i) { @@ -8954,7 +8955,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo, std::swap (vop[0], vop[1]); } tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks, - vec_num * ncopies, vectype_in, i); + vec_num * ncopies, vectype_in, + mask_index++); gcall *call = gimple_build_call_internal (cond_fn, 4, mask, vop[0], vop[1], vop[0]); new_temp = make_ssa_name (vec_dest, call); @@ -8971,7 +8973,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo, if (masked_loop_p && mask_by_cond_expr) { tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks, - vec_num * ncopies, vectype_in, i); + vec_num * ncopies, vectype_in, + mask_index++); build_vect_cond_expr (code, vop, mask, gsi); } -- 2.17.1
Re: [RFC] Generalize formation of lane-reducing ops in loop reduction
>> >> >> 1. Background >> >> >> >> For loop reduction of accumulating result of a widening operation, the >> >> preferred pattern is lane-reducing operation, if supported by target. >> >> Because >> >> this kind of operation need not preserve intermediate results of widening >> >> operation, and only produces reduced amount of final results for >> >> accumulation, >> >> choosing the pattern could lead to pretty compact codegen. >> >> >> >> Three lane-reducing opcodes are defined in gcc, belonging to two kinds of >> >> operations: dot-product (DOT_PROD_EXPR) and sum-of-absolute-difference >> >> (SAD_EXPR). WIDEN_SUM_EXPR could be seen as a degenerated dot-product >> >> with a >> >> constant operand as "1". Currently, gcc only supports recognition of >> >> simple >> >> lane-reducing case, in which each accumulation statement of loop reduction >> >> forms one pattern: >> >> >> >> char *d0, *d1; >> >> short *s0, *s1; >> >> >> >> for (i) { >> >>sum += d0[i] * d1[i]; // = DOT_PROD > >> char> >> >>sum += abs(s0[i] - s1[i]); // = SAD >> >> } >> >> >> >> We could rewrite the example as the below using only one statement, whose >> >> non- >> >> reduction addend is the sum of the above right-side parts. As a whole, the >> >> addend would match nothing, while its two sub-expressions could be >> >> recognized >> >> as corresponding lane-reducing patterns. >> >> >> >> for (i) { >> >>sum += d0[i] * d1[i] + abs(s0[i] - s1[i]); >> >> } >> > >> > Note we try to recognize the original form as SLP reduction (which of >> > course fails). >> > >> >> This case might be too elaborately crafted to be very common in reality. >> >> Though, we do find seemingly variant but essentially similar code pattern >> >> in >> >> some AI applications, which use matrix-vector operations extensively, some >> >> usages are just single loop reduction composed of multiple dot-products. A >> >> code snippet from ggml: >> >> >> >> for (int j = 0; j < qk/2; ++j) { >> >>const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; >> >>const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; >> >> >> >>const int32_t x0 = (x[i].qs[j] & 0xF) | xh_0; >> >>const int32_t x1 = (x[i].qs[j] >> 4) | xh_1; >> >> >> >>sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]); >> >> } >> >> >> >> In the source level, it appears to be a nature and minor scaling-up of >> >> simple >> >> one lane-reducing pattern, but it is beyond capability of current >> >> vectorization >> >> pattern recognition, and needs some kind of generic extension to the >> >> framework. >> >> Sorry for late response. >> >> > So this is about re-associating lane-reducing ops to alternative >> > lane-reducing >> > ops to save repeated accumulation steps? >> >> You mean re-associating slp-based lane-reducing ops to loop-based? > > Yes. > >> > The thing is that IMO pattern recognition as we do now is limiting and >> > should >> > eventually move to the SLP side where we should be able to more freely >> > "undo" and associate. >> >> No matter pattern recognition is done prior to or within SLP, the must thing >> is we >> need to figure out which op is qualified for lane-reducing by some means. >> >> For example, when seeing a mult in a loop with vectorization-favored shape, >> ... >> t = a * b; // char a, b >> ... >> >> we could not say it is decidedly applicable for reduced computation via >> dot-product >> even the corresponding target ISA is available. > > True. Note there's a PR which shows SLP lane-reducing written out like > > a[i] = b[4*i] * 3 + b[4*i+1] * 3 + b[4*i+2] * 3 + b[4*i+3] * 3; > > which we cannot change to a DOT_PROD because we do not know which > lanes are reduced. My point was there are non-reduction cases where knowing > which actual lanes get reduced would help. For reductions it's not important > and associating in a way to expose more possible (reduction) lane reductions > is almost always going to be a win. > >> Recognition of normal patterns merely involves local statement-based match, >> while >> for lane-reducing, validity check requires global loop-wise analysis on >> structure of >> reduction, probably not same as, but close to what is proposed in the RFC. >> The >> basic logic, IMHO, is independent of where pattern recognition is >> implemented. >> As the matter of fact, this is not about of "associating", but "tagging" >> (mark all lane- >> reducing quantifiable statements). After the process, "re-associator" could >> play its >> role to guide selection of either loop-based or slp-based lane-reducing op. >> >> > I've searched twice now, a few days ago I read that the optabs not >> > specifying >> > which lanes are combined/reduced is a limitation. Yes, it is - I hope we >> > can >> > rectify this, so if this is motivation enough we should split the optabs up >> > into even/odd/hi/lo (or whatever else interesting targets actually do). >> >> Actually, how lanes are combined/reduced does
[PATCH] vect: Add missed opcodes in vect_get_smallest_scalar_type [PR115228]
Some opcodes are missed when determining the smallest scalar type for a vectorizable statement. Currently, this bug does not cause any problem, because vect_get_smallest_scalar_type is only used to compute max nunits vectype, and even statement with missed opcode is incorrectly bypassed, the max nunits vectype could also be rightly deduced from def statements for operands of the statement. In the future, if this function will be called to do other thing, we may get something wrong. So fix it in this patch. Thanks, Feng --- gcc/ * tree-vect-data-refs.cc (vect_get_smallest_scalar_type): Add missed opcodes that involve widening operation. --- gcc/tree-vect-data-refs.cc | 3 +++ 1 file changed, 3 insertions(+) diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc index 39fd887a96b..5b0d548f847 100644 --- a/gcc/tree-vect-data-refs.cc +++ b/gcc/tree-vect-data-refs.cc @@ -162,7 +162,10 @@ vect_get_smallest_scalar_type (stmt_vec_info stmt_info, tree scalar_type) if (gimple_assign_cast_p (assign) || gimple_assign_rhs_code (assign) == DOT_PROD_EXPR || gimple_assign_rhs_code (assign) == WIDEN_SUM_EXPR + || gimple_assign_rhs_code (assign) == SAD_EXPR || gimple_assign_rhs_code (assign) == WIDEN_MULT_EXPR + || gimple_assign_rhs_code (assign) == WIDEN_MULT_PLUS_EXPR + || gimple_assign_rhs_code (assign) == WIDEN_MULT_MINUS_EXPR || gimple_assign_rhs_code (assign) == WIDEN_LSHIFT_EXPR || gimple_assign_rhs_code (assign) == FLOAT_EXPR) { -- 2.17.1
[PATCH] vect: Allow unsigned-to-signed promotion in vect_look_through_possible_promotion [PR115707]
The function vect_look_through_possible_promotion() fails to figure out root definition if casts involves more than two promotions with sign change as: long a = (long)b; // promotion cast -> int b = (int)c; // promotion cast, sign change -> unsigned short c = ...; For this case, the function thinks the 2nd cast has different sign as the 1st, so stop looking through, while "unsigned short -> integer" is a nature sign extension. This patch allows this unsigned-to-signed promotion in the function. Thanks, Feng --- gcc/ * tree-vect-patterns.cc (vect_look_through_possible_promotion): Allow unsigned-to-signed promotion. --- gcc/tree-vect-patterns.cc | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index 4674a16d15f..b2c83cfd219 100644 --- a/gcc/tree-vect-patterns.cc +++ b/gcc/tree-vect-patterns.cc @@ -434,7 +434,9 @@ vect_look_through_possible_promotion (vec_info *vinfo, tree op, sign of the previous promotion. */ if (!res || TYPE_PRECISION (unprom->type) == orig_precision - || TYPE_SIGN (unprom->type) == TYPE_SIGN (op_type)) + || TYPE_SIGN (unprom->type) == TYPE_SIGN (op_type) + || (TYPE_UNSIGNED (op_type) + && TYPE_PRECISION (op_type) < TYPE_PRECISION (unprom->type))) { unprom->set_op (op, dt, caster); min_precision = TYPE_PRECISION (op_type); -- 2.17.1From 334998e1d991e1d2c8e4c2234663b4d829e88e5c Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Mon, 5 Aug 2024 15:23:56 +0800 Subject: [PATCH] vect: Allow unsigned-to-signed promotion in vect_look_through_possible_promotion [PR115707] The function fails to figure out root definition if casts involves more than two promotions with sign change as: long a = (long)b; // promotion cast -> int b = (int)c; // promotion cast, sign change -> unsigned short c = ...; For this case, the function thinks the 2nd cast has different sign as the 1st, so stop looking through, while "unsigned short -> integer" is a nature sign extension. 2024-08-05 Feng Xue gcc/ * tree-vect-patterns.cc (vect_look_through_possible_promotion): Allow unsigned-to-signed promotion. --- gcc/tree-vect-patterns.cc | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index 4674a16d15f..b2c83cfd219 100644 --- a/gcc/tree-vect-patterns.cc +++ b/gcc/tree-vect-patterns.cc @@ -434,7 +434,9 @@ vect_look_through_possible_promotion (vec_info *vinfo, tree op, sign of the previous promotion. */ if (!res || TYPE_PRECISION (unprom->type) == orig_precision - || TYPE_SIGN (unprom->type) == TYPE_SIGN (op_type)) + || TYPE_SIGN (unprom->type) == TYPE_SIGN (op_type) + || (TYPE_UNSIGNED (op_type) + && TYPE_PRECISION (op_type) < TYPE_PRECISION (unprom->type))) { unprom->set_op (op, dt, caster); min_precision = TYPE_PRECISION (op_type); -- 2.17.1
Re: [RFC] Generalize formation of lane-reducing ops in loop reduction
>> 1. Background >> >> For loop reduction of accumulating result of a widening operation, the >> preferred pattern is lane-reducing operation, if supported by target. Because >> this kind of operation need not preserve intermediate results of widening >> operation, and only produces reduced amount of final results for >> accumulation, >> choosing the pattern could lead to pretty compact codegen. >> >> Three lane-reducing opcodes are defined in gcc, belonging to two kinds of >> operations: dot-product (DOT_PROD_EXPR) and sum-of-absolute-difference >> (SAD_EXPR). WIDEN_SUM_EXPR could be seen as a degenerated dot-product with a >> constant operand as "1". Currently, gcc only supports recognition of simple >> lane-reducing case, in which each accumulation statement of loop reduction >> forms one pattern: >> >> char *d0, *d1; >> short *s0, *s1; >> >> for (i) { >>sum += d0[i] * d1[i]; // = DOT_PROD >>sum += abs(s0[i] - s1[i]); // = SAD >> } >> >> We could rewrite the example as the below using only one statement, whose >> non- >> reduction addend is the sum of the above right-side parts. As a whole, the >> addend would match nothing, while its two sub-expressions could be recognized >> as corresponding lane-reducing patterns. >> >> for (i) { >>sum += d0[i] * d1[i] + abs(s0[i] - s1[i]); >> } > > Note we try to recognize the original form as SLP reduction (which of > course fails). > >> This case might be too elaborately crafted to be very common in reality. >> Though, we do find seemingly variant but essentially similar code pattern in >> some AI applications, which use matrix-vector operations extensively, some >> usages are just single loop reduction composed of multiple dot-products. A >> code snippet from ggml: >> >> for (int j = 0; j < qk/2; ++j) { >>const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; >>const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; >> >>const int32_t x0 = (x[i].qs[j] & 0xF) | xh_0; >>const int32_t x1 = (x[i].qs[j] >> 4) | xh_1; >> >>sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]); >> } >> >> In the source level, it appears to be a nature and minor scaling-up of simple >> one lane-reducing pattern, but it is beyond capability of current >> vectorization >> pattern recognition, and needs some kind of generic extension to the >> framework. Sorry for late response. > So this is about re-associating lane-reducing ops to alternative lane-reducing > ops to save repeated accumulation steps? You mean re-associating slp-based lane-reducing ops to loop-based? > The thing is that IMO pattern recognition as we do now is limiting and should > eventually move to the SLP side where we should be able to more freely > "undo" and associate. No matter pattern recognition is done prior to or within SLP, the must thing is we need to figure out which op is qualified for lane-reducing by some means. For example, when seeing a mult in a loop with vectorization-favored shape, ... t = a * b; // char a, b ... we could not say it is decidedly applicable for reduced computation via dot-product even the corresponding target ISA is available. Recognition of normal patterns merely involves local statement-based match, while for lane-reducing, validity check requires global loop-wise analysis on structure of reduction, probably not same as, but close to what is proposed in the RFC. The basic logic, IMHO, is independent of where pattern recognition is implemented. As the matter of fact, this is not about of "associating", but "tagging" (mark all lane- reducing quantifiable statements). After the process, "re-associator" could play its role to guide selection of either loop-based or slp-based lane-reducing op. > I've searched twice now, a few days ago I read that the optabs not specifying > which lanes are combined/reduced is a limitation. Yes, it is - I hope we can > rectify this, so if this is motivation enough we should split the optabs up > into even/odd/hi/lo (or whatever else interesting targets actually do). Actually, how lanes are combined/reduced does not matter too much regarding to recognition of lane-reducing patterns. > I did read through the rest of the e-mail before, I do in general like the > idea > to do better. Costing is another place where we need to re-do things Yes, current pattern recognition framework is not costing-model-driven, and has no way to "undo" decision previously made even it negatively impacts pattern match later. But this is a weakness of the framework, not any specific pattern. To overcome it, we may count on another separate task instead of mingling with this RFC, and better to have the task contained into your plan of moving pattern recognition to SLP. > completely; my current idea is to feed targets the SLP graph so they'll > have some dependence info. They should already have access to the > actual operation done, though in awkward ways. I guess the first target > to impleme
[RFC][PATCH 5/5] vect: Add accumulating-result pattern for lane-reducing operation
This patch adds a pattern to fold a summation into the last operand of lane- reducing operation when appropriate, which is a supplement to those operation- specific patterns for dot-prod/sad/widen-sum. sum = lane-reducing-op(..., 0) + value; => sum = lane-reducing-op(..., value); Thanks, Feng --- gcc/ * tree-vect-patterns (vect_recog_lane_reducing_accum_pattern): New pattern function. (vect_vect_recog_func_ptrs): Add the new pattern function. * params.opt (vect-lane-reducing-accum-pattern): New parameter. gcc/testsuite/ * gcc.dg/vect/vect-reduc-accum-pattern.c --- gcc/params.opt| 4 + .../gcc.dg/vect/vect-reduc-accum-pattern.c| 61 ++ gcc/tree-vect-patterns.cc | 106 ++ 3 files changed, 171 insertions(+) create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-accum-pattern.c diff --git a/gcc/params.opt b/gcc/params.opt index c17ba17b91b..b94bdc26cbd 100644 --- a/gcc/params.opt +++ b/gcc/params.opt @@ -1198,6 +1198,10 @@ The maximum factor which the loop vectorizer applies to the cost of statements i Common Joined UInteger Var(param_vect_induction_float) Init(1) IntegerRange(0, 1) Param Optimization Enable loop vectorization of floating point inductions. +-param=vect-lane-reducing-accum-pattern= +Common Joined UInteger Var(param_vect_lane_reducing_accum_pattern) Init(2) IntegerRange(0, 2) Param Optimization +Allow pattern of combining plus into lane reducing operation or not. If value is 2, allow this for all statements, or if 1, only for reduction statement, otherwise, disable it. + -param=vrp-block-limit= Common Joined UInteger Var(param_vrp_block_limit) Init(15) Optimization Param Maximum number of basic blocks before VRP switches to a fast model with less memory requirements. diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-accum-pattern.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-accum-pattern.c new file mode 100644 index 000..80a2c4f047e --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-accum-pattern.c @@ -0,0 +1,61 @@ +/* Disabling epilogues until we find a better way to deal with scans. */ +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ +/* { dg-add-options arm_v8_2a_dotprod_neon } */ + +#include "tree-vect.h" + +#define N 50 + +#define FN(name, S1, S2) \ +S1 int __attribute__ ((noipa)) \ +name (S1 int res, \ + S2 char *restrict a, \ + S2 char *restrict b, \ + S2 char *restrict c, \ + S2 char *restrict d) \ +{ \ + for (int i = 0; i < N; i++) \ +res += a[i] * b[i];\ + \ + asm volatile ("" ::: "memory"); \ + for (int i = 0; i < N; ++i) \ +res += (a[i] * b[i] + c[i] * d[i]) << 3; \ + \ + return res; \ +} + +FN(f1_vec, signed, signed) + +#pragma GCC push_options +#pragma GCC optimize ("O0") +FN(f1_novec, signed, signed) +#pragma GCC pop_options + +#define BASE2 ((signed int) -1 < 0 ? -126 : 4) +#define OFFSET 20 + +int +main (void) +{ + check_vect (); + + signed char a[N], b[N]; + signed char c[N], d[N]; + +#pragma GCC novector + for (int i = 0; i < N; ++i) +{ + a[i] = BASE2 + i * 5; + b[i] = BASE2 + OFFSET + i * 4; + c[i] = BASE2 + i * 6; + d[i] = BASE2 + OFFSET + i * 5; +} + + if (f1_vec (0x12345, a, b, c, d) != f1_novec (0x12345, a, b, c, d)) +__builtin_abort (); +} + +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */ +/* { dg-final { scan-tree-dump "vect_recog_lane_reducing_accum_pattern: detected" "vect" { target { vect_sdot_qi } } } } */ diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index bb037af0b68..9a6b16532e4 100644 --- a/gcc/tree-vect-patterns.cc +++ b/gcc/tree-vect-patterns.cc @@ -1490,6 +1490,111 @@ vect_recog_abd_pattern (vec_info *vinfo, return vect_convert_output (vinfo, stmt_vinfo, out_type, stmt, vectype_out); } +/* Function vect_recog_lane_reducing_accum_pattern + + Try to fold a summation into the last operand of lane-reducing operation. + + sum = lane-reducing-op(..., 0) + value; + + A lane-reducing operation contains two aspects: main primitive operation + and appendant result-accumulation. Pattern matching for the basic aspect + is handled in specific pattern for dot-prod/sad/widen-sum respectively. + The function is in charge of the other aspect. + + Input: + + * STMT_VINFO: The stmt from which the pattern se
[RFC][PATCH 2/5] vect: Introduce loop reduction affine closure to vect pattern recog
For sum-based loop reduction, its affine closure is composed by statements whose results and derived computation only end up in the reduction, and are not used in any non-linear transform operation. The concept underlies the generalized lane-reducing pattern recognition in the coming patches. As mathematically proved, it is legitimate to optimize evaluation of a value with lane-reducing pattern, only if its definition statement locates in affine closure. That is to say, canonicalized representation for loop reduction could be of the following affine form, in which "opX" denotes an operation for lane-reducing pattern, h(i) represents remaining operations irrelvant to those patterns. for (i) sum += cst0 * op0 + cst1 * op1 + ... + cstN * opN + h(i); At initialization, we invoke a preprocessing step to mark all statements in affine closure, which could ease retrieval of the property during pattern matching. Since a pattern hit would replace original statement with new pattern statements, we resort to a postprocessing step after recognition, to parse semantics of those new, and incrementally update affine closure, or rollback the pattern change if it would break completeness of existing closure. Thus, inside affine closure, recog framework could universally handle both lane-reducing and normal patterns. Also with this patch, we are able to add more complicated logic to enhance lane-reducing patterns. Thanks, Feng --- gcc/ * tree-vectorizer.h (enum vect_reduc_pattern_status): New enum. (_stmt_vec_info): Add a new field reduc_pattern_status. * tree-vect-patterns.cc (vect_split_statement): Adjust statement status for reduction affine closure. (vect_convert_input): Do not reuse conversion statement in process. (vect_reassociating_reduction_p): Add a condition check to only allow statement in reduction affine closure. (vect_pattern_expr_invariant_p): New function. (vect_get_affine_operands_mask): Likewise. (vect_mark_reduction_affine_closure): Likewise. (vect_mark_stmts_for_reduction_pattern_recog): Likewise. (vect_get_prev_reduction_stmt): Likewise. (vect_mark_reduction_pattern_sequence_formed): Likewise. (vect_check_pattern_stmts_for_reduction): Likewise. (vect_pattern_recog_1): Check if a pattern recognition would break existing lane-reducing pattern statements. (vect_pattern_recog): Mark loop reduction affine closure. --- gcc/tree-vect-patterns.cc | 722 +- gcc/tree-vectorizer.h | 23 ++ 2 files changed, 742 insertions(+), 3 deletions(-) diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index ca8809e7cfd..02f6b942026 100644 --- a/gcc/tree-vect-patterns.cc +++ b/gcc/tree-vect-patterns.cc @@ -750,7 +750,6 @@ vect_split_statement (vec_info *vinfo, stmt_vec_info stmt2_info, tree new_rhs, gimple_stmt_iterator gsi = gsi_for_stmt (stmt2_info->stmt, def_seq); gsi_insert_before_without_update (&gsi, stmt1, GSI_SAME_STMT); } - return true; } else { @@ -783,9 +782,35 @@ vect_split_statement (vec_info *vinfo, stmt_vec_info stmt2_info, tree new_rhs, dump_printf_loc (MSG_NOTE, vect_location, "and: %G", (gimple *) new_stmt2); } +} - return true; + /* Since this function would change existing conversion statement no matter + the pattern is finally applied or not, we should check whether affine + closure of loop reduction need to be adjusted for impacted statements. */ + unsigned int status = stmt2_info->reduc_pattern_status; + + if (status != rpatt_none) +{ + tree rhs_type = TREE_TYPE (gimple_assign_rhs1 (stmt1)); + tree new_rhs_type = TREE_TYPE (new_rhs); + + /* The new statement generated by splitting is a nature widening +conversion. */ + gcc_assert (TYPE_PRECISION (rhs_type) < TYPE_PRECISION (new_rhs_type)); + gcc_assert (TYPE_UNSIGNED (rhs_type) || !TYPE_UNSIGNED (new_rhs_type)); + + /* The new statement would not break transform invariance of lane- +reducing operation, if the original conversion depends on the one +formed previously. For the case, it should also be marked with +rpatt_formed status. */ + if (status & rpatt_formed) + vinfo->lookup_stmt (stmt1)->reduc_pattern_status = rpatt_formed; + + if (!is_pattern_stmt_p (stmt2_info)) + STMT_VINFO_RELATED_STMT (stmt2_info)->reduc_pattern_status = status; } + + return true; } /* Look for the following pattern @@ -890,7 +915,10 @@ vect_convert_input (vec_info *vinfo, stmt_vec_info stmt_info, tree type, return wide_int_to_tree (type, wi::to_widest (unprom->op)); tree input = unprom->op; - if (unprom->caster) + + /* We should not reuse conversion, if it is just the statement under pattern + recognition. */ + if (unprom->caster && unprom->cast
[RFC][PATCH 4/5] vect: Extend lane-reducing patterns to non-loop-reduction statement
Previously, only simple lane-reducing case is supported, in which one loop reduction statement forms one pattern match: char *d0, *d1, *s0, *s1, *w; for (i) { sum += d0[i] * d1[i]; // sum = DOT_PROD(d0, d1, sum); sum += abs(s0[i] - s1[i]); // sum = SAD(s0, s1, sum); sum += w[i]; // sum = WIDEN_SUM(w, sum); } This patch removes limitation of current lane-reducing matching strategy, and extends candidate scope to the whole loop reduction affine closure. Thus, we could optimize reduction with lane-reducing as many as possible, which ends up with generalized pattern recognition as ("opX" denotes an operation for lane-reducing pattern): for (i) sum += cst0 * op0 + cst1 * op1 + ... + cstN * opN + h(i); A lane-reducing operation contains two aspects: main primitive operation and appendant result-accumulation. Original design handles match of the compound semantics in single pattern, but the means is not suitable for operation that does not directly participate in loop reduction. In this patch, we only focus on the basic aspect, and leave another patch to cover the rest. An example with dot-product: sum = DOT_PROD(d0, d1, sum); // original sum = DOT_PROD(d0, d1, 0) + sum; // now Thanks, Feng --- gcc/ * tree-vect-patterns (vect_reassociating_reduction_p): Remove the function. (vect_recog_dot_prod_pattern): Relax check to allow any statement in reduction affine closure. (vect_recog_sad_pattern): Likewise. (vect_recog_widen_sum_pattern): Likewise. And use dot-product if widen-sum is not supported. (vect_vect_recog_func_ptrs): Move lane-reducing patterns to the topmost. gcc/testsuite/ * gcc.dg/vect/vect-reduc-affine-1.c * gcc.dg/vect/vect-reduc-affine-2.c * gcc.dg/vect/vect-reduc-affine-slp-1.c --- .../gcc.dg/vect/vect-reduc-affine-1.c | 112 ++ .../gcc.dg/vect/vect-reduc-affine-2.c | 81 + .../gcc.dg/vect/vect-reduc-affine-slp-1.c | 74 gcc/tree-vect-patterns.cc | 321 ++ 4 files changed, 372 insertions(+), 216 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-affine-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-affine-2.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-affine-slp-1.c diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-affine-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-affine-1.c new file mode 100644 index 000..a5e99ce703b --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-affine-1.c @@ -0,0 +1,112 @@ +/* Disabling epilogues until we find a better way to deal with scans. */ +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ +/* { dg-add-options arm_v8_2a_dotprod_neon } */ + +#include "tree-vect.h" + +#define N 50 + +#define FN(name, S1, S2) \ +S1 int __attribute__ ((noipa)) \ +name (S1 int res, \ + S2 char *restrict a, \ + S2 char *restrict b, \ + S2 int *restrict c, \ + S2 int cst1, \ + S2 int cst2, \ + int shift) \ +{ \ + for (int i = 0; i < N; i++) \ +res += a[i] * b[i] + 16; \ + \ + asm volatile ("" ::: "memory"); \ + for (int i = 0; i < N; i++) \ +res += a[i] * b[i] + cst1; \ + \ + asm volatile ("" ::: "memory"); \ + for (int i = 0; i < N; i++) \ +res += a[i] * b[i] + c[i]; \ + \ + asm volatile ("" ::: "memory"); \ + for (int i = 0; i < N; i++) \ +res += a[i] * b[i] * 23; \ + \ + asm volatile ("" ::: "memory"); \ + for (int i = 0; i < N; i++) \ +res += a[i] * b[i] << 6; \ + \ + asm volatile ("" ::: "memory"); \ + for (int i = 0; i < N; i++) \ +res += a[i] * b[i] * cst2; \ + \ + asm volatile ("" ::: "memory"); \ + for
[RFC][PATCH 3/5] vect: Enable lane-reducing operation that is not loop reduction statement
This patch extends original vect analysis and transform to support a new kind of lane-reducing operation that participates in loop reduction indirectly. The operation itself is not reduction statement, but its value would be accumulated into reduction result finally. Thanks, Feng --- gcc/ * tree-vect-loop.cc (vectorizable_lane_reducing): Allow indirect lane- reducing operation. (vect_transform_reduction): Extend transform for indirect lane-reducing operation. --- gcc/tree-vect-loop.cc | 48 +++ 1 file changed, 40 insertions(+), 8 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index d7d628efa60..c344158b419 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7520,9 +7520,7 @@ vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info, stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)); - /* TODO: Support lane-reducing operation that does not directly participate - in loop reduction. */ - if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0) + if (!reduc_info) return false; /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not @@ -7530,7 +7528,16 @@ vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info, gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def); gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION); - for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++) + int sum_idx = STMT_VINFO_REDUC_IDX (stmt_info); + int num_ops = (int) gimple_num_ops (stmt) - 1; + + /* Participate in loop reduction either directly or indirectly. */ + if (sum_idx >= 0) +gcc_assert (sum_idx == num_ops - 1); + else +sum_idx = num_ops - 1; + + for (int i = 0; i < num_ops; i++) { stmt_vec_info def_stmt_info; slp_tree slp_op; @@ -7573,7 +7580,24 @@ vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info, tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info); - gcc_assert (vectype_in); + if (!vectype_in) +{ + enum vect_def_type dt; + tree rhs1 = gimple_assign_rhs1 (stmt); + + if (!vect_is_simple_use (rhs1, loop_vinfo, &dt, &vectype_in)) + return false; + + if (!vectype_in) + { + vectype_in = get_vectype_for_scalar_type (loop_vinfo, + TREE_TYPE (rhs1)); + if (!vectype_in) + return false; + } + + STMT_VINFO_REDUC_VECTYPE_IN (stmt_info) = vectype_in; +} /* Compute number of effective vector statements for costing. */ unsigned int ncopies_for_cost = vect_get_num_copies (loop_vinfo, slp_node, @@ -8750,9 +8774,17 @@ vect_transform_reduction (loop_vec_info loop_vinfo, gcc_assert (single_defuse_cycle || lane_reducing); if (lane_reducing) -{ - /* The last operand of lane-reducing op is for reduction. */ - gcc_assert (reduc_index == (int) op.num_ops - 1); +{ + if (reduc_index < 0) + { + reduc_index = (int) op.num_ops - 1; + single_defuse_cycle = false; + } + else + { + /* The last operand of lane-reducing op is for reduction. */ + gcc_assert (reduc_index == (int) op.num_ops - 1); + } } /* Create the destination vector */ -- 2.17.1From 5e65c65786d9594c172b58a6cd1af50c67efb927 Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Wed, 24 Apr 2024 16:46:49 +0800 Subject: [PATCH 3/5] vect: Enable lane-reducing operation that is not loop reduction statement This patch extends original vect analysis and transform to support a new kind of lane-reducing operation that participates in loop reduction indirectly. The operation itself is not reduction statement, but its value would be accumulated into reduction result finally. 2024-04-24 Feng Xue gcc/ * tree-vect-loop.cc (vectorizable_lane_reducing): Allow indirect lane- reducing operation. (vect_transform_reduction): Extend transform for indirect lane-reducing operation. --- gcc/tree-vect-loop.cc | 48 +++ 1 file changed, 40 insertions(+), 8 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index d7d628efa60..c344158b419 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7520,9 +7520,7 @@ vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info, stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)); - /* TODO: Support lane-reducing operation that does not directly participate - in loop reduction. */ - if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0) + if (!reduc_info) return false; /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not @@ -7530,7 +7528,16 @@ vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info stmt_info, gcc_assert (ST
[RFC][PATCH 1/5] vect: Fix single_imm_use in tree_vect_patterns
The work for RFC (https://gcc.gnu.org/pipermail/gcc-patches/2024-July/657860.html) involves not a little code change, so I have to separate it into several batches of patchset. This and the following patches constitute the first batch. Since pattern statement coexists with normal statements in a way that it is not linked into function body, we should not invoke utility procedures that depends on def/use graph on pattern statement, such as counting uses of a pseudo value defined by a pattern statement. This patch is to fix a bug of this type in vect pattern formation. Thanks, Feng --- gcc/ * tree-vect-patterns.cc (vect_recog_bitfield_ref_pattern): Only call single_imm_use if statement is not generated by pattern recognition. --- gcc/tree-vect-patterns.cc | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index 4570c25b664..ca8809e7cfd 100644 --- a/gcc/tree-vect-patterns.cc +++ b/gcc/tree-vect-patterns.cc @@ -2700,7 +2700,8 @@ vect_recog_bitfield_ref_pattern (vec_info *vinfo, stmt_vec_info stmt_info, /* If the only use of the result of this BIT_FIELD_REF + CONVERT is a PLUS_EXPR then do the shift last as some targets can combine the shift and add into a single instruction. */ - if (lhs && single_imm_use (lhs, &use_p, &use_stmt)) + if (lhs && !STMT_VINFO_RELATED_STMT (stmt_info) + && single_imm_use (lhs, &use_p, &use_stmt)) { if (gimple_code (use_stmt) == GIMPLE_ASSIGN && gimple_assign_rhs_code (use_stmt) == PLUS_EXPR) -- 2.17.1 From 52e1725339fc7e4552eb7916570790c4ab7f133d Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Fri, 14 Jun 2024 15:49:23 +0800 Subject: [PATCH 1/5] vect: Fix single_imm_use in tree_vect_patterns Since pattern statement coexists with normal statements in a way that it is not linked into function body, we should not invoke utility procedures that depends on def/use graph on pattern statement, such as counting uses of a pseudo value defined by a pattern statement. This patch is to fix a bug of this type in vect pattern formation. 2024-06-14 Feng Xue gcc/ * tree-vect-patterns.cc (vect_recog_bitfield_ref_pattern): Only call single_imm_use if statement is not generated by pattern recognition. --- gcc/tree-vect-patterns.cc | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index 4570c25b664..ca8809e7cfd 100644 --- a/gcc/tree-vect-patterns.cc +++ b/gcc/tree-vect-patterns.cc @@ -2700,7 +2700,8 @@ vect_recog_bitfield_ref_pattern (vec_info *vinfo, stmt_vec_info stmt_info, /* If the only use of the result of this BIT_FIELD_REF + CONVERT is a PLUS_EXPR then do the shift last as some targets can combine the shift and add into a single instruction. */ - if (lhs && single_imm_use (lhs, &use_p, &use_stmt)) + if (lhs && !STMT_VINFO_RELATED_STMT (stmt_info) + && single_imm_use (lhs, &use_p, &use_stmt)) { if (gimple_code (use_stmt) == GIMPLE_ASSIGN && gimple_assign_rhs_code (use_stmt) == PLUS_EXPR) -- 2.17.1
[RFC] Generalize formation of lane-reducing ops in loop reduction
Hi, I composed some patches to generalize lane-reducing (dot-product is a typical representative) pattern recognition, and prepared a RFC document so as to help review. The original intention was to make a complete solution for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114440. For sure, the work might be limited, so hope your comments. Thanks. - 1. Background For loop reduction of accumulating result of a widening operation, the preferred pattern is lane-reducing operation, if supported by target. Because this kind of operation need not preserve intermediate results of widening operation, and only produces reduced amount of final results for accumulation, choosing the pattern could lead to pretty compact codegen. Three lane-reducing opcodes are defined in gcc, belonging to two kinds of operations: dot-product (DOT_PROD_EXPR) and sum-of-absolute-difference (SAD_EXPR). WIDEN_SUM_EXPR could be seen as a degenerated dot-product with a constant operand as "1". Currently, gcc only supports recognition of simple lane-reducing case, in which each accumulation statement of loop reduction forms one pattern: char *d0, *d1; short *s0, *s1; for (i) { sum += d0[i] * d1[i]; // = DOT_PROD sum += abs(s0[i] - s1[i]); // = SAD } We could rewrite the example as the below using only one statement, whose non- reduction addend is the sum of the above right-side parts. As a whole, the addend would match nothing, while its two sub-expressions could be recognized as corresponding lane-reducing patterns. for (i) { sum += d0[i] * d1[i] + abs(s0[i] - s1[i]); } This case might be too elaborately crafted to be very common in reality. Though, we do find seemingly variant but essentially similar code pattern in some AI applications, which use matrix-vector operations extensively, some usages are just single loop reduction composed of multiple dot-products. A code snippet from ggml: for (int j = 0; j < qk/2; ++j) { const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = (x[i].qs[j] & 0xF) | xh_0; const int32_t x1 = (x[i].qs[j] >> 4) | xh_1; sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]); } In the source level, it appears to be a nature and minor scaling-up of simple one lane-reducing pattern, but it is beyond capability of current vectorization pattern recognition, and needs some kind of generic extension to the framework. 2. Reasoning on validity of transform First of all, we should tell what kind of expression is appropriate for lane- reducing transform. Given a loop, we use the language of mathematics to define an abstract function f(x, i), whose first independent variable "x" denotes a value that will participate sum-based loop reduction either directly or indirectly, and the 2nd one "i" specifies index of a loop iteration, which implies other intra-iteration factor irrelevant to "x". The function itself represents the transformed value by applying a series of operations on "x" in the context of "i"th loop iteration, and this value is directly accumulated to the loop reduction result. For the purpose of vectorization, it is implicitly supposed that f(x, i) is a pure function, and free of loop dependency. Additionally, for a value "x" defined in the loop, let "X" be a vector as , consisting of the "x" values in all iterations, to be specific, "X[i]" corresponds to "x" at iteration "i", or "xi". With sequential execution order, a loop reduction regarding to f(x, i) would be expanded to: sum += f(x0, 0); sum += f(x1, 1); ... sum += f(xM, M); 2.1 Lane-reducing vs. Lane-combining Following lane-reducing semantics, we introduce a new similar lane-combining operation that also manipulates a subset of lanes/elements in vector, by accumulating all into one of them, at the same time, clearing the rest lanes to be zero. Two operations are equivalent in essence, while a major difference is that lane-combining operation does not reduce the lanes of vector. One advantage about this is codegen of lane-combining operation could seamlessly inter-operate with that of normal (non-lane-reducing) vector operation. Any lane-combining operation could be synthesized by a sequence of the most basic two-lane operations, which become the focuses of our analysis. Given two lanes "i" and "j", and let X' = lane-combine(X, i, j), then we have: X = <..., xi , ..., xj, ...> X' = <..., xi + xj, ..., 0, ...> 2.2 Equations for loop reduction invariance Since combining strategy of lane-reducing operations is target-specific, for examples, accumulating quad lanes to one (#0 + #1 + #2 + #3 => #0), or low to high (#0 + #4 => #4), we just make a conservative assumption that combining could happen on arbitrary two lanes in either order. Under the precondition, it is legitimate to optimize evaluation of a value "x" with a lane-reducing pattern, only if loop reduction always produces invariant result no matter w
Re: [PATCH 1/4] vect: Add a unified vect_get_num_copies for slp and non-slp
>> +inline unsigned int >> +vect_get_num_copies (vec_info *vinfo, slp_tree node, tree vectype = NULL) >> +{ >> + poly_uint64 vf; >> + >> + if (loop_vec_info loop_vinfo = dyn_cast (vinfo)) >> +vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); >> + else >> +vf = 1; >> + >> + if (node) >> +{ >> + vf *= SLP_TREE_LANES (node); >> + if (!vectype) >> + vectype = SLP_TREE_VECTYPE (node); >> +} >> + else >> +gcc_checking_assert (vectype); > > can you make the checking assert unconditional? > > OK with that change. vect_get_num_vectors will ICE anyway > I guess, so at your choice remove the assert completely. > OK, I removed the assert. Thanks, Feng From: Richard Biener Sent: Monday, July 15, 2024 10:00 PM To: Feng Xue OS Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH 1/4] vect: Add a unified vect_get_num_copies for slp and non-slp On Sat, Jul 13, 2024 at 5:46 PM Feng Xue OS wrote: > > Extend original vect_get_num_copies (pure loop-based) to calculate number of > vector stmts for slp node regarding a generic vect region. > > Thanks, > Feng > --- > gcc/ > * tree-vectorizer.h (vect_get_num_copies): New overload function. > (vect_get_slp_num_vectors): New function. > * tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Calculate > number of vector stmts for slp node with vect_get_num_copies. > (vect_slp_analyze_node_operations): Calculate number of vector > elements > for constant/external slp node with vect_get_num_copies. > --- > gcc/tree-vect-slp.cc | 19 +++ > gcc/tree-vectorizer.h | 29 - > 2 files changed, 31 insertions(+), 17 deletions(-) > > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > index d0a8531fd3b..4dadbc6854d 100644 > --- a/gcc/tree-vect-slp.cc > +++ b/gcc/tree-vect-slp.cc > @@ -6573,17 +6573,7 @@ vect_slp_analyze_node_operations_1 (vec_info *vinfo, > slp_tree node, > } > } >else > -{ > - poly_uint64 vf; > - if (loop_vec_info loop_vinfo = dyn_cast (vinfo)) > - vf = loop_vinfo->vectorization_factor; > - else > - vf = 1; > - unsigned int group_size = SLP_TREE_LANES (node); > - tree vectype = SLP_TREE_VECTYPE (node); > - SLP_TREE_NUMBER_OF_VEC_STMTS (node) > - = vect_get_num_vectors (vf * group_size, vectype); > -} > +SLP_TREE_NUMBER_OF_VEC_STMTS (node) = vect_get_num_copies (vinfo, node); > >/* Handle purely internal nodes. */ >if (SLP_TREE_CODE (node) == VEC_PERM_EXPR) > @@ -6851,12 +6841,9 @@ vect_slp_analyze_node_operations (vec_info *vinfo, > slp_tree node, > && j == 1); > continue; > } > - unsigned group_size = SLP_TREE_LANES (child); > - poly_uint64 vf = 1; > - if (loop_vec_info loop_vinfo = dyn_cast (vinfo)) > - vf = loop_vinfo->vectorization_factor; > + > SLP_TREE_NUMBER_OF_VEC_STMTS (child) > - = vect_get_num_vectors (vf * group_size, vector_type); > + = vect_get_num_copies (vinfo, child); > /* And cost them. */ > vect_prologue_cost_for_slp (child, cost_vec); > } > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h > index 8eb3ec4df86..09923b9b440 100644 > --- a/gcc/tree-vectorizer.h > +++ b/gcc/tree-vectorizer.h > @@ -2080,6 +2080,33 @@ vect_get_num_vectors (poly_uint64 nunits, tree vectype) >return exact_div (nunits, TYPE_VECTOR_SUBPARTS (vectype)).to_constant (); > } > > +/* Return the number of vectors in the context of vectorization region VINFO, > + needed for a group of total SIZE statements that are supposed to be > + interleaved together with no gap, and all operate on vectors of type > + VECTYPE. If NULL, SLP_TREE_VECTYPE of NODE is used. */ > + > +inline unsigned int > +vect_get_num_copies (vec_info *vinfo, slp_tree node, tree vectype = NULL) > +{ > + poly_uint64 vf; > + > + if (loop_vec_info loop_vinfo = dyn_cast (vinfo)) > +vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); > + else > +vf = 1; > + > + if (node) > +{ > + vf *= SLP_TREE_LANES (node); > + if (!vectype) > + vectype = SLP_TREE_VECTYPE (node); > +} > + else > +gcc_checking_assert (vectype); can you make the checking assert unconditional? OK with that change. vect_get_num_vectors will ICE anyway I guess, so at your choice remove the assert completely. Thanks, Richard. > + > + return vect_get_num_vectors (vf, vecty
[PATCH 4/4] vect: Optimize order of lane-reducing statements in loop def-use cycles
When transforming multiple lane-reducing operations in a loop reduction chain, originally, corresponding vectorized statements are generated into def-use cycles starting from 0. The def-use cycle with smaller index, would contain more statements, which means more instruction dependency. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad sum += n[i]; // normal } Original transformation result: for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy ... } For a higher instruction parallelism in final vectorized loop, an optimal means is to make those effective vector lane-reducing ops be distributed evenly among all def-use cycles. Transformed as the below, DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles, instruction dependency among them could be eliminated. for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = sum_v1; // copy sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2); sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3); ... } Thanks, Feng --- gcc/ PR tree-optimization/114440 * tree-vectorizer.h (struct _stmt_vec_info): Add a new field reduc_result_pos. * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing statements in an optimized order. --- gcc/tree-vect-loop.cc | 64 ++- gcc/tree-vectorizer.h | 6 2 files changed, 63 insertions(+), 7 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index e72d692ffa3..5bc6e526d43 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8841,6 +8841,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad + sum += n[i]; // normal } The vector size is 128-bit,vectorization factor is 16. Reduction @@ -8858,19 +8859,27 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy - sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); - sum_v1 = sum_v1; // copy + sum_v0 = sum_v0; // copy + sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy - sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); - sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); - sum_v2 = sum_v2; // copy + sum_v0 = sum_v0; // copy + sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1); + sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2); sum_v3 = sum_v3; // copy + + sum_v0 += n_v0[i: 0 ~ 3 ]; + sum_v1 += n_v1[i: 4 ~ 7 ]; + sum_v2 += n_v2[i: 8 ~ 11]; + sum_v3 += n_v3[i: 12 ~ 15]; } - sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 + sum_v1 - */ +Moreover, for a higher instruction parallelism in final vectorized +loop, it is considered to make those effective vector lane-reducing +ops be distributed evenly among all def-use cycles. In the above +example, DOT_PROD, WIDEN_SUM and SADs are generated into disparate +cycles, instruction dependency among them could be eliminated. */ unsigned effec_ncopies = vec_oprnds[0].length (); unsigned total_ncopies = vec_oprnds[reduc_index].length (); @@ -8884,6 +8893,47 @@ vect_transform_reduction (loop_vec_info loop_vinfo, vec_oprnds[i].safe_grow_cleared (total_ncopies); } } + + tree reduc_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info); + gcc_assert (reduc_vectype_in); + + unsigned effec_reduc_ncopies + = vect_get_num_copies (loop_vinfo, slp_no
[PATCH 3/4] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current vectorizer could only handle the pattern if the reduction chain does not contain other operation, no matter the other is normal or lane-reducing. This patches removes some constraints in reduction analysis to allow multiple arbitrary lane-reducing operations with mixed input vectypes in a loop reduction chain. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad } The vector size is 128-bit vectorization factor is 16. Reduction statements would be transformed as: vector<4> int sum_v0 = { 0, 0, 0, 1 }; vector<4> int sum_v1 = { 0, 0, 0, 0 }; vector<4> int sum_v2 = { 0, 0, 0, 0 }; vector<4> int sum_v3 = { 0, 0, 0, 0 }; for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 + sum_v1 Thanks, Feng --- gcc/ PR tree-optimization/114440 * tree-vectorizer.h (vectorizable_lane_reducing): New function declaration. * tree-vect-stmts.cc (vect_analyze_stmt): Call new function vectorizable_lane_reducing to analyze lane-reducing operation. * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation code related to emulated_mixed_dot_prod. (vectorizable_lane_reducing): New function. (vectorizable_reduction): Allow multiple lane-reducing operations in loop reduction. Move some original lane-reducing related code to vectorizable_lane_reducing. (vect_transform_reduction): Adjust comments with updated example. gcc/testsuite/ PR tree-optimization/114440 * gcc.dg/vect/vect-reduc-chain-1.c * gcc.dg/vect/vect-reduc-chain-2.c * gcc.dg/vect/vect-reduc-chain-3.c * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c * gcc.dg/vect/vect-reduc-dot-slp-1.c --- .../gcc.dg/vect/vect-reduc-chain-1.c | 64 + .../gcc.dg/vect/vect-reduc-chain-2.c | 79 ++ .../gcc.dg/vect/vect-reduc-chain-3.c | 68 + .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c | 95 +++ .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c | 67 + .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c | 79 ++ .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c | 63 + .../gcc.dg/vect/vect-reduc-dot-slp-1.c| 60 + gcc/tree-vect-loop.cc | 240 +- gcc/tree-vect-stmts.cc| 2 + gcc/tree-vectorizer.h | 2 + 11 files changed, 750 insertions(+), 69 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c new file mode 100644 index 000..80b0089ea0f --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c @@ -0,0 +1,64 @@ +/* Disabling epilogues until we find a better way to deal with scans. */ +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ +/* { dg-add-options arm_v8_2a_dotprod_neon } */ + +#include "tree-vect.h" + +#define N 50 + +#ifndef SIGNEDNESS_1 +#define SIGNEDNESS_1 signed +#define SIGNEDNESS_2 signed +#endif + +SIGNEDNESS_1 int __attribute__ ((noipa)) +f (SIGNEDNESS_1 int res, + SIGNEDNESS_2 char *restrict a, + SIGNEDNESS_2 char *restrict b, + SIGNEDNESS_2 char *restrict c, + SIGNEDNESS_2 char *restrict d, + SIGNEDNESS_1 int *restrict e) +{ + for (int i = 0; i < N; ++i) +{ + res += a[i] * b[i]; + res += c[i] * d[i]; + r
[PATCH 2/4] vect: Refit lane-reducing to be normal operation
Vector stmts number of an operation is calculated based on output vectype. This is over-estimated for lane-reducing operation, which would cause vector def/use mismatched when we want to support loop reduction mixed with lane- reducing and normal operations. One solution is to refit lane-reducing to make it behave like a normal one, by adding new pass-through copies to fix possible def/use gap. And resultant superfluous statements could be optimized away after vectorization. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod } The vector size is 128-bit,vectorization factor is 16. Reduction statements would be transformed as: vector<4> int sum_v0 = { 0, 0, 0, 1 }; vector<4> int sum_v1 = { 0, 0, 0, 0 }; vector<4> int sum_v2 = { 0, 0, 0, 0 }; vector<4> int sum_v3 = { 0, 0, 0, 0 }; for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 Thanks, Feng --- gcc/ * tree-vect-loop.cc (vect_reduction_update_partial_vector_usage): Calculate effective vector stmts number with generic vect_get_num_copies. (vect_transform_reduction): Insert copies for lane-reducing so as to fix over-estimated vector stmts number. (vect_transform_cycle_phi): Calculate vector PHI number only based on output vectype. * tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Remove adjustment on vector stmts number specific to slp reduction. --- gcc/tree-vect-loop.cc | 134 +++--- gcc/tree-vect-slp.cc | 27 +++-- 2 files changed, 121 insertions(+), 40 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index a64b5082bd1..5ac83e76975 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7468,12 +7468,8 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo, = get_masked_reduction_fn (reduc_fn, vectype_in); vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo); vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); - unsigned nvectors; - - if (slp_node) - nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); - else - nvectors = vect_get_num_copies (loop_vinfo, vectype_in); + unsigned nvectors = vect_get_num_copies (loop_vinfo, slp_node, + vectype_in); if (mask_reduc_fn == IFN_MASK_LEN_FOLD_LEFT_PLUS) vect_record_loop_len (loop_vinfo, lens, nvectors, vectype_in, 1); @@ -8595,12 +8591,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo, stmt_vec_info phi_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)); gphi *reduc_def_phi = as_a (phi_info->stmt); int reduc_index = STMT_VINFO_REDUC_IDX (stmt_info); - tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info); + tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info); + + if (!vectype_in) +vectype_in = STMT_VINFO_VECTYPE (stmt_info); if (slp_node) { ncopies = 1; - vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); + vec_num = vect_get_num_copies (loop_vinfo, slp_node, vectype_in); } else { @@ -8658,13 +8657,40 @@ vect_transform_reduction (loop_vec_info loop_vinfo, bool lane_reducing = lane_reducing_op_p (code); gcc_assert (single_defuse_cycle || lane_reducing); + if (lane_reducing) +{ + /* The last operand of lane-reducing op is for reduction. */ + gcc_assert (reduc_index == (int) op.num_ops - 1); +} + /* Create the destination vector */ tree scalar_dest = gimple_get_lhs (stmt_info->stmt); tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out); + if (lane_reducing && !slp_node && !single_defuse_cycle) +{ + /* Note: there are still vectorizable cases that can not be handled by +single-lane slp. Probably it would take some time to evolve the +feature to a mature state. So we have to keep the below non-slp code +path as failsafe for lane-reducing support. */ + gcc_assert (op.num_ops <= 3); + for (unsigned i = 0; i < op.num_ops; i++) + { + unsigned oprnd_ncopies = ncopies; + + if ((int) i == reduc_index) + { + tree vectype = STMT_VINFO_VECTYPE (stmt_info); + oprnd_ncopies = vect_get_num_copies (loop_vinfo, vectype); + } + + vect_get_vec_defs_for_operand (loop_vinfo, stmt_info, oprnd_ncopies, +op.ops[i], &vec_oprnds[i]); + } +} /* Get NCOPIES vector definitions for all operands except the reduction definition. */ - if (!cond_fn_p) + else if (!cond_fn_p) { gcc_assert (reduc_index >= 0 && reduc_index <= 2); vect_get_vec_defs (loop_vinf
[PATCH 1/4] vect: Add a unified vect_get_num_copies for slp and non-slp
Extend original vect_get_num_copies (pure loop-based) to calculate number of vector stmts for slp node regarding a generic vect region. Thanks, Feng --- gcc/ * tree-vectorizer.h (vect_get_num_copies): New overload function. (vect_get_slp_num_vectors): New function. * tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Calculate number of vector stmts for slp node with vect_get_num_copies. (vect_slp_analyze_node_operations): Calculate number of vector elements for constant/external slp node with vect_get_num_copies. --- gcc/tree-vect-slp.cc | 19 +++ gcc/tree-vectorizer.h | 29 - 2 files changed, 31 insertions(+), 17 deletions(-) diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index d0a8531fd3b..4dadbc6854d 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -6573,17 +6573,7 @@ vect_slp_analyze_node_operations_1 (vec_info *vinfo, slp_tree node, } } else -{ - poly_uint64 vf; - if (loop_vec_info loop_vinfo = dyn_cast (vinfo)) - vf = loop_vinfo->vectorization_factor; - else - vf = 1; - unsigned int group_size = SLP_TREE_LANES (node); - tree vectype = SLP_TREE_VECTYPE (node); - SLP_TREE_NUMBER_OF_VEC_STMTS (node) - = vect_get_num_vectors (vf * group_size, vectype); -} +SLP_TREE_NUMBER_OF_VEC_STMTS (node) = vect_get_num_copies (vinfo, node); /* Handle purely internal nodes. */ if (SLP_TREE_CODE (node) == VEC_PERM_EXPR) @@ -6851,12 +6841,9 @@ vect_slp_analyze_node_operations (vec_info *vinfo, slp_tree node, && j == 1); continue; } - unsigned group_size = SLP_TREE_LANES (child); - poly_uint64 vf = 1; - if (loop_vec_info loop_vinfo = dyn_cast (vinfo)) - vf = loop_vinfo->vectorization_factor; + SLP_TREE_NUMBER_OF_VEC_STMTS (child) - = vect_get_num_vectors (vf * group_size, vector_type); + = vect_get_num_copies (vinfo, child); /* And cost them. */ vect_prologue_cost_for_slp (child, cost_vec); } diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 8eb3ec4df86..09923b9b440 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -2080,6 +2080,33 @@ vect_get_num_vectors (poly_uint64 nunits, tree vectype) return exact_div (nunits, TYPE_VECTOR_SUBPARTS (vectype)).to_constant (); } +/* Return the number of vectors in the context of vectorization region VINFO, + needed for a group of total SIZE statements that are supposed to be + interleaved together with no gap, and all operate on vectors of type + VECTYPE. If NULL, SLP_TREE_VECTYPE of NODE is used. */ + +inline unsigned int +vect_get_num_copies (vec_info *vinfo, slp_tree node, tree vectype = NULL) +{ + poly_uint64 vf; + + if (loop_vec_info loop_vinfo = dyn_cast (vinfo)) +vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); + else +vf = 1; + + if (node) +{ + vf *= SLP_TREE_LANES (node); + if (!vectype) + vectype = SLP_TREE_VECTYPE (node); +} + else +gcc_checking_assert (vectype); + + return vect_get_num_vectors (vf, vectype); +} + /* Return the number of copies needed for loop vectorization when a statement operates on vectors of type VECTYPE. This is the vectorization factor divided by the number of elements in @@ -2088,7 +2115,7 @@ vect_get_num_vectors (poly_uint64 nunits, tree vectype) inline unsigned int vect_get_num_copies (loop_vec_info loop_vinfo, tree vectype) { - return vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR (loop_vinfo), vectype); + return vect_get_num_copies (loop_vinfo, NULL, vectype); } /* Update maximum unit count *MAX_NUNITS so that it accounts for -- 2.17.1
Re: [PATCH 2/4] vect: Fix inaccurate vector stmts number for slp reduction with lane-reducing
> > Hi, Richard, > > > > Let me explain some idea that has to be chosen for lane-reducing. The key > > complication is that these ops are associated with two kinds of vec_nums, > > one is number of effective vector stmts, which is used by partial > > vectorzation > > function such as vect_get_loop_mask. The other is number of total created > > vector stmts. Now we should make it aligned with normal op, in order to > > interoperate with normal op. Suppose expressions mixed with lane-reducing > > and normal as: > > > > temp = lane_reducing<16*char> + expr<4*int>; > > temp = cst<4*int> * lane_reducing<16*char>; > > > > If only generating effective vector stmt for lane_reducing, vector def/use > > between ops will never be matched, so extra pass-through copies are > > necessary. This is why we say "refit a lane-reducing to be a fake normal > > op". > > And this only happens in vect_transform_reduction, right? Yes. it is. > > The other pre-existing issue is that for single_defuse_cycle optimization > SLP_TREE_NUMBER_OF_VEC_STMTS is also off (too large). But here > the transform also goes through vect_transform_reduction. > > > The requirement of two vec_stmts are independent of how we will implement > > SLP_TREE_NUMBER_OF_VEC_STMTS. Moreover, if we want to refactor vect code > > to unify ncopies/vec_num computation and completely remove > > SLP_TREE_NUMBER_OF_VEC_STMTS, this tends to be a a large task, and might > > be overkill for these lane-reducing patches. So I will keep it as before, > > and do > > not touch it as what I have done in this patch. > > > > Since one SLP_TREE_NUMBER_OF_VEC_STMTS could not be used for two purposes. > > The your previous suggestion might not be work: > > > > > As said, I don't like this much. vect_slp_analyze_node_operations_1 sets > > > this > > > and I think the existing "exception" > > > > > > /* Calculate the number of vector statements to be created for the > > > scalar stmts in this node. For SLP reductions it is equal to the > > > number of vector statements in the children (which has already been > > > calculated by the recursive call). Otherwise it is the number of > > > scalar elements in one scalar iteration (DR_GROUP_SIZE) multiplied by > > > VF divided by the number of elements in a vector. */ > > > if (SLP_TREE_CODE (node) != VEC_PERM_EXPR > > > && !STMT_VINFO_DATA_REF (stmt_info) > > > && REDUC_GROUP_FIRST_ELEMENT (stmt_info)) > > >{ > > > for (unsigned i = 0; i < SLP_TREE_CHILDREN (node).length (); ++i) > > >if (SLP_TREE_DEF_TYPE (SLP_TREE_CHILDREN (node)[i]) == > > > vect_internal_def) > > > { > > >SLP_TREE_NUMBER_OF_VEC_STMTS (node) > > > = SLP_TREE_NUMBER_OF_VEC_STMTS (SLP_TREE_CHILDREN (node)[i]); > > >break; > > > } > > >} > > > > > > could be changed (or amended if replacing doesn't work out) to > > > > > > if (SLP_TREE_CODE (node) != VEC_PERM_EXPR > > > && STMT_VINFO_REDUC_IDX (stmt_info) > > > // do we have this always set? > > > && STMT_VINFO_REDUC_VECTYPE_IN (stmt_info)) > > >{ > > > do the same as in else {} but using VECTYPE_IN > > >} > > > > > > Or maybe scrap the special case and use STMT_VINFO_REDUC_VECTYPE_IN > > > when that's set instead of SLP_TREE_VECTYPE? As said having wrong > > > SLP_TREE_NUMBER_OF_VEC_STMTS is going to backfire. > > > > Then the alternative is to limit special handling related to the vec_num > > only > > inside vect_transform_reduction. Is that ok? Or any other suggestion? > > I think that's kind-of in line with the suggestion of a reduction > specific VF, so yes, > not using SLP_TREE_NUMBER_OF_VEC_STMTS in vect_transform_reduction > sounds fine to me and would be a step towards not having > SLP_TREE_NUMBER_OF_VEC_STMTS > where the function would be responsible for appropriate allocation as well. OK. I remade 4 patches, and send them in a new emails. Thanks, Feng > From: Richard Biener > Sent: Thursday, July 11, 2024 5:43 PM > To: Feng Xue OS; Richard Sandiford > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH 2/4] vect: Fix inaccurate vector stmts number for slp > reduction with lane-reducing > > On Thu, Jul 11, 2024 at 10:53 AM Feng Xue OS
Re: [PATCH 2/4] vect: Fix inaccurate vector stmts number for slp reduction with lane-reducing
Hi, Richard, Let me explain some idea that has to be chosen for lane-reducing. The key complication is that these ops are associated with two kinds of vec_nums, one is number of effective vector stmts, which is used by partial vectorzation function such as vect_get_loop_mask. The other is number of total created vector stmts. Now we should make it aligned with normal op, in order to interoperate with normal op. Suppose expressions mixed with lane-reducing and normal as: temp = lane_reducing<16*char> + expr<4*int>; temp = cst<4*int> * lane_reducing<16*char>; If only generating effective vector stmt for lane_reducing, vector def/use between ops will never be matched, so extra pass-through copies are necessary. This is why we say "refit a lane-reducing to be a fake normal op". The requirement of two vec_stmts are independent of how we will implement SLP_TREE_NUMBER_OF_VEC_STMTS. Moreover, if we want to refactor vect code to unify ncopies/vec_num computation and completely remove SLP_TREE_NUMBER_OF_VEC_STMTS, this tends to be a a large task, and might be overkill for these lane-reducing patches. So I will keep it as before, and do not touch it as what I have done in this patch. Since one SLP_TREE_NUMBER_OF_VEC_STMTS could not be used for two purposes. The your previous suggestion might not be work: > As said, I don't like this much. vect_slp_analyze_node_operations_1 sets this > and I think the existing "exception" > > /* Calculate the number of vector statements to be created for the > scalar stmts in this node. For SLP reductions it is equal to the > number of vector statements in the children (which has already been > calculated by the recursive call). Otherwise it is the number of > scalar elements in one scalar iteration (DR_GROUP_SIZE) multiplied by > VF divided by the number of elements in a vector. */ > if (SLP_TREE_CODE (node) != VEC_PERM_EXPR > && !STMT_VINFO_DATA_REF (stmt_info) > && REDUC_GROUP_FIRST_ELEMENT (stmt_info)) >{ > for (unsigned i = 0; i < SLP_TREE_CHILDREN (node).length (); ++i) >if (SLP_TREE_DEF_TYPE (SLP_TREE_CHILDREN (node)[i]) == > vect_internal_def) > { >SLP_TREE_NUMBER_OF_VEC_STMTS (node) > = SLP_TREE_NUMBER_OF_VEC_STMTS (SLP_TREE_CHILDREN (node)[i]); >break; > } >} > > could be changed (or amended if replacing doesn't work out) to > > if (SLP_TREE_CODE (node) != VEC_PERM_EXPR > && STMT_VINFO_REDUC_IDX (stmt_info) > // do we have this always set? > && STMT_VINFO_REDUC_VECTYPE_IN (stmt_info)) >{ > do the same as in else {} but using VECTYPE_IN >} > > Or maybe scrap the special case and use STMT_VINFO_REDUC_VECTYPE_IN > when that's set instead of SLP_TREE_VECTYPE? As said having wrong > SLP_TREE_NUMBER_OF_VEC_STMTS is going to backfire. Then the alternative is to limit special handling related to the vec_num only inside vect_transform_reduction. Is that ok? Or any other suggestion? Thanks, Feng From: Richard Biener Sent: Thursday, July 11, 2024 5:43 PM To: Feng Xue OS; Richard Sandiford Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH 2/4] vect: Fix inaccurate vector stmts number for slp reduction with lane-reducing On Thu, Jul 11, 2024 at 10:53 AM Feng Xue OS wrote: > > Vector stmts number of an operation is calculated based on output vectype. > This is over-estimated for lane-reducing operation. Sometimes, to workaround > the issue, we have to rely on additional logic to deduce an exactly accurate > number by other means. Aiming at the inconvenience, in this patch, we would > "turn" lane-reducing operation into a normal one by inserting new trivial > statements like zero-valued PHIs and pass-through copies, which could be > optimized away by later passes. At the same time, a new field is added for > slp node to hold number of vector stmts that are really effective after > vectorization. For example: Adding Richard into the loop. I'm sorry, but this feels a bit backwards - in the end I was hoping that we can get rid of SLP_TREE_NUMBER_OF_VEC_STMTS completely. We do currently have the odd ncopies (non-SLP) vs. vec_num (SLP) duality but in reality all vectorizable_* should know the number of stmt copies (or output vector defs) to produce by looking at the vector type and the vectorization factor (and in the SLP case the number of lanes represented by the node). That means that in the end vectorizable_* could at transform time simply make sure that SLP_TREE_VEC_DEF is appropriately created (currently generic code does this based on SLP_TREE_NUMBER_OF_VEC_STMTS and also generic code tries to determine SLP_TREE_NUMBER_OF_VEC
[PATCH 4/4] vect: Optimize order of lane-reducing statements in loop def-use cycles
When transforming multiple lane-reducing operations in a loop reduction chain, originally, corresponding vectorized statements are generated into def-use cycles starting from 0. The def-use cycle with smaller index, would contain more statements, which means more instruction dependency. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad sum += n[i]; // normal } Original transformation result: for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy ... } For a higher instruction parallelism in final vectorized loop, an optimal means is to make those effective vector lane-reducing ops be distributed evenly among all def-use cycles. Transformed as the below, DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles, instruction dependency among them could be eliminated. for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = sum_v1; // copy sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2); sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3); ... } Thanks, Feng --- gcc/ PR tree-optimization/114440 * tree-vectorizer.h (struct _stmt_vec_info): Add a new field reduc_result_pos. (vect_transform_reduction): Add a new parameter of slp_instance type. * tree-vect-stmts.cc (vect_transform_stmt): Add a new argument slp_node_instance to vect_transform_reduction. * tree-vect-loop.cc (vect_transform_reduction): Add a new parameter slp_node_instance. Generate lane-reducing statements in an optimized order. --- gcc/tree-vect-loop.cc | 73 +++--- gcc/tree-vect-stmts.cc | 3 +- gcc/tree-vectorizer.h | 8 - 3 files changed, 71 insertions(+), 13 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index a3374fb2d1a..841ef4c9120 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8673,7 +8673,8 @@ vect_emulate_mixed_dot_prod (loop_vec_info loop_vinfo, stmt_vec_info stmt_info, bool vect_transform_reduction (loop_vec_info loop_vinfo, stmt_vec_info stmt_info, gimple_stmt_iterator *gsi, - gimple **vec_stmt, slp_tree slp_node) + gimple **vec_stmt, slp_tree slp_node, + slp_instance slp_node_instance) { tree vectype_out = STMT_VINFO_VECTYPE (stmt_info); class loop *loop = LOOP_VINFO_LOOP (loop_vinfo); @@ -8863,6 +8864,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad + sum += n[i]; // normal } The vector size is 128-bit,vectorization factor is 16. Reduction @@ -8880,25 +8882,30 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy - sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); - sum_v1 = sum_v1; // copy + sum_v0 = sum_v0; // copy + sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy - sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); - sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); - sum_v2 = sum_v2; // copy + sum_v0 = sum_v0; // copy + sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1); + sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2); sum_v3 = sum_v3; // copy + + sum_v0 += n_v0[i: 0 ~ 3 ]; + sum_v1 += n_v1[i: 4 ~ 7 ]; + sum_v2 += n_v2[i: 8 ~ 11]; + sum_v3 += n_v3[i: 12 ~ 15]; } - sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 + sum_v1 - */ +Moreover, for a higher instruction pa
[PATCH 3/4] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current vectorizer could only handle the pattern if the reduction chain does not contain other operation, no matter the other is normal or lane-reducing. This patches removes some constraints in reduction analysis to allow multiple arbitrary lane-reducing operations with mixed input vectypes in a loop reduction chain. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad } The vector size is 128-bit vectorization factor is 16. Reduction statements would be transformed as: vector<4> int sum_v0 = { 0, 0, 0, 1 }; vector<4> int sum_v1 = { 0, 0, 0, 0 }; vector<4> int sum_v2 = { 0, 0, 0, 0 }; vector<4> int sum_v3 = { 0, 0, 0, 0 }; for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 + sum_v1 Thanks, Feng --- gcc/ PR tree-optimization/114440 * tree-vectorizer.h (vectorizable_lane_reducing): New function declaration. * tree-vect-stmts.cc (vect_analyze_stmt): Call new function vectorizable_lane_reducing to analyze lane-reducing operation. * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation code related to emulated_mixed_dot_prod. (vectorizable_lane_reducing): New function. (vectorizable_reduction): Allow multiple lane-reducing operations in loop reduction. Move some original lane-reducing related code to vectorizable_lane_reducing. (vect_transform_reduction): Extend transformation to support reduction statements with mixed input vectypes for non-slp code path. gcc/testsuite/ PR tree-optimization/114440 * gcc.dg/vect/vect-reduc-chain-1.c * gcc.dg/vect/vect-reduc-chain-2.c * gcc.dg/vect/vect-reduc-chain-3.c * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c * gcc.dg/vect/vect-reduc-dot-slp-1.c --- .../gcc.dg/vect/vect-reduc-chain-1.c | 64 .../gcc.dg/vect/vect-reduc-chain-2.c | 79 + .../gcc.dg/vect/vect-reduc-chain-3.c | 68 + .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c | 95 ++ .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c | 67 .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c | 79 + .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c | 63 .../gcc.dg/vect/vect-reduc-dot-slp-1.c| 60 gcc/tree-vect-loop.cc | 285 +- gcc/tree-vect-stmts.cc| 2 + gcc/tree-vectorizer.h | 2 + 11 files changed, 785 insertions(+), 79 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c new file mode 100644 index 000..80b0089ea0f --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c @@ -0,0 +1,64 @@ +/* Disabling epilogues until we find a better way to deal with scans. */ +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ +/* { dg-add-options arm_v8_2a_dotprod_neon } */ + +#include "tree-vect.h" + +#define N 50 + +#ifndef SIGNEDNESS_1 +#define SIGNEDNESS_1 signed +#define SIGNEDNESS_2 signed +#endif + +SIGNEDNESS_1 int __attribute__ ((noipa)) +f (SIGNEDNESS_1 int res, + SIGNEDNESS_2 char *restrict a, + SIGNEDNESS_2 char *restrict b, + SIGNEDNESS_2 char *restrict c, + SIGNEDNESS_2 char *restrict d, + SIGNEDNESS_1 int *restrict e) +{ + for (int i = 0; i < N; ++i) +
[PATCH 2/4] vect: Fix inaccurate vector stmts number for slp reduction with lane-reducing
Vector stmts number of an operation is calculated based on output vectype. This is over-estimated for lane-reducing operation. Sometimes, to workaround the issue, we have to rely on additional logic to deduce an exactly accurate number by other means. Aiming at the inconvenience, in this patch, we would "turn" lane-reducing operation into a normal one by inserting new trivial statements like zero-valued PHIs and pass-through copies, which could be optimized away by later passes. At the same time, a new field is added for slp node to hold number of vector stmts that are really effective after vectorization. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod } The vector size is 128-bit,vectorization factor is 16. Reduction statements would be transformed as: vector<4> int sum_v0 = { 0, 0, 0, 1 }; vector<4> int sum_v1 = { 0, 0, 0, 0 }; vector<4> int sum_v2 = { 0, 0, 0, 0 }; vector<4> int sum_v3 = { 0, 0, 0, 0 }; for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 Thanks, Feng --- gcc/ * tree-vectorizer.h (vec_stmts_effec_size): New field in _slp_tree. (SLP_TREE_VEC_STMTS_EFFEC_NUM): New macro. (vect_get_num_vectors): New overload function. (vect_get_slp_num_vectors): New function. * tree-vect-loop.cc (vect_reduction_update_partial_vector_usage): Use effective vector stmts number. (vectorizable_reduction): Compute number of effective vector stmts for lane-reducing op and reduction PHI. (vect_transform_reduction): Insert copies for lane-reducing so as to fix inaccurate vector stmts number. (vect_transform_cycle_phi): Only need to calculate vector PHI number based on input vectype for non-slp code path. * tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize effective vector stmts number to zero. (vect_slp_analyze_node_operations_1): Remove adjustment on vector stmts number specific to slp reduction. (vect_slp_analyze_node_operations): Compute number of vector elements for constant/external slp node with vect_get_slp_num_vectors. --- gcc/tree-vect-loop.cc | 139 -- gcc/tree-vect-slp.cc | 56 ++--- gcc/tree-vectorizer.h | 45 ++ 3 files changed, 183 insertions(+), 57 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index c183e2b6068..5ad9836d6c8 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7471,7 +7471,7 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo, unsigned nvectors; if (slp_node) - nvectors = SLP_TREE_VEC_STMTS_NUM (slp_node); + nvectors = SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node); else nvectors = vect_get_num_copies (loop_vinfo, vectype_in); @@ -7594,6 +7594,15 @@ vectorizable_reduction (loop_vec_info loop_vinfo, stmt_vec_info phi_info = stmt_info; if (!is_a (stmt_info->stmt)) { + if (lane_reducing_stmt_p (stmt_info->stmt) && slp_node) + { + /* Compute number of effective vector statements for lane-reducing +ops. */ + vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info); + gcc_assert (vectype_in); + SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node) + = vect_get_slp_num_vectors (loop_vinfo, slp_node, vectype_in); + } STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type; return true; } @@ -8012,14 +8021,25 @@ vectorizable_reduction (loop_vec_info loop_vinfo, if (STMT_VINFO_LIVE_P (phi_info)) return false; - if (slp_node) -ncopies = 1; - else -ncopies = vect_get_num_copies (loop_vinfo, vectype_in); + poly_uint64 nunits_out = TYPE_VECTOR_SUBPARTS (vectype_out); - gcc_assert (ncopies >= 1); + if (slp_node) +{ + ncopies = 1; - poly_uint64 nunits_out = TYPE_VECTOR_SUBPARTS (vectype_out); + if (maybe_ne (TYPE_VECTOR_SUBPARTS (vectype_in), nunits_out)) + { + /* Not all vector reduction PHIs would be used, compute number +of the effective statements. */ + SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node) + = vect_get_slp_num_vectors (loop_vinfo, slp_node, vectype_in); + } +} + else +{ + ncopies = vect_get_num_copies (loop_vinfo, vectype_in); + gcc_assert (ncopies >= 1); +} if (nested_cycle) { @@ -8360,7 +8380,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, || (slp_node && !REDUC_GROUP_FIRST_ELEMENT (stmt_info) && SLP_TREE_LANES (slp_node) == 1 - && vect_get_num_copies (loop_vinfo, vectype_in) > 1)) + && SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node) > 1)) && (STMT_VINFO_RELEVANT (stmt
[PATCH 1/4] vect: Shorten name of macro SLP_TREE_NUMBER_OF_VEC_STMTS
This patch series are recomposed and split from https://gcc.gnu.org/pipermail/gcc-patches/2024-June/655974.html. As I will add a new field tightly coupled with "vec_stmts_size", if following naming conversion as original, the new macro would be very long. So better to choose samely meaningful but shorter names, the patch makes change for this macro, the other new patch would handle the new field and macro accordingly as this. Thanks, Feng --- gcc/ * tree-vectorizer.h (SLP_TREE_NUMBER_OF_VEC_STMTS): Change the macro to SLP_TREE_VEC_STMTS_NUM. * tree-vect-stmts.cc (vect_model_simple_cost): Likewise. (check_load_store_for_partial_vectors): Likewise. (vectorizable_bswap): Likewise. (vectorizable_call): Likewise. (vectorizable_conversion): Likewise. (vectorizable_shift): Likewise. And replace direct field reference to "vec_stmts_size" with the new macro. (vectorizable_operation): Likewise. (vectorizable_store): Likewise. (vectorizable_load): Likewise. (vectorizable_condition): Likewise. * tree-vect-loop.cc (vect_reduction_update_partial_vector_usage): Likewise. (vectorizable_reduction): Likewise. (vect_transform_reduction): Likewise. (vectorizable_phi): Likewise. (vectorizable_recurr): Likewise. (vectorizable_induction): Likewise. (vectorizable_live_operation): Likewise. * tree-vect-slp.cc (_slp_tree::_slp_tree): Likewise. (vect_slp_analyze_node_operations_1): Likewise. (vect_prologue_cost_for_slp): Likewise. (vect_slp_analyze_node_operations): Likewise. (vect_create_constant_vectors): Likewise. (vect_get_slp_vect_def): Likewise. (vect_transform_slp_perm_load_1): Likewise. (vectorizable_slp_permutation_1): Likewise. (vect_schedule_slp_node): Likewise. (vectorize_slp_instance_root_stmt): Likewise. --- gcc/tree-vect-loop.cc | 17 +++--- gcc/tree-vect-slp.cc | 34 +-- gcc/tree-vect-stmts.cc | 52 -- gcc/tree-vectorizer.h | 2 +- 4 files changed, 51 insertions(+), 54 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index a64b5082bd1..c183e2b6068 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7471,7 +7471,7 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo, unsigned nvectors; if (slp_node) - nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); + nvectors = SLP_TREE_VEC_STMTS_NUM (slp_node); else nvectors = vect_get_num_copies (loop_vinfo, vectype_in); @@ -8121,7 +8121,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, || reduction_type == CONST_COND_REDUCTION || reduction_type == EXTRACT_LAST_REDUCTION) && slp_node - && SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node) > 1) + && SLP_TREE_VEC_STMTS_NUM (slp_node) > 1) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, @@ -8600,7 +8600,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo, if (slp_node) { ncopies = 1; - vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); + vec_num = SLP_TREE_VEC_STMTS_NUM (slp_node); } else { @@ -9196,7 +9196,7 @@ vectorizable_phi (vec_info *, for the scalar and the vector PHIs. This avoids artificially favoring the vector path (but may pessimize it in some cases). */ if (gimple_phi_num_args (as_a (stmt_info->stmt)) > 1) - record_stmt_cost (cost_vec, SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node), + record_stmt_cost (cost_vec, SLP_TREE_VEC_STMTS_NUM (slp_node), vector_stmt, stmt_info, vectype, 0, vect_body); STMT_VINFO_TYPE (stmt_info) = phi_info_type; return true; @@ -9304,7 +9304,7 @@ vectorizable_recurr (loop_vec_info loop_vinfo, stmt_vec_info stmt_info, tree vectype = STMT_VINFO_VECTYPE (stmt_info); unsigned ncopies; if (slp_node) -ncopies = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); +ncopies = SLP_TREE_VEC_STMTS_NUM (slp_node); else ncopies = vect_get_num_copies (loop_vinfo, vectype); poly_int64 nunits = TYPE_VECTOR_SUBPARTS (vectype); @@ -10217,8 +10217,7 @@ vectorizable_induction (loop_vec_info loop_vinfo, } /* loop cost for vec_loop. */ inside_cost - = record_stmt_cost (cost_vec, - SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node), + = record_stmt_cost (cost_vec, SLP_TREE_VEC_STMTS_NUM (slp_node), vector_stmt, stmt_info, 0, vect_body); /* prologue cost for vec_init (if not nested) and step. */ prologue_cost = record_stmt_cost (cost_vec, 1 + !nested_in_vect_loop, @@ -10289,7 +10288,7 @@ vectorizable_induction (loop_vec_info loop_vinfo, } /*
Re: [PATCH] vect: Fix shift-by-induction for single-lane slp
I added two test cases for the examples your mentioned. BTW: would you please look over another 3 lane-reducing patches that have been updated? If ok, I would consider to check them in. Thanks, Feng -- Allow shift-by-induction for slp node, when it is single lane, which is aligned with the original loop-based handling. gcc/ * tree-vect-stmts.cc (vectorizable_shift): Allow shift-by-induction for single-lane slp node. gcc/testsuite/ * gcc.dg/vect/vect-shift-6.c * gcc.dg/vect/vect-shift-7.c --- gcc/testsuite/gcc.dg/vect/vect-shift-6.c | 51 +++ gcc/testsuite/gcc.dg/vect/vect-shift-7.c | 65 gcc/tree-vect-stmts.cc | 2 +- 3 files changed, 117 insertions(+), 1 deletion(-) create mode 100644 gcc/testsuite/gcc.dg/vect/vect-shift-6.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-shift-7.c diff --git a/gcc/testsuite/gcc.dg/vect/vect-shift-6.c b/gcc/testsuite/gcc.dg/vect/vect-shift-6.c new file mode 100644 index 000..940f7f2a4db --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-shift-6.c @@ -0,0 +1,51 @@ +/* { dg-require-effective-target vect_shift } */ +/* { dg-require-effective-target vect_int } */ + +#include +#include "tree-vect.h" + +#define N 32 + +int A[N]; +int B[N]; + +#define FN(name) \ +__attribute__((noipa)) \ +void name(int *a) \ +{ \ + for (int i = 0; i < N / 2; i++) \ +{ \ + a[2 * i + 0] <<= i; \ + a[2 * i + 1] <<= i; \ +} \ +} + + +FN(foo_vec) + +#pragma GCC push_options +#pragma GCC optimize ("O0") +FN(foo_novec) +#pragma GCC pop_options + +int main () +{ + int i; + + check_vect (); + +#pragma GCC novector + for (i = 0; i < N; i++) +A[i] = B[i] = -(i + 1); + + foo_vec(A); + foo_novec(B); + + /* check results: */ +#pragma GCC novector + for (i = 0; i < N; i++) +if (A[i] != B[i]) + abort (); + + return 0; +} diff --git a/gcc/testsuite/gcc.dg/vect/vect-shift-7.c b/gcc/testsuite/gcc.dg/vect/vect-shift-7.c new file mode 100644 index 000..a33b120343b --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-shift-7.c @@ -0,0 +1,65 @@ +/* { dg-require-effective-target vect_shift } */ +/* { dg-require-effective-target vect_int } */ + +#include +#include "tree-vect.h" + +#define N 32 +#define M 64 + +int A[N]; +int B[N]; + +#define FN(name) \ +__attribute__((noipa)) \ +void name(int *a) \ +{ \ + for (int i = 0; i < N / 2; i++) \ +{ \ + int s1 = i; \ + int s2 = s1 + 1; \ + int r1 = 0; \ + int r2 = 1; \ + \ + for (int j = 0; j < M; j++) \ + { \ +r1 += j << s1; \ +r2 += j << s2; \ +s1++; \ +s2++; \ + } \ + \ + a[2 * i + 0] = r1; \ + a[2 * i + 1] = r2; \ +} \ +} + + +FN(foo_vec) + +#pragma GCC push_options +#pragma GCC optimize ("O0") +FN(foo_novec) +#pragma GCC pop_options + +int main () +{ + int i; + + check_vect (); + +#pragma GCC novector + for (i = 0; i < N; i++) +A[i] = B[i] = 0; + + foo_vec(A); + foo_novec(B); + + /* check results: */ +#pragma GCC novector + for (i = 0; i < N; i++) +if (A[i] != B[i]) + abort (); + + return 0; +} diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index ca6052662a3..840e162c7f0 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -6247,7 +6247,7 @@ vectorizable_shift (vec_info *vinfo, if ((dt[1] == vect_internal_def || dt[1] == vect_induction_def || dt[1] == vect_nested_cycle) - && !slp_node) + && (!slp_node || SLP_TREE_LANES (slp_node) == 1)) scalar_shift_arg = false; else if (dt[1] == vect_constant_def || dt[1] == vect_external_def -- 2.17.1 ________ From: Richard Biener Sent: Thursday, June 27, 2024 12:49 AM To: Feng Xue OS Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH] vect: Fix shift-by-induction for single-lane slp On Wed, Jun 26, 2024 at 4:58 PM Feng Xue OS wrote: > > Allow shift-by-induction for slp node, when it is single lane, which is > aligned with the original loop-based handling. OK. Did you try whether we handle multiple lanes correctly? The simplest case would be a loop body with say a[2*i] = x << i; a[2*i+1] = x << i; I'm not sure how we match up multiple (different) inductions in the sam
[PATCH] vect: Fix shift-by-induction for single-lane slp
Allow shift-by-induction for slp node, when it is single lane, which is aligned with the original loop-based handling. Thanks, Feng --- gcc/tree-vect-stmts.cc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index ca6052662a3..840e162c7f0 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -6247,7 +6247,7 @@ vectorizable_shift (vec_info *vinfo, if ((dt[1] == vect_internal_def || dt[1] == vect_induction_def || dt[1] == vect_nested_cycle) - && !slp_node) + && (!slp_node || SLP_TREE_LANES (slp_node) == 1)) scalar_shift_arg = false; else if (dt[1] == vect_constant_def || dt[1] == vect_external_def -- 2.17.1
[PATCH] vect: Fix shift-by-induction for single-lane slp
Allow shift-by-induction for slp node, when it is single lane, which is aligned with the original loop-based handling. Thanks, Feng --- gcc/tree-vect-stmts.cc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index ca6052662a3..840e162c7f0 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -6247,7 +6247,7 @@ vectorizable_shift (vec_info *vinfo, if ((dt[1] == vect_internal_def || dt[1] == vect_induction_def || dt[1] == vect_nested_cycle) - && !slp_node) + && (!slp_node || SLP_TREE_LANES (slp_node) == 1)) scalar_shift_arg = false; else if (dt[1] == vect_constant_def || dt[1] == vect_external_def -- 2.17.1
Re: [PATCH 8/8] vect: Optimize order of lane-reducing statements in loop def-use cycles
unsigned k = j - 1; + std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]); + gcc_assert (!vec_oprnds[i][k]); + } + } + } } } diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 94736736dcc..64c6571a293 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -1402,6 +1402,12 @@ public: /* The vector type for performing the actual reduction. */ tree reduc_vectype; + /* For loop reduction with multiple vectorized results (ncopies > 1), a + lane-reducing operation participating in it may not use all of those + results, this field specifies result index starting from which any + following land-reducing operation would be assigned to. */ + unsigned int reduc_result_pos; + /* If IS_REDUC_INFO is true and if the vector code is performing N scalar reductions in parallel, this variable gives the initial scalar values of those N reductions. */ -- 2.17.1 ____________ From: Feng Xue OS Sent: Thursday, June 20, 2024 2:02 PM To: Richard Biener Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH 8/8] vect: Optimize order of lane-reducing statements in loop def-use cycles This patch was updated with some new change. When transforming multiple lane-reducing operations in a loop reduction chain, originally, corresponding vectorized statements are generated into def-use cycles starting from 0. The def-use cycle with smaller index, would contain more statements, which means more instruction dependency. For example: int sum = 0; for (i) { sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad } Original transformation result: for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } For a higher instruction parallelism in final vectorized loop, an optimal means is to make those effective vectorized lane-reducing statements be distributed evenly among all def-use cycles. Transformed as the below, DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles, instruction dependency could be eliminated. for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = sum_v1; // copy sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2); sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3); } 2024-03-22 Feng Xue gcc/ PR tree-optimization/114440 * tree-vectorizer.h (struct _stmt_vec_info): Add a new field reduc_result_pos. * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing statements in an optimized order. --- gcc/tree-vect-loop.cc | 43 +++ gcc/tree-vectorizer.h | 6 ++ 2 files changed, 45 insertions(+), 4 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 5a27a2c3d9c..adee54350d4 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8821,9 +8821,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy - sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); - sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); - sum_v2 = sum_v2; // copy + sum_v0 = sum_v0; // copy + sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1); + sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2); sum_v3 = sum_v3; // copy sum_v0 += n_v0[i: 0 ~ 3 ]; @@ -8831,7 +8831,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 += n_v2[i: 8 ~ 11]; sum_v3 += n_v3[i: 12 ~ 15]; } - */ + +Moreover, for a higher instruction parallelism in final vectorized +loop, it is considered to make those effective vectorized lane- +reducing statements be distributed evenly among all def-use cycles. +In the above example, SADs are generated into other cycles rather +than that of DOT_PROD. */
Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
_v2 += n_v2[i: 8 ~ 11]; + sum_v3 += n_v3[i: 12 ~ 15]; +} + */ + unsigned using_ncopies = vec_oprnds[0].length (); + unsigned reduc_ncopies = vec_oprnds[reduc_index].length (); + + gcc_assert (using_ncopies <= reduc_ncopies); + + if (using_ncopies < reduc_ncopies) + { + for (unsigned i = 0; i < op.num_ops - 1; i++) + { + gcc_assert (vec_oprnds[i].length () == using_ncopies); + vec_oprnds[i].safe_grow_cleared (reduc_ncopies); + } + } +} bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info); unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length (); @@ -8706,7 +8874,18 @@ vect_transform_reduction (loop_vec_info loop_vinfo, { gimple *new_stmt; tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE }; - if (masked_loop_p && !mask_by_cond_expr) + + if (!vop[0] || !vop[1]) + { + tree reduc_vop = vec_oprnds[reduc_index][i]; + + /* Insert trivial copy if no need to generate vectorized +statement. */ + gcc_assert (reduc_vop); + + new_stmt = SSA_NAME_DEF_STMT (reduc_vop); + } + else if (masked_loop_p && !mask_by_cond_expr) { /* No conditional ifns have been defined for lane-reducing op yet. */ @@ -8735,8 +8914,22 @@ vect_transform_reduction (loop_vec_info loop_vinfo, if (masked_loop_p && mask_by_cond_expr) { + unsigned nvectors = vec_num * ncopies; + tree stmt_vectype_in = vectype_in; + + /* For single-lane slp node on lane-reducing op, we need to +compute exact number of vector stmts from its input vectype, +since the value got from the slp node is over-estimated. +TODO: properly set the number this somewhere, so that this +fixup could be removed. */ + if (lane_reducing && SLP_TREE_LANES (slp_node) == 1) + { + stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info); + nvectors = vect_get_num_copies (loop_vinfo, vectype_in); + } + tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks, - vec_num * ncopies, vectype_in, i); + nvectors, stmt_vectype_in, i); build_vect_cond_expr (code, vop, mask, gsi); } diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index 840e162c7f0..845647b4399 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -13350,6 +13350,8 @@ vect_analyze_stmt (vec_info *vinfo, NULL, NULL, node, cost_vec) || vectorizable_load (vinfo, stmt_info, NULL, NULL, node, cost_vec) || vectorizable_store (vinfo, stmt_info, NULL, NULL, node, cost_vec) + || vectorizable_lane_reducing (as_a (vinfo), +stmt_info, node, cost_vec) || vectorizable_reduction (as_a (vinfo), stmt_info, node, node_instance, cost_vec) || vectorizable_induction (as_a (vinfo), stmt_info, diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 60224f4e284..94736736dcc 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -2455,6 +2455,8 @@ extern loop_vec_info vect_create_loop_vinfo (class loop *, vec_info_shared *, extern bool vectorizable_live_operation (vec_info *, stmt_vec_info, slp_tree, slp_instance, int, bool, stmt_vector_for_cost *); +extern bool vectorizable_lane_reducing (loop_vec_info, stmt_vec_info, + slp_tree, stmt_vector_for_cost *); extern bool vectorizable_reduction (loop_vec_info, stmt_vec_info, slp_tree, slp_instance, stmt_vector_for_cost *); -- 2.17.1 From: Feng Xue OS Sent: Tuesday, June 25, 2024 5:32 PM To: Richard Biener Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440] >> >> >> - if (slp_node) >> >> + if (slp_node && SLP_TREE_LANES (slp_node) > 1) >> > >> > Hmm, that looks wrong. It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off >> > instead, which is bad. >> > >> >> nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); >> >>else >> >> nvectors = vect_get_num_copies (loop_vinfo, vectype_in); >> >> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage >> >> (loop_vec_info lo
Re: [PATCH 4/8] vect: Determine input vectype for multiple lane-reducing
ype_in; - - /* Each lane-reducing operation has its own input vectype, while reduction - PHI records the input vectype with least lanes. */ - if (lane_reducing) -STMT_VINFO_REDUC_VECTYPE_IN (stmt_info) = vectype_in; enum vect_reduction_type reduction_type = STMT_VINFO_REDUC_TYPE (phi_info); STMT_VINFO_REDUC_TYPE (reduc_info) = reduction_type; -- 2.17.1 From: Feng Xue OS Sent: Thursday, June 20, 2024 1:47 PM To: Richard Biener Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH 4/8] vect: Determine input vectype for multiple lane-reducing >> + if (lane_reducing_op_p (op.code)) >> + { >> + unsigned group_size = slp_node ? SLP_TREE_LANES (slp_node) : 0; >> + tree op_type = TREE_TYPE (op.ops[0]); >> + tree new_vectype_in = get_vectype_for_scalar_type (loop_vinfo, >> +op_type, >> +group_size); > > I think doing it this way does not adhere to the vector type size constraint > with loop vectorization. You should use vect_is_simple_use like the > original code did as the actual vector definition determines the vector type > used. OK, though this might be wordy. Actually, STMT_VINFO_REDUC_VECTYPE_IN is logically equivalent to nunits_vectype that is determined in vect_determine_vf_for_stmt_1(). So how about setting the type in this function? > > You are always using op.ops[0] here - I think that works because > reduc_idx is the last operand of all lane-reducing ops. But then > we should assert reduc_idx != 0 here and add a comment. Already added in the following assertion. >> + >> + /* The last operand of lane-reducing operation is for >> +reduction. */ >> + gcc_assert (reduc_idx > 0 && reduc_idx == (int) op.num_ops - >> 1); ^^ >> + >> + /* For lane-reducing operation vectorizable analysis needs the >> +reduction PHI information */ >> + STMT_VINFO_REDUC_DEF (def) = phi_info; >> + >> + if (!new_vectype_in) >> + return false; >> + >> + /* Each lane-reducing operation has its own input vectype, >> while >> +reduction PHI will record the input vectype with the least >> +lanes. */ >> + STMT_VINFO_REDUC_VECTYPE_IN (vdef) = new_vectype_in; >> + >> + /* To accommodate lane-reducing operations of mixed input >> +vectypes, choose input vectype with the least lanes for the >> +reduction PHI statement, which would result in the most >> +ncopies for vectorized reduction results. */ >> + if (!vectype_in >> + || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE >> (vectype_in))) >> + < GET_MODE_SIZE (SCALAR_TYPE_MODE (op_type >> + vectype_in = new_vectype_in; > > I know this is a fragile area but I always wonder since the accumulating > operand > is the largest (all lane-reducing ops are widening), and that will be > equal to the > type of the PHI node, how this condition can be ever true. In the original code, accumulating operand is skipped! While it is correctly, we should not count the operand, this is why we call operation lane-reducing. > > ncopies is determined by the VF, so the comment is at least misleading. > >> + } >> + else >> + vectype_in = STMT_VINFO_VECTYPE (phi_info); > > Please initialize vectype_in from phi_info before the loop (that > should never be NULL). > May not, as the below explanation. > I'll note that with your patch it seems we'd initialize vectype_in to > the biggest > non-accumulation vector type involved in lane-reducing ops but the > accumulating > type might still be larger. Why, when we have multiple lane-reducing > ops, would > we chose the largest input here? I see we eventually do > > if (slp_node) > ncopies = 1; > else > ncopies = vect_get_num_copies (loop_vinfo, vectype_in); > > but then IIRC we always force a single cycle def for lane-reducing ops(?). > In particular for vect_transform_reduction and SLP we rely on > SLP_TREE_NUMBER_OF_VEC_STMTS while non-SLP uses > STMT_VINFO_REDUC_VECTYPE_IN. > > So I wonder what breaks when we set vectype_in = vector type of PHI? > Yes. It is right, nothing is broken. Suppose that a loop contains three dot_prods, two are <16 * char>, on
Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
>> >> >> - if (slp_node) >> >> + if (slp_node && SLP_TREE_LANES (slp_node) > 1) >> > >> > Hmm, that looks wrong. It looks like SLP_TREE_NUMBER_OF_VEC_STMTS is off >> > instead, which is bad. >> > >> >> nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); >> >>else >> >> nvectors = vect_get_num_copies (loop_vinfo, vectype_in); >> >> @@ -7478,6 +7472,152 @@ vect_reduction_update_partial_vector_usage >> >> (loop_vec_info loop_vinfo, >> >> } >> >> } >> >> >> >> +/* Check if STMT_INFO is a lane-reducing operation that can be >> >> vectorized in >> >> + the context of LOOP_VINFO, and vector cost will be recorded in >> >> COST_VEC. >> >> + Now there are three such kinds of operations: dot-prod/widen-sum/sad >> >> + (sum-of-absolute-differences). >> >> + >> >> + For a lane-reducing operation, the loop reduction path that it lies >> >> in, >> >> + may contain normal operation, or other lane-reducing operation of >> >> different >> >> + input type size, an example as: >> >> + >> >> + int sum = 0; >> >> + for (i) >> >> + { >> >> + ... >> >> + sum += d0[i] * d1[i]; // dot-prod >> >> + sum += w[i];// widen-sum >> >> + sum += abs(s0[i] - s1[i]); // sad >> >> + sum += n[i];// normal >> >> + ... >> >> + } >> >> + >> >> + Vectorization factor is essentially determined by operation whose >> >> input >> >> + vectype has the most lanes ("vector(16) char" in the example), while >> >> we >> >> + need to choose input vectype with the least lanes ("vector(4) int" in >> >> the >> >> + example) for the reduction PHI statement. */ >> >> + >> >> +bool >> >> +vectorizable_lane_reducing (loop_vec_info loop_vinfo, stmt_vec_info >> >> stmt_info, >> >> + slp_tree slp_node, stmt_vector_for_cost >> >> *cost_vec) >> >> +{ >> >> + gimple *stmt = stmt_info->stmt; >> >> + >> >> + if (!lane_reducing_stmt_p (stmt)) >> >> +return false; >> >> + >> >> + tree type = TREE_TYPE (gimple_assign_lhs (stmt)); >> >> + >> >> + if (!INTEGRAL_TYPE_P (type) && !SCALAR_FLOAT_TYPE_P (type)) >> >> +return false; >> >> + >> >> + /* Do not try to vectorize bit-precision reductions. */ >> >> + if (!type_has_mode_precision_p (type)) >> >> +return false; >> >> + >> >> + if (!slp_node) >> >> +return false; >> >> + >> >> + for (int i = 0; i < (int) gimple_num_ops (stmt) - 1; i++) >> >> +{ >> >> + stmt_vec_info def_stmt_info; >> >> + slp_tree slp_op; >> >> + tree op; >> >> + tree vectype; >> >> + enum vect_def_type dt; >> >> + >> >> + if (!vect_is_simple_use (loop_vinfo, stmt_info, slp_node, i, &op, >> >> + &slp_op, &dt, &vectype, &def_stmt_info)) >> >> + { >> >> + if (dump_enabled_p ()) >> >> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, >> >> +"use not simple.\n"); >> >> + return false; >> >> + } >> >> + >> >> + if (!vectype) >> >> + { >> >> + vectype = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE >> >> (op), >> >> +slp_op); >> >> + if (!vectype) >> >> + return false; >> >> + } >> >> + >> >> + if (!vect_maybe_update_slp_op_vectype (slp_op, vectype)) >> >> + { >> >> + if (dump_enabled_p ()) >> >> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, >> >> +"incompatible vector types for >> >> invariants\n"); >> >> + return false; >> >> + } >> >> + >> >> + if (i == STMT_VINFO_REDUC_IDX (stmt_info)) >> >> + continue; >> >> + >> >> + /* There should be at most one cycle def in the stmt. */ >> >> + if (VECTORIZABLE_CYCLE_DEF (dt)) >> >> + return false; >> >> +} >> >> + >> >> + stmt_vec_info reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt >> >> (stmt_info)); >> >> + >> >> + /* TODO: Support lane-reducing operation that does not directly >> >> participate >> >> + in loop reduction. */ >> >> + if (!reduc_info || STMT_VINFO_REDUC_IDX (stmt_info) < 0) >> >> +return false; >> >> + >> >> + /* Lane-reducing pattern inside any inner loop of LOOP_VINFO is not >> >> + recoginized. */ >> >> + gcc_assert (STMT_VINFO_DEF_TYPE (reduc_info) == vect_reduction_def); >> >> + gcc_assert (STMT_VINFO_REDUC_TYPE (reduc_info) == TREE_CODE_REDUCTION); >> >> + >> >> + tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info); >> >> + int ncopies_for_cost; >> >> + >> >> + if (SLP_TREE_LANES (slp_node) > 1) >> >> +{ >> >> + /* Now lane-reducing operations in a non-single-lane slp node >> >> should only >> >> +come from the same loop reduction path. */ >> >> + gcc_assert (REDUC_GROUP_FIRST_ELEMENT (stmt_info)); >> >> + ncopies_for_cost = 1; >> >> +} >> >> + else >> >> +
Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
; >> + > > assert reduc_ncopies >= using_ncopies? Maybe assert > reduc_index == op.num_ops - 1 given you use one above > and the other below? Or simply iterate till op.num_ops > and sip i == reduc_index. > >> + for (unsigned i = 0; i < op.num_ops - 1; i++) >> + { >> + gcc_assert (vec_oprnds[i].length () == using_ncopies); >> + vec_oprnds[i].safe_grow_cleared (reduc_ncopies); >> + } >> +} >> >> bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod >> (stmt_info); >>unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length (); >> @@ -8697,7 +8856,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo, >> { >>gimple *new_stmt; >>tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE }; >> - if (masked_loop_p && !mask_by_cond_expr) >> + >> + if (!vop[0] || !vop[1]) >> + { >> + tree reduc_vop = vec_oprnds[reduc_index][i]; >> + >> + /* Insert trivial copy if no need to generate vectorized >> +statement. */ >> + gcc_assert (reduc_vop); >> + >> + new_stmt = gimple_build_assign (vec_dest, reduc_vop); >> + new_temp = make_ssa_name (vec_dest, new_stmt); >> + gimple_set_lhs (new_stmt, new_temp); >> + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi); > > I think you could simply do > >slp_node->push_vec_def (reduc_vop); >continue; > > without any code generation. > OK, that would be easy. Here comes another question, this patch assumes lane-reducing op would always be contained in a slp node, since single-lane slp node feature has been enabled. But I got some regression if I enforced such constraint on lane-reducing op check. Those cases are founded to be unvectorizable with single-lane slp, so this should not be what we want? and need to be fixed? >> + } >> + else if (masked_loop_p && !mask_by_cond_expr) >> { >> /* No conditional ifns have been defined for lane-reducing op >> yet. */ >> @@ -8726,8 +8899,19 @@ vect_transform_reduction (loop_vec_info loop_vinfo, >> >> if (masked_loop_p && mask_by_cond_expr) >> { >> + tree stmt_vectype_in = vectype_in; >> + unsigned nvectors = vec_num * ncopies; >> + >> + if (lane_reducing && SLP_TREE_LANES (slp_node) == 1) >> + { >> + /* Input vectype of the reduction PHI may be defferent from > > different > >> +that of lane-reducing operation. */ >> + stmt_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info); >> + nvectors = vect_get_num_copies (loop_vinfo, >> stmt_vectype_in); > > I think this again points to a wrong SLP_TREE_NUMBER_OF_VEC_STMTS. To partially vectorizing a dot_prod<16 * char> with 128-bit vector width, we should pass (nvector=4, vectype=<4 *int>) instead of (nvector=1, vectype=<16 *char>) to vect_get_loop_mask? Thanks, Feng From: Richard Biener Sent: Thursday, June 20, 2024 8:26 PM To: Feng Xue OS Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440] On Sun, Jun 16, 2024 at 9:31?AM Feng Xue OS wrote: > > For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current > vectorizer could only handle the pattern if the reduction chain does not > contain other operation, no matter the other is normal or lane-reducing. > > Actually, to allow multiple arbitrary lane-reducing operations, we need to > support vectorization of loop reduction chain with mixed input vectypes. Since > lanes of vectype may vary with operation, the effective ncopies of vectorized > statements for operation also may not be same to each other, this causes > mismatch on vectorized def-use cycles. A simple way is to align all operations > with the one that has the most ncopies, the gap could be complemented by > generating extra trivial pass-through copies. For example: > >int sum = 0; >for (i) > { >sum += d0[i] * d1[i]; // dot-prod >sum += w[i]; // widen-sum >sum += abs(s0[i] - s1[i]); // sad >sum += n[i]; // normal > } > > The vector size is 128-bit vectorization factor is 16. Reduction statements > would be transformed as: > >vector<4> int sum_v0 = { 0, 0, 0, 0 }; >vector<4> i
Re: [PATCH 8/8] vect: Optimize order of lane-reducing statements in loop def-use cycles
s; j > start; j--) + { + unsigned k = j - 1; + std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]); + gcc_assert (!vec_oprnds[i][k]); + } + } + } } } diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 94736736dcc..64c6571a293 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -1402,6 +1402,12 @@ public: /* The vector type for performing the actual reduction. */ tree reduc_vectype; + /* For loop reduction with multiple vectorized results (ncopies > 1), a + lane-reducing operation participating in it may not use all of those + results, this field specifies result index starting from which any + following land-reducing operation would be assigned to. */ + unsigned int reduc_result_pos; + /* If IS_REDUC_INFO is true and if the vector code is performing N scalar reductions in parallel, this variable gives the initial scalar values of those N reductions. */ -- 2.17.1 ____________ From: Feng Xue OS Sent: Sunday, June 16, 2024 3:32 PM To: Richard Biener Cc: gcc-patches@gcc.gnu.org Subject: [PATCH 8/8] vect: Optimize order of lane-reducing statements in loop def-use cycles When transforming multiple lane-reducing operations in a loop reduction chain, originally, corresponding vectorized statements are generated into def-use cycles starting from 0. The def-use cycle with smaller index, would contain more statements, which means more instruction dependency. For example: int sum = 0; for (i) { sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad } Original transformation result: for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } For a higher instruction parallelism in final vectorized loop, an optimal means is to make those effective vectorized lane-reducing statements be distributed evenly among all def-use cycles. Transformed as the below, DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles, instruction dependency could be eliminated. Thanks, Feng --- gcc/ PR tree-optimization/114440 * tree-vectorizer.h (struct _stmt_vec_info): Add a new field reduc_result_pos. * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing statements in an optimized order. --- gcc/tree-vect-loop.cc | 39 +++ gcc/tree-vectorizer.h | 6 ++ 2 files changed, 41 insertions(+), 4 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 6d91665a341..c7e13d655d8 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8828,9 +8828,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy - sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); - sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); - sum_v2 = sum_v2; // copy + sum_v0 = sum_v0; // copy + sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1); + sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2); sum_v3 = sum_v3; // copy sum_v0 += n_v0[i: 0 ~ 3 ]; @@ -8838,14 +8838,45 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 += n_v2[i: 8 ~ 11]; sum_v3 += n_v3[i: 12 ~ 15]; } - */ + +Moreover, for a higher instruction parallelism in final vectorized +loop, it is considered to make those effective vectorized lane- +reducing statements be distributed evenly among all def-use cycles. +In the above example, SADs are generated into other cycles rather +than that of DOT_PROD. */ unsigned using_ncopies = vec_oprnds[0].length (); unsigned reduc_ncopies = vec_oprnds[reduc_index].length (); + unsigned result_pos = reduc_info->reduc_result_pos; + + reduc_info->reduc_result_pos + = (result_pos + using_ncopies) % reduc_ncopies; + gcc_assert (result_pos < reduc_ncopies); for (unsigned i = 0; i < op.num_ops - 1; i++) { gcc_assert (vec_oprnds[i].length () == using_ncopies); vec_oprnds[i].safe_grow_cleared (reduc_ncopies); + + /* Find suitable def-use cycles
Re: [PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
py + sum_v2 = sum_v2; // copy + sum_v3 = sum_v3; // copy + + sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); + sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); + sum_v2 = sum_v2; // copy + sum_v3 = sum_v3; // copy + + sum_v0 += n_v0[i: 0 ~ 3 ]; + sum_v1 += n_v1[i: 4 ~ 7 ]; + sum_v2 += n_v2[i: 8 ~ 11]; + sum_v3 += n_v3[i: 12 ~ 15]; +} + */ + tree phi_vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info); + unsigned all_ncopies = vect_get_num_copies (loop_vinfo, phi_vectype_in); + unsigned use_ncopies = vec_oprnds[0].length (); + + if (use_ncopies < all_ncopies) + { + if (!slp_node) + { + tree reduc_oprnd = op.ops[reduc_index]; + + vec_oprnds[reduc_index].truncate (0); + vect_get_vec_defs_for_operand (loop_vinfo, stmt_info, +all_ncopies, reduc_oprnd, +&vec_oprnds[reduc_index]); + } + else + gcc_assert (all_ncopies == vec_oprnds[reduc_index].length ()); + + for (unsigned i = 0; i < op.num_ops - 1; i++) + { + gcc_assert (vec_oprnds[i].length () == use_ncopies); + vec_oprnds[i].safe_grow_cleared (all_ncopies); + } + } +} bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info); unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length (); @@ -8699,7 +8865,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo, { gimple *new_stmt; tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE }; - if (masked_loop_p && !mask_by_cond_expr) + + if (!vop[0] || !vop[1]) + { + tree reduc_vop = vec_oprnds[reduc_index][i]; + + /* Insert trivial copy if no need to generate vectorized +statement. */ + gcc_assert (reduc_vop); + + new_stmt = gimple_build_assign (vec_dest, reduc_vop); + new_temp = make_ssa_name (vec_dest, new_stmt); + gimple_set_lhs (new_stmt, new_temp); + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, gsi); + } + else if (masked_loop_p && !mask_by_cond_expr) { /* No conditional ifns have been defined for lane-reducing op yet. */ @@ -8728,8 +8908,16 @@ vect_transform_reduction (loop_vec_info loop_vinfo, if (masked_loop_p && mask_by_cond_expr) { + unsigned nvectors = vec_num * ncopies; + + /* For single-lane slp node on lane-reducing op, we need to +compute exact number of vector stmts from its input vectype, +since the value got from the slp node is over-estimated. */ + if (lane_reducing && slp_node && SLP_TREE_LANES (slp_node) == 1) + nvectors = vect_get_num_copies (loop_vinfo, vectype_in); + tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks, - vec_num * ncopies, vectype_in, i); + nvectors, vectype_in, i); build_vect_cond_expr (code, vop, mask, gsi); } diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index ca6052662a3..1b73ef01ade 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -13350,6 +13350,8 @@ vect_analyze_stmt (vec_info *vinfo, NULL, NULL, node, cost_vec) || vectorizable_load (vinfo, stmt_info, NULL, NULL, node, cost_vec) || vectorizable_store (vinfo, stmt_info, NULL, NULL, node, cost_vec) + || vectorizable_lane_reducing (as_a (vinfo), +stmt_info, node, cost_vec) || vectorizable_reduction (as_a (vinfo), stmt_info, node, node_instance, cost_vec) || vectorizable_induction (as_a (vinfo), stmt_info, diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 60224f4e284..94736736dcc 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -2455,6 +2455,8 @@ extern loop_vec_info vect_create_loop_vinfo (class loop *, vec_info_shared *, extern bool vectorizable_live_operation (vec_info *, stmt_vec_info, slp_tree, slp_instance, int, bool, stmt_vector_for_cost *); +extern bool vectorizable_lane_reducing (loop_vec_info, stmt_vec_info, + slp_tree, stmt_vector_for_cost *); extern bool vectorizable_reduction (loop_vec_info, stmt_vec_info, slp_tree, slp_instance, stmt_vector_for_cost *); -- 2.17.1 From: F
Re: [PATCH 4/8] vect: Determine input vectype for multiple lane-reducing
>(short_c0_hi, short_c1_hi, sum_v1); sum_v2 = sum_v2; sum_v3 = sum_v3; } The def/use cycles (sum_v2 and sum_v3> would be optimized away finally. Then this gets same result as setting vectype_in to <8 * short>. With the patch #8, we get: vector<4> int sum_v0 = { 0, 0, 0, 0 }; vector<4> int sum_v1 = { 0, 0, 0, 0 }; vector<4> int sum_v2 = { 0, 0, 0, 0 }; vector<4> int sum_v3 = { 0, 0, 0, 0 }; loop () { sum_v0 = dot_prod<16 * char>(char_a0, char_a1, sum_v0); sum_v1 = dot_prod<16 * char>(char_b0, char_b1, sum_v1); sum_v2 = dot_prod<8 * short>(short_c0_lo, short_c1_lo, sum_v2); sum_v3 = dot_prod<8 * short>(short_c0_hi, short_c1_hi, sum_v3); } All dot_prods are assigned to separate def/use cycles, and no dependency. More def/use cycles, higher instruction parallelism, but there need extra cost in epilogue to combine the result. So we consider a somewhat compact def/use layout similar to single-defuse-cycle, in which two <16 * char> dot_prods are independent, and cycle 2 and 3 are not used, and this is better than the 1st scheme. vector<4> int sum_v0 = { 0, 0, 0, 0 }; vector<4> int sum_v1 = { 0, 0, 0, 0 }; loop () { sum_v0 = dot_prod<16 * char>(char_a0, char_a1, sum_v0); sum_v1 = dot_prod<16 * char>(char_b0, char_b1, sum_v1); sum_v0 = dot_prod<8 * short>(short_c0_lo, short_c1_lo, sum_v0); sum_v1 = dot_prod<8 * short>(short_c0_hi, short_c1_hi, sum_v1); } For this purpose, we need to track the vectype_in that results in the most ncopies, for this case, the type is <8 * short>. BTW: would you please also take a look at patch #7 and #8? Thanks, Feng From: Richard Biener Sent: Wednesday, June 19, 2024 9:01 PM To: Feng Xue OS Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH 4/8] vect: Determine input vectype for multiple lane-reducing On Sun, Jun 16, 2024 at 9:25 AM Feng Xue OS wrote: > > The input vectype of reduction PHI statement must be determined before > vect cost computation for the reduction. Since lance-reducing operation has > different input vectype from normal one, so we need to traverse all reduction > statements to find out the input vectype with the least lanes, and set that to > the PHI statement. > > Thanks, > Feng > > --- > gcc/ > * tree-vect-loop.cc (vectorizable_reduction): Determine input vectype > during traversal of reduction statements. > --- > gcc/tree-vect-loop.cc | 72 +-- > 1 file changed, 49 insertions(+), 23 deletions(-) > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > index 0f7b125e72d..39aa5cb1197 100644 > --- a/gcc/tree-vect-loop.cc > +++ b/gcc/tree-vect-loop.cc > @@ -7643,7 +7643,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > { >stmt_vec_info def = loop_vinfo->lookup_def (reduc_def); >stmt_vec_info vdef = vect_stmt_to_vectorize (def); > - if (STMT_VINFO_REDUC_IDX (vdef) == -1) > + int reduc_idx = STMT_VINFO_REDUC_IDX (vdef); > + > + if (reduc_idx == -1) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > @@ -7686,10 +7688,50 @@ vectorizable_reduction (loop_vec_info loop_vinfo, > return false; > } > } > - else if (!stmt_info) > - /* First non-conversion stmt. */ > - stmt_info = vdef; > - reduc_def = op.ops[STMT_VINFO_REDUC_IDX (vdef)]; > + else > + { > + /* First non-conversion stmt. */ > + if (!stmt_info) > + stmt_info = vdef; > + > + if (lane_reducing_op_p (op.code)) > + { > + unsigned group_size = slp_node ? SLP_TREE_LANES (slp_node) : 0; > + tree op_type = TREE_TYPE (op.ops[0]); > + tree new_vectype_in = get_vectype_for_scalar_type (loop_vinfo, > +op_type, > +group_size); I think doing it this way does not adhere to the vector type size constraint with loop vectorization. You should use vect_is_simple_use like the original code did as the actual vector definition determines the vector type used. You are always using op.ops[0] here - I think that works because reduc_idx is the last operand of all lane-reducing ops. But then we should assert reduc_idx != 0 here and add a comment. > + > + /* The last operand of lane-reducing operation is for > +reduction. */ > + gcc_assert (redu
[PATCH 8/8] vect: Optimize order of lane-reducing statements in loop def-use cycles
When transforming multiple lane-reducing operations in a loop reduction chain, originally, corresponding vectorized statements are generated into def-use cycles starting from 0. The def-use cycle with smaller index, would contain more statements, which means more instruction dependency. For example: int sum = 0; for (i) { sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad } Original transformation result: for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } For a higher instruction parallelism in final vectorized loop, an optimal means is to make those effective vectorized lane-reducing statements be distributed evenly among all def-use cycles. Transformed as the below, DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles, instruction dependency could be eliminated. Thanks, Feng --- gcc/ PR tree-optimization/114440 * tree-vectorizer.h (struct _stmt_vec_info): Add a new field reduc_result_pos. * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing statements in an optimized order. --- gcc/tree-vect-loop.cc | 39 +++ gcc/tree-vectorizer.h | 6 ++ 2 files changed, 41 insertions(+), 4 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 6d91665a341..c7e13d655d8 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8828,9 +8828,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy - sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); - sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); - sum_v2 = sum_v2; // copy + sum_v0 = sum_v0; // copy + sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1); + sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2); sum_v3 = sum_v3; // copy sum_v0 += n_v0[i: 0 ~ 3 ]; @@ -8838,14 +8838,45 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 += n_v2[i: 8 ~ 11]; sum_v3 += n_v3[i: 12 ~ 15]; } - */ + +Moreover, for a higher instruction parallelism in final vectorized +loop, it is considered to make those effective vectorized lane- +reducing statements be distributed evenly among all def-use cycles. +In the above example, SADs are generated into other cycles rather +than that of DOT_PROD. */ unsigned using_ncopies = vec_oprnds[0].length (); unsigned reduc_ncopies = vec_oprnds[reduc_index].length (); + unsigned result_pos = reduc_info->reduc_result_pos; + + reduc_info->reduc_result_pos + = (result_pos + using_ncopies) % reduc_ncopies; + gcc_assert (result_pos < reduc_ncopies); for (unsigned i = 0; i < op.num_ops - 1; i++) { gcc_assert (vec_oprnds[i].length () == using_ncopies); vec_oprnds[i].safe_grow_cleared (reduc_ncopies); + + /* Find suitable def-use cycles to generate vectorized statements +into, and reorder operands based on the selection. */ + if (result_pos) + { + unsigned count = reduc_ncopies - using_ncopies; + unsigned start = result_pos - count; + + if ((int) start < 0) + { + count = result_pos; + start = 0; + } + + for (unsigned j = using_ncopies; j > start; j--) + { + unsigned k = j - 1; + std::swap (vec_oprnds[i][k], vec_oprnds[i][k + count]); + gcc_assert (!vec_oprnds[i][k]); + } + } } } diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 94736736dcc..64c6571a293 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -1402,6 +1402,12 @@ public: /* The vector type for performing the actual reduction. */ tree reduc_vectype; + /* For loop reduction with multiple vectorized results (ncopies > 1), a + lane-reducing operation participating in it may not use all of those + results, this field specifies result index starting from which any + following land-reducing operation would be assigned to. */ + unsigned int reduc_result_pos; + /* If IS_RED
[PATCH 7/8] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current vectorizer could only handle the pattern if the reduction chain does not contain other operation, no matter the other is normal or lane-reducing. Actually, to allow multiple arbitrary lane-reducing operations, we need to support vectorization of loop reduction chain with mixed input vectypes. Since lanes of vectype may vary with operation, the effective ncopies of vectorized statements for operation also may not be same to each other, this causes mismatch on vectorized def-use cycles. A simple way is to align all operations with the one that has the most ncopies, the gap could be complemented by generating extra trivial pass-through copies. For example: int sum = 0; for (i) { sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad sum += n[i]; // normal } The vector size is 128-bit vectorization factor is 16. Reduction statements would be transformed as: vector<4> int sum_v0 = { 0, 0, 0, 0 }; vector<4> int sum_v1 = { 0, 0, 0, 0 }; vector<4> int sum_v2 = { 0, 0, 0, 0 }; vector<4> int sum_v3 = { 0, 0, 0, 0 }; for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 += n_v0[i: 0 ~ 3 ]; sum_v1 += n_v1[i: 4 ~ 7 ]; sum_v2 += n_v2[i: 8 ~ 11]; sum_v3 += n_v3[i: 12 ~ 15]; } Thanks, Feng --- gcc/ PR tree-optimization/114440 * tree-vectorizer.h (vectorizable_lane_reducing): New function declaration. * tree-vect-stmts.cc (vect_analyze_stmt): Call new function vectorizable_lane_reducing to analyze lane-reducing operation. * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation code related to emulated_mixed_dot_prod. (vect_reduction_update_partial_vector_usage): Compute ncopies as the original means for single-lane slp node. (vectorizable_lane_reducing): New function. (vectorizable_reduction): Allow multiple lane-reducing operations in loop reduction. Move some original lane-reducing related code to vectorizable_lane_reducing. (vect_transform_reduction): Extend transformation to support reduction statements with mixed input vectypes. gcc/testsuite/ PR tree-optimization/114440 * gcc.dg/vect/vect-reduc-chain-1.c * gcc.dg/vect/vect-reduc-chain-2.c * gcc.dg/vect/vect-reduc-chain-3.c * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c * gcc.dg/vect/vect-reduc-dot-slp-1.c --- .../gcc.dg/vect/vect-reduc-chain-1.c | 62 .../gcc.dg/vect/vect-reduc-chain-2.c | 77 + .../gcc.dg/vect/vect-reduc-chain-3.c | 66 .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c | 95 + .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c | 67 .../gcc.dg/vect/vect-reduc-chain-dot-slp-3.c | 79 + .../gcc.dg/vect/vect-reduc-chain-dot-slp-4.c | 63 .../gcc.dg/vect/vect-reduc-dot-slp-1.c| 35 ++ gcc/tree-vect-loop.cc | 324 ++ gcc/tree-vect-stmts.cc| 2 + gcc/tree-vectorizer.h | 2 + 11 files changed, 802 insertions(+), 70 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-3.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-4.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c new file mode 100644 index 000..04bfc419dbd --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c @@ -0,0 +1,62 @@ +/* Disabling epilogues until we find a better way to deal with scans. */ +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require
[PATCH 6/8] vect: Tighten an assertion for lane-reducing in transform
According to logic of code nearby the assertion, all lane-reducing operations should not appear, not just DOT_PROD_EXPR. Since "use_mask_by_cond_expr_p" treats SAD_EXPR same as DOT_PROD_EXPR, and WIDEN_SUM_EXPR should not be allowed by the following assertion "gcc_assert (commutative_binary_op_p (...))", so tighten the assertion. Thanks, Feng --- gcc/ * tree-vect-loop.cc (vect_transform_reduction): Change assertion to cover all lane-reducing ops. --- gcc/tree-vect-loop.cc | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 7909d63d4df..e0561feddce 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8643,7 +8643,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo, } bool single_defuse_cycle = STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info); - gcc_assert (single_defuse_cycle || lane_reducing_op_p (code)); + bool lane_reducing = lane_reducing_op_p (code); + gcc_assert (single_defuse_cycle || lane_reducing); /* Create the destination vector */ tree scalar_dest = gimple_get_lhs (stmt_info->stmt); @@ -8698,8 +8699,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo, tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE }; if (masked_loop_p && !mask_by_cond_expr) { - /* No conditional ifns have been defined for dot-product yet. */ - gcc_assert (code != DOT_PROD_EXPR); + /* No conditional ifns have been defined for lane-reducing op +yet. */ + gcc_assert (!lane_reducing); /* Make sure that the reduction accumulator is vop[0]. */ if (reduc_index == 1) -- 2.17.1From d348e63c001e65067876a80dfae75abefe10c240 Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Sun, 16 Jun 2024 13:33:52 +0800 Subject: [PATCH 6/8] vect: Tighten an assertion for lane-reducing in transform According to logic of code nearby the assertion, all lane-reducing operations should not appear, not just DOT_PROD_EXPR. Since "use_mask_by_cond_expr_p" treats SAD_EXPR same as DOT_PROD_EXPR, and WIDEN_SUM_EXPR should not be allowed by the following assertion "gcc_assert (commutative_binary_op_p (...))", so tighten the assertion. 2024-06-16 Feng Xue gcc/ * tree-vect-loop.cc (vect_transform_reduction): Change assertion to cover all lane-reducing ops. --- gcc/tree-vect-loop.cc | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 7909d63d4df..e0561feddce 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8643,7 +8643,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo, } bool single_defuse_cycle = STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info); - gcc_assert (single_defuse_cycle || lane_reducing_op_p (code)); + bool lane_reducing = lane_reducing_op_p (code); + gcc_assert (single_defuse_cycle || lane_reducing); /* Create the destination vector */ tree scalar_dest = gimple_get_lhs (stmt_info->stmt); @@ -8698,8 +8699,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo, tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE }; if (masked_loop_p && !mask_by_cond_expr) { - /* No conditional ifns have been defined for dot-product yet. */ - gcc_assert (code != DOT_PROD_EXPR); + /* No conditional ifns have been defined for lane-reducing op + yet. */ + gcc_assert (!lane_reducing); /* Make sure that the reduction accumulator is vop[0]. */ if (reduc_index == 1) -- 2.17.1
[PATCH 5/8] vect: Use an array to replace 3 relevant variables
It's better to place 3 relevant independent variables into array, since we have requirement to access them via an index in the following patch. At the same time, this change may get some duplicated code be more compact. Thanks, Feng --- gcc/ * tree-vect-loop.cc (vect_transform_reduction): Replace vec_oprnds0/1/2 with one new array variable vec_oprnds[3]. --- gcc/tree-vect-loop.cc | 42 +- 1 file changed, 17 insertions(+), 25 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 39aa5cb1197..7909d63d4df 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8605,9 +8605,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo, /* Transform. */ tree new_temp = NULL_TREE; - auto_vec vec_oprnds0; - auto_vec vec_oprnds1; - auto_vec vec_oprnds2; + auto_vec vec_oprnds[3]; if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "transform reduction.\n"); @@ -8657,12 +8655,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo, { vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, ncopies, single_defuse_cycle && reduc_index == 0 -? NULL_TREE : op.ops[0], &vec_oprnds0, +? NULL_TREE : op.ops[0], &vec_oprnds[0], single_defuse_cycle && reduc_index == 1 -? NULL_TREE : op.ops[1], &vec_oprnds1, +? NULL_TREE : op.ops[1], &vec_oprnds[1], op.num_ops == 3 && !(single_defuse_cycle && reduc_index == 2) -? op.ops[2] : NULL_TREE, &vec_oprnds2); +? op.ops[2] : NULL_TREE, &vec_oprnds[2]); } else { @@ -8670,12 +8668,12 @@ vect_transform_reduction (loop_vec_info loop_vinfo, vectype. */ gcc_assert (single_defuse_cycle && (reduc_index == 1 || reduc_index == 2)); - vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, ncopies, -op.ops[0], truth_type_for (vectype_in), &vec_oprnds0, + vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, ncopies, op.ops[0], +truth_type_for (vectype_in), &vec_oprnds[0], reduc_index == 1 ? NULL_TREE : op.ops[1], -NULL_TREE, &vec_oprnds1, +NULL_TREE, &vec_oprnds[1], reduc_index == 2 ? NULL_TREE : op.ops[2], -NULL_TREE, &vec_oprnds2); +NULL_TREE, &vec_oprnds[2]); } /* For single def-use cycles get one copy of the vectorized reduction @@ -8683,20 +8681,21 @@ vect_transform_reduction (loop_vec_info loop_vinfo, if (single_defuse_cycle) { vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, 1, -reduc_index == 0 ? op.ops[0] : NULL_TREE, &vec_oprnds0, -reduc_index == 1 ? op.ops[1] : NULL_TREE, &vec_oprnds1, +reduc_index == 0 ? op.ops[0] : NULL_TREE, +&vec_oprnds[0], +reduc_index == 1 ? op.ops[1] : NULL_TREE, +&vec_oprnds[1], reduc_index == 2 ? op.ops[2] : NULL_TREE, -&vec_oprnds2); +&vec_oprnds[2]); } bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info); + unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length (); - unsigned num = (reduc_index == 0 - ? vec_oprnds1.length () : vec_oprnds0.length ()); for (unsigned i = 0; i < num; ++i) { gimple *new_stmt; - tree vop[3] = { vec_oprnds0[i], vec_oprnds1[i], NULL_TREE }; + tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE }; if (masked_loop_p && !mask_by_cond_expr) { /* No conditional ifns have been defined for dot-product yet. */ @@ -8721,7 +8720,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo, else { if (op.num_ops >= 3) - vop[2] = vec_oprnds2[i]; + vop[2] = vec_oprnds[2][i]; if (masked_loop_p && mask_by_cond_expr) { @@ -8752,14 +8751,7 @@ vect_transform_reduction (loop_vec_info loop_vinfo, } if (single_defuse_cycle && i < num - 1) - { - if (reduc_index == 0) - vec_oprnds0.safe_push (gimple_get_lhs (new_stmt)); - else if (reduc_index == 1) - vec_oprnds1.safe_push (gimple_get_lhs (new_stmt)); - else if (reduc_index == 2) - vec_oprnds2.safe_push (gimple_get_lhs (new_stmt)); - } + vec_oprnds[reduc_index].safe_push (gimple_get_lhs (new_stmt)); else if (slp_node) slp_node->push_vec_def (new_stmt); else -- 2.17.1From 168a55952ae317fca34af55d025c1235b4ff34b5 Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Sun, 16 Jun
[PATCH 4/8] vect: Determine input vectype for multiple lane-reducing
The input vectype of reduction PHI statement must be determined before vect cost computation for the reduction. Since lance-reducing operation has different input vectype from normal one, so we need to traverse all reduction statements to find out the input vectype with the least lanes, and set that to the PHI statement. Thanks, Feng --- gcc/ * tree-vect-loop.cc (vectorizable_reduction): Determine input vectype during traversal of reduction statements. --- gcc/tree-vect-loop.cc | 72 +-- 1 file changed, 49 insertions(+), 23 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 0f7b125e72d..39aa5cb1197 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7643,7 +7643,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo, { stmt_vec_info def = loop_vinfo->lookup_def (reduc_def); stmt_vec_info vdef = vect_stmt_to_vectorize (def); - if (STMT_VINFO_REDUC_IDX (vdef) == -1) + int reduc_idx = STMT_VINFO_REDUC_IDX (vdef); + + if (reduc_idx == -1) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, @@ -7686,10 +7688,50 @@ vectorizable_reduction (loop_vec_info loop_vinfo, return false; } } - else if (!stmt_info) - /* First non-conversion stmt. */ - stmt_info = vdef; - reduc_def = op.ops[STMT_VINFO_REDUC_IDX (vdef)]; + else + { + /* First non-conversion stmt. */ + if (!stmt_info) + stmt_info = vdef; + + if (lane_reducing_op_p (op.code)) + { + unsigned group_size = slp_node ? SLP_TREE_LANES (slp_node) : 0; + tree op_type = TREE_TYPE (op.ops[0]); + tree new_vectype_in = get_vectype_for_scalar_type (loop_vinfo, +op_type, +group_size); + + /* The last operand of lane-reducing operation is for +reduction. */ + gcc_assert (reduc_idx > 0 && reduc_idx == (int) op.num_ops - 1); + + /* For lane-reducing operation vectorizable analysis needs the +reduction PHI information */ + STMT_VINFO_REDUC_DEF (def) = phi_info; + + if (!new_vectype_in) + return false; + + /* Each lane-reducing operation has its own input vectype, while +reduction PHI will record the input vectype with the least +lanes. */ + STMT_VINFO_REDUC_VECTYPE_IN (vdef) = new_vectype_in; + + /* To accommodate lane-reducing operations of mixed input +vectypes, choose input vectype with the least lanes for the +reduction PHI statement, which would result in the most +ncopies for vectorized reduction results. */ + if (!vectype_in + || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_in))) + < GET_MODE_SIZE (SCALAR_TYPE_MODE (op_type + vectype_in = new_vectype_in; + } + else + vectype_in = STMT_VINFO_VECTYPE (phi_info); + } + + reduc_def = op.ops[reduc_idx]; reduc_chain_length++; if (!stmt_info && slp_node) slp_for_stmt_info = SLP_TREE_CHILDREN (slp_for_stmt_info)[0]; @@ -7747,6 +7789,8 @@ vectorizable_reduction (loop_vec_info loop_vinfo, tree vectype_out = STMT_VINFO_VECTYPE (stmt_info); STMT_VINFO_REDUC_VECTYPE (reduc_info) = vectype_out; + STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) = vectype_in; + gimple_match_op op; if (!gimple_extract_op (stmt_info->stmt, &op)) gcc_unreachable (); @@ -7831,16 +7875,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo, = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op.ops[i]), slp_op[i]); - /* To properly compute ncopies we are interested in the widest -non-reduction input type in case we're looking at a widening -accumulation that we later handle in vect_transform_reduction. */ - if (lane_reducing - && vectype_op[i] - && (!vectype_in - || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_in))) - < GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_op[i])) - vectype_in = vectype_op[i]; - /* Record how the non-reduction-def value of COND_EXPR is defined. ??? For a chain of multiple CONDs we'd have to match them up all. */ if (op.code == COND_EXPR && reduc_chain_length == 1) @@ -7859,14 +7893,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo, } } } - if (!vectype_in) -vectype_in = STMT_VINFO_VECTYPE (phi_info); - STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) = vectype_in; - - /* Each lane-reducing operation has
[PATCH 3/8] vect: Use one reduction_type local variable
Two local variables were defined to refer same STMT_VINFO_REDUC_TYPE, better to keep only one. Thanks, Feng --- gcc/ * tree-vect-loop.cc (vectorizable_reduction): Remove v_reduc_type, and replace it to another local variable reduction_type. --- gcc/tree-vect-loop.cc | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 6e8b3639daf..0f7b125e72d 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7868,10 +7868,10 @@ vectorizable_reduction (loop_vec_info loop_vinfo, if (lane_reducing) STMT_VINFO_REDUC_VECTYPE_IN (stmt_info) = vectype_in; - enum vect_reduction_type v_reduc_type = STMT_VINFO_REDUC_TYPE (phi_info); - STMT_VINFO_REDUC_TYPE (reduc_info) = v_reduc_type; + enum vect_reduction_type reduction_type = STMT_VINFO_REDUC_TYPE (phi_info); + STMT_VINFO_REDUC_TYPE (reduc_info) = reduction_type; /* If we have a condition reduction, see if we can simplify it further. */ - if (v_reduc_type == COND_REDUCTION) + if (reduction_type == COND_REDUCTION) { if (slp_node && SLP_TREE_LANES (slp_node) != 1) return false; @@ -8038,7 +8038,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, STMT_VINFO_REDUC_CODE (reduc_info) = orig_code; - vect_reduction_type reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info); + reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info); if (reduction_type == TREE_CODE_REDUCTION) { /* Check whether it's ok to change the order of the computation. -- 2.17.1From 19dc1c91f10ec22e695b9003cae1f4ab5aa45250 Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Sun, 16 Jun 2024 12:17:26 +0800 Subject: [PATCH 3/8] vect: Use one reduction_type local variable Two local variables were defined to refer same STMT_VINFO_REDUC_TYPE, better to keep only one. 2024-06-16 Feng Xue gcc/ * tree-vect-loop.cc (vectorizable_reduction): Remove v_reduc_type, and replace it to another local variable reduction_type. --- gcc/tree-vect-loop.cc | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 6e8b3639daf..0f7b125e72d 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7868,10 +7868,10 @@ vectorizable_reduction (loop_vec_info loop_vinfo, if (lane_reducing) STMT_VINFO_REDUC_VECTYPE_IN (stmt_info) = vectype_in; - enum vect_reduction_type v_reduc_type = STMT_VINFO_REDUC_TYPE (phi_info); - STMT_VINFO_REDUC_TYPE (reduc_info) = v_reduc_type; + enum vect_reduction_type reduction_type = STMT_VINFO_REDUC_TYPE (phi_info); + STMT_VINFO_REDUC_TYPE (reduc_info) = reduction_type; /* If we have a condition reduction, see if we can simplify it further. */ - if (v_reduc_type == COND_REDUCTION) + if (reduction_type == COND_REDUCTION) { if (slp_node && SLP_TREE_LANES (slp_node) != 1) return false; @@ -8038,7 +8038,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, STMT_VINFO_REDUC_CODE (reduc_info) = orig_code; - vect_reduction_type reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info); + reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info); if (reduction_type == TREE_CODE_REDUCTION) { /* Check whether it's ok to change the order of the computation. -- 2.17.1
[PATCH 2/8] vect: Remove duplicated check on reduction operand
In vectorizable_reduction, one check on a reduction operand via index could be contained by another one check via pointer, so remove the former. Thanks, Feng --- gcc/ * tree-vect-loop.cc (vectorizable_reduction): Remove the duplicated check. --- gcc/tree-vect-loop.cc | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index d9a2ad69484..6e8b3639daf 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7815,11 +7815,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo, "use not simple.\n"); return false; } - if (i == STMT_VINFO_REDUC_IDX (stmt_info)) - continue; - /* For an IFN_COND_OP we might hit the reduction definition operand -twice (once as definition, once as else). */ + /* Skip reduction operands, and for an IFN_COND_OP we might hit the +reduction operand twice (once as definition, once as else). */ if (op.ops[i] == op.ops[STMT_VINFO_REDUC_IDX (stmt_info)]) continue; -- 2.17.1From 5d2c22ad724856db12bf0ca568650f471447fa34 Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Sun, 16 Jun 2024 12:08:56 +0800 Subject: [PATCH 2/8] vect: Remove duplicated check on reduction operand In vectorizable_reduction, one check on a reduction operand via index could be contained by another one check via pointer, so remove the former. 2024-06-16 Feng Xue gcc/ * tree-vect-loop.cc (vectorizable_reduction): Remove the duplicated check. --- gcc/tree-vect-loop.cc | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index d9a2ad69484..6e8b3639daf 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7815,11 +7815,9 @@ vectorizable_reduction (loop_vec_info loop_vinfo, "use not simple.\n"); return false; } - if (i == STMT_VINFO_REDUC_IDX (stmt_info)) - continue; - /* For an IFN_COND_OP we might hit the reduction definition operand - twice (once as definition, once as else). */ + /* Skip reduction operands, and for an IFN_COND_OP we might hit the + reduction operand twice (once as definition, once as else). */ if (op.ops[i] == op.ops[STMT_VINFO_REDUC_IDX (stmt_info)]) continue; -- 2.17.1
[PATH 1/8] vect: Add a function to check lane-reducing stmt
The series of patches are meant to support multiple lane-reducing reduction statements. Since the original ones conflicted with the new single-lane slp node patches, I have reworked most of the patches, and split them as small as possible, which may make code review easier. In the 1st one, I add a utility function to check if a statement is lane-reducing operation, which could simplify some existing code. Thanks, Feng --- gcc/ * tree-vectorizer.h (lane_reducing_stmt_p): New function. * tree-vect-slp.cc (vect_analyze_slp): Use new function lane_reducing_stmt_p to check statement. --- gcc/tree-vect-slp.cc | 4 +--- gcc/tree-vectorizer.h | 12 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 7e3d0107b4e..b4ea2e18f00 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -3919,7 +3919,6 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size) scalar_stmts.create (loop_vinfo->reductions.length ()); for (auto next_info : loop_vinfo->reductions) { - gassign *g; next_info = vect_stmt_to_vectorize (next_info); if ((STMT_VINFO_RELEVANT_P (next_info) || STMT_VINFO_LIVE_P (next_info)) @@ -3931,8 +3930,7 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size) { /* Do not discover SLP reductions combining lane-reducing ops, that will fail later. */ - if (!(g = dyn_cast (STMT_VINFO_STMT (next_info))) - || !lane_reducing_op_p (gimple_assign_rhs_code (g))) + if (!lane_reducing_stmt_p (STMT_VINFO_STMT (next_info))) scalar_stmts.quick_push (next_info); else { diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 6bb0f5c3a56..60224f4e284 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -2169,12 +2169,24 @@ vect_apply_runtime_profitability_check_p (loop_vec_info loop_vinfo) && th >= vect_vf_for_cost (loop_vinfo)); } +/* Return true if CODE is a lane-reducing opcode. */ + inline bool lane_reducing_op_p (code_helper code) { return code == DOT_PROD_EXPR || code == WIDEN_SUM_EXPR || code == SAD_EXPR; } +/* Return true if STMT is a lane-reducing statement. */ + +inline bool +lane_reducing_stmt_p (gimple *stmt) +{ + if (auto *assign = dyn_cast (stmt)) +return lane_reducing_op_p (gimple_assign_rhs_code (assign)); + return false; +} + /* Source location + hotness information. */ extern dump_user_location_t vect_location; -- 2.17.1From 0a90550b4ed3addfb2a36c40085bfa9b4bb05b7c Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Sat, 15 Jun 2024 23:17:10 +0800 Subject: [PATCH 1/8] vect: Add a function to check lane-reducing stmt Add a utility function to check if a statement is lane-reducing operation, which could simplify some existing code. 2024-06-16 Feng Xue gcc/ * tree-vectorizer.h (lane_reducing_stmt_p): New function. * tree-vect-slp.cc (vect_analyze_slp): Use new function lane_reducing_stmt_p to check statement. --- gcc/tree-vect-slp.cc | 4 +--- gcc/tree-vectorizer.h | 12 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 7e3d0107b4e..b4ea2e18f00 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -3919,7 +3919,6 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size) scalar_stmts.create (loop_vinfo->reductions.length ()); for (auto next_info : loop_vinfo->reductions) { - gassign *g; next_info = vect_stmt_to_vectorize (next_info); if ((STMT_VINFO_RELEVANT_P (next_info) || STMT_VINFO_LIVE_P (next_info)) @@ -3931,8 +3930,7 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size) { /* Do not discover SLP reductions combining lane-reducing ops, that will fail later. */ - if (!(g = dyn_cast (STMT_VINFO_STMT (next_info))) - || !lane_reducing_op_p (gimple_assign_rhs_code (g))) + if (!lane_reducing_stmt_p (STMT_VINFO_STMT (next_info))) scalar_stmts.quick_push (next_info); else { diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 6bb0f5c3a56..60224f4e284 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -2169,12 +2169,24 @@ vect_apply_runtime_profitability_check_p (loop_vec_info loop_vinfo) && th >= vect_vf_for_cost (loop_vinfo)); } +/* Return true if CODE is a lane-reducing opcode. */ + inline bool lane_reducing_op_p (code_helper code) { return code == DOT_PROD_EXPR || code == WIDEN_SUM_EXPR || code == SAD_EXPR; } +/* Return true if STMT is a lane-reducing statement. */ + +inline bool +lane_reducing_stmt_p (gimple *stmt) +{ + if (auto *assign = dyn_cast (stmt)) +return lane_reducing_op_p (gimple_assign_rhs_code (assign)); + return false; +} +
Re: [PATCH 6/6] vect: Optimize order of lane-reducing statements in loop def-use cycles [PR114440]
Regenerate the patch due to changes on its dependent patches. Thanks, Feng, --- gcc/ PR tree-optimization/114440 * tree-vectorizer.h (struct _stmt_vec_info): Add a new field reduc_result_pos. * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing statements in an optimized order. --- gcc/tree-vect-loop.cc | 51 ++- gcc/tree-vectorizer.h | 6 + 2 files changed, 51 insertions(+), 6 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index fb9259d115c..de7a9bab990 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8734,7 +8734,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo, } bool single_defuse_cycle = STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info); - gcc_assert (single_defuse_cycle || lane_reducing_op_p (code)); + bool lane_reducing = lane_reducing_op_p (code); + gcc_assert (single_defuse_cycle || lane_reducing); /* Create the destination vector */ tree scalar_dest = gimple_get_lhs (stmt_info->stmt); @@ -8751,6 +8752,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo, } else { + int result_pos = 0; + /* The input vectype of the reduction PHI determines copies of vectorized def-use cycles, which might be more than effective copies of vectorized lane-reducing reduction statements. This could be @@ -8780,9 +8783,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy - sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); - sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); - sum_v2 = sum_v2; // copy + sum_v0 = sum_v0; // copy + sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1); + sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2); sum_v3 = sum_v3; // copy sum_v0 += n_v0[i: 0 ~ 3 ]; @@ -8790,7 +8793,20 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 += n_v2[i: 8 ~ 11]; sum_v3 += n_v3[i: 12 ~ 15]; } - */ + +Moreover, for a higher instruction parallelism in final vectorized +loop, it is considered to make those effective vectorized +lane-reducing statements be distributed evenly among all def-use +cycles. In the above example, SADs are generated into other cycles +rather than that of DOT_PROD. */ + + if (stmt_ncopies < ncopies) + { + gcc_assert (lane_reducing); + result_pos = reduc_info->reduc_result_pos; + reduc_info->reduc_result_pos = (result_pos + stmt_ncopies) % ncopies; + gcc_assert (result_pos >= 0 && result_pos < ncopies); + } for (i = 0; i < MIN (3, (int) op.num_ops); i++) { @@ -8826,7 +8842,30 @@ vect_transform_reduction (loop_vec_info loop_vinfo, op.ops[i], &vec_oprnds[i], vectype); if (used_ncopies < ncopies) - vec_oprnds[i].safe_grow_cleared (ncopies); + { + vec_oprnds[i].safe_grow_cleared (ncopies); + + /* Find suitable def-use cycles to generate vectorized +statements into, and reorder operands based on the +selection. */ + if (i != reduc_index && result_pos) + { + int count = ncopies - used_ncopies; + int start = result_pos - count; + + if (start < 0) + { + count = result_pos; + start = 0; + } + + for (int j = used_ncopies - 1; j >= start; j--) + { + std::swap (vec_oprnds[i][j], vec_oprnds[i][j + count]); + gcc_assert (!vec_oprnds[i][j]); + } + } + } } } diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 3f7db707d97..b9bc9d432ee 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -1402,6 +1402,12 @@ public: /* The vector type for performing the actual reduction. */ tree reduc_vectype; + /* For loop reduction with multiple vectorized results (ncopies > 1), a + lane-reducing operation participating in it may not use all of those + results, this field specifies result index starting from which any + following land-reducing operation would be assigned to. */ + int reduc_result_pos; + /* If IS_REDUC_INFO is true and if the vector code is performing N scalar reductions in parallel, this variable gives the initial scalar values of those N reductions. */ -- 2.17.1 ________ From: Feng Xue OS Sent: Thursday, May 30, 2024 10:56 PM To: Richard Biener
Re: [PATCH 3/6] vect: Set STMT_VINFO_REDUC_DEF for non-live stmt in loop reduction
Updated the patch. Thanks, Feng -- gcc/ * tree-vect-loop.cc (vectorizable_reduction): Set STMT_VINFO_REDUC_DEF for non-live stmt. * tree-vect-stmts.cc (vectorizable_condition): Treat the condition statement that is pointed by stmt_vec_info of reduction PHI as the real "for_reduction" statement. --- gcc/tree-vect-loop.cc | 7 +-- gcc/tree-vect-stmts.cc | 11 ++- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index bbd5d261907..35c50eb72cb 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7665,8 +7665,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo, if (STMT_VINFO_LIVE_P (s)) STMT_VINFO_REDUC_DEF (vect_orig_stmt (s)) = phi_info; } - else if (STMT_VINFO_LIVE_P (vdef)) - STMT_VINFO_REDUC_DEF (def) = phi_info; + + /* For lane-reducing operation vectorizable analysis needs the +reduction PHI information */ + STMT_VINFO_REDUC_DEF (def) = phi_info; + gimple_match_op op; if (!gimple_extract_op (vdef->stmt, &op)) { diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index e32d44050e5..dbdb59054e0 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -12137,11 +12137,20 @@ vectorizable_condition (vec_info *vinfo, vect_reduction_type reduction_type = TREE_CODE_REDUCTION; bool for_reduction = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)) != NULL; + if (for_reduction) +{ + reduc_info = info_for_reduction (vinfo, stmt_info); + if (STMT_VINFO_REDUC_DEF (reduc_info) != vect_orig_stmt (stmt_info)) + { + for_reduction = false; + reduc_info = NULL; + } +} + if (for_reduction) { if (slp_node && SLP_TREE_LANES (slp_node) > 1) return false; - reduc_info = info_for_reduction (vinfo, stmt_info); reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info); reduc_index = STMT_VINFO_REDUC_IDX (stmt_info); gcc_assert (reduction_type != EXTRACT_LAST_REDUCTION -- 2.17.1 ____________ From: Feng Xue OS Sent: Thursday, May 30, 2024 10:51 PM To: Richard Biener Cc: Tamar Christina; gcc-patches@gcc.gnu.org Subject: [PATCH 3/6] vect: Set STMT_VINFO_REDUC_DEF for non-live stmt in loop reduction Normally, vectorizable checking on statement in a loop reduction chain does not use the reduction PHI information. But some special statements might need it in vectorizable analysis, especially, for multiple lane-reducing operations support later. Thanks, Feng --- gcc/ * tree-vect-loop.cc (vectorizable_reduction): Set STMT_VINFO_REDUC_DEF for non-live stmt. * tree-vect-stmts.cc (vectorizable_condition): Treat the condition statement that is pointed by stmt_vec_info of reduction PHI as the real "for_reduction" statement. --- gcc/tree-vect-loop.cc | 5 +++-- gcc/tree-vect-stmts.cc | 11 ++- 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index aa5f21ccd1a..51627c27f8a 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7632,14 +7632,15 @@ vectorizable_reduction (loop_vec_info loop_vinfo, all lanes here - even though we only will vectorize from the SLP node with live lane zero the other live lanes also need to be identified as part of a reduction to be able -to skip code generation for them. */ +to skip code generation for them. For lane-reducing operation +vectorizable analysis needs the reduction PHI information. */ if (slp_for_stmt_info) { for (auto s : SLP_TREE_SCALAR_STMTS (slp_for_stmt_info)) if (STMT_VINFO_LIVE_P (s)) STMT_VINFO_REDUC_DEF (vect_orig_stmt (s)) = phi_info; } - else if (STMT_VINFO_LIVE_P (vdef)) + else STMT_VINFO_REDUC_DEF (def) = phi_info; gimple_match_op op; if (!gimple_extract_op (vdef->stmt, &op)) diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index 935d80f0e1b..2e0be763abb 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -12094,11 +12094,20 @@ vectorizable_condition (vec_info *vinfo, vect_reduction_type reduction_type = TREE_CODE_REDUCTION; bool for_reduction = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)) != NULL; + if (for_reduction) +{ + reduc_info = info_for_reduction (vinfo, stmt_info); + if (STMT_VINFO_REDUC_DEF (reduc_info) != vect_orig_stmt (stmt_info)) + { + for_reduction = false; + reduc_info = NULL; + } +} + if (for_reduction) { if (slp_node) return false; - reduc_info = info_for_reduction (vinfo, stmt_info); reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info); reduc_index = STMT_V
Re: [PATCH 5/6] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
Please see my comments below. Thanks, Feng > On Thu, May 30, 2024 at 4:55 PM Feng Xue OS > wrote: >> >> For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, >> current >> vectorizer could only handle the pattern if the reduction chain does not >> contain other operation, no matter the other is normal or lane-reducing. >> >> Actually, to allow multiple arbitray lane-reducing operations, we need to >> support vectorization of loop reduction chain with mixed input vectypes. >> Since >> lanes of vectype may vary with operation, the effective ncopies of vectorized >> statements for operation also may not be same to each other, this causes >> mismatch on vectorized def-use cycles. A simple way is to align all >> operations >> with the one that has the most ncopies, the gap could be complemented by >> generating extra trival pass-through copies. For example: >> >>int sum = 0; >>for (i) >> { >>sum += d0[i] * d1[i]; // dot-prod >>sum += w[i]; // widen-sum >>sum += abs(s0[i] - s1[i]); // sad >>sum += n[i]; // normal >> } >> >> The vector size is 128-bit,vectorization factor is 16. Reduction statements >> would be transformed as: >> >>vector<4> int sum_v0 = { 0, 0, 0, 0 }; >>vector<4> int sum_v1 = { 0, 0, 0, 0 }; >>vector<4> int sum_v2 = { 0, 0, 0, 0 }; >>vector<4> int sum_v3 = { 0, 0, 0, 0 }; >> >>for (i / 16) >> { >>sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); >>sum_v1 = sum_v1; // copy >>sum_v2 = sum_v2; // copy >>sum_v3 = sum_v3; // copy >> >>sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); >>sum_v1 = sum_v1; // copy >>sum_v2 = sum_v2; // copy >>sum_v3 = sum_v3; // copy >> >>sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); >>sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); >>sum_v2 = sum_v2; // copy >>sum_v3 = sum_v3; // copy >> >>sum_v0 += n_v0[i: 0 ~ 3 ]; >>sum_v1 += n_v1[i: 4 ~ 7 ]; >>sum_v2 += n_v2[i: 8 ~ 11]; >>sum_v3 += n_v3[i: 12 ~ 15]; >> } >> >> Thanks, >> Feng >> >> ... >> >> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc >> index 20c99f11e9a..b5849dbb08a 100644 >> --- a/gcc/tree-vect-loop.cc >> +++ b/gcc/tree-vect-loop.cc >> @@ -5322,8 +5322,6 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo, >>if (!gimple_extract_op (orig_stmt_info->stmt, &op)) >> gcc_unreachable (); >> >> - bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod >> (stmt_info); >> - >>if (reduction_type == EXTRACT_LAST_REDUCTION) >> /* No extra instructions are needed in the prologue. The loop body >> operations are costed in vectorizable_condition. */ >> @@ -5358,12 +5356,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo, >>initial result of the data reduction, initial value of the index >>reduction. */ >> prologue_stmts = 4; >> - else if (emulated_mixed_dot_prod) >> - /* We need the initial reduction value and two invariants: >> - one that contains the minimum signed value and one that >> - contains half of its negative. */ >> - prologue_stmts = 3; >>else >> + /* We need the initial reduction value. */ >> prologue_stmts = 1; >>prologue_cost += record_stmt_cost (cost_vec, prologue_stmts, >> scalar_to_vec, stmt_info, 0, >> @@ -7464,6 +7458,169 @@ vect_reduction_use_partial_vector (loop_vec_info >> loop_vinfo, >> } >> } >> >> +/* Check if STMT_INFO is a lane-reducing operation that can be vectorized in >> + the context of LOOP_VINFO, and vector cost will be recorded in COST_VEC. >> + Now there are three such kinds of operations: dot-prod/widen-sum/sad >> + (sum-of-absolute-differences). >> + >> + For a lane-reducing operation, the loop reduction path that it lies in, >> + may contain normal operation, or other lane-reducing operation of >> different >> + input type size, an example as: >> + >> + int sum = 0; >> + for (i) >> + { >> + ... >> + sum += d0[i] * d1[i]; // dot-prod >> +
Re: [PATCH 2/6] vect: Split out partial vect checking for reduction into a function
Ok. Updated as the comments. Thanks, Feng From: Richard Biener Sent: Friday, May 31, 2024 3:29 PM To: Feng Xue OS Cc: Tamar Christina; gcc-patches@gcc.gnu.org Subject: Re: [PATCH 2/6] vect: Split out partial vect checking for reduction into a function On Thu, May 30, 2024 at 4:48 PM Feng Xue OS wrote: > > This is a patch that is split out from > https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652626.html. > > Partial vectorization checking for vectorizable_reduction is a piece of > relatively isolated code, which may be reused by other places. Move the > code into a new function for sharing. > > Thanks, > Feng > --- > gcc/ > * tree-vect-loop.cc (vect_reduction_use_partial_vector): New function. Can you rename the function to vect_reduction_update_partial_vector_usage please? And keep ... > (vectorizable_reduction): Move partial vectorization checking code to > vect_reduction_use_partial_vector. > --- > gcc/tree-vect-loop.cc | 138 -- > 1 file changed, 78 insertions(+), 60 deletions(-) > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > index a42d79c7cbf..aa5f21ccd1a 100644 > --- a/gcc/tree-vect-loop.cc > +++ b/gcc/tree-vect-loop.cc > @@ -7391,6 +7391,81 @@ build_vect_cond_expr (code_helper code, tree vop[3], > tree mask, > } > } > > +/* Given an operation with CODE in loop reduction path whose reduction PHI is > + specified by REDUC_INFO, the operation has TYPE of scalar result, and its > + input vectype is represented by VECTYPE_IN. The vectype of vectorized > result > + may be different from VECTYPE_IN, either in base type or vectype lanes, > + lane-reducing operation is the case. This function check if it is > possible, > + and how to perform partial vectorization on the operation in the context > + of LOOP_VINFO. */ > + > +static void > +vect_reduction_use_partial_vector (loop_vec_info loop_vinfo, > + stmt_vec_info reduc_info, > + slp_tree slp_node, code_helper code, > + tree type, tree vectype_in) > +{ > + if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) > +return; > + > + enum vect_reduction_type reduc_type = STMT_VINFO_REDUC_TYPE (reduc_info); > + internal_fn reduc_fn = STMT_VINFO_REDUC_FN (reduc_info); > + internal_fn cond_fn = get_conditional_internal_fn (code, type); > + > + if (reduc_type != FOLD_LEFT_REDUCTION > + && !use_mask_by_cond_expr_p (code, cond_fn, vectype_in) > + && (cond_fn == IFN_LAST > + || !direct_internal_fn_supported_p (cond_fn, vectype_in, > + OPTIMIZE_FOR_SPEED))) > +{ > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > +"can't operate on partial vectors because" > +" no conditional operation is available.\n"); > + LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; > +} > + else if (reduc_type == FOLD_LEFT_REDUCTION > + && reduc_fn == IFN_LAST > + && !expand_vec_cond_expr_p (vectype_in, truth_type_for > (vectype_in), > + SSA_NAME)) > +{ > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > + "can't operate on partial vectors because" > + " no conditional operation is available.\n"); > + LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; > +} > + else if (reduc_type == FOLD_LEFT_REDUCTION > + && internal_fn_mask_index (reduc_fn) == -1 > + && FLOAT_TYPE_P (vectype_in) > + && HONOR_SIGN_DEPENDENT_ROUNDING (vectype_in)) > +{ > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > +"can't operate on partial vectors because" > +" signed zeros cannot be preserved.\n"); > + LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; > +} > + else > +{ > + internal_fn mask_reduc_fn > + = get_masked_reduction_fn (reduc_fn, vectype_in); > + vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo); > + vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); > + unsigned nvectors; > + > + if (slp_node) > + nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); > + el
[PATCH 6/6] vect: Optimize order of lane-reducing statements in loop def-use cycles [PR114440]
When transforming multiple lane-reducing operations in a loop reduction chain, originally, corresponding vectorized statements are generated into def-use cycles starting from 0. The def-use cycle with smaller index, would contain more statements, which means more instruction dependency. For example: int sum = 0; for (i) { sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad } Original transformation result: for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } For a higher instruction parallelism in final vectorized loop, an optimal means is to make those effective vectorized lane-reducing statements be distributed evenly among all def-use cycles. Transformed as the below, DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles, instruction dependency could be eliminated. for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = sum_v1; // copy sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2); sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3); } Thanks, Feng --- gcc/ PR tree-optimization/114440 * tree-vectorizer.h (struct _stmt_vec_info): Add a new field reduc_result_pos. * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing statements in an optimized order. --- gcc/tree-vect-loop.cc | 51 ++- gcc/tree-vectorizer.h | 6 + 2 files changed, 51 insertions(+), 6 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index b5849dbb08a..4807f529506 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -8703,7 +8703,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo, } bool single_defuse_cycle = STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info); - gcc_assert (single_defuse_cycle || lane_reducing_op_p (code)); + bool lane_reducing = lane_reducing_op_p (code); + gcc_assert (single_defuse_cycle || lane_reducing); /* Create the destination vector */ tree scalar_dest = gimple_get_lhs (stmt_info->stmt); @@ -8720,6 +8721,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo, } else { + int result_pos = 0; + /* The input vectype of the reduction PHI determines copies of vectorized def-use cycles, which might be more than effective copies of vectorized lane-reducing reduction statements. This could be @@ -8749,9 +8752,9 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy - sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); - sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); - sum_v2 = sum_v2; // copy + sum_v0 = sum_v0; // copy + sum_v1 = SAD (s0_v1[i: 0 ~ 7 ], s1_v1[i: 0 ~ 7 ], sum_v1); + sum_v2 = SAD (s0_v2[i: 8 ~ 15], s1_v2[i: 8 ~ 15], sum_v2); sum_v3 = sum_v3; // copy sum_v0 += n_v0[i: 0 ~ 3 ]; @@ -8759,7 +8762,20 @@ vect_transform_reduction (loop_vec_info loop_vinfo, sum_v2 += n_v2[i: 8 ~ 11]; sum_v3 += n_v3[i: 12 ~ 15]; } - */ + +Moreover, for a higher instruction parallelism in final vectorized +loop, it is considered to make those effective vectorized +lane-reducing statements be distributed evenly among all def-use +cycles. In the above example, SADs are generated into other cycles +rather than that of DOT_PROD. */ + + if (stmt_ncopies < ncopies) + { + gcc_assert (lane_reducing); + result_pos = reduc_info->reduc_result_pos; + reduc_info->reduc_result_pos = (result_pos + stmt_ncopies) % ncopies; + gcc_assert (result_pos >= 0 && result_pos < ncopies); + } for (i = 0; i < MIN (3, (int) op.num_ops); i++) { @@ -8792,7 +8808,30 @@ vect_transform_reduction (loop_vec_info loop_vinfo, op.ops[i], &vec_oprnds[i], vectype); if (used_ncopies <
[PATCH 5/6] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current vectorizer could only handle the pattern if the reduction chain does not contain other operation, no matter the other is normal or lane-reducing. Actually, to allow multiple arbitray lane-reducing operations, we need to support vectorization of loop reduction chain with mixed input vectypes. Since lanes of vectype may vary with operation, the effective ncopies of vectorized statements for operation also may not be same to each other, this causes mismatch on vectorized def-use cycles. A simple way is to align all operations with the one that has the most ncopies, the gap could be complemented by generating extra trival pass-through copies. For example: int sum = 0; for (i) { sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad sum += n[i]; // normal } The vector size is 128-bit,vectorization factor is 16. Reduction statements would be transformed as: vector<4> int sum_v0 = { 0, 0, 0, 0 }; vector<4> int sum_v1 = { 0, 0, 0, 0 }; vector<4> int sum_v2 = { 0, 0, 0, 0 }; vector<4> int sum_v3 = { 0, 0, 0, 0 }; for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0); sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 += n_v0[i: 0 ~ 3 ]; sum_v1 += n_v1[i: 4 ~ 7 ]; sum_v2 += n_v2[i: 8 ~ 11]; sum_v3 += n_v3[i: 12 ~ 15]; } Thanks, Feng --- gcc/ PR tree-optimization/114440 * tree-vectorizer.h (vectorizable_lane_reducing): New function declaration. * tree-vect-stmts.cc (vect_analyze_stmt): Call new function vectorizable_lane_reducing to analyze lane-reducing operation. * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation code related to emulated_mixed_dot_prod. (vectorizable_lane_reducing): New function. (vectorizable_reduction): Allow multiple lane-reducing operations in loop reduction. Move some original lane-reducing related code to vectorizable_lane_reducing. (vect_transform_reduction): Extend transformation to support reduction statements with mixed input vectypes. gcc/testsuite/ PR tree-optimization/114440 * gcc.dg/vect/vect-reduc-chain-1.c * gcc.dg/vect/vect-reduc-chain-2.c * gcc.dg/vect/vect-reduc-chain-3.c * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c * gcc.dg/vect/vect-reduc-dot-slp-1.c --- .../gcc.dg/vect/vect-reduc-chain-1.c | 62 +++ .../gcc.dg/vect/vect-reduc-chain-2.c | 77 +++ .../gcc.dg/vect/vect-reduc-chain-3.c | 66 +++ .../gcc.dg/vect/vect-reduc-chain-dot-slp-1.c | 97 .../gcc.dg/vect/vect-reduc-chain-dot-slp-2.c | 81 +++ .../gcc.dg/vect/vect-reduc-dot-slp-1.c| 35 ++ gcc/tree-vect-loop.cc | 478 -- gcc/tree-vect-stmts.cc| 2 + gcc/tree-vectorizer.h | 2 + 9 files changed, 755 insertions(+), 145 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-dot-slp-2.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c new file mode 100644 index 000..04bfc419dbd --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c @@ -0,0 +1,62 @@ +/* Disabling epilogues until we find a better way to deal with scans. */ +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ +/* { dg-add-options arm_v8_2a_dotprod_neon } */ + +#include "tree-vect.h" + +#define N 50 + +#ifndef SIGNEDNESS_1 +#define SIGNEDNESS_1 signed +#define SIGNEDNESS_2 signed +#endif + +SIGNEDNESS_1 int __attribute__ ((noipa)) +f (SIGNEDNESS_1 int res, + SIGNEDNESS_2 char *restrict a, + SIGNEDNESS_2 char *restrict b, + SIGNEDNESS_2 char *restrict c, + SIGNEDNESS_2 char *restrict d, + SIGNEDNESS_1 int *
[PATCH 4/6] vect: Bind input vectype to lane-reducing operation
The input vectype is an attribute of lane-reducing operation, instead of reduction PHI that it is associated to, since there might be more than one lane-reducing operations with different type in a loop reduction chain. So bind each lane-reducing operation with its own input type. Thanks, Feng --- gcc/ * tree-vect-loop.cc (vect_is_emulated_mixed_dot_prod): Remove parameter loop_vinfo. Get input vectype from stmt_info instead of reduction PHI. (vect_model_reduction_cost): Remove loop_vinfo argument of call to vect_is_emulated_mixed_dot_prod. (vect_transform_reduction): Likewise. (vectorizable_reduction): Likewise, and bind input vectype to lane-reducing operation. --- gcc/tree-vect-loop.cc | 23 +-- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 51627c27f8a..20c99f11e9a 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -5270,8 +5270,7 @@ have_whole_vector_shift (machine_mode mode) See vect_emulate_mixed_dot_prod for the actual sequence used. */ static bool -vect_is_emulated_mixed_dot_prod (loop_vec_info loop_vinfo, -stmt_vec_info stmt_info) +vect_is_emulated_mixed_dot_prod (stmt_vec_info stmt_info) { gassign *assign = dyn_cast (stmt_info->stmt); if (!assign || gimple_assign_rhs_code (assign) != DOT_PROD_EXPR) @@ -5282,10 +5281,9 @@ vect_is_emulated_mixed_dot_prod (loop_vec_info loop_vinfo, if (TYPE_SIGN (TREE_TYPE (rhs1)) == TYPE_SIGN (TREE_TYPE (rhs2))) return false; - stmt_vec_info reduc_info = info_for_reduction (loop_vinfo, stmt_info); - gcc_assert (reduc_info->is_reduc_info); + gcc_assert (STMT_VINFO_REDUC_VECTYPE_IN (stmt_info)); return !directly_supported_p (DOT_PROD_EXPR, - STMT_VINFO_REDUC_VECTYPE_IN (reduc_info), + STMT_VINFO_REDUC_VECTYPE_IN (stmt_info), optab_vector_mixed_sign); } @@ -5324,8 +5322,8 @@ vect_model_reduction_cost (loop_vec_info loop_vinfo, if (!gimple_extract_op (orig_stmt_info->stmt, &op)) gcc_unreachable (); - bool emulated_mixed_dot_prod -= vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info); + bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info); + if (reduction_type == EXTRACT_LAST_REDUCTION) /* No extra instructions are needed in the prologue. The loop body operations are costed in vectorizable_condition. */ @@ -7840,6 +7838,11 @@ vectorizable_reduction (loop_vec_info loop_vinfo, vectype_in = STMT_VINFO_VECTYPE (phi_info); STMT_VINFO_REDUC_VECTYPE_IN (reduc_info) = vectype_in; + /* Each lane-reducing operation has its own input vectype, while reduction + PHI records the input vectype with least lanes. */ + if (lane_reducing) +STMT_VINFO_REDUC_VECTYPE_IN (stmt_info) = vectype_in; + enum vect_reduction_type v_reduc_type = STMT_VINFO_REDUC_TYPE (phi_info); STMT_VINFO_REDUC_TYPE (reduc_info) = v_reduc_type; /* If we have a condition reduction, see if we can simplify it further. */ @@ -8366,7 +8369,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, if (single_defuse_cycle || lane_reducing) { int factor = 1; - if (vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info)) + if (vect_is_emulated_mixed_dot_prod (stmt_info)) /* Three dot-products and a subtraction. */ factor = 4; record_stmt_cost (cost_vec, ncopies * factor, vector_stmt, @@ -8617,8 +8620,8 @@ vect_transform_reduction (loop_vec_info loop_vinfo, : &vec_oprnds2)); } - bool emulated_mixed_dot_prod -= vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info); + bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info); + FOR_EACH_VEC_ELT (vec_oprnds0, i, def0) { gimple *new_stmt; -- 2.17.1From b885de76ad7e9f5accceff18cb6c11de73a36225 Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Wed, 29 May 2024 16:41:57 +0800 Subject: [PATCH 4/6] vect: Bind input vectype to lane-reducing operation The input vectype is an attribute of lane-reducing operation, instead of reduction PHI that it is associated to, since there might be more than one lane-reducing operations with different type in a loop reduction chain. So bind each lane-reducing operation with its own input type. 2024-05-29 Feng Xue gcc/ * tree-vect-loop.cc (vect_is_emulated_mixed_dot_prod): Remove parameter loop_vinfo. Get input vectype from stmt_info instead of reduction PHI. (vect_model_reduction_cost): Remove loop_vinfo argument of call to vect_is_emulated_mixed_dot_prod. (vect_transform_reduction): Likewise. (vectorizable_reduction): Likewise, and bind input vectype to lane-reducing operation. --- gcc/tree-vect-loop.cc | 23 +-- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/gcc/tr
[PATCH 3/6] vect: Set STMT_VINFO_REDUC_DEF for non-live stmt in loop reduction
Normally, vectorizable checking on statement in a loop reduction chain does not use the reduction PHI information. But some special statements might need it in vectorizable analysis, especially, for multiple lane-reducing operations support later. Thanks, Feng --- gcc/ * tree-vect-loop.cc (vectorizable_reduction): Set STMT_VINFO_REDUC_DEF for non-live stmt. * tree-vect-stmts.cc (vectorizable_condition): Treat the condition statement that is pointed by stmt_vec_info of reduction PHI as the real "for_reduction" statement. --- gcc/tree-vect-loop.cc | 5 +++-- gcc/tree-vect-stmts.cc | 11 ++- 2 files changed, 13 insertions(+), 3 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index aa5f21ccd1a..51627c27f8a 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7632,14 +7632,15 @@ vectorizable_reduction (loop_vec_info loop_vinfo, all lanes here - even though we only will vectorize from the SLP node with live lane zero the other live lanes also need to be identified as part of a reduction to be able -to skip code generation for them. */ +to skip code generation for them. For lane-reducing operation +vectorizable analysis needs the reduction PHI information. */ if (slp_for_stmt_info) { for (auto s : SLP_TREE_SCALAR_STMTS (slp_for_stmt_info)) if (STMT_VINFO_LIVE_P (s)) STMT_VINFO_REDUC_DEF (vect_orig_stmt (s)) = phi_info; } - else if (STMT_VINFO_LIVE_P (vdef)) + else STMT_VINFO_REDUC_DEF (def) = phi_info; gimple_match_op op; if (!gimple_extract_op (vdef->stmt, &op)) diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index 935d80f0e1b..2e0be763abb 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -12094,11 +12094,20 @@ vectorizable_condition (vec_info *vinfo, vect_reduction_type reduction_type = TREE_CODE_REDUCTION; bool for_reduction = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)) != NULL; + if (for_reduction) +{ + reduc_info = info_for_reduction (vinfo, stmt_info); + if (STMT_VINFO_REDUC_DEF (reduc_info) != vect_orig_stmt (stmt_info)) + { + for_reduction = false; + reduc_info = NULL; + } +} + if (for_reduction) { if (slp_node) return false; - reduc_info = info_for_reduction (vinfo, stmt_info); reduction_type = STMT_VINFO_REDUC_TYPE (reduc_info); reduc_index = STMT_VINFO_REDUC_IDX (stmt_info); gcc_assert (reduction_type != EXTRACT_LAST_REDUCTION -- 2.17.1
[PATCH 2/6] vect: Split out partial vect checking for reduction into a function
This is a patch that is split out from https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652626.html. Partial vectorization checking for vectorizable_reduction is a piece of relatively isolated code, which may be reused by other places. Move the code into a new function for sharing. Thanks, Feng --- gcc/ * tree-vect-loop.cc (vect_reduction_use_partial_vector): New function. (vectorizable_reduction): Move partial vectorization checking code to vect_reduction_use_partial_vector. --- gcc/tree-vect-loop.cc | 138 -- 1 file changed, 78 insertions(+), 60 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index a42d79c7cbf..aa5f21ccd1a 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7391,6 +7391,81 @@ build_vect_cond_expr (code_helper code, tree vop[3], tree mask, } } +/* Given an operation with CODE in loop reduction path whose reduction PHI is + specified by REDUC_INFO, the operation has TYPE of scalar result, and its + input vectype is represented by VECTYPE_IN. The vectype of vectorized result + may be different from VECTYPE_IN, either in base type or vectype lanes, + lane-reducing operation is the case. This function check if it is possible, + and how to perform partial vectorization on the operation in the context + of LOOP_VINFO. */ + +static void +vect_reduction_use_partial_vector (loop_vec_info loop_vinfo, + stmt_vec_info reduc_info, + slp_tree slp_node, code_helper code, + tree type, tree vectype_in) +{ + if (!LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) +return; + + enum vect_reduction_type reduc_type = STMT_VINFO_REDUC_TYPE (reduc_info); + internal_fn reduc_fn = STMT_VINFO_REDUC_FN (reduc_info); + internal_fn cond_fn = get_conditional_internal_fn (code, type); + + if (reduc_type != FOLD_LEFT_REDUCTION + && !use_mask_by_cond_expr_p (code, cond_fn, vectype_in) + && (cond_fn == IFN_LAST + || !direct_internal_fn_supported_p (cond_fn, vectype_in, + OPTIMIZE_FOR_SPEED))) +{ + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, +"can't operate on partial vectors because" +" no conditional operation is available.\n"); + LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; +} + else if (reduc_type == FOLD_LEFT_REDUCTION + && reduc_fn == IFN_LAST + && !expand_vec_cond_expr_p (vectype_in, truth_type_for (vectype_in), + SSA_NAME)) +{ + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "can't operate on partial vectors because" + " no conditional operation is available.\n"); + LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; +} + else if (reduc_type == FOLD_LEFT_REDUCTION + && internal_fn_mask_index (reduc_fn) == -1 + && FLOAT_TYPE_P (vectype_in) + && HONOR_SIGN_DEPENDENT_ROUNDING (vectype_in)) +{ + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, +"can't operate on partial vectors because" +" signed zeros cannot be preserved.\n"); + LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; +} + else +{ + internal_fn mask_reduc_fn + = get_masked_reduction_fn (reduc_fn, vectype_in); + vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo); + vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); + unsigned nvectors; + + if (slp_node) + nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); + else + nvectors = vect_get_num_copies (loop_vinfo, vectype_in); + + if (mask_reduc_fn == IFN_MASK_LEN_FOLD_LEFT_PLUS) + vect_record_loop_len (loop_vinfo, lens, nvectors, vectype_in, 1); + else + vect_record_loop_mask (loop_vinfo, masks, nvectors, vectype_in, NULL); +} +} + /* Function vectorizable_reduction. Check if STMT_INFO performs a reduction operation that can be vectorized. @@ -7456,7 +7531,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo, bool single_defuse_cycle = false; bool nested_cycle = false; bool double_reduc = false; - int vec_num; tree cr_index_scalar_type = NULL_TREE, cr_index_vector_type = NULL_TREE; tree cond_reduc_val = NULL_TREE; @@ -8283,11 +8357,6 @@ vectorizable_reduction (loop_vec_info loop_vinfo, return false; } - if (slp_node) -vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); - else -vec_num = 1; - vect_model_reduction_cost (loop_vinfo, stmt_info, reduc_fn, reduction_type, ncopies, cost_vec); /* Cost the reducti
[PATCH 1/6] vect: Add a function to check lane-reducing code [PR114440]
This is a patch that is split out from https://gcc.gnu.org/pipermail/gcc-patches/2024-May/652626.html. Check if an operation is lane-reducing requires comparison of code against three kinds (DOT_PROD_EXPR/WIDEN_SUM_EXPR/SAD_EXPR). Add an utility function to make source coding for the check handy and concise. Feng -- gcc/ * tree-vectorizer.h (lane_reducing_op_p): New function. * tree-vect-slp.cc (vect_analyze_slp): Use new function lane_reducing_op_p to check statement code. * tree-vect-loop.cc (vect_transform_reduction): Likewise. (vectorizable_reduction): Likewise, and change name of a local variable that holds the result flag. --- gcc/tree-vect-loop.cc | 29 - gcc/tree-vect-slp.cc | 4 +--- gcc/tree-vectorizer.h | 6 ++ 3 files changed, 19 insertions(+), 20 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 04a9ac64df7..a42d79c7cbf 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7650,9 +7650,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, gimple_match_op op; if (!gimple_extract_op (stmt_info->stmt, &op)) gcc_unreachable (); - bool lane_reduc_code_p = (op.code == DOT_PROD_EXPR - || op.code == WIDEN_SUM_EXPR - || op.code == SAD_EXPR); + bool lane_reducing = lane_reducing_op_p (op.code); if (!POINTER_TYPE_P (op.type) && !INTEGRAL_TYPE_P (op.type) && !SCALAR_FLOAT_TYPE_P (op.type)) @@ -7664,7 +7662,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, /* For lane-reducing ops we're reducing the number of reduction PHIs which means the only use of that may be in the lane-reducing operation. */ - if (lane_reduc_code_p + if (lane_reducing && reduc_chain_length != 1 && !only_slp_reduc_chain) { @@ -7678,7 +7676,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, since we'll mix lanes belonging to different reductions. But it's OK to use them in a reduction chain or when the reduction group has just one element. */ - if (lane_reduc_code_p + if (lane_reducing && slp_node && !REDUC_GROUP_FIRST_ELEMENT (stmt_info) && SLP_TREE_LANES (slp_node) > 1) @@ -7738,7 +7736,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, /* To properly compute ncopies we are interested in the widest non-reduction input type in case we're looking at a widening accumulation that we later handle in vect_transform_reduction. */ - if (lane_reduc_code_p + if (lane_reducing && vectype_op[i] && (!vectype_in || (GET_MODE_SIZE (SCALAR_TYPE_MODE (TREE_TYPE (vectype_in))) @@ -8211,7 +8209,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, && loop_vinfo->suggested_unroll_factor == 1) single_defuse_cycle = true; - if (single_defuse_cycle || lane_reduc_code_p) + if (single_defuse_cycle || lane_reducing) { gcc_assert (op.code != COND_EXPR); @@ -8227,7 +8225,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, mixed-sign dot-products can be implemented using signed dot-products. */ machine_mode vec_mode = TYPE_MODE (vectype_in); - if (!lane_reduc_code_p + if (!lane_reducing && !directly_supported_p (op.code, vectype_in, optab_vector)) { if (dump_enabled_p ()) @@ -8252,7 +8250,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, For the other cases try without the single cycle optimization. */ if (!ok) { - if (lane_reduc_code_p) + if (lane_reducing) return false; else single_defuse_cycle = false; @@ -8263,7 +8261,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, /* If the reduction stmt is one of the patterns that have lane reduction embedded we cannot handle the case of ! single_defuse_cycle. */ if ((ncopies > 1 && ! single_defuse_cycle) - && lane_reduc_code_p) + && lane_reducing) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, @@ -8274,7 +8272,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, if (slp_node && !(!single_defuse_cycle - && !lane_reduc_code_p + && !lane_reducing && reduction_type != FOLD_LEFT_REDUCTION)) for (i = 0; i < (int) op.num_ops; i++) if (!vect_maybe_update_slp_op_vectype (slp_op[i], vectype_op[i])) @@ -8295,7 +8293,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo, /* Cost the reduction op inside the loop if transformed via vect_transform_reduction. Otherwise this is costed by the separate vectorizable_* routines. */ - if (single_defuse_cycle || lane_reduc_code_p) + if (single_defuse_cycle || lane_reducing) { int factor = 1; if (vect_is_emulated_mixed_dot_prod (loop_vinfo, stmt_info)) @@ -8313,7 +8311,7 @@ vectori
Re: [PATCH] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
>> Hi, >> >> The patch was updated with the newest trunk, and also contained some minor >> changes. >> >> I am working on another new feature which is meant to support pattern >> recognition >> of lane-reducing operations in affine closure originated from loop reduction >> variable, >> like: >> >> sum += cst1 * dot_prod_1 + cst2 * sad_2 + ... + cstN * lane_reducing_op_N >> >> The feature WIP depends on the patch. It has been a little bit long time >> since its post, >> would you please take a time to review this one? Thanks. > This seems to do multiple things so I wonder if you can split up the > patch a bit? OK. Will send out split patches in new mails. > For example adding lane_reducing_op_p can be split out, it also seems like > the vect_transform_reduction change to better distribute work can be done > separately? Likewise refactoring like splitting out > vect_reduction_use_partial_vector. > > When we have > >sum += d0[i] * d1[i]; // dot-prod >sum += w[i]; // widen-sum >sum += abs(s0[i] - s1[i]); // sad >sum += n[i]; // normal > > the vector DOT_PROD and friend ops can end up mixing different lanes > since it is not specified which lanes are reduced into which output lane. > So, DOT_PROD might combine 0-3, 4-7, ... but SAD might combine > 0,4,8,12; 1,5,9,13; ... I think this isn't worse than what one op itself > is doing, but it's worth pointing out (it's probably unlikely a target > mixes different reduction strategies anyway). Yes. But even on a peculiar target, DOT_PROD and SAD have different reduction strategies, it does not impact result correctness, at least for integer operation. Is there anything special that we need to consider? > > Can you make sure to add at least one SLP reduction example to show > this works for SLP as well? OK. The patches contains the cases for SLP reduction chain. Will add one for SLP reduction, this should be a negative case. Thanks, Feng
Re: [PATCH] vect: Unify bbs in loop_vec_info and bb_vec_info
Ok. Then I will add a TODO comment on "bbs" field to describe it. Thanks, Feng From: Richard Biener Sent: Wednesday, May 29, 2024 3:14 PM To: Feng Xue OS Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH] vect: Unify bbs in loop_vec_info and bb_vec_info On Tue, May 28, 2024 at 6:11 PM Feng Xue OS wrote: > > Because bbs of loop_vec_info need to be allocated via old-fashion > XCNEWVEC, in order to receive result from dfs_enumerate_from(), > so have to make bb_vec_info align with loop_vec_info, use > basic_block * instead of vec. Another reason is that > some loop vect related codes assume that bbs is a pointer, such > as using LOOP_VINFO_BBS() to directly free the bbs area. I think dfs_enumerate_from is fine with receiving bbs.address () (if you first grow the vector, of course). There might be other code that needs changing, sure. > While encapsulating bbs into array_slice might make changed code > more wordy. So still choose basic_block * as its type. Updated the > patch by removing bbs_as_vector. The updated patch looks good to me. Lifetime management of the base class bbs done differently by _loop_vec_info and _bb_vec_info is a bit ugly but it's a well isolated fact. Thus, OK. I do think we can turn the basic_block * back to a vec<> but this can be done as followup if anybody has spare cycles. Thanks, Richard. > Feng. > > gcc/ > * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Move > initialization of bbs to explicit construction code. Adjust the > definition of nbbs. > (update_epilogue_loop_vinfo): Update nbbs for epilog vinfo. > * tree-vect-pattern.cc (vect_determine_precisions): Make > loop_vec_info and bb_vec_info share same code. > (vect_pattern_recog): Remove duplicated vect_pattern_recog_1 loop. > * tree-vect-slp.cc (vect_get_and_check_slp_defs): Access to bbs[0] > via base vec_info class. > (_bb_vec_info::_bb_vec_info): Initialize bbs and nbbs using data > fields of input auto_vec<> bbs. > (vect_slp_region): Use access to nbbs to replace original > bbs.length(). > (vect_schedule_slp_node): Access to bbs[0] via base vec_info class. > * tree-vectorizer.cc (vec_info::vec_info): Add initialization of > bbs and nbbs. > (vec_info::insert_seq_on_entry): Access to bbs[0] via base vec_info > class. > * tree-vectorizer.h (vec_info): Add new fields bbs and nbbs. > (LOOP_VINFO_NBBS): New macro. > (BB_VINFO_BBS): Rename BB_VINFO_BB to BB_VINFO_BBS. > (BB_VINFO_NBBS): New macro. > (_loop_vec_info): Remove field bbs. > (_bb_vec_info): Rename field bbs. > --- > gcc/tree-vect-loop.cc | 7 +- > gcc/tree-vect-patterns.cc | 142 +++--- > gcc/tree-vect-slp.cc | 23 +++--- > gcc/tree-vectorizer.cc| 7 +- > gcc/tree-vectorizer.h | 19 +++-- > 5 files changed, 70 insertions(+), 128 deletions(-) > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > index 3b94bb13a8b..04a9ac64df7 100644 > --- a/gcc/tree-vect-loop.cc > +++ b/gcc/tree-vect-loop.cc > @@ -1028,7 +1028,6 @@ bb_in_loop_p (const_basic_block bb, const void *data) > _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared) >: vec_info (vec_info::loop, shared), > loop (loop_in), > -bbs (XCNEWVEC (basic_block, loop->num_nodes)), > num_itersm1 (NULL_TREE), > num_iters (NULL_TREE), > num_iters_unchanged (NULL_TREE), > @@ -1079,8 +1078,9 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, > vec_info_shared *shared) > case of the loop forms we allow, a dfs order of the BBs would the same > as reversed postorder traversal, so we are safe. */ > > - unsigned int nbbs = dfs_enumerate_from (loop->header, 0, bb_in_loop_p, > - bbs, loop->num_nodes, loop); > + bbs = XCNEWVEC (basic_block, loop->num_nodes); > + nbbs = dfs_enumerate_from (loop->header, 0, bb_in_loop_p, bbs, > +loop->num_nodes, loop); >gcc_assert (nbbs == loop->num_nodes); > >for (unsigned int i = 0; i < nbbs; i++) > @@ -11667,6 +11667,7 @@ update_epilogue_loop_vinfo (class loop *epilogue, > tree advance) > >free (LOOP_VINFO_BBS (epilogue_vinfo)); >LOOP_VINFO_BBS (epilogue_vinfo) = epilogue_bbs; > + LOOP_VINFO_NBBS (epilogue_vinfo) = epilogue->num_nodes; > >/* Advance data_reference's with the number of iterations of the previous > loop and its prologue. */ > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc > i
Re: [PATCH] vect: Unify bbs in loop_vec_info and bb_vec_info
0644 --- a/gcc/tree-vectorizer.cc +++ b/gcc/tree-vectorizer.cc @@ -463,7 +463,9 @@ shrink_simd_arrays vec_info::vec_info (vec_info::vec_kind kind_in, vec_info_shared *shared_) : kind (kind_in), shared (shared_), -stmt_vec_info_ro (false) +stmt_vec_info_ro (false), +bbs (NULL), +nbbs (0) { stmt_vec_infos.create (50); } @@ -660,9 +662,8 @@ vec_info::insert_seq_on_entry (stmt_vec_info context, gimple_seq seq) } else { - bb_vec_info bb_vinfo = as_a (this); gimple_stmt_iterator gsi_region_begin - = gsi_after_labels (bb_vinfo->bbs[0]); + = gsi_after_labels (bbs[0]); gsi_insert_seq_before (&gsi_region_begin, seq, GSI_SAME_STMT); } } diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 93bc30ef660..bd4f5952f4b 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -499,6 +499,12 @@ public: made any decisions about which vector modes to use. */ machine_mode vector_mode; + /* The basic blocks in the vectorization region. */ + basic_block *bbs; + + /* The count of the basic blocks in the vectorization region. */ + unsigned int nbbs; + private: stmt_vec_info new_stmt_vec_info (gimple *stmt); void set_vinfo_for_stmt (gimple *, stmt_vec_info, bool = true); @@ -679,9 +685,6 @@ public: /* The loop to which this info struct refers to. */ class loop *loop; - /* The loop basic blocks. */ - basic_block *bbs; - /* Number of latch executions. */ tree num_itersm1; /* Number of iterations. */ @@ -969,6 +972,7 @@ public: #define LOOP_VINFO_EPILOGUE_IV_EXIT(L) (L)->vec_epilogue_loop_iv_exit #define LOOP_VINFO_SCALAR_IV_EXIT(L) (L)->scalar_loop_iv_exit #define LOOP_VINFO_BBS(L) (L)->bbs +#define LOOP_VINFO_NBBS(L) (L)->nbbs #define LOOP_VINFO_NITERSM1(L) (L)->num_itersm1 #define LOOP_VINFO_NITERS(L) (L)->num_iters /* Since LOOP_VINFO_NITERS and LOOP_VINFO_NITERSM1 can change after @@ -1094,16 +1098,11 @@ public: _bb_vec_info (vec bbs, vec_info_shared *); ~_bb_vec_info (); - /* The region we are operating on. bbs[0] is the entry, excluding - its PHI nodes. In the future we might want to track an explicit - entry edge to cover bbs[0] PHI nodes and have a region entry - insert location. */ - vec bbs; - vec roots; } *bb_vec_info; -#define BB_VINFO_BB(B) (B)->bb +#define BB_VINFO_BBS(B) (B)->bbs +#define BB_VINFO_NBBS(B) (B)->nbbs #define BB_VINFO_GROUPED_STORES(B) (B)->grouped_stores #define BB_VINFO_SLP_INSTANCES(B)(B)->slp_instances #define BB_VINFO_DATAREFS(B) (B)->shared->datarefs -- 2.17.1 From: Richard Biener Sent: Tuesday, May 28, 2024 5:43 PM To: Feng Xue OS Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH] vect: Unify bbs in loop_vec_info and bb_vec_info On Sat, May 25, 2024 at 4:54 PM Feng Xue OS wrote: > > Both derived classes ( loop_vec_info/bb_vec_info) have their own "bbs" > field, which have exactly same purpose of recording all basic blocks > inside the corresponding vect region, while the fields are composed by > different data type, one is normal array, the other is auto_vec. This > difference causes some duplicated code even handling the same stuff, > almost in tree-vect-patterns. One refinement is lifting this field into the > base class "vec_info", and reset its value to the continuous memory area > pointed by two old "bbs" in each constructor of derived classes. Nice. But. bbs_as_vector - why is that necessary? Why is vinfo->bbs not a vec? Having bbs and nbbs feels like a step back. Note the code duplications can probably be removed by "indirecting" through an array_slice. I'm a bit torn to approve this as-is given the above. Can you explain what made you not choose vec<> for bbs? I bet you tried. Richard. > Regression test on x86-64 and aarch64. > > Feng > -- > gcc/ > * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Move > initialization of bbs to explicit construction code. Adjust the > definition of nbbs. > * tree-vect-pattern.cc (vect_determine_precisions): Make > loop_vec_info and bb_vec_info share same code. > (vect_pattern_recog): Remove duplicated vect_pattern_recog_1 loop. > * tree-vect-slp.cc (vect_get_and_check_slp_defs): Access to bbs[0] > via base vec_info class. > (_bb_vec_info::_bb_vec_info): Initialize bbs and nbbs using data > fields of input auto_vec<> bbs. > (_bb_vec_info::_bb_vec_info): Add assertions on bbs and nbbs to ensure > they are not changed externally. > (vect_slp_region): Use access to nbbs to replace original > bbs.length(
Re: [PATCH] vect: Use vect representative statement instead of original in patch recog [PR115060]
Changed as the comments. Thanks, Feng From: Richard Biener Sent: Tuesday, May 28, 2024 5:34 PM To: Feng Xue OS Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH] vect: Use vect representative statement instead of original in patch recog [PR115060] On Sat, May 25, 2024 at 4:45 PM Feng Xue OS wrote: > > Some utility functions (such as vect_look_through_possible_promotion) that are > to find out certain kind of direct or indirect definition SSA for a value, may > return the original one of the SSA, not its pattern representative SSA, even > pattern is involved. For example, > >a = (T1) patt_b; >patt_b = (T2) c;// b = ... >patt_c = not-a-cast;// c = ... > > Given 'a', the mentioned function will return 'c', instead of 'patt_c'. This > subtlety would make some pattern recog code that is unaware of it mis-use the > original instead of the new pattern statement, which is inconsistent wth > processing logic of the pattern formation pass. This patch corrects the issue > by forcing another utility function (vect_get_internal_def) return the pattern > statement information to caller by default. > > Regression test on x86-64 and aarch64. > > Feng > -- > gcc/ > PR tree-optimization/115060 > * tree-vect-patterns.h (vect_get_internal_def): Add a new parameter > for_vectorize. > (vect_widened_op_tree): Call vect_get_internal_def instead of look_def > to get statement information. > (vect_recog_widen_abd_pattern): No need to call > vect_stmt_to_vectorize. > --- > gcc/tree-vect-patterns.cc | 16 +++- > 1 file changed, 11 insertions(+), 5 deletions(-) > > diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc > index a313dc64643..fa35bf26372 100644 > --- a/gcc/tree-vect-patterns.cc > +++ b/gcc/tree-vect-patterns.cc > @@ -258,15 +258,21 @@ vect_element_precision (unsigned int precision) > } > > /* If OP is defined by a statement that's being considered for vectorization, > - return information about that statement, otherwise return NULL. */ > + return information about that statement, otherwise return NULL. > + FOR_VECTORIZE is used to specify whether original or vectorization > + representative (if have) statement information is returned. */ > > static stmt_vec_info > -vect_get_internal_def (vec_info *vinfo, tree op) > +vect_get_internal_def (vec_info *vinfo, tree op, bool for_vectorize = true) I'm probably blind - but you nowhere pass 'false' and I think returning the pattern stmt is the correct behavior always. OK with omitting the new parameter. > { >stmt_vec_info def_stmt_info = vinfo->lookup_def (op); >if (def_stmt_info >&& STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_internal_def) > -return def_stmt_info; > +{ > + if (for_vectorize) > + def_stmt_info = vect_stmt_to_vectorize (def_stmt_info); > + return def_stmt_info; > +} >return NULL; > } > > @@ -655,7 +661,8 @@ vect_widened_op_tree (vec_info *vinfo, stmt_vec_info > stmt_info, tree_code code, > > /* Recursively process the definition of the operand. */ > stmt_vec_info def_stmt_info > - = vinfo->lookup_def (this_unprom->op); > + = vect_get_internal_def (vinfo, this_unprom->op); > + > nops = vect_widened_op_tree (vinfo, def_stmt_info, code, >widened_code, shift_p, max_nops, >this_unprom, common_type, > @@ -1739,7 +1746,6 @@ vect_recog_widen_abd_pattern (vec_info *vinfo, > stmt_vec_info stmt_vinfo, >if (!abd_pattern_vinfo) > return NULL; > > - abd_pattern_vinfo = vect_stmt_to_vectorize (abd_pattern_vinfo); >gcall *abd_stmt = dyn_cast (STMT_VINFO_STMT (abd_pattern_vinfo)); >if (!abd_stmt
[PATCH] vect: Unify bbs in loop_vec_info and bb_vec_info
Both derived classes ( loop_vec_info/bb_vec_info) have their own "bbs" field, which have exactly same purpose of recording all basic blocks inside the corresponding vect region, while the fields are composed by different data type, one is normal array, the other is auto_vec. This difference causes some duplicated code even handling the same stuff, almost in tree-vect-patterns. One refinement is lifting this field into the base class "vec_info", and reset its value to the continuous memory area pointed by two old "bbs" in each constructor of derived classes. Regression test on x86-64 and aarch64. Feng -- gcc/ * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Move initialization of bbs to explicit construction code. Adjust the definition of nbbs. * tree-vect-pattern.cc (vect_determine_precisions): Make loop_vec_info and bb_vec_info share same code. (vect_pattern_recog): Remove duplicated vect_pattern_recog_1 loop. * tree-vect-slp.cc (vect_get_and_check_slp_defs): Access to bbs[0] via base vec_info class. (_bb_vec_info::_bb_vec_info): Initialize bbs and nbbs using data fields of input auto_vec<> bbs. (_bb_vec_info::_bb_vec_info): Add assertions on bbs and nbbs to ensure they are not changed externally. (vect_slp_region): Use access to nbbs to replace original bbs.length(). (vect_schedule_slp_node): Access to bbs[0] via base vec_info class. * tree-vectorizer.cc (vec_info::vec_info): Add initialization of bbs and nbbs. (vec_info::insert_seq_on_entry): Access to bbs[0] via base vec_info class. * tree-vectorizer.h (vec_info): Add new fields bbs and nbbs. (_loop_vec_info): Remove field bbs. (_bb_vec_info): Rename old bbs field to bbs_as_vector, and make it be private. --- gcc/tree-vect-loop.cc | 6 +- gcc/tree-vect-patterns.cc | 142 +++--- gcc/tree-vect-slp.cc | 24 --- gcc/tree-vectorizer.cc| 7 +- gcc/tree-vectorizer.h | 19 ++--- 5 files changed, 72 insertions(+), 126 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 83c0544b6aa..aef17420a5f 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -1028,7 +1028,6 @@ bb_in_loop_p (const_basic_block bb, const void *data) _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared) : vec_info (vec_info::loop, shared), loop (loop_in), -bbs (XCNEWVEC (basic_block, loop->num_nodes)), num_itersm1 (NULL_TREE), num_iters (NULL_TREE), num_iters_unchanged (NULL_TREE), @@ -1079,8 +1078,9 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared) case of the loop forms we allow, a dfs order of the BBs would the same as reversed postorder traversal, so we are safe. */ - unsigned int nbbs = dfs_enumerate_from (loop->header, 0, bb_in_loop_p, - bbs, loop->num_nodes, loop); + bbs = XCNEWVEC (basic_block, loop->num_nodes); + nbbs = dfs_enumerate_from (loop->header, 0, bb_in_loop_p, bbs, +loop->num_nodes, loop); gcc_assert (nbbs == loop->num_nodes); for (unsigned int i = 0; i < nbbs; i++) diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index a313dc64643..848a3195a93 100644 --- a/gcc/tree-vect-patterns.cc +++ b/gcc/tree-vect-patterns.cc @@ -6925,81 +6925,41 @@ vect_determine_stmt_precisions (vec_info *vinfo, stmt_vec_info stmt_info) void vect_determine_precisions (vec_info *vinfo) { + basic_block *bbs = vinfo->bbs; + unsigned int nbbs = vinfo->nbbs; + DUMP_VECT_SCOPE ("vect_determine_precisions"); - if (loop_vec_info loop_vinfo = dyn_cast (vinfo)) + for (unsigned int i = 0; i < nbbs; i++) { - class loop *loop = LOOP_VINFO_LOOP (loop_vinfo); - basic_block *bbs = LOOP_VINFO_BBS (loop_vinfo); - unsigned int nbbs = loop->num_nodes; - - for (unsigned int i = 0; i < nbbs; i++) + basic_block bb = bbs[i]; + for (auto gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next (&gsi)) { - basic_block bb = bbs[i]; - for (auto gsi = gsi_start_phis (bb); - !gsi_end_p (gsi); gsi_next (&gsi)) - { - stmt_vec_info stmt_info = vinfo->lookup_stmt (gsi.phi ()); - if (stmt_info) - vect_determine_mask_precision (vinfo, stmt_info); - } - for (auto si = gsi_start_bb (bb); !gsi_end_p (si); gsi_next (&si)) - if (!is_gimple_debug (gsi_stmt (si))) - vect_determine_mask_precision - (vinfo, vinfo->lookup_stmt (gsi_stmt (si))); + stmt_vec_info stmt_info = vinfo->lookup_stmt (gsi.phi ()); + if (stmt_info && STMT_VINFO_VECTORIZABLE (stmt_info)) + vect_determine_mask_precision (vinfo, stmt_info); } - for (unsigned int i = 0; i < nbbs;
[PATCH] vect: Use vect representative statement instead of original in patch recog [PR115060]
Some utility functions (such as vect_look_through_possible_promotion) that are to find out certain kind of direct or indirect definition SSA for a value, may return the original one of the SSA, not its pattern representative SSA, even pattern is involved. For example, a = (T1) patt_b; patt_b = (T2) c;// b = ... patt_c = not-a-cast;// c = ... Given 'a', the mentioned function will return 'c', instead of 'patt_c'. This subtlety would make some pattern recog code that is unaware of it mis-use the original instead of the new pattern statement, which is inconsistent wth processing logic of the pattern formation pass. This patch corrects the issue by forcing another utility function (vect_get_internal_def) return the pattern statement information to caller by default. Regression test on x86-64 and aarch64. Feng -- gcc/ PR tree-optimization/115060 * tree-vect-patterns.h (vect_get_internal_def): Add a new parameter for_vectorize. (vect_widened_op_tree): Call vect_get_internal_def instead of look_def to get statement information. (vect_recog_widen_abd_pattern): No need to call vect_stmt_to_vectorize. --- gcc/tree-vect-patterns.cc | 16 +++- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index a313dc64643..fa35bf26372 100644 --- a/gcc/tree-vect-patterns.cc +++ b/gcc/tree-vect-patterns.cc @@ -258,15 +258,21 @@ vect_element_precision (unsigned int precision) } /* If OP is defined by a statement that's being considered for vectorization, - return information about that statement, otherwise return NULL. */ + return information about that statement, otherwise return NULL. + FOR_VECTORIZE is used to specify whether original or vectorization + representative (if have) statement information is returned. */ static stmt_vec_info -vect_get_internal_def (vec_info *vinfo, tree op) +vect_get_internal_def (vec_info *vinfo, tree op, bool for_vectorize = true) { stmt_vec_info def_stmt_info = vinfo->lookup_def (op); if (def_stmt_info && STMT_VINFO_DEF_TYPE (def_stmt_info) == vect_internal_def) -return def_stmt_info; +{ + if (for_vectorize) + def_stmt_info = vect_stmt_to_vectorize (def_stmt_info); + return def_stmt_info; +} return NULL; } @@ -655,7 +661,8 @@ vect_widened_op_tree (vec_info *vinfo, stmt_vec_info stmt_info, tree_code code, /* Recursively process the definition of the operand. */ stmt_vec_info def_stmt_info - = vinfo->lookup_def (this_unprom->op); + = vect_get_internal_def (vinfo, this_unprom->op); + nops = vect_widened_op_tree (vinfo, def_stmt_info, code, widened_code, shift_p, max_nops, this_unprom, common_type, @@ -1739,7 +1746,6 @@ vect_recog_widen_abd_pattern (vec_info *vinfo, stmt_vec_info stmt_vinfo, if (!abd_pattern_vinfo) return NULL; - abd_pattern_vinfo = vect_stmt_to_vectorize (abd_pattern_vinfo); gcall *abd_stmt = dyn_cast (STMT_VINFO_STMT (abd_pattern_vinfo)); if (!abd_stmt
Re: [PATCH] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
Hi, The patch was updated with the newest trunk, and also contained some minor changes. I am working on another new feature which is meant to support pattern recognition of lane-reducing operations in affine closure originated from loop reduction variable, like: sum += cst1 * dot_prod_1 + cst2 * sad_2 + ... + cstN * lane_reducing_op_N The feature WIP depends on the patch. It has been a little bit long time since its post, would you please take a time to review this one? Thanks. Feng gcc/ PR tree-optimization/114440 * tree-vectorizer.h (struct _stmt_vec_info): Add a new field reduc_result_pos. (lane_reducing_op_p): New function. (vectorizable_lane_reducing): New function declaration. * tree-vect-stmts.cc (vectorizable_condition): Treat the condition statement that is pointed by stmt_vec_info of reduction PHI as the real "for_reduction" statement. (vect_analyze_stmt): Call new function vectorizable_lane_reducing to analyze lane-reducing operation. * tree-vect-loop.cc (vect_is_emulated_mixed_dot_prod): Remove parameter loop_vinfo. Get input vectype from stmt_info instead of reduction PHI. (vect_model_reduction_cost): Remove cost computation code related to emulated_mixed_dot_prod. (vect_reduction_use_partial_vector): New function. (vectorizable_lane_reducing): New function. (vectorizable_reduction): Allow multiple lane-reducing operations in loop reduction. Move some original lane-reducing related code to vectorizable_lane_reducing, and move partial vectorization checking code to vect_reduction_use_partial_vector. (vect_transform_reduction): Extend transformation to support reduction statements with mixed input vectypes. * tree-vect-slp.cc (vect_analyze_slp): Use new function lane_reducing_op_p to check statement code. gcc/testsuite/ PR tree-optimization/114440 * gcc.dg/vect/vect-reduc-chain-1.c * gcc.dg/vect/vect-reduc-chain-2.c * gcc.dg/vect/vect-reduc-chain-3.c * gcc.dg/vect/vect-reduc-dot-slp-1.c * gcc.dg/vect/vect-reduc-dot-slp-2.c --- .../gcc.dg/vect/vect-reduc-chain-1.c | 62 ++ .../gcc.dg/vect/vect-reduc-chain-2.c | 77 ++ .../gcc.dg/vect/vect-reduc-chain-3.c | 66 ++ .../gcc.dg/vect/vect-reduc-dot-slp-1.c| 97 +++ .../gcc.dg/vect/vect-reduc-dot-slp-2.c| 81 +++ gcc/tree-vect-loop.cc | 680 -- gcc/tree-vect-slp.cc | 4 +- gcc/tree-vect-stmts.cc| 13 +- gcc/tree-vectorizer.h | 14 + 9 files changed, 873 insertions(+), 221 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-2.c diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c new file mode 100644 index 000..04bfc419dbd --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c @@ -0,0 +1,62 @@ +/* Disabling epilogues until we find a better way to deal with scans. */ +/* { dg-additional-options "--param vect-epilogues-nomask=0" } */ +/* { dg-require-effective-target vect_int } */ +/* { dg-require-effective-target arm_v8_2a_dotprod_neon_hw { target { aarch64*-*-* || arm*-*-* } } } */ +/* { dg-add-options arm_v8_2a_dotprod_neon } */ + +#include "tree-vect.h" + +#define N 50 + +#ifndef SIGNEDNESS_1 +#define SIGNEDNESS_1 signed +#define SIGNEDNESS_2 signed +#endif + +SIGNEDNESS_1 int __attribute__ ((noipa)) +f (SIGNEDNESS_1 int res, + SIGNEDNESS_2 char *restrict a, + SIGNEDNESS_2 char *restrict b, + SIGNEDNESS_2 char *restrict c, + SIGNEDNESS_2 char *restrict d, + SIGNEDNESS_1 int *restrict e) +{ + for (int i = 0; i < N; ++i) +{ + res += a[i] * b[i]; + res += c[i] * d[i]; + res += e[i]; +} + return res; +} + +#define BASE ((SIGNEDNESS_2 int) -1 < 0 ? -126 : 4) +#define OFFSET 20 + +int +main (void) +{ + check_vect (); + + SIGNEDNESS_2 char a[N], b[N]; + SIGNEDNESS_2 char c[N], d[N]; + SIGNEDNESS_1 int e[N]; + int expected = 0x12345; + for (int i = 0; i < N; ++i) +{ + a[i] = BASE + i * 5; + b[i] = BASE + OFFSET + i * 4; + c[i] = BASE + i * 2; + d[i] = BASE + OFFSET + i * 3; + e[i] = i; + asm volatile ("" ::: "memory"); + expected += a[i] * b[i]; + expected += c[i] * d[i]; + expected += e[i]; +} + if (f (0x12345, a, b, c, d, e) != expected) +__builtin_abort (); +} + +/* { dg-final { scan-tree-dump "vect_recog_dot_prod_pattern: detected" "vect" } } */ +/* { dg-final { scan-tree
[PATCH] vect: Support multiple lane-reducing operations for loop reduction [PR114440]
For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current vectorizer could only handle the pattern if the reduction chain does not contain other operation, no matter the other is normal or lane-reducing. Acctually, to allow multiple arbitray lane-reducing operations, we need to support vectorization of loop reduction chain with mixed input vectypes. Since lanes of vectype may vary with operation, the effective ncopies of vectorized statements for operation also may not be same to each other, this causes mismatch on vectorized def-use cycles. A simple way is to align all operations with the one that has the most ncopies, the gap could be complemented by generating extra trival pass-through copies. For example: int sum = 0; for (i) { sum += d0[i] * d1[i]; // dot-prod sum += w[i]; // widen-sum sum += abs(s0[i] - s1[i]); // sad sum += n[i]; // normal } The vector size is 128-bit,vectorization factor is 16. Reduction statements would be transformed as: vector<4> int sum_v0 = { 0, 0, 0, 0 }; vector<4> int sum_v1 = { 0, 0, 0, 0 }; vector<4> int sum_v2 = { 0, 0, 0, 0 }; vector<4> int sum_v3 = { 0, 0, 0, 0 }; for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1); sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy sum_v0 = sum_v0; // copy sum_v1 = sum_v1; // copy sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2); sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3); sum_v0 += n_v0[i: 0 ~ 3 ]; sum_v1 += n_v1[i: 4 ~ 7 ]; sum_v2 += n_v2[i: 8 ~ 11]; sum_v3 += n_v3[i: 12 ~ 15]; } Moreover, for a higher instruction parallelism in final vectorized loop, it is considered to make those effective vectorized lane-reducing statements be distributed evenly among all def-use cycles. In the above example, DOT_PROD, WIDEN_SUM and SADs are generated into disparate cycles. Bootstrapped/regtested on x86_64-linux and aarch64-linux. Feng --- gcc/ PR tree-optimization/114440 * tree-vectorizer.h (struct _stmt_vec_info): Add a new field reduc_result_pos. (vectorizable_lane_reducing): New function declaration. * tree-vect-stmts.cc (vectorizable_condition): Treat the condition statement that is pointed by stmt_vec_info of reduction PHI as the real "for_reduction" statement. (vect_analyze_stmt): Call new function vectorizable_lane_reducing to analyze lane-reducing operation. * tree-vect-loop.cc (vect_is_emulated_mixed_dot_prod): Remove parameter loop_vinfo. Get input vectype from stmt_info instead of reduction PHI. (vect_model_reduction_cost): Remove cost computation code related to emulated_mixed_dot_prod. (vect_reduction_use_partial_vector): New function. (vectorizable_lane_reducing): New function. (vectorizable_reduction): Allow multiple lane-reducing operations in loop reduction. Move some original lane-reducing related code to vectorizable_lane_reducing, and move partial vectorization checking code to vect_reduction_use_partial_vector. (vect_transform_reduction): Extend transformation to support reduction statements with mixed input vectypes. gcc/testsuite/ PR tree-optimization/114440 * gcc.dg/vect/vect-reduc-chain-1.c * gcc.dg/vect/vect-reduc-chain-2.c * gcc.dg/vect/vect-reduc-chain-3.c * gcc.dg/vect/vect-reduc-dot-slp-1.c * gcc.dg/vect/vect-reduc-dot-slp-2.c --- .../gcc.dg/vect/vect-reduc-chain-1.c | 62 ++ .../gcc.dg/vect/vect-reduc-chain-2.c | 77 ++ .../gcc.dg/vect/vect-reduc-chain-3.c | 66 ++ .../gcc.dg/vect/vect-reduc-dot-slp-1.c| 97 +++ .../gcc.dg/vect/vect-reduc-dot-slp-2.c| 81 +++ gcc/tree-vect-loop.cc | 668 -- gcc/tree-vect-stmts.cc| 13 +- gcc/tree-vectorizer.h | 8 + 8 files changed, 863 insertions(+), 209 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-2.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-chain-3.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/vect-reduc-dot-slp-2.c diff --git a/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c new file mode 100644 index 000..04bfc419dbd --- /dev/null +++ b/gcc/testsuite/gcc.dg/vect/vect-reduc-chain-1.c @@ -0,0 +1,62 @@ +/* Disabling epilogues until we find a
Re: [PATCH] Do not count unused scalar use when marking STMT_VINFO_LIVE_P [PR113091]
+ scalar_use_map.put (op, 1); + } + else + { + for (slp_tree child : SLP_TREE_CHILDREN (node)) + if (child && !visited.add (child)) + worklist.safe_push (child); + } +} while (!worklist.is_empty ()); + + visited.empty (); + + for (slp_instance instance : bb_vinfo->slp_instances) +{ + vect_location = instance->location (); + vect_bb_slp_mark_live_stmts (bb_vinfo, SLP_INSTANCE_TREE (instance), + instance, &instance->cost_vec, + scalar_use_map, svisited, visited); +} } /* Determine whether we can vectorize the reduction epilogue for INSTANCE. */ @@ -6684,17 +6823,7 @@ vect_slp_analyze_operations (vec_info *vinfo) /* Compute vectorizable live stmts. */ if (bb_vec_info bb_vinfo = dyn_cast (vinfo)) -{ - hash_set svisited; - hash_set visited; - for (i = 0; vinfo->slp_instances.iterate (i, &instance); ++i) - { - vect_location = instance->location (); - vect_bb_slp_mark_live_stmts (bb_vinfo, SLP_INSTANCE_TREE (instance), - instance, &instance->cost_vec, svisited, - visited); - } -} +vect_bb_slp_mark_live_stmts (bb_vinfo); return !vinfo->slp_instances.is_empty (); } -- 2.17.1 From: Richard Biener Sent: Thursday, January 11, 2024 5:52 PM To: Feng Xue OS; Richard Sandiford Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Do not count unused scalar use when marking STMT_VINFO_LIVE_P [PR113091] On Thu, Jan 11, 2024 at 10:46 AM Richard Biener wrote: > > On Fri, Dec 29, 2023 at 11:29 AM Feng Xue OS > wrote: > > > > This patch is meant to fix over-estimation about SLP vector-to-scalar cost > > for > > STMT_VINFO_LIVE_P statement. When pattern recognition is involved, a > > statement whose definition is consumed in some pattern, may not be > > included in the final replacement pattern statements, and would be skipped > > when building SLP graph. > > > > * Original > > char a_c = *(char *) a; > > char b_c = *(char *) b; > > unsigned short a_s = (unsigned short) a_c; > > int a_i = (int) a_s; > > int b_i = (int) b_c; > > int r_i = a_i - b_i; > > > > * After pattern replacement > > a_s = (unsigned short) a_c; > > a_i = (int) a_s; > > > > patt_b_s = (unsigned short) b_c;// b_i = (int) b_c > > patt_b_i = (int) patt_b_s; // b_i = (int) b_c > > > > patt_r_s = widen_minus(a_c, b_c); // r_i = a_i - b_i > > patt_r_i = (int) patt_r_s; // r_i = a_i - b_i > > > > The definitions of a_i(original statement) and b_i(pattern statement) > > are related to, but actually not part of widen_minus pattern. > > Vectorizing the pattern does not cause these definition statements to > > be marked as PURE_SLP. For this case, we need to recursively check > > whether their uses are all absorbed into vectorized code. But there > > is an exception that some use may participate in an vectorized > > operation via an external SLP node containing that use as an element. > > > > Feng > > > > --- > > .../gcc.target/aarch64/bb-slp-pr113091.c | 22 ++ > > gcc/tree-vect-slp.cc | 189 ++ > > 2 files changed, 172 insertions(+), 39 deletions(-) > > create mode 100644 gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c > > b/gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c > > new file mode 100644 > > index 000..ff822e90b4a > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c > > @@ -0,0 +1,22 @@ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-O3 -fdump-tree-slp-details > > -ftree-slp-vectorize" } */ > > + > > +int test(unsigned array[8]); > > + > > +int foo(char *a, char *b) > > +{ > > + unsigned array[8]; > > + > > + array[0] = (a[0] - b[0]); > > + array[1] = (a[1] - b[1]); > > + array[2] = (a[2] - b[2]); > > + array[3] = (a[3] - b[3]); > > + array[4] = (a[4] - b[4]); > > + array[5] = (a[5] - b[5]); > > + array[6] = (a[6] - b[6]); > > + array[7] = (a[7] - b[7]); > > + > > + return test(array); > > +} > > + > > +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using > > SLP" 1 "slp2" } } */ > > diff --git a/gcc
PING: [PATCH] Do not count unused scalar use when marking STMT_VINFO_LIVE_P [PR113091]
Hi, Richard, Would you please talk a look at this patch? Thanks, Feng From: Feng Xue OS Sent: Friday, December 29, 2023 6:28 PM To: gcc-patches@gcc.gnu.org Subject: [PATCH] Do not count unused scalar use when marking STMT_VINFO_LIVE_P [PR113091] This patch is meant to fix over-estimation about SLP vector-to-scalar cost for STMT_VINFO_LIVE_P statement. When pattern recognition is involved, a statement whose definition is consumed in some pattern, may not be included in the final replacement pattern statements, and would be skipped when building SLP graph. * Original char a_c = *(char *) a; char b_c = *(char *) b; unsigned short a_s = (unsigned short) a_c; int a_i = (int) a_s; int b_i = (int) b_c; int r_i = a_i - b_i; * After pattern replacement a_s = (unsigned short) a_c; a_i = (int) a_s; patt_b_s = (unsigned short) b_c;// b_i = (int) b_c patt_b_i = (int) patt_b_s; // b_i = (int) b_c patt_r_s = widen_minus(a_c, b_c); // r_i = a_i - b_i patt_r_i = (int) patt_r_s; // r_i = a_i - b_i The definitions of a_i(original statement) and b_i(pattern statement) are related to, but actually not part of widen_minus pattern. Vectorizing the pattern does not cause these definition statements to be marked as PURE_SLP. For this case, we need to recursively check whether their uses are all absorbed into vectorized code. But there is an exception that some use may participate in an vectorized operation via an external SLP node containing that use as an element. Feng --- .../gcc.target/aarch64/bb-slp-pr113091.c | 22 ++ gcc/tree-vect-slp.cc | 189 ++ 2 files changed, 172 insertions(+), 39 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c diff --git a/gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c b/gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c new file mode 100644 index 000..ff822e90b4a --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c @@ -0,0 +1,22 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-O3 -fdump-tree-slp-details -ftree-slp-vectorize" } */ + +int test(unsigned array[8]); + +int foo(char *a, char *b) +{ + unsigned array[8]; + + array[0] = (a[0] - b[0]); + array[1] = (a[1] - b[1]); + array[2] = (a[2] - b[2]); + array[3] = (a[3] - b[3]); + array[4] = (a[4] - b[4]); + array[5] = (a[5] - b[5]); + array[6] = (a[6] - b[6]); + array[7] = (a[7] - b[7]); + + return test(array); +} + +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */ diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index a82fca45161..d36ff37114e 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -6418,6 +6418,84 @@ vect_slp_analyze_node_operations (vec_info *vinfo, slp_tree node, return res; } +/* Given a definition DEF, analyze if it will have any live scalar use after + performing SLP vectorization whose information is represented by BB_VINFO, + and record result into hash map SCALAR_USE_MAP as cache for later fast + check. */ + +static bool +vec_slp_has_scalar_use (bb_vec_info bb_vinfo, tree def, + hash_map &scalar_use_map) +{ + imm_use_iterator use_iter; + gimple *use_stmt; + + if (bool *res = scalar_use_map.get (def)) +return *res; + + FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, def) +{ + if (is_gimple_debug (use_stmt)) + continue; + + stmt_vec_info use_stmt_info = bb_vinfo->lookup_stmt (use_stmt); + + if (!use_stmt_info) + break; + + if (PURE_SLP_STMT (vect_stmt_to_vectorize (use_stmt_info))) + continue; + + /* Do not step forward when encounter PHI statement, since it may +involve cyclic reference and cause infinite recursive invocation. */ + if (gimple_code (use_stmt) == GIMPLE_PHI) + break; + + /* When pattern recognition is involved, a statement whose definition is +consumed in some pattern, may not be included in the final replacement +pattern statements, so would be skipped when building SLP graph. + +* Original + char a_c = *(char *) a; + char b_c = *(char *) b; + unsigned short a_s = (unsigned short) a_c; + int a_i = (int) a_s; + int b_i = (int) b_c; + int r_i = a_i - b_i; + +* After pattern replacement + a_s = (unsigned short) a_c; + a_i = (int) a_s; + + patt_b_s = (unsigned short) b_c;// b_i = (int) b_c + patt_b_i = (int) patt_b_s; // b_i = (int) b_c + + patt_r_s = widen_minus(a_c, b_c); // r_i = a_i - b_i + patt_r_i = (int) patt_r_s; // r_i = a_i - b_i + +The definitions of a_i(original statement) and b_i(pattern statement) +are related to, but act
[PATCH] Do not count unused scalar use when marking STMT_VINFO_LIVE_P [PR113091]
This patch is meant to fix over-estimation about SLP vector-to-scalar cost for STMT_VINFO_LIVE_P statement. When pattern recognition is involved, a statement whose definition is consumed in some pattern, may not be included in the final replacement pattern statements, and would be skipped when building SLP graph. * Original char a_c = *(char *) a; char b_c = *(char *) b; unsigned short a_s = (unsigned short) a_c; int a_i = (int) a_s; int b_i = (int) b_c; int r_i = a_i - b_i; * After pattern replacement a_s = (unsigned short) a_c; a_i = (int) a_s; patt_b_s = (unsigned short) b_c;// b_i = (int) b_c patt_b_i = (int) patt_b_s; // b_i = (int) b_c patt_r_s = widen_minus(a_c, b_c); // r_i = a_i - b_i patt_r_i = (int) patt_r_s; // r_i = a_i - b_i The definitions of a_i(original statement) and b_i(pattern statement) are related to, but actually not part of widen_minus pattern. Vectorizing the pattern does not cause these definition statements to be marked as PURE_SLP. For this case, we need to recursively check whether their uses are all absorbed into vectorized code. But there is an exception that some use may participate in an vectorized operation via an external SLP node containing that use as an element. Feng --- .../gcc.target/aarch64/bb-slp-pr113091.c | 22 ++ gcc/tree-vect-slp.cc | 189 ++ 2 files changed, 172 insertions(+), 39 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c diff --git a/gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c b/gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c new file mode 100644 index 000..ff822e90b4a --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/bb-slp-pr113091.c @@ -0,0 +1,22 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-O3 -fdump-tree-slp-details -ftree-slp-vectorize" } */ + +int test(unsigned array[8]); + +int foo(char *a, char *b) +{ + unsigned array[8]; + + array[0] = (a[0] - b[0]); + array[1] = (a[1] - b[1]); + array[2] = (a[2] - b[2]); + array[3] = (a[3] - b[3]); + array[4] = (a[4] - b[4]); + array[5] = (a[5] - b[5]); + array[6] = (a[6] - b[6]); + array[7] = (a[7] - b[7]); + + return test(array); +} + +/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */ diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index a82fca45161..d36ff37114e 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -6418,6 +6418,84 @@ vect_slp_analyze_node_operations (vec_info *vinfo, slp_tree node, return res; } +/* Given a definition DEF, analyze if it will have any live scalar use after + performing SLP vectorization whose information is represented by BB_VINFO, + and record result into hash map SCALAR_USE_MAP as cache for later fast + check. */ + +static bool +vec_slp_has_scalar_use (bb_vec_info bb_vinfo, tree def, + hash_map &scalar_use_map) +{ + imm_use_iterator use_iter; + gimple *use_stmt; + + if (bool *res = scalar_use_map.get (def)) +return *res; + + FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, def) +{ + if (is_gimple_debug (use_stmt)) + continue; + + stmt_vec_info use_stmt_info = bb_vinfo->lookup_stmt (use_stmt); + + if (!use_stmt_info) + break; + + if (PURE_SLP_STMT (vect_stmt_to_vectorize (use_stmt_info))) + continue; + + /* Do not step forward when encounter PHI statement, since it may +involve cyclic reference and cause infinite recursive invocation. */ + if (gimple_code (use_stmt) == GIMPLE_PHI) + break; + + /* When pattern recognition is involved, a statement whose definition is +consumed in some pattern, may not be included in the final replacement +pattern statements, so would be skipped when building SLP graph. + +* Original + char a_c = *(char *) a; + char b_c = *(char *) b; + unsigned short a_s = (unsigned short) a_c; + int a_i = (int) a_s; + int b_i = (int) b_c; + int r_i = a_i - b_i; + +* After pattern replacement + a_s = (unsigned short) a_c; + a_i = (int) a_s; + + patt_b_s = (unsigned short) b_c;// b_i = (int) b_c + patt_b_i = (int) patt_b_s; // b_i = (int) b_c + + patt_r_s = widen_minus(a_c, b_c); // r_i = a_i - b_i + patt_r_i = (int) patt_r_s; // r_i = a_i - b_i + +The definitions of a_i(original statement) and b_i(pattern statement) +are related to, but actually not part of widen_minus pattern. +Vectorizing the pattern does not cause these definition statements to +be marked as PURE_SLP. For this case, we need to recursively check +whether their uses are all absorbed into vectorized code. But there +is an exception that some use may participate in an vectorized +
[PATCH] arm/aarch64: Add bti for all functions [PR106671]
This patch extends option -mbranch-protection=bti with an optional argument as bti[+all] to force compiler to unconditionally insert bti for all functions. Because a direct function call at the stage of compiling might be rewritten to an indirect call with some kind of linker-generated thunk stub as invocation relay for some reasons. One instance is if a direct callee is placed far from its caller, direct BL {imm} instruction could not represent the distance, so indirect BLR {reg} should be used. For this case, a bti is required at the beginning of the callee. caller() { bl callee } => caller() { adrp reg, addreg, reg, #constant blrreg } Although the issue could be fixed with a pretty new version of ld, here we provide another means for user who has to rely on the old ld or other non-ld linker. I also checked LLVM, by default, it implements bti just as the proposed -mbranch-protection=bti+all. Feng --- gcc/config/aarch64/aarch64.cc| 12 +++- gcc/config/aarch64/aarch64.opt | 2 +- gcc/config/arm/aarch-bti-insert.cc | 3 ++- gcc/config/arm/aarch-common.cc | 22 ++ gcc/config/arm/aarch-common.h| 18 ++ gcc/config/arm/arm.cc| 4 ++-- gcc/config/arm/arm.opt | 2 +- gcc/doc/invoke.texi | 16 ++-- gcc/testsuite/gcc.target/aarch64/bti-5.c | 17 + 9 files changed, 76 insertions(+), 20 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/bti-5.c diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 71215ef9fee..a404447c8d0 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -8997,7 +8997,8 @@ void aarch_bti_arch_check (void) bool aarch_bti_enabled (void) { - return (aarch_enable_bti == 1); + gcc_checking_assert (aarch_enable_bti != AARCH_BTI_FUNCTION_UNSET); + return (aarch_enable_bti != AARCH_BTI_FUNCTION_NONE); } /* Check if INSN is a BTI J insn. */ @@ -18454,12 +18455,12 @@ aarch64_override_options (void) selected_tune = tune ? tune->ident : cpu->ident; - if (aarch_enable_bti == 2) + if (aarch_enable_bti == AARCH_BTI_FUNCTION_UNSET) { #ifdef TARGET_ENABLE_BTI - aarch_enable_bti = 1; + aarch_enable_bti = AARCH_BTI_FUNCTION; #else - aarch_enable_bti = 0; + aarch_enable_bti = AARCH_BTI_FUNCTION_NONE; #endif } @@ -22881,7 +22882,8 @@ aarch64_print_patchable_function_entry (FILE *file, basic_block bb = ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb; if (!aarch_bti_enabled () - || cgraph_node::get (cfun->decl)->only_called_directly_p ()) + || (aarch_enable_bti != AARCH_BTI_FUNCTION_ALL + && cgraph_node::get (cfun->decl)->only_called_directly_p ())) { /* Emit the patchable_area at the beginning of the function. */ rtx_insn *insn = emit_insn_before (pa, BB_HEAD (bb)); diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt index 025e52d40e5..5571f7e916d 100644 --- a/gcc/config/aarch64/aarch64.opt +++ b/gcc/config/aarch64/aarch64.opt @@ -37,7 +37,7 @@ TargetVariable aarch64_feature_flags aarch64_isa_flags = 0 TargetVariable -unsigned aarch_enable_bti = 2 +enum aarch_bti_function_type aarch_enable_bti = AARCH_BTI_FUNCTION_UNSET TargetVariable enum aarch_key_type aarch_ra_sign_key = AARCH_KEY_A diff --git a/gcc/config/arm/aarch-bti-insert.cc b/gcc/config/arm/aarch-bti-insert.cc index 71a77e29406..babd2490c9f 100644 --- a/gcc/config/arm/aarch-bti-insert.cc +++ b/gcc/config/arm/aarch-bti-insert.cc @@ -164,7 +164,8 @@ rest_of_insert_bti (void) functions that are already protected by Return Address Signing (PACIASP/ PACIBSP). For all other cases insert a BTI C at the beginning of the function. */ - if (!cgraph_node::get (cfun->decl)->only_called_directly_p ()) + if (aarch_enable_bti == AARCH_BTI_FUNCTION_ALL + || !cgraph_node::get (cfun->decl)->only_called_directly_p ()) { bb = ENTRY_BLOCK_PTR_FOR_FN (cfun)->next_bb; insn = BB_HEAD (bb); diff --git a/gcc/config/arm/aarch-common.cc b/gcc/config/arm/aarch-common.cc index 5b96ff4c2e8..7751d40f909 100644 --- a/gcc/config/arm/aarch-common.cc +++ b/gcc/config/arm/aarch-common.cc @@ -666,7 +666,7 @@ static enum aarch_parse_opt_result aarch_handle_no_branch_protection (char* str, char* rest) { aarch_ra_sign_scope = AARCH_FUNCTION_NONE; - aarch_enable_bti = 0; + aarch_enable_bti = AARCH_BTI_FUNCTION_NONE; if (rest) { error ("unexpected %<%s%> after %<%s%>", rest, str); @@ -680,7 +680,7 @@ aarch_handle_standard_branch_protection (char* str, char* rest) { aarch_ra_sign_scope = AARCH_FUNCTION_NON_LEAF; aarch_ra_sign_key = AARCH_KEY_A; - aarch_enable_bti = 1; + aarch_enable_bti = AARCH_BTI_FUNCTION; if (rest) { error ("unexpected %<%s%> after %<%s%>", res
PING^2: [PATCH/RFC 2/2] WPD: Enable whole program devirtualization at LTRANS
Thanks, Feng From: Feng Xue OS Sent: Thursday, September 16, 2021 5:26 PM To: Jan Hubicka; mjam...@suse.cz; Richard Biener; gcc-patches@gcc.gnu.org Cc: JiangNing OS Subject: [PATCH/RFC 2/2] WPD: Enable whole program devirtualization at LTRANS This patch is to extend applicability of full devirtualization to LTRANS stage. Normally, whole program assumption would not hold when WPA splits whole compilation into more than one LTRANS partitions. To avoid information lost for WPD at LTRANS, we will record all vtable nodes and related member function references into each partition. Bootstrapped/regtested on x86_64-linux and aarch64-linux. Thanks, Feng 2021-09-07 Feng Xue gcc/ * tree.h (TYPE_CXX_LOCAL): New macro for type using base.nothrow_flag. * tree-core.h (tree_base): Update comment on using base.nothrow_flag to represent TYPE_CXX_LOCAL. * ipa-devirt.c (odr_type_d::whole_program_local): Removed. (odr_type_d::whole_program_local_p): Check TYPE_CXX_LOCAL flag on type, and enable WPD at LTRANS when flag_devirtualize_fully is true. (get_odr_type): Remove setting whole_program_local flag on type. (identify_whole_program_local_types): Replace whole_program_local in odr_type_d by TYPE_CXX_LOCAL on type. (maybe_record_node): Enable WPD at LTRANS when flag_devirtualize_fully is true. * ipa.c (can_remove_vtable_if_no_refs_p): Retain vtables at LTRANS stage under full devirtualization. * lto-cgraph.c (compute_ltrans_boundary): Add all defined vtables to boundary of each LTRANS partition. * lto-streamer-out.c (get_symbol_initial_value): Streaming out initial value of vtable even its class is optimized away. * lto-lang.c (lto_post_options): Disable full devirtualization if flag_ltrans_devirtualize is false. * tree-streamer-in.c (unpack_ts_base_value_fields): unpack value of TYPE_CXX_LOCAL for a type from streaming data. * tree-streamer-out.c (pack_ts_base_value_fields): pack value ofTYPE_CXX_LOCAL for a type into streaming data. --- From 624aef44d72799ae488a431b4dce730f4b0fc28e Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Mon, 6 Sep 2021 20:34:50 +0800 Subject: [PATCH 2/2] WPD: Enable whole program devirtualization at LTRANS Whole program assumption would not hold when WPA splits whole compilation into more than one LTRANS partitions. To avoid information lost for WPD at LTRANS, we will record all vtable nodes and related member function references into each partition. 2021-09-07 Feng Xue gcc/ * tree.h (TYPE_CXX_LOCAL): New macro for type using base.nothrow_flag. * tree-core.h (tree_base): Update comment on using base.nothrow_flag to represent TYPE_CXX_LOCAL. * ipa-devirt.c (odr_type_d::whole_program_local): Removed. (odr_type_d::whole_program_local_p): Check TYPE_CXX_LOCAL flag on type, and enable WPD at LTRANS when flag_devirtualize_fully is true. (get_odr_type): Remove setting whole_program_local flag on type. (identify_whole_program_local_types): Replace whole_program_local in odr_type_d by TYPE_CXX_LOCAL on type. (maybe_record_node): Enable WPD at LTRANS when flag_devirtualize_fully is true. * ipa.c (can_remove_vtable_if_no_refs_p): Retain vtables at LTRANS stage under full devirtualization. * lto-cgraph.c (compute_ltrans_boundary): Add all defined vtables to boundary of each LTRANS partition. * lto-streamer-out.c (get_symbol_initial_value): Streaming out initial value of vtable even its class is optimized away. * lto-streamer-in.c (lto_input_tree): There might be more than one decls in dref_queue, register debuginfo for all of them. * lto-lang.c (lto_post_options): Disable full devirtualization if flag_ltrans_devirtualize is false. * tree-streamer-in.c (unpack_ts_base_value_fields): unpack value of TYPE_CXX_LOCAL for a type from streaming data. * tree-streamer-out.c (pack_ts_base_value_fields): pack value ofTYPE_CXX_LOCAL for a type into streaming data. --- gcc/ipa-devirt.c| 33 ++--- gcc/ipa.c | 7 ++- gcc/lto-cgraph.c| 19 +++ gcc/lto-streamer-in.c | 3 +-- gcc/lto-streamer-out.c | 12 +++- gcc/lto/lto-lang.c | 6 ++ gcc/tree-core.h | 3 +++ gcc/tree-streamer-in.c | 11 --- gcc/tree-streamer-out.c | 16 +--- gcc/tree.h | 5 + 10 files changed, 90 insertions(+), 25 deletions(-) diff --git a/gcc/ipa-devirt.c b/gcc/ipa-devirt.c index 284c449c6c1..bb929f016f8 100644 --- a/gcc/ipa-devirt.c +++ b/gcc/ipa-devirt.c @@ -216,8 +216,6 @@ struct GTY(()) odr_type_d int id; /* Is it in anonymous namespace? */ bool anonymous_namespace; - /* Set when type is not used outside of program. */ - bool whole_program_local; /* Did we
PING^2: [PATCH/RFC 1/2] WPD: Enable whole program devirtualization
Hi, Honza & Martin, Would you please take some time to review proposal and patches of whole program devirtualization? We have to say, this feature is not 100% safe, but provides us a way to deploy correct WPD on C++ program if we elaborately prepare linked libraries to ensure rtti symbols are contained, which is always the case for libstdc++ and well-composed third-part c++libraries with default gcc options. If not, we could get an expected rebuild with desirable options, and this does not require invasive modification on source codes, which is an advantage over LLVM visibility-based scheme. Now gcc-12 dev branch is at late stage since time will step into Nov. Anyway, we are not sure it is acceptable or not. But if yes, getting it in before code freeze would be a good time point. And made some minor changes on patches, also posted RFC link here for your convenience. (https://gcc.gnu.org/pipermail/gcc/2021-August/237132.html) Thanks, Feng From: Feng Xue OS Sent: Saturday, September 18, 2021 5:38 PM To: Jason Merrill; Jan Hubicka; mjam...@suse.cz; Richard Biener; gcc-patches@gcc.gnu.org Subject: Re: [PATCH/RFC 1/2] WPD: Enable whole program devirtualization >On 9/16/21 22:29, Feng Xue OS wrote: >>> On 9/16/21 05:25, Feng Xue OS via Gcc-patches wrote: >>>> This and following patches are composed to enable full devirtualization >>>> under whole program assumption (so also called whole-program >>>> devirtualization, WPD for short), which is an enhancement to current >>>> speculative devirtualization. The base of the optimization is how to >>>> identify class type that is local in terms of whole-program scope, at >>>> least those class types in libstdc++ must be excluded in some way. >>>> Our means is to use typeinfo symbol as identity marker of a class since >>>> it is unique and always generated once the class or its derived type >>>> is instantiated somewhere, and rely on symbol resolution by >>>> lto-linker-plugin to detect whether a typeinfo is referenced by regular >>>> object/library, which indirectly tells class types are escaped or not. >>>> The RFC at https://gcc.gnu.org/pipermail/gcc/2021-August/237132.html >>>> gives more details on that. >>>> >>>> Bootstrapped/regtested on x86_64-linux and aarch64-linux. >>>> >>>> Thanks, >>>> Feng >>>> >>>> >>>> 2021-09-07 Feng Xue >>>> >>>> gcc/ >>>>* common.opt (-fdevirtualize-fully): New option. >>>>* class.c (build_rtti_vtbl_entries): Force generation of typeinfo >>>>even -fno-rtti is specificied under full devirtualization. >>> >>> This makes -fno-rtti useless; rather than this, you should warn about >>> the combination of flags and force flag_rtti on. It also sounds like >>> you depend on the library not being built with -fno-rtti. >> >> Although rtti is generated by front-end, we will remove it after lto symtab >> merge, which is meant to keep same behavior as -fno-rtti. > > Ah, the cp/ change is OK, then, with a comment about that. > >> Yes, regular library to be linked with should contain rtti data, otherwise >> WPD could not deduce class type usage safely. By default, we can think >> that it should work for libstdc++, but it probably becomes a problem for >> user library, which might be avoided if we properly document this >> requirement and suggest user doing that when using WPD. > > Yes, I would expect that external libraries would be built with RTTI on > to allow users to use RTTI features even if they aren't used within the > library. But it's good to document it as a requirement. > >> + /* If a class with virtual base is only instantiated as >> + subobjects of derived classes, and has no complete object in >> + compilation unit, merely construction vtables will be >> involved, >> + its primary vtable is really not needed, and subject to being >> + removed. So once a vtable node is encountered, for all >> + polymorphic base classes of the vtable's context class, always >> + force generation of primary vtable nodes when full >> + devirtualization is enabled. */ > > Why do you need the primary vtable if you're relying on RTTI info? > Construction vtables will point to the same RTTI node. At middle end, the easiest way to get vtable of type is via TYPE_BINFO(type), it is the primary one. And WPD relies on existence of varpool_node of the vtable dec
PING: [PATCH/RFC 2/2] WPD: Enable whole program devirtualization at LTRANS
Made some minor changes. Thanks, Feng From: Feng Xue OS Sent: Thursday, September 16, 2021 5:26 PM To: Jan Hubicka; mjam...@suse.cz; Richard Biener; gcc-patches@gcc.gnu.org Cc: JiangNing OS Subject: [PATCH/RFC 2/2] WPD: Enable whole program devirtualization at LTRANS This patch is to extend applicability of full devirtualization to LTRANS stage. Normally, whole program assumption would not hold when WPA splits whole compilation into more than one LTRANS partitions. To avoid information lost for WPD at LTRANS, we will record all vtable nodes and related member function references into each partition. Bootstrapped/regtested on x86_64-linux and aarch64-linux. Thanks, Feng 2021-09-07 Feng Xue gcc/ * tree.h (TYPE_CXX_LOCAL): New macro for type using base.nothrow_flag. * tree-core.h (tree_base): Update comment on using base.nothrow_flag to represent TYPE_CXX_LOCAL. * ipa-devirt.c (odr_type_d::whole_program_local): Removed. (odr_type_d::whole_program_local_p): Check TYPE_CXX_LOCAL flag on type, and enable WPD at LTRANS when flag_devirtualize_fully is true. (get_odr_type): Remove setting whole_program_local flag on type. (identify_whole_program_local_types): Replace whole_program_local in odr_type_d by TYPE_CXX_LOCAL on type. (maybe_record_node): Enable WPD at LTRANS when flag_devirtualize_fully is true. * ipa.c (can_remove_vtable_if_no_refs_p): Retain vtables at LTRANS stage under full devirtualization. * lto-cgraph.c (compute_ltrans_boundary): Add all defined vtables to boundary of each LTRANS partition. * lto-streamer-out.c (get_symbol_initial_value): Streaming out initial value of vtable even its class is optimized away. * lto-lang.c (lto_post_options): Disable full devirtualization if flag_ltrans_devirtualize is false. * tree-streamer-in.c (unpack_ts_base_value_fields): unpack value of TYPE_CXX_LOCAL for a type from streaming data. * tree-streamer-out.c (pack_ts_base_value_fields): pack value ofTYPE_CXX_LOCAL for a type into streaming data. --- From 2c0d243b0c092585561c732bac490700f41001fb Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Mon, 6 Sep 2021 20:34:50 +0800 Subject: [PATCH 2/2] WPD: Enable whole program devirtualization at LTRANS Whole program assumption would not hold when WPA splits whole compilation into more than one LTRANS partitions. To avoid information lost for WPD at LTRANS, we will record all vtable nodes and related member function references into each partition. 2021-09-07 Feng Xue gcc/ * tree.h (TYPE_CXX_LOCAL): New macro for type using base.nothrow_flag. * tree-core.h (tree_base): Update comment on using base.nothrow_flag to represent TYPE_CXX_LOCAL. * ipa-devirt.c (odr_type_d::whole_program_local): Removed. (odr_type_d::whole_program_local_p): Check TYPE_CXX_LOCAL flag on type, and enable WPD at LTRANS when flag_devirtualize_fully is true. (get_odr_type): Remove setting whole_program_local flag on type. (identify_whole_program_local_types): Replace whole_program_local in odr_type_d by TYPE_CXX_LOCAL on type. (maybe_record_node): Enable WPD at LTRANS when flag_devirtualize_fully is true. * ipa.c (can_remove_vtable_if_no_refs_p): Retain vtables at LTRANS stage under full devirtualization. * lto-cgraph.c (compute_ltrans_boundary): Add all defined vtables to boundary of each LTRANS partition. * lto-streamer-out.c (get_symbol_initial_value): Streaming out initial value of vtable even its class is optimized away. * lto-streamer-in.c (lto_input_tree): There might be more than one decls in dref_queue, register debuginfo for all of them. * lto-lang.c (lto_post_options): Disable full devirtualization if flag_ltrans_devirtualize is false. * tree-streamer-in.c (unpack_ts_base_value_fields): unpack value of TYPE_CXX_LOCAL for a type from streaming data. * tree-streamer-out.c (pack_ts_base_value_fields): pack value ofTYPE_CXX_LOCAL for a type into streaming data. temp --- gcc/ipa-devirt.c| 29 ++--- gcc/ipa.c | 7 ++- gcc/lto-cgraph.c| 18 ++ gcc/lto-streamer-in.c | 3 +-- gcc/lto-streamer-out.c | 12 +++- gcc/lto/lto-lang.c | 6 ++ gcc/tree-core.h | 3 +++ gcc/tree-streamer-in.c | 11 --- gcc/tree-streamer-out.c | 11 --- gcc/tree.h | 5 + 10 files changed, 84 insertions(+), 21 deletions(-) diff --git a/gcc/ipa-devirt.c b/gcc/ipa-devirt.c index a7d04388dab..4ff551bace8 100644 --- a/gcc/ipa-devirt.c +++ b/gcc/ipa-devirt.c @@ -216,8 +216,6 @@ struct GTY(()) odr_type_d int id; /* Is it in anonymous namespace? */ bool anonymous_namespace; - /* Set when type is not used outside of program. */ - bool
PING: [PATCH/RFC 1/2] WPD: Enable whole program devirtualization
Minor update for some bugfixs and comment wording change. Thanks, Feng From: Feng Xue OS Sent: Saturday, September 18, 2021 5:38 PM To: Jason Merrill; Jan Hubicka; mjam...@suse.cz; Richard Biener; gcc-patches@gcc.gnu.org Subject: Re: [PATCH/RFC 1/2] WPD: Enable whole program devirtualization >On 9/16/21 22:29, Feng Xue OS wrote: >>> On 9/16/21 05:25, Feng Xue OS via Gcc-patches wrote: >>>> This and following patches are composed to enable full devirtualization >>>> under whole program assumption (so also called whole-program >>>> devirtualization, WPD for short), which is an enhancement to current >>>> speculative devirtualization. The base of the optimization is how to >>>> identify class type that is local in terms of whole-program scope, at >>>> least those class types in libstdc++ must be excluded in some way. >>>> Our means is to use typeinfo symbol as identity marker of a class since >>>> it is unique and always generated once the class or its derived type >>>> is instantiated somewhere, and rely on symbol resolution by >>>> lto-linker-plugin to detect whether a typeinfo is referenced by regular >>>> object/library, which indirectly tells class types are escaped or not. >>>> The RFC at https://gcc.gnu.org/pipermail/gcc/2021-August/237132.html >>>> gives more details on that. >>>> >>>> Bootstrapped/regtested on x86_64-linux and aarch64-linux. >>>> >>>> Thanks, >>>> Feng >>>> >>>> >>>> 2021-09-07 Feng Xue >>>> >>>> gcc/ >>>>* common.opt (-fdevirtualize-fully): New option. >>>>* class.c (build_rtti_vtbl_entries): Force generation of typeinfo >>>>even -fno-rtti is specificied under full devirtualization. >>> >>> This makes -fno-rtti useless; rather than this, you should warn about >>> the combination of flags and force flag_rtti on. It also sounds like >>> you depend on the library not being built with -fno-rtti. >> >> Although rtti is generated by front-end, we will remove it after lto symtab >> merge, which is meant to keep same behavior as -fno-rtti. > > Ah, the cp/ change is OK, then, with a comment about that. > >> Yes, regular library to be linked with should contain rtti data, otherwise >> WPD could not deduce class type usage safely. By default, we can think >> that it should work for libstdc++, but it probably becomes a problem for >> user library, which might be avoided if we properly document this >> requirement and suggest user doing that when using WPD. > > Yes, I would expect that external libraries would be built with RTTI on > to allow users to use RTTI features even if they aren't used within the > library. But it's good to document it as a requirement. > >> + /* If a class with virtual base is only instantiated as >> + subobjects of derived classes, and has no complete object in >> + compilation unit, merely construction vtables will be >> involved, >> + its primary vtable is really not needed, and subject to being >> + removed. So once a vtable node is encountered, for all >> + polymorphic base classes of the vtable's context class, always >> + force generation of primary vtable nodes when full >> + devirtualization is enabled. */ > > Why do you need the primary vtable if you're relying on RTTI info? > Construction vtables will point to the same RTTI node. At middle end, the easiest way to get vtable of type is via TYPE_BINFO(type), it is the primary one. And WPD relies on existence of varpool_node of the vtable decl to determine if the type has been removed (when it is never instantiated), so we will force generation of vtable node at very early stage. Additionally, construction vtable (C-in-D) belongs to the class (D) of complete object, not the class (C) of subobject actually being constructed for, it is not easy to correlate construction vtable with the subobject class (C) after front end. > >> + /* Public class w/o key member function (or local class in a public >> + inline function) requires COMDAT-like vtable so as to be shared >> + among units. But C++ privatizing via -fno-weak would introduce >> + multiple static vtable copies for one class in merged lto symbol >> + table. This breaks one-to-one correspondence between class and >> + vtable, and makes class liveness check become not that easy.
[PATCH] Fix value uninitialization in vn_reference_insert_pieces [PR102400]
Bootstrapped/regtested on x86_64-linux. Thanks, Feng --- 2021-09-23 Feng Xue gcc/ChangeLog PR tree-optimization/102400 * tree-ssa-sccvn.c (vn_reference_insert_pieces): Initialize result_vdef to zero value. --- gcc/tree-ssa-sccvn.c | 1 + 1 file changed, 1 insertion(+) diff --git a/gcc/tree-ssa-sccvn.c b/gcc/tree-ssa-sccvn.c index a901f51a025..e8b1c39184d 100644 --- a/gcc/tree-ssa-sccvn.c +++ b/gcc/tree-ssa-sccvn.c @@ -3811,6 +3811,7 @@ vn_reference_insert_pieces (tree vuse, alias_set_type set, if (result && TREE_CODE (result) == SSA_NAME) result = SSA_VAL (result); vr1->result = result; + vr1->result_vdef = NULL_TREE; slot = valid_info->references->find_slot_with_hash (vr1, vr1->hashcode, INSERT); -- 2.17.1
[PATCH] Fix null-pointer dereference in delete_dead_or_redundant_call [PR102451]
Bootstrapped/regtested on x86_64-linux and aarch64-linux. Thanks, Feng --- 2021-09-23 Feng Xue gcc/ChangeLog: PR tree-optimization/102451 * tree-ssa-dse.c (delete_dead_or_redundant_call): Record bb of stmt before removal. --- gcc/tree-ssa-dse.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/gcc/tree-ssa-dse.c b/gcc/tree-ssa-dse.c index 98daa8ab24c..27287fe88ee 100644 --- a/gcc/tree-ssa-dse.c +++ b/gcc/tree-ssa-dse.c @@ -978,6 +978,7 @@ delete_dead_or_redundant_call (gimple_stmt_iterator *gsi, const char *type) fprintf (dump_file, "\n"); } + basic_block bb = gimple_bb (stmt); tree lhs = gimple_call_lhs (stmt); if (lhs) { @@ -985,7 +986,7 @@ delete_dead_or_redundant_call (gimple_stmt_iterator *gsi, const char *type) gimple *new_stmt = gimple_build_assign (lhs, ptr); unlink_stmt_vdef (stmt); if (gsi_replace (gsi, new_stmt, true)) -bitmap_set_bit (need_eh_cleanup, gimple_bb (stmt)->index); + bitmap_set_bit (need_eh_cleanup, bb->index); } else { @@ -994,7 +995,7 @@ delete_dead_or_redundant_call (gimple_stmt_iterator *gsi, const char *type) /* Remove the dead store. */ if (gsi_remove (gsi, true)) - bitmap_set_bit (need_eh_cleanup, gimple_bb (stmt)->index); + bitmap_set_bit (need_eh_cleanup, bb->index); release_defs (stmt); } } -- 2.17.1
Re: [PATCH/RFC 1/2] WPD: Enable whole program devirtualization
>On 9/16/21 22:29, Feng Xue OS wrote: >>> On 9/16/21 05:25, Feng Xue OS via Gcc-patches wrote: >>>> This and following patches are composed to enable full devirtualization >>>> under whole program assumption (so also called whole-program >>>> devirtualization, WPD for short), which is an enhancement to current >>>> speculative devirtualization. The base of the optimization is how to >>>> identify class type that is local in terms of whole-program scope, at >>>> least those class types in libstdc++ must be excluded in some way. >>>> Our means is to use typeinfo symbol as identity marker of a class since >>>> it is unique and always generated once the class or its derived type >>>> is instantiated somewhere, and rely on symbol resolution by >>>> lto-linker-plugin to detect whether a typeinfo is referenced by regular >>>> object/library, which indirectly tells class types are escaped or not. >>>> The RFC at https://gcc.gnu.org/pipermail/gcc/2021-August/237132.html >>>> gives more details on that. >>>> >>>> Bootstrapped/regtested on x86_64-linux and aarch64-linux. >>>> >>>> Thanks, >>>> Feng >>>> >>>> >>>> 2021-09-07 Feng Xue >>>> >>>> gcc/ >>>>* common.opt (-fdevirtualize-fully): New option. >>>>* class.c (build_rtti_vtbl_entries): Force generation of typeinfo >>>>even -fno-rtti is specificied under full devirtualization. >>> >>> This makes -fno-rtti useless; rather than this, you should warn about >>> the combination of flags and force flag_rtti on. It also sounds like >>> you depend on the library not being built with -fno-rtti. >> >> Although rtti is generated by front-end, we will remove it after lto symtab >> merge, which is meant to keep same behavior as -fno-rtti. > > Ah, the cp/ change is OK, then, with a comment about that. > >> Yes, regular library to be linked with should contain rtti data, otherwise >> WPD could not deduce class type usage safely. By default, we can think >> that it should work for libstdc++, but it probably becomes a problem for >> user library, which might be avoided if we properly document this >> requirement and suggest user doing that when using WPD. > > Yes, I would expect that external libraries would be built with RTTI on > to allow users to use RTTI features even if they aren't used within the > library. But it's good to document it as a requirement. > >> + /* If a class with virtual base is only instantiated as >> + subobjects of derived classes, and has no complete object in >> + compilation unit, merely construction vtables will be >> involved, >> + its primary vtable is really not needed, and subject to being >> + removed. So once a vtable node is encountered, for all >> + polymorphic base classes of the vtable's context class, always >> + force generation of primary vtable nodes when full >> + devirtualization is enabled. */ > > Why do you need the primary vtable if you're relying on RTTI info? > Construction vtables will point to the same RTTI node. At middle end, the easiest way to get vtable of type is via TYPE_BINFO(type), it is the primary one. And WPD relies on existence of varpool_node of the vtable decl to determine if the type has been removed (when it is never instantiated), so we will force generation of vtable node at very early stage. Additionally, construction vtable (C-in-D) belongs to the class (D) of complete object, not the class (C) of subobject actually being constructed for, it is not easy to correlate construction vtable with the subobject class (C) after front end. > >> + /* Public class w/o key member function (or local class in a public >> + inline function) requires COMDAT-like vtable so as to be shared >> + among units. But C++ privatizing via -fno-weak would introduce >> + multiple static vtable copies for one class in merged lto symbol >> + table. This breaks one-to-one correspondence between class and >> + vtable, and makes class liveness check become not that easy. To >> + be simple, we exclude such kind of class from our choice list. > > Same question. Also, why would you use -fno-weak? Forcing multiple > copies of things we're perfectly capable of combining seems like a > strange choice. You can privatize things with the symbol visibility > controls or RTLD_LOCAL. We expect that user does not specify -fno-weak for WPD. But if specified, we should correctly handle that and bypass the type. And indeed there is no need to force generation of vtable under this situation. But if vtable is not keyed to any compilation unit, we might never have any copy of it in ordinary build, while its class type is meaningful to whole-program analysis, such as an abstract root class. Thanks, Feng
Re: [PATCH/RFC 1/2] WPD: Enable whole program devirtualization
>On 9/16/21 05:25, Feng Xue OS via Gcc-patches wrote: >> This and following patches are composed to enable full devirtualization >> under whole program assumption (so also called whole-program >> devirtualization, WPD for short), which is an enhancement to current >> speculative devirtualization. The base of the optimization is how to >> identify class type that is local in terms of whole-program scope, at >> least those class types in libstdc++ must be excluded in some way. >> Our means is to use typeinfo symbol as identity marker of a class since >> it is unique and always generated once the class or its derived type >> is instantiated somewhere, and rely on symbol resolution by >> lto-linker-plugin to detect whether a typeinfo is referenced by regular >> object/library, which indirectly tells class types are escaped or not. >> The RFC at https://gcc.gnu.org/pipermail/gcc/2021-August/237132.html >> gives more details on that. >> >> Bootstrapped/regtested on x86_64-linux and aarch64-linux. >> >> Thanks, >> Feng >> >> >> 2021-09-07 Feng Xue >> >> gcc/ >> * common.opt (-fdevirtualize-fully): New option. >> * class.c (build_rtti_vtbl_entries): Force generation of typeinfo >> even -fno-rtti is specificied under full devirtualization. > >This makes -fno-rtti useless; rather than this, you should warn about >the combination of flags and force flag_rtti on. It also sounds like >you depend on the library not being built with -fno-rtti. Although rtti is generated by front-end, we will remove it after lto symtab merge, which is meant to keep same behavior as -fno-rtti. Yes, regular library to be linked with should contain rtti data, otherwise WPD could not deduce class type usage safely. By default, we can think that it should work for libstdc++, but it probably becomes a problem for user library, which might be avoided if we properly document this requirement and suggest user doing that when using WPD. Thanks Feng > >> * cgraph.c (cgraph_update_edges_for_call_stmt): Add an assertion >> to check node to be traversed. >> * cgraphclones.c (cgraph_node::find_replacement): Record >> former_clone_of on replacement node. >> * cgraphunit.c (symtab_node::needed_p): Always output vtable for >> full devirtualization. >> (analyze_functions): Force generation of primary vtables for all >> base classes. >> * ipa-devirt.c (odr_type_d::whole_program_local): New field. >> (odr_type_d::has_virtual_base): Likewise. >> (odr_type_d::all_derivations_known): Removed. >> (odr_type_d::whole_program_local_p): New member function. >> (odr_type_d::all_derivations_known_p): Likewise. >> (odr_type_d::possibly_instantiated_p): Likewise. >> (odr_type_d::set_has_virtual_base): Likewise. >> (get_odr_type): Set "whole_program_local" and "has_virtual_base" >> when adding a type. >> (type_all_derivations_known_p): Replace implementation by a call >> to odr_type_d::all_derivations_known_p. >> (type_possibly_instantiated_p): Replace implementation by a call >> to odr_type_d::possibly_instantiated_p. >> (type_known_to_have_no_derivations_p): Replace call to >> type_possibly_instantiated_p with call to >> odr_type_d::possibly_instantiated_p. >> (type_all_ctors_visible_p): Removed. >> (type_whole_program_local_p): New function. >> (get_type_vtable): Likewise. >> (extract_typeinfo_in_vtable): Likewise. >> (identify_whole_program_local_types): Likewise. >> (dump_odr_type): Dump has_virtual_base and whole_program_local_p() >> of type. >> (maybe_record_node): Resort to type_whole_program_local_p to >> check whether a class has been optimized away. >> (record_target_from_binfo): Remove parameter "anonymous", add >> a new parameter "possibly_instantiated", and adjust code >> accordingly. >> (devirt_variable_node_removal_hook): Replace call to >> "type_in_anonymous_namespace_p" with "type_whole_program_local_p". >> (possible_polymorphic_call_targets): Replace call to >> "type_possibly_instantiated_p" with "possibly_instantiated_p", >> replace flag check on "all_derivations_known" with call to >>"all_derivations_known_p". >> * ipa-icf.c (filter_removed_items): Disable folding on vtable >> under full devirtualization. >> * ipa-
[PATCH/RFC 2/2] WPD: Enable whole program devirtualization at LTRANS
This patch is to extend applicability of full devirtualization to LTRANS stage. Normally, whole program assumption would not hold when WPA splits whole compilation into more than one LTRANS partitions. To avoid information lost for WPD at LTRANS, we will record all vtable nodes and related member function references into each partition. Bootstrapped/regtested on x86_64-linux and aarch64-linux. Thanks, Feng 2021-09-07 Feng Xue gcc/ * tree.h (TYPE_CXX_LOCAL): New macro for type using base.nothrow_flag. * tree-core.h (tree_base): Update comment on using base.nothrow_flag to represent TYPE_CXX_LOCAL. * ipa-devirt.c (odr_type_d::whole_program_local): Removed. (odr_type_d::whole_program_local_p): Check TYPE_CXX_LOCAL flag on type, and enable WPD at LTRANS when flag_devirtualize_fully is true. (get_odr_type): Remove setting whole_program_local flag on type. (identify_whole_program_local_types): Replace whole_program_local in odr_type_d by TYPE_CXX_LOCAL on type. (maybe_record_node): Enable WPD at LTRANS when flag_devirtualize_fully is true. * ipa.c (can_remove_vtable_if_no_refs_p): Retain vtables at LTRANS stage under full devirtualization. * lto-cgraph.c (compute_ltrans_boundary): Add all defined vtables to boundary of each LTRANS partition. * lto-streamer-out.c (get_symbol_initial_value): Streaming out initial value of vtable even its class is optimized away. * lto-lang.c (lto_post_options): Disable full devirtualization if flag_ltrans_devirtualize is false. * tree-streamer-in.c (unpack_ts_base_value_fields): unpack value of TYPE_CXX_LOCAL for a type from streaming data. * tree-streamer-out.c (pack_ts_base_value_fields): pack value ofTYPE_CXX_LOCAL for a type into streaming data. --- From 3af32b9aadff23d339750ada4541386b3d358edc Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Mon, 6 Sep 2021 20:34:50 +0800 Subject: [PATCH 2/2] WPD: Enable whole program devirtualization at LTRANS Whole program assumption would not hold when WPA splits whole compilation into more than one LTRANS partitions. To avoid information lost for WPD at LTRANS, we will record all vtable nodes and related member function references into each partition. 2021-09-07 Feng Xue gcc/ * tree.h (TYPE_CXX_LOCAL): New macro for type using base.nothrow_flag. * tree-core.h (tree_base): Update comment on using base.nothrow_flag to represent TYPE_CXX_LOCAL. * ipa-devirt.c (odr_type_d::whole_program_local): Removed. (odr_type_d::whole_program_local_p): Check TYPE_CXX_LOCAL flag on type, and enable WPD at LTRANS when flag_devirtualize_fully is true. (get_odr_type): Remove setting whole_program_local flag on type. (identify_whole_program_local_types): Replace whole_program_local in odr_type_d by TYPE_CXX_LOCAL on type. (maybe_record_node): Enable WPD at LTRANS when flag_devirtualize_fully is true. * ipa.c (can_remove_vtable_if_no_refs_p): Retain vtables at LTRANS stage under full devirtualization. * lto-cgraph.c (compute_ltrans_boundary): Add all defined vtables to boundary of each LTRANS partition. * lto-streamer-out.c (get_symbol_initial_value): Streaming out initial value of vtable even its class is optimized away. * lto-lang.c (lto_post_options): Disable full devirtualization if flag_ltrans_devirtualize is false. * tree-streamer-in.c (unpack_ts_base_value_fields): unpack value of TYPE_CXX_LOCAL for a type from streaming data. * tree-streamer-out.c (pack_ts_base_value_fields): pack value ofTYPE_CXX_LOCAL for a type into streaming data. --- gcc/ipa-devirt.c| 29 ++--- gcc/ipa.c | 7 ++- gcc/lto-cgraph.c| 18 ++ gcc/lto-streamer-out.c | 12 +++- gcc/lto/lto-lang.c | 6 ++ gcc/tree-core.h | 3 +++ gcc/tree-streamer-in.c | 11 --- gcc/tree-streamer-out.c | 11 --- gcc/tree.h | 5 + 9 files changed, 83 insertions(+), 19 deletions(-) diff --git a/gcc/ipa-devirt.c b/gcc/ipa-devirt.c index fcb097d7156..65e9ebbfb59 100644 --- a/gcc/ipa-devirt.c +++ b/gcc/ipa-devirt.c @@ -216,8 +216,6 @@ struct GTY(()) odr_type_d int id; /* Is it in anonymous namespace? */ bool anonymous_namespace; - /* Set when type is not used outside of program. */ - bool whole_program_local; /* Did we report ODR violation here? */ bool odr_violated; /* Set when virtual table without RTTI prevailed table with. */ @@ -290,10 +288,18 @@ get_type_vtable (tree type) bool odr_type_d::whole_program_local_p () { - if (flag_ltrans) + if (flag_ltrans && !flag_devirtualize_fully) return false; - return whole_program_local; + if (in_lto_p) +return TYPE_CXX_LOCAL (type); + + /* Although a local class is always considered as whole program loca
[PATCH/RFC 1/2] WPD: Enable whole program devirtualization
This and following patches are composed to enable full devirtualization under whole program assumption (so also called whole-program devirtualization, WPD for short), which is an enhancement to current speculative devirtualization. The base of the optimization is how to identify class type that is local in terms of whole-program scope, at least those class types in libstdc++ must be excluded in some way. Our means is to use typeinfo symbol as identity marker of a class since it is unique and always generated once the class or its derived type is instantiated somewhere, and rely on symbol resolution by lto-linker-plugin to detect whether a typeinfo is referenced by regular object/library, which indirectly tells class types are escaped or not. The RFC at https://gcc.gnu.org/pipermail/gcc/2021-August/237132.html gives more details on that. Bootstrapped/regtested on x86_64-linux and aarch64-linux. Thanks, Feng 2021-09-07 Feng Xue gcc/ * common.opt (-fdevirtualize-fully): New option. * class.c (build_rtti_vtbl_entries): Force generation of typeinfo even -fno-rtti is specificied under full devirtualization. * cgraph.c (cgraph_update_edges_for_call_stmt): Add an assertion to check node to be traversed. * cgraphclones.c (cgraph_node::find_replacement): Record former_clone_of on replacement node. * cgraphunit.c (symtab_node::needed_p): Always output vtable for full devirtualization. (analyze_functions): Force generation of primary vtables for all base classes. * ipa-devirt.c (odr_type_d::whole_program_local): New field. (odr_type_d::has_virtual_base): Likewise. (odr_type_d::all_derivations_known): Removed. (odr_type_d::whole_program_local_p): New member function. (odr_type_d::all_derivations_known_p): Likewise. (odr_type_d::possibly_instantiated_p): Likewise. (odr_type_d::set_has_virtual_base): Likewise. (get_odr_type): Set "whole_program_local" and "has_virtual_base" when adding a type. (type_all_derivations_known_p): Replace implementation by a call to odr_type_d::all_derivations_known_p. (type_possibly_instantiated_p): Replace implementation by a call to odr_type_d::possibly_instantiated_p. (type_known_to_have_no_derivations_p): Replace call to type_possibly_instantiated_p with call to odr_type_d::possibly_instantiated_p. (type_all_ctors_visible_p): Removed. (type_whole_program_local_p): New function. (get_type_vtable): Likewise. (extract_typeinfo_in_vtable): Likewise. (identify_whole_program_local_types): Likewise. (dump_odr_type): Dump has_virtual_base and whole_program_local_p() of type. (maybe_record_node): Resort to type_whole_program_local_p to check whether a class has been optimized away. (record_target_from_binfo): Remove parameter "anonymous", add a new parameter "possibly_instantiated", and adjust code accordingly. (devirt_variable_node_removal_hook): Replace call to "type_in_anonymous_namespace_p" with "type_whole_program_local_p". (possible_polymorphic_call_targets): Replace call to "type_possibly_instantiated_p" with "possibly_instantiated_p", replace flag check on "all_derivations_known" with call to "all_derivations_known_p". * ipa-icf.c (filter_removed_items): Disable folding on vtable under full devirtualization. * ipa-polymorphic-call.c (restrict_to_inner_class): Move odr type check to type_known_to_have_no_derivations_p. * ipa-utils.h (identify_whole_program_local_types): New declaration. (type_all_derivations_known_p): Parameter type adjustment. * ipa.c (walk_polymorphic_call_targets): Do not mark vcall targets as reachable for full devirtualization. (can_remove_vtable_if_no_refs_p): New function. (symbol_table::remove_unreachable_nodes): Add defined vtables to reachable list under full devirtualization. * lto-symtab.c (lto_symtab_merge_symbols): Identify whole program local types after symbol table merge. ---From 2632d8e7ea8f96cb545e57dedd9e4148b5a2cae4 Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Mon, 6 Sep 2021 15:03:31 +0800 Subject: [PATCH 1/2] WPD: Enable whole program devirtualization Enable full devirtualization under whole program assumption (so also called whole-program devirtualization, WPD for short). The base of the optimization is how to identify class type that is local in terms of whole-program scope. But "whole program" does not ensure that class hierarchy of a type never span to dependent C++ libraries (one is libstdc++), which would result in incorrect devirtualization. An example is given below to demonstrate the problem. // Has been pre-compiled to a library class Base { vi
Re: [PATCH] Fix loop split incorrect count and probability
Any transformation involving cfg alteration would face same problem, it is not that easy to update new cfg with reasonable and seemly-correct profile count. We can adjust probability for impacted condition bbs, but lack of a utility like what static profile estimating pass does, and only propagates count partially. Thanks, Feng From: Richard Biener Sent: Tuesday, August 10, 2021 10:47 PM To: Xionghu Luo Cc: gcc-patches@gcc.gnu.org; seg...@kernel.crashing.org; Feng Xue OS; wschm...@linux.ibm.com; guoji...@linux.ibm.com; li...@gcc.gnu.org; hubi...@ucw.cz Subject: Re: [PATCH] Fix loop split incorrect count and probability On Mon, 9 Aug 2021, Xionghu Luo wrote: > Thanks, > > On 2021/8/6 19:46, Richard Biener wrote: > > On Tue, 3 Aug 2021, Xionghu Luo wrote: > > > >> loop split condition is moved between loop1 and loop2, the split bb's > >> count and probability should also be duplicated instead of (100% vs INV), > >> secondly, the original loop1 and loop2 count need be propotional from the > >> original loop. > >> > >> > >> diff base/loop-cond-split-1.c.151t.lsplit > >> patched/loop-cond-split-1.c.151t.lsplit: > >> ... > >> int prephitmp_16; > >> int prephitmp_25; > >> > >> [local count: 118111600]: > >> if (n_7(D) > 0) > >> goto ; [89.00%] > >> else > >> goto ; [11.00%] > >> > >> [local count: 118111600]: > >> return; > >> > >> [local count: 105119324]: > >> pretmp_3 = ga; > >> > >> - [local count: 955630225]: > >> + [local count: 315357973]: > >> # i_13 = PHI > >> # prephitmp_12 = PHI > >> if (prephitmp_12 != 0) > >> goto ; [33.00%] > >> else > >> goto ; [67.00%] > >> > >> - [local count: 315357972]: > >> + [local count: 104068130]: > >> _2 = do_something (); > >> ga = _2; > >> > >> - [local count: 955630225]: > >> + [local count: 315357973]: > >> # prephitmp_5 = PHI > >> i_10 = inc (i_13); > >> if (n_7(D) > i_10) > >> goto ; [89.00%] > >> else > >> goto ; [11.00%] > >> > >> [local count: 105119324]: > >> goto ; [100.00%] > >> > >> - [local count: 850510901]: > >> + [local count: 280668596]: > >> if (prephitmp_12 != 0) > >> -goto ; [100.00%] > >> +goto ; [33.00%] > >> else > >> -goto ; [INV] > >> +goto ; [67.00%] > >> > >> - [local count: 850510901]: > >> + [local count: 280668596]: > >> goto ; [100.00%] > >> > >> - [count: 0]: > >> + [local count: 70429947]: > >> # i_23 = PHI > >> # prephitmp_25 = PHI > >> > >> - [local count: 955630225]: > >> + [local count: 640272252]: > >> # i_15 = PHI > >> # prephitmp_16 = PHI > >> i_22 = inc (i_15); > >> if (n_7(D) > i_22) > >> goto ; [89.00%] > >> else > >> goto ; [11.00%] > >> > >> - [local count: 850510901]: > >> + [local count: 569842305]: > >> goto ; [100.00%] > >> > >> } > >> > >> gcc/ChangeLog: > >> > >>* tree-ssa-loop-split.c (split_loop): Fix incorrect probability. > >>(do_split_loop_on_cond): Likewise. > >> --- > >> gcc/tree-ssa-loop-split.c | 16 > >> 1 file changed, 8 insertions(+), 8 deletions(-) > >> > >> diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c > >> index 3a09bbc39e5..8e5a7ded0f7 100644 > >> --- a/gcc/tree-ssa-loop-split.c > >> +++ b/gcc/tree-ssa-loop-split.c > >> @@ -583,10 +583,10 @@ split_loop (class loop *loop1) > >>basic_block cond_bb; > > if (!initial_true) > - cond = fold_build1 (TRUTH_NOT_EXPR, boolean_type_node, cond); > + cond = fold_build1 (TRUTH_NOT_EXPR, boolean_type_node, cond); > + > + edge true_edge = EDGE_SUCC (bbs[i], 0)->flags & EDGE_TRUE_VALUE > +? EDGE_SUCC (bbs[i], 0) > +: EDGE_SUCC (bbs[i], 1); > > >> > >>class loop *loop2 = loop_version (loop1, cond, &cond_bb, > >> - profile_probability::always (), > >> -
Re: [PATCH] Fix loop split incorrect count and probability
Yes. Condition to to switch two versioned loops is "true", the first two arguments should be 100% and 0%. It is different from normal loop split, we could not deduce exactly precise probability for condition-based loop split, since cfg inside loop2 would be changed. (invar-branch is replaced to "true", as shown in the comment on do_split_loop_on_cond). Any way, your way of scaling two loops' probabilities according to that of invar-branch, seems to be a better heuristics than original, which would give us more reasonable execution count, at least for loop header bb. Thanks, Feng From: Gcc-patches on behalf of Richard Biener via Gcc-patches Sent: Friday, August 6, 2021 7:46 PM To: Xionghu Luo Cc: seg...@kernel.crashing.org; wschm...@linux.ibm.com; li...@gcc.gnu.org; gcc-patches@gcc.gnu.org; hubi...@ucw.cz; dje@gmail.com Subject: Re: [PATCH] Fix loop split incorrect count and probability On Tue, 3 Aug 2021, Xionghu Luo wrote: > loop split condition is moved between loop1 and loop2, the split bb's > count and probability should also be duplicated instead of (100% vs INV), > secondly, the original loop1 and loop2 count need be propotional from the > original loop. > > Regression tested pass, OK for master? > > diff base/loop-cond-split-1.c.151t.lsplit > patched/loop-cond-split-1.c.151t.lsplit: > ... >int prephitmp_16; >int prephitmp_25; > > [local count: 118111600]: >if (n_7(D) > 0) > goto ; [89.00%] >else > goto ; [11.00%] > > [local count: 118111600]: >return; > > [local count: 105119324]: >pretmp_3 = ga; > > - [local count: 955630225]: > + [local count: 315357973]: ># i_13 = PHI ># prephitmp_12 = PHI >if (prephitmp_12 != 0) > goto ; [33.00%] >else > goto ; [67.00%] > > - [local count: 315357972]: > + [local count: 104068130]: >_2 = do_something (); >ga = _2; > > - [local count: 955630225]: > + [local count: 315357973]: ># prephitmp_5 = PHI >i_10 = inc (i_13); >if (n_7(D) > i_10) > goto ; [89.00%] >else > goto ; [11.00%] > > [local count: 105119324]: >goto ; [100.00%] > > - [local count: 850510901]: > + [local count: 280668596]: >if (prephitmp_12 != 0) > -goto ; [100.00%] > +goto ; [33.00%] >else > -goto ; [INV] > +goto ; [67.00%] > > - [local count: 850510901]: > + [local count: 280668596]: >goto ; [100.00%] > > - [count: 0]: > + [local count: 70429947]: ># i_23 = PHI ># prephitmp_25 = PHI > > - [local count: 955630225]: > + [local count: 640272252]: ># i_15 = PHI ># prephitmp_16 = PHI >i_22 = inc (i_15); >if (n_7(D) > i_22) > goto ; [89.00%] >else > goto ; [11.00%] > > - [local count: 850510901]: > + [local count: 569842305]: >goto ; [100.00%] > > } > > gcc/ChangeLog: > > * tree-ssa-loop-split.c (split_loop): Fix incorrect probability. > (do_split_loop_on_cond): Likewise. > --- > gcc/tree-ssa-loop-split.c | 16 > 1 file changed, 8 insertions(+), 8 deletions(-) > > diff --git a/gcc/tree-ssa-loop-split.c b/gcc/tree-ssa-loop-split.c > index 3a09bbc39e5..8e5a7ded0f7 100644 > --- a/gcc/tree-ssa-loop-split.c > +++ b/gcc/tree-ssa-loop-split.c > @@ -583,10 +583,10 @@ split_loop (class loop *loop1) > basic_block cond_bb; > > class loop *loop2 = loop_version (loop1, cond, &cond_bb, > -profile_probability::always (), > -profile_probability::always (), > -profile_probability::always (), > -profile_probability::always (), > +true_edge->probability, > +true_edge->probability.invert (), > +true_edge->probability, > +true_edge->probability.invert (), > true); there is no 'true_edge' variable at this point. > gcc_assert (loop2); > > @@ -1486,10 +1486,10 @@ do_split_loop_on_cond (struct loop *loop1, edge > invar_branch) >initialize_original_copy_tables (); > >struct loop *loop2 = loop_version (loop1, boolean_true_node, NULL, > - profile_probability::always (), > - profile_probability::never (), > - profile_probability::always (), > - profile_probability::always (), > + invar_branch->probability.invert (), > + invar_branch->probability, > + invar_branch->probability.invert (), > + invar_branch->probability, >true); >if (!loop2) >
Question about non-POD class type
For an instance of a non-POD class, can I always assume that any operation on it should be type-safe, any wrong or even trick code to violate this is UB in C++ spec? For example, here are some ways: union { Type1 *p1; Type2 *p2; }; or union { Type1 t1; Type2 t2; }; or void *p = Type1 *p1; Type2 *p2 = p; p2->xxx; Feng
Re: [PATCH/RFC] Add a new memory gathering optimization for loop (PR98598)
>> gcc/ >> PR tree-optimization/98598 >> * Makefile.in (OBJS): Add tree-ssa-loop-mgo.o. >> * common.opt (-ftree-loop-mgo): New option. > > Just a quick comment - -ftree-loop-mgo is user-facing and it isn't really a > good > name. -floop-mgo would be better but still I'd have no idea what this would > do. > > I don't have a good suggestion here other than to expand it to > -floop-gather-memory (?!). OK. Better than "mgo", this abbr. is only a term for development use. > The option documentation isn't informative either. > > From: > > outer-loop () > { > inner-loop (iter, iter_count) > { > Type1 v1 = LOAD (iter); > Type2 v2 = LOAD (v1); > Type3 v3 = LOAD (v2); > ... > iter = NEXT (iter); > } > } > > To: > > typedef struct cache_elem > { > bool init; > Type1 c_v1; > Type2 c_v2; > Type3 c_v3; > } cache_elem; > > cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem)); > > outer-loop () > { > size_t cache_idx = 0; > > inner-loop (iter, iter_count) > { > if (!cache_arr[cache_idx]->init) > { > v1 = LOAD (iter); > v2 = LOAD (v1); > v3 = LOAD (v2); > > cache_arr[cache_idx]->init = true; > cache_arr[cache_idx]->c_v1 = v1; > cache_arr[cache_idx]->c_v2 = v2; > cache_arr[cache_idx]->c_v3 = v3; > } > else > { > v1 = cache_arr[cache_idx]->c_v1; > v2 = cache_arr[cache_idx]->c_v2; > v3 = cache_arr[cache_idx]->c_v3; > } > ... > cache_idx++; > iter = NEXT (iter); > } > } > > free (cache_arr); > > This is a _very_ special transform. What it seems to do is > optimize the dependent loads for outer loop iteration n > 1 > by caching the result(s). If that's possible then you should > be able to distribute the outer loop to one doing the caching > and one using the cache. Then this transform would be more > like a tradidional array expansion of scalars? In some cases > also loop interchange could remove the need for the caching. > > Doing MGO as the very first loop pass thus looks bad, I think > MGO should be much later, for example after interchange. > I also think that MGO should work in concert with loop > distribution (which could have an improved cost model) > rather than being a separate pass. > > Your analysis phase looks quite expensive, building sth > like a on-the side representation very closely matching SSA. > It seems to work from PHI defs to uses, which looks backwards. Did not catch this point very clearly. Would you please detail it more? > You seem to roll your own dependence analysis code :/ Please > have a look at loop distribution. > > Also you build an actual structure type for reasons that escape > me rather than simply accessing the allocated storage at > appropriate offsets. > > I think simply calling 'calloc' isn't OK because you might need > aligned storage and because calloc might not be available. > Please at least use 'malloc' and make sure MALLOC_ABI_ALIGNMENT > is large enough for the data you want to place (or perform > dynamic re-alignment yourself). We probably want some generic > middle-end utility to obtain aligned allocated storage at some > point. > > As said above I think you want to re-do this transform as > a loop distribution transform. I think if caching works then > the loads should be distributable and the loop distribution > transform should be enhanced to expand the scalars to arrays. I checked code of loop distribution, and its trigger strategy seems to be very conservative, now only targets simple and regular index-based loop, and could not handle link-list traversal, which consists of a series of discrete memory accesses, and MGO would matter a lot. Additionally, for some complicate cases, we could not completely decompose MGO as two separate loops for "do caching" and "use caching" respectively. An example: for (i = 0; i < N; i++) { for (j = 0; j < i; j++) { Type1 v1 = LOAD_FN1 (j); Type2 v2 = LOAD_FN2 (v1); Type3 v3 = LOAD_FN3 (v2); ... condition = ... } if (condition) break; } We should not cache all loads (Totally N) in one step since some of them might be invalid after "condition" breaks loops. We have to mix up "do caching" and "use caching", and let them dynamically switched against "init" flag. But loop distribution does have some overlap on analysis and transformation with MGO, we will try to see if there is a way to unify them. Thanks, Feng
Re: [PATCH/RFC] Add a new memory gathering optimization for loop (PR98598)
>> This patch implements a new loop optimization according to the proposal >> in RFC given at >> https://gcc.gnu.org/pipermail/gcc/2021-January/234682.html. >> So do not repeat the idea in this mail. Hope your comments on it. > > With the caveat that I'm not an optimization expert (but no one else > seems to have replied), here are some thoughts. > > [...snip...] > >> Subject: [PATCH 1/3] mgo: Add a new memory gathering optimization for loop >> [PR98598] > > BTW, did you mean to also post patches 2 and 3? > Not yet, but they are ready. Since this is kind of special optimization that uses heap as temporary storage, not a common means in gcc, we do not know basic attitude of the community towards it. So only the first patch was sent out for initial comments, in that it implements a generic MGO framework, and is complete and self-contained. Other 2 patches just composed some enhancements for specific code pattern and dynamic alias check. If possible, this proposal would be accepted principally, we will submit other 2 for review. > >> In nested loops, if scattered memory accesses inside inner loop remain >> unchanged in outer loop, we can sequentialize these loads by caching >> their values into a temporary memory region at the first time, and >> reuse the caching data in following iterations. This way can improve >> efficiency of cpu cache subsystem by reducing its unpredictable activies. > > I don't think you've cited any performance numbers so far. Does the > optimization show a measurable gain on some benchmark(s)? e.g. is this > ready to run SPEC yet, and how does it do? Yes, we have done that. Minor improvement about several point percentage could gain for some real applications. And to be specific, we also get major improvement as more than 30% for certain benchmark in SPEC2017. > >> To illustrate what the optimization will do, two pieces of pseudo code, >> before and after transformation, are given. Suppose all loads and >> "iter_count" are invariant in outer loop. >> >> From: >> >> outer-loop () >> { >> inner-loop (iter, iter_count) >> { >> Type1 v1 = LOAD (iter); >> Type2 v2 = LOAD (v1); >> Type3 v3 = LOAD (v2); >> ... >> iter = NEXT (iter); >> } >> } >> >> To: >> >> typedef struct cache_elem >> { >> bool init; >> Type1 c_v1; >> Type2 c_v2; >> Type3 c_v3; > > Putting the "bool init;" at the front made me think "what about > packing?" but presumably the idea is that every element is accessed in > order, so it presumably benefits speed to have "init" at the top of the > element, right? Yes, layout of the struct layout could be optimized in terms of size by some means, such as: o. packing "init" into a padding hole after certain field o. if certain field is a pointer type, the field can take the role of "init" (Non-NULL implies "initialized") Now this simple scheme is straightforward, and would be enhanced in various aspects later. >> } cache_elem; >> >> cache_elem *cache_arr = calloc (iter_count, sizeof (cache_elem)); > What if the allocation fails at runtime? Do we keep an unoptimized > copy of the nested loops around as a fallback and have an unlikely > branch to that copy? Yes, we should. But in a different way, a flag is added into original nested loop to control runtime switch between optimized and unoptimized execution. This definitely incurs runtime cost, but avoid possible code size bloating. A better handling, as a TODO is to apply dynamic-switch for large loop, and loop-clone for small one. > I notice that you're using calloc, presumably to clear all of the > "init" flags (and the whole buffer). > > FWIW, this feels like a case where it would be nice to have a thread- > local heap allocation, perhaps something like an obstack implemented in > the standard library - but that's obviously scope creep for this. Yes, that's good, specially for many-thread application. > Could it make sense to use alloca for small allocations? (or is that > scope creep?) We did consider using alloca as you said. But if we could not determine up limit for a non-constant size, we have to place alloca inside a loop that encloses the nested loop. Without a corresponding free operation, this kind of alloca-in-loop might cause stack overflow. So it becomes another TODO. >> outer-loop () >> { >> size_t cache_idx = 0; >> >> inner-loop (iter, iter_count) >> { >> if (!cache_arr[cache_idx]->init) >> { >> v1 = LOAD (iter); >> v2 = LOAD (v1); >> v3 = LOAD (v2); >> >> cache_arr[cache_idx]->init = true; >> cache_arr[cache_idx]->c_v1 = v1; >> cache_arr[cache_idx]->c_v2 = v2; >> cache_arr[cache_idx]->c_v3 = v3; >> } >> else >> { >> v1 = cache_arr[cache_idx]->c_v1; >> v2
[PATCH] Fix testcases to avoid plusminus-with-convert pattern (PR 97066)
With the new pattern rule (T)(A) +- (T)(B) -> (T)(A +- B), some testcases are simplified and could not keep expected code pattern as test-check. Minor changes are made to those cases to avoid simplification effect of the rule. Tested on x86_64-linux and aarch64-linux. Feng --- 2020-09-16 Feng Xue gcc/testsuite/ PR testsuite/97066 * gcc.dg/ifcvt-3.c: Modified to suppress simplification. * gcc.dg/tree-ssa/20030807-10.c: Likewise.From ac768c385f1332e276260c6de83b12929180fbfb Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Wed, 16 Sep 2020 16:21:14 +0800 Subject: [PATCH] testsuite/97066 - minor change to bypass plusminus-with-convert rule The following testcases will be simplified by the new rule (T)(A) +- (T)(B) -> (T)(A +- B), so could not keep expected code pattern as test-check. Adjust test code to suppress simplification. 2020-09-16 Feng Xue gcc/testsuite/ PR testsuite/97066 * gcc.dg/ifcvt-3.c: Modified to suppress simplification. * gcc.dg/tree-ssa/20030807-10.c: Likewise. --- gcc/testsuite/gcc.dg/ifcvt-3.c | 2 +- gcc/testsuite/gcc.dg/tree-ssa/20030807-10.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/gcc/testsuite/gcc.dg/ifcvt-3.c b/gcc/testsuite/gcc.dg/ifcvt-3.c index b250bc15e08..56fdd753a0a 100644 --- a/gcc/testsuite/gcc.dg/ifcvt-3.c +++ b/gcc/testsuite/gcc.dg/ifcvt-3.c @@ -11,7 +11,7 @@ foo (s64 a, s64 b, s64 c) if (d == 0) return a + c; else -return b + d + c; +return b + c + d; } /* This test can be reduced to just return a + c; */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/20030807-10.c b/gcc/testsuite/gcc.dg/tree-ssa/20030807-10.c index 0903f3c4321..0e01e511b78 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/20030807-10.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/20030807-10.c @@ -7,7 +7,7 @@ unsigned int subreg_highpart_offset (outermode, innermode) int outermode, innermode; { - unsigned int offset = 0; + unsigned int offset = 1; int difference = (mode_size[innermode] - mode_size[outermode]); if (difference > 0) { -- 2.17.1
Re: [PATCH 2/2 V4] Add plusminus-with-convert pattern (PR 94234)
>> Add a rule (T)(A) +- (T)(B) -> (T)(A +- B), which works only when (A +- B) >> could be folded to a simple value. By this rule, a >> plusminus-mult-with-convert >> expression could be handed over to the rule (A * C) +- (B * C) -> (A +- B). > >Please use INTEGRAL_TYPE_P () instead of TREE_CODE == INTEGER_TYPE >in all three cases. It's enough to check for INTEGRAL_TYPE_P on one operand, >the types_match will take care of the other. I would have considered using INTEGRAL_TYPE_P(), but if inner type is bool or enum, can we do plus/minus operation on that? Feng > >OK with those changes. > >Thanks, >Richard. > > > Bootstrapped/regtested on x86_64-linux and aarch64-linux. > > Feng > --- > 2020-09-15 Feng Xue > > gcc/ > PR tree-optimization/94234 > * match.pd (T)(A) +- (T)(B) -> (T)(A +- B): New simplification. > > gcc/testsuite/ > PR tree-optimization/94234 > * gcc.dg/pr94234-3.c: New test.
Re: Ping: [PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop (PR 94234)
>> This patch is to handle simplification of plusminus-mult-with-convert >> expression >> as ((T) X) +- ((T) Y), in which at least one of (X, Y) is result of >> multiplication. >> This is done in forwprop pass. We try to transform it to (T) (X +- Y), and >> resort >> to gimple-matcher to fold (X +- Y) instead of manually code pattern >> recognition. > >I still don't like the complete new function with all its correctness >issues - the existing >fold_plusminus_mult_expr was difficult enough to get correct for >corner cases and >we do have a set of match.pd patterns (partly?) implementing its transforms. > >Looking at > >+unsigned goo (unsigned m_param, unsigned n_param) >+{ >+ unsigned b1 = m_param * (n_param + 2); >+ unsigned b2 = m_param * (n_param + 1); >+ int r = (int)(b1) - (int)(b2); > >it seems we want to simplify (signed)A - (signed)B to >(signed)(A - B) if A - B "simplifies"? I guess > >(simplify > (plusminus (nop_convert @0) (nop_convert? @1)) > (convert (plusminus! @0 @1))) > >probably needs a swapped pattern or not iterate over plus/minus >to handle at least one converted operand and avoid adding >a (plus @0 @1) -> (convert (plus! @0 @1)) rule. > >Even > >(simplify > (minus (nop_convert @0) (nop_convert @1)) > (convert (minus! @0 @1))) > >seems to handle all your testcases already (which means >they are all the same and not very exhaustive...) Yes. This is much simpler. Thanks, Feng >Richard. > > >> Regards, >> Feng >> --- >> 2020-09-03 Feng Xue >> >> gcc/ >> PR tree-optimization/94234 >> * tree-ssa-forwprop.c (simplify_plusminus_mult_with_convert): New >> function. >> (fwprop_ssa_val): Move it before its new caller. >> (pass_forwprop::execute): Add call to >> simplify_plusminus_mult_with_convert. >> >> gcc/testsuite/ >> PR tree-optimization/94234 >> * gcc.dg/pr94234-3.c: New test. >
[PATCH 2/2 V4] Add plusminus-with-convert pattern (PR 94234)
Add a rule (T)(A) +- (T)(B) -> (T)(A +- B), which works only when (A +- B) could be folded to a simple value. By this rule, a plusminus-mult-with-convert expression could be handed over to the rule (A * C) +- (B * C) -> (A +- B). Bootstrapped/regtested on x86_64-linux and aarch64-linux. Feng --- 2020-09-15 Feng Xue gcc/ PR tree-optimization/94234 * match.pd (T)(A) +- (T)(B) -> (T)(A +- B): New simplification. gcc/testsuite/ PR tree-optimization/94234 * gcc.dg/pr94234-3.c: New test.From f7c7483bd61fe1e3d6888f84d718fb4be4ea9e14 Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Mon, 17 Aug 2020 23:00:35 +0800 Subject: [PATCH] tree-optimization/94234 - add plusminus-with-convert pattern Add a rule (T)(A) +- (T)(B) -> (T)(A +- B), which works only when (A +- B) could be folded to a simple value. By this rule, a plusminus-mult-with-convert expression could be handed over to the rule (A * C) +- (B * C) -> (A +- B). 2020-09-15 Feng Xue gcc/ PR tree-optimization/94234 * match.pd (T)(A) +- (T)(B) -> (T)(A +- B): New simplification. gcc/testsuite/ PR tree-optimization/94234 * gcc.dg/pr94234-3.c: New test. --- gcc/match.pd | 16 gcc/testsuite/gcc.dg/pr94234-3.c | 42 2 files changed, 58 insertions(+) create mode 100644 gcc/testsuite/gcc.dg/pr94234-3.c diff --git a/gcc/match.pd b/gcc/match.pd index 46fd880bd37..d8c59fad9c1 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -2397,6 +2397,22 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (plus (convert @0) (op @2 (convert @1)) #endif +/* (T)(A) +- (T)(B) -> (T)(A +- B) only when (A +- B) could be simplified + to a simple value. */ +#if GIMPLE + (for op (plus minus) + (simplify +(op (convert @0) (convert @1)) + (if (TREE_CODE (type) == INTEGER_TYPE + && TREE_CODE (TREE_TYPE (@0)) == INTEGER_TYPE + && TREE_CODE (TREE_TYPE (@1)) == INTEGER_TYPE + && TYPE_PRECISION (type) <= TYPE_PRECISION (TREE_TYPE (@0)) + && types_match (TREE_TYPE (@0), TREE_TYPE (@1)) + && !TYPE_OVERFLOW_TRAPS (type) + && !TYPE_OVERFLOW_SANITIZED (type)) + (convert (op! @0 @1) +#endif + /* ~A + A -> -1 */ (simplify (plus:c (bit_not @0) @0) diff --git a/gcc/testsuite/gcc.dg/pr94234-3.c b/gcc/testsuite/gcc.dg/pr94234-3.c new file mode 100644 index 000..9bb9b46bd96 --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr94234-3.c @@ -0,0 +1,42 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-forwprop1" } */ + +typedef __SIZE_TYPE__ size_t; +typedef __PTRDIFF_TYPE__ ptrdiff_t; + +ptrdiff_t foo1 (char *a, size_t n) +{ + char *b1 = a + 8 * n; + char *b2 = a + 8 * (n - 1); + + return b1 - b2; +} + +int use_ptr (char *a, char *b); + +ptrdiff_t foo2 (char *a, size_t n) +{ + char *b1 = a + 8 * (n - 1); + char *b2 = a + 8 * n; + + use_ptr (b1, b2); + + return b1 - b2; +} + +int use_int (int i); + +unsigned goo (unsigned m_param, unsigned n_param) +{ + unsigned b1 = m_param * (n_param + 2); + unsigned b2 = m_param * (n_param + 1); + int r = (int)(b1) - (int)(b2); + + use_int (r); + + return r; +} + +/* { dg-final { scan-tree-dump-times "return 8;" 1 "forwprop1" } } */ +/* { dg-final { scan-tree-dump-times "return -8;" 1 "forwprop1" } } */ +/* { dg-final { scan-tree-dump-times "return m_param" 1 "forwprop1" } } */ -- 2.17.1
Re: Ping: [PATCH 1/2] Fold plusminus_mult expr with multi-use operands (PR 94234)
>@@ -3426,8 +3426,16 @@ dt_simplify::gen_1 (FILE *f, int indent, bool >gimple, operand *result) > /* Re-fold the toplevel result. It's basically an embedded > gimple_build w/o actually building the stmt. */ > if (!is_predicate) >- fprintf_indent (f, indent, >- "res_op->resimplify (lseq, valueize);\n"); >+ { >+ fprintf_indent (f, indent, >+ "res_op->resimplify (lseq, valueize);\n"); >+ if (e->force_leaf) >+ { >+ fprintf_indent (f, indent, >+ "if (!maybe_push_res_to_seq (res_op, NULL))\n"); >+ fprintf_indent (f, indent + 2, "return false;\n"); > >please use "goto %s;\n", fail_label) here. OK with that change. Ok. > >I've tried again to think about sth prettier to cover these kind of >single-use checks but failed to come up with sth. Maybe we need a smart combiner that can deduce cost globally, and remove these single-use specifiers from rule description. Feng From: Richard Biener Sent: Monday, September 14, 2020 9:39 PM To: Feng Xue OS Cc: gcc-patches@gcc.gnu.org Subject: Re: Ping: [PATCH 1/2] Fold plusminus_mult expr with multi-use operands (PR 94234) On Mon, Sep 14, 2020 at 5:17 AM Feng Xue OS via Gcc-patches wrote: > > Thanks, @@ -3426,8 +3426,16 @@ dt_simplify::gen_1 (FILE *f, int indent, bool gimple, operand *result) /* Re-fold the toplevel result. It's basically an embedded gimple_build w/o actually building the stmt. */ if (!is_predicate) - fprintf_indent (f, indent, - "res_op->resimplify (lseq, valueize);\n"); + { + fprintf_indent (f, indent, + "res_op->resimplify (lseq, valueize);\n"); + if (e->force_leaf) + { + fprintf_indent (f, indent, + "if (!maybe_push_res_to_seq (res_op, NULL))\n"); + fprintf_indent (f, indent + 2, "return false;\n"); please use "goto %s;\n", fail_label) here. OK with that change. I've tried again to think about sth prettier to cover these kind of single-use checks but failed to come up with sth. Thanks and sorry for the delay, Richard. > Feng > > > From: Feng Xue OS > Sent: Thursday, September 3, 2020 2:06 PM > To: gcc-patches@gcc.gnu.org > Subject: [PATCH 1/2] Fold plusminus_mult expr with multi-use operands (PR > 94234) > > For pattern A * C +- B * C -> (A +- B) * C, simplification is disabled > when A and B are not single-use. This patch is a minor enhancement > on the pattern, which allows folding if final result is found to be a > simple gimple value (constant/existing SSA). > > Bootstrapped/regtested on x86_64-linux and aarch64-linux. > > Feng > --- > 2020-09-03 Feng Xue > > gcc/ > PR tree-optimization/94234 > * genmatch.c (dt_simplify::gen_1): Emit check on final simplification > result when "!" is specified on toplevel output expr. > * match.pd ((A * C) +- (B * C) -> (A +- B) * C): Allow folding for > expr with multi-use operands if final result is a simple gimple value. > > gcc/testsuite/ > PR tree-optimization/94234 > * gcc.dg/pr94234-2.c: New test. > ---
Ping: [PATCH 1/2] Fold plusminus_mult expr with multi-use operands (PR 94234)
Thanks, Feng From: Feng Xue OS Sent: Thursday, September 3, 2020 2:06 PM To: gcc-patches@gcc.gnu.org Subject: [PATCH 1/2] Fold plusminus_mult expr with multi-use operands (PR 94234) For pattern A * C +- B * C -> (A +- B) * C, simplification is disabled when A and B are not single-use. This patch is a minor enhancement on the pattern, which allows folding if final result is found to be a simple gimple value (constant/existing SSA). Bootstrapped/regtested on x86_64-linux and aarch64-linux. Feng --- 2020-09-03 Feng Xue gcc/ PR tree-optimization/94234 * genmatch.c (dt_simplify::gen_1): Emit check on final simplification result when "!" is specified on toplevel output expr. * match.pd ((A * C) +- (B * C) -> (A +- B) * C): Allow folding for expr with multi-use operands if final result is a simple gimple value. gcc/testsuite/ PR tree-optimization/94234 * gcc.dg/pr94234-2.c: New test. --- From e247eb0d9a43856cc0b46f98414ed58d13796d62 Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Tue, 1 Sep 2020 17:17:58 +0800 Subject: [PATCH] tree-optimization/94234 - Fold plusminus_mult expr with multi-use operands 2020-09-03 Feng Xue gcc/ PR tree-optimization/94234 * genmatch.c (dt_simplify::gen_1): Emit check on final simplification result when "!" is specified on toplevel output expr. * match.pd ((A * C) +- (B * C) -> (A +- B) * C): Allow folding for expr with multi-use operands if final result is a simple gimple value. gcc/testsuite/ PR tree-optimization/94234 * gcc.dg/pr94234-2.c: New test. --- gcc/genmatch.c | 12 -- gcc/match.pd | 22 ++ gcc/testsuite/gcc.dg/pr94234-2.c | 39 3 files changed, 62 insertions(+), 11 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/pr94234-2.c diff --git a/gcc/genmatch.c b/gcc/genmatch.c index 906d842c4d8..d4f01401964 100644 --- a/gcc/genmatch.c +++ b/gcc/genmatch.c @@ -3426,8 +3426,16 @@ dt_simplify::gen_1 (FILE *f, int indent, bool gimple, operand *result) /* Re-fold the toplevel result. It's basically an embedded gimple_build w/o actually building the stmt. */ if (!is_predicate) - fprintf_indent (f, indent, - "res_op->resimplify (lseq, valueize);\n"); + { + fprintf_indent (f, indent, + "res_op->resimplify (lseq, valueize);\n"); + if (e->force_leaf) + { + fprintf_indent (f, indent, + "if (!maybe_push_res_to_seq (res_op, NULL))\n"); + fprintf_indent (f, indent + 2, "return false;\n"); + } + } } else if (result->type == operand::OP_CAPTURE || result->type == operand::OP_C_EXPR) diff --git a/gcc/match.pd b/gcc/match.pd index 6e45836e32b..46fd880bd37 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -2570,15 +2570,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (for plusminus (plus minus) (simplify (plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2)) - (if ((!ANY_INTEGRAL_TYPE_P (type) - || TYPE_OVERFLOW_WRAPS (type) - || (INTEGRAL_TYPE_P (type) - && tree_expr_nonzero_p (@0) - && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type) - /* If @1 +- @2 is constant require a hard single-use on either - original operand (but not on both). */ - && (single_use (@3) || single_use (@4))) -(mult (plusminus @1 @2) @0))) + (if (!ANY_INTEGRAL_TYPE_P (type) + || TYPE_OVERFLOW_WRAPS (type) + || (INTEGRAL_TYPE_P (type) + && tree_expr_nonzero_p (@0) + && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type) +(if (single_use (@3) || single_use (@4)) + /* If @1 +- @2 is constant require a hard single-use on either + original operand (but not on both). */ + (mult (plusminus @1 @2) @0) +#if GIMPLE + (mult! (plusminus @1 @2) @0) +#endif + ))) /* We cannot generate constant 1 for fract. */ (if (!ALL_FRACT_MODE_P (TYPE_MODE (type))) (simplify diff --git a/gcc/testsuite/gcc.dg/pr94234-2.c b/gcc/testsuite/gcc.dg/pr94234-2.c new file mode 100644 index 000..1f4b194dd43 --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr94234-2.c @@ -0,0 +1,39 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-forwprop1" } */ + +int use_fn (int a); + +int foo (int n) +{ + int b1 = 8 * (n + 1); + int b2 = 8 * n; + + use_fn (b1 ^ b2); + + return b1 - b2; +} + +unsigned goo (unsigned m_param, unsigned n_param) +{ + unsigned b1 = m_param * (n_param + 2); + unsigned b2 = m_param * (n_param + 1); + + use_fn (b1 ^ b2); + + return b1 - b2; +} + +unsigned hoo (unsigned k_param) +{ + unsigned b1 = k_param * 28; + unsigned b2 = k_param * 15; + unsigned b3 = k_param * 12; + + use_fn (b1 ^ b2 ^ b3); + + return (b1 - b2) - b3; +} + +/* { dg-final { scan-tree-dump-times "return 8;" 1 "for
Ping: [PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop (PR 94234)
Thanks, Feng From: Feng Xue OS Sent: Thursday, September 3, 2020 5:29 PM To: Richard Biener; gcc-patches@gcc.gnu.org Subject: Re: [PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop (PR 94234) Attach patch file. Feng From: Gcc-patches on behalf of Feng Xue OS via Gcc-patches Sent: Thursday, September 3, 2020 5:27 PM To: Richard Biener; gcc-patches@gcc.gnu.org Subject: [PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop (PR 94234) This patch is to handle simplification of plusminus-mult-with-convert expression as ((T) X) +- ((T) Y), in which at least one of (X, Y) is result of multiplication. This is done in forwprop pass. We try to transform it to (T) (X +- Y), and resort to gimple-matcher to fold (X +- Y) instead of manually code pattern recognition. Regards, Feng --- 2020-09-03 Feng Xue gcc/ PR tree-optimization/94234 * tree-ssa-forwprop.c (simplify_plusminus_mult_with_convert): New function. (fwprop_ssa_val): Move it before its new caller. (pass_forwprop::execute): Add call to simplify_plusminus_mult_with_convert. gcc/testsuite/ PR tree-optimization/94234 * gcc.dg/pr94234-3.c: New test. From 98c4b97989207dcef5742e9cb451799feafd125e Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Mon, 17 Aug 2020 23:00:35 +0800 Subject: [PATCH] tree-optimization/94234 - simplify plusminus-mult-with-convert in forwprop For expression as ((T) X) +- ((T) Y), and at lease of (X, Y) is result of multification, try to transform it to (T) (X +- Y), and apply simplification on (X +- Y) if possible. In this way, we can avoid creating almost duplicated rule to handle plusminus-mult-with-convert variant. 2020-09-03 Feng Xue gcc/ PR tree-optimization/94234 * tree-ssa-forwprop.c (simplify_plusminus_mult_with_convert): New function. (fwprop_ssa_val): Move it before its new caller. (pass_forwprop::execute): Add call to simplify_plusminus_mult_with_convert. gcc/testsuite/ PR tree-optimization/94234 * gcc.dg/pr94234-3.c: New test. --- gcc/testsuite/gcc.dg/pr94234-3.c | 42 gcc/tree-ssa-forwprop.c | 168 +++ 2 files changed, 191 insertions(+), 19 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/pr94234-3.c diff --git a/gcc/testsuite/gcc.dg/pr94234-3.c b/gcc/testsuite/gcc.dg/pr94234-3.c new file mode 100644 index 000..9bb9b46bd96 --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr94234-3.c @@ -0,0 +1,42 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-forwprop1" } */ + +typedef __SIZE_TYPE__ size_t; +typedef __PTRDIFF_TYPE__ ptrdiff_t; + +ptrdiff_t foo1 (char *a, size_t n) +{ + char *b1 = a + 8 * n; + char *b2 = a + 8 * (n - 1); + + return b1 - b2; +} + +int use_ptr (char *a, char *b); + +ptrdiff_t foo2 (char *a, size_t n) +{ + char *b1 = a + 8 * (n - 1); + char *b2 = a + 8 * n; + + use_ptr (b1, b2); + + return b1 - b2; +} + +int use_int (int i); + +unsigned goo (unsigned m_param, unsigned n_param) +{ + unsigned b1 = m_param * (n_param + 2); + unsigned b2 = m_param * (n_param + 1); + int r = (int)(b1) - (int)(b2); + + use_int (r); + + return r; +} + +/* { dg-final { scan-tree-dump-times "return 8;" 1 "forwprop1" } } */ +/* { dg-final { scan-tree-dump-times "return -8;" 1 "forwprop1" } } */ +/* { dg-final { scan-tree-dump-times "return m_param" 1 "forwprop1" } } */ diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c index e2d008dfb92..7b9d46ec919 100644 --- a/gcc/tree-ssa-forwprop.c +++ b/gcc/tree-ssa-forwprop.c @@ -338,6 +338,25 @@ remove_prop_source_from_use (tree name) return cfg_changed; } +/* Primitive "lattice" function for gimple_simplify. */ + +static tree +fwprop_ssa_val (tree name) +{ + /* First valueize NAME. */ + if (TREE_CODE (name) == SSA_NAME + && SSA_NAME_VERSION (name) < lattice.length ()) +{ + tree val = lattice[SSA_NAME_VERSION (name)]; + if (val) + name = val; +} + /* We continue matching along SSA use-def edges for SSA names + that are not single-use. Currently there are no patterns + that would cause any issues with that. */ + return name; +} + /* Return the rhs of a gassign *STMT in a form of a single tree, converted to type TYPE. @@ -1821,6 +1840,133 @@ simplify_rotate (gimple_stmt_iterator *gsi) return true; } +/* Given ((T) X) +- ((T) Y), and at least one of (X, Y) is result of + multiplication, if the expr can be transformed to (T) (X +- Y) in terms of + two's complement computation, apply simplification on (X +- Y) if it is + possible. As a prerequisite, outer result type (T) has precision not more + than that of inner operand type. */ + +static bool +simplify_plusminus_mult_with_convert (gimple_stmt_iterator *gsi) +{ + gimple *stmt = gsi_stm
Re: [PATCH] Fix ICE in ipa-cp due to cost addition overflow (PR 96806)
>> Hi, >> >> On Mon, Aug 31 2020, Feng Xue OS wrote: >> > This patch is to fix a bug that cost that is used to evaluate clone >> > candidate >> > becomes negative due to integer overflow. >> > >> > Feng >> > --- >> > 2020-08-31 Feng Xue >> > >> > gcc/ >> > PR tree-optimization/96806 >> >> the component is "ipa," please change that when you commit the patch. >> >> > * ipa-cp.c (decide_about_value): Use safe_add to avoid cost >> > addition >> > overflow. >> >> assuming you have bootstrapped and tested it, it is OK for both trunk >> and all affected release branches. > >I have already added caps on things that come from profile counts so >things do not overflow, but I think in longer run we want to simply use >sreals here.. >> >&& !good_cloning_opportunity_p (node, >> > - val->local_time_benefit >> > - + val->prop_time_benefit, >> > + safe_add (val->local_time_benefit, >> > + val->prop_time_benefit), >> > freq_sum, count_sum, >> > - val->local_size_cost >> > - + val->prop_size_cost)) >> > + safe_add (val->local_size_cost, >> > + val->prop_size_cost))) > >Is it also size cost that may overflow? That seem bit odd ;) > Yes. prop_size_cost accumulates all callees' size_cost. And since there are two recursive calls, this value increases exponentially as 2's power, and easily exceeds value space of integer. It is actually a defect of cost computation for recursive cloning. But I think we need a complete consideration on how to adjust cost model for recursive cloning, including profile estimation, threshold, size_cost... And a quick fix is to add a cap here to avoid overflow. Feng >Honza >> > return false; >> > >> >if (dump_file) >> >> [...] >
Re: [PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop (PR 94234)
Attach patch file. Feng From: Gcc-patches on behalf of Feng Xue OS via Gcc-patches Sent: Thursday, September 3, 2020 5:27 PM To: Richard Biener; gcc-patches@gcc.gnu.org Subject: [PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop (PR 94234) This patch is to handle simplification of plusminus-mult-with-convert expression as ((T) X) +- ((T) Y), in which at least one of (X, Y) is result of multiplication. This is done in forwprop pass. We try to transform it to (T) (X +- Y), and resort to gimple-matcher to fold (X +- Y) instead of manually code pattern recognition. Regards, Feng --- 2020-09-03 Feng Xue gcc/ PR tree-optimization/94234 * tree-ssa-forwprop.c (simplify_plusminus_mult_with_convert): New function. (fwprop_ssa_val): Move it before its new caller. (pass_forwprop::execute): Add call to simplify_plusminus_mult_with_convert. gcc/testsuite/ PR tree-optimization/94234 * gcc.dg/pr94234-3.c: New test. From 98c4b97989207dcef5742e9cb451799feafd125e Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Mon, 17 Aug 2020 23:00:35 +0800 Subject: [PATCH] tree-optimization/94234 - simplify plusminus-mult-with-convert in forwprop For expression as ((T) X) +- ((T) Y), and at lease of (X, Y) is result of multification, try to transform it to (T) (X +- Y), and apply simplification on (X +- Y) if possible. In this way, we can avoid creating almost duplicated rule to handle plusminus-mult-with-convert variant. 2020-09-03 Feng Xue gcc/ PR tree-optimization/94234 * tree-ssa-forwprop.c (simplify_plusminus_mult_with_convert): New function. (fwprop_ssa_val): Move it before its new caller. (pass_forwprop::execute): Add call to simplify_plusminus_mult_with_convert. gcc/testsuite/ PR tree-optimization/94234 * gcc.dg/pr94234-3.c: New test. --- gcc/testsuite/gcc.dg/pr94234-3.c | 42 gcc/tree-ssa-forwprop.c | 168 +++ 2 files changed, 191 insertions(+), 19 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/pr94234-3.c diff --git a/gcc/testsuite/gcc.dg/pr94234-3.c b/gcc/testsuite/gcc.dg/pr94234-3.c new file mode 100644 index 000..9bb9b46bd96 --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr94234-3.c @@ -0,0 +1,42 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-forwprop1" } */ + +typedef __SIZE_TYPE__ size_t; +typedef __PTRDIFF_TYPE__ ptrdiff_t; + +ptrdiff_t foo1 (char *a, size_t n) +{ + char *b1 = a + 8 * n; + char *b2 = a + 8 * (n - 1); + + return b1 - b2; +} + +int use_ptr (char *a, char *b); + +ptrdiff_t foo2 (char *a, size_t n) +{ + char *b1 = a + 8 * (n - 1); + char *b2 = a + 8 * n; + + use_ptr (b1, b2); + + return b1 - b2; +} + +int use_int (int i); + +unsigned goo (unsigned m_param, unsigned n_param) +{ + unsigned b1 = m_param * (n_param + 2); + unsigned b2 = m_param * (n_param + 1); + int r = (int)(b1) - (int)(b2); + + use_int (r); + + return r; +} + +/* { dg-final { scan-tree-dump-times "return 8;" 1 "forwprop1" } } */ +/* { dg-final { scan-tree-dump-times "return -8;" 1 "forwprop1" } } */ +/* { dg-final { scan-tree-dump-times "return m_param" 1 "forwprop1" } } */ diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c index e2d008dfb92..7b9d46ec919 100644 --- a/gcc/tree-ssa-forwprop.c +++ b/gcc/tree-ssa-forwprop.c @@ -338,6 +338,25 @@ remove_prop_source_from_use (tree name) return cfg_changed; } +/* Primitive "lattice" function for gimple_simplify. */ + +static tree +fwprop_ssa_val (tree name) +{ + /* First valueize NAME. */ + if (TREE_CODE (name) == SSA_NAME + && SSA_NAME_VERSION (name) < lattice.length ()) +{ + tree val = lattice[SSA_NAME_VERSION (name)]; + if (val) + name = val; +} + /* We continue matching along SSA use-def edges for SSA names + that are not single-use. Currently there are no patterns + that would cause any issues with that. */ + return name; +} + /* Return the rhs of a gassign *STMT in a form of a single tree, converted to type TYPE. @@ -1821,6 +1840,133 @@ simplify_rotate (gimple_stmt_iterator *gsi) return true; } +/* Given ((T) X) +- ((T) Y), and at least one of (X, Y) is result of + multiplication, if the expr can be transformed to (T) (X +- Y) in terms of + two's complement computation, apply simplification on (X +- Y) if it is + possible. As a prerequisite, outer result type (T) has precision not more + than that of inner operand type. */ + +static bool +simplify_plusminus_mult_with_convert (gimple_stmt_iterator *gsi) +{ + gimple *stmt = gsi_stmt (*gsi); + tree lhs = gimple_assign_lhs (stmt); + tree rtype = TREE_TYPE (lhs); + tree ctype = NULL_TREE; + enum tree_code code = gimple_assign_rhs_code (stmt); + + if (code != PLUS_EXPR && code != MINUS_EXPR) +return false; + + /*
[PATCH 2/2 V3] Simplify plusminus-mult-with-convert expr in forwprop (PR 94234)
This patch is to handle simplification of plusminus-mult-with-convert expression as ((T) X) +- ((T) Y), in which at least one of (X, Y) is result of multiplication. This is done in forwprop pass. We try to transform it to (T) (X +- Y), and resort to gimple-matcher to fold (X +- Y) instead of manually code pattern recognition. Regards, Feng --- 2020-09-03 Feng Xue gcc/ PR tree-optimization/94234 * tree-ssa-forwprop.c (simplify_plusminus_mult_with_convert): New function. (fwprop_ssa_val): Move it before its new caller. (pass_forwprop::execute): Add call to simplify_plusminus_mult_with_convert. gcc/testsuite/ PR tree-optimization/94234 * gcc.dg/pr94234-3.c: New test.
[PATCH 1/2] Fold plusminus_mult expr with multi-use operands (PR 94234)
For pattern A * C +- B * C -> (A +- B) * C, simplification is disabled when A and B are not single-use. This patch is a minor enhancement on the pattern, which allows folding if final result is found to be a simple gimple value (constant/existing SSA). Bootstrapped/regtested on x86_64-linux and aarch64-linux. Feng --- 2020-09-03 Feng Xue gcc/ PR tree-optimization/94234 * genmatch.c (dt_simplify::gen_1): Emit check on final simplification result when "!" is specified on toplevel output expr. * match.pd ((A * C) +- (B * C) -> (A +- B) * C): Allow folding for expr with multi-use operands if final result is a simple gimple value. gcc/testsuite/ PR tree-optimization/94234 * gcc.dg/pr94234-2.c: New test. ---From e247eb0d9a43856cc0b46f98414ed58d13796d62 Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Tue, 1 Sep 2020 17:17:58 +0800 Subject: [PATCH] tree-optimization/94234 - Fold plusminus_mult expr with multi-use operands 2020-09-03 Feng Xue gcc/ PR tree-optimization/94234 * genmatch.c (dt_simplify::gen_1): Emit check on final simplification result when "!" is specified on toplevel output expr. * match.pd ((A * C) +- (B * C) -> (A +- B) * C): Allow folding for expr with multi-use operands if final result is a simple gimple value. gcc/testsuite/ PR tree-optimization/94234 * gcc.dg/pr94234-2.c: New test. --- gcc/genmatch.c | 12 -- gcc/match.pd | 22 ++ gcc/testsuite/gcc.dg/pr94234-2.c | 39 3 files changed, 62 insertions(+), 11 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/pr94234-2.c diff --git a/gcc/genmatch.c b/gcc/genmatch.c index 906d842c4d8..d4f01401964 100644 --- a/gcc/genmatch.c +++ b/gcc/genmatch.c @@ -3426,8 +3426,16 @@ dt_simplify::gen_1 (FILE *f, int indent, bool gimple, operand *result) /* Re-fold the toplevel result. It's basically an embedded gimple_build w/o actually building the stmt. */ if (!is_predicate) - fprintf_indent (f, indent, - "res_op->resimplify (lseq, valueize);\n"); + { + fprintf_indent (f, indent, + "res_op->resimplify (lseq, valueize);\n"); + if (e->force_leaf) + { + fprintf_indent (f, indent, + "if (!maybe_push_res_to_seq (res_op, NULL))\n"); + fprintf_indent (f, indent + 2, "return false;\n"); + } + } } else if (result->type == operand::OP_CAPTURE || result->type == operand::OP_C_EXPR) diff --git a/gcc/match.pd b/gcc/match.pd index 6e45836e32b..46fd880bd37 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -2570,15 +2570,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (for plusminus (plus minus) (simplify (plusminus (mult:cs@3 @0 @1) (mult:cs@4 @0 @2)) - (if ((!ANY_INTEGRAL_TYPE_P (type) - || TYPE_OVERFLOW_WRAPS (type) - || (INTEGRAL_TYPE_P (type) - && tree_expr_nonzero_p (@0) - && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type) - /* If @1 +- @2 is constant require a hard single-use on either - original operand (but not on both). */ - && (single_use (@3) || single_use (@4))) -(mult (plusminus @1 @2) @0))) + (if (!ANY_INTEGRAL_TYPE_P (type) + || TYPE_OVERFLOW_WRAPS (type) + || (INTEGRAL_TYPE_P (type) + && tree_expr_nonzero_p (@0) + && expr_not_equal_to (@0, wi::minus_one (TYPE_PRECISION (type) +(if (single_use (@3) || single_use (@4)) + /* If @1 +- @2 is constant require a hard single-use on either + original operand (but not on both). */ + (mult (plusminus @1 @2) @0) +#if GIMPLE + (mult! (plusminus @1 @2) @0) +#endif + ))) /* We cannot generate constant 1 for fract. */ (if (!ALL_FRACT_MODE_P (TYPE_MODE (type))) (simplify diff --git a/gcc/testsuite/gcc.dg/pr94234-2.c b/gcc/testsuite/gcc.dg/pr94234-2.c new file mode 100644 index 000..1f4b194dd43 --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr94234-2.c @@ -0,0 +1,39 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-forwprop1" } */ + +int use_fn (int a); + +int foo (int n) +{ + int b1 = 8 * (n + 1); + int b2 = 8 * n; + + use_fn (b1 ^ b2); + + return b1 - b2; +} + +unsigned goo (unsigned m_param, unsigned n_param) +{ + unsigned b1 = m_param * (n_param + 2); + unsigned b2 = m_param * (n_param + 1); + + use_fn (b1 ^ b2); + + return b1 - b2; +} + +unsigned hoo (unsigned k_param) +{ + unsigned b1 = k_param * 28; + unsigned b2 = k_param * 15; + unsigned b3 = k_param * 12; + + use_fn (b1 ^ b2 ^ b3); + + return (b1 - b2) - b3; +} + +/* { dg-final { scan-tree-dump-times "return 8;" 1 "forwprop1" } } */ +/* { dg-final { scan-tree-dump-times "return m_param" 1 "forwprop1" } } */ +/* { dg-final { scan-tree-dump-not "return k_param" "forwprop1" } } */ -- 2.17.1
Re: [PATCH V2] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)
>> >> gcc/ >> >> PR tree-optimization/94234 >> >> * tree-ssa-forwprop.c (simplify_binary_with_convert): New >> >> function. >> >> * (fwprop_ssa_val): Move it before its new caller. >> >> > No * at this line. There's an entry for (pass_forwprop::execute) missing. >> OK. >> >> > I don't think the transform as implemented, ((T) X) OP ((T) Y) to >> > (T) (X OP Y) is useful to do in tree-ssa-forwprop.c. Instead what I >> > suggested was to do the original >> > >> > +/* (T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C) and >> > + (T)(A * C) +- (T)(A) -> (T)(A * (C +- 1)). */ >> > >> > but realize we already do this for GENERIC in fold_plusminus_mult_expr, >> > just >> > without the conversions (also look at the conditions in the callers). This >> > function takes great care for handling overflow correctly and thus I >> > suggested >> > to take that over to GIMPLE in tree-ssa-forwprop.c and try extend it to >> > cover >> > the conversions you need for the specific cases. >> But this way would introduce duplicate handling. Is it more concise to reuse >> existing rule? > Sure moving the GENERIC folding to match.pd so it covers both GENERIC > and GIMPLE would be nice to avoid duplication. >> And different from GENERIC, we might need to check whether operand is >> single-use >> or not, and have distinct actions accordingly. >> >>(T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C) >> >> Suppose both A and B are multiple-used, in most situations, the transform >> is unprofitable and avoided. But if (A +- B) could be folded to a constant, >> we >> can still allow the transform. For this, we have to recursively fold (A >> +-B), either >> handle it manually or resort to gimple-matcher to tell result. The latter is >> a >> natural choice. If so, why not do it on the top. > > I don't understand. From the comments in your patch you are just > hoisting conversions in the transform. I don't really see the connection > to the originally desired transform here? A code sequence as: t1 = (T)(A * C) t2 = (T)(B * C) ... = use (t1) ... = use (t2) t3 = t1 - t2 Since t1 and t2 are not single-use, we do not expect the transform on t3 happens in that it incurs more (add/mul) operations in most situations. But if (A - B) * C can be folded to a constant or an existing SSA, the transform is OK. That is to say we need to try to fold (A - B) and (A - B) * C to peak the final result. To do this, it is natural to use gimple-matcher instead of manually pattern matching as fold_plusminus_mult_expr, which could not cover all cases as gimple rules. Some examples: A = n + 2, B = n + 1, C=m A = n - m, B = n, C = -1 A = 3 * n, B = 2 * n, C = 1 And this way can be easily generalized to handle ((T) X) OP ((T) Y). >> > Alternatively one could move the GENERIC bits to match.pd, leaving a >> > worker in fold-const.c. Then try to extend that there. >> This worker function is meant to be used by both GENERIC and GIMPLE? > Yes, for both. Thanks, Feng
Re: [PATCH V2] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)
>> gcc/ >> PR tree-optimization/94234 >> * tree-ssa-forwprop.c (simplify_binary_with_convert): New function. >> * (fwprop_ssa_val): Move it before its new caller. > No * at this line. There's an entry for (pass_forwprop::execute) missing. OK. > I don't think the transform as implemented, ((T) X) OP ((T) Y) to > (T) (X OP Y) is useful to do in tree-ssa-forwprop.c. Instead what I > suggested was to do the original > > +/* (T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C) and > + (T)(A * C) +- (T)(A) -> (T)(A * (C +- 1)). */ > > but realize we already do this for GENERIC in fold_plusminus_mult_expr, just > without the conversions (also look at the conditions in the callers). This > function takes great care for handling overflow correctly and thus I suggested > to take that over to GIMPLE in tree-ssa-forwprop.c and try extend it to cover > the conversions you need for the specific cases. But this way would introduce duplicate handling. Is it more concise to reuse existing rule? And different from GENERIC, we might need to check whether operand is single-use or not, and have distinct actions accordingly. (T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C) Suppose both A and B are multiple-used, in most situations, the transform is unprofitable and avoided. But if (A +- B) could be folded to a constant, we can still allow the transform. For this, we have to recursively fold (A +-B), either handle it manually or resort to gimple-matcher to tell result. The latter is a natural choice. If so, why not do it on the top. > Alternatively one could move the GENERIC bits to match.pd, leaving a > worker in fold-const.c. Then try to extend that there. This worker function is meant to be used by both GENERIC and GIMPLE? > I just remember this is a very fragile area with respect to overflow > correctness. Thanks, Feng
PING: [PATCH V2] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)
Thanks, Feng From: Feng Xue OS Sent: Wednesday, August 19, 2020 5:17 PM To: Richard Biener Cc: gcc-patches@gcc.gnu.org; Marc Glisse Subject: [PATCH V2] Add pattern for pointer-diff on addresses with same base/offset (PR 94234) As Richard's comment, this patch is composed to simplify generalized binary-with-convert pattern like ((T) X) OP ((T) Y). Instead of creating almost duplicated rules into match.pd, we try to transform it to (T) (X OP Y), and apply simplification on (X OP Y) in forwprop pass. Regards, Feng --- 2020-08-19 Feng Xue gcc/ PR tree-optimization/94234 * tree-ssa-forwprop.c (simplify_binary_with_convert): New function. * (fwprop_ssa_val): Move it before its new caller. gcc/testsuite/ PR tree-optimization/94234 * gcc.dg/ifcvt-3.c: Modified to suppress forward propagation. * gcc.dg/tree-ssa/20030807-10.c: Likewise. * gcc.dg/pr94234-2.c: New test. > > From: Richard Biener > Sent: Monday, June 15, 2020 3:41 PM > To: Feng Xue OS > Cc: gcc-patches@gcc.gnu.org; Marc Glisse > Subject: Re: [PATCH] Add pattern for pointer-diff on addresses with same > base/offset (PR 94234) > > On Fri, Jun 5, 2020 at 11:20 AM Feng Xue OS > wrote: >> >> As Marc suggested, removed the new pointer_diff rule, and add another rule >> to fold >> convert-add expression. This new rule is: >> >>(T)(A * C) +- (T)(B * C) -> (T) ((A +- B) * C) >> >> Regards, >> Feng >> >> --- >> 2020-06-01 Feng Xue >> >> gcc/ >> PR tree-optimization/94234 >> * match.pd ((T)(A * C) +- (T)(B * C)) -> (T)((A +- B) * C): New >> simplification. >> * ((PTR_A + OFF) - (PTR_B + OFF)) -> (PTR_A - PTR_B): New >> simplification. >> >> gcc/testsuite/ >> PR tree-optimization/94234 >> * gcc.dg/pr94234.c: New test. >> --- >> gcc/match.pd | 28 >> gcc/testsuite/gcc.dg/pr94234.c | 24 >> 2 files changed, 52 insertions(+) >> create mode 100644 gcc/testsuite/gcc.dg/pr94234.c >> >> diff --git a/gcc/match.pd b/gcc/match.pd >> index 33ee1a920bf..4f340bfe40a 100644 >> --- a/gcc/match.pd >> +++ b/gcc/match.pd >> @@ -2515,6 +2515,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) >> && TREE_CODE (@2) == INTEGER_CST >> && tree_int_cst_sign_bit (@2) == 0)) >>(minus (convert @1) (convert @2) >> + (simplify >> +(pointer_diff (pointer_plus @0 @2) (pointer_plus @1 @2)) >> + (pointer_diff @0 @1)) > > This new pattern is OK. Please commit it separately. > >> (simplify >> (pointer_diff (pointer_plus @@0 @1) (pointer_plus @0 @2)) >> /* The second argument of pointer_plus must be interpreted as signed, >> and >> @@ -2526,6 +2529,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) >>(minus (convert (view_convert:stype @1)) >> (convert (view_convert:stype @2))) >> >> +/* (T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C) and >> + (T)(A * C) +- (T)(A) -> (T)(A * (C +- 1)). */ >> +(if (INTEGRAL_TYPE_P (type)) >> + (for plusminus (plus minus) >> + (simplify >> + (plusminus (convert:s (mult:cs @0 @1)) (convert:s (mult:cs @0 @2))) >> + (if (element_precision (type) <= element_precision (TREE_TYPE (@0)) >> + && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type)) >> + && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))) >> +(convert (mult (plusminus @1 @2) @0 >> + (simplify >> + (plusminus (convert @0) (convert@2 (mult:c@3 @0 @1))) >> + (if (element_precision (type) <= element_precision (TREE_TYPE (@0)) >> + && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type)) >> + && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)) >> + && single_use (@2) && single_use (@3)) >> +(convert (mult (plusminus { build_one_cst (TREE_TYPE (@1)); } @1) >> @0 >> + (simplify >> + (plusminus (convert@2 (mult:c@3 @0 @1)) (convert @0)) >> + (if (element_precision (type) <= element_precision (TREE_TYPE (@0)) >> + && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type)) >> + && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)) >> + && single_use (@2) && single_use (@3)) >> +(convert (mult (plusminus @1 { build_one_cst (TREE_TYPE (@1)); }) >> @0)) >> + > > This shows the limit of pattern matching IMHO. I'm also not convinced > it gets the > overflow cases correct (but I didn't spend too much time here). Note we have > similar functionality implemented in fold_plusminus_mult_expr. IMHO instead > of doing the above moving fold_plusminus_mult_expr to GIMPLE by executing > it from inside the forwprop pass would make more sense. Or finally biting the > bullet and try to teach reassociation about how to handle signed arithmetic > with non-wrapping overflow behavior. > > Richard.
Re: [PATCH] Fix ICE in ipa-cp due to cost addition overflow (PR 96806)
>>> the component is "ipa," please change that when you commit the patch. >> Mistake has been made, I'v pushed it. Is there a way to correct it? git push >> --force? > > There is. You need to wait until tomorrow (after the commit message > gets copied to gcc/ChangeLog by a script) and then push a commit that > modifies nothing else but the ChangeLog. IIUC. > > Thanks again for taking care of this, I will. Thanks. Feng
Re: [PATCH] Fix ICE in ipa-cp due to cost addition overflow (PR 96806)
>> gcc/ >> PR tree-optimization/96806 > the component is "ipa," please change that when you commit the patch. Mistake has been made, I'v pushed it. Is there a way to correct it? git push --force? Thanks, Feng
[PATCH] Fix ICE in ipa-cp due to cost addition overflow (PR 96806)
This patch is to fix a bug that cost that is used to evaluate clone candidate becomes negative due to integer overflow. Feng --- 2020-08-31 Feng Xue gcc/ PR tree-optimization/96806 * ipa-cp.c (decide_about_value): Use safe_add to avoid cost addition overflow. gcc/testsuite/ PR tree-optimization/96806 * g++.dg/ipa/pr96806.C: New test.From 8d92b4ca4be2303a73f0a2441e57564488ca1c23 Mon Sep 17 00:00:00 2001 From: Feng Xue Date: Mon, 31 Aug 2020 15:00:52 +0800 Subject: [PATCH] ipa/96806 - Fix ICE in ipa-cp due to integer addition overflow 2020-08-31 Feng Xue gcc/ PR tree-optimization/96806 * ipa-cp.c (decide_about_value): Use safe_add to avoid cost addition overflow. gcc/testsuite/ PR tree-optimization/96806 * g++.dg/ipa/pr96806.C: New test. --- gcc/ipa-cp.c | 8 ++--- gcc/testsuite/g++.dg/ipa/pr96806.C | 53 ++ 2 files changed, 57 insertions(+), 4 deletions(-) create mode 100644 gcc/testsuite/g++.dg/ipa/pr96806.C diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c index e4910a04ffa..8e5d6e2a393 100644 --- a/gcc/ipa-cp.c +++ b/gcc/ipa-cp.c @@ -5480,11 +5480,11 @@ decide_about_value (struct cgraph_node *node, int index, HOST_WIDE_INT offset, freq_sum, count_sum, val->local_size_cost) && !good_cloning_opportunity_p (node, - val->local_time_benefit - + val->prop_time_benefit, + safe_add (val->local_time_benefit, + val->prop_time_benefit), freq_sum, count_sum, - val->local_size_cost - + val->prop_size_cost)) + safe_add (val->local_size_cost, + val->prop_size_cost))) return false; if (dump_file) diff --git a/gcc/testsuite/g++.dg/ipa/pr96806.C b/gcc/testsuite/g++.dg/ipa/pr96806.C new file mode 100644 index 000..28fdf7787a1 --- /dev/null +++ b/gcc/testsuite/g++.dg/ipa/pr96806.C @@ -0,0 +1,53 @@ +/* { dg-do compile } */ +/* { dg-options "-std=c++11 -O -fipa-cp -fipa-cp-clone --param=ipa-cp-max-recursive-depth=94 --param=logical-op-non-short-circuit=0" } */ + +enum a {}; +struct m; +struct n { + a d; +}; +int o(int, int); +struct p { + char d; + char aa; + p *ab; + bool q() const { +int h = d & 4; +return h; + } + char r() const { return aa; } + int s(const m *, bool) const; +} l; +struct t { + p *ac; + p *u() { return ac; } + p *v(int); +}; +int w(const p *, const p *, const m *, int = 0); +struct m : n { + struct { +t *ad; + } ae; + char x() const; + p *y(int z) const { return ae.ad ? nullptr : ae.ad->v(z); } +} j; +int w(const p *z, const p *af, const m *ag, int ah) { + int a, g = z->s(ag, true), i = af->s(ag, true); + if (af->q()) { +if (ag->x()) + return 0; +ah++; +char b = af->r(); +p *c = ag->y(b), *e = ag->ae.ad->u(); +int d = w(z, c, ag, ah), f = w(z, af ? e : af->ab, ag, ah); +a = f ? d : f; +return a; + } + if (g || i == 1) +return ag->d ? o(g, i) : o(g, i); + return 0; +} +void ai() { + for (p k;;) +w(&k, &l, &j); +} -- 2.17.1
[PATCH V2] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)
As Richard's comment, this patch is composed to simplify generalized binary-with-convert pattern like ((T) X) OP ((T) Y). Instead of creating almost duplicated rules into match.pd, we try to transform it to (T) (X OP Y), and apply simplification on (X OP Y) in forwprop pass. Regards, Feng --- 2020-08-19 Feng Xue gcc/ PR tree-optimization/94234 * tree-ssa-forwprop.c (simplify_binary_with_convert): New function. * (fwprop_ssa_val): Move it before its new caller. gcc/testsuite/ PR tree-optimization/94234 * gcc.dg/ifcvt-3.c: Modified to suppress forward propagation. * gcc.dg/tree-ssa/20030807-10.c: Likewise. * gcc.dg/pr94234-2.c: New test. > > From: Richard Biener > Sent: Monday, June 15, 2020 3:41 PM > To: Feng Xue OS > Cc: gcc-patches@gcc.gnu.org; Marc Glisse > Subject: Re: [PATCH] Add pattern for pointer-diff on addresses with same > base/offset (PR 94234) > > On Fri, Jun 5, 2020 at 11:20 AM Feng Xue OS > wrote: >> >> As Marc suggested, removed the new pointer_diff rule, and add another rule >> to fold >> convert-add expression. This new rule is: >> >>(T)(A * C) +- (T)(B * C) -> (T) ((A +- B) * C) >> >> Regards, >> Feng >> >> --- >> 2020-06-01 Feng Xue >> >> gcc/ >> PR tree-optimization/94234 >> * match.pd ((T)(A * C) +- (T)(B * C)) -> (T)((A +- B) * C): New >> simplification. >> * ((PTR_A + OFF) - (PTR_B + OFF)) -> (PTR_A - PTR_B): New >> simplification. >> >> gcc/testsuite/ >> PR tree-optimization/94234 >> * gcc.dg/pr94234.c: New test. >> --- >> gcc/match.pd | 28 >> gcc/testsuite/gcc.dg/pr94234.c | 24 >> 2 files changed, 52 insertions(+) >> create mode 100644 gcc/testsuite/gcc.dg/pr94234.c >> >> diff --git a/gcc/match.pd b/gcc/match.pd >> index 33ee1a920bf..4f340bfe40a 100644 >> --- a/gcc/match.pd >> +++ b/gcc/match.pd >> @@ -2515,6 +2515,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) >> && TREE_CODE (@2) == INTEGER_CST >> && tree_int_cst_sign_bit (@2) == 0)) >>(minus (convert @1) (convert @2) >> + (simplify >> +(pointer_diff (pointer_plus @0 @2) (pointer_plus @1 @2)) >> + (pointer_diff @0 @1)) > > This new pattern is OK. Please commit it separately. > >> (simplify >> (pointer_diff (pointer_plus @@0 @1) (pointer_plus @0 @2)) >> /* The second argument of pointer_plus must be interpreted as signed, >> and >> @@ -2526,6 +2529,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) >>(minus (convert (view_convert:stype @1)) >> (convert (view_convert:stype @2))) >> >> +/* (T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C) and >> + (T)(A * C) +- (T)(A) -> (T)(A * (C +- 1)). */ >> +(if (INTEGRAL_TYPE_P (type)) >> + (for plusminus (plus minus) >> + (simplify >> + (plusminus (convert:s (mult:cs @0 @1)) (convert:s (mult:cs @0 @2))) >> + (if (element_precision (type) <= element_precision (TREE_TYPE (@0)) >> + && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type)) >> + && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))) >> +(convert (mult (plusminus @1 @2) @0 >> + (simplify >> + (plusminus (convert @0) (convert@2 (mult:c@3 @0 @1))) >> + (if (element_precision (type) <= element_precision (TREE_TYPE (@0)) >> + && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type)) >> + && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)) >> + && single_use (@2) && single_use (@3)) >> +(convert (mult (plusminus { build_one_cst (TREE_TYPE (@1)); } @1) >> @0 >> + (simplify >> + (plusminus (convert@2 (mult:c@3 @0 @1)) (convert @0)) >> + (if (element_precision (type) <= element_precision (TREE_TYPE (@0)) >> + && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type)) >> + && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)) >> + && single_use (@2) && single_use (@3)) >> +(convert (mult (plusminus @1 { build_one_cst (TREE_TYPE (@1)); }) >> @0)) >> + > > This shows the limit of pattern matching IMHO. I'm also not convinced > it gets the > overflow cases correct (but I didn't spend too much time here). Note we have > similar functionality implemented in f
Re: [PATCH] ipa-inline: Improve growth accumulation for recursive calls
> Hello, > with Martin we spent some time looking into exchange2 and my > understanding of the problem is the following: > > There is the self recursive function digits_2 with the property that it > has 10 nested loops and calls itself from the innermost. > Now we do not do amazing job on guessing the profile since it is quite > atypical. First observation is that the callback frequencly needs to be > less than 1 otherwise the program never terminates, however with 10 > nested loops one needs to predict every loop to iterate just few times > and conditionals guarding them as not very likely. For that we added > PRED_LOOP_GUARD_WITH_RECURSION some time ago and I fixed it yesterday > (causing regression in exhange since the bad profile turned out to > disable some harmful vectorization) and I also now added a cap to the > self recursive frequency so things to not get mispropagated by ipa-cp. With default setting of PRED_LOOP_GUARD_WITH_RECURSION, static profile estimation for exchange2 is far from accurate, the hottest recursive function is predicted as infrequent. However, this low execution estimation works fine with IRA. I've tried to tweak likelihood of the predictor, same as you, performance was degraded when estimated profile increased. This regression is also found to be correlated with IRA, which produces much more register spills than default. In presence of deep loops and high register pressure, IRA behaves more sensitively to profile estimation, and this exhibits an unwanted property of current IRA algorithm. I've described it in a tracker (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90174). Feng > > Now if ipa-cp decides to duplicate digits few times we have a new > problem. The tree of recursion is orgnaized in a way that the depth is > bounded by 10 (which GCC does not know) and moreover most time is not > spent on very deep levels of recursion. > > For that you have the patch which increases frequencies of recursively > cloned nodes, however it still seems to me as very specific hack for > exchange: I do not see how to guess where most of time is spent. > Even for very regular trees, by master theorem, it depends on very > little differences in the estimates of recursion frequency whether most > of time is spent on the top of tree, bottom or things are balanced. > > With algorithms doing backtracing, like exhchange, the likelyness of > recusion reduces with deeper recursion level, but we do not know how > quickly and what the level is. > >> From: Xiong Hu Luo >> >> For SPEC2017 exchange2, there is a large recursive functiondigits_2(function >> size 1300) generates specialized node from digits_2.1 to digits_2.8 with >> added >> build option: >> >> --param ipa-cp-eval-threshold=1 --param ipa-cp-unit-growth=80 >> >> ipa-inline pass will consider inline these nodes called only once, but these >> large functions inlined too deeply will cause serious register spill and >> performance down as followed. >> >> inlineA: brute (inline digits_2.1, 2.2, 2.3, 2.4) -> digits_2.5 (inline >> 2.6, 2.7, 2.8) >> inlineB: digits_2.1 (inline digits_2.2, 2.3) -> call digits_2.4 (inline >> digits_2.5, 2.6) -> call digits_2.7 (inline 2.8) >> inlineC: brute (inline digits_2) -> call 2.1 -> 2.2 (inline 2.3) -> 2.4 -> >> 2.5 -> 2.6 (inline 2.7 ) -> 2.8 >> inlineD: brute -> call digits_2 -> call 2.1 -> call 2.2 -> 2.3 -> 2.4 -> >> 2.5 -> 2.6 -> 2.7 -> 2.8 >> >> Performance diff: >> inlineB is ~25% faster than inlineA; >> inlineC is ~20% faster than inlineB; >> inlineD is ~30% faster than inlineC. >> >> The master GCC code now generates inline sequence like inlineB, this patch >> makes the ipa-inline pass behavior like inlineD by: >> 1) The growth acumulation for recursive calls by adding the growth data >> to the edge when edge's caller is inlined into another function to avoid >> inline too deeply; >> 2) And if caller and callee are both specialized from same node, the edge >> should also be considered as recursive edge. >> >> SPEC2017 test shows GEOMEAN improve +2.75% in total(+0.56% without >> exchange2). >> Any comments? Thanks. >> >> 523.xalancbmk_r +1.32% >> 541.leela_r +1.51% >> 548.exchange2_r +31.87% >> 507.cactuBSSN_r +0.80% >> 526.blender_r +1.25% >> 538.imagick_r +1.82% >> >> gcc/ChangeLog: >> >> 2020-08-12 Xionghu Luo >> >>* cgraph.h (cgraph_edge::recursive_p): Return true if caller and >>callee and specialized from same node. >>* ipa-inline-analysis.c (do_estimate_growth_1): Add caller's >>inlined_to growth to edge whose caller is inlined. >> --- >> gcc/cgraph.h | 2 ++ >> gcc/ipa-inline-analysis.c | 3 +++ >> 2 files changed, 5 insertions(+) >> >> diff --git a/gcc/cgraph.h b/gcc/cgraph.h >> index 0211f08964f..11903ac1960 100644 >> --- a/gcc/cgraph.h >> +++ b/gcc/cgraph.h >> @@ -3314,6 +3314,8 @@ cgraph_edge::recursive_p (void) >> cgraph_node *c = callee->ultimate_alias_target
Re: [PATCH] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)
Here is an question about pointer operation: Pointer is treated as unsigned in comparison operation, while distance between pointers is signed. Then we can not assume the below conclusion is true? (ptr_a > ptr_b) => (ptr_a - ptr_b) >= 0 Thanks, Feng From: Marc Glisse Sent: Wednesday, June 3, 2020 10:32 PM To: Feng Xue OS Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH] Add pattern for pointer-diff on addresses with same base/offset (PR 94234) On Wed, 3 Jun 2020, Feng Xue OS via Gcc-patches wrote: >> Ah, looking at the PR, you decided to perform the operation as unsigned >> because that has fewer NOP conversions, which, in that particular testcase >> where the offsets are originally unsigned, means we simplify better. But I >> would expect it to regress other testcases (in particular if the offsets >> were originally signed). Also, changing the second argument of >> pointer_plus to be signed, as is supposed to eventually happen, would >> break your testcase again. > The old rule might produce overflow result (offset_a = (signed_int_max)UL, > offset_b = 1UL). signed_int_max-1 does not overflow. But the point is that pointer_plus / pointer_diff are defined in a way that if that subtraction would overflow, then one of the pointer_plus or pointed_diff would have been undefined already. In particular, you cannot have objects larger than half the address space, and pointer_plus/pointer_diff have to remain inside an object. Doing the subtraction in a signed type keeps (part of) that information. > Additionally, (stype)(offset_a - offset_b) is more compact, Not if offset_a comes from (utype)a and offset_b from (utype)b with a and b signed. Using size_t indices as in the bugzilla testcase is not recommended practice. Change it to ssize_t, and we do optimize the testcase in CCP1 already. > there might be > further simplification opportunities on offset_a - offset_b, even it is not > in form of (A * C - B * C), for example (~A - 1 -> -A). But for old rule, we > have > to introduce another rule as (T)A - (T)(B) -> (T)(A - B), which seems to > be too generic to benefit performance in all situations. Sadly, conversions complicate optimizations and are all over the place, we need to handle them in more places. I sometimes dream of getting rid of NOP conversions, and having a single PLUS_EXPR with some kind of flag saying if it can wrap/saturate/trap when seen as a signed/unsigned operation, i.e. push the information on the operations instead of objects. > If the 2nd argument is signed, we can add a specific rule as your suggestion > (T)(A * C) - (T)(B * C) -> (T) (A - B) * C. > >> At the very least we want to keep a comment next to the transformation >> explaining the situation. > >> If there are platforms where the second argument of pointer_plus is a >> smaller type than the result of pointer_diff (can this happen? I keep >> forgetting all the weird things some platforms do), this version may do an >> unsafe zero-extension. > If the 2nd argument is a smaller type, this might bring confuse semantic to > pointer_plus operator. Suppose the type is a (unsigned) char, the expression > "ptr + ((char) -1)" represents ptr + 255 or ptr - 1? (pointer_plus ptr 255) would mean ptr - 1 on a platform where the second argument of pointer_plus has size 1 byte. Do note that I am not a reviewer, what I say isn't final. -- Marc Glisse
Ping: [PATCH V2] Add pattern for pointer-diff on addresses with same base/offset (PR 94234)
Thanks, Feng From: Feng Xue OS Sent: Friday, June 5, 2020 5:20 PM To: Richard Biener; gcc-patches@gcc.gnu.org; Marc Glisse Subject: Re: [PATCH] Add pattern for pointer-diff on addresses with same base/offset (PR 94234) As Marc suggested, removed the new pointer_diff rule, and add another rule to fold convert-add expression. This new rule is: (T)(A * C) +- (T)(B * C) -> (T) ((A +- B) * C) Regards, Feng --- 2020-06-01 Feng Xue gcc/ PR tree-optimization/94234 * match.pd ((T)(A * C) +- (T)(B * C)) -> (T)((A +- B) * C): New simplification. * ((PTR_A + OFF) - (PTR_B + OFF)) -> (PTR_A - PTR_B): New simplification. gcc/testsuite/ PR tree-optimization/94234 * gcc.dg/pr94234.c: New test. --- gcc/match.pd | 28 gcc/testsuite/gcc.dg/pr94234.c | 24 2 files changed, 52 insertions(+) create mode 100644 gcc/testsuite/gcc.dg/pr94234.c diff --git a/gcc/match.pd b/gcc/match.pd index 33ee1a920bf..4f340bfe40a 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -2515,6 +2515,9 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) && TREE_CODE (@2) == INTEGER_CST && tree_int_cst_sign_bit (@2) == 0)) (minus (convert @1) (convert @2) + (simplify +(pointer_diff (pointer_plus @0 @2) (pointer_plus @1 @2)) + (pointer_diff @0 @1)) (simplify (pointer_diff (pointer_plus @@0 @1) (pointer_plus @0 @2)) /* The second argument of pointer_plus must be interpreted as signed, and @@ -2526,6 +2529,31 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (minus (convert (view_convert:stype @1)) (convert (view_convert:stype @2))) +/* (T)(A * C) +- (T)(B * C) -> (T)((A +- B) * C) and + (T)(A * C) +- (T)(A) -> (T)(A * (C +- 1)). */ +(if (INTEGRAL_TYPE_P (type)) + (for plusminus (plus minus) + (simplify + (plusminus (convert:s (mult:cs @0 @1)) (convert:s (mult:cs @0 @2))) + (if (element_precision (type) <= element_precision (TREE_TYPE (@0)) + && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type)) + && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0))) +(convert (mult (plusminus @1 @2) @0 + (simplify + (plusminus (convert @0) (convert@2 (mult:c@3 @0 @1))) + (if (element_precision (type) <= element_precision (TREE_TYPE (@0)) + && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type)) + && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)) + && single_use (@2) && single_use (@3)) +(convert (mult (plusminus { build_one_cst (TREE_TYPE (@1)); } @1) @0 + (simplify + (plusminus (convert@2 (mult:c@3 @0 @1)) (convert @0)) + (if (element_precision (type) <= element_precision (TREE_TYPE (@0)) + && (TYPE_OVERFLOW_UNDEFINED (type) || TYPE_OVERFLOW_WRAPS (type)) + && TYPE_OVERFLOW_WRAPS (TREE_TYPE (@0)) + && single_use (@2) && single_use (@3)) +(convert (mult (plusminus @1 { build_one_cst (TREE_TYPE (@1)); }) @0)) + /* (A * C) +- (B * C) -> (A+-B) * C and (A * C) +- A -> A * (C+-1). Modeled after fold_plusminus_mult_expr. */ (if (!TYPE_SATURATING (type) diff --git a/gcc/testsuite/gcc.dg/pr94234.c b/gcc/testsuite/gcc.dg/pr94234.c new file mode 100644 index 000..3f7c7a5e58f --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr94234.c @@ -0,0 +1,24 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-forwprop1" } */ + +typedef __SIZE_TYPE__ size_t; +typedef __PTRDIFF_TYPE__ ptrdiff_t; + +ptrdiff_t foo (char *a, size_t n) +{ + char *b1 = a + 8 * n; + char *b2 = a + 8 * (n - 1); + + return b1 - b2; +} + +ptrdiff_t goo (char *a, size_t n, size_t m) +{ + char *b1 = a + 8 * n; + char *b2 = a + 8 * (n + 1); + + return (b1 + m) - (b2 + m); +} + +/* { dg-final { scan-tree-dump-times "return 8;" 1 "forwprop1" } } */ +/* { dg-final { scan-tree-dump-times "return -8;" 1 "forwprop1" } } */ -- From: Richard Biener Sent: Thursday, June 4, 2020 4:30 PM To: gcc-patches@gcc.gnu.org Cc: Feng Xue OS Subject: Re: [PATCH] Add pattern for pointer-diff on addresses with same base/offset (PR 94234) On Wed, Jun 3, 2020 at 4:33 PM Marc Glisse wrote: > > On Wed, 3 Jun 2020, Feng Xue OS via Gcc-patches wrote: > > >> Ah, looking at the PR, you decided to perform the operation as unsigned > >> because that has fewer NOP conversions, which, in that particular testcase > >> where the offsets are originally unsigned, means we simplify better. But I > >> would expect it to regress other testcases (in particular if the offsets > >> were originally signed). Also, changing the second argument of > >> pointer_plus to be signed, as is supposed to eventually happen, would > >> brea