Re: [PATCH 10/10] vect: Reuse reduction accumulators between loops

2021-07-12 Thread Richard Biener via Gcc-patches
On Mon, Jul 12, 2021 at 7:55 PM Richard Sandiford
 wrote:
>
> Richard Biener via Gcc-patches  writes:
> > On Fri, Jul 9, 2021 at 3:12 PM Richard Sandiford
> >  wrote:
> >>
> >> Thanks for the review.
> >>
> >> Richard Biener  writes:
> >> >> @@ -588,6 +600,23 @@ public:
> >> >>/* Unrolling factor  */
> >> >>poly_uint64 vectorization_factor;
> >> >>
> >> >> +  /* If this loop is an epilogue loop whose main loop can be skipped,
> >> >> + MAIN_LOOP_EDGE is the edge from the main loop to this loop's
> >> >> + preheader.  SKIP_MAIN_LOOP_EDGE is then the edge that skips the
> >> >> + main loop and goes straight to this loop's preheader.
> >> >> +
> >> >> + Both fields are null otherwise.  */
> >> >> +  edge main_loop_edge;
> >> >> +  edge skip_main_loop_edge;
> >> >> +
> >> >> +  /* If this loop is an epilogue loop that might be skipped after 
> >> >> executing
> >> >> + the main loop, this edge is the one that skips the epilogue.  */
> >> >> +  edge skip_this_loop_edge;
> >> >> +
> >> >> +  /* After vectorization, maps live-out SSA names to information about
> >> >> + the reductions that generated them.  */
> >> >> +  hash_map reusable_accumulators;
> >> >
> >> > Is that the LC PHI node defs or the definition inside of the loop?
> >> > If the latter we could attach the info directly to its stmt-info?
> >>
> >> Ah, yeah, I should improve the comment there.  It's the vectoriser's
> >> replacement for the original LC PHI node, i.e. the final scalar result
> >> after the reduction has taken place.
> >
> > OK
> >
> >> >> @@ -1186,6 +1215,21 @@ public:
> >> >>/* The vector type for performing the actual reduction.  */
> >> >>tree reduc_vectype;
> >> >>
> >> >> +  /* If IS_REDUC_INFO is true and if the reduction is operating on N
> >> >> + elements in parallel, this vector gives the initial values of 
> >> >> these
> >> >> + N elements.  */
> >> >
> >> > That's N scalar elements or N vector elements?  I suppose it's for
> >> > SLP reductions (rather than SLP reduction chains) and never non-SLP
> >> > reductions?
> >>
> >> Yeah, poor wording again, sorry.  I meant something closer to:
> >>
> >>   /* If IS_REDUC_INFO is true and if the vector code is performing
> >>  N scalar reductions in parallel, this vector gives the initial
> >>  scalar values of those N reductions.  */
> >>
> >> >> +  vec reduc_initial_values;
> >> >> +
> >> >> +  /* If IS_REDUC_INFO is true and if the reduction is operating on N
> >> >> + elements in parallel, this vector gives the scalar result of each
> >> >> + reduction.  */
> >> >> +  vec reduc_scalar_results;
> >>
> >> Same change here.
> >>
> >> >> […]
> >> >> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> >> >> index 2909e8a0fc3..b7b0523e3c8 100644
> >> >> --- a/gcc/tree-vect-loop-manip.c
> >> >> +++ b/gcc/tree-vect-loop-manip.c
> >> >> @@ -2457,6 +2457,31 @@ vect_update_epilogue_niters (loop_vec_info 
> >> >> epilogue_vinfo,
> >> >>return vect_determine_partial_vectors_and_peeling (epilogue_vinfo, 
> >> >> true);
> >> >>  }
> >> >>
> >> >> +/* LOOP_VINFO is an epilogue loop and MAIN_LOOP_VALUE is available on 
> >> >> exit
> >> >> +   from the corresponding main loop.  Return a value that is available 
> >> >> in
> >> >> +   LOOP_VINFO's preheader, using SKIP_VALUE if the main loop is 
> >> >> skipped.
> >> >> +   Passing a null SKIP_VALUE is equivalent to passing zero.  */
> >> >> +
> >> >> +tree
> >> >> +vect_get_main_loop_result (loop_vec_info loop_vinfo, tree 
> >> >> main_loop_value,
> >> >> +  tree skip_value)
> >> >> +{
> >> >> +  if (!loop_vinfo->main_loop_edge)
> >> >> +return main_loop_value;
> >> >> +
> >> >> +  if (!skip_value)
> >> >> +skip_value = build_zero_cst (TREE_TYPE (main_loop_value));
> >> >
> >> > shouldn't that be the initial value?
> >>
> >> For the current use case, the above two conditions are never true.
> >> I wrote it like this because I had a follow-on patch (which might
> >> not go anywhere) that needed this function for 0-based IVs.
> >>
> >> Maybe that's a bad risk/reward trade-off though.  Not having to pass
> >> zero makes things only slightly simpler for the follow-on patch,
> >> and I guess could be dangerous in other cases.
> >>
> >> Perhaps in that case though I should change loop_vinfo->main_loop_edge
> >> into a gcc_assert as well.
> >
> > Yeah, I think asserts (and comments in case it's because we don't handle
> > some specific cases yet) are better than possibly wrong behavior.
>
> OK.
>
> >> >> +  tree phi_result = make_ssa_name (TREE_TYPE (main_loop_value));
> >> >> +  basic_block bb = loop_vinfo->main_loop_edge->dest;
> >> >> +  gphi *new_phi = create_phi_node (phi_result, bb);
> >> >> +  add_phi_arg (new_phi, main_loop_value, loop_vinfo->main_loop_edge,
> >> >> +  UNKNOWN_LOCATION);
> >> >> +  add_phi_arg (new_phi, skip_value,
> >> >> +  loop_vinfo->skip_main_loop_edge, UNKNOWN_LOCATION);
> >> >>

Re: [PATCH 10/10] vect: Reuse reduction accumulators between loops

2021-07-12 Thread Richard Sandiford via Gcc-patches
Richard Biener via Gcc-patches  writes:
> On Fri, Jul 9, 2021 at 3:12 PM Richard Sandiford
>  wrote:
>>
>> Thanks for the review.
>>
>> Richard Biener  writes:
>> >> @@ -588,6 +600,23 @@ public:
>> >>/* Unrolling factor  */
>> >>poly_uint64 vectorization_factor;
>> >>
>> >> +  /* If this loop is an epilogue loop whose main loop can be skipped,
>> >> + MAIN_LOOP_EDGE is the edge from the main loop to this loop's
>> >> + preheader.  SKIP_MAIN_LOOP_EDGE is then the edge that skips the
>> >> + main loop and goes straight to this loop's preheader.
>> >> +
>> >> + Both fields are null otherwise.  */
>> >> +  edge main_loop_edge;
>> >> +  edge skip_main_loop_edge;
>> >> +
>> >> +  /* If this loop is an epilogue loop that might be skipped after 
>> >> executing
>> >> + the main loop, this edge is the one that skips the epilogue.  */
>> >> +  edge skip_this_loop_edge;
>> >> +
>> >> +  /* After vectorization, maps live-out SSA names to information about
>> >> + the reductions that generated them.  */
>> >> +  hash_map reusable_accumulators;
>> >
>> > Is that the LC PHI node defs or the definition inside of the loop?
>> > If the latter we could attach the info directly to its stmt-info?
>>
>> Ah, yeah, I should improve the comment there.  It's the vectoriser's
>> replacement for the original LC PHI node, i.e. the final scalar result
>> after the reduction has taken place.
>
> OK
>
>> >> @@ -1186,6 +1215,21 @@ public:
>> >>/* The vector type for performing the actual reduction.  */
>> >>tree reduc_vectype;
>> >>
>> >> +  /* If IS_REDUC_INFO is true and if the reduction is operating on N
>> >> + elements in parallel, this vector gives the initial values of these
>> >> + N elements.  */
>> >
>> > That's N scalar elements or N vector elements?  I suppose it's for
>> > SLP reductions (rather than SLP reduction chains) and never non-SLP
>> > reductions?
>>
>> Yeah, poor wording again, sorry.  I meant something closer to:
>>
>>   /* If IS_REDUC_INFO is true and if the vector code is performing
>>  N scalar reductions in parallel, this vector gives the initial
>>  scalar values of those N reductions.  */
>>
>> >> +  vec reduc_initial_values;
>> >> +
>> >> +  /* If IS_REDUC_INFO is true and if the reduction is operating on N
>> >> + elements in parallel, this vector gives the scalar result of each
>> >> + reduction.  */
>> >> +  vec reduc_scalar_results;
>>
>> Same change here.
>>
>> >> […]
>> >> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
>> >> index 2909e8a0fc3..b7b0523e3c8 100644
>> >> --- a/gcc/tree-vect-loop-manip.c
>> >> +++ b/gcc/tree-vect-loop-manip.c
>> >> @@ -2457,6 +2457,31 @@ vect_update_epilogue_niters (loop_vec_info 
>> >> epilogue_vinfo,
>> >>return vect_determine_partial_vectors_and_peeling (epilogue_vinfo, 
>> >> true);
>> >>  }
>> >>
>> >> +/* LOOP_VINFO is an epilogue loop and MAIN_LOOP_VALUE is available on 
>> >> exit
>> >> +   from the corresponding main loop.  Return a value that is available in
>> >> +   LOOP_VINFO's preheader, using SKIP_VALUE if the main loop is skipped.
>> >> +   Passing a null SKIP_VALUE is equivalent to passing zero.  */
>> >> +
>> >> +tree
>> >> +vect_get_main_loop_result (loop_vec_info loop_vinfo, tree 
>> >> main_loop_value,
>> >> +  tree skip_value)
>> >> +{
>> >> +  if (!loop_vinfo->main_loop_edge)
>> >> +return main_loop_value;
>> >> +
>> >> +  if (!skip_value)
>> >> +skip_value = build_zero_cst (TREE_TYPE (main_loop_value));
>> >
>> > shouldn't that be the initial value?
>>
>> For the current use case, the above two conditions are never true.
>> I wrote it like this because I had a follow-on patch (which might
>> not go anywhere) that needed this function for 0-based IVs.
>>
>> Maybe that's a bad risk/reward trade-off though.  Not having to pass
>> zero makes things only slightly simpler for the follow-on patch,
>> and I guess could be dangerous in other cases.
>>
>> Perhaps in that case though I should change loop_vinfo->main_loop_edge
>> into a gcc_assert as well.
>
> Yeah, I think asserts (and comments in case it's because we don't handle
> some specific cases yet) are better than possibly wrong behavior.

OK.

>> >> +  tree phi_result = make_ssa_name (TREE_TYPE (main_loop_value));
>> >> +  basic_block bb = loop_vinfo->main_loop_edge->dest;
>> >> +  gphi *new_phi = create_phi_node (phi_result, bb);
>> >> +  add_phi_arg (new_phi, main_loop_value, loop_vinfo->main_loop_edge,
>> >> +  UNKNOWN_LOCATION);
>> >> +  add_phi_arg (new_phi, skip_value,
>> >> +  loop_vinfo->skip_main_loop_edge, UNKNOWN_LOCATION);
>> >> +  return phi_result;
>> >> +}
>> >> +
>> >>  /* Function vect_do_peeling.
>> >>
>> >> Input:
>> >> […]
>> >> @@ -4823,6 +4842,100 @@ info_for_reduction (vec_info *vinfo, 
>> >> stmt_vec_info stmt_info)
>> >>return stmt_info;
>> >>  }
>> >>
>> >> +/* PHI is a reduction in LOOP_VINFO that we are going to vectorize

Re: [PATCH 10/10] vect: Reuse reduction accumulators between loops

2021-07-11 Thread Richard Biener via Gcc-patches
On Fri, Jul 9, 2021 at 3:12 PM Richard Sandiford
 wrote:
>
> Thanks for the review.
>
> Richard Biener  writes:
> >> @@ -588,6 +600,23 @@ public:
> >>/* Unrolling factor  */
> >>poly_uint64 vectorization_factor;
> >>
> >> +  /* If this loop is an epilogue loop whose main loop can be skipped,
> >> + MAIN_LOOP_EDGE is the edge from the main loop to this loop's
> >> + preheader.  SKIP_MAIN_LOOP_EDGE is then the edge that skips the
> >> + main loop and goes straight to this loop's preheader.
> >> +
> >> + Both fields are null otherwise.  */
> >> +  edge main_loop_edge;
> >> +  edge skip_main_loop_edge;
> >> +
> >> +  /* If this loop is an epilogue loop that might be skipped after 
> >> executing
> >> + the main loop, this edge is the one that skips the epilogue.  */
> >> +  edge skip_this_loop_edge;
> >> +
> >> +  /* After vectorization, maps live-out SSA names to information about
> >> + the reductions that generated them.  */
> >> +  hash_map reusable_accumulators;
> >
> > Is that the LC PHI node defs or the definition inside of the loop?
> > If the latter we could attach the info directly to its stmt-info?
>
> Ah, yeah, I should improve the comment there.  It's the vectoriser's
> replacement for the original LC PHI node, i.e. the final scalar result
> after the reduction has taken place.

OK

> >> @@ -1186,6 +1215,21 @@ public:
> >>/* The vector type for performing the actual reduction.  */
> >>tree reduc_vectype;
> >>
> >> +  /* If IS_REDUC_INFO is true and if the reduction is operating on N
> >> + elements in parallel, this vector gives the initial values of these
> >> + N elements.  */
> >
> > That's N scalar elements or N vector elements?  I suppose it's for
> > SLP reductions (rather than SLP reduction chains) and never non-SLP
> > reductions?
>
> Yeah, poor wording again, sorry.  I meant something closer to:
>
>   /* If IS_REDUC_INFO is true and if the vector code is performing
>  N scalar reductions in parallel, this vector gives the initial
>  scalar values of those N reductions.  */
>
> >> +  vec reduc_initial_values;
> >> +
> >> +  /* If IS_REDUC_INFO is true and if the reduction is operating on N
> >> + elements in parallel, this vector gives the scalar result of each
> >> + reduction.  */
> >> +  vec reduc_scalar_results;
>
> Same change here.
>
> >> […]
> >> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
> >> index 2909e8a0fc3..b7b0523e3c8 100644
> >> --- a/gcc/tree-vect-loop-manip.c
> >> +++ b/gcc/tree-vect-loop-manip.c
> >> @@ -2457,6 +2457,31 @@ vect_update_epilogue_niters (loop_vec_info 
> >> epilogue_vinfo,
> >>return vect_determine_partial_vectors_and_peeling (epilogue_vinfo, 
> >> true);
> >>  }
> >>
> >> +/* LOOP_VINFO is an epilogue loop and MAIN_LOOP_VALUE is available on exit
> >> +   from the corresponding main loop.  Return a value that is available in
> >> +   LOOP_VINFO's preheader, using SKIP_VALUE if the main loop is skipped.
> >> +   Passing a null SKIP_VALUE is equivalent to passing zero.  */
> >> +
> >> +tree
> >> +vect_get_main_loop_result (loop_vec_info loop_vinfo, tree main_loop_value,
> >> +  tree skip_value)
> >> +{
> >> +  if (!loop_vinfo->main_loop_edge)
> >> +return main_loop_value;
> >> +
> >> +  if (!skip_value)
> >> +skip_value = build_zero_cst (TREE_TYPE (main_loop_value));
> >
> > shouldn't that be the initial value?
>
> For the current use case, the above two conditions are never true.
> I wrote it like this because I had a follow-on patch (which might
> not go anywhere) that needed this function for 0-based IVs.
>
> Maybe that's a bad risk/reward trade-off though.  Not having to pass
> zero makes things only slightly simpler for the follow-on patch,
> and I guess could be dangerous in other cases.
>
> Perhaps in that case though I should change loop_vinfo->main_loop_edge
> into a gcc_assert as well.

Yeah, I think asserts (and comments in case it's because we don't handle
some specific cases yet) are better than possibly wrong behavior.

> >> +  tree phi_result = make_ssa_name (TREE_TYPE (main_loop_value));
> >> +  basic_block bb = loop_vinfo->main_loop_edge->dest;
> >> +  gphi *new_phi = create_phi_node (phi_result, bb);
> >> +  add_phi_arg (new_phi, main_loop_value, loop_vinfo->main_loop_edge,
> >> +  UNKNOWN_LOCATION);
> >> +  add_phi_arg (new_phi, skip_value,
> >> +  loop_vinfo->skip_main_loop_edge, UNKNOWN_LOCATION);
> >> +  return phi_result;
> >> +}
> >> +
> >>  /* Function vect_do_peeling.
> >>
> >> Input:
> >> […]
> >> @@ -4823,6 +4842,100 @@ info_for_reduction (vec_info *vinfo, stmt_vec_info 
> >> stmt_info)
> >>return stmt_info;
> >>  }
> >>
> >> +/* PHI is a reduction in LOOP_VINFO that we are going to vectorize using 
> >> vector
> >> +   type VECTYPE.  See if LOOP_VINFO is an epilogue loop whose main loop 
> >> had a
> >> +   matching reduction that we can build on.  Adjust REDUC_INFO and return

Re: [PATCH 10/10] vect: Reuse reduction accumulators between loops

2021-07-09 Thread Richard Sandiford via Gcc-patches
Thanks for the review.

Richard Biener  writes:
>> @@ -588,6 +600,23 @@ public:
>>/* Unrolling factor  */
>>poly_uint64 vectorization_factor;
>>
>> +  /* If this loop is an epilogue loop whose main loop can be skipped,
>> + MAIN_LOOP_EDGE is the edge from the main loop to this loop's
>> + preheader.  SKIP_MAIN_LOOP_EDGE is then the edge that skips the
>> + main loop and goes straight to this loop's preheader.
>> +
>> + Both fields are null otherwise.  */
>> +  edge main_loop_edge;
>> +  edge skip_main_loop_edge;
>> +
>> +  /* If this loop is an epilogue loop that might be skipped after executing
>> + the main loop, this edge is the one that skips the epilogue.  */
>> +  edge skip_this_loop_edge;
>> +
>> +  /* After vectorization, maps live-out SSA names to information about
>> + the reductions that generated them.  */
>> +  hash_map reusable_accumulators;
>
> Is that the LC PHI node defs or the definition inside of the loop?
> If the latter we could attach the info directly to its stmt-info?

Ah, yeah, I should improve the comment there.  It's the vectoriser's
replacement for the original LC PHI node, i.e. the final scalar result
after the reduction has taken place.

>> @@ -1186,6 +1215,21 @@ public:
>>/* The vector type for performing the actual reduction.  */
>>tree reduc_vectype;
>>
>> +  /* If IS_REDUC_INFO is true and if the reduction is operating on N
>> + elements in parallel, this vector gives the initial values of these
>> + N elements.  */
>
> That's N scalar elements or N vector elements?  I suppose it's for
> SLP reductions (rather than SLP reduction chains) and never non-SLP
> reductions?

Yeah, poor wording again, sorry.  I meant something closer to:

  /* If IS_REDUC_INFO is true and if the vector code is performing
 N scalar reductions in parallel, this vector gives the initial
 scalar values of those N reductions.  */

>> +  vec reduc_initial_values;
>> +
>> +  /* If IS_REDUC_INFO is true and if the reduction is operating on N
>> + elements in parallel, this vector gives the scalar result of each
>> + reduction.  */
>> +  vec reduc_scalar_results;

Same change here.

>> […]
>> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
>> index 2909e8a0fc3..b7b0523e3c8 100644
>> --- a/gcc/tree-vect-loop-manip.c
>> +++ b/gcc/tree-vect-loop-manip.c
>> @@ -2457,6 +2457,31 @@ vect_update_epilogue_niters (loop_vec_info 
>> epilogue_vinfo,
>>return vect_determine_partial_vectors_and_peeling (epilogue_vinfo, true);
>>  }
>>
>> +/* LOOP_VINFO is an epilogue loop and MAIN_LOOP_VALUE is available on exit
>> +   from the corresponding main loop.  Return a value that is available in
>> +   LOOP_VINFO's preheader, using SKIP_VALUE if the main loop is skipped.
>> +   Passing a null SKIP_VALUE is equivalent to passing zero.  */
>> +
>> +tree
>> +vect_get_main_loop_result (loop_vec_info loop_vinfo, tree main_loop_value,
>> +  tree skip_value)
>> +{
>> +  if (!loop_vinfo->main_loop_edge)
>> +return main_loop_value;
>> +
>> +  if (!skip_value)
>> +skip_value = build_zero_cst (TREE_TYPE (main_loop_value));
>
> shouldn't that be the initial value?

For the current use case, the above two conditions are never true.
I wrote it like this because I had a follow-on patch (which might
not go anywhere) that needed this function for 0-based IVs.

Maybe that's a bad risk/reward trade-off though.  Not having to pass
zero makes things only slightly simpler for the follow-on patch,
and I guess could be dangerous in other cases.

Perhaps in that case though I should change loop_vinfo->main_loop_edge
into a gcc_assert as well.

>> +  tree phi_result = make_ssa_name (TREE_TYPE (main_loop_value));
>> +  basic_block bb = loop_vinfo->main_loop_edge->dest;
>> +  gphi *new_phi = create_phi_node (phi_result, bb);
>> +  add_phi_arg (new_phi, main_loop_value, loop_vinfo->main_loop_edge,
>> +  UNKNOWN_LOCATION);
>> +  add_phi_arg (new_phi, skip_value,
>> +  loop_vinfo->skip_main_loop_edge, UNKNOWN_LOCATION);
>> +  return phi_result;
>> +}
>> +
>>  /* Function vect_do_peeling.
>>
>> Input:
>> […]
>> @@ -4823,6 +4842,100 @@ info_for_reduction (vec_info *vinfo, stmt_vec_info 
>> stmt_info)
>>return stmt_info;
>>  }
>>
>> +/* PHI is a reduction in LOOP_VINFO that we are going to vectorize using 
>> vector
>> +   type VECTYPE.  See if LOOP_VINFO is an epilogue loop whose main loop had 
>> a
>> +   matching reduction that we can build on.  Adjust REDUC_INFO and return 
>> true
>> +   if so, otherwise return false.  */
>> +
>> +static bool
>> +vect_find_reusable_accumulator (loop_vec_info loop_vinfo,
>> +   stmt_vec_info reduc_info)
>> +{
>> +  loop_vec_info main_loop_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo);
>> +  if (!main_loop_vinfo)
>> +return false;
>> +
>> +  if (STMT_VINFO_REDUC_TYPE (reduc_info) != TREE_CODE_REDUCTION)
>> +return false;
>> +
>> +  unsi

Re: [PATCH 10/10] vect: Reuse reduction accumulators between loops

2021-07-09 Thread Richard Biener via Gcc-patches
On Thu, Jul 8, 2021 at 2:50 PM Richard Sandiford via Gcc-patches
 wrote:
>
> This patch adds support for reusing a main loop's reduction accumulator
> in an epilogue loop.  This in turn lets the loops share a single piece
> of vector->scalar reduction code.
>
> The patch has the following restrictions:
>
> (1) The epilogue reduction can only operate on a single vector
> (e.g. ncopies must be 1 for non-SLP reductions, and the group size
> must be <= the element count for SLP reductions).
>
> (2) Both loops must use the same vector mode for their accumulators.
> This means that the patch is restricted to targets that support
> --param vect-partial-vector-usage=1.
>
> (3) The reduction must be a standard “tree code” reduction.
>
> However, these restrictions could be lifted in future.  For example,
> if the main loop operates on 128-bit vectors and the epilogue loop
> operates on 64-bit vectors, we could in future reduce the 128-bit
> vector by one stage and use the 64-bit result as the starting point
> for the epilogue result.

Yeah, I hope that can be done quickly - it should make the
approach usable on x86_64.

> The patch tries to handle chained SLP reductions, unchained SLP
> reductions and non-SLP reductions.  It also handles cases in which
> the epilogue loop is entered directly (rather than via the main loop)
> and cases in which the epilogue loop can be skipped.
>
> vect_get_main_loop_result is a bit more general than the current
> patch needs.

I didn't see anything that would adjust the costing of the vectorization
(though I don't specifically remember how we cost vectorized epilogues
in general).

Few comments / questions inline below - I think the patch is OK
as-is though.

Thanks,
Richard.

> gcc/
> * tree-vectorizer.h (vect_reusable_accumulator): New structure.
> (_loop_vec_info::main_loop_edge): New field.
> (_loop_vec_info::skip_main_loop_edge): Likewise.
> (_loop_vec_info::skip_this_loop_edge): Likewise.
> (_loop_vec_info::reusable_accumulators): Likewise.
> (_stmt_vec_info::reduc_scalar_results): Likewise.
> (_stmt_vec_info::reused_accumulator): Likewise.
> (vect_get_main_loop_result): Declare.
> * tree-vectorizer.c (vec_info::new_stmt_vec_info): Initialize
> reduc_scalar_inputs.
> (vec_info::free_stmt_vec_info): Free reduc_scalar_inputs.
> * tree-vect-loop-manip.c (vect_get_main_loop_result): New function.
> (vect_do_peeling): Fill an epilogue loop's main_loop_edge,
> skip_main_loop_edge and skip_this_loop_edge fields.
> * tree-vect-loop.c (INCLUDE_ALGORITHM): Define.
> (vect_emit_reduction_init_stmts): New function.
> (get_initial_def_for_reduction): Use it.
> (get_initial_defs_for_reduction): Likewise.  Change the vinfo
> parameter to a loop_vec_info.
> (vect_create_epilog_for_reduction): Store the scalar results
> in the reduc_info.  If an epilogue loop is reusing an accumulator
> from the main loop, and if the epilogue loop can also be skipped,
> try to place the reduction code in the join block.  Record
> accumulators that could potentially be reused by epilogue loops.
> (vect_transform_cycle_phi): When vectorizing epilogue loops,
> try to reuse accumulators from the main loop.  Record the initial
> value in reduc_info for non-SLP reductions too.
>
> gcc/testsuite/
> * gcc.target/aarch64/sve/reduc_9.c: New test.
> * gcc.target/aarch64/sve/reduc_9_run.c: Likewise.
> * gcc.target/aarch64/sve/reduc_10.c: Likewise.
> * gcc.target/aarch64/sve/reduc_10_run.c: Likewise.
> * gcc.target/aarch64/sve/reduc_11.c: Likewise.
> * gcc.target/aarch64/sve/reduc_11_run.c: Likewise.
> * gcc.target/aarch64/sve/reduc_12.c: Likewise.
> * gcc.target/aarch64/sve/reduc_12_run.c: Likewise.
> * gcc.target/aarch64/sve/reduc_13.c: Likewise.
> * gcc.target/aarch64/sve/reduc_13_run.c: Likewise.
> * gcc.target/aarch64/sve/reduc_14.c: Likewise.
> * gcc.target/aarch64/sve/reduc_14_run.c: Likewise.
> * gcc.target/aarch64/sve/reduc_15.c: Likewise.
> * gcc.target/aarch64/sve/reduc_15_run.c: Likewise.
> ---
>  .../gcc.target/aarch64/sve/reduc_10.c |  77 +
>  .../gcc.target/aarch64/sve/reduc_10_run.c |  49 +++
>  .../gcc.target/aarch64/sve/reduc_11.c |  71 
>  .../gcc.target/aarch64/sve/reduc_11_run.c |  34 ++
>  .../gcc.target/aarch64/sve/reduc_12.c |  71 
>  .../gcc.target/aarch64/sve/reduc_12_run.c |  66 
>  .../gcc.target/aarch64/sve/reduc_13.c | 101 ++
>  .../gcc.target/aarch64/sve/reduc_13_run.c |  61 
>  .../gcc.target/aarch64/sve/reduc_14.c | 107 ++
>  .../gcc.target/aarch64/sve/reduc_14_run.c | 187 +++
>  .../gcc.target/aarch64/sve/reduc_15.c |  16 +
>  .../gcc.target/aarch64

[PATCH 10/10] vect: Reuse reduction accumulators between loops

2021-07-08 Thread Richard Sandiford via Gcc-patches
This patch adds support for reusing a main loop's reduction accumulator
in an epilogue loop.  This in turn lets the loops share a single piece
of vector->scalar reduction code.

The patch has the following restrictions:

(1) The epilogue reduction can only operate on a single vector
(e.g. ncopies must be 1 for non-SLP reductions, and the group size
must be <= the element count for SLP reductions).

(2) Both loops must use the same vector mode for their accumulators.
This means that the patch is restricted to targets that support
--param vect-partial-vector-usage=1.

(3) The reduction must be a standard “tree code” reduction.

However, these restrictions could be lifted in future.  For example,
if the main loop operates on 128-bit vectors and the epilogue loop
operates on 64-bit vectors, we could in future reduce the 128-bit
vector by one stage and use the 64-bit result as the starting point
for the epilogue result.

The patch tries to handle chained SLP reductions, unchained SLP
reductions and non-SLP reductions.  It also handles cases in which
the epilogue loop is entered directly (rather than via the main loop)
and cases in which the epilogue loop can be skipped.

vect_get_main_loop_result is a bit more general than the current
patch needs.

gcc/
* tree-vectorizer.h (vect_reusable_accumulator): New structure.
(_loop_vec_info::main_loop_edge): New field.
(_loop_vec_info::skip_main_loop_edge): Likewise.
(_loop_vec_info::skip_this_loop_edge): Likewise.
(_loop_vec_info::reusable_accumulators): Likewise.
(_stmt_vec_info::reduc_scalar_results): Likewise.
(_stmt_vec_info::reused_accumulator): Likewise.
(vect_get_main_loop_result): Declare.
* tree-vectorizer.c (vec_info::new_stmt_vec_info): Initialize
reduc_scalar_inputs.
(vec_info::free_stmt_vec_info): Free reduc_scalar_inputs.
* tree-vect-loop-manip.c (vect_get_main_loop_result): New function.
(vect_do_peeling): Fill an epilogue loop's main_loop_edge,
skip_main_loop_edge and skip_this_loop_edge fields.
* tree-vect-loop.c (INCLUDE_ALGORITHM): Define.
(vect_emit_reduction_init_stmts): New function.
(get_initial_def_for_reduction): Use it.
(get_initial_defs_for_reduction): Likewise.  Change the vinfo
parameter to a loop_vec_info.
(vect_create_epilog_for_reduction): Store the scalar results
in the reduc_info.  If an epilogue loop is reusing an accumulator
from the main loop, and if the epilogue loop can also be skipped,
try to place the reduction code in the join block.  Record
accumulators that could potentially be reused by epilogue loops.
(vect_transform_cycle_phi): When vectorizing epilogue loops,
try to reuse accumulators from the main loop.  Record the initial
value in reduc_info for non-SLP reductions too.

gcc/testsuite/
* gcc.target/aarch64/sve/reduc_9.c: New test.
* gcc.target/aarch64/sve/reduc_9_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_10.c: Likewise.
* gcc.target/aarch64/sve/reduc_10_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_11.c: Likewise.
* gcc.target/aarch64/sve/reduc_11_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_12.c: Likewise.
* gcc.target/aarch64/sve/reduc_12_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_13.c: Likewise.
* gcc.target/aarch64/sve/reduc_13_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_14.c: Likewise.
* gcc.target/aarch64/sve/reduc_14_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_15.c: Likewise.
* gcc.target/aarch64/sve/reduc_15_run.c: Likewise.
---
 .../gcc.target/aarch64/sve/reduc_10.c |  77 +
 .../gcc.target/aarch64/sve/reduc_10_run.c |  49 +++
 .../gcc.target/aarch64/sve/reduc_11.c |  71 
 .../gcc.target/aarch64/sve/reduc_11_run.c |  34 ++
 .../gcc.target/aarch64/sve/reduc_12.c |  71 
 .../gcc.target/aarch64/sve/reduc_12_run.c |  66 
 .../gcc.target/aarch64/sve/reduc_13.c | 101 ++
 .../gcc.target/aarch64/sve/reduc_13_run.c |  61 
 .../gcc.target/aarch64/sve/reduc_14.c | 107 ++
 .../gcc.target/aarch64/sve/reduc_14_run.c | 187 +++
 .../gcc.target/aarch64/sve/reduc_15.c |  16 +
 .../gcc.target/aarch64/sve/reduc_15_run.c |  22 ++
 .../gcc.target/aarch64/sve/reduc_9.c  |  77 +
 .../gcc.target/aarch64/sve/reduc_9_run.c  |  29 ++
 gcc/tree-vect-loop-manip.c|  29 ++
 gcc/tree-vect-loop.c  | 309 ++
 gcc/tree-vectorizer.c |   4 +
 gcc/tree-vectorizer.h |  51 ++-
 18 files changed, 1297 insertions(+), 64 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_10.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/reduc_10_run