Richard Biener <rguent...@suse.de> writes:
> This adds a simple reduction vectorization capability to the
> non-loop vectorizer.  Simple meaning it lacks any of the fancy
> ways to generate the reduction epilogue but only supports
> those we can handle via a direct internal function reducing
> a vector to a scalar.  One of the main reasons is to avoid
> massive refactoring at this point but also that more complex
> epilogue operations are hardly profitable.
>
> Mixed sign reductions are for now fend off and I'm not finally
> settled with whether we want an explicit SLP node for the
> reduction epilogue operation.  Handling mixed signs could be
> done by multiplying with a { 1, -1, .. } vector.  Fend off
> are also reductions with non-internal operands (constants
> or register parameters for example).
>
> Costing is done by accounting the original scalar participating
> stmts for the scalar cost and log2 permutes and operations for
> the vectorized epilogue.

It would be good if we have had a standard way of asking for this
cost for both loops and SLP, perhaps based on the internal function.
E.g. for aarch64 we have a cost table that gives a more precise cost
(and log2 of the scalar op isn't always it :-)).

I don't have any specific suggestion how though.  And I guess it
can be a follow-on patch anyway.

> SPEC CPU 2017 FP with rate workload measurements show (picked
> fastest runs of three) regressions for 507.cactuBSSN_r (1.5%),
> 508.namd_r (2.5%), 511.povray_r (2.5%), 526.blender_r (0.5) and
> 527.cam4_r (2.5%) and improvements for 510.parest_r (5%) and
> 538.imagick_r (1.5%).  This is with -Ofast -march=znver2 on a Zen2.
>
> Statistics on CPU 2017 shows that the overwhelming number of seeds
> we find are reductions of two lanes (well - that's basically every
> associative operation).  That means we put a quite high pressure
> on the SLP discovery process this way.
>
> In total we find 583218 seeds we put to SLP discovery out of which
> 66205 pass that and only 6185 of those make it through
> code generation checks. 796 of those are discarded because the reduction
> is part of a larger SLP instance.  4195 of the remaining
> are deemed not profitable to vectorize and 1194 are finally
> vectorized.  That's a poor 0.2% rate.

Oof.

> Of the 583218 seeds 486826 (83%) have two lanes, 60912 have three (10%),
> 28181 four (5%), 4808 five, 909 six and there are instances up to 120
> lanes.
>
> There's a set of 54086 candidate seeds we reject because
> they contain a constant or invariant (not implemented yet) but still
> have two or more lanes that could be put to SLP discovery.

It looks like the patch doesn't explicitly forbid 2-element reductions
and instead relies on the cost model.  Is that right?

> Bootstrapped and tested on x86_64-unknown-linux-gnu, I've also
> built and tested SPEC CPU 2017 with -Ofast -march=znver2 successfully.
>
> I do think this is good enough(TM) for this point, please speak up
> if you disagree and/or like to see changes.

No objection from me FWIW.  Looks like a nice feature :-)

Thanks,
Richard

>
> Thanks,
> Richard.
>
> 2021-06-16  Richard Biener   <rguent...@suse.de>
>
>       PR tree-optimization/54400
>       * tree-vectorizer.h (enum slp_instance_kind): Add
>       slp_inst_kind_bb_reduc.
>       (reduction_fn_for_scalar_code): Declare.
>       * tree-vect-data-refs.c (vect_slp_analyze_instance_dependence):
>       Check SLP_INSTANCE_KIND instead of looking at the
>       representative.
>       (vect_slp_analyze_instance_alignment): Likewise.
>       * tree-vect-loop.c (reduction_fn_for_scalar_code): Export.
>       * tree-vect-slp.c (vect_slp_linearize_chain): Split out
>       chain linearization from vect_build_slp_tree_2 and generalize
>       for the use of BB reduction vectorization.
>       (vect_build_slp_tree_2): Adjust accordingly.
>       (vect_optimize_slp): Elide permutes at the root of BB reduction
>       instances.
>       (vectorizable_bb_reduc_epilogue): New function.
>       (vect_slp_prune_covered_roots): Likewise.
>       (vect_slp_analyze_operations): Use them.
>       (vect_slp_check_for_constructors): Recognize associatable
>       chains for BB reduction vectorization.
>       (vectorize_slp_instance_root_stmt): Generate code for the
>       BB reduction epilogue.
>
>       * gcc.dg/vect/bb-slp-pr54400.c: New testcase.
> ---
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c |  43 +++
>  gcc/tree-vect-data-refs.c                  |   9 +-
>  gcc/tree-vect-loop.c                       |   2 +-
>  gcc/tree-vect-slp.c                        | 383 +++++++++++++++++----
>  gcc/tree-vectorizer.h                      |   2 +
>  5 files changed, 367 insertions(+), 72 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c

Reply via email to