Richard Biener <rguent...@suse.de> writes: > This adds a simple reduction vectorization capability to the > non-loop vectorizer. Simple meaning it lacks any of the fancy > ways to generate the reduction epilogue but only supports > those we can handle via a direct internal function reducing > a vector to a scalar. One of the main reasons is to avoid > massive refactoring at this point but also that more complex > epilogue operations are hardly profitable. > > Mixed sign reductions are for now fend off and I'm not finally > settled with whether we want an explicit SLP node for the > reduction epilogue operation. Handling mixed signs could be > done by multiplying with a { 1, -1, .. } vector. Fend off > are also reductions with non-internal operands (constants > or register parameters for example). > > Costing is done by accounting the original scalar participating > stmts for the scalar cost and log2 permutes and operations for > the vectorized epilogue.
It would be good if we have had a standard way of asking for this cost for both loops and SLP, perhaps based on the internal function. E.g. for aarch64 we have a cost table that gives a more precise cost (and log2 of the scalar op isn't always it :-)). I don't have any specific suggestion how though. And I guess it can be a follow-on patch anyway. > SPEC CPU 2017 FP with rate workload measurements show (picked > fastest runs of three) regressions for 507.cactuBSSN_r (1.5%), > 508.namd_r (2.5%), 511.povray_r (2.5%), 526.blender_r (0.5) and > 527.cam4_r (2.5%) and improvements for 510.parest_r (5%) and > 538.imagick_r (1.5%). This is with -Ofast -march=znver2 on a Zen2. > > Statistics on CPU 2017 shows that the overwhelming number of seeds > we find are reductions of two lanes (well - that's basically every > associative operation). That means we put a quite high pressure > on the SLP discovery process this way. > > In total we find 583218 seeds we put to SLP discovery out of which > 66205 pass that and only 6185 of those make it through > code generation checks. 796 of those are discarded because the reduction > is part of a larger SLP instance. 4195 of the remaining > are deemed not profitable to vectorize and 1194 are finally > vectorized. That's a poor 0.2% rate. Oof. > Of the 583218 seeds 486826 (83%) have two lanes, 60912 have three (10%), > 28181 four (5%), 4808 five, 909 six and there are instances up to 120 > lanes. > > There's a set of 54086 candidate seeds we reject because > they contain a constant or invariant (not implemented yet) but still > have two or more lanes that could be put to SLP discovery. It looks like the patch doesn't explicitly forbid 2-element reductions and instead relies on the cost model. Is that right? > Bootstrapped and tested on x86_64-unknown-linux-gnu, I've also > built and tested SPEC CPU 2017 with -Ofast -march=znver2 successfully. > > I do think this is good enough(TM) for this point, please speak up > if you disagree and/or like to see changes. No objection from me FWIW. Looks like a nice feature :-) Thanks, Richard > > Thanks, > Richard. > > 2021-06-16 Richard Biener <rguent...@suse.de> > > PR tree-optimization/54400 > * tree-vectorizer.h (enum slp_instance_kind): Add > slp_inst_kind_bb_reduc. > (reduction_fn_for_scalar_code): Declare. > * tree-vect-data-refs.c (vect_slp_analyze_instance_dependence): > Check SLP_INSTANCE_KIND instead of looking at the > representative. > (vect_slp_analyze_instance_alignment): Likewise. > * tree-vect-loop.c (reduction_fn_for_scalar_code): Export. > * tree-vect-slp.c (vect_slp_linearize_chain): Split out > chain linearization from vect_build_slp_tree_2 and generalize > for the use of BB reduction vectorization. > (vect_build_slp_tree_2): Adjust accordingly. > (vect_optimize_slp): Elide permutes at the root of BB reduction > instances. > (vectorizable_bb_reduc_epilogue): New function. > (vect_slp_prune_covered_roots): Likewise. > (vect_slp_analyze_operations): Use them. > (vect_slp_check_for_constructors): Recognize associatable > chains for BB reduction vectorization. > (vectorize_slp_instance_root_stmt): Generate code for the > BB reduction epilogue. > > * gcc.dg/vect/bb-slp-pr54400.c: New testcase. > --- > gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c | 43 +++ > gcc/tree-vect-data-refs.c | 9 +- > gcc/tree-vect-loop.c | 2 +- > gcc/tree-vect-slp.c | 383 +++++++++++++++++---- > gcc/tree-vectorizer.h | 2 + > 5 files changed, 367 insertions(+), 72 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c