On Mon, 28 Mar 2022, Richard Sandiford wrote: > Richard Biener <rguent...@suse.de> writes: > > Since we're now vectorizing by default at -O2 issues like PR101908 > > become more important where we apply basic-block vectorization to > > parts of the function covering loads from function parameters passed > > on the stack. Since we have no good idea how the stack pushing > > was performed but we do have a good idea that it was done recent > > when at the start of the function a STLF failure is inevitable > > if the argument passing code didn't use the same or larger vector > > stores and also has them aligned the same way. > > > > Until there's a robust IPA based solution the following implements > > target independent heuristics in the vectorizer to retain scalar > > loads for loads from parameters likely passed in memory (I use > > a BLKmode DECL_MODE check for this rather than firing up > > cummulative-args). I've restricted this also to loads from the > > first "block" (that can be less than the first basic block if there's > > a call for example), since that covers the testcase. > > > > Note that for the testcase (but not c-ray from the bugreport) there's > > a x86 peephole2 that vectorizes things back, so the patch is > > not effective there. > > > > Any comments? I know we're also looking at x86 port specific > > mitigations but the issue will hit arm and power/z as well I think. > > I'm not sure this is a target-independent win. In a loop that: > > stores 2 scalars > loads the stored scalars as a vector > adds a vector > stores a vector > > (no feedback), I see a 20% regression using elementwise accesses for > the load vs. using a normal vector load (measured on a Cortex-A72). > With feedback the elementwise version is still slower, but obviously > not by such a big factor.
I see, so that's even without a call inbetween the scalar stores and the vector load as is the case we're trying to cover. I would suspect that maybe the two elementwise accesses execute too close to the two scalar stores to benefit from any forwarding with the A72 micro-architecture? Do you see a speedup when performing a vector store instead of two scalar stores? > I can see the argument for penalising this in the cost model in > a target-independent way, but I'm not sure we should change the > vectorisation strategy. OK. We tried that but I think the better idea is splitting the stores if that makes them likely faster late during md-reorg where we should have the best idea on whether we'll actually hit a STLF fail. Since that's also what Hongtao prefers I'll drop this patch for now. Thanks, Richard. > Thanks, > Richard > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > Thanks, > > Richard. > > > > 2022-03-25 Richard Biener <rguent...@suse.de> > > > > PR tree-optimization/101908 > > * tree-vect-stmts.cc (get_group_load_store_type): Add > > heuristic to limit BB vectorization of function parameters. > > > > * gcc.dg/vect/bb-slp-pr101908.c: New testcase. > > --- > > gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c | 18 ++++++++++++++ > > gcc/tree-vect-stmts.cc | 27 ++++++++++++++++++++- > > 2 files changed, 44 insertions(+), 1 deletion(-) > > create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c > > > > diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c > > b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c > > new file mode 100644 > > index 00000000000..b7534a18f0e > > --- /dev/null > > +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c > > @@ -0,0 +1,18 @@ > > +/* { dg-do compile } */ > > +/* { dg-require-effective-target vect_double } */ > > + > > +struct vec3 { > > + double x, y, z; > > +}; > > + > > +struct ray { > > + struct vec3 orig, dir; > > +}; > > + > > +void ray_sphere (struct ray ray, double *res) > > +{ > > + res[0] = ray.orig.y * ray.dir.x; > > + res[1] = ray.orig.z * ray.dir.y; > > +} > > + > > +/* { dg-final { scan-tree-dump "STLF fail" "slp2" } } */ > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > > index f7449a79d1c..1e37e9678b6 100644 > > --- a/gcc/tree-vect-stmts.cc > > +++ b/gcc/tree-vect-stmts.cc > > @@ -2197,7 +2197,32 @@ get_group_load_store_type (vec_info *vinfo, > > stmt_vec_info stmt_info, > > /* Stores can't yet have gaps. */ > > gcc_assert (slp_node || vls_type == VLS_LOAD || gap == 0); > > > > - if (slp_node) > > + tree parm; > > + if (!loop_vinfo > > + && vls_type == VLS_LOAD > > + /* The access is based on a PARM_DECL. */ > > + && TREE_CODE (DR_BASE_ADDRESS (first_dr_info->dr)) == ADDR_EXPR > > + && ((parm = TREE_OPERAND (DR_BASE_ADDRESS (first_dr_info->dr), 0)), > > true) > > + && TREE_CODE (parm) == PARM_DECL > > + /* Likely passed on the stack. */ > > + && DECL_MODE (parm) == BLKmode > > + /* The access is in the first group. */ > > + && first_dr_info->group == 0) > > + { > > + /* When doing BB vectorizing force early loads from function > > parameters > > + passed on the stack and thus stored recently to be done elementwise > > + to avoid store-to-load forwarding penalties. > > + Note this will cause vectorization to fail for the load because of > > + the fear of underestimating the cost of elementwise accesses, > > + see the end of get_load_store_type. We are then going to effectively > > + do the same via handling the loads as external input to the SLP. */ > > + if (dump_enabled_p ()) > > + dump_printf_loc (MSG_NOTE, vect_location, > > + "Not using vector loads from function parameter " > > + "for the fear of causing a STLF fail\n"); > > + *memory_access_type = VMAT_ELEMENTWISE; > > + } > > + else if (slp_node) > > { > > /* For SLP vectorization we directly vectorize a subchain > > without permutation. */ > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)