On Mon, 28 Mar 2022, Richard Sandiford wrote:

> Richard Biener <rguent...@suse.de> writes:
> > Since we're now vectorizing by default at -O2 issues like PR101908
> > become more important where we apply basic-block vectorization to
> > parts of the function covering loads from function parameters passed
> > on the stack.  Since we have no good idea how the stack pushing
> > was performed but we do have a good idea that it was done recent
> > when at the start of the function a STLF failure is inevitable
> > if the argument passing code didn't use the same or larger vector
> > stores and also has them aligned the same way.
> >
> > Until there's a robust IPA based solution the following implements
> > target independent heuristics in the vectorizer to retain scalar
> > loads for loads from parameters likely passed in memory (I use
> > a BLKmode DECL_MODE check for this rather than firing up
> > cummulative-args).  I've restricted this also to loads from the
> > first "block" (that can be less than the first basic block if there's
> > a call for example), since that covers the testcase.
> >
> > Note that for the testcase (but not c-ray from the bugreport) there's
> > a x86 peephole2 that vectorizes things back, so the patch is
> > not effective there.
> >
> > Any comments?  I know we're also looking at x86 port specific
> > mitigations but the issue will hit arm and power/z as well I think.
> 
> I'm not sure this is a target-independent win.  In a loop that:
> 
>   stores 2 scalars
>   loads the stored scalars as a vector
>   adds a vector
>   stores a vector
> 
> (no feedback), I see a 20% regression using elementwise accesses for
> the load vs. using a normal vector load (measured on a Cortex-A72).
> With feedback the elementwise version is still slower, but obviously
> not by such a big factor.

I see, so that's even without a call inbetween the scalar stores
and the vector load as is the case we're trying to cover.  I
would suspect that maybe the two elementwise accesses execute
too close to the two scalar stores to benefit from any forwarding
with the A72 micro-architecture?  Do you see a speedup when
performing a vector store instead of two scalar stores?

> I can see the argument for penalising this in the cost model in
> a target-independent way, but I'm not sure we should change the
> vectorisation strategy.

OK.  We tried that but I think the better idea is splitting the
stores if that makes them likely faster late during md-reorg
where we should have the best idea on whether we'll actually 
hit a STLF fail.

Since that's also what Hongtao prefers I'll drop this patch
for now.

Thanks,
Richard.

> Thanks,
> Richard
> 
> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >
> > Thanks,
> > Richard.
> >
> > 2022-03-25  Richard Biener  <rguent...@suse.de>
> >
> >     PR tree-optimization/101908
> >     * tree-vect-stmts.cc (get_group_load_store_type): Add
> >     heuristic to limit BB vectorization of function parameters.
> >
> >     * gcc.dg/vect/bb-slp-pr101908.c: New testcase.
> > ---
> >  gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c | 18 ++++++++++++++
> >  gcc/tree-vect-stmts.cc                      | 27 ++++++++++++++++++++-
> >  2 files changed, 44 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
> >
> > diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c 
> > b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
> > new file mode 100644
> > index 00000000000..b7534a18f0e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
> > @@ -0,0 +1,18 @@
> > +/* { dg-do compile } */
> > +/* { dg-require-effective-target vect_double } */
> > +
> > +struct vec3 {
> > +  double x, y, z;
> > +};
> > +
> > +struct ray {
> > +  struct vec3 orig, dir;
> > +};
> > +
> > +void ray_sphere (struct ray ray, double *res)
> > +{
> > +  res[0] = ray.orig.y * ray.dir.x;
> > +  res[1] = ray.orig.z * ray.dir.y;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "STLF fail" "slp2" } } */
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index f7449a79d1c..1e37e9678b6 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -2197,7 +2197,32 @@ get_group_load_store_type (vec_info *vinfo, 
> > stmt_vec_info stmt_info,
> >    /* Stores can't yet have gaps.  */
> >    gcc_assert (slp_node || vls_type == VLS_LOAD || gap == 0);
> >  
> > -  if (slp_node)
> > +  tree parm;
> > +  if (!loop_vinfo
> > +      && vls_type == VLS_LOAD
> > +      /* The access is based on a PARM_DECL.  */
> > +      && TREE_CODE (DR_BASE_ADDRESS (first_dr_info->dr)) == ADDR_EXPR
> > +      && ((parm = TREE_OPERAND (DR_BASE_ADDRESS (first_dr_info->dr), 0)), 
> > true)
> > +      && TREE_CODE (parm) == PARM_DECL
> > +      /* Likely passed on the stack.  */
> > +      && DECL_MODE (parm) == BLKmode
> > +      /* The access is in the first group.  */
> > +      && first_dr_info->group == 0)
> > +    {
> > +      /* When doing BB vectorizing force early loads from function 
> > parameters
> > +    passed on the stack and thus stored recently to be done elementwise
> > +    to avoid store-to-load forwarding penalties.
> > +    Note this will cause vectorization to fail for the load because of
> > +    the fear of underestimating the cost of elementwise accesses,
> > +    see the end of get_load_store_type.  We are then going to effectively
> > +    do the same via handling the loads as external input to the SLP.  */
> > +      if (dump_enabled_p ())
> > +   dump_printf_loc (MSG_NOTE, vect_location,
> > +                    "Not using vector loads from function parameter "
> > +                    "for the fear of causing a STLF fail\n");
> > +      *memory_access_type = VMAT_ELEMENTWISE;
> > +    }
> > +  else if (slp_node)
> >      {
> >        /* For SLP vectorization we directly vectorize a subchain
> >      without permutation.  */
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)

Reply via email to