Re: [RFC PATCH] Do not cache SLP trees across SLP instances [PR116352]

Richard Biener Fri, 30 Aug 2024 04:25:05 -0700

On Fri, 30 Aug 2024, Manolis Tsamis wrote:

> For more context please see
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116352#c11.
> 
> I've sent this as an RFC because it's hard to be sure that this was
> just uncovered by r15-2820-gab18785840d7b8 rather than being
> introduced by it.
> Still, to the best of my knowledge after reading through the code,
> this looked like a general issue to me and hence I'm proposing this
> fix.


It's a general issue indeed.  Note we do want to (re-)perform CSE
even between SLP instances - we're CSEing only internal defs,
not vect_{constant,external}_def at the moment.

The reason is that we want to put two instances accessing the
same stmts into the same partition (I think that would happen
also without the CSE, they'd just not share vectors).

Now - in this case we're indirectly CSEing vect_{constant,external}_def
given we have

  _119 = {j_160, _75, i_159, _73};
  _80 = VEC_PERM_EXPR <_119, _119, { 1, 1, 3, 3 }>;

and

 _21 = {j_132, _75, i_131, _73};
     = VEC_PERM_EXPR <_21, _21, { 1, 1, 3, 3 }>;

CSE doesn't honor scheduling constraints - it relies on SSA defs
being available (for the scalar lanes) but here, while the scalar
lanes are available the def they are picked from are not.

So in general I don't agree with the patch - it's a workaround only.

Note we can't reject this at the point we want to schedule, and we
do currently have no separate check of whether there is a valid
schedule - instead we rely on this by construction which means you
have a point in CSE violating this.

One way to avoid this in CSE would be to not allow CSEing of
non-bijective permutes like { 1, 1, 3, 3 } that do not select all
of the input lanes (not selecting a lane that has a NULL
scalar def would be OK I think).

The other obvious thing in this particular case is that the
permutes could all have been vector constructors, so instead of

  _119 = {j_160, _75, i_159, _73};
  _80 = VEC_PERM_EXPR <_119, _119, { 1, 1, 3, 3 }>;
  _150 = VEC_PERM_EXPR <_119, _119, { 0, 0, 2, 2 }>;

we'd have

  _80 = { _75, _75, _73, _73 };
  _150 = { j_160, j_160, i_159, i_159 };

I think that would be a profitable simplification but of course
it doesn't address the general issue.  I'd have expected that
the SLP permute optimization would eventually do this but I
think it only ever considers bijective permutes.

Richard.


> Thanks,
> Manolis
> 
> On Fri, Aug 30, 2024 at 1:46 PM Manolis Tsamis <manolis.tsa...@vrull.eu> 
> wrote:
> >
> > Currently we do SLP CSE after permute optimization using a single map across
> > all SLP instances. These SLP instances can affect many different basic 
> > blocks
> > and the cache may replace a statement in one block with one from another 
> > block.
> > Because there are no further limitations and the blocks may be visited in an
> > arbitrary order, this may lead to a statement being used in paths that may 
> > not
> > be defined. This patch creates one map per SLP instance to address that.
> >
> >         PR tree-optimization/116352
> >
> > gcc/ChangeLog:
> >
> >         * tree-vect-slp.cc (vect_optimize_slp): Use one scalar stmts to tree
> >         map per SLP instance.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.dg/pr116352.c: New test.
> >
> > Signed-off-by: Manolis Tsamis <manolis.tsa...@vrull.eu>
> > ---
> >
> >  gcc/testsuite/gcc.dg/pr116352.c | 18 ++++++++++++++++++
> >  gcc/tree-vect-slp.cc            | 12 +++++++-----
> >  2 files changed, 25 insertions(+), 5 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/pr116352.c
> >
> > diff --git a/gcc/testsuite/gcc.dg/pr116352.c 
> > b/gcc/testsuite/gcc.dg/pr116352.c
> > new file mode 100644
> > index 00000000000..c427eff3c08
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/pr116352.c
> > @@ -0,0 +1,18 @@
> > +/* PR tree-optimization/116352 */
> > +/* { dg-do compile } */
> > +/* { dg-options "-O3 -fchecking" } */
> > +
> > +int a;
> > +float b, c;
> > +void l(int h, int f, int g, float *e)
> > +{
> > +  for (int m = 0; m < h; m++)
> > +  {
> > +    float i = 2 * b, j = 2 * c;
> > +    if (a) {
> > +      e[m*4 + 0] = e[m*4 + 1] = (j - g * 0.5f);
> > +      e[m*4 + 2] = e[m*4 + 3] = (i + f * 0.5f);
> > +    } else
> > +      e[m*4 + 0] = f * 0.5f + g * 0.5f;
> > +  }
> > +}
> > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> > index cfdf59ad386..a8836c2fb04 100644
> > --- a/gcc/tree-vect-slp.cc
> > +++ b/gcc/tree-vect-slp.cc
> > @@ -6356,13 +6356,15 @@ vect_optimize_slp (vec_info *vinfo)
> >    vect_optimize_slp_pass (vinfo).run ();
> >
> >    /* Apply CSE again to nodes after permute optimization.  */
> > -  scalar_stmts_to_slp_tree_map_t *bst_map
> > -    = new scalar_stmts_to_slp_tree_map_t ();
> > -
> >    for (auto inst : vinfo->slp_instances)
> > -    vect_cse_slp_nodes (bst_map, SLP_INSTANCE_TREE (inst));
> > +    {
> > +      scalar_stmts_to_slp_tree_map_t *bst_map
> > +       = new scalar_stmts_to_slp_tree_map_t ();
> >
> > -  release_scalar_stmts_to_slp_tree_map (bst_map);
> > +      vect_cse_slp_nodes (bst_map, SLP_INSTANCE_TREE (inst));
> > +
> > +      release_scalar_stmts_to_slp_tree_map (bst_map);
> > +    }
> >  }
> >
> >  /* Gather loads reachable from the individual SLP graph entries.  */
> > --
> > 2.34.1
> >
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [RFC PATCH] Do not cache SLP trees across SLP instances [PR116352]

Reply via email to