RE: [RFC][middle-end] SLP Early break and control flow support in GCC

Tamar Christina Tue, 30 Jul 2024 02:13:29 -0700

> -----Original Message-----
> From: Richard Biener <rguent...@suse.de>
> Sent: Thursday, July 18, 2024 10:00 AM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Richard Sandiford
> <richard.sandif...@arm.com>
> Subject: RE: [RFC][middle-end] SLP Early break and control flow support in GCC
> 
> On Wed, 17 Jul 2024, Tamar Christina wrote:
> 
> > > -----Original Message-----
> > > From: Richard Biener <rguent...@suse.de>
> > > Sent: Tuesday, July 16, 2024 4:08 PM
> > > To: Tamar Christina <tamar.christ...@arm.com>
> > > Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Richard Sandiford
> > > <richard.sandif...@arm.com>
> > > Subject: Re: [RFC][middle-end] SLP Early break and control flow support 
> > > in GCC
> > >
> > > On Mon, 15 Jul 2024, Tamar Christina wrote:
> > >
> > > > Hi All,
> > > >
> > > > This RFC document covers at a high level how to extend early break 
> > > > support in
> > > > GCC to support SLP and how this will be extended in the future to 
> > > > support
> > > > full control flow in GCC.
> > > >
> > > > The basic idea in this is based on the paper "All You Need Is 
> > > > Superword-Level
> > > > Parallelism: Systematic Control-Flow Vectorization with SLP"[1] but it 
> > > > is
> > > > adapted to fit within the GCC vectorizer.
> > >
> > > An interesting read - I think the approach is viable for loop
> > > vectorization where we schedule the whole vectorized loop but difficult
> > > for basic-block vectorization as they seem to re-build the whole function
> > > from scratch.  They also do not address how to code-generate predicated
> > > not vectorized code or how they decide to handle mixed vector/non-vector
> > > code at all.  For example I don't think any current CPU architecture
> > > supports a full set of predicated _scalar_ ops and not every scalar
> > > op would have a vector equivalent in case one would use single-lane
> > > vectors.
> >
> > Hmm I'm guessing you mean here, they don't address for BB vectorization
> > how to deal with the fact that you may not always be able to vectorize the
> > entire function up from the seed? I thought today we dealt with that by
> > splitting during discovery?  Can we not do the same? i.e. treat them as
> > externals?
> 
> Currently not all scalar stmts are even reached by the seeds we use.
> They seem to simply rewrite the whole function into predicated form
> and I don't see how that is a transform that gets you back 1:1 the old
> code (or code of at least the same quality) if not all statements end
> up vectorized?
> 
> Sure we could build up single-lane SLP instances for "everything",
> but then how do you code-generate these predicated single-lane SLP
> instances?
> 
> > >
> > > For GCCs basic-block vectorization the main outstanding issue is one
> > > of scheduling and dependences with scalar code (live lane extracts,
> > > vector builds from scalars) as well.
> > >
> > > > What will not be covered is the support for First-Faulting Loads nor
> > > > alignment peeling as these are done by a different engineer.
> > > >
> > > > == Concept and Theory ==
> > > >
> > > > Supporting Early Break in SLP requires the same theory as general 
> > > > control
> flow,
> > > > the only difference is in that Early break can be supported for 
> > > > non-masked
> > > > architectures while full control flow support requires a fully masked
> > > > architecture.
> > > >
> > > > In GCC 14 Early break was added for non-SLP by teaching the vectorizer 
> > > > to
> > > branch
> > > > to a scalar epilogue whenever an early exit is taken.  This means the 
> > > > vectorizer
> > > > itself doesn't need to know how to perform side-effects or final 
> > > > reductions.
> > >
> > > With current GCC we avoid the need of predicated stmts by using the scalar
> > > epilog and a branch-on-all-true/false stmt.  To make that semantically
> > > valid stmts are moved around.
> > >
> > > > The additional benefit of this is that it works for any target 
> > > > providing a
> > > > vector cbranch optab implementation since we don't require masking
> support.
> > > In
> > > > order for this to work we need to have code motion that moves side 
> > > > effects
> > > > (primarily stores) into the latch basic block.  i.e. we delay any 
> > > > side-effects
> > > > up to a point where we know the full vector iteration would have been
> > > performed.
> > > > For this to work however we had to disable support for epilogue 
> > > > vectorization
> as
> > > > when the scalar statements are moved we break the link to the original 
> > > > BB
> they
> > > > were in.  This means that the stmt_vinfo for the stores that need to be 
> > > > moved
> > > > will no longer be valid for epilogue vectorization and the only way to 
> > > > recover
> > > > this would be to perform a full dataflow analysis again.  We decided 
> > > > against
> > > > this as the plan of record was to switch to SLP.
> > > >
> > > > -- Predicated SSA --
> > > >
> > > > The core of the proposal is to support a form of Predicated SSA (PSSA) 
> > > > in the
> > > > vectorizer [2]. The idea is to assign a control predicate to every SSA
> > > > statement.  This control predicate denotes their dependencies wrt to 
> > > > the BB
> they
> > > > are in and defines the dependencies between statements.
> > > >
> > > > As an example the following early break code:
> > > >
> > > > unsigned test4(unsigned x)
> > > > {
> > > >  unsigned ret = 0;
> > > >  for (int i = 0; i < N; i++)
> > > >  {
> > > >    vect_b[i] = x + i;
> > > >    if (vect_a[i]*2 != x)
> > > >      break;
> > > >    vect_a[i] = x;
> > > >
> > > >  }
> > > >  return ret;
> > > > }
> > > >
> > > > is transformed into
> > > >
> > > > unsigned test4(unsigned x)
> > > > {
> > > >  unsigned ret = 0;
> > > >  for (int i = 0; i < N; i++)
> > > >  {
> > > >    vect_b[i] = x + i;      :: !(vect_a[i]*2 != x)
> > >
> > > why's this not 'true'?
> > >
> >
> > I think there are three cases here:
> >
> > 1. control flow with no exit == true, in this case we
> >     can perform statements with side-effects in place as
> >     you never exit the loop early.
> > 2. early break non-masked, this one requires the store
> >      to be moved to the latch, or the block with the last
> >      early break.  This is what we do today, and the predicate
> >      would cause it to be moved.
> > 3.  early break masked, In this case I think we need an exit
> >      block that performs any side effects masked.  Since on every
> >      exit you must still perform the stores, but based on the union
> >      of the masks you have so far.  The final one should still be moved
> >      to the latch block.
> >
> > > >    if (vect_a[i]*2 != x)   :: true
> > > >      break;                :: vect_a[i]*2 != x
> > > >    vect_a[i] = x;          :: !(vect_a[i]*2 != x)
> > > >
> > > >  }
> > > >  return ret;
> > > > }
> > > >
> > > > A more complicated example:
> > > >
> > > > int foo (int n)
> > > > {
> > > >     int res = 0;
> > > >     for (int i = 0; i < n; i++)
> > > >       {
> > > >         y[i] = x[i] * 2;      :: true
> > > >         res += x[i] + y[i];   :: true
> > > >
> > > >         if (x[i] > 5)         :: true
> > > >           break;              :: x[i] > 5
> > > >
> > > >         if (z[i] > 5)         :: !(x[i] > 5)
> > > >           break;              :: !(x[i] > 5) && (z[i] > 5)
> > > >       }
> > > >
> > > >     return res;
> > > > }
> > > >
> > > > any statement guarded by a block with a non-true predicate can be 
> > > > simplified,
> so
> > > > the above can be turned into
> > > >
> > > > int foo (int n)
> > > > {
> > > >     int res = 0;
> > > >     for (int i = 0; i < n; i++)
> > > >       {
> > > >         y[i] = x[i] * 2;      :: true
> > > >         res += x[i] + y[i];   :: true
> > > >
> > > >         if (x[i] > 5)         :: true
> > > >           break;              :: x[i] > 5
> > > >
> > > >         if (z[i] > 5)         :: !(x[i] > 5)
> > > >           break;              :: (z[i] > 5)
> > > >       }
> > > >
> > > >     return res;
> > > > }
> > > >
> > > > if we make the predicates a *gimple, where true is just NULL;
> > >
> > > I'd make predicates a SSA name instead (or possibly abstract this
> > > so we can more easily change our minds) - the disadvantage is
> > > that gcond * doesn't have a SSA def.
> > >
> > > We might want to look into tree-if-conv.cc (remotely possibly into
> > > gimple-predicate-analysis.{h,cc}) since there are multiple places in
> > > GCC where we need to maintain a composition of individual predicates
> > > (and possibly compare / simplify such compositions).
> > >
> >
> > Aha, good shout.
> >
> > > > Lastly examples such as:
> > > >
> > > > unsigned test4(unsigned x, unsigned y)
> > > > {
> > > >  unsigned ret = 0;
> > > >  unsigned sum = 0;
> > > >  for (int i = 0; i < N; i++)
> > > >  {
> > > >    vect_b[i] = x + i;
> > > >    if (vect_a[i] > x)
> > > >      return vect_a[i];
> > > >
> > > >   vect_b[i] += x + i;
> > > >   if (vect_a[i] > y)
> > > >       return sum;
> > > >   sum += vect_a[i];
> > > >   vect_a[i] = x;
> > > >  }
> > > >  return ret;
> > > > }
> > > >
> > > > are tagged as:
> > > >
> > > > unsigned test4(unsigned x, unsigned y)
> > > > {
> > > >  unsigned ret = 0;
> > > >  unsigned sum = 0;
> > > >  for (int i = 0; i < N; i++)
> > > >  {
> > > >    vect_b[i] = x + i;    :: !(vect_a[i] > y)
> > > >    if (vect_a[i] > x)    :: true
> > > >      return vect_a[i];   :: vect_a[i] > x
> > > >
> > > >   vect_b[i] += x + i;    :: !(vect_a[i] > y)
> > > >   if (vect_a[i] > y)     :: !(vect_a[i] > x)
> > > >       return sum;        :: vect_a[i] > y
> > > >   sum += vect_a[i];      :: !(vect_a[i] > y)
> > > >   vect_a[i] = x;         :: !(vect_a[i] > y)
> > > >  }
> > > >  return ret;
> > > > }
> > >
> > > I'll note you have all early-exit examples.  Whatever we build should
> > > ideally able to handle cases we currently if-convert.
> > >
> >
> > Yeah, I think cases we currently if-convert and don't if-convert can be
> > handled approximately the same way.  Though we might want to have
> > some kind of lowering phase that transforms the block into conditional
> > statements when we can.  Probably in optimize_slp?
> 
> But isn't this handled by the rewrite to predicated form?  Of course
> you would need to handle PHIs then (with early exit we have/allow no
> in-loop PHIs), but those are simply COND_EXPRs (or blends when
> vectorized).
> 
> > > > This representation has a number of benefits:
> > > >
> > > > 1. code-motion becomes just SLP scheduling, where scheduling has to take
> > > account
> > > >    not to schedule blocks across eachother which have a data 
> > > > dependency. This
> > > >    becomes simpler when we build the SLP tree we can merge blocks with 
> > > > the
> > > same
> > > >    control dependencies.
> > > > 2. this representation also helps for the case where we want to stay 
> > > > inside the
> > > >    vector loop, as the predicate tells us which mask we need to use to 
> > > > compute
> > > >    the reduction values.
> > > > 3. this representation is easily extendable to general control flow 
> > > > support
> > > >    since the statements will all be explicitly guarded by  a predicate. 
> > > >  The PHI
> > > >    node can be given a special type, to indicate to codegen that a 
> > > > branch
> should
> > > >    be generated.
> > >
> > > Indeed.
> > >
> > > Now, we're for example currently missing the dependence on the loop
> > > mask computation for fully-masked loops as we're not representing its
> > > computation nor the implicit uses on SLP nodes.
> > >
> > > Each predicate would be a dependence on the predicate computation itself,
> > > is that predicate computation supposed to be a SLP node?
> >
> > This is where I was hoping to get some feedback into Richard S's plans.
> > My understanding is that he'll be working on BB SLP support for VLA and it
> > looks to me like we could use the same bookkeeping here.
> 
> I think you are talking about allowing a subset of active lanes from the
> real vector?  For BB SLP the static constant mask is kind-of implicit
> with the number of lanes in the SLP node, the main issue we have here
> currently is that we have to mask (or fill with safe values) the inactive
> lanes when we have a "gap in the vector".  But sure, we could trigger
> the masking logic itself by adding a all-ones mask (all lanes in the
> SLP node are active), but the main issue will be to support "gaps"
> at the end of vectors.
> 
> > My initial thought was to indeed just have a new node similar to how
> > TWO_OPERANDS is implemented.  Except instead of having a permute node
> > it would have a mask node.    This should also be usable when we're building
> > an SLP code from statements where we could build the tree if masking is 
> > applied:
> > i.e.
> >
> > a[i] = b[i]
> > a[i+1] = b[i+1] * 5
> >
> > should be SLP'able by masking out the second operand.
> >
> > I think an SLP node makes sense because we then don't need to keep a 
> > separate
> CFG.
> 
> Good.
> 
> > >
> > > > conditions which read from the same sources can be merged (SLP'ed) if 
> > > > the
> > > > statements in the early exit block are the same.
> > > >
> > > > That is, the above can be handled as:
> > > >
> > > > unsigned test4(unsigned x, unsigned y)
> > > > {
> > > >  unsigned ret = 0;
> > > >  unsigned sum = 0;
> > > >  for (int i = 0; i < N; i++)
> > > >  {
> > > >    if (vect_a[i] > x)    :: true
> > > >      return vect_a[i];   :: vect_a[i] > x
> > > >
> > > >    if (vect_a[i] > y)    :: !(vect_a[i] > x)
> > > >      return sum;         :: vect_a[i] > y
> > > >
> > > >    vect_b[i] = x + i;    :: !(vect_a[i] > y)
> > > >    vect_b[i] += x + i;   :: !(vect_a[i] > y)
> > > >    sum += vect_a[i];     :: !(vect_a[i] > y)
> > > >    vect_a[i] = x;        :: !(vect_a[i] > y)
> > > >  }
> > > >  return ret;
> > > > }
> > > >
> > > > But the following:
> > > >
> > > > unsigned test4(unsigned x, unsigned y)
> > > > {
> > > >  unsigned ret = 0;
> > > >  unsigned sum = 0;
> > > >  for (int i = 0; i < N; i++)
> > > >  {
> > > >    vect_b[i] = x + i;    :: !(vect_a[i] > y)
> > > >    if (vect_a[i] > x)    :: true
> > > >      break;              :: vect_a[i] > x
> > > >
> > > >   vect_b[i] += x + i;    :: !(vect_a[i] > y)
> > > >   if (vect_a[i] > y)     :: !(vect_a[i] > x)
> > > >      break;              :: vect_a[i] > y
> > > >   sum += vect_a[i];      :: !(vect_a[i] > y)
> > > >   vect_a[i] = x;         :: !(vect_a[i] > y)
> > > >  }
> > > >  return ret;
> > > > }
> > > >
> > > > becomes:
> > > >
> > > > unsigned test4(unsigned x, unsigned y)
> > > > {
> > > >  unsigned ret = 0;
> > > >  unsigned sum = 0;
> > > >  for (int i = 0; i < N; i++)
> > > >  {
> > > >    if (vect_a[i] > x || vect_a[i] > y)    :: true
> > > >      break;              :: vect_a[i] > x
> > > >
> > > >    vect_b[i] = x + i;    :: !(vect_a[i] > y)
> > > >    vect_b[i] += x + i;   :: !(vect_a[i] > y)
> > > >    sum += vect_a[i];     :: !(vect_a[i] > y)
> > > >    vect_a[i] = x;        :: !(vect_a[i] > y)
> > > >  }
> > > >  return ret;
> > > > }
> > > >
> > > > In addition for this to work the vectorizer must be changed to generate 
> > > > code
> > > > solely based on the SLP tree and not directly from the BB.
> > >
> > > Yep - see my comments on the paper above and about the existing issues
> > > with the BB vectorizer.  I'll remind myself that in future SLP build
> > > should start with a single-lane SLP node for each SSA def so we'd have
> > > SLP nodes for both the scalar and the vector part of a BB which might
> > > "solve" this if we chose to never "schedule" scalar SLP nodes
> > > (we currently tend to create cyclic dependences in some cases).
> > >
> > > > Note: that the the control predicates should be easy to determine 
> > > > through
> the
> > > > post-order dominance tree.
> > >
> > > The paper (and the talk) talks about loops and the need to represent
> > > them with a special "AST" node.  I think without doing that this restricts
> > > us to non-loopy CFGs (like no outer loop vectorization)?
> >
> > Yeah,  the outer loop vectorization approach from the paper also looked
> interesting.
> > The approach would theoretically allow for loop merging as well.  Though I 
> > think
> that
> > requires you to try BB SLP before loop vectorization.
> 
> Note in the talk (I didn't read the paper thoroughly) they "cheat" and
> perform loop fusion after rewriting into conditional IL and only then
> perform SLP rather than doing the fusion on-demand.
> 
> As said I'd first see predication as a way to get rid of loop
> if-conversion, merging that with the vectorizer itself, and of course
> to get early break supported with SLP.  I'd look at expanding BB
> vectorization later - similar for the idea of rewriting SLP discovery
> (which predication will likely require to some extent).  I'm not sure
> we want to go fully no-CFG, I'd rather have some SLP nodes "tied" to
> a spot in the CFG for now and get "CFG aware SLP scheduling" this way.


I guess the problem here is that this way we won't be able to "SLP" CFG,
At the moment, in GIMPLE compound expressions get broken up into
multiple exits. So if (a[i] != 0 ||a[i+1] !=0) return;  becomes two exits.

The idea with going CFG-less is that we should be able to SLP the control
flow.  We should also be able to re-arrange the order of exits to e.g. test
the cheapest one first.  So I think I agree that at least for version 0 we 
should
stick with requiring the CFG, no explicit CFG is probably better for future
versions?

> For example dependence analysis relies on a vector load to be emitted
> at the earliers scalar load position of the set of loads vectorized
> and for the store to be emitted at the latest scalar store position
> of the set of stores vectorized.  Without also representing memory
> data dependences in the SLP graph we have to preserve that
> (you'd have to add SLP edges for virtual operands).  The paper doesn't
> really cover this detail very explicitly AFAICS (how memory dependences
> are preserved when re-writing the IL).

Yeah, I noticed this gap as well, they don't seem to handle statements with
side-effects.   So for that I had extended it with my own scheme. I believe
this essentially just requires different predicates for statements with 
side-effects.

At least on paper it works out, so it's a simple extension of the idea.

> 
> >
> > >
> > > > == Implementation ==
> > > >
> > > > The high level steps for implementing this implementation and the order 
> > > > I'll
> > > > work is:
> > > >
> > > > 1. Detach SLP codegen from BB
> > > > 2. Teach build SLP to assign control predicates to relevant statements. 
> > > > It might
> > > >    be easier to do this during loop analysis,  but I don't think 
> > > > that'll work
> > > >    for BB SLP. Thoughts?
> > > > 3. Teach optimize SLP to merge blocks with similar control predicates
> > > > 4. Teach SLP scheduling to move blocks
> > > > 5. Teach vectorizable_early_break to produce new CFG
> > >
> > > Let me ask you to start with 0. the minimum viable "hack" to make
> > > SLP of early break work.  For 1. to 5. I'd first concentrate on loop SLP
> > > vectorization because there "detached codegen" is trivial (ha, joking
> > > of course).  I'd probably go as far as to re-do SLP discovery between
> > > 0. and tackling 1. - SLP discovery for loops, that is.
> >
> > Sure, that's fair enough.  I have some Arm specific patches to get out 
> > first,
> > which I'll do this week and can start on this next Tuesday.
> >
> > I just wanted to outline what I'm thinking for the longer term plan.
> 
> Sounds good.  Just to re-iterate my plan is to separate SLP discovery
> for loop vectorization and BB vectorization - I'll motivate that
> with some of the SLP-only patches that are queued to after when I
> have managed to make myself happy with the load/store permute handling ...
> 
> > >
> > > I think for 0. we want to be able to attach a [set of] predicates to
> > > each SLP node.  With that we can represent loop masking and early
> > > break conditions.  Those [set of] predicates would be an alternate
> > > unsorted vector of "SLP children" and the SLP children would be
> > > the predicate computation SLP nodes.  We'd have the chance to
> > > combine the set of predicates down to one by adding intermediate
> > > merging SLP nodes (in the end code generation just accepts one
> > > predicate but discovery for simplicity(?) would add many, or two).
> > >
> >
> > So if I understood correctly, during discovery do we maintain breaks
> > in the same instance or do we represent them using these merge nodes
> > which are essentially a sort of PHI node?  So they represent your branch
> > in control flow and the predicates the jump locations?
> 
> I think stmts after a break will pick up the predicate SLP nodes from
> the break but yes, the break stmt itself (the cbranch) would be the
> seed of a separate SLP instance.  Caching of the predicate SLP
> nodes should probably be tied to the basic-block (similar to how
> if-conversion does this).  Whether to build SLP nodes to "and"
> two exit predicates or whether to record them in a vector of
> and_predicates in each SLP node is probably something that experiments
> will tell - having a representation of predicates not "lowered" to
> SLP operations is probably more convenient for simplification and/or
> merging with loop mask predicates, but for codegen and costing having
> the actual predicate merging ops spelled out is prefered.
> 
> > > For early break and without 1. we'd have to "fix" the position of
> > > some SLP nodes in the CFG, namely the "cbranch" SLP node (maybe
> > > that's just a predicate compute node of special kind).  I think
> > > I mentioned adding an optional gimple_stmt_iterator to SLP nodes
> > > to support this (or special-case the cbranch node during scheduling
> > > by looking at the gcond * that's inevitably there).  The loop
> > > mask compute would get a gsi_after_labels for example.
> > >
> > > In principle if we have this as 0. then 1. should become just
> > > an engineering exercise.
> >
> > OK, and so if I understand correctly, code motion becomes a matter of
> > pushing nodes down the instance until the merge predicate simplifies?
> >
> > Or do we keep them in place, and just materialize them in the correct
> > BB?
> >
> > I imagine that for 0 not doing 1 would be easier as we can just re-use
> > the scalar BB as is today and not have to go down the path of being
> > able to generate new CFG yet.
> 
> Yes, as said I'd first do only 0 and record fixed scheduling points
> for SLP nodes where that's necessary (like for the cbranch node).
> It might be that the code motion that's required for correctness(!)
> happens "magically", but I'm not sure we want to rely on that ;)
> ISTR we only push stmts down after the exits, so predicating them
> with the in-loop block predicate should get them scheduled appropriately?
> 
> > >
> > > I'm trying to make you focus on 0. as I'm really eager to make
> > > the only-SLP transition this stage1 and doing all the neat stuff
> > > will take longer.
> >
> > That is fair, I'll move it up my priority list. I'm hoping to get as much of
> > this done during this stage-1 (spent a lot of time thinking about it).
> >
> > So I'll get my AArch64 patches out and start on 0.  I'll continue on
> > IV opts in the background.
> 
> Thanks a lot.
> 
> > >
> > > > == Switch vectorization support ==
> > > >
> > > > The following code
> > > >
> > > > short a[100];
> > > >
> > > > int foo(int n, int counter)
> > > > {
> > > >    for (int i = 0; i < n; i++)
> > > >      {
> > > >         if (a[i] == 1 || a[i] == 2 || a[i] == 7 || a[i] == 4)
> > > >           return 1;
> > > >      }
> > > >     return 0;
> > > > }
> > > >
> > > > fails to vectorize at -O3 -march=armv9-a because in GIMPLE the if is 
> > > > rewritten
> > > > into a switch statement:
> > > >
> > > >   <bb 2> [local count: 114863530]:
> > > >   if (n_6(D) > 0)
> > > >     goto <bb 6>; [94.50%]
> > > >   else
> > > >     goto <bb 11>; [5.50%]
> > > >
> > > >   <bb 6> [local count: 108546036]:
> > > >
> > > >   <bb 3> [local count: 1014686025]:
> > > >   # i_3 = PHI <i_8(7), 0(6)>
> > > >   _1 = a[i_3];
> > > >   switch (_1) <default: <L9> [94.50%], case 1 ... 2: <L11> [5.50%], 
> > > > case 4:
> <L11>
> > > [5.50%], case 7: <L11> [5.50%]>
> > > >
> > > >   <bb 9> [local count: 55807731]:
> > > > <L11>:
> > > >   goto <bb 5>; [100.00%]
> > > >
> > > >   <bb 4> [local count: 958878295]:
> > > > <L9>:
> > > >   i_8 = i_3 + 1;
> > > >   if (n_6(D) > i_8)
> > > >     goto <bb 7>; [94.50%]
> > > >   else
> > > >     goto <bb 12>; [5.50%]
> > > >
> > > >   <bb 12> [local count: 52738306]:
> > > >   goto <bb 8>; [100.00%]
> > > >
> > > >   <bb 7> [local count: 906139989]:
> > > >   goto <bb 3>; [100.00%]
> > > >
> > > >   <bb 11> [local count: 6317494]:
> > > >
> > > >   <bb 8> [local count: 59055800]:
> > > >
> > > >   <bb 5> [local count: 114863531]:
> > > >   # _5 = PHI <1(9), 0(8)>
> > > > <L10>:
> > > >   return _5;
> > > >
> > > > However such switch statements, where all the entries lead to the same
> > > > destination are easy to vectorize.  In SVE we have the MATCH 
> > > > instruction that
> > > > can be used here, and for others we can duplicate the constants and 
> > > > lower
> the
> > > > switch to a series of compare and ORRs.
> > > >
> > > > This is similar as what's done for when the values don't fit inside a 
> > > > switch:
> > > >
> > > > short a[100];
> > > >
> > > > int foo(int n, int counter, short x, short b, short c)
> > > > {
> > > >    for (int i = 0; i < n; i++)
> > > >      {
> > > >         if (a[i] == x || a[i] == b || a[i] == 7 || a[i] == c)
> > > >           return 1;
> > > >      }
> > > >     return 0;
> > > > }
> > > >
> > > > is vectorized as:
> > > >
> > > > .L4:
> > > >         incb    x5
> > > >         incw    z30.s, all, mul #2
> > > >         cmp     w6, w1
> > > >         bcc     .L15
> > > > .L6:
> > > >         ld1h    z31.h, p7/z, [x5]
> > > >         cmpeq   p15.h, p7/z, z31.h, z27.h
> > > >         cmpeq   p11.h, p7/z, z31.h, z28.h
> > > >         cmpeq   p14.h, p7/z, z31.h, #7
> > > >         cmpeq   p12.h, p7/z, z31.h, z29.h
> > > >         orr     p15.b, p7/z, p15.b, p11.b
> > > >         orr     p14.b, p7/z, p14.b, p12.b
> > > >         inch    x1
> > > >         orr     p15.b, p7/z, p15.b, p14.b
> > > >         ptest   p13, p15.b
> > > >         b.none  .L4
> > > >         umov    w1, v30.s[0]
> > > > .L3:
> > > >         sxtw    x1, w1
> > > >         b       .L7
> > > >
> > > > which is great! but should be an SVE MATCH instruction as well.
> > > >
> > > > This kind of code shows up through the use of std::find_if as well:
> > > >
> > > > #include <bits/stdc++.h>
> > > > using namespace std;
> > > >
> > > > bool pred(int i) { return i == 1 || i == 2 || i == 7 || i == 4; }
> > > >
> > > > int foo(vector<int> vec)
> > > > {
> > > >     vector<int>::iterator it;
> > > >
> > > >     it = find_if(vec.begin(), vec.end(), pred);
> > > >
> > > >     return *it;
> > > > }
> > > >
> > > > and once the unrolled loop is gone we should be able to vectorize it.
> > > >
> > > > My proposal is to create two new IFNs and optabs:
> > > >
> > > > IFN_MATCH, IFN_NMATCH and have a vectorizer pattern recognize the ORR
> > > reduction
> > > > cases into MATCH and have ifcvt lower the switch into ORR statements.
> > > >
> > > > Such switch statements only have two branches and so are identical to
> cbranch
> > > > for codegen and vectorization.
> > > >
> > > > The unrolled ORR are replace by match and so the gimple becomes:
> > > >
> > > > a = .MATCH (...)
> > > > if (a != {0,..})
> > > > and so no other change is needed for codegen.
> > >
> > > This sounds separate from the rest, I think teaching if-conversion to
> > > lower (small, or simple by some measure) switches is the way to go
> > > for now.  There's existing code to produce ifs from switch and those
> > > can be if-converted in the classical way (in the if (vectorized) loop
> > > copy only, of course).
> >
> > Ack, it was just something we noticed in some cases which blocked 
> > vectorization
> > which seemed easy to address 😊
> >
> > >
> > > An alternative with IFN_MATCH sounds good in principle.
> > >
> > > > == Better cbranch support ==
> > > >
> > > > At the moment, one of the big downsides of re-using the existing 
> > > > cbranch is
> that
> > > > in the target we can't tell whether the result of the cbranch is 
> > > > actually needed
> > > > or not.
> > > >
> > > > i.e. for SVE we can't tell if the predicate is needed.  For the cases 
> > > > where we
> > > > don't stay inside the vector loop we can generate more efficient code 
> > > > if we
> know
> > > > that the loop only cares about any or all bits set and doesn't need to 
> > > > know
> > > > which one.
> > > >
> > > > For this reason I propose adding new optabs cbranch_any_ and branch_all_
> and
> > > > have emit_cmp_and_jump_insns lower to these when appropriate.
> > >
> > > Hmm, but isn't this then more a vec_cmpeq_any that produces a scalar
> > > rather than a vector and then a regular scalar compare-and-jump?  That is,
> > > does SVE have such a compare instruction?  Can you show the current
> > > code-gen and how the improved one would look like?
> >
> > Ah so the point here is that we don't need the scalar results, the optabs 
> > would
> only
> > care about the CC flags.
> >
> > Consider the following loop:
> >
> > #ifndef N
> > #define N 800
> > #endif
> > unsigned vect_a[N];
> > unsigned vect_b[N];
> >
> > unsigned test4(unsigned x)
> > {
> >  unsigned ret = 0;
> >  for (int i = 0; i < N; i++)
> >  {
> >    vect_b[i] = x + i;
> >    if (vect_a[i]*2 != x)
> >      break;
> >    vect_a[i] = x;
> >
> >  }
> >  return ret;
> > }
> >
> > For early break, where we branch to scalar code we don't care about the 
> > result of
> the mask,
> > as it's unused:
> >
> > so in gimple we have:
> >
> >   mask_patt_6.15_69 = vect_cst__60 != vect__4.14_67;
> >   vec_mask_and_72 = loop_mask_64 & mask_patt_6.15_69;
> >   if (vec_mask_and_72 != { 0, ... })
> >
> > which generates:
> >
> >         cmpne   p14.s, p7/z, z30.s, z29.s
> 
> p7 is the loop mask here?  p14.s the output of the masked compare?
> 
> >         ptest   p15, p14.b
> 
> and p15 is all-zero?

P15 here is all true, in SVE cbranch emits a ptest, and in this case a ptrue is 
fine
since we only care about 0 and != 0.

I'll explain more below.

> 
> >         b.none  .L3
> >
> > notice the ptest.  In SVE if the size of a predicate of an operation does 
> > not match
> that of
> > the governing predicate a ptest is required to reshape the flags.
> >
> > However, for early breaks all we case about are any or all.  So the exact 
> > shape of
> the predicate
> > register doesn't matter and since vector compares already set flags we 
> > already
> have the answer
> > we need.  So the above should be:
> >
> >         cmpne   p14.s, p7/z, z30.s, z29.s
> >         b.none  .L3
> 
> Hmm, I don't really see anything missing on the GIMPLE side - doesn't
> the required info all translate to RTL and thus instruction combine
> should be able to handle this?

For SVE only we could potentially do it, but for using SVE instructions for 
Adv. SIMD
we can't.  Adv. SIMD requires us to compress the vector results into an integer
and compare that with 0 to get the CC flags.  SVE doesn't have this, because in 
SVE
a vector compare sets flags.  So in this case, we don't care about p14 at all.

But if we say want to use an SVE compare to replace an Adv. SIMD compare the 
sequence
of instructions will be longer than what combine can match.  And it becomes 
very harry..
I've tried it before and because of the various combinations of compares (the 
vector and the
Flag ones) It ends up with a lot of complicated patterns.

> 
> > This makes a big difference in loops for performance.  We have a small CC
> optimization
> > pass in that tries to optimize such cases
> https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=5a783f42d77b2f00a1ed171c119
> b020e8df8e521
> >
> > But this case is hard to detect as you need to know that the predicate is 
> > not used
> and we only
> > case about any/all.  This is information we have in GIMPLE so can more 
> > readily do
> in expand.
> 
> But the ptest p15, p14.b is the same as vec_mask_and_72 != { 0, ... }
> so I don't see where information is missing?  I probably don't
> understand the "shape" thing.

So for SVE the usages of the compares for early break don't care about the mask 
value.
that is, we don’t care about the predicate result of vec_mask_and_72 != { 0, 
... }.

Because for SVE the vector compare sets the CC flags indicating one of 3 
options:
1. none == result predicate is all false
2. first == the first element in the predicate is true
3. !last == the last element in the predicate is not true

This gives us the ability for early break to just branch based on the CC, since 
we just care
about none or !none, which is abstracted as the pseudo branch instructions 
b.any and b.none.

ptrue is used to reshape the CC into the correct form, when the size of 
governing predicate
doesn't match that of the compare results.

For none and any this doesn't matter, but it changes the meaning for first and 
!last.  But because
for early break we only require none and any, the ptrue is never needed.

Now for Adv. SIMD what we typically do, when an SVE instruction can be used in 
place of an Adv. SIMD
sequence we deal with it at expand time.  But at early break that's no possible 
because cbranch doesn't
see the compare.  So we are forced to expand to the Adv. SIMD equivalent and 
later try to match it up
with an SVE instruction.  This results in more instructions than combine can 
handle.  The new optab
would allow us to do this easily as we don't ever have to emit the Adv. SIMD 
code first.

And for SVE it will tell us as well that the result value of the comparison is 
not needed.  i.e. it coveys
that all we care about is branching based on a compare, how you do it, is up to 
you.

Cheers,
Tamar

> 
> Richard.
> 
> > The additional benefit of the optab is that it allows us to more easily use 
> > SVE
> instructions
> > to do the flag setting when we choose Adv. SIMD but SVE is available.  
> > Since we
> know we don't
> > need to materialize an Adv. SIMD Boolean vector we can avoid the
> materialization of the conversion
> > from an SVE predicate to Adv. SIMD mask vector.
> >
> > Cheers,
> > Tamar
> >
> > >
> > > > Are the general idea and steps OK?
> > >
> > > See above.  Thanks for the write-up.
> > >
> > > Richard.
> > >
> > > > If so I'll start implementation now.
> > > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > [1] Yishen Chen, Charith Mendis, and Saman Amarasinghe. 2022. All
> > > > You Need Is Superword-Level Parallelism: Systematic Control-Flow
> > > > Vectorization with SLP. In Proceedings of the 43rd ACM SIGPLAN
> > > > International Conference on Programming Language Design and
> > > > Implementation (PLDI '22), June 13?17, 2022, San Diego, CA, USA.
> > > > ACM, NewYork, NY, USA, 15 pages. https://doi.org/10.1145/3519939.
> > > > 3523701
> > > >
> > > > [2] Predicated Static Single Assignment
> > > >
> > > > Lori Carter Beth Simon Brad Calder Larry Carter Jeanne Ferrante
> > > > Department of Computer Science and Engineering
> > > > University of California, San Diego
> > > > flcarter,esimon,calder,carter,ferran...@cs.ucsd.edu
> > > >
> > >
> > > --
> > > Richard Biener <rguent...@suse.de>
> > > SUSE Software Solutions Germany GmbH,
> > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> Nuernberg)
> >
> 
> --
> Richard Biener <rguent...@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

RE: [RFC][middle-end] SLP Early break and control flow support in GCC

Reply via email to