Hi All,

This RFC document covers at a high level how to extend early break support in
GCC to support SLP and how this will be extended in the future to support
full control flow in GCC.

The basic idea in this is based on the paper "All You Need Is Superword-Level
Parallelism: Systematic Control-Flow Vectorization with SLP"[1] but it is
adapted to fit within the GCC vectorizer.

What will not be covered is the support for First-Faulting Loads nor alignment
peeling as these are done by a different engineer.

== Concept and Theory ==

Supporting Early Break in SLP requires the same theory as general control flow,
the only difference is in that Early break can be supported for non-masked
architectures while full control flow support requires a fully masked
architecture.

In GCC 14 Early break was added for non-SLP by teaching the vectorizer to branch
to a scalar epilogue whenever an early exit is taken.  This means the vectorizer
itself doesn't need to know how to perform side-effects or final reductions.

The additional benefit of this is that it works for any target providing a
vector cbranch optab implementation since we don't require masking support.  In
order for this to work we need to have code motion that moves side effects
(primarily stores) into the latch basic block.  i.e. we delay any side-effects
up to a point where we know the full vector iteration would have been performed.
For this to work however we had to disable support for epilogue vectorization as
when the scalar statements are moved we break the link to the original BB they
were in.  This means that the stmt_vinfo for the stores that need to be moved
will no longer be valid for epilogue vectorization and the only way to recover
this would be to perform a full dataflow analysis again.  We decided against
this as the plan of record was to switch to SLP.

-- Predicated SSA --

The core of the proposal is to support a form of Predicated SSA (PSSA) in the
vectorizer [2]. The idea is to assign a control predicate to every SSA
statement.  This control predicate denotes their dependencies wrt to the BB they
are in and defines the dependencies between statements.

As an example the following early break code:

unsigned test4(unsigned x)
{
 unsigned ret = 0;
 for (int i = 0; i < N; i++)
 {
   vect_b[i] = x + i;
   if (vect_a[i]*2 != x)
     break;
   vect_a[i] = x;
 
 }
 return ret;
}

is transformed into

unsigned test4(unsigned x)
{
 unsigned ret = 0;
 for (int i = 0; i < N; i++)
 {
   vect_b[i] = x + i;      :: !(vect_a[i]*2 != x)
   if (vect_a[i]*2 != x)   :: true
     break;                :: vect_a[i]*2 != x
   vect_a[i] = x;          :: !(vect_a[i]*2 != x)
 
 }
 return ret;
}

A more complicated example:

int foo (int n)
{
    int res = 0;
    for (int i = 0; i < n; i++)
      {
        y[i] = x[i] * 2;      :: true
        res += x[i] + y[i];   :: true
 
        if (x[i] > 5)         :: true
          break;              :: x[i] > 5
 
        if (z[i] > 5)         :: !(x[i] > 5)
          break;              :: !(x[i] > 5) && (z[i] > 5)
      }
 
    return res;
}

any statement guarded by a block with a non-true predicate can be simplified, so
the above can be turned into

int foo (int n)
{
    int res = 0;
    for (int i = 0; i < n; i++)
      {
        y[i] = x[i] * 2;      :: true
        res += x[i] + y[i];   :: true
 
        if (x[i] > 5)         :: true
          break;              :: x[i] > 5
 
        if (z[i] > 5)         :: !(x[i] > 5)
          break;              :: (z[i] > 5)
      }
 
    return res;
}

if we make the predicates a *gimple, where true is just NULL;

Lastly examples such as:

unsigned test4(unsigned x, unsigned y)
{
 unsigned ret = 0;
 unsigned sum = 0;
 for (int i = 0; i < N; i++)
 {
   vect_b[i] = x + i;
   if (vect_a[i] > x)
     return vect_a[i];
 
  vect_b[i] += x + i;
  if (vect_a[i] > y)
      return sum;
  sum += vect_a[i];
  vect_a[i] = x;
 }
 return ret;
}

are tagged as:

unsigned test4(unsigned x, unsigned y)
{
 unsigned ret = 0;
 unsigned sum = 0;
 for (int i = 0; i < N; i++)
 {
   vect_b[i] = x + i;    :: !(vect_a[i] > y)
   if (vect_a[i] > x)    :: true
     return vect_a[i];   :: vect_a[i] > x
 
  vect_b[i] += x + i;    :: !(vect_a[i] > y)
  if (vect_a[i] > y)     :: !(vect_a[i] > x)
      return sum;        :: vect_a[i] > y
  sum += vect_a[i];      :: !(vect_a[i] > y)
  vect_a[i] = x;         :: !(vect_a[i] > y)
 }
 return ret;
}

This representation has a number of benefits:

1. code-motion becomes just SLP scheduling, where scheduling has to take account
   not to schedule blocks across eachother which have a data dependency. This
   becomes simpler when we build the SLP tree we can merge blocks with the same
   control dependencies.
2. this representation also helps for the case where we want to stay inside the
   vector loop, as the predicate tells us which mask we need to use to compute
   the reduction values.
3. this representation is easily extendable to general control flow support
   since the statements will all be explicitly guarded by  a predicate.  The PHI
   node can be given a special type, to indicate to codegen that a branch should
   be generated.

conditions which read from the same sources can be merged (SLP'ed) if the
statements in the early exit block are the same.

That is, the above can be handled as:

unsigned test4(unsigned x, unsigned y)
{
 unsigned ret = 0;
 unsigned sum = 0;
 for (int i = 0; i < N; i++)
 {
   if (vect_a[i] > x)    :: true
     return vect_a[i];   :: vect_a[i] > x
 
   if (vect_a[i] > y)    :: !(vect_a[i] > x)
     return sum;         :: vect_a[i] > y
 
   vect_b[i] = x + i;    :: !(vect_a[i] > y)
   vect_b[i] += x + i;   :: !(vect_a[i] > y)
   sum += vect_a[i];     :: !(vect_a[i] > y)
   vect_a[i] = x;        :: !(vect_a[i] > y)
 }
 return ret;
}

But the following:

unsigned test4(unsigned x, unsigned y)
{
 unsigned ret = 0;
 unsigned sum = 0;
 for (int i = 0; i < N; i++)
 {
   vect_b[i] = x + i;    :: !(vect_a[i] > y)
   if (vect_a[i] > x)    :: true
     break;              :: vect_a[i] > x
 
  vect_b[i] += x + i;    :: !(vect_a[i] > y)
  if (vect_a[i] > y)     :: !(vect_a[i] > x)
     break;              :: vect_a[i] > y
  sum += vect_a[i];      :: !(vect_a[i] > y)
  vect_a[i] = x;         :: !(vect_a[i] > y)
 }
 return ret;
}

becomes:

unsigned test4(unsigned x, unsigned y)
{
 unsigned ret = 0;
 unsigned sum = 0;
 for (int i = 0; i < N; i++)
 {
   if (vect_a[i] > x || vect_a[i] > y)    :: true
     break;              :: vect_a[i] > x
 
   vect_b[i] = x + i;    :: !(vect_a[i] > y)
   vect_b[i] += x + i;   :: !(vect_a[i] > y)
   sum += vect_a[i];     :: !(vect_a[i] > y)
   vect_a[i] = x;        :: !(vect_a[i] > y)
 }
 return ret;
}

In addition for this to work the vectorizer must be changed to generate code
solely based on the SLP tree and not directly from the BB.

Note: that the the control predicates should be easy to determine through the
post-order dominance tree.

== Implementation ==

The high level steps for implementing this implementation and the order I'll
work is:

1. Detach SLP codegen from BB
2. Teach build SLP to assign control predicates to relevant statements. It might
   be easier to do this during loop analysis,  but I don't think that'll work
   for BB SLP. Thoughts?
3. Teach optimize SLP to merge blocks with similar control predicates
4. Teach SLP scheduling to move blocks
5. Teach vectorizable_early_break to produce new CFG

== Switch vectorization support ==

The following code

short a[100];
 
int foo(int n, int counter)
{
   for (int i = 0; i < n; i++)
     {
        if (a[i] == 1 || a[i] == 2 || a[i] == 7 || a[i] == 4)
          return 1;
     }
    return 0;
}

fails to vectorize at -O3 -march=armv9-a because in GIMPLE the if is rewritten
into a switch statement:

  <bb 2> [local count: 114863530]:
  if (n_6(D) > 0)
    goto <bb 6>; [94.50%]
  else
    goto <bb 11>; [5.50%]
 
  <bb 6> [local count: 108546036]:
 
  <bb 3> [local count: 1014686025]:
  # i_3 = PHI <i_8(7), 0(6)>
  _1 = a[i_3];
  switch (_1) <default: <L9> [94.50%], case 1 ... 2: <L11> [5.50%], case 4: 
<L11> [5.50%], case 7: <L11> [5.50%]>
 
  <bb 9> [local count: 55807731]:
<L11>:
  goto <bb 5>; [100.00%]
 
  <bb 4> [local count: 958878295]:
<L9>:
  i_8 = i_3 + 1;
  if (n_6(D) > i_8)
    goto <bb 7>; [94.50%]
  else
    goto <bb 12>; [5.50%]
 
  <bb 12> [local count: 52738306]:
  goto <bb 8>; [100.00%]
 
  <bb 7> [local count: 906139989]:
  goto <bb 3>; [100.00%]
 
  <bb 11> [local count: 6317494]:
 
  <bb 8> [local count: 59055800]:
 
  <bb 5> [local count: 114863531]:
  # _5 = PHI <1(9), 0(8)>
<L10>:
  return _5;

However such switch statements, where all the entries lead to the same
destination are easy to vectorize.  In SVE we have the MATCH instruction that
can be used here, and for others we can duplicate the constants and lower the
switch to a series of compare and ORRs.

This is similar as what's done for when the values don't fit inside a switch:

short a[100];
 
int foo(int n, int counter, short x, short b, short c)
{
   for (int i = 0; i < n; i++)
     {
        if (a[i] == x || a[i] == b || a[i] == 7 || a[i] == c)
          return 1;
     }
    return 0;
}

is vectorized as:

.L4:
        incb    x5
        incw    z30.s, all, mul #2
        cmp     w6, w1
        bcc     .L15
.L6:
        ld1h    z31.h, p7/z, [x5]
        cmpeq   p15.h, p7/z, z31.h, z27.h
        cmpeq   p11.h, p7/z, z31.h, z28.h
        cmpeq   p14.h, p7/z, z31.h, #7
        cmpeq   p12.h, p7/z, z31.h, z29.h
        orr     p15.b, p7/z, p15.b, p11.b
        orr     p14.b, p7/z, p14.b, p12.b
        inch    x1
        orr     p15.b, p7/z, p15.b, p14.b
        ptest   p13, p15.b
        b.none  .L4
        umov    w1, v30.s[0]
.L3:
        sxtw    x1, w1
        b       .L7

which is great! but should be an SVE MATCH instruction as well.

This kind of code shows up through the use of std::find_if as well:

#include <bits/stdc++.h>
using namespace std;
  
bool pred(int i) { return i == 1 || i == 2 || i == 7 || i == 4; }
 
int foo(vector<int> vec)
{
    vector<int>::iterator it;
 
    it = find_if(vec.begin(), vec.end(), pred);
  
    return *it;
}

and once the unrolled loop is gone we should be able to vectorize it.

My proposal is to create two new IFNs and optabs:

IFN_MATCH, IFN_NMATCH and have a vectorizer pattern recognize the ORR reduction
cases into MATCH and have ifcvt lower the switch into ORR statements.

Such switch statements only have two branches and so are identical to cbranch
for codegen and vectorization.

The unrolled ORR are replace by match and so the gimple becomes:

a = .MATCH (...)
if (a != {0,..})
and so no other change is needed for codegen.

== Better cbranch support ==

At the moment, one of the big downsides of re-using the existing cbranch is that
in the target we can't tell whether the result of the cbranch is actually needed
or not.

i.e. for SVE we can't tell if the predicate is needed.  For the cases where we
don't stay inside the vector loop we can generate more efficient code if we know
that the loop only cares about any or all bits set and doesn't need to know
which one.

For this reason I propose adding new optabs cbranch_any_ and branch_all_ and
have emit_cmp_and_jump_insns lower to these when appropriate.

Are the general idea and steps OK?

If so I'll start implementation now.

Thanks,
Tamar

[1] Yishen Chen, Charith Mendis, and Saman Amarasinghe. 2022. All
You Need Is Superword-Level Parallelism: Systematic Control-Flow
Vectorization with SLP. In Proceedings of the 43rd ACM SIGPLAN
International Conference on Programming Language Design and
Implementation (PLDI '22), June 13?17, 2022, San Diego, CA, USA.
ACM, NewYork, NY, USA, 15 pages. https://doi.org/10.1145/3519939.
3523701

[2] Predicated Static Single Assignment

Lori Carter Beth Simon Brad Calder Larry Carter Jeanne Ferrante
Department of Computer Science and Engineering
University of California, San Diego
flcarter,esimon,calder,carter,ferran...@cs.ucsd.edu

Reply via email to