https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107096

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |ams at gcc dot gnu.org

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to rsand...@gcc.gnu.org from comment #2)
> See the comment above rgroup_controls in tree-vectorizer.h for the
> current assumptions around loop predication.  If AVX512 wants something
> different then some extensions will be needed :-)

Coming back to this now.  Crucially

   If only the first three lanes are active, the masks we need are:

     f rgroup: 1 1 | 1 1 | 1 1 | 0 0
     d rgroup:  1  |  1  |  1  |  0

   Here we can use a mask calculated for f's rgroup for d's, but not
   vice versa.

that seems to assume that the space in the mask vector for the "bools"
in the d rgroup is twice as large as in that for the f rgroup.

For AVX512 there's a single bit for each lane, independently on the
width of the actual data.  So instead of 

   Thus for each value of nV, it is enough to provide nV masks, with the
   mask being calculated based on the highest nL (or, equivalently, based
   on the highest nS) required by any rgroup with that nV.  We therefore
   represent the entire collection of masks as a two-level table, with the
   first level being indexed by nV - 1 (since nV == 0 doesn't exist) and
   the second being indexed by the mask index 0 <= i < nV.

we need a set of nV masks for each value of nS, and we can pick the
smallest nV for each nS and generate the corresponding larger nV
masks by a series of shifts.  In fact we can re-use the first vector
(excess bits are OK).  So for the example in tree-vectorizer.h

     float *f;
     double *d;
     for (int i = 0; i < n; ++i)
       {
         f[i * 2 + 0] += 1.0f;
         f[i * 2 + 1] += 2.0f;
         d[i] += 3.0;
       }

we'd need to perform two WHILE_ULT.  For

     float *f;
     double *d;
     for (int i = 0; i < n; ++i)
       {
         f[i] += 1.0f;
         d[i] += 3.0;
       }

we'd compute the mask for the f rgroup with a WHILE_ULT and we'll
have nV_d = 2 * nV_f and the first mask vector from f can be reused
for d (but not the other way around).  The second mask vector for
d can be obtained by kshiftr.

There's no other way to do N bit to two N/2 bit hi/lo (un)packing
(there's a 2x N/2 bit -> N bit operation, for whatever reason).
There's also no way to transform the d rgroup mask into the
f rgroup mask for the first example aka duplicate bits in place,
{ b0, b1, b2, ... bN } -> { b0, b0, b1, b1, b2, b2, ... bN, bN },
nor the reverse.

So in reality it seems we need a set of mask vectors for the full
set of nS * nV combinations with AVX512.  Doing fully masking with
AVX2 style vectors would work with the existing rgroup control scheme.

Currently the "key" to the AVX512 behavior is the use of scalar modes
for the mask vectors but then that's also what GCN uses.  Do GCN
mask bits really map to bytes to allow the currently used rgroup scheme?

Reply via email to