Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors

Richard Biener via Gcc-patches Wed, 26 Jul 2023 05:27:23 -0700

On Wed, Jul 26, 2023 at 9:21 AM Tejas Belagod <[email protected]> wrote:
>
> On 7/17/23 5:46 PM, Richard Biener wrote:
> > On Fri, Jul 14, 2023 at 12:18 PM Tejas Belagod <[email protected]> 
> > wrote:
> >>
> >> On 7/13/23 4:05 PM, Richard Biener wrote:
> >>> On Thu, Jul 13, 2023 at 12:15 PM Tejas Belagod <[email protected]> 
> >>> wrote:
> >>>>
> >>>> On 7/3/23 1:31 PM, Richard Biener wrote:
> >>>>> On Mon, Jul 3, 2023 at 8:50 AM Tejas Belagod <[email protected]> 
> >>>>> wrote:
> >>>>>>
> >>>>>> On 6/29/23 6:55 PM, Richard Biener wrote:
> >>>>>>> On Wed, Jun 28, 2023 at 1:26 PM Tejas Belagod <[email protected]> 
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> From: Richard Biener <[email protected]>
> >>>>>>>> Date: Tuesday, June 27, 2023 at 12:58 PM
> >>>>>>>> To: Tejas Belagod <[email protected]>
> >>>>>>>> Cc: [email protected] <[email protected]>
> >>>>>>>> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
> >>>>>>>>
> >>>>>>>> On Tue, Jun 27, 2023 at 8:30 AM Tejas Belagod 
> >>>>>>>> <[email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> From: Richard Biener <[email protected]>
> >>>>>>>>> Date: Monday, June 26, 2023 at 2:23 PM
> >>>>>>>>> To: Tejas Belagod <[email protected]>
> >>>>>>>>> Cc: [email protected] <[email protected]>
> >>>>>>>>> Subject: Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors
> >>>>>>>>>
> >>>>>>>>> On Mon, Jun 26, 2023 at 8:24 AM Tejas Belagod via Gcc-patches
> >>>>>>>>> <[email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> Packed Boolean Vectors
> >>>>>>>>>> ----------------------
> >>>>>>>>>>
> >>>>>>>>>> I'd like to propose a feature addition to GNU Vector extensions to 
> >>>>>>>>>> add packed
> >>>>>>>>>> boolean vectors (PBV).  This has been discussed in the past 
> >>>>>>>>>> here[1] and a variant has
> >>>>>>>>>> been implemented in Clang recently[2].
> >>>>>>>>>>
> >>>>>>>>>> With predication features being added to vector architectures 
> >>>>>>>>>> (SVE, MVE, AVX),
> >>>>>>>>>> it is a useful feature to have to model predication on targets.  
> >>>>>>>>>> This could
> >>>>>>>>>> find its use in intrinsics or just used as is as a GNU vector 
> >>>>>>>>>> extension being
> >>>>>>>>>> mapped to underlying target features.  For example, the packed 
> >>>>>>>>>> boolean vector
> >>>>>>>>>> could directly map to a predicate register on SVE.
> >>>>>>>>>>
> >>>>>>>>>> Also, this new packed boolean type GNU extension can be used with 
> >>>>>>>>>> SVE ACLE
> >>>>>>>>>> intrinsics to replace a fixed-length svbool_t.
> >>>>>>>>>>
> >>>>>>>>>> Here are a few options to represent the packed boolean vector type.
> >>>>>>>>>
> >>>>>>>>> The GIMPLE frontend uses a new 'vector_mask' attribute:
> >>>>>>>>>
> >>>>>>>>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> >>>>>>>>> typedef v8si v8sib __attribute__((vector_mask));
> >>>>>>>>>
> >>>>>>>>> it get's you a vector type that's the appropriate (dependent on the
> >>>>>>>>> target) vector
> >>>>>>>>> mask type for the vector data type (v8si in this case).
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks Richard.
> >>>>>>>>>
> >>>>>>>>> Having had a quick look at the implementation, it does seem to tick 
> >>>>>>>>> the boxes.
> >>>>>>>>>
> >>>>>>>>> I must admit I haven't dug deep, but if the target hook allows the 
> >>>>>>>>> mask to be
> >>>>>>>>>
> >>>>>>>>> defined in way that is target-friendly (and I don't know how much 
> >>>>>>>>> effort it will
> >>>>>>>>>
> >>>>>>>>> be to migrate the attribute to more front-ends), it should do the 
> >>>>>>>>> job nicely.
> >>>>>>>>>
> >>>>>>>>> Let me go back and dig a bit deeper and get back with questions if 
> >>>>>>>>> any.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Let me add that the advantage of this is the compiler doesn't need
> >>>>>>>> to support weird explicitely laid out packed boolean vectors that do
> >>>>>>>> not match what the target supports and the user doesn't need to know
> >>>>>>>> what the target supports (and thus have an #ifdef maze around 
> >>>>>>>> explicitely
> >>>>>>>> specified layouts).
> >>>>>>>>
> >>>>>>>> Sorry for the delayed response – I spent a day experimenting with 
> >>>>>>>> vector_mask.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Yeah, this is what option 4 in the RFC is trying to achieve – be 
> >>>>>>>> portable enough
> >>>>>>>>
> >>>>>>>> to avoid having to sprinkle the code with ifdefs.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> It does remove some flexibility though, for example with -mavx512f 
> >>>>>>>> -mavx512vl
> >>>>>>>> you'll get AVX512 style masks for V4SImode data vectors but of 
> >>>>>>>> course the
> >>>>>>>> target sill supports SSE2/AVX2 style masks as well, but those would 
> >>>>>>>> not be
> >>>>>>>> available as "packed boolean vectors", though they are of course in 
> >>>>>>>> fact
> >>>>>>>> equal to V4SImode data vectors with -1 or 0 values, so in this 
> >>>>>>>> particular
> >>>>>>>> case it might not matter.
> >>>>>>>>
> >>>>>>>> That said, the vector_mask attribute will get you V4SImode vectors 
> >>>>>>>> with
> >>>>>>>> signed boolean elements of 32 bits for V4SImode data vectors with
> >>>>>>>> SSE2/AVX2.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> This sounds very much like what the scenario would be with NEON vs 
> >>>>>>>> SVE. Coming to think
> >>>>>>>>
> >>>>>>>> of it, vector_mask resembles option 4 in the proposal with ‘n’ 
> >>>>>>>> implied by the ‘base’ vector type
> >>>>>>>>
> >>>>>>>> and a ‘w’ specified for the type.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Given its current implementation, if vector_mask is exposed to the 
> >>>>>>>> CFE, would there be any
> >>>>>>>>
> >>>>>>>> major challenges wrt implementation or defining behaviour semantics? 
> >>>>>>>> I played around with a
> >>>>>>>>
> >>>>>>>> few examples from the testsuite and wrote some new ones. I mostly 
> >>>>>>>> tried operations that
> >>>>>>>>
> >>>>>>>> the new type would have to support (unary, binary bitwise, 
> >>>>>>>> initializations etc) – with a couple of exceptions
> >>>>>>>>
> >>>>>>>> most of the ops seem to be supported. I also triggered a couple of 
> >>>>>>>> ICEs in some tests involving
> >>>>>>>>
> >>>>>>>> implicit conversions to wider/narrower vector_mask types (will raise 
> >>>>>>>> reports for these). Correct me
> >>>>>>>>
> >>>>>>>> if I’m wrong here, but we’d probably have to support a couple of new 
> >>>>>>>> ops if vector_mask is exposed
> >>>>>>>>
> >>>>>>>> to the CFE – initialization and subscript operations?
> >>>>>>>
> >>>>>>> Yes, either that or restrict how the mask vectors can be used, thus
> >>>>>>> properly diagnose improper
> >>>>>>> uses.
> >>>>>>
> >>>>>> Indeed.
> >>>>>>
> >>>>>>      A question would be for example how to write common mask test
> >>>>>>> operations like
> >>>>>>> if (any (mask)) or if (all (mask)).
> >>>>>>
> >>>>>> I see 2 options here. New builtins could support new types - they'd
> >>>>>> provide a target independent way to test any and all conditions. 
> >>>>>> Another
> >>>>>> would be to let the target use its intrinsics to do them in the most
> >>>>>> efficient way possible (which the builtins would get lowered down to
> >>>>>> anyway).
> >>>>>>
> >>>>>>
> >>>>>>      Likewise writing merge operations
> >>>>>>> - do those as
> >>>>>>>
> >>>>>>>      a = a | (mask ? b : 0);
> >>>>>>>
> >>>>>>> thus use ternary ?: for this?
> >>>>>>
> >>>>>> Yes, like now, the ternary could just translate to
> >>>>>>
> >>>>>>       {mask[0] ? b[0] : 0, mask[1] ? b[1] : 0, ... }
> >>>>>>
> >>>>>> One thing to flesh out is the semantics. Should we allow this operation
> >>>>>> as long as the number of elements are the same even if the mask type if
> >>>>>> different i.e.
> >>>>>>
> >>>>>>       v4hib ? v4si : v4si;
> >>>>>>
> >>>>>> I don't see why this can't be allowed as now we let
> >>>>>>
> >>>>>>       v4si ? v4sf : v4sf;
> >>>>>>
> >>>>>>
> >>>>>> For initialization regular vector
> >>>>>>> syntax should work:
> >>>>>>>
> >>>>>>> mtype mask = (mtype){ -1, -1, 0, 0, ... };
> >>>>>>>
> >>>>>>> there's the question of the signedness of the mask elements.  GCC
> >>>>>>> internally uses signed
> >>>>>>> bools with values -1 for true and 0 for false.
> >>>>>>
> >>>>>> One of the things is the value that represents true. This is largely
> >>>>>> target-dependent when it comes to the vector_mask type. When 
> >>>>>> vector_mask
> >>>>>> types are created from GCC's internal representation of bool vectors
> >>>>>> (signed ints) the point about implicit/explicit conversions from signed
> >>>>>> int vect to mask types in the proposal covers this. So mask in
> >>>>>>
> >>>>>>       v4sib mask = (v4sib){-1, -1, 0, 0, ... }
> >>>>>>
> >>>>>> will probably end up being represented as 0x3xxxx on AVX512 and 0x11xxx
> >>>>>> on SVE. On AVX2/SSE they'd still be represented as vector of signed 
> >>>>>> ints
> >>>>>> {-1, -1, 0, 0, ... }. I'm not entirely confident what ramifications 
> >>>>>> this
> >>>>>> new mask type representations will have in the mid-end while being
> >>>>>> converted back and forth to and from GCC's internal representation, but
> >>>>>> I'm guessing this is already being handled at some level by the
> >>>>>> vector_mask type's current support?
> >>>>>
> >>>>> Yes, I would guess so.  Of course what the middle-end is currently 
> >>>>> exposed
> >>>>> to is simply what the vectorizer generates - once fuzzers discover this 
> >>>>> feature
> >>>>> we'll see "interesting" uses that might run into missed or wrong 
> >>>>> handling of
> >>>>> them.
> >>>>>
> >>>>> So whatever we do on the side of exposing this to users a good portion
> >>>>> of testsuite coverage for the allowed use cases is important.
> >>>>>
> >>>>> Richard.
> >>>>>
> >>>>
> >>>> Apologies for the long-ish reply, but here's a TLDR and gory details 
> >>>> follow.
> >>>>
> >>>> TLDR:
> >>>> GIMPLE's vector_mask type semantics seems to be target-dependent, so
> >>>> elevating vector_mask to CFE with same semantics is undesirable. OTOH,
> >>>> changing vector_mask to have target-independent CFE semantics will cause
> >>>> dichotomy between its CFE and GFE behaviours. But vector_mask approach
> >>>> scales well for sizeless types. Is the solution to have something like
> >>>> vector_mask with defined target-independent type semantics, but call it
> >>>> something else to prevent conflation with GIMPLE, a viable option?
> >>>>
> >>>> Details:
> >>>> After some more analysis of the proposed options, here are some
> >>>> interesting findings:
> >>>>
> >>>> vector_mask looked like a very interesting option until I ran into some
> >>>> semantic uncertainly. This code:
> >>>>
> >>>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> >>>> typedef v8si v8sib __attribute__((vector_mask));
> >>>>
> >>>> typedef short v8hi __attribute__((vector_size(8*sizeof(short))));
> >>>> typedef v8hi v8hib __attribute__((vector_mask));
> >>>>
> >>>> v8si res;
> >>>> v8hi resh;
> >>>>
> >>>> v8hib __GIMPLE () foo (v8hib x, v8sib y)
> >>>> {
> >>>>      v8hib res;
> >>>>
> >>>>      res = x & y;
> >>>>      return res;
> >>>> }
> >>>>
> >>>> When compiled on AArch64, produces a type-mismatch error for binary
> >>>> expression involving '&' because the 'derived' types 'v8hib' and 'v8sib'
> >>>>     have a different target-layout. If the layout of these two 'derived'
> >>>> types match, then the above code has no issue. Which is the case on
> >>>> amdgcn-amdhsa target where it compiles without any error(amdgcn uses a
> >>>> scalar DImode mask mode). IoW such code seems to be allowed on some
> >>>> targets and not on others.
> >>>>
> >>>> With the same code, I tried putting casts and it worked fine on AArch64
> >>>> and amdgcn. This target-specific behaviour of vector_mask derived types
> >>>> will be difficult to specify once we move it to the CFE - in fact we
> >>>> probably don't want target-specific behaviour once it moves to the CFE.
> >>>>
> >>>> If we expose vector_mask to CFE, we'd have to specify consistent
> >>>> semantics for vector_mask types. We'd have to resolve ambiguities like
> >>>> 'v4hib & v4sib' clearly to be able to specify the semantics of the type
> >>>> system involving vector_mask. If we do this, don't we run the risk of a
> >>>> dichotomy between the CFE and GFE semantics of vector_mask? I'm assuming
> >>>> we'd want to retain vector_mask semantics as they are in GIMPLE.
> >>>>
> >>>> If we want to enforce constant semantics for vector_mask in the CFE, one
> >>>> way is to treat vector_mask types as distinct if they're 'attached' to
> >>>> distinct data vector types. In such a scenario, vector_mask types
> >>>> attached to two data vector types with the same lane-width and number of
> >>>> lanes would be classified as distinct. For eg:
> >>>>
> >>>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> >>>> typedef v8si v8sib __attribute__((vector_mask));
> >>>>
> >>>> typedef float v8sf __attribute__((vector_size(8*sizeof(float))));
> >>>> typedef v8sf v8sfb __attribute__((vector_mask));
> >>>>
> >>>> v8si  foo (v8sf x, v8sf y, v8si i, v8si j)
> >>>> {
> >>>>      (a == b) & (v8sfb)(x == y) ? x : (v8si){0};
> >>>> }
> >>>>
> >>>> This could be the case for unsigned vs signed int vectors too for eg -
> >>>> seems a bit unnecessary tbh.
> >>>>
> >>>> Though vector_mask's being 'attached' to a type has its drawbacks, it
> >>>> does seem to have an advantage when sizeless types are considered. If we
> >>>> have to define a sizeless vector boolean type that is implied by the
> >>>> lane size, we could do something like
> >>>>
> >>>> typedef svint32_t svbool32_t __attribute__((vector_mask));
> >>>>
> >>>> int32_t foo (svint32_t a, svint32_t b)
> >>>> {
> >>>>      svbool32_t pred = a > b;
> >>>>
> >>>>      return pred[2] ? a[2] : b[2];
> >>>> }
> >>>>
> >>>> This is harder to do in the other schemes proposed so far as they're
> >>>> size-based.
> >>>>
> >>>> To be able to free the boolean from the base type (not size) and retain
> >>>> vector_mask's flexibility to declare sizeless types, we could have an
> >>>> attribute that is more flexibly-typed and only 'derives' the lane-size
> >>>> and number of lanes from its 'base' type without actually inheriting the
> >>>> actual base type(char, short, int etc) or its signedness. This creates a
> >>>> purer and stand-alone boolean type without the associated semantics'
> >>>> complexity of having to cast between two same-size types with the same
> >>>> number of lanes. Eg.
> >>>>
> >>>> typedef int v8si __attribute__((vector_size(8*sizeof(int))));
> >>>> typedef v8si v8b __attribute__((vector_bool));
> >>>>
> >>>> However, with differing lane-sizes, there will have to be a cast as the
> >>>> 'derived' element size is different which could impact the layout of the
> >>>> vector mask. Eg.
> >>>>
> >>>> v8si  foo (v8hi x, v8hi y, v8si i, v8si j)
> >>>> {
> >>>>      (v8sib)(x == y) & (i == j) ? i : (v8si){0};
> >>>> }
> >>>>
> >>>> Such conversions on targets like AVX512/AMDGCN will be a NOP, but
> >>>> non-trivial on SVE (depending on the implemented layout of the bool 
> >>>> vector).
> >>>>
> >>>> vector_bool decouples us from having to retain the behaviour of
> >>>> vector_mask and provides the flexibility of not having to cast across
> >>>> same-element-size vector types. Wrt to sizeless types, it could scale 
> >>>> well.
> >>>>
> >>>> typedef svint32_t svbool32_t __attribute__((vector_bool));
> >>>> typedef svint16_t svbool16_t __attribute__((vector_bool));
> >>>>
> >>>> int32_t foo (svint32_t a, svint32_t b)
> >>>> {
> >>>>      svbool32_t pred = a > b;
> >>>>
> >>>>      return pred[2] ? a[2] : b[2];
> >>>> }
> >>>>
> >>>> int16_t bar (svint16_t a, svint16_t b)
> >>>> {
> >>>>      svbool16_t pred = a > b;
> >>>>
> >>>>      return pred[2] ? a[2] : b[2];
> >>>> }
> >>>>
> >>>> On SVE, pred[2] refers to bit 4 for svint16_t and bit 8 for svint32_t on
> >>>> the target predicate.
> >>>>
> >>>> Thoughts?
> >>>
> >>> The GIMPLE frontend accepts just what is valid on the target here.  Any
> >>> "plumbing" such as implicit conversions (if we do not want to require
> >>> explicit ones even when NOP) need to be done/enforced by the C frontend.
> >>>
> >>
> >> Sorry, I'm not sure I follow - correct me if I'm wrong here.
> >>
> >> If we desire to define/allow operations like implicit/explicit
> >> conversion on vector_mask types in CFE, don't we have to start from a
> >> position of defining what makes vector_mask types distinct and therefore
> >> require implicit/explicit conversions?
> >
> > We need to look at which operations we want to produce vector masks and
> > which operations consume them and what operations operate on them.
> >
> > In GIMPLE comparisons produce them, conditionals consume them and
> > we allow bitwise ops to operate on them directly (GIMPLE doesn't have
> > logical && it just has bitwise &).
> >
>
> Thanks for your thoughts - after I spent more cycles researching and
> experimenting, I think I understand the driving factors here. Comparison
> producers generate signed integer vectors of the same lane-width as the
> comparison operands. This means mixed type vectors can't be applied to
> conditional consumers or bitwise operators eg:
>
>    v8hi foo (v8si a, v8si b, v8hi c, v8hi d)
>    {
>      return a > b || c > d; // error!
>      return a > b || __builtin_convertvector (c > d, v8si); // OK.
>      return a | b && c | d; // error!
>      return a | b && __builtin_convertvector (c | d, v8si); // OK.
>    }
>
> Similarly, if we extend these 'stricter-typing' rules to vector_mask, it
> could look like:
>
> typedef v4sib v4si __attribute__((vector_mask));
> typedef v4hib v4hi __attribute__((vector_mask));
>
>    v8sib foo (v8si a, v8si b, v8hi c, v8hi d)
>    {
>      v8sib psi = a > b;
>      v8hib phi = c > d;
>
>      return psi || phi; // error!
>      return psi || __builtin_convertvector (phi, v8sib); // OK.
>      return psi | phi; // error!
>      return psi | __builtin_convertvector (phi, v8sib); // OK.
>    }
>
> At GIMPLE stage, on targets where the layout allows it (eg AMDGCN),
> expressions like
>    psi | __builtin_convertvector (phi, v8sib)
> can be optimized to
>    psi | phi
> because __builtin_convertvector (phi, v8sib) is a NOP.
>
> I think this could make vector_mask more portable across targets. If one
> wants to take CFE vector_mask code and run it on the GFE, it should
> work; while the reverse won't as CFE vector_mask rules are more restrictive.
>
> Does this look like a sensible approach for progress?


Yes, that looks good.

> >> IIUC, GFE's distinctness of vector_mask types depends on how the mask
> >> mode is implemented on the target. If implemented in CFE, vector_mask
> >> types' distinctness probably shouldn't be based on target layout and
> >> could be based on the type they're 'attached' to.
> >
> > But since we eventually run on the target the layout should ideally
> > match that of the target.  Now, the question is whether that's ever
> > OK behavior - it effectively makes the mask somewhat opaque and
> > only "observable" by probing it in defined manners.
> >
> >> Wouldn't that diverge from target-specific GFE behaviour - or are you
> >> suggesting its OK for vector_mask type semantics to be different in CFE
> >> and GFE?
> >
> > It's definitely undesirable but as said I'm not sure it has to differ
> > [the layout].
> >
>
> I agree it is best to have a consistent layout of vector_mask across CFE
> and GFE and also implement it to match the target layout for optimal
> code quality.
>
> For observability, I think it makes sense to allow operations that are
> relevant and have a consistent meaning irrespective of that target. Eg.
> 'vector_mask & 2' might not mean the same thing on all targets, but
> vector_mask[2] does. Therefore, I think the opaqueness is useful and
> necessary to some extent.

Yes.  The main question regarding to observability will be
things like sizeof or alignof and putting masks into addressable storage.

I think IBM folks have introduced some "opaque" types for their
matrix-multiplication accelerator where intrinsics need something to
work with but the observability of many aspect is restricted.  In
the middle-end we have OPAQUE_TYPE and MODE_OPAQUE
(but IIRC there can only be a single kind of that at the moment).
Interestingly an OPAQUE_TYPE does have a size.

Note one way out would be to make vector_mask types "decay"
to a value vector type.  Thus any time you try to observe it
you get a vector bool (a vector of actual 8 bit bool data elements)
and when you use a vector data type in mask context you
get a "mask conversion" aka vector bool != 0.  It would then be
up to the compiler to elide round-trips between mask and data.

That would mean sizeof (vector_mask) would be sizeof (vector bool)
even when for example the hardware would produce V4SImode
mask from a V4SFmode compare or when it would produce a
QImode 4-bit integer from the same?

Richard.

> Thanks,
> Tejas.
>
> >>> There's one issue I can see that wasn't mentioned yet - GCC currently
> >>> accepts
> >>>
> >>> typedef long gv1024di __attribute__((vector_size(1024*8)));
> >>>
> >>> even if there's no underlying support on the target which either has 
> >>> support
> >>> only for smaller vectors or no vectors at all.  Currently vector_mask will
> >>> simply fail to produce sth desirable here.  What's your idea of making
> >>> that not target dependent?  GCC will later lower operations with such
> >>> vectors, possibly splitting them up into sizes supported by the hardware
> >>> natively, possibly performing elementwise operations.  For the former
> >>> one would need to guess the "decomposition type" and based on that
> >>> select the mask type [layout]?
> >>>
> >>> One idea would be to specify the mask layout follows the largest vector
> >>> kind supported by the target and if there is none follow the layout
> >>> of (signed?) _Bool [n]?  When there's no target support for vectors
> >>> GCC will generally use elementwise operations apart from some
> >>> special-cases.
> >>>
> >>
> >> That is a very good point - thanks for raising it. For when GCC chooses
> >> to lower to a vector type supported by the target, my initial thought
> >> would be to, as you say, choose a mask that has enough bits to represent
> >> the largest vector size with the smallest lane-width. The actual layout
> >> of the mask will depend on how the target implements its mask mode.
> >> Decomposition of vector_mask ought to follow the decomposition of the
> >> GNU vectors type and each decomposed vector_mask type ought to have
> >> enough bits to represent the decomposed GNU vector shape. It sounds nice
> >> on paper, but I haven't really worked through a design for this. Do you
> >> see any gotchas here?
> >
> > Not really.  In the end it comes down to what the C writer is allowed to
> > do with a vector mask.  I would for example expect that I could do
> >
> >   auto m = v1 < v2;
> >   _mm512_mask_sub_epi32 (a, m, b, c);
> >
> > so generic masks should inter-operate with intrinsics (when the appropriate
> > ISA is enabled).  That works for the data vectors themselves for example
> > (quite some intrinsics are implemented with GCCs generic vector code).
> >
> > I for example can't do
> >
> >    _Bool lane2 = m[2];
> >
> > to inspect lane two of a maks with AVX512.  I can do m & 2 but I wouldn't 
> > expect
> > that to work (should I?) with a vector_mask mask (it's at least not
> > valid directly
> > in GIMPLE).  There's _mm512_int2mask and _mm512_mask2int which transfer
> > between mask and int (but the mask types are really just typedefd to
> > integer typeS).
> >
> >>> While using a different name than vector_mask is certainly possible
> >>> it wouldn't me to decide that, but I'm also not yet convinced it's
> >>> really necessary.  As said, what the GIMPLE frontend accepts
> >>> or not shouldn't limit us here - just the actual chosen layout of the
> >>> boolean vectors.
> >>>
> >>
> >> I'm just concerned about creating an alternate vector_mask functionality
> >> in the CFE and risk not being consistent with GFE.
> >
> > I think it's more important to double-check usablilty from the users side.
> > If the implementation necessarily diverges from GIMPLE then we can
> > choose a different attribute name but then it will also inevitably have
> > code-generation (quality) issues as GIMPLE matches what the hardware
> > can do.
> >
> > Richard.
> >
> >> Thanks,
> >> Tejas.
> >>
> >>> Richard.
> >>>
> >>>> Thanks,
> >>>> Tejas.
> >>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Tejas.
> >>>>>>
> >>>>>>>
> >>>>>>> Richard.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Tejas.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Richard.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> Tejas.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> 1. __attribute__((vector_size (n))) where n represents bytes
> >>>>>>>>>>
> >>>>>>>>>>       typedef bool vbool __attribute__ ((vector_size (1)));
> >>>>>>>>>>
> >>>>>>>>>> In this approach, the shape of the boolean vector is unclear. IoW, 
> >>>>>>>>>> it is not
> >>>>>>>>>> clear if each bit in 'n' controls a byte or an element. On targets
> >>>>>>>>>> like SVE, it would be natural to have each bit control a byte of 
> >>>>>>>>>> the target
> >>>>>>>>>> vector (therefore resulting in an 'unpacked' layout of the PBV) 
> >>>>>>>>>> and on AVX, each
> >>>>>>>>>> bit would control one element/lane on the target vector(therefore 
> >>>>>>>>>> resulting in a
> >>>>>>>>>> 'packed' layout with all significant bits at the LSB).
> >>>>>>>>>>
> >>>>>>>>>> 2. __attribute__((vector_size (n))) where n represents num of lanes
> >>>>>>>>>>
> >>>>>>>>>>       typedef int v4si __attribute__ ((vector_size (4 * sizeof 
> >>>>>>>>>> (int)));
> >>>>>>>>>>       typedef bool v4bi __attribute__ ((vector_size (sizeof v4si / 
> >>>>>>>>>> sizeof (v4si){0}[0])));
> >>>>>>>>>>
> >>>>>>>>>> Here the 'n' in the vector_size attribute represents the number of 
> >>>>>>>>>> bits that
> >>>>>>>>>> is needed to represent a vector quantity.  In this case, this 
> >>>>>>>>>> packed boolean
> >>>>>>>>>> vector can represent upto 'n' vector lanes. The size of the type is
> >>>>>>>>>> rounded up the nearest byte.  For example, the sizeof v4bi in the 
> >>>>>>>>>> above
> >>>>>>>>>> example is 1.
> >>>>>>>>>>
> >>>>>>>>>> In this approach, because of the nature of the representation, the 
> >>>>>>>>>> n bits required
> >>>>>>>>>> to represent the n lanes of the vector are packed at the LSB. This 
> >>>>>>>>>> does not naturally
> >>>>>>>>>> align with the SVE approach of each bit representing a byte of the 
> >>>>>>>>>> target vector
> >>>>>>>>>> and PBV therefore having an 'unpacked' layout.
> >>>>>>>>>>
> >>>>>>>>>> More importantly, another drawback here is that the change in 
> >>>>>>>>>> units for vector_size
> >>>>>>>>>> might be confusing to programmers.  The units will have to be 
> >>>>>>>>>> interpreted based on the
> >>>>>>>>>> base type of the typedef. It does not offer any flexibility in 
> >>>>>>>>>> terms of the layout of
> >>>>>>>>>> the bool vector - it is fixed.
> >>>>>>>>>>
> >>>>>>>>>> 3. Combination of 1 and 2.
> >>>>>>>>>>
> >>>>>>>>>> Combining the best of 1 and 2, we can introduce extra parameters 
> >>>>>>>>>> to vector_size that will
> >>>>>>>>>> unambiguously represent the layout of the PBV. Consider
> >>>>>>>>>>
> >>>>>>>>>>       typedef bool vbool __attribute__((vector_size (s, n[, w])));
> >>>>>>>>>>
> >>>>>>>>>> where 's' is size in bytes, 'n' is the number of lanes and an 
> >>>>>>>>>> optional 3rd parameter 'w'
> >>>>>>>>>> is the number of bits of the PBV that represents a lane of the 
> >>>>>>>>>> target vector. 'w' would
> >>>>>>>>>> allow a target to force a certain layout of the PBV.
> >>>>>>>>>>
> >>>>>>>>>> The 2-parameter form of vector_size allows the target to have an
> >>>>>>>>>> implementation-defined layout of the PBV. The target is free to 
> >>>>>>>>>> choose the 'w'
> >>>>>>>>>> if it is not specified to mirror the target layout of predicate 
> >>>>>>>>>> registers. For
> >>>>>>>>>> eg. AVX would choose 'w' as 1 and SVE would choose s*8/n.
> >>>>>>>>>>
> >>>>>>>>>> As an example, to represent the result of a comparison on 2 
> >>>>>>>>>> int16x8_t, we'd need
> >>>>>>>>>> 8 lanes of boolean which could be represented by
> >>>>>>>>>>
> >>>>>>>>>>       typedef bool v8b __attribute__ ((vector_size (2, 8)));
> >>>>>>>>>>
> >>>>>>>>>> SVE would implement v8b layout to make every 2nd bit significant 
> >>>>>>>>>> i.e. w == 2
> >>>>>>>>>>
> >>>>>>>>>> and AVX would choose a layout where all 8 consecutive bits packed 
> >>>>>>>>>> at LSB would
> >>>>>>>>>> be significant i.e. w == 1.
> >>>>>>>>>>
> >>>>>>>>>> This scheme would accomodate more than 1 target to effectively 
> >>>>>>>>>> represent vector
> >>>>>>>>>> bools that mirror the target properties.
> >>>>>>>>>>
> >>>>>>>>>> 4. A new attribite
> >>>>>>>>>>
> >>>>>>>>>> This is based on a suggestion from Richard S in [3]. The idea is 
> >>>>>>>>>> to introduce a new
> >>>>>>>>>> attribute to define the PBV and make it general enough to
> >>>>>>>>>>
> >>>>>>>>>> * represent all targets flexibly (SVE, AVX etc)
> >>>>>>>>>> * represent sub-byte length predicates
> >>>>>>>>>> * have no change in units of vector_size/no new vector_size 
> >>>>>>>>>> signature
> >>>>>>>>>> * not have the number of bytes constrain representation
> >>>>>>>>>>
> >>>>>>>>>> If we call the new attribute 'bool_vec' (for lack of a better 
> >>>>>>>>>> name), consider
> >>>>>>>>>>
> >>>>>>>>>>       typedef bool vbool __attribute__((bool_vec (n[, w])))
> >>>>>>>>>>
> >>>>>>>>>> where 'n' represents number of lanes/elements and the optional 'w' 
> >>>>>>>>>> is bits-per-lane.
> >>>>>>>>>>
> >>>>>>>>>> If 'w' is not specified, it and bytes-per-predicate are 
> >>>>>>>>>> implementation-defined based on target.
> >>>>>>>>>> If 'w' is specified,  sizeof (vbool) will be ceil (n*w/8).
> >>>>>>>>>>
> >>>>>>>>>> 5. Behaviour of the packed vector boolean type.
> >>>>>>>>>>
> >>>>>>>>>> Taking the example of one of the options above, following is an 
> >>>>>>>>>> illustration of it's behavior
> >>>>>>>>>>
> >>>>>>>>>> * ABI
> >>>>>>>>>>
> >>>>>>>>>>       New ABI rules will need to be defined for this type - eg 
> >>>>>>>>>> alignment, PCS,
> >>>>>>>>>>       mangling etc
> >>>>>>>>>>
> >>>>>>>>>> * Initialization:
> >>>>>>>>>>
> >>>>>>>>>>       Packed Boolean Vectors(PBV) can be initialized like so:
> >>>>>>>>>>
> >>>>>>>>>>         typedef bool v4bi __attribute__ ((vector_size (2, 4, 4)));
> >>>>>>>>>>         v4bi p = {false, true, false, false};
> >>>>>>>>>>
> >>>>>>>>>>       Each value in the initizlizer constant is of type bool. The 
> >>>>>>>>>> lowest numbered
> >>>>>>>>>>       element in the const array corresponds to the LSbit of p, 
> >>>>>>>>>> element 1 is
> >>>>>>>>>>       assigned to bit 4 etc.
> >>>>>>>>>>
> >>>>>>>>>>       p is effectively a 2-byte bitmask with value 0x0010
> >>>>>>>>>>
> >>>>>>>>>>       With a different layout
> >>>>>>>>>>
> >>>>>>>>>>         typedef bool v4bi __attribute__ ((vector_size (2, 4, 1)));
> >>>>>>>>>>         v4bi p = {false, true, false, false};
> >>>>>>>>>>
> >>>>>>>>>>       p is effectively a 2-byte bitmask with value 0x0002
> >>>>>>>>>>
> >>>>>>>>>> * Operations:
> >>>>>>>>>>
> >>>>>>>>>>       Packed Boolean Vectors support the following operations:
> >>>>>>>>>>       . unary ~
> >>>>>>>>>>       . unary !
> >>>>>>>>>>       . binary&,|andˆ
> >>>>>>>>>>       . assignments &=, |= and ˆ=
> >>>>>>>>>>       . comparisons <, <=, ==, !=, >= and >
> >>>>>>>>>>       . Ternary operator ?:
> >>>>>>>>>>
> >>>>>>>>>>       Operations are defined as applied to the individual elements 
> >>>>>>>>>> i.e the bits
> >>>>>>>>>>       that are significant in the PBV. Whether the PBVs are 
> >>>>>>>>>> treated as bitmasks
> >>>>>>>>>>       or otherwise is implementation-defined.
> >>>>>>>>>>
> >>>>>>>>>>       Insignificant bits could affect results of comparisons or 
> >>>>>>>>>> ternary operators.
> >>>>>>>>>>       In such cases, it is implementation defined how the unused 
> >>>>>>>>>> bits are treated.
> >>>>>>>>>>
> >>>>>>>>>>       . Subscript operator []
> >>>>>>>>>>
> >>>>>>>>>>       For the subscript operator, the packed boolean vector acts 
> >>>>>>>>>> like a array of
> >>>>>>>>>>       elements - the first or the 0th indexed element being the 
> >>>>>>>>>> LSbit of the PBV.
> >>>>>>>>>>       Subscript operator yields a scalar boolean value.
> >>>>>>>>>>       For example:
> >>>>>>>>>>
> >>>>>>>>>>         typedef bool v8b __attribute__ ((vector_size (2, 8, 2)));
> >>>>>>>>>>
> >>>>>>>>>>         // Subscript operator result yields a boolean value.
> >>>>>>>>>>         // x[3] is the 7th LSbit and x[1] is the 3rd LSbit of x.
> >>>>>>>>>>         bool foo (v8b p, int n) { p[3] = true; return p[1]; }
> >>>>>>>>>>
> >>>>>>>>>>       Out of bounds access: OOB access can be determined at 
> >>>>>>>>>> compile time given the
> >>>>>>>>>>       strong typing of the PBVs.
> >>>>>>>>>>
> >>>>>>>>>>       PBV does not support address of operator(&) for elements of 
> >>>>>>>>>> PBVs.
> >>>>>>>>>>
> >>>>>>>>>>       . Implicit conversion from integer vectors to PBVs
> >>>>>>>>>>
> >>>>>>>>>>       We would like to support the output of comparison operations 
> >>>>>>>>>> to be PBVs. This
> >>>>>>>>>>       requires us to define the implicit conversion from an 
> >>>>>>>>>> integer vector to PBV
> >>>>>>>>>>       as the result of vector comparisons are integer vectors.
> >>>>>>>>>>
> >>>>>>>>>>       To define this operation:
> >>>>>>>>>>
> >>>>>>>>>>         bool_vector = vector <cmpop> vector
> >>>>>>>>>>
> >>>>>>>>>>       There is no change in how vector <cmpop> vector behavior 
> >>>>>>>>>> i.e. this comparison
> >>>>>>>>>>       would still produce an int_vector type as it does now.
> >>>>>>>>>>
> >>>>>>>>>>         temp_int_vec = vector <cmpop> vector
> >>>>>>>>>>         bool_vec = temp_int_vec // Implicit conversion from 
> >>>>>>>>>> int_vec to bool_vec
> >>>>>>>>>>
> >>>>>>>>>>       The implicit conversion from int_vec to bool I'd define 
> >>>>>>>>>> simply to be:
> >>>>>>>>>>
> >>>>>>>>>>         bool_vec[n] = (_Bool) int_vec[n]
> >>>>>>>>>>
> >>>>>>>>>>       where the C11 standard rules apply
> >>>>>>>>>>       6.3.1.2 Boolean type  When any scalar value is converted to 
> >>>>>>>>>> _Bool, the result
> >>>>>>>>>>       is 0 if the value compares equal to 0; otherwise, the result 
> >>>>>>>>>> is 1.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> [1] https://lists.llvm.org/pipermail/cfe-dev/2020-May/065434.html
> >>>>>>>>>> [2] https://reviews.llvm.org/D88905
> >>>>>>>>>> [3] https://reviews.llvm.org/D81083
> >>>>>>>>>>
> >>>>>>>>>> Thoughts?
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Tejas.
> >>>>>>
> >>>>
> >>
>

Re: [RFC] GNU Vector Extension -- Packed Boolean Vectors

Reply via email to