Re: [RFC] Proposal to support Packed Boolean Vector masks.

Richard Biener Mon, 08 Jul 2024 04:16:17 -0700

On Mon, Jul 8, 2024 at 11:27 AM Tejas Belagod <tejas.bela...@arm.com> wrote:
>
> Hi,
>
> Sorry to have dropped the ball on
> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625535.html, but
> here I've tried to pick it up again and write up a strawman proposal for
> elevating __attribute__((vector_mask)) to the FE from GIMPLE.
>
>
> Thanks,
> Tejas.
>
> Motivation
> ----------
>
> The idea of packed boolean vectors came about when we wanted to support
> C/C++ operators on SVE ACLE types. The current vector boolean type that
> ACLE specifies does not adequately disambiguate vector lane sizes which
> they were derived off of. Consider this simple, albeit unrealistic, example:
>
>    bool foo (svint32_t a, svint32_t b)
>    {
>      svbool_t p = a > b;
>
>      // Here p[2] is not the same as a[2] > b[2].
>      return p[2];
>    }
>
> In the above example, because svbool_t has a fixed 1-lane-per-byte, p[i]
> does not return the bool value corresponding to a[i] > b[i]. This
> necessitates a 'typed' vector boolean value that unambiguously
> represents results of operations
> of the same type.
>
> __attribute__((vector_mask))
> -----------------------------
>
> Note: If interested in historical discussions refer to:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625535.html
>
> We define this new attribute which when applied to a base data vector
> produces a new boolean vector type that represents a boolean type that
> is produced as a result of operations on the corresponding base vector
> type. The following is the syntax.
>
>    typedef int v8si __attribute__((vector_size (8 * sizeof (int)));
>    typedef v8si v8sib __attribute__((vector_mask));
>
> Here the 'base' data vector type is v8si or a vector of 8 integers.
>
> Rules
>
> • The layout/size of the boolean vector type is implementation-defined
> for its base data vector type.
>
> • Two boolean vector types who's base data vector types have same number
> of elements and lane-width have the same layout and size.
>
> • Consequently, two boolean vectors who's base data vector types have
> different number of elements or different lane-size have different layouts.
>
> This aligns with gnu vector extensions that generate integer vectors as
> a result of comparisons - "The result of the comparison is a vector of
> the same width and number of elements as the comparison operands with a
> signed integral element type." according to
>     https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html.


Without having the time to re-review this all in detail I think the GNU
vector extension does not expose the result of the comparison as the
machine would produce it but instead a comparison "decays" to
a conditional:

typedef int v4si __attribute__((vector_size(16)));

v4si a;
v4si b;

void foo()
{
  auto r = a < b;
}

produces, with C23:

  vector(4) int r =  VEC_COND_EXPR < a < b , { -1, -1, -1, -1 } , { 0,
0, 0, 0 } > ;

In fact on x86_64 with AVX and AVX512 you have two different "machine
produced" mask types and the above could either produce a AVX mask with
32bit elements or a AVX512 mask with 1bit elements.

Not exposing "native" mask types requires the compiler optimizing subsequent
uses and makes generic vectors difficult to combine with for example AVX512
intrinsics (where masks are just 'int').  Across an ABI boundary it's also
even more difficult to optimize mask transitions.

But it at least allows portable code and it does not suffer from users trying to
expose machine representations of masks as input to generic vector code
with all the problems of constant folding not only requiring self-consistent
code within the compiler but compatibility with user produced constant masks.

That said, I somewhat question the need to expose the target mask layout
to users for GCCs generic vector extension.

> Producers and Consumers of PBV
> ------------------------------
>
> With GNU vector extensions, comparisons produce boolean vectors;
> conditional and bitwise operators consume them. Comparison producers
> generate signed integer vectors of the same lane-width as the operands
> of the comparison operator. This means conditionals and bitwise
> operators cannot be applied to mixed vectors that are a result of
> different width operands. Eg.
>
>    v8hi foo (v8si a, v8si b, v8hi c, v8hi d, v8sf e, v8sf f)
>    {
>      return a > b || c > d; // error!
>      return a > b || e < f; // OK - no explicit conversion needed.
>      return a > b || __builtin_convertvector (c > d, v8si); // OK.
>      return a | b && c | d; // error!
>      return a | b && __builtin_convertvector (c | d, v8si); // OK.
>    }
>
> __builtin_convertvector () needs to be applied to convert vectors to the
> type one wants to do the comparison in. IoW, the integer vectors that
> represent boolean vectors are 'strictly-typed'. If we extend these rules
> to vector_mask, this will look like:
>
>    typedef v8sib v8si __attribute__((vector_mask));
>    typedef v8hib v8hi __attribute__((vector_mask));
>    typedef v8sfb v8sf __attribute__((vector_mask));
>
>    v8sib foo (v8si a, v8si b, v8hi c, v8hi d, v8sf e, v8sf f)
>    {
>      v8sib psi = a > b;
>      v8hib phi = c > d;
>      v8sfb psf = e < f;
>
>      return psi || phi; // error!
>      return psi || psf; // OK - no explicit conversion needed.
>      return psi || __builtin_convertvector (phi, v8sib); // OK.
>      return psi | phi; // error!
>      return psi | __builtin_convertvector (phi, v8sib); // OK.
>      return psi | psf; // OK - no explicit conversion needed.
>    }
>
> Now according to the rules explained above, v8sib and v8hib will have
> different layouts (which is why they can't be used directly without
> conversion if used as operands of operations). OTOH, the same rules
> dictate that the layout of, say v8sib and v8sfb, where v8sfb is the
> float base data vector equivalent of v8sib which when applied ensure
> that v8sib and v8sfb have the same layout and hence can be used as
> operands of operators without explicit conversion. This aligns with the
> GNU vector extensions rules where comparison of 2 v8sf vectors results
> in a v8si of the same lane-width and number of elements as that would
> result in comparison of 2 v8si vectors.
>
> Application of vector_mask to sizeless types
> --------------------------------------------
>
> __attribute__((vector_mask)) has the advantage that it can be applied to
> sizeless types seamlessly.  When __attribute__((vector_mask)) is applied
> to a data vector that is a sizeless type, the resulting vector mask also
> becomes a sizeless type.
> Eg.
>
>    typedef svpred16_t svint16_t __attribute__((vector_mask));
>
> This is equivalent of
>
>    typedef vNhib vNhi __attribute__((vector_mask));
>
> where N could be 8, 16, 32 etc.
>
> The resulting type is a scalable boolean vector type, i.e svint8_t. The
> resulting boolean vector type has the same behavior as the scalar type
> svint8_t. While svint8_t can represent a scalable bool vector, we need a
> scalable scalar type to represent the bit-mask variant of the opaque
> type that represents the bool vector. I haven't thought this through,
> but I suspect it will be implemented as a 'typed' variant of svbool_t.
>
> ABI
> ---
>
> Given the new opaque type, it needs rules that define PCS, storage
> layout in aggregates and alignment.
>
> PCS
> ---
>
> GNU vector extension type parameters are always passed on the stack.
> Similarly vector_mask applied to GNU base data vector type parameters
> will also be passed on  the stack. The format to pass on the stack will
> always be a canonical format - an opaque type where the internal
> representation can be implementation-defined.
>
> The canonical form of the argument could be a boolean vector. This
> boolean vector will be passed on the stack just like other GNU vectors.
> vector bool is convenient for a callee to synthesize into a predicate
> (irrespective of the target i.e. NEON, SVE, AVX) using target instructions.
>
> If the base data vector is an ACLE type, if the canonical bool vector we
> choose is svint8_t or a typed svbool_t we could apply the same rules as
> ABI for the said type.
>
> Alignment
> ---------
>
> For boolean vector in memory, their alignment will be the natural
> alignment as defined by the AAPCS64 i.e. 8 and 16 bytes for Short
> Vectors and 16 bytes for scalable vectors.
>
> Aggregates
> ----------
>
> For fixed size vectors, the type resulting from applying
> __attribute__((vector_mask)) is a vector of booleans IoW a vNqi.
> Therefore the same rules apply as would apply to a GNU vector with 8-bit
> elements of the same size in an aggregate.  For scalable GNU boolean
> vectors in aggregates, it acts as a Pure scalable type svint8_t and the
> ABI rules from Section 5.10 of AAPCS64 apply.
>
> Operation Semantics
> -------------------
>
> What should be the data structure of the vector mask type? This seems to
> be the main consideration. As suggested by Richard in
> https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625535.html the idea
> is to have an opaque type to have control over operations and
> observability. This means that the internal representation can be a bit
> mask, but based on the operator being applied to it, the mask can
> 'decay' to another operator-friendly data structure.
>
> vector_mask has 2 forms that is chosen based on the context. It lives as
> a mask and a vector bool. Here we describe its behaviour in various
> contexts.
>
> Arithmetic ops
> --------------
>
> These don't apply as the values are essentially binary.
>
> Bitwise ops -  &, ^, |, ~, >>, <<
> ---------------------------------
>
> Here vector_mask acts as a scalar bitmask. Applying bitwise ops is like
> another scalar operation.
>
> If p1 and p2 are vector_mask types of type:
>
>         typedef v8sib v8si __attribute__((vector_mask));
>
> Bitwise &, | and ^
> ------------------
>
>    p1 & p2
>
> Here p1 and p2 act as integer type bitmasks where each bit represents a
> vector lane of the data vector type. LSBit representing the lowest
> numbered lane and MSBit representing the highest numbered lane.
>
>    p1 & <scalar immediate>
>
> Here the immediate scalar is implicitly cast to a vector_mask type and
> the binary op is applied accordingly.
>
> Bitwise ~:
>
>    ~p1
>
> Treats p1 as a bitmask and inverts all the bits of the bitmask.
>
> Bitwise >>, << :
>
>    p1 >> <scalar immediate>
>    p1 >> <scalar int32 variable>
>
> Treats p1 as a bitmask. The shifter operand has to be a signed int32
> immediate. If the immediate is negative, the direction of the shift is
> inverted. Behaviour for any value outside the range of 0..nelems-1 is
> undefined.
>
>    p1 >> p2 or p1 << p2
>
> is not allowed.
>
> Logical ops - ==, !=, >, <, >=, <=
> ----------------------------------
>
> The following ops treat vector_mask as bitmask:
>    p1 == p2
>    p1 != p2
>    p1 == <scalar immediate>
>    p1 != <scalar immediate>
>
> The result of these operations is a bool. Note that the scalar
> immediates will be implicitly converted to the LHS type of p1. Eg. if p1
> is v8sib,
>
>    p1 == 0x3
>
> will mean that 0x3 will represent lower numbered 2 lanes of v8sib are
> true and the rest are false.
>
>  >, <, >=, <= do not apply to the vector_mask.
>
> Ternary operator ?:
> -------------------
>
>    p1 <logicalop> p2 ? s1 : s2;
>
> is allowed and p1 and p2 are treated as bitmasks.
>
> Conditional operators ||, && !
> ------------------------------
>
> Here vector_mask is used as a bitmask scalar. So
>
>    p1 != 0 || p2 == 0
>
> treats p1 and p2 as scalar bitmasks. Similarly for && and !.
>
> Assignment ops =, <<=, >>=
> --------------------------
>
> The assignment operator is straightforward - it does a copy of the RHS
> into a p1. Eg.
>
>    p1 = p2
>
> Copies the value of p2 into p1. If the types are different, there is no
> implicit conversion from one to the other (except in cases mentioned
> below). One will have to explicitly convert using
> __builtin_convertvector (). So if p1 and p2 are different and if one
> wants to copy p2 to p1, one has to write
>
>    p1 = __builtin_convertvector (p2, typeof (p1));
>
> __builtin_convertvector is implementation-defined. It is essential to
> note p1 and p2 must have the same number of lanes irrespective of the
> lane-size. Also, explicit conversion is not required if the lane-sizes
> are the same for p1 and p2 along with the same number of elements. So
> for eg. if p1 is v8sib and p2 is v8sfb, there is no explicit conversion
> required. Same for v8sib and v8uib.
>
> <<= and >>= have similar operations.
>
> Increment Ops ++, --
> ---------------------
>
> NA
>
> Address-of &
> ------------
>
> Taking address of a vector_mask returns (vector bool *).
>
> sizeof ()
> --------
>
> sizeof (vector_mask) = sizeof (vector bool)
>
> alignof ()
> ----------
>
> See Alignment section above
>
> Typecast and implicit conversions
> ---------------------------------
>
> typecast from one vector_mask type to another vector_mask type is only
> possible using __builtin_convertvector () if, as explained above, the
> lane-size are different. It is not possible to convert between vectors
> of different nelems either way.
>
> Implicit conversions between two same-nelem vector_masks are possible
> only if the lane-sizes are same.
>
> Literals and Initialization
> ---------------------------
>
> There are two ways to initialize vector_mask objects - bitmask form and
> constant array form. Eg.  typedef v4si v4si __attribute__((vector_mask));
>
>    void foo ()
>    {
>      v4sib p1 = 0xf;
>
>      /* Do something. */
>
>      p1 = {1, 1, 1, 0};
>
>      ...
>    }
>
> The behaviour is undefined values other than 1 or 0 are used in the
> constant array initializer.
>
> C++:
> ---
>
> static_cast<target_type> (<source_expression>)
>
> LLVM allows static_cast<> where both vector sizes are same, but the
> semantics are equal to reinterpret_cast<>. GNU does not allow
> static_cast<> irrespective of source and target shapes.
>
> To be consistent, leave it unsupported for vector_mask too.
>
> dynamic_cast <target_type> (<source_expr>)
> NA
>
> reinterpret_cast<>
> Semantics are same as Clang's static_cast<> i.e. reinterpret the types
> if both source and target type vectors are same size.
>
> const_cast<>
>
> Applies constness to a vector mask type pointer.
>
>    #include <inttypes.h>
>
>    typedef int32_t v16si __attribute__((__vector_size__(64)));
>    typedef v16si v16sib __attribute__((vector_mask));
>
>    __attribute__((noinline))
>    const v16sib * foo (v16sib * a)
>    {
>      return const_cast<v16sib *> (a);
>    }
>
> new & delete
>
> For new, vector_mask types will return a pointer to vector_mask type and
> allocate sizeof (vector bool) depending on the size of the vector bool
> array in
> bytes. For eg.  typedef v16sib v16si __attribute__((vector_mask));
>
>    v16sib * foo()
>    {
>      return new v16sib;
>    }
>
> foo returns sizeof (vector bool (16))) i.e. 16 bytes.
>
> __attribute__((vector_mask)) 's conflation with GIMPLE
> ------------------------------------------------------
>
> __attribute__((vector_mask)) is a feature that has been elevated from
> GIMPLE to the FE. In GIMPLE, the semantics are loosely-typed and
> target-dependent i.e. different-shared vector mask types are allowed to
> work with binary ops depending on which target we're compiling for. Eg.
>
>    typedef v8sib v8si __attribute__((vector_mask));
>    typedef v8hib v8hi __attribute__((vector_mask));
>
>    __GIMPLE v8sib foo (v8si a, v8si b, v8hi c, v8hi d)
>    {
>      v8sib psi = a > b;
>      v8hib phi = c > d;
>
>      return psi | phi; // OK on amdgcn, but errors on aarch64!
>    }
>
> This dichotomy is acceptable as long as GIMPLE semantics don't change
> and because the FE semantics are proposed to be more restrictive, its
> becomes a subset of the functionality of GIMPLE semantics. This is the
> current starting point, but going forward if there are scenarios where
> we have to diverge from GIMPLE semantics, we have to discuss that on a
> case-by-case basis.
>
>

Re: [RFC] Proposal to support Packed Boolean Vector masks.

Reply via email to