On 7/8/24 4:45 PM, Richard Biener wrote:
On Mon, Jul 8, 2024 at 11:27 AM Tejas Belagod <tejas.bela...@arm.com> wrote:

Hi,

Sorry to have dropped the ball on
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625535.html, but
here I've tried to pick it up again and write up a strawman proposal for
elevating __attribute__((vector_mask)) to the FE from GIMPLE.


Thanks,
Tejas.

Motivation
----------

The idea of packed boolean vectors came about when we wanted to support
C/C++ operators on SVE ACLE types. The current vector boolean type that
ACLE specifies does not adequately disambiguate vector lane sizes which
they were derived off of. Consider this simple, albeit unrealistic, example:

    bool foo (svint32_t a, svint32_t b)
    {
      svbool_t p = a > b;

      // Here p[2] is not the same as a[2] > b[2].
      return p[2];
    }

In the above example, because svbool_t has a fixed 1-lane-per-byte, p[i]
does not return the bool value corresponding to a[i] > b[i]. This
necessitates a 'typed' vector boolean value that unambiguously
represents results of operations
of the same type.

__attribute__((vector_mask))
-----------------------------

Note: If interested in historical discussions refer to:
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625535.html

We define this new attribute which when applied to a base data vector
produces a new boolean vector type that represents a boolean type that
is produced as a result of operations on the corresponding base vector
type. The following is the syntax.

    typedef int v8si __attribute__((vector_size (8 * sizeof (int)));
    typedef v8si v8sib __attribute__((vector_mask));

Here the 'base' data vector type is v8si or a vector of 8 integers.

Rules

• The layout/size of the boolean vector type is implementation-defined
for its base data vector type.

• Two boolean vector types who's base data vector types have same number
of elements and lane-width have the same layout and size.

• Consequently, two boolean vectors who's base data vector types have
different number of elements or different lane-size have different layouts.

This aligns with gnu vector extensions that generate integer vectors as
a result of comparisons - "The result of the comparison is a vector of
the same width and number of elements as the comparison operands with a
signed integral element type." according to
     https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html.

Without having the time to re-review this all in detail I think the GNU
vector extension does not expose the result of the comparison as the
machine would produce it but instead a comparison "decays" to
a conditional:

typedef int v4si __attribute__((vector_size(16)));

v4si a;
v4si b;

void foo()
{
   auto r = a < b;
}

produces, with C23:

   vector(4) int r =  VEC_COND_EXPR < a < b , { -1, -1, -1, -1 } , { 0,
0, 0, 0 } > ;

In fact on x86_64 with AVX and AVX512 you have two different "machine
produced" mask types and the above could either produce a AVX mask with
32bit elements or a AVX512 mask with 1bit elements.

Not exposing "native" mask types requires the compiler optimizing subsequent
uses and makes generic vectors difficult to combine with for example AVX512
intrinsics (where masks are just 'int').  Across an ABI boundary it's also
even more difficult to optimize mask transitions.

But it at least allows portable code and it does not suffer from users trying to
expose machine representations of masks as input to generic vector code
with all the problems of constant folding not only requiring self-consistent
code within the compiler but compatibility with user produced constant masks.

That said, I somewhat question the need to expose the target mask layout
to users for GCCs generic vector extension.


Thanks for your feedback.

IIUC, I can imagine how having a GNU vector extension exposing the target vector mask layout can pose a challenge - maybe making it a generic GNU vector extension was too ambitious. I wonder if there's value in pursuing these alternate paths?

1. Can implementing this extension in a 'generic' way i.e. possibly not implement it with a target mask, but just a generic int vector, still maintain the consistency of GNU predicate vectors within the compiler? I know it may not seem very different from how boolean vectors are currently implemented (as in your above example), but, having the __attribute__((vector_mask)) as a 'property' of the object makes it useful to optimize its uses to target predicates in subsequent stages of the compiler.

2. Restricting __attribute__((vector_mask)) to apply only to target intrinsic types? Eg.

On SVE something like:
typedef svint16_t svpred16_t __attribute__((vector_mask)); // OK.

On AVX, something like:
typedef __m256i __mask32 __attribute__((vector_mask)); // OK - though this would require more fine-grained defn of lane-size to mask-bits mapping.

Would not be allowed on GNU Vector Extensions:
typedef v4si v4sib __attribute__((vector_mask)); // Error - vector_mask can't be a generic GNU vector extension!


Thanks,
Tejas.


Producers and Consumers of PBV
------------------------------

With GNU vector extensions, comparisons produce boolean vectors;
conditional and bitwise operators consume them. Comparison producers
generate signed integer vectors of the same lane-width as the operands
of the comparison operator. This means conditionals and bitwise
operators cannot be applied to mixed vectors that are a result of
different width operands. Eg.

    v8hi foo (v8si a, v8si b, v8hi c, v8hi d, v8sf e, v8sf f)
    {
      return a > b || c > d; // error!
      return a > b || e < f; // OK - no explicit conversion needed.
      return a > b || __builtin_convertvector (c > d, v8si); // OK.
      return a | b && c | d; // error!
      return a | b && __builtin_convertvector (c | d, v8si); // OK.
    }

__builtin_convertvector () needs to be applied to convert vectors to the
type one wants to do the comparison in. IoW, the integer vectors that
represent boolean vectors are 'strictly-typed'. If we extend these rules
to vector_mask, this will look like:

    typedef v8sib v8si __attribute__((vector_mask));
    typedef v8hib v8hi __attribute__((vector_mask));
    typedef v8sfb v8sf __attribute__((vector_mask));

    v8sib foo (v8si a, v8si b, v8hi c, v8hi d, v8sf e, v8sf f)
    {
      v8sib psi = a > b;
      v8hib phi = c > d;
      v8sfb psf = e < f;

      return psi || phi; // error!
      return psi || psf; // OK - no explicit conversion needed.
      return psi || __builtin_convertvector (phi, v8sib); // OK.
      return psi | phi; // error!
      return psi | __builtin_convertvector (phi, v8sib); // OK.
      return psi | psf; // OK - no explicit conversion needed.
    }

Now according to the rules explained above, v8sib and v8hib will have
different layouts (which is why they can't be used directly without
conversion if used as operands of operations). OTOH, the same rules
dictate that the layout of, say v8sib and v8sfb, where v8sfb is the
float base data vector equivalent of v8sib which when applied ensure
that v8sib and v8sfb have the same layout and hence can be used as
operands of operators without explicit conversion. This aligns with the
GNU vector extensions rules where comparison of 2 v8sf vectors results
in a v8si of the same lane-width and number of elements as that would
result in comparison of 2 v8si vectors.

Application of vector_mask to sizeless types
--------------------------------------------

__attribute__((vector_mask)) has the advantage that it can be applied to
sizeless types seamlessly.  When __attribute__((vector_mask)) is applied
to a data vector that is a sizeless type, the resulting vector mask also
becomes a sizeless type.
Eg.

    typedef svpred16_t svint16_t __attribute__((vector_mask));

This is equivalent of

    typedef vNhib vNhi __attribute__((vector_mask));

where N could be 8, 16, 32 etc.

The resulting type is a scalable boolean vector type, i.e svint8_t. The
resulting boolean vector type has the same behavior as the scalar type
svint8_t. While svint8_t can represent a scalable bool vector, we need a
scalable scalar type to represent the bit-mask variant of the opaque
type that represents the bool vector. I haven't thought this through,
but I suspect it will be implemented as a 'typed' variant of svbool_t.

ABI
---

Given the new opaque type, it needs rules that define PCS, storage
layout in aggregates and alignment.

PCS
---

GNU vector extension type parameters are always passed on the stack.
Similarly vector_mask applied to GNU base data vector type parameters
will also be passed on  the stack. The format to pass on the stack will
always be a canonical format - an opaque type where the internal
representation can be implementation-defined.

The canonical form of the argument could be a boolean vector. This
boolean vector will be passed on the stack just like other GNU vectors.
vector bool is convenient for a callee to synthesize into a predicate
(irrespective of the target i.e. NEON, SVE, AVX) using target instructions.

If the base data vector is an ACLE type, if the canonical bool vector we
choose is svint8_t or a typed svbool_t we could apply the same rules as
ABI for the said type.

Alignment
---------

For boolean vector in memory, their alignment will be the natural
alignment as defined by the AAPCS64 i.e. 8 and 16 bytes for Short
Vectors and 16 bytes for scalable vectors.

Aggregates
----------

For fixed size vectors, the type resulting from applying
__attribute__((vector_mask)) is a vector of booleans IoW a vNqi.
Therefore the same rules apply as would apply to a GNU vector with 8-bit
elements of the same size in an aggregate.  For scalable GNU boolean
vectors in aggregates, it acts as a Pure scalable type svint8_t and the
ABI rules from Section 5.10 of AAPCS64 apply.

Operation Semantics
-------------------

What should be the data structure of the vector mask type? This seems to
be the main consideration. As suggested by Richard in
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625535.html the idea
is to have an opaque type to have control over operations and
observability. This means that the internal representation can be a bit
mask, but based on the operator being applied to it, the mask can
'decay' to another operator-friendly data structure.

vector_mask has 2 forms that is chosen based on the context. It lives as
a mask and a vector bool. Here we describe its behaviour in various
contexts.

Arithmetic ops
--------------

These don't apply as the values are essentially binary.

Bitwise ops -  &, ^, |, ~, >>, <<
---------------------------------

Here vector_mask acts as a scalar bitmask. Applying bitwise ops is like
another scalar operation.

If p1 and p2 are vector_mask types of type:

         typedef v8sib v8si __attribute__((vector_mask));

Bitwise &, | and ^
------------------

    p1 & p2

Here p1 and p2 act as integer type bitmasks where each bit represents a
vector lane of the data vector type. LSBit representing the lowest
numbered lane and MSBit representing the highest numbered lane.

    p1 & <scalar immediate>

Here the immediate scalar is implicitly cast to a vector_mask type and
the binary op is applied accordingly.

Bitwise ~:

    ~p1

Treats p1 as a bitmask and inverts all the bits of the bitmask.

Bitwise >>, << :

    p1 >> <scalar immediate>
    p1 >> <scalar int32 variable>

Treats p1 as a bitmask. The shifter operand has to be a signed int32
immediate. If the immediate is negative, the direction of the shift is
inverted. Behaviour for any value outside the range of 0..nelems-1 is
undefined.

    p1 >> p2 or p1 << p2

is not allowed.

Logical ops - ==, !=, >, <, >=, <=
----------------------------------

The following ops treat vector_mask as bitmask:
    p1 == p2
    p1 != p2
    p1 == <scalar immediate>
    p1 != <scalar immediate>

The result of these operations is a bool. Note that the scalar
immediates will be implicitly converted to the LHS type of p1. Eg. if p1
is v8sib,

    p1 == 0x3

will mean that 0x3 will represent lower numbered 2 lanes of v8sib are
true and the rest are false.

  >, <, >=, <= do not apply to the vector_mask.

Ternary operator ?:
-------------------

    p1 <logicalop> p2 ? s1 : s2;

is allowed and p1 and p2 are treated as bitmasks.

Conditional operators ||, && !
------------------------------

Here vector_mask is used as a bitmask scalar. So

    p1 != 0 || p2 == 0

treats p1 and p2 as scalar bitmasks. Similarly for && and !.

Assignment ops =, <<=, >>=
--------------------------

The assignment operator is straightforward - it does a copy of the RHS
into a p1. Eg.

    p1 = p2

Copies the value of p2 into p1. If the types are different, there is no
implicit conversion from one to the other (except in cases mentioned
below). One will have to explicitly convert using
__builtin_convertvector (). So if p1 and p2 are different and if one
wants to copy p2 to p1, one has to write

    p1 = __builtin_convertvector (p2, typeof (p1));

__builtin_convertvector is implementation-defined. It is essential to
note p1 and p2 must have the same number of lanes irrespective of the
lane-size. Also, explicit conversion is not required if the lane-sizes
are the same for p1 and p2 along with the same number of elements. So
for eg. if p1 is v8sib and p2 is v8sfb, there is no explicit conversion
required. Same for v8sib and v8uib.

<<= and >>= have similar operations.

Increment Ops ++, --
---------------------

NA

Address-of &
------------

Taking address of a vector_mask returns (vector bool *).

sizeof ()
--------

sizeof (vector_mask) = sizeof (vector bool)

alignof ()
----------

See Alignment section above

Typecast and implicit conversions
---------------------------------

typecast from one vector_mask type to another vector_mask type is only
possible using __builtin_convertvector () if, as explained above, the
lane-size are different. It is not possible to convert between vectors
of different nelems either way.

Implicit conversions between two same-nelem vector_masks are possible
only if the lane-sizes are same.

Literals and Initialization
---------------------------

There are two ways to initialize vector_mask objects - bitmask form and
constant array form. Eg.  typedef v4si v4si __attribute__((vector_mask));

    void foo ()
    {
      v4sib p1 = 0xf;

      /* Do something. */

      p1 = {1, 1, 1, 0};

      ...
    }

The behaviour is undefined values other than 1 or 0 are used in the
constant array initializer.

C++:
---

static_cast<target_type> (<source_expression>)

LLVM allows static_cast<> where both vector sizes are same, but the
semantics are equal to reinterpret_cast<>. GNU does not allow
static_cast<> irrespective of source and target shapes.

To be consistent, leave it unsupported for vector_mask too.

dynamic_cast <target_type> (<source_expr>)
NA

reinterpret_cast<>
Semantics are same as Clang's static_cast<> i.e. reinterpret the types
if both source and target type vectors are same size.

const_cast<>

Applies constness to a vector mask type pointer.

    #include <inttypes.h>

    typedef int32_t v16si __attribute__((__vector_size__(64)));
    typedef v16si v16sib __attribute__((vector_mask));

    __attribute__((noinline))
    const v16sib * foo (v16sib * a)
    {
      return const_cast<v16sib *> (a);
    }

new & delete

For new, vector_mask types will return a pointer to vector_mask type and
allocate sizeof (vector bool) depending on the size of the vector bool
array in
bytes. For eg.  typedef v16sib v16si __attribute__((vector_mask));

    v16sib * foo()
    {
      return new v16sib;
    }

foo returns sizeof (vector bool (16))) i.e. 16 bytes.

__attribute__((vector_mask)) 's conflation with GIMPLE
------------------------------------------------------

__attribute__((vector_mask)) is a feature that has been elevated from
GIMPLE to the FE. In GIMPLE, the semantics are loosely-typed and
target-dependent i.e. different-shared vector mask types are allowed to
work with binary ops depending on which target we're compiling for. Eg.

    typedef v8sib v8si __attribute__((vector_mask));
    typedef v8hib v8hi __attribute__((vector_mask));

    __GIMPLE v8sib foo (v8si a, v8si b, v8hi c, v8hi d)
    {
      v8sib psi = a > b;
      v8hib phi = c > d;

      return psi | phi; // OK on amdgcn, but errors on aarch64!
    }

This dichotomy is acceptable as long as GIMPLE semantics don't change
and because the FE semantics are proposed to be more restrictive, its
becomes a subset of the functionality of GIMPLE semantics. This is the
current starting point, but going forward if there are scenarios where
we have to diverge from GIMPLE semantics, we have to discuss that on a
case-by-case basis.



Reply via email to