Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd

Matthias Kretz Wed, 27 Mar 2024 04:53:18 -0700

Hi Richard,

sorry for not answering sooner. I took action on your mail but failed to also 
give feedback. Now in light of your veto of Srinivas patch I wanted to use the 
opportunity to pick this up again.

On Dienstag, 23. Januar 2024 21:57:23 CET Richard Sandiford wrote:
> However, we also support different vector lengths for streaming SVE
> (running in "streaming" mode on SME) and non-streaming SVE (running
> in "non-streaming" mode on the core).  Having two different lengths is
> expected to be the common case, rather than a theoretical curiosity.

I read up on this after you mentioned this for the first time. As a WG21 
member I find the approach troublesome - but that's a bit off-topic for this 
thread.

The big issue here is that, IIUC, a user (and the simd library) cannot do the 
right thing at the moment. There simply isn't enough context information 
available when parsing the <experimental/simd> header. I.e. on definition of 
the class template there's no facility to take target_clones or SME 
"streaming" mode into account. Consequently, if we want the library to be fit 
for SME, then we need more language extension(s) to make it work.

I guess I'm looking for a way to declare types that are different depending on 
whether they are used in streaming mode or non-streaming mode (making them 
ill-formed to use in functions marked arm_streaming_compatible).

From reading through https://arm-software.github.io/acle/main/
acle.html#controlling-the-use-of-streaming-mode I don't see any discussion of 
member functions or ctor/dtor, static and non-static data members, etc.

The big issue I see here is that currently all of std::* is declared without a 
arm_streaming or arm_streaming_compatible. Thus, IIUC, you can't use anything 
from the standard library in streaming mode. Since that also applies to 
std::experimental::simd, we're not creating a new footgun, only missing out on 
potential users?

Some more thoughts on target_clones/streaming SVE language extension 
evolution:

  void nonstreaming_fn(void) {
    constexpr int width = __arm_sve_bits(); // e.g. 512
    constexpr int width2 = __builtin_vector_size(); // e.g. 64 (the
      // vector_size attribute works with bytes, not bits)
  }

  __attribute__((arm_locally_streaming))
  void streaming_fn(void) {
    constexpr int width = __arm_sve_bits(); // e.g. 128
    constexpr int width2 = __builtin_vector_size(); // e.g. 16
  }

  __attribute__((target_clones("sse4.2,avx2")))
  void streaming_fn(void) {
    constexpr int width = __builtin_vector_size(); // 16 in the sse4.2 clone
      // and 32 in the avx2 clone
  }

... as a starting point for exploration. Given this, I'd still have to resort 
to a macro to define a "native" simd type:

#define NATIVE_SIMD(T) std::experimental::simd<T, _SveAbi<__arm_sve_bits() / 
CHAR_BITS, __arm_sve_bits() / CHAR_BITS>>

Getting rid of the macro seems to be even harder.

A declaration of an alias like

template <typename T>
using SveSimd = std::experimental::simd<T, _SveAbi<__arm_sve_bits() / 
CHAR_BITS, __arm_sve_bits() / CHAR_BITS>>;

would have to delay "invoking" __arm_sve_bits() until it knows its context:

  void nonstreaming_fn(void) {
    static_assert(sizeof(SveSimd<float>) == 64);
  }

  __attribute__((arm_locally_streaming))
  void streaming_fn(void) {
    static_assert(sizeof(SveSimd<float>) == 16);
    nonstreaming_fn(); // fine
  }

This gets even worse for target_clones, where

  void f() {
    sizeof(std::simd<float>) == ?
  }

  __attribute__((target_clones("sse4.2,avx2")))
  void g() {
    f();
  }

the compiler *must* virally apply target_clones to all functions it calls. And 
member functions must either also get cloned as functions, or the whole type 
must be cloned (as in the std::simd case, where the sizeof needs to change). 😳

> When would NumberOfUsedBytes < SizeofRegister be used for SVE?  Would it
> be for storing narrower elements in wider containers?  If the interface
> supports that then, yeah, two parameters would probably be safer.
> 
> Or were you thinking about emulating narrower vectors with wider registers
> using predication?  I suppose that's possible too, and would be similar in
> spirit to using SVE to optimise Advanced SIMD std::simd types.
> But mightn't it cause confusion if sizeof applied to a "16-byte"
> vector actually gives 32?

Yes, the idea is to e.g. use one SVE register instead of two NEON registers 
for a "float, 8" with SVE512.

The user never asks for a "16-byte" vector. The user asks for a value-type and 
and number of elements. Sure, the wasteful "padding" might come as a surprise, 
but it's totally within the spec to implement it like this.

> I assume std::experimental::native_simd<int> has to have the same
> meaning everywhere for ODR reasons?

No. Only std::experimental::simd<int> has to be "ABI stable". And note that in 
the C++ spec there's no such thing as compiling and linking TUs with different 
compiler flags. That's plain UB. The committee still cares about it, but 
getting this "right" cannot be part of the standard and must be defined by 
implementers

>  If so, it'll need to be an
> Advanced SIMD vector for AArch64 (but using SVE to optimise certain
> operations under the hood where available).  I don't think we could
> support anything else.

simd<int> on AArch64 uses [[gnu::vector_size(16)]].

> Even if SVE registers are larger than 128 bits, we can't require
> all code in the binary to be recompiled with that knowledge.
> 
> I suppose that breaks the "largest" requirement, but due to the
> streaming/non-streaming distinction I mentioned above, there isn't
> really a single, universal "largest" in this context.

There is, but it's context-dependent. I'd love to make this work.

> SVE and Advanced SIMD are architected to use the same registers
> (i.e. SVE registers architecturally extend Advanced SIMD registers).
> In Neoverse V1 (SVE256) they are the same physical register as well.
> I believe the same is true for A64FX.

That's good to know. 👍

> FWIW, GCC has already started using SVE in this way.  E.g. SVE provides
> a wider range of immediate constants for logic operations, so we now use
> them for Advanced SIMD logic where beneficial.

I will consider these optimizations (when necessary in the library) for the
C++26 implementation.

Best,
  Matthias

-- 
──────────────────────────────────────────────────────────────────────────
 Dr. Matthias Kretz                           https://mattkretz.github.io
 GSI Helmholtz Center for Heavy Ion Research               https://gsi.de
 std::simd
──────────────────────────────────────────────────────────────────────────

Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd

Reply via email to