Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd

2024-03-28 Thread Matthias Kretz
On Mittwoch, 27. März 2024 14:34:52 CET Richard Sandiford wrote:
> Matthias Kretz  writes:
> > The big issue here is that, IIUC, a user (and the simd library) cannot do
> > the right thing at the moment. There simply isn't enough context
> > information available when parsing the  header. I.e.
> > on definition of the class template there's no facility to take
> > target_clones or SME "streaming" mode into account. Consequently, if we
> > want the library to be fit for SME, then we need more language
> > extension(s) to make it work.
> 
> Yeah.  I think the same applies to plain SVE.

With "plain SVE" you mean the *scalable* part of it, right? BTW, I've 
experimented with implementing simd basically as

template 
class simd
{
  alignas(bit_ceil(sizeof(T) * N)) T data[N];

See here: https://compiler-explorer.com/z/WW6KqanTW

Maybe the compiler can get better at optimizing this approach. But for now 
it's not a solution for a *scalable* variant, because every code is going to 
be load/store bound from the get go.

@Srinivas: See the guard variables for __index0123? They need to go. I believe 
you can and should declare them `constexpr`.

>  It seems reasonable to
> have functions whose implementation is specialised for a specific SVE
> length, with that function being selected at runtime where appropriate.
> Those functions needn't (in principle) be in separate TUs.  The “best”
> definition of native then becomes a per-function property rather
> than a per-TU property.

Hmm, I never considered this; but can one actually write fixed-length SVE code 
if -msve-vector-bits is not given? Then it's certainly possible to write a 
single TU with a runtime dispatch for all different SVE-widths. (This is less 
interesting on x86 where we need to dispatch on ISA extensions *and* vector 
width. It's much simpler (and safer) to compile a TU multiple times, 
restricted to a certain set of ISA extensions and then dispatch to the right 
translation at from some general code section.)

> As you note later, I think the same thing would apply to x86_64.

Yes. I don't think "same" is the case (yet) but it's very similar. Once ARM is 
at SVE9  and binaries need to support HW from SVE2 up to SVE9 it gets closer 
to "same".

> > The big issue I see here is that currently all of std::* is declared
> > without a arm_streaming or arm_streaming_compatible. Thus, IIUC, you
> > can't use anything from the standard library in streaming mode. Since
> > that also applies to std::experimental::simd, we're not creating a new
> > footgun, only missing out on potential users?
> 
> Kind-of.  However, we can inline a non-streaming function into a streaming
> function if that doesn't change defined behaviour.  And that's important
> in practice for C++, since most trivial inline functions will not be
> marked streaming-compatible despite being so in practice.

Ah good to know that it takes a pragmatic approach here. But I imagine this 
could become a source of confusion to users.

> > [...]
> > the compiler *must* virally apply target_clones to all functions it calls.
> > And member functions must either also get cloned as functions, or the
> > whole type must be cloned (as in the std::simd case, where the sizeof
> > needs to change). 
> Yeah, tricky :)
> 
> It's also not just about vector widths.  The target-clones case also has
> the problem that you cannot detect at include time which features are
> available.  E.g. “do I have SVE2-specific instructions?” becomes a
> contextual question rather than a global question.
> 
> Fortunately, this should just be a missed optimisation.  But it would be
> nice if uses of std::simd in SVE2 clones could take advantage of SVE2-only
> instructions, even if SVE2 wasn't enabled at include time.

Exactly. Even if we solve the scalable vector-length question, the 
target_clones question stays relevant.

So far my best answer, for x86 at least, is to compile the SIMD code multiple 
times into different shared libraries. And then let the dynamic linker pick 
the right library variant depending on the CPU. I'd be happy to have something 
simpler and working right out of the box.

Best,
  Matthias

-- 
──
 Dr. Matthias Kretz   https://mattkretz.github.io
 GSI Helmholtz Center for Heavy Ion Research   https://gsi.de
 std::simd
──





Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd

2024-03-27 Thread Richard Sandiford
Matthias Kretz  writes:
> Hi Richard,
>
> sorry for not answering sooner. I took action on your mail but failed to also 
> give feedback. Now in light of your veto of Srinivas patch I wanted to use 
> the 
> opportunity to pick this up again.
>
> On Dienstag, 23. Januar 2024 21:57:23 CET Richard Sandiford wrote:
>> However, we also support different vector lengths for streaming SVE
>> (running in "streaming" mode on SME) and non-streaming SVE (running
>> in "non-streaming" mode on the core).  Having two different lengths is
>> expected to be the common case, rather than a theoretical curiosity.
>
> I read up on this after you mentioned this for the first time. As a WG21 
> member I find the approach troublesome - but that's a bit off-topic for this 
> thread.
>
> The big issue here is that, IIUC, a user (and the simd library) cannot do the 
> right thing at the moment. There simply isn't enough context information 
> available when parsing the  header. I.e. on definition of 
> the class template there's no facility to take target_clones or SME 
> "streaming" mode into account. Consequently, if we want the library to be fit 
> for SME, then we need more language extension(s) to make it work.

Yeah.  I think the same applies to plain SVE.  It seems reasonable to
have functions whose implementation is specialised for a specific SVE
length, with that function being selected at runtime where appropriate.
Those functions needn't (in principle) be in separate TUs.  The “best”
definition of native then becomes a per-function property rather
than a per-TU property.

As you note later, I think the same thing would apply to x86_64.

> I guess I'm looking for a way to declare types that are different depending 
> on 
> whether they are used in streaming mode or non-streaming mode (making them 
> ill-formed to use in functions marked arm_streaming_compatible).
>
> From reading through https://arm-software.github.io/acle/main/
> acle.html#controlling-the-use-of-streaming-mode I don't see any discussion of 
> member functions or ctor/dtor, static and non-static data members, etc.
>
> The big issue I see here is that currently all of std::* is declared without 
> a 
> arm_streaming or arm_streaming_compatible. Thus, IIUC, you can't use anything 
> from the standard library in streaming mode. Since that also applies to 
> std::experimental::simd, we're not creating a new footgun, only missing out 
> on 
> potential users?

Kind-of.  However, we can inline a non-streaming function into a streaming
function if that doesn't change defined behaviour.  And that's important
in practice for C++, since most trivial inline functions will not be
marked streaming-compatible despite being so in practice.

It's UB to pass and return SVE vectors across streaming/non-streaming
boundaries unless the two VLs are equal.  It's therefore valid to inline
such functions into streaming functions *unless* the callee uses
non-streaming-only instructions such as gather loads.

Because of that, someone trying to use std::experimenal::simd in SME
functions is likely to succeed, at least in simple cases.

> Some more thoughts on target_clones/streaming SVE language extension 
> evolution:
>
>   void nonstreaming_fn(void) {
> constexpr int width = __arm_sve_bits(); // e.g. 512
> constexpr int width2 = __builtin_vector_size(); // e.g. 64 (the
>   // vector_size attribute works with bytes, not bits)
>   }
>
>   __attribute__((arm_locally_streaming))
>   void streaming_fn(void) {
> constexpr int width = __arm_sve_bits(); // e.g. 128
> constexpr int width2 = __builtin_vector_size(); // e.g. 16
>   }
>
>   __attribute__((target_clones("sse4.2,avx2")))
>   void streaming_fn(void) {
> constexpr int width = __builtin_vector_size(); // 16 in the sse4.2 clone
>   // and 32 in the avx2 clone
>   }
>
> ... as a starting point for exploration. Given this, I'd still have to resort 
> to a macro to define a "native" simd type:
>
> #define NATIVE_SIMD(T) std::experimental::simd CHAR_BITS, __arm_sve_bits() / CHAR_BITS>>
>
> Getting rid of the macro seems to be even harder.

Yeah.  The constexprs in the AArch64 functions would only be compile-time
constants in to-be-defined circumstances, using some mechanism to specify
the streaming and non-streaming vector lengths at compile time.  But that's
a premise of the whole discussion, just noting it for the record in case
anyone reading this later jumps in at this point.

> A declaration of an alias like
>
> template 
> using SveSimd = std::experimental::simd CHAR_BITS, __arm_sve_bits() / CHAR_BITS>>;
>
> would have to delay "invoking" __arm_sve_bits() until it knows its context:
>
>   void nonstreaming_fn(void) {
> static_assert(sizeof(SveSimd) == 64);
>   }
>
>   __attribute__((arm_locally_streaming))
>   void streaming_fn(void) {
> static_assert(sizeof(SveSimd) == 16);
> nonstreaming_fn(); // fine
>   }
>
> This gets even worse for target_clones, where
>
>   void f() {
> 

Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd

2024-03-27 Thread Matthias Kretz
Hi Richard,

sorry for not answering sooner. I took action on your mail but failed to also 
give feedback. Now in light of your veto of Srinivas patch I wanted to use the 
opportunity to pick this up again.

On Dienstag, 23. Januar 2024 21:57:23 CET Richard Sandiford wrote:
> However, we also support different vector lengths for streaming SVE
> (running in "streaming" mode on SME) and non-streaming SVE (running
> in "non-streaming" mode on the core).  Having two different lengths is
> expected to be the common case, rather than a theoretical curiosity.

I read up on this after you mentioned this for the first time. As a WG21 
member I find the approach troublesome - but that's a bit off-topic for this 
thread.

The big issue here is that, IIUC, a user (and the simd library) cannot do the 
right thing at the moment. There simply isn't enough context information 
available when parsing the  header. I.e. on definition of 
the class template there's no facility to take target_clones or SME 
"streaming" mode into account. Consequently, if we want the library to be fit 
for SME, then we need more language extension(s) to make it work.

I guess I'm looking for a way to declare types that are different depending on 
whether they are used in streaming mode or non-streaming mode (making them 
ill-formed to use in functions marked arm_streaming_compatible).

From reading through https://arm-software.github.io/acle/main/
acle.html#controlling-the-use-of-streaming-mode I don't see any discussion of 
member functions or ctor/dtor, static and non-static data members, etc.

The big issue I see here is that currently all of std::* is declared without a 
arm_streaming or arm_streaming_compatible. Thus, IIUC, you can't use anything 
from the standard library in streaming mode. Since that also applies to 
std::experimental::simd, we're not creating a new footgun, only missing out on 
potential users?

Some more thoughts on target_clones/streaming SVE language extension 
evolution:

  void nonstreaming_fn(void) {
constexpr int width = __arm_sve_bits(); // e.g. 512
constexpr int width2 = __builtin_vector_size(); // e.g. 64 (the
  // vector_size attribute works with bytes, not bits)
  }

  __attribute__((arm_locally_streaming))
  void streaming_fn(void) {
constexpr int width = __arm_sve_bits(); // e.g. 128
constexpr int width2 = __builtin_vector_size(); // e.g. 16
  }

  __attribute__((target_clones("sse4.2,avx2")))
  void streaming_fn(void) {
constexpr int width = __builtin_vector_size(); // 16 in the sse4.2 clone
  // and 32 in the avx2 clone
  }

... as a starting point for exploration. Given this, I'd still have to resort 
to a macro to define a "native" simd type:

#define NATIVE_SIMD(T) std::experimental::simd>

Getting rid of the macro seems to be even harder.

A declaration of an alias like

template 
using SveSimd = std::experimental::simd>;

would have to delay "invoking" __arm_sve_bits() until it knows its context:

  void nonstreaming_fn(void) {
static_assert(sizeof(SveSimd) == 64);
  }

  __attribute__((arm_locally_streaming))
  void streaming_fn(void) {
static_assert(sizeof(SveSimd) == 16);
nonstreaming_fn(); // fine
  }

This gets even worse for target_clones, where

  void f() {
sizeof(std::simd) == ?
  }

  __attribute__((target_clones("sse4.2,avx2")))
  void g() {
f();
  }

the compiler *must* virally apply target_clones to all functions it calls. And 
member functions must either also get cloned as functions, or the whole type 
must be cloned (as in the std::simd case, where the sizeof needs to change). 


> When would NumberOfUsedBytes < SizeofRegister be used for SVE?  Would it
> be for storing narrower elements in wider containers?  If the interface
> supports that then, yeah, two parameters would probably be safer.
> 
> Or were you thinking about emulating narrower vectors with wider registers
> using predication?  I suppose that's possible too, and would be similar in
> spirit to using SVE to optimise Advanced SIMD std::simd types.
> But mightn't it cause confusion if sizeof applied to a "16-byte"
> vector actually gives 32?

Yes, the idea is to e.g. use one SVE register instead of two NEON registers 
for a "float, 8" with SVE512.

The user never asks for a "16-byte" vector. The user asks for a value-type and 
and number of elements. Sure, the wasteful "padding" might come as a surprise, 
but it's totally within the spec to implement it like this.


> I assume std::experimental::native_simd has to have the same
> meaning everywhere for ODR reasons?

No. Only std::experimental::simd has to be "ABI stable". And note that in 
the C++ spec there's no such thing as compiling and linking TUs with different 
compiler flags. That's plain UB. The committee still cares about it, but 
getting this "right" cannot be part of the standard and must be defined by 
implementers

>  If so, it'll need to be an
> Advanced SIMD vector for AArch64 (but using SVE to optimise 

Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd

2024-01-23 Thread Richard Sandiford
Matthias Kretz  writes:
> On Sunday, 10 December 2023 14:29:45 CET Richard Sandiford wrote:
>> Thanks for the patch and sorry for the slow review.
>
> Sorry for my slow reaction. I needed a long vacation. For now I'll focus on 
> the design question wrt. multi-arch compilation.
>
>> I can only comment on the usage of SVE, rather than on the scaffolding
>> around it.  Hopefully Jonathan or others can comment on the rest.
>
> That's very useful!
>
>> The main thing that worries me is:
>> 
>> #if _GLIBCXX_SIMD_HAVE_SVE
>> constexpr inline int __sve_vectorized_size_bytes = __ARM_FEATURE_SVE_BITS/8;
>> #else
>> constexpr inline int __sve_vectorized_size_bytes = 0;
>> #endif
>> 
>> Although -msve-vector-bits is currently a per-TU setting, that isn't
>> necessarily going to be the case in future.
>
> This is a topic that I care about a lot... as simd user, implementer, and 
> WG21 
> proposal author. Are you thinking of a plan to implement the target_clones 
> function attribute for different SVE lengths? Or does it go further than 
> that? 
> PR83875 is raising the same issue and solution ideas for x86. If I understand 
> your concern correctly, then the issue you're raising exists in the same form 
> for x86.
>
> If anyone is interested in working on a "translation phase 7 replacement" for 
> compiler flags macros I'd be happy to give some input of what I believe is 
> necessary to make target_clones work with std(x)::simd. This seems to be 
> about 
> constant expressions that return compiler-internal state - probably similar 
> to 
> how static reflection needs to work.
>
> For a sketch of a direction: what I'm already doing in 
> std::experimental::simd, is to tag all non-always_inline function names with 
> a 
> bitflag, representing a relevant subset of -f and -m flags. That way, I'm 
> guarding against surprises on linking TUs compiled with different flags.

Yeah, target_clones could be one use for switching lengths.  In that
case the choice would be based on a load-time check of the vector length.

However, we also support different vector lengths for streaming SVE
(running in "streaming" mode on SME) and non-streaming SVE (running
in "non-streaming" mode on the core).  Having two different lengths is
expected to be the common case, rather than a theoretical curiosity.

Software also provides a "streaming compatible" mode in which code can
run either in streaming or non-streaming mode and which restricts itself
to the common subset of non-streaming and streaming features (which is
large).

So it isn't really safe to treat the vector length as a program or
process invariant, or to assume that all active uses of std::simd
in a binary are for the same vector length.  It should in principle
be possible to use std::simd variants for non-streaming code and
std::simd variants for streaming code even if the lengths are
different.

So in the target_clones case, the clone would have to apply to a
non-streaming function and switch based on the non-streaming
vector length, or apply to a streaming function and switch based
on the streaming vector length.  (Neither of those are supported yet,
of course.)

I think there are probably also ODR problems with hard-coding a vector
length, instead of treating it as a template parameter.

Unfortunately it isn't currently possible to write:

template
using T = svint32_t __attribute__((arm_sve_vector_bits(N)));

due to PRs 58855/68703/88600.  But I think we should fix that. :)

>> Ideally it would be
>> possible to define different implementations of a function with
>> different (fixed) vector lengths within the same TU.  The value at
>> the point that the header file is included is then somewhat arbitrary.
>> 
>> So rather than have:
>> >  using __neon128 = _Neon<16>;
>> >  using __neon64 = _Neon<8>;
>> > 
>> > +using __sve = _Sve<>;
>> 
>> would it be possible instead to have:
>> 
>>   using __sve128 = _Sve<128>;
>>   using __sve256 = _Sve<256>;
>>   ...etc...
>> 
>> ?  Code specialised for 128-bit vectors could then use __sve128 and
>> code specialised for 256-bit vectors could use __sve256.
>
> Hmm, as things stand we'd need two numbers, IIUC:
> _Sve
>
> On x86, "NumberOfUsedBytes" is sufficient, because 33-64 implies zmm 
> registers 
> (and -mavx512f), 17-32 implies ymm, and <=16 implies xmm (except where it 
> doesn't ;) ).

When would NumberOfUsedBytes < SizeofRegister be used for SVE?  Would it
be for storing narrower elements in wider containers?  If the interface
supports that then, yeah, two parameters would probably be safer.

Or were you thinking about emulating narrower vectors with wider registers
using predication?  I suppose that's possible too, and would be similar in
spirit to using SVE to optimise Advanced SIMD std::simd types.
But mightn't it cause confusion if sizeof applied to a "16-byte"
vector actually gives 32?

>> Perhaps that's not possible as things stand, but it'd be interesting
>> to know the exact failure mode if so.  Either way, it would be 

Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd

2024-01-18 Thread Matthias Kretz
On Thursday, 18 January 2024 08:40:48 CET Andrew Pinski wrote:
> On Wed, Jan 17, 2024 at 11:28 PM Matthias Kretz  wrote:
> > template 
> > struct Point
> > {
> >   T x, y, z;
> >   
> >   T distance_to_origin() {
> > return sqrt(x * x + y * y + z * z);
> >   }
> > };
> > 
> > Point is one point in 3D space, Point> stores multiple
> > points in 3D space and can work on them in parallel.
> > 
> > This implies that simd must have a sizeof. C++ is unlikely to get
> > sizeless types (the discussions were long, there were many papers, ...).
> > Should sizeless types in C++ ever happen, then composition is likely going
> > to be constrained to the last data member.
> 
> Even this is a bad design in general for simd. It means the code needs
> to know the size.

Yes and no. The developer writes size-agnostic code. The person compiling the 
code chooses the size (via -m flags) and thus the compiler sees fixed-size 
code.

> Also AoS vs SoA is always an interesting point here. In some cases you
> want an array of structs
> for speed and Point> does not work there at all. I guess
> This is all water under the bridge with how folks design code.
> You are basically pushing AoSoA idea here which is much worse idea than
> before.

I like to call it "array of vectorized struct" (AoVS) instead of AoSoA to 
emphasize the compiler-flags dependent memory layout.

I've been doing a lot of heterogeneous SIMD programming since 2009, starting 
with an outer loop vectorization across many TUs of a high-energy physics code 
targeting Intel Larrabee (pre-AVX512 ISA) and SSE2 with one source. In all 
these years my experience has been that, if the problem allows, AoVS is best 
in terms of performance and code generality & readability. I'd be interested 
to learn why you think differently.

> That being said sometimes it is not a vector of N elements you want to
> work on but rather 1/2/3 vector of  N elements. Seems like this is
> just pushing the idea one of one vector of one type of element which
> again is wrong push.

I might have misunderstood. You're saying that sometimes I want a  
even though my target CPU only has  registers? Yes! The 
std::experimental::simd spec and implementation isn't good enough in that area 
yet, but the C++26 paper(s) and my prototype implementation provides perfect 
SIMD + ILP translation of the expressed data-parallelism.

> Also more over, I guess pushing one idea of SIMD is worse than pushing
> any idea of SIMD. For Mathematical code, it is better for the compiler
> to do the vectorization than the user try to be semi-portable between
> different targets.

I guess I agree with that statement. But I wouldn't, in general, call the use 
of simd "the user try[ing] to be semi-portable". In my experience, working 
on physics code - a lot of math - using simd (as intended) is better in 
terms of performance and performance portability. As always, abuse is possible 
...

> This is what was learned on Fortran but I guess
> some folks in the C++ likes to expose the underlying HW instead of
> thinking high level here.

The C++ approach is to "leave no room for a lower-level language" while 
designing for high-level abstractions / usage.

> > With the above as our design constraints, SVE at first seems to be a bad
> > fit for implementing std::simd. However, if (at least initially) we accept
> > the need for different binaries for different SVE implementations, then
> > you
> > can look at the "scalable" part of SVE as an efficient way of reducing the
> > number of opcodes necessary for supporting all kinds of different vector
> > lengths. But otherwise you can treat it as fixed-size registers - which it
> > is for a given CPU. In the case of a multi-CPU shared-memory system (e.g.
> > RDMA between different ARM implementations) all you need is a different
> > name for incompatible types. So std::simd on SVE256 must have a
> > different name on SVE512. Same for std::simd (which is currently
> > not the case with Sriniva's patch, I think, and needs to be resolved).
> 
> For SVE that is a bad design. It means The code is not portable at all.

When you say "code" you mean "source code", not binaries, right? I don't see 
how that follows.

- Matthias

-- 
──
 Dr. Matthias Kretz   https://mattkretz.github.io
 GSI Helmholtz Center for Heavy Ion Research   https://gsi.de
 std::simd
──





Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd

2024-01-17 Thread Andrew Pinski
On Wed, Jan 17, 2024 at 11:28 PM Matthias Kretz  wrote:
>
> On Thursday, 4 January 2024 10:10:12 CET Andrew Pinski wrote:
> > I really doubt this would work in the end. Because HW which is 128bits
> > only, can't support -msve-vector-bits=2048 . I am thinking
> > std::experimental::simd is not the right way of supporting this.
> > Really the route the standard should be heading towards is non
> > constant at compile time sized vectors instead and then you could use
> > the constant sized ones to emulate the Variable length ones.
>
> I don't follow. "non-constant at compile time sized vectors" implies
> sizeless (no constexpr sizeof), no?
> Let me try to explain where I'm coming from. One of the motivating use
> cases for simd types is composition. Like this:
>
> template 
> struct Point
> {
>   T x, y, z;
>
>   T distance_to_origin() {
> return sqrt(x * x + y * y + z * z);
>   }
> };
>
> Point is one point in 3D space, Point> stores multiple
> points in 3D space and can work on them in parallel.
>
> This implies that simd must have a sizeof. C++ is unlikely to get
> sizeless types (the discussions were long, there were many papers, ...).
> Should sizeless types in C++ ever happen, then composition is likely going
> to be constrained to the last data member.

Even this is a bad design in general for simd. It means the code needs
to know the size.
Also AoS vs SoA is always an interesting point here. In some cases you
want an array of structs
for speed and Point> does not work there at all. I guess
This is all water under the bridge with how folks design code.
You are basically pushing AoSoA idea here which is much worse idea than before.

That being said sometimes it is not a vector of N elements you want to
work on but rather 1/2/3 vector of  N elements. Seems like this is
just pushing the idea one of one vector of one type of element which
again is wrong push.
Also more over, I guess pushing one idea of SIMD is worse than pushing
any idea of SIMD. For Mathematical code, it is better for the compiler
to do the vectorization than the user try to be semi-portable between
different targets. This is what was learned on Fortran but I guess
some folks in the C++ likes to expose the underlying HW instead of
thinking high level here.

Thanks,
Andrew Pinski

>
> With the above as our design constraints, SVE at first seems to be a bad
> fit for implementing std::simd. However, if (at least initially) we accept
> the need for different binaries for different SVE implementations, then you
> can look at the "scalable" part of SVE as an efficient way of reducing the
> number of opcodes necessary for supporting all kinds of different vector
> lengths. But otherwise you can treat it as fixed-size registers - which it
> is for a given CPU. In the case of a multi-CPU shared-memory system (e.g.
> RDMA between different ARM implementations) all you need is a different
> name for incompatible types. So std::simd on SVE256 must have a
> different name on SVE512. Same for std::simd (which is currently
> not the case with Sriniva's patch, I think, and needs to be resolved).

For SVE that is a bad design. It means The code is not portable at all.

>
> > I think we should not depend on __ARM_FEATURE_SVE_BITS being set here
> > and being meanful in any way.
>
> I'd love to. In the same way I'd love to *not depend* on __AVX__,
> __AVX512F__ etc.
>
> - Matthias
>
> --
> ──
>  Dr. Matthias Kretz   https://mattkretz.github.io
>  GSI Helmholtz Center for Heavy Ion Research   https://gsi.de
>  std::simd
> ──


Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd

2024-01-17 Thread Matthias Kretz
On Sunday, 10 December 2023 14:29:45 CET Richard Sandiford wrote:
> Thanks for the patch and sorry for the slow review.

Sorry for my slow reaction. I needed a long vacation. For now I'll focus on 
the design question wrt. multi-arch compilation.

> I can only comment on the usage of SVE, rather than on the scaffolding
> around it.  Hopefully Jonathan or others can comment on the rest.

That's very useful!

> The main thing that worries me is:
> 
> #if _GLIBCXX_SIMD_HAVE_SVE
> constexpr inline int __sve_vectorized_size_bytes = __ARM_FEATURE_SVE_BITS/8;
> #else
> constexpr inline int __sve_vectorized_size_bytes = 0;
> #endif
> 
> Although -msve-vector-bits is currently a per-TU setting, that isn't
> necessarily going to be the case in future.

This is a topic that I care about a lot... as simd user, implementer, and WG21 
proposal author. Are you thinking of a plan to implement the target_clones 
function attribute for different SVE lengths? Or does it go further than that? 
PR83875 is raising the same issue and solution ideas for x86. If I understand 
your concern correctly, then the issue you're raising exists in the same form 
for x86.

If anyone is interested in working on a "translation phase 7 replacement" for 
compiler flags macros I'd be happy to give some input of what I believe is 
necessary to make target_clones work with std(x)::simd. This seems to be about 
constant expressions that return compiler-internal state - probably similar to 
how static reflection needs to work.

For a sketch of a direction: what I'm already doing in 
std::experimental::simd, is to tag all non-always_inline function names with a 
bitflag, representing a relevant subset of -f and -m flags. That way, I'm 
guarding against surprises on linking TUs compiled with different flags.

> Ideally it would be
> possible to define different implementations of a function with
> different (fixed) vector lengths within the same TU.  The value at
> the point that the header file is included is then somewhat arbitrary.
> 
> So rather than have:
> >  using __neon128 = _Neon<16>;
> >  using __neon64 = _Neon<8>;
> > 
> > +using __sve = _Sve<>;
> 
> would it be possible instead to have:
> 
>   using __sve128 = _Sve<128>;
>   using __sve256 = _Sve<256>;
>   ...etc...
> 
> ?  Code specialised for 128-bit vectors could then use __sve128 and
> code specialised for 256-bit vectors could use __sve256.

Hmm, as things stand we'd need two numbers, IIUC:
_Sve

On x86, "NumberOfUsedBytes" is sufficient, because 33-64 implies zmm registers 
(and -mavx512f), 17-32 implies ymm, and <=16 implies xmm (except where it 
doesn't ;) ).

> Perhaps that's not possible as things stand, but it'd be interesting
> to know the exact failure mode if so.  Either way, it would be good to
> remove direct uses of __ARM_FEATURE_SVE_BITS from simd_sve.h if possible,
> and instead rely on information derived from template parameters.

The TS spec requires std::experimental::native_simd to basically give you 
the largest, most efficient, full SIMD register. (And it can't be a sizeless 
type because they don't exist in C++). So how would you do that without 
looking at __ARM_FEATURE_SVE_BITS in the simd implementation?


> It should be possible to use SVE to optimise some of the __neon*
> implementations, which has the advantage of working even for VLA SVE.
> That's obviously a separate patch, though.  Just saying for the record.

I learned that NVidia Grace CPUs alias NEON and SVE registers. But I must 
assume that other SVE implementations (especially those with 
__ARM_FEATURE_SVE_BITS > 128) don't do that and might incur a significant 
latency when going from a NEON register to an SVE register and back (which 
each requires a store-load, IIUC). So are you thinking of implementing 
everything via SVE? That would break ABI, no?

- Matthias

-- 
──
 Dr. Matthias Kretz   https://mattkretz.github.io
 GSI Helmholtz Center for Heavy Ion Research   https://gsi.de
 std::simd
──





Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd

2024-01-17 Thread Matthias Kretz
On Thursday, 4 January 2024 10:10:12 CET Andrew Pinski wrote:
> I really doubt this would work in the end. Because HW which is 128bits
> only, can't support -msve-vector-bits=2048 . I am thinking
> std::experimental::simd is not the right way of supporting this.
> Really the route the standard should be heading towards is non
> constant at compile time sized vectors instead and then you could use
> the constant sized ones to emulate the Variable length ones.

I don't follow. "non-constant at compile time sized vectors" implies 
sizeless (no constexpr sizeof), no?
Let me try to explain where I'm coming from. One of the motivating use 
cases for simd types is composition. Like this:

template 
struct Point
{
  T x, y, z;

  T distance_to_origin() {
return sqrt(x * x + y * y + z * z);
  }
};

Point is one point in 3D space, Point> stores multiple 
points in 3D space and can work on them in parallel.

This implies that simd must have a sizeof. C++ is unlikely to get 
sizeless types (the discussions were long, there were many papers, ...). 
Should sizeless types in C++ ever happen, then composition is likely going 
to be constrained to the last data member.

With the above as our design constraints, SVE at first seems to be a bad 
fit for implementing std::simd. However, if (at least initially) we accept 
the need for different binaries for different SVE implementations, then you 
can look at the "scalable" part of SVE as an efficient way of reducing the 
number of opcodes necessary for supporting all kinds of different vector 
lengths. But otherwise you can treat it as fixed-size registers - which it 
is for a given CPU. In the case of a multi-CPU shared-memory system (e.g. 
RDMA between different ARM implementations) all you need is a different 
name for incompatible types. So std::simd on SVE256 must have a 
different name on SVE512. Same for std::simd (which is currently 
not the case with Sriniva's patch, I think, and needs to be resolved).

> I think we should not depend on __ARM_FEATURE_SVE_BITS being set here
> and being meanful in any way.

I'd love to. In the same way I'd love to *not depend* on __AVX__, 
__AVX512F__ etc.

- Matthias

-- 
──
 Dr. Matthias Kretz   https://mattkretz.github.io
 GSI Helmholtz Center for Heavy Ion Research   https://gsi.de
 std::simd
──


Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd

2024-01-04 Thread Andrew Pinski
On Thu, Jan 4, 2024 at 1:03 AM Srinivas Yadav
 wrote:
>
>
> Hi,
>
> Thanks a lot for the review. Sorry for the very late reply.
>
> The following are my comments on the feedback.
>
>>
>> The main thing that worries me is:
>>
>>   #if _GLIBCXX_SIMD_HAVE_SVE
>>   constexpr inline int __sve_vectorized_size_bytes = __ARM_FEATURE_SVE_BITS 
>> / 8;
>>   #else
>>   constexpr inline int __sve_vectorized_size_bytes = 0;
>>   #endif
>>
>> Although -msve-vector-bits is currently a per-TU setting, that isn't
>> necessarily going to be the case in future.  Ideally it would be
>> possible to define different implementations of a function with
>> different (fixed) vector lengths within the same TU.  The value at
>> the point that the header file is included is then somewhat arbitrary.
>>
>> So rather than have:
>>
>> >  using __neon128 = _Neon<16>;
>> >  using __neon64 = _Neon<8>;
>> > +using __sve = _Sve<>;
>>
>> would it be possible instead to have:
>>
>>   using __sve128 = _Sve<128>;
>>   using __sve256 = _Sve<256>;
>>   ...etc...
>>
>> ?  Code specialised for 128-bit vectors could then use __sve128 and
>> code specialised for 256-bit vectors could use __sve256.
>
>
> I think it is very hard to maintain specialization for each vector length, as 
> lengths for
> SVE can vary from 128 to 2048 in 128 bit increments. Hence, there would be 
> many
> specializations. Also, we only embed the vector length information in the 
> type specialization.
> Hence, one generic __sve abi type would be sufficient, which holds the max 
> vector length (in bytes)
> derived from __sve_vectorized_bytes (which is again derived from 
> __ARM_FEATURE_SVE_BITS
> in simd.h.

I really doubt this would work in the end. Because HW which is 128bits
only, can't support -msve-vector-bits=2048 . I am thinking
std::experimental::simd is not the right way of supporting this.
Really the route the standard should be heading towards is non
constant at compile time sized vectors instead and then you could use
the constant sized ones to emulate the Variable length ones.
I think we should not depend on __ARM_FEATURE_SVE_BITS being set here
and being meanful in any way.

Thanks,
Andrew Pinski

>
>> Perhaps that's not possible as things stand, but it'd be interesting
>> to know the exact failure mode if so.  Either way, it would be good to
>> remove direct uses of __ARM_FEATURE_SVE_BITS from simd_sve.h if possible,
>> and instead rely on information derived from template parameters.
>>
> I am currently directly using __ARM_FEATURE_SVE_BITS in simd_sve.h for 
> defining specific vector
> types of basic arithmetic types as int, float, double etc...
> The __ARM_FEATURE_SVE_BITS macro is used in the context of
> arm_sve_vector_bits(__ARM_FEATURE_SVE_BITS) that requires an integer constant 
> expression.
> Hence, its not quite usable with information derived from template parameters.
> May I please know what is the exact issue if __ARM_FEATURE_SVE_BITS is 
> directly used ?
>
> Other possible solution is to define a local macro in simd.h as
> #define _GLIBCXX_ARM_FEATURE_SVE_BITS __ARM_FEATURE_SVE_BITS
> and use _GLIBCXX_ARM_FEATURE_SVE_BITS at all occurrences of 
> __ARM_FEATURE_SVE_BITS.
>
>
>> It should be possible to use SVE to optimise some of the __neon*
>> implementations, which has the advantage of working even for VLA SVE.
>> That's obviously a separate patch, though.  Just saying for the record.
>>
> Yes, that's an interesting thought. Noted. Thanks!
>
>
>> On:
>>
>> inline static __sve_bool_type
>> __sve_active_mask()
>> { return svwhilelt_b8(int8_t(0), int8_t(_Np)); };
>>
>> I think it'd be more canonical to use svptrue_b8().  (The inputs to
>> svwhilelt aren't naturally tied to the size of the output elements.
>> This particular call will use WHILELE with 32-bit registers.  So if
>> you do continue to use svwhile*, it'd be good to remove the casts on
>> the inputs.)
>>
>> It'd also be good to use svptrue_b8() for all element sizes where possible,
>> and only use element-specific masks where absolutely necesary.  This helps
>> to premote reuse of predicates.
>>
>> Or does the API require that every mask value is internally "canonical",
>> i.e.  every element must be 0 or 1, regardless of the element size?
>> If so, that could introduce some inefficiency on SVE, where upper bits
>> are (by design) usually ignored.
>>
> I am not sure if svptrue_b8() would satisfy the requirement here. As 
> currently,
> we use svwhilet_b8(0, _Np) to make a mask with only [0 to _Np) elements 
> active (1)
> and remaining  elements to be inactive (0).
> I would prefer to continue to use svwhile* and remove the casts to the inputs.
>
>
>> On:
>>
>>   template 
>> struct __sve_vector_type
>> : __sve_vector_type<__get_sve_value_type_t, _Np>
>> {};
>>
>>   template 
>> struct __sve_vector_type
>> : __sve_vector_type<__get_sve_value_type_t, _Np>
>> {};
>>
>>   template 
>> struct __sve_vector_type
>> : 

Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd

2024-01-03 Thread Srinivas Yadav
Hi,

Thanks a lot for the review. Sorry for the very late reply.

The following are my comments on the feedback.


> The main thing that worries me is:
>
>   #if _GLIBCXX_SIMD_HAVE_SVE
>   constexpr inline int __sve_vectorized_size_bytes =
> __ARM_FEATURE_SVE_BITS / 8;
>   #else
>   constexpr inline int __sve_vectorized_size_bytes = 0;
>   #endif
>
> Although -msve-vector-bits is currently a per-TU setting, that isn't
> necessarily going to be the case in future.  Ideally it would be
> possible to define different implementations of a function with
> different (fixed) vector lengths within the same TU.  The value at
> the point that the header file is included is then somewhat arbitrary.
>
> So rather than have:
>
> >  using __neon128 = _Neon<16>;
> >  using __neon64 = _Neon<8>;
> > +using __sve = _Sve<>;
>
> would it be possible instead to have:
>
>   using __sve128 = _Sve<128>;
>   using __sve256 = _Sve<256>;
>   ...etc...
>
> ?  Code specialised for 128-bit vectors could then use __sve128 and
> code specialised for 256-bit vectors could use __sve256.
>

I think it is very hard to maintain specialization for each vector length,
as lengths for
SVE can vary from 128 to 2048 in 128 bit increments. Hence, there would be
many
specializations. Also, we only embed the vector length information in the
type specialization.
Hence, one generic __sve abi type would be sufficient, which holds the max
vector length (in bytes)
derived from __sve_vectorized_bytes (which is again derived from
__ARM_FEATURE_SVE_BITS
in simd.h.

Perhaps that's not possible as things stand, but it'd be interesting
> to know the exact failure mode if so.  Either way, it would be good to
> remove direct uses of __ARM_FEATURE_SVE_BITS from simd_sve.h if possible,
> and instead rely on information derived from template parameters.
>
> I am currently directly using __ARM_FEATURE_SVE_BITS in simd_sve.h for
defining specific vector
types of basic arithmetic types as int, float, double etc...
The __ARM_FEATURE_SVE_BITS macro is used in the context of
arm_sve_vector_bits(__ARM_FEATURE_SVE_BITS) that requires an integer
constant expression.
Hence, its not quite usable with information derived from template
parameters.
May I please know what is the exact issue if __ARM_FEATURE_SVE_BITS is
directly used ?

Other possible solution is to define a local macro in simd.h as
#define _GLIBCXX_ARM_FEATURE_SVE_BITS __ARM_FEATURE_SVE_BITS
and use _GLIBCXX_ARM_FEATURE_SVE_BITS at all occurrences of
__ARM_FEATURE_SVE_BITS.


It should be possible to use SVE to optimise some of the __neon*
> implementations, which has the advantage of working even for VLA SVE.
> That's obviously a separate patch, though.  Just saying for the record.
>
> Yes, that's an interesting thought. Noted. Thanks!


On:
>
> inline static __sve_bool_type
> __sve_active_mask()
> { return svwhilelt_b8(int8_t(0), int8_t(_Np)); };
>
> I think it'd be more canonical to use svptrue_b8().  (The inputs to
> svwhilelt aren't naturally tied to the size of the output elements.
> This particular call will use WHILELE with 32-bit registers.  So if
> you do continue to use svwhile*, it'd be good to remove the casts on
> the inputs.)
>
> It'd also be good to use svptrue_b8() for all element sizes where possible,
> and only use element-specific masks where absolutely necesary.  This helps
> to premote reuse of predicates.
>
> Or does the API require that every mask value is internally "canonical",
> i.e.  every element must be 0 or 1, regardless of the element size?
> If so, that could introduce some inefficiency on SVE, where upper bits
> are (by design) usually ignored.
>
> I am not sure if svptrue_b8() would satisfy the requirement here. As
currently,
we use svwhilet_b8(0, _Np) to make a mask with only [0 to _Np) elements
active (1)
and remaining  elements to be inactive (0).
I would prefer to continue to use svwhile* and remove the casts to the
inputs.


On:
>
>   template 
> struct __sve_vector_type
> : __sve_vector_type<__get_sve_value_type_t, _Np>
> {};
>
>   template 
> struct __sve_vector_type
> : __sve_vector_type<__get_sve_value_type_t, _Np>
> {};
>
>   template 
> struct __sve_vector_type
> : __sve_vector_type<__get_sve_value_type_t, _Np>
> {};
>
>   template 
> struct __sve_vector_type
> : __sve_vector_type<__get_sve_value_type_t, _Np>
> {};
>
>   template 
> struct __sve_vector_type
> : __sve_vector_type<__get_sve_value_type_t, _Np>
> {};
>
>   template 
> struct __sve_vector_type
> : __sve_vector_type<__get_sve_value_type_t,
> _Np>
> {};
>
> Does this ensure complete coverage without duplicate definitions?  It seems
> to be hard-coding a specific choice of stdint.h type (e.g. by including
> char but not unsigned char, and long long int but not long int).
> Unfortunately the stdint.h definitions vary between newlib and glibc,
> for example.
>
> I have overloaded all the types that are present in simd testsuite.

Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd

2023-12-11 Thread Richard Sandiford
Richard Sandiford  writes:
>   template 
> struct _SveMaskWrapper
> {
>   ...
>
>   _GLIBCXX_SIMD_INTRINSIC constexpr value_type
>   operator[](size_t __i) const
>   {
> return _BuiltinSveMaskType::__sve_mask_active_count(
> _BuiltinSveVectorType::__sve_active_mask(),
> svand_z(_BuiltinSveVectorType::__sve_active_mask(),
> svcmpeq(_BuiltinSveVectorType::__sve_active_mask(),
> _BuiltinSveMaskType::__index0123,
> typename 
> _BuiltinSveMaskType::__sve_mask_uint_type(__i)),
> _M_data));
>   }
>
> A simpler way to do this might be to use svdup_u_z (_M_data, 1, 0)
> and then svorrv on the result.

Sorry for the nonsense comment.  I was thinking of a different operation.
But GCC knows how to index a fixed-length data vector, so it might be
worth using:

  svdup_u_z (_M_data, 1)[__i]

Thanks,
Richard


Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd

2023-12-10 Thread Richard Sandiford
Thanks for the patch and sorry for the slow review.

I can only comment on the usage of SVE, rather than on the scaffolding
around it.  Hopefully Jonathan or others can comment on the rest.

The main thing that worries me is:

  #if _GLIBCXX_SIMD_HAVE_SVE
  constexpr inline int __sve_vectorized_size_bytes = __ARM_FEATURE_SVE_BITS / 8;
  #else
  constexpr inline int __sve_vectorized_size_bytes = 0;
  #endif

Although -msve-vector-bits is currently a per-TU setting, that isn't
necessarily going to be the case in future.  Ideally it would be
possible to define different implementations of a function with
different (fixed) vector lengths within the same TU.  The value at
the point that the header file is included is then somewhat arbitrary.

So rather than have:

>  using __neon128 = _Neon<16>;
>  using __neon64 = _Neon<8>;
> +using __sve = _Sve<>;

would it be possible instead to have:

  using __sve128 = _Sve<128>;
  using __sve256 = _Sve<256>;
  ...etc...

?  Code specialised for 128-bit vectors could then use __sve128 and
code specialised for 256-bit vectors could use __sve256.

Perhaps that's not possible as things stand, but it'd be interesting
to know the exact failure mode if so.  Either way, it would be good to
remove direct uses of __ARM_FEATURE_SVE_BITS from simd_sve.h if possible,
and instead rely on information derived from template parameters.

It should be possible to use SVE to optimise some of the __neon*
implementations, which has the advantage of working even for VLA SVE.
That's obviously a separate patch, though.  Just saying for the record.

On:

inline static __sve_bool_type
__sve_active_mask()
{ return svwhilelt_b8(int8_t(0), int8_t(_Np)); };

I think it'd be more canonical to use svptrue_b8().  (The inputs to
svwhilelt aren't naturally tied to the size of the output elements.
This particular call will use WHILELE with 32-bit registers.  So if
you do continue to use svwhile*, it'd be good to remove the casts on
the inputs.)

It'd also be good to use svptrue_b8() for all element sizes where possible,
and only use element-specific masks where absolutely necesary.  This helps
to premote reuse of predicates.

Or does the API require that every mask value is internally "canonical",
i.e.  every element must be 0 or 1, regardless of the element size?
If so, that could introduce some inefficiency on SVE, where upper bits
are (by design) usually ignored.

On:

  template 
struct __sve_vector_type
: __sve_vector_type<__get_sve_value_type_t, _Np>
{};

  template 
struct __sve_vector_type
: __sve_vector_type<__get_sve_value_type_t, _Np>
{};

  template 
struct __sve_vector_type
: __sve_vector_type<__get_sve_value_type_t, _Np>
{};

  template 
struct __sve_vector_type
: __sve_vector_type<__get_sve_value_type_t, _Np>
{};

  template 
struct __sve_vector_type
: __sve_vector_type<__get_sve_value_type_t, _Np>
{};

  template 
struct __sve_vector_type
: __sve_vector_type<__get_sve_value_type_t, _Np>
{};

Does this ensure complete coverage without duplicate definitions?  It seems
to be hard-coding a specific choice of stdint.h type (e.g. by including
char but not unsigned char, and long long int but not long int).
Unfortunately the stdint.h definitions vary between newlib and glibc,
for example.

inline static type
__sve_mask_first_true()
{ return svptrue_pat_b8(SV_VL1); }

SV_VL1 produces the same predicate result for all element widths,
so I don't think this needs to be parameterised.

  template <>
struct __sve_mask_type<8>
{
  ...

  inline static void
  called()
  {}

What's the called() function for?

  template 
_GLIBCXX_SIMD_INTRINSIC constexpr auto
__sve_reinterpret_cast(_From __v)
{
  if constexpr (std::is_same_v<_To, int32_t>)
return svreinterpret_s32(__v);
  else if constexpr (std::is_same_v<_To, int64_t>)
return svreinterpret_s64(__v);
  else if constexpr (std::is_same_v<_To, float32_t>)
return svreinterpret_f32(__v);
  else if constexpr (std::is_same_v<_To, float64_t>)
return svreinterpret_f64(__v);
}

Why does this only need to handle these 4 types?  Think it'd be
worth a comment.

  template 
struct _SveMaskWrapper
{
  ...

  _GLIBCXX_SIMD_INTRINSIC constexpr value_type
  operator[](size_t __i) const
  {
return _BuiltinSveMaskType::__sve_mask_active_count(
_BuiltinSveVectorType::__sve_active_mask(),
svand_z(_BuiltinSveVectorType::__sve_active_mask(),
svcmpeq(_BuiltinSveVectorType::__sve_active_mask(),
_BuiltinSveMaskType::__index0123,
typename 
_BuiltinSveMaskType::__sve_mask_uint_type(__i)),
_M_data));
  }

A simpler way to do this might be to use svdup_u_z (_M_data, 1, 0)
and then svorrv on the result.

  struct _CommonImplSve
  {