Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd
On Mittwoch, 27. März 2024 14:34:52 CET Richard Sandiford wrote: > Matthias Kretz writes: > > The big issue here is that, IIUC, a user (and the simd library) cannot do > > the right thing at the moment. There simply isn't enough context > > information available when parsing the header. I.e. > > on definition of the class template there's no facility to take > > target_clones or SME "streaming" mode into account. Consequently, if we > > want the library to be fit for SME, then we need more language > > extension(s) to make it work. > > Yeah. I think the same applies to plain SVE. With "plain SVE" you mean the *scalable* part of it, right? BTW, I've experimented with implementing simd basically as template class simd { alignas(bit_ceil(sizeof(T) * N)) T data[N]; See here: https://compiler-explorer.com/z/WW6KqanTW Maybe the compiler can get better at optimizing this approach. But for now it's not a solution for a *scalable* variant, because every code is going to be load/store bound from the get go. @Srinivas: See the guard variables for __index0123? They need to go. I believe you can and should declare them `constexpr`. > It seems reasonable to > have functions whose implementation is specialised for a specific SVE > length, with that function being selected at runtime where appropriate. > Those functions needn't (in principle) be in separate TUs. The “best” > definition of native then becomes a per-function property rather > than a per-TU property. Hmm, I never considered this; but can one actually write fixed-length SVE code if -msve-vector-bits is not given? Then it's certainly possible to write a single TU with a runtime dispatch for all different SVE-widths. (This is less interesting on x86 where we need to dispatch on ISA extensions *and* vector width. It's much simpler (and safer) to compile a TU multiple times, restricted to a certain set of ISA extensions and then dispatch to the right translation at from some general code section.) > As you note later, I think the same thing would apply to x86_64. Yes. I don't think "same" is the case (yet) but it's very similar. Once ARM is at SVE9 and binaries need to support HW from SVE2 up to SVE9 it gets closer to "same". > > The big issue I see here is that currently all of std::* is declared > > without a arm_streaming or arm_streaming_compatible. Thus, IIUC, you > > can't use anything from the standard library in streaming mode. Since > > that also applies to std::experimental::simd, we're not creating a new > > footgun, only missing out on potential users? > > Kind-of. However, we can inline a non-streaming function into a streaming > function if that doesn't change defined behaviour. And that's important > in practice for C++, since most trivial inline functions will not be > marked streaming-compatible despite being so in practice. Ah good to know that it takes a pragmatic approach here. But I imagine this could become a source of confusion to users. > > [...] > > the compiler *must* virally apply target_clones to all functions it calls. > > And member functions must either also get cloned as functions, or the > > whole type must be cloned (as in the std::simd case, where the sizeof > > needs to change). > Yeah, tricky :) > > It's also not just about vector widths. The target-clones case also has > the problem that you cannot detect at include time which features are > available. E.g. “do I have SVE2-specific instructions?” becomes a > contextual question rather than a global question. > > Fortunately, this should just be a missed optimisation. But it would be > nice if uses of std::simd in SVE2 clones could take advantage of SVE2-only > instructions, even if SVE2 wasn't enabled at include time. Exactly. Even if we solve the scalable vector-length question, the target_clones question stays relevant. So far my best answer, for x86 at least, is to compile the SIMD code multiple times into different shared libraries. And then let the dynamic linker pick the right library variant depending on the CPU. I'd be happy to have something simpler and working right out of the box. Best, Matthias -- ── Dr. Matthias Kretz https://mattkretz.github.io GSI Helmholtz Center for Heavy Ion Research https://gsi.de std::simd ──
Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd
Matthias Kretz writes: > Hi Richard, > > sorry for not answering sooner. I took action on your mail but failed to also > give feedback. Now in light of your veto of Srinivas patch I wanted to use > the > opportunity to pick this up again. > > On Dienstag, 23. Januar 2024 21:57:23 CET Richard Sandiford wrote: >> However, we also support different vector lengths for streaming SVE >> (running in "streaming" mode on SME) and non-streaming SVE (running >> in "non-streaming" mode on the core). Having two different lengths is >> expected to be the common case, rather than a theoretical curiosity. > > I read up on this after you mentioned this for the first time. As a WG21 > member I find the approach troublesome - but that's a bit off-topic for this > thread. > > The big issue here is that, IIUC, a user (and the simd library) cannot do the > right thing at the moment. There simply isn't enough context information > available when parsing the header. I.e. on definition of > the class template there's no facility to take target_clones or SME > "streaming" mode into account. Consequently, if we want the library to be fit > for SME, then we need more language extension(s) to make it work. Yeah. I think the same applies to plain SVE. It seems reasonable to have functions whose implementation is specialised for a specific SVE length, with that function being selected at runtime where appropriate. Those functions needn't (in principle) be in separate TUs. The “best” definition of native then becomes a per-function property rather than a per-TU property. As you note later, I think the same thing would apply to x86_64. > I guess I'm looking for a way to declare types that are different depending > on > whether they are used in streaming mode or non-streaming mode (making them > ill-formed to use in functions marked arm_streaming_compatible). > > From reading through https://arm-software.github.io/acle/main/ > acle.html#controlling-the-use-of-streaming-mode I don't see any discussion of > member functions or ctor/dtor, static and non-static data members, etc. > > The big issue I see here is that currently all of std::* is declared without > a > arm_streaming or arm_streaming_compatible. Thus, IIUC, you can't use anything > from the standard library in streaming mode. Since that also applies to > std::experimental::simd, we're not creating a new footgun, only missing out > on > potential users? Kind-of. However, we can inline a non-streaming function into a streaming function if that doesn't change defined behaviour. And that's important in practice for C++, since most trivial inline functions will not be marked streaming-compatible despite being so in practice. It's UB to pass and return SVE vectors across streaming/non-streaming boundaries unless the two VLs are equal. It's therefore valid to inline such functions into streaming functions *unless* the callee uses non-streaming-only instructions such as gather loads. Because of that, someone trying to use std::experimenal::simd in SME functions is likely to succeed, at least in simple cases. > Some more thoughts on target_clones/streaming SVE language extension > evolution: > > void nonstreaming_fn(void) { > constexpr int width = __arm_sve_bits(); // e.g. 512 > constexpr int width2 = __builtin_vector_size(); // e.g. 64 (the > // vector_size attribute works with bytes, not bits) > } > > __attribute__((arm_locally_streaming)) > void streaming_fn(void) { > constexpr int width = __arm_sve_bits(); // e.g. 128 > constexpr int width2 = __builtin_vector_size(); // e.g. 16 > } > > __attribute__((target_clones("sse4.2,avx2"))) > void streaming_fn(void) { > constexpr int width = __builtin_vector_size(); // 16 in the sse4.2 clone > // and 32 in the avx2 clone > } > > ... as a starting point for exploration. Given this, I'd still have to resort > to a macro to define a "native" simd type: > > #define NATIVE_SIMD(T) std::experimental::simd CHAR_BITS, __arm_sve_bits() / CHAR_BITS>> > > Getting rid of the macro seems to be even harder. Yeah. The constexprs in the AArch64 functions would only be compile-time constants in to-be-defined circumstances, using some mechanism to specify the streaming and non-streaming vector lengths at compile time. But that's a premise of the whole discussion, just noting it for the record in case anyone reading this later jumps in at this point. > A declaration of an alias like > > template > using SveSimd = std::experimental::simd CHAR_BITS, __arm_sve_bits() / CHAR_BITS>>; > > would have to delay "invoking" __arm_sve_bits() until it knows its context: > > void nonstreaming_fn(void) { > static_assert(sizeof(SveSimd) == 64); > } > > __attribute__((arm_locally_streaming)) > void streaming_fn(void) { > static_assert(sizeof(SveSimd) == 16); > nonstreaming_fn(); // fine > } > > This gets even worse for target_clones, where > > void f() { >
Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd
Hi Richard, sorry for not answering sooner. I took action on your mail but failed to also give feedback. Now in light of your veto of Srinivas patch I wanted to use the opportunity to pick this up again. On Dienstag, 23. Januar 2024 21:57:23 CET Richard Sandiford wrote: > However, we also support different vector lengths for streaming SVE > (running in "streaming" mode on SME) and non-streaming SVE (running > in "non-streaming" mode on the core). Having two different lengths is > expected to be the common case, rather than a theoretical curiosity. I read up on this after you mentioned this for the first time. As a WG21 member I find the approach troublesome - but that's a bit off-topic for this thread. The big issue here is that, IIUC, a user (and the simd library) cannot do the right thing at the moment. There simply isn't enough context information available when parsing the header. I.e. on definition of the class template there's no facility to take target_clones or SME "streaming" mode into account. Consequently, if we want the library to be fit for SME, then we need more language extension(s) to make it work. I guess I'm looking for a way to declare types that are different depending on whether they are used in streaming mode or non-streaming mode (making them ill-formed to use in functions marked arm_streaming_compatible). From reading through https://arm-software.github.io/acle/main/ acle.html#controlling-the-use-of-streaming-mode I don't see any discussion of member functions or ctor/dtor, static and non-static data members, etc. The big issue I see here is that currently all of std::* is declared without a arm_streaming or arm_streaming_compatible. Thus, IIUC, you can't use anything from the standard library in streaming mode. Since that also applies to std::experimental::simd, we're not creating a new footgun, only missing out on potential users? Some more thoughts on target_clones/streaming SVE language extension evolution: void nonstreaming_fn(void) { constexpr int width = __arm_sve_bits(); // e.g. 512 constexpr int width2 = __builtin_vector_size(); // e.g. 64 (the // vector_size attribute works with bytes, not bits) } __attribute__((arm_locally_streaming)) void streaming_fn(void) { constexpr int width = __arm_sve_bits(); // e.g. 128 constexpr int width2 = __builtin_vector_size(); // e.g. 16 } __attribute__((target_clones("sse4.2,avx2"))) void streaming_fn(void) { constexpr int width = __builtin_vector_size(); // 16 in the sse4.2 clone // and 32 in the avx2 clone } ... as a starting point for exploration. Given this, I'd still have to resort to a macro to define a "native" simd type: #define NATIVE_SIMD(T) std::experimental::simd> Getting rid of the macro seems to be even harder. A declaration of an alias like template using SveSimd = std::experimental::simd>; would have to delay "invoking" __arm_sve_bits() until it knows its context: void nonstreaming_fn(void) { static_assert(sizeof(SveSimd) == 64); } __attribute__((arm_locally_streaming)) void streaming_fn(void) { static_assert(sizeof(SveSimd) == 16); nonstreaming_fn(); // fine } This gets even worse for target_clones, where void f() { sizeof(std::simd) == ? } __attribute__((target_clones("sse4.2,avx2"))) void g() { f(); } the compiler *must* virally apply target_clones to all functions it calls. And member functions must either also get cloned as functions, or the whole type must be cloned (as in the std::simd case, where the sizeof needs to change). > When would NumberOfUsedBytes < SizeofRegister be used for SVE? Would it > be for storing narrower elements in wider containers? If the interface > supports that then, yeah, two parameters would probably be safer. > > Or were you thinking about emulating narrower vectors with wider registers > using predication? I suppose that's possible too, and would be similar in > spirit to using SVE to optimise Advanced SIMD std::simd types. > But mightn't it cause confusion if sizeof applied to a "16-byte" > vector actually gives 32? Yes, the idea is to e.g. use one SVE register instead of two NEON registers for a "float, 8" with SVE512. The user never asks for a "16-byte" vector. The user asks for a value-type and and number of elements. Sure, the wasteful "padding" might come as a surprise, but it's totally within the spec to implement it like this. > I assume std::experimental::native_simd has to have the same > meaning everywhere for ODR reasons? No. Only std::experimental::simd has to be "ABI stable". And note that in the C++ spec there's no such thing as compiling and linking TUs with different compiler flags. That's plain UB. The committee still cares about it, but getting this "right" cannot be part of the standard and must be defined by implementers > If so, it'll need to be an > Advanced SIMD vector for AArch64 (but using SVE to optimise
Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd
Matthias Kretz writes: > On Sunday, 10 December 2023 14:29:45 CET Richard Sandiford wrote: >> Thanks for the patch and sorry for the slow review. > > Sorry for my slow reaction. I needed a long vacation. For now I'll focus on > the design question wrt. multi-arch compilation. > >> I can only comment on the usage of SVE, rather than on the scaffolding >> around it. Hopefully Jonathan or others can comment on the rest. > > That's very useful! > >> The main thing that worries me is: >> >> #if _GLIBCXX_SIMD_HAVE_SVE >> constexpr inline int __sve_vectorized_size_bytes = __ARM_FEATURE_SVE_BITS/8; >> #else >> constexpr inline int __sve_vectorized_size_bytes = 0; >> #endif >> >> Although -msve-vector-bits is currently a per-TU setting, that isn't >> necessarily going to be the case in future. > > This is a topic that I care about a lot... as simd user, implementer, and > WG21 > proposal author. Are you thinking of a plan to implement the target_clones > function attribute for different SVE lengths? Or does it go further than > that? > PR83875 is raising the same issue and solution ideas for x86. If I understand > your concern correctly, then the issue you're raising exists in the same form > for x86. > > If anyone is interested in working on a "translation phase 7 replacement" for > compiler flags macros I'd be happy to give some input of what I believe is > necessary to make target_clones work with std(x)::simd. This seems to be > about > constant expressions that return compiler-internal state - probably similar > to > how static reflection needs to work. > > For a sketch of a direction: what I'm already doing in > std::experimental::simd, is to tag all non-always_inline function names with > a > bitflag, representing a relevant subset of -f and -m flags. That way, I'm > guarding against surprises on linking TUs compiled with different flags. Yeah, target_clones could be one use for switching lengths. In that case the choice would be based on a load-time check of the vector length. However, we also support different vector lengths for streaming SVE (running in "streaming" mode on SME) and non-streaming SVE (running in "non-streaming" mode on the core). Having two different lengths is expected to be the common case, rather than a theoretical curiosity. Software also provides a "streaming compatible" mode in which code can run either in streaming or non-streaming mode and which restricts itself to the common subset of non-streaming and streaming features (which is large). So it isn't really safe to treat the vector length as a program or process invariant, or to assume that all active uses of std::simd in a binary are for the same vector length. It should in principle be possible to use std::simd variants for non-streaming code and std::simd variants for streaming code even if the lengths are different. So in the target_clones case, the clone would have to apply to a non-streaming function and switch based on the non-streaming vector length, or apply to a streaming function and switch based on the streaming vector length. (Neither of those are supported yet, of course.) I think there are probably also ODR problems with hard-coding a vector length, instead of treating it as a template parameter. Unfortunately it isn't currently possible to write: template using T = svint32_t __attribute__((arm_sve_vector_bits(N))); due to PRs 58855/68703/88600. But I think we should fix that. :) >> Ideally it would be >> possible to define different implementations of a function with >> different (fixed) vector lengths within the same TU. The value at >> the point that the header file is included is then somewhat arbitrary. >> >> So rather than have: >> > using __neon128 = _Neon<16>; >> > using __neon64 = _Neon<8>; >> > >> > +using __sve = _Sve<>; >> >> would it be possible instead to have: >> >> using __sve128 = _Sve<128>; >> using __sve256 = _Sve<256>; >> ...etc... >> >> ? Code specialised for 128-bit vectors could then use __sve128 and >> code specialised for 256-bit vectors could use __sve256. > > Hmm, as things stand we'd need two numbers, IIUC: > _Sve > > On x86, "NumberOfUsedBytes" is sufficient, because 33-64 implies zmm > registers > (and -mavx512f), 17-32 implies ymm, and <=16 implies xmm (except where it > doesn't ;) ). When would NumberOfUsedBytes < SizeofRegister be used for SVE? Would it be for storing narrower elements in wider containers? If the interface supports that then, yeah, two parameters would probably be safer. Or were you thinking about emulating narrower vectors with wider registers using predication? I suppose that's possible too, and would be similar in spirit to using SVE to optimise Advanced SIMD std::simd types. But mightn't it cause confusion if sizeof applied to a "16-byte" vector actually gives 32? >> Perhaps that's not possible as things stand, but it'd be interesting >> to know the exact failure mode if so. Either way, it would be
Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd
On Thursday, 18 January 2024 08:40:48 CET Andrew Pinski wrote: > On Wed, Jan 17, 2024 at 11:28 PM Matthias Kretz wrote: > > template > > struct Point > > { > > T x, y, z; > > > > T distance_to_origin() { > > return sqrt(x * x + y * y + z * z); > > } > > }; > > > > Point is one point in 3D space, Point> stores multiple > > points in 3D space and can work on them in parallel. > > > > This implies that simd must have a sizeof. C++ is unlikely to get > > sizeless types (the discussions were long, there were many papers, ...). > > Should sizeless types in C++ ever happen, then composition is likely going > > to be constrained to the last data member. > > Even this is a bad design in general for simd. It means the code needs > to know the size. Yes and no. The developer writes size-agnostic code. The person compiling the code chooses the size (via -m flags) and thus the compiler sees fixed-size code. > Also AoS vs SoA is always an interesting point here. In some cases you > want an array of structs > for speed and Point> does not work there at all. I guess > This is all water under the bridge with how folks design code. > You are basically pushing AoSoA idea here which is much worse idea than > before. I like to call it "array of vectorized struct" (AoVS) instead of AoSoA to emphasize the compiler-flags dependent memory layout. I've been doing a lot of heterogeneous SIMD programming since 2009, starting with an outer loop vectorization across many TUs of a high-energy physics code targeting Intel Larrabee (pre-AVX512 ISA) and SSE2 with one source. In all these years my experience has been that, if the problem allows, AoVS is best in terms of performance and code generality & readability. I'd be interested to learn why you think differently. > That being said sometimes it is not a vector of N elements you want to > work on but rather 1/2/3 vector of N elements. Seems like this is > just pushing the idea one of one vector of one type of element which > again is wrong push. I might have misunderstood. You're saying that sometimes I want a even though my target CPU only has registers? Yes! The std::experimental::simd spec and implementation isn't good enough in that area yet, but the C++26 paper(s) and my prototype implementation provides perfect SIMD + ILP translation of the expressed data-parallelism. > Also more over, I guess pushing one idea of SIMD is worse than pushing > any idea of SIMD. For Mathematical code, it is better for the compiler > to do the vectorization than the user try to be semi-portable between > different targets. I guess I agree with that statement. But I wouldn't, in general, call the use of simd "the user try[ing] to be semi-portable". In my experience, working on physics code - a lot of math - using simd (as intended) is better in terms of performance and performance portability. As always, abuse is possible ... > This is what was learned on Fortran but I guess > some folks in the C++ likes to expose the underlying HW instead of > thinking high level here. The C++ approach is to "leave no room for a lower-level language" while designing for high-level abstractions / usage. > > With the above as our design constraints, SVE at first seems to be a bad > > fit for implementing std::simd. However, if (at least initially) we accept > > the need for different binaries for different SVE implementations, then > > you > > can look at the "scalable" part of SVE as an efficient way of reducing the > > number of opcodes necessary for supporting all kinds of different vector > > lengths. But otherwise you can treat it as fixed-size registers - which it > > is for a given CPU. In the case of a multi-CPU shared-memory system (e.g. > > RDMA between different ARM implementations) all you need is a different > > name for incompatible types. So std::simd on SVE256 must have a > > different name on SVE512. Same for std::simd (which is currently > > not the case with Sriniva's patch, I think, and needs to be resolved). > > For SVE that is a bad design. It means The code is not portable at all. When you say "code" you mean "source code", not binaries, right? I don't see how that follows. - Matthias -- ── Dr. Matthias Kretz https://mattkretz.github.io GSI Helmholtz Center for Heavy Ion Research https://gsi.de std::simd ──
Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd
On Wed, Jan 17, 2024 at 11:28 PM Matthias Kretz wrote: > > On Thursday, 4 January 2024 10:10:12 CET Andrew Pinski wrote: > > I really doubt this would work in the end. Because HW which is 128bits > > only, can't support -msve-vector-bits=2048 . I am thinking > > std::experimental::simd is not the right way of supporting this. > > Really the route the standard should be heading towards is non > > constant at compile time sized vectors instead and then you could use > > the constant sized ones to emulate the Variable length ones. > > I don't follow. "non-constant at compile time sized vectors" implies > sizeless (no constexpr sizeof), no? > Let me try to explain where I'm coming from. One of the motivating use > cases for simd types is composition. Like this: > > template > struct Point > { > T x, y, z; > > T distance_to_origin() { > return sqrt(x * x + y * y + z * z); > } > }; > > Point is one point in 3D space, Point> stores multiple > points in 3D space and can work on them in parallel. > > This implies that simd must have a sizeof. C++ is unlikely to get > sizeless types (the discussions were long, there were many papers, ...). > Should sizeless types in C++ ever happen, then composition is likely going > to be constrained to the last data member. Even this is a bad design in general for simd. It means the code needs to know the size. Also AoS vs SoA is always an interesting point here. In some cases you want an array of structs for speed and Point> does not work there at all. I guess This is all water under the bridge with how folks design code. You are basically pushing AoSoA idea here which is much worse idea than before. That being said sometimes it is not a vector of N elements you want to work on but rather 1/2/3 vector of N elements. Seems like this is just pushing the idea one of one vector of one type of element which again is wrong push. Also more over, I guess pushing one idea of SIMD is worse than pushing any idea of SIMD. For Mathematical code, it is better for the compiler to do the vectorization than the user try to be semi-portable between different targets. This is what was learned on Fortran but I guess some folks in the C++ likes to expose the underlying HW instead of thinking high level here. Thanks, Andrew Pinski > > With the above as our design constraints, SVE at first seems to be a bad > fit for implementing std::simd. However, if (at least initially) we accept > the need for different binaries for different SVE implementations, then you > can look at the "scalable" part of SVE as an efficient way of reducing the > number of opcodes necessary for supporting all kinds of different vector > lengths. But otherwise you can treat it as fixed-size registers - which it > is for a given CPU. In the case of a multi-CPU shared-memory system (e.g. > RDMA between different ARM implementations) all you need is a different > name for incompatible types. So std::simd on SVE256 must have a > different name on SVE512. Same for std::simd (which is currently > not the case with Sriniva's patch, I think, and needs to be resolved). For SVE that is a bad design. It means The code is not portable at all. > > > I think we should not depend on __ARM_FEATURE_SVE_BITS being set here > > and being meanful in any way. > > I'd love to. In the same way I'd love to *not depend* on __AVX__, > __AVX512F__ etc. > > - Matthias > > -- > ── > Dr. Matthias Kretz https://mattkretz.github.io > GSI Helmholtz Center for Heavy Ion Research https://gsi.de > std::simd > ──
Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd
On Sunday, 10 December 2023 14:29:45 CET Richard Sandiford wrote: > Thanks for the patch and sorry for the slow review. Sorry for my slow reaction. I needed a long vacation. For now I'll focus on the design question wrt. multi-arch compilation. > I can only comment on the usage of SVE, rather than on the scaffolding > around it. Hopefully Jonathan or others can comment on the rest. That's very useful! > The main thing that worries me is: > > #if _GLIBCXX_SIMD_HAVE_SVE > constexpr inline int __sve_vectorized_size_bytes = __ARM_FEATURE_SVE_BITS/8; > #else > constexpr inline int __sve_vectorized_size_bytes = 0; > #endif > > Although -msve-vector-bits is currently a per-TU setting, that isn't > necessarily going to be the case in future. This is a topic that I care about a lot... as simd user, implementer, and WG21 proposal author. Are you thinking of a plan to implement the target_clones function attribute for different SVE lengths? Or does it go further than that? PR83875 is raising the same issue and solution ideas for x86. If I understand your concern correctly, then the issue you're raising exists in the same form for x86. If anyone is interested in working on a "translation phase 7 replacement" for compiler flags macros I'd be happy to give some input of what I believe is necessary to make target_clones work with std(x)::simd. This seems to be about constant expressions that return compiler-internal state - probably similar to how static reflection needs to work. For a sketch of a direction: what I'm already doing in std::experimental::simd, is to tag all non-always_inline function names with a bitflag, representing a relevant subset of -f and -m flags. That way, I'm guarding against surprises on linking TUs compiled with different flags. > Ideally it would be > possible to define different implementations of a function with > different (fixed) vector lengths within the same TU. The value at > the point that the header file is included is then somewhat arbitrary. > > So rather than have: > > using __neon128 = _Neon<16>; > > using __neon64 = _Neon<8>; > > > > +using __sve = _Sve<>; > > would it be possible instead to have: > > using __sve128 = _Sve<128>; > using __sve256 = _Sve<256>; > ...etc... > > ? Code specialised for 128-bit vectors could then use __sve128 and > code specialised for 256-bit vectors could use __sve256. Hmm, as things stand we'd need two numbers, IIUC: _Sve On x86, "NumberOfUsedBytes" is sufficient, because 33-64 implies zmm registers (and -mavx512f), 17-32 implies ymm, and <=16 implies xmm (except where it doesn't ;) ). > Perhaps that's not possible as things stand, but it'd be interesting > to know the exact failure mode if so. Either way, it would be good to > remove direct uses of __ARM_FEATURE_SVE_BITS from simd_sve.h if possible, > and instead rely on information derived from template parameters. The TS spec requires std::experimental::native_simd to basically give you the largest, most efficient, full SIMD register. (And it can't be a sizeless type because they don't exist in C++). So how would you do that without looking at __ARM_FEATURE_SVE_BITS in the simd implementation? > It should be possible to use SVE to optimise some of the __neon* > implementations, which has the advantage of working even for VLA SVE. > That's obviously a separate patch, though. Just saying for the record. I learned that NVidia Grace CPUs alias NEON and SVE registers. But I must assume that other SVE implementations (especially those with __ARM_FEATURE_SVE_BITS > 128) don't do that and might incur a significant latency when going from a NEON register to an SVE register and back (which each requires a store-load, IIUC). So are you thinking of implementing everything via SVE? That would break ABI, no? - Matthias -- ── Dr. Matthias Kretz https://mattkretz.github.io GSI Helmholtz Center for Heavy Ion Research https://gsi.de std::simd ──
Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd
On Thursday, 4 January 2024 10:10:12 CET Andrew Pinski wrote: > I really doubt this would work in the end. Because HW which is 128bits > only, can't support -msve-vector-bits=2048 . I am thinking > std::experimental::simd is not the right way of supporting this. > Really the route the standard should be heading towards is non > constant at compile time sized vectors instead and then you could use > the constant sized ones to emulate the Variable length ones. I don't follow. "non-constant at compile time sized vectors" implies sizeless (no constexpr sizeof), no? Let me try to explain where I'm coming from. One of the motivating use cases for simd types is composition. Like this: template struct Point { T x, y, z; T distance_to_origin() { return sqrt(x * x + y * y + z * z); } }; Point is one point in 3D space, Point> stores multiple points in 3D space and can work on them in parallel. This implies that simd must have a sizeof. C++ is unlikely to get sizeless types (the discussions were long, there were many papers, ...). Should sizeless types in C++ ever happen, then composition is likely going to be constrained to the last data member. With the above as our design constraints, SVE at first seems to be a bad fit for implementing std::simd. However, if (at least initially) we accept the need for different binaries for different SVE implementations, then you can look at the "scalable" part of SVE as an efficient way of reducing the number of opcodes necessary for supporting all kinds of different vector lengths. But otherwise you can treat it as fixed-size registers - which it is for a given CPU. In the case of a multi-CPU shared-memory system (e.g. RDMA between different ARM implementations) all you need is a different name for incompatible types. So std::simd on SVE256 must have a different name on SVE512. Same for std::simd (which is currently not the case with Sriniva's patch, I think, and needs to be resolved). > I think we should not depend on __ARM_FEATURE_SVE_BITS being set here > and being meanful in any way. I'd love to. In the same way I'd love to *not depend* on __AVX__, __AVX512F__ etc. - Matthias -- ── Dr. Matthias Kretz https://mattkretz.github.io GSI Helmholtz Center for Heavy Ion Research https://gsi.de std::simd ──
Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd
On Thu, Jan 4, 2024 at 1:03 AM Srinivas Yadav wrote: > > > Hi, > > Thanks a lot for the review. Sorry for the very late reply. > > The following are my comments on the feedback. > >> >> The main thing that worries me is: >> >> #if _GLIBCXX_SIMD_HAVE_SVE >> constexpr inline int __sve_vectorized_size_bytes = __ARM_FEATURE_SVE_BITS >> / 8; >> #else >> constexpr inline int __sve_vectorized_size_bytes = 0; >> #endif >> >> Although -msve-vector-bits is currently a per-TU setting, that isn't >> necessarily going to be the case in future. Ideally it would be >> possible to define different implementations of a function with >> different (fixed) vector lengths within the same TU. The value at >> the point that the header file is included is then somewhat arbitrary. >> >> So rather than have: >> >> > using __neon128 = _Neon<16>; >> > using __neon64 = _Neon<8>; >> > +using __sve = _Sve<>; >> >> would it be possible instead to have: >> >> using __sve128 = _Sve<128>; >> using __sve256 = _Sve<256>; >> ...etc... >> >> ? Code specialised for 128-bit vectors could then use __sve128 and >> code specialised for 256-bit vectors could use __sve256. > > > I think it is very hard to maintain specialization for each vector length, as > lengths for > SVE can vary from 128 to 2048 in 128 bit increments. Hence, there would be > many > specializations. Also, we only embed the vector length information in the > type specialization. > Hence, one generic __sve abi type would be sufficient, which holds the max > vector length (in bytes) > derived from __sve_vectorized_bytes (which is again derived from > __ARM_FEATURE_SVE_BITS > in simd.h. I really doubt this would work in the end. Because HW which is 128bits only, can't support -msve-vector-bits=2048 . I am thinking std::experimental::simd is not the right way of supporting this. Really the route the standard should be heading towards is non constant at compile time sized vectors instead and then you could use the constant sized ones to emulate the Variable length ones. I think we should not depend on __ARM_FEATURE_SVE_BITS being set here and being meanful in any way. Thanks, Andrew Pinski > >> Perhaps that's not possible as things stand, but it'd be interesting >> to know the exact failure mode if so. Either way, it would be good to >> remove direct uses of __ARM_FEATURE_SVE_BITS from simd_sve.h if possible, >> and instead rely on information derived from template parameters. >> > I am currently directly using __ARM_FEATURE_SVE_BITS in simd_sve.h for > defining specific vector > types of basic arithmetic types as int, float, double etc... > The __ARM_FEATURE_SVE_BITS macro is used in the context of > arm_sve_vector_bits(__ARM_FEATURE_SVE_BITS) that requires an integer constant > expression. > Hence, its not quite usable with information derived from template parameters. > May I please know what is the exact issue if __ARM_FEATURE_SVE_BITS is > directly used ? > > Other possible solution is to define a local macro in simd.h as > #define _GLIBCXX_ARM_FEATURE_SVE_BITS __ARM_FEATURE_SVE_BITS > and use _GLIBCXX_ARM_FEATURE_SVE_BITS at all occurrences of > __ARM_FEATURE_SVE_BITS. > > >> It should be possible to use SVE to optimise some of the __neon* >> implementations, which has the advantage of working even for VLA SVE. >> That's obviously a separate patch, though. Just saying for the record. >> > Yes, that's an interesting thought. Noted. Thanks! > > >> On: >> >> inline static __sve_bool_type >> __sve_active_mask() >> { return svwhilelt_b8(int8_t(0), int8_t(_Np)); }; >> >> I think it'd be more canonical to use svptrue_b8(). (The inputs to >> svwhilelt aren't naturally tied to the size of the output elements. >> This particular call will use WHILELE with 32-bit registers. So if >> you do continue to use svwhile*, it'd be good to remove the casts on >> the inputs.) >> >> It'd also be good to use svptrue_b8() for all element sizes where possible, >> and only use element-specific masks where absolutely necesary. This helps >> to premote reuse of predicates. >> >> Or does the API require that every mask value is internally "canonical", >> i.e. every element must be 0 or 1, regardless of the element size? >> If so, that could introduce some inefficiency on SVE, where upper bits >> are (by design) usually ignored. >> > I am not sure if svptrue_b8() would satisfy the requirement here. As > currently, > we use svwhilet_b8(0, _Np) to make a mask with only [0 to _Np) elements > active (1) > and remaining elements to be inactive (0). > I would prefer to continue to use svwhile* and remove the casts to the inputs. > > >> On: >> >> template >> struct __sve_vector_type >> : __sve_vector_type<__get_sve_value_type_t, _Np> >> {}; >> >> template >> struct __sve_vector_type >> : __sve_vector_type<__get_sve_value_type_t, _Np> >> {}; >> >> template >> struct __sve_vector_type >> :
Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd
Hi, Thanks a lot for the review. Sorry for the very late reply. The following are my comments on the feedback. > The main thing that worries me is: > > #if _GLIBCXX_SIMD_HAVE_SVE > constexpr inline int __sve_vectorized_size_bytes = > __ARM_FEATURE_SVE_BITS / 8; > #else > constexpr inline int __sve_vectorized_size_bytes = 0; > #endif > > Although -msve-vector-bits is currently a per-TU setting, that isn't > necessarily going to be the case in future. Ideally it would be > possible to define different implementations of a function with > different (fixed) vector lengths within the same TU. The value at > the point that the header file is included is then somewhat arbitrary. > > So rather than have: > > > using __neon128 = _Neon<16>; > > using __neon64 = _Neon<8>; > > +using __sve = _Sve<>; > > would it be possible instead to have: > > using __sve128 = _Sve<128>; > using __sve256 = _Sve<256>; > ...etc... > > ? Code specialised for 128-bit vectors could then use __sve128 and > code specialised for 256-bit vectors could use __sve256. > I think it is very hard to maintain specialization for each vector length, as lengths for SVE can vary from 128 to 2048 in 128 bit increments. Hence, there would be many specializations. Also, we only embed the vector length information in the type specialization. Hence, one generic __sve abi type would be sufficient, which holds the max vector length (in bytes) derived from __sve_vectorized_bytes (which is again derived from __ARM_FEATURE_SVE_BITS in simd.h. Perhaps that's not possible as things stand, but it'd be interesting > to know the exact failure mode if so. Either way, it would be good to > remove direct uses of __ARM_FEATURE_SVE_BITS from simd_sve.h if possible, > and instead rely on information derived from template parameters. > > I am currently directly using __ARM_FEATURE_SVE_BITS in simd_sve.h for defining specific vector types of basic arithmetic types as int, float, double etc... The __ARM_FEATURE_SVE_BITS macro is used in the context of arm_sve_vector_bits(__ARM_FEATURE_SVE_BITS) that requires an integer constant expression. Hence, its not quite usable with information derived from template parameters. May I please know what is the exact issue if __ARM_FEATURE_SVE_BITS is directly used ? Other possible solution is to define a local macro in simd.h as #define _GLIBCXX_ARM_FEATURE_SVE_BITS __ARM_FEATURE_SVE_BITS and use _GLIBCXX_ARM_FEATURE_SVE_BITS at all occurrences of __ARM_FEATURE_SVE_BITS. It should be possible to use SVE to optimise some of the __neon* > implementations, which has the advantage of working even for VLA SVE. > That's obviously a separate patch, though. Just saying for the record. > > Yes, that's an interesting thought. Noted. Thanks! On: > > inline static __sve_bool_type > __sve_active_mask() > { return svwhilelt_b8(int8_t(0), int8_t(_Np)); }; > > I think it'd be more canonical to use svptrue_b8(). (The inputs to > svwhilelt aren't naturally tied to the size of the output elements. > This particular call will use WHILELE with 32-bit registers. So if > you do continue to use svwhile*, it'd be good to remove the casts on > the inputs.) > > It'd also be good to use svptrue_b8() for all element sizes where possible, > and only use element-specific masks where absolutely necesary. This helps > to premote reuse of predicates. > > Or does the API require that every mask value is internally "canonical", > i.e. every element must be 0 or 1, regardless of the element size? > If so, that could introduce some inefficiency on SVE, where upper bits > are (by design) usually ignored. > > I am not sure if svptrue_b8() would satisfy the requirement here. As currently, we use svwhilet_b8(0, _Np) to make a mask with only [0 to _Np) elements active (1) and remaining elements to be inactive (0). I would prefer to continue to use svwhile* and remove the casts to the inputs. On: > > template > struct __sve_vector_type > : __sve_vector_type<__get_sve_value_type_t, _Np> > {}; > > template > struct __sve_vector_type > : __sve_vector_type<__get_sve_value_type_t, _Np> > {}; > > template > struct __sve_vector_type > : __sve_vector_type<__get_sve_value_type_t, _Np> > {}; > > template > struct __sve_vector_type > : __sve_vector_type<__get_sve_value_type_t, _Np> > {}; > > template > struct __sve_vector_type > : __sve_vector_type<__get_sve_value_type_t, _Np> > {}; > > template > struct __sve_vector_type > : __sve_vector_type<__get_sve_value_type_t, > _Np> > {}; > > Does this ensure complete coverage without duplicate definitions? It seems > to be hard-coding a specific choice of stdint.h type (e.g. by including > char but not unsigned char, and long long int but not long int). > Unfortunately the stdint.h definitions vary between newlib and glibc, > for example. > > I have overloaded all the types that are present in simd testsuite.
Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd
Richard Sandiford writes: > template > struct _SveMaskWrapper > { > ... > > _GLIBCXX_SIMD_INTRINSIC constexpr value_type > operator[](size_t __i) const > { > return _BuiltinSveMaskType::__sve_mask_active_count( > _BuiltinSveVectorType::__sve_active_mask(), > svand_z(_BuiltinSveVectorType::__sve_active_mask(), > svcmpeq(_BuiltinSveVectorType::__sve_active_mask(), > _BuiltinSveMaskType::__index0123, > typename > _BuiltinSveMaskType::__sve_mask_uint_type(__i)), > _M_data)); > } > > A simpler way to do this might be to use svdup_u_z (_M_data, 1, 0) > and then svorrv on the result. Sorry for the nonsense comment. I was thinking of a different operation. But GCC knows how to index a fixed-length data vector, so it might be worth using: svdup_u_z (_M_data, 1)[__i] Thanks, Richard
Re: [PATCH] libstdc++: add ARM SVE support to std::experimental::simd
Thanks for the patch and sorry for the slow review. I can only comment on the usage of SVE, rather than on the scaffolding around it. Hopefully Jonathan or others can comment on the rest. The main thing that worries me is: #if _GLIBCXX_SIMD_HAVE_SVE constexpr inline int __sve_vectorized_size_bytes = __ARM_FEATURE_SVE_BITS / 8; #else constexpr inline int __sve_vectorized_size_bytes = 0; #endif Although -msve-vector-bits is currently a per-TU setting, that isn't necessarily going to be the case in future. Ideally it would be possible to define different implementations of a function with different (fixed) vector lengths within the same TU. The value at the point that the header file is included is then somewhat arbitrary. So rather than have: > using __neon128 = _Neon<16>; > using __neon64 = _Neon<8>; > +using __sve = _Sve<>; would it be possible instead to have: using __sve128 = _Sve<128>; using __sve256 = _Sve<256>; ...etc... ? Code specialised for 128-bit vectors could then use __sve128 and code specialised for 256-bit vectors could use __sve256. Perhaps that's not possible as things stand, but it'd be interesting to know the exact failure mode if so. Either way, it would be good to remove direct uses of __ARM_FEATURE_SVE_BITS from simd_sve.h if possible, and instead rely on information derived from template parameters. It should be possible to use SVE to optimise some of the __neon* implementations, which has the advantage of working even for VLA SVE. That's obviously a separate patch, though. Just saying for the record. On: inline static __sve_bool_type __sve_active_mask() { return svwhilelt_b8(int8_t(0), int8_t(_Np)); }; I think it'd be more canonical to use svptrue_b8(). (The inputs to svwhilelt aren't naturally tied to the size of the output elements. This particular call will use WHILELE with 32-bit registers. So if you do continue to use svwhile*, it'd be good to remove the casts on the inputs.) It'd also be good to use svptrue_b8() for all element sizes where possible, and only use element-specific masks where absolutely necesary. This helps to premote reuse of predicates. Or does the API require that every mask value is internally "canonical", i.e. every element must be 0 or 1, regardless of the element size? If so, that could introduce some inefficiency on SVE, where upper bits are (by design) usually ignored. On: template struct __sve_vector_type : __sve_vector_type<__get_sve_value_type_t, _Np> {}; template struct __sve_vector_type : __sve_vector_type<__get_sve_value_type_t, _Np> {}; template struct __sve_vector_type : __sve_vector_type<__get_sve_value_type_t, _Np> {}; template struct __sve_vector_type : __sve_vector_type<__get_sve_value_type_t, _Np> {}; template struct __sve_vector_type : __sve_vector_type<__get_sve_value_type_t, _Np> {}; template struct __sve_vector_type : __sve_vector_type<__get_sve_value_type_t, _Np> {}; Does this ensure complete coverage without duplicate definitions? It seems to be hard-coding a specific choice of stdint.h type (e.g. by including char but not unsigned char, and long long int but not long int). Unfortunately the stdint.h definitions vary between newlib and glibc, for example. inline static type __sve_mask_first_true() { return svptrue_pat_b8(SV_VL1); } SV_VL1 produces the same predicate result for all element widths, so I don't think this needs to be parameterised. template <> struct __sve_mask_type<8> { ... inline static void called() {} What's the called() function for? template _GLIBCXX_SIMD_INTRINSIC constexpr auto __sve_reinterpret_cast(_From __v) { if constexpr (std::is_same_v<_To, int32_t>) return svreinterpret_s32(__v); else if constexpr (std::is_same_v<_To, int64_t>) return svreinterpret_s64(__v); else if constexpr (std::is_same_v<_To, float32_t>) return svreinterpret_f32(__v); else if constexpr (std::is_same_v<_To, float64_t>) return svreinterpret_f64(__v); } Why does this only need to handle these 4 types? Think it'd be worth a comment. template struct _SveMaskWrapper { ... _GLIBCXX_SIMD_INTRINSIC constexpr value_type operator[](size_t __i) const { return _BuiltinSveMaskType::__sve_mask_active_count( _BuiltinSveVectorType::__sve_active_mask(), svand_z(_BuiltinSveVectorType::__sve_active_mask(), svcmpeq(_BuiltinSveVectorType::__sve_active_mask(), _BuiltinSveMaskType::__index0123, typename _BuiltinSveMaskType::__sve_mask_uint_type(__i)), _M_data)); } A simpler way to do this might be to use svdup_u_z (_M_data, 1, 0) and then svorrv on the result. struct _CommonImplSve {