Matthias Kretz <m.kr...@gsi.de> writes:
> On Sunday, 10 December 2023 14:29:45 CET Richard Sandiford wrote:
>> Thanks for the patch and sorry for the slow review.
>
> Sorry for my slow reaction. I needed a long vacation. For now I'll focus on 
> the design question wrt. multi-arch compilation.
>
>> I can only comment on the usage of SVE, rather than on the scaffolding
>> around it.  Hopefully Jonathan or others can comment on the rest.
>
> That's very useful!
>
>> The main thing that worries me is:
>> 
>> #if _GLIBCXX_SIMD_HAVE_SVE
>> constexpr inline int __sve_vectorized_size_bytes = __ARM_FEATURE_SVE_BITS/8;
>> #else
>> constexpr inline int __sve_vectorized_size_bytes = 0;
>> #endif
>> 
>> Although -msve-vector-bits is currently a per-TU setting, that isn't
>> necessarily going to be the case in future.
>
> This is a topic that I care about a lot... as simd user, implementer, and 
> WG21 
> proposal author. Are you thinking of a plan to implement the target_clones 
> function attribute for different SVE lengths? Or does it go further than 
> that? 
> PR83875 is raising the same issue and solution ideas for x86. If I understand 
> your concern correctly, then the issue you're raising exists in the same form 
> for x86.
>
> If anyone is interested in working on a "translation phase 7 replacement" for 
> compiler flags macros I'd be happy to give some input of what I believe is 
> necessary to make target_clones work with std(x)::simd. This seems to be 
> about 
> constant expressions that return compiler-internal state - probably similar 
> to 
> how static reflection needs to work.
>
> For a sketch of a direction: what I'm already doing in 
> std::experimental::simd, is to tag all non-always_inline function names with 
> a 
> bitflag, representing a relevant subset of -f and -m flags. That way, I'm 
> guarding against surprises on linking TUs compiled with different flags.

Yeah, target_clones could be one use for switching lengths.  In that
case the choice would be based on a load-time check of the vector length.

However, we also support different vector lengths for streaming SVE
(running in "streaming" mode on SME) and non-streaming SVE (running
in "non-streaming" mode on the core).  Having two different lengths is
expected to be the common case, rather than a theoretical curiosity.

Software also provides a "streaming compatible" mode in which code can
run either in streaming or non-streaming mode and which restricts itself
to the common subset of non-streaming and streaming features (which is
large).

So it isn't really safe to treat the vector length as a program or
process invariant, or to assume that all active uses of std::simd
in a binary are for the same vector length.  It should in principle
be possible to use std::simd variants for non-streaming code and
std::simd variants for streaming code even if the lengths are
different.

So in the target_clones case, the clone would have to apply to a
non-streaming function and switch based on the non-streaming
vector length, or apply to a streaming function and switch based
on the streaming vector length.  (Neither of those are supported yet,
of course.)

I think there are probably also ODR problems with hard-coding a vector
length, instead of treating it as a template parameter.

Unfortunately it isn't currently possible to write:

template<int N>
using T = svint32_t __attribute__((arm_sve_vector_bits(N)));

due to PRs 58855/68703/88600.  But I think we should fix that. :)

>> Ideally it would be
>> possible to define different implementations of a function with
>> different (fixed) vector lengths within the same TU.  The value at
>> the point that the header file is included is then somewhat arbitrary.
>> 
>> So rather than have:
>> >  using __neon128 = _Neon<16>;
>> >  using __neon64 = _Neon<8>;
>> > 
>> > +using __sve = _Sve<>;
>> 
>> would it be possible instead to have:
>> 
>>   using __sve128 = _Sve<128>;
>>   using __sve256 = _Sve<256>;
>>   ...etc...
>> 
>> ?  Code specialised for 128-bit vectors could then use __sve128 and
>> code specialised for 256-bit vectors could use __sve256.
>
> Hmm, as things stand we'd need two numbers, IIUC:
> _Sve<NumberOfUsedBytes, SizeofRegister>
>
> On x86, "NumberOfUsedBytes" is sufficient, because 33-64 implies zmm 
> registers 
> (and -mavx512f), 17-32 implies ymm, and <=16 implies xmm (except where it 
> doesn't ;) ).

When would NumberOfUsedBytes < SizeofRegister be used for SVE?  Would it
be for storing narrower elements in wider containers?  If the interface
supports that then, yeah, two parameters would probably be safer.

Or were you thinking about emulating narrower vectors with wider registers
using predication?  I suppose that's possible too, and would be similar in
spirit to using SVE to optimise Advanced SIMD std::simd types.
But mightn't it cause confusion if sizeof applied to a "16-byte"
vector actually gives 32?

>> Perhaps that's not possible as things stand, but it'd be interesting
>> to know the exact failure mode if so.  Either way, it would be good to
>> remove direct uses of __ARM_FEATURE_SVE_BITS from simd_sve.h if possible,
>> and instead rely on information derived from template parameters.
>
> The TS spec requires std::experimental::native_simd<int> to basically give 
> you 
> the largest, most efficient, full SIMD register. (And it can't be a sizeless 
> type because they don't exist in C++). So how would you do that without 
> looking at __ARM_FEATURE_SVE_BITS in the simd implementation?

I assume std::experimental::native_simd<int> has to have the same
meaning everywhere for ODR reasons?  If so, it'll need to be an
Advanced SIMD vector for AArch64 (but using SVE to optimise certain
operations under the hood where available).  I don't think we could
support anything else.

Even if SVE registers are larger than 128 bits, we can't require
all code in the binary to be recompiled with that knowledge.

I suppose that breaks the "largest" requirement, but due to the
streaming/non-streaming distinction I mentioned above, there isn't
really a single, universal "largest" in this context.

>> It should be possible to use SVE to optimise some of the __neon*
>> implementations, which has the advantage of working even for VLA SVE.
>> That's obviously a separate patch, though.  Just saying for the record.
>
> I learned that NVidia Grace CPUs alias NEON and SVE registers. But I must 
> assume that other SVE implementations (especially those with 
> __ARM_FEATURE_SVE_BITS > 128) don't do that and might incur a significant 
> latency when going from a NEON register to an SVE register and back (which 
> each requires a store-load, IIUC). So are you thinking of implementing 
> everything via SVE? That would break ABI, no?

SVE and Advanced SIMD are architected to use the same registers
(i.e. SVE registers architecturally extend Advanced SIMD registers).
In Neoverse V1 (SVE256) they are the same physical register as well.
I believe the same is true for A64FX.

FWIW, GCC has already started using SVE in this way.  E.g. SVE provides
a wider range of immediate constants for logic operations, so we now use
them for Advanced SIMD logic where beneficial.

Thanks,
Richard

Reply via email to