On Thu, Mar 19, 2026 at 9:25 AM Richard Biener
<[email protected]> wrote:
>
> On Thu, Mar 19, 2026 at 5:14 AM Michael Clark <[email protected]> wrote:
> >
> > On 3/19/26 03:28, Richard Biener via Gcc wrote:
> > > Yes, but for example on x86 the memory order for scatters is defined to
> > > be left-to-right.
> >
> > which side is lane zero on? left or right? ;-)
> >
> > on little-endian I'd expect right-to-left if lane 0 were on the right.
> >
> > there’s a subtle interaction between TSO and vector scatters. although a
> > vector instruction has a logical width (e.g. 512-bit), in practice it
> > decomposes into architectural subgroups (often ~128-bit) and further
> > into micro-architectural quanta (e.g. 256×2 execution, banked caches,
> > store buffers). execution and visibility therefore occur in smaller,
> > partially parallel units of execution, and different threads may observe
> > partial progress depending on micro-architectural effects.
> >
> > TSO constrains the ordering of stores as they become globally visible,
> > but it does not inherently require any ordering within a single vector
> > instruction. intra-vector lane ordering is orthogonal to TSO.
> >
> > however, x86 defines a lane order for scatters (e.g. to determine which
> > store wins on conflicts), introducing an additional ordering constraint
> > at the ISA level that is not strictly required by TSO itself. while this
> > does not imply full serialization or per-lane global visibility, it does
> > constrain the abstract behavior more than a fully relaxed model would.
>
> It's probably modeled so you can vectorize
>
>   for (int i = 0; i < n; ++i)
>     a[b[i]] = 42;
>
> because there's no way you can codify a runtime check to ensure
> b[i] do not produce the same index.  AVX512F (which introduces scatters)
> is complemented by AVX512CD to be able to also handle
>
>   for (int i = 0; i < n; ++i)
>     a[b[i]]++;
>
> because this is not only WAW but RAW conflicts which makes splitting
> the iteration domain upon conflicts necessary.
>
> That said - the *scatter* optabs assume naive vectorization of the
> first loop works, even when b[] = { 0, 1, 2, 0 }, so if GCN is not able
> to guarantee this their "vector address store" are not scatters in
> terms of what GCC assumes.  The documentation for the optabs
> does not mention this constraint.

Does not mention with explicit words, but

"
... For each element index @var{i}:
...
store element @var{i} of operand 4 to that address.
"

does not say "in random order", clearly a native speaker might suggest
a less ambiguous (if it is ambiguous at all) way to formulate this.  Possibly
pseudo-code would be most obvious.

>
> > this can be viewed as a potential performance cliff: a very strong
> > ordering constraint applied within a single instruction, despite being
> > orthogonal to the guarantees provided by TSO. I think it is too strong,
>
> True, this is one thing that can make scatters slow, in particular on
> architectures like GCN.
>
> > in practice, implementations still split, merge, and overlap work, and
> > observable behavior is shaped by store buffering, cache banking, and
> > other micro-architectural details. the effective execution width is
> > therefore dynamic and may differ from the nominal vector width.
> >
> > as a result, relying on predictable outcomes for conflicting scatter
> > lanes seems unwise to me. a more robust model is to treat scatters as
> > decomposed operations with largely unspecified internal ordering, even
> > if the ISA defines a resolution rule for conflicts.
>
> That would then still ask for a (non-UNSPEC) way to encode such
> ordering guarantee in RTL for the x86 guaranteed semantics (I suspect
> SVE as well) and RVV semantics (it has even stronger ordering guarantee
> than x86).
>
> Richard.
>
> > Michael.

Reply via email to