On 17/03/2026 12:13, Richard Biener wrote:
On Tue, Mar 17, 2026 at 1:01 PM Andrew Stubbs <[email protected]> wrote:
Is there any reason why a MEM cannot take a vector of addresses, other
than the few cases fixed in the attached patch?
It would make perfect sense for AMD GCN to do this, so I would like to
know if such a patch would be acceptable to the maintainers, or if there
are likely to be technical showstoppers? (Initial testing of the
prototype patches seems promising).
I've attached 3 prototype patches to illustrate (not really for review):
1. Enough middle-end changes to not ICE.
2. The amdgcn backend changes to make such MEMs "legitimate", add the
instructions and constraints that can use them, and add support for the
different forms in print_operand. (There's a few bits regarding
vec_duplicate of offsets that are the result of some experimentation I
did and are not strictly in use here, but you can get the idea, I think.)
3. A basic implementation of the vector atomics that motivated this
request in the first place, but is not strictly "part of it".
Obviously, none of this is for GCC 16.
The issue is that (mem:<vectype> (reg:<vectype>)) does not play
nicely with the idea that a (mem:...) accesses contiguous memory
as indicated by MEM_ATTRs.
Thank you for your prompt reply!
What is the problem with MEM_ATTRs? As long as they remain correct *per
lane*, or are not present at all.
These MEMs would not be generated naturally by expand (although that
would be nice). They would come from builtins and the likes of
gen_gather_load, atomic_load, and other backend RTL passes, so the tree
expressions, size, offset etc. will be likely unset.
If required for correctness, I could add middle-end asserts to ensure
that the problematic attributes are unset when the address has a vector
mode?
Current fields in mem_attrs:
* "tree expr"
- If known, this probably remains accurate.
* "poly_int64 offset"
& "bool offset_known_p"
- Problematic, can be disallowed.
* "poly_int64 size"
& "bool size_known_p"
- Likewise? Can be disallowed, or treated as a total.
* "alias_set_type alias"
- Likely remains accurate.
* "unsigned int align"
- Probably accurate, per lane.
* "unsigned char addrspace"
- No problem.
A "proper" representation for a gather
might be a new (vec_concat_multiple:<vector> [ (mem:<scalar> ..)
(mem:<scalar> ..) ... ])
or as all targets(?) do, an UNSPEC. That vec_concat_multiple could be
called vec_gather then, but I'd not imply the MEM here. For GCN
You'd then have nested (subreg ..) of the address vector. Quite ugly,
considering the large number of lanes for GCN.
Another alternative is this (or some equivalent using vec_select):
(parallel
[(set (mem:SI (subreg:DI (reg:V64DI) 0))
(subreg:SI (reg:V64DI) 0))
(set (mem:SI (subreg:DI (reg:V64DI) 8))
(subreg:SI (reg:V64DI) 4))
......
(set (mem:SI (subreg:DI (reg:V64DI) 504))
(subreg:SI (reg:V64DI) 252))])
...... which I refuse to do. Likewise for your proposed
vec_concat_multiple. It's ugly, requires special handling everywhere it
appears, completely defeats optimizers such as combine, and we need one
instruction pattern for every variety of supported base/offset because
none of the legitimate_address/legitimize_address machinery is active.
And then, when all that is done, we still have to completely reimplement
the vector atomics and the other features I'm working on right now
because all of those have to have ugly unspecs too.
Oh, and we also have to write all this for V2, V4, V8, V16, and V32,
which is why we're using the briefer, but still unsatisfactory UNSPEC
patterns.
The problem with UNSPEC, besides disabling the whole legitimize_address
thing again, is that I cannot say this:
(set (unspec:V64SI [(reg:V64DI)] UNSPEC_VECSTORE)
(reg:V64SI))
Because that's apparently not a valid set destination, and also doesn't
mention MEM (which I suspect has consequences), so we end up with this:
(set (mem:BLK (scratch))
(unspec:BLK
[(reg:V64DI) /* dst addr */
(reg:V64SI)] /* src data */
UNSPEC_VECSTORE))
..... which is not great (and not hypothetical; this is simplified from
what we actually are using for scatter_store, which also has its own
attributes for address space and volatile, because the original MEM is
lost).
And even if this method is more manageable, I still have to do custom
implementations for the vectors versions of every memory-touching
instruction even though it results in the exact same machine instruction
as the equivalent scalar pattern at the end.
:-(
Thanks in advance.
Andrew
----------
Background ...
I've often said that on GCN "all loads and stores are gather/scatter",
because there's no instruction for "load a whole vector starting at this
base address". But, that's not really true, because, at least in GCC
terminology, gather/scatter uses a scalar base address with a vector of
offsets with a scalar multiplier, which GCN also *cannot* do. [1]
What GCN *can* do is take a vector of arbitrary addresses and load/store
all of them in parallel. It can then add an identical scalar offset to
each address. There doesn't need to be any relationship, or pattern
between the addresses (although I believe the hardware may well optimize
accesses to contiguous data). Each address refers to a single element
of data, so it really is like gluing together N scalar load instructions
into one.
So, whenever GCC tries to load a contiguous vector, or does a
gather_load or scatter_store, the backend converts this in to an unspec
that has the vector of addresses, which could be much more neatly
represented as a MEM with a vector "base".
The last straw came when I wanted to implement vector atomics. The
atomic instructions have a lot of if-then-else with cache handling for
different device features, and I was looking at having to reproduce or
refactor it all to add new insns that use new unspecs similar to the
existing gather/scatter patterns, with all the different base+offset
combinations. Which would mean yet more places to touch each time we
support a new device with a new cache configuration. But at the end of
all of it, the actual instruction produced would be identical (apart
from there being a different value in the vector mask register).
I also anticipate that the new MEM will help with another project I'm
working on right now.
[1] The "global_load" instruction can do scalar_base+vector_offset (no
multiplier), but only in one address space that is too limited for
general use. The more useful "flat_load" instruction is strictly vector
addresses only.