Is there any reason why a MEM cannot take a vector of addresses, other than the few cases fixed in the attached patch?

It would make perfect sense for AMD GCN to do this, so I would like to know if such a patch would be acceptable to the maintainers, or if there are likely to be technical showstoppers? (Initial testing of the prototype patches seems promising).

I've attached 3 prototype patches to illustrate (not really for review):

1. Enough middle-end changes to not ICE.

2. The amdgcn backend changes to make such MEMs "legitimate", add the instructions and constraints that can use them, and add support for the different forms in print_operand. (There's a few bits regarding vec_duplicate of offsets that are the result of some experimentation I did and are not strictly in use here, but you can get the idea, I think.)

3. A basic implementation of the vector atomics that motivated this request in the first place, but is not strictly "part of it".

Obviously, none of this is for GCC 16.

Thanks in advance.

Andrew
----------


Background ...

I've often said that on GCN "all loads and stores are gather/scatter", because there's no instruction for "load a whole vector starting at this base address". But, that's not really true, because, at least in GCC terminology, gather/scatter uses a scalar base address with a vector of offsets with a scalar multiplier, which GCN also *cannot* do. [1]

What GCN *can* do is take a vector of arbitrary addresses and load/store all of them in parallel. It can then add an identical scalar offset to each address. There doesn't need to be any relationship, or pattern between the addresses (although I believe the hardware may well optimize accesses to contiguous data). Each address refers to a single element of data, so it really is like gluing together N scalar load instructions into one.

So, whenever GCC tries to load a contiguous vector, or does a gather_load or scatter_store, the backend converts this in to an unspec that has the vector of addresses, which could be much more neatly represented as a MEM with a vector "base".

The last straw came when I wanted to implement vector atomics. The atomic instructions have a lot of if-then-else with cache handling for different device features, and I was looking at having to reproduce or refactor it all to add new insns that use new unspecs similar to the existing gather/scatter patterns, with all the different base+offset combinations. Which would mean yet more places to touch each time we support a new device with a new cache configuration. But at the end of all of it, the actual instruction produced would be identical (apart from there being a different value in the vector mask register).

I also anticipate that the new MEM will help with another project I'm working on right now.




[1] The "global_load" instruction can do scalar_base+vector_offset (no multiplier), but only in one address space that is too limited for general use. The more useful "flat_load" instruction is strictly vector addresses only.
From 4febbcd3ad0f8003644bb19d45878cd147bec407 Mon Sep 17 00:00:00 2001
From: Andrew Stubbs <[email protected]>
Date: Tue, 17 Mar 2026 10:40:56 +0000
Subject: [PATCH 1/3] rtl: Allow "(mem:<vecmode> (reg:<vecmode>))"

This patch makes the middle-end adjustments needed to allow a MEM to take a
vector of addresses in order to read/write a vector of data.

This prototype lacks large offset handling (but I've not actually encountered
an example that takes that code path).

These changes, alone, are not enough to actually allow such MEMs, since it's
unlikely that any backend reports the vector modes "legitimate".
---
 gcc/emit-rtl.cc | 8 ++++++--
 gcc/recog.cc    | 4 +++-
 gcc/rtl.h       | 2 +-
 gcc/rtlanal.cc  | 4 ++--
 4 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/gcc/emit-rtl.cc b/gcc/emit-rtl.cc
index e41ec2283b8..0d31afda351 100644
--- a/gcc/emit-rtl.cc
+++ b/gcc/emit-rtl.cc
@@ -2382,7 +2382,7 @@ adjust_address_1 (rtx memref, machine_mode mode, poly_int64 offset,
 {
   rtx addr = XEXP (memref, 0);
   rtx new_rtx;
-  scalar_int_mode address_mode;
+  machine_mode address_mode;
   class mem_attrs attrs (*get_mem_attrs (memref)), *defattrs;
   unsigned HOST_WIDE_INT max_align;
 #ifdef POINTERS_EXTEND_UNSIGNED
@@ -2416,7 +2416,11 @@ adjust_address_1 (rtx memref, machine_mode mode, poly_int64 offset,
   /* Convert a possibly large offset to a signed value within the
      range of the target address space.  */
   address_mode = get_address_mode (memref);
-  offset = trunc_int_for_mode (offset, address_mode);
+  if (!VECTOR_MODE_P (address_mode))
+    offset = trunc_int_for_mode (offset, address_mode);
+  else
+    /* Vectors with offsets not implemented yet.  */
+    gcc_assert (known_eq (offset, 0));
 
   if (adjust_address)
     {
diff --git a/gcc/recog.cc b/gcc/recog.cc
index 48f6b45ec6d..28902c42c34 100644
--- a/gcc/recog.cc
+++ b/gcc/recog.cc
@@ -1626,7 +1626,9 @@ address_operand (rtx op, machine_mode mode)
 {
   /* Wrong mode for an address expr.  */
   if (GET_MODE (op) != VOIDmode
-      && ! SCALAR_INT_MODE_P (GET_MODE (op)))
+      && !(SCALAR_INT_MODE_P (GET_MODE (op))
+	   || (VECTOR_MODE_P (GET_MODE (op))
+	       && SCALAR_INT_MODE_P (GET_MODE_INNER (GET_MODE (op))))))
     return false;
 
   return memory_address_p (mode, op);
diff --git a/gcc/rtl.h b/gcc/rtl.h
index eebcc18a4f1..d9925d8662a 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -3686,7 +3686,7 @@ inline rtx single_set (const rtx_insn *insn)
   return single_set_2 (insn, PATTERN (insn));
 }
 
-extern scalar_int_mode get_address_mode (rtx mem);
+extern machine_mode get_address_mode (rtx mem);
 extern bool rtx_addr_can_trap_p (const_rtx);
 extern bool nonzero_address_p (const_rtx);
 extern bool rtx_unstable_p (const_rtx);
diff --git a/gcc/rtlanal.cc b/gcc/rtlanal.cc
index c5062ab7715..ad4f442649a 100644
--- a/gcc/rtlanal.cc
+++ b/gcc/rtlanal.cc
@@ -6299,7 +6299,7 @@ low_bitmask_len (machine_mode mode, unsigned HOST_WIDE_INT m)
 
 /* Return the mode of MEM's address.  */
 
-scalar_int_mode
+machine_mode
 get_address_mode (rtx mem)
 {
   machine_mode mode;
@@ -6307,7 +6307,7 @@ get_address_mode (rtx mem)
   gcc_assert (MEM_P (mem));
   mode = GET_MODE (XEXP (mem, 0));
   if (mode != VOIDmode)
-    return as_a <scalar_int_mode> (mode);
+    return mode;
   return targetm.addr_space.address_mode (MEM_ADDR_SPACE (mem));
 }
 
-- 
2.52.0

From 71a6f0e2fdc2832c658621b9c4eb0ceeb3a49948 Mon Sep 17 00:00:00 2001
From: Andrew Stubbs <[email protected]>
Date: Tue, 17 Mar 2026 10:46:51 +0000
Subject: [PATCH 2/3] amdgcn: Implement "(mem (reg:<vecmode>))"

This prototype patch modifies the amdgcn MEM handling to allow, recognise,
and output instructions that use vectors of addresses.

Prior to this, amdgcn was forced to use gather/scatter to load/store vectors
even though that is not really the most natural for the ISA.

I could further extend this patch to remove the gather/scatter insns and just
have the define_expand emit regular moves, using address vectors.
---
 gcc/config/gcn/constraints.md |  29 +++++++-
 gcc/config/gcn/gcn-valu.md    | 120 +++++++++++++++++++++++-----------
 gcc/config/gcn/gcn.cc         |  88 ++++++++++++++++++++++---
 3 files changed, 186 insertions(+), 51 deletions(-)

diff --git a/gcc/config/gcn/constraints.md b/gcc/config/gcn/constraints.md
index c683af420a3..68bd8d09c7d 100644
--- a/gcc/config/gcn/constraints.md
+++ b/gcc/config/gcn/constraints.md
@@ -116,7 +116,15 @@ (define_special_memory_constraint "RF"
   "Buffer memory address to flat memory."
   (and (match_code "mem")
        (match_test "AS_FLAT_P (MEM_ADDR_SPACE (op))
-		    && gcn_flat_address_p (XEXP (op, 0), mode)")))
+		    && gcn_flat_address_p (XEXP (op, 0), mode)
+		    && !VECTOR_MODE_P (GET_MODE (XEXP (op, 0)))")))
+
+(define_special_memory_constraint "Rf"
+  "Buffer memory address to flat memory."
+  (and (match_code "mem")
+       (match_test "AS_FLAT_P (MEM_ADDR_SPACE (op))
+		    && gcn_flat_address_p (XEXP (op, 0), mode)
+		    && VECTOR_MODE_P (GET_MODE (XEXP (op, 0)))")))
 
 (define_special_memory_constraint "RS"
   "Buffer memory address to scalar flat memory."
@@ -127,7 +135,14 @@ (define_special_memory_constraint "RS"
 (define_special_memory_constraint "RL"
   "Buffer memory address to LDS memory."
   (and (match_code "mem")
-       (match_test "AS_LDS_P (MEM_ADDR_SPACE (op))")))
+       (match_test "AS_LDS_P (MEM_ADDR_SPACE (op))
+		    && !VECTOR_MODE_P (GET_MODE (XEXP (op, 0)))")))
+
+(define_special_memory_constraint "Rl"
+  "Buffer memory address to LDS memory."
+  (and (match_code "mem")
+       (match_test "AS_LDS_P (MEM_ADDR_SPACE (op))
+		    && VECTOR_MODE_P (GET_MODE (XEXP (op, 0)))")))
 
 (define_special_memory_constraint "RG"
   "Buffer memory address to GDS memory."
@@ -144,4 +159,12 @@ (define_special_memory_constraint "RM"
   "Memory address to global (main) memory."
   (and (match_code "mem")
        (match_test "AS_GLOBAL_P (MEM_ADDR_SPACE (op))
-		    && gcn_global_address_p (XEXP (op, 0))")))
+		    && gcn_global_address_p (XEXP (op, 0))
+		    && !VECTOR_MODE_P (GET_MODE (XEXP (op, 0)))")))
+
+(define_special_memory_constraint "Rm"
+  "Memory address to global (main) memory."
+  (and (match_code "mem")
+       (match_test "AS_GLOBAL_P (MEM_ADDR_SPACE (op))
+		    && gcn_global_address_p (XEXP (op, 0))
+		    && VECTOR_MODE_P (GET_MODE (XEXP (op, 0)))")))
diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
index 9d752c717ff..ba13df43873 100644
--- a/gcc/config/gcn/gcn-valu.md
+++ b/gcc/config/gcn/gcn-valu.md
@@ -429,7 +429,10 @@ (define_expand "mov<mode>"
 	emit_insn (gen_gather<mode>_expr (operands[0], expr, a, v));
 	DONE;
       }
-    else if ((MEM_P (operands[0]) || MEM_P (operands[1])))
+    else if ((MEM_P (operands[0])
+	      && !VECTOR_MODE_P (GET_MODE (XEXP (operands[0], 0))))
+	     || (MEM_P (operands[1])
+		 && !VECTOR_MODE_P (GET_MODE (XEXP (operands[1], 0)))))
       {
         gcc_assert (!reload_completed);
 	rtx scratch = gen_reg_rtx (<VnDI>mode);
@@ -453,11 +456,17 @@ (define_insn "*mov<mode>"
 	(match_operand:V_1REG 1 "general_operand"))]
   ""
   {@ [cons: =0, 1; attrs: type, length, cdna]
-  [v  ,vA;vop1     ,4,*    ] v_mov_b32\t%0, %1
-  [v  ,B ;vop1     ,8,*    ] ^
-  [v  ,a ;vop3p_mai,8,*    ] v_accvgpr_read_b32\t%0, %1
-  [$a ,v ;vop3p_mai,8,*    ] v_accvgpr_write_b32\t%0, %1
-  [a  ,a ;vop1     ,4,cdna2] v_accvgpr_mov_b32\t%0, %1
+  [v ,vA;vop1     ,4 ,*    ] v_mov_b32\t%0, %1
+  [v ,B ;vop1     ,8 ,*    ] ^
+  [v ,Rf;flat     ,12,*    ] flat_load%o1\t%0, %A1%O1\;s_waitcnt\t0
+  [Rf,v ;flat     ,12,*    ] flat_store%s0\t%A0, %1%O0
+  [v ,Rm;flat     ,12,*    ] global_load%o1\t%0, %A1%O1\;s_waitcnt\tvmcnt(0)
+  [Rm,v ;flat     ,12,*    ] global_store%s0\t%A0, %1%O0
+  [v ,Rl;ds       ,12,*    ] ds_read%b1\t%0, %A1%O1\;s_waitcnt\tlgkmcnt(0)
+  [Rl,v ;ds       ,12,*    ] ds_write%b0\t%A0, %1%O0\;s_waitcnt\tlgkmcnt(0)
+  [v ,a ;vop3p_mai,8 ,*    ] v_accvgpr_read_b32\t%0, %1
+  [$a,v ;vop3p_mai,8 ,*    ] v_accvgpr_write_b32\t%0, %1
+  [a ,a ;vop1     ,4 ,cdna2] v_accvgpr_mov_b32\t%0, %1
   })
 
 (define_insn "mov<mode>_exec"
@@ -469,25 +478,36 @@ (define_insn "mov<mode>_exec"
    (clobber (match_scratch:<VnDI> 4))]
   "!MEM_P (operands[0]) || REG_P (operands[1])"
   {@ [cons: =0, 1, 2, 3, =4; attrs: type, length]
-  [v,vA,U0,e ,X ;vop1 ,4 ] v_mov_b32\t%0, %1
-  [v,B ,U0,e ,X ;vop1 ,8 ] v_mov_b32\t%0, %1
-  [v,v ,vA,cV,X ;vop2 ,4 ] v_cndmask_b32\t%0, %2, %1, vcc
-  [v,vA,vA,Sv,X ;vop3a,8 ] v_cndmask_b32\t%0, %2, %1, %3
-  [v,m ,U0,e ,&v;*    ,16] #
-  [m,v ,U0,e ,&v;*    ,16] #
+  [v ,vA,U0,e ,X ;vop1 ,4 ] v_mov_b32\t%0, %1
+  [v ,B ,U0,e ,X ;vop1 ,8 ] v_mov_b32\t%0, %1
+  [v ,Rf,U0,e ,X ;flat ,12] flat_load%o1\t%0, %A1%O1\;s_waitcnt\t0
+  [Rf,v ,U0,e ,X ;flat ,12] flat_store%s0\t%A0, %1%O0
+  [v ,Rm,U0,e ,X ;flat ,12] global_load%o1\t%0, %A1%O1\;s_waitcnt\tvmcnt(0)
+  [Rm,v ,U0,e ,X ;flat ,12] global_store%s0\t%A0, %1%O0
+  [v ,Rl,U0,e ,X ;ds   ,12] ds_read%b1\t%0, %A1%O1\;s_waitcnt\tlgkmcnt(0)
+  [Rl,v ,U0,e ,X ;ds   ,12] ds_write%b0\t%A0, %1%O0\;s_waitcnt\tlgkmcnt(0)
+  [v ,v ,vA,cV,X ;vop2 ,4 ] v_cndmask_b32\t%0, %2, %1, vcc
+  [v ,vA,vA,Sv,X ;vop3a,8 ] v_cndmask_b32\t%0, %2, %1, %3
+  [v ,m ,U0,e ,&v;*    ,16] #
+  [m ,v ,U0,e ,&v;*    ,16] #
   })
 
 (define_insn "*mov<mode>"
   [(set (match_operand:V_2REG 0 "nonimmediate_operand")
 	(match_operand:V_2REG 1 "general_operand"))]
   ""
-  {@ [cons: =0, 1; attrs: length, cdna]
-  [v ,vDB;16,*    ] v_mov_b32\t%L0, %L1\;v_mov_b32\t%H0, %H1
-  [v ,a  ;16,*    ] v_accvgpr_read_b32\t%L0, %L1\;v_accvgpr_read_b32\t%H0, %H1
-  [$a,v  ;16,*    ] v_accvgpr_write_b32\t%L0, %L1\;v_accvgpr_write_b32\t%H0, %H1
-  [a ,a  ;8 ,cdna2] v_accvgpr_mov_b32\t%L0, %L1\;v_accvgpr_mov_b32\t%H0, %H1
-  }
-  [(set_attr "type" "vmult,vmult,vmult,vmult")])
+  {@ [cons: =0, 1; attrs: type, length, cdna]
+  [v ,vDB;vmult,16,*    ] v_mov_b32\t%L0, %L1\;v_mov_b32\t%H0, %H1
+  [v ,a  ;vmult,16,*    ] v_accvgpr_read_b32\t%L0, %L1\;v_accvgpr_read_b32\t%H0, %H1
+  [$a,v  ;vmult,16,*    ] v_accvgpr_write_b32\t%L0, %L1\;v_accvgpr_write_b32\t%H0, %H1
+  [a ,a  ;vmult,8 ,cdna2] v_accvgpr_mov_b32\t%L0, %L1\;v_accvgpr_mov_b32\t%H0, %H1
+  [v ,Rf ;flat ,12,*    ] flat_load_dwordx2\t%0, %A1%O1\;s_waitcnt\t0
+  [Rf,v  ;flat ,12,*    ] flat_store_dwordx2\t%A0, %1%O0
+  [v ,Rm ;flat ,12,*    ] global_load_dwordx2\t%0, %A1%O1\;s_waitcnt\tvmcnt(0)
+  [Rm,v  ;flat ,12,*    ] global_store_dwordx2\t%A0, %1%O0
+  [v ,Rl ;ds   ,12,*    ] ds_read_b64\t%0, %A1%O1\;s_waitcnt\tlgkmcnt(0)
+  [Rl,v  ;ds   ,12,*    ] ds_write_b64\t%A0, %1%O0\;s_waitcnt\tlgkmcnt(0)
+  })
 
 (define_insn "mov<mode>_exec"
   [(set (match_operand:V_2REG 0 "nonimmediate_operand")
@@ -514,6 +534,10 @@ (define_insn "*mov<mode>_4reg"
   [v ,a  ;vmult,32,*    ]  v_accvgpr_read_b32\t%L0, %L1\; v_accvgpr_read_b32\t%H0, %H1\; v_accvgpr_read_b32\t%J0, %J1\; v_accvgpr_read_b32\t%K0, %K1
   [$a,v  ;vmult,32,*    ] v_accvgpr_write_b32\t%L0, %L1\;v_accvgpr_write_b32\t%H0, %H1\;v_accvgpr_write_b32\t%J0, %J1\;v_accvgpr_write_b32\t%K0, %K1
   [a ,a  ;vmult,32,cdna2]   v_accvgpr_mov_b32\t%L0, %L1\;  v_accvgpr_mov_b32\t%H0, %H1\;  v_accvgpr_mov_b32\t%J0, %J1\;  v_accvgpr_mov_b32\t%K0, %K1
+  [v ,Rf ;flat ,12,*    ] flat_load_dwordx4\t%0, %A1%O1\;s_waitcnt\t0
+  [Rf,v  ;flat ,12,*    ] flat_store_dwordx4\t%A0, %1%O0
+  [v ,Rm ;flat ,12,*    ] global_load_dwordx4\t%0, %A1%O1\;s_waitcnt\tvmcnt(0)
+  [Rm,v  ;flat ,12,*    ] global_store_dwordx4\t%A0, %1%O0
   })
 
 (define_insn "mov<mode>_exec"
@@ -525,11 +549,15 @@ (define_insn "mov<mode>_exec"
    (clobber (match_scratch:<VnDI> 4))]
   "!MEM_P (operands[0]) || REG_P (operands[1])"
   {@ [cons: =0, 1, 2, 3, =4; attrs: type, length]
-  [v,vDB,U0  ,e ,X ;vmult,32] v_mov_b32\t%L0, %L1\;v_mov_b32\t%H0, %H1\;v_mov_b32\t%J0, %J1\;v_mov_b32\t%K0, %K1
-  [v,v0 ,vDA0,cV,X ;vmult,32] v_cndmask_b32\t%L0, %L2, %L1, vcc\;v_cndmask_b32\t%H0, %H2, %H1, vcc\;v_cndmask_b32\t%J0, %J2, %J1, vcc\;v_cndmask_b32\t%K0, %K2, %K1, vcc
-  [v,v0 ,vDA0,Sv,X ;vmult,32] v_cndmask_b32\t%L0, %L2, %L1, %3\;v_cndmask_b32\t%H0, %H2, %H1, %3\;v_cndmask_b32\t%J0, %J2, %J1, %3\;v_cndmask_b32\t%K0, %K2, %K1, %3
-  [v,m  ,U0  ,e ,&v;*    ,32] #
-  [m,v  ,U0  ,e ,&v;*    ,32] #
+  [v ,vDB,U0  ,e ,X ;vmult,32] v_mov_b32\t%L0, %L1\;v_mov_b32\t%H0, %H1\;v_mov_b32\t%J0, %J1\;v_mov_b32\t%K0, %K1
+  [v ,v0 ,vDA0,cV,X ;vmult,32] v_cndmask_b32\t%L0, %L2, %L1, vcc\;v_cndmask_b32\t%H0, %H2, %H1, vcc\;v_cndmask_b32\t%J0, %J2, %J1, vcc\;v_cndmask_b32\t%K0, %K2, %K1, vcc
+  [v ,v0 ,vDA0,Sv,X ;vmult,32] v_cndmask_b32\t%L0, %L2, %L1, %3\;v_cndmask_b32\t%H0, %H2, %H1, %3\;v_cndmask_b32\t%J0, %J2, %J1, %3\;v_cndmask_b32\t%K0, %K2, %K1, %3
+  [v ,Rf ,U0  ,e ,X ;flat ,4 ] flat_load_dwordx4\t%0, %A1%O1\;s_waitcnt\t0
+  [Rf,v  ,U0  ,e ,X ;flat ,4 ] flat_store_dwordx4\t%A0, %1%O0
+  [v ,Rm ,U0  ,e ,X ;flat ,4 ] global_load_dwordx4\t%0, %A1%O1\;s_waitcnt\tvmcnt(0)
+  [Rm,v  ,U0  ,e ,X ;flat ,4 ] global_store_dwordx4\t%A0, %1%O0
+  [v ,m  ,U0  ,e ,&v;*    ,32] #
+  [m ,v  ,U0  ,e ,&v;*    ,32] #
   })
 
 ; A SGPR-base load looks like:
@@ -601,9 +629,13 @@ (define_split
 	(unspec:BLK [(match_dup 5) (match_dup 1) (match_dup 6) (match_dup 7)]
 		    UNSPEC_SCATTER))]
   {
-    operands[5] = gcn_expand_scalar_to_vector_address (<MODE>mode, NULL,
-						       operands[0],
-						       operands[2]);
+    /* It can happen that precalculated V64DI address vectors come this
+       way, so skip expansion in that case.  */
+    operands[5] = (VECTOR_MODE_P (GET_MODE (XEXP (operands[0], 0)))
+		   ? XEXP (operands[0], 0)
+		   : gcn_expand_scalar_to_vector_address (<MODE>mode, NULL,
+							  operands[0],
+							  operands[2]));
     operands[6] = gen_rtx_CONST_INT (VOIDmode, MEM_ADDR_SPACE (operands[0]));
     operands[7] = gen_rtx_CONST_INT (VOIDmode, MEM_VOLATILE_P (operands[0]));
   })
@@ -621,10 +653,14 @@ (define_split
 		     (match_dup 6) (match_dup 7) (match_dup 3)]
 		    UNSPEC_SCATTER))]
   {
-    operands[5] = gcn_expand_scalar_to_vector_address (<MODE>mode,
-						       operands[3],
-						       operands[0],
-						       operands[4]);
+    /* It can happen that precalculated V64DI address vectors come this
+       way, so skip expansion in that case.  */
+    operands[5] = (VECTOR_MODE_P (GET_MODE (XEXP (operands[0], 0)))
+		   ? XEXP (operands[0], 0)
+		   : gcn_expand_scalar_to_vector_address (<MODE>mode,
+							  operands[3],
+							  operands[0],
+							  operands[4]));
     operands[6] = gen_rtx_CONST_INT (VOIDmode, MEM_ADDR_SPACE (operands[0]));
     operands[7] = gen_rtx_CONST_INT (VOIDmode, MEM_VOLATILE_P (operands[0]));
   })
@@ -641,9 +677,13 @@ (define_split
 		       (mem:BLK (scratch))]
 		      UNSPEC_GATHER))]
   {
-    operands[5] = gcn_expand_scalar_to_vector_address (<MODE>mode, NULL,
-						       operands[1],
-						       operands[2]);
+    /* It can happen that precalculated V64DI address vectors come this
+       way, so skip expansion in that case.  */
+    operands[5] = (VECTOR_MODE_P (GET_MODE (XEXP (operands[1], 0)))
+		   ? XEXP (operands[1], 0)
+		   : gcn_expand_scalar_to_vector_address (<MODE>mode, NULL,
+							  operands[1],
+							  operands[2]));
     operands[6] = gen_rtx_CONST_INT (VOIDmode, MEM_ADDR_SPACE (operands[1]));
     operands[7] = gen_rtx_CONST_INT (VOIDmode, MEM_VOLATILE_P (operands[1]));
   })
@@ -664,10 +704,14 @@ (define_split
 	  (match_dup 2)
 	  (match_dup 3)))]
   {
-    operands[5] = gcn_expand_scalar_to_vector_address (<MODE>mode,
-						       operands[3],
-						       operands[1],
-						       operands[4]);
+    /* It can happen that precalculated V64DI address vectors come this
+       way, so skip expansion in that case.  */
+    operands[5] = (VECTOR_MODE_P (GET_MODE (XEXP (operands[1], 0)))
+		   ? XEXP (operands[1], 0)
+		   : gcn_expand_scalar_to_vector_address (<MODE>mode,
+							  operands[3],
+							  operands[1],
+							  operands[4]));
     operands[6] = gen_rtx_CONST_INT (VOIDmode, MEM_ADDR_SPACE (operands[1]));
     operands[7] = gen_rtx_CONST_INT (VOIDmode, MEM_VOLATILE_P (operands[1]));
   })
diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index 3822febc721..f616b3ed13b 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -1449,6 +1449,8 @@ gcn_stepped_zero_int_parallel_p (rtx op, int step)
 /* }}}  */
 /* {{{ Addresses, pointers and moves.  */
 
+static bool gcn_vec_address_register_p (rtx, machine_mode, bool);
+
 /* Return true is REG is a valid place to store a pointer,
    for instructions that require an SGPR.
    FIXME rename. */
@@ -1462,6 +1464,10 @@ gcn_address_register_p (rtx reg, machine_mode mode, bool strict)
   if (!REG_P (reg))
     return false;
 
+  /* If the address is a vector we'll want it in a vector register.  */
+  if (VECTOR_MODE_P (GET_MODE (reg)))
+    return gcn_vec_address_register_p (reg, mode, strict);
+
   if (GET_MODE (reg) != mode)
     return false;
 
@@ -1494,7 +1500,11 @@ gcn_vec_address_register_p (rtx reg, machine_mode mode, bool strict)
   if (!REG_P (reg))
     return false;
 
-  if (GET_MODE (reg) != mode)
+  /* Vector addresses are allowed, but MODE is always specified scalar.  */
+  machine_mode scalar_reg_mode = (VECTOR_MODE_P (GET_MODE (reg))
+				  ? GET_MODE_INNER (GET_MODE (reg))
+				  : GET_MODE (reg));
+  if (scalar_reg_mode != mode)
     return false;
 
   int regno = REGNO (reg);
@@ -1620,6 +1630,30 @@ gcn_global_address_p (rtx addr)
 	  && immediate_p)
 	/* (SGPR + VGPR) + CONST  */
 	return true;
+
+      bool vec_immediate_p =
+	(GET_CODE (offset) == CONST_VECTOR
+	 && gcn_constant_p (offset)
+	 /* Signed 12/13-bit immediate.  */
+	 && INTVAL (CONST_VECTOR_ELT (offset, 0)) >= -(1 << offsetbits)
+	 && INTVAL (CONST_VECTOR_ELT (offset, 0)) < (1 << offsetbits)
+	 /* The low bits of the offset are ignored, even
+	    when they're meant to realign the pointer.  */
+	 && !(INTVAL (CONST_VECTOR_ELT (offset, 0)) & 0x3));
+
+      if (gcn_vec_address_register_p (base, DImode, false)
+	  && vec_immediate_p)
+	/* VGPR + CONST  */
+	return true;
+
+      if (GET_CODE (base) == PLUS
+	  && GET_CODE (XEXP (base, 0)) == VEC_DUPLICATE
+	  && gcn_address_register_p (XEXP (XEXP (base, 0), 0),
+				     DImode, false)
+	  && gcn_vgpr_register_operand (XEXP (base, 1), V64SImode)
+	  && vec_immediate_p)
+	/* (SGPR + VGPR) + CONST  */
+	return true;
     }
 
   return false;
@@ -1758,6 +1792,30 @@ gcn_addr_space_legitimate_address_p (machine_mode mode, rtx x, bool strict,
 		  && immediate_p)
 		/* SGPR + CONST  */
 		return true;
+
+	      bool vec_immediate_p =
+		(GET_CODE (offset) == CONST_VECTOR
+		 && gcn_constant_p (offset)
+		 /* Signed 12/13-bit immediate.  */
+		 && INTVAL (CONST_VECTOR_ELT (offset, 0)) >= -(1 << offsetbits)
+		 && INTVAL (CONST_VECTOR_ELT (offset, 0)) < (1 << offsetbits)
+		 /* The low bits of the offset are ignored, even
+		    when they're meant to realign the pointer.  */
+		 && !(INTVAL (CONST_VECTOR_ELT (offset, 0)) & 0x3));
+
+	      if (gcn_vec_address_register_p (base, DImode, strict)
+		  && vec_immediate_p)
+		/* VGPR + CONST  */
+		return true;
+
+	      if (GET_CODE (base) == PLUS
+		  && GET_CODE (XEXP (base, 0)) == VEC_DUPLICATE
+		  && gcn_address_register_p (XEXP (XEXP (base, 0), 0),
+					     DImode, strict)
+		  && gcn_vgpr_register_operand (XEXP (base, 1), V64SImode)
+		  && vec_immediate_p)
+		/* (SGPR + VGPR) + CONST  */
+		return true;
 	    }
 	}
       else
@@ -2448,8 +2506,9 @@ gcn_secondary_reload (bool in_p, rtx x, reg_class_t rclass,
 	case ADDR_SPACE_FLAT:
 	case ADDR_SPACE_FLAT_SCRATCH:
 	case ADDR_SPACE_GLOBAL:
-	  if (GET_MODE_CLASS (reload_mode) == MODE_VECTOR_INT
-	      || GET_MODE_CLASS (reload_mode) == MODE_VECTOR_FLOAT)
+	  if ((GET_MODE_CLASS (reload_mode) == MODE_VECTOR_INT
+	       || GET_MODE_CLASS (reload_mode) == MODE_VECTOR_FLOAT)
+	      && !(MEM_P (x) && VECTOR_MODE_P (GET_MODE (XEXP (x, 0)))))
 	    {
 	      sri->icode = code_for_mov_sgprbase (reload_mode);
 	      break;
@@ -7217,9 +7276,13 @@ print_operand_address (FILE *file, rtx mem)
 
 	  if (GET_CODE (base) == PLUS)
 	    {
-	      /* (SGPR + VGPR) + CONST  */
+	      /* (SGPR + VGPR) + CONST
+		 Note: the offset is printed by %O.  */
 	      vgpr_offset = XEXP (base, 1);
 	      base = XEXP (base, 0);
+
+	      if (GET_CODE (base) == VEC_DUPLICATE)
+		base = XEXP (base, 0);
 	    }
 	  else
 	    {
@@ -7255,8 +7318,6 @@ print_operand_address (FILE *file, rtx mem)
 		output_operand_lossage ("bad ADDR_SPACE_GLOBAL address");
 	    }
 	}
-      else
-	output_operand_lossage ("bad ADDR_SPACE_GLOBAL address");
     }
   else if (AS_ANY_DS_P (as))
     switch (GET_CODE (addr))
@@ -7586,12 +7647,19 @@ print_operand (FILE *file, rtx x, int code)
 		base = XEXP (x0, 0);
 
 		if (GET_CODE (base) == PLUS)
-		  /* (SGPR + VGPR) + CONST  */
-		  /* Ignore the VGPR offset for this operand.  */
-		  base = XEXP (base, 0);
+		  {
+		    /* (SGPR + VGPR) + CONST  */
+		    /* Ignore the VGPR offset for this operand.  */
+		    base = XEXP (base, 0);
+
+		    if (GET_CODE (base) == VEC_DUPLICATE)
+		      base = XEXP (base, 0);
+		  }
 
+		if (CONST_VECTOR_P (offset))
+		  offset = CONST_VECTOR_ELT (offset, 0);
 		if (CONST_INT_P (offset))
-		  const_offset = XEXP (x0, 1);
+		  const_offset = offset;
 		else if (REG_P (offset))
 		  /* SGPR + VGPR  */
 		  /* Ignore the VGPR offset for this operand.  */
-- 
2.52.0

From 5f759f0214e4267a9fe9b93ed8da8978d90f02f8 Mon Sep 17 00:00:00 2001
From: Andrew Stubbs <[email protected]>
Date: Tue, 17 Mar 2026 10:51:57 +0000
Subject: [PATCH 3/3] amdgcn: Add vector atomics

This prototype patch converts the existing scalar atomic instructions to
allow vectors, meaning we can now do 64 atomic operations on 64 different
addresses, simultaneously in one instruction.

Without "(mem (reg:<vectype>))" support, this would have had to be completely
rewritten using unspec, leading to code duplication and additional maintainence
burden.
---
 gcc/config/gcn/gcn.md | 82 +++++++++++++++++++++++--------------------
 1 file changed, 43 insertions(+), 39 deletions(-)

diff --git a/gcc/config/gcn/gcn.md b/gcc/config/gcn/gcn.md
index 5957b29f748..137d4647f0b 100644
--- a/gcc/config/gcn/gcn.md
+++ b/gcc/config/gcn/gcn.md
@@ -1922,7 +1922,11 @@ (define_expand "<expander>ti3"
 ; the programmer to get this right.
 
 (define_code_iterator atomicops [plus minus and ior xor])
-(define_mode_attr X [(SI "") (DI "_X2")])
+(define_mode_attr X [(SI "") (V64SI "")
+		     (DI "_X2") (V64DI "_X2")])
+
+(define_mode_iterator ATOMICMODE
+		      [SI DI V64SI V64DI])
 
 ;; TODO compare_and_swap test_and_set inc dec
 ;; Hardware also supports min and max, but GCC does not.
@@ -1964,13 +1968,13 @@ (define_insn "*memory_barrier"
 ; reliably - they can cause hangs or incorrect results.
 ; TODO: flush caches according to memory model
 (define_insn "atomic_fetch_<bare_mnemonic><mode>"
-  [(set (match_operand:SIDI 0 "register_operand"     "=Sm, v, v")
-	(match_operand:SIDI 1 "memory_operand"	     "+RS,RF,RM"))
+  [(set (match_operand:ATOMICMODE 0 "register_operand"     "=Sm, v, v")
+	(match_operand:ATOMICMODE 1 "memory_operand"	   "+RS,RF,RM"))
    (set (match_dup 1)
-	(unspec_volatile:SIDI
-	  [(atomicops:SIDI
+	(unspec_volatile:ATOMICMODE
+	  [(atomicops:ATOMICMODE
 	    (match_dup 1)
-	    (match_operand:SIDI 2 "register_operand" " Sm, v, v"))]
+	    (match_operand:ATOMICMODE 2 "register_operand" " Sm, v, v"))]
 	   UNSPECV_ATOMIC))
    (use (match_operand 3 "const_int_operand"))]
   "0 /* Disabled.  */"
@@ -1988,11 +1992,11 @@ (define_insn "atomic_fetch_<bare_mnemonic><mode>"
 ; you might expect from a concurrent non-atomic read-modify-write.
 ; TODO: flush caches according to memory model
 (define_insn "atomic_<bare_mnemonic><mode>"
-  [(set (match_operand:SIDI 0 "memory_operand"       "+RS,RF,RM")
-	(unspec_volatile:SIDI
-	  [(atomicops:SIDI
+  [(set (match_operand:ATOMICMODE 0 "memory_operand"       "+RS,RF,RM")
+	(unspec_volatile:ATOMICMODE
+	  [(atomicops:ATOMICMODE
 	    (match_dup 0)
-	    (match_operand:SIDI 1 "register_operand" " Sm, v, v"))]
+	    (match_operand:ATOMICMODE 1 "register_operand" " Sm, v, v"))]
 	  UNSPECV_ATOMIC))
    (use (match_operand 2 "const_int_operand"))]
   "0 /* Disabled.  */"
@@ -2004,15 +2008,15 @@ (define_insn "atomic_<bare_mnemonic><mode>"
    (set_attr "flatmemaccess" "*,atomicwait,atomicwait")
    (set_attr "length" "12")])
 
-(define_mode_attr x2 [(SI "DI") (DI "TI")])
-(define_mode_attr size [(SI "4") (DI "8")])
-(define_mode_attr bitsize [(SI "32") (DI "64")])
+(define_mode_attr x2 [(SI "DI") (DI "TI") (V64SI "V64DI") (V64DI "V64TI")])
+(define_mode_attr size [(SI "4") (DI "8") (V64SI "4") (V64DI "8")])
+(define_mode_attr bitsize [(SI "32") (DI "64") (V64SI "32") (V64DI "64")])
 
 (define_expand "sync_compare_and_swap<mode>"
-  [(match_operand:SIDI 0 "register_operand")
-   (match_operand:SIDI 1 "memory_operand")
-   (match_operand:SIDI 2 "register_operand")
-   (match_operand:SIDI 3 "register_operand")]
+  [(match_operand:ATOMICMODE 0 "register_operand")
+   (match_operand:ATOMICMODE 1 "memory_operand")
+   (match_operand:ATOMICMODE 2 "register_operand")
+   (match_operand:ATOMICMODE 3 "register_operand")]
   ""
   {
     if (MEM_ADDR_SPACE (operands[1]) == ADDR_SPACE_LDS)
@@ -2036,11 +2040,11 @@ (define_expand "sync_compare_and_swap<mode>"
   })
 
 (define_insn "sync_compare_and_swap<mode>_insn"
-  [(set (match_operand:SIDI 0 "register_operand"    "=Sm, v, v")
-	(match_operand:SIDI 1 "memory_operand"      "+RS,RF,RM"))
+  [(set (match_operand:ATOMICMODE 0 "register_operand" "=Sm,   v,   v")
+	(match_operand:ATOMICMODE 1 "memory_operand"   "+RS,RfRF,RmRM"))
    (set (match_dup 1)
-	(unspec_volatile:SIDI
-	  [(match_operand:<x2> 2 "register_operand" " Sm, v, v")]
+	(unspec_volatile:ATOMICMODE
+	  [(match_operand:<x2> 2 "register_operand"    " Sm,   v,   v")]
 	  UNSPECV_ATOMIC))]
   ""
   "@
@@ -2052,14 +2056,14 @@ (define_insn "sync_compare_and_swap<mode>_insn"
    (set_attr "flatmemaccess" "*,cmpswapx2,cmpswapx2")])
 
 (define_insn "sync_compare_and_swap<mode>_lds_insn"
-  [(set (match_operand:SIDI 0 "register_operand"    "= v")
-	(unspec_volatile:SIDI
-	  [(match_operand:SIDI 1 "memory_operand"   "+RL")]
+  [(set (match_operand:ATOMICMODE 0 "register_operand"    "= v")
+	(unspec_volatile:ATOMICMODE
+	  [(match_operand:ATOMICMODE 1 "memory_operand"   "+RL")]
 	  UNSPECV_ATOMIC))
    (set (match_dup 1)
-	(unspec_volatile:SIDI
-	  [(match_operand:SIDI 2 "register_operand" "  v")
-	   (match_operand:SIDI 3 "register_operand" "  v")]
+	(unspec_volatile:ATOMICMODE
+	  [(match_operand:ATOMICMODE 2 "register_operand" "  v")
+	   (match_operand:ATOMICMODE 3 "register_operand" "  v")]
 	  UNSPECV_ATOMIC))]
   ""
   {
@@ -2072,11 +2076,11 @@ (define_insn "sync_compare_and_swap<mode>_lds_insn"
    (set_attr "length" "12")])
 
 (define_insn "atomic_load<mode>"
-  [(set (match_operand:SIDI 0 "register_operand"  "=Sm, v, v")
-	(unspec_volatile:SIDI
-	  [(match_operand:SIDI 1 "memory_operand" " RS,RF,RM")]
+  [(set (match_operand:ATOMICMODE 0 "register_operand"  "=Sm,   v,   v")
+	(unspec_volatile:ATOMICMODE
+	  [(match_operand:ATOMICMODE 1 "memory_operand" " RS,RfRF,RmRM")]
 	  UNSPECV_ATOMIC))
-   (use (match_operand:SIDI 2 "immediate_operand" "  i, i, i"))]
+   (use (match_operand:SI 2 "immediate_operand" "  i,   i,   i"))]
   ""
   {
     /* FIXME: RDNA cache instructions may be too conservative?  */
@@ -2174,11 +2178,11 @@ (define_insn "atomic_load<mode>"
    (set_attr "rdna" "no,*,*")])
 
 (define_insn "atomic_store<mode>"
-  [(set (match_operand:SIDI 0 "memory_operand"      "=RS,RF,RM")
-	(unspec_volatile:SIDI
-	  [(match_operand:SIDI 1 "register_operand" " Sm, v, v")]
+  [(set (match_operand:ATOMICMODE 0 "memory_operand"      "=RS,RfRF,RmRM")
+	(unspec_volatile:ATOMICMODE
+	  [(match_operand:ATOMICMODE 1 "register_operand" " Sm,   v,   v")]
 	  UNSPECV_ATOMIC))
-  (use (match_operand:SIDI 2 "immediate_operand"    "  i, i, i"))]
+  (use (match_operand:SI 2 "immediate_operand"    "  i,   i,   i"))]
   ""
   {
     switch (INTVAL (operands[2]))
@@ -2261,11 +2265,11 @@ (define_insn "atomic_store<mode>"
    (set_attr "rdna" "no,*,*")])
 
 (define_insn "atomic_exchange<mode>"
-  [(set (match_operand:SIDI 0 "register_operand"    "=Sm, v, v")
-        (match_operand:SIDI 1 "memory_operand"	    "+RS,RF,RM"))
+  [(set (match_operand:ATOMICMODE 0 "register_operand"    "=Sm,   v,   v")
+        (match_operand:ATOMICMODE 1 "memory_operand"	  "+RS,RfRF,RmRM"))
    (set (match_dup 1)
-	(unspec_volatile:SIDI
-	  [(match_operand:SIDI 2 "register_operand" " Sm, v, v")]
+	(unspec_volatile:ATOMICMODE
+	  [(match_operand:ATOMICMODE 2 "register_operand" " Sm,   v,   v")]
 	  UNSPECV_ATOMIC))
    (use (match_operand 3 "immediate_operand"))]
   ""
-- 
2.52.0

Reply via email to