[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-22 Thread Krzysztof Drewniak via cfe-commits

krzysz00 wrote:

>  (that means addrspacecast 7-> 8 is not invertible by 8-> 7, right? it would 
> discard some bits, in invisible breakage sort of way? is there an RFC for 
> that design?)

I'm not aware of anything requiring addrspacecast to be invertible? (In 
specific, cast 7 -> 8 isn't a thing at the moment)

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-22 Thread Krzysztof Drewniak via cfe-commits

krzysz00 wrote:

(You'll note that in 
https://github.com/llvm/llvm-project/pull/137425/files#diff-f904f8cd236733212015dd1988ffefcc9f79f7484ee46e3e3833d2d75fa69542R2243
 , this intrinsic gets lowered to `raw_ptr_buffer_load_lds` by "pulling apart" 
the ptr addrspace(7) - that `raw_ptr_buffer_load_lds` intrinsic is the direct 
equivalent to the v4i32 system)

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-22 Thread Krzysztof Drewniak via cfe-commits

krzysz00 wrote:

@JonChesterfield This builtin, semantically, cannot accommodate the v4i32 usage

When you have a v4i32, you need to also specify, as an additional argument, the 
`voffset` that gets used to index into that v4i32. This builtin doesn't have 
room for that, because it takes either a global pointer (which doesn't have a 
notion of the offset) or a buffer fat pointer (which has the offset stored in 
the low bits of the pointer and is, in some sense, a v5i32)

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-22 Thread Jon Chesterfield via cfe-commits

JonChesterfield wrote:

I think we could do with an additional overload here.

Currently a bunch of code (notably CK but probably elsewhere) uses the v4i32 
version of the LDS intrinsics. I think this patch lets one use the addrspace(7) 
pointer of 128 bits alternative. So callers could transform the v4i32 into an 
addrspace(7) and then call this.

It's not very clear from the backend docs how this stuff is supposed to be 
wired up by the user. Possibly bitcast from the 4i32 into an addrspace(8) 
annotated i128, and then addrspacecast to 7 to provide an extra 32 bits of 
zero, and then onward to this builtin? Whatever the proper sequence might be, 
adding an overload which takes a v4i32 and does the conversion is likely to 
improve adoption for the new builtin.

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-20 Thread Krzysztof Drewniak via cfe-commits

krzysz00 wrote:

> I don't think we need to worry about compatibility with an intrinsic that's 
> been committed for a day

`global.load.lds` and `buffer[.ptr].load.lds` have been around for quite a 
while though, and this is just an abstraction over them

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-19 Thread Matt Arsenault via cfe-commits

arsenm wrote:

> If you want to add new intrinsics that don't have the offset and that 
> pattern-match instead, I'd be more than happy to review that - or if we want 
> to break back-compat by getting rid of the offset.

I don't think we need to worry about compatibility with an intrinsic that's 
been committed for a day 


https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-19 Thread Krzysztof Drewniak via cfe-commits

krzysz00 wrote:

@arsenm You're right that it might be better to emit the offset, but all the 
existing intrinsics that I'm abstracting over _do_ have such a field.

If you want to add new intrinsics that don't have the offset and that 
pattern-match instead, I'd be more than happy to review that - or if we want to 
break back-compat by getting rid of the offset.

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-19 Thread Matt Arsenault via cfe-commits

arsenm wrote:

I think we'd be better off omitting the offset from the intrinsic signature 

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-19 Thread Krzysztof Drewniak via cfe-commits

https://github.com/krzysz00 closed 
https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-13 Thread Krzysztof Drewniak via cfe-commits

krzysz00 wrote:

Ping

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-09 Thread Krzysztof Drewniak via cfe-commits

krzysz00 wrote:

Ping

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-02 Thread Krzysztof Drewniak via cfe-commits


@@ -2641,6 +2641,28 @@ def int_amdgcn_perm :
 // GFX9 Intrinsics
 
//===--===//
 
+/// This is a general-purpose intrinsic for all operations that take a pointer
+/// a base location in LDS, and a data size and use it to perform a gather to 
LDS.
+/// This allows abstracting over both global pointers (address space 1) and
+/// the buffer-resource-wrapper pointers (address space 7 and 9).
+/// TODO: add support for address space 5 and scratch_load_lds.
+class AMDGPULoadToLDS :
+  Intrinsic <
+[],
+[llvm_anyptr_ty,// Base pointer to load from. Varies 
per lane.
+ LLVMQualPointerType<3>,// LDS base pointer to store to. Must 
be wave-uniform.
+ llvm_i32_ty,   // Data byte size: 1/2/4 (/12/16 for 
gfx950)
+ llvm_i32_ty,   // imm offset (applied to both input 
and LDS address)

krzysz00 wrote:

Oh, thanks for finding the context! `git blame` failed me. So ... we're having 
the discussion from that thread again, and therefore I'd like to appeal to 
precedent in the short term (regarding the immoffset parameter) in the 
interests of making some sort of progress.

If we ever fix the immoffset issue, upgrading into making the immoffset a 
constant 0 and adding it to both pointers should be fine? But that'd require a 
sufficiently robust pattern match, which I'm not sure we're convinced of

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-02 Thread Shilei Tian via cfe-commits


@@ -2641,6 +2641,28 @@ def int_amdgcn_perm :
 // GFX9 Intrinsics
 
//===--===//
 
+/// This is a general-purpose intrinsic for all operations that take a pointer
+/// a base location in LDS, and a data size and use it to perform a gather to 
LDS.
+/// This allows abstracting over both global pointers (address space 1) and
+/// the buffer-resource-wrapper pointers (address space 7 and 9).
+/// TODO: add support for address space 5 and scratch_load_lds.
+class AMDGPULoadToLDS :
+  Intrinsic <
+[],
+[llvm_anyptr_ty,// Base pointer to load from. Varies 
per lane.
+ LLVMQualPointerType<3>,// LDS base pointer to store to. Must 
be wave-uniform.
+ llvm_i32_ty,   // Data byte size: 1/2/4 (/12/16 for 
gfx950)
+ llvm_i32_ty,   // imm offset (applied to both input 
and LDS address)

shiltian wrote:

Right, it was added in https://reviews.llvm.org/D125279.

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-02 Thread Krzysztof Drewniak via cfe-commits

krzysz00 wrote:

Re discussion on the other PR about "why is this even an intrinsic" - since 
this probably shouldn't just be in @jayfoad's DMs:

The reason I disagree with "just pattern-match it" is that you can't get the 
scheduling you want without a guarantee of the intrinssic
 
Namely, while 
```
global_load_b32 v1, v0
ds_write_addtid_b32 v1, s0
```
is obviously
```
s_mov_b32 m0, s0
global_load_lds_b32 v0
```
if we turn that first example into
``` 
pipelined_loop: {
  global_load_b32 v2, v0
  ...
  waitcnt(lds only) + barrier
  ds_read v*, ...
  mfmas(v)
  waitcnt(lds)+s_barrier
  waitcnt(vmem) ;; and not substantially earlier please
  ds_write_addtid_b32 v2, s0
  jle pipelined_loop
}
```
for example, we really don't want that match firing because LDS gets overridden.
 
... *unless* we're double-buffering into LDS and so trying to do
```
pipelined_lds: {
  waitcnt(vmem,lds)+barrier
  load_lds(global1(iv), lds2)
  do_compute(lds1)
  waitcnt(vmem,lds)+barrier
  load_lds(global2(iv), lds1)
  do_compute(lds2) ;; We'd better not be waiting on LDS1 to settle at/before 
here
  iv += 2
}
```
where, if the pattern match for the addtid load fails, say by waitcnt 
insertion, that'll cause proglems for the program
 
Not to mention, because we don't have an intrinsic for ds_addtid, and because 
there are a *lot* of ways to spell the lane ID (mbcnt, workitem.id.x with 
annotations, a bunch of workitem IDs mod 64, etc etc), that'll be quite fragile
 
So in the context of GEMM stuff, I'd rather not have this at "hope the compiler 
recognizes what we're trying to do". If the compiler can be made to recognize 
what we're trying to do reliably in the future, that'll be cool, but I can't be 
the one to write that patch and I don't think there's infinite bandwidth among 
the AMDGPU crowd for this improvement either

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-02 Thread Krzysztof Drewniak via cfe-commits


@@ -564,6 +564,11 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned 
BuiltinID,
 llvm::Function *F = CGM.getIntrinsic(IID, {LoadTy});
 return Builder.CreateCall(F, {Addr});
   }
+  case AMDGPU::BI__builtin_amdgcn_load_to_lds: {
+// Should this have asan instrumentation?

krzysz00 wrote:

It might be a backend problem

However, 
https://github.com/llvm/llvm-project/blob/a9ce60ee885e25cba54c8900605563ff66e067b7/clang/lib/CodeGen/CGBuiltin.cpp#L4142
 is the handling for memcpy(), and doesn't seem to rely on the backend handling 
it. I don't know if this same handling should be happening for the intrinsic

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-02 Thread Krzysztof Drewniak via cfe-commits


@@ -2641,6 +2641,28 @@ def int_amdgcn_perm :
 // GFX9 Intrinsics
 
//===--===//
 
+/// This is a general-purpose intrinsic for all operations that take a pointer
+/// a base location in LDS, and a data size and use it to perform a gather to 
LDS.
+/// This allows abstracting over both global pointers (address space 1) and
+/// the buffer-resource-wrapper pointers (address space 7 and 9).
+/// TODO: add support for address space 5 and scratch_load_lds.
+class AMDGPULoadToLDS :
+  Intrinsic <
+[],
+[llvm_anyptr_ty,// Base pointer to load from. Varies 
per lane.
+ LLVMQualPointerType<3>,// LDS base pointer to store to. Must 
be wave-uniform.
+ llvm_i32_ty,   // Data byte size: 1/2/4 (/12/16 for 
gfx950)
+ llvm_i32_ty,   // imm offset (applied to both input 
and LDS address)

krzysz00 wrote:

*big shrug* 

Is this something y'all want to try and fix here?

(Also, procedural history on the buffer intrinsics points me to 
https://reviews.llvm.org/D124884 ( @rampitec ), which leaves it rather unclear 
why the immarg was added ... I think it's because for LDS, unlike the other 
buffer instructions, you can't do voffset => (actual voffset + imm))

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-02 Thread Matt Arsenault via cfe-commits


@@ -2641,6 +2641,28 @@ def int_amdgcn_perm :
 // GFX9 Intrinsics
 
//===--===//
 
+/// This is a general-purpose intrinsic for all operations that take a pointer
+/// a base location in LDS, and a data size and use it to perform a gather to 
LDS.
+/// This allows abstracting over both global pointers (address space 1) and
+/// the buffer-resource-wrapper pointers (address space 7 and 9).
+/// TODO: add support for address space 5 and scratch_load_lds.
+class AMDGPULoadToLDS :
+  Intrinsic <
+[],
+[llvm_anyptr_ty,// Base pointer to load from. Varies 
per lane.
+ LLVMQualPointerType<3>,// LDS base pointer to store to. Must 
be wave-uniform.
+ llvm_i32_ty,   // Data byte size: 1/2/4 (/12/16 for 
gfx950)
+ llvm_i32_ty,   // imm offset (applied to both input 
and LDS address)

arsenm wrote:

The global one definitely shouldn't have the offset (given it's there, we 
should be trying to do addressing mode folding into it)

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-02 Thread Krzysztof Drewniak via cfe-commits


@@ -2641,6 +2641,28 @@ def int_amdgcn_perm :
 // GFX9 Intrinsics
 
//===--===//
 
+/// This is a general-purpose intrinsic for all operations that take a pointer
+/// a base location in LDS, and a data size and use it to perform a gather to 
LDS.
+/// This allows abstracting over both global pointers (address space 1) and
+/// the buffer-resource-wrapper pointers (address space 7 and 9).
+/// TODO: add support for address space 5 and scratch_load_lds.
+class AMDGPULoadToLDS :
+  Intrinsic <
+[],
+[llvm_anyptr_ty,// Base pointer to load from. Varies 
per lane.
+ LLVMQualPointerType<3>,// LDS base pointer to store to. Must 
be wave-uniform.
+ llvm_i32_ty,   // Data byte size: 1/2/4 (/12/16 for 
gfx950)
+ llvm_i32_ty,   // imm offset (applied to both input 
and LDS address)

krzysz00 wrote:

So @shiltian for reasons I may not be aware of that that's there

(The buffer intrinsic's `soffset` bit is probably a case where they may have 
been incorrect bounds checking at some point, and is unrelated, I think)

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-02 Thread Krzysztof Drewniak via cfe-commits


@@ -2641,6 +2641,28 @@ def int_amdgcn_perm :
 // GFX9 Intrinsics
 
//===--===//
 
+/// This is a general-purpose intrinsic for all operations that take a pointer
+/// a base location in LDS, and a data size and use it to perform a gather to 
LDS.
+/// This allows abstracting over both global pointers (address space 1) and
+/// the buffer-resource-wrapper pointers (address space 7 and 9).
+/// TODO: add support for address space 5 and scratch_load_lds.
+class AMDGPULoadToLDS :
+  Intrinsic <
+[],
+[llvm_anyptr_ty,// Base pointer to load from. Varies 
per lane.
+ LLVMQualPointerType<3>,// LDS base pointer to store to. Must 
be wave-uniform.
+ llvm_i32_ty,   // Data byte size: 1/2/4 (/12/16 for 
gfx950)
+ llvm_i32_ty,   // imm offset (applied to both input 
and LDS address)

krzysz00 wrote:

This same "immediate offset" argument landed without comment back in 
https://github.com/llvm/llvm-project/pull/92962 - this intrinsic is 
API-compatible with that

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-02 Thread Matt Arsenault via cfe-commits


@@ -2641,6 +2641,28 @@ def int_amdgcn_perm :
 // GFX9 Intrinsics
 
//===--===//
 
+/// This is a general-purpose intrinsic for all operations that take a pointer
+/// a base location in LDS, and a data size and use it to perform a gather to 
LDS.
+/// This allows abstracting over both global pointers (address space 1) and
+/// the buffer-resource-wrapper pointers (address space 7 and 9).
+/// TODO: add support for address space 5 and scratch_load_lds.
+class AMDGPULoadToLDS :
+  Intrinsic <
+[],
+[llvm_anyptr_ty,// Base pointer to load from. Varies 
per lane.
+ LLVMQualPointerType<3>,// LDS base pointer to store to. Must 
be wave-uniform.
+ llvm_i32_ty,   // Data byte size: 1/2/4 (/12/16 for 
gfx950)
+ llvm_i32_ty,   // imm offset (applied to both input 
and LDS address)

arsenm wrote:

Lack of pattern matching isn't a reason to have the offset. There should be 
offset pattern matching regardless. I thought the argument for the current 
buffer intrinsic offset argument was something about the unreasonable bounds 
checking behaviors 

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-02 Thread Jay Foad via cfe-commits


@@ -2641,6 +2641,28 @@ def int_amdgcn_perm :
 // GFX9 Intrinsics
 
//===--===//
 
+/// This is a general-purpose intrinsic for all operations that take a pointer
+/// a base location in LDS, and a data size and use it to perform a gather to 
LDS.
+/// This allows abstracting over both global pointers (address space 1) and
+/// the buffer-resource-wrapper pointers (address space 7 and 9).
+/// TODO: add support for address space 5 and scratch_load_lds.
+class AMDGPULoadToLDS :
+  Intrinsic <
+[],
+[llvm_anyptr_ty,// Base pointer to load from. Varies 
per lane.
+ LLVMQualPointerType<3>,// LDS base pointer to store to. Must 
be wave-uniform.
+ llvm_i32_ty,   // Data byte size: 1/2/4 (/12/16 for 
gfx950)
+ llvm_i32_ty,   // imm offset (applied to both input 
and LDS address)

jayfoad wrote:

???

Matching addressing modes is part of the compiler's job. If you want absolute 
control over what goes in the immediate offset field you can write assembler!

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-02 Thread Krzysztof Drewniak via cfe-commits


@@ -2641,6 +2641,28 @@ def int_amdgcn_perm :
 // GFX9 Intrinsics
 
//===--===//
 
+/// This is a general-purpose intrinsic for all operations that take a pointer
+/// a base location in LDS, and a data size and use it to perform a gather to 
LDS.
+/// This allows abstracting over both global pointers (address space 1) and
+/// the buffer-resource-wrapper pointers (address space 7 and 9).
+/// TODO: add support for address space 5 and scratch_load_lds.
+class AMDGPULoadToLDS :
+  Intrinsic <
+[],
+[llvm_anyptr_ty,// Base pointer to load from. Varies 
per lane.
+ LLVMQualPointerType<3>,// LDS base pointer to store to. Must 
be wave-uniform.
+ llvm_i32_ty,   // Data byte size: 1/2/4 (/12/16 for 
gfx950)
+ llvm_i32_ty,   // imm offset (applied to both input 
and LDS address)

krzysz00 wrote:

... Oh. Consider the case that you have global p + N and LDS q + N. Then the 
LDS combiner can rewrite this to (q' + O) + N, aka q' + (O + N).

Then the two pointers won't have the same offset anymore and so it's unclear if 
you can slide it onto the instruction immediate

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-02 Thread Krzysztof Drewniak via cfe-commits

https://github.com/krzysz00 updated 
https://github.com/llvm/llvm-project/pull/137425

>From bcb72e3d8cb2dcdb97199d32797306c5807c8442 Mon Sep 17 00:00:00 2001
From: Krzysztof Drewniak 
Date: Sat, 26 Apr 2025 00:20:22 +
Subject: [PATCH 1/4] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic

This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads
to LDS from global (address space 1) pointers and buffer fat
pointers (address space 7), since they use the saem API and "gather
from a pointer to LDS" is something of an abstract operation.

This commet adds the intrinsic and its lowerings for addrspaces 1 and
7, and updates the MLIR wrappers to use it (loosening up the
restrictions on loads to LDS along the way to match the ground truth
from target features).

It also plumbs the intrinsic through to clang.
---
 clang/include/clang/Basic/BuiltinsAMDGPU.def  |   1 +
 clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp   |   4 +
 clang/lib/Sema/SemaAMDGPU.cpp |   1 +
 .../CodeGenOpenCL/builtins-amdgcn-gfx950.cl   |  30 +++
 .../builtins-amdgcn-load-to-lds.cl|  60 +
 llvm/docs/ReleaseNotes.md |   8 +
 llvm/include/llvm/IR/IntrinsicsAMDGPU.td  |  21 ++
 .../AMDGPU/AMDGPUInstructionSelector.cpp  |   5 +
 .../AMDGPU/AMDGPULowerBufferFatPointers.cpp   |  20 ++
 .../Target/AMDGPU/AMDGPURegisterBankInfo.cpp  |   2 +
 llvm/lib/Target/AMDGPU/SIISelLowering.cpp |   8 +-
 .../AMDGPU/llvm.amdgcn.load.to.lds.gfx950.ll  |  75 ++
 .../CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll | 220 ++
 .../lower-buffer-fat-pointers-mem-transfer.ll |  18 ++
 mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td |  12 +-
 mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td  |  35 ++-
 .../AMDGPUToROCDL/AMDGPUToROCDL.cpp   |  15 +-
 mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp  |  21 +-
 .../Conversion/AMDGPUToROCDL/load_lds.mlir|  67 --
 mlir/test/Dialect/LLVMIR/rocdl.mlir   |  17 +-
 mlir/test/Target/LLVMIR/rocdl.mlir|  11 +-
 21 files changed, 598 insertions(+), 53 deletions(-)
 create mode 100644 clang/test/CodeGenOpenCL/builtins-amdgcn-load-to-lds.cl
 create mode 100644 llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.gfx950.ll
 create mode 100644 llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll

diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def 
b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 39fef9e4601f8..730fd15913c11 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -257,6 +257,7 @@ TARGET_BUILTIN(__builtin_amdgcn_flat_atomic_fadd_v2bf16, 
"V2sV2s*0V2s", "t", "at
 TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_v2bf16, "V2sV2s*1V2s", "t", 
"atomic-global-pk-add-bf16-inst")
 TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2bf16, "V2sV2s*3V2s", "t", 
"atomic-ds-pk-add-16-insts")
 TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2f16, "V2hV2h*3V2h", "t", 
"atomic-ds-pk-add-16-insts")
+TARGET_BUILTIN(__builtin_amdgcn_load_to_lds, "vv*v*3IUiIiIUi", "t", 
"vmem-to-lds-load-insts")
 TARGET_BUILTIN(__builtin_amdgcn_global_load_lds, "vv*1v*3IUiIiIUi", "t", 
"vmem-to-lds-load-insts")
 
 
//===--===//
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp 
b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index ad012d98635ff..a32ef1c2a5a12 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -564,6 +564,10 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned 
BuiltinID,
 llvm::Function *F = CGM.getIntrinsic(IID, {LoadTy});
 return Builder.CreateCall(F, {Addr});
   }
+  case AMDGPU::BI__builtin_amdgcn_load_to_lds: {
+return emitBuiltinWithOneOverloadedType<5>(*this, E,
+   Intrinsic::amdgcn_load_to_lds);
+  }
   case AMDGPU::BI__builtin_amdgcn_get_fpenv: {
 Function *F = CGM.getIntrinsic(Intrinsic::get_fpenv,
{llvm::Type::getInt64Ty(getLLVMContext())});
diff --git a/clang/lib/Sema/SemaAMDGPU.cpp b/clang/lib/Sema/SemaAMDGPU.cpp
index a6366aceec2a6..e6414a623b929 100644
--- a/clang/lib/Sema/SemaAMDGPU.cpp
+++ b/clang/lib/Sema/SemaAMDGPU.cpp
@@ -36,6 +36,7 @@ bool SemaAMDGPU::CheckAMDGCNBuiltinFunctionCall(unsigned 
BuiltinID,
 
   switch (BuiltinID) {
   case AMDGPU::BI__builtin_amdgcn_raw_ptr_buffer_load_lds:
+  case AMDGPU::BI__builtin_amdgcn_load_to_lds:
   case AMDGPU::BI__builtin_amdgcn_global_load_lds: {
 constexpr const int SizeIdx = 2;
 llvm::APSInt Size;
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl 
b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl
index 8251d6c213e3d..4b73347ac8155 100644
--- a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl
@@ -1766,6 +1766,36 @@ void test_cvt_sr_f16_f32(global half2 *out, float src, 
uint seed)
   *out = __built

[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-02 Thread Krzysztof Drewniak via cfe-commits


@@ -2641,6 +2641,28 @@ def int_amdgcn_perm :
 // GFX9 Intrinsics
 
//===--===//
 
+/// This is a general-purpose intrinsic for all operations that take a pointer
+/// a base location in LDS, and a data size and use it to perform a gather to 
LDS.
+/// This allows abstracting over both global pointers (address space 1) and
+/// the buffer-resource-wrapper pointers (address space 7 and 9).
+/// TODO: add support for address space 5 and scratch_load_lds.
+class AMDGPULoadToLDS :
+  Intrinsic <
+[],
+[llvm_anyptr_ty,// Base pointer to load from. Varies 
per lane.
+ LLVMQualPointerType<3>,// LDS base pointer to store to. Must 
be wave-uniform.
+ llvm_i32_ty,   // Data byte size: 1/2/4 (/12/16 for 
gfx950)
+ llvm_i32_ty,   // imm offset (applied to both input 
and LDS address)

krzysz00 wrote:

I'm sure the existing intrinsics expose it for a reason - probably because 
there isn't pattern-matching to strip such an offset

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-02 Thread Matt Arsenault via cfe-commits


@@ -0,0 +1,48 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py 
UTC_ARGS: --version 5
+// REQUIRES: amdgpu-registered-target
+// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -target-cpu gfx950 -emit-llvm 
-fcuda-is-device -o - %s | FileCheck %s
+
+// COM: Most tests are in the OpenCL semastics, this is just a verification 
for HIP
+
+#define __device__ __attribute__((device))
+#define __shared__ __attribute__((shared))
+
+typedef unsigned int u32;
+
+// CHECK-LABEL: define dso_local void @_Z20test_load_to_lds_u32PjS_(
+// CHECK-SAME: ptr noundef [[SRC:%.*]], ptr noundef [[DST:%.*]]) 
#[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:[[SRC_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
+// CHECK-NEXT:[[DST_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
+// CHECK-NEXT:[[SRC_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) 
[[SRC_ADDR]] to ptr
+// CHECK-NEXT:[[DST_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) 
[[DST_ADDR]] to ptr
+// CHECK-NEXT:store ptr [[SRC]], ptr [[SRC_ADDR_ASCAST]], align 8
+// CHECK-NEXT:store ptr [[DST]], ptr [[DST_ADDR_ASCAST]], align 8
+// CHECK-NEXT:[[TMP0:%.*]] = load ptr, ptr [[SRC_ADDR_ASCAST]], align 8
+// CHECK-NEXT:[[TMP1:%.*]] = load ptr, ptr [[DST_ADDR_ASCAST]], align 8
+// CHECK-NEXT:[[TMP2:%.*]] = addrspacecast ptr [[TMP1]] to ptr addrspace(3)
+// CHECK-NEXT:call void @llvm.amdgcn.load.to.lds.p0(ptr [[TMP0]], ptr 
addrspace(3) [[TMP2]], i32 4, i32 0, i32 0)
+// CHECK-NEXT:ret void
+//
+__device__ void test_load_to_lds_u32(u32* src, __shared__ u32 *dst) {
+  __builtin_amdgcn_load_to_lds(src, dst, /*size=*/4, /*offset=*/0, /*aux=*/0);
+}
+
+// CHECK-LABEL: define dso_local void @_Z20test_load_to_lds_128PvS_(
+// CHECK-SAME: ptr noundef [[SRC:%.*]], ptr noundef [[DST:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:[[SRC_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
+// CHECK-NEXT:[[DST_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
+// CHECK-NEXT:[[SRC_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) 
[[SRC_ADDR]] to ptr
+// CHECK-NEXT:[[DST_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) 
[[DST_ADDR]] to ptr
+// CHECK-NEXT:store ptr [[SRC]], ptr [[SRC_ADDR_ASCAST]], align 8
+// CHECK-NEXT:store ptr [[DST]], ptr [[DST_ADDR_ASCAST]], align 8
+// CHECK-NEXT:[[TMP0:%.*]] = load ptr, ptr [[SRC_ADDR_ASCAST]], align 8
+// CHECK-NEXT:[[TMP1:%.*]] = load ptr, ptr [[DST_ADDR_ASCAST]], align 8
+// CHECK-NEXT:[[TMP2:%.*]] = addrspacecast ptr [[TMP1]] to ptr addrspace(3)
+// CHECK-NEXT:call void @llvm.amdgcn.load.to.lds.p0(ptr [[TMP0]], ptr 
addrspace(3) [[TMP2]], i32 16, i32 0, i32 0)
+// CHECK-NEXT:ret void
+//
+__device__ void test_load_to_lds_128(void* src, __shared__ void *dst) {
+  __builtin_amdgcn_load_to_lds(src, dst, /*size=*/16, /*offset=*/0, /*aux=*/0);
+}

arsenm wrote:

Also test where dst isn't qualified with __shared__ 

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-02 Thread Jay Foad via cfe-commits


@@ -2641,6 +2641,28 @@ def int_amdgcn_perm :
 // GFX9 Intrinsics
 
//===--===//
 
+/// This is a general-purpose intrinsic for all operations that take a pointer
+/// a base location in LDS, and a data size and use it to perform a gather to 
LDS.
+/// This allows abstracting over both global pointers (address space 1) and
+/// the buffer-resource-wrapper pointers (address space 7 and 9).
+/// TODO: add support for address space 5 and scratch_load_lds.
+class AMDGPULoadToLDS :
+  Intrinsic <
+[],
+[llvm_anyptr_ty,// Base pointer to load from. Varies 
per lane.
+ LLVMQualPointerType<3>,// LDS base pointer to store to. Must 
be wave-uniform.
+ llvm_i32_ty,   // Data byte size: 1/2/4 (/12/16 for 
gfx950)
+ llvm_i32_ty,   // imm offset (applied to both input 
and LDS address)

jayfoad wrote:

What's the point of the imm offset argument? It's not semantically useful, 
right? You could just add this offset to the two pointer arguments before 
calling the intrinsic.

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-01 Thread Krzysztof Drewniak via cfe-commits

https://github.com/krzysz00 updated 
https://github.com/llvm/llvm-project/pull/137425



  



Rate limit · GitHub


  body {
background-color: #f6f8fa;
color: #24292e;
font-family: -apple-system,BlinkMacSystemFont,Segoe 
UI,Helvetica,Arial,sans-serif,Apple Color Emoji,Segoe UI Emoji,Segoe UI Symbol;
font-size: 14px;
line-height: 1.5;
margin: 0;
  }

  .container { margin: 50px auto; max-width: 600px; text-align: center; 
padding: 0 24px; }

  a { color: #0366d6; text-decoration: none; }
  a:hover { text-decoration: underline; }

  h1 { line-height: 60px; font-size: 48px; font-weight: 300; margin: 0px; 
text-shadow: 0 1px 0 #fff; }
  p { color: rgba(0, 0, 0, 0.5); margin: 20px 0 40px; }

  ul { list-style: none; margin: 25px 0; padding: 0; }
  li { display: table-cell; font-weight: bold; width: 1%; }

  .logo { display: inline-block; margin-top: 35px; }
  .logo-img-2x { display: none; }
  @media
  only screen and (-webkit-min-device-pixel-ratio: 2),
  only screen and (   min--moz-device-pixel-ratio: 2),
  only screen and ( -o-min-device-pixel-ratio: 2/1),
  only screen and (min-device-pixel-ratio: 2),
  only screen and (min-resolution: 192dpi),
  only screen and (min-resolution: 2dppx) {
.logo-img-1x { display: none; }
.logo-img-2x { display: inline-block; }
  }

  #suggestions {
margin-top: 35px;
color: #ccc;
  }
  #suggestions a {
color: #66;
font-weight: 200;
font-size: 14px;
margin: 0 10px;
  }


  
  



  Whoa there!
  You have exceeded a secondary rate limit.
Please wait a few minutes before you try again;
in some cases this may take up to an hour.
  
  
https://support.github.com/contact";>Contact Support —
https://githubstatus.com";>GitHub Status —
https://twitter.com/githubstatus";>@githubstatus
  

  

  

  

  

  


___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-01 Thread Matt Arsenault via cfe-commits


@@ -0,0 +1,60 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py
+// RUN: %clang_cc1 -cl-std=CL2.0 -O0 -triple amdgcn-unknown-unknown 
-target-cpu gfx900 -emit-llvm -o - %s | FileCheck %s
+// RUN: %clang_cc1 -cl-std=CL2.0 -O0 -triple amdgcn-unknown-unknown 
-target-cpu gfx942 -emit-llvm -o - %s | FileCheck %s
+// RUN: %clang_cc1 -cl-std=CL2.0 -O0 -triple amdgcn-unknown-unknown 
-target-cpu gfx1010 -emit-llvm -o - %s | FileCheck %s
+// REQUIRES: amdgpu-registered-target
+

arsenm wrote:

Should also add a HIP codegen test, builtins with pointer arguments sometimes 
interact poorly with the C++ variants that pretend address spaces don't exist 

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-05-01 Thread Matt Arsenault via cfe-commits


@@ -564,6 +564,11 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned 
BuiltinID,
 llvm::Function *F = CGM.getIntrinsic(IID, {LoadTy});
 return Builder.CreateCall(F, {Addr});
   }
+  case AMDGPU::BI__builtin_amdgcn_load_to_lds: {
+// Should this have asan instrumentation?

arsenm wrote:

Repeating comment from the PR that duplicated this one, yes but that's a 
backend problem?

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-30 Thread Krzysztof Drewniak via cfe-commits


@@ -444,17 +444,40 @@ def ROCDL_ds_read_tr6_b96 : 
ROCDL_LDS_Read_Tr_IntrOp<"ds.read.tr6.b96">;
 def ROCDL_ds_read_tr16_b64 : ROCDL_LDS_Read_Tr_IntrOp<"ds.read.tr16.b64">;
 
 //===-===//
-// Global load to LDS intrinsic (available in GFX950)
+// Load to LDS intrinsic (available in GFX9 and GFX10)
+//===-===//
+
+def ROCDL_LoadToLDSOp :
+  ROCDL_IntrOp<"load.to.lds", [], [0], [], 0, 0, 1, [2, 3, 4], ["size", 
"offset", "aux"]> {
+  dag args = (ins Arg:$globalPtr,
+ Arg:$ldsPtr,
+ I32Attr:$size,
+ I32Attr:$offset,
+ I32Attr:$aux);
+  let arguments = !con(args, aliasAttrs);
+  let assemblyFormat = [{
+$globalPtr `,`  $ldsPtr `,` $size `,` $offset `,` $aux
+attr-dict `:` type($globalPtr)
+  }];
+  let extraClassDefinition = [{
+::llvm::SmallVector<::mlir::Value> $cppClass::getAccessedOperands() {
+  return {getGlobalPtr(), getLdsPtr()};
+}
+  }];
+}
 
 def ROCDL_GlobalLoadLDSOp :
-  ROCDL_IntrOp<"global.load.lds", [], [], [], 0, 0, 1> {
+  ROCDL_IntrOp<"global.load.lds", [], [], [], 0, 0, 1, [2, 3, 4], ["size", 
"offset", "aux"]> {

krzysz00 wrote:

I wouldn't - someone might be using it and all. Might need a prefer load.to.lds 
instead note

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-30 Thread Alan Li via cfe-commits


@@ -444,17 +444,40 @@ def ROCDL_ds_read_tr6_b96 : 
ROCDL_LDS_Read_Tr_IntrOp<"ds.read.tr6.b96">;
 def ROCDL_ds_read_tr16_b64 : ROCDL_LDS_Read_Tr_IntrOp<"ds.read.tr16.b64">;
 
 //===-===//
-// Global load to LDS intrinsic (available in GFX950)
+// Load to LDS intrinsic (available in GFX9 and GFX10)
+//===-===//
+
+def ROCDL_LoadToLDSOp :
+  ROCDL_IntrOp<"load.to.lds", [], [0], [], 0, 0, 1, [2, 3, 4], ["size", 
"offset", "aux"]> {
+  dag args = (ins Arg:$globalPtr,
+ Arg:$ldsPtr,
+ I32Attr:$size,
+ I32Attr:$offset,
+ I32Attr:$aux);
+  let arguments = !con(args, aliasAttrs);
+  let assemblyFormat = [{
+$globalPtr `,`  $ldsPtr `,` $size `,` $offset `,` $aux
+attr-dict `:` type($globalPtr)
+  }];
+  let extraClassDefinition = [{
+::llvm::SmallVector<::mlir::Value> $cppClass::getAccessedOperands() {
+  return {getGlobalPtr(), getLdsPtr()};
+}
+  }];
+}
 
 def ROCDL_GlobalLoadLDSOp :
-  ROCDL_IntrOp<"global.load.lds", [], [], [], 0, 0, 1> {
+  ROCDL_IntrOp<"global.load.lds", [], [], [], 0, 0, 1, [2, 3, 4], ["size", 
"offset", "aux"]> {

lialan wrote:

@krzysz00 should we simply remove this op?

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-29 Thread Alan Li via cfe-commits

lialan wrote:

> > I still think we need an intrinsic here because a load + an addtid store 
> > can be scheduled much different from the asynchronous "gather to LDS" - and 
> > because we don't want this load/store to not be optimized
> 
> IMO the intrinsic should only be added as a last resort if we really can't 
> get the pattern based codegen to work well enough.

Beg to differ in particularly this case. In downstream application, I want to 
fine control to use this particular instruction so this gets propagated down to 
LLVM IR, without being changed or modified along the way.

Well, actual reason: we need this instruction now. :-p

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-29 Thread Krzysztof Drewniak via cfe-commits

krzysz00 wrote:

Well, if y'all want to go add a pattern for this and eventually deprecate the 
intrinsics I'm all ears, but we're trying to use these instructions now 

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-29 Thread Jay Foad via cfe-commits

jayfoad wrote:

> I still think we need an intrinsic here because a load + an addtid store can 
> be scheduled much different from the asynchronous "gather to LDS" - and 
> because we don't want this load/store to not be optimized

IMO the intrinsic should only be added as a last resort if we really can't get 
the pattern based codegen to work well enough.

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-28 Thread Krzysztof Drewniak via cfe-commits


@@ -257,6 +257,7 @@ TARGET_BUILTIN(__builtin_amdgcn_flat_atomic_fadd_v2bf16, 
"V2sV2s*0V2s", "t", "at
 TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_v2bf16, "V2sV2s*1V2s", "t", 
"atomic-global-pk-add-bf16-inst")
 TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2bf16, "V2sV2s*3V2s", "t", 
"atomic-ds-pk-add-16-insts")
 TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2f16, "V2hV2h*3V2h", "t", 
"atomic-ds-pk-add-16-insts")
+TARGET_BUILTIN(__builtin_amdgcn_load_to_lds, "vv*v*3IUiIiIUi", "t", 
"vmem-to-lds-load-insts")

krzysz00 wrote:

Done, I think

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-28 Thread Krzysztof Drewniak via cfe-commits

https://github.com/krzysz00 updated 
https://github.com/llvm/llvm-project/pull/137425

>From 96e94b5662c613fd80f712080751076254a73524 Mon Sep 17 00:00:00 2001
From: Krzysztof Drewniak 
Date: Sat, 26 Apr 2025 00:20:22 +
Subject: [PATCH 1/2] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic

This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads
to LDS from global (address space 1) pointers and buffer fat
pointers (address space 7), since they use the saem API and "gather
from a pointer to LDS" is something of an abstract operation.

This commet adds the intrinsic and its lowerings for addrspaces 1 and
7, and updates the MLIR wrappers to use it (loosening up the
restrictions on loads to LDS along the way to match the ground truth
from target features).

It also plumbs the intrinsic through to clang.
---
 clang/include/clang/Basic/BuiltinsAMDGPU.def  |   1 +
 clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp   |   4 +
 clang/lib/Sema/SemaAMDGPU.cpp |   1 +
 .../CodeGenOpenCL/builtins-amdgcn-gfx950.cl   |  30 +++
 .../builtins-amdgcn-load-to-lds.cl|  60 +
 llvm/docs/ReleaseNotes.md |   8 +
 llvm/include/llvm/IR/IntrinsicsAMDGPU.td  |  21 ++
 .../AMDGPU/AMDGPUInstructionSelector.cpp  |   5 +
 .../AMDGPU/AMDGPULowerBufferFatPointers.cpp   |  20 ++
 .../Target/AMDGPU/AMDGPURegisterBankInfo.cpp  |   2 +
 llvm/lib/Target/AMDGPU/SIISelLowering.cpp |   8 +-
 .../AMDGPU/llvm.amdgcn.load.to.lds.gfx950.ll  |  75 ++
 .../CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll | 220 ++
 .../lower-buffer-fat-pointers-mem-transfer.ll |  18 ++
 mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td |  12 +-
 mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td  |  35 ++-
 .../AMDGPUToROCDL/AMDGPUToROCDL.cpp   |  15 +-
 mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp  |  21 +-
 .../Conversion/AMDGPUToROCDL/load_lds.mlir|  67 --
 mlir/test/Dialect/LLVMIR/rocdl.mlir   |  17 +-
 mlir/test/Target/LLVMIR/rocdl.mlir|  11 +-
 21 files changed, 598 insertions(+), 53 deletions(-)
 create mode 100644 clang/test/CodeGenOpenCL/builtins-amdgcn-load-to-lds.cl
 create mode 100644 llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.gfx950.ll
 create mode 100644 llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll

diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def 
b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 39fef9e4601f8..730fd15913c11 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -257,6 +257,7 @@ TARGET_BUILTIN(__builtin_amdgcn_flat_atomic_fadd_v2bf16, 
"V2sV2s*0V2s", "t", "at
 TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_v2bf16, "V2sV2s*1V2s", "t", 
"atomic-global-pk-add-bf16-inst")
 TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2bf16, "V2sV2s*3V2s", "t", 
"atomic-ds-pk-add-16-insts")
 TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2f16, "V2hV2h*3V2h", "t", 
"atomic-ds-pk-add-16-insts")
+TARGET_BUILTIN(__builtin_amdgcn_load_to_lds, "vv*v*3IUiIiIUi", "t", 
"vmem-to-lds-load-insts")
 TARGET_BUILTIN(__builtin_amdgcn_global_load_lds, "vv*1v*3IUiIiIUi", "t", 
"vmem-to-lds-load-insts")
 
 
//===--===//
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp 
b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index ad012d98635ff..a32ef1c2a5a12 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -564,6 +564,10 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned 
BuiltinID,
 llvm::Function *F = CGM.getIntrinsic(IID, {LoadTy});
 return Builder.CreateCall(F, {Addr});
   }
+  case AMDGPU::BI__builtin_amdgcn_load_to_lds: {
+return emitBuiltinWithOneOverloadedType<5>(*this, E,
+   Intrinsic::amdgcn_load_to_lds);
+  }
   case AMDGPU::BI__builtin_amdgcn_get_fpenv: {
 Function *F = CGM.getIntrinsic(Intrinsic::get_fpenv,
{llvm::Type::getInt64Ty(getLLVMContext())});
diff --git a/clang/lib/Sema/SemaAMDGPU.cpp b/clang/lib/Sema/SemaAMDGPU.cpp
index a6366aceec2a6..e6414a623b929 100644
--- a/clang/lib/Sema/SemaAMDGPU.cpp
+++ b/clang/lib/Sema/SemaAMDGPU.cpp
@@ -36,6 +36,7 @@ bool SemaAMDGPU::CheckAMDGCNBuiltinFunctionCall(unsigned 
BuiltinID,
 
   switch (BuiltinID) {
   case AMDGPU::BI__builtin_amdgcn_raw_ptr_buffer_load_lds:
+  case AMDGPU::BI__builtin_amdgcn_load_to_lds:
   case AMDGPU::BI__builtin_amdgcn_global_load_lds: {
 constexpr const int SizeIdx = 2;
 llvm::APSInt Size;
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl 
b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl
index 8251d6c213e3d..4b73347ac8155 100644
--- a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl
@@ -1766,6 +1766,36 @@ void test_cvt_sr_f16_f32(global half2 *out, float src, 
uint seed)
   *out = __built

[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-28 Thread Krzysztof Drewniak via cfe-commits

https://github.com/krzysz00 edited 
https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-28 Thread Krzysztof Drewniak via cfe-commits

krzysz00 wrote:

@jayfoad I still think we need an intrinsic here because a load + an addtid 
store can be scheduled much different from the asynchronous "gather to LDS" - 
and because we don't want this load/store to not be optimized

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-28 Thread Jay Foad via cfe-commits

jayfoad wrote:

> > High level question: I don't understand why you call this a "gather" 
> > operation. What do you mean by that? Isn't it semantically just a memcpy, 
> > or a (global/buffer) load followed by a (LDS) store?
> 
> The semantics of this operation (at least in the pre-gfx950 cases) are
> 
> ```
> lds_load(vector globalAddr, scalar ldsAddr) {
>lds[ldsAddr + 4 * laneId] = global[globalAddr];
> }
> ```
> 
> Note that your lane-varying global address can point all over memory, but 
> that the values to written to LDS always go at base, base + 4 bytes, base + 8 
> bytes, ... base + (wavesize - 1) * 4 bytes
> 
> From where I'm standing, this is a gather

I see. The LDS part is doing "addtid" addressing. There are other instructions 
that do this like `DS_LOAD_ADDTID_B32` and `GLOBAL_LOAD_ADDTID_B32` but I don't 
think we have any codegen support for them.

I think we _could_ add the codegen support just by pattern-matching the 
address, so `DS_LOAD_ADDTID_B32` would match something like `load ptr 
addrspace(3) (constant_base + tid *4)`.

Then buffer-load-to-lds could be pattern-matched as a regular (fat pointer) 
buffer load followed by an addtid-style LDS store, right? So no intrinsic is 
really _needed_?

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-28 Thread Pierre van Houtryve via cfe-commits

Pierre-vh wrote:

Can you please document it in the AMDGPUUsage table as well?

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-28 Thread Krzysztof Drewniak via cfe-commits

krzysz00 wrote:

@jayfoad 
> High level question: I don't understand why you call this a "gather" 
> operation. What do you mean by that? Isn't it semantically just a memcpy, or 
> a (global/buffer) load followed by a (LDS) store?

The semantics of this operation (at least in the pre-gfx950 cases) are
```
lds_load(vector globalAddr, scalar ldsAddr) {
   lds[ldsAddr + 4 * laneId] = global[globalAddr];
}
```

Note that your lane-varying global address can point all over memory, but that 
the values to written to LDS always go at base, base + 4 bytes, base + 8 bytes, 
... base + (wavesize - 1) * 4 bytes

>From where I'm standing, this is a gather

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-27 Thread Jakub Kuderski via cfe-commits

kuhar wrote:

> High level question: I don't understand why you call this a "gather" 
> operation. What do you mean by that? Isn't it semantically just a memcpy, or 
> a (global/buffer) load followed by a (LDS) store?

This is more like a subgroup operation because the destination offset is 
uniform.

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-27 Thread Jay Foad via cfe-commits

jayfoad wrote:

> This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to LDS 
> from global (address space 1) pointers and buffer fat pointers (address space 
> 7), since they use the same API and "gather from a pointer to LDS" is 
> something of an abstract operation.

High level question: I don't understand why you call this a "gather" operation. 
What do you mean by that? Isn't it semantically just a memcpy, or a 
(global/buffer) load followed by a (LDS) store?

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-26 Thread Shilei Tian via cfe-commits


@@ -0,0 +1,75 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py 
UTC_ARGS: --version 5
+; RUN: llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx950 < %s | FileCheck 
-check-prefixes=GFX950,GFX950-SDAG %s
+; RUN: llc -global-isel=1 -mtriple=amdgcn -mcpu=gfx950 < %s | FileCheck 
-check-prefixes=GFX950,GFX950-GISEL %s
+
+; RUN: not --crash llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx942 
-filetype=null < %s 2>&1 | FileCheck -check-prefix=ERR-SDAG %s
+; RUN: not --crash llc -global-isel=1 -mtriple=amdgcn -mcpu=gfx942 
-filetype=null < %s 2>&1 | FileCheck -check-prefix=ERR-GISEL %s
+
+; ERR-SDAG: LLVM ERROR: Cannot select: intrinsic %llvm.amdgcn.load.to.lds
+
+; ERR-GISEL: LLVM ERROR: cannot select: G_INTRINSIC_W_SIDE_EFFECTS 
intrinsic(@llvm.amdgcn.load.to.lds),
+
+;; Note: this is a bare-bones test to make sure that amdgcn.load.to.lds lowers 
to
+;; the correct intrinsic.
+
+declare void @llvm.amdgcn.load.to.lds.p1(ptr addrspace(1) nocapture %gptr, ptr 
addrspace(3) nocapture %lptr, i32 %size, i32 %offset, i32 %aux)
+declare void @llvm.amdgcn.load.to.lds.p7(ptr addrspace(7) nocapture %gptr, ptr 
addrspace(3) nocapture %lptr, i32 %size, i32 %offset, i32 %aux)
+
+;-y
+; dwordx3
+;-
+
+define amdgpu_ps void @global_load_lds_dwordx3_vaddr_saddr(ptr addrspace(1) 
nocapture %gptr, ptr addrspace(3) nocapture inreg %lptr) {
+; GFX950-LABEL: global_load_lds_dwordx3_vaddr_saddr:
+; GFX950:   ; %bb.0:
+; GFX950-NEXT:s_mov_b32 m0, s0
+; GFX950-NEXT:s_nop 0
+; GFX950-NEXT:global_load_lds_dwordx3 v[0:1], off offset:16 sc0
+; GFX950-NEXT:s_endpgm
+  call void @llvm.amdgcn.load.to.lds.p1(ptr addrspace(1) %gptr, ptr 
addrspace(3) %lptr, i32 12, i32 16, i32 1)
+  ret void
+}
+
+define amdgpu_ps void @buffer_load_lds_dwordx3_vaddr_saddr(ptr addrspace(7) 
nocapture inreg %gptr, i32 %off, ptr addrspace(3) nocapture inreg %lptr) {
+; GFX950-LABEL: buffer_load_lds_dwordx3_vaddr_saddr:
+; GFX950:   ; %bb.0:
+; GFX950-NEXT:v_add_u32_e32 v0, s4, v0
+; GFX950-NEXT:s_mov_b32 m0, s5
+; GFX950-NEXT:s_nop 0
+; GFX950-NEXT:buffer_load_dwordx3 v0, s[0:3], 0 offen offset:16 sc0 lds
+; GFX950-NEXT:s_endpgm
+  %gptr.off = getelementptr i8, ptr addrspace(7) %gptr, i32 %off
+  call void @llvm.amdgcn.load.to.lds.p7(ptr addrspace(7) %gptr.off, ptr 
addrspace(3) %lptr, i32 12, i32 16, i32 1)
+  ret void
+}
+
+;-
+; dwordx4
+;-
+
+define amdgpu_ps void @global_load_lds_dwordx4_vaddr_saddr(ptr addrspace(1) 
nocapture %gptr, ptr addrspace(3) nocapture inreg %lptr) {
+; GFX950-LABEL: global_load_lds_dwordx4_vaddr_saddr:
+; GFX950:   ; %bb.0:
+; GFX950-NEXT:s_mov_b32 m0, s0
+; GFX950-NEXT:s_nop 0
+; GFX950-NEXT:global_load_lds_dwordx4 v[0:1], off offset:16 sc0
+; GFX950-NEXT:s_endpgm
+  call void @llvm.amdgcn.global.load.lds(ptr addrspace(1) %gptr, ptr 
addrspace(3) %lptr, i32 16, i32 16, i32 1)
+  ret void
+}
+
+define amdgpu_ps void @buffer_load_lds_dwordx4_vaddr_saddr(ptr addrspace(7) 
nocapture inreg %gptr, i32 %off, ptr addrspace(3) nocapture inreg %lptr) {
+; GFX950-LABEL: buffer_load_lds_dwordx4_vaddr_saddr:
+; GFX950:   ; %bb.0:
+; GFX950-NEXT:v_add_u32_e32 v0, s4, v0
+; GFX950-NEXT:s_mov_b32 m0, s5
+; GFX950-NEXT:s_nop 0
+; GFX950-NEXT:buffer_load_dwordx4 v0, s[0:3], 0 offen offset:16 sc0 lds
+; GFX950-NEXT:s_endpgm
+  %gptr.off = getelementptr i8, ptr addrspace(7) %gptr, i32 %off
+  call void @llvm.amdgcn.load.to.lds.p7(ptr addrspace(7) %gptr.off, ptr 
addrspace(3) %lptr, i32 16, i32 16, i32 1)
+  ret void
+}
+;; NOTE: These prefixes are unused and the list is autogenerated. Do not add 
tests below this line:
+; GFX950-GISEL: {{.*}}
+; GFX950-SDAG: {{.*}}

shiltian wrote:

The multi-prefix doesn't work well, so just use `--check-prefix`.

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-26 Thread Shilei Tian via cfe-commits

https://github.com/shiltian edited 
https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-26 Thread Shilei Tian via cfe-commits

https://github.com/shiltian edited 
https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-26 Thread Matt Arsenault via cfe-commits


@@ -257,6 +257,7 @@ TARGET_BUILTIN(__builtin_amdgcn_flat_atomic_fadd_v2bf16, 
"V2sV2s*0V2s", "t", "at
 TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_v2bf16, "V2sV2s*1V2s", "t", 
"atomic-global-pk-add-bf16-inst")
 TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2bf16, "V2sV2s*3V2s", "t", 
"atomic-ds-pk-add-16-insts")
 TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2f16, "V2hV2h*3V2h", "t", 
"atomic-ds-pk-add-16-insts")
+TARGET_BUILTIN(__builtin_amdgcn_load_to_lds, "vv*v*3IUiIiIUi", "t", 
"vmem-to-lds-load-insts")

arsenm wrote:

Missing sema tests for rejection of invalid immediate values or missing 
subtarget feature

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-26 Thread Matt Arsenault via cfe-commits


@@ -102,6 +102,14 @@ Changes to the AMDGPU Backend
 
 * Bump the default `.amdhsa_code_object_version` to 6. ROCm 6.3 is required to 
run any program compiled with COV6.
 
+* Add a new `amdgcn.load.to.lds` intrinsic that wraps the existing 
global.load.lds

arsenm wrote:

Should add to the intrinsics section of AMDGPUUsage 

https://github.com/llvm/llvm-project/pull/137425
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-25 Thread via cfe-commits

llvmbot wrote:




@llvm/pr-subscribers-backend-amdgpu

Author: Krzysztof Drewniak (krzysz00)


Changes

This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to LDS 
from global (address space 1) pointers and buffer fat pointers (address space 
7), since they use the saem API and "gather from a pointer to LDS" is something 
of an abstract operation.

This commet adds the intrinsic and its lowerings for addrspaces 1 and 7, and 
updates the MLIR wrappers to use it (loosening up the restrictions on loads to 
LDS along the way to match the ground truth from target features).

It also plumbs the intrinsic through to clang.

(Any clang folks know why things are broken?)

---

Patch is 50.24 KiB, truncated to 20.00 KiB below, full version: 
https://github.com/llvm/llvm-project/pull/137425.diff


21 Files Affected:

- (modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+1) 
- (modified) clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp (+4) 
- (modified) clang/lib/Sema/SemaAMDGPU.cpp (+1) 
- (modified) clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl (+30) 
- (added) clang/test/CodeGenOpenCL/builtins-amdgcn-load-to-lds.cl (+60) 
- (modified) llvm/docs/ReleaseNotes.md (+8) 
- (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+21) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+5) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp (+20) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+2) 
- (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+7-1) 
- (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.gfx950.ll (+75) 
- (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll (+220) 
- (modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-mem-transfer.ll 
(+18) 
- (modified) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td (+7-5) 
- (modified) mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td (+29-6) 
- (modified) mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp (+7-8) 
- (modified) mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp (+15-6) 
- (modified) mlir/test/Conversion/AMDGPUToROCDL/load_lds.mlir (+51-16) 
- (modified) mlir/test/Dialect/LLVMIR/rocdl.mlir (+10-7) 
- (modified) mlir/test/Target/LLVMIR/rocdl.mlir (+7-4) 


``diff
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def 
b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 39fef9e4601f8..730fd15913c11 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -257,6 +257,7 @@ TARGET_BUILTIN(__builtin_amdgcn_flat_atomic_fadd_v2bf16, 
"V2sV2s*0V2s", "t", "at
 TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_v2bf16, "V2sV2s*1V2s", "t", 
"atomic-global-pk-add-bf16-inst")
 TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2bf16, "V2sV2s*3V2s", "t", 
"atomic-ds-pk-add-16-insts")
 TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2f16, "V2hV2h*3V2h", "t", 
"atomic-ds-pk-add-16-insts")
+TARGET_BUILTIN(__builtin_amdgcn_load_to_lds, "vv*v*3IUiIiIUi", "t", 
"vmem-to-lds-load-insts")
 TARGET_BUILTIN(__builtin_amdgcn_global_load_lds, "vv*1v*3IUiIiIUi", "t", 
"vmem-to-lds-load-insts")
 
 
//===--===//
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp 
b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index ad012d98635ff..a32ef1c2a5a12 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -564,6 +564,10 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned 
BuiltinID,
 llvm::Function *F = CGM.getIntrinsic(IID, {LoadTy});
 return Builder.CreateCall(F, {Addr});
   }
+  case AMDGPU::BI__builtin_amdgcn_load_to_lds: {
+return emitBuiltinWithOneOverloadedType<5>(*this, E,
+   Intrinsic::amdgcn_load_to_lds);
+  }
   case AMDGPU::BI__builtin_amdgcn_get_fpenv: {
 Function *F = CGM.getIntrinsic(Intrinsic::get_fpenv,
{llvm::Type::getInt64Ty(getLLVMContext())});
diff --git a/clang/lib/Sema/SemaAMDGPU.cpp b/clang/lib/Sema/SemaAMDGPU.cpp
index a6366aceec2a6..e6414a623b929 100644
--- a/clang/lib/Sema/SemaAMDGPU.cpp
+++ b/clang/lib/Sema/SemaAMDGPU.cpp
@@ -36,6 +36,7 @@ bool SemaAMDGPU::CheckAMDGCNBuiltinFunctionCall(unsigned 
BuiltinID,
 
   switch (BuiltinID) {
   case AMDGPU::BI__builtin_amdgcn_raw_ptr_buffer_load_lds:
+  case AMDGPU::BI__builtin_amdgcn_load_to_lds:
   case AMDGPU::BI__builtin_amdgcn_global_load_lds: {
 constexpr const int SizeIdx = 2;
 llvm::APSInt Size;
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl 
b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl
index 8251d6c213e3d..4b73347ac8155 100644
--- a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl
@@ -1766,6 +1766,36 @@ void test_cvt_sr_f16_f32(global half2 *out, float src, 
uint seed)
   *out = __builtin_amdgcn_cvt_sr_f16_f32(*out, src, seed, 1);
 }
 
+// CH

[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-25 Thread via cfe-commits

llvmbot wrote:




@llvm/pr-subscribers-mlir

Author: Krzysztof Drewniak (krzysz00)


Changes

This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to LDS 
from global (address space 1) pointers and buffer fat pointers (address space 
7), since they use the saem API and "gather from a pointer to LDS" is something 
of an abstract operation.

This commet adds the intrinsic and its lowerings for addrspaces 1 and 7, and 
updates the MLIR wrappers to use it (loosening up the restrictions on loads to 
LDS along the way to match the ground truth from target features).

It also plumbs the intrinsic through to clang.

(Any clang folks know why things are broken?)

---

Patch is 50.24 KiB, truncated to 20.00 KiB below, full version: 
https://github.com/llvm/llvm-project/pull/137425.diff


21 Files Affected:

- (modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+1) 
- (modified) clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp (+4) 
- (modified) clang/lib/Sema/SemaAMDGPU.cpp (+1) 
- (modified) clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl (+30) 
- (added) clang/test/CodeGenOpenCL/builtins-amdgcn-load-to-lds.cl (+60) 
- (modified) llvm/docs/ReleaseNotes.md (+8) 
- (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+21) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+5) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp (+20) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+2) 
- (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+7-1) 
- (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.gfx950.ll (+75) 
- (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll (+220) 
- (modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-mem-transfer.ll 
(+18) 
- (modified) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td (+7-5) 
- (modified) mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td (+29-6) 
- (modified) mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp (+7-8) 
- (modified) mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp (+15-6) 
- (modified) mlir/test/Conversion/AMDGPUToROCDL/load_lds.mlir (+51-16) 
- (modified) mlir/test/Dialect/LLVMIR/rocdl.mlir (+10-7) 
- (modified) mlir/test/Target/LLVMIR/rocdl.mlir (+7-4) 


``diff
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def 
b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 39fef9e4601f8..730fd15913c11 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -257,6 +257,7 @@ TARGET_BUILTIN(__builtin_amdgcn_flat_atomic_fadd_v2bf16, 
"V2sV2s*0V2s", "t", "at
 TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_v2bf16, "V2sV2s*1V2s", "t", 
"atomic-global-pk-add-bf16-inst")
 TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2bf16, "V2sV2s*3V2s", "t", 
"atomic-ds-pk-add-16-insts")
 TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2f16, "V2hV2h*3V2h", "t", 
"atomic-ds-pk-add-16-insts")
+TARGET_BUILTIN(__builtin_amdgcn_load_to_lds, "vv*v*3IUiIiIUi", "t", 
"vmem-to-lds-load-insts")
 TARGET_BUILTIN(__builtin_amdgcn_global_load_lds, "vv*1v*3IUiIiIUi", "t", 
"vmem-to-lds-load-insts")
 
 
//===--===//
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp 
b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index ad012d98635ff..a32ef1c2a5a12 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -564,6 +564,10 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned 
BuiltinID,
 llvm::Function *F = CGM.getIntrinsic(IID, {LoadTy});
 return Builder.CreateCall(F, {Addr});
   }
+  case AMDGPU::BI__builtin_amdgcn_load_to_lds: {
+return emitBuiltinWithOneOverloadedType<5>(*this, E,
+   Intrinsic::amdgcn_load_to_lds);
+  }
   case AMDGPU::BI__builtin_amdgcn_get_fpenv: {
 Function *F = CGM.getIntrinsic(Intrinsic::get_fpenv,
{llvm::Type::getInt64Ty(getLLVMContext())});
diff --git a/clang/lib/Sema/SemaAMDGPU.cpp b/clang/lib/Sema/SemaAMDGPU.cpp
index a6366aceec2a6..e6414a623b929 100644
--- a/clang/lib/Sema/SemaAMDGPU.cpp
+++ b/clang/lib/Sema/SemaAMDGPU.cpp
@@ -36,6 +36,7 @@ bool SemaAMDGPU::CheckAMDGCNBuiltinFunctionCall(unsigned 
BuiltinID,
 
   switch (BuiltinID) {
   case AMDGPU::BI__builtin_amdgcn_raw_ptr_buffer_load_lds:
+  case AMDGPU::BI__builtin_amdgcn_load_to_lds:
   case AMDGPU::BI__builtin_amdgcn_global_load_lds: {
 constexpr const int SizeIdx = 2;
 llvm::APSInt Size;
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl 
b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl
index 8251d6c213e3d..4b73347ac8155 100644
--- a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl
@@ -1766,6 +1766,36 @@ void test_cvt_sr_f16_f32(global half2 *out, float src, 
uint seed)
   *out = __builtin_amdgcn_cvt_sr_f16_f32(*out, src, seed, 1);
 }
 
+// CHECK-LABEL:

[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-25 Thread via cfe-commits

llvmbot wrote:




@llvm/pr-subscribers-mlir-llvm

Author: Krzysztof Drewniak (krzysz00)


Changes

This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to LDS 
from global (address space 1) pointers and buffer fat pointers (address space 
7), since they use the saem API and "gather from a pointer to LDS" is something 
of an abstract operation.

This commet adds the intrinsic and its lowerings for addrspaces 1 and 7, and 
updates the MLIR wrappers to use it (loosening up the restrictions on loads to 
LDS along the way to match the ground truth from target features).

It also plumbs the intrinsic through to clang.

(Any clang folks know why things are broken?)

---

Patch is 50.24 KiB, truncated to 20.00 KiB below, full version: 
https://github.com/llvm/llvm-project/pull/137425.diff


21 Files Affected:

- (modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+1) 
- (modified) clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp (+4) 
- (modified) clang/lib/Sema/SemaAMDGPU.cpp (+1) 
- (modified) clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl (+30) 
- (added) clang/test/CodeGenOpenCL/builtins-amdgcn-load-to-lds.cl (+60) 
- (modified) llvm/docs/ReleaseNotes.md (+8) 
- (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+21) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+5) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp (+20) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+2) 
- (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+7-1) 
- (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.gfx950.ll (+75) 
- (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll (+220) 
- (modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-mem-transfer.ll 
(+18) 
- (modified) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td (+7-5) 
- (modified) mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td (+29-6) 
- (modified) mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp (+7-8) 
- (modified) mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp (+15-6) 
- (modified) mlir/test/Conversion/AMDGPUToROCDL/load_lds.mlir (+51-16) 
- (modified) mlir/test/Dialect/LLVMIR/rocdl.mlir (+10-7) 
- (modified) mlir/test/Target/LLVMIR/rocdl.mlir (+7-4) 


``diff
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def 
b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 39fef9e4601f8..730fd15913c11 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -257,6 +257,7 @@ TARGET_BUILTIN(__builtin_amdgcn_flat_atomic_fadd_v2bf16, 
"V2sV2s*0V2s", "t", "at
 TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_v2bf16, "V2sV2s*1V2s", "t", 
"atomic-global-pk-add-bf16-inst")
 TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2bf16, "V2sV2s*3V2s", "t", 
"atomic-ds-pk-add-16-insts")
 TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2f16, "V2hV2h*3V2h", "t", 
"atomic-ds-pk-add-16-insts")
+TARGET_BUILTIN(__builtin_amdgcn_load_to_lds, "vv*v*3IUiIiIUi", "t", 
"vmem-to-lds-load-insts")
 TARGET_BUILTIN(__builtin_amdgcn_global_load_lds, "vv*1v*3IUiIiIUi", "t", 
"vmem-to-lds-load-insts")
 
 
//===--===//
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp 
b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index ad012d98635ff..a32ef1c2a5a12 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -564,6 +564,10 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned 
BuiltinID,
 llvm::Function *F = CGM.getIntrinsic(IID, {LoadTy});
 return Builder.CreateCall(F, {Addr});
   }
+  case AMDGPU::BI__builtin_amdgcn_load_to_lds: {
+return emitBuiltinWithOneOverloadedType<5>(*this, E,
+   Intrinsic::amdgcn_load_to_lds);
+  }
   case AMDGPU::BI__builtin_amdgcn_get_fpenv: {
 Function *F = CGM.getIntrinsic(Intrinsic::get_fpenv,
{llvm::Type::getInt64Ty(getLLVMContext())});
diff --git a/clang/lib/Sema/SemaAMDGPU.cpp b/clang/lib/Sema/SemaAMDGPU.cpp
index a6366aceec2a6..e6414a623b929 100644
--- a/clang/lib/Sema/SemaAMDGPU.cpp
+++ b/clang/lib/Sema/SemaAMDGPU.cpp
@@ -36,6 +36,7 @@ bool SemaAMDGPU::CheckAMDGCNBuiltinFunctionCall(unsigned 
BuiltinID,
 
   switch (BuiltinID) {
   case AMDGPU::BI__builtin_amdgcn_raw_ptr_buffer_load_lds:
+  case AMDGPU::BI__builtin_amdgcn_load_to_lds:
   case AMDGPU::BI__builtin_amdgcn_global_load_lds: {
 constexpr const int SizeIdx = 2;
 llvm::APSInt Size;
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl 
b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl
index 8251d6c213e3d..4b73347ac8155 100644
--- a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl
@@ -1766,6 +1766,36 @@ void test_cvt_sr_f16_f32(global half2 *out, float src, 
uint seed)
   *out = __builtin_amdgcn_cvt_sr_f16_f32(*out, src, seed, 1);
 }
 
+// CHECK-L

[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)

2025-04-25 Thread Krzysztof Drewniak via cfe-commits

https://github.com/krzysz00 created 
https://github.com/llvm/llvm-project/pull/137425

This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to LDS 
from global (address space 1) pointers and buffer fat pointers (address space 
7), since they use the saem API and "gather from a pointer to LDS" is something 
of an abstract operation.

This commet adds the intrinsic and its lowerings for addrspaces 1 and 7, and 
updates the MLIR wrappers to use it (loosening up the restrictions on loads to 
LDS along the way to match the ground truth from target features).

It also plumbs the intrinsic through to clang.

(Any clang folks know why things are broken?)

>From 96e94b5662c613fd80f712080751076254a73524 Mon Sep 17 00:00:00 2001
From: Krzysztof Drewniak 
Date: Sat, 26 Apr 2025 00:20:22 +
Subject: [PATCH] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic

This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads
to LDS from global (address space 1) pointers and buffer fat
pointers (address space 7), since they use the saem API and "gather
from a pointer to LDS" is something of an abstract operation.

This commet adds the intrinsic and its lowerings for addrspaces 1 and
7, and updates the MLIR wrappers to use it (loosening up the
restrictions on loads to LDS along the way to match the ground truth
from target features).

It also plumbs the intrinsic through to clang.
---
 clang/include/clang/Basic/BuiltinsAMDGPU.def  |   1 +
 clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp   |   4 +
 clang/lib/Sema/SemaAMDGPU.cpp |   1 +
 .../CodeGenOpenCL/builtins-amdgcn-gfx950.cl   |  30 +++
 .../builtins-amdgcn-load-to-lds.cl|  60 +
 llvm/docs/ReleaseNotes.md |   8 +
 llvm/include/llvm/IR/IntrinsicsAMDGPU.td  |  21 ++
 .../AMDGPU/AMDGPUInstructionSelector.cpp  |   5 +
 .../AMDGPU/AMDGPULowerBufferFatPointers.cpp   |  20 ++
 .../Target/AMDGPU/AMDGPURegisterBankInfo.cpp  |   2 +
 llvm/lib/Target/AMDGPU/SIISelLowering.cpp |   8 +-
 .../AMDGPU/llvm.amdgcn.load.to.lds.gfx950.ll  |  75 ++
 .../CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll | 220 ++
 .../lower-buffer-fat-pointers-mem-transfer.ll |  18 ++
 mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td |  12 +-
 mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td  |  35 ++-
 .../AMDGPUToROCDL/AMDGPUToROCDL.cpp   |  15 +-
 mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp  |  21 +-
 .../Conversion/AMDGPUToROCDL/load_lds.mlir|  67 --
 mlir/test/Dialect/LLVMIR/rocdl.mlir   |  17 +-
 mlir/test/Target/LLVMIR/rocdl.mlir|  11 +-
 21 files changed, 598 insertions(+), 53 deletions(-)
 create mode 100644 clang/test/CodeGenOpenCL/builtins-amdgcn-load-to-lds.cl
 create mode 100644 llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.gfx950.ll
 create mode 100644 llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll

diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def 
b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index 39fef9e4601f8..730fd15913c11 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -257,6 +257,7 @@ TARGET_BUILTIN(__builtin_amdgcn_flat_atomic_fadd_v2bf16, 
"V2sV2s*0V2s", "t", "at
 TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_v2bf16, "V2sV2s*1V2s", "t", 
"atomic-global-pk-add-bf16-inst")
 TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2bf16, "V2sV2s*3V2s", "t", 
"atomic-ds-pk-add-16-insts")
 TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2f16, "V2hV2h*3V2h", "t", 
"atomic-ds-pk-add-16-insts")
+TARGET_BUILTIN(__builtin_amdgcn_load_to_lds, "vv*v*3IUiIiIUi", "t", 
"vmem-to-lds-load-insts")
 TARGET_BUILTIN(__builtin_amdgcn_global_load_lds, "vv*1v*3IUiIiIUi", "t", 
"vmem-to-lds-load-insts")
 
 
//===--===//
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp 
b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index ad012d98635ff..a32ef1c2a5a12 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -564,6 +564,10 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned 
BuiltinID,
 llvm::Function *F = CGM.getIntrinsic(IID, {LoadTy});
 return Builder.CreateCall(F, {Addr});
   }
+  case AMDGPU::BI__builtin_amdgcn_load_to_lds: {
+return emitBuiltinWithOneOverloadedType<5>(*this, E,
+   Intrinsic::amdgcn_load_to_lds);
+  }
   case AMDGPU::BI__builtin_amdgcn_get_fpenv: {
 Function *F = CGM.getIntrinsic(Intrinsic::get_fpenv,
{llvm::Type::getInt64Ty(getLLVMContext())});
diff --git a/clang/lib/Sema/SemaAMDGPU.cpp b/clang/lib/Sema/SemaAMDGPU.cpp
index a6366aceec2a6..e6414a623b929 100644
--- a/clang/lib/Sema/SemaAMDGPU.cpp
+++ b/clang/lib/Sema/SemaAMDGPU.cpp
@@ -36,6 +36,7 @@ bool SemaAMDGPU::CheckAMDGCNBuiltinFunctionCall(unsigned 
BuiltinID,
 
   switch (BuiltinID) {
   case AMDGPU::BI__bui