[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
krzysz00 wrote: > (that means addrspacecast 7-> 8 is not invertible by 8-> 7, right? it would > discard some bits, in invisible breakage sort of way? is there an RFC for > that design?) I'm not aware of anything requiring addrspacecast to be invertible? (In specific, cast 7 -> 8 isn't a thing at the moment) https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
krzysz00 wrote: (You'll note that in https://github.com/llvm/llvm-project/pull/137425/files#diff-f904f8cd236733212015dd1988ffefcc9f79f7484ee46e3e3833d2d75fa69542R2243 , this intrinsic gets lowered to `raw_ptr_buffer_load_lds` by "pulling apart" the ptr addrspace(7) - that `raw_ptr_buffer_load_lds` intrinsic is the direct equivalent to the v4i32 system) https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
krzysz00 wrote: @JonChesterfield This builtin, semantically, cannot accommodate the v4i32 usage When you have a v4i32, you need to also specify, as an additional argument, the `voffset` that gets used to index into that v4i32. This builtin doesn't have room for that, because it takes either a global pointer (which doesn't have a notion of the offset) or a buffer fat pointer (which has the offset stored in the low bits of the pointer and is, in some sense, a v5i32) https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
JonChesterfield wrote: I think we could do with an additional overload here. Currently a bunch of code (notably CK but probably elsewhere) uses the v4i32 version of the LDS intrinsics. I think this patch lets one use the addrspace(7) pointer of 128 bits alternative. So callers could transform the v4i32 into an addrspace(7) and then call this. It's not very clear from the backend docs how this stuff is supposed to be wired up by the user. Possibly bitcast from the 4i32 into an addrspace(8) annotated i128, and then addrspacecast to 7 to provide an extra 32 bits of zero, and then onward to this builtin? Whatever the proper sequence might be, adding an overload which takes a v4i32 and does the conversion is likely to improve adoption for the new builtin. https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
krzysz00 wrote: > I don't think we need to worry about compatibility with an intrinsic that's > been committed for a day `global.load.lds` and `buffer[.ptr].load.lds` have been around for quite a while though, and this is just an abstraction over them https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
arsenm wrote: > If you want to add new intrinsics that don't have the offset and that > pattern-match instead, I'd be more than happy to review that - or if we want > to break back-compat by getting rid of the offset. I don't think we need to worry about compatibility with an intrinsic that's been committed for a day https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
krzysz00 wrote: @arsenm You're right that it might be better to emit the offset, but all the existing intrinsics that I'm abstracting over _do_ have such a field. If you want to add new intrinsics that don't have the offset and that pattern-match instead, I'd be more than happy to review that - or if we want to break back-compat by getting rid of the offset. https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
arsenm wrote: I think we'd be better off omitting the offset from the intrinsic signature https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
https://github.com/krzysz00 closed https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
krzysz00 wrote: Ping https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
krzysz00 wrote: Ping https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
@@ -2641,6 +2641,28 @@ def int_amdgcn_perm : // GFX9 Intrinsics //===--===// +/// This is a general-purpose intrinsic for all operations that take a pointer +/// a base location in LDS, and a data size and use it to perform a gather to LDS. +/// This allows abstracting over both global pointers (address space 1) and +/// the buffer-resource-wrapper pointers (address space 7 and 9). +/// TODO: add support for address space 5 and scratch_load_lds. +class AMDGPULoadToLDS : + Intrinsic < +[], +[llvm_anyptr_ty,// Base pointer to load from. Varies per lane. + LLVMQualPointerType<3>,// LDS base pointer to store to. Must be wave-uniform. + llvm_i32_ty, // Data byte size: 1/2/4 (/12/16 for gfx950) + llvm_i32_ty, // imm offset (applied to both input and LDS address) krzysz00 wrote: Oh, thanks for finding the context! `git blame` failed me. So ... we're having the discussion from that thread again, and therefore I'd like to appeal to precedent in the short term (regarding the immoffset parameter) in the interests of making some sort of progress. If we ever fix the immoffset issue, upgrading into making the immoffset a constant 0 and adding it to both pointers should be fine? But that'd require a sufficiently robust pattern match, which I'm not sure we're convinced of https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
@@ -2641,6 +2641,28 @@ def int_amdgcn_perm : // GFX9 Intrinsics //===--===// +/// This is a general-purpose intrinsic for all operations that take a pointer +/// a base location in LDS, and a data size and use it to perform a gather to LDS. +/// This allows abstracting over both global pointers (address space 1) and +/// the buffer-resource-wrapper pointers (address space 7 and 9). +/// TODO: add support for address space 5 and scratch_load_lds. +class AMDGPULoadToLDS : + Intrinsic < +[], +[llvm_anyptr_ty,// Base pointer to load from. Varies per lane. + LLVMQualPointerType<3>,// LDS base pointer to store to. Must be wave-uniform. + llvm_i32_ty, // Data byte size: 1/2/4 (/12/16 for gfx950) + llvm_i32_ty, // imm offset (applied to both input and LDS address) shiltian wrote: Right, it was added in https://reviews.llvm.org/D125279. https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
krzysz00 wrote: Re discussion on the other PR about "why is this even an intrinsic" - since this probably shouldn't just be in @jayfoad's DMs: The reason I disagree with "just pattern-match it" is that you can't get the scheduling you want without a guarantee of the intrinssic Namely, while ``` global_load_b32 v1, v0 ds_write_addtid_b32 v1, s0 ``` is obviously ``` s_mov_b32 m0, s0 global_load_lds_b32 v0 ``` if we turn that first example into ``` pipelined_loop: { global_load_b32 v2, v0 ... waitcnt(lds only) + barrier ds_read v*, ... mfmas(v) waitcnt(lds)+s_barrier waitcnt(vmem) ;; and not substantially earlier please ds_write_addtid_b32 v2, s0 jle pipelined_loop } ``` for example, we really don't want that match firing because LDS gets overridden. ... *unless* we're double-buffering into LDS and so trying to do ``` pipelined_lds: { waitcnt(vmem,lds)+barrier load_lds(global1(iv), lds2) do_compute(lds1) waitcnt(vmem,lds)+barrier load_lds(global2(iv), lds1) do_compute(lds2) ;; We'd better not be waiting on LDS1 to settle at/before here iv += 2 } ``` where, if the pattern match for the addtid load fails, say by waitcnt insertion, that'll cause proglems for the program Not to mention, because we don't have an intrinsic for ds_addtid, and because there are a *lot* of ways to spell the lane ID (mbcnt, workitem.id.x with annotations, a bunch of workitem IDs mod 64, etc etc), that'll be quite fragile So in the context of GEMM stuff, I'd rather not have this at "hope the compiler recognizes what we're trying to do". If the compiler can be made to recognize what we're trying to do reliably in the future, that'll be cool, but I can't be the one to write that patch and I don't think there's infinite bandwidth among the AMDGPU crowd for this improvement either https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
@@ -564,6 +564,11 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID, llvm::Function *F = CGM.getIntrinsic(IID, {LoadTy}); return Builder.CreateCall(F, {Addr}); } + case AMDGPU::BI__builtin_amdgcn_load_to_lds: { +// Should this have asan instrumentation? krzysz00 wrote: It might be a backend problem However, https://github.com/llvm/llvm-project/blob/a9ce60ee885e25cba54c8900605563ff66e067b7/clang/lib/CodeGen/CGBuiltin.cpp#L4142 is the handling for memcpy(), and doesn't seem to rely on the backend handling it. I don't know if this same handling should be happening for the intrinsic https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
@@ -2641,6 +2641,28 @@ def int_amdgcn_perm : // GFX9 Intrinsics //===--===// +/// This is a general-purpose intrinsic for all operations that take a pointer +/// a base location in LDS, and a data size and use it to perform a gather to LDS. +/// This allows abstracting over both global pointers (address space 1) and +/// the buffer-resource-wrapper pointers (address space 7 and 9). +/// TODO: add support for address space 5 and scratch_load_lds. +class AMDGPULoadToLDS : + Intrinsic < +[], +[llvm_anyptr_ty,// Base pointer to load from. Varies per lane. + LLVMQualPointerType<3>,// LDS base pointer to store to. Must be wave-uniform. + llvm_i32_ty, // Data byte size: 1/2/4 (/12/16 for gfx950) + llvm_i32_ty, // imm offset (applied to both input and LDS address) krzysz00 wrote: *big shrug* Is this something y'all want to try and fix here? (Also, procedural history on the buffer intrinsics points me to https://reviews.llvm.org/D124884 ( @rampitec ), which leaves it rather unclear why the immarg was added ... I think it's because for LDS, unlike the other buffer instructions, you can't do voffset => (actual voffset + imm)) https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
@@ -2641,6 +2641,28 @@ def int_amdgcn_perm : // GFX9 Intrinsics //===--===// +/// This is a general-purpose intrinsic for all operations that take a pointer +/// a base location in LDS, and a data size and use it to perform a gather to LDS. +/// This allows abstracting over both global pointers (address space 1) and +/// the buffer-resource-wrapper pointers (address space 7 and 9). +/// TODO: add support for address space 5 and scratch_load_lds. +class AMDGPULoadToLDS : + Intrinsic < +[], +[llvm_anyptr_ty,// Base pointer to load from. Varies per lane. + LLVMQualPointerType<3>,// LDS base pointer to store to. Must be wave-uniform. + llvm_i32_ty, // Data byte size: 1/2/4 (/12/16 for gfx950) + llvm_i32_ty, // imm offset (applied to both input and LDS address) arsenm wrote: The global one definitely shouldn't have the offset (given it's there, we should be trying to do addressing mode folding into it) https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
@@ -2641,6 +2641,28 @@ def int_amdgcn_perm : // GFX9 Intrinsics //===--===// +/// This is a general-purpose intrinsic for all operations that take a pointer +/// a base location in LDS, and a data size and use it to perform a gather to LDS. +/// This allows abstracting over both global pointers (address space 1) and +/// the buffer-resource-wrapper pointers (address space 7 and 9). +/// TODO: add support for address space 5 and scratch_load_lds. +class AMDGPULoadToLDS : + Intrinsic < +[], +[llvm_anyptr_ty,// Base pointer to load from. Varies per lane. + LLVMQualPointerType<3>,// LDS base pointer to store to. Must be wave-uniform. + llvm_i32_ty, // Data byte size: 1/2/4 (/12/16 for gfx950) + llvm_i32_ty, // imm offset (applied to both input and LDS address) krzysz00 wrote: So @shiltian for reasons I may not be aware of that that's there (The buffer intrinsic's `soffset` bit is probably a case where they may have been incorrect bounds checking at some point, and is unrelated, I think) https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
@@ -2641,6 +2641,28 @@ def int_amdgcn_perm : // GFX9 Intrinsics //===--===// +/// This is a general-purpose intrinsic for all operations that take a pointer +/// a base location in LDS, and a data size and use it to perform a gather to LDS. +/// This allows abstracting over both global pointers (address space 1) and +/// the buffer-resource-wrapper pointers (address space 7 and 9). +/// TODO: add support for address space 5 and scratch_load_lds. +class AMDGPULoadToLDS : + Intrinsic < +[], +[llvm_anyptr_ty,// Base pointer to load from. Varies per lane. + LLVMQualPointerType<3>,// LDS base pointer to store to. Must be wave-uniform. + llvm_i32_ty, // Data byte size: 1/2/4 (/12/16 for gfx950) + llvm_i32_ty, // imm offset (applied to both input and LDS address) krzysz00 wrote: This same "immediate offset" argument landed without comment back in https://github.com/llvm/llvm-project/pull/92962 - this intrinsic is API-compatible with that https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
@@ -2641,6 +2641,28 @@ def int_amdgcn_perm : // GFX9 Intrinsics //===--===// +/// This is a general-purpose intrinsic for all operations that take a pointer +/// a base location in LDS, and a data size and use it to perform a gather to LDS. +/// This allows abstracting over both global pointers (address space 1) and +/// the buffer-resource-wrapper pointers (address space 7 and 9). +/// TODO: add support for address space 5 and scratch_load_lds. +class AMDGPULoadToLDS : + Intrinsic < +[], +[llvm_anyptr_ty,// Base pointer to load from. Varies per lane. + LLVMQualPointerType<3>,// LDS base pointer to store to. Must be wave-uniform. + llvm_i32_ty, // Data byte size: 1/2/4 (/12/16 for gfx950) + llvm_i32_ty, // imm offset (applied to both input and LDS address) arsenm wrote: Lack of pattern matching isn't a reason to have the offset. There should be offset pattern matching regardless. I thought the argument for the current buffer intrinsic offset argument was something about the unreasonable bounds checking behaviors https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
@@ -2641,6 +2641,28 @@ def int_amdgcn_perm : // GFX9 Intrinsics //===--===// +/// This is a general-purpose intrinsic for all operations that take a pointer +/// a base location in LDS, and a data size and use it to perform a gather to LDS. +/// This allows abstracting over both global pointers (address space 1) and +/// the buffer-resource-wrapper pointers (address space 7 and 9). +/// TODO: add support for address space 5 and scratch_load_lds. +class AMDGPULoadToLDS : + Intrinsic < +[], +[llvm_anyptr_ty,// Base pointer to load from. Varies per lane. + LLVMQualPointerType<3>,// LDS base pointer to store to. Must be wave-uniform. + llvm_i32_ty, // Data byte size: 1/2/4 (/12/16 for gfx950) + llvm_i32_ty, // imm offset (applied to both input and LDS address) jayfoad wrote: ??? Matching addressing modes is part of the compiler's job. If you want absolute control over what goes in the immediate offset field you can write assembler! https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
@@ -2641,6 +2641,28 @@ def int_amdgcn_perm : // GFX9 Intrinsics //===--===// +/// This is a general-purpose intrinsic for all operations that take a pointer +/// a base location in LDS, and a data size and use it to perform a gather to LDS. +/// This allows abstracting over both global pointers (address space 1) and +/// the buffer-resource-wrapper pointers (address space 7 and 9). +/// TODO: add support for address space 5 and scratch_load_lds. +class AMDGPULoadToLDS : + Intrinsic < +[], +[llvm_anyptr_ty,// Base pointer to load from. Varies per lane. + LLVMQualPointerType<3>,// LDS base pointer to store to. Must be wave-uniform. + llvm_i32_ty, // Data byte size: 1/2/4 (/12/16 for gfx950) + llvm_i32_ty, // imm offset (applied to both input and LDS address) krzysz00 wrote: ... Oh. Consider the case that you have global p + N and LDS q + N. Then the LDS combiner can rewrite this to (q' + O) + N, aka q' + (O + N). Then the two pointers won't have the same offset anymore and so it's unclear if you can slide it onto the instruction immediate https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
https://github.com/krzysz00 updated https://github.com/llvm/llvm-project/pull/137425 >From bcb72e3d8cb2dcdb97199d32797306c5807c8442 Mon Sep 17 00:00:00 2001 From: Krzysztof Drewniak Date: Sat, 26 Apr 2025 00:20:22 + Subject: [PATCH 1/4] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to LDS from global (address space 1) pointers and buffer fat pointers (address space 7), since they use the saem API and "gather from a pointer to LDS" is something of an abstract operation. This commet adds the intrinsic and its lowerings for addrspaces 1 and 7, and updates the MLIR wrappers to use it (loosening up the restrictions on loads to LDS along the way to match the ground truth from target features). It also plumbs the intrinsic through to clang. --- clang/include/clang/Basic/BuiltinsAMDGPU.def | 1 + clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp | 4 + clang/lib/Sema/SemaAMDGPU.cpp | 1 + .../CodeGenOpenCL/builtins-amdgcn-gfx950.cl | 30 +++ .../builtins-amdgcn-load-to-lds.cl| 60 + llvm/docs/ReleaseNotes.md | 8 + llvm/include/llvm/IR/IntrinsicsAMDGPU.td | 21 ++ .../AMDGPU/AMDGPUInstructionSelector.cpp | 5 + .../AMDGPU/AMDGPULowerBufferFatPointers.cpp | 20 ++ .../Target/AMDGPU/AMDGPURegisterBankInfo.cpp | 2 + llvm/lib/Target/AMDGPU/SIISelLowering.cpp | 8 +- .../AMDGPU/llvm.amdgcn.load.to.lds.gfx950.ll | 75 ++ .../CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll | 220 ++ .../lower-buffer-fat-pointers-mem-transfer.ll | 18 ++ mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td | 12 +- mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td | 35 ++- .../AMDGPUToROCDL/AMDGPUToROCDL.cpp | 15 +- mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp | 21 +- .../Conversion/AMDGPUToROCDL/load_lds.mlir| 67 -- mlir/test/Dialect/LLVMIR/rocdl.mlir | 17 +- mlir/test/Target/LLVMIR/rocdl.mlir| 11 +- 21 files changed, 598 insertions(+), 53 deletions(-) create mode 100644 clang/test/CodeGenOpenCL/builtins-amdgcn-load-to-lds.cl create mode 100644 llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.gfx950.ll create mode 100644 llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def index 39fef9e4601f8..730fd15913c11 100644 --- a/clang/include/clang/Basic/BuiltinsAMDGPU.def +++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def @@ -257,6 +257,7 @@ TARGET_BUILTIN(__builtin_amdgcn_flat_atomic_fadd_v2bf16, "V2sV2s*0V2s", "t", "at TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_v2bf16, "V2sV2s*1V2s", "t", "atomic-global-pk-add-bf16-inst") TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2bf16, "V2sV2s*3V2s", "t", "atomic-ds-pk-add-16-insts") TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2f16, "V2hV2h*3V2h", "t", "atomic-ds-pk-add-16-insts") +TARGET_BUILTIN(__builtin_amdgcn_load_to_lds, "vv*v*3IUiIiIUi", "t", "vmem-to-lds-load-insts") TARGET_BUILTIN(__builtin_amdgcn_global_load_lds, "vv*1v*3IUiIiIUi", "t", "vmem-to-lds-load-insts") //===--===// diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp index ad012d98635ff..a32ef1c2a5a12 100644 --- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp +++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp @@ -564,6 +564,10 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID, llvm::Function *F = CGM.getIntrinsic(IID, {LoadTy}); return Builder.CreateCall(F, {Addr}); } + case AMDGPU::BI__builtin_amdgcn_load_to_lds: { +return emitBuiltinWithOneOverloadedType<5>(*this, E, + Intrinsic::amdgcn_load_to_lds); + } case AMDGPU::BI__builtin_amdgcn_get_fpenv: { Function *F = CGM.getIntrinsic(Intrinsic::get_fpenv, {llvm::Type::getInt64Ty(getLLVMContext())}); diff --git a/clang/lib/Sema/SemaAMDGPU.cpp b/clang/lib/Sema/SemaAMDGPU.cpp index a6366aceec2a6..e6414a623b929 100644 --- a/clang/lib/Sema/SemaAMDGPU.cpp +++ b/clang/lib/Sema/SemaAMDGPU.cpp @@ -36,6 +36,7 @@ bool SemaAMDGPU::CheckAMDGCNBuiltinFunctionCall(unsigned BuiltinID, switch (BuiltinID) { case AMDGPU::BI__builtin_amdgcn_raw_ptr_buffer_load_lds: + case AMDGPU::BI__builtin_amdgcn_load_to_lds: case AMDGPU::BI__builtin_amdgcn_global_load_lds: { constexpr const int SizeIdx = 2; llvm::APSInt Size; diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl index 8251d6c213e3d..4b73347ac8155 100644 --- a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl +++ b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl @@ -1766,6 +1766,36 @@ void test_cvt_sr_f16_f32(global half2 *out, float src, uint seed) *out = __built
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
@@ -2641,6 +2641,28 @@ def int_amdgcn_perm : // GFX9 Intrinsics //===--===// +/// This is a general-purpose intrinsic for all operations that take a pointer +/// a base location in LDS, and a data size and use it to perform a gather to LDS. +/// This allows abstracting over both global pointers (address space 1) and +/// the buffer-resource-wrapper pointers (address space 7 and 9). +/// TODO: add support for address space 5 and scratch_load_lds. +class AMDGPULoadToLDS : + Intrinsic < +[], +[llvm_anyptr_ty,// Base pointer to load from. Varies per lane. + LLVMQualPointerType<3>,// LDS base pointer to store to. Must be wave-uniform. + llvm_i32_ty, // Data byte size: 1/2/4 (/12/16 for gfx950) + llvm_i32_ty, // imm offset (applied to both input and LDS address) krzysz00 wrote: I'm sure the existing intrinsics expose it for a reason - probably because there isn't pattern-matching to strip such an offset https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
@@ -0,0 +1,48 @@ +// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --version 5 +// REQUIRES: amdgpu-registered-target +// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -target-cpu gfx950 -emit-llvm -fcuda-is-device -o - %s | FileCheck %s + +// COM: Most tests are in the OpenCL semastics, this is just a verification for HIP + +#define __device__ __attribute__((device)) +#define __shared__ __attribute__((shared)) + +typedef unsigned int u32; + +// CHECK-LABEL: define dso_local void @_Z20test_load_to_lds_u32PjS_( +// CHECK-SAME: ptr noundef [[SRC:%.*]], ptr noundef [[DST:%.*]]) #[[ATTR0:[0-9]+]] { +// CHECK-NEXT: [[ENTRY:.*:]] +// CHECK-NEXT:[[SRC_ADDR:%.*]] = alloca ptr, align 8, addrspace(5) +// CHECK-NEXT:[[DST_ADDR:%.*]] = alloca ptr, align 8, addrspace(5) +// CHECK-NEXT:[[SRC_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[SRC_ADDR]] to ptr +// CHECK-NEXT:[[DST_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DST_ADDR]] to ptr +// CHECK-NEXT:store ptr [[SRC]], ptr [[SRC_ADDR_ASCAST]], align 8 +// CHECK-NEXT:store ptr [[DST]], ptr [[DST_ADDR_ASCAST]], align 8 +// CHECK-NEXT:[[TMP0:%.*]] = load ptr, ptr [[SRC_ADDR_ASCAST]], align 8 +// CHECK-NEXT:[[TMP1:%.*]] = load ptr, ptr [[DST_ADDR_ASCAST]], align 8 +// CHECK-NEXT:[[TMP2:%.*]] = addrspacecast ptr [[TMP1]] to ptr addrspace(3) +// CHECK-NEXT:call void @llvm.amdgcn.load.to.lds.p0(ptr [[TMP0]], ptr addrspace(3) [[TMP2]], i32 4, i32 0, i32 0) +// CHECK-NEXT:ret void +// +__device__ void test_load_to_lds_u32(u32* src, __shared__ u32 *dst) { + __builtin_amdgcn_load_to_lds(src, dst, /*size=*/4, /*offset=*/0, /*aux=*/0); +} + +// CHECK-LABEL: define dso_local void @_Z20test_load_to_lds_128PvS_( +// CHECK-SAME: ptr noundef [[SRC:%.*]], ptr noundef [[DST:%.*]]) #[[ATTR0]] { +// CHECK-NEXT: [[ENTRY:.*:]] +// CHECK-NEXT:[[SRC_ADDR:%.*]] = alloca ptr, align 8, addrspace(5) +// CHECK-NEXT:[[DST_ADDR:%.*]] = alloca ptr, align 8, addrspace(5) +// CHECK-NEXT:[[SRC_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[SRC_ADDR]] to ptr +// CHECK-NEXT:[[DST_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[DST_ADDR]] to ptr +// CHECK-NEXT:store ptr [[SRC]], ptr [[SRC_ADDR_ASCAST]], align 8 +// CHECK-NEXT:store ptr [[DST]], ptr [[DST_ADDR_ASCAST]], align 8 +// CHECK-NEXT:[[TMP0:%.*]] = load ptr, ptr [[SRC_ADDR_ASCAST]], align 8 +// CHECK-NEXT:[[TMP1:%.*]] = load ptr, ptr [[DST_ADDR_ASCAST]], align 8 +// CHECK-NEXT:[[TMP2:%.*]] = addrspacecast ptr [[TMP1]] to ptr addrspace(3) +// CHECK-NEXT:call void @llvm.amdgcn.load.to.lds.p0(ptr [[TMP0]], ptr addrspace(3) [[TMP2]], i32 16, i32 0, i32 0) +// CHECK-NEXT:ret void +// +__device__ void test_load_to_lds_128(void* src, __shared__ void *dst) { + __builtin_amdgcn_load_to_lds(src, dst, /*size=*/16, /*offset=*/0, /*aux=*/0); +} arsenm wrote: Also test where dst isn't qualified with __shared__ https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
@@ -2641,6 +2641,28 @@ def int_amdgcn_perm : // GFX9 Intrinsics //===--===// +/// This is a general-purpose intrinsic for all operations that take a pointer +/// a base location in LDS, and a data size and use it to perform a gather to LDS. +/// This allows abstracting over both global pointers (address space 1) and +/// the buffer-resource-wrapper pointers (address space 7 and 9). +/// TODO: add support for address space 5 and scratch_load_lds. +class AMDGPULoadToLDS : + Intrinsic < +[], +[llvm_anyptr_ty,// Base pointer to load from. Varies per lane. + LLVMQualPointerType<3>,// LDS base pointer to store to. Must be wave-uniform. + llvm_i32_ty, // Data byte size: 1/2/4 (/12/16 for gfx950) + llvm_i32_ty, // imm offset (applied to both input and LDS address) jayfoad wrote: What's the point of the imm offset argument? It's not semantically useful, right? You could just add this offset to the two pointer arguments before calling the intrinsic. https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
https://github.com/krzysz00 updated https://github.com/llvm/llvm-project/pull/137425 Rate limit · GitHub body { background-color: #f6f8fa; color: #24292e; font-family: -apple-system,BlinkMacSystemFont,Segoe UI,Helvetica,Arial,sans-serif,Apple Color Emoji,Segoe UI Emoji,Segoe UI Symbol; font-size: 14px; line-height: 1.5; margin: 0; } .container { margin: 50px auto; max-width: 600px; text-align: center; padding: 0 24px; } a { color: #0366d6; text-decoration: none; } a:hover { text-decoration: underline; } h1 { line-height: 60px; font-size: 48px; font-weight: 300; margin: 0px; text-shadow: 0 1px 0 #fff; } p { color: rgba(0, 0, 0, 0.5); margin: 20px 0 40px; } ul { list-style: none; margin: 25px 0; padding: 0; } li { display: table-cell; font-weight: bold; width: 1%; } .logo { display: inline-block; margin-top: 35px; } .logo-img-2x { display: none; } @media only screen and (-webkit-min-device-pixel-ratio: 2), only screen and ( min--moz-device-pixel-ratio: 2), only screen and ( -o-min-device-pixel-ratio: 2/1), only screen and (min-device-pixel-ratio: 2), only screen and (min-resolution: 192dpi), only screen and (min-resolution: 2dppx) { .logo-img-1x { display: none; } .logo-img-2x { display: inline-block; } } #suggestions { margin-top: 35px; color: #ccc; } #suggestions a { color: #66; font-weight: 200; font-size: 14px; margin: 0 10px; } Whoa there! You have exceeded a secondary rate limit. Please wait a few minutes before you try again; in some cases this may take up to an hour. https://support.github.com/contact";>Contact Support — https://githubstatus.com";>GitHub Status — https://twitter.com/githubstatus";>@githubstatus ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
@@ -0,0 +1,60 @@ +// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py +// RUN: %clang_cc1 -cl-std=CL2.0 -O0 -triple amdgcn-unknown-unknown -target-cpu gfx900 -emit-llvm -o - %s | FileCheck %s +// RUN: %clang_cc1 -cl-std=CL2.0 -O0 -triple amdgcn-unknown-unknown -target-cpu gfx942 -emit-llvm -o - %s | FileCheck %s +// RUN: %clang_cc1 -cl-std=CL2.0 -O0 -triple amdgcn-unknown-unknown -target-cpu gfx1010 -emit-llvm -o - %s | FileCheck %s +// REQUIRES: amdgpu-registered-target + arsenm wrote: Should also add a HIP codegen test, builtins with pointer arguments sometimes interact poorly with the C++ variants that pretend address spaces don't exist https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
@@ -564,6 +564,11 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID, llvm::Function *F = CGM.getIntrinsic(IID, {LoadTy}); return Builder.CreateCall(F, {Addr}); } + case AMDGPU::BI__builtin_amdgcn_load_to_lds: { +// Should this have asan instrumentation? arsenm wrote: Repeating comment from the PR that duplicated this one, yes but that's a backend problem? https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
@@ -444,17 +444,40 @@ def ROCDL_ds_read_tr6_b96 : ROCDL_LDS_Read_Tr_IntrOp<"ds.read.tr6.b96">; def ROCDL_ds_read_tr16_b64 : ROCDL_LDS_Read_Tr_IntrOp<"ds.read.tr16.b64">; //===-===// -// Global load to LDS intrinsic (available in GFX950) +// Load to LDS intrinsic (available in GFX9 and GFX10) +//===-===// + +def ROCDL_LoadToLDSOp : + ROCDL_IntrOp<"load.to.lds", [], [0], [], 0, 0, 1, [2, 3, 4], ["size", "offset", "aux"]> { + dag args = (ins Arg:$globalPtr, + Arg:$ldsPtr, + I32Attr:$size, + I32Attr:$offset, + I32Attr:$aux); + let arguments = !con(args, aliasAttrs); + let assemblyFormat = [{ +$globalPtr `,` $ldsPtr `,` $size `,` $offset `,` $aux +attr-dict `:` type($globalPtr) + }]; + let extraClassDefinition = [{ +::llvm::SmallVector<::mlir::Value> $cppClass::getAccessedOperands() { + return {getGlobalPtr(), getLdsPtr()}; +} + }]; +} def ROCDL_GlobalLoadLDSOp : - ROCDL_IntrOp<"global.load.lds", [], [], [], 0, 0, 1> { + ROCDL_IntrOp<"global.load.lds", [], [], [], 0, 0, 1, [2, 3, 4], ["size", "offset", "aux"]> { krzysz00 wrote: I wouldn't - someone might be using it and all. Might need a prefer load.to.lds instead note https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
@@ -444,17 +444,40 @@ def ROCDL_ds_read_tr6_b96 : ROCDL_LDS_Read_Tr_IntrOp<"ds.read.tr6.b96">; def ROCDL_ds_read_tr16_b64 : ROCDL_LDS_Read_Tr_IntrOp<"ds.read.tr16.b64">; //===-===// -// Global load to LDS intrinsic (available in GFX950) +// Load to LDS intrinsic (available in GFX9 and GFX10) +//===-===// + +def ROCDL_LoadToLDSOp : + ROCDL_IntrOp<"load.to.lds", [], [0], [], 0, 0, 1, [2, 3, 4], ["size", "offset", "aux"]> { + dag args = (ins Arg:$globalPtr, + Arg:$ldsPtr, + I32Attr:$size, + I32Attr:$offset, + I32Attr:$aux); + let arguments = !con(args, aliasAttrs); + let assemblyFormat = [{ +$globalPtr `,` $ldsPtr `,` $size `,` $offset `,` $aux +attr-dict `:` type($globalPtr) + }]; + let extraClassDefinition = [{ +::llvm::SmallVector<::mlir::Value> $cppClass::getAccessedOperands() { + return {getGlobalPtr(), getLdsPtr()}; +} + }]; +} def ROCDL_GlobalLoadLDSOp : - ROCDL_IntrOp<"global.load.lds", [], [], [], 0, 0, 1> { + ROCDL_IntrOp<"global.load.lds", [], [], [], 0, 0, 1, [2, 3, 4], ["size", "offset", "aux"]> { lialan wrote: @krzysz00 should we simply remove this op? https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
lialan wrote: > > I still think we need an intrinsic here because a load + an addtid store > > can be scheduled much different from the asynchronous "gather to LDS" - and > > because we don't want this load/store to not be optimized > > IMO the intrinsic should only be added as a last resort if we really can't > get the pattern based codegen to work well enough. Beg to differ in particularly this case. In downstream application, I want to fine control to use this particular instruction so this gets propagated down to LLVM IR, without being changed or modified along the way. Well, actual reason: we need this instruction now. :-p https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
krzysz00 wrote: Well, if y'all want to go add a pattern for this and eventually deprecate the intrinsics I'm all ears, but we're trying to use these instructions now https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
jayfoad wrote: > I still think we need an intrinsic here because a load + an addtid store can > be scheduled much different from the asynchronous "gather to LDS" - and > because we don't want this load/store to not be optimized IMO the intrinsic should only be added as a last resort if we really can't get the pattern based codegen to work well enough. https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
@@ -257,6 +257,7 @@ TARGET_BUILTIN(__builtin_amdgcn_flat_atomic_fadd_v2bf16, "V2sV2s*0V2s", "t", "at TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_v2bf16, "V2sV2s*1V2s", "t", "atomic-global-pk-add-bf16-inst") TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2bf16, "V2sV2s*3V2s", "t", "atomic-ds-pk-add-16-insts") TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2f16, "V2hV2h*3V2h", "t", "atomic-ds-pk-add-16-insts") +TARGET_BUILTIN(__builtin_amdgcn_load_to_lds, "vv*v*3IUiIiIUi", "t", "vmem-to-lds-load-insts") krzysz00 wrote: Done, I think https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
https://github.com/krzysz00 updated https://github.com/llvm/llvm-project/pull/137425 >From 96e94b5662c613fd80f712080751076254a73524 Mon Sep 17 00:00:00 2001 From: Krzysztof Drewniak Date: Sat, 26 Apr 2025 00:20:22 + Subject: [PATCH 1/2] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to LDS from global (address space 1) pointers and buffer fat pointers (address space 7), since they use the saem API and "gather from a pointer to LDS" is something of an abstract operation. This commet adds the intrinsic and its lowerings for addrspaces 1 and 7, and updates the MLIR wrappers to use it (loosening up the restrictions on loads to LDS along the way to match the ground truth from target features). It also plumbs the intrinsic through to clang. --- clang/include/clang/Basic/BuiltinsAMDGPU.def | 1 + clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp | 4 + clang/lib/Sema/SemaAMDGPU.cpp | 1 + .../CodeGenOpenCL/builtins-amdgcn-gfx950.cl | 30 +++ .../builtins-amdgcn-load-to-lds.cl| 60 + llvm/docs/ReleaseNotes.md | 8 + llvm/include/llvm/IR/IntrinsicsAMDGPU.td | 21 ++ .../AMDGPU/AMDGPUInstructionSelector.cpp | 5 + .../AMDGPU/AMDGPULowerBufferFatPointers.cpp | 20 ++ .../Target/AMDGPU/AMDGPURegisterBankInfo.cpp | 2 + llvm/lib/Target/AMDGPU/SIISelLowering.cpp | 8 +- .../AMDGPU/llvm.amdgcn.load.to.lds.gfx950.ll | 75 ++ .../CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll | 220 ++ .../lower-buffer-fat-pointers-mem-transfer.ll | 18 ++ mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td | 12 +- mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td | 35 ++- .../AMDGPUToROCDL/AMDGPUToROCDL.cpp | 15 +- mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp | 21 +- .../Conversion/AMDGPUToROCDL/load_lds.mlir| 67 -- mlir/test/Dialect/LLVMIR/rocdl.mlir | 17 +- mlir/test/Target/LLVMIR/rocdl.mlir| 11 +- 21 files changed, 598 insertions(+), 53 deletions(-) create mode 100644 clang/test/CodeGenOpenCL/builtins-amdgcn-load-to-lds.cl create mode 100644 llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.gfx950.ll create mode 100644 llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def index 39fef9e4601f8..730fd15913c11 100644 --- a/clang/include/clang/Basic/BuiltinsAMDGPU.def +++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def @@ -257,6 +257,7 @@ TARGET_BUILTIN(__builtin_amdgcn_flat_atomic_fadd_v2bf16, "V2sV2s*0V2s", "t", "at TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_v2bf16, "V2sV2s*1V2s", "t", "atomic-global-pk-add-bf16-inst") TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2bf16, "V2sV2s*3V2s", "t", "atomic-ds-pk-add-16-insts") TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2f16, "V2hV2h*3V2h", "t", "atomic-ds-pk-add-16-insts") +TARGET_BUILTIN(__builtin_amdgcn_load_to_lds, "vv*v*3IUiIiIUi", "t", "vmem-to-lds-load-insts") TARGET_BUILTIN(__builtin_amdgcn_global_load_lds, "vv*1v*3IUiIiIUi", "t", "vmem-to-lds-load-insts") //===--===// diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp index ad012d98635ff..a32ef1c2a5a12 100644 --- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp +++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp @@ -564,6 +564,10 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID, llvm::Function *F = CGM.getIntrinsic(IID, {LoadTy}); return Builder.CreateCall(F, {Addr}); } + case AMDGPU::BI__builtin_amdgcn_load_to_lds: { +return emitBuiltinWithOneOverloadedType<5>(*this, E, + Intrinsic::amdgcn_load_to_lds); + } case AMDGPU::BI__builtin_amdgcn_get_fpenv: { Function *F = CGM.getIntrinsic(Intrinsic::get_fpenv, {llvm::Type::getInt64Ty(getLLVMContext())}); diff --git a/clang/lib/Sema/SemaAMDGPU.cpp b/clang/lib/Sema/SemaAMDGPU.cpp index a6366aceec2a6..e6414a623b929 100644 --- a/clang/lib/Sema/SemaAMDGPU.cpp +++ b/clang/lib/Sema/SemaAMDGPU.cpp @@ -36,6 +36,7 @@ bool SemaAMDGPU::CheckAMDGCNBuiltinFunctionCall(unsigned BuiltinID, switch (BuiltinID) { case AMDGPU::BI__builtin_amdgcn_raw_ptr_buffer_load_lds: + case AMDGPU::BI__builtin_amdgcn_load_to_lds: case AMDGPU::BI__builtin_amdgcn_global_load_lds: { constexpr const int SizeIdx = 2; llvm::APSInt Size; diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl index 8251d6c213e3d..4b73347ac8155 100644 --- a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl +++ b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl @@ -1766,6 +1766,36 @@ void test_cvt_sr_f16_f32(global half2 *out, float src, uint seed) *out = __built
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
https://github.com/krzysz00 edited https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
krzysz00 wrote: @jayfoad I still think we need an intrinsic here because a load + an addtid store can be scheduled much different from the asynchronous "gather to LDS" - and because we don't want this load/store to not be optimized https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
jayfoad wrote: > > High level question: I don't understand why you call this a "gather" > > operation. What do you mean by that? Isn't it semantically just a memcpy, > > or a (global/buffer) load followed by a (LDS) store? > > The semantics of this operation (at least in the pre-gfx950 cases) are > > ``` > lds_load(vector globalAddr, scalar ldsAddr) { >lds[ldsAddr + 4 * laneId] = global[globalAddr]; > } > ``` > > Note that your lane-varying global address can point all over memory, but > that the values to written to LDS always go at base, base + 4 bytes, base + 8 > bytes, ... base + (wavesize - 1) * 4 bytes > > From where I'm standing, this is a gather I see. The LDS part is doing "addtid" addressing. There are other instructions that do this like `DS_LOAD_ADDTID_B32` and `GLOBAL_LOAD_ADDTID_B32` but I don't think we have any codegen support for them. I think we _could_ add the codegen support just by pattern-matching the address, so `DS_LOAD_ADDTID_B32` would match something like `load ptr addrspace(3) (constant_base + tid *4)`. Then buffer-load-to-lds could be pattern-matched as a regular (fat pointer) buffer load followed by an addtid-style LDS store, right? So no intrinsic is really _needed_? https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
Pierre-vh wrote: Can you please document it in the AMDGPUUsage table as well? https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
krzysz00 wrote: @jayfoad > High level question: I don't understand why you call this a "gather" > operation. What do you mean by that? Isn't it semantically just a memcpy, or > a (global/buffer) load followed by a (LDS) store? The semantics of this operation (at least in the pre-gfx950 cases) are ``` lds_load(vector globalAddr, scalar ldsAddr) { lds[ldsAddr + 4 * laneId] = global[globalAddr]; } ``` Note that your lane-varying global address can point all over memory, but that the values to written to LDS always go at base, base + 4 bytes, base + 8 bytes, ... base + (wavesize - 1) * 4 bytes >From where I'm standing, this is a gather https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
kuhar wrote: > High level question: I don't understand why you call this a "gather" > operation. What do you mean by that? Isn't it semantically just a memcpy, or > a (global/buffer) load followed by a (LDS) store? This is more like a subgroup operation because the destination offset is uniform. https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
jayfoad wrote: > This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to LDS > from global (address space 1) pointers and buffer fat pointers (address space > 7), since they use the same API and "gather from a pointer to LDS" is > something of an abstract operation. High level question: I don't understand why you call this a "gather" operation. What do you mean by that? Isn't it semantically just a memcpy, or a (global/buffer) load followed by a (LDS) store? https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
@@ -0,0 +1,75 @@ +; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 +; RUN: llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx950 < %s | FileCheck -check-prefixes=GFX950,GFX950-SDAG %s +; RUN: llc -global-isel=1 -mtriple=amdgcn -mcpu=gfx950 < %s | FileCheck -check-prefixes=GFX950,GFX950-GISEL %s + +; RUN: not --crash llc -global-isel=0 -mtriple=amdgcn -mcpu=gfx942 -filetype=null < %s 2>&1 | FileCheck -check-prefix=ERR-SDAG %s +; RUN: not --crash llc -global-isel=1 -mtriple=amdgcn -mcpu=gfx942 -filetype=null < %s 2>&1 | FileCheck -check-prefix=ERR-GISEL %s + +; ERR-SDAG: LLVM ERROR: Cannot select: intrinsic %llvm.amdgcn.load.to.lds + +; ERR-GISEL: LLVM ERROR: cannot select: G_INTRINSIC_W_SIDE_EFFECTS intrinsic(@llvm.amdgcn.load.to.lds), + +;; Note: this is a bare-bones test to make sure that amdgcn.load.to.lds lowers to +;; the correct intrinsic. + +declare void @llvm.amdgcn.load.to.lds.p1(ptr addrspace(1) nocapture %gptr, ptr addrspace(3) nocapture %lptr, i32 %size, i32 %offset, i32 %aux) +declare void @llvm.amdgcn.load.to.lds.p7(ptr addrspace(7) nocapture %gptr, ptr addrspace(3) nocapture %lptr, i32 %size, i32 %offset, i32 %aux) + +;-y +; dwordx3 +;- + +define amdgpu_ps void @global_load_lds_dwordx3_vaddr_saddr(ptr addrspace(1) nocapture %gptr, ptr addrspace(3) nocapture inreg %lptr) { +; GFX950-LABEL: global_load_lds_dwordx3_vaddr_saddr: +; GFX950: ; %bb.0: +; GFX950-NEXT:s_mov_b32 m0, s0 +; GFX950-NEXT:s_nop 0 +; GFX950-NEXT:global_load_lds_dwordx3 v[0:1], off offset:16 sc0 +; GFX950-NEXT:s_endpgm + call void @llvm.amdgcn.load.to.lds.p1(ptr addrspace(1) %gptr, ptr addrspace(3) %lptr, i32 12, i32 16, i32 1) + ret void +} + +define amdgpu_ps void @buffer_load_lds_dwordx3_vaddr_saddr(ptr addrspace(7) nocapture inreg %gptr, i32 %off, ptr addrspace(3) nocapture inreg %lptr) { +; GFX950-LABEL: buffer_load_lds_dwordx3_vaddr_saddr: +; GFX950: ; %bb.0: +; GFX950-NEXT:v_add_u32_e32 v0, s4, v0 +; GFX950-NEXT:s_mov_b32 m0, s5 +; GFX950-NEXT:s_nop 0 +; GFX950-NEXT:buffer_load_dwordx3 v0, s[0:3], 0 offen offset:16 sc0 lds +; GFX950-NEXT:s_endpgm + %gptr.off = getelementptr i8, ptr addrspace(7) %gptr, i32 %off + call void @llvm.amdgcn.load.to.lds.p7(ptr addrspace(7) %gptr.off, ptr addrspace(3) %lptr, i32 12, i32 16, i32 1) + ret void +} + +;- +; dwordx4 +;- + +define amdgpu_ps void @global_load_lds_dwordx4_vaddr_saddr(ptr addrspace(1) nocapture %gptr, ptr addrspace(3) nocapture inreg %lptr) { +; GFX950-LABEL: global_load_lds_dwordx4_vaddr_saddr: +; GFX950: ; %bb.0: +; GFX950-NEXT:s_mov_b32 m0, s0 +; GFX950-NEXT:s_nop 0 +; GFX950-NEXT:global_load_lds_dwordx4 v[0:1], off offset:16 sc0 +; GFX950-NEXT:s_endpgm + call void @llvm.amdgcn.global.load.lds(ptr addrspace(1) %gptr, ptr addrspace(3) %lptr, i32 16, i32 16, i32 1) + ret void +} + +define amdgpu_ps void @buffer_load_lds_dwordx4_vaddr_saddr(ptr addrspace(7) nocapture inreg %gptr, i32 %off, ptr addrspace(3) nocapture inreg %lptr) { +; GFX950-LABEL: buffer_load_lds_dwordx4_vaddr_saddr: +; GFX950: ; %bb.0: +; GFX950-NEXT:v_add_u32_e32 v0, s4, v0 +; GFX950-NEXT:s_mov_b32 m0, s5 +; GFX950-NEXT:s_nop 0 +; GFX950-NEXT:buffer_load_dwordx4 v0, s[0:3], 0 offen offset:16 sc0 lds +; GFX950-NEXT:s_endpgm + %gptr.off = getelementptr i8, ptr addrspace(7) %gptr, i32 %off + call void @llvm.amdgcn.load.to.lds.p7(ptr addrspace(7) %gptr.off, ptr addrspace(3) %lptr, i32 16, i32 16, i32 1) + ret void +} +;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line: +; GFX950-GISEL: {{.*}} +; GFX950-SDAG: {{.*}} shiltian wrote: The multi-prefix doesn't work well, so just use `--check-prefix`. https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
https://github.com/shiltian edited https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
https://github.com/shiltian edited https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
@@ -257,6 +257,7 @@ TARGET_BUILTIN(__builtin_amdgcn_flat_atomic_fadd_v2bf16, "V2sV2s*0V2s", "t", "at TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_v2bf16, "V2sV2s*1V2s", "t", "atomic-global-pk-add-bf16-inst") TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2bf16, "V2sV2s*3V2s", "t", "atomic-ds-pk-add-16-insts") TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2f16, "V2hV2h*3V2h", "t", "atomic-ds-pk-add-16-insts") +TARGET_BUILTIN(__builtin_amdgcn_load_to_lds, "vv*v*3IUiIiIUi", "t", "vmem-to-lds-load-insts") arsenm wrote: Missing sema tests for rejection of invalid immediate values or missing subtarget feature https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
@@ -102,6 +102,14 @@ Changes to the AMDGPU Backend * Bump the default `.amdhsa_code_object_version` to 6. ROCm 6.3 is required to run any program compiled with COV6. +* Add a new `amdgcn.load.to.lds` intrinsic that wraps the existing global.load.lds arsenm wrote: Should add to the intrinsics section of AMDGPUUsage https://github.com/llvm/llvm-project/pull/137425 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
llvmbot wrote: @llvm/pr-subscribers-backend-amdgpu Author: Krzysztof Drewniak (krzysz00) Changes This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to LDS from global (address space 1) pointers and buffer fat pointers (address space 7), since they use the saem API and "gather from a pointer to LDS" is something of an abstract operation. This commet adds the intrinsic and its lowerings for addrspaces 1 and 7, and updates the MLIR wrappers to use it (loosening up the restrictions on loads to LDS along the way to match the ground truth from target features). It also plumbs the intrinsic through to clang. (Any clang folks know why things are broken?) --- Patch is 50.24 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137425.diff 21 Files Affected: - (modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+1) - (modified) clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp (+4) - (modified) clang/lib/Sema/SemaAMDGPU.cpp (+1) - (modified) clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl (+30) - (added) clang/test/CodeGenOpenCL/builtins-amdgcn-load-to-lds.cl (+60) - (modified) llvm/docs/ReleaseNotes.md (+8) - (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+21) - (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+5) - (modified) llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp (+20) - (modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+2) - (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+7-1) - (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.gfx950.ll (+75) - (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll (+220) - (modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-mem-transfer.ll (+18) - (modified) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td (+7-5) - (modified) mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td (+29-6) - (modified) mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp (+7-8) - (modified) mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp (+15-6) - (modified) mlir/test/Conversion/AMDGPUToROCDL/load_lds.mlir (+51-16) - (modified) mlir/test/Dialect/LLVMIR/rocdl.mlir (+10-7) - (modified) mlir/test/Target/LLVMIR/rocdl.mlir (+7-4) ``diff diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def index 39fef9e4601f8..730fd15913c11 100644 --- a/clang/include/clang/Basic/BuiltinsAMDGPU.def +++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def @@ -257,6 +257,7 @@ TARGET_BUILTIN(__builtin_amdgcn_flat_atomic_fadd_v2bf16, "V2sV2s*0V2s", "t", "at TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_v2bf16, "V2sV2s*1V2s", "t", "atomic-global-pk-add-bf16-inst") TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2bf16, "V2sV2s*3V2s", "t", "atomic-ds-pk-add-16-insts") TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2f16, "V2hV2h*3V2h", "t", "atomic-ds-pk-add-16-insts") +TARGET_BUILTIN(__builtin_amdgcn_load_to_lds, "vv*v*3IUiIiIUi", "t", "vmem-to-lds-load-insts") TARGET_BUILTIN(__builtin_amdgcn_global_load_lds, "vv*1v*3IUiIiIUi", "t", "vmem-to-lds-load-insts") //===--===// diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp index ad012d98635ff..a32ef1c2a5a12 100644 --- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp +++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp @@ -564,6 +564,10 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID, llvm::Function *F = CGM.getIntrinsic(IID, {LoadTy}); return Builder.CreateCall(F, {Addr}); } + case AMDGPU::BI__builtin_amdgcn_load_to_lds: { +return emitBuiltinWithOneOverloadedType<5>(*this, E, + Intrinsic::amdgcn_load_to_lds); + } case AMDGPU::BI__builtin_amdgcn_get_fpenv: { Function *F = CGM.getIntrinsic(Intrinsic::get_fpenv, {llvm::Type::getInt64Ty(getLLVMContext())}); diff --git a/clang/lib/Sema/SemaAMDGPU.cpp b/clang/lib/Sema/SemaAMDGPU.cpp index a6366aceec2a6..e6414a623b929 100644 --- a/clang/lib/Sema/SemaAMDGPU.cpp +++ b/clang/lib/Sema/SemaAMDGPU.cpp @@ -36,6 +36,7 @@ bool SemaAMDGPU::CheckAMDGCNBuiltinFunctionCall(unsigned BuiltinID, switch (BuiltinID) { case AMDGPU::BI__builtin_amdgcn_raw_ptr_buffer_load_lds: + case AMDGPU::BI__builtin_amdgcn_load_to_lds: case AMDGPU::BI__builtin_amdgcn_global_load_lds: { constexpr const int SizeIdx = 2; llvm::APSInt Size; diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl index 8251d6c213e3d..4b73347ac8155 100644 --- a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl +++ b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl @@ -1766,6 +1766,36 @@ void test_cvt_sr_f16_f32(global half2 *out, float src, uint seed) *out = __builtin_amdgcn_cvt_sr_f16_f32(*out, src, seed, 1); } +// CH
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
llvmbot wrote: @llvm/pr-subscribers-mlir Author: Krzysztof Drewniak (krzysz00) Changes This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to LDS from global (address space 1) pointers and buffer fat pointers (address space 7), since they use the saem API and "gather from a pointer to LDS" is something of an abstract operation. This commet adds the intrinsic and its lowerings for addrspaces 1 and 7, and updates the MLIR wrappers to use it (loosening up the restrictions on loads to LDS along the way to match the ground truth from target features). It also plumbs the intrinsic through to clang. (Any clang folks know why things are broken?) --- Patch is 50.24 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137425.diff 21 Files Affected: - (modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+1) - (modified) clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp (+4) - (modified) clang/lib/Sema/SemaAMDGPU.cpp (+1) - (modified) clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl (+30) - (added) clang/test/CodeGenOpenCL/builtins-amdgcn-load-to-lds.cl (+60) - (modified) llvm/docs/ReleaseNotes.md (+8) - (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+21) - (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+5) - (modified) llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp (+20) - (modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+2) - (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+7-1) - (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.gfx950.ll (+75) - (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll (+220) - (modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-mem-transfer.ll (+18) - (modified) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td (+7-5) - (modified) mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td (+29-6) - (modified) mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp (+7-8) - (modified) mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp (+15-6) - (modified) mlir/test/Conversion/AMDGPUToROCDL/load_lds.mlir (+51-16) - (modified) mlir/test/Dialect/LLVMIR/rocdl.mlir (+10-7) - (modified) mlir/test/Target/LLVMIR/rocdl.mlir (+7-4) ``diff diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def index 39fef9e4601f8..730fd15913c11 100644 --- a/clang/include/clang/Basic/BuiltinsAMDGPU.def +++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def @@ -257,6 +257,7 @@ TARGET_BUILTIN(__builtin_amdgcn_flat_atomic_fadd_v2bf16, "V2sV2s*0V2s", "t", "at TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_v2bf16, "V2sV2s*1V2s", "t", "atomic-global-pk-add-bf16-inst") TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2bf16, "V2sV2s*3V2s", "t", "atomic-ds-pk-add-16-insts") TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2f16, "V2hV2h*3V2h", "t", "atomic-ds-pk-add-16-insts") +TARGET_BUILTIN(__builtin_amdgcn_load_to_lds, "vv*v*3IUiIiIUi", "t", "vmem-to-lds-load-insts") TARGET_BUILTIN(__builtin_amdgcn_global_load_lds, "vv*1v*3IUiIiIUi", "t", "vmem-to-lds-load-insts") //===--===// diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp index ad012d98635ff..a32ef1c2a5a12 100644 --- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp +++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp @@ -564,6 +564,10 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID, llvm::Function *F = CGM.getIntrinsic(IID, {LoadTy}); return Builder.CreateCall(F, {Addr}); } + case AMDGPU::BI__builtin_amdgcn_load_to_lds: { +return emitBuiltinWithOneOverloadedType<5>(*this, E, + Intrinsic::amdgcn_load_to_lds); + } case AMDGPU::BI__builtin_amdgcn_get_fpenv: { Function *F = CGM.getIntrinsic(Intrinsic::get_fpenv, {llvm::Type::getInt64Ty(getLLVMContext())}); diff --git a/clang/lib/Sema/SemaAMDGPU.cpp b/clang/lib/Sema/SemaAMDGPU.cpp index a6366aceec2a6..e6414a623b929 100644 --- a/clang/lib/Sema/SemaAMDGPU.cpp +++ b/clang/lib/Sema/SemaAMDGPU.cpp @@ -36,6 +36,7 @@ bool SemaAMDGPU::CheckAMDGCNBuiltinFunctionCall(unsigned BuiltinID, switch (BuiltinID) { case AMDGPU::BI__builtin_amdgcn_raw_ptr_buffer_load_lds: + case AMDGPU::BI__builtin_amdgcn_load_to_lds: case AMDGPU::BI__builtin_amdgcn_global_load_lds: { constexpr const int SizeIdx = 2; llvm::APSInt Size; diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl index 8251d6c213e3d..4b73347ac8155 100644 --- a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl +++ b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl @@ -1766,6 +1766,36 @@ void test_cvt_sr_f16_f32(global half2 *out, float src, uint seed) *out = __builtin_amdgcn_cvt_sr_f16_f32(*out, src, seed, 1); } +// CHECK-LABEL:
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
llvmbot wrote: @llvm/pr-subscribers-mlir-llvm Author: Krzysztof Drewniak (krzysz00) Changes This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to LDS from global (address space 1) pointers and buffer fat pointers (address space 7), since they use the saem API and "gather from a pointer to LDS" is something of an abstract operation. This commet adds the intrinsic and its lowerings for addrspaces 1 and 7, and updates the MLIR wrappers to use it (loosening up the restrictions on loads to LDS along the way to match the ground truth from target features). It also plumbs the intrinsic through to clang. (Any clang folks know why things are broken?) --- Patch is 50.24 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137425.diff 21 Files Affected: - (modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+1) - (modified) clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp (+4) - (modified) clang/lib/Sema/SemaAMDGPU.cpp (+1) - (modified) clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl (+30) - (added) clang/test/CodeGenOpenCL/builtins-amdgcn-load-to-lds.cl (+60) - (modified) llvm/docs/ReleaseNotes.md (+8) - (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+21) - (modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+5) - (modified) llvm/lib/Target/AMDGPU/AMDGPULowerBufferFatPointers.cpp (+20) - (modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+2) - (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+7-1) - (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.gfx950.ll (+75) - (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll (+220) - (modified) llvm/test/CodeGen/AMDGPU/lower-buffer-fat-pointers-mem-transfer.ll (+18) - (modified) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td (+7-5) - (modified) mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td (+29-6) - (modified) mlir/lib/Conversion/AMDGPUToROCDL/AMDGPUToROCDL.cpp (+7-8) - (modified) mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp (+15-6) - (modified) mlir/test/Conversion/AMDGPUToROCDL/load_lds.mlir (+51-16) - (modified) mlir/test/Dialect/LLVMIR/rocdl.mlir (+10-7) - (modified) mlir/test/Target/LLVMIR/rocdl.mlir (+7-4) ``diff diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def index 39fef9e4601f8..730fd15913c11 100644 --- a/clang/include/clang/Basic/BuiltinsAMDGPU.def +++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def @@ -257,6 +257,7 @@ TARGET_BUILTIN(__builtin_amdgcn_flat_atomic_fadd_v2bf16, "V2sV2s*0V2s", "t", "at TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_v2bf16, "V2sV2s*1V2s", "t", "atomic-global-pk-add-bf16-inst") TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2bf16, "V2sV2s*3V2s", "t", "atomic-ds-pk-add-16-insts") TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2f16, "V2hV2h*3V2h", "t", "atomic-ds-pk-add-16-insts") +TARGET_BUILTIN(__builtin_amdgcn_load_to_lds, "vv*v*3IUiIiIUi", "t", "vmem-to-lds-load-insts") TARGET_BUILTIN(__builtin_amdgcn_global_load_lds, "vv*1v*3IUiIiIUi", "t", "vmem-to-lds-load-insts") //===--===// diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp index ad012d98635ff..a32ef1c2a5a12 100644 --- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp +++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp @@ -564,6 +564,10 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID, llvm::Function *F = CGM.getIntrinsic(IID, {LoadTy}); return Builder.CreateCall(F, {Addr}); } + case AMDGPU::BI__builtin_amdgcn_load_to_lds: { +return emitBuiltinWithOneOverloadedType<5>(*this, E, + Intrinsic::amdgcn_load_to_lds); + } case AMDGPU::BI__builtin_amdgcn_get_fpenv: { Function *F = CGM.getIntrinsic(Intrinsic::get_fpenv, {llvm::Type::getInt64Ty(getLLVMContext())}); diff --git a/clang/lib/Sema/SemaAMDGPU.cpp b/clang/lib/Sema/SemaAMDGPU.cpp index a6366aceec2a6..e6414a623b929 100644 --- a/clang/lib/Sema/SemaAMDGPU.cpp +++ b/clang/lib/Sema/SemaAMDGPU.cpp @@ -36,6 +36,7 @@ bool SemaAMDGPU::CheckAMDGCNBuiltinFunctionCall(unsigned BuiltinID, switch (BuiltinID) { case AMDGPU::BI__builtin_amdgcn_raw_ptr_buffer_load_lds: + case AMDGPU::BI__builtin_amdgcn_load_to_lds: case AMDGPU::BI__builtin_amdgcn_global_load_lds: { constexpr const int SizeIdx = 2; llvm::APSInt Size; diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl index 8251d6c213e3d..4b73347ac8155 100644 --- a/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl +++ b/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl @@ -1766,6 +1766,36 @@ void test_cvt_sr_f16_f32(global half2 *out, float src, uint seed) *out = __builtin_amdgcn_cvt_sr_f16_f32(*out, src, seed, 1); } +// CHECK-L
[clang] [llvm] [mlir] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic (PR #137425)
https://github.com/krzysz00 created https://github.com/llvm/llvm-project/pull/137425 This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to LDS from global (address space 1) pointers and buffer fat pointers (address space 7), since they use the saem API and "gather from a pointer to LDS" is something of an abstract operation. This commet adds the intrinsic and its lowerings for addrspaces 1 and 7, and updates the MLIR wrappers to use it (loosening up the restrictions on loads to LDS along the way to match the ground truth from target features). It also plumbs the intrinsic through to clang. (Any clang folks know why things are broken?) >From 96e94b5662c613fd80f712080751076254a73524 Mon Sep 17 00:00:00 2001 From: Krzysztof Drewniak Date: Sat, 26 Apr 2025 00:20:22 + Subject: [PATCH] [AMDGPU] Add a new amdgcn.load.to.lds intrinsic This PR adds a amdgns_load_to_lds intrinsic that abstracts over loads to LDS from global (address space 1) pointers and buffer fat pointers (address space 7), since they use the saem API and "gather from a pointer to LDS" is something of an abstract operation. This commet adds the intrinsic and its lowerings for addrspaces 1 and 7, and updates the MLIR wrappers to use it (loosening up the restrictions on loads to LDS along the way to match the ground truth from target features). It also plumbs the intrinsic through to clang. --- clang/include/clang/Basic/BuiltinsAMDGPU.def | 1 + clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp | 4 + clang/lib/Sema/SemaAMDGPU.cpp | 1 + .../CodeGenOpenCL/builtins-amdgcn-gfx950.cl | 30 +++ .../builtins-amdgcn-load-to-lds.cl| 60 + llvm/docs/ReleaseNotes.md | 8 + llvm/include/llvm/IR/IntrinsicsAMDGPU.td | 21 ++ .../AMDGPU/AMDGPUInstructionSelector.cpp | 5 + .../AMDGPU/AMDGPULowerBufferFatPointers.cpp | 20 ++ .../Target/AMDGPU/AMDGPURegisterBankInfo.cpp | 2 + llvm/lib/Target/AMDGPU/SIISelLowering.cpp | 8 +- .../AMDGPU/llvm.amdgcn.load.to.lds.gfx950.ll | 75 ++ .../CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll | 220 ++ .../lower-buffer-fat-pointers-mem-transfer.ll | 18 ++ mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td | 12 +- mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td | 35 ++- .../AMDGPUToROCDL/AMDGPUToROCDL.cpp | 15 +- mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp | 21 +- .../Conversion/AMDGPUToROCDL/load_lds.mlir| 67 -- mlir/test/Dialect/LLVMIR/rocdl.mlir | 17 +- mlir/test/Target/LLVMIR/rocdl.mlir| 11 +- 21 files changed, 598 insertions(+), 53 deletions(-) create mode 100644 clang/test/CodeGenOpenCL/builtins-amdgcn-load-to-lds.cl create mode 100644 llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.gfx950.ll create mode 100644 llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.to.lds.ll diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def index 39fef9e4601f8..730fd15913c11 100644 --- a/clang/include/clang/Basic/BuiltinsAMDGPU.def +++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def @@ -257,6 +257,7 @@ TARGET_BUILTIN(__builtin_amdgcn_flat_atomic_fadd_v2bf16, "V2sV2s*0V2s", "t", "at TARGET_BUILTIN(__builtin_amdgcn_global_atomic_fadd_v2bf16, "V2sV2s*1V2s", "t", "atomic-global-pk-add-bf16-inst") TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2bf16, "V2sV2s*3V2s", "t", "atomic-ds-pk-add-16-insts") TARGET_BUILTIN(__builtin_amdgcn_ds_atomic_fadd_v2f16, "V2hV2h*3V2h", "t", "atomic-ds-pk-add-16-insts") +TARGET_BUILTIN(__builtin_amdgcn_load_to_lds, "vv*v*3IUiIiIUi", "t", "vmem-to-lds-load-insts") TARGET_BUILTIN(__builtin_amdgcn_global_load_lds, "vv*1v*3IUiIiIUi", "t", "vmem-to-lds-load-insts") //===--===// diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp index ad012d98635ff..a32ef1c2a5a12 100644 --- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp +++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp @@ -564,6 +564,10 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID, llvm::Function *F = CGM.getIntrinsic(IID, {LoadTy}); return Builder.CreateCall(F, {Addr}); } + case AMDGPU::BI__builtin_amdgcn_load_to_lds: { +return emitBuiltinWithOneOverloadedType<5>(*this, E, + Intrinsic::amdgcn_load_to_lds); + } case AMDGPU::BI__builtin_amdgcn_get_fpenv: { Function *F = CGM.getIntrinsic(Intrinsic::get_fpenv, {llvm::Type::getInt64Ty(getLLVMContext())}); diff --git a/clang/lib/Sema/SemaAMDGPU.cpp b/clang/lib/Sema/SemaAMDGPU.cpp index a6366aceec2a6..e6414a623b929 100644 --- a/clang/lib/Sema/SemaAMDGPU.cpp +++ b/clang/lib/Sema/SemaAMDGPU.cpp @@ -36,6 +36,7 @@ bool SemaAMDGPU::CheckAMDGCNBuiltinFunctionCall(unsigned BuiltinID, switch (BuiltinID) { case AMDGPU::BI__bui