[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)
cdevadas wrote: ### Merge activity * **Jul 23, 4:02 AM EDT**: @cdevadas started a stack merge that includes this pull request via [Graphite](https://app.graphite.dev/github/pr/llvm/llvm-project/96163). https://github.com/llvm/llvm-project/pull/96163 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
cdevadas wrote: ### Merge activity * **Jul 23, 4:02 AM EDT**: @cdevadas started a stack merge that includes this pull request via [Graphite](https://app.graphite.dev/github/pr/llvm/llvm-project/96162). https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)
cdevadas wrote: The latest patch optimizes the PatFrag and the patterns written further by using OtherPredicates. The lit test changes in the latest patch are a missed optimization I incorrectly introduced earlier in this PR for GFX7. It is now fixed and matches the default behavior with the current compiler. https://github.com/llvm/llvm-project/pull/96163 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
cdevadas wrote: > I still think it is terribly surprising all of the test diff shows up in this > commit, and not the selection case Because the selection support is done in the next PR of the review stack, https://github.com/llvm/llvm-project/pull/96162. This patch takes care of choosing the right opcode while merging the loads. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
https://github.com/cdevadas edited https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -658,17 +658,17 @@ define amdgpu_kernel void @image_bvh_intersect_ray_nsa_reassign(ptr %p_node_ptr, ; ; GFX1013-LABEL: image_bvh_intersect_ray_nsa_reassign: ; GFX1013: ; %bb.0: -; GFX1013-NEXT:s_load_dwordx8 s[0:7], s[0:1], 0x24 +; GFX1013-NEXT:s_load_dwordx8 s[4:11], s[0:1], 0x24 cdevadas wrote: > I guess this code changes because xnack is enabled by default for GFX10.1? Yes. > Is there anything we could do to add known alignment info here, to avoid the > code pessimization? I'm not sure what can be done for it. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)
cdevadas wrote: Ping https://github.com/llvm/llvm-project/pull/96163 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
cdevadas wrote: Ping https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -183,10 +183,10 @@ define <2 x half> @local_atomic_fadd_v2f16_rtn(ptr addrspace(3) %ptr, <2 x half> define amdgpu_kernel void @local_atomic_fadd_v2bf16_noret(ptr addrspace(3) %ptr, <2 x i16> %data) { ; GFX940-LABEL: local_atomic_fadd_v2bf16_noret: ; GFX940: ; %bb.0: -; GFX940-NEXT:s_load_dwordx2 s[0:1], s[0:1], 0x24 +; GFX940-NEXT:s_load_dwordx2 s[2:3], s[0:1], 0x24 cdevadas wrote: Opened https://github.com/llvm/llvm-project/issues/97715. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
https://github.com/cdevadas edited https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
https://github.com/cdevadas edited https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -183,10 +183,10 @@ define <2 x half> @local_atomic_fadd_v2f16_rtn(ptr addrspace(3) %ptr, <2 x half> define amdgpu_kernel void @local_atomic_fadd_v2bf16_noret(ptr addrspace(3) %ptr, <2 x i16> %data) { ; GFX940-LABEL: local_atomic_fadd_v2bf16_noret: ; GFX940: ; %bb.0: -; GFX940-NEXT:s_load_dwordx2 s[0:1], s[0:1], 0x24 +; GFX940-NEXT:s_load_dwordx2 s[2:3], s[0:1], 0x24 cdevadas wrote: Unfortunately, that's not happening. The IR load-store-vectorizer doesn't combine the two loads. I still see the two loads after the IR vectorizer and they become two loads in the selected code. Can this happen because the alignment for the two loads differ and the IR vectorizer safely ignores them? *** IR Dump before Selection *** define amdgpu_kernel void @local_atomic_fadd_v2bf16_noret(ptr addrspace(3) %ptr, <2 x i16> %data) #0 { %local_atomic_fadd_v2bf16_noret.kernarg.segment = call nonnull align 16 dereferenceable(44) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr() %ptr.kernarg.offset = getelementptr inbounds i8, ptr addrspace(4) %local_atomic_fadd_v2bf16_noret.kernarg.segment, i64 36, !amdgpu.uniform !0 **%ptr.load = load ptr addrspace(3), ptr addrspace(4) %ptr.kernarg.offset**, align 4, !invariant.load !0 %data.kernarg.offset = getelementptr inbounds i8, ptr addrspace(4) %local_atomic_fadd_v2bf16_noret.kernarg.segment, i64 40, !amdgpu.uniform !0 **%data.load = load <2 x i16>, ptr addrspace(4) %data.kernarg.offset**, align 8, !invariant.load !0 %ret = call <2 x i16> @llvm.amdgcn.ds.fadd.v2bf16(ptr addrspace(3) %ptr.load, <2 x i16> %data.load) ret void } # *** IR Dump After selection ***: # Machine code for function local_atomic_fadd_v2bf16_noret: IsSSA, TracksLiveness Function Live Ins: $sgpr0_sgpr1 in %1 bb.0 (%ir-block.0): liveins: $sgpr0_sgpr1 %1:sgpr_64(p4) = COPY $sgpr0_sgpr1 %3:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %1:sgpr_64(p4), 36, 0 :: (dereferenceable invariant load (s32) from %ir.ptr.kernarg.offset, addrspace 4) %4:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %1:sgpr_64(p4), 40, 0 :: (dereferenceable invariant load (s32) from %ir.data.kernarg.offset, align 8, addrspace 4) %5:vgpr_32 = COPY %3:sreg_32_xm0_xexec %6:vgpr_32 = COPY %4:sreg_32_xm0_xexec DS_PK_ADD_BF16 killed %5:vgpr_32, killed %6:vgpr_32, 0, 0, implicit $m0, implicit $exec S_ENDPGM 0 https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
https://github.com/cdevadas edited https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -183,10 +183,10 @@ define <2 x half> @local_atomic_fadd_v2f16_rtn(ptr addrspace(3) %ptr, <2 x half> define amdgpu_kernel void @local_atomic_fadd_v2bf16_noret(ptr addrspace(3) %ptr, <2 x i16> %data) { ; GFX940-LABEL: local_atomic_fadd_v2bf16_noret: ; GFX940: ; %bb.0: -; GFX940-NEXT:s_load_dwordx2 s[0:1], s[0:1], 0x24 +; GFX940-NEXT:s_load_dwordx2 s[2:3], s[0:1], 0x24 cdevadas wrote: Earlier I wrongly used the dword size (Width) in the the alignment check here as Jay pointed out. Now, I fixed it to use Byte size while comparing it with the existing alignment of the first load. https://github.com/llvm/llvm-project/pull/96162/commits/e7e6cbc4abd476a038fd7836e5078565e73d1fe9#diff-35f4d1b6c4c17815f6989f86abbac2e606ca760f9d93f501ff503449048bf760R1730 https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Enable atomic optimizer for 64 bit divergent values (PR #96934)
@@ -178,6 +178,20 @@ bool AMDGPUAtomicOptimizerImpl::run(Function ) { return Changed; } +static bool shouldOptimizeForType(Type *Ty) { + switch (Ty->getTypeID()) { + case Type::FloatTyID: + case Type::DoubleTyID: +return true; + case Type::IntegerTyID: { +if (Ty->getIntegerBitWidth() == 32 || Ty->getIntegerBitWidth() == 64) cdevadas wrote: Get Ty->getIntegerBitWidth() just once outside? https://github.com/llvm/llvm-project/pull/96934 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -1701,17 +1732,33 @@ unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo , return AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR_IMM; } case S_LOAD_IMM: -switch (Width) { -default: - return 0; -case 2: - return AMDGPU::S_LOAD_DWORDX2_IMM; -case 3: - return AMDGPU::S_LOAD_DWORDX3_IMM; -case 4: - return AMDGPU::S_LOAD_DWORDX4_IMM; -case 8: - return AMDGPU::S_LOAD_DWORDX8_IMM; +// For targets that support XNACK replay, use the constrained load opcode. +if (STI && STI->hasXnackReplay()) { + switch (Width) { cdevadas wrote: > > currently the alignment is picked from the first MMO and that'd definitely > > be smaller than the natural align requirement for the new load > > You don't know that - the alignment in the first MMO will be whatever > alignment the compiler could deduce, which could be large, e.g. if the > pointer used for the first load was known to have a large alignment. Are you suggesting to check the alignment in the first MMO and see if it is still the preferred alignment for the merge-load? Use the _ec if the alignment is found to be smaller than the expected value. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -1701,17 +1732,33 @@ unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo , return AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR_IMM; } case S_LOAD_IMM: -switch (Width) { -default: - return 0; -case 2: - return AMDGPU::S_LOAD_DWORDX2_IMM; -case 3: - return AMDGPU::S_LOAD_DWORDX3_IMM; -case 4: - return AMDGPU::S_LOAD_DWORDX4_IMM; -case 8: - return AMDGPU::S_LOAD_DWORDX8_IMM; +// For targets that support XNACK replay, use the constrained load opcode. +if (STI && STI->hasXnackReplay()) { + switch (Width) { cdevadas wrote: I guess currently the merged load is always under-aligned. While combining the MMOs, currently the alignment is picked from the first MMO and that'd definitely be smaller than the natural align requirement for the new load. That's one reason I conservatively want to emit _ec equivalent when XNACK is enabled. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)
@@ -867,13 +867,104 @@ def SMRDBufferImm : ComplexPattern; def SMRDBufferImm32 : ComplexPattern; def SMRDBufferSgprImm : ComplexPattern; +class SMRDAlignedLoadPat : PatFrag <(ops node:$ptr), (Op node:$ptr), [{ + // Returns true if it is a naturally aligned multi-dword load. + LoadSDNode *Ld = cast(N); + unsigned Size = Ld->getMemoryVT().getStoreSize(); + return (Size <= 4) || (Ld->getAlign().value() >= PowerOf2Ceil(Size)); cdevadas wrote: Isn't it 12 >= 16 or 16 >=16? The PowerOf2Ceil intends to catch the first case where the specified Align is smaller than its natural alignment. https://github.com/llvm/llvm-project/pull/96163 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)
@@ -886,26 +977,17 @@ multiclass SMRD_Pattern { def : GCNPat < (smrd_load (SMRDSgpr i64:$sbase, i32:$soffset)), (vt (!cast(Instr#"_SGPR") $sbase, $soffset, 0))> { -let OtherPredicates = [isNotGFX9Plus]; - } - def : GCNPat < -(smrd_load (SMRDSgpr i64:$sbase, i32:$soffset)), -(vt (!cast(Instr#"_SGPR_IMM") $sbase, $soffset, 0, 0))> { -let OtherPredicates = [isGFX9Plus]; +let OtherPredicates = [isGFX6GFX7]; } - // 4. SGPR+IMM offset + // 4. No offset def : GCNPat < -(smrd_load (SMRDSgprImm i64:$sbase, i32:$soffset, i32:$offset)), -(vt (!cast(Instr#"_SGPR_IMM") $sbase, $soffset, $offset, 0))> { -let OtherPredicates = [isGFX9Plus]; +(vt (smrd_load (i64 SReg_64:$sbase))), +(vt (!cast(Instr#"_IMM") i64:$sbase, 0, 0))> { +let OtherPredicates = [isGFX6GFX7]; } - // 5. No offset - def : GCNPat < -(vt (smrd_load (i64 SReg_64:$sbase))), -(vt (!cast(Instr#"_IMM") i64:$sbase, 0, 0)) - >; + defm : SMRD_Align_Pattern; cdevadas wrote: I was using the predicate for gfx8+ which has the xnack replay support enabled. I should instead check if the xnack is on. Will change it. https://github.com/llvm/llvm-project/pull/96163 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)
@@ -867,13 +867,104 @@ def SMRDBufferImm : ComplexPattern; def SMRDBufferImm32 : ComplexPattern; def SMRDBufferSgprImm : ComplexPattern; +class SMRDAlignedLoadPat : PatFrag <(ops node:$ptr), (Op node:$ptr), [{ + // Returns true if it is a naturally aligned multi-dword load. + LoadSDNode *Ld = cast(N); + unsigned Size = Ld->getMemoryVT().getStoreSize(); + return (Size <= 4) || (Ld->getAlign().value() >= PowerOf2Ceil(Size)); cdevadas wrote: For the DWORDX3 case. Even though the size is 12 Bytes, its natural alignment is 16 Bytes. https://github.com/llvm/llvm-project/pull/96163 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
cdevadas wrote: > This looks like it is affecting codegen even when xnack is disabled? That > should not happen. It shouldn't. I put the xnack replay subtarget check before using *_ec equivalents. See the code here: https://github.com/llvm/llvm-project/commit/65eb44327cf32a83dbbf13eb70f9d8c03f3efaef#diff-35f4d1b6c4c17815f6989f86abbac2e606ca760f9d93f501ff503449048bf760R1735 https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
https://github.com/cdevadas ready_for_review https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)
https://github.com/cdevadas ready_for_review https://github.com/llvm/llvm-project/pull/96163 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
cdevadas wrote: > [!WARNING] > This pull request is not mergeable via GitHub because a downstack PR is > open. Once all requirements are satisfied, merge this PR as a stack href="https://app.graphite.dev/github/pr/llvm/llvm-project/96162?utm_source=stack-comment-downstack-mergeability-warning; > >on Graphite. > https://graphite.dev/docs/merge-pull-requests;>Learn more * **#96163** https://app.graphite.dev/github/pr/llvm/llvm-project/96163?utm_source=stack-comment-icon; target="_blank">https://static.graphite.dev/graphite-32x32-black.png; alt="Graphite" width="10px" height="10px"/> * **#96162** https://app.graphite.dev/github/pr/llvm/llvm-project/96162?utm_source=stack-comment-icon; target="_blank">https://static.graphite.dev/graphite-32x32-black.png; alt="Graphite" width="10px" height="10px"/> * **#96161** https://app.graphite.dev/github/pr/llvm/llvm-project/96161?utm_source=stack-comment-icon; target="_blank">https://static.graphite.dev/graphite-32x32-black.png; alt="Graphite" width="10px" height="10px"/> * `main` This stack of pull requests is managed by Graphite. https://stacking.dev/?utm_source=stack-comment;>Learn more about stacking. Join @cdevadas and the rest of your teammates on https://graphite.dev?utm-source=stack-comment;>https://static.graphite.dev/graphite-32x32-black.png; alt="Graphite" width="11px" height="11px"/> Graphite https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU] Codegen support for constrained multi-dword sloads (PR #96163)
cdevadas wrote: > [!WARNING] > This pull request is not mergeable via GitHub because a downstack PR is > open. Once all requirements are satisfied, merge this PR as a stack href="https://app.graphite.dev/github/pr/llvm/llvm-project/96163?utm_source=stack-comment-downstack-mergeability-warning; > >on Graphite. > https://graphite.dev/docs/merge-pull-requests;>Learn more * **#96163** https://app.graphite.dev/github/pr/llvm/llvm-project/96163?utm_source=stack-comment-icon; target="_blank">https://static.graphite.dev/graphite-32x32-black.png; alt="Graphite" width="10px" height="10px"/> * **#96162** https://app.graphite.dev/github/pr/llvm/llvm-project/96162?utm_source=stack-comment-icon; target="_blank">https://static.graphite.dev/graphite-32x32-black.png; alt="Graphite" width="10px" height="10px"/> * **#96161** https://app.graphite.dev/github/pr/llvm/llvm-project/96161?utm_source=stack-comment-icon; target="_blank">https://static.graphite.dev/graphite-32x32-black.png; alt="Graphite" width="10px" height="10px"/> * `main` This stack of pull requests is managed by Graphite. https://stacking.dev/?utm_source=stack-comment;>Learn more about stacking. Join @cdevadas and the rest of your teammates on https://graphite.dev?utm-source=stack-comment;>https://static.graphite.dev/graphite-32x32-black.png; alt="Graphite" width="11px" height="11px"/> Graphite https://github.com/llvm/llvm-project/pull/96163 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [clang] [llvm] AMDGPU: Remove ds atomic fadd intrinsics (PR #95396)
@@ -2331,40 +2337,74 @@ static Value *upgradeARMIntrinsicCall(StringRef Name, CallBase *CI, Function *F, llvm_unreachable("Unknown function for ARM CallBase upgrade."); } +// These are expected to have have the arguments: cdevadas wrote: ```suggestion // These are expected to have the arguments: ``` https://github.com/llvm/llvm-project/pull/95396 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] ff8a1ca - [AMDGPU] Fix the inconsistency in soffset for MUBUF stack accesses.
Author: Christudasan Devadasan Date: 2021-01-22T14:20:59+05:30 New Revision: ff8a1cae181438b97937848060da1efb67117ea4 URL: https://github.com/llvm/llvm-project/commit/ff8a1cae181438b97937848060da1efb67117ea4 DIFF: https://github.com/llvm/llvm-project/commit/ff8a1cae181438b97937848060da1efb67117ea4.diff LOG: [AMDGPU] Fix the inconsistency in soffset for MUBUF stack accesses. During instruction selection, there is an inconsistency in choosing the initial soffset value. With certain early passes, this value is getting modified and that brought additional fixup during eliminateFrameIndex to work for all cases. This whole transformation looks trivial and can be handled better. This patch clearly defines the initial value for soffset and keeps it unchanged before eliminateFrameIndex. The initial value must be zero for MUBUF with a frame index. The non-frame index MUBUF forms that use a raw offset from SP will have the stack register for soffset. During frame elimination, the soffset remains zero for entry functions with zero dynamic allocas and no callsites, or else is updated to the appropriate frame/stack register. Also, did some code clean up and made all asserts around soffset stricter to match. Reviewed By: scott.linder Differential Revision: https://reviews.llvm.org/D95071 Added: Modified: llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp llvm/lib/Target/AMDGPU/SIFoldOperands.cpp llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-load-private.mir llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-store-private.mir llvm/test/CodeGen/AMDGPU/amdpal-callable.ll llvm/test/CodeGen/AMDGPU/fold-fi-mubuf.mir llvm/test/CodeGen/AMDGPU/stack-pointer-offset-relative-frameindex.ll Removed: diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp index 3c66745c0e70..340f4ac6f57a 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp +++ b/llvm/lib/Target/AMDGPU/AMDGPUISelDAGToDAG.cpp @@ -1523,7 +1523,9 @@ std::pair AMDGPUDAGToDAGISel::foldFrameIndex(SDValue N) const FI ? CurDAG->getTargetFrameIndex(FI->getIndex(), FI->getValueType(0)) : N; // We rebase the base address into an absolute stack address and hence - // use constant 0 for soffset. + // use constant 0 for soffset. This value must be retained until + // frame elimination and eliminateFrameIndex will choose the appropriate + // frame register if need be. return std::make_pair(TFI, CurDAG->getTargetConstant(0, DL, MVT::i32)); } diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp index 7255a061b26b..bd577a6fb8c5 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp +++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp @@ -3669,13 +3669,9 @@ AMDGPUInstructionSelector::selectMUBUFScratchOffen(MachineOperand ) const { MIB.addReg(HighBits); }, [=](MachineInstrBuilder ) { // soffset - const MachineMemOperand *MMO = *MI->memoperands_begin(); - const MachinePointerInfo = MMO->getPointerInfo(); - - if (isStackPtrRelative(PtrInfo)) - MIB.addReg(Info->getStackPtrOffsetReg()); - else - MIB.addImm(0); + // Use constant zero for soffset and rely on eliminateFrameIndex + // to choose the appropriate frame register if need be. + MIB.addImm(0); }, [=](MachineInstrBuilder ) { // offset MIB.addImm(Offset & 4095); @@ -3722,15 +3718,9 @@ AMDGPUInstructionSelector::selectMUBUFScratchOffen(MachineOperand ) const { MIB.addReg(VAddr); }, [=](MachineInstrBuilder ) { // soffset - // If we don't know this private access is a local stack object, it - // needs to be relative to the entry point's scratch wave offset. - // TODO: Should split large offsets that don't fit like above. - // TODO: Don't use scratch wave offset just because the offset - // didn't fit. - if (!Info->isEntryFunction() && FI.hasValue()) - MIB.addReg(Info->getStackPtrOffsetReg()); - else - MIB.addImm(0); + // Use constant zero for soffset and rely on eliminateFrameIndex + // to choose the appropriate frame register if need be. + MIB.addImm(0); }, [=](MachineInstrBuilder ) { // offset MIB.addImm(Offset); diff --git a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp index d22bdb791535..d5fa9afded27 100644 --- a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
[llvm-branch-commits] [llvm] c971bcd - [AMDGPU] Test clean up (NFC)
Author: Christudasan Devadasan Date: 2021-01-22T13:38:52+05:30 New Revision: c971bcd2102b905e6469463fb8309ab3f7b2b8f2 URL: https://github.com/llvm/llvm-project/commit/c971bcd2102b905e6469463fb8309ab3f7b2b8f2 DIFF: https://github.com/llvm/llvm-project/commit/c971bcd2102b905e6469463fb8309ab3f7b2b8f2.diff LOG: [AMDGPU] Test clean up (NFC) Added: Modified: llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll Removed: diff --git a/llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll b/llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll index 4fb2625a5221..5e4b5f70de0b 100644 --- a/llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll +++ b/llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll @@ -16,7 +16,7 @@ ; so eliminateFrameIndex would not adjust the access to use the ; correct FP offset. -define amdgpu_kernel void @local_stack_offset_uses_sp(i64 addrspace(1)* %out, i8 addrspace(1)* %in) { +define amdgpu_kernel void @local_stack_offset_uses_sp(i64 addrspace(1)* %out) { ; MUBUF-LABEL: local_stack_offset_uses_sp: ; MUBUF: ; %bb.0: ; %entry ; MUBUF-NEXT:s_load_dwordx2 s[4:5], s[4:5], 0x0 @@ -106,7 +106,7 @@ entry: ret void } -define void @func_local_stack_offset_uses_sp(i64 addrspace(1)* %out, i8 addrspace(1)* %in) { +define void @func_local_stack_offset_uses_sp(i64 addrspace(1)* %out) { ; MUBUF-LABEL: func_local_stack_offset_uses_sp: ; MUBUF: ; %bb.0: ; %entry ; MUBUF-NEXT:s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] ae25a39 - AMDGPU/GlobalISel: Enable sret demotion
Author: Christudasan Devadasan Date: 2021-01-08T10:56:35+05:30 New Revision: ae25a397e9de833ffbd5d8e3b480086404625cb7 URL: https://github.com/llvm/llvm-project/commit/ae25a397e9de833ffbd5d8e3b480086404625cb7 DIFF: https://github.com/llvm/llvm-project/commit/ae25a397e9de833ffbd5d8e3b480086404625cb7.diff LOG: AMDGPU/GlobalISel: Enable sret demotion Added: Modified: llvm/include/llvm/CodeGen/GlobalISel/CallLowering.h llvm/lib/CodeGen/GlobalISel/CallLowering.cpp llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp llvm/lib/Target/AMDGPU/AMDGPUCallLowering.h llvm/test/CodeGen/AMDGPU/GlobalISel/function-returns.ll llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-call-return-values.ll Removed: diff --git a/llvm/include/llvm/CodeGen/GlobalISel/CallLowering.h b/llvm/include/llvm/CodeGen/GlobalISel/CallLowering.h index dff73d185114..57ff3900ef25 100644 --- a/llvm/include/llvm/CodeGen/GlobalISel/CallLowering.h +++ b/llvm/include/llvm/CodeGen/GlobalISel/CallLowering.h @@ -358,8 +358,8 @@ class CallLowering { /// described by \p Outs can fit into the return registers. If false /// is returned, an sret-demotion is performed. virtual bool canLowerReturn(MachineFunction , CallingConv::ID CallConv, - SmallVectorImpl , bool IsVarArg, - LLVMContext ) const { + SmallVectorImpl , + bool IsVarArg) const { return true; } diff --git a/llvm/lib/CodeGen/GlobalISel/CallLowering.cpp b/llvm/lib/CodeGen/GlobalISel/CallLowering.cpp index e1591a4bf19b..a6d4ea76 100644 --- a/llvm/lib/CodeGen/GlobalISel/CallLowering.cpp +++ b/llvm/lib/CodeGen/GlobalISel/CallLowering.cpp @@ -95,8 +95,7 @@ bool CallLowering::lowerCall(MachineIRBuilder , const CallBase , SmallVector SplitArgs; getReturnInfo(CallConv, RetTy, CB.getAttributes(), SplitArgs, DL); - Info.CanLowerReturn = - canLowerReturn(MF, CallConv, SplitArgs, IsVarArg, RetTy->getContext()); + Info.CanLowerReturn = canLowerReturn(MF, CallConv, SplitArgs, IsVarArg); if (!Info.CanLowerReturn) { // Callee requires sret demotion. @@ -592,8 +591,7 @@ bool CallLowering::checkReturnTypeForCallConv(MachineFunction ) const { SmallVector SplitArgs; getReturnInfo(CallConv, ReturnType, F.getAttributes(), SplitArgs, MF.getDataLayout()); - return canLowerReturn(MF, CallConv, SplitArgs, F.isVarArg(), -ReturnType->getContext()); + return canLowerReturn(MF, CallConv, SplitArgs, F.isVarArg()); } bool CallLowering::analyzeArgInfo(CCState , diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp index 3b6e263ee6d8..b86052e3a14e 100644 --- a/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp +++ b/llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp @@ -20,6 +20,7 @@ #include "SIMachineFunctionInfo.h" #include "SIRegisterInfo.h" #include "llvm/CodeGen/Analysis.h" +#include "llvm/CodeGen/FunctionLoweringInfo.h" #include "llvm/CodeGen/GlobalISel/MachineIRBuilder.h" #include "llvm/IR/IntrinsicsAMDGPU.h" @@ -420,6 +421,22 @@ static void unpackRegsToOrigType(MachineIRBuilder , B.buildUnmerge(UnmergeResults, UnmergeSrc); } +bool AMDGPUCallLowering::canLowerReturn(MachineFunction , +CallingConv::ID CallConv, +SmallVectorImpl , +bool IsVarArg) const { + // For shaders. Vector types should be explicitly handled by CC. + if (AMDGPU::isEntryFunctionCC(CallConv)) +return true; + + SmallVector ArgLocs; + const SITargetLowering = *getTLI(); + CCState CCInfo(CallConv, IsVarArg, MF, ArgLocs, + MF.getFunction().getContext()); + + return checkReturn(CCInfo, Outs, TLI.CCAssignFnForReturn(CallConv, IsVarArg)); +} + /// Lower the return value for the already existing \p Ret. This assumes that /// \p B's insertion point is correct. bool AMDGPUCallLowering::lowerReturnVal(MachineIRBuilder , @@ -533,7 +550,9 @@ bool AMDGPUCallLowering::lowerReturn(MachineIRBuilder , const Value *Val, Ret.addUse(ReturnAddrVReg); } - if (!lowerReturnVal(B, Val, VRegs, Ret)) + if (!FLI.CanLowerReturn) +insertSRetStores(B, Val->getType(), VRegs, FLI.DemoteRegister); + else if (!lowerReturnVal(B, Val, VRegs, Ret)) return false; if (ReturnOpc == AMDGPU::S_SETPC_B64_return) { @@ -872,6 +891,11 @@ bool AMDGPUCallLowering::lowerFormalArguments( unsigned Idx = 0; unsigned PSInputNum = 0; + // Insert the hidden sret parameter if the return value won't fit in the + // return registers. + if (!FLI.CanLowerReturn) +insertSRetIncomingArgument(F, SplitArgs, FLI.DemoteRegister, MRI, DL); + for (auto : F.args()) { if (DL.getTypeStoreSize(Arg.getType()) == 0) continue; @@
[llvm-branch-commits] [llvm] d68458b - [GlobalISel] Base implementation for sret demotion.
Author: Christudasan Devadasan Date: 2021-01-06T10:30:50+05:30 New Revision: d68458bd56d9d55b05fca5447891aa8752d70509 URL: https://github.com/llvm/llvm-project/commit/d68458bd56d9d55b05fca5447891aa8752d70509 DIFF: https://github.com/llvm/llvm-project/commit/d68458bd56d9d55b05fca5447891aa8752d70509.diff LOG: [GlobalISel] Base implementation for sret demotion. If the return values can't be lowered to registers SelectionDAG performs the sret demotion. This patch contains the basic implementation for the same in the GlobalISel pipeline. Furthermore, targets should bring relevant changes during lowerFormalArguments, lowerReturn and lowerCall to make use of this feature. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D92953 Added: Modified: llvm/include/llvm/CodeGen/GlobalISel/CallLowering.h llvm/lib/CodeGen/GlobalISel/CallLowering.cpp llvm/lib/CodeGen/GlobalISel/IRTranslator.cpp llvm/lib/Target/AArch64/GISel/AArch64CallLowering.cpp llvm/lib/Target/AArch64/GISel/AArch64CallLowering.h llvm/lib/Target/AMDGPU/AMDGPUCallLowering.cpp llvm/lib/Target/AMDGPU/AMDGPUCallLowering.h llvm/lib/Target/ARM/ARMCallLowering.cpp llvm/lib/Target/ARM/ARMCallLowering.h llvm/lib/Target/Mips/MipsCallLowering.cpp llvm/lib/Target/Mips/MipsCallLowering.h llvm/lib/Target/PowerPC/GISel/PPCCallLowering.cpp llvm/lib/Target/PowerPC/GISel/PPCCallLowering.h llvm/lib/Target/RISCV/RISCVCallLowering.cpp llvm/lib/Target/RISCV/RISCVCallLowering.h llvm/lib/Target/X86/X86CallLowering.cpp llvm/lib/Target/X86/X86CallLowering.h llvm/tools/llvm-exegesis/lib/Assembler.cpp Removed: diff --git a/llvm/include/llvm/CodeGen/GlobalISel/CallLowering.h b/llvm/include/llvm/CodeGen/GlobalISel/CallLowering.h index dbd7e00c429a..dff73d185114 100644 --- a/llvm/include/llvm/CodeGen/GlobalISel/CallLowering.h +++ b/llvm/include/llvm/CodeGen/GlobalISel/CallLowering.h @@ -19,6 +19,7 @@ #include "llvm/CodeGen/CallingConvLower.h" #include "llvm/CodeGen/MachineOperand.h" #include "llvm/CodeGen/TargetCallingConv.h" +#include "llvm/IR/Attributes.h" #include "llvm/IR/CallingConv.h" #include "llvm/IR/Type.h" #include "llvm/Support/ErrorHandling.h" @@ -31,6 +32,7 @@ namespace llvm { class CallBase; class DataLayout; class Function; +class FunctionLoweringInfo; class MachineIRBuilder; struct MachinePointerInfo; class MachineRegisterInfo; @@ -42,21 +44,30 @@ class CallLowering { virtual void anchor(); public: - struct ArgInfo { + struct BaseArgInfo { +Type *Ty; +SmallVector Flags; +bool IsFixed; + +BaseArgInfo(Type *Ty, +ArrayRef Flags = ArrayRef(), +bool IsFixed = true) +: Ty(Ty), Flags(Flags.begin(), Flags.end()), IsFixed(IsFixed) {} + +BaseArgInfo() : Ty(nullptr), IsFixed(false) {} + }; + + struct ArgInfo : public BaseArgInfo { SmallVector Regs; // If the argument had to be split into multiple parts according to the // target calling convention, then this contains the original vregs // if the argument was an incoming arg. SmallVector OrigRegs; -Type *Ty; -SmallVector Flags; -bool IsFixed; ArgInfo(ArrayRef Regs, Type *Ty, ArrayRef Flags = ArrayRef(), bool IsFixed = true) -: Regs(Regs.begin(), Regs.end()), Ty(Ty), - Flags(Flags.begin(), Flags.end()), IsFixed(IsFixed) { +: BaseArgInfo(Ty, Flags, IsFixed), Regs(Regs.begin(), Regs.end()) { if (!Regs.empty() && Flags.empty()) this->Flags.push_back(ISD::ArgFlagsTy()); // FIXME: We should have just one way of saying "no register". @@ -65,7 +76,7 @@ class CallLowering { "only void types should have no register"); } -ArgInfo() : Ty(nullptr), IsFixed(false) {} +ArgInfo() : BaseArgInfo() {} }; struct CallLoweringInfo { @@ -101,6 +112,15 @@ class CallLowering { /// True if the call is to a vararg function. bool IsVarArg = false; + +/// True if the function's return value can be lowered to registers. +bool CanLowerReturn = true; + +/// VReg to hold the hidden sret parameter. +Register DemoteRegister; + +/// The stack index for sret demotion. +int DemoteStackIndex; }; /// Argument handling is mostly uniform between the four places that @@ -292,20 +312,73 @@ class CallLowering { return false; } + /// Load the returned value from the stack into virtual registers in \p VRegs. + /// It uses the frame index \p FI and the start offset from \p DemoteReg. + /// The loaded data size will be determined from \p RetTy. + void insertSRetLoads(MachineIRBuilder , Type *RetTy, + ArrayRef VRegs, Register DemoteReg, + int FI) const; + + /// Store the return value given by \p VRegs into stack starting at the offset + ///