[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
cdevadas wrote: ### Merge activity * **Jul 23, 4:02 AM EDT**: @cdevadas started a stack merge that includes this pull request via [Graphite](https://app.graphite.dev/github/pr/llvm/llvm-project/96162). https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
https://github.com/jayfoad approved this pull request. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -6,7 +6,7 @@ declare i32 @llvm.amdgcn.global.atomic.csub(ptr addrspace(1), i32) ; GCN-LABEL: {{^}}global_atomic_csub_rtn: ; PREGFX12: global_atomic_csub v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9:]+}}, s{{\[[0-9]+:[0-9]+\]}} glc -; GFX12PLUS: global_atomic_sub_clamp_u32 v0, v0, v1, s[0:1] th:TH_ATOMIC_RETURN +; GFX12PLUS: global_atomic_sub_clamp_u32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}, s{{\[[0-9]+:[0-9]+\]}} th:TH_ATOMIC_RETURN jayfoad wrote: You shouldn't need any changes in this file. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
arsenm wrote: > > I still think it is terribly surprising all of the test diff shows up in > > this commit, and not the selection case > > Because the selection support is done in the next PR of the review stack, > #96162. This patch takes care of choosing the right opcode while merging the > loads. It's more the lack of vectorization of these loads in the IR that's surprising. Ideally we wouldn't have to rely on any of this post-isel load merging https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
cdevadas wrote: > I still think it is terribly surprising all of the test diff shows up in this > commit, and not the selection case Because the selection support is done in the next PR of the review stack, https://github.com/llvm/llvm-project/pull/96162. This patch takes care of choosing the right opcode while merging the loads. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
https://github.com/arsenm approved this pull request. I still think it is terribly surprising all of the test diff shows up in this commit, and not the selection case https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
https://github.com/cdevadas edited https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -658,17 +658,17 @@ define amdgpu_kernel void @image_bvh_intersect_ray_nsa_reassign(ptr %p_node_ptr, ; ; GFX1013-LABEL: image_bvh_intersect_ray_nsa_reassign: ; GFX1013: ; %bb.0: -; GFX1013-NEXT:s_load_dwordx8 s[0:7], s[0:1], 0x24 +; GFX1013-NEXT:s_load_dwordx8 s[4:11], s[0:1], 0x24 cdevadas wrote: > I guess this code changes because xnack is enabled by default for GFX10.1? Yes. > Is there anything we could do to add known alignment info here, to avoid the > code pessimization? I'm not sure what can be done for it. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -1212,8 +1228,14 @@ void SILoadStoreOptimizer::copyToDestRegs( // Copy to the old destination registers. const MCInstrDesc = TII->get(TargetOpcode::COPY); - const auto *Dest0 = TII->getNamedOperand(*CI.I, OpName); - const auto *Dest1 = TII->getNamedOperand(*Paired.I, OpName); + auto *Dest0 = TII->getNamedOperand(*CI.I, OpName); + auto *Dest1 = TII->getNamedOperand(*Paired.I, OpName); + + // The constrained sload instructions in S_LOAD_IMM class will have + // `early-clobber` flag in the dst operand. Remove the flag before using the + // MOs in copies. + Dest0->setIsEarlyClobber(false); + Dest1->setIsEarlyClobber(false); jayfoad wrote: It's a bit ugly to modify in-place the operands of `CI.I` and `Paired.I`. But I guess it is harmless since they will be erased soon, when the merged load instruction is created. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -658,17 +658,17 @@ define amdgpu_kernel void @image_bvh_intersect_ray_nsa_reassign(ptr %p_node_ptr, ; ; GFX1013-LABEL: image_bvh_intersect_ray_nsa_reassign: ; GFX1013: ; %bb.0: -; GFX1013-NEXT:s_load_dwordx8 s[0:7], s[0:1], 0x24 +; GFX1013-NEXT:s_load_dwordx8 s[4:11], s[0:1], 0x24 jayfoad wrote: I guess this code changes because xnack is enabled by default for GFX10.1? Is there anything we could do to add known alignment info here, to avoid the code pessimization? https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
cdevadas wrote: Ping https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -183,10 +183,10 @@ define <2 x half> @local_atomic_fadd_v2f16_rtn(ptr addrspace(3) %ptr, <2 x half> define amdgpu_kernel void @local_atomic_fadd_v2bf16_noret(ptr addrspace(3) %ptr, <2 x i16> %data) { ; GFX940-LABEL: local_atomic_fadd_v2bf16_noret: ; GFX940: ; %bb.0: -; GFX940-NEXT:s_load_dwordx2 s[0:1], s[0:1], 0x24 +; GFX940-NEXT:s_load_dwordx2 s[2:3], s[0:1], 0x24 cdevadas wrote: Opened https://github.com/llvm/llvm-project/issues/97715. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -183,10 +183,10 @@ define <2 x half> @local_atomic_fadd_v2f16_rtn(ptr addrspace(3) %ptr, <2 x half> define amdgpu_kernel void @local_atomic_fadd_v2bf16_noret(ptr addrspace(3) %ptr, <2 x i16> %data) { ; GFX940-LABEL: local_atomic_fadd_v2bf16_noret: ; GFX940: ; %bb.0: -; GFX940-NEXT:s_load_dwordx2 s[0:1], s[0:1], 0x24 +; GFX940-NEXT:s_load_dwordx2 s[2:3], s[0:1], 0x24 arsenm wrote: Can you open an issue for this https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -183,10 +183,10 @@ define <2 x half> @local_atomic_fadd_v2f16_rtn(ptr addrspace(3) %ptr, <2 x half> define amdgpu_kernel void @local_atomic_fadd_v2bf16_noret(ptr addrspace(3) %ptr, <2 x i16> %data) { ; GFX940-LABEL: local_atomic_fadd_v2bf16_noret: ; GFX940: ; %bb.0: -; GFX940-NEXT:s_load_dwordx2 s[0:1], s[0:1], 0x24 +; GFX940-NEXT:s_load_dwordx2 s[2:3], s[0:1], 0x24 arsenm wrote: LSV should have gotten this case, I don't see why it didn't. Someone should look into this https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
https://github.com/cdevadas edited https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
https://github.com/cdevadas edited https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -183,10 +183,10 @@ define <2 x half> @local_atomic_fadd_v2f16_rtn(ptr addrspace(3) %ptr, <2 x half> define amdgpu_kernel void @local_atomic_fadd_v2bf16_noret(ptr addrspace(3) %ptr, <2 x i16> %data) { ; GFX940-LABEL: local_atomic_fadd_v2bf16_noret: ; GFX940: ; %bb.0: -; GFX940-NEXT:s_load_dwordx2 s[0:1], s[0:1], 0x24 +; GFX940-NEXT:s_load_dwordx2 s[2:3], s[0:1], 0x24 cdevadas wrote: Unfortunately, that's not happening. The IR load-store-vectorizer doesn't combine the two loads. I still see the two loads after the IR vectorizer and they become two loads in the selected code. Can this happen because the alignment for the two loads differ and the IR vectorizer safely ignores them? *** IR Dump before Selection *** define amdgpu_kernel void @local_atomic_fadd_v2bf16_noret(ptr addrspace(3) %ptr, <2 x i16> %data) #0 { %local_atomic_fadd_v2bf16_noret.kernarg.segment = call nonnull align 16 dereferenceable(44) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr() %ptr.kernarg.offset = getelementptr inbounds i8, ptr addrspace(4) %local_atomic_fadd_v2bf16_noret.kernarg.segment, i64 36, !amdgpu.uniform !0 **%ptr.load = load ptr addrspace(3), ptr addrspace(4) %ptr.kernarg.offset**, align 4, !invariant.load !0 %data.kernarg.offset = getelementptr inbounds i8, ptr addrspace(4) %local_atomic_fadd_v2bf16_noret.kernarg.segment, i64 40, !amdgpu.uniform !0 **%data.load = load <2 x i16>, ptr addrspace(4) %data.kernarg.offset**, align 8, !invariant.load !0 %ret = call <2 x i16> @llvm.amdgcn.ds.fadd.v2bf16(ptr addrspace(3) %ptr.load, <2 x i16> %data.load) ret void } # *** IR Dump After selection ***: # Machine code for function local_atomic_fadd_v2bf16_noret: IsSSA, TracksLiveness Function Live Ins: $sgpr0_sgpr1 in %1 bb.0 (%ir-block.0): liveins: $sgpr0_sgpr1 %1:sgpr_64(p4) = COPY $sgpr0_sgpr1 %3:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %1:sgpr_64(p4), 36, 0 :: (dereferenceable invariant load (s32) from %ir.ptr.kernarg.offset, addrspace 4) %4:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %1:sgpr_64(p4), 40, 0 :: (dereferenceable invariant load (s32) from %ir.data.kernarg.offset, align 8, addrspace 4) %5:vgpr_32 = COPY %3:sreg_32_xm0_xexec %6:vgpr_32 = COPY %4:sreg_32_xm0_xexec DS_PK_ADD_BF16 killed %5:vgpr_32, killed %6:vgpr_32, 0, 0, implicit $m0, implicit $exec S_ENDPGM 0 https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -183,10 +183,10 @@ define <2 x half> @local_atomic_fadd_v2f16_rtn(ptr addrspace(3) %ptr, <2 x half> define amdgpu_kernel void @local_atomic_fadd_v2bf16_noret(ptr addrspace(3) %ptr, <2 x i16> %data) { ; GFX940-LABEL: local_atomic_fadd_v2bf16_noret: ; GFX940: ; %bb.0: -; GFX940-NEXT:s_load_dwordx2 s[0:1], s[0:1], 0x24 +; GFX940-NEXT:s_load_dwordx2 s[2:3], s[0:1], 0x24 arsenm wrote: But surely we aren't merging *this* many scalar loads in MIR? The IR vectorizer should have gotten most of these? https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
https://github.com/cdevadas edited https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -183,10 +183,10 @@ define <2 x half> @local_atomic_fadd_v2f16_rtn(ptr addrspace(3) %ptr, <2 x half> define amdgpu_kernel void @local_atomic_fadd_v2bf16_noret(ptr addrspace(3) %ptr, <2 x i16> %data) { ; GFX940-LABEL: local_atomic_fadd_v2bf16_noret: ; GFX940: ; %bb.0: -; GFX940-NEXT:s_load_dwordx2 s[0:1], s[0:1], 0x24 +; GFX940-NEXT:s_load_dwordx2 s[2:3], s[0:1], 0x24 cdevadas wrote: Earlier I wrongly used the dword size (Width) in the the alignment check here as Jay pointed out. Now, I fixed it to use Byte size while comparing it with the existing alignment of the first load. https://github.com/llvm/llvm-project/pull/96162/commits/e7e6cbc4abd476a038fd7836e5078565e73d1fe9#diff-35f4d1b6c4c17815f6989f86abbac2e606ca760f9d93f501ff503449048bf760R1730 https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -183,10 +183,10 @@ define <2 x half> @local_atomic_fadd_v2f16_rtn(ptr addrspace(3) %ptr, <2 x half> define amdgpu_kernel void @local_atomic_fadd_v2bf16_noret(ptr addrspace(3) %ptr, <2 x i16> %data) { ; GFX940-LABEL: local_atomic_fadd_v2bf16_noret: ; GFX940: ; %bb.0: -; GFX940-NEXT:s_load_dwordx2 s[0:1], s[0:1], 0x24 +; GFX940-NEXT:s_load_dwordx2 s[2:3], s[0:1], 0x24 arsenm wrote: Why does this patch have so many test diffs? Before this patch, we would have just missed out on a few folds after the _ec variants were introduced? https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -1700,19 +1722,29 @@ unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo , case 8: return AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR_IMM; } - case S_LOAD_IMM: + case S_LOAD_IMM: { +// If XNACK is enabled, use the constrained opcodes when the first load is +// under-aligned. +const MachineMemOperand *MMO = *CI.I->memoperands_begin(); +bool NeedsConstrainedOpc = +STM->isXNACKEnabled() && MMO->getAlign().value() < (Width << 2); jayfoad wrote: ```suggestion STM->isXNACKEnabled() && MMO->getAlign().value() < Width * 4; ``` https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -1700,19 +1725,30 @@ unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo , case 8: return AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR_IMM; } - case S_LOAD_IMM: + case S_LOAD_IMM: { +// If XNACK is enabled, use the constrained opcodes when the first load is +// under-aligned. +const MachineMemOperand *MMO = *CI.I->memoperands_begin(); +auto NeedsConstrainedOpc = [, Width](const GCNSubtarget ) { + return ST.isXNACKEnabled() && MMO->getAlign().value() < Width; jayfoad wrote: This doesn't look right since `Width` is in units of dwords here. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -1212,8 +1228,17 @@ void SILoadStoreOptimizer::copyToDestRegs( // Copy to the old destination registers. const MCInstrDesc = TII->get(TargetOpcode::COPY); - const auto *Dest0 = TII->getNamedOperand(*CI.I, OpName); - const auto *Dest1 = TII->getNamedOperand(*Paired.I, OpName); + auto *Dest0 = TII->getNamedOperand(*CI.I, OpName); + auto *Dest1 = TII->getNamedOperand(*Paired.I, OpName); + + // The constrained sload instructions in S_LOAD_IMM class will have + // `early-clobber` flag in the dst operand. Remove the flag before using the + // MOs in copies. + if (Dest0->isEarlyClobber()) +Dest0->setIsEarlyClobber(false); + + if (Dest1->isEarlyClobber()) +Dest1->setIsEarlyClobber(false); jayfoad wrote: ```suggestion Dest0->setIsEarlyClobber(false); Dest1->setIsEarlyClobber(false); ``` https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -1700,19 +1725,30 @@ unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo , case 8: return AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR_IMM; } - case S_LOAD_IMM: + case S_LOAD_IMM: { +// If XNACK is enabled, use the constrained opcodes when the first load is +// under-aligned. +const MachineMemOperand *MMO = *CI.I->memoperands_begin(); +auto NeedsConstrainedOpc = [, Width](const GCNSubtarget ) { jayfoad wrote: This doesn't need to be a lambda. It is always called, with identical arguments. Just calculate the result as a `bool` here. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -1701,17 +1732,33 @@ unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo , return AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR_IMM; } case S_LOAD_IMM: -switch (Width) { -default: - return 0; -case 2: - return AMDGPU::S_LOAD_DWORDX2_IMM; -case 3: - return AMDGPU::S_LOAD_DWORDX3_IMM; -case 4: - return AMDGPU::S_LOAD_DWORDX4_IMM; -case 8: - return AMDGPU::S_LOAD_DWORDX8_IMM; +// For targets that support XNACK replay, use the constrained load opcode. +if (STI && STI->hasXnackReplay()) { + switch (Width) { rampitec wrote: > > currently the alignment is picked from the first MMO and that'd definitely > > be smaller than the natural align requirement for the new load > > You don't know that - the alignment in the first MMO will be whatever > alignment the compiler could deduce, which could be large, e.g. if the > pointer used for the first load was known to have a large alignment. Moreover, it can easily be as large as a page. In a case of scalar load and kernarg. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -1701,17 +1732,33 @@ unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo , return AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR_IMM; } case S_LOAD_IMM: -switch (Width) { -default: - return 0; -case 2: - return AMDGPU::S_LOAD_DWORDX2_IMM; -case 3: - return AMDGPU::S_LOAD_DWORDX3_IMM; -case 4: - return AMDGPU::S_LOAD_DWORDX4_IMM; -case 8: - return AMDGPU::S_LOAD_DWORDX8_IMM; +// For targets that support XNACK replay, use the constrained load opcode. +if (STI && STI->hasXnackReplay()) { + switch (Width) { cdevadas wrote: > > currently the alignment is picked from the first MMO and that'd definitely > > be smaller than the natural align requirement for the new load > > You don't know that - the alignment in the first MMO will be whatever > alignment the compiler could deduce, which could be large, e.g. if the > pointer used for the first load was known to have a large alignment. Are you suggesting to check the alignment in the first MMO and see if it is still the preferred alignment for the merge-load? Use the _ec if the alignment is found to be smaller than the expected value. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
https://github.com/jayfoad edited https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -1701,17 +1732,33 @@ unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo , return AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR_IMM; } case S_LOAD_IMM: -switch (Width) { -default: - return 0; -case 2: - return AMDGPU::S_LOAD_DWORDX2_IMM; -case 3: - return AMDGPU::S_LOAD_DWORDX3_IMM; -case 4: - return AMDGPU::S_LOAD_DWORDX4_IMM; -case 8: - return AMDGPU::S_LOAD_DWORDX8_IMM; +// For targets that support XNACK replay, use the constrained load opcode. +if (STI && STI->hasXnackReplay()) { + switch (Width) { jayfoad wrote: > currently the alignment is picked from the first MMO and that'd definitely be > smaller than the natural align requirement for the new load You don't know that - the alignment in the first MMO will be whatever alignment the compiler could deduce, which could be large. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -1701,17 +1732,33 @@ unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo , return AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR_IMM; } case S_LOAD_IMM: -switch (Width) { -default: - return 0; -case 2: - return AMDGPU::S_LOAD_DWORDX2_IMM; -case 3: - return AMDGPU::S_LOAD_DWORDX3_IMM; -case 4: - return AMDGPU::S_LOAD_DWORDX4_IMM; -case 8: - return AMDGPU::S_LOAD_DWORDX8_IMM; +// For targets that support XNACK replay, use the constrained load opcode. +if (STI && STI->hasXnackReplay()) { + switch (Width) { cdevadas wrote: I guess currently the merged load is always under-aligned. While combining the MMOs, currently the alignment is picked from the first MMO and that'd definitely be smaller than the natural align requirement for the new load. That's one reason I conservatively want to emit _ec equivalent when XNACK is enabled. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -1701,17 +1732,33 @@ unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo , return AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR_IMM; } case S_LOAD_IMM: -switch (Width) { -default: - return 0; -case 2: - return AMDGPU::S_LOAD_DWORDX2_IMM; -case 3: - return AMDGPU::S_LOAD_DWORDX3_IMM; -case 4: - return AMDGPU::S_LOAD_DWORDX4_IMM; -case 8: - return AMDGPU::S_LOAD_DWORDX8_IMM; +// For targets that support XNACK replay, use the constrained load opcode. +if (STI && STI->hasXnackReplay()) { + switch (Width) { rampitec wrote: You can check alignment on the first load if MMO is available and avoid producing _ec version if it is sufficient. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -967,6 +967,7 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo, bool hasLDSFPAtomicAddF32() const { return GFX8Insts; } bool hasLDSFPAtomicAddF64() const { return GFX90AInsts; } + bool hasXnackReplay() const { return GFX8Insts; } jayfoad wrote: We already have a field SupportsXNACK for this which is hooked up to the "xnack-support" target feature. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
jayfoad wrote: > > This looks like it is affecting codegen even when xnack is disabled? That > > should not happen. > > It shouldn't. I put the xnack replay subtarget check before using *_ec > equivalents. See the code here: > [65eb443#diff-35f4d1b6c4c17815f6989f86abbac2e606ca760f9d93f501ff503449048bf760R1735](https://github.com/llvm/llvm-project/commit/65eb44327cf32a83dbbf13eb70f9d8c03f3efaef#diff-35f4d1b6c4c17815f6989f86abbac2e606ca760f9d93f501ff503449048bf760R1735) You're checking `STI->hasXnackReplay()` which is true on all GFX8+ targets. You should be checking whether xnack support is enabled with `STI->isXNACKEnabled()`. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -1701,17 +1732,33 @@ unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo , return AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR_IMM; } case S_LOAD_IMM: -switch (Width) { -default: - return 0; -case 2: - return AMDGPU::S_LOAD_DWORDX2_IMM; -case 3: - return AMDGPU::S_LOAD_DWORDX3_IMM; -case 4: - return AMDGPU::S_LOAD_DWORDX4_IMM; -case 8: - return AMDGPU::S_LOAD_DWORDX8_IMM; +// For targets that support XNACK replay, use the constrained load opcode. +if (STI && STI->hasXnackReplay()) { + switch (Width) { arsenm wrote: One switch and move the condition inside each size case? https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
@@ -1701,17 +1732,33 @@ unsigned SILoadStoreOptimizer::getNewOpcode(const CombineInfo , return AMDGPU::S_BUFFER_LOAD_DWORDX8_SGPR_IMM; } case S_LOAD_IMM: -switch (Width) { -default: - return 0; -case 2: - return AMDGPU::S_LOAD_DWORDX2_IMM; -case 3: - return AMDGPU::S_LOAD_DWORDX3_IMM; -case 4: - return AMDGPU::S_LOAD_DWORDX4_IMM; -case 8: - return AMDGPU::S_LOAD_DWORDX8_IMM; +// For targets that support XNACK replay, use the constrained load opcode. +if (STI && STI->hasXnackReplay()) { arsenm wrote: STI should never be null. The conservative default would be to assume ec if it were possible https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
arsenm wrote: I'm still not sure why we have so much in this pass. The load and store vectorization should have happened in the IR. This pass originally was for the multi offset DS instructions https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
cdevadas wrote: > This looks like it is affecting codegen even when xnack is disabled? That > should not happen. It shouldn't. I put the xnack replay subtarget check before using *_ec equivalents. See the code here: https://github.com/llvm/llvm-project/commit/65eb44327cf32a83dbbf13eb70f9d8c03f3efaef#diff-35f4d1b6c4c17815f6989f86abbac2e606ca760f9d93f501ff503449048bf760R1735 https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
jayfoad wrote: This looks like it is affecting codegen even when xnack is disabled? That should not happen. https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
llvmbot wrote: @llvm/pr-subscribers-backend-amdgpu @llvm/pr-subscribers-llvm-globalisel Author: Christudasan Devadasan (cdevadas) Changes Consider the constrained multi-dword loads while merging individual loads to a single multi-dword load. --- Patch is 1023.60 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/96162.diff 116 Files Affected: - (modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+1) - (modified) llvm/lib/Target/AMDGPU/SILoadStoreOptimizer.cpp (+63-16) - (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/cvt_f32_ubyte.ll (+84-85) - (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fp-atomics-gfx940.ll (+6-6) - (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.global.atomic.csub.ll (+6-6) - (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+51-51) - (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.set.inactive.ll (+50-50) - (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.update.dpp.ll (+42-42) - (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdivrem.ll (+102-102) - (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll (+65-65) - (modified) llvm/test/CodeGen/AMDGPU/add.v2i16.ll (+21-21) - (modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll (+204-204) - (modified) llvm/test/CodeGen/AMDGPU/atomics_cond_sub.ll (+24-24) - (modified) llvm/test/CodeGen/AMDGPU/bfe-patterns.ll (+20-20) - (modified) llvm/test/CodeGen/AMDGPU/bfm.ll (+5-5) - (modified) llvm/test/CodeGen/AMDGPU/bitreverse.ll (+157-157) - (modified) llvm/test/CodeGen/AMDGPU/build_vector.ll (+25-25) - (modified) llvm/test/CodeGen/AMDGPU/calling-conventions.ll (+25-25) - (modified) llvm/test/CodeGen/AMDGPU/cluster_stores.ll (+60-60) - (modified) llvm/test/CodeGen/AMDGPU/combine-cond-add-sub.ll (+10-10) - (modified) llvm/test/CodeGen/AMDGPU/ctlz.ll (+227-227) - (modified) llvm/test/CodeGen/AMDGPU/ctlz_zero_undef.ll (+223-223) - (modified) llvm/test/CodeGen/AMDGPU/ctpop16.ll (+52-48) - (modified) llvm/test/CodeGen/AMDGPU/ctpop64.ll (+24-24) - (modified) llvm/test/CodeGen/AMDGPU/cttz.ll (+185-185) - (modified) llvm/test/CodeGen/AMDGPU/cttz_zero_undef.ll (+212-212) - (modified) llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll (+331-334) - (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-buildvector.ll (+24-24) - (modified) llvm/test/CodeGen/AMDGPU/ds_read2.ll (+8-8) - (modified) llvm/test/CodeGen/AMDGPU/extract_vector_elt-f16.ll (+25-25) - (modified) llvm/test/CodeGen/AMDGPU/fabs.f16.ll (+6-6) - (modified) llvm/test/CodeGen/AMDGPU/fabs.ll (+17-17) - (modified) llvm/test/CodeGen/AMDGPU/fcanonicalize.ll (+3-3) - (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f16.ll (+31-32) - (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f32.ll (+33-33) - (modified) llvm/test/CodeGen/AMDGPU/fdiv.ll (+143-143) - (modified) llvm/test/CodeGen/AMDGPU/flat_atomics.ll (+20-20) - (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (+56-56) - (modified) llvm/test/CodeGen/AMDGPU/fma-combine.ll (+88-88) - (modified) llvm/test/CodeGen/AMDGPU/fmul-2-combine-multi-use.ll (+9-9) - (modified) llvm/test/CodeGen/AMDGPU/fmuladd.f16.ll (+240-240) - (modified) llvm/test/CodeGen/AMDGPU/fnearbyint.ll (+9-9) - (modified) llvm/test/CodeGen/AMDGPU/fneg-combines.new.ll (+23-23) - (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.f16.ll (+6-6) - (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.ll (+17-17) - (modified) llvm/test/CodeGen/AMDGPU/fneg.ll (+13-13) - (modified) llvm/test/CodeGen/AMDGPU/fp-atomics-gfx1200.ll (+26-26) - (modified) llvm/test/CodeGen/AMDGPU/fp-atomics-gfx940.ll (+22-22) - (modified) llvm/test/CodeGen/AMDGPU/fp-classify.ll (+13-13) - (modified) llvm/test/CodeGen/AMDGPU/fp-min-max-buffer-atomics.ll (+33-33) - (modified) llvm/test/CodeGen/AMDGPU/fp-min-max-buffer-ptr-atomics.ll (+46-46) - (modified) llvm/test/CodeGen/AMDGPU/fp16_to_fp32.ll (+20-20) - (modified) llvm/test/CodeGen/AMDGPU/fp16_to_fp64.ll (+20-20) - (modified) llvm/test/CodeGen/AMDGPU/fp32_to_fp16.ll (+20-20) - (modified) llvm/test/CodeGen/AMDGPU/fp64-min-max-buffer-atomics.ll (+24-24) - (modified) llvm/test/CodeGen/AMDGPU/fp64-min-max-buffer-ptr-atomics.ll (+38-38) - (modified) llvm/test/CodeGen/AMDGPU/fp_to_sint.ll (+19-19) - (modified) llvm/test/CodeGen/AMDGPU/fp_to_uint.ll (+19-19) - (modified) llvm/test/CodeGen/AMDGPU/fshl.ll (+44-44) - (modified) llvm/test/CodeGen/AMDGPU/fshr.ll (+20-20) - (modified) llvm/test/CodeGen/AMDGPU/global_atomics.ll (+24-24) - (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+98-98) - (modified) llvm/test/CodeGen/AMDGPU/half.ll (+41-41) - (modified) llvm/test/CodeGen/AMDGPU/insert_vector_dynelt.ll (+88-88) - (modified) llvm/test/CodeGen/AMDGPU/insert_waitcnt_for_precise_memory.ll (+124-124) - (modified) llvm/test/CodeGen/AMDGPU/kernel-args.ll (+53-53) - (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.cvt.pkrtz.ll (+15-15) - (modified)
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
https://github.com/cdevadas ready_for_review https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits
[llvm-branch-commits] [llvm] [AMDGPU][SILoadStoreOptimizer] Merge constrained sloads (PR #96162)
cdevadas wrote: > [!WARNING] > This pull request is not mergeable via GitHub because a downstack PR is > open. Once all requirements are satisfied, merge this PR as a stack href="https://app.graphite.dev/github/pr/llvm/llvm-project/96162?utm_source=stack-comment-downstack-mergeability-warning; > >on Graphite. > https://graphite.dev/docs/merge-pull-requests;>Learn more * **#96163** https://app.graphite.dev/github/pr/llvm/llvm-project/96163?utm_source=stack-comment-icon; target="_blank">https://static.graphite.dev/graphite-32x32-black.png; alt="Graphite" width="10px" height="10px"/> * **#96162** https://app.graphite.dev/github/pr/llvm/llvm-project/96162?utm_source=stack-comment-icon; target="_blank">https://static.graphite.dev/graphite-32x32-black.png; alt="Graphite" width="10px" height="10px"/> * **#96161** https://app.graphite.dev/github/pr/llvm/llvm-project/96161?utm_source=stack-comment-icon; target="_blank">https://static.graphite.dev/graphite-32x32-black.png; alt="Graphite" width="10px" height="10px"/> * `main` This stack of pull requests is managed by Graphite. https://stacking.dev/?utm_source=stack-comment;>Learn more about stacking. Join @cdevadas and the rest of your teammates on https://graphite.dev?utm-source=stack-comment;>https://static.graphite.dev/graphite-32x32-black.png; alt="Graphite" width="11px" height="11px"/> Graphite https://github.com/llvm/llvm-project/pull/96162 ___ llvm-branch-commits mailing list llvm-branch-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-branch-commits