[llvm-bugs] [Bug 169552] [AMDGPU] regression from `SIFixSGPRCopies::processPHINode` change

LLVM Bugs via llvm-bugs Fri, 28 Nov 2025 21:33:40 -0800

Issue	169552
Summary	[AMDGPU] regression from `SIFixSGPRCopies::processPHINode` change
Labels	new issue
Assignees
Reporter	raiseirql

    The recent change to `SIFixSGPRCopies::processPHINode` in #169038 addressed the problem that we reported in #168761, but this caused a regression in our matmul kernel where PHI nodes that had `av_class_128` were then converted to `vreg_128` with copies introduced to move back to `areg_128` registers.


I put a repro case of the kernel I'm looking at here: https://gist.github.com/raiseirql/ebe539468b5a741ff561a85832205a3c

Should the change in #169038 be refined or should a later pass have cleaned up the codegen to remove the copies.

In order to get AGPRs to be used due to other recent backend changes, we have needed to apply this patch. @arsenm had been looking into how to fix this properly.
```
diff --git a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
index b398db4f7caf..42af5c19cae2 100644
--- a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
@@ -90,7 +90,11 @@ SIMachineFunctionInfo::SIMachineFunctionInfo(const Function &F,
     auto [MinNumAGPRAttr, MaxNumAGPRAttr] =
 AMDGPU::getIntegerPairAttribute(F, "amdgpu-agpr-alloc", {~0u, ~0u},
 /*_OnlyFirstRequired_=*/true);
- MinNumAGPRs = MinNumAGPRAttr;
+    MinNumAGPRs = ~0u;
+    if (MFMAVGPRForm ||
+         (ST.getMaxNumVGPRs(F) <= ST.getAddressableNumArchVGPRs() &&
+ !mayUseAGPRs(F)))
+      MinNumAGPRs = 0;
   }

   if (AMDGPU::isChainCC(CC)) {
``` 

Before `si-fix-sgpr-copies` in the matmul inner loop. Same basic pattern after `si-fix-sgpr-copies` without #169038.
```llvm
bb.2 (%ir-block.634):
; predecessors: %bb.1

...
 %706:areg_128_align2 = PHI %352:av_128_align2, %bb.1
  %707:areg_128_align2 = PHI %353:av_128_align2, %bb.1
...
  %1147:areg_128_align2 = V_MFMA_F32_16X16X32_BF16_mac_e64 %778:av_128_align2, %770:av_128_align2, %706:areg_128_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
 %1148:areg_128_align2 = V_MFMA_F32_16X16X32_BF16_mac_e64 %779:av_128_align2, %770:av_128_align2, %707:areg_128_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
``` 

After `si-fix-sgpr-copies` with #169038:
```llvm
  %2483:vgpr_32 = COPY %791:av_32, implicit $exec
 %2485:vgpr_32 = COPY %791:av_32, implicit $exec
  %2487:vgpr_32 = COPY %791:av_32, implicit $exec
  %2489:vgpr_32 = COPY %791:av_32, implicit $exec
  %2491:vgpr_32 = COPY %791:av_32, implicit $exec
  %2493:vgpr_32 = COPY %791:av_32, implicit $exec
  %2495:vgpr_32 = COPY %791:av_32, implicit $exec
  %2497:vgpr_32 = COPY %791:av_32, implicit $exec

bb.1 (%ir-block.184):
; predecessors: %bb.0, %bb.1
  successors: %bb.2(0x04000000), %bb.1(0x7c000000); %bb.2(3.12%), %bb.1(96.88%)
...
 %328:av_32 = PHI %2483:vgpr_32, %bb.0, %2484:vgpr_32, %bb.1
  %329:av_32 = PHI %2485:vgpr_32, %bb.0, %2486:vgpr_32, %bb.1
  %330:av_32 = PHI %2487:vgpr_32, %bb.0, %2488:vgpr_32, %bb.1
  %331:av_32 = PHI %2489:vgpr_32, %bb.0, %2490:vgpr_32, %bb.1
  %332:av_32 = PHI %2491:vgpr_32, %bb.0, %2492:vgpr_32, %bb.1
  %333:av_32 = PHI %2493:vgpr_32, %bb.0, %2494:vgpr_32, %bb.1
  %334:av_32 = PHI %2495:vgpr_32, %bb.0, %2496:vgpr_32, %bb.1
 %335:av_32 = PHI %2497:vgpr_32, %bb.0, %2498:vgpr_32, %bb.1
...
 %908:areg_128_align2 = REG_SEQUENCE %332:av_32, %subreg.sub0, %333:av_32, %subreg.sub1, %334:av_32, %subreg.sub2, %335:av_32, %subreg.sub3
 %909:areg_128_align2 = REG_SEQUENCE %328:av_32, %subreg.sub0, %329:av_32, %subreg.sub1, %330:av_32, %subreg.sub2, %331:av_32, %subreg.sub3
...
 %988:areg_128_align2 = V_MFMA_F32_16X16X32_BF16_mac_e64 %55:av_128_align2, %71:av_128_align2, %908:areg_128_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
  %989:areg_128_align2 = V_MFMA_F32_16X16X32_BF16_mac_e64 %54:av_128_align2, %71:av_128_align2, %909:areg_128_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
``` 

The end effect of all this is the matmul kernel for a test shape of 8Kx8Kx8K goes from 0.85ms to over 15ms. The inner loop is has several scratch reads and copies between vgpr/agpr:
```assembly
        v_mfma_f32_16x16x32_bf16 a[216:219], v[162:165], v[54:57], a[216:219]
        s_nop 4
        v_accvgpr_read_b32 v241, a135
        v_accvgpr_read_b32 v240, a134
        v_accvgpr_read_b32 v239, a133
        v_mfma_f32_16x16x32_bf16 a[152:155], v[162:165], v[46:49], a[172:175]
        v_accvgpr_read_b32 v238, a132
 v_mfma_f32_16x16x32_bf16 a[128:131], v[162:165], v[50:53], a[128:131]
 scratch_load_dwordx4 v[166:169], off, off offset:144
``` 

cc @jayfoad

_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

[llvm-bugs] [Bug 169552] [AMDGPU] regression from `SIFixSGPRCopies::processPHINode` change

Reply via email to