| Issue |
169552
|
| Summary |
[AMDGPU] regression from `SIFixSGPRCopies::processPHINode` change
|
| Labels |
new issue
|
| Assignees |
|
| Reporter |
raiseirql
|
The recent change to `SIFixSGPRCopies::processPHINode` in #169038 addressed the problem that we reported in #168761, but this caused a regression in our matmul kernel where PHI nodes that had `av_class_128` were then converted to `vreg_128` with copies introduced to move back to `areg_128` registers.
I put a repro case of the kernel I'm looking at here: https://gist.github.com/raiseirql/ebe539468b5a741ff561a85832205a3c
Should the change in #169038 be refined or should a later pass have cleaned up the codegen to remove the copies.
In order to get AGPRs to be used due to other recent backend changes, we have needed to apply this patch. @arsenm had been looking into how to fix this properly.
```
diff --git a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
index b398db4f7caf..42af5c19cae2 100644
--- a/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
@@ -90,7 +90,11 @@ SIMachineFunctionInfo::SIMachineFunctionInfo(const Function &F,
auto [MinNumAGPRAttr, MaxNumAGPRAttr] =
AMDGPU::getIntegerPairAttribute(F, "amdgpu-agpr-alloc", {~0u, ~0u},
/*_OnlyFirstRequired_=*/true);
- MinNumAGPRs = MinNumAGPRAttr;
+ MinNumAGPRs = ~0u;
+ if (MFMAVGPRForm ||
+ (ST.getMaxNumVGPRs(F) <= ST.getAddressableNumArchVGPRs() &&
+ !mayUseAGPRs(F)))
+ MinNumAGPRs = 0;
}
if (AMDGPU::isChainCC(CC)) {
```
Before `si-fix-sgpr-copies` in the matmul inner loop. Same basic pattern after `si-fix-sgpr-copies` without #169038.
```llvm
bb.2 (%ir-block.634):
; predecessors: %bb.1
...
%706:areg_128_align2 = PHI %352:av_128_align2, %bb.1
%707:areg_128_align2 = PHI %353:av_128_align2, %bb.1
...
%1147:areg_128_align2 = V_MFMA_F32_16X16X32_BF16_mac_e64 %778:av_128_align2, %770:av_128_align2, %706:areg_128_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
%1148:areg_128_align2 = V_MFMA_F32_16X16X32_BF16_mac_e64 %779:av_128_align2, %770:av_128_align2, %707:areg_128_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
```
After `si-fix-sgpr-copies` with #169038:
```llvm
%2483:vgpr_32 = COPY %791:av_32, implicit $exec
%2485:vgpr_32 = COPY %791:av_32, implicit $exec
%2487:vgpr_32 = COPY %791:av_32, implicit $exec
%2489:vgpr_32 = COPY %791:av_32, implicit $exec
%2491:vgpr_32 = COPY %791:av_32, implicit $exec
%2493:vgpr_32 = COPY %791:av_32, implicit $exec
%2495:vgpr_32 = COPY %791:av_32, implicit $exec
%2497:vgpr_32 = COPY %791:av_32, implicit $exec
bb.1 (%ir-block.184):
; predecessors: %bb.0, %bb.1
successors: %bb.2(0x04000000), %bb.1(0x7c000000); %bb.2(3.12%), %bb.1(96.88%)
...
%328:av_32 = PHI %2483:vgpr_32, %bb.0, %2484:vgpr_32, %bb.1
%329:av_32 = PHI %2485:vgpr_32, %bb.0, %2486:vgpr_32, %bb.1
%330:av_32 = PHI %2487:vgpr_32, %bb.0, %2488:vgpr_32, %bb.1
%331:av_32 = PHI %2489:vgpr_32, %bb.0, %2490:vgpr_32, %bb.1
%332:av_32 = PHI %2491:vgpr_32, %bb.0, %2492:vgpr_32, %bb.1
%333:av_32 = PHI %2493:vgpr_32, %bb.0, %2494:vgpr_32, %bb.1
%334:av_32 = PHI %2495:vgpr_32, %bb.0, %2496:vgpr_32, %bb.1
%335:av_32 = PHI %2497:vgpr_32, %bb.0, %2498:vgpr_32, %bb.1
...
%908:areg_128_align2 = REG_SEQUENCE %332:av_32, %subreg.sub0, %333:av_32, %subreg.sub1, %334:av_32, %subreg.sub2, %335:av_32, %subreg.sub3
%909:areg_128_align2 = REG_SEQUENCE %328:av_32, %subreg.sub0, %329:av_32, %subreg.sub1, %330:av_32, %subreg.sub2, %331:av_32, %subreg.sub3
...
%988:areg_128_align2 = V_MFMA_F32_16X16X32_BF16_mac_e64 %55:av_128_align2, %71:av_128_align2, %908:areg_128_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
%989:areg_128_align2 = V_MFMA_F32_16X16X32_BF16_mac_e64 %54:av_128_align2, %71:av_128_align2, %909:areg_128_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
```
The end effect of all this is the matmul kernel for a test shape of 8Kx8Kx8K goes from 0.85ms to over 15ms. The inner loop is has several scratch reads and copies between vgpr/agpr:
```assembly
v_mfma_f32_16x16x32_bf16 a[216:219], v[162:165], v[54:57], a[216:219]
s_nop 4
v_accvgpr_read_b32 v241, a135
v_accvgpr_read_b32 v240, a134
v_accvgpr_read_b32 v239, a133
v_mfma_f32_16x16x32_bf16 a[152:155], v[162:165], v[46:49], a[172:175]
v_accvgpr_read_b32 v238, a132
v_mfma_f32_16x16x32_bf16 a[128:131], v[162:165], v[50:53], a[128:131]
scratch_load_dwordx4 v[166:169], off, off offset:144
```
cc @jayfoad
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs