[kwin] [Bug 479845] New: The current Wayland GPU recovery experience (AMD) is not ideal

fililip Mon, 15 Jan 2024 05:50:45 -0800

https://bugs.kde.org/show_bug.cgi?id=479845


            Bug ID: 479845
           Summary: The current Wayland GPU recovery experience (AMD) is
                    not ideal
    Classification: Plasma
           Product: kwin
           Version: 5.92.0
          Platform: Arch Linux
                OS: Linux
            Status: REPORTED
          Severity: normal
          Priority: NOR
         Component: wayland-generic
          Assignee: kwin-bugs-n...@kde.org
          Reporter: t...@nitrosubs.live
  Target Milestone: ---

SUMMARY
GPU recovery on Wayland (amdgpu) now either works too slowly, doesn't actually
recover (forces a compositor restart) or hangs the input system, forcing SSHing
into it and SIGKILLing kwin_wayland to restart the compositor.

This is with one display attached (1080p 165Hz VRR), though. With two (1080p
165Hz VRR & 1080p 60Hz non-VRR), the compositor does not recover at all (or
does, but very rarely) and either hangs the input system (forcing to SSH) or
restarts itself, making apps (that do not support compositor handoff, I
presume, since Konsole stays up just fine) to lose progress. When that happens,
dmesg doesn't show gfxhub page faults, but two gfx timeouts and DRM commit
failures.

What's more, it used to work just fine on KWin 5.27.5 and Mesa 23.2 -
everything happened fast enough, and it worked fine even with two displays.
(Unfortunately, back then it was possible for a faulty app to reset the card in
a way that did not aid recovery, and there was some kind of VRAM leak.)

dmesg log after the reset completes (one display):
[  377.569608] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0
ring:24 vmid:2 pasid:32770, for process kwin_wayland pid 4036 thread
kwin_wayla:cs0 pid 4068)
[  377.569613] amdgpu 0000:0b:00.0: amdgpu:   in page starting at address
0x0000800010000000 from client 0x1b (UTCL2)
[  377.569615] amdgpu 0000:0b:00.0: amdgpu:
GCVM_L2_PROTECTION_FAULT_STATUS:0x00201431
[  377.569616] amdgpu 0000:0b:00.0: amdgpu:      Faulty UTCL2 client ID: SQC
(data) (0xa)
[  377.569617] amdgpu 0000:0b:00.0: amdgpu:      MORE_FAULTS: 0x1
[  377.569617] amdgpu 0000:0b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  377.569618] amdgpu 0000:0b:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
[  377.569618] amdgpu 0000:0b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  377.569619] amdgpu 0000:0b:00.0: amdgpu:      RW: 0x0
[  377.569622] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0
ring:24 vmid:2 pasid:32770, for process kwin_wayland pid 4036 thread
kwin_wayla:cs0 pid 4068)
[  377.569624] amdgpu 0000:0b:00.0: amdgpu:   in page starting at address
0x0000800010000000 from client 0x1b (UTCL2)
[  377.569625] amdgpu 0000:0b:00.0: amdgpu:
GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  377.569625] amdgpu 0000:0b:00.0: amdgpu:      Faulty UTCL2 client ID: CB/DB
(0x0)
[  377.569626] amdgpu 0000:0b:00.0: amdgpu:      MORE_FAULTS: 0x0
[  377.569626] amdgpu 0000:0b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  377.569627] amdgpu 0000:0b:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  377.569627] amdgpu 0000:0b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  377.569628] amdgpu 0000:0b:00.0: amdgpu:      RW: 0x0
[  377.569631] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0
ring:24 vmid:2 pasid:32770, for process kwin_wayland pid 4036 thread
kwin_wayla:cs0 pid 4068)
[  377.569632] amdgpu 0000:0b:00.0: amdgpu:   in page starting at address
0x0000800010000000 from client 0x1b (UTCL2)
[  377.569633] amdgpu 0000:0b:00.0: amdgpu:
GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  377.569634] amdgpu 0000:0b:00.0: amdgpu:      Faulty UTCL2 client ID: CB/DB
(0x0)
[  377.569634] amdgpu 0000:0b:00.0: amdgpu:      MORE_FAULTS: 0x0
[  377.569635] amdgpu 0000:0b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  377.569635] amdgpu 0000:0b:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  377.569636] amdgpu 0000:0b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  377.569636] amdgpu 0000:0b:00.0: amdgpu:      RW: 0x0
[  388.011857] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0
ring:24 vmid:2 pasid:32770, for process kwin_wayland pid 4036 thread
kwin_wayla:cs0 pid 4068)
[  388.011882] amdgpu 0000:0b:00.0: amdgpu:   in page starting at address
0x0000800010000000 from client 0x1b (UTCL2)
[  388.011894] amdgpu 0000:0b:00.0: amdgpu:
GCVM_L2_PROTECTION_FAULT_STATUS:0x00201431
[  388.011900] amdgpu 0000:0b:00.0: amdgpu:      Faulty UTCL2 client ID: SQC
(data) (0xa)
[  388.011905] amdgpu 0000:0b:00.0: amdgpu:      MORE_FAULTS: 0x1
[  388.011909] amdgpu 0000:0b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  388.011913] amdgpu 0000:0b:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
[  388.011916] amdgpu 0000:0b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  388.011919] amdgpu 0000:0b:00.0: amdgpu:      RW: 0x0
[  388.011932] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0
ring:24 vmid:2 pasid:32770, for process kwin_wayland pid 4036 thread
kwin_wayla:cs0 pid 4068)
[  388.011942] amdgpu 0000:0b:00.0: amdgpu:   in page starting at address
0x0000800010000000 from client 0x1b (UTCL2)
[  388.011949] amdgpu 0000:0b:00.0: amdgpu:
GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  388.011953] amdgpu 0000:0b:00.0: amdgpu:      Faulty UTCL2 client ID: CB/DB
(0x0)
[  388.011958] amdgpu 0000:0b:00.0: amdgpu:      MORE_FAULTS: 0x0
[  388.011961] amdgpu 0000:0b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  388.011965] amdgpu 0000:0b:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  388.011968] amdgpu 0000:0b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  388.011971] amdgpu 0000:0b:00.0: amdgpu:      RW: 0x0
[  388.011980] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0
ring:24 vmid:2 pasid:32770, for process kwin_wayland pid 4036 thread
kwin_wayla:cs0 pid 4068)
[  388.011988] amdgpu 0000:0b:00.0: amdgpu:   in page starting at address
0x0000800010000000 from client 0x1b (UTCL2)
[  388.011993] amdgpu 0000:0b:00.0: amdgpu:
GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  388.011997] amdgpu 0000:0b:00.0: amdgpu:      Faulty UTCL2 client ID: CB/DB
(0x0)
[  388.012001] amdgpu 0000:0b:00.0: amdgpu:      MORE_FAULTS: 0x0
[  388.012004] amdgpu 0000:0b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  388.012007] amdgpu 0000:0b:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  388.012010] amdgpu 0000:0b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  388.012013] amdgpu 0000:0b:00.0: amdgpu:      RW: 0x0
[  388.012022] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0
ring:24 vmid:2 pasid:32770, for process kwin_wayland pid 4036 thread
kwin_wayla:cs0 pid 4068)
[  388.012029] amdgpu 0000:0b:00.0: amdgpu:   in page starting at address
0x0000800010000000 from client 0x1b (UTCL2)
[  388.012034] amdgpu 0000:0b:00.0: amdgpu:
GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  388.012037] amdgpu 0000:0b:00.0: amdgpu:      Faulty UTCL2 client ID: CB/DB
(0x0)
[  388.012040] amdgpu 0000:0b:00.0: amdgpu:      MORE_FAULTS: 0x0
[  388.012044] amdgpu 0000:0b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  388.012047] amdgpu 0000:0b:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[  388.012050] amdgpu 0000:0b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  388.012053] amdgpu 0000:0b:00.0: amdgpu:      RW: 0x0
[  388.012062] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0
timeout, but soft recovered

With two displays, the GFX ring has a very low chance of soft recovery, and
when that happens (alongside DRM/CRTC commit failure messages), causes an
additional reset which destroys the session entirely - then it *requires*
SSHing into the machine.

STEPS TO REPRODUCE
1. Start the Plasma desktop
2. Invoke `sudo cat /sys/kernel/debug/dri/X/amdgpu_gpu_recover` or get an app
to cause a gfx timeout

OBSERVED RESULT
Recovery either occurs too slowly or doesn't occur at all, as described in
"Summary".

EXPECTED RESULT
The desktop should recover correctly and all robust apps should continue to
function

SOFTWARE/OS VERSIONS
Operating System: Arch Linux 
KDE Plasma Version: 5.92.0
KDE Frameworks Version: 5.248.0
Qt Version: 6.7.0
Kernel Version: 6.7.0-zen3-1-zen (64-bit)
Graphics Processor: AMD Radeon RX 6600 XT

ADDITIONAL INFORMATION
This Mesa issue that I reported around 2 months ago might be relevant:
https://gitlab.freedesktop.org/mesa/mesa/-/issues/10124, but I'm unsure whether
it's actually a Mesa problem, or a KWin problem.

Additionally, the GPU-reset caused Xwayland hang
(https://bugs.kde.org/show_bug.cgi?id=475322) is still present (running vkcube
after a reset hangs the session), but that is its own issue.

-- 
You are receiving this mail because:
You are watching all bug changes.

[kwin] [Bug 479845] New: The current Wayland GPU recovery experience (AMD) is not ideal

Reply via email to