https://bugs.kde.org/show_bug.cgi?id=479845
Bug ID: 479845 Summary: The current Wayland GPU recovery experience (AMD) is not ideal Classification: Plasma Product: kwin Version: 5.92.0 Platform: Arch Linux OS: Linux Status: REPORTED Severity: normal Priority: NOR Component: wayland-generic Assignee: kwin-bugs-n...@kde.org Reporter: t...@nitrosubs.live Target Milestone: --- SUMMARY GPU recovery on Wayland (amdgpu) now either works too slowly, doesn't actually recover (forces a compositor restart) or hangs the input system, forcing SSHing into it and SIGKILLing kwin_wayland to restart the compositor. This is with one display attached (1080p 165Hz VRR), though. With two (1080p 165Hz VRR & 1080p 60Hz non-VRR), the compositor does not recover at all (or does, but very rarely) and either hangs the input system (forcing to SSH) or restarts itself, making apps (that do not support compositor handoff, I presume, since Konsole stays up just fine) to lose progress. When that happens, dmesg doesn't show gfxhub page faults, but two gfx timeouts and DRM commit failures. What's more, it used to work just fine on KWin 5.27.5 and Mesa 23.2 - everything happened fast enough, and it worked fine even with two displays. (Unfortunately, back then it was possible for a faulty app to reset the card in a way that did not aid recovery, and there was some kind of VRAM leak.) dmesg log after the reset completes (one display): [ 377.569608] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32770, for process kwin_wayland pid 4036 thread kwin_wayla:cs0 pid 4068) [ 377.569613] amdgpu 0000:0b:00.0: amdgpu: in page starting at address 0x0000800010000000 from client 0x1b (UTCL2) [ 377.569615] amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00201431 [ 377.569616] amdgpu 0000:0b:00.0: amdgpu: Faulty UTCL2 client ID: SQC (data) (0xa) [ 377.569617] amdgpu 0000:0b:00.0: amdgpu: MORE_FAULTS: 0x1 [ 377.569617] amdgpu 0000:0b:00.0: amdgpu: WALKER_ERROR: 0x0 [ 377.569618] amdgpu 0000:0b:00.0: amdgpu: PERMISSION_FAULTS: 0x3 [ 377.569618] amdgpu 0000:0b:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 377.569619] amdgpu 0000:0b:00.0: amdgpu: RW: 0x0 [ 377.569622] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32770, for process kwin_wayland pid 4036 thread kwin_wayla:cs0 pid 4068) [ 377.569624] amdgpu 0000:0b:00.0: amdgpu: in page starting at address 0x0000800010000000 from client 0x1b (UTCL2) [ 377.569625] amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 377.569625] amdgpu 0000:0b:00.0: amdgpu: Faulty UTCL2 client ID: CB/DB (0x0) [ 377.569626] amdgpu 0000:0b:00.0: amdgpu: MORE_FAULTS: 0x0 [ 377.569626] amdgpu 0000:0b:00.0: amdgpu: WALKER_ERROR: 0x0 [ 377.569627] amdgpu 0000:0b:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 377.569627] amdgpu 0000:0b:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 377.569628] amdgpu 0000:0b:00.0: amdgpu: RW: 0x0 [ 377.569631] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32770, for process kwin_wayland pid 4036 thread kwin_wayla:cs0 pid 4068) [ 377.569632] amdgpu 0000:0b:00.0: amdgpu: in page starting at address 0x0000800010000000 from client 0x1b (UTCL2) [ 377.569633] amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 377.569634] amdgpu 0000:0b:00.0: amdgpu: Faulty UTCL2 client ID: CB/DB (0x0) [ 377.569634] amdgpu 0000:0b:00.0: amdgpu: MORE_FAULTS: 0x0 [ 377.569635] amdgpu 0000:0b:00.0: amdgpu: WALKER_ERROR: 0x0 [ 377.569635] amdgpu 0000:0b:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 377.569636] amdgpu 0000:0b:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 377.569636] amdgpu 0000:0b:00.0: amdgpu: RW: 0x0 [ 388.011857] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32770, for process kwin_wayland pid 4036 thread kwin_wayla:cs0 pid 4068) [ 388.011882] amdgpu 0000:0b:00.0: amdgpu: in page starting at address 0x0000800010000000 from client 0x1b (UTCL2) [ 388.011894] amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00201431 [ 388.011900] amdgpu 0000:0b:00.0: amdgpu: Faulty UTCL2 client ID: SQC (data) (0xa) [ 388.011905] amdgpu 0000:0b:00.0: amdgpu: MORE_FAULTS: 0x1 [ 388.011909] amdgpu 0000:0b:00.0: amdgpu: WALKER_ERROR: 0x0 [ 388.011913] amdgpu 0000:0b:00.0: amdgpu: PERMISSION_FAULTS: 0x3 [ 388.011916] amdgpu 0000:0b:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 388.011919] amdgpu 0000:0b:00.0: amdgpu: RW: 0x0 [ 388.011932] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32770, for process kwin_wayland pid 4036 thread kwin_wayla:cs0 pid 4068) [ 388.011942] amdgpu 0000:0b:00.0: amdgpu: in page starting at address 0x0000800010000000 from client 0x1b (UTCL2) [ 388.011949] amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 388.011953] amdgpu 0000:0b:00.0: amdgpu: Faulty UTCL2 client ID: CB/DB (0x0) [ 388.011958] amdgpu 0000:0b:00.0: amdgpu: MORE_FAULTS: 0x0 [ 388.011961] amdgpu 0000:0b:00.0: amdgpu: WALKER_ERROR: 0x0 [ 388.011965] amdgpu 0000:0b:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 388.011968] amdgpu 0000:0b:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 388.011971] amdgpu 0000:0b:00.0: amdgpu: RW: 0x0 [ 388.011980] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32770, for process kwin_wayland pid 4036 thread kwin_wayla:cs0 pid 4068) [ 388.011988] amdgpu 0000:0b:00.0: amdgpu: in page starting at address 0x0000800010000000 from client 0x1b (UTCL2) [ 388.011993] amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 388.011997] amdgpu 0000:0b:00.0: amdgpu: Faulty UTCL2 client ID: CB/DB (0x0) [ 388.012001] amdgpu 0000:0b:00.0: amdgpu: MORE_FAULTS: 0x0 [ 388.012004] amdgpu 0000:0b:00.0: amdgpu: WALKER_ERROR: 0x0 [ 388.012007] amdgpu 0000:0b:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 388.012010] amdgpu 0000:0b:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 388.012013] amdgpu 0000:0b:00.0: amdgpu: RW: 0x0 [ 388.012022] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32770, for process kwin_wayland pid 4036 thread kwin_wayla:cs0 pid 4068) [ 388.012029] amdgpu 0000:0b:00.0: amdgpu: in page starting at address 0x0000800010000000 from client 0x1b (UTCL2) [ 388.012034] amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000 [ 388.012037] amdgpu 0000:0b:00.0: amdgpu: Faulty UTCL2 client ID: CB/DB (0x0) [ 388.012040] amdgpu 0000:0b:00.0: amdgpu: MORE_FAULTS: 0x0 [ 388.012044] amdgpu 0000:0b:00.0: amdgpu: WALKER_ERROR: 0x0 [ 388.012047] amdgpu 0000:0b:00.0: amdgpu: PERMISSION_FAULTS: 0x0 [ 388.012050] amdgpu 0000:0b:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 388.012053] amdgpu 0000:0b:00.0: amdgpu: RW: 0x0 [ 388.012062] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered With two displays, the GFX ring has a very low chance of soft recovery, and when that happens (alongside DRM/CRTC commit failure messages), causes an additional reset which destroys the session entirely - then it *requires* SSHing into the machine. STEPS TO REPRODUCE 1. Start the Plasma desktop 2. Invoke `sudo cat /sys/kernel/debug/dri/X/amdgpu_gpu_recover` or get an app to cause a gfx timeout OBSERVED RESULT Recovery either occurs too slowly or doesn't occur at all, as described in "Summary". EXPECTED RESULT The desktop should recover correctly and all robust apps should continue to function SOFTWARE/OS VERSIONS Operating System: Arch Linux KDE Plasma Version: 5.92.0 KDE Frameworks Version: 5.248.0 Qt Version: 6.7.0 Kernel Version: 6.7.0-zen3-1-zen (64-bit) Graphics Processor: AMD Radeon RX 6600 XT ADDITIONAL INFORMATION This Mesa issue that I reported around 2 months ago might be relevant: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10124, but I'm unsure whether it's actually a Mesa problem, or a KWin problem. Additionally, the GPU-reset caused Xwayland hang (https://bugs.kde.org/show_bug.cgi?id=475322) is still present (running vkcube after a reset hangs the session), but that is its own issue. -- You are receiving this mail because: You are watching all bug changes.