Bug#1068467: libgl1-mesa-dri: GPU hangs and resets while playing 3D games on Framework Laptop 13, AMD Ryzen 7640U
On Fri, 05 Apr 2024 11:36:32 -0600 Ivan Stanton wrote: > Package: libgl1-mesa-dri > Version: 23.3.5-1 > Severity: important > > Dear Maintainer, > > I and some others have been unable to play 3D games or run GPU-intensive > software on the Framework Laptop 13 AMD 7040 Edition due to GPU resets > occurring while doing so. I've previously reported this to the Framework > Community forums: > > https://community.frame.work/t/solved-debian-12-on-laptop-13-ryzen-7640u-gpu-hangs-in-some-games/ > > And others have reported similar issues: > > https://community.frame.work/t/vram-is-lost-due-to-gpu-reset-followed-by-a-crash/ > >* What led up to the situation? I attempted to play the Steam version of > Garry's Mod. This also occurred with The Stanley Parable: Ultra Deluxe > (Steam), > DSDA Doom (from the Debian repo) and Xonotic (from flathub). All 3D games > seem > to be affected, and possibly other GPU-intensive applications. >* What exactly did you do (or not do) that was effective (or > ineffective)? I first encountered this bug on bookworm, with mesa > 22.3.6-1+deb12u1. Upgrading linux-firmware, both from upstream and from > testing, had no effect. Upgrading the kernel from backports had no effect. > Upgrading mesa, using the packages from trixie, made the crashes less frequent > but did not resolve the issue. After some A/B testing, the crash seems to be > resolved only by both upgrading mesa and setting the kernel parameter > amdgpu.sg_display=0, which judging by the kernel documentation, I should not > have to set unless there is a bug. It would also be nice to get this fixed for > Debian Stable users, if possible. >* What was the outcome of this action? A few seconds into the game, the > display froze (though audio kept playing). After a few seconds, it flickered > and the graphics became partially corrupted. About a minute later, I was > kicked to the login screen. >* What outcome did you expect instead? Game continues playing without any > graphical glitches or freezes. > > I'm not an expert on the GNU/Linux graphics stack and I haven't reported a bug > to Debian in a while, so apologies if I got something wrong. > > Here's an extract of dmesg from one occurrence of the bug: > > [ 62.824231] amdgpu :c1:00.0: amdgpu: [gfxhub] page fault (src_id:0 > ring:24 vmid:6 pasid:32787, for process dsda-doom pid 2910 thread dsda- > doom:cs0 > pid 2926) > [ 62.824267] amdgpu :c1:00.0: amdgpu: in page starting at address > 0x00409b40c000 from client 10 > [ 62.824285] amdgpu :c1:00.0: amdgpu: > GCVM_L2_PROTECTION_FAULT_STATUS:0x00601030 > [ 62.824297] amdgpu :c1:00.0: amdgpu: Faulty UTCL2 client ID: TCP > (0x8) > [ 62.824310] amdgpu :c1:00.0: amdgpu: MORE_FAULTS: 0x0 > [ 62.824321] amdgpu :c1:00.0: amdgpu: WALKER_ERROR: 0x0 > [ 62.824331] amdgpu :c1:00.0: amdgpu: PERMISSION_FAULTS: 0x3 > [ 62.824340] amdgpu :c1:00.0: amdgpu: MAPPING_ERROR: 0x0 > [ 62.824349] amdgpu :c1:00.0: amdgpu: RW: 0x0 > [ 72.941268] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 I haven't been able to replicate this exact behavior since upgrading to Framework's BIOS version 3.05b and disabling all of my previous workarounds, but I did get this log from a regular app crash that was similar: [75883.804346] amdgpu :c1:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32807, for process Discord pid 8547 thread Discord:cs0 pid 8579) [75883.804356] amdgpu :c1:00.0: amdgpu: in page starting at address 0x4d023e345000 from client 10 [75883.804359] amdgpu :c1:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS: 0x00201430 [75883.804361] amdgpu :c1:00.0: amdgpu: Faulty UTCL2 client ID: SQC (data) (0xa) [75883.804363] amdgpu :c1:00.0: amdgpu: MORE_FAULTS: 0x0 [75883.804365] amdgpu :c1:00.0: amdgpu: WALKER_ERROR: 0x0 [75883.804368] amdgpu :c1:00.0: amdgpu: PERMISSION_FAULTS: 0x3 [75883.804370] amdgpu :c1:00.0: amdgpu: MAPPING_ERROR: 0x0 [75883.804371] amdgpu :c1:00.0: amdgpu: RW: 0x0 [75893.925804] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered In this case, KWin reloaded due to a graphics reset, instead of logging me out, and I did not experience freezing.
Bug#1068467: libgl1-mesa-dri: GPU hangs and resets while playing 3D games on Framework Laptop 13, AMD Ryzen 7640U
Package: libgl1-mesa-dri Version: 23.3.5-1 Severity: important Dear Maintainer, I and some others have been unable to play 3D games or run GPU-intensive software on the Framework Laptop 13 AMD 7040 Edition due to GPU resets occurring while doing so. I've previously reported this to the Framework Community forums: https://community.frame.work/t/solved-debian-12-on-laptop-13-ryzen-7640u-gpu-hangs-in-some-games/ And others have reported similar issues: https://community.frame.work/t/vram-is-lost-due-to-gpu-reset-followed-by-a-crash/ * What led up to the situation? I attempted to play the Steam version of Garry's Mod. This also occurred with The Stanley Parable: Ultra Deluxe (Steam), DSDA Doom (from the Debian repo) and Xonotic (from flathub). All 3D games seem to be affected, and possibly other GPU-intensive applications. * What exactly did you do (or not do) that was effective (or ineffective)? I first encountered this bug on bookworm, with mesa 22.3.6-1+deb12u1. Upgrading linux-firmware, both from upstream and from testing, had no effect. Upgrading the kernel from backports had no effect. Upgrading mesa, using the packages from trixie, made the crashes less frequent but did not resolve the issue. After some A/B testing, the crash seems to be resolved only by both upgrading mesa and setting the kernel parameter amdgpu.sg_display=0, which judging by the kernel documentation, I should not have to set unless there is a bug. It would also be nice to get this fixed for Debian Stable users, if possible. * What was the outcome of this action? A few seconds into the game, the display froze (though audio kept playing). After a few seconds, it flickered and the graphics became partially corrupted. About a minute later, I was kicked to the login screen. * What outcome did you expect instead? Game continues playing without any graphical glitches or freezes. I'm not an expert on the GNU/Linux graphics stack and I haven't reported a bug to Debian in a while, so apologies if I got something wrong. Here's an extract of dmesg from one occurrence of the bug: [ 62.824231] amdgpu :c1:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:6 pasid:32787, for process dsda-doom pid 2910 thread dsda- doom:cs0 pid 2926) [ 62.824267] amdgpu :c1:00.0: amdgpu: in page starting at address 0x00409b40c000 from client 10 [ 62.824285] amdgpu :c1:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00601030 [ 62.824297] amdgpu :c1:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8) [ 62.824310] amdgpu :c1:00.0: amdgpu: MORE_FAULTS: 0x0 [ 62.824321] amdgpu :c1:00.0: amdgpu: WALKER_ERROR: 0x0 [ 62.824331] amdgpu :c1:00.0: amdgpu: PERMISSION_FAULTS: 0x3 [ 62.824340] amdgpu :c1:00.0: amdgpu: MAPPING_ERROR: 0x0 [ 62.824349] amdgpu :c1:00.0: amdgpu: RW: 0x0 [ 72.941268] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered [ 83.446602] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=7073, emitted seq=7075 [ 83.447891] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process dsda-doom pid 2910 thread dsda-doom:cs0 pid 2926 [ 83.448887] amdgpu :c1:00.0: amdgpu: GPU reset begin! [ 83.729405] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3 [ 83.730483] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 83.949833] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3 [ 83.950689] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 84.169971] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3 [ 84.170799] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 84.390063] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3 [ 84.390888] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 84.610016] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3 [ 84.610932] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 84.828847] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3 [ 84.830204] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 85.048322] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3 [ 85.049271] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue [ 85.267011] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3 [ 85.268422] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]]