On Fri, 05 Apr 2024 11:36:32 -0600 Ivan Stanton <northivanas...@gmail.com> wrote: > Package: libgl1-mesa-dri > Version: 23.3.5-1 > Severity: important > > Dear Maintainer, > > I and some others have been unable to play 3D games or run GPU-intensive > software on the Framework Laptop 13 AMD 7040 Edition due to GPU resets > occurring while doing so. I've previously reported this to the Framework > Community forums: > > https://community.frame.work/t/solved-debian-12-on-laptop-13-ryzen-7640u-gpu-hangs-in-some-games/ > > And others have reported similar issues: > > https://community.frame.work/t/vram-is-lost-due-to-gpu-reset-followed-by-a-crash/ > > * What led up to the situation? I attempted to play the Steam version of > Garry's Mod. This also occurred with The Stanley Parable: Ultra Deluxe > (Steam), > DSDA Doom (from the Debian repo) and Xonotic (from flathub). All 3D games > seem > to be affected, and possibly other GPU-intensive applications. > * What exactly did you do (or not do) that was effective (or > ineffective)? I first encountered this bug on bookworm, with mesa > 22.3.6-1+deb12u1. Upgrading linux-firmware, both from upstream and from > testing, had no effect. Upgrading the kernel from backports had no effect. > Upgrading mesa, using the packages from trixie, made the crashes less frequent > but did not resolve the issue. After some A/B testing, the crash seems to be > resolved only by both upgrading mesa and setting the kernel parameter > amdgpu.sg_display=0, which judging by the kernel documentation, I should not > have to set unless there is a bug. It would also be nice to get this fixed for > Debian Stable users, if possible. > * What was the outcome of this action? A few seconds into the game, the > display froze (though audio kept playing). After a few seconds, it flickered > and the graphics became partially corrupted. About a minute later, I was > kicked to the login screen. > * What outcome did you expect instead? Game continues playing without any > graphical glitches or freezes. > > I'm not an expert on the GNU/Linux graphics stack and I haven't reported a bug > to Debian in a while, so apologies if I got something wrong. > > Here's an extract of dmesg from one occurrence of the bug: > > [ 62.824231] amdgpu 0000:c1:00.0: amdgpu: [gfxhub] page fault (src_id:0 > ring:24 vmid:6 pasid:32787, for process dsda-doom pid 2910 thread dsda- > doom:cs0 > pid 2926) > [ 62.824267] amdgpu 0000:c1:00.0: amdgpu: in page starting at address > 0x000000409b40c000 from client 10 > [ 62.824285] amdgpu 0000:c1:00.0: amdgpu: > GCVM_L2_PROTECTION_FAULT_STATUS:0x00601030 > [ 62.824297] amdgpu 0000:c1:00.0: amdgpu: Faulty UTCL2 client ID: TCP > (0x8) > [ 62.824310] amdgpu 0000:c1:00.0: amdgpu: MORE_FAULTS: 0x0 > [ 62.824321] amdgpu 0000:c1:00.0: amdgpu: WALKER_ERROR: 0x0 > [ 62.824331] amdgpu 0000:c1:00.0: amdgpu: PERMISSION_FAULTS: 0x3 > [ 62.824340] amdgpu 0000:c1:00.0: amdgpu: MAPPING_ERROR: 0x0 > [ 62.824349] amdgpu 0000:c1:00.0: amdgpu: RW: 0x0 > [ 72.941268] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0
I haven't been able to replicate this exact behavior since upgrading to Framework's BIOS version 3.05b and disabling all of my previous workarounds, but I did get this log from a regular app crash that was similar: [75883.804346] amdgpu 0000:c1:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:2 pasid:32807, for process Discord pid 8547 thread Discord:cs0 pid 8579) [75883.804356] amdgpu 0000:c1:00.0: amdgpu: in page starting at address 0x00004d023e345000 from client 10 [75883.804359] amdgpu 0000:c1:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS: 0x00201430 [75883.804361] amdgpu 0000:c1:00.0: amdgpu: Faulty UTCL2 client ID: SQC (data) (0xa) [75883.804363] amdgpu 0000:c1:00.0: amdgpu: MORE_FAULTS: 0x0 [75883.804365] amdgpu 0000:c1:00.0: amdgpu: WALKER_ERROR: 0x0 [75883.804368] amdgpu 0000:c1:00.0: amdgpu: PERMISSION_FAULTS: 0x3 [75883.804370] amdgpu 0000:c1:00.0: amdgpu: MAPPING_ERROR: 0x0 [75883.804371] amdgpu 0000:c1:00.0: amdgpu: RW: 0x0 [75893.925804] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered In this case, KWin reloaded due to a graphics reset, instead of logging me out, and I did not experience freezing.