Bug#1068467: libgl1-mesa-dri: GPU hangs and resets while playing 3D games on Framework Laptop 13, AMD Ryzen 7640U

2024-04-16 Thread Ivan Stanton
On Fri, 05 Apr 2024 11:36:32 -0600 Ivan Stanton  
wrote:
> Package: libgl1-mesa-dri
> Version: 23.3.5-1
> Severity: important
> 
> Dear Maintainer,
> 
> I and some others have been unable to play 3D games or run GPU-intensive
> software on the Framework Laptop 13 AMD 7040 Edition due to GPU resets
> occurring while doing so. I've previously reported this to the Framework
> Community forums:
> 
> https://community.frame.work/t/solved-debian-12-on-laptop-13-ryzen-7640u-gpu-hangs-in-some-games/
> 
> And others have reported similar issues:
> 
> https://community.frame.work/t/vram-is-lost-due-to-gpu-reset-followed-by-a-crash/
> 
>* What led up to the situation? I attempted to play the Steam version of
> Garry's Mod. This also occurred with The Stanley Parable: Ultra Deluxe 
> (Steam),
> DSDA Doom (from the Debian repo) and Xonotic (from flathub). All 3D games > 
seem
> to be affected, and possibly other GPU-intensive applications.
>* What exactly did you do (or not do) that was effective (or
>  ineffective)? I first encountered this bug on bookworm, with mesa
> 22.3.6-1+deb12u1. Upgrading linux-firmware, both from upstream and from
> testing, had no effect. Upgrading the kernel from backports had no effect.
> Upgrading mesa, using the packages from trixie, made the crashes less  
frequent
> but did not resolve the issue. After some A/B testing, the crash seems to be
> resolved only by both upgrading mesa and setting the kernel parameter
> amdgpu.sg_display=0, which judging by the kernel documentation, I should not
> have to set unless there is a bug. It would also be nice to get this fixed 
for
> Debian Stable users, if possible.
>* What was the outcome of this action? A few seconds into the game, the
> display froze (though audio kept playing). After a few seconds, it flickered
> and the graphics became partially corrupted. About a minute later, I was 
> kicked to the login screen.
>* What outcome did you expect instead? Game continues playing without any
> graphical glitches or freezes.
> 
> I'm not an expert on the GNU/Linux graphics stack and I haven't reported a 
bug 
> to Debian in a while, so apologies if I got something wrong.
> 
> Here's an extract of dmesg from one occurrence of the bug:
> 
> [   62.824231] amdgpu :c1:00.0: amdgpu: [gfxhub] page fault (src_id:0
> ring:24 vmid:6 pasid:32787, for process dsda-doom pid 2910 thread dsda-
> doom:cs0
> pid 2926)
> [   62.824267] amdgpu :c1:00.0: amdgpu:   in page starting at address
> 0x00409b40c000 from client 10
> [   62.824285] amdgpu :c1:00.0: amdgpu:
> GCVM_L2_PROTECTION_FAULT_STATUS:0x00601030
> [   62.824297] amdgpu :c1:00.0: amdgpu:  Faulty UTCL2 client ID: TCP
> (0x8)
> [   62.824310] amdgpu :c1:00.0: amdgpu:  MORE_FAULTS: 0x0
> [   62.824321] amdgpu :c1:00.0: amdgpu:  WALKER_ERROR: 0x0
> [   62.824331] amdgpu :c1:00.0: amdgpu:  PERMISSION_FAULTS: 0x3
> [   62.824340] amdgpu :c1:00.0: amdgpu:  MAPPING_ERROR: 0x0
> [   62.824349] amdgpu :c1:00.0: amdgpu:  RW: 0x0
> [   72.941268] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0

I haven't been able to replicate this exact behavior since upgrading to 
Framework's BIOS version 3.05b and disabling all of my previous workarounds, 
but I did get this log from a regular app crash that was similar:

[75883.804346] amdgpu :c1:00.0: amdgpu: [gfxhub] page fault (src_id:0 
ring:24 vmid:2 pasid:32807, for process Discord pid 8547 thread Discord:cs0 
pid 8579)
[75883.804356] amdgpu :c1:00.0: amdgpu:   in page starting at address 
0x4d023e345000 from client 10
[75883.804359] amdgpu :c1:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:
0x00201430
[75883.804361] amdgpu :c1:00.0: amdgpu:  Faulty UTCL2 client ID: SQC 
(data) (0xa)
[75883.804363] amdgpu :c1:00.0: amdgpu:  MORE_FAULTS: 0x0
[75883.804365] amdgpu :c1:00.0: amdgpu:  WALKER_ERROR: 0x0
[75883.804368] amdgpu :c1:00.0: amdgpu:  PERMISSION_FAULTS: 0x3
[75883.804370] amdgpu :c1:00.0: amdgpu:  MAPPING_ERROR: 0x0
[75883.804371] amdgpu :c1:00.0: amdgpu:  RW: 0x0
[75893.925804] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 
timeout, but soft recovered

In this case, KWin reloaded due to a graphics reset, instead of logging me 
out, and I did not experience freezing.



Bug#1068467: libgl1-mesa-dri: GPU hangs and resets while playing 3D games on Framework Laptop 13, AMD Ryzen 7640U

2024-04-05 Thread Ivan Stanton
Package: libgl1-mesa-dri
Version: 23.3.5-1
Severity: important

Dear Maintainer,

I and some others have been unable to play 3D games or run GPU-intensive
software on the Framework Laptop 13 AMD 7040 Edition due to GPU resets
occurring while doing so. I've previously reported this to the Framework
Community forums:

https://community.frame.work/t/solved-debian-12-on-laptop-13-ryzen-7640u-gpu-hangs-in-some-games/

And others have reported similar issues:

https://community.frame.work/t/vram-is-lost-due-to-gpu-reset-followed-by-a-crash/

   * What led up to the situation? I attempted to play the Steam version of
Garry's Mod. This also occurred with The Stanley Parable: Ultra Deluxe 
(Steam),
DSDA Doom (from the Debian repo) and Xonotic (from flathub). All 3D games seem
to be affected, and possibly other GPU-intensive applications.
   * What exactly did you do (or not do) that was effective (or
 ineffective)? I first encountered this bug on bookworm, with mesa
22.3.6-1+deb12u1. Upgrading linux-firmware, both from upstream and from
testing, had no effect. Upgrading the kernel from backports had no effect.
Upgrading mesa, using the packages from trixie, made the crashes less frequent
but did not resolve the issue. After some A/B testing, the crash seems to be
resolved only by both upgrading mesa and setting the kernel parameter
amdgpu.sg_display=0, which judging by the kernel documentation, I should not
have to set unless there is a bug. It would also be nice to get this fixed for
Debian Stable users, if possible.
   * What was the outcome of this action? A few seconds into the game, the
display froze (though audio kept playing). After a few seconds, it flickered
and the graphics became partially corrupted. About a minute later, I was 
kicked to the login screen.
   * What outcome did you expect instead? Game continues playing without any
graphical glitches or freezes.

I'm not an expert on the GNU/Linux graphics stack and I haven't reported a bug 
to Debian in a while, so apologies if I got something wrong.

Here's an extract of dmesg from one occurrence of the bug:

[   62.824231] amdgpu :c1:00.0: amdgpu: [gfxhub] page fault (src_id:0
ring:24 vmid:6 pasid:32787, for process dsda-doom pid 2910 thread dsda-
doom:cs0
pid 2926)
[   62.824267] amdgpu :c1:00.0: amdgpu:   in page starting at address
0x00409b40c000 from client 10
[   62.824285] amdgpu :c1:00.0: amdgpu:
GCVM_L2_PROTECTION_FAULT_STATUS:0x00601030
[   62.824297] amdgpu :c1:00.0: amdgpu:  Faulty UTCL2 client ID: TCP
(0x8)
[   62.824310] amdgpu :c1:00.0: amdgpu:  MORE_FAULTS: 0x0
[   62.824321] amdgpu :c1:00.0: amdgpu:  WALKER_ERROR: 0x0
[   62.824331] amdgpu :c1:00.0: amdgpu:  PERMISSION_FAULTS: 0x3
[   62.824340] amdgpu :c1:00.0: amdgpu:  MAPPING_ERROR: 0x0
[   62.824349] amdgpu :c1:00.0: amdgpu:  RW: 0x0
[   72.941268] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0
timeout, but soft recovered
[   83.446602] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0
timeout, signaled seq=7073, emitted seq=7075
[   83.447891] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
process dsda-doom pid 2910 thread dsda-doom:cs0 pid 2926
[   83.448887] amdgpu :c1:00.0: amdgpu: GPU reset begin!
[   83.729405] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0
[amdgpu]] *ERROR* MES failed to response msg=3
[   83.730483] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to
unmap legacy queue
[   83.949833] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0
[amdgpu]] *ERROR* MES failed to response msg=3
[   83.950689] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to
unmap legacy queue
[   84.169971] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0
[amdgpu]] *ERROR* MES failed to response msg=3
[   84.170799] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to
unmap legacy queue
[   84.390063] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0
[amdgpu]] *ERROR* MES failed to response msg=3
[   84.390888] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to
unmap legacy queue
[   84.610016] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0
[amdgpu]] *ERROR* MES failed to response msg=3
[   84.610932] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to
unmap legacy queue
[   84.828847] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0
[amdgpu]] *ERROR* MES failed to response msg=3
[   84.830204] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to
unmap legacy queue
[   85.048322] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0
[amdgpu]] *ERROR* MES failed to response msg=3
[   85.049271] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to
unmap legacy queue
[   85.267011] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0
[amdgpu]] *ERROR* MES failed to response msg=3
[   85.268422] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]]