https://bugs.freedesktop.org/show_bug.cgi?id=109403
Bug ID: 109403
Summary: amdgpu randomly hangs while streaming or when CPU is
busy on X399 with TR 1950X
Product: DRI
Version: unspecified
Hardware: x86-64 (AMD64)
OS: Linux (All)
Status: NEW
Severity: normal
Priority: medium
Component: DRM/AMDgpu
Assignee: dri-devel@lists.freedesktop.org
Reporter: 1...@provod.gl
I've been experiencing random GPU hangs since I upgraded to Threadripper about
a year ago.
Specs:
- Motherboard: ASUS Prime X399-A, all bios versions from stock until current
0808
- CPU: Threadripper 1950X, 32 threads
- GPU: MSI Radeon RX Vega 64 Air Boost 8G OC (was also happening on ASUS R9
Fury X on the same machine; this GPU was generally stable on previous box)
- Displays:
- 2x DELL U2412M 1920x1200x60 (DP)
- 1x ASUS MG279Q 2560x1440x144 (DP)
- Kernel versions: 4.20, 5.0-rc2 (has been happening since from at least 4.14;
earlier versions weren't tried).
- linux-firmware: 20181218
- Mesa: 18.3.1
- X: 1.20.3
- libdrm: 2.4.96
- Possibly relevant kernel options: amd_iommu=on
vfio-pci.ids=10de:1005,10de:0e1a,1912:0014,1106:3483 iommu=pt
vfio-pci.disable_vga=1 hpet=disable nohpet amdgpu.ppfeaturemask=0xfffd7fff
amdgpu.gpu_recovery=1 pcie_aspm=off
The problem manifests itself usually like this:
1. Screen suddenly freezes (sometimes it is possible to move mouse cursor for a
few seconds, but it will freeze eventually too)
2. GPU fan speeds up and remain high
3. Every process that talks to GPU freezes and becomes impossible to kill.
4. Can SSH into the machine and everything else besides the GPU works ok.
5. dmesg contains a message like this:
[Jan21 00:03] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
gfx timeout, signaled seq=17188686, emitted seq=17188689
[ +0.32] [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
Process information: process X pid 9315 thread X:cs0 pid 9335
or with a bit more stuff happening before:
[Jan18 19:43] amdgpu :44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
[ +0.03] amdgpu :44:00.0: in page starting at
address 0x800010607000 from 27
[ +0.02] amdgpu :44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x0060153D
[ +0.05] amdgpu :44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
[ +0.02] amdgpu :44:00.0: in page starting at
address 0x800010609000 from 27
[ +0.01] amdgpu :44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x
[ +0.04] amdgpu :44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
[ +0.01] amdgpu :44:00.0: in page starting at
address 0x800010607000 from 27
[ +0.02] amdgpu :44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x
[ +0.04] amdgpu :44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
[ +0.01] amdgpu :44:00.0: in page starting at
address 0x800010609000 from 27
[ +0.01] amdgpu :44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x
[ +0.04] amdgpu :44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
[ +0.02] amdgpu :44:00.0: in page starting at
address 0x800010607000 from 27
[ +0.01] amdgpu :44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x
[ +0.04] amdgpu :44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
[ +0.01] amdgpu :44:00.0: in page starting at
address 0x800010609000 from 27
[ +0.01] amdgpu :44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x
[ +0.04] amdgpu :44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
[ +0.01] amdgpu :44:00.0: in page starting at
address 0x800010607000 from 27
[ +0.01] amdgpu :44:00.0:
VM_L2_PROTECTION_FAULT_STATUS:0x
[ +0.04] amdgpu :44:00.0: [gfxhub] VMC page fault
(src_id:0 ring:158 vmid:6 pasid:32771, for process superposition pid 11225
thread superposit:cs0 pid 11308)
[