Hello amd-gfx, I am reporting a reproducible amdgpu failure on gfx1150 (Strix / Radeon 880M/890M) observed on Linux 6.19-rc2. The issue appears to be a real GPU VM / illegal access fault that reliably escalates into an unrecoverable reset.
This is not related to ROCm or user compute workloads. --- Hardware: * APU: AMD Strix (gfx1150) * GPU: Radeon 880M / 890M (integrated) * SMU: smu_v14_0_0 * Platform: x86_64 desktop * Firmware: standard linux-firmware (no custom blobs) Kernel: * Linux 6.19.0-rc2 * amdgpu built as module * DRM AMD DC enabled * Default kernel configuration for modern AMD APU (no unusual options) --- Observed failure (6.19-rc2): During a long-running but otherwise normal graphics/compute workload, the kernel logs the following: ``` amdgpu 0000:c5:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0) amdgpu 0000:c5:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10 amdgpu 0000:c5:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B32 amdgpu 0000:c5:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5) amdgpu 0000:c5:00.0: amdgpu: WALKER_ERROR: 0x1 amdgpu 0000:c5:00.0: amdgpu: PERMISSION_FAULTS: 0x3 amdgpu 0000:c5:00.0: amdgpu: MAPPING_ERROR: 0x1 ``` Shortly after, MES stops responding: ``` amdgpu: MES failed to respond to msg=MISC (WAIT_REG_MEM) amdgpu: failed to reg_write_reg_wait ``` The driver then attempts recovery/reset. Reset / recovery behavior: On gfx1150, recovery is not survivable: * VPE queue reset fails * Driver falls back to MODE2 reset * SMU resumes successfully * MES fails to respond when re-adding queues * gfx_v11_0 resume fails with -110 (ETIMEDOUT) Example reset log excerpt (also reproducible on 6.17.x): ``` amdgpu: GPU reset begin! amdgpu: VPE queue reset failed amdgpu: MODE2 reset amdgpu: SMU is resumed successfully amdgpu: MES failed to respond to msg=ADD_QUEUE amdgpu: resume of IP block <gfx_v11_0> failed -110 amdgpu: GPU reset end with ret = -110 ``` In practice this leaves the system unusable and often requires a power cycle. --- Additional notes: * This is reproducible on an otherwise idle system using `amd-smi reset --gpureset`. * The same reset failure occurs on 6.17.10, so reset/recovery for gfx1150 appears incomplete independent of the 6.19 regression. * 6.19-rc2 increases the frequency of hitting recovery due to the CPC/gfxhub illegal access fault. * This report focuses on the *trigger* (illegal access / page fault), not the reset issue itself. --- Summary: * The gfxhub CPC page fault at VA 0x0 appears to be a real bug in 6.19-rc2. * Any recovery attempt on gfx1150 currently escalates into an unrecoverable state. * Avoiding recovery (e.g. by disabling CWSR) avoids crashes but masks the underlying fault. Please let me know if additional traces, bisect testing, or instrumentation would be helpful. Thank you for your time. Best regards, Harris Landgarten Harris Landgarten 516 643-1286
