This RFC patch series introduces a new DRM wedge recovery method 'DRM_WEDGE_RECOVERY_COLD_RESET' for handling critical errors that cannot be recovered through existing software-based mechanisms.
Background ---------- Current recovery methods (driver rebind, bus reset, FLR) are effective for most error scenarios. However, certain critical errors affect device-level persistent state that survives warm resets and software recovery attempts. These errors require complete device power cycling to restore functionality. Proposed Solution ----------------- This series adds DRM_WEDGE_RECOVERY_COLD_RESET (BIT(4)) as a new recovery method to the DRM wedging framework. When this method is set, it signals to userspace that only a complete device cold reset (power cycle) can restore normal operation. Example uevent received: SUBSYSTEM=drm WEDGED=cold-reset DEVPATH=/devices/.../drm/card0 Testing ------- The debugfs interface allows testing the cold reset recovery path: echo 1 > /sys/kernel/debug/dri/N/trigger_critical_error This triggers the critical error handler, wedges the device with cold reset method, and sends the appropriate uevent to userspace. Cc: André Almeida <[email protected]> Cc: Christian König <[email protected]> Cc: David Airlie <[email protected]> Cc: Simona Vetter <[email protected]> Cc: Maxime Ripard <[email protected]> Mallesh Koujalagi (4): drm: Add DRM_WEDGE_RECOVERY_COLD_RESET for critical error drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method drm/xe: Add handler for critical errors which require cold-reset drm/xe/debugfs: Add interface to trigger critical error handler Documentation/gpu/drm-uapi.rst | 73 +++++++++++++++++++++++++++++++- drivers/gpu/drm/drm_drv.c | 2 + drivers/gpu/drm/xe/xe_debugfs.c | 38 +++++++++++++++++ drivers/gpu/drm/xe/xe_hw_error.c | 28 ++++++++++++ drivers/gpu/drm/xe/xe_hw_error.h | 1 + include/drm/drm_device.h | 1 + 6 files changed, 142 insertions(+), 1 deletion(-) -- 2.34.1
