This RFC patch series introduces a new DRM wedge recovery method
'DRM_WEDGE_RECOVERY_COLD_RESET' for handling critical errors
that cannot be recovered through existing software-based mechanisms.

Background
----------
Current recovery methods (driver rebind, bus reset, FLR) are effective
for most error scenarios. However, certain critical errors
affect device-level persistent state that survives warm resets and
software recovery attempts. These errors require complete device power
cycling to restore functionality.

Proposed Solution
-----------------
This series adds DRM_WEDGE_RECOVERY_COLD_RESET (BIT(4)) as a new
recovery method to the DRM wedging framework. When this method is set,
it signals to userspace that only a complete device cold reset (power
cycle) can restore normal operation.

Example uevent received:
  SUBSYSTEM=drm
  WEDGED=cold-reset
  DEVPATH=/devices/.../drm/card0

Testing
-------
The debugfs interface allows testing the cold reset recovery path:

  echo 1 > /sys/kernel/debug/dri/N/trigger_critical_error

This triggers the critical error handler, wedges the device with
cold reset method, and sends the appropriate uevent to userspace.

Cc: André Almeida <[email protected]>
Cc: Christian König <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Simona Vetter <[email protected]>
Cc: Maxime Ripard <[email protected]>

Mallesh Koujalagi (4):
  drm: Add DRM_WEDGE_RECOVERY_COLD_RESET for critical error
  drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method
  drm/xe: Add handler for critical errors which require cold-reset
  drm/xe/debugfs: Add interface to trigger critical error handler

 Documentation/gpu/drm-uapi.rst   | 73 +++++++++++++++++++++++++++++++-
 drivers/gpu/drm/drm_drv.c        |  2 +
 drivers/gpu/drm/xe/xe_debugfs.c  | 38 +++++++++++++++++
 drivers/gpu/drm/xe/xe_hw_error.c | 28 ++++++++++++
 drivers/gpu/drm/xe/xe_hw_error.h |  1 +
 include/drm/drm_device.h         |  1 +
 6 files changed, 142 insertions(+), 1 deletion(-)

-- 
2.34.1

Reply via email to