This series builds on top of Introduce Xe Uncorrectable Error Handling[1] and adds support for handling errors that require a complete device power cycle (cold reset) to recover.
Certain error conditions leave the device in a persistent hardware error state that cannot be cleared through existing recovery mechanisms such as driver reload or PCIe reset. In these cases, functionality can only be restored by performing a cold reset. To support this, the series introduces a new DRM wedging recovery method, DRM_WEDGE_RECOVERY_COLD_RESET (BIT(4)). When a device is wedged with this method, the DRM core notifies userspace via a uevent that a cold reset is required. This allows userspace to take appropriate action to power-cycle the device. Example uevent received: SUBSYSTEM=drm WEDGED=cold-reset DEVPATH=/devices/.../drm/card0 Detailed description in commit message. [1] https://patchwork.freedesktop.org/series/160482/ This patch series introduces a call to xe_punit_error_handler() from within handle_soc_internal_errors() when PUNIT errors detected. v2: - Add use case: Handling errors from power management unit, which requires a complete power cycle to recover. (Christian) - Add several instead of number to avoid update. (Jani) v3: - Update any scenario that requires cold-reset. (Riana) - Update document with generic scenario. (Riana) - Consistent with terminology. (Raag) - Remove already covered information. - Use PUNIT instead of PMU. (Riana) - Use consistent wordingi. - Remove log. (Raag) Cc: André Almeida <[email protected]> Cc: Christian König <[email protected]> Cc: David Airlie <[email protected]> Cc: Simona Vetter <[email protected]> Cc: Maxime Ripard <[email protected]> Mallesh Koujalagi (3): drm: Add DRM_WEDGE_RECOVERY_COLD_RESET recovery method drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method drm/xe: Handle PUNIT errors by requesting cold-reset recovery Riana Tauro (1): Introduce Xe Uncorrectable Error Handling Documentation/gpu/drm-uapi.rst | 60 +++- drivers/gpu/drm/drm_drv.c | 2 + drivers/gpu/drm/xe/Makefile | 2 + drivers/gpu/drm/xe/xe_device.c | 10 + drivers/gpu/drm/xe/xe_device.h | 15 + drivers/gpu/drm/xe/xe_device_types.h | 6 + drivers/gpu/drm/xe/xe_gt.c | 14 +- drivers/gpu/drm/xe/xe_guc_submit.c | 9 +- drivers/gpu/drm/xe/xe_pci.c | 3 + drivers/gpu/drm/xe/xe_pci_error.c | 118 ++++++ drivers/gpu/drm/xe/xe_ras.c | 337 ++++++++++++++++++ drivers/gpu/drm/xe/xe_ras.h | 17 + drivers/gpu/drm/xe/xe_ras_types.h | 203 +++++++++++ drivers/gpu/drm/xe/xe_survivability_mode.c | 12 +- drivers/gpu/drm/xe/xe_sysctrl_mailbox_types.h | 13 + include/drm/drm_device.h | 1 + 16 files changed, 813 insertions(+), 9 deletions(-) create mode 100644 drivers/gpu/drm/xe/xe_pci_error.c create mode 100644 drivers/gpu/drm/xe/xe_ras.c create mode 100644 drivers/gpu/drm/xe/xe_ras.h create mode 100644 drivers/gpu/drm/xe/xe_ras_types.h -- 2.34.1
