On 2/10/2026 10:28 AM, Riana Tauro wrote:
On 2/5/2026 2:00 PM, Raag Jadav wrote:
On Mon, Feb 02, 2026 at 12:13:59PM +0530, Riana Tauro wrote:
Initialize DRM RAS in hw error init. Map the UAPI error severities
with the hardware error severities and refactor file.
Signed-off-by: Riana Tauro <[email protected]>
---
drivers/gpu/drm/xe/xe_drm_ras_types.h | 8 ++++
drivers/gpu/drm/xe/xe_hw_error.c | 68 ++++++++++++++++-----------
2 files changed, 48 insertions(+), 28 deletions(-)
diff --git a/drivers/gpu/drm/xe/xe_drm_ras_types.h b/drivers/gpu/drm/
xe/xe_drm_ras_types.h
index 0ac4ae324f37..beed48811d6a 100644
--- a/drivers/gpu/drm/xe/xe_drm_ras_types.h
+++ b/drivers/gpu/drm/xe/xe_drm_ras_types.h
@@ -11,6 +11,14 @@
struct drm_ras_node;
+/* Error categories reported by hardware */
+enum hardware_error {
+ HARDWARE_ERROR_CORRECTABLE = 0,
+ HARDWARE_ERROR_NONFATAL = 1,
+ HARDWARE_ERROR_FATAL = 2,
I'd align "= x" using tabs for readability.
Will remove the values except the start
+ HARDWARE_ERROR_MAX,
Guaranteed last member, so redundant comma.
Will fix it
+};
Also, just curious. Are these expected to be reused anywhere?
If not, they're probably better off in the .c file.
These are used in register header files along with c. So added it here
...
@@ -86,8 +78,8 @@ static void csc_hw_error_handler(struct xe_tile
*tile, const enum hardware_error
fw_err = xe_mmio_read32(mmio, HEC_UNCORR_FW_ERR_DW0(base));
for_each_set_bit(err_bit, &fw_err, HEC_UNCORR_FW_ERR_BITS) {
drm_err_ratelimited(&xe->drm, HW_ERR
- "%s: HEC Uncorrected FW %s error reported,
bit[%d] is set\n",
- hw_err_str,
hec_uncorrected_fw_errors[err_bit],
+ "HEC FW %s error reported, bit[%d] is set\n",
+ hec_uncorrected_fw_errors[err_bit],
So we're dropping severity_str here? Did I miss something?
I removed it because uncorrected was mentioned in log. But removed that
also by mistake
Will fix this. Thanks for catching this
err_bit);
...
+static int hw_error_info_init(struct xe_device *xe)
+{
+ int ret;
+
+ if (xe->info.platform != XE_PVC)
+ return 0;
+
+ ret = xe_drm_ras_allocate_nodes(xe);
Why not just
return xe_drm_ras_allocate_nodes();
Tidy? ;)
okay
+ if (ret)
+ return ret;
+
+ return 0;
+}
+
/*
* Process hardware errors during boot
*/
@@ -172,11 +179,16 @@ static void process_hw_errors(struct xe_device
*xe)
void xe_hw_error_init(struct xe_device *xe)
{
struct xe_tile *tile = xe_device_get_root_tile(xe);
+ int ret;
if (!IS_DGFX(xe) || IS_SRIOV_VF(xe))
return;
INIT_WORK(&tile->csc_hw_error_work, csc_hw_error_work);
+ ret = hw_error_info_init(xe);
+ if (ret)
+ drm_warn(&xe->drm, "Failed to allocate DRM RAS nodes\n");
This is less likely due to any hardware limitation, so I think
drm_err() would be more appropriate here.
okay will fix it
Thanks
Riana
Raag
+
process_hw_errors(xe);
}
--
2.47.1