From: Tomer Tayar <tta...@habana.ai>

The mechanism of aborting device reset for consecutive fatal errors is
currently only for fatal errors that are reported by FW.
A non-responsive FW and consecutive heartbeat failures is also
considered fatal, so add them as well to this mechanism to avoid
recurring device reset in such a case.

Signed-off-by: Tomer Tayar <tta...@habana.ai>
Reviewed-by: Oded Gabbay <ogab...@kernel.org>
Signed-off-by: Oded Gabbay <ogab...@kernel.org>
---
 drivers/accel/habanalabs/common/device.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/accel/habanalabs/common/device.c 
b/drivers/accel/habanalabs/common/device.c
index 15891de6cf39..581fc99ad89b 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -1769,14 +1769,16 @@ int hl_device_reset(struct hl_device *hdev, u32 flags)
                hdev->device_cpu_disabled = false;
                hdev->reset_info.hard_reset_pending = false;
 
+               /*
+                * Put the device in an unusable state if there are 2 back to 
back resets due to
+                * fatal errors.
+                */
                if (hdev->reset_info.reset_trigger_repeated &&
-                               (hdev->reset_info.prev_reset_trigger ==
-                                               HL_DRV_RESET_FW_FATAL_ERR)) {
-                       /* if there 2 back to back resets from FW,
-                        * ensure driver puts the driver in a unusable state
-                        */
+                               (hdev->reset_info.prev_reset_trigger == 
HL_DRV_RESET_FW_FATAL_ERR ||
+                                               
hdev->reset_info.prev_reset_trigger ==
+                                                               
HL_DRV_RESET_HEARTBEAT)) {
                        dev_crit(hdev->dev,
-                               "%s Consecutive FW fatal errors received, 
stopping hard reset\n",
+                               "%s Consecutive fatal errors, stopping hard 
reset\n",
                                dev_name(&(hdev)->pdev->dev));
                        rc = -EIO;
                        goto out_err;
-- 
2.34.1

Reply via email to