On 3/11/2021 5:26 AM, Jakub Kicinski wrote:
Pending vendors adding the right reporters. <<

Would you like Nvidia to reply with the remedy per reporter or to actually prepare the patch?


Extend the applicability of devlink health reporters
beyond what can be locally remedied. Add failure modes
which require re-flashing the NVM image or HW changes.

The expectation is that driver will call
devlink_health_reporter_state_update() to put hardware
health reporters into bad state.

Signed-off-by: Jakub Kicinski <k...@kernel.org>
---
  include/uapi/linux/devlink.h | 7 +++++++
  net/core/devlink.c           | 3 +--
  2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 8cd1508b525b..f623bbc63489 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -617,10 +617,17 @@ enum devlink_port_fn_opstate {
   * @DL_HEALTH_STATE_ERROR: error state, running health reporter's recovery
   *                    may fix the issue, otherwise user needs to try
   *                    power cycling or other forms of reset
+ * @DL_HEALTH_STATE_BAD_IMAGE: device's non-volatile memory needs
+ *                     to be re-written, usually due to block corruption
+ * @DL_HEALTH_STATE_BAD_HW: hardware errors detected, device, host
+ *                     or the connection between the two may be at fault
   */
  enum devlink_health_state {
        DL_HEALTH_STATE_HEALTHY,
        DL_HEALTH_STATE_ERROR,
+
+       DL_HEALTH_STATE_BAD_IMAGE,
+       DL_HEALTH_STATE_BAD_HW,
  };
/**
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 09d77d43ff63..4a9fa6288a4a 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -6527,8 +6527,7 @@ void
  devlink_health_reporter_state_update(struct devlink_health_reporter *reporter,
                                     enum devlink_health_state state)
  {
-       if (WARN_ON(state != DL_HEALTH_STATE_HEALTHY &&
-                   state != DL_HEALTH_STATE_ERROR))
+       if (WARN_ON(state > DL_HEALTH_STATE_BAD_HW))
                return;
if (reporter->health_state == state)


devlink_health_reporter_recover() requires an update as well.
something like:

@@ -6346,8 +6346,15 @@ devlink_health_reporter_recover(struct devlink_health_reporter *reporter,
 {
        int err;

-   if (reporter->health_state == DL_HEALTH_STATE_HEALTHY)
+ switch (reporter->health_state) {
+ case DL_HEALTH_STATE_HEALTHY:
                return 0;
+ case DL_HEALTH_STATE_ERROR:
+         break;
+ case DL_HEALTH_STATE_BAD_IMAGE:
+ case DL_HEALTH_STATE_BAD_HW:
+         return -EOPNOTSUPP;
+ }

        if (!reporter->ops->recover)
                return -EOPNOTSUPP;

Reply via email to