On 9/13/2018 4:24 PM, Andrew Lunn wrote:
On Thu, Sep 13, 2018 at 03:49:37PM +0300, Eran Ben Elisha wrote:


On 9/13/2018 3:08 PM, Andrew Lunn wrote:
        devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action 
reset off action dump on
            Sets TX_COMP_ERROR sensor parameters for a specific device.

I hope the real sensors have more understandable names. If i remember
correctly, the same sort of comment was given for resource
management. It was pretty unclear what the resource names actually
mean. Is an average user going to have any idea how to actually use
these sensors and actions?

well, hopefully. the whole point is to have it fully controlled by the user.
However, names for the command should be short. I guess we shall have it
documented (challenge is to fit to multi vendors).


Can you give more examples of sensors. We should understand if there
are any overlaps with hwmon.

I restate here that we shall have SW sensors as well, and not only HW
sensors.

This is what I had in mind:
1. command interface error
2. command interface timeout
3. stuck TX queue (like tx_timeout)
4. stuck TX completion queue (driver did not process packets in a reasonable
time period)
5. stuck RX queue
6. RX completion error
7. TX completion error
8. HW / FW catastrophic error report
9. completion queue overrun

Hi Eran

I'm having trouble differentiating between these SW sensors and bugs
which need fixing. What causes a command interface error? Sending it a
command it does not understand? A wrongly formatted command? A command
the version of the firmware does not support? These all sound just
like plain old bugs which need fixing, not something which needs a
framework to detect them and try to recover from them by resetting
something.

Such issues do exist in production environment, and need to be handled even if root cause is a bug which will be fixed in latest release. My feature should help developers / administrator to control and recover their live systems, by auto correction and logging support.
Goal is:
- Provide alert debug information
- Self healing
- If problem needs vendor support, provide a way to gather all needed debugging information.


I would of expected that all the issues are about physical
properties. Something similar to SMART for hard disks. The power
supplies are starting to droop, suggesting it might die soon. The
tacho on the fan suggests the FAN is not rotating as fast as it
should, so the motor is going to die soon. An SFP is giving i2c
errors, suggesting it is not seated correctly. The card as a whole is
overheating, despite the fan working, suggesting the ambient
temperature is just too high.

AFAIU, the kind of sensors you suggest here requires manual fix / physically approaching to the setup, replace HW, install new Fan, etc. Monitor such events is easy, driver can just log events from HW to the dmesg and end its handle there. None of these is a real networking issue I would like to handle with devlink-health.

Eran


        Andrew

Reply via email to