On 9/13/2018 8:36 PM, Jakub Kicinski wrote:
On Thu, 13 Sep 2018 11:18:15 +0300, Eran Ben Elisha wrote:
The health spec is targeted for Real Time Alerting, in order to know when
something bad had happened to a PCI device

By spec you mean some standards body spec you implement or this
proposal is a spec?

This proposal is a spec


- Provide alert debug information
- Self healing
- If problem needs vendor support, provide a way to gather all needed debugging
   information.

The health contains sensors which sense for malfunction. Once sensor triggered,
actions such as logs and correction can be taken.
Sensors are sensing the health state and can trigger correction action.

The sensors are divided into the following groups
- Hardware sensor - a sensor which is triggered by the device due to
   malfunction.
- Software sensor - a sensor which is triggered by the software due to
   malfunction.
Both group of sensors can be triggered due to error event or due to a periodic 
check.

Actions are the way to handle sensor events. Action can be in one of the
following groups:
- Dump -  SW trace, SW dump, HW trace, HW dump
- Reset - Surgical correction (e.g. modify Q, flush Q, reset of device, etc)
Actions can be performed by SW or HW.

User is allowed to enable or disable sensors and sensor2action mapping.

This RFC man page patch describes the suggested API of devlink-health in order
to control sensors and actions.

I like the idea of configuring response to events like this, although
I'm not sure the name sensor is appropriate here - perhaps exception or
error would be better?

I was trying to avoid the negativity description. Have it called sensor to avoid restricting the API for errors / exceptions only. I got the same type of comment from Andrew as well devlink-health->devlink-bug.

But if other vendors driver developers don't see it can be expanded to sensor which are not errors, then I guess we can refactor the names.

Are there going to be values reported?

It depends on the sensor. If it has data that would help in the debug, then I assume yes, via the dumps.


I'm not so sure about HW sensors in relation to existing HWMON
infrastructure...  I assume you're targeting things like say some HW
engine/block reporting it encountered an error?  Sounds good, too.

yes, exactly.


Are the actions all envisioned to be performed by the driver?
Firmware?  Hardware?  I guess that distinction can be added later.
For FW/HW actions we would go back to the problem of persistence of
the setting since it was only implemented for params :S

The problem is not with FW action, the problem is when you try to set sensor2action mapping for the FW/HW. this will need persistence configuration mode. Sensor2action in SW shall be run-time mode (at least as a start).
But it sound as this need some more tuning, to make it clear.


Is the dump option going to tie back into region snapshots?

no necessarily, dumping SW objects as well can be helpful

Reply via email to