Em 01/07/2025 13:44, Riana Tauro escreveu:
On 7/1/2025 9:32 PM, Raag Jadav wrote:
On Tue, Jul 01, 2025 at 04:35:42PM +0200, Christian König wrote:
On 01.07.25 16:23, Raag Jadav wrote:
On Tue, Jul 01, 2025 at 05:11:24PM +0530, Riana Tauro wrote:
On 7/1/2025 5:07 PM, Riana Tauro wrote:
On 6/30/2025 11:03 PM, Rodrigo Vivi wrote:
On Mon, Jun 30, 2025 at 10:29:10AM +0200, Christian König wrote:
On 27.06.25 23:38, Rodrigo Vivi wrote:
Or at least print a big warning into the system log?
I mean a firmware update is usually something which
the system administrator triggers very explicitly
because when it fails for some reason (e.g.
unexpected reset, power outage or whatever) it can
sometimes brick the HW.
I think it's rather brave to do this automatically.
Are you sure we don't talk past each other on the
meaning of the wedge event?
The goal is not to do that automatically, but raise the
uevent to the admin
with enough information that they can decide for the right
correctable
action.
Christian, Andre, any concerns with this still?
Well, that sounds not quite the correct use case for wedge events.
See the wedge event is made for automation.
I respectfully disagree with this statement.
The wedged state in i915 and xe, then ported to drm, was never
just about
automation. Of course, the unbind + flr + rebind is one that driver
cannot
do by itself, hence needs automation. But wedge cases were also very
useful
in other situations like keeping the device in the failure stage for
debuging
(without automation) or keeping other critical things up like
display with SW
rendering (again, nothing about automation).
Granted, automation is probably not the right term.
What I wanted to say is that the wedge event should not replace
information in the system log.
For example to allow a process supervising containers get the
device working again and re-start the container which used it or
gather crash log etc .....
When you want to notify the system administrator which manual
intervention is necessary then I would just write that into the
system log and raise a device event with WEDGED=unknown.
What we could potentially do is to separate between
WEDGED=unknown and WEDGED=manual, e.g. between driver has no
idea what to do and driver printed useful info into the system
log.
Well, you are right here. Even our official documentation in drm-
uapi.rst
already tells that firmware flashing should be a case for 'unknown'.
I had added specific method since we know firmware flash will recover
the error. Sure will change it.
In the current code, there is no recovery method named "unknown" even
though the document mentions it
https://elixir.bootlin.com/linux/v6.16-rc4/source/drivers/gpu/drm/
drm_drv.c#L534
Since we are adding something new, can it be "manual" instead of
unknown?
Okay missed it. It's in the drm_dev_wedged_event function. Will use
unknown
Let's go with that then. And use other hints like logs and sysfs so,
Admin
has a better information of what to do.
But creating an event with WEDGED=firmware-flash just sounds to
specific, when we go down that route we might soon have
WEDGE=change- bios-setting, WEDGE=....
Well, I agree that we shouldn't explode the options exponentially
here.
Although I believe that firmware flashing should be a common case
in many
case and could be a candidate for another indication.
But let's move on with WEDGE='unknown' for this case.
I understand that WEDGED=firmware-flash can't be handled in a
generic way
for all drivers but it is simply not as same as WEDGED=unknown since
the
driver knows something specific needs to be done here.
I'm wondering if we could add a WEDGED=vendor-specific method for such
cases?
Works for me as well.
My main concern was that we should not start to invent specific wedge
events for all kind of different problems.
On the other hand what's the additional value to distinct between
unknown and vendor-specific?
In other words even if the necessary handling is unknown to the wedge
event, the administrator could and should still examine the logs to
see what to do.
They're somewhat similar except the consumer can execute vendor specific
triggers (look at some sys/proc entries or logs) based on device id that
the consumer is already familiar with as defined by the vendor, and could
potentially be automated.
Unknown is basically "I'm clueless and good luck with your
investigation".
So the distinction is whether the driver is able to provide definition
for
its vendor specific cases and how well documented they are.
Agree with Raag. We know which recovery method works here. Rather than
using 'unknown', 'manual/vendor' macro seems better with vendor specific
documentation for recovery.
That makes sense for me as well, thanks!
Thanks
Riana
Raag