Re: [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent

André Almeida Tue, 01 Jul 2025 10:16:23 -0700

Em 01/07/2025 13:44, Riana Tauro escreveu:



On 7/1/2025 9:32 PM, Raag Jadav wrote:

On Tue, Jul 01, 2025 at 04:35:42PM +0200, Christian König wrote:

On 01.07.25 16:23, Raag Jadav wrote:

On Tue, Jul 01, 2025 at 05:11:24PM +0530, Riana Tauro wrote:

On 7/1/2025 5:07 PM, Riana Tauro wrote:

On 6/30/2025 11:03 PM, Rodrigo Vivi wrote:

On Mon, Jun 30, 2025 at 10:29:10AM +0200, Christian König wrote:

On 27.06.25 23:38, Rodrigo Vivi wrote:

Or at least print a big warning into the system log?

I mean a firmware update is usually something which
the system administrator triggers very explicitly
because when it fails for some reason (e.g.
unexpected reset, power outage or whatever) it can
sometimes brick the HW.

I think it's rather brave to do this automatically.
Are you sure we don't talk past each other on the
meaning of the wedge event?


The goal is not to do that automatically, but raise the
uevent to the admin

with enough information that they can decide for the rightcorrectable

action.


Christian, Andre, any concerns with this still?


Well, that sounds not quite the correct use case for wedge events.

See the wedge event is made for automation.


I respectfully disagree with this statement.

The wedged state in i915 and xe, then ported to drm, was neverjust about

automation. Of course, the unbind + flr + rebind is one that driver
cannot
do by itself, hence needs automation. But wedge cases were also very
useful
in other situations like keeping the device in the failure stage for
debuging
(without automation) or keeping other critical things up like
display with SW
rendering (again, nothing about automation).


Granted, automation is probably not the right term.

What I wanted to say is that the wedge event should not replaceinformation in the system log.

For example to allow a process supervising containers get the
device working again and re-start the container which used it or
gather crash log etc .....

When you want to notify the system administrator which manual
intervention is necessary then I would just write that into the
system log and raise a device event with WEDGED=unknown.

What we could potentially do is to separate between
WEDGED=unknown and WEDGED=manual, e.g. between driver has no
idea what to do and driver printed useful info into the system
log.

Well, you are right here. Even our official documentation in drm-uapi.rst

already tells that firmware flashing should be a case for 'unknown'.


I had added specific method since we know firmware flash will recover
the error.  Sure will change it.

In the current code, there is no recovery method named "unknown" even
though the document mentions it

https://elixir.bootlin.com/linux/v6.16-rc4/source/drivers/gpu/drm/
drm_drv.c#L534

Since we are adding something new, can it be "manual" instead ofunknown?

Okay missed it. It's in the drm_dev_wedged_event function. Will useunknown

Let's go with that then. And use other hints like logs and sysfs so,
Admin
has a better information of what to do.

But creating an event with WEDGED=firmware-flash just sounds to
specific, when we go down that route we might soon have
WEDGE=change- bios-setting, WEDGE=....

Well, I agree that we shouldn't explode the options exponentiallyhere.

Although I believe that firmware flashing should be a common casein many

case and could be a candidate for another indication.

But let's move on with WEDGE='unknown' for this case.

I understand that WEDGED=firmware-flash can't be handled in ageneric wayfor all drivers but it is simply not as same as WEDGED=unknown sincethe

driver knows something specific needs to be done here.

I'm wondering if we could add a WEDGED=vendor-specific method for such
cases?


Works for me as well.

My main concern was that we should not start to invent specific wedgeevents for all kind of different problems.

On the other hand what's the additional value to distinct betweenunknown and vendor-specific?

In other words even if the necessary handling is unknown to the wedgeevent, the administrator could and should still examine the logs tosee what to do.


They're somewhat similar except the consumer can execute vendor specific
triggers (look at some sys/proc entries or logs) based on device id that
the consumer is already familiar with as defined by the vendor, and could
potentially be automated.

Unknown is basically "I'm clueless and good luck with yourinvestigation".

So the distinction is whether the driver is able to provide definitionfor

its vendor specific cases and how well documented they are.

Agree with Raag. We know which recovery method works here. Rather thanusing 'unknown', 'manual/vendor' macro seems better with vendor specificdocumentation for recovery.


That makes sense for me as well, thanks!

Thanks
Riana


Raag

Re: [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent

Reply via email to