On 7/1/2025 5:07 PM, Riana Tauro wrote:
Hi Rodrigo/Christian
On 6/30/2025 11:03 PM, Rodrigo Vivi wrote:
On Mon, Jun 30, 2025 at 10:29:10AM +0200, Christian König wrote:
On 27.06.25 23:38, Rodrigo Vivi wrote:
Or at least print a big warning into the system log?
I mean a firmware update is usually something which the system
administrator triggers very explicitly because when it fails for
some reason (e.g. unexpected reset, power outage or whatever) it
can sometimes brick the HW.
I think it's rather brave to do this automatically. Are you sure
we don't talk past each other on the meaning of the wedge event?
The goal is not to do that automatically, but raise the uevent to
the admin
with enough information that they can decide for the right correctable
action.
Christian, Andre, any concerns with this still?
Well, that sounds not quite the correct use case for wedge events.
See the wedge event is made for automation.
I respectfully disagree with this statement.
The wedged state in i915 and xe, then ported to drm, was never just about
automation. Of course, the unbind + flr + rebind is one that driver
cannot
do by itself, hence needs automation. But wedge cases were also very
useful
in other situations like keeping the device in the failure stage for
debuging
(without automation) or keeping other critical things up like display
with SW
rendering (again, nothing about automation).
For example to allow a process supervising containers get the device
working again and re-start the container which used it or gather
crash log etc .....
When you want to notify the system administrator which manual
intervention is necessary then I would just write that into the
system log and raise a device event with WEDGED=unknown.
What we could potentially do is to separate between WEDGED=unknown
and WEDGED=manual, e.g. between driver has no idea what to do and
driver printed useful info into the system log.
Well, you are right here. Even our official documentation in drm-uapi.rst
already tells that firmware flashing should be a case for 'unknown'.
I had added specific method since we know firmware flash will recover
the error. Sure will change it.
In the current code, there is no recovery method named "unknown" even
though the document mentions it
https://elixir.bootlin.com/linux/v6.16-rc4/source/drivers/gpu/drm/
drm_drv.c#L534
Since we are adding something new, can it be "manual" instead of unknown?
Okay missed it. It's in the drm_dev_wedged_event function. Will use unknown
Thanks
Riana
Let's go with that then. And use other hints like logs and sysfs so,
Admin
has a better information of what to do.
But creating an event with WEDGED=firmware-flash just sounds to
specific, when we go down that route we might soon have WEDGE=change-
bios-setting, WEDGE=....
Well, I agree that we shouldn't explode the options exponentially here.
Although I believe that firmware flashing should be a common case in many
case and could be a candidate for another indication.
But let's move on with WEDGE='unknown' for this case.
Thanks,
Rodrigo.
Regards,
Christian.
Thanks,
Rodrigo.