Hi Rodrigo/Christian
On 6/30/2025 11:03 PM, Rodrigo Vivi wrote:
On Mon, Jun 30, 2025 at 10:29:10AM +0200, Christian König wrote:
On 27.06.25 23:38, Rodrigo Vivi wrote:
Or at least print a big warning into the system log?
I mean a firmware update is usually something which the system administrator
triggers very explicitly because when it fails for some reason (e.g. unexpected
reset, power outage or whatever) it can sometimes brick the HW.
I think it's rather brave to do this automatically. Are you sure we don't talk
past each other on the meaning of the wedge event?
The goal is not to do that automatically, but raise the uevent to the admin
with enough information that they can decide for the right correctable
action.
Christian, Andre, any concerns with this still?
Well, that sounds not quite the correct use case for wedge events.
See the wedge event is made for automation.
I respectfully disagree with this statement.
The wedged state in i915 and xe, then ported to drm, was never just about
automation. Of course, the unbind + flr + rebind is one that driver cannot
do by itself, hence needs automation. But wedge cases were also very useful
in other situations like keeping the device in the failure stage for debuging
(without automation) or keeping other critical things up like display with SW
rendering (again, nothing about automation).
For example to allow a process supervising containers get the device working
again and re-start the container which used it or gather crash log etc .....
When you want to notify the system administrator which manual intervention is
necessary then I would just write that into the system log and raise a device
event with WEDGED=unknown.
What we could potentially do is to separate between WEDGED=unknown and
WEDGED=manual, e.g. between driver has no idea what to do and driver printed
useful info into the system log.
Well, you are right here. Even our official documentation in drm-uapi.rst
already tells that firmware flashing should be a case for 'unknown'.
I had added specific method since we know firmware flash will recover
the error. Sure will change it.
In the current code, there is no recovery method named "unknown" even
though the document mentions it
https://elixir.bootlin.com/linux/v6.16-rc4/source/drivers/gpu/drm/drm_drv.c#L534
Since we are adding something new, can it be "manual" instead of unknown?
Thanks
Riana
Let's go with that then. And use other hints like logs and sysfs so, Admin
has a better information of what to do.
But creating an event with WEDGED=firmware-flash just sounds to specific, when
we go down that route we might soon have WEDGE=change-bios-setting, WEDGE=....
Well, I agree that we shouldn't explode the options exponentially here.
Although I believe that firmware flashing should be a common case in many
case and could be a candidate for another indication.
But let's move on with WEDGE='unknown' for this case.
Thanks,
Rodrigo.
Regards,
Christian.
Thanks,
Rodrigo.