Re: Handling pageflip timeouts

2024-03-20 Thread Simon Ser
Note, the kernel already sends synthetic page-flip events when a CRTC goes
from on → off. I think it would make sense to do the same for all pending
page-flips before the device is destroyed in the kernel.


Re: Handling pageflip timeouts

2024-03-20 Thread Pekka Paalanen
On Wed, 13 Mar 2024 15:45:47 +0100
Xaver Hugl  wrote:

> Hi all,
> 
> This was already discussed on IRC, but I think this should be on the
> mailing list as well and get some more official conclusion that's
> written down somewhere.
> 
> Recently I've experienced a GPU reset, which the system successfully
> recovered from, but the display was still stuck - because amdgpu hit a
> pageflip timeout, which causes the compositor to wait for a pageflip
> event that will never come. Some other experiments I did before showed
> that even if the compositor tries submitting new atomic commits after
> a timeout, those commits are rejected with EBUSY, presumably because
> the timed out pageflip is still considered "pending" on the kernel
> side.
> 
> After restarting the compositor, everything continued to work
> correctly, so this state can be recovered from. Because of that I
> think it would be useful for the kernel to act on pageflip timeouts
> differently. It should
> - signal the pageflip's completion to userspace
> - maybe have a new event for "pageflip failed" to give userspace more
> correct information in the future
> - allow new commits to happen afterwards
> 
> Another case discussed was when the device is completely removed.
> Right now, if a pageflip is pending when that happens, userspace never
> gets the event for pageflip completion, just like with the GPU reset.
> KWin ignores pending pageflips on hotunplug, because the device is
> removed it's not a big issue, but uAPI wise I would expect a pageflip
> event to arrive for all commits that request them, no matter what -
> and if that is not possible or desirable, uAPI has to be changed, for
> example by introducing the mentioned "pageflip failed" event.

I agree.

From my point of view, after some serious failure in hardware or
driver, the main question is:

Can already open device fds continue to be used, or not?

If the intention is that they can continue to be used, then a page flip
event must be eventually delivered if one was expected under normal
circumstances. Otherwise userspace cannot continue. Or, if userspace is
supposed to employ its own timeout for waiting for the event, then
that's is new'ish UAPI, and the device must stop returning EBUSY for new
commits.

If the intention is that open device fds have become unusable, then the
kernel should follow the same policy as for hot-unplug, which is
documented at
https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#device-hot-unplug

Specifically, EBUSY is an inappropriate error to return in that case.

This includes sending the udev event for device removal, and everything
that implies. The hardware can then come back as a new device.

The case at hand sounds like a driver bug to me.


Thanks,
pq


pgpigqcOVXFat.pgp
Description: OpenPGP digital signature


Handling pageflip timeouts

2024-03-13 Thread Xaver Hugl
Hi all,

This was already discussed on IRC, but I think this should be on the
mailing list as well and get some more official conclusion that's
written down somewhere.

Recently I've experienced a GPU reset, which the system successfully
recovered from, but the display was still stuck - because amdgpu hit a
pageflip timeout, which causes the compositor to wait for a pageflip
event that will never come. Some other experiments I did before showed
that even if the compositor tries submitting new atomic commits after
a timeout, those commits are rejected with EBUSY, presumably because
the timed out pageflip is still considered "pending" on the kernel
side.

After restarting the compositor, everything continued to work
correctly, so this state can be recovered from. Because of that I
think it would be useful for the kernel to act on pageflip timeouts
differently. It should
- signal the pageflip's completion to userspace
- maybe have a new event for "pageflip failed" to give userspace more
correct information in the future
- allow new commits to happen afterwards

Another case discussed was when the device is completely removed.
Right now, if a pageflip is pending when that happens, userspace never
gets the event for pageflip completion, just like with the GPU reset.
KWin ignores pending pageflips on hotunplug, because the device is
removed it's not a big issue, but uAPI wise I would expect a pageflip
event to arrive for all commits that request them, no matter what -
and if that is not possible or desirable, uAPI has to be changed, for
example by introducing the mentioned "pageflip failed" event.

Looking forward to some answers,
Xaver Hugl