> Then second even if the kernel can do it I'm not sure if it should do it.
>
> I mean userspace asked for a quick page flip and not some expensive CRTC/PLL 
> reprogramming. Stuff like that usually takes some time and by then the frame 
> which should be displayed by the page flip might already be stale and it 
> would be better to tell userspace that we couldn't display it on time and 
> wait for a new frame to be generated.

I would personally prefer a new "pageflip failed" event, which the
compositor can react to appropriately.
For compositors not opting into that new API, the kernel automatically
fixing things would be nice though. Even pretending the pageflip
completed and then failing the next one with EINVAL would be enough to
trigger a modeset in the case of KWin.

> And third, there must be a root cause of the page flip not completing.
>
> My educated guess is that we have some atomic property change or even turning 
> the CRTC off in parallel with the page flip. I mean HW rarely turns off its 
> reoccurring vblank interrupt on its own.
>
> Returning an error to userspace might actually help identify the root cause.

There are two things I know that trigger pageflip timeouts frequently.

On dedicated GPUs, most of them seem to happen when a game causes a GPU reset.
In some cases, it seemed like the pageflip did complete, but the
kernel never sent the pageflip event to userspace. It also rejected
new atomic commits of the compositor with EBUSY - but a new instance
of the compositor could still do atomic commits just fine.
In other cases, triggering another GPU reset was necessary to recover,
and in yet other cases it was just broken beyond repair.
Presumably, all of them are caused by bugs in the GPU reset sequence.
As another piece of information on that, KWin only does atomic commits
once the fences of the involved buffers are signaled, and it does not
use OUT_FENCE_FD. So fence signaling should not be relevant, at least
not on the KMS uAPI level.

On APUs, the vast majority are caused by PSR. I know many AMD laptop
users that run with "amdgpu.dcdebugmask=0x10" to disable PSR as a
workaround, and have never seen a pageflip timeout since setting that
option. I have even seen a freeze *without* a pageflip timeout in
testing PSR again on my laptop recently, so PSR seems to have even
bigger issues.
Pageflip timeout or not, manually triggering a GPU reset seems to be a
reliable way to recover from it.
IMO that one is bad and widespread enough that PSR should be disabled
by default on the relevant AMD hardware until it no longer causes such
problems.

On the topic of whether or not this is just a thing the driver has to
fix, this isn't as exclusive to amdgpu as it might seem. i915 has some
pageflip timeout issues with apparently still unknown causes
(https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14604), and
the proprietary Nvidia driver had one some time ago that IIRC was
caused by firmware bugs.

Obviously, drivers still need to be fixed, but the bug is bad enough
for the end user that a fallback would be very helpful. If userspace
gets notified about it, we can still direct users to the relevant bug
trackers to get the underlying bugs hopefully fixed either way.

- Xaver

Reply via email to