Hi,

Since its introduction, the vc4 driver schedules its jobs and tracks the
dependencies in between them using its own internal job queue
implementation. This internal implementation is based in job lists, wait
queues and hand-rolled seqnos. Although job scheduling worked most of
the time, in more GPU intensive scenarios, many GPU hangs were reported
[1][2].

After investigating several GPU hangs, I noticed that job dependencies
weren't being tracked correctly, which could lead to synchronization
issues and GPU resets. Also, the GPU reset path had issues related to
job resubmission.

Considering the many issues related to the internal job queue
implementation, this series proposes switching to the DRM GPU scheduler,
which is a well-established implementation used by multiple DRM drivers.

This has many advantages:

1. Using common code: Instead of relying on a custom implementation, use
   a trusted common framework. This helps with maintainability of the
   vc4 driver. It also makes the code more readable.

2. Synchronization issues are gone: With this series, applications can
   work reliably on RPi 3. Many users reported that they weren't able to
   open applications like emulators on the device. Now, it's possible to
   play several retro games without issues.

3. GPU resets are recoverable: Even if a timeout happens, the GPU is able
   to recover successfully with minimal impact to the user.

4. PM actually works: Before this series, the GPU was active during the 
   entire runtime. After this series, the GPU is able to autosuspend and
   resume when needed.

In order to improve reviewability of the patches, I introduced piece by
piece of the new infrastructure without actually plugging it in. The
actual switchover only happens in the patch "drm/vc4: Switch to DRM GPU
scheduler".

For this second version, I moved all the fixes patches to a different
series [3] with the goal to make this series more focused on the
scheduler changes.

This series was mostly based on the design of the v3d driver as the two
drivers are very similar.

[1] https://github.com/raspberrypi/linux/issues/5780
[2] https://github.com/raspberrypi/linux/issues/3221
[3] 
https://lore.kernel.org/dri-devel/[email protected]/T/

Best regards,
- Maíra

---
v1 -> v2: 
https://lore.kernel.org/r/[email protected]

- Moved all miscellaneous fixes and improvements to a separate series:
  
https://lore.kernel.org/dri-devel/[email protected]/T/
- [1/7] Add Melissa's R-b (Melissa Wen)
- [2/7] Squash "[PATCH 04/11] drm/vc4: Introduce vc4_job structures for DRM
  scheduler integration" and "[PATCH 05/11] drm/vc4: Add DRM GPU scheduler
  infrastructure" (Tvrtko Ursulin)
- [2/7] Centralize the initialization of queues in vc4_sched_init() (Melissa 
Wen)
- [2/7] Handle error when vc4_fence_create() fails (Melissa Wen)
- [2/7] Protect vc4->render_job when updating the pointer in `run_job()` 
(Melissa Wen)
- [2/7] Handle error when drm_sched_entity_init() fails (Tvrtko Ursulin)
- [2/7] Clarify comment in sched_lock (Tvrtko Ursulin)
- [2/7] Remove fence_lock as dma-fences now support a built-in lock (Tvrtko 
Ursulin)
- [2/7] Use spin_(un)lock_irq in `run_job()` callbacks (Tvrtko Ursulin)
- [3/7] Add a comment explaining why we don't need to unreference BOs in
  case of failure in vc4_lookup_bos()
- [3/7] Several stylistic adjustments to vc4_get_bcl() (Tvrtko Ursulin)
- [3/7] s/kvmalloc_array/kvmalloc in vc4_get_bcl() (Tvrtko Ursulin)
- [3/7] Remove all error messages related to allocation failures (Tvrtko 
Ursulin)
- [3/7] Rename vc4_job_init() to vc4_job_alloc() (Tvrtko Ursulin)
- [3/7] Address cases in which in_sync or out_sync <= 0 (Tvrtko Ursulin)
- [3/7] Replace vc4_attach_fences_and_unlock_reservation() with
  vc4_attach_fences() + drm_exec_fini() (Tvrtko Ursulin)
- [3/7] Don't clean-up the BIN job if it has already been pushed.
- [4/7] NEW PATCH: "drm/vc4: Refcount vc4_file for safe access by jobs" (Tvrtko 
Ursulin)
- [5/7] Use vc4_file refcount to get a fd reference.
- [5/7] Add comment explaining why we use dma_fence_get_rcu() in
  vc4_wait_seqno_ioctl() (Tvrtko Ursulin)  
- [7/7] Return "unknown" instead of NULL when the fence type is unknown (Tvrtko 
Ursulin)

---
Maíra Canal (7):
      drm/vc4: Move vc4_wait_bo_ioctl() to vc4_bo.c
      drm/vc4: Add DRM GPU scheduler infrastructure and job structures
      drm/vc4: Add new job submission implementation
      drm/vc4: Refcount vc4_file for safe access by jobs
      drm/vc4: Add per-file descriptor seqno tracking
      drm/vc4: Switch to DRM GPU scheduler
      drm/vc4: Use unique fence timeline names per queue

 drivers/gpu/drm/vc4/Kconfig         |   1 +
 drivers/gpu/drm/vc4/Makefile        |   2 +
 drivers/gpu/drm/vc4/vc4_bo.c        |  33 ++
 drivers/gpu/drm/vc4/vc4_drv.c       |  49 +-
 drivers/gpu/drm/vc4/vc4_drv.h       | 234 ++++-----
 drivers/gpu/drm/vc4/vc4_fence.c     |  34 +-
 drivers/gpu/drm/vc4/vc4_gem.c       | 961 ++----------------------------------
 drivers/gpu/drm/vc4/vc4_irq.c       | 132 +----
 drivers/gpu/drm/vc4/vc4_render_cl.c |  17 +-
 drivers/gpu/drm/vc4/vc4_sched.c     | 337 +++++++++++++
 drivers/gpu/drm/vc4/vc4_submit.c    | 581 ++++++++++++++++++++++
 drivers/gpu/drm/vc4/vc4_v3d.c       |  24 +-
 drivers/gpu/drm/vc4/vc4_validate.c  |  21 +-
 13 files changed, 1243 insertions(+), 1183 deletions(-)
---
base-commit: 4b9c36c83b34f710da9573291404f6a2246251c1
change-id: 20260121-vc4-drm-scheduler-03cd8670b3f6

Reply via email to