Hello,

This is an attempt at adding a GEM shrinker to panthor so the system
can finally reclaim GPU memory.

This implementation is losely based on the MSM shrinker (which is why
I added the MSM maintainers in Cc), and it's relying on the drm_gpuvm
eviction/validation infrastructure.

I've only done very basic IGT-based [1] and chromium-based (opening
a lot of tabs on Aquarium until the system starts reclaiming+swaping
out GPU buffers) testing, but I'm posting this early so I can get
preliminary feedback on the implementation. If someone knows about
better tools/ways to test the shrinker, please let me know.

A few words about some design/implementation choices:
- No MADVISE support because I want to see if we can live with just
  transparent reclaim
- We considered basing this implementation on the generic shrinker work
  started by Dmitry [2], but
  1. with the activeness/idleness tracking happening at the VM
     granularity, having per-BO LRUs would caused a lot of
     list_move()s that are not really needed (the VM as a whole
     become active/idle, we can track individual BOs)
  2. Thomas Zimmermann recently suggested that we should have our
     own GEM implementation instead of trying to add this extra reclaim
     complexity to gem-shmem. There are some plans to create a
     gem-uma (Unified Memory Architecture) lib that would do more
     than gem-shmem but in a way that doesn't force all its users
     to pay the overhead (size overhead of the gem object, mostly)
     for features they don't use. Patch "Part ways with
     drm_gem_shmem_object" is showing what this component-based lib
     API could look like if it were to be extracted
- At the moment we only support swapout, but we could add an
  extra flag to specify when buffer content doesn't need to be
  preserved to avoid the swapout/swapin dance. First candidate for
  this DISCARD_ON_RECLAIM flag would probably be the tiler heap chunks.
- Reclaim uses _try_lock() all the way because of the various lock order
  inversions between the reclaim path and submission paths. That means
  we don't try very hard to reclaim hot GPU buffers, but the locking is
  such a mess that I don't really see a better option to be honest.

Regards,

Boris

[1]https://gitlab.freedesktop.org/bbrezillon/igt-gpu-tools/-/commit/fc76934a5579767d2aabe787d40e38a17c3f4ea4
[2]https://lkml.org/lkml/2024/1/5/665

Akash Goel (1):
  drm/panthor: Add a GEM shrinker

Boris Brezillon (8):
  drm/gem: Consider GEM object reclaimable if shrinking fails
  drm/gpuvm: Validate BOs in the extobj list when VM is resv protected
  drm/panthor: Move panthor_gems_debugfs_init() to panthor_gem.c
  drm/panthor: Group panthor_kernel_bo_xxx() helpers
  drm/panthor: Part ways with drm_gem_shmem_object
  drm/panthor: Lazily allocate pages on mmap()
  drm/panthor: Split panthor_vm_prepare_map_op_ctx() to prepare for
    reclaim
  drm/panthor: Track the number of mmap on a BO

 drivers/gpu/drm/drm_gem.c                |   10 +
 drivers/gpu/drm/drm_gpuvm.c              |   23 +-
 drivers/gpu/drm/panthor/Kconfig          |    1 -
 drivers/gpu/drm/panthor/panthor_device.c |   11 +-
 drivers/gpu/drm/panthor/panthor_device.h |   73 ++
 drivers/gpu/drm/panthor/panthor_drv.c    |   33 +-
 drivers/gpu/drm/panthor/panthor_fw.c     |   16 +-
 drivers/gpu/drm/panthor/panthor_gem.c    | 1387 ++++++++++++++++++----
 drivers/gpu/drm/panthor/panthor_gem.h    |  135 ++-
 drivers/gpu/drm/panthor/panthor_mmu.c    |  451 +++++--
 drivers/gpu/drm/panthor/panthor_mmu.h    |    8 +
 drivers/gpu/drm/panthor/panthor_sched.c  |    9 +-
 include/drm/drm_gpuvm.h                  |    6 +
 13 files changed, 1829 insertions(+), 334 deletions(-)

-- 
2.52.0

Reply via email to