On Wed, 21 Jan 2026 11:11:01 +0000
Akash Goel <[email protected]> wrote:

> Hi Boris, Steve
> 
> 
> On 1/12/26 14:17, Boris Brezillon wrote:
> > On Mon, 12 Jan 2026 12:06:17 +0000
> > Steven Price <[email protected]> wrote:
> >   
> >> On 09/01/2026 13:07, Boris Brezillon wrote:  
> >>> While drm_gem_shmem_object does most of the job we need it to do, the
> >>> way sub-resources (pages, sgt, vmap) are handled and their lifetimes
> >>> gets in the way of BO reclaim. There has been attempts to address
> >>> that [1], but in the meantime, new gem_shmem users were introduced
> >>> (accel drivers), and some of them manually free some of these resources.
> >>> This makes things harder to control/sanitize/validate.
> >>>
> >>> Thomas Zimmerman is not a huge fan of enforcing lifetimes of sub-resources
> >>> and forcing gem_shmem users to go through new gem_shmem helpers when they
> >>> need manual control of some sort, and I believe this is a dead end if
> >>> we don't force users to follow some stricter rules through carefully
> >>> designed helpers, because there will always be one user doing crazy things
> >>> with gem_shmem_object internals, which ends up tripping out the common
> >>> helpers when they are called.
> >>>
> >>> The consensus we reached was that we would be better off forking
> >>> gem_shmem in panthor. So here we are, parting ways with gem_shmem. The
> >>> current transition tries to minimize the changes, but there are still
> >>> some aspects that are different, the main one being that we no longer
> >>> have a pages_use_count, and pages stays around until the GEM object is
> >>> destroyed (or when evicted once we've added a shrinker). The sgt also
> >>> no longer retains pages. This is losely based on how msm does things by
> >>> the way.  
> >>
> >>  From a reviewing perspective it's a little tricky trying to match up the
> >> implementation to shmem because of these changes. I don't know how
> >> difficult it would be to split the changes to a patch which literally
> >> copies (with renames) from shmem, followed by simplifying out the parts
> >> we don't want.  
> > 
> > It's a bit annoying as the new implementation is not based on shmem at
> > all, but if you think it helps the review, I can try what you're
> > suggesting. I mean, I'm not convinced it will be significantly easier
> > to review with this extra step, since the new logic is different enough
> > (especially when it comes to resource refcounting) that it needs a
> > careful review anyway (which you started doing here).
> >   
> >>
> >> Of course the main issue is going to be getting some proper testing of
> >> this (especially with the shrinker added).  
> > 
> > For the shrinker, the best I can propose for now is extending the
> > IGT tests I've added. For close-to-real-usecases testing of the shmem ->
> > custom transition (this commit), making sure the g610 jobs we have in
> > mesa CI still passes is a start. If you have other ideas, I'd be happy
> > to give them a try.
> >   
> >>  
> >>>
> >>> If there's any interest in sharing code (probably with msm, since the
> >>> panthor shrinker is going to be losely based on the msm implementation),
> >>> we can always change gears and do that once we have everything
> >>> working/merged.
> >>>
> >>> [1]https://patchwork.kernel.org/project/dri-devel/patch/[email protected]/
> >>>
> >>> Signed-off-by: Boris Brezillon <[email protected]>
> >>> ---
> >>>   drivers/gpu/drm/panthor/Kconfig         |   1 -
> >>>   drivers/gpu/drm/panthor/panthor_drv.c   |   7 +-
> >>>   drivers/gpu/drm/panthor/panthor_fw.c    |  16 +-
> >>>   drivers/gpu/drm/panthor/panthor_gem.c   | 696 ++++++++++++++++++++----
> >>>   drivers/gpu/drm/panthor/panthor_gem.h   |  62 ++-
> >>>   drivers/gpu/drm/panthor/panthor_mmu.c   |  49 +-
> >>>   drivers/gpu/drm/panthor/panthor_sched.c |   9 +-
> >>>   7 files changed, 666 insertions(+), 174 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/panthor/Kconfig 
> >>> b/drivers/gpu/drm/panthor/Kconfig
> >>> index 55b40ad07f3b..911e7f4810c3 100644
> >>> --- a/drivers/gpu/drm/panthor/Kconfig
> >>> +++ b/drivers/gpu/drm/panthor/Kconfig
> >>> @@ -8,7 +8,6 @@ config DRM_PANTHOR
> >>>           depends on MMU
> >>>           select DEVFREQ_GOV_SIMPLE_ONDEMAND
> >>>           select DRM_EXEC
> >>> - select DRM_GEM_SHMEM_HELPER
> >>>           select DRM_GPUVM
> >>>           select DRM_SCHED
> >>>           select IOMMU_IO_PGTABLE_LPAE
> >>> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c 
> >>> b/drivers/gpu/drm/panthor/panthor_drv.c
> >>> index 52c27a60c84a..90e9abc22d9e 100644
> >>> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> >>> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> >>> @@ -19,6 +19,7 @@
> >>>   #include <drm/drm_debugfs.h>
> >>>   #include <drm/drm_drv.h>
> >>>   #include <drm/drm_exec.h>
> >>> +#include <drm/drm_file.h>
> >>>   #include <drm/drm_ioctl.h>
> >>>   #include <drm/drm_print.h>
> >>>   #include <drm/drm_syncobj.h>
> >>> @@ -1457,7 +1458,7 @@ static int panthor_ioctl_bo_query_info(struct 
> >>> drm_device *ddev, void *data,
> >>>           args->create_flags = bo->flags;
> >>>   
> >>>           args->extra_flags = 0;
> >>> - if (drm_gem_is_imported(&bo->base.base))
> >>> + if (drm_gem_is_imported(&bo->base))
> >>>                   args->extra_flags |= DRM_PANTHOR_BO_IS_IMPORTED;
> >>>   
> >>>           drm_gem_object_put(obj);
> >>> @@ -1671,8 +1672,7 @@ static const struct drm_driver panthor_drm_driver = 
> >>> {
> >>>           .major = 1,
> >>>           .minor = 7,
> >>>   
> >>> - .gem_create_object = panthor_gem_create_object,
> >>> - .gem_prime_import_sg_table = drm_gem_shmem_prime_import_sg_table,
> >>> + .gem_prime_import_sg_table = panthor_gem_prime_import_sg_table,
> >>>           .gem_prime_import = panthor_gem_prime_import,
> >>>   #ifdef CONFIG_DEBUG_FS
> >>>           .debugfs_init = panthor_debugfs_init,
> >>> @@ -1822,3 +1822,4 @@ module_exit(panthor_exit);
> >>>   MODULE_AUTHOR("Panthor Project Developers");
> >>>   MODULE_DESCRIPTION("Panthor DRM Driver");
> >>>   MODULE_LICENSE("Dual MIT/GPL");
> >>> +MODULE_IMPORT_NS("DMA_BUF");
> >>> diff --git a/drivers/gpu/drm/panthor/panthor_fw.c 
> >>> b/drivers/gpu/drm/panthor/panthor_fw.c
> >>> index a64ec8756bed..f135cf2130b8 100644
> >>> --- a/drivers/gpu/drm/panthor/panthor_fw.c
> >>> +++ b/drivers/gpu/drm/panthor/panthor_fw.c
> >>> @@ -627,7 +627,6 @@ static int panthor_fw_load_section_entry(struct 
> >>> panthor_device *ptdev,
> >>>                   u32 cache_mode = hdr.flags & 
> >>> CSF_FW_BINARY_IFACE_ENTRY_CACHE_MODE_MASK;
> >>>                   struct panthor_gem_object *bo;
> >>>                   u32 vm_map_flags = 0;
> >>> -         struct sg_table *sgt;
> >>>                   u64 va = hdr.va.start;
> >>>   
> >>>                   if (!(hdr.flags & CSF_FW_BINARY_IFACE_ENTRY_WR))
> >>> @@ -665,11 +664,12 @@ static int panthor_fw_load_section_entry(struct 
> >>> panthor_device *ptdev,
> >>>                   panthor_fw_init_section_mem(ptdev, section);
> >>>   
> >>>                   bo = to_panthor_bo(section->mem->obj);
> >>> -         sgt = drm_gem_shmem_get_pages_sgt(&bo->base);
> >>> -         if (IS_ERR(sgt))
> >>> -                 return PTR_ERR(sgt);
> >>>   
> >>> -         dma_sync_sgtable_for_device(ptdev->base.dev, sgt, 
> >>> DMA_TO_DEVICE);
> >>> +         /* An sgt should have been requested when the kernel BO was 
> >>> GPU-mapped. */
> >>> +         if (drm_WARN_ON_ONCE(&ptdev->base, !bo->dmap.sgt))
> >>> +                 return -EINVAL;
> >>> +
> >>> +         dma_sync_sgtable_for_device(ptdev->base.dev, bo->dmap.sgt, 
> >>> DMA_TO_DEVICE);
> >>>           }
> >>>   
> >>>           if (hdr.va.start == CSF_MCU_SHARED_REGION_START)
> >>> @@ -729,8 +729,10 @@ panthor_reload_fw_sections(struct panthor_device 
> >>> *ptdev, bool full_reload)
> >>>                           continue;
> >>>   
> >>>                   panthor_fw_init_section_mem(ptdev, section);
> >>> -         sgt = 
> >>> drm_gem_shmem_get_pages_sgt(&to_panthor_bo(section->mem->obj)->base);
> >>> -         if (!drm_WARN_ON(&ptdev->base, IS_ERR_OR_NULL(sgt)))
> >>> +
> >>> +         /* An sgt should have been requested when the kernel BO was 
> >>> GPU-mapped. */
> >>> +         sgt = to_panthor_bo(section->mem->obj)->dmap.sgt;
> >>> +         if (!drm_WARN_ON_ONCE(&ptdev->base, !sgt))
> >>>                           dma_sync_sgtable_for_device(ptdev->base.dev, 
> >>> sgt, DMA_TO_DEVICE);
> >>>           }
> >>>   }
> >>> diff --git a/drivers/gpu/drm/panthor/panthor_gem.c 
> >>> b/drivers/gpu/drm/panthor/panthor_gem.c
> >>> index 4b3d82f001d8..0e52c7a07c87 100644
> >>> --- a/drivers/gpu/drm/panthor/panthor_gem.c
> >>> +++ b/drivers/gpu/drm/panthor/panthor_gem.c
> >>> @@ -8,9 +8,11 @@
> >>>   #include <linux/dma-mapping.h>
> >>>   #include <linux/err.h>
> >>>   #include <linux/slab.h>
> >>> +#include <linux/vmalloc.h>
> >>>   
> >>>   #include <drm/drm_debugfs.h>
> >>>   #include <drm/drm_file.h>
> >>> +#include <drm/drm_prime.h>
> >>>   #include <drm/drm_print.h>
> >>>   #include <drm/panthor_drm.h>
> >>>   
> >>> @@ -44,7 +46,7 @@ static void panthor_gem_debugfs_bo_init(struct 
> >>> panthor_gem_object *bo)
> >>>   
> >>>   static void panthor_gem_debugfs_bo_add(struct panthor_gem_object *bo)
> >>>   {
> >>> - struct panthor_device *ptdev = container_of(bo->base.base.dev,
> >>> + struct panthor_device *ptdev = container_of(bo->base.dev,
> >>>                                                       struct 
> >>> panthor_device, base);
> >>>   
> >>>           bo->debugfs.creator.tgid = current->group_leader->pid;
> >>> @@ -57,7 +59,7 @@ static void panthor_gem_debugfs_bo_add(struct 
> >>> panthor_gem_object *bo)
> >>>   
> >>>   static void panthor_gem_debugfs_bo_rm(struct panthor_gem_object *bo)
> >>>   {
> >>> - struct panthor_device *ptdev = container_of(bo->base.base.dev,
> >>> + struct panthor_device *ptdev = container_of(bo->base.dev,
> >>>                                                       struct 
> >>> panthor_device, base);
> >>>   
> >>>           if (list_empty(&bo->debugfs.node))
> >>> @@ -80,9 +82,9 @@ static void panthor_gem_debugfs_bo_init(struct 
> >>> panthor_gem_object *bo) {}
> >>>   #endif
> >>>   
> >>>   static bool
> >>> -should_map_wc(struct panthor_gem_object *bo, struct panthor_vm 
> >>> *exclusive_vm)
> >>> +should_map_wc(struct panthor_gem_object *bo)
> >>>   {
> >>> - struct panthor_device *ptdev = container_of(bo->base.base.dev, struct 
> >>> panthor_device, base);
> >>> + struct panthor_device *ptdev = container_of(bo->base.dev, struct 
> >>> panthor_device, base);
> >>>   
> >>>           /* We can't do uncached mappings if the device is coherent,
> >>>            * because the zeroing done by the shmem layer at page 
> >>> allocation
> >>> @@ -112,6 +114,208 @@ should_map_wc(struct panthor_gem_object *bo, struct 
> >>> panthor_vm *exclusive_vm)
> >>>           return true;
> >>>   }
> >>>   
> >>> +static void
> >>> +panthor_gem_backing_cleanup(struct panthor_gem_object *bo)
> >>> +{
> >>> + if (!bo->backing.pages)
> >>> +         return;
> >>> +
> >>> + drm_gem_put_pages(&bo->base, bo->backing.pages, true, false);
> >>> + bo->backing.pages = NULL;
> >>> +}
> >>> +
> >>> +static int
> >>> +panthor_gem_backing_get_pages_locked(struct panthor_gem_object *bo)
> >>> +{
> >>> + dma_resv_assert_held(bo->base.resv);
> >>> +
> >>> + if (bo->backing.pages)
> >>> +         return 0;
> >>> +
> >>> + bo->backing.pages = drm_gem_get_pages(&bo->base);
> >>> + if (IS_ERR(bo->backing.pages)) {
> >>> +         drm_dbg_kms(bo->base.dev, "Failed to get pages (%pe)\n",
> >>> +                     bo->backing.pages);
> >>> +         return PTR_ERR(bo->backing.pages);  
> >>
> >> This leaves bo->backing.pages set to the error value, which means a
> >> future call to panthor_gem_backing_get_pages_locked() for the same
> >> object will return success. Unless there's some 'poisoning' that I
> >> haven't spotted this looks like a bug.  
> > 
> > That's a bug, I'll fix it.
> >   
> 
> I ran into this bug, during the stress testing, which resulted in a crash.
> 
> 
> Unable to handle kernel paging request at virtual address fffffffffffffff4
> KASAN: maybe wild-memory-access in range 
> [0x0003ffffffffffa0-0x0003ffffffffffa7]
> Mem abort info:
>    ESR = 0x0000000096000006
>    EC = 0x25: DABT (current EL), IL = 32 bits
>    SET = 0, FnV = 0
>    EA = 0, S1PTW = 0
>    FSC = 0x06: level 2 translation fault
> Data abort info:
>    ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
>    CM = 0, WnR = 0, TnD = 0, TagAccess = 0
>    GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000084026000
> [fffffffffffffff4] pgd=0000000000000000, p4d=0000000088a6d403, 
> pud=0000000088a6e403, pmd=0000000000000000
> pc : nonblocking_page_setup+0x90/0x108 [panthor]
> lr : nonblocking_page_setup+0x8c/0x108 [panthor]
> sp : ffff80008bdd7a80
> Call trace:
>   nonblocking_page_setup+0x90/0x108 [panthor] (P)
>   panthor_gem_fault+0x84/0x218 [panthor]
>   __do_fault+0x78/0x3d0
>   __handle_mm_fault+0xe20/0x23a0
>   handle_mm_fault+0x10c/0x438
>   do_page_fault+0x234/0x958
>   do_translation_fault+0xa0/0xd8
>   do_mem_abort+0x68/0x100
>   el0_da+0x54/0x1d8
>   el0t_64_sync_handler+0xd0/0xe8
>   el0t_64_sync+0x198/0x1a0
> 
> 
> First the call to panthor_gem_backing_get_pages_locked(), from 
> blocking_page_setup(), failed due to low memory.
> 
> panthor_gem_fault() returned VM_FAULT_RETRY as mmap_lock_held was 0.
> 
> Then the crash happened inside nonblocking_page_setup() when
> panthor_gem_fault() was called for the 2nd time.
> 
> 
> Made the following change locally to avoid the issue.
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_gem.c 
> b/drivers/gpu/drm/panthor/panthor_gem.c
> index 6e91c5022d0d..b31a4606a8c6 100644
> --- a/drivers/gpu/drm/panthor/panthor_gem.c
> +++ b/drivers/gpu/drm/panthor/panthor_gem.c
> @@ -234,7 +234,7 @@ panthor_gem_backing_get_pages_locked(struct 
> panthor_gem_object *bo)
>          if (IS_ERR(bo->backing.pages)) {
>                  drm_dbg_kms(bo->base.dev, "Failed to get pages (%pe)\n",
>                              bo->backing.pages);
> -               return PTR_ERR(bo->backing.pages);
> +               return PTR_ERR(xchg(&bo->backing.pages, NULL));

That's clever, but after searching for such patterns in the linux code
base, I couldn't find a match (xchg only seems to be used for atomic
updates), so I'd be tempted to stick to the usual

                ret = PTR_ERR(bo->backing.pages);
                bo->backing.pages = NULL;
                return ret;

or

        pages = xxx()
        if (IS_ERR(pages)) {
                drm_dbg_kms(bo->base.dev, "Failed to get pages (%pe)\n",
                            bo->backing.pages);
                return PTR_ERR(pages);
        }

        bo->backing.pages = pages;

Anyway, I'm adding this to my fixup stack.

Thanks,

Boris

Reply via email to