On Tue, 2025-07-22 at 01:45 -0700, Matthew Brost wrote: > On Tue, Jul 22, 2025 at 01:07:29AM -0700, Matthew Brost wrote: > > On Tue, Jul 22, 2025 at 09:37:11AM +0200, Philipp Stanner wrote: > > > On Mon, 2025-07-21 at 11:07 -0700, Matthew Brost wrote: > > > > On Mon, Jul 21, 2025 at 12:14:31PM +0200, Danilo Krummrich wrote: > > > > > On Mon Jul 21, 2025 at 10:16 AM CEST, Philipp Stanner wrote: > > > > > > On Mon, 2025-07-21 at 09:52 +0200, Philipp Stanner wrote: > > > > > > > On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote: > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > index bfea608a7106..997a2cc1a635 100644 > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > @@ -172,8 +172,10 @@ void > > > > > > > > drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, > > > > > > > > > > > > > > > > entity->oldest_job_waiting = ts; > > > > > > > > > > > > > > > > - rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > > > > > > > > - drm_sched_entity_compare_before); > > > > > > > > + if (!entity->stopped) { > > > > > > > > + rb_add_cached(&entity->rb_tree_node, > > > > > > > > &rq->rb_tree_root, > > > > > > > > + drm_sched_entity_compare_before); > > > > > > > > + } > > > > > > > > > > > > > > If this is a race, then this patch here is broken, too, because > > > > > > > you're > > > > > > > checking the 'stopped' boolean as the callers of that function > > > > > > > do, too > > > > > > > – just later. :O > > > > > > > > > > > > > > Could still race, just less likely. > > > > > > > > > > > > > > The proper way to fix it would then be to address the issue where > > > > > > > the > > > > > > > locking is supposed to happen. Let's look at, for example, > > > > > > > drm_sched_entity_push_job(): > > > > > > > > > > > > > > > > > > > > > void drm_sched_entity_push_job(struct drm_sched_job *sched_job) > > > > > > > { > > > > > > > (Bla bla bla) > > > > > > > > > > > > > > ………… > > > > > > > > > > > > > > /* first job wakes up scheduler */ > > > > > > > if (first) { > > > > > > > struct drm_gpu_scheduler *sched; > > > > > > > struct drm_sched_rq *rq; > > > > > > > > > > > > > > /* Add the entity to the run queue */ > > > > > > > spin_lock(&entity->lock); > > > > > > > if (entity->stopped) { <---- Aha! > > > > > > > spin_unlock(&entity->lock); > > > > > > > > > > > > > > DRM_ERROR("Trying to push to a killed > > > > > > > entity\n"); > > > > > > > return; > > > > > > > } > > > > > > > > > > > > > > rq = entity->rq; > > > > > > > sched = rq->sched; > > > > > > > > > > > > > > spin_lock(&rq->lock); > > > > > > > drm_sched_rq_add_entity(rq, entity); > > > > > > > > > > > > > > if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) > > > > > > > drm_sched_rq_update_fifo_locked(entity, rq, > > > > > > > submit_ts); <---- bumm! > > > > > > > > > > > > > > spin_unlock(&rq->lock); > > > > > > > spin_unlock(&entity->lock); > > > > > > > > > > > > > > But the locks are still being hold. So that "shouldn't be > > > > > > > happening"(tm). > > > > > > > > > > > > > > Interesting. AFAICS only drm_sched_entity_kill() and > > > > > > > drm_sched_fini() > > > > > > > stop entities. The former holds appropriate locks, but > > > > > > > drm_sched_fini() > > > > > > > doesn't. So that looks like a hot candidate to me. Opinions? > > > > > > > > > > > > > > On the other hand, aren't drivers prohibited from calling > > > > > > > drm_sched_entity_push_job() after calling drm_sched_fini()? If the > > > > > > > fuzzer does that, then it's not the scheduler's fault. > > > > > > > > > > Exactly, this is the first question to ask. > > > > > > > > > > And I think it's even more restrictive: > > > > > > > > > > In drm_sched_fini() > > > > > > > > > > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) { > > > > > struct drm_sched_rq *rq = sched->sched_rq[i]; > > > > > > > > > > spin_lock(&rq->lock); > > > > > list_for_each_entry(s_entity, &rq->entities, list) > > > > > /* > > > > > * Prevents reinsertion and marks job_queue as > > > > > idle, > > > > > * it will be removed from the rq in > > > > > drm_sched_entity_fini() > > > > > * eventually > > > > > */ > > > > > s_entity->stopped = true; > > > > > spin_unlock(&rq->lock); > > > > > kfree(sched->sched_rq[i]); > > > > > } > > > > > > > > > > In drm_sched_entity_kill() > > > > > > > > > > static void drm_sched_entity_kill(struct drm_sched_entity > > > > > *entity) > > > > > { > > > > > struct drm_sched_job *job; > > > > > struct dma_fence *prev; > > > > > > > > > > if (!entity->rq) > > > > > return; > > > > > > > > > > spin_lock(&entity->lock); > > > > > entity->stopped = true; > > > > > drm_sched_rq_remove_entity(entity->rq, entity); > > > > > spin_unlock(&entity->lock); > > > > > > > > > > [...] > > > > > } > > > > > > > > > > If this runs concurrently, this is a UAF as well. > > > > > > > > > > Personally, I have always been working with the assupmtion that > > > > > entites have to > > > > > be torn down *before* the scheduler, but those lifetimes are not > > > > > documented > > > > > properly. > > > > > > > > Yes, this is my assumption too. I would even take it further: an entity > > > > shouldn't be torn down until all jobs associated with it are freed as > > > > well. I think this would solve a lot of issues I've seen on the list > > > > related to UAF, teardown, etc. > > > > > > That's kind of impossible with the new tear down design, because > > > drm_sched_fini() ensures that all jobs are freed on teardown. And > > > drm_sched_fini() wouldn't be called before all jobs are gone, > > > effectively resulting in a chicken-egg-problem, or rather: the driver > > > implementing its own solution for teardown. > > > > > > > I've read this four times and I'm still generally confused. > > > > "drm_sched_fini ensures that all jobs are freed on teardown" — Yes, > > that's how a refcounting-based solution works. drm_sched_fini would > > never be called if there were pending jobs. > > > > "drm_sched_fini() wouldn't be called before all jobs are gone" — See > > above. > > > > "effectively resulting in a chicken-and-egg problem" — A job is created > > after the scheduler, and it holds a reference to the scheduler until > > it's freed. I don't see how this idiom applies. > > > > "the driver implementing its own solution for teardown" — It’s just > > following the basic lifetime rules I outlined below. Perhaps Xe was > > ahead of its time, but the number of DRM scheduler blowups we've had is > > zero — maybe a strong indication that this design is correct. > > > > Sorry—self-reply. > > To expand on this: the reason Xe implemented a refcount-based teardown > solution is because the internals of the DRM scheduler during teardown > looked wildly scary. A lower layer should not impose its will on upper > layers. I think that’s the root cause of all the problems I've listed. > > In my opinion, we should document the lifetime rules I’ve outlined, fix > all drivers accordingly, and assert these rules in the scheduler layer.
Everyone had a separate solution for that. Nouveau used a waitqueue. That's what happens when there's no centralized mechanism for solving a problem. Did you see the series we recently merged which repairs the memory leaks of drm/sched? It had been around for quite some time. https://lore.kernel.org/dri-devel/20250701132142.76899-3-pha...@kernel.org/ P. > > Matt > > > Matt > > > > > P. > > > > > > > > > > > > > > > > > > > > There are two solutions: > > > > > > > > > > (1) Strictly require all entities to be torn down before > > > > > drm_sched_fini(), > > > > > i.e. stick to the natural ownership and lifetime rules here > > > > > (see below). > > > > > > > > > > (2) Actually protect *any* changes of the relevent fields of the > > > > > entity > > > > > structure with the entity lock. > > > > > > > > > > While (2) seems rather obvious, we run into lock inversion with this > > > > > approach, > > > > > as you note below as well. And I think drm_sched_fini() should not > > > > > mess with > > > > > entities anyways. > > > > > > > > > > The ownership here seems obvious: > > > > > > > > > > The scheduler *owns* a resource that is used by entities. > > > > > Consequently, entities > > > > > are not allowed to out-live the scheduler. > > > > > > > > > > Surely, the current implementation to just take the resource away > > > > > from the > > > > > entity under the hood can work as well with appropriate locking, but > > > > > that's a > > > > > mess. > > > > > > > > > > If the resource *really* needs to be shared for some reason (which I > > > > > don't see), > > > > > shared ownership, i.e. reference counting, is much less error prone. > > > > > > > > Yes, Xe solves all of this via reference counting (jobs refcount the > > > > entity). It's a bit easier in Xe since the scheduler and entities are > > > > the same object due to their 1:1 relationship. But even in non-1:1 > > > > relationships, an entity could refcount the scheduler. The teardown > > > > sequence would then be: all jobs complete on the entity → teardown the > > > > entity → all entities torn down → teardown the scheduler. > > > > > > > > Matt > > >