On Wed, Jul 23, 2025 at 09:13:34PM -0700, Matthew Brost wrote: > On Wed, Jul 23, 2025 at 08:56:01AM +0200, Philipp Stanner wrote: > > On Tue, 2025-07-22 at 01:45 -0700, Matthew Brost wrote: > > > On Tue, Jul 22, 2025 at 01:07:29AM -0700, Matthew Brost wrote: > > > > On Tue, Jul 22, 2025 at 09:37:11AM +0200, Philipp Stanner wrote: > > > > > On Mon, 2025-07-21 at 11:07 -0700, Matthew Brost wrote: > > > > > > On Mon, Jul 21, 2025 at 12:14:31PM +0200, Danilo Krummrich wrote: > > > > > > > On Mon Jul 21, 2025 at 10:16 AM CEST, Philipp Stanner wrote: > > > > > > > > On Mon, 2025-07-21 at 09:52 +0200, Philipp Stanner wrote: > > > > > > > > > On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote: > > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > index bfea608a7106..997a2cc1a635 100644 > > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > @@ -172,8 +172,10 @@ void > > > > > > > > > > drm_sched_rq_update_fifo_locked(struct drm_sched_entity > > > > > > > > > > *entity, > > > > > > > > > > > > > > > > > > > > entity->oldest_job_waiting = ts; > > > > > > > > > > > > > > > > > > > > - rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root, > > > > > > > > > > - drm_sched_entity_compare_before); > > > > > > > > > > + if (!entity->stopped) { > > > > > > > > > > + rb_add_cached(&entity->rb_tree_node, > > > > > > > > > > &rq->rb_tree_root, > > > > > > > > > > + drm_sched_entity_compare_before); > > > > > > > > > > + } > > > > > > > > > > > > > > > > > > If this is a race, then this patch here is broken, too, > > > > > > > > > because you're > > > > > > > > > checking the 'stopped' boolean as the callers of that > > > > > > > > > function do, too > > > > > > > > > – just later. :O > > > > > > > > > > > > > > > > > > Could still race, just less likely. > > > > > > > > > > > > > > > > > > The proper way to fix it would then be to address the issue > > > > > > > > > where the > > > > > > > > > locking is supposed to happen. Let's look at, for example, > > > > > > > > > drm_sched_entity_push_job(): > > > > > > > > > > > > > > > > > > > > > > > > > > > void drm_sched_entity_push_job(struct drm_sched_job > > > > > > > > > *sched_job) > > > > > > > > > { > > > > > > > > > (Bla bla bla) > > > > > > > > > > > > > > > > > > ………… > > > > > > > > > > > > > > > > > > /* first job wakes up scheduler */ > > > > > > > > > if (first) { > > > > > > > > > struct drm_gpu_scheduler *sched; > > > > > > > > > struct drm_sched_rq *rq; > > > > > > > > > > > > > > > > > > /* Add the entity to the run queue */ > > > > > > > > > spin_lock(&entity->lock); > > > > > > > > > if (entity->stopped) { <---- > > > > > > > > > Aha! > > > > > > > > > spin_unlock(&entity->lock); > > > > > > > > > > > > > > > > > > DRM_ERROR("Trying to push to a killed > > > > > > > > > entity\n"); > > > > > > > > > return; > > > > > > > > > } > > > > > > > > > > > > > > > > > > rq = entity->rq; > > > > > > > > > sched = rq->sched; > > > > > > > > > > > > > > > > > > spin_lock(&rq->lock); > > > > > > > > > drm_sched_rq_add_entity(rq, entity); > > > > > > > > > > > > > > > > > > if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) > > > > > > > > > drm_sched_rq_update_fifo_locked(entity, > > > > > > > > > rq, submit_ts); <---- bumm! > > > > > > > > > > > > > > > > > > spin_unlock(&rq->lock); > > > > > > > > > spin_unlock(&entity->lock); > > > > > > > > > > > > > > > > > > But the locks are still being hold. So that "shouldn't be > > > > > > > > > happening"(tm). > > > > > > > > > > > > > > > > > > Interesting. AFAICS only drm_sched_entity_kill() and > > > > > > > > > drm_sched_fini() > > > > > > > > > stop entities. The former holds appropriate locks, but > > > > > > > > > drm_sched_fini() > > > > > > > > > doesn't. So that looks like a hot candidate to me. Opinions? > > > > > > > > > > > > > > > > > > On the other hand, aren't drivers prohibited from calling > > > > > > > > > drm_sched_entity_push_job() after calling drm_sched_fini()? > > > > > > > > > If the > > > > > > > > > fuzzer does that, then it's not the scheduler's fault. > > > > > > > > > > > > > > Exactly, this is the first question to ask. > > > > > > > > > > > > > > And I think it's even more restrictive: > > > > > > > > > > > > > > In drm_sched_fini() > > > > > > > > > > > > > > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) { > > > > > > > struct drm_sched_rq *rq = sched->sched_rq[i]; > > > > > > > > > > > > > > spin_lock(&rq->lock); > > > > > > > list_for_each_entry(s_entity, &rq->entities, list) > > > > > > > /* > > > > > > > * Prevents reinsertion and marks job_queue as > > > > > > > idle, > > > > > > > * it will be removed from the rq in > > > > > > > drm_sched_entity_fini() > > > > > > > * eventually > > > > > > > */ > > > > > > > s_entity->stopped = true; > > > > > > > spin_unlock(&rq->lock); > > > > > > > kfree(sched->sched_rq[i]); > > > > > > > } > > > > > > > > > > > > > > In drm_sched_entity_kill() > > > > > > > > > > > > > > static void drm_sched_entity_kill(struct drm_sched_entity > > > > > > > *entity) > > > > > > > { > > > > > > > struct drm_sched_job *job; > > > > > > > struct dma_fence *prev; > > > > > > > > > > > > > > if (!entity->rq) > > > > > > > return; > > > > > > > > > > > > > > spin_lock(&entity->lock); > > > > > > > entity->stopped = true; > > > > > > > drm_sched_rq_remove_entity(entity->rq, entity); > > > > > > > spin_unlock(&entity->lock); > > > > > > > > > > > > > > [...] > > > > > > > } > > > > > > > > > > > > > > If this runs concurrently, this is a UAF as well. > > > > > > > > > > > > > > Personally, I have always been working with the assupmtion that > > > > > > > entites have to > > > > > > > be torn down *before* the scheduler, but those lifetimes are not > > > > > > > documented > > > > > > > properly. > > > > > > > > > > > > Yes, this is my assumption too. I would even take it further: an > > > > > > entity > > > > > > shouldn't be torn down until all jobs associated with it are freed > > > > > > as > > > > > > well. I think this would solve a lot of issues I've seen on the list > > > > > > related to UAF, teardown, etc. > > > > > > > > > > That's kind of impossible with the new tear down design, because > > > > > drm_sched_fini() ensures that all jobs are freed on teardown. And > > > > > drm_sched_fini() wouldn't be called before all jobs are gone, > > > > > effectively resulting in a chicken-egg-problem, or rather: the driver > > > > > implementing its own solution for teardown. > > > > > > > > > > > > > I've read this four times and I'm still generally confused. > > > > > > > > "drm_sched_fini ensures that all jobs are freed on teardown" — Yes, > > > > that's how a refcounting-based solution works. drm_sched_fini would > > > > never be called if there were pending jobs. > > > > > > > > "drm_sched_fini() wouldn't be called before all jobs are gone" — See > > > > above. > > > > > > > > "effectively resulting in a chicken-and-egg problem" — A job is created > > > > after the scheduler, and it holds a reference to the scheduler until > > > > it's freed. I don't see how this idiom applies. > > > > > > > > "the driver implementing its own solution for teardown" — It’s just > > > > following the basic lifetime rules I outlined below. Perhaps Xe was > > > > ahead of its time, but the number of DRM scheduler blowups we've had is > > > > zero — maybe a strong indication that this design is correct. > > > > > > > > > > Sorry—self-reply. > > > > > > To expand on this: the reason Xe implemented a refcount-based teardown > > > solution is because the internals of the DRM scheduler during teardown > > > looked wildly scary. A lower layer should not impose its will on upper > > > layers. I think that’s the root cause of all the problems I've listed. > > > > > > In my opinion, we should document the lifetime rules I’ve outlined, fix > > > all drivers accordingly, and assert these rules in the scheduler layer. > > > > > > Everyone had a separate solution for that. Nouveau used a waitqueue. > > That's what happens when there's no centralized mechanism for solving a > > problem. > > > > Right, this is essentially my point — I think refcounting on the driver > side is what the long-term solution really needs to be. > > To recap the basic rules: > > - Entities should not be finalized or freed until all jobs associated > with them are freed. > - Schedulers should not be finalized or freed until all associated > entities are finalized. > - Jobs should hold a reference to the entity. > - Entities should hold a reference to the scheduler. > > I understand this won’t happen overnight — or perhaps ever — but > adopting this model would solve a lot of problems across the subsystem > and reduce a significant amount of complexity in the DRM scheduler. I’ll > also acknowledge that part of this is my fault — years ago, I worked > around problems (implemented above ref count model) in the scheduler > related to teardown rather than proposing a common, unified solution, > and clear lifetime rules. > > For drivers with a 1:1 entity-to-scheduler relationship, teardown > becomes fairly simple: set the TDR timeout to zero and naturally let the > remaining jobs flush out via TDR + the timedout_job callback, which > signals the job’s fence. Free job, is called after that. > > For non-1:1 setups, we could introduce something like > drm_sched_entity_kill, which would move all jobs on the pending list of > a given entity to a kill list. A worker could then process that kill > list — calling timedout_job and signaling the associated fences. > Similarly, any jobs that had unresolved dependencies could be > immediately added to the kill list. The kill list would have to be
s/added to the kill list/added to the kill list after calling run_job/ Matt > checked in drm_sched_free_job_work too. > > This would ensure that all jobs submitted would go through the full > lifecycle: > > - run_job is called > - free_job is called > - If the fence returned from run_job needs to be artificially signaled, > timedout_job is called > > We can add the infrastructure for this and once all driver adhere this > model, clean up ugliness in the scheduler related to teardown and all > races here. > > > Did you see the series we recently merged which repairs the memory > > leaks of drm/sched? It had been around for quite some time. > > > > https://lore.kernel.org/dri-devel/20250701132142.76899-3-pha...@kernel.org/ > > > > I would say this is just hacking around the fundamental issues with the > lifetime of these objects. Do you see anything in Nouveau that would > prevent the approach I described above from working? > > Also, what if jobs have dependencies that aren't even on the pending > list yet? This further illustrates the problems with trying to finalize > objects while child objects (entities, job) are still around. > > Matt > > > > > P. > > > > > > > > Matt > > > > > > > Matt > > > > > > > > > P. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > There are two solutions: > > > > > > > > > > > > > > (1) Strictly require all entities to be torn down before > > > > > > > drm_sched_fini(), > > > > > > > i.e. stick to the natural ownership and lifetime rules here > > > > > > > (see below). > > > > > > > > > > > > > > (2) Actually protect *any* changes of the relevent fields of > > > > > > > the entity > > > > > > > structure with the entity lock. > > > > > > > > > > > > > > While (2) seems rather obvious, we run into lock inversion with > > > > > > > this approach, > > > > > > > as you note below as well. And I think drm_sched_fini() should > > > > > > > not mess with > > > > > > > entities anyways. > > > > > > > > > > > > > > The ownership here seems obvious: > > > > > > > > > > > > > > The scheduler *owns* a resource that is used by entities. > > > > > > > Consequently, entities > > > > > > > are not allowed to out-live the scheduler. > > > > > > > > > > > > > > Surely, the current implementation to just take the resource away > > > > > > > from the > > > > > > > entity under the hood can work as well with appropriate locking, > > > > > > > but that's a > > > > > > > mess. > > > > > > > > > > > > > > If the resource *really* needs to be shared for some reason > > > > > > > (which I don't see), > > > > > > > shared ownership, i.e. reference counting, is much less error > > > > > > > prone. > > > > > > > > > > > > Yes, Xe solves all of this via reference counting (jobs refcount the > > > > > > entity). It's a bit easier in Xe since the scheduler and entities > > > > > > are > > > > > > the same object due to their 1:1 relationship. But even in non-1:1 > > > > > > relationships, an entity could refcount the scheduler. The teardown > > > > > > sequence would then be: all jobs complete on the entity → teardown > > > > > > the > > > > > > entity → all entities torn down → teardown the scheduler. > > > > > > > > > > > > Matt > > > > > > >