On Tue, Jun 17, 2025 at 05:08:57PM +0200, Danilo Krummrich wrote: > On Tue, Jun 17, 2025 at 04:25:09PM +0200, Simona Vetter wrote: > > On Tue, Jun 17, 2025 at 04:10:40PM +0200, Danilo Krummrich wrote: > > > On Tue, Jun 17, 2025 at 03:51:33PM +0200, Simona Vetter wrote: > > > > On Thu, Jun 12, 2025 at 04:49:54PM +0200, Philipp Stanner wrote: > > > > > + * NOTE that sharing &struct drm_sched_init_args.submit_wq with the > > > > > driver > > > > > + * theoretically can deadlock. It must be guaranteed that submit_wq > > > > > never has > > > > > + * more than max_active - 1 active tasks, or if max_active tasks are > > > > > reached at > > > > > + * least one of them does not execute operations that may block on > > > > > dma_fences > > > > > + * that potentially make progress through this scheduler instance. > > > > > Otherwise, > > > > > + * it is possible that all max_active tasks end up waiting on a > > > > > dma_fence (that > > > > > + * can only make progress through this schduler instance), while the > > > > > + * scheduler's queued work waits for at least one of the max_active > > > > > tasks to > > > > > + * finish. Thus, this can result in a deadlock. > > > > > > > > Uh if you have an ordered wq you deadlock with just one misuse. I'd just > > > > explain that the wq must provide sufficient forward-progress guarantees > > > > for the scheduler, specifically that it's on the dma_fence signalling > > > > critical path and leave the concrete examples for people to figure out > > > > when the design a specific locking scheme. > > > > > > This isn't a concrete example, is it? It's exactly what you say in > > > slightly > > > different words, with the addition of highlighting the impact of the > > > workqueue's > > > max_active configuration. > > > > > > I think that's relevant, because N - 1 active tasks can be on the > > > dma_fence > > > signalling critical path without issues. > > > > > > We could change > > > > > > "if max_active tasks are reached at least one of them must not execute > > > operations that may block on dma_fences that potentially make progress > > > through this scheduler instance" > > > > > > to > > > > > > "if max_active tasks are reached at least one of them must not be on the > > > dma_fence signalling critical path" > > > > > > which is a bit more to the point I think. > > > > My point was to more state that the wq must be suitable for the scheduler > > jobs as the general issue, and specifically then also highlight the > > dma_fence concurrency issue. > > Sure, there are more guarantees the driver has to uphold, but this is one of > them, so I think we should explain it. > > > But it's not the only one, you can have driver locks and other fun involved > > here too. > > Yeah, but it boils down to the same issue, e.g. if a driver takes a lock in > active work, and this lock is taken elsewhere for activities that violate the > dma_fence signalling critical path. > > All the cases I have in mind boil down to that we potentially, either directly > or indirectly (through some synchronization primitive), wait for things we are > not allowed to wait for in the dma_fence signalling critical path. > > Or do you mean something different?
You could also grab a mutex in those driver work items that is held while waiting for a dma_fence somewhere. The dma_fence annotations should catch that, but at least in my reading of the text here it's not covered. But my main point is what I explain below, the text fails to clearly address the issues that all current drivers (maybe all reasonable drivers, but sometimes I lack imagination) can encounter, so to me it's too generic and not that directly applicable in practice. And then on the other hand dma_fence is definitely the big thing, but fundamtentally you tie anything you schedule on those wq to the drm/scheduler design in it's entirety. So for the general rule, it's not generic enough for my taste. > > Also since all the paragraphs above talk about ordered wq as the example > > where specifying your own wq makes sense, it's a bit confusing to now > > suddenly only talk about the concurrent wq case without again mentioned > > that the ordered wq case is really limited. > > I mean, it talks about both cases in a generic way, i.e. if you set > max_active == 1 in the text it covers the ordered case. > > Or do you mean to say that we should *only* allow ordered workqueues to be > shared with the driver? Both examples talk about ordered wq, they don't make any sense with max_active > 1, and I can't come up with any example that would. So yeah I'm leaning in that direction, at least in the docs. Only discussing max_active > 1 for this specific issue is imo very confusing and not helping much. I guess we could even WARN_ON if a provided wq is not ordered, because that does smell funky for sure. Cheers, Sima -- Simona Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
