On Tue, Mar 17, 2026 at 09:43:20PM +0100, Boris Brezillon wrote: > Hi Matthew, > > Just a few drive-by comments. > > On Tue, 17 Mar 2026 11:14:36 -0700 > Matthew Brost <[email protected]> wrote: > > > > > > Timeout Detection and Recovery (TDR): a per-queue delayed work item > > > > > fires when the head pending job exceeds q->job.timeout jiffies, > > > > > calling > > > > > ops->timedout_job(). drm_dep_queue_trigger_timeout() forces immediate > > > > > expiry for device teardown. > > > > > > > > > > IRQ-safe completion: queues flagged > > > > > DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE > > > > > allow drm_dep_job_done() to be called from hardirq context (e.g. a > > > > > dma_fence callback). Dependency cleanup is deferred to process context > > > > > after ops->run_job() returns to avoid calling xa_destroy() from IRQ. > > > > > > > > > > Zombie-state guard: workers use kref_get_unless_zero() on entry and > > > > > bail immediately if the queue refcount has already reached zero and > > > > > async teardown is in flight, preventing use-after-free. > > > > > > > > In rust, when you queue work, you have to pass a reference-counted > > > > pointer > > > > (Arc<T>). We simply never have this problem in a Rust design. If there > > > > is work > > > > queued, the queue is alive. > > > > > > > > By the way, why can’t we simply require synchronous teardowns? > > > > Consider the case where the DRM dep queue’s refcount drops to zero, but > > the device firmware still holds references to the associated queue. > > These are resources that must be torn down asynchronously. In Xe, I need > > to send two asynchronous firmware commands before I can safely remove > > the memory associated with the queue (faulting on this kind of global > > memory will take down the device) and recycle the firmware ID tied to > > the queue. These async commands are issued on the driver side, on the > > DRM dep queue’s workqueue as well. > > Asynchronous teardown is okay, but I'm not too sure using the refcnt to > know that the queue is no longer usable is the way to go. To me the > refcnt is what determines when the SW object is no longer referenced by > any other item in the code, and a work item acting on the queue counts > as one owner of this queue. If you want to cancel the work in order to > speed up the destruction of the queue, you can call > {cancel,disable}_work[_sync](), and have the ref dropped if the > cancel/disable was effective. Multi-step teardown is also an option, > but again, the state of the queue shouldn't be determined from its > refcnt IMHO. > > > > > Now consider a scenario where something goes wrong and those firmware > > commands never complete, and a device reset is required to recover. The > > driver’s per-queue tracking logic stops all queues (including zombie > > ones), determines which commands were lost, cleans up the side effects > > of that lost state, and then restarts all queues. That is how we would > > end up in this work item with a zombie queue. The restart logic could > > probably be made smart enough to avoid queueing work for zombie queues, > > but in my opinion it’s safe enough to use kref_get_unless_zero() in the > > work items. > > Well, that only works for single-step teardown, or when you enter the > last step. At which point, I'm not too sure it's significantly better > than encoding the state of the queue through a separate field, and have > the job queue logic reject new jobs if the queue is no longer usable > (shouldn't even be exposed to userland at this point though). >
'shouldn't even be exposed to userland at this point though' - Yes. The philosophy of ref counting design is roughly: - When queue is created by userland call drm_dep_queue_init - All jobs hold a ref to drm_dep_queue - When userland close queue, remove it from the FD, call drm_dep_queue_put + initiatie teardown (I'd recommned just setting TDR to immediately fire, kick off queue from device in first fire + signal all fences). - Queue refcount goes to zero optionality implement drm_dep_queue_ops.fini to keep the drm_dep_queue (and object it is embedded) around for a bit longer if additional firmware / device side resources are still around, call drm_dep_queue_fini when this part completes. If drm_dep_queue_ops.fini isn't implement the core implementation just calls drm_dep_queue_fini. - Work item releases drm_dep_queue out dma-fence signaling path for safe memory release (e.g., take dma-resv locks). > > > > It should also be clear that a DRM dep queue is primarily intended to be > > embedded inside the driver’s own queue object, even though it is valid > > to use it as a standalone object. The async teardown flows are also > > optional features. > > > > Let’s also consider a case where you do not need the async firmware > > flows described above, but the DRM dep queue is still embedded in a > > driver-side object that owns memory via dma-resv. The final queue put > > may occur in IRQ context (DRM dep avoids kicking a worker just to drop a > > refi as opt in), or in the reclaim path (any scheduler workqueue is the > > reclaim path). In either case, you cannot free memory there taking a > > dma-resv lock, which is why all DRM dep queues ultimately free their > > resources in a work item outside of reclaim. Many drivers already follow > > this pattern, but in DRM dep this behavior is built-in. > > I agree deferred cleanup is the way to go. > +1. Yes. I spot a bunch of drivers that open code this part driver side, Xe included. > > > > So I don’t think Rust natively solves these types of problems, although > > I’ll concede that it does make refcounting a bit more sane. > > Rust won't magically defer the cleanup, nor will it dictate how you want > to do the queue teardown, those are things you need to implement. But it > should give visibility about object lifetimes, and guarantee that an > object that's still visible to some owners is usable (the notion of > usable is highly dependent on the object implementation). > > Just a purely theoretical example of a multi-step queue teardown that > might be possible to encode in rust: > > - MyJobQueue<Usable>: The job queue is currently exposed and usable. > There's a ::destroy() method consuming 'self' and returning a > MyJobQueue<Destroyed> object > - MyJobQueue<Destroyed>: The user asked for the workqueue to be > destroyed. No new job can be pushed. Existing jobs that didn't make > it to the FW queue are cancelled, jobs that are in-flight are > cancelled if they can, or are just waited upon if they can't. When > the whole destruction step is done, ::destroyed() is called, it > consumes 'self' and returns a MyJobQueue<Inactive> object. > - MyJobQueue<Inactive>: The queue is no longer active (HW doesn't have > any resources on this queue). It's ready to be cleaned up. > ::cleanup() (or just ::drop()) defers the cleanup of some inner > object that has been passed around between the various > MyJobQueue<State> wrappers. > > Each of the state transition can happen asynchronously. A state > transition consumes the object in one state, and returns a new object > in its new state. None of the transition involves dropping a refcnt, > ownership is just transferred. The final MyJobQueue<Inactive> object is > the object we'll defer cleanup on. > > It's a very high-level view of one way this can be implemented (I'm > sure there are others, probably better than my suggestion) in order to > make sure the object doesn't go away without the compiler enforcing > proper state transitions. > I'm sure Rust can implement this. My point about Rust is it doesn't magically solve hard software arch probles, but I will admit the ownership model, way it can enforce locking at compile time is pretty cool. > > > > > +/** > > > > > + * DOC: DRM dependency fence > > > > > + * > > > > > + * Each struct drm_dep_job has an associated struct drm_dep_fence > > > > > that > > > > > + * provides a single dma_fence (@finished) signalled when the > > > > > hardware > > > > > + * completes the job. > > > > > + * > > > > > + * The hardware fence returned by &drm_dep_queue_ops.run_job is > > > > > stored as > > > > > + * @parent. @finished is chained to @parent via > > > > > drm_dep_job_done_cb() and > > > > > + * is signalled once @parent signals (or immediately if run_job() > > > > > returns > > > > > + * NULL or an error). > > > > > > > > I thought this fence proxy mechanism was going away due to recent work > > > > being > > > > carried out by Christian? > > > > > > > > Consider the case where a driver’s hardware fence is implemented as a > > dma-fence-array or dma-fence-chain. You cannot install these types of > > fences into a dma-resv or into syncobjs, so a proxy fence is useful > > here. > > Hm, so that's a driver returning a dma_fence_array/chain through > ::run_job()? Why would we not want to have them directly exposed and > split up into singular fence objects at resv insertion time (I don't > think syncobjs care, but I might be wrong). I mean, one of the point You can stick dma-fence-arrays in syncobjs, but not chains. Neither dma-fence-arrays/chain can go into dma-resv. Hence why disconnecting a job's finished fence from hardware fence IMO is good idea to keep so gives drivers flexiblity on the hardware fences. e.g., If this design didn't have a job's finished fence, I'd have to open code one Xe side. > behind the container extraction is so fences coming from the same > context/timeline can be detected and merged. If you insert the > container through a proxy, you're defeating the whole fence merging > optimization. Right. Finished fences have single timeline too... > > The second thing is that I'm not sure drivers were ever supposed to > return fence containers in the first place, because the whole idea > behind a fence context is that fences are emitted/signalled in > seqno-order, and if the fence is encoding the state of multiple > timelines that progress at their own pace, it becomes tricky to control > that. I guess if it's always the same set of timelines that are > combined, that would work. Xe does this is definitely works. We submit to multiple rings, when all rings signal a seqno, a chain or array signals -> finished fence signals. The queues used in this manor can only submit multiple ring jobs so the finished fence timeline stays intact. If you could a multiple rings followed by a single ring submission on the same queue, yes this could break. > > > One example is when a single job submits work to multiple rings > > that are flipped in hardware at the same time. > > We do have that in Panthor, but that's all explicit: in a single > SUBMIT, you can have multiple jobs targeting different queues, each of > them having their own set of deps/signal ops. The combination of all the > signal ops into a container is left to the UMD. It could be automated > kernel side, but that would be a flag on the SIGNAL op leading to the > creation of a fence_array containing fences from multiple submitted > jobs, rather than the driver combining stuff in the fence it returns in > ::run_job(). See above. We have a dedicated queue type for these type of submissions and single job that submits to the all rings. We had multiple queue / jobs in the i915 to implemented this but it turns out it is much cleaner with a single queue / singler job / multiple rings model. > > > > > Another case is late arming of hardware fences in run_job (which many > > drivers do). The proxy fence is immediately available at arm time and > > can be installed into dma-resv or syncobjs even though the actual > > hardware fence is not yet available. I think most drivers could be > > refactored to make the hardware fence immediately available at run_job, > > though. > > Yep, I also think we can arm the driver fence early in the case of > JobQueue. The reason it couldn't be done before is because the > scheduler was in the middle, deciding which entity to pull the next job > from, which was changing the seqno a job driver-fence would be assigned > (you can't guess that at queue time in that case). > Xe doesn't need to late arming, but it look like multiple drivers to implement the late arming which may be required (?). > [...] > > > > > > + * **Reference counting** > > > > > + * > > > > > + * Jobs and queues are both reference counted. > > > > > + * > > > > > + * A job holds a reference to its queue from drm_dep_job_init() until > > > > > + * drm_dep_job_put() drops the job's last reference and its release > > > > > callback > > > > > + * runs. This ensures the queue remains valid for the entire > > > > > lifetime of any > > > > > + * job that was submitted to it. > > > > > + * > > > > > + * The queue holds its own reference to a job for as long as the job > > > > > is > > > > > + * internally tracked: from the moment the job is added to the > > > > > pending list > > > > > + * in drm_dep_queue_run_job() until drm_dep_job_done() kicks the > > > > > put_job > > > > > + * worker, which calls drm_dep_job_put() to release that reference. > > > > > > > > Why not simply keep track that the job was completed, instead of > > > > relinquishing > > > > the reference? We can then release the reference once the job is > > > > cleaned up > > > > (by the queue, using a worker) in process context. > > > > I think that’s what I’m doing, while also allowing an opt-in path to > > drop the job reference when it signals (in IRQ context) > > Did you mean in !IRQ (or !atomic) context here? Feels weird to not > defer the cleanup when you're in an IRQ/atomic context, but defer it > when you're in a thread context. > The put of a job in this design can be from an IRQ context (opt-in) feature. xa_destroy blows up if it is called from an IRQ context, although maybe that could be workaround. > > so we avoid > > switching to a work item just to drop a ref. That seems like a > > significant win in terms of CPU cycles. > > Well, the cleanup path is probably not where latency matters the most. Agree. But I do think avoiding a CPU context switch (work item) for a very lightweight job cleanup (usually just drop refs) will save of CPU cycles, thus also things like power, etc... > It's adding scheduling overhead, sure, but given all the stuff we defer > already, I'm not too sure we're at saving a few cycles to get the > cleanup done immediately. What's important to have is a way to signal > fences in an atomic context, because this has an impact on latency. > Yes. The signaling happens first then drm_dep_job_put if IRQ opt-in. > [...] > > > > > > + /* > > > > > + * Drop all input dependency fences now, in process context, before > > > > > the > > > > > + * final job put. Once the job is on the pending list its last > > > > > reference > > > > > + * may be dropped from a dma_fence callback (IRQ context), where > > > > > calling > > > > > + * xa_destroy() would be unsafe. > > > > > + */ > > > > > > > > I assume that “pending” is the list of jobs that have been handed to > > > > the driver > > > > via ops->run_job()? > > > > > > > > Can’t this problem be solved by not doing anything inside a dma_fence > > > > callback > > > > other than scheduling the queue worker? > > > > > > > > Yes, this code is required to support dropping job refs directly in the > > dma-fence callback (an opt-in feature). Again, this seems like a > > significant win in terms of CPU cycles, although I haven’t collected > > data yet. > > If it significantly hurts the perf, I'd like to understand why, because > to me it looks like pure-cleanup (no signaling involved), and thus no > other process waiting for us to do the cleanup. The only thing that > might have an impact is how fast you release the resources, and given > it's only a partial cleanup (xa_destroy() still has to be deferred), I'd > like to understand which part of the immediate cleanup is causing a > contention (basically which kind of resources the system is starving of) > It was more of once we moved to a ref counted model, it is pretty trivial allow drm_dep_job_put when the fence is signaling. It doesn't really add any complexity either, thus why I added it is. Matt > Regards, > > Boris
