dep: Add DRM dependency queue layer

Boris Brezillon Tue, 24 Mar 2026 01:50:10 -0700

On Mon, 23 Mar 2026 10:08:53 -0700
Matthew Brost <[email protected]> wrote:


> On Mon, Mar 23, 2026 at 10:55:04AM +0100, Boris Brezillon wrote:
> > Hi Matthew,
> > 
> > On Sun, 22 Mar 2026 21:50:07 -0700
> > Matthew Brost <[email protected]> wrote:
> >   
> > > > > > > diff --git a/drivers/gpu/drm/dep/drm_dep_job.c 
> > > > > > > b/drivers/gpu/drm/dep/drm_dep_job.c
> > > > > > > new file mode 100644
> > > > > > > index 000000000000..2d012b29a5fc
> > > > > > > --- /dev/null
> > > > > > > +++ b/drivers/gpu/drm/dep/drm_dep_job.c
> > > > > > > @@ -0,0 +1,675 @@
> > > > > > > +// SPDX-License-Identifier: MIT
> > > > > > > +/*
> > > > > > > + * Copyright 2015 Advanced Micro Devices, Inc.
> > > > > > > + *
> > > > > > > + * Permission is hereby granted, free of charge, to any person 
> > > > > > > obtaining a
> > > > > > > + * copy of this software and associated documentation files (the 
> > > > > > > "Software"),
> > > > > > > + * to deal in the Software without restriction, including 
> > > > > > > without limitation
> > > > > > > + * the rights to use, copy, modify, merge, publish, distribute, 
> > > > > > > sublicense,
> > > > > > > + * and/or sell copies of the Software, and to permit persons to 
> > > > > > > whom the
> > > > > > > + * Software is furnished to do so, subject to the following 
> > > > > > > conditions:
> > > > > > > + *
> > > > > > > + * The above copyright notice and this permission notice shall 
> > > > > > > be included in
> > > > > > > + * all copies or substantial portions of the Software.
> > > > > > > + *
> > > > > > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY 
> > > > > > > KIND, EXPRESS OR
> > > > > > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 
> > > > > > > MERCHANTABILITY,
> > > > > > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO 
> > > > > > > EVENT SHALL
> > > > > > > + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, 
> > > > > > > DAMAGES OR
> > > > > > > + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR 
> > > > > > > OTHERWISE,
> > > > > > > + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR 
> > > > > > > THE USE OR
> > > > > > > + * OTHER DEALINGS IN THE SOFTWARE.
> > > > > > > + *
> > > > > > > + * Copyright © 2026 Intel Corporation
> > > > > > > + */
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * DOC: DRM dependency job
> > > > > > > + *
> > > > > > > + * A struct drm_dep_job represents a single unit of GPU work 
> > > > > > > associated with
> > > > > > > + * a struct drm_dep_queue. The lifecycle of a job is:
> > > > > > > + *
> > > > > > > + * 1. **Allocation**: the driver allocates memory for the job 
> > > > > > > (typically by
> > > > > > > + *    embedding struct drm_dep_job in a larger structure) and 
> > > > > > > calls
> > > > > > > + *    drm_dep_job_init() to initialise it. On success the job 
> > > > > > > holds one
> > > > > > > + *    kref reference and a reference to its queue.
> > > > > > > + *
> > > > > > > + * 2. **Dependency collection**: the driver calls 
> > > > > > > drm_dep_job_add_dependency(),
> > > > > > > + *    drm_dep_job_add_syncobj_dependency(), 
> > > > > > > drm_dep_job_add_resv_dependencies(),
> > > > > > > + *    or drm_dep_job_add_implicit_dependencies() to register 
> > > > > > > dma_fence objects
> > > > > > > + *    that must be signalled before the job can run. Duplicate 
> > > > > > > fences from the
> > > > > > > + *    same fence context are deduplicated automatically.
> > > > > > > + *
> > > > > > > + * 3. **Arming**: drm_dep_job_arm() initialises the job's 
> > > > > > > finished fence,
> > > > > > > + *    consuming a sequence number from the queue. After arming,
> > > > > > > + *    drm_dep_job_finished_fence() returns a valid fence that 
> > > > > > > may be passed to
> > > > > > > + *    userspace or used as a dependency by other jobs.
> > > > > > > + *
> > > > > > > + * 4. **Submission**: drm_dep_job_push() submits the job to the 
> > > > > > > queue. The
> > > > > > > + *    queue takes a reference that it holds until the job's 
> > > > > > > finished fence
> > > > > > > + *    signals and the job is freed by the put_job worker.
> > > > > > > + *
> > > > > > > + * 5. **Completion**: when the job's hardware work finishes its 
> > > > > > > finished fence
> > > > > > > + *    is signalled and drm_dep_job_put() is called by the queue. 
> > > > > > > The driver
> > > > > > > + *    must release any driver-private resources in 
> > > > > > > &drm_dep_job_ops.release.
> > > > > > > + *
> > > > > > > + * Reference counting uses drm_dep_job_get() / 
> > > > > > > drm_dep_job_put(). The
> > > > > > > + * internal drm_dep_job_fini() tears down the dependency xarray 
> > > > > > > and fence
> > > > > > > + * objects before the driver's release callback is invoked.
> > > > > > > + */
> > > > > > > +
> > > > > > > +#include <linux/dma-resv.h>
> > > > > > > +#include <linux/kref.h>
> > > > > > > +#include <linux/slab.h>
> > > > > > > +#include <drm/drm_dep.h>
> > > > > > > +#include <drm/drm_file.h>
> > > > > > > +#include <drm/drm_gem.h>
> > > > > > > +#include <drm/drm_syncobj.h>
> > > > > > > +#include "drm_dep_fence.h"
> > > > > > > +#include "drm_dep_job.h"
> > > > > > > +#include "drm_dep_queue.h"
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * drm_dep_job_init() - initialise a dep job
> > > > > > > + * @job: dep job to initialise
> > > > > > > + * @args: initialisation arguments
> > > > > > > + *
> > > > > > > + * Initialises @job with the queue, ops and credit count from 
> > > > > > > @args.  Acquires
> > > > > > > + * a reference to @args->q via drm_dep_queue_get(); this 
> > > > > > > reference is held for
> > > > > > > + * the lifetime of the job and released by drm_dep_job_release() 
> > > > > > > when the last
> > > > > > > + * job reference is dropped.
> > > > > > > + *
> > > > > > > + * Resources are released automatically when the last reference 
> > > > > > > is dropped
> > > > > > > + * via drm_dep_job_put(), which must be called to release the 
> > > > > > > job; drivers
> > > > > > > + * must not free the job directly.
> > > > > > > + *
> > > > > > > + * Context: Process context. Allocates memory with GFP_KERNEL.
> > > > > > > + * Return: 0 on success, -%EINVAL if credits is 0,
> > > > > > > + *   -%ENOMEM on fence allocation failure.
> > > > > > > + */
> > > > > > > +int drm_dep_job_init(struct drm_dep_job *job,
> > > > > > > +              const struct drm_dep_job_init_args *args)
> > > > > > > +{
> > > > > > > + if (unlikely(!args->credits)) {
> > > > > > > +         pr_err("drm_dep: %s: credits cannot be 0\n", __func__);
> > > > > > > +         return -EINVAL;
> > > > > > > + }
> > > > > > > +
> > > > > > > + memset(job, 0, sizeof(*job));
> > > > > > > +
> > > > > > > + job->dfence = drm_dep_fence_alloc();
> > > > > > > + if (!job->dfence)
> > > > > > > +         return -ENOMEM;
> > > > > > > +
> > > > > > > + job->ops = args->ops;
> > > > > > > + job->q = drm_dep_queue_get(args->q);
> > > > > > > + job->credits = args->credits;
> > > > > > > +
> > > > > > > + kref_init(&job->refcount);
> > > > > > > + xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC);
> > > > > > > + INIT_LIST_HEAD(&job->pending_link);
> > > > > > > +
> > > > > > > + return 0;
> > > > > > > +}
> > > > > > > +EXPORT_SYMBOL(drm_dep_job_init);
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * drm_dep_job_drop_dependencies() - release all input 
> > > > > > > dependency fences
> > > > > > > + * @job: dep job whose dependency xarray to drain
> > > > > > > + *
> > > > > > > + * Walks @job->dependencies, puts each fence, and destroys the 
> > > > > > > xarray.
> > > > > > > + * Any slots still holding a %DRM_DEP_JOB_FENCE_PREALLOC 
> > > > > > > sentinel —
> > > > > > > + * i.e. slots that were pre-allocated but never replaced — are 
> > > > > > > silently
> > > > > > > + * skipped; the sentinel carries no reference.  Called from
> > > > > > > + * drm_dep_queue_run_job() in process context immediately after
> > > > > > > + * @ops->run_job() returns, before the final drm_dep_job_put().  
> > > > > > > Releasing
> > > > > > > + * dependencies here — while still in process context — avoids 
> > > > > > > calling
> > > > > > > + * xa_destroy() from IRQ context if the job's last reference is 
> > > > > > > later
> > > > > > > + * dropped from a dma_fence callback.
> > > > > > > + *
> > > > > > > + * Context: Process context.
> > > > > > > + */
> > > > > > > +void drm_dep_job_drop_dependencies(struct drm_dep_job *job)
> > > > > > > +{
> > > > > > > + struct dma_fence *fence;
> > > > > > > + unsigned long index;
> > > > > > > +
> > > > > > > + xa_for_each(&job->dependencies, index, fence) {
> > > > > > > +         if (unlikely(fence == DRM_DEP_JOB_FENCE_PREALLOC))
> > > > > > > +                 continue;
> > > > > > > +         dma_fence_put(fence);
> > > > > > > + }
> > > > > > > + xa_destroy(&job->dependencies);
> > > > > > > +}
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * drm_dep_job_fini() - clean up a dep job
> > > > > > > + * @job: dep job to clean up
> > > > > > > + *
> > > > > > > + * Cleans up the dep fence and drops the queue reference held by 
> > > > > > > @job.
> > > > > > > + *
> > > > > > > + * If the job was never armed (e.g. init failed before 
> > > > > > > drm_dep_job_arm()),
> > > > > > > + * the dependency xarray is also released here.  For armed jobs 
> > > > > > > the xarray
> > > > > > > + * has already been drained by drm_dep_job_drop_dependencies() 
> > > > > > > in process
> > > > > > > + * context immediately after run_job(), so it is left untouched 
> > > > > > > to avoid
> > > > > > > + * calling xa_destroy() from IRQ context.
> > > > > > > + *
> > > > > > > + * Warns if @job is still linked on the queue's pending list, 
> > > > > > > which would
> > > > > > > + * indicate a bug in the teardown ordering.
> > > > > > > + *
> > > > > > > + * Context: Any context.
> > > > > > > + */
> > > > > > > +static void drm_dep_job_fini(struct drm_dep_job *job)
> > > > > > > +{
> > > > > > > + bool armed = drm_dep_fence_is_armed(job->dfence);
> > > > > > > +
> > > > > > > + WARN_ON(!list_empty(&job->pending_link));
> > > > > > > +
> > > > > > > + drm_dep_fence_cleanup(job->dfence);
> > > > > > > + job->dfence = NULL;
> > > > > > > +
> > > > > > > + /*
> > > > > > > +  * Armed jobs have their dependencies drained by
> > > > > > > +  * drm_dep_job_drop_dependencies() in process context after 
> > > > > > > run_job().      
> > > > > > 
> > > > > > Just want to clear the confusion and make sure I get this right at 
> > > > > > the
> > > > > > same time. To me, "process context" means a user thread entering 
> > > > > > some
> > > > > > syscall(). What you call "process context" is more a "thread 
> > > > > > context" to
> > > > > > me. I'm actually almost certain it's always a kernel thread (a 
> > > > > > workqueue
> > > > > > worker thread to be accurate) that executes the drop_deps() after a
> > > > > > run_job().      
> > > > > 
> > > > > Some of context comments likely could be cleaned up. 'process context'
> > > > > here either in user context (bypass path) or run job work item.
> > > > >     
> > > > > >       
> > > > > > > +  * Skip here to avoid calling xa_destroy() from IRQ context.
> > > > > > > +  */
> > > > > > > + if (!armed)
> > > > > > > +         drm_dep_job_drop_dependencies(job);      
> > > > > > 
> > > > > > Why do we need to make a difference here. Can't we just assume that 
> > > > > > the
> > > > > > hole drm_dep_job_fini() call is unsafe in atomic context, and have a
> > > > > > work item embedded in the job to defer its destruction when _put() 
> > > > > > is
> > > > > > called in a context where the destruction is not allowed?
> > > > > >       
> > > > > 
> > > > > We already touched on this, but the design currently allows the last 
> > > > > job
> > > > > put from dma-fence signaling path (IRQ).    
> > > > 
> > > > It's not much about the last _put and more about what happens in the
> > > > _release() you pass to kref_put(). My point being, if you assume
> > > > something in _release() is not safe to be done in an atomic context,
> > > > and _put() is assumed to be called from any context, you might as well  
> > > >   
> > > 
> > > No. DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE indicates that the entire job
> > > put (including release) is IRQ-safe. If the documentation isn’t clear, I
> > > can clean that up. Some of my comments here [1] try to explain this
> > > further.
> > > 
> > > Setting DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE makes a job analogous to a
> > > dma-fence whose release must be IRQ-safe, so there is precedent for
> > > this. I didn’t want to unilaterally require that all job releases be
> > > IRQ-safe, as that would conflict with existing DRM scheduler jobs—hence
> > > the flag.
> > > 
> > > The difference between non-IRQ-safe and IRQ-safe job release is only
> > > about 12 lines of code.  
> > 
> > It's not just about the number of lines of code added to the core to
> > deal with that case, but also complexity of the API that results from
> > these various modes.
> >   
> 
> Fair enough.
> 
> > > I figured that if we’re going to invest the time
> > > and effort to replace DRM sched, we should aim for the best possible
> > > implementation. Any driver can opt-in here and immediately get less CPU
> > > ultization and power savings. I will try to figure out how to measure
> > > this and get some number here.  
> > 
> > That's key here. My gut feeling is that we have so much deferred
> > already that adding one more work to the workqueue is not going to
> > hurt in term of scheduling overhead (no context switch if it's
> > scheduled on the same workqueue). Job cleanup is just the phase  
> 
> Signaling of fences in many drivers occurs in hard IRQ context rather
> than in a work queue. I agree that if you are signaling fences from a
> work queue, the overhead of another work item is minimal.

I'm talking about the drm_dep_queue_run_job_queue() call which in turn
calls queue_work() when a job gets reported as done. That, I think, is
the most likely path, isn't it?

> 
> > following the job_done() event, which also requires a deferred work to
> > check progress on the queue anyway. And if you move the entirety of the  
> 
> Yes, I see Panthor signals fences from a work queue by looking at the
> seqnos, but again, in many drivers this flow is IRQ-driven for fence
> signaling latency reasons.

I'm not talking about the signalling of the done_fence, but the work
that's used to check progress on a job queue
(drm_dep_queue_run_job_queue()).

> > >   
> > > > > 
> > > > > Once arm() is called there is a guarnette the run_job path is called
> > > > > either via bypass or run job work item.    
> > > > 
> > > > Sure.
> > > >     
> > > 
> > > Let’s not gloss over this—this is actually a huge difference from DRM
> > > sched. One of the biggest problems I found with DRM sched is that if you
> > > call arm(), run_job() may or may not be called. Without this guarantee,
> > > you can’t do driver-side bookkeeping in arm() that is later released in
> > > run_job(), which would otherwise simplify the driver design.  
> > 
> > You can do driver-side book-keeping after all jobs have been
> > successfully initialized, which include arming their fences. The key
> > turning point is when you start exposing those armed fences, not
> > when you arm them. See below.
> >   
> 
> There is still the seqno critical section which starts at arm() and
> closed at push() or drop of fence.

That's orthogonal to the rule that says nothing after _arm() can
fail, I think. To guarantee proper job ordering, you need extra locking
(at the moment, we rely on the VM resv lock to serialize this in
Panthor).

> > > 
> > > In my opinion, it’s best and safest to enforce a no-failure policy
> > > between arm() and push().  
> > 
> > I don't think it's safer, it's just the semantics that have been
> > defined by drm_sched/dma_fence and that we keep forcing ourselves
> > into. I'd rather have a well defined dma_fence state that says "that's
> > it, I'm exposed, you have to signal me now", than this half-enforced
> > arm()+push() model.
> >   
> 
> So what is the suggestion here — move the asserts I have from arm() to
> something like begin_push()? We could add a dma-fence state toggle there
> as well if we can get that part merged into dma-fence. Or should we just
> drop the asserts/lockdep checks between arm() and push() completely? I’m
> open to either approach here.

If we can have that INACTIVE flag added, and the associated
dma_fence_init[64]_inactive() variants, I would say, we call
dma_fence_init[64]_inactive() in _arm(), and we call
dma_fence_set_active() in _push(). It'd still be valuable to have some
sort of delimitation for the submission through some block-like macro
with an associated context to which we can attach states and allow for
more (optional?) runtime-checks.

> 
> > > 
> > > FWIW, this came up while I was reviewing AMDXDNA’s DRM scheduler usage,
> > > which had the exact issue I described above. I pointed it out and got a
> > > reply saying, “well, this is an API issue, right?”—and they were
> > > correct, it is an API issue.
> > >   
> > > > >     
> > > > > > 
> > > > > > In general, I wonder if we should distinguish between "armed" and
> > > > > > "publicly exposed" to help deal with this intra-batch dep thing 
> > > > > > without
> > > > > > resorting to reservation and other tricks like that.
> > > > > >       
> > > > > 
> > > > > I'm not exactly sure what you suggesting but always open to ideas.    
> > > > 
> > > > Right now _arm() is what does the dma_fence_init(). But there's an
> > > > extra step between initializing the fence object and making it
> > > > visible to the outside world. In order for the dep to be added to the
> > > > job, you need the fence to be initialized, but that's not quite
> > > > external visibility, because the job is still very much a driver
> > > > object, and if something fails, the rollback mechanism makes it so all
> > > > the deps are dropped on the floor along the job that's being destroyed.
> > > > So we won't really wait on this fence that's never going to be
> > > > signalled.
> > > > 
> > > > I see what's appealing in pretending that _arm() == externally-visible,
> > > > but it's also forcing us to do extra pre-alloc (or other pre-init)
> > > > operations that would otherwise not be required in the submit path. Not
> > > > a hill I'm willing to die on, but I just thought I'd mention the fact I
> > > > find it weird that we put extra constraints on ourselves that are not
> > > > strictly needed, just because we fail to properly flag the dma_fence
> > > > visibility transitions.    
> > > 
> > > See the dma-resv example above. I’m not willing to die on this hill
> > > either, but again, in my opinion, for safety and as an API-level
> > > contract, enforcing arm() as a no-failure point makes sense. It prevents
> > > drivers from doing anything dangerous like the dma-resv example, which
> > > is an extremely subtle bug.  
> > 
> > That's a valid point, but you're not really enforcing things at
> > compile/run-time it's just "don't do this/that" in the docs. If you
> > encode the is_active() state at the dma_fence level, properly change
> > the fence state anytime it's about to be added to a public container,
> > and make it so an active fence that's released without being signalled
> > triggers a WARN_ON(), you've achieved more. Once you've done that, you
> > can also relax the rule that says that "an armed fence has to be
> > signalled" to "a fence that's active has to be signalled". With this,
> > the pre-alloc for intra-batch deps in your drm_dep_job::deps xarray is
> > no longer required, because you would be able to store inactive fences  
> 
> I wouldn’t go that far or say it’s that simple. This would require a
> fairly large refactor of Xe’s VM bind pipeline to call arm() earlier,
> and I’m not even sure it would be possible. Between arm() and push(),
> the seqno critical section still remains and requires locking, in
> particulay the tricky case is kernel binds (e.g., page fault handling)
> which use the same queue. Multiple threads can issue kernel binds
> concurrently, as our page fault handler is multi-threaded, similar
> to the CPU page fault handler, so critical section between arm() and
> push() is very late in the pipeline tightly protected by a lock.

This sounds like a different issue to me. That's the constraint that
says _arm() and _push() ordering needs to be preserved to guarantee
that jobs are properly ordered on the job queue. But that's orthogonal
to the rule that says nothing between _arm() and _push() on a given job
can fail. Let's take the Panthor case as an example:

        for_each_job_in_batch() {
                // This acquires the VM resv lock, and all BO locks
                // Because queues target a specific VM and all jobs
                // in the a SUBMIT must target the same VM, this
                // guarantees that seqno allocation happening further
                // down (when _arm() is called) won't be interleaved
                // with other concurrent submissions to the same queues.
                lock_and_prepare_resvs()

                <--- Seqno critical section starts here
        }

        for_each_job_in_batch() {
                // If something fails here, we drop all the jobs that
                // are part of this SUBMIT, and the resv locks are
                // released as part of the rollback. This means we
                // consumed but didn't use the seqnos, thus creating
                // a whole on the timeline, which is armless, as long
                // as those seqnos are not recycled.
                ret = faillible_stuf()
                if (ret)
                        goto rollback;
 
                arm(job)
        }
 
        // Nothing can fail after this point
 
        for_each_job_in_batch() {
                // resv locks are released here, unblocking other
                // concurrent submissions
                update_resvs(job->done_fence)

                <--- Seqno critical section ends here in case of success

                push(job)
        }
 
        update_submit_syncobjs();

rollback:
        unlock_resvs()
        <--- Seqno critical section ends here in case of success
...

How wide your critical seqno section is is up to each driver, really.

> 
> > there, as long as they become active before the job is pushed.
> >   
> > >   
> > > > 
> > > > On the rust side it would be directly described through the type
> > > > system (see the Visibility attribute in Daniel's branch[1]). On C side,
> > > > this could take the form of a new DMA_FENCE_FLAG_INACTIVE (or whichever
> > > > name you want to give it). Any operation pushing the fence to public
> > > > container (dma_resv, syncobj, sync_file, ...) would be rejected when
> > > > that flag is set. At _push() time, we'd clear that flag with a
> > > > dma_fence_set_active() helper, which would reflect the fact the fence
> > > > can now be observed and exposed to the outside world.
> > > >     
> > > 
> > > Timeline squashing is problematic due to the DMA_FENCE_FLAG_INACTIVE
> > > flag. When adding a fence to dma-resv, fences that belong to the same
> > > timeline are immediately squashed. A later transition of the fence state
> > > completely breaks this behavior.  
> > 
> > That's exactly my point: as soon as you want to insert the fence to a
> > public container, you have to make it "active", so it will never be
> > rolled back to the previous entry in the resv. Similarly, a
> > wait/add_callback() on an inactive fence should be rejected.
> >   
> 
> This is bit bigger dma-fence / treewide level change but in general I
> believe this is a good idea.

I agree it's a bit more work. It implies patching containers to reject
insertion when the INACTIVE flag is set. If we keep !INACTIVE as the
default (__dma_fence_init(INACTIVE) being an opt-in), fence emitters can
be moved to this model progressively though.

Re: [RFC PATCH 02/12] drm/dep: Add DRM dependency queue layer

Reply via email to