On Mon, 23 Mar 2026 10:08:53 -0700 Matthew Brost <[email protected]> wrote:
> On Mon, Mar 23, 2026 at 10:55:04AM +0100, Boris Brezillon wrote: > > Hi Matthew, > > > > On Sun, 22 Mar 2026 21:50:07 -0700 > > Matthew Brost <[email protected]> wrote: > > > > > > > > > diff --git a/drivers/gpu/drm/dep/drm_dep_job.c > > > > > > > b/drivers/gpu/drm/dep/drm_dep_job.c > > > > > > > new file mode 100644 > > > > > > > index 000000000000..2d012b29a5fc > > > > > > > --- /dev/null > > > > > > > +++ b/drivers/gpu/drm/dep/drm_dep_job.c > > > > > > > @@ -0,0 +1,675 @@ > > > > > > > +// SPDX-License-Identifier: MIT > > > > > > > +/* > > > > > > > + * Copyright 2015 Advanced Micro Devices, Inc. > > > > > > > + * > > > > > > > + * Permission is hereby granted, free of charge, to any person > > > > > > > obtaining a > > > > > > > + * copy of this software and associated documentation files (the > > > > > > > "Software"), > > > > > > > + * to deal in the Software without restriction, including > > > > > > > without limitation > > > > > > > + * the rights to use, copy, modify, merge, publish, distribute, > > > > > > > sublicense, > > > > > > > + * and/or sell copies of the Software, and to permit persons to > > > > > > > whom the > > > > > > > + * Software is furnished to do so, subject to the following > > > > > > > conditions: > > > > > > > + * > > > > > > > + * The above copyright notice and this permission notice shall > > > > > > > be included in > > > > > > > + * all copies or substantial portions of the Software. > > > > > > > + * > > > > > > > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY > > > > > > > KIND, EXPRESS OR > > > > > > > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > > > > > > > MERCHANTABILITY, > > > > > > > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO > > > > > > > EVENT SHALL > > > > > > > + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, > > > > > > > DAMAGES OR > > > > > > > + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR > > > > > > > OTHERWISE, > > > > > > > + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR > > > > > > > THE USE OR > > > > > > > + * OTHER DEALINGS IN THE SOFTWARE. > > > > > > > + * > > > > > > > + * Copyright © 2026 Intel Corporation > > > > > > > + */ > > > > > > > + > > > > > > > +/** > > > > > > > + * DOC: DRM dependency job > > > > > > > + * > > > > > > > + * A struct drm_dep_job represents a single unit of GPU work > > > > > > > associated with > > > > > > > + * a struct drm_dep_queue. The lifecycle of a job is: > > > > > > > + * > > > > > > > + * 1. **Allocation**: the driver allocates memory for the job > > > > > > > (typically by > > > > > > > + * embedding struct drm_dep_job in a larger structure) and > > > > > > > calls > > > > > > > + * drm_dep_job_init() to initialise it. On success the job > > > > > > > holds one > > > > > > > + * kref reference and a reference to its queue. > > > > > > > + * > > > > > > > + * 2. **Dependency collection**: the driver calls > > > > > > > drm_dep_job_add_dependency(), > > > > > > > + * drm_dep_job_add_syncobj_dependency(), > > > > > > > drm_dep_job_add_resv_dependencies(), > > > > > > > + * or drm_dep_job_add_implicit_dependencies() to register > > > > > > > dma_fence objects > > > > > > > + * that must be signalled before the job can run. Duplicate > > > > > > > fences from the > > > > > > > + * same fence context are deduplicated automatically. > > > > > > > + * > > > > > > > + * 3. **Arming**: drm_dep_job_arm() initialises the job's > > > > > > > finished fence, > > > > > > > + * consuming a sequence number from the queue. After arming, > > > > > > > + * drm_dep_job_finished_fence() returns a valid fence that > > > > > > > may be passed to > > > > > > > + * userspace or used as a dependency by other jobs. > > > > > > > + * > > > > > > > + * 4. **Submission**: drm_dep_job_push() submits the job to the > > > > > > > queue. The > > > > > > > + * queue takes a reference that it holds until the job's > > > > > > > finished fence > > > > > > > + * signals and the job is freed by the put_job worker. > > > > > > > + * > > > > > > > + * 5. **Completion**: when the job's hardware work finishes its > > > > > > > finished fence > > > > > > > + * is signalled and drm_dep_job_put() is called by the queue. > > > > > > > The driver > > > > > > > + * must release any driver-private resources in > > > > > > > &drm_dep_job_ops.release. > > > > > > > + * > > > > > > > + * Reference counting uses drm_dep_job_get() / > > > > > > > drm_dep_job_put(). The > > > > > > > + * internal drm_dep_job_fini() tears down the dependency xarray > > > > > > > and fence > > > > > > > + * objects before the driver's release callback is invoked. > > > > > > > + */ > > > > > > > + > > > > > > > +#include <linux/dma-resv.h> > > > > > > > +#include <linux/kref.h> > > > > > > > +#include <linux/slab.h> > > > > > > > +#include <drm/drm_dep.h> > > > > > > > +#include <drm/drm_file.h> > > > > > > > +#include <drm/drm_gem.h> > > > > > > > +#include <drm/drm_syncobj.h> > > > > > > > +#include "drm_dep_fence.h" > > > > > > > +#include "drm_dep_job.h" > > > > > > > +#include "drm_dep_queue.h" > > > > > > > + > > > > > > > +/** > > > > > > > + * drm_dep_job_init() - initialise a dep job > > > > > > > + * @job: dep job to initialise > > > > > > > + * @args: initialisation arguments > > > > > > > + * > > > > > > > + * Initialises @job with the queue, ops and credit count from > > > > > > > @args. Acquires > > > > > > > + * a reference to @args->q via drm_dep_queue_get(); this > > > > > > > reference is held for > > > > > > > + * the lifetime of the job and released by drm_dep_job_release() > > > > > > > when the last > > > > > > > + * job reference is dropped. > > > > > > > + * > > > > > > > + * Resources are released automatically when the last reference > > > > > > > is dropped > > > > > > > + * via drm_dep_job_put(), which must be called to release the > > > > > > > job; drivers > > > > > > > + * must not free the job directly. > > > > > > > + * > > > > > > > + * Context: Process context. Allocates memory with GFP_KERNEL. > > > > > > > + * Return: 0 on success, -%EINVAL if credits is 0, > > > > > > > + * -%ENOMEM on fence allocation failure. > > > > > > > + */ > > > > > > > +int drm_dep_job_init(struct drm_dep_job *job, > > > > > > > + const struct drm_dep_job_init_args *args) > > > > > > > +{ > > > > > > > + if (unlikely(!args->credits)) { > > > > > > > + pr_err("drm_dep: %s: credits cannot be 0\n", __func__); > > > > > > > + return -EINVAL; > > > > > > > + } > > > > > > > + > > > > > > > + memset(job, 0, sizeof(*job)); > > > > > > > + > > > > > > > + job->dfence = drm_dep_fence_alloc(); > > > > > > > + if (!job->dfence) > > > > > > > + return -ENOMEM; > > > > > > > + > > > > > > > + job->ops = args->ops; > > > > > > > + job->q = drm_dep_queue_get(args->q); > > > > > > > + job->credits = args->credits; > > > > > > > + > > > > > > > + kref_init(&job->refcount); > > > > > > > + xa_init_flags(&job->dependencies, XA_FLAGS_ALLOC); > > > > > > > + INIT_LIST_HEAD(&job->pending_link); > > > > > > > + > > > > > > > + return 0; > > > > > > > +} > > > > > > > +EXPORT_SYMBOL(drm_dep_job_init); > > > > > > > + > > > > > > > +/** > > > > > > > + * drm_dep_job_drop_dependencies() - release all input > > > > > > > dependency fences > > > > > > > + * @job: dep job whose dependency xarray to drain > > > > > > > + * > > > > > > > + * Walks @job->dependencies, puts each fence, and destroys the > > > > > > > xarray. > > > > > > > + * Any slots still holding a %DRM_DEP_JOB_FENCE_PREALLOC > > > > > > > sentinel — > > > > > > > + * i.e. slots that were pre-allocated but never replaced — are > > > > > > > silently > > > > > > > + * skipped; the sentinel carries no reference. Called from > > > > > > > + * drm_dep_queue_run_job() in process context immediately after > > > > > > > + * @ops->run_job() returns, before the final drm_dep_job_put(). > > > > > > > Releasing > > > > > > > + * dependencies here — while still in process context — avoids > > > > > > > calling > > > > > > > + * xa_destroy() from IRQ context if the job's last reference is > > > > > > > later > > > > > > > + * dropped from a dma_fence callback. > > > > > > > + * > > > > > > > + * Context: Process context. > > > > > > > + */ > > > > > > > +void drm_dep_job_drop_dependencies(struct drm_dep_job *job) > > > > > > > +{ > > > > > > > + struct dma_fence *fence; > > > > > > > + unsigned long index; > > > > > > > + > > > > > > > + xa_for_each(&job->dependencies, index, fence) { > > > > > > > + if (unlikely(fence == DRM_DEP_JOB_FENCE_PREALLOC)) > > > > > > > + continue; > > > > > > > + dma_fence_put(fence); > > > > > > > + } > > > > > > > + xa_destroy(&job->dependencies); > > > > > > > +} > > > > > > > + > > > > > > > +/** > > > > > > > + * drm_dep_job_fini() - clean up a dep job > > > > > > > + * @job: dep job to clean up > > > > > > > + * > > > > > > > + * Cleans up the dep fence and drops the queue reference held by > > > > > > > @job. > > > > > > > + * > > > > > > > + * If the job was never armed (e.g. init failed before > > > > > > > drm_dep_job_arm()), > > > > > > > + * the dependency xarray is also released here. For armed jobs > > > > > > > the xarray > > > > > > > + * has already been drained by drm_dep_job_drop_dependencies() > > > > > > > in process > > > > > > > + * context immediately after run_job(), so it is left untouched > > > > > > > to avoid > > > > > > > + * calling xa_destroy() from IRQ context. > > > > > > > + * > > > > > > > + * Warns if @job is still linked on the queue's pending list, > > > > > > > which would > > > > > > > + * indicate a bug in the teardown ordering. > > > > > > > + * > > > > > > > + * Context: Any context. > > > > > > > + */ > > > > > > > +static void drm_dep_job_fini(struct drm_dep_job *job) > > > > > > > +{ > > > > > > > + bool armed = drm_dep_fence_is_armed(job->dfence); > > > > > > > + > > > > > > > + WARN_ON(!list_empty(&job->pending_link)); > > > > > > > + > > > > > > > + drm_dep_fence_cleanup(job->dfence); > > > > > > > + job->dfence = NULL; > > > > > > > + > > > > > > > + /* > > > > > > > + * Armed jobs have their dependencies drained by > > > > > > > + * drm_dep_job_drop_dependencies() in process context after > > > > > > > run_job(). > > > > > > > > > > > > Just want to clear the confusion and make sure I get this right at > > > > > > the > > > > > > same time. To me, "process context" means a user thread entering > > > > > > some > > > > > > syscall(). What you call "process context" is more a "thread > > > > > > context" to > > > > > > me. I'm actually almost certain it's always a kernel thread (a > > > > > > workqueue > > > > > > worker thread to be accurate) that executes the drop_deps() after a > > > > > > run_job(). > > > > > > > > > > Some of context comments likely could be cleaned up. 'process context' > > > > > here either in user context (bypass path) or run job work item. > > > > > > > > > > > > > > > > > > + * Skip here to avoid calling xa_destroy() from IRQ context. > > > > > > > + */ > > > > > > > + if (!armed) > > > > > > > + drm_dep_job_drop_dependencies(job); > > > > > > > > > > > > Why do we need to make a difference here. Can't we just assume that > > > > > > the > > > > > > hole drm_dep_job_fini() call is unsafe in atomic context, and have a > > > > > > work item embedded in the job to defer its destruction when _put() > > > > > > is > > > > > > called in a context where the destruction is not allowed? > > > > > > > > > > > > > > > > We already touched on this, but the design currently allows the last > > > > > job > > > > > put from dma-fence signaling path (IRQ). > > > > > > > > It's not much about the last _put and more about what happens in the > > > > _release() you pass to kref_put(). My point being, if you assume > > > > something in _release() is not safe to be done in an atomic context, > > > > and _put() is assumed to be called from any context, you might as well > > > > > > > > > > No. DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE indicates that the entire job > > > put (including release) is IRQ-safe. If the documentation isn’t clear, I > > > can clean that up. Some of my comments here [1] try to explain this > > > further. > > > > > > Setting DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE makes a job analogous to a > > > dma-fence whose release must be IRQ-safe, so there is precedent for > > > this. I didn’t want to unilaterally require that all job releases be > > > IRQ-safe, as that would conflict with existing DRM scheduler jobs—hence > > > the flag. > > > > > > The difference between non-IRQ-safe and IRQ-safe job release is only > > > about 12 lines of code. > > > > It's not just about the number of lines of code added to the core to > > deal with that case, but also complexity of the API that results from > > these various modes. > > > > Fair enough. > > > > I figured that if we’re going to invest the time > > > and effort to replace DRM sched, we should aim for the best possible > > > implementation. Any driver can opt-in here and immediately get less CPU > > > ultization and power savings. I will try to figure out how to measure > > > this and get some number here. > > > > That's key here. My gut feeling is that we have so much deferred > > already that adding one more work to the workqueue is not going to > > hurt in term of scheduling overhead (no context switch if it's > > scheduled on the same workqueue). Job cleanup is just the phase > > Signaling of fences in many drivers occurs in hard IRQ context rather > than in a work queue. I agree that if you are signaling fences from a > work queue, the overhead of another work item is minimal. I'm talking about the drm_dep_queue_run_job_queue() call which in turn calls queue_work() when a job gets reported as done. That, I think, is the most likely path, isn't it? > > > following the job_done() event, which also requires a deferred work to > > check progress on the queue anyway. And if you move the entirety of the > > Yes, I see Panthor signals fences from a work queue by looking at the > seqnos, but again, in many drivers this flow is IRQ-driven for fence > signaling latency reasons. I'm not talking about the signalling of the done_fence, but the work that's used to check progress on a job queue (drm_dep_queue_run_job_queue()). > > > > > > > > > > > > > Once arm() is called there is a guarnette the run_job path is called > > > > > either via bypass or run job work item. > > > > > > > > Sure. > > > > > > > > > > Let’s not gloss over this—this is actually a huge difference from DRM > > > sched. One of the biggest problems I found with DRM sched is that if you > > > call arm(), run_job() may or may not be called. Without this guarantee, > > > you can’t do driver-side bookkeeping in arm() that is later released in > > > run_job(), which would otherwise simplify the driver design. > > > > You can do driver-side book-keeping after all jobs have been > > successfully initialized, which include arming their fences. The key > > turning point is when you start exposing those armed fences, not > > when you arm them. See below. > > > > There is still the seqno critical section which starts at arm() and > closed at push() or drop of fence. That's orthogonal to the rule that says nothing after _arm() can fail, I think. To guarantee proper job ordering, you need extra locking (at the moment, we rely on the VM resv lock to serialize this in Panthor). > > > > > > In my opinion, it’s best and safest to enforce a no-failure policy > > > between arm() and push(). > > > > I don't think it's safer, it's just the semantics that have been > > defined by drm_sched/dma_fence and that we keep forcing ourselves > > into. I'd rather have a well defined dma_fence state that says "that's > > it, I'm exposed, you have to signal me now", than this half-enforced > > arm()+push() model. > > > > So what is the suggestion here — move the asserts I have from arm() to > something like begin_push()? We could add a dma-fence state toggle there > as well if we can get that part merged into dma-fence. Or should we just > drop the asserts/lockdep checks between arm() and push() completely? I’m > open to either approach here. If we can have that INACTIVE flag added, and the associated dma_fence_init[64]_inactive() variants, I would say, we call dma_fence_init[64]_inactive() in _arm(), and we call dma_fence_set_active() in _push(). It'd still be valuable to have some sort of delimitation for the submission through some block-like macro with an associated context to which we can attach states and allow for more (optional?) runtime-checks. > > > > > > > FWIW, this came up while I was reviewing AMDXDNA’s DRM scheduler usage, > > > which had the exact issue I described above. I pointed it out and got a > > > reply saying, “well, this is an API issue, right?”—and they were > > > correct, it is an API issue. > > > > > > > > > > > > > > > > > > > > In general, I wonder if we should distinguish between "armed" and > > > > > > "publicly exposed" to help deal with this intra-batch dep thing > > > > > > without > > > > > > resorting to reservation and other tricks like that. > > > > > > > > > > > > > > > > I'm not exactly sure what you suggesting but always open to ideas. > > > > > > > > Right now _arm() is what does the dma_fence_init(). But there's an > > > > extra step between initializing the fence object and making it > > > > visible to the outside world. In order for the dep to be added to the > > > > job, you need the fence to be initialized, but that's not quite > > > > external visibility, because the job is still very much a driver > > > > object, and if something fails, the rollback mechanism makes it so all > > > > the deps are dropped on the floor along the job that's being destroyed. > > > > So we won't really wait on this fence that's never going to be > > > > signalled. > > > > > > > > I see what's appealing in pretending that _arm() == externally-visible, > > > > but it's also forcing us to do extra pre-alloc (or other pre-init) > > > > operations that would otherwise not be required in the submit path. Not > > > > a hill I'm willing to die on, but I just thought I'd mention the fact I > > > > find it weird that we put extra constraints on ourselves that are not > > > > strictly needed, just because we fail to properly flag the dma_fence > > > > visibility transitions. > > > > > > See the dma-resv example above. I’m not willing to die on this hill > > > either, but again, in my opinion, for safety and as an API-level > > > contract, enforcing arm() as a no-failure point makes sense. It prevents > > > drivers from doing anything dangerous like the dma-resv example, which > > > is an extremely subtle bug. > > > > That's a valid point, but you're not really enforcing things at > > compile/run-time it's just "don't do this/that" in the docs. If you > > encode the is_active() state at the dma_fence level, properly change > > the fence state anytime it's about to be added to a public container, > > and make it so an active fence that's released without being signalled > > triggers a WARN_ON(), you've achieved more. Once you've done that, you > > can also relax the rule that says that "an armed fence has to be > > signalled" to "a fence that's active has to be signalled". With this, > > the pre-alloc for intra-batch deps in your drm_dep_job::deps xarray is > > no longer required, because you would be able to store inactive fences > > I wouldn’t go that far or say it’s that simple. This would require a > fairly large refactor of Xe’s VM bind pipeline to call arm() earlier, > and I’m not even sure it would be possible. Between arm() and push(), > the seqno critical section still remains and requires locking, in > particulay the tricky case is kernel binds (e.g., page fault handling) > which use the same queue. Multiple threads can issue kernel binds > concurrently, as our page fault handler is multi-threaded, similar > to the CPU page fault handler, so critical section between arm() and > push() is very late in the pipeline tightly protected by a lock. This sounds like a different issue to me. That's the constraint that says _arm() and _push() ordering needs to be preserved to guarantee that jobs are properly ordered on the job queue. But that's orthogonal to the rule that says nothing between _arm() and _push() on a given job can fail. Let's take the Panthor case as an example: for_each_job_in_batch() { // This acquires the VM resv lock, and all BO locks // Because queues target a specific VM and all jobs // in the a SUBMIT must target the same VM, this // guarantees that seqno allocation happening further // down (when _arm() is called) won't be interleaved // with other concurrent submissions to the same queues. lock_and_prepare_resvs() <--- Seqno critical section starts here } for_each_job_in_batch() { // If something fails here, we drop all the jobs that // are part of this SUBMIT, and the resv locks are // released as part of the rollback. This means we // consumed but didn't use the seqnos, thus creating // a whole on the timeline, which is armless, as long // as those seqnos are not recycled. ret = faillible_stuf() if (ret) goto rollback; arm(job) } // Nothing can fail after this point for_each_job_in_batch() { // resv locks are released here, unblocking other // concurrent submissions update_resvs(job->done_fence) <--- Seqno critical section ends here in case of success push(job) } update_submit_syncobjs(); rollback: unlock_resvs() <--- Seqno critical section ends here in case of success ... How wide your critical seqno section is is up to each driver, really. > > > there, as long as they become active before the job is pushed. > > > > > > > > > > > > > On the rust side it would be directly described through the type > > > > system (see the Visibility attribute in Daniel's branch[1]). On C side, > > > > this could take the form of a new DMA_FENCE_FLAG_INACTIVE (or whichever > > > > name you want to give it). Any operation pushing the fence to public > > > > container (dma_resv, syncobj, sync_file, ...) would be rejected when > > > > that flag is set. At _push() time, we'd clear that flag with a > > > > dma_fence_set_active() helper, which would reflect the fact the fence > > > > can now be observed and exposed to the outside world. > > > > > > > > > > Timeline squashing is problematic due to the DMA_FENCE_FLAG_INACTIVE > > > flag. When adding a fence to dma-resv, fences that belong to the same > > > timeline are immediately squashed. A later transition of the fence state > > > completely breaks this behavior. > > > > That's exactly my point: as soon as you want to insert the fence to a > > public container, you have to make it "active", so it will never be > > rolled back to the previous entry in the resv. Similarly, a > > wait/add_callback() on an inactive fence should be rejected. > > > > This is bit bigger dma-fence / treewide level change but in general I > believe this is a good idea. I agree it's a bit more work. It implies patching containers to reject insertion when the INACTIVE flag is set. If we keep !INACTIVE as the default (__dma_fence_init(INACTIVE) being an opt-in), fence emitters can be moved to this model progressively though.
