On 09/01/2026 14:31, Christian König wrote:
On 1/9/26 15:06, Tvrtko Ursulin wrote:
\
On 07/01/2026 14:11, Philipp Stanner wrote:
Happy 2026, Tvrtko!


On Wed, 2026-01-07 at 13:44 +0000, Tvrtko Ursulin wrote:
drm_sched_entity_is_idle() is called when flushing the entity before
destroying it and currently decides on the idle status based either on
number of jobs in its queue being zero, or whether the entity is not part
of any run-queue.

If entity is not part of a run-queue it is implied it can have no jobs in
its queue, from which it follows it is redundant to look at the both and
we can simplify to only look at the queue.

The list_empty() check was added in
a875f58e237a ("drm/scheduler: stop setting rq to NULL")
where it replaced the entity->rq check which was added in
741f01e636b7 ("drm/scheduler: Avoid using wait_event_killable for dying process 
(V4)").

Since for the submit race involving exiting entities, explicit handling
via entity->stopped was added soon after in
62347a33001c ("drm/scheduler: Add stopped flag to drm_sched_entity")
it indeed looks we are safe to remove the list_empty() check.

This mean we can remove the memory barrier as well and fewer memory
barriers the better.

I am not convinced that this change is worth it.

Not to long ago we discussed that the spsc_queue should be removed and
replaced by some sort of list, with proper locks. Christian has agreed
that this should fly.

The spsc queue has only 1 user in the kernel and it's super hard to
understand how it's supposed to work and when any why barriers and
READ_ONCE's are necessary (that's, of course, also not documented in
the queue).

Until that is done I don't really want to touch any of those memory
barriers..

(I had a branch with spsc gone more than a year ago but I abandoned it for now 
since it contained some other too ambitious changes. So no complaints from me. 
Who is doing it BTW?)

Back to the point - this patch can wait, no problem. To explain the context 
though.

I wanted to get rid of looking at the list_empty here because I have a branch 
which improves the flow for the 1:1 sched:entity drivers.

Why are the two related? If you remember in the fair scheduler series all the 
run-queue stuff is nicely grouped in sched_rq.c and encapsulated in the rq API, 
which made it possible to follow up with virtualizing the rq operations.

The yet another relevant thing is the patch I sent this week which removes the 
special case where entity can be initialized with no schedulers.

If my memory not completely fails me the patch to reject initializing entities 
with no scheduler is actually a pre requisite of this patch here.

I look at it like this - "Can drm_sched_entity_is_idle() ever return true today if entity is unlinked but other conditions are not true?"

I think it cannot. Because if entity is unlinked then spsc queue count must be zero. Therefore list_empty is redundant. In fact...

The list_empty() is most likely only there because we had entities initialized 
without a scheduler.

... this is where I noted the archaeological digging I did in the commit message. It appears the list_empty only replaced the entity->rq == NULL check, but the list_empty was then found to be not enough and entity->stopped was added in response.

So in my mind the only question is whether some barrier is needed or READ_ONCE on entity->stopped is enough.

This would ask can we have drm_sched_entity_kill() run in parallel to drm_sched_entity_flush(), and have the latter miss to see entity->stopped == true when woken up in wait_event_*().

drm_sched_entity_kill() sets entity->stopped, from drm_sched_entity_fini(), on what can be the drivers entity teardown path, while drm_sched_entity_flush() can be called from drm fd close (amdgpu and msm) but fd close should happen after any driver specific entity destroy.

So it does not look like it, but maybe I am missing something.

Anyway, it is not a terribly important cleanup so as long as there is doubt it can stay parked.

Regards,

Tvrtko


If we combined all these three pre-requisites, my branch allows the fully 
invariant sched:entity and 1:1:1 sched:rq:entity. Run-queue vfuncs for the 1:1 
drivers become mostly no-ops (remove and pop entity are not needed at all, while 
add just checks for entity->stopped). So all the list and tree management 
needed by M:N drivers simply does not happen.

That sounds sane to me, but I would do one step at a time.

Regards,
Christian.


In that branch 1:1 entities do not take part in the rq->entities_list so, going 
back to this patch, the list_empty check will be in the way.

Anyway, just for context, I will park this cleanup for now but you can mull it 
over whether the bigger picture sounds interesting to you.

Regards,

Tvrtko


While at it, we add READ_ONCE annotation on the entity->stopped check to
mark the unlocked read.

This would effectively legalize the lockless access. But it is illegal
and undefined behavior. See drm_sched_fini().

P.



Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: Christian König <[email protected]>
Cc: Danilo Krummrich <[email protected]>
Cc: Matthew Brost <[email protected]>
Cc: Philipp Stanner <[email protected]>
---
   drivers/gpu/drm/scheduler/sched_entity.c | 10 ++--------
   1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c
index bb7e5fc47f99..2ed84504cfd6 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -137,14 +137,8 @@ EXPORT_SYMBOL(drm_sched_entity_modify_sched);
     static bool drm_sched_entity_is_idle(struct drm_sched_entity *entity)
   {
-    rmb(); /* for list_empty to work without lock */
-
-    if (list_empty(&entity->list) ||
-        spsc_queue_count(&entity->job_queue) == 0 ||
-        entity->stopped)
-        return true;
-
-    return false;
+    return spsc_queue_count(&entity->job_queue) == 0 ||
+           READ_ONCE(entity->stopped);
   }
     /**




Reply via email to