On Mon, 2025-09-01 at 15:14 +0200, Pierre-Eric Pelloux-Prayer wrote: > > > Le 25/08/2025 à 15:13, Philipp Stanner a écrit : > > On Fri, 2025-08-22 at 15:43 +0200, Pierre-Eric Pelloux-Prayer wrote: > > > Currently, the scheduler score is incremented when a job is pushed to an > > > entity and when an entity is attached to the scheduler. > > > > It's indeed awkward why attaching is treated equivalently to job > > submission. > > > > Can you expand the documentation for drm_sched_init_args a bit so that > > it gets clearer what the score is supposed to do? > > > drm_sched_init_args.score is the feature allowing multiple schedulers to > share a > score so I suppose you meant drm_gpu_scheduler.score?
I mean both. struct drm_sched_init_args has a @score which is passed by the driver during init and will then be stored in drm_gpu_scheduler.score The docu should be improved for both. > > The doc currently says "score to help loadbalancer pick a idle sched" which > is a > bit vague. It could be modified to become: > > @score: holds the number of yet-to-be-completed jobs pushed to each > scheduler. > It's used when load balancing between different schedulers. "to each scheduler" reads a bit as if each scheduler has an individual score. Is it used that way, though? I think it's a pointer because all schedulers share this atomic. Thus, it "holds the total number of yet- to-be-completed jobs for all schedulers sharing this atomic", right? > > What do you think? > > > > > > > > > This leads to some bad scheduling decision where the score value is > > > largely made of idle entities. > > > > > > For instance, a scenario with 2 schedulers and where 10 entities submit > > > a single job, then do nothing, each scheduler will probably end up with > > > a score of 5. > > > Now, 5 userspace apps exit, so their entities will be dropped. > > > > > > > "entities will be dropped" == "drm_sched_entity_kill() gets called", > > right? > > Yes. then nit: better say "their entities will be killed" instead of dropped. > > > > In > > > the worst case, these apps' entities where all attached to the same > > > > s/where/were > > > > or better yet: "could be" > > Will fix, thanks. > > > > > > scheduler and we end up with score=5 (the 5 remaining entities) and > > > score=0, despite the 2 schedulers being idle. > > > > Sounds indeed like a (small) problem to me. > > > > > > > When new entities show up, they will all select the second scheduler > > > based on its low score value, instead of alternating between the 2. > > > > > > Some amdgpu rings depended on this feature, but the previous commit > > > implemented the same thing in amdgpu directly so it can be safely > > > removed from drm/sched. > > > > Can we be that sure that other drivers don't depend on it, though? I > > suspect it's likely that it's just amdgpu, but… > > > > Aside from the new "rocket" as pointed out by Tvrtko, amdgpu is the only > driver > passing more than one schedulers to entities so they're the only ones that > could > be affected. > > I verified amdgpu and Tvrtko pinged the rocket maintainers in the other > thread. Very good! Then let's give those guys a few days to jump into the discussion. > > > > > > > BTW, since you're cleaning up related stuff currently: I saw that it > > seems that the only driver that sets &struct drm_sched_init_args.score > > is amdgpu. Would be cool if you can take a look whether that's still > > needed. > > It cannot really be removed yet as it's useful when a single hardware block > is > exposed through different schedulers (so pushing jobs to one of the > schedulers > should increase the load of the underlying hw). OK. Thx P. > > Thanks, > Pierre-Eric > > > > > > > P. > > > > > > > > Signed-off-by: Pierre-Eric Pelloux-Prayer > > > <[email protected]> > > > --- > > > drivers/gpu/drm/scheduler/sched_main.c | 2 -- > > > 1 file changed, 2 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > index 5a550fd76bf0..e6d232a8ec58 100644 > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > @@ -206,7 +206,6 @@ void drm_sched_rq_add_entity(struct drm_sched_rq *rq, > > > if (!list_empty(&entity->list)) > > > return; > > > > > > - atomic_inc(rq->sched->score); > > > list_add_tail(&entity->list, &rq->entities); > > > } > > > > > > @@ -228,7 +227,6 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq > > > *rq, > > > > > > spin_lock(&rq->lock); > > > > > > - atomic_dec(rq->sched->score); > > > list_del_init(&entity->list); > > > > > > if (rq->current_entity == entity)
