Re: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control

Luben Tuikov Wed, 01 Nov 2023 16:03:29 -0700

On 2023-10-31 22:23, Danilo Krummrich wrote:
> Hi Luben,
> 

[snip]
>>> @@ -187,12 +251,14 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq 
>>> *rq,
>>>  /**
>>>   * drm_sched_rq_select_entity_rr - Select an entity which could provide a 
>>> job to run
>>>   *
>>> + * @sched: the gpu scheduler
>>>   * @rq: scheduler run queue to check.
>>>   *
>>>   * Try to find a ready entity, returns NULL if none found.
>>>   */
>>>  static struct drm_sched_entity *
>>> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>> +drm_sched_rq_select_entity_rr(struct drm_gpu_scheduler *sched,
>>> +                         struct drm_sched_rq *rq)
>>>  {
>>>     struct drm_sched_entity *entity;
>>>  
>>> @@ -202,6 +268,14 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>     if (entity) {
>>>             list_for_each_entry_continue(entity, &rq->entities, list) {
>>>                     if (drm_sched_entity_is_ready(entity)) {
>>> +                           /* If we can't queue yet, preserve the current
>>> +                            * entity in terms of fairness.
>>> +                            */
>>> +                           if (!drm_sched_can_queue(sched, entity)) {
>>> +                                   spin_unlock(&rq->lock);
>>> +                                   return ERR_PTR(-ENOSPC);
>>> +                           }
>>> +
>>>                             rq->current_entity = entity;
>>>                             reinit_completion(&entity->entity_idle);
>>>                             spin_unlock(&rq->lock);
>>> @@ -211,8 +285,15 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>     }
>>>  
>>>     list_for_each_entry(entity, &rq->entities, list) {
>>> -
>>>             if (drm_sched_entity_is_ready(entity)) {
>>> +                   /* If we can't queue yet, preserve the current entity in
>>> +                    * terms of fairness.
>>> +                    */
>>> +                   if (!drm_sched_can_queue(sched, entity)) {
>>> +                           spin_unlock(&rq->lock);
>>> +                           return ERR_PTR(-ENOSPC);
>>> +                   }
>>> +
>>>                     rq->current_entity = entity;
>>>                     reinit_completion(&entity->entity_idle);
>>>                     spin_unlock(&rq->lock);
>>> @@ -231,12 +312,14 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>  /**
>>>   * drm_sched_rq_select_entity_fifo - Select an entity which provides a job 
>>> to run
>>>   *
>>> + * @sched: the gpu scheduler
>>>   * @rq: scheduler run queue to check.
>>>   *
>>>   * Find oldest waiting ready entity, returns NULL if none found.
>>>   */
>>>  static struct drm_sched_entity *
>>> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>> +drm_sched_rq_select_entity_fifo(struct drm_gpu_scheduler *sched,
>>> +                           struct drm_sched_rq *rq)
>>>  {
>>>     struct rb_node *rb;
>>>  
>>> @@ -246,6 +329,14 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq 
>>> *rq)
>>>  
>>>             entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>>>             if (drm_sched_entity_is_ready(entity)) {
>>> +                   /* If we can't queue yet, preserve the current entity in
>>> +                    * terms of fairness.
>>> +                    */
>>> +                   if (!drm_sched_can_queue(sched, entity)) {
>>> +                           spin_unlock(&rq->lock);
>>> +                           return ERR_PTR(-ENOSPC);
>>> +                   }
>>> +
>>
>> So, this kinda did this abrupt "return" in v2, then it was fixed fine in v3,
>> and now I see we return and return an error now, which doesn't seem to be 
>> used
>> by anyone. In fact, in drm_sched_select_entity(), we do this,
>>
>> -    return entity;
>> +    return IS_ERR(entity) ? NULL : entity;
>>
>> So, it's perhaps best to leave this as it was in v3, and if in the future
>> we need to distinguish between the type of error, then that future patch
>> would do that and also show how this is used with new code and logic.
>>
>> There is little value to over-engineer this right now, given that in
>> the future, the logic may be more esoteric than we can think of.
> 
> Ok, maybe what I do here is a little bit subtle and requires a comment. Let me
> explain.
> 
> The reason I return an ERR_PTR() instead of NULL is to indicate to
> drm_sched_select_entity() to break out of the runqueue loop
> (drm_sched_select_entity() breaks the loop when the returned entity is not
> NULL).
> 
> Without that, we'd continue the runqueue loop in drm_sched_select_entity() and
> retry with the next lower priority. This could lead to prioritiy inversion of
> the runqueues, since a lower priority runqueue with jobs with less credits 
> could
> stall a higher priority runqueue with jobs with more credits.
> 
> Hence, in drm_sched_select_entity() we need to be able to distinguish between
> drm_sched_rq_select_entity_fifo()/drm_sched_rq_select_entity_rr() can't find 
> an
> entity and they *can* find an entity, but it's job doesn't fit on the ring.
> 
> I think what makes it subtle in this patch is that in 
> drm_sched_select_entity()
> the condition already fits with
> 
>       if (entity)
>               break;
> 
> because we want to break this loop when we actually found an entity, or when
> there is no space on the ring buffer, but we want to keep checking the other
> runqueues if entity is NULL.


Okay, that's fine, but please update the head comment of
drm_sched_rq_select_entity_{rr,fifo}() and of 
drm_sched_select_entity() to explain what is being returned.

This invariably adds to the documentation which some want added to DRM--let's
first all start documenting code which we add ourselves.

I'd imagine this would look something like this, for each _{rr,fifo},
respectively, (remove content with braces {}, x4),

/**
 * drm_sched_rq_select_entity_{LT: rr,fifo} - Select an entity which provides a 
job to run
 * @sched: the gpu scheduler
 * @rq: scheduler run queue to check.
 *
 * Try to find a ready entity, returns NULL if none found. {LT: RR}
 * Find oldest waiting ready entity, returns NULL if none found. {LT: FIFO}
 *
 * Return an entity if one is found; return an error-pointer (!NULL) if an 
entity
 * was ready, but the scheduler had insufficient credits to accommodate its job;
 * return NULL, if no ready entity was found. {LT: for both RR and FIFO.}
 */

And,

/**
 * drm_sched_select_entity - Select the next entity to process
 * @sched: the scheduler instance
 *
 * Return an entity to process or NULL if none are found.
 *
 * Note, that we break out of the for-loop when "entity"
 * is non-null, which can also be an error-pointer--this assures
 * we don't process lower priority run-queues. See comments
 * in the respectively called functions.
 */

[snip]
>>> +
>>> +   /**
>>> +    * @update_job_credits: Called once the scheduler is considering this
>>> +    * job for execution.
>>
>> "once" --> "when", a close cousin of "once", but clearer in code comments.
>> It is called in fact as many times as the job is considered to be pushed down
>> to hardware, if we couldn't previously.
> 
> Sure, gonna change that.
> 
>>
>>> +    *
>>> +    * This callback returns the number of credits this job would take if
>>
>> Too many repetitions of "this". Instead of "this job", say "the job".
> 
> Gonna fix.

Great.

Could you please rebase your patch on top of drm-misc-next--Matthew's Xe changes
went in this afternoon.

Thanks!
-- 
Regards,
Luben

OpenPGP_0x4C15479431A334AF.asc
Description: OpenPGP public key

OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: [PATCH drm-misc-next v4] drm/sched: implement dynamic job-flow control

Reply via email to