sched: serialize job_timeout and scheduler

Luben Tuikov Tue, 31 Aug 2021 08:06:59 -0700

On 2021-08-31 08:59, Daniel Vetter wrote:
> Can we please have some actual commit message here, with detailed
> explanation of the race/bug/whatever, how you fix it and why this is the
> best option?


I agree with Daniel--a narrative form of a commit message is so much easier
for humans to digest. The "[what]"/"[why]"/"[how]" and "issue"/"fix" format is
somewhat dry and uninformative, and leaves much to be desired.

Regards,
Luben

>
> On Tue, Aug 31, 2021 at 06:35:39PM +0800, Monk Liu wrote:
>> tested-by: jingwen chen <jingwen.c...@amd.com>
>> Signed-off-by: Monk Liu <monk....@amd.com>
>> Signed-off-by: jingwen chen <jingwen.c...@amd.com>
>> ---
>>  drivers/gpu/drm/scheduler/sched_main.c | 24 ++++--------------------
>>  1 file changed, 4 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index ecf8140..894fdb24 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -319,19 +319,17 @@ static void drm_sched_job_timedout(struct work_struct 
>> *work)
>>      sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>>  
>>      /* Protects against concurrent deletion in drm_sched_get_cleanup_job */
>> +    if (!__kthread_should_park(sched->thread))
> This is a __ function, i.e. considered internal, and it's lockless atomic,
> i.e. unordered. And you're not explaining why this works.
>
> Iow it's probably buggy, and an just unconditionally parking the kthread
> is probably the right thing to do. If it's not the right thing to do,
> there's a bug here for sure.
> -Daniel
>
>> +            kthread_park(sched->thread);
>> +
>>      spin_lock(&sched->job_list_lock);
>>      job = list_first_entry_or_null(&sched->pending_list,
>>                                     struct drm_sched_job, list);
>>  
>>      if (job) {
>> -            /*
>> -             * Remove the bad job so it cannot be freed by concurrent
>> -             * drm_sched_cleanup_jobs. It will be reinserted back after 
>> sched->thread
>> -             * is parked at which point it's safe.
>> -             */
>> -            list_del_init(&job->list);
>>              spin_unlock(&sched->job_list_lock);
>>  
>> +            /* vendor's timeout_job should call drm_sched_start() */
>>              status = job->sched->ops->timedout_job(job);
>>  
>>              /*
>> @@ -393,20 +391,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, 
>> struct drm_sched_job *bad)
>>      kthread_park(sched->thread);
>>  
>>      /*
>> -     * Reinsert back the bad job here - now it's safe as
>> -     * drm_sched_get_cleanup_job cannot race against us and release the
>> -     * bad job at this point - we parked (waited for) any in progress
>> -     * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>> -     * now until the scheduler thread is unparked.
>> -     */
>> -    if (bad && bad->sched == sched)
>> -            /*
>> -             * Add at the head of the queue to reflect it was the earliest
>> -             * job extracted.
>> -             */
>> -            list_add(&bad->list, &sched->pending_list);
>> -
>> -    /*
>>       * Iterate the job list from later to  earlier one and either deactive
>>       * their HW callbacks or remove them from pending list if they already
>>       * signaled.
>> -- 
>> 2.7.4
>>

Re: [PATCH 2/2] drm/sched: serialize job_timeout and scheduler

Reply via email to