On Wed, Sep 13, 2023 at 10:56:10PM -0400, Luben Tuikov wrote:
> On 2023-09-11 22:16, Matthew Brost wrote:
> > If the TDR is set to a value, it can fire before a job is submitted in
> > drm_sched_main. The job should be always be submitted before the TDR
> > fires, fix this ordering.
> > 
> > v2:
> >   - Add to pending list before run_job, start TDR after (Luben, Boris)
> > 
> > Signed-off-by: Matthew Brost <matthew.br...@intel.com>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index c627d3e6494a..9dbfab7be2c6 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -498,7 +498,6 @@ static void drm_sched_job_begin(struct drm_sched_job 
> > *s_job)
> >  
> >     spin_lock(&sched->job_list_lock);
> >     list_add_tail(&s_job->list, &sched->pending_list);
> > -   drm_sched_start_timeout(sched);
> >     spin_unlock(&sched->job_list_lock);
> >  }
> >  
> > @@ -1234,6 +1233,7 @@ static void drm_sched_run_job_work(struct work_struct 
> > *w)
> >             fence = sched->ops->run_job(sched_job);
> >             complete_all(&entity->entity_idle);
> >             drm_sched_fence_scheduled(s_fence, fence);
> > +           drm_sched_start_timeout_unlocked(sched);
> >  
> >             if (!IS_ERR_OR_NULL(fence)) {
> >                     /* Drop for original kref_init of the fence */
> 
> So, sched->ops->run_job(), is a "job inflection point" from the point of view 
> of
> the DRM scheduler. After that call, DRM has relinquished control of the job 
> to the
> firmware/hardware.
> 
> Putting the job in the pending list, before submitting it to down to the 
> firmware/hardware,
> goes along with starting a timeout timer for the job. The timeout always 
> includes
> time for the firmware/hardware to get it prepped, as well as time for the 
> actual
> execution of the job (task). Thus, we want to do this:
>       1. Put the job in pending list. "Pending list" means "pends in 
> hardware".
>       2. Start a timeout timer for the job.
>       3. Start executing the job/task. This usually involves giving it to 
> firmware/hardware,
>          i.e. ownership of the job/task changes to another domain. In our 
> case this is accomplished
>          by calling sched->ops->run_job().
> Perhaps move drm_sched_start_timeout() closer to sched->ops->run_job() from 
> above and/or increase
> the timeout value?

I disagree. It is clear race if the timeout starts before run_job() that
the TDR can fire before run_job() is called. The entire point of this
patch is to seal this race by starting the TDR after run_job() is
called.

Matt

> -- 
> Regards,
> Luben
> 

Reply via email to