Re: [Intel-xe] [PATCH v4 02/10] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-10-11 Thread Luben Tuikov
On 2023-10-06 19:43, Matthew Brost wrote:
> On Fri, Oct 06, 2023 at 03:14:04PM +, Matthew Brost wrote:
>> On Fri, Oct 06, 2023 at 08:59:15AM +0100, Tvrtko Ursulin wrote:
>>>
>>> On 05/10/2023 05:13, Luben Tuikov wrote:
 On 2023-10-04 23:33, Matthew Brost wrote:
> On Tue, Sep 26, 2023 at 11:32:10PM -0400, Luben Tuikov wrote:
>> Hi,
>>
>> On 2023-09-19 01:01, Matthew Brost wrote:
>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
>>> seems a bit odd but let us explain the reasoning below.
>>>
>>> 1. In XE the submission order from multiple drm_sched_entity is not
>>> guaranteed to be the same completion even if targeting the same hardware
>>> engine. This is because in XE we have a firmware scheduler, the GuC,
>>> which allowed to reorder, timeslice, and preempt submissions. If a using
>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
>>> apart as the TDR expects submission order == completion order. Using a
>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
>>>
>>> 2. In XE submissions are done via programming a ring buffer (circular
>>> buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
>>> limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
>>> control on the ring for free.
>>>
>>> A problem with this design is currently a drm_gpu_scheduler uses a
>>> kthread for submission / job cleanup. This doesn't scale if a large
>>> number of drm_gpu_scheduler are used. To work around the scaling issue,
>>> use a worker rather than kthread for submission / job cleanup.
>>>
>>> v2:
>>>- (Rob Clark) Fix msm build
>>>- Pass in run work queue
>>> v3:
>>>- (Boris) don't have loop in worker
>>> v4:
>>>- (Tvrtko) break out submit ready, stop, start helpers into own patch
>>> v5:
>>>- (Boris) default to ordered work queue
>>>
>>> Signed-off-by: Matthew Brost 
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   2 +-
>>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c|   2 +-
>>>   drivers/gpu/drm/lima/lima_sched.c  |   2 +-
>>>   drivers/gpu/drm/msm/msm_ringbuffer.c   |   2 +-
>>>   drivers/gpu/drm/nouveau/nouveau_sched.c|   2 +-
>>>   drivers/gpu/drm/panfrost/panfrost_job.c|   2 +-
>>>   drivers/gpu/drm/scheduler/sched_main.c | 118 ++---
>>>   drivers/gpu/drm/v3d/v3d_sched.c|  10 +-
>>>   include/drm/gpu_scheduler.h|  14 ++-
>>>   9 files changed, 79 insertions(+), 75 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index e366f61c3aed..16f3cfe1574a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -2279,7 +2279,7 @@ static int amdgpu_device_init_schedulers(struct 
>>> amdgpu_device *adev)
>>> break;
>>> }
>>> -   r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
>>> +   r = drm_sched_init(&ring->sched, &amdgpu_sched_ops, 
>>> NULL,
>>>ring->num_hw_submission, 0,
>>>timeout, adev->reset_domain->wq,
>>>ring->sched_score, ring->name,
>>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c 
>>> b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>>> index 345fec6cb1a4..618a804ddc34 100644
>>> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>>> @@ -134,7 +134,7 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
>>>   {
>>> int ret;
>>> -   ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops,
>>> +   ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops, NULL,
>>>  etnaviv_hw_jobs_limit, 
>>> etnaviv_job_hang_limit,
>>>  msecs_to_jiffies(500), NULL, NULL,
>>>  dev_name(gpu->dev), gpu->dev);
>>> diff --git a/drivers/gpu/drm/lima/lima_sched.c 
>>> b/drivers/gpu/drm/lima/lima_sched.c
>>> index ffd91a5ee299..8d858aed0e56 100644
>>> --- a/drivers/gpu/drm/lima/lima_sched.c
>>> +++ b/drivers/gpu/drm/lima/lima_sched.c
>>> @@ -488,7 +488,7 @@ int lima_sched_pipe_init(struct lima_sched_pipe 
>>> *pipe, const char *name)
>>> INIT_WORK(&pipe->recover_work, lima_sched_recover_work);
>>> -   return drm_sched_init(&pipe->base, &lima_sched_ops, 1,
>>> +   return drm_sched_init(&pipe->base, &lima_sched_ops, NULL, 1,
>>>  

Re: [Intel-xe] [PATCH v4 02/10] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-10-09 Thread Tvrtko Ursulin



On 07/10/2023 00:43, Matthew Brost wrote:

On Fri, Oct 06, 2023 at 03:14:04PM +, Matthew Brost wrote:

On Fri, Oct 06, 2023 at 08:59:15AM +0100, Tvrtko Ursulin wrote:


On 05/10/2023 05:13, Luben Tuikov wrote:

On 2023-10-04 23:33, Matthew Brost wrote:

On Tue, Sep 26, 2023 at 11:32:10PM -0400, Luben Tuikov wrote:

Hi,

On 2023-09-19 01:01, Matthew Brost wrote:

In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
seems a bit odd but let us explain the reasoning below.

1. In XE the submission order from multiple drm_sched_entity is not
guaranteed to be the same completion even if targeting the same hardware
engine. This is because in XE we have a firmware scheduler, the GuC,
which allowed to reorder, timeslice, and preempt submissions. If a using
shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
apart as the TDR expects submission order == completion order. Using a
dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.

2. In XE submissions are done via programming a ring buffer (circular
buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
control on the ring for free.

A problem with this design is currently a drm_gpu_scheduler uses a
kthread for submission / job cleanup. This doesn't scale if a large
number of drm_gpu_scheduler are used. To work around the scaling issue,
use a worker rather than kthread for submission / job cleanup.

v2:
- (Rob Clark) Fix msm build
- Pass in run work queue
v3:
- (Boris) don't have loop in worker
v4:
- (Tvrtko) break out submit ready, stop, start helpers into own patch
v5:
- (Boris) default to ordered work queue

Signed-off-by: Matthew Brost 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   2 +-
   drivers/gpu/drm/etnaviv/etnaviv_sched.c|   2 +-
   drivers/gpu/drm/lima/lima_sched.c  |   2 +-
   drivers/gpu/drm/msm/msm_ringbuffer.c   |   2 +-
   drivers/gpu/drm/nouveau/nouveau_sched.c|   2 +-
   drivers/gpu/drm/panfrost/panfrost_job.c|   2 +-
   drivers/gpu/drm/scheduler/sched_main.c | 118 ++---
   drivers/gpu/drm/v3d/v3d_sched.c|  10 +-
   include/drm/gpu_scheduler.h|  14 ++-
   9 files changed, 79 insertions(+), 75 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e366f61c3aed..16f3cfe1574a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2279,7 +2279,7 @@ static int amdgpu_device_init_schedulers(struct 
amdgpu_device *adev)
break;
}
-   r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
+   r = drm_sched_init(&ring->sched, &amdgpu_sched_ops, NULL,
   ring->num_hw_submission, 0,
   timeout, adev->reset_domain->wq,
   ring->sched_score, ring->name,
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c 
b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
index 345fec6cb1a4..618a804ddc34 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -134,7 +134,7 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
   {
int ret;
-   ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops,
+   ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops, NULL,
 etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
 msecs_to_jiffies(500), NULL, NULL,
 dev_name(gpu->dev), gpu->dev);
diff --git a/drivers/gpu/drm/lima/lima_sched.c 
b/drivers/gpu/drm/lima/lima_sched.c
index ffd91a5ee299..8d858aed0e56 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -488,7 +488,7 @@ int lima_sched_pipe_init(struct lima_sched_pipe *pipe, 
const char *name)
INIT_WORK(&pipe->recover_work, lima_sched_recover_work);
-   return drm_sched_init(&pipe->base, &lima_sched_ops, 1,
+   return drm_sched_init(&pipe->base, &lima_sched_ops, NULL, 1,
  lima_job_hang_limit,
  msecs_to_jiffies(timeout), NULL,
  NULL, name, pipe->ldev->dev);
diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c 
b/drivers/gpu/drm/msm/msm_ringbuffer.c
index 40c0bc35a44c..b8865e61b40f 100644
--- a/drivers/gpu/drm/msm/msm_ringbuffer.c
+++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
@@ -94,7 +94,7 @@ struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu 
*gpu, int id,
 /* currently managing hangcheck ourselves: */
sched_timeout = MAX_SCHEDULE_TIMEOUT;
-   ret = drm_sched_init(&ring->sched, &msm_sched_ops,
+   ret = drm_sched_init(&ring->sched, &msm_sched_ops, NULL,
 

Re: [Intel-xe] [PATCH v4 02/10] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-10-06 Thread Matthew Brost
On Fri, Oct 06, 2023 at 03:14:04PM +, Matthew Brost wrote:
> On Fri, Oct 06, 2023 at 08:59:15AM +0100, Tvrtko Ursulin wrote:
> > 
> > On 05/10/2023 05:13, Luben Tuikov wrote:
> > > On 2023-10-04 23:33, Matthew Brost wrote:
> > > > On Tue, Sep 26, 2023 at 11:32:10PM -0400, Luben Tuikov wrote:
> > > > > Hi,
> > > > > 
> > > > > On 2023-09-19 01:01, Matthew Brost wrote:
> > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first 
> > > > > > this
> > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > 
> > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > guaranteed to be the same completion even if targeting the same 
> > > > > > hardware
> > > > > > engine. This is because in XE we have a firmware scheduler, the GuC,
> > > > > > which allowed to reorder, timeslice, and preempt submissions. If a 
> > > > > > using
> > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR 
> > > > > > falls
> > > > > > apart as the TDR expects submission order == completion order. 
> > > > > > Using a
> > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
> > > > > > 
> > > > > > 2. In XE submissions are done via programming a ring buffer 
> > > > > > (circular
> > > > > > buffer), a drm_gpu_scheduler provides a limit on number of jobs, if 
> > > > > > the
> > > > > > limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get 
> > > > > > flow
> > > > > > control on the ring for free.
> > > > > > 
> > > > > > A problem with this design is currently a drm_gpu_scheduler uses a
> > > > > > kthread for submission / job cleanup. This doesn't scale if a large
> > > > > > number of drm_gpu_scheduler are used. To work around the scaling 
> > > > > > issue,
> > > > > > use a worker rather than kthread for submission / job cleanup.
> > > > > > 
> > > > > > v2:
> > > > > >- (Rob Clark) Fix msm build
> > > > > >- Pass in run work queue
> > > > > > v3:
> > > > > >- (Boris) don't have loop in worker
> > > > > > v4:
> > > > > >- (Tvrtko) break out submit ready, stop, start helpers into own 
> > > > > > patch
> > > > > > v5:
> > > > > >- (Boris) default to ordered work queue
> > > > > > 
> > > > > > Signed-off-by: Matthew Brost 
> > > > > > ---
> > > > > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   2 +-
> > > > > >   drivers/gpu/drm/etnaviv/etnaviv_sched.c|   2 +-
> > > > > >   drivers/gpu/drm/lima/lima_sched.c  |   2 +-
> > > > > >   drivers/gpu/drm/msm/msm_ringbuffer.c   |   2 +-
> > > > > >   drivers/gpu/drm/nouveau/nouveau_sched.c|   2 +-
> > > > > >   drivers/gpu/drm/panfrost/panfrost_job.c|   2 +-
> > > > > >   drivers/gpu/drm/scheduler/sched_main.c | 118 
> > > > > > ++---
> > > > > >   drivers/gpu/drm/v3d/v3d_sched.c|  10 +-
> > > > > >   include/drm/gpu_scheduler.h|  14 ++-
> > > > > >   9 files changed, 79 insertions(+), 75 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > index e366f61c3aed..16f3cfe1574a 100644
> > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > @@ -2279,7 +2279,7 @@ static int 
> > > > > > amdgpu_device_init_schedulers(struct amdgpu_device *adev)
> > > > > > break;
> > > > > > }
> > > > > > -   r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
> > > > > > +   r = drm_sched_init(&ring->sched, &amdgpu_sched_ops, 
> > > > > > NULL,
> > > > > >ring->num_hw_submission, 0,
> > > > > >timeout, adev->reset_domain->wq,
> > > > > >ring->sched_score, ring->name,
> > > > > > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c 
> > > > > > b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > > > > > index 345fec6cb1a4..618a804ddc34 100644
> > > > > > --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > > > > > +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> > > > > > @@ -134,7 +134,7 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
> > > > > >   {
> > > > > > int ret;
> > > > > > -   ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops,
> > > > > > +   ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops, NULL,
> > > > > >  etnaviv_hw_jobs_limit, 
> > > > > > etnaviv_job_hang_limit,
> > > > > >  msecs_to_jiffies(500), NULL, NULL,
> > > > > >  dev_name(gpu->dev), gpu->dev);
> > > > > > diff --git a/drivers/gpu/drm/lima/lima_sched.c 
> > > > > > b/drivers/gpu/drm/lima/lima_sched.c
> > > > > > index ffd91a5ee299..8d858aed0e56 100644
> > > > > > --- a/drivers/gpu/drm/lima/lima_sched.c
> > > > > > +++ b/drivers/gpu/