Re: [PATCH 1/8] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-08-03 Thread Tvrtko Ursulin



On 03/08/2023 15:43, Matthew Brost wrote:

On Thu, Aug 03, 2023 at 11:11:13AM +0100, Tvrtko Ursulin wrote:


On 01/08/2023 21:50, Matthew Brost wrote:

In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
seems a bit odd but let us explain the reasoning below.

1. In XE the submission order from multiple drm_sched_entity is not
guaranteed to be the same completion even if targeting the same hardware
engine. This is because in XE we have a firmware scheduler, the GuC,
which allowed to reorder, timeslice, and preempt submissions. If a using
shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
apart as the TDR expects submission order == completion order. Using a
dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.

2. In XE submissions are done via programming a ring buffer (circular
buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
control on the ring for free.

A problem with this design is currently a drm_gpu_scheduler uses a
kthread for submission / job cleanup. This doesn't scale if a large
number of drm_gpu_scheduler are used. To work around the scaling issue,
use a worker rather than kthread for submission / job cleanup.

v2:
- (Rob Clark) Fix msm build
- Pass in run work queue
v3:
- (Boris) don't have loop in worker

Signed-off-by: Matthew Brost 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  14 +-
   drivers/gpu/drm/etnaviv/etnaviv_sched.c |   2 +-
   drivers/gpu/drm/lima/lima_sched.c   |   2 +-
   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
   drivers/gpu/drm/msm/msm_ringbuffer.c|   2 +-
   drivers/gpu/drm/panfrost/panfrost_job.c |   2 +-
   drivers/gpu/drm/scheduler/sched_main.c  | 136 +++-
   drivers/gpu/drm/v3d/v3d_sched.c |  10 +-
   include/drm/gpu_scheduler.h |  14 +-
   10 files changed, 113 insertions(+), 89 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index f60753f97ac5..9c2a10aeb0b3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1489,9 +1489,9 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file 
*m, void *unused)
for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
struct amdgpu_ring *ring = adev->rings[i];
-   if (!ring || !ring->sched.thread)
+   if (!ring || !ring->sched.ready)
continue;
-   kthread_park(ring->sched.thread);
+   drm_sched_run_wq_stop(>sched);


It would be good to split out adding of these wrappers (including adding one
for ring->sched.thread/ready) to a standalong preceding patch. That way at
least some mechanical changes to various drivers would be separated from
functional changes.



Sure.
  

Also, perhaps do not have the wq in the name if it is not really needed to
be verbose with the underlying implementation like that? Like would
drm_sched_run/pause. Or even __drm_sched_start/stop, dunno, just an idea.



Sure.
  

}
seq_printf(m, "run ib test:\n");
@@ -1505,9 +1505,9 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file 
*m, void *unused)
for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
struct amdgpu_ring *ring = adev->rings[i];
-   if (!ring || !ring->sched.thread)
+   if (!ring || !ring->sched.ready)
continue;
-   kthread_unpark(ring->sched.thread);
+   drm_sched_run_wq_start(>sched);
}
up_write(>reset_domain->sem);
@@ -1727,7 +1727,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
ring = adev->rings[val];
-   if (!ring || !ring->funcs->preempt_ib || !ring->sched.thread)
+   if (!ring || !ring->funcs->preempt_ib || !ring->sched.ready)
return -EINVAL;
/* the last preemption failed */
@@ -1745,7 +1745,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
goto pro_end;
/* stop the scheduler */
-   kthread_park(ring->sched.thread);
+   drm_sched_run_wq_stop(>sched);
/* preempt the IB */
r = amdgpu_ring_preempt_ib(ring);
@@ -1779,7 +1779,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
   failure:
/* restart the scheduler */
-   kthread_unpark(ring->sched.thread);
+   drm_sched_run_wq_start(>sched);
up_read(>reset_domain->sem);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index fac9312b1695..00c9c03c8f94 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2364,7 +2364,7 @@ static int 

Re: [PATCH 1/8] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-08-03 Thread Tvrtko Ursulin



On 03/08/2023 15:56, Christian König wrote:

Am 03.08.23 um 16:43 schrieb Matthew Brost:

On Thu, Aug 03, 2023 at 11:11:13AM +0100, Tvrtko Ursulin wrote:

On 01/08/2023 21:50, Matthew Brost wrote:
[SNIP]

   sched->ops = ops;
   sched->hw_submission_limit = hw_submission;
   sched->name = name;
+    sched->run_wq = run_wq ? : system_wq;
I still think it is not nice to implicitly move everyone over to the 
shared
system wq. Maybe even more so with now one at a time execution, since 
effect

on latency can be even greater.


No one that has a stake in this has pushed back that I can recall. Open
to feedback stakeholders (maintainers of drivers that use the drm
scheduler).

>
No objections to using the system_wq here. Drivers can still pass in 
their own or simply use system_highpri_wq instead.


Additional to that the system_wq isn't single threaded, it will create 
as much threads as needed to fully utilize all CPUs.



  The i915 doesn't use the DRM scheduler last time I looked.
Has that changed?
Have you considered kthread_work as a backend? Maybe it would work to 
have
callers pass in a kthread_worker they create, or provide a drm_sched 
helper

to create one, which would then be passed to drm_sched_init.

That would enable per driver kthread_worker, or per device, or whatever
granularity each driver would want/need/desire.

driver init:
struct drm_sched_worker = drm_sched_create_worker(...);

queue/whatever init:
drm_sched_init(.., worker, ...);


This idea doesn't seem to work for varitey of reasons. Will type it out
if needed but not going to spend time on this unless someone with a
stake raises this as an issue.


Agree completely. kthread_work is for real time workers IIRC.


AFAIK it is indicated if one needs to tweak the kthread priority, but 
that is not the only use case.


I am curious to know why the idea does not work for variety of reasons.


You could create one inside drm_sched_init if not passed in, which would
keep the behaviour for existing drivers more similar - they would 
still have

a 1:1 kthread context for their exclusive use.


Part of the idea of a work queue is so a user can't directly create a
kthread via an IOCTL (XE_EXEC_QUEUE_CREATE). What you suggesting exposes
this issue.


Yeah, prevent that is indeed a very good idea.


Nope, I wasn't suggesting that at all.

I was suggesting as many kthread_workers (these are threads) as the 
implementation wants. Xe can create one per device. Someone else can 
create one per hw engine, whatever.


One kthread_*work* per entity does not mean one thread per 
XE_EXEC_QUEUE_CREATE. Kthread_work is just a unit of work executed by 
the kthread_worker thread. Same in that conceptual relationship as 
workqueue and workitem.


Difference is it may work better for single-shot re-arming design if 
regression in submission latency concerns any stakeholders.


And I *think* self-re-arming would be less problematic latency wise 
since
kthread_worker consumes everything queued without relinquishing 
control and
execution context would be guaranteed not to be shared with random 
system

stuff.


So this is essentially so we can use a loop? Seems like a lot effort for
what is pure speculation. Again if a stakeholder raises an issue we can
address then.


Instead of a loop what you usually do in the worker is to submit one 
item (if possible) and then re-queue yourself if there is more work to do.


This way you give others chance to run as well and/or cancel the work 
etc...


Yeah I was pointing out loop in the worker was bad months ago (or more) 
so it is not about that. Here my point is whether it can be done better 
than silently convert everyone to system_wq.


Hence my proposal is to *keep* closer to the thread semantics for 
everyone and at the same time _allow_ the option of custom 
workqueue/whatever.


Where is the problem there?

Regards,

Tvrtko


Re: [PATCH 1/8] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-08-03 Thread Christian König

Am 03.08.23 um 16:43 schrieb Matthew Brost:

On Thu, Aug 03, 2023 at 11:11:13AM +0100, Tvrtko Ursulin wrote:

On 01/08/2023 21:50, Matthew Brost wrote:
[SNIP]

sched->ops = ops;
sched->hw_submission_limit = hw_submission;
sched->name = name;
+   sched->run_wq = run_wq ? : system_wq;

I still think it is not nice to implicitly move everyone over to the shared
system wq. Maybe even more so with now one at a time execution, since effect
on latency can be even greater.


No one that has a stake in this has pushed back that I can recall. Open
to feedback stakeholders (maintainers of drivers that use the drm
scheduler).


No objections to using the system_wq here. Drivers can still pass in 
their own or simply use system_highpri_wq instead.


Additional to that the system_wq isn't single threaded, it will create 
as much threads as needed to fully utilize all CPUs.



  The i915 doesn't use the DRM scheduler last time I looked.
Has that changed?
  

Have you considered kthread_work as a backend? Maybe it would work to have
callers pass in a kthread_worker they create, or provide a drm_sched helper
to create one, which would then be passed to drm_sched_init.

That would enable per driver kthread_worker, or per device, or whatever
granularity each driver would want/need/desire.

driver init:
struct drm_sched_worker = drm_sched_create_worker(...);

queue/whatever init:
drm_sched_init(.., worker, ...);


This idea doesn't seem to work for varitey of reasons. Will type it out
if needed but not going to spend time on this unless someone with a
stake raises this as an issue.


Agree completely. kthread_work is for real time workers IIRC.

  

You could create one inside drm_sched_init if not passed in, which would
keep the behaviour for existing drivers more similar - they would still have
a 1:1 kthread context for their exclusive use.


Part of the idea of a work queue is so a user can't directly create a
kthread via an IOCTL (XE_EXEC_QUEUE_CREATE). What you suggesting exposes
this issue.


Yeah, prevent that is indeed a very good idea.




And I *think* self-re-arming would be less problematic latency wise since
kthread_worker consumes everything queued without relinquishing control and
execution context would be guaranteed not to be shared with random system
stuff.


So this is essentially so we can use a loop? Seems like a lot effort for
what is pure speculation. Again if a stakeholder raises an issue we can
address then.


Instead of a loop what you usually do in the worker is to submit one 
item (if possible) and then re-queue yourself if there is more work to do.


This way you give others chance to run as well and/or cancel the work etc...

Christian.



Matt


Regards,

Tvrtko





Re: [PATCH 1/8] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-08-03 Thread Matthew Brost
On Thu, Aug 03, 2023 at 11:11:13AM +0100, Tvrtko Ursulin wrote:
> 
> On 01/08/2023 21:50, Matthew Brost wrote:
> > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> > seems a bit odd but let us explain the reasoning below.
> > 
> > 1. In XE the submission order from multiple drm_sched_entity is not
> > guaranteed to be the same completion even if targeting the same hardware
> > engine. This is because in XE we have a firmware scheduler, the GuC,
> > which allowed to reorder, timeslice, and preempt submissions. If a using
> > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> > apart as the TDR expects submission order == completion order. Using a
> > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
> > 
> > 2. In XE submissions are done via programming a ring buffer (circular
> > buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
> > limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
> > control on the ring for free.
> > 
> > A problem with this design is currently a drm_gpu_scheduler uses a
> > kthread for submission / job cleanup. This doesn't scale if a large
> > number of drm_gpu_scheduler are used. To work around the scaling issue,
> > use a worker rather than kthread for submission / job cleanup.
> > 
> > v2:
> >- (Rob Clark) Fix msm build
> >- Pass in run work queue
> > v3:
> >- (Boris) don't have loop in worker
> > 
> > Signed-off-by: Matthew Brost 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  14 +-
> >   drivers/gpu/drm/etnaviv/etnaviv_sched.c |   2 +-
> >   drivers/gpu/drm/lima/lima_sched.c   |   2 +-
> >   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> >   drivers/gpu/drm/msm/msm_ringbuffer.c|   2 +-
> >   drivers/gpu/drm/panfrost/panfrost_job.c |   2 +-
> >   drivers/gpu/drm/scheduler/sched_main.c  | 136 +++-
> >   drivers/gpu/drm/v3d/v3d_sched.c |  10 +-
> >   include/drm/gpu_scheduler.h |  14 +-
> >   10 files changed, 113 insertions(+), 89 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > index f60753f97ac5..9c2a10aeb0b3 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > @@ -1489,9 +1489,9 @@ static int amdgpu_debugfs_test_ib_show(struct 
> > seq_file *m, void *unused)
> > for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
> > struct amdgpu_ring *ring = adev->rings[i];
> > -   if (!ring || !ring->sched.thread)
> > +   if (!ring || !ring->sched.ready)
> > continue;
> > -   kthread_park(ring->sched.thread);
> > +   drm_sched_run_wq_stop(>sched);
> 
> It would be good to split out adding of these wrappers (including adding one
> for ring->sched.thread/ready) to a standalong preceding patch. That way at
> least some mechanical changes to various drivers would be separated from
> functional changes.
>

Sure.
 
> Also, perhaps do not have the wq in the name if it is not really needed to
> be verbose with the underlying implementation like that? Like would
> drm_sched_run/pause. Or even __drm_sched_start/stop, dunno, just an idea.
>

Sure.
 
> > }
> > seq_printf(m, "run ib test:\n");
> > @@ -1505,9 +1505,9 @@ static int amdgpu_debugfs_test_ib_show(struct 
> > seq_file *m, void *unused)
> > for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
> > struct amdgpu_ring *ring = adev->rings[i];
> > -   if (!ring || !ring->sched.thread)
> > +   if (!ring || !ring->sched.ready)
> > continue;
> > -   kthread_unpark(ring->sched.thread);
> > +   drm_sched_run_wq_start(>sched);
> > }
> > up_write(>reset_domain->sem);
> > @@ -1727,7 +1727,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 
> > val)
> > ring = adev->rings[val];
> > -   if (!ring || !ring->funcs->preempt_ib || !ring->sched.thread)
> > +   if (!ring || !ring->funcs->preempt_ib || !ring->sched.ready)
> > return -EINVAL;
> > /* the last preemption failed */
> > @@ -1745,7 +1745,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 
> > val)
> > goto pro_end;
> > /* stop the scheduler */
> > -   kthread_park(ring->sched.thread);
> > +   drm_sched_run_wq_stop(>sched);
> > /* preempt the IB */
> > r = amdgpu_ring_preempt_ib(ring);
> > @@ -1779,7 +1779,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 
> > val)
> >   failure:
> > /* restart the scheduler */
> > -   kthread_unpark(ring->sched.thread);
> > +   drm_sched_run_wq_start(>sched);
> > up_read(>reset_domain->sem);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> > 

Re: [PATCH 1/8] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-08-03 Thread Tvrtko Ursulin



On 01/08/2023 21:50, Matthew Brost wrote:

In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
seems a bit odd but let us explain the reasoning below.

1. In XE the submission order from multiple drm_sched_entity is not
guaranteed to be the same completion even if targeting the same hardware
engine. This is because in XE we have a firmware scheduler, the GuC,
which allowed to reorder, timeslice, and preempt submissions. If a using
shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
apart as the TDR expects submission order == completion order. Using a
dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.

2. In XE submissions are done via programming a ring buffer (circular
buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
control on the ring for free.

A problem with this design is currently a drm_gpu_scheduler uses a
kthread for submission / job cleanup. This doesn't scale if a large
number of drm_gpu_scheduler are used. To work around the scaling issue,
use a worker rather than kthread for submission / job cleanup.

v2:
   - (Rob Clark) Fix msm build
   - Pass in run work queue
v3:
   - (Boris) don't have loop in worker

Signed-off-by: Matthew Brost 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  14 +-
  drivers/gpu/drm/etnaviv/etnaviv_sched.c |   2 +-
  drivers/gpu/drm/lima/lima_sched.c   |   2 +-
  drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
  drivers/gpu/drm/msm/msm_ringbuffer.c|   2 +-
  drivers/gpu/drm/panfrost/panfrost_job.c |   2 +-
  drivers/gpu/drm/scheduler/sched_main.c  | 136 +++-
  drivers/gpu/drm/v3d/v3d_sched.c |  10 +-
  include/drm/gpu_scheduler.h |  14 +-
  10 files changed, 113 insertions(+), 89 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index f60753f97ac5..9c2a10aeb0b3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1489,9 +1489,9 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file 
*m, void *unused)
for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
struct amdgpu_ring *ring = adev->rings[i];
  
-		if (!ring || !ring->sched.thread)

+   if (!ring || !ring->sched.ready)
continue;
-   kthread_park(ring->sched.thread);
+   drm_sched_run_wq_stop(>sched);


It would be good to split out adding of these wrappers (including adding 
one for ring->sched.thread/ready) to a standalong preceding patch. That 
way at least some mechanical changes to various drivers would be 
separated from functional changes.


Also, perhaps do not have the wq in the name if it is not really needed 
to be verbose with the underlying implementation like that? Like would 
drm_sched_run/pause. Or even __drm_sched_start/stop, dunno, just an idea.



}
  
  	seq_printf(m, "run ib test:\n");

@@ -1505,9 +1505,9 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file 
*m, void *unused)
for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
struct amdgpu_ring *ring = adev->rings[i];
  
-		if (!ring || !ring->sched.thread)

+   if (!ring || !ring->sched.ready)
continue;
-   kthread_unpark(ring->sched.thread);
+   drm_sched_run_wq_start(>sched);
}
  
  	up_write(>reset_domain->sem);

@@ -1727,7 +1727,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
  
  	ring = adev->rings[val];
  
-	if (!ring || !ring->funcs->preempt_ib || !ring->sched.thread)

+   if (!ring || !ring->funcs->preempt_ib || !ring->sched.ready)
return -EINVAL;
  
  	/* the last preemption failed */

@@ -1745,7 +1745,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
goto pro_end;
  
  	/* stop the scheduler */

-   kthread_park(ring->sched.thread);
+   drm_sched_run_wq_stop(>sched);
  
  	/* preempt the IB */

r = amdgpu_ring_preempt_ib(ring);
@@ -1779,7 +1779,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 val)
  
  failure:

/* restart the scheduler */
-   kthread_unpark(ring->sched.thread);
+   drm_sched_run_wq_start(>sched);
  
  	up_read(>reset_domain->sem);
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index fac9312b1695..00c9c03c8f94 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2364,7 +2364,7 @@ static int amdgpu_device_init_schedulers(struct 
amdgpu_device *adev)
break;
}
  
-		r = drm_sched_init(>sched, _sched_ops,

+