Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-17 Thread Matthew Brost
On Thu, Jan 12, 2023 at 04:39:32PM -0800, John Harrison wrote:
> On 1/11/2023 14:56, Jason Ekstrand wrote:
> > On Wed, Jan 11, 2023 at 4:32 PM Matthew Brost 
> > wrote:
> > 
> > On Wed, Jan 11, 2023 at 04:18:01PM -0600, Jason Ekstrand wrote:
> > > On Wed, Jan 11, 2023 at 2:50 AM Tvrtko Ursulin <
> > > tvrtko.ursu...@linux.intel.com> wrote:
> > >
> > > >
> > [snip]
> > > >
> > > > Typically is the key here. But I am not sure it is good
> > enough. Consider
> > > > this example - Intel Flex 170:
> > > >
> > > >   * Delivers up to 36 streams 1080p60 transcode throughput per
> > card.
> > > >   * When scaled to 10 cards in a 4U server configuration, it
> > can support
> > > > up to 360 streams of HEVC/HEVC 1080p60 transcode throughput.
> > > >
> > >
> > > I had a feeling it was going to be media 
> > >
> > 
> > Yea wondering the media UMD can be rewritten to use less
> > xe_engines, it
> > is massive rewrite for VM bind + no implicit dependencies so let's
> > just
> > pile on some more work?
> > 
> > 
> > It could probably use fewer than it does today.  It currently creates
> > and throws away contexts like crazy, or did last I looked at it. 
> > However, the nature of media encode is that it often spreads across two
> > or three different types of engines.  There's not much you can do to
> > change that.
> And as per Tvrtko's example, you get media servers that transcode huge
> numbers of tiny streams in parallel. Almost no work per frame but 100s of
> independent streams being run concurrently. That means many 100s of contexts
> all trying to run at 30fps. I recall a specific bug about thundering herds -
> hundreds (thousands?) of waiting threads all being woken up at once because
> some request had completed.
> 
> > >
> > > > One transcode stream from my experience typically is 3-4 GPU
> > contexts
> > > > (buffer travels from vcs -> rcs -> vcs, maybe vecs) used from
> > a single
> > > > CPU thread. 4 contexts * 36 streams = 144 active contexts.
> > Multiply by
> > > > 60fps = 8640 jobs submitted and completed per second.
> > > >
> > > > 144 active contexts in the proposed scheme means possibly
> > means 144
> > > > kernel worker threads spawned (driven by 36 transcode CPU
> > threads). (I
> > > > don't think the pools would scale down given all are
> > constantly pinged
> > > > at 60fps.)
> > > >
> > > > And then each of 144 threads goes to grab the single GuC CT
> > mutex. First
> > > > threads are being made schedulable, then put to sleep as mutex
> > > > contention is hit, then woken again as mutexes are getting
> > released,
> > > > rinse, repeat.
> > > >
> > >
> > > Why is every submission grabbing the GuC CT mutex? I've not read
> > the GuC
> > > back-end yet but I was under the impression that most run_job()
> > would be
> > > just shoving another packet into a ring buffer.  If we have to
> > send the GuC
> > > a message on the control ring every single time we submit a job,
> > that's
> > > pretty horrible.
> > >
> > 
> > Run job writes the ring buffer and moves the tail as the first
> > step (no
> > lock required). Next it needs to tell the GuC the xe_engine LRC
> > tail has
> > moved, this is done from a single Host to GuC channel which is
> > circular
> > buffer, the writing of the channel protected by the mutex. There are
> > little more nuances too but in practice there is always space in the
> > channel so the time mutex needs to held is really, really small
> > (check cached credits, write 3 dwords in payload, write 1 dword to
> > move
> > tail). I also believe mutexes in Linux are hybrid where they spin
> > for a
> > little bit before sleeping and certainly if there is space in the
> > channel we shouldn't sleep mutex contention.
> > 
> > 
> > Ok, that makes sense.  It's maybe a bit clunky and it'd be nice if we
> > had some way to batch things up a bit so we only have to poke the GuC
> > channel once for every batch of things rather than once per job.  That's
> > maybe something we can look into as a future improvement; not
> > fundamental.
> > 
> > Generally, though, it sounds like contention could be a real problem if
> > we end up ping-ponging that lock between cores.  It's going to depend on
> > how much work it takes to get the next ready thing vs. the cost of that
> > atomic.  But, also, anything we do is going to potentially run into
> > contention problems.  *shrug*  If we were going to go for
> > one-per-HW-engine, we may as well go one-per-device and then we wouldn't
> > need the lock.  Off the top of my head, that doesn't sound great either
> > but IDK.
> > 
> > As far as this being horrible, well didn't design the GuC and this how
> > it is implemented for KMD based 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-12 Thread John Harrison

On 1/11/2023 14:56, Jason Ekstrand wrote:
On Wed, Jan 11, 2023 at 4:32 PM Matthew Brost 
 wrote:


On Wed, Jan 11, 2023 at 04:18:01PM -0600, Jason Ekstrand wrote:
> On Wed, Jan 11, 2023 at 2:50 AM Tvrtko Ursulin <
> tvrtko.ursu...@linux.intel.com> wrote:
>
> >
[snip]
> >
> > Typically is the key here. But I am not sure it is good
enough. Consider
> > this example - Intel Flex 170:
> >
> >   * Delivers up to 36 streams 1080p60 transcode throughput per
card.
> >   * When scaled to 10 cards in a 4U server configuration, it
can support
> > up to 360 streams of HEVC/HEVC 1080p60 transcode throughput.
> >
>
> I had a feeling it was going to be media 
>

Yea wondering the media UMD can be rewritten to use less
xe_engines, it
is massive rewrite for VM bind + no implicit dependencies so let's
just
pile on some more work?


It could probably use fewer than it does today.  It currently creates 
and throws away contexts like crazy, or did last I looked at it.  
However, the nature of media encode is that it often spreads across 
two or three different types of engines.  There's not much you can do 
to change that.
And as per Tvrtko's example, you get media servers that transcode huge 
numbers of tiny streams in parallel. Almost no work per frame but 100s 
of independent streams being run concurrently. That means many 100s of 
contexts all trying to run at 30fps. I recall a specific bug about 
thundering herds - hundreds (thousands?) of waiting threads all being 
woken up at once because some request had completed.



>
> > One transcode stream from my experience typically is 3-4 GPU
contexts
> > (buffer travels from vcs -> rcs -> vcs, maybe vecs) used from
a single
> > CPU thread. 4 contexts * 36 streams = 144 active contexts.
Multiply by
> > 60fps = 8640 jobs submitted and completed per second.
> >
> > 144 active contexts in the proposed scheme means possibly
means 144
> > kernel worker threads spawned (driven by 36 transcode CPU
threads). (I
> > don't think the pools would scale down given all are
constantly pinged
> > at 60fps.)
> >
> > And then each of 144 threads goes to grab the single GuC CT
mutex. First
> > threads are being made schedulable, then put to sleep as mutex
> > contention is hit, then woken again as mutexes are getting
released,
> > rinse, repeat.
> >
>
> Why is every submission grabbing the GuC CT mutex? I've not read
the GuC
> back-end yet but I was under the impression that most run_job()
would be
> just shoving another packet into a ring buffer.  If we have to
send the GuC
> a message on the control ring every single time we submit a job,
that's
> pretty horrible.
>

Run job writes the ring buffer and moves the tail as the first
step (no
lock required). Next it needs to tell the GuC the xe_engine LRC
tail has
moved, this is done from a single Host to GuC channel which is
circular
buffer, the writing of the channel protected by the mutex. There are
little more nuances too but in practice there is always space in the
channel so the time mutex needs to held is really, really small
(check cached credits, write 3 dwords in payload, write 1 dword to
move
tail). I also believe mutexes in Linux are hybrid where they spin
for a
little bit before sleeping and certainly if there is space in the
channel we shouldn't sleep mutex contention.


Ok, that makes sense.  It's maybe a bit clunky and it'd be nice if we 
had some way to batch things up a bit so we only have to poke the GuC 
channel once for every batch of things rather than once per job.  
That's maybe something we can look into as a future improvement; not 
fundamental.


Generally, though, it sounds like contention could be a real problem 
if we end up ping-ponging that lock between cores.  It's going to 
depend on how much work it takes to get the next ready thing vs. the 
cost of that atomic.  But, also, anything we do is going to 
potentially run into contention problems.  *shrug*  If we were going 
to go for one-per-HW-engine, we may as well go one-per-device and then 
we wouldn't need the lock.  Off the top of my head, that doesn't sound 
great either but IDK.


As far as this being horrible, well didn't design the GuC and this how
it is implemented for KMD based submission. We also have 256 doorbells
so we wouldn't need a lock but I think are other issues with that
design
too which need to be worked out in the Xe2 / Xe3 timeframe.


Yeah, not blaming you.  Just surprised, that's all.  How does it work 
for userspace submission?  What would it look like if the kernel 
emulated userspace submission?  Is that even possible?


What are these doorbell things?  How do they play into it?
Basically a bank of MMIO space reserved 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-12 Thread Tvrtko Ursulin




On 11/01/2023 19:40, Matthew Brost wrote:

On Wed, Jan 11, 2023 at 08:50:37AM +, Tvrtko Ursulin wrote:


[snip]


This example is where it would hurt on large systems. Imagine only an even
wider media transcode card...

Second example is only a single engine class used (3d desktop?) but with a
bunch of not-runnable jobs queued and waiting on a fence to signal. Implicit
or explicit dependencies doesn't matter. Then the fence signals and call
backs run. N work items get scheduled, but they all submit to the same HW
engine. So we end up with:

 /-- wi1 --\
/ .. .. \
  cb --+---  wi.. ---+-- rq1 -- .. -- rqN
\ ....  /
 \-- wiN --/


All that we have achieved is waking up N CPUs to contend on the same lock
and effectively insert the job into the same single HW queue. I don't see
any positives there.



I've said this before, the CT channel in practice isn't going to be full
so the section of code protected by the mutex is really, really small.
The mutex really shouldn't ever have contention. Also does a mutex spin
for small period of time before going to sleep? I seem to recall some
type of core lock did this, if we can use a lock that spins for short
period of time this argument falls apart.


This argument already fell apart when we established it's the system_wq 
and not the unbound one. So a digression only - it did not fall apart 
because of the CT channel not being ever congested, there would still be 
the question of what's the point to wake up N cpus when there is a 
single work channel in the backend.


You would have been able to bypass all that by inserting work items 
directly, not via the scheduler workers. I thought that was what Jason 
was implying when he mentioned that a better frontend/backend drm 
scheduler split was considered at some point.


Because for 1:1:1, where GuC is truly 1, it does seem it would work 
better if that sort of a split would enable you to queue directly into 
the backend bypassing the kthread/worker / wait_on wake_up dance.


Would that work? From drm_sched_entity_push_job directly to the backend 
- not waking up but *calling* the equivalent of drm_sched_main.



Right, that's all solid I think. My takeaway is that frontend priority
sorting and that stuff isn't needed and that is okay. And that there are
multiple options to maybe improve drm scheduler, like the fore mentioned
making it deal with out of order, or split into functional components, or
split frontend/backend what you suggested. For most of them cost vs benefit
is more or less not completely clear, neither how much effort was invested
to look into them.

One thing I missed from this explanation is how drm_scheduler per engine
class interferes with the high level concepts. And I did not manage to pick
up on what exactly is the TDR problem in that case. Maybe the two are one
and the same.

Bottom line is I still have the concern that conversion to kworkers has an
opportunity to regress. Possibly more opportunity for some Xe use cases than
to affect other vendors, since they would still be using per physical engine
/ queue scheduler instances.



We certainly don't want to affect other vendors but I haven't yet heard
any push back from other vendors. I don't think speculating about
potential problems is helpful.


I haven't had any push back on the drm cgroup controller either. :D


And to put my money where my mouth is I will try to put testing Xe inside
the full blown ChromeOS environment in my team plans. It would probably also
be beneficial if Xe team could take a look at real world behaviour of the
extreme transcode use cases too. If the stack is ready for that and all. It
would be better to know earlier rather than later if there is a fundamental
issue.



We don't have a media UMD yet it will be tough to test at this point in
time. Also not sure when Xe is going to be POR for a Chrome product
either so porting Xe into ChromeOS likely isn't a top priority for your
team. I know from experience that porting things into ChromeOS isn't
trivial as I've support several of these efforts. Not saying don't do
this just mentioning the realities of what you are suggesting.


I know, I only said I'd put it in the plans, not that it will happen 
tomorrow.


Regards,

Tvrtko


Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-12 Thread Tvrtko Ursulin




On 11/01/2023 17:52, Matthew Brost wrote:

On Wed, Jan 11, 2023 at 09:09:45AM +, Tvrtko Ursulin wrote:


[snip]


Anyway, since you are not buying any arguments on paper perhaps you are more
open towards testing. If you would adapt gem_wsim for Xe you would be able
to spawn N simulated transcode sessions on any Gen11+ machine and try it
out.

For example:

gem_wsim -w benchmarks/wsim/media_load_balance_fhd26u7.wsim -c 36 -r 600

That will run you 36 parallel transcoding sessions streams for 600 frames
each. No client setup needed whatsoever apart from compiling IGT.

In the past that was quite a handy tool to identify scheduling issues, or
validate changes against. All workloads with the media prefix have actually
been hand crafted by looking at what real media pipelines do with real data.
Few years back at least.



Porting this is non-trivial as this is 2.5k. Also in Xe we are trending
to use UMD benchmarks to determine if there are performance problems as
in the i915 we had tons microbenchmarks / IGT benchmarks that we found
meant absolutely nothing. Can't say if this benchmark falls into that
category.


I explained what it does so it was supposed to be obvious it is not a 
micro benchmark.


2.5k what, lines of code? Difficulty of adding Xe support does not scale 
with LOC but with how much it uses the kernel API. You'd essentially 
need to handle context/engine creation and different execbuf.


It's not trivial no, but it would save you downloading gigabytes of test 
streams, building a bunch of tools and libraries etc, and so overall in 
my experience it *significantly* improves the driver development 
turn-around time.



We VK and compute benchmarks running and haven't found any major issues
yet. The media UMD hasn't been ported because of the VM bind dependency
so I can't say if there are any issues with the media UMD + Xe.

What I can do hack up xe_exec_threads to really hammer Xe - change it to
128x xe_engines + 8k execs per thread. Each exec is super simple, it
just stores a dword. It creates a thread per hardware engine, so on TGL
this is 5x threads.

Results below:
root@DUT025-TGLU:mbrost# xe_exec_threads --r threads-basic
IGT-Version: 1.26-ge26de4b2 (x86_64) (Linux: 6.1.0-rc1-xe+ x86_64)
Starting subtest: threads-basic
Subtest threads-basic: SUCCESS (1.215s)
root@DUT025-TGLU:mbrost# dumptrace | grep job | wc
   40960  491520 7401728
root@DUT025-TGLU:mbrost# dumptrace | grep engine | wc
 6457095   82457

So with 640 xe_engines (5x are VM engines) it takes 1.215 seconds test
time to run 40960 execs. That seems to indicate we do not have a
scheduling problem.

This is 8 core (or at least 8 threads) TGL:

root@DUT025-TGLU:mbrost# cat /proc/cpuinfo
...
processor   : 7
vendor_id   : GenuineIntel
cpu family  : 6
model   : 140
model name  : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
stepping: 1
microcode   : 0x3a
cpu MHz : 2344.098
cache size  : 12288 KB
physical id : 0
siblings: 8
core id : 3
cpu cores   : 4
...

Enough data to be convinced there is not issue with this design? I can
also hack up Xe to use less GPU schedulers /w a kthreads but again that
isn't trivial and doesn't seem necessary based on these results.


Not yet. It's not only about how many somethings per second you can do. 
It is also about what effect to the rest of the system it creates.


Anyway I think you said in different sub-thread you will move away from 
system_wq, so we can close this one. With that plan at least I don't 
have to worry my mouse will stutter and audio glitch while Xe is 
churning away.


Regards,

Tvrtko


Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-12 Thread Boris Brezillon
On Thu, 12 Jan 2023 16:38:18 +0100
Daniel Vetter  wrote:

> > >
> > > Also if you do the allocation in ->prepare_job with dma_fence and not
> > > run_job, then I think can sort out fairness issues (if they do pop up) in
> > > the drm/sched code instead of having to think about this in each driver.  
> >
> > By allocation, you mean assigning a FW slot ID? If we do this allocation
> > in ->prepare_job(), couldn't we mess up ordering? Like,
> > lower-prio/later-queuing entity being scheduled before its pairs,
> > because there's no guarantee on the job completion order (and thus the
> > queue idleness order). I mean, completion order depends on the kind of
> > job being executed by the queues, the time the FW actually lets the
> > queue execute things and probably other factors. You can use metrics
> > like the position in the LRU list + the amount of jobs currently
> > queued to a group to guess which one will be idle first, but that's
> > just a guess. And I'm not sure I see what doing this slot selection in  
> > ->prepare_job() would bring us compared to doing it in ->run_job(),  
> > where we can just pick the least recently used slot.  
> 
> In ->prepare_job you can let the scheduler code do the stalling (and
> ensure fairness), in ->run_job it's your job.

Yeah returning a fence in ->prepare_job() to wait for a FW slot to
become idle sounds good. This fence would be signaled when one of the
slots becomes idle. But I'm wondering why we'd want to select the slot
so early. Can't we just do the selection in ->run_job()? After all, if
the fence has been signaled, that means we'll find at least one slot
that's ready when we hit ->run_job(), and we can select it at that
point.

> The current RFC doesn't
> really bother much with getting this very right, but if the scheduler
> code tries to make sure it pushes higher-prio stuff in first before
> others, you should get the right outcome.

Okay, so I'm confused again. We said we had a 1:1
drm_gpu_scheduler:drm_sched_entity mapping, meaning that entities are
isolated from each other. I can see how I could place the dma_fence
returned by ->prepare_job() in a driver-specific per-priority list, so
the driver can pick the highest-prio/first-inserted entry and signal the
associated fence when a slot becomes idle. But I have a hard time
seeing how common code could do that if it doesn't see the other
entities. Right now, drm_gpu_scheduler only selects the best entity
among the registered ones, and there's only one entity per
drm_gpu_scheduler in this case.

> 
> The more important functional issue is that you must only allocate the
> fw slot after all dependencies have signalled.

Sure, but it doesn't have to be a specific FW slot, it can be any FW
slot, as long as we don't signal more fences than we have slots
available, right?

> Otherwise you might get
> a nice deadlock, where job A is waiting for the fw slot of B to become
> free, and B is waiting for A to finish.

Got that part, and that's ensured by the fact we wait for all
regular deps before returning the FW-slot-available dma_fence in
->prepare_job(). This exact same fence will be signaled when a slot
becomes idle.

> 
> > > Few fw sched slots essentially just make fw scheduling unfairness more
> > > prominent than with others, but I don't think it's fundamentally something
> > > else really.
> > >
> > > If every ctx does that and the lru isn't too busted, they should then form
> > > a nice orderly queue and cycle through the fw scheduler, while still being
> > > able to get some work done. It's essentially the exact same thing that
> > > happens with ttm vram eviction, when you have a total working set where
> > > each process fits in vram individually, but in total they're too big and
> > > you need to cycle things through.  
> >
> > I see.
> >  
> > >  
> > > > > I'll need to make sure this still works with the concept of group 
> > > > > (it's
> > > > > not a single queue we schedule, it's a group of queues, meaning that 
> > > > > we
> > > > > have N fences to watch to determine if the slot is busy or not, but
> > > > > that should be okay).  
> > > >
> > > > Oh, there's one other thing I forgot to mention: the FW scheduler is
> > > > not entirely fair, it does take the slot priority (which has to be
> > > > unique across all currently assigned slots) into account when
> > > > scheduling groups. So, ideally, we'd want to rotate group priorities
> > > > when they share the same drm_sched_priority (probably based on the
> > > > position in the LRU).  
> > >
> > > Hm that will make things a bit more fun I guess, especially with your
> > > constraint to not update this too often. How strict is that priority
> > > difference? If it's a lot, we might need to treat this more like execlist
> > > and less like a real fw scheduler ...  
> >
> > Strict as in, if two groups with same priority try to request an
> > overlapping set of resources (cores or tilers), it can deadlock, so
> > pretty strict I would say :-).  

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-12 Thread Daniel Vetter
On Thu, 12 Jan 2023 at 13:08, Boris Brezillon
 wrote:
> On Thu, 12 Jan 2023 11:42:57 +0100
> Daniel Vetter  wrote:
>
> > On Thu, Jan 12, 2023 at 11:25:53AM +0100, Boris Brezillon wrote:
> > > On Thu, 12 Jan 2023 11:11:03 +0100
> > > Boris Brezillon  wrote:
> > >
> > > > On Thu, 12 Jan 2023 10:32:18 +0100
> > > > Daniel Vetter  wrote:
> > > >
> > > > > On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:
> > > > > > Hi Daniel,
> > > > > >
> > > > > > On Wed, 11 Jan 2023 22:47:02 +0100
> > > > > > Daniel Vetter  wrote:
> > > > > >
> > > > > > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > Hi Daniel,
> > > > > > > >
> > > > > > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > > > > > Daniel Vetter  wrote:
> > > > > > > >
> > > > > > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon 
> > > > > > > > > wrote:
> > > > > > > > > > Hi Jason,
> > > > > > > > > >
> > > > > > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > > > > > Jason Ekstrand  wrote:
> > > > > > > > > >
> > > > > > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost 
> > > > > > > > > > > 
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris 
> > > > > > > > > > > > Brezillon wrote:
> > > > > > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > > > > > Boris Brezillon  wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > > > > > Boris Brezillon  
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hello Matthew,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > > > > > Matthew Brost  wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has 
> > > > > > > > > > > > > > > > made to have a 1 to 1
> > > > > > > > > > > > > > > > mapping between a drm_gpu_scheduler and 
> > > > > > > > > > > > > > > > drm_sched_entity. At first
> > > > > > > > > > > > this
> > > > > > > > > > > > > > > > seems a bit odd but let us explain the 
> > > > > > > > > > > > > > > > reasoning below.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1. In XE the submission order from multiple 
> > > > > > > > > > > > > > > > drm_sched_entity is not
> > > > > > > > > > > > > > > > guaranteed to be the same completion even if 
> > > > > > > > > > > > > > > > targeting the same
> > > > > > > > > > > > hardware
> > > > > > > > > > > > > > > > engine. This is because in XE we have a 
> > > > > > > > > > > > > > > > firmware scheduler, the
> > > > > > > > > > > > GuC,
> > > > > > > > > > > > > > > > which allowed to reorder, timeslice, and 
> > > > > > > > > > > > > > > > preempt submissions. If a
> > > > > > > > > > > > using
> > > > > > > > > > > > > > > > shared drm_gpu_scheduler across multiple 
> > > > > > > > > > > > > > > > drm_sched_entity, the TDR
> > > > > > > > > > > > falls
> > > > > > > > > > > > > > > > apart as the TDR expects submission order == 
> > > > > > > > > > > > > > > > completion order.
> > > > > > > > > > > > Using a
> > > > > > > > > > > > > > > > dedicated drm_gpu_scheduler per 
> > > > > > > > > > > > > > > > drm_sched_entity solve this
> > > > > > > > > > > > problem.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Oh, that's interesting. I've been trying to solve 
> > > > > > > > > > > > > > > the same sort of
> > > > > > > > > > > > > > > issues to support Arm's new Mali GPU which is 
> > > > > > > > > > > > > > > relying on a
> > > > > > > > > > > > FW-assisted
> > > > > > > > > > > > > > > scheduling scheme (you give the FW N streams to 
> > > > > > > > > > > > > > > execute, and it does
> > > > > > > > > > > > > > > the scheduling between those N command streams, 
> > > > > > > > > > > > > > > the kernel driver
> > > > > > > > > > > > > > > does timeslice scheduling to update the command 
> > > > > > > > > > > > > > > streams passed to the
> > > > > > > > > > > > > > > FW). I must admit I gave up on using drm_sched at 
> > > > > > > > > > > > > > > some point, mostly
> > > > > > > > > > > > > > > because the integration with drm_sched was 
> > > > > > > > > > > > > > > painful, but also because
> > > > > > > > > > > > I
> > > > > > > > > > > > > > > felt trying to bend drm_sched to make it interact 
> > > > > > > > > > > > > > > with a
> > > > > > > > > > > > > > > timeslice-oriented scheduling model wasn't really 
> > > > > > > > > > > > > > > future proof.
> > > > > > > > > > > > Giving
> > > > > > > > > > > > > > > drm_sched_entity exlusive access to a 
> > > > > > > > > > > > > > > drm_gpu_scheduler probably
> > > > > > > > > > > > might
> > > > > > > > > > > > > > > help for a few things (didn't think it through 
> > > > > > > > > > > > > > > yet), but I feel it's
> > > > > > > > > > > > > > > coming short on other aspects 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-12 Thread Boris Brezillon
On Thu, 12 Jan 2023 11:42:57 +0100
Daniel Vetter  wrote:

> On Thu, Jan 12, 2023 at 11:25:53AM +0100, Boris Brezillon wrote:
> > On Thu, 12 Jan 2023 11:11:03 +0100
> > Boris Brezillon  wrote:
> >   
> > > On Thu, 12 Jan 2023 10:32:18 +0100
> > > Daniel Vetter  wrote:
> > >   
> > > > On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:
> > > > > Hi Daniel,
> > > > > 
> > > > > On Wed, 11 Jan 2023 22:47:02 +0100
> > > > > Daniel Vetter  wrote:
> > > > >   
> > > > > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > > > > >  wrote:  
> > > > > > >
> > > > > > > Hi Daniel,
> > > > > > >
> > > > > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > > > > Daniel Vetter  wrote:
> > > > > > >
> > > > > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon 
> > > > > > > > wrote:
> > > > > > > > > Hi Jason,
> > > > > > > > >
> > > > > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > > > > Jason Ekstrand  wrote:
> > > > > > > > >
> > > > > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost 
> > > > > > > > > > 
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon 
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > > > > Boris Brezillon  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > > > > Boris Brezillon  wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hello Matthew,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > > > > Matthew Brost  wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has 
> > > > > > > > > > > > > > > made to have a 1 to 1
> > > > > > > > > > > > > > > mapping between a drm_gpu_scheduler and 
> > > > > > > > > > > > > > > drm_sched_entity. At first
> > > > > > > > > > > this
> > > > > > > > > > > > > > > seems a bit odd but let us explain the reasoning 
> > > > > > > > > > > > > > > below.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. In XE the submission order from multiple 
> > > > > > > > > > > > > > > drm_sched_entity is not
> > > > > > > > > > > > > > > guaranteed to be the same completion even if 
> > > > > > > > > > > > > > > targeting the same
> > > > > > > > > > > hardware
> > > > > > > > > > > > > > > engine. This is because in XE we have a firmware 
> > > > > > > > > > > > > > > scheduler, the
> > > > > > > > > > > GuC,
> > > > > > > > > > > > > > > which allowed to reorder, timeslice, and preempt 
> > > > > > > > > > > > > > > submissions. If a
> > > > > > > > > > > using
> > > > > > > > > > > > > > > shared drm_gpu_scheduler across multiple 
> > > > > > > > > > > > > > > drm_sched_entity, the TDR
> > > > > > > > > > > falls
> > > > > > > > > > > > > > > apart as the TDR expects submission order == 
> > > > > > > > > > > > > > > completion order.
> > > > > > > > > > > Using a
> > > > > > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity 
> > > > > > > > > > > > > > > solve this
> > > > > > > > > > > problem.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Oh, that's interesting. I've been trying to solve 
> > > > > > > > > > > > > > the same sort of
> > > > > > > > > > > > > > issues to support Arm's new Mali GPU which is 
> > > > > > > > > > > > > > relying on a
> > > > > > > > > > > FW-assisted
> > > > > > > > > > > > > > scheduling scheme (you give the FW N streams to 
> > > > > > > > > > > > > > execute, and it does
> > > > > > > > > > > > > > the scheduling between those N command streams, the 
> > > > > > > > > > > > > > kernel driver
> > > > > > > > > > > > > > does timeslice scheduling to update the command 
> > > > > > > > > > > > > > streams passed to the
> > > > > > > > > > > > > > FW). I must admit I gave up on using drm_sched at 
> > > > > > > > > > > > > > some point, mostly
> > > > > > > > > > > > > > because the integration with drm_sched was painful, 
> > > > > > > > > > > > > > but also because
> > > > > > > > > > > I
> > > > > > > > > > > > > > felt trying to bend drm_sched to make it interact 
> > > > > > > > > > > > > > with a
> > > > > > > > > > > > > > timeslice-oriented scheduling model wasn't really 
> > > > > > > > > > > > > > future proof.
> > > > > > > > > > > Giving
> > > > > > > > > > > > > > drm_sched_entity exlusive access to a 
> > > > > > > > > > > > > > drm_gpu_scheduler probably
> > > > > > > > > > > might
> > > > > > > > > > > > > > help for a few things (didn't think it through 
> > > > > > > > > > > > > > yet), but I feel it's
> > > > > > > > > > > > > > coming short on other aspects we have to deal with 
> > 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-12 Thread Daniel Vetter
On Thu, Jan 12, 2023 at 11:25:53AM +0100, Boris Brezillon wrote:
> On Thu, 12 Jan 2023 11:11:03 +0100
> Boris Brezillon  wrote:
> 
> > On Thu, 12 Jan 2023 10:32:18 +0100
> > Daniel Vetter  wrote:
> > 
> > > On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:  
> > > > Hi Daniel,
> > > > 
> > > > On Wed, 11 Jan 2023 22:47:02 +0100
> > > > Daniel Vetter  wrote:
> > > > 
> > > > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > > > >  wrote:
> > > > > >
> > > > > > Hi Daniel,
> > > > > >
> > > > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > > > Daniel Vetter  wrote:
> > > > > >  
> > > > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:  
> > > > > > > 
> > > > > > > > Hi Jason,
> > > > > > > >
> > > > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > > > Jason Ekstrand  wrote:
> > > > > > > >  
> > > > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost 
> > > > > > > > > 
> > > > > > > > > wrote:
> > > > > > > > >  
> > > > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon 
> > > > > > > > > > wrote:  
> > > > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > > > Boris Brezillon  wrote:
> > > > > > > > > > >  
> > > > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > > > Boris Brezillon  wrote:
> > > > > > > > > > > >  
> > > > > > > > > > > > > Hello Matthew,
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > > > Matthew Brost  wrote:
> > > > > > > > > > > > >  
> > > > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made 
> > > > > > > > > > > > > > to have a 1 to 1
> > > > > > > > > > > > > > mapping between a drm_gpu_scheduler and 
> > > > > > > > > > > > > > drm_sched_entity. At first  
> > > > > > > > > > this  
> > > > > > > > > > > > > > seems a bit odd but let us explain the reasoning 
> > > > > > > > > > > > > > below.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. In XE the submission order from multiple 
> > > > > > > > > > > > > > drm_sched_entity is not
> > > > > > > > > > > > > > guaranteed to be the same completion even if 
> > > > > > > > > > > > > > targeting the same  
> > > > > > > > > > hardware  
> > > > > > > > > > > > > > engine. This is because in XE we have a firmware 
> > > > > > > > > > > > > > scheduler, the  
> > > > > > > > > > GuC,  
> > > > > > > > > > > > > > which allowed to reorder, timeslice, and preempt 
> > > > > > > > > > > > > > submissions. If a  
> > > > > > > > > > using  
> > > > > > > > > > > > > > shared drm_gpu_scheduler across multiple 
> > > > > > > > > > > > > > drm_sched_entity, the TDR  
> > > > > > > > > > falls  
> > > > > > > > > > > > > > apart as the TDR expects submission order == 
> > > > > > > > > > > > > > completion order.  
> > > > > > > > > > Using a  
> > > > > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity 
> > > > > > > > > > > > > > solve this  
> > > > > > > > > > problem.  
> > > > > > > > > > > > >
> > > > > > > > > > > > > Oh, that's interesting. I've been trying to solve the 
> > > > > > > > > > > > > same sort of
> > > > > > > > > > > > > issues to support Arm's new Mali GPU which is relying 
> > > > > > > > > > > > > on a  
> > > > > > > > > > FW-assisted  
> > > > > > > > > > > > > scheduling scheme (you give the FW N streams to 
> > > > > > > > > > > > > execute, and it does
> > > > > > > > > > > > > the scheduling between those N command streams, the 
> > > > > > > > > > > > > kernel driver
> > > > > > > > > > > > > does timeslice scheduling to update the command 
> > > > > > > > > > > > > streams passed to the
> > > > > > > > > > > > > FW). I must admit I gave up on using drm_sched at 
> > > > > > > > > > > > > some point, mostly
> > > > > > > > > > > > > because the integration with drm_sched was painful, 
> > > > > > > > > > > > > but also because  
> > > > > > > > > > I  
> > > > > > > > > > > > > felt trying to bend drm_sched to make it interact 
> > > > > > > > > > > > > with a
> > > > > > > > > > > > > timeslice-oriented scheduling model wasn't really 
> > > > > > > > > > > > > future proof.  
> > > > > > > > > > Giving  
> > > > > > > > > > > > > drm_sched_entity exlusive access to a 
> > > > > > > > > > > > > drm_gpu_scheduler probably  
> > > > > > > > > > might  
> > > > > > > > > > > > > help for a few things (didn't think it through yet), 
> > > > > > > > > > > > > but I feel it's
> > > > > > > > > > > > > coming short on other aspects we have to deal with on 
> > > > > > > > > > > > > Arm GPUs.  
> > > > > > > > > > > >
> > > > > > > > > > > > Ok, so I just had a quick look at the Xe driver and how 
> > > > > > > > > > > > it
> > > > > > > > > > > > instantiates the drm_sched_entity and 
> > > > > > > > > > > > drm_gpu_scheduler, and I think I
> > > > > > > > > > > > have a better 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-12 Thread Boris Brezillon
On Thu, 12 Jan 2023 11:11:03 +0100
Boris Brezillon  wrote:

> On Thu, 12 Jan 2023 10:32:18 +0100
> Daniel Vetter  wrote:
> 
> > On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:  
> > > Hi Daniel,
> > > 
> > > On Wed, 11 Jan 2023 22:47:02 +0100
> > > Daniel Vetter  wrote:
> > > 
> > > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > > >  wrote:
> > > > >
> > > > > Hi Daniel,
> > > > >
> > > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > > Daniel Vetter  wrote:
> > > > >  
> > > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:
> > > > > >   
> > > > > > > Hi Jason,
> > > > > > >
> > > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > > Jason Ekstrand  wrote:
> > > > > > >  
> > > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost 
> > > > > > > > 
> > > > > > > > wrote:
> > > > > > > >  
> > > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon 
> > > > > > > > > wrote:  
> > > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > > Boris Brezillon  wrote:
> > > > > > > > > >  
> > > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > > Boris Brezillon  wrote:
> > > > > > > > > > >  
> > > > > > > > > > > > Hello Matthew,
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > > Matthew Brost  wrote:
> > > > > > > > > > > >  
> > > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to 
> > > > > > > > > > > > > have a 1 to 1
> > > > > > > > > > > > > mapping between a drm_gpu_scheduler and 
> > > > > > > > > > > > > drm_sched_entity. At first  
> > > > > > > > > this  
> > > > > > > > > > > > > seems a bit odd but let us explain the reasoning 
> > > > > > > > > > > > > below.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. In XE the submission order from multiple 
> > > > > > > > > > > > > drm_sched_entity is not
> > > > > > > > > > > > > guaranteed to be the same completion even if 
> > > > > > > > > > > > > targeting the same  
> > > > > > > > > hardware  
> > > > > > > > > > > > > engine. This is because in XE we have a firmware 
> > > > > > > > > > > > > scheduler, the  
> > > > > > > > > GuC,  
> > > > > > > > > > > > > which allowed to reorder, timeslice, and preempt 
> > > > > > > > > > > > > submissions. If a  
> > > > > > > > > using  
> > > > > > > > > > > > > shared drm_gpu_scheduler across multiple 
> > > > > > > > > > > > > drm_sched_entity, the TDR  
> > > > > > > > > falls  
> > > > > > > > > > > > > apart as the TDR expects submission order == 
> > > > > > > > > > > > > completion order.  
> > > > > > > > > Using a  
> > > > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity 
> > > > > > > > > > > > > solve this  
> > > > > > > > > problem.  
> > > > > > > > > > > >
> > > > > > > > > > > > Oh, that's interesting. I've been trying to solve the 
> > > > > > > > > > > > same sort of
> > > > > > > > > > > > issues to support Arm's new Mali GPU which is relying 
> > > > > > > > > > > > on a  
> > > > > > > > > FW-assisted  
> > > > > > > > > > > > scheduling scheme (you give the FW N streams to 
> > > > > > > > > > > > execute, and it does
> > > > > > > > > > > > the scheduling between those N command streams, the 
> > > > > > > > > > > > kernel driver
> > > > > > > > > > > > does timeslice scheduling to update the command streams 
> > > > > > > > > > > > passed to the
> > > > > > > > > > > > FW). I must admit I gave up on using drm_sched at some 
> > > > > > > > > > > > point, mostly
> > > > > > > > > > > > because the integration with drm_sched was painful, but 
> > > > > > > > > > > > also because  
> > > > > > > > > I  
> > > > > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > > > > timeslice-oriented scheduling model wasn't really 
> > > > > > > > > > > > future proof.  
> > > > > > > > > Giving  
> > > > > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler 
> > > > > > > > > > > > probably  
> > > > > > > > > might  
> > > > > > > > > > > > help for a few things (didn't think it through yet), 
> > > > > > > > > > > > but I feel it's
> > > > > > > > > > > > coming short on other aspects we have to deal with on 
> > > > > > > > > > > > Arm GPUs.  
> > > > > > > > > > >
> > > > > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, 
> > > > > > > > > > > and I think I
> > > > > > > > > > > have a better understanding of how you get away with 
> > > > > > > > > > > using drm_sched
> > > > > > > > > > > while still controlling how scheduling is really done. 
> > > > > > > > > > > Here
> > > > > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you 
> > > > > > > > > > > use the
> > > > > > > > > > > drm_sched 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-12 Thread Boris Brezillon
On Thu, 12 Jan 2023 11:11:03 +0100
Boris Brezillon  wrote:

> On Thu, 12 Jan 2023 10:32:18 +0100
> Daniel Vetter  wrote:
> 
> > On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:  
> > > Hi Daniel,
> > > 
> > > On Wed, 11 Jan 2023 22:47:02 +0100
> > > Daniel Vetter  wrote:
> > > 
> > > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > > >  wrote:
> > > > >
> > > > > Hi Daniel,
> > > > >
> > > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > > Daniel Vetter  wrote:
> > > > >  
> > > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:
> > > > > >   
> > > > > > > Hi Jason,
> > > > > > >
> > > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > > Jason Ekstrand  wrote:
> > > > > > >  
> > > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost 
> > > > > > > > 
> > > > > > > > wrote:
> > > > > > > >  
> > > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon 
> > > > > > > > > wrote:  
> > > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > > Boris Brezillon  wrote:
> > > > > > > > > >  
> > > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > > Boris Brezillon  wrote:
> > > > > > > > > > >  
> > > > > > > > > > > > Hello Matthew,
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > > Matthew Brost  wrote:
> > > > > > > > > > > >  
> > > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to 
> > > > > > > > > > > > > have a 1 to 1
> > > > > > > > > > > > > mapping between a drm_gpu_scheduler and 
> > > > > > > > > > > > > drm_sched_entity. At first  
> > > > > > > > > this  
> > > > > > > > > > > > > seems a bit odd but let us explain the reasoning 
> > > > > > > > > > > > > below.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. In XE the submission order from multiple 
> > > > > > > > > > > > > drm_sched_entity is not
> > > > > > > > > > > > > guaranteed to be the same completion even if 
> > > > > > > > > > > > > targeting the same  
> > > > > > > > > hardware  
> > > > > > > > > > > > > engine. This is because in XE we have a firmware 
> > > > > > > > > > > > > scheduler, the  
> > > > > > > > > GuC,  
> > > > > > > > > > > > > which allowed to reorder, timeslice, and preempt 
> > > > > > > > > > > > > submissions. If a  
> > > > > > > > > using  
> > > > > > > > > > > > > shared drm_gpu_scheduler across multiple 
> > > > > > > > > > > > > drm_sched_entity, the TDR  
> > > > > > > > > falls  
> > > > > > > > > > > > > apart as the TDR expects submission order == 
> > > > > > > > > > > > > completion order.  
> > > > > > > > > Using a  
> > > > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity 
> > > > > > > > > > > > > solve this  
> > > > > > > > > problem.  
> > > > > > > > > > > >
> > > > > > > > > > > > Oh, that's interesting. I've been trying to solve the 
> > > > > > > > > > > > same sort of
> > > > > > > > > > > > issues to support Arm's new Mali GPU which is relying 
> > > > > > > > > > > > on a  
> > > > > > > > > FW-assisted  
> > > > > > > > > > > > scheduling scheme (you give the FW N streams to 
> > > > > > > > > > > > execute, and it does
> > > > > > > > > > > > the scheduling between those N command streams, the 
> > > > > > > > > > > > kernel driver
> > > > > > > > > > > > does timeslice scheduling to update the command streams 
> > > > > > > > > > > > passed to the
> > > > > > > > > > > > FW). I must admit I gave up on using drm_sched at some 
> > > > > > > > > > > > point, mostly
> > > > > > > > > > > > because the integration with drm_sched was painful, but 
> > > > > > > > > > > > also because  
> > > > > > > > > I  
> > > > > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > > > > timeslice-oriented scheduling model wasn't really 
> > > > > > > > > > > > future proof.  
> > > > > > > > > Giving  
> > > > > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler 
> > > > > > > > > > > > probably  
> > > > > > > > > might  
> > > > > > > > > > > > help for a few things (didn't think it through yet), 
> > > > > > > > > > > > but I feel it's
> > > > > > > > > > > > coming short on other aspects we have to deal with on 
> > > > > > > > > > > > Arm GPUs.  
> > > > > > > > > > >
> > > > > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, 
> > > > > > > > > > > and I think I
> > > > > > > > > > > have a better understanding of how you get away with 
> > > > > > > > > > > using drm_sched
> > > > > > > > > > > while still controlling how scheduling is really done. 
> > > > > > > > > > > Here
> > > > > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you 
> > > > > > > > > > > use the
> > > > > > > > > > > drm_sched 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-12 Thread Boris Brezillon
On Thu, 12 Jan 2023 10:32:18 +0100
Daniel Vetter  wrote:

> On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:
> > Hi Daniel,
> > 
> > On Wed, 11 Jan 2023 22:47:02 +0100
> > Daniel Vetter  wrote:
> >   
> > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > >  wrote:  
> > > >
> > > > Hi Daniel,
> > > >
> > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > Daniel Vetter  wrote:
> > > >
> > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:
> > > > > > Hi Jason,
> > > > > >
> > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > Jason Ekstrand  wrote:
> > > > > >
> > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost 
> > > > > > > 
> > > > > > > wrote:
> > > > > > >
> > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon 
> > > > > > > > wrote:
> > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > Boris Brezillon  wrote:
> > > > > > > > >
> > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > Boris Brezillon  wrote:
> > > > > > > > > >
> > > > > > > > > > > Hello Matthew,
> > > > > > > > > > >
> > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > Matthew Brost  wrote:
> > > > > > > > > > >
> > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to 
> > > > > > > > > > > > have a 1 to 1
> > > > > > > > > > > > mapping between a drm_gpu_scheduler and 
> > > > > > > > > > > > drm_sched_entity. At first
> > > > > > > > this
> > > > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. In XE the submission order from multiple 
> > > > > > > > > > > > drm_sched_entity is not
> > > > > > > > > > > > guaranteed to be the same completion even if targeting 
> > > > > > > > > > > > the same
> > > > > > > > hardware
> > > > > > > > > > > > engine. This is because in XE we have a firmware 
> > > > > > > > > > > > scheduler, the
> > > > > > > > GuC,
> > > > > > > > > > > > which allowed to reorder, timeslice, and preempt 
> > > > > > > > > > > > submissions. If a
> > > > > > > > using
> > > > > > > > > > > > shared drm_gpu_scheduler across multiple 
> > > > > > > > > > > > drm_sched_entity, the TDR
> > > > > > > > falls
> > > > > > > > > > > > apart as the TDR expects submission order == completion 
> > > > > > > > > > > > order.
> > > > > > > > Using a
> > > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve 
> > > > > > > > > > > > this
> > > > > > > > problem.
> > > > > > > > > > >
> > > > > > > > > > > Oh, that's interesting. I've been trying to solve the 
> > > > > > > > > > > same sort of
> > > > > > > > > > > issues to support Arm's new Mali GPU which is relying on 
> > > > > > > > > > > a
> > > > > > > > FW-assisted
> > > > > > > > > > > scheduling scheme (you give the FW N streams to execute, 
> > > > > > > > > > > and it does
> > > > > > > > > > > the scheduling between those N command streams, the 
> > > > > > > > > > > kernel driver
> > > > > > > > > > > does timeslice scheduling to update the command streams 
> > > > > > > > > > > passed to the
> > > > > > > > > > > FW). I must admit I gave up on using drm_sched at some 
> > > > > > > > > > > point, mostly
> > > > > > > > > > > because the integration with drm_sched was painful, but 
> > > > > > > > > > > also because
> > > > > > > > I
> > > > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > > > timeslice-oriented scheduling model wasn't really future 
> > > > > > > > > > > proof.
> > > > > > > > Giving
> > > > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler 
> > > > > > > > > > > probably
> > > > > > > > might
> > > > > > > > > > > help for a few things (didn't think it through yet), but 
> > > > > > > > > > > I feel it's
> > > > > > > > > > > coming short on other aspects we have to deal with on Arm 
> > > > > > > > > > > GPUs.
> > > > > > > > > >
> > > > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, 
> > > > > > > > > > and I think I
> > > > > > > > > > have a better understanding of how you get away with using 
> > > > > > > > > > drm_sched
> > > > > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you 
> > > > > > > > > > use the
> > > > > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole 
> > > > > > > > > > run-queue
> > > > > > > >
> > > > > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > > > > dependency tracking and releasing jobs to be scheduled when 
> > > > > > > > dependencies
> > > > > > > > are met, and lastly a tracking mechanism of inflights jobs that 
> > > > > > > > need to
> > > > > > > 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-12 Thread Daniel Vetter
On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:
> Hi Daniel,
> 
> On Wed, 11 Jan 2023 22:47:02 +0100
> Daniel Vetter  wrote:
> 
> > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> >  wrote:
> > >
> > > Hi Daniel,
> > >
> > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > Daniel Vetter  wrote:
> > >  
> > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:  
> > > > > Hi Jason,
> > > > >
> > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > Jason Ekstrand  wrote:
> > > > >  
> > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost 
> > > > > > 
> > > > > > wrote:
> > > > > >  
> > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:  
> > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > Boris Brezillon  wrote:
> > > > > > > >  
> > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > Boris Brezillon  wrote:
> > > > > > > > >  
> > > > > > > > > > Hello Matthew,
> > > > > > > > > >
> > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > Matthew Brost  wrote:
> > > > > > > > > >  
> > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to 
> > > > > > > > > > > have a 1 to 1
> > > > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. 
> > > > > > > > > > > At first  
> > > > > > > this  
> > > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > > >
> > > > > > > > > > > 1. In XE the submission order from multiple 
> > > > > > > > > > > drm_sched_entity is not
> > > > > > > > > > > guaranteed to be the same completion even if targeting 
> > > > > > > > > > > the same  
> > > > > > > hardware  
> > > > > > > > > > > engine. This is because in XE we have a firmware 
> > > > > > > > > > > scheduler, the  
> > > > > > > GuC,  
> > > > > > > > > > > which allowed to reorder, timeslice, and preempt 
> > > > > > > > > > > submissions. If a  
> > > > > > > using  
> > > > > > > > > > > shared drm_gpu_scheduler across multiple 
> > > > > > > > > > > drm_sched_entity, the TDR  
> > > > > > > falls  
> > > > > > > > > > > apart as the TDR expects submission order == completion 
> > > > > > > > > > > order.  
> > > > > > > Using a  
> > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve 
> > > > > > > > > > > this  
> > > > > > > problem.  
> > > > > > > > > >
> > > > > > > > > > Oh, that's interesting. I've been trying to solve the same 
> > > > > > > > > > sort of
> > > > > > > > > > issues to support Arm's new Mali GPU which is relying on a  
> > > > > > > FW-assisted  
> > > > > > > > > > scheduling scheme (you give the FW N streams to execute, 
> > > > > > > > > > and it does
> > > > > > > > > > the scheduling between those N command streams, the kernel 
> > > > > > > > > > driver
> > > > > > > > > > does timeslice scheduling to update the command streams 
> > > > > > > > > > passed to the
> > > > > > > > > > FW). I must admit I gave up on using drm_sched at some 
> > > > > > > > > > point, mostly
> > > > > > > > > > because the integration with drm_sched was painful, but 
> > > > > > > > > > also because  
> > > > > > > I  
> > > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > > timeslice-oriented scheduling model wasn't really future 
> > > > > > > > > > proof.  
> > > > > > > Giving  
> > > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler 
> > > > > > > > > > probably  
> > > > > > > might  
> > > > > > > > > > help for a few things (didn't think it through yet), but I 
> > > > > > > > > > feel it's
> > > > > > > > > > coming short on other aspects we have to deal with on Arm 
> > > > > > > > > > GPUs.  
> > > > > > > > >
> > > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and 
> > > > > > > > > I think I
> > > > > > > > > have a better understanding of how you get away with using 
> > > > > > > > > drm_sched
> > > > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use 
> > > > > > > > > the
> > > > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole 
> > > > > > > > > run-queue  
> > > > > > >
> > > > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > > > dependency tracking and releasing jobs to be scheduled when 
> > > > > > > dependencies
> > > > > > > are met, and lastly a tracking mechanism of inflights jobs that 
> > > > > > > need to
> > > > > > > be cleaned up if an error occurs. It doesn't actually do any 
> > > > > > > scheduling
> > > > > > > aside from the most basic level of not overflowing the submission 
> > > > > > > ring
> > > > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > > > scheduler fits quite well.
> > > > > > >  
> > > > > >
> > > > > > Yeah, I think there's an annoying 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-12 Thread Boris Brezillon
Hi Daniel,

On Wed, 11 Jan 2023 22:47:02 +0100
Daniel Vetter  wrote:

> On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
>  wrote:
> >
> > Hi Daniel,
> >
> > On Mon, 9 Jan 2023 21:40:21 +0100
> > Daniel Vetter  wrote:
> >  
> > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:  
> > > > Hi Jason,
> > > >
> > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > Jason Ekstrand  wrote:
> > > >  
> > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost 
> > > > > wrote:
> > > > >  
> > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:  
> > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > Boris Brezillon  wrote:
> > > > > > >  
> > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > Boris Brezillon  wrote:
> > > > > > > >  
> > > > > > > > > Hello Matthew,
> > > > > > > > >
> > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > Matthew Brost  wrote:
> > > > > > > > >  
> > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have 
> > > > > > > > > > a 1 to 1
> > > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. 
> > > > > > > > > > At first  
> > > > > > this  
> > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > >
> > > > > > > > > > 1. In XE the submission order from multiple 
> > > > > > > > > > drm_sched_entity is not
> > > > > > > > > > guaranteed to be the same completion even if targeting the 
> > > > > > > > > > same  
> > > > > > hardware  
> > > > > > > > > > engine. This is because in XE we have a firmware scheduler, 
> > > > > > > > > > the  
> > > > > > GuC,  
> > > > > > > > > > which allowed to reorder, timeslice, and preempt 
> > > > > > > > > > submissions. If a  
> > > > > > using  
> > > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, 
> > > > > > > > > > the TDR  
> > > > > > falls  
> > > > > > > > > > apart as the TDR expects submission order == completion 
> > > > > > > > > > order.  
> > > > > > Using a  
> > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this 
> > > > > > > > > >  
> > > > > > problem.  
> > > > > > > > >
> > > > > > > > > Oh, that's interesting. I've been trying to solve the same 
> > > > > > > > > sort of
> > > > > > > > > issues to support Arm's new Mali GPU which is relying on a  
> > > > > > FW-assisted  
> > > > > > > > > scheduling scheme (you give the FW N streams to execute, and 
> > > > > > > > > it does
> > > > > > > > > the scheduling between those N command streams, the kernel 
> > > > > > > > > driver
> > > > > > > > > does timeslice scheduling to update the command streams 
> > > > > > > > > passed to the
> > > > > > > > > FW). I must admit I gave up on using drm_sched at some point, 
> > > > > > > > > mostly
> > > > > > > > > because the integration with drm_sched was painful, but also 
> > > > > > > > > because  
> > > > > > I  
> > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > timeslice-oriented scheduling model wasn't really future 
> > > > > > > > > proof.  
> > > > > > Giving  
> > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler 
> > > > > > > > > probably  
> > > > > > might  
> > > > > > > > > help for a few things (didn't think it through yet), but I 
> > > > > > > > > feel it's
> > > > > > > > > coming short on other aspects we have to deal with on Arm 
> > > > > > > > > GPUs.  
> > > > > > > >
> > > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I 
> > > > > > > > think I
> > > > > > > > have a better understanding of how you get away with using 
> > > > > > > > drm_sched
> > > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use 
> > > > > > > > the
> > > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole 
> > > > > > > > run-queue  
> > > > > >
> > > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > > dependency tracking and releasing jobs to be scheduled when 
> > > > > > dependencies
> > > > > > are met, and lastly a tracking mechanism of inflights jobs that 
> > > > > > need to
> > > > > > be cleaned up if an error occurs. It doesn't actually do any 
> > > > > > scheduling
> > > > > > aside from the most basic level of not overflowing the submission 
> > > > > > ring
> > > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > > scheduler fits quite well.
> > > > > >  
> > > > >
> > > > > Yeah, I think there's an annoying difference between what 
> > > > > AMD/NVIDIA/Intel
> > > > > want here and what you need for Arm thanks to the number of FW queues
> > > > > available. I don't remember the exact number of GuC queues but it's at
> > > > > least 1k. This puts it in an entirely different class from what you 
> > > > > have 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-11 Thread Jason Ekstrand
On Wed, Jan 11, 2023 at 4:32 PM Matthew Brost 
wrote:

> On Wed, Jan 11, 2023 at 04:18:01PM -0600, Jason Ekstrand wrote:
> > On Wed, Jan 11, 2023 at 2:50 AM Tvrtko Ursulin <
> > tvrtko.ursu...@linux.intel.com> wrote:
> >
> > >
> > > On 10/01/2023 14:08, Jason Ekstrand wrote:
> > > > On Tue, Jan 10, 2023 at 5:28 AM Tvrtko Ursulin
> > > >  tvrtko.ursu...@linux.intel.com>>
> > >
> > > > wrote:
> > > >
> > > >
> > > >
> > > > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > > >
> > > > [snip]
> > > >
> > > >  >  >>> AFAICT it proposes to have 1:1 between *userspace*
> > > created
> > > >  > contexts (per
> > > >  >  >>> context _and_ engine) and drm_sched. I am not sure
> > > avoiding
> > > >  > invasive changes
> > > >  >  >>> to the shared code is in the spirit of the overall
> idea
> > > > and instead
> > > >  >  >>> opportunity should be used to look at way to
> > > > refactor/improve
> > > >  > drm_sched.
> > > >  >
> > > >  >
> > > >  > Maybe?  I'm not convinced that what Xe is doing is an abuse at
> > > > all or
> > > >  > really needs to drive a re-factor.  (More on that later.)
> > > > There's only
> > > >  > one real issue which is that it fires off potentially a lot of
> > > > kthreads.
> > > >  > Even that's not that bad given that kthreads are pretty light
> and
> > > > you're
> > > >  > not likely to have more kthreads than userspace threads which
> are
> > > > much
> > > >  > heavier.  Not ideal, but not the end of the world either.
> > > > Definitely
> > > >  > something we can/should optimize but if we went through with
> Xe
> > > > without
> > > >  > this patch, it would probably be mostly ok.
> > > >  >
> > > >  >  >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > > >  >  >>
> > > >  >  >> I'm not really prepared to make large changes to DRM
> > > > scheduler
> > > >  > at the
> > > >  >  >> moment for Xe as they are not really required nor does
> > > Boris
> > > >  > seem they
> > > >  >  >> will be required for his work either. I am interested
> to
> > > see
> > > >  > what Boris
> > > >  >  >> comes up with.
> > > >  >  >>
> > > >  >  >>> Even on the low level, the idea to replace drm_sched
> > > threads
> > > >  > with workers
> > > >  >  >>> has a few problems.
> > > >  >  >>>
> > > >  >  >>> To start with, the pattern of:
> > > >  >  >>>
> > > >  >  >>>while (not_stopped) {
> > > >  >  >>> keep picking jobs
> > > >  >  >>>}
> > > >  >  >>>
> > > >  >  >>> Feels fundamentally in disagreement with workers
> (while
> > > >  > obviously fits
> > > >  >  >>> perfectly with the current kthread design).
> > > >  >  >>
> > > >  >  >> The while loop breaks and worker exists if no jobs are
> > > ready.
> > > >  >
> > > >  >
> > > >  > I'm not very familiar with workqueues. What are you saying
> would
> > > fit
> > > >  > better? One scheduling job per work item rather than one big
> work
> > > > item
> > > >  > which handles all available jobs?
> > > >
> > > > Yes and no, it indeed IMO does not fit to have a work item which
> is
> > > > potentially unbound in runtime. But it is a bit moot conceptual
> > > > mismatch
> > > > because it is a worst case / theoretical, and I think due more
> > > > fundamental concerns.
> > > >
> > > > If we have to go back to the low level side of things, I've
> picked
> > > this
> > > > random spot to consolidate what I have already mentioned and
> perhaps
> > > > expand.
> > > >
> > > > To start with, let me pull out some thoughts from workqueue.rst:
> > > >
> > > > """
> > > > Generally, work items are not expected to hog a CPU and consume
> many
> > > > cycles. That means maintaining just enough concurrency to prevent
> > > work
> > > > processing from stalling should be optimal.
> > > > """
> > > >
> > > > For unbound queues:
> > > > """
> > > > The responsibility of regulating concurrency level is on the
> users.
> > > > """
> > > >
> > > > Given the unbound queues will be spawned on demand to service all
> > > > queued
> > > > work items (more interesting when mixing up with the
> > > > system_unbound_wq),
> > > > in the proposed design the number of instantiated worker threads
> does
> > > > not correspond to the number of user threads (as you have
> elsewhere
> > > > stated), but pessimistically to the number of active user
> contexts.
> > > >
> > > >
> > > > Those are pretty much the same in practice.  Rather, user threads is
> > > > typically an upper bound on the number of contexts.  Yes, a single
> user
> > > > thread could have a bunch of contexts but basically nothing 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-11 Thread Matthew Brost
On Wed, Jan 11, 2023 at 04:18:01PM -0600, Jason Ekstrand wrote:
> On Wed, Jan 11, 2023 at 2:50 AM Tvrtko Ursulin <
> tvrtko.ursu...@linux.intel.com> wrote:
> 
> >
> > On 10/01/2023 14:08, Jason Ekstrand wrote:
> > > On Tue, Jan 10, 2023 at 5:28 AM Tvrtko Ursulin
> > > mailto:tvrtko.ursu...@linux.intel.com>>
> >
> > > wrote:
> > >
> > >
> > >
> > > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > >
> > > [snip]
> > >
> > >  >  >>> AFAICT it proposes to have 1:1 between *userspace*
> > created
> > >  > contexts (per
> > >  >  >>> context _and_ engine) and drm_sched. I am not sure
> > avoiding
> > >  > invasive changes
> > >  >  >>> to the shared code is in the spirit of the overall idea
> > > and instead
> > >  >  >>> opportunity should be used to look at way to
> > > refactor/improve
> > >  > drm_sched.
> > >  >
> > >  >
> > >  > Maybe?  I'm not convinced that what Xe is doing is an abuse at
> > > all or
> > >  > really needs to drive a re-factor.  (More on that later.)
> > > There's only
> > >  > one real issue which is that it fires off potentially a lot of
> > > kthreads.
> > >  > Even that's not that bad given that kthreads are pretty light and
> > > you're
> > >  > not likely to have more kthreads than userspace threads which are
> > > much
> > >  > heavier.  Not ideal, but not the end of the world either.
> > > Definitely
> > >  > something we can/should optimize but if we went through with Xe
> > > without
> > >  > this patch, it would probably be mostly ok.
> > >  >
> > >  >  >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > >  >  >>
> > >  >  >> I'm not really prepared to make large changes to DRM
> > > scheduler
> > >  > at the
> > >  >  >> moment for Xe as they are not really required nor does
> > Boris
> > >  > seem they
> > >  >  >> will be required for his work either. I am interested to
> > see
> > >  > what Boris
> > >  >  >> comes up with.
> > >  >  >>
> > >  >  >>> Even on the low level, the idea to replace drm_sched
> > threads
> > >  > with workers
> > >  >  >>> has a few problems.
> > >  >  >>>
> > >  >  >>> To start with, the pattern of:
> > >  >  >>>
> > >  >  >>>while (not_stopped) {
> > >  >  >>> keep picking jobs
> > >  >  >>>}
> > >  >  >>>
> > >  >  >>> Feels fundamentally in disagreement with workers (while
> > >  > obviously fits
> > >  >  >>> perfectly with the current kthread design).
> > >  >  >>
> > >  >  >> The while loop breaks and worker exists if no jobs are
> > ready.
> > >  >
> > >  >
> > >  > I'm not very familiar with workqueues. What are you saying would
> > fit
> > >  > better? One scheduling job per work item rather than one big work
> > > item
> > >  > which handles all available jobs?
> > >
> > > Yes and no, it indeed IMO does not fit to have a work item which is
> > > potentially unbound in runtime. But it is a bit moot conceptual
> > > mismatch
> > > because it is a worst case / theoretical, and I think due more
> > > fundamental concerns.
> > >
> > > If we have to go back to the low level side of things, I've picked
> > this
> > > random spot to consolidate what I have already mentioned and perhaps
> > > expand.
> > >
> > > To start with, let me pull out some thoughts from workqueue.rst:
> > >
> > > """
> > > Generally, work items are not expected to hog a CPU and consume many
> > > cycles. That means maintaining just enough concurrency to prevent
> > work
> > > processing from stalling should be optimal.
> > > """
> > >
> > > For unbound queues:
> > > """
> > > The responsibility of regulating concurrency level is on the users.
> > > """
> > >
> > > Given the unbound queues will be spawned on demand to service all
> > > queued
> > > work items (more interesting when mixing up with the
> > > system_unbound_wq),
> > > in the proposed design the number of instantiated worker threads does
> > > not correspond to the number of user threads (as you have elsewhere
> > > stated), but pessimistically to the number of active user contexts.
> > >
> > >
> > > Those are pretty much the same in practice.  Rather, user threads is
> > > typically an upper bound on the number of contexts.  Yes, a single user
> > > thread could have a bunch of contexts but basically nothing does that
> > > except IGT.  In real-world usage, it's at most one context per user
> > thread.
> >
> > Typically is the key here. But I am not sure it is good enough. Consider
> > this example - Intel Flex 170:
> >
> >   * Delivers up to 36 streams 1080p60 transcode throughput per card.
> >   * When scaled to 10 cards in a 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-11 Thread Jason Ekstrand
On Wed, Jan 11, 2023 at 2:50 AM Tvrtko Ursulin <
tvrtko.ursu...@linux.intel.com> wrote:

>
> On 10/01/2023 14:08, Jason Ekstrand wrote:
> > On Tue, Jan 10, 2023 at 5:28 AM Tvrtko Ursulin
> > mailto:tvrtko.ursu...@linux.intel.com>>
>
> > wrote:
> >
> >
> >
> > On 09/01/2023 17:27, Jason Ekstrand wrote:
> >
> > [snip]
> >
> >  >  >>> AFAICT it proposes to have 1:1 between *userspace*
> created
> >  > contexts (per
> >  >  >>> context _and_ engine) and drm_sched. I am not sure
> avoiding
> >  > invasive changes
> >  >  >>> to the shared code is in the spirit of the overall idea
> > and instead
> >  >  >>> opportunity should be used to look at way to
> > refactor/improve
> >  > drm_sched.
> >  >
> >  >
> >  > Maybe?  I'm not convinced that what Xe is doing is an abuse at
> > all or
> >  > really needs to drive a re-factor.  (More on that later.)
> > There's only
> >  > one real issue which is that it fires off potentially a lot of
> > kthreads.
> >  > Even that's not that bad given that kthreads are pretty light and
> > you're
> >  > not likely to have more kthreads than userspace threads which are
> > much
> >  > heavier.  Not ideal, but not the end of the world either.
> > Definitely
> >  > something we can/should optimize but if we went through with Xe
> > without
> >  > this patch, it would probably be mostly ok.
> >  >
> >  >  >> Yes, it is 1:1 *userspace* engines and drm_sched.
> >  >  >>
> >  >  >> I'm not really prepared to make large changes to DRM
> > scheduler
> >  > at the
> >  >  >> moment for Xe as they are not really required nor does
> Boris
> >  > seem they
> >  >  >> will be required for his work either. I am interested to
> see
> >  > what Boris
> >  >  >> comes up with.
> >  >  >>
> >  >  >>> Even on the low level, the idea to replace drm_sched
> threads
> >  > with workers
> >  >  >>> has a few problems.
> >  >  >>>
> >  >  >>> To start with, the pattern of:
> >  >  >>>
> >  >  >>>while (not_stopped) {
> >  >  >>> keep picking jobs
> >  >  >>>}
> >  >  >>>
> >  >  >>> Feels fundamentally in disagreement with workers (while
> >  > obviously fits
> >  >  >>> perfectly with the current kthread design).
> >  >  >>
> >  >  >> The while loop breaks and worker exists if no jobs are
> ready.
> >  >
> >  >
> >  > I'm not very familiar with workqueues. What are you saying would
> fit
> >  > better? One scheduling job per work item rather than one big work
> > item
> >  > which handles all available jobs?
> >
> > Yes and no, it indeed IMO does not fit to have a work item which is
> > potentially unbound in runtime. But it is a bit moot conceptual
> > mismatch
> > because it is a worst case / theoretical, and I think due more
> > fundamental concerns.
> >
> > If we have to go back to the low level side of things, I've picked
> this
> > random spot to consolidate what I have already mentioned and perhaps
> > expand.
> >
> > To start with, let me pull out some thoughts from workqueue.rst:
> >
> > """
> > Generally, work items are not expected to hog a CPU and consume many
> > cycles. That means maintaining just enough concurrency to prevent
> work
> > processing from stalling should be optimal.
> > """
> >
> > For unbound queues:
> > """
> > The responsibility of regulating concurrency level is on the users.
> > """
> >
> > Given the unbound queues will be spawned on demand to service all
> > queued
> > work items (more interesting when mixing up with the
> > system_unbound_wq),
> > in the proposed design the number of instantiated worker threads does
> > not correspond to the number of user threads (as you have elsewhere
> > stated), but pessimistically to the number of active user contexts.
> >
> >
> > Those are pretty much the same in practice.  Rather, user threads is
> > typically an upper bound on the number of contexts.  Yes, a single user
> > thread could have a bunch of contexts but basically nothing does that
> > except IGT.  In real-world usage, it's at most one context per user
> thread.
>
> Typically is the key here. But I am not sure it is good enough. Consider
> this example - Intel Flex 170:
>
>   * Delivers up to 36 streams 1080p60 transcode throughput per card.
>   * When scaled to 10 cards in a 4U server configuration, it can support
> up to 360 streams of HEVC/HEVC 1080p60 transcode throughput.
>

I had a feeling it was going to be media 


> One transcode stream from my experience typically is 3-4 GPU contexts
> (buffer travels from vcs -> rcs -> vcs, maybe vecs) used from a single
> CPU thread. 4 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-11 Thread Daniel Vetter
On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
 wrote:
>
> Hi Daniel,
>
> On Mon, 9 Jan 2023 21:40:21 +0100
> Daniel Vetter  wrote:
>
> > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:
> > > Hi Jason,
> > >
> > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > Jason Ekstrand  wrote:
> > >
> > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost 
> > > > wrote:
> > > >
> > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:
> > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > Boris Brezillon  wrote:
> > > > > >
> > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > Boris Brezillon  wrote:
> > > > > > >
> > > > > > > > Hello Matthew,
> > > > > > > >
> > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > Matthew Brost  wrote:
> > > > > > > >
> > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 
> > > > > > > > > 1 to 1
> > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At 
> > > > > > > > > first
> > > > > this
> > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > >
> > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity 
> > > > > > > > > is not
> > > > > > > > > guaranteed to be the same completion even if targeting the 
> > > > > > > > > same
> > > > > hardware
> > > > > > > > > engine. This is because in XE we have a firmware scheduler, 
> > > > > > > > > the
> > > > > GuC,
> > > > > > > > > which allowed to reorder, timeslice, and preempt submissions. 
> > > > > > > > > If a
> > > > > using
> > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, 
> > > > > > > > > the TDR
> > > > > falls
> > > > > > > > > apart as the TDR expects submission order == completion order.
> > > > > Using a
> > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this
> > > > > problem.
> > > > > > > >
> > > > > > > > Oh, that's interesting. I've been trying to solve the same sort 
> > > > > > > > of
> > > > > > > > issues to support Arm's new Mali GPU which is relying on a
> > > > > FW-assisted
> > > > > > > > scheduling scheme (you give the FW N streams to execute, and it 
> > > > > > > > does
> > > > > > > > the scheduling between those N command streams, the kernel 
> > > > > > > > driver
> > > > > > > > does timeslice scheduling to update the command streams passed 
> > > > > > > > to the
> > > > > > > > FW). I must admit I gave up on using drm_sched at some point, 
> > > > > > > > mostly
> > > > > > > > because the integration with drm_sched was painful, but also 
> > > > > > > > because
> > > > > I
> > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > timeslice-oriented scheduling model wasn't really future proof.
> > > > > Giving
> > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably
> > > > > might
> > > > > > > > help for a few things (didn't think it through yet), but I feel 
> > > > > > > > it's
> > > > > > > > coming short on other aspects we have to deal with on Arm GPUs.
> > > > > > >
> > > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I 
> > > > > > > think I
> > > > > > > have a better understanding of how you get away with using 
> > > > > > > drm_sched
> > > > > > > while still controlling how scheduling is really done. Here
> > > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> > > > >
> > > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > > dependency tracking and releasing jobs to be scheduled when 
> > > > > dependencies
> > > > > are met, and lastly a tracking mechanism of inflights jobs that need 
> > > > > to
> > > > > be cleaned up if an error occurs. It doesn't actually do any 
> > > > > scheduling
> > > > > aside from the most basic level of not overflowing the submission ring
> > > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > > scheduler fits quite well.
> > > > >
> > > >
> > > > Yeah, I think there's an annoying difference between what 
> > > > AMD/NVIDIA/Intel
> > > > want here and what you need for Arm thanks to the number of FW queues
> > > > available. I don't remember the exact number of GuC queues but it's at
> > > > least 1k. This puts it in an entirely different class from what you 
> > > > have on
> > > > Mali. Roughly, there's about three categories here:
> > > >
> > > >  1. Hardware where the kernel is placing jobs on actual HW rings. This 
> > > > is
> > > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > > categorization.)
> > > >
> > > >  2. Hardware (or firmware) with a very limited number of queues where
> > > > you're going to have to juggle in the kernel in order to 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-11 Thread Matthew Brost
On Wed, Jan 11, 2023 at 08:50:37AM +, Tvrtko Ursulin wrote:
> 
> On 10/01/2023 14:08, Jason Ekstrand wrote:
> > On Tue, Jan 10, 2023 at 5:28 AM Tvrtko Ursulin
> > mailto:tvrtko.ursu...@linux.intel.com>>
> > wrote:
> > 
> > 
> > 
> > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > 
> > [snip]
> > 
> >  >      >>> AFAICT it proposes to have 1:1 between *userspace* created
> >  >     contexts (per
> >  >      >>> context _and_ engine) and drm_sched. I am not sure avoiding
> >  >     invasive changes
> >  >      >>> to the shared code is in the spirit of the overall idea
> > and instead
> >  >      >>> opportunity should be used to look at way to
> > refactor/improve
> >  >     drm_sched.
> >  >
> >  >
> >  > Maybe?  I'm not convinced that what Xe is doing is an abuse at
> > all or
> >  > really needs to drive a re-factor.  (More on that later.)
> > There's only
> >  > one real issue which is that it fires off potentially a lot of
> > kthreads.
> >  > Even that's not that bad given that kthreads are pretty light and
> > you're
> >  > not likely to have more kthreads than userspace threads which are
> > much
> >  > heavier.  Not ideal, but not the end of the world either.
> > Definitely
> >  > something we can/should optimize but if we went through with Xe
> > without
> >  > this patch, it would probably be mostly ok.
> >  >
> >  >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
> >  >      >>
> >  >      >> I'm not really prepared to make large changes to DRM
> > scheduler
> >  >     at the
> >  >      >> moment for Xe as they are not really required nor does Boris
> >  >     seem they
> >  >      >> will be required for his work either. I am interested to see
> >  >     what Boris
> >  >      >> comes up with.
> >  >      >>
> >  >      >>> Even on the low level, the idea to replace drm_sched threads
> >  >     with workers
> >  >      >>> has a few problems.
> >  >      >>>
> >  >      >>> To start with, the pattern of:
> >  >      >>>
> >  >      >>>    while (not_stopped) {
> >  >      >>>     keep picking jobs
> >  >      >>>    }
> >  >      >>>
> >  >      >>> Feels fundamentally in disagreement with workers (while
> >  >     obviously fits
> >  >      >>> perfectly with the current kthread design).
> >  >      >>
> >  >      >> The while loop breaks and worker exists if no jobs are ready.
> >  >
> >  >
> >  > I'm not very familiar with workqueues. What are you saying would fit
> >  > better? One scheduling job per work item rather than one big work
> > item
> >  > which handles all available jobs?
> > 
> > Yes and no, it indeed IMO does not fit to have a work item which is
> > potentially unbound in runtime. But it is a bit moot conceptual
> > mismatch
> > because it is a worst case / theoretical, and I think due more
> > fundamental concerns.
> > 
> > If we have to go back to the low level side of things, I've picked this
> > random spot to consolidate what I have already mentioned and perhaps
> > expand.
> > 
> > To start with, let me pull out some thoughts from workqueue.rst:
> > 
> > """
> > Generally, work items are not expected to hog a CPU and consume many
> > cycles. That means maintaining just enough concurrency to prevent work
> > processing from stalling should be optimal.
> > """
> > 
> > For unbound queues:
> > """
> > The responsibility of regulating concurrency level is on the users.
> > """
> > 
> > Given the unbound queues will be spawned on demand to service all
> > queued
> > work items (more interesting when mixing up with the
> > system_unbound_wq),
> > in the proposed design the number of instantiated worker threads does
> > not correspond to the number of user threads (as you have elsewhere
> > stated), but pessimistically to the number of active user contexts.
> > 
> > 
> > Those are pretty much the same in practice.  Rather, user threads is
> > typically an upper bound on the number of contexts.  Yes, a single user
> > thread could have a bunch of contexts but basically nothing does that
> > except IGT.  In real-world usage, it's at most one context per user
> > thread.
> 
> Typically is the key here. But I am not sure it is good enough. Consider
> this example - Intel Flex 170:
> 
>  * Delivers up to 36 streams 1080p60 transcode throughput per card.
>  * When scaled to 10 cards in a 4U server configuration, it can support up
> to 360 streams of HEVC/HEVC 1080p60 transcode throughput.
> 
> One transcode stream from my experience typically is 3-4 GPU contexts
> (buffer travels from vcs -> rcs -> vcs, maybe vecs) used from a single CPU
> thread. 4 contexts * 36 streams = 144 active contexts. Multiply by 60fps =
> 8640 jobs submitted and 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-11 Thread Matthew Brost
On Wed, Jan 11, 2023 at 10:52:54AM -0800, John Harrison wrote:
> On 1/11/2023 10:07, Matthew Brost wrote:
> > On Wed, Jan 11, 2023 at 09:17:01AM +, Tvrtko Ursulin wrote:
> > > On 10/01/2023 19:01, Matthew Brost wrote:
> > > > On Tue, Jan 10, 2023 at 04:50:55PM +, Tvrtko Ursulin wrote:
> > > > > On 10/01/2023 15:55, Matthew Brost wrote:
> > > > > > On Tue, Jan 10, 2023 at 12:19:35PM +, Tvrtko Ursulin wrote:
> > > > > > > On 10/01/2023 11:28, Tvrtko Ursulin wrote:
> > > > > > > > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > > > > > > > 
> > > > > > > > [snip]
> > > > > > > > 
> > > > > > > > > >>> AFAICT it proposes to have 1:1 between 
> > > > > > > > > *userspace* created
> > > > > > > > >    contexts (per
> > > > > > > > > >>> context _and_ engine) and drm_sched. I am not 
> > > > > > > > > sure avoiding
> > > > > > > > >    invasive changes
> > > > > > > > > >>> to the shared code is in the spirit of the 
> > > > > > > > > overall idea and
> > > > > > > > > instead
> > > > > > > > > >>> opportunity should be used to look at way to 
> > > > > > > > > refactor/improve
> > > > > > > > >    drm_sched.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Maybe?  I'm not convinced that what Xe is doing is an abuse 
> > > > > > > > > at all
> > > > > > > > > or really needs to drive a re-factor.  (More on that later.)
> > > > > > > > > There's only one real issue which is that it fires off 
> > > > > > > > > potentially a
> > > > > > > > > lot of kthreads. Even that's not that bad given that kthreads 
> > > > > > > > > are
> > > > > > > > > pretty light and you're not likely to have more kthreads than
> > > > > > > > > userspace threads which are much heavier.  Not ideal, but not 
> > > > > > > > > the
> > > > > > > > > end of the world either.  Definitely something we can/should
> > > > > > > > > optimize but if we went through with Xe without this patch, 
> > > > > > > > > it would
> > > > > > > > > probably be mostly ok.
> > > > > > > > > 
> > > > > > > > > >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > > > > > > > > >>
> > > > > > > > > >> I'm not really prepared to make large changes to 
> > > > > > > > > DRM scheduler
> > > > > > > > >    at the
> > > > > > > > > >> moment for Xe as they are not really required nor 
> > > > > > > > > does Boris
> > > > > > > > >    seem they
> > > > > > > > > >> will be required for his work either. I am 
> > > > > > > > > interested to see
> > > > > > > > >    what Boris
> > > > > > > > > >> comes up with.
> > > > > > > > > >>
> > > > > > > > > >>> Even on the low level, the idea to replace 
> > > > > > > > > drm_sched threads
> > > > > > > > >    with workers
> > > > > > > > > >>> has a few problems.
> > > > > > > > > >>>
> > > > > > > > > >>> To start with, the pattern of:
> > > > > > > > > >>>
> > > > > > > > > >>>    while (not_stopped) {
> > > > > > > > > >>>     keep picking jobs
> > > > > > > > > >>>    }
> > > > > > > > > >>>
> > > > > > > > > >>> Feels fundamentally in disagreement with workers 
> > > > > > > > > (while
> > > > > > > > >    obviously fits
> > > > > > > > > >>> perfectly with the current kthread design).
> > > > > > > > > >>
> > > > > > > > > >> The while loop breaks and worker exists if no jobs 
> > > > > > > > > are ready.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I'm not very familiar with workqueues. What are you saying 
> > > > > > > > > would fit
> > > > > > > > > better? One scheduling job per work item rather than one big 
> > > > > > > > > work
> > > > > > > > > item which handles all available jobs?
> > > > > > > > Yes and no, it indeed IMO does not fit to have a work item 
> > > > > > > > which is
> > > > > > > > potentially unbound in runtime. But it is a bit moot conceptual 
> > > > > > > > mismatch
> > > > > > > > because it is a worst case / theoretical, and I think due more
> > > > > > > > fundamental concerns.
> > > > > > > > 
> > > > > > > > If we have to go back to the low level side of things, I've 
> > > > > > > > picked this
> > > > > > > > random spot to consolidate what I have already mentioned and 
> > > > > > > > perhaps
> > > > > > > > expand.
> > > > > > > > 
> > > > > > > > To start with, let me pull out some thoughts from workqueue.rst:
> > > > > > > > 
> > > > > > > > """
> > > > > > > > Generally, work items are not expected to hog a CPU and consume 
> > > > > > > > many
> > > > > > > > cycles. That means maintaining just enough concurrency to 
> > > > > > > > prevent work
> > > > > > > > processing from stalling should be optimal.
> > > > > > > > """
> > > > > > > > 
> > > > > > > > For unbound queues:
> > > > > > > > """
> > > > > > > > The responsibility of regulating concurrency level is on the 
> > > > > > > > users.
> > > > > > > > """
> > > > > > 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-11 Thread John Harrison

On 1/11/2023 10:07, Matthew Brost wrote:

On Wed, Jan 11, 2023 at 09:17:01AM +, Tvrtko Ursulin wrote:

On 10/01/2023 19:01, Matthew Brost wrote:

On Tue, Jan 10, 2023 at 04:50:55PM +, Tvrtko Ursulin wrote:

On 10/01/2023 15:55, Matthew Brost wrote:

On Tue, Jan 10, 2023 at 12:19:35PM +, Tvrtko Ursulin wrote:

On 10/01/2023 11:28, Tvrtko Ursulin wrote:

On 09/01/2023 17:27, Jason Ekstrand wrote:

[snip]


    >>> AFAICT it proposes to have 1:1 between *userspace* created
       contexts (per
    >>> context _and_ engine) and drm_sched. I am not sure avoiding
       invasive changes
    >>> to the shared code is in the spirit of the overall idea and
instead
    >>> opportunity should be used to look at way to refactor/improve
       drm_sched.


Maybe?  I'm not convinced that what Xe is doing is an abuse at all
or really needs to drive a re-factor.  (More on that later.)
There's only one real issue which is that it fires off potentially a
lot of kthreads. Even that's not that bad given that kthreads are
pretty light and you're not likely to have more kthreads than
userspace threads which are much heavier.  Not ideal, but not the
end of the world either.  Definitely something we can/should
optimize but if we went through with Xe without this patch, it would
probably be mostly ok.

    >> Yes, it is 1:1 *userspace* engines and drm_sched.
    >>
    >> I'm not really prepared to make large changes to DRM scheduler
       at the
    >> moment for Xe as they are not really required nor does Boris
       seem they
    >> will be required for his work either. I am interested to see
       what Boris
    >> comes up with.
    >>
    >>> Even on the low level, the idea to replace drm_sched threads
       with workers
    >>> has a few problems.
    >>>
    >>> To start with, the pattern of:
    >>>
    >>>    while (not_stopped) {
    >>>     keep picking jobs
    >>>    }
    >>>
    >>> Feels fundamentally in disagreement with workers (while
       obviously fits
    >>> perfectly with the current kthread design).
    >>
    >> The while loop breaks and worker exists if no jobs are ready.


I'm not very familiar with workqueues. What are you saying would fit
better? One scheduling job per work item rather than one big work
item which handles all available jobs?

Yes and no, it indeed IMO does not fit to have a work item which is
potentially unbound in runtime. But it is a bit moot conceptual mismatch
because it is a worst case / theoretical, and I think due more
fundamental concerns.

If we have to go back to the low level side of things, I've picked this
random spot to consolidate what I have already mentioned and perhaps
expand.

To start with, let me pull out some thoughts from workqueue.rst:

"""
Generally, work items are not expected to hog a CPU and consume many
cycles. That means maintaining just enough concurrency to prevent work
processing from stalling should be optimal.
"""

For unbound queues:
"""
The responsibility of regulating concurrency level is on the users.
"""

Given the unbound queues will be spawned on demand to service all queued
work items (more interesting when mixing up with the system_unbound_wq),
in the proposed design the number of instantiated worker threads does
not correspond to the number of user threads (as you have elsewhere
stated), but pessimistically to the number of active user contexts. That
is the number which drives the maximum number of not-runnable jobs that
can become runnable at once, and hence spawn that many work items, and
in turn unbound worker threads.

Several problems there.

It is fundamentally pointless to have potentially that many more threads
than the number of CPU cores - it simply creates a scheduling storm.

To make matters worse, if I follow the code correctly, all these per user
context worker thread / work items end up contending on the same lock or
circular buffer, both are one instance per GPU:

guc_engine_run_job
-> submit_engine
   a) wq_item_append
   -> wq_wait_for_space
 -> msleep

a) is dedicated per xe_engine

Hah true, what its for then? I thought throttling the LRCA ring is done via:


This is a per guc_id 'work queue' which is used for parallel submission
(e.g. multiple LRC tail values need to written atomically by the GuC).
Again in practice there should always be space.

Speaking of guc id, where does blocking when none are available happen in
the non parallel case?


We have 64k guc_ids on native, 1k guc_ids with 64k VFs. Either way we
think that is more than enough and can just reject xe_engine creation if
we run out of guc_ids. If this proves to false, we can fix this but the
guc_id stealing the i915 is rather complicated and hopefully not needed.

We will limit the number of guc_ids allowed per user pid to reasonible
number to prevent a DoS. Elevated pids (e.g. IGTs) will be able do to
whatever they 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-11 Thread Matthew Brost
On Wed, Jan 11, 2023 at 09:17:01AM +, Tvrtko Ursulin wrote:
> 
> On 10/01/2023 19:01, Matthew Brost wrote:
> > On Tue, Jan 10, 2023 at 04:50:55PM +, Tvrtko Ursulin wrote:
> > > 
> > > On 10/01/2023 15:55, Matthew Brost wrote:
> > > > On Tue, Jan 10, 2023 at 12:19:35PM +, Tvrtko Ursulin wrote:
> > > > > 
> > > > > On 10/01/2023 11:28, Tvrtko Ursulin wrote:
> > > > > > 
> > > > > > 
> > > > > > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > > > > > 
> > > > > > [snip]
> > > > > > 
> > > > > > >    >>> AFAICT it proposes to have 1:1 between *userspace* 
> > > > > > > created
> > > > > > >       contexts (per
> > > > > > >    >>> context _and_ engine) and drm_sched. I am not sure 
> > > > > > > avoiding
> > > > > > >       invasive changes
> > > > > > >    >>> to the shared code is in the spirit of the overall 
> > > > > > > idea and
> > > > > > > instead
> > > > > > >    >>> opportunity should be used to look at way to 
> > > > > > > refactor/improve
> > > > > > >       drm_sched.
> > > > > > > 
> > > > > > > 
> > > > > > > Maybe?  I'm not convinced that what Xe is doing is an abuse at all
> > > > > > > or really needs to drive a re-factor.  (More on that later.)
> > > > > > > There's only one real issue which is that it fires off 
> > > > > > > potentially a
> > > > > > > lot of kthreads. Even that's not that bad given that kthreads are
> > > > > > > pretty light and you're not likely to have more kthreads than
> > > > > > > userspace threads which are much heavier.  Not ideal, but not the
> > > > > > > end of the world either.  Definitely something we can/should
> > > > > > > optimize but if we went through with Xe without this patch, it 
> > > > > > > would
> > > > > > > probably be mostly ok.
> > > > > > > 
> > > > > > >    >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > > > > > >    >>
> > > > > > >    >> I'm not really prepared to make large changes to DRM 
> > > > > > > scheduler
> > > > > > >       at the
> > > > > > >    >> moment for Xe as they are not really required nor does 
> > > > > > > Boris
> > > > > > >       seem they
> > > > > > >    >> will be required for his work either. I am interested 
> > > > > > > to see
> > > > > > >       what Boris
> > > > > > >    >> comes up with.
> > > > > > >    >>
> > > > > > >    >>> Even on the low level, the idea to replace drm_sched 
> > > > > > > threads
> > > > > > >       with workers
> > > > > > >    >>> has a few problems.
> > > > > > >    >>>
> > > > > > >    >>> To start with, the pattern of:
> > > > > > >    >>>
> > > > > > >    >>>    while (not_stopped) {
> > > > > > >    >>>     keep picking jobs
> > > > > > >    >>>    }
> > > > > > >    >>>
> > > > > > >    >>> Feels fundamentally in disagreement with workers (while
> > > > > > >       obviously fits
> > > > > > >    >>> perfectly with the current kthread design).
> > > > > > >    >>
> > > > > > >    >> The while loop breaks and worker exists if no jobs are 
> > > > > > > ready.
> > > > > > > 
> > > > > > > 
> > > > > > > I'm not very familiar with workqueues. What are you saying would 
> > > > > > > fit
> > > > > > > better? One scheduling job per work item rather than one big work
> > > > > > > item which handles all available jobs?
> > > > > > 
> > > > > > Yes and no, it indeed IMO does not fit to have a work item which is
> > > > > > potentially unbound in runtime. But it is a bit moot conceptual 
> > > > > > mismatch
> > > > > > because it is a worst case / theoretical, and I think due more
> > > > > > fundamental concerns.
> > > > > > 
> > > > > > If we have to go back to the low level side of things, I've picked 
> > > > > > this
> > > > > > random spot to consolidate what I have already mentioned and perhaps
> > > > > > expand.
> > > > > > 
> > > > > > To start with, let me pull out some thoughts from workqueue.rst:
> > > > > > 
> > > > > > """
> > > > > > Generally, work items are not expected to hog a CPU and consume many
> > > > > > cycles. That means maintaining just enough concurrency to prevent 
> > > > > > work
> > > > > > processing from stalling should be optimal.
> > > > > > """
> > > > > > 
> > > > > > For unbound queues:
> > > > > > """
> > > > > > The responsibility of regulating concurrency level is on the users.
> > > > > > """
> > > > > > 
> > > > > > Given the unbound queues will be spawned on demand to service all 
> > > > > > queued
> > > > > > work items (more interesting when mixing up with the 
> > > > > > system_unbound_wq),
> > > > > > in the proposed design the number of instantiated worker threads 
> > > > > > does
> > > > > > not correspond to the number of user threads (as you have elsewhere
> > > > > > stated), but pessimistically to the number of active user contexts. 
> > > > > > That
> > > > > > is the number which drives the maximum number of not-runnable jobs 
> > > > > > that
> > > > > > can become runnable at once, and hence spawn 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-11 Thread Matthew Brost
On Wed, Jan 11, 2023 at 09:09:45AM +, Tvrtko Ursulin wrote:
> 
> On 11/01/2023 01:13, Matthew Brost wrote:
> > On Tue, Jan 10, 2023 at 04:39:00PM +, Matthew Brost wrote:
> > > On Tue, Jan 10, 2023 at 11:28:08AM +, Tvrtko Ursulin wrote:
> > > > 
> > > > 
> > > > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > > > 
> > > > [snip]
> > > > 
> > > > >   >>> AFAICT it proposes to have 1:1 between *userspace* created
> > > > >  contexts (per
> > > > >   >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > > > >  invasive changes
> > > > >   >>> to the shared code is in the spirit of the overall idea and 
> > > > > instead
> > > > >   >>> opportunity should be used to look at way to 
> > > > > refactor/improve
> > > > >  drm_sched.
> > > > > 
> > > > > 
> > > > > Maybe?  I'm not convinced that what Xe is doing is an abuse at all or
> > > > > really needs to drive a re-factor.  (More on that later.)  There's 
> > > > > only
> > > > > one real issue which is that it fires off potentially a lot of 
> > > > > kthreads.
> > > > > Even that's not that bad given that kthreads are pretty light and 
> > > > > you're
> > > > > not likely to have more kthreads than userspace threads which are much
> > > > > heavier.  Not ideal, but not the end of the world either.  Definitely
> > > > > something we can/should optimize but if we went through with Xe 
> > > > > without
> > > > > this patch, it would probably be mostly ok.
> > > > > 
> > > > >   >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > > > >   >>
> > > > >   >> I'm not really prepared to make large changes to DRM 
> > > > > scheduler
> > > > >  at the
> > > > >   >> moment for Xe as they are not really required nor does Boris
> > > > >  seem they
> > > > >   >> will be required for his work either. I am interested to see
> > > > >  what Boris
> > > > >   >> comes up with.
> > > > >   >>
> > > > >   >>> Even on the low level, the idea to replace drm_sched threads
> > > > >  with workers
> > > > >   >>> has a few problems.
> > > > >   >>>
> > > > >   >>> To start with, the pattern of:
> > > > >   >>>
> > > > >   >>>    while (not_stopped) {
> > > > >   >>>     keep picking jobs
> > > > >   >>>    }
> > > > >   >>>
> > > > >   >>> Feels fundamentally in disagreement with workers (while
> > > > >  obviously fits
> > > > >   >>> perfectly with the current kthread design).
> > > > >   >>
> > > > >   >> The while loop breaks and worker exists if no jobs are ready.
> > > > > 
> > > > > 
> > > > > I'm not very familiar with workqueues. What are you saying would fit
> > > > > better? One scheduling job per work item rather than one big work item
> > > > > which handles all available jobs?
> > > > 
> > > > Yes and no, it indeed IMO does not fit to have a work item which is
> > > > potentially unbound in runtime. But it is a bit moot conceptual mismatch
> > > > because it is a worst case / theoretical, and I think due more 
> > > > fundamental
> > > > concerns.
> > > > 
> > > > If we have to go back to the low level side of things, I've picked this
> > > > random spot to consolidate what I have already mentioned and perhaps 
> > > > expand.
> > > > 
> > > > To start with, let me pull out some thoughts from workqueue.rst:
> > > > 
> > > > """
> > > > Generally, work items are not expected to hog a CPU and consume many 
> > > > cycles.
> > > > That means maintaining just enough concurrency to prevent work 
> > > > processing
> > > > from stalling should be optimal.
> > > > """
> > > > 
> > > > For unbound queues:
> > > > """
> > > > The responsibility of regulating concurrency level is on the users.
> > > > """
> > > > 
> > > > Given the unbound queues will be spawned on demand to service all queued
> > > > work items (more interesting when mixing up with the 
> > > > system_unbound_wq), in
> > > > the proposed design the number of instantiated worker threads does not
> > > > correspond to the number of user threads (as you have elsewhere 
> > > > stated), but
> > > > pessimistically to the number of active user contexts. That is the 
> > > > number
> > > > which drives the maximum number of not-runnable jobs that can become
> > > > runnable at once, and hence spawn that many work items, and in turn 
> > > > unbound
> > > > worker threads.
> > > > 
> > > > Several problems there.
> > > > 
> > > > It is fundamentally pointless to have potentially that many more threads
> > > > than the number of CPU cores - it simply creates a scheduling storm.
> > > > 
> > > 
> > > We can use a different work queue if this is an issue, have a FIXME
> > > which indicates we should allow the user to pass in the work queue.
> > > 
> > > > Unbound workers have no CPU / cache locality either and no connection 
> > > > with
> > > > the CPU scheduler to optimize scheduling patterns. This may matter 
> > > > either on
> > > > large systems or on 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-11 Thread Tvrtko Ursulin



On 10/01/2023 19:01, Matthew Brost wrote:

On Tue, Jan 10, 2023 at 04:50:55PM +, Tvrtko Ursulin wrote:


On 10/01/2023 15:55, Matthew Brost wrote:

On Tue, Jan 10, 2023 at 12:19:35PM +, Tvrtko Ursulin wrote:


On 10/01/2023 11:28, Tvrtko Ursulin wrote:



On 09/01/2023 17:27, Jason Ekstrand wrote:

[snip]


   >>> AFAICT it proposes to have 1:1 between *userspace* created
      contexts (per
   >>> context _and_ engine) and drm_sched. I am not sure avoiding
      invasive changes
   >>> to the shared code is in the spirit of the overall idea and
instead
   >>> opportunity should be used to look at way to refactor/improve
      drm_sched.


Maybe?  I'm not convinced that what Xe is doing is an abuse at all
or really needs to drive a re-factor.  (More on that later.)
There's only one real issue which is that it fires off potentially a
lot of kthreads. Even that's not that bad given that kthreads are
pretty light and you're not likely to have more kthreads than
userspace threads which are much heavier.  Not ideal, but not the
end of the world either.  Definitely something we can/should
optimize but if we went through with Xe without this patch, it would
probably be mostly ok.

   >> Yes, it is 1:1 *userspace* engines and drm_sched.
   >>
   >> I'm not really prepared to make large changes to DRM scheduler
      at the
   >> moment for Xe as they are not really required nor does Boris
      seem they
   >> will be required for his work either. I am interested to see
      what Boris
   >> comes up with.
   >>
   >>> Even on the low level, the idea to replace drm_sched threads
      with workers
   >>> has a few problems.
   >>>
   >>> To start with, the pattern of:
   >>>
   >>>    while (not_stopped) {
   >>>     keep picking jobs
   >>>    }
   >>>
   >>> Feels fundamentally in disagreement with workers (while
      obviously fits
   >>> perfectly with the current kthread design).
   >>
   >> The while loop breaks and worker exists if no jobs are ready.


I'm not very familiar with workqueues. What are you saying would fit
better? One scheduling job per work item rather than one big work
item which handles all available jobs?


Yes and no, it indeed IMO does not fit to have a work item which is
potentially unbound in runtime. But it is a bit moot conceptual mismatch
because it is a worst case / theoretical, and I think due more
fundamental concerns.

If we have to go back to the low level side of things, I've picked this
random spot to consolidate what I have already mentioned and perhaps
expand.

To start with, let me pull out some thoughts from workqueue.rst:

"""
Generally, work items are not expected to hog a CPU and consume many
cycles. That means maintaining just enough concurrency to prevent work
processing from stalling should be optimal.
"""

For unbound queues:
"""
The responsibility of regulating concurrency level is on the users.
"""

Given the unbound queues will be spawned on demand to service all queued
work items (more interesting when mixing up with the system_unbound_wq),
in the proposed design the number of instantiated worker threads does
not correspond to the number of user threads (as you have elsewhere
stated), but pessimistically to the number of active user contexts. That
is the number which drives the maximum number of not-runnable jobs that
can become runnable at once, and hence spawn that many work items, and
in turn unbound worker threads.

Several problems there.

It is fundamentally pointless to have potentially that many more threads
than the number of CPU cores - it simply creates a scheduling storm.


To make matters worse, if I follow the code correctly, all these per user
context worker thread / work items end up contending on the same lock or
circular buffer, both are one instance per GPU:

guc_engine_run_job
   -> submit_engine
  a) wq_item_append
  -> wq_wait_for_space
-> msleep


a) is dedicated per xe_engine


Hah true, what its for then? I thought throttling the LRCA ring is done via:



This is a per guc_id 'work queue' which is used for parallel submission
(e.g. multiple LRC tail values need to written atomically by the GuC).
Again in practice there should always be space.


Speaking of guc id, where does blocking when none are available happen 
in the non parallel case?



   drm_sched_init(>sched, _sched_ops,
 e->lrc[0].ring.size / MAX_JOB_SIZE_BYTES,

Is there something more to throttle other than the ring? It is throttling
something using msleeps..


Also you missed the step of programming the ring which is dedicated per 
xe_engine


I was trying to quickly find places which serialize on something in the
backend, ringbuffer emission did not seem to do that but maybe I missed
something.



xe_ring_ops vfunc emit_job is called to write the ring.


Right but does it serialize between different contexts, I didn't spot 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-11 Thread Tvrtko Ursulin



On 11/01/2023 01:13, Matthew Brost wrote:

On Tue, Jan 10, 2023 at 04:39:00PM +, Matthew Brost wrote:

On Tue, Jan 10, 2023 at 11:28:08AM +, Tvrtko Ursulin wrote:



On 09/01/2023 17:27, Jason Ekstrand wrote:

[snip]


  >>> AFAICT it proposes to have 1:1 between *userspace* created
 contexts (per
  >>> context _and_ engine) and drm_sched. I am not sure avoiding
 invasive changes
  >>> to the shared code is in the spirit of the overall idea and instead
  >>> opportunity should be used to look at way to refactor/improve
 drm_sched.


Maybe?  I'm not convinced that what Xe is doing is an abuse at all or
really needs to drive a re-factor.  (More on that later.)  There's only
one real issue which is that it fires off potentially a lot of kthreads.
Even that's not that bad given that kthreads are pretty light and you're
not likely to have more kthreads than userspace threads which are much
heavier.  Not ideal, but not the end of the world either.  Definitely
something we can/should optimize but if we went through with Xe without
this patch, it would probably be mostly ok.

  >> Yes, it is 1:1 *userspace* engines and drm_sched.
  >>
  >> I'm not really prepared to make large changes to DRM scheduler
 at the
  >> moment for Xe as they are not really required nor does Boris
 seem they
  >> will be required for his work either. I am interested to see
 what Boris
  >> comes up with.
  >>
  >>> Even on the low level, the idea to replace drm_sched threads
 with workers
  >>> has a few problems.
  >>>
  >>> To start with, the pattern of:
  >>>
  >>>    while (not_stopped) {
  >>>     keep picking jobs
  >>>    }
  >>>
  >>> Feels fundamentally in disagreement with workers (while
 obviously fits
  >>> perfectly with the current kthread design).
  >>
  >> The while loop breaks and worker exists if no jobs are ready.


I'm not very familiar with workqueues. What are you saying would fit
better? One scheduling job per work item rather than one big work item
which handles all available jobs?


Yes and no, it indeed IMO does not fit to have a work item which is
potentially unbound in runtime. But it is a bit moot conceptual mismatch
because it is a worst case / theoretical, and I think due more fundamental
concerns.

If we have to go back to the low level side of things, I've picked this
random spot to consolidate what I have already mentioned and perhaps expand.

To start with, let me pull out some thoughts from workqueue.rst:

"""
Generally, work items are not expected to hog a CPU and consume many cycles.
That means maintaining just enough concurrency to prevent work processing
from stalling should be optimal.
"""

For unbound queues:
"""
The responsibility of regulating concurrency level is on the users.
"""

Given the unbound queues will be spawned on demand to service all queued
work items (more interesting when mixing up with the system_unbound_wq), in
the proposed design the number of instantiated worker threads does not
correspond to the number of user threads (as you have elsewhere stated), but
pessimistically to the number of active user contexts. That is the number
which drives the maximum number of not-runnable jobs that can become
runnable at once, and hence spawn that many work items, and in turn unbound
worker threads.

Several problems there.

It is fundamentally pointless to have potentially that many more threads
than the number of CPU cores - it simply creates a scheduling storm.



We can use a different work queue if this is an issue, have a FIXME
which indicates we should allow the user to pass in the work queue.


Unbound workers have no CPU / cache locality either and no connection with
the CPU scheduler to optimize scheduling patterns. This may matter either on
large systems or on small ones. Whereas the current design allows for
scheduler to notice userspace CPU thread keeps waking up the same drm
scheduler kernel thread, and so it can keep them on the same CPU, the
unbound workers lose that ability and so 2nd CPU might be getting woken up
from low sleep for every submission.



I guess I don't understand kthread vs. workqueue scheduling internals.
  


Looked into this and we are not using unbound workers rather we are just
using the system_wq which is indeed bound. Again we can change this so a
user can just pass in worker too. After doing a of research bound
workers allows the scheduler to use locality too avoid that exact
problem your reading.

TL;DR I'm not buying any of these arguments although it is possible I am
missing something.


Well you told me it's using unbound.. message id 
y7dejcuc1arhb...@dut025-tglu.fm.intel.com:


"""
Right now the system_unbound_wq is used which does have a limit on the
number of threads, right? I do have a FIXME to allow a worker to be
passed in similar to TDR.
"""

With bound workers you will indeed get CPU locality. I am 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-11 Thread Tvrtko Ursulin



On 10/01/2023 14:08, Jason Ekstrand wrote:
On Tue, Jan 10, 2023 at 5:28 AM Tvrtko Ursulin 
mailto:tvrtko.ursu...@linux.intel.com>> 
wrote:




On 09/01/2023 17:27, Jason Ekstrand wrote:

[snip]

 >      >>> AFAICT it proposes to have 1:1 between *userspace* created
 >     contexts (per
 >      >>> context _and_ engine) and drm_sched. I am not sure avoiding
 >     invasive changes
 >      >>> to the shared code is in the spirit of the overall idea
and instead
 >      >>> opportunity should be used to look at way to
refactor/improve
 >     drm_sched.
 >
 >
 > Maybe?  I'm not convinced that what Xe is doing is an abuse at
all or
 > really needs to drive a re-factor.  (More on that later.) 
There's only

 > one real issue which is that it fires off potentially a lot of
kthreads.
 > Even that's not that bad given that kthreads are pretty light and
you're
 > not likely to have more kthreads than userspace threads which are
much
 > heavier.  Not ideal, but not the end of the world either. 
Definitely

 > something we can/should optimize but if we went through with Xe
without
 > this patch, it would probably be mostly ok.
 >
 >      >> Yes, it is 1:1 *userspace* engines and drm_sched.
 >      >>
 >      >> I'm not really prepared to make large changes to DRM
scheduler
 >     at the
 >      >> moment for Xe as they are not really required nor does Boris
 >     seem they
 >      >> will be required for his work either. I am interested to see
 >     what Boris
 >      >> comes up with.
 >      >>
 >      >>> Even on the low level, the idea to replace drm_sched threads
 >     with workers
 >      >>> has a few problems.
 >      >>>
 >      >>> To start with, the pattern of:
 >      >>>
 >      >>>    while (not_stopped) {
 >      >>>     keep picking jobs
 >      >>>    }
 >      >>>
 >      >>> Feels fundamentally in disagreement with workers (while
 >     obviously fits
 >      >>> perfectly with the current kthread design).
 >      >>
 >      >> The while loop breaks and worker exists if no jobs are ready.
 >
 >
 > I'm not very familiar with workqueues. What are you saying would fit
 > better? One scheduling job per work item rather than one big work
item
 > which handles all available jobs?

Yes and no, it indeed IMO does not fit to have a work item which is
potentially unbound in runtime. But it is a bit moot conceptual
mismatch
because it is a worst case / theoretical, and I think due more
fundamental concerns.

If we have to go back to the low level side of things, I've picked this
random spot to consolidate what I have already mentioned and perhaps
expand.

To start with, let me pull out some thoughts from workqueue.rst:

"""
Generally, work items are not expected to hog a CPU and consume many
cycles. That means maintaining just enough concurrency to prevent work
processing from stalling should be optimal.
"""

For unbound queues:
"""
The responsibility of regulating concurrency level is on the users.
"""

Given the unbound queues will be spawned on demand to service all
queued
work items (more interesting when mixing up with the
system_unbound_wq),
in the proposed design the number of instantiated worker threads does
not correspond to the number of user threads (as you have elsewhere
stated), but pessimistically to the number of active user contexts.


Those are pretty much the same in practice.  Rather, user threads is 
typically an upper bound on the number of contexts.  Yes, a single user 
thread could have a bunch of contexts but basically nothing does that 
except IGT.  In real-world usage, it's at most one context per user thread.


Typically is the key here. But I am not sure it is good enough. Consider 
this example - Intel Flex 170:


 * Delivers up to 36 streams 1080p60 transcode throughput per card.
 * When scaled to 10 cards in a 4U server configuration, it can support 
up to 360 streams of HEVC/HEVC 1080p60 transcode throughput.


One transcode stream from my experience typically is 3-4 GPU contexts 
(buffer travels from vcs -> rcs -> vcs, maybe vecs) used from a single 
CPU thread. 4 contexts * 36 streams = 144 active contexts. Multiply by 
60fps = 8640 jobs submitted and completed per second.


144 active contexts in the proposed scheme means possibly means 144 
kernel worker threads spawned (driven by 36 transcode CPU threads). (I 
don't think the pools would scale down given all are constantly pinged 
at 60fps.)


And then each of 144 threads goes to grab the single GuC CT mutex. First 
threads are being made schedulable, then put to sleep as mutex 
contention is hit, then woken again as mutexes are getting released, 
rinse, repeat.


(And yes this 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-10 Thread Matthew Brost
On Tue, Jan 10, 2023 at 04:39:00PM +, Matthew Brost wrote:
> On Tue, Jan 10, 2023 at 11:28:08AM +, Tvrtko Ursulin wrote:
> > 
> > 
> > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > 
> > [snip]
> > 
> > >  >>> AFAICT it proposes to have 1:1 between *userspace* created
> > > contexts (per
> > >  >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > > invasive changes
> > >  >>> to the shared code is in the spirit of the overall idea and 
> > > instead
> > >  >>> opportunity should be used to look at way to refactor/improve
> > > drm_sched.
> > > 
> > > 
> > > Maybe?  I'm not convinced that what Xe is doing is an abuse at all or
> > > really needs to drive a re-factor.  (More on that later.)  There's only
> > > one real issue which is that it fires off potentially a lot of kthreads.
> > > Even that's not that bad given that kthreads are pretty light and you're
> > > not likely to have more kthreads than userspace threads which are much
> > > heavier.  Not ideal, but not the end of the world either.  Definitely
> > > something we can/should optimize but if we went through with Xe without
> > > this patch, it would probably be mostly ok.
> > > 
> > >  >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > >  >>
> > >  >> I'm not really prepared to make large changes to DRM scheduler
> > > at the
> > >  >> moment for Xe as they are not really required nor does Boris
> > > seem they
> > >  >> will be required for his work either. I am interested to see
> > > what Boris
> > >  >> comes up with.
> > >  >>
> > >  >>> Even on the low level, the idea to replace drm_sched threads
> > > with workers
> > >  >>> has a few problems.
> > >  >>>
> > >  >>> To start with, the pattern of:
> > >  >>>
> > >  >>>    while (not_stopped) {
> > >  >>>     keep picking jobs
> > >  >>>    }
> > >  >>>
> > >  >>> Feels fundamentally in disagreement with workers (while
> > > obviously fits
> > >  >>> perfectly with the current kthread design).
> > >  >>
> > >  >> The while loop breaks and worker exists if no jobs are ready.
> > > 
> > > 
> > > I'm not very familiar with workqueues. What are you saying would fit
> > > better? One scheduling job per work item rather than one big work item
> > > which handles all available jobs?
> > 
> > Yes and no, it indeed IMO does not fit to have a work item which is
> > potentially unbound in runtime. But it is a bit moot conceptual mismatch
> > because it is a worst case / theoretical, and I think due more fundamental
> > concerns.
> > 
> > If we have to go back to the low level side of things, I've picked this
> > random spot to consolidate what I have already mentioned and perhaps expand.
> > 
> > To start with, let me pull out some thoughts from workqueue.rst:
> > 
> > """
> > Generally, work items are not expected to hog a CPU and consume many cycles.
> > That means maintaining just enough concurrency to prevent work processing
> > from stalling should be optimal.
> > """
> > 
> > For unbound queues:
> > """
> > The responsibility of regulating concurrency level is on the users.
> > """
> > 
> > Given the unbound queues will be spawned on demand to service all queued
> > work items (more interesting when mixing up with the system_unbound_wq), in
> > the proposed design the number of instantiated worker threads does not
> > correspond to the number of user threads (as you have elsewhere stated), but
> > pessimistically to the number of active user contexts. That is the number
> > which drives the maximum number of not-runnable jobs that can become
> > runnable at once, and hence spawn that many work items, and in turn unbound
> > worker threads.
> > 
> > Several problems there.
> > 
> > It is fundamentally pointless to have potentially that many more threads
> > than the number of CPU cores - it simply creates a scheduling storm.
> > 
> 
> We can use a different work queue if this is an issue, have a FIXME
> which indicates we should allow the user to pass in the work queue.
> 
> > Unbound workers have no CPU / cache locality either and no connection with
> > the CPU scheduler to optimize scheduling patterns. This may matter either on
> > large systems or on small ones. Whereas the current design allows for
> > scheduler to notice userspace CPU thread keeps waking up the same drm
> > scheduler kernel thread, and so it can keep them on the same CPU, the
> > unbound workers lose that ability and so 2nd CPU might be getting woken up
> > from low sleep for every submission.
> >
> 
> I guess I don't understand kthread vs. workqueue scheduling internals.
>  

Looked into this and we are not using unbound workers rather we are just
using the system_wq which is indeed bound. Again we can change this so a
user can just pass in worker too. After doing a of research bound
workers allows the scheduler to use locality too avoid that exact
problem your 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-10 Thread Matthew Brost
On Tue, Jan 10, 2023 at 04:50:55PM +, Tvrtko Ursulin wrote:
> 
> On 10/01/2023 15:55, Matthew Brost wrote:
> > On Tue, Jan 10, 2023 at 12:19:35PM +, Tvrtko Ursulin wrote:
> > > 
> > > On 10/01/2023 11:28, Tvrtko Ursulin wrote:
> > > > 
> > > > 
> > > > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > > > 
> > > > [snip]
> > > > 
> > > > >   >>> AFAICT it proposes to have 1:1 between *userspace* created
> > > > >      contexts (per
> > > > >   >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > > > >      invasive changes
> > > > >   >>> to the shared code is in the spirit of the overall idea and
> > > > > instead
> > > > >   >>> opportunity should be used to look at way to 
> > > > > refactor/improve
> > > > >      drm_sched.
> > > > > 
> > > > > 
> > > > > Maybe?  I'm not convinced that what Xe is doing is an abuse at all
> > > > > or really needs to drive a re-factor.  (More on that later.)
> > > > > There's only one real issue which is that it fires off potentially a
> > > > > lot of kthreads. Even that's not that bad given that kthreads are
> > > > > pretty light and you're not likely to have more kthreads than
> > > > > userspace threads which are much heavier.  Not ideal, but not the
> > > > > end of the world either.  Definitely something we can/should
> > > > > optimize but if we went through with Xe without this patch, it would
> > > > > probably be mostly ok.
> > > > > 
> > > > >   >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > > > >   >>
> > > > >   >> I'm not really prepared to make large changes to DRM 
> > > > > scheduler
> > > > >      at the
> > > > >   >> moment for Xe as they are not really required nor does Boris
> > > > >      seem they
> > > > >   >> will be required for his work either. I am interested to see
> > > > >      what Boris
> > > > >   >> comes up with.
> > > > >   >>
> > > > >   >>> Even on the low level, the idea to replace drm_sched threads
> > > > >      with workers
> > > > >   >>> has a few problems.
> > > > >   >>>
> > > > >   >>> To start with, the pattern of:
> > > > >   >>>
> > > > >   >>>    while (not_stopped) {
> > > > >   >>>     keep picking jobs
> > > > >   >>>    }
> > > > >   >>>
> > > > >   >>> Feels fundamentally in disagreement with workers (while
> > > > >      obviously fits
> > > > >   >>> perfectly with the current kthread design).
> > > > >   >>
> > > > >   >> The while loop breaks and worker exists if no jobs are ready.
> > > > > 
> > > > > 
> > > > > I'm not very familiar with workqueues. What are you saying would fit
> > > > > better? One scheduling job per work item rather than one big work
> > > > > item which handles all available jobs?
> > > > 
> > > > Yes and no, it indeed IMO does not fit to have a work item which is
> > > > potentially unbound in runtime. But it is a bit moot conceptual mismatch
> > > > because it is a worst case / theoretical, and I think due more
> > > > fundamental concerns.
> > > > 
> > > > If we have to go back to the low level side of things, I've picked this
> > > > random spot to consolidate what I have already mentioned and perhaps
> > > > expand.
> > > > 
> > > > To start with, let me pull out some thoughts from workqueue.rst:
> > > > 
> > > > """
> > > > Generally, work items are not expected to hog a CPU and consume many
> > > > cycles. That means maintaining just enough concurrency to prevent work
> > > > processing from stalling should be optimal.
> > > > """
> > > > 
> > > > For unbound queues:
> > > > """
> > > > The responsibility of regulating concurrency level is on the users.
> > > > """
> > > > 
> > > > Given the unbound queues will be spawned on demand to service all queued
> > > > work items (more interesting when mixing up with the system_unbound_wq),
> > > > in the proposed design the number of instantiated worker threads does
> > > > not correspond to the number of user threads (as you have elsewhere
> > > > stated), but pessimistically to the number of active user contexts. That
> > > > is the number which drives the maximum number of not-runnable jobs that
> > > > can become runnable at once, and hence spawn that many work items, and
> > > > in turn unbound worker threads.
> > > > 
> > > > Several problems there.
> > > > 
> > > > It is fundamentally pointless to have potentially that many more threads
> > > > than the number of CPU cores - it simply creates a scheduling storm.
> > > 
> > > To make matters worse, if I follow the code correctly, all these per user
> > > context worker thread / work items end up contending on the same lock or
> > > circular buffer, both are one instance per GPU:
> > > 
> > > guc_engine_run_job
> > >   -> submit_engine
> > >  a) wq_item_append
> > >  -> wq_wait_for_space
> > >-> msleep
> > 
> > a) is dedicated per xe_engine
> 
> Hah true, what its for then? I thought throttling the LRCA ring is done via:
> 

This 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-10 Thread Tvrtko Ursulin



On 10/01/2023 15:55, Matthew Brost wrote:

On Tue, Jan 10, 2023 at 12:19:35PM +, Tvrtko Ursulin wrote:


On 10/01/2023 11:28, Tvrtko Ursulin wrote:



On 09/01/2023 17:27, Jason Ekstrand wrote:

[snip]


  >>> AFAICT it proposes to have 1:1 between *userspace* created
     contexts (per
  >>> context _and_ engine) and drm_sched. I am not sure avoiding
     invasive changes
  >>> to the shared code is in the spirit of the overall idea and
instead
  >>> opportunity should be used to look at way to refactor/improve
     drm_sched.


Maybe?  I'm not convinced that what Xe is doing is an abuse at all
or really needs to drive a re-factor.  (More on that later.)
There's only one real issue which is that it fires off potentially a
lot of kthreads. Even that's not that bad given that kthreads are
pretty light and you're not likely to have more kthreads than
userspace threads which are much heavier.  Not ideal, but not the
end of the world either.  Definitely something we can/should
optimize but if we went through with Xe without this patch, it would
probably be mostly ok.

  >> Yes, it is 1:1 *userspace* engines and drm_sched.
  >>
  >> I'm not really prepared to make large changes to DRM scheduler
     at the
  >> moment for Xe as they are not really required nor does Boris
     seem they
  >> will be required for his work either. I am interested to see
     what Boris
  >> comes up with.
  >>
  >>> Even on the low level, the idea to replace drm_sched threads
     with workers
  >>> has a few problems.
  >>>
  >>> To start with, the pattern of:
  >>>
  >>>    while (not_stopped) {
  >>>     keep picking jobs
  >>>    }
  >>>
  >>> Feels fundamentally in disagreement with workers (while
     obviously fits
  >>> perfectly with the current kthread design).
  >>
  >> The while loop breaks and worker exists if no jobs are ready.


I'm not very familiar with workqueues. What are you saying would fit
better? One scheduling job per work item rather than one big work
item which handles all available jobs?


Yes and no, it indeed IMO does not fit to have a work item which is
potentially unbound in runtime. But it is a bit moot conceptual mismatch
because it is a worst case / theoretical, and I think due more
fundamental concerns.

If we have to go back to the low level side of things, I've picked this
random spot to consolidate what I have already mentioned and perhaps
expand.

To start with, let me pull out some thoughts from workqueue.rst:

"""
Generally, work items are not expected to hog a CPU and consume many
cycles. That means maintaining just enough concurrency to prevent work
processing from stalling should be optimal.
"""

For unbound queues:
"""
The responsibility of regulating concurrency level is on the users.
"""

Given the unbound queues will be spawned on demand to service all queued
work items (more interesting when mixing up with the system_unbound_wq),
in the proposed design the number of instantiated worker threads does
not correspond to the number of user threads (as you have elsewhere
stated), but pessimistically to the number of active user contexts. That
is the number which drives the maximum number of not-runnable jobs that
can become runnable at once, and hence spawn that many work items, and
in turn unbound worker threads.

Several problems there.

It is fundamentally pointless to have potentially that many more threads
than the number of CPU cores - it simply creates a scheduling storm.


To make matters worse, if I follow the code correctly, all these per user
context worker thread / work items end up contending on the same lock or
circular buffer, both are one instance per GPU:

guc_engine_run_job
  -> submit_engine
 a) wq_item_append
 -> wq_wait_for_space
   -> msleep


a) is dedicated per xe_engine


Hah true, what its for then? I thought throttling the LRCA ring is done via:

  drm_sched_init(>sched, _sched_ops,
 e->lrc[0].ring.size / MAX_JOB_SIZE_BYTES,

Is there something more to throttle other than the ring? It is 
throttling something using msleeps..



Also you missed the step of programming the ring which is dedicated per 
xe_engine


I was trying to quickly find places which serialize on something in the 
backend, ringbuffer emission did not seem to do that but maybe I missed 
something.





 b) xe_guc_ct_send
 -> guc_ct_send
   -> mutex_lock(>lock);
   -> later a potential msleep in h2g_has_room


Techincally there is 1 instance per GT not GPU, yes this is shared but
in practice there will always be space in the CT channel so contention
on the lock should be rare.


Yeah I used the term GPU to be more understandable to outside audience.

I am somewhat disappointed that the Xe opportunity hasn't been used to 
improve upon the CT communication bottlenecks. I mean those backoff 
sleeps and lock contention. I wish there 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-10 Thread Matthew Brost
On Tue, Jan 10, 2023 at 11:28:08AM +, Tvrtko Ursulin wrote:
> 
> 
> On 09/01/2023 17:27, Jason Ekstrand wrote:
> 
> [snip]
> 
> >  >>> AFAICT it proposes to have 1:1 between *userspace* created
> > contexts (per
> >  >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > invasive changes
> >  >>> to the shared code is in the spirit of the overall idea and instead
> >  >>> opportunity should be used to look at way to refactor/improve
> > drm_sched.
> > 
> > 
> > Maybe?  I'm not convinced that what Xe is doing is an abuse at all or
> > really needs to drive a re-factor.  (More on that later.)  There's only
> > one real issue which is that it fires off potentially a lot of kthreads.
> > Even that's not that bad given that kthreads are pretty light and you're
> > not likely to have more kthreads than userspace threads which are much
> > heavier.  Not ideal, but not the end of the world either.  Definitely
> > something we can/should optimize but if we went through with Xe without
> > this patch, it would probably be mostly ok.
> > 
> >  >> Yes, it is 1:1 *userspace* engines and drm_sched.
> >  >>
> >  >> I'm not really prepared to make large changes to DRM scheduler
> > at the
> >  >> moment for Xe as they are not really required nor does Boris
> > seem they
> >  >> will be required for his work either. I am interested to see
> > what Boris
> >  >> comes up with.
> >  >>
> >  >>> Even on the low level, the idea to replace drm_sched threads
> > with workers
> >  >>> has a few problems.
> >  >>>
> >  >>> To start with, the pattern of:
> >  >>>
> >  >>>    while (not_stopped) {
> >  >>>     keep picking jobs
> >  >>>    }
> >  >>>
> >  >>> Feels fundamentally in disagreement with workers (while
> > obviously fits
> >  >>> perfectly with the current kthread design).
> >  >>
> >  >> The while loop breaks and worker exists if no jobs are ready.
> > 
> > 
> > I'm not very familiar with workqueues. What are you saying would fit
> > better? One scheduling job per work item rather than one big work item
> > which handles all available jobs?
> 
> Yes and no, it indeed IMO does not fit to have a work item which is
> potentially unbound in runtime. But it is a bit moot conceptual mismatch
> because it is a worst case / theoretical, and I think due more fundamental
> concerns.
> 
> If we have to go back to the low level side of things, I've picked this
> random spot to consolidate what I have already mentioned and perhaps expand.
> 
> To start with, let me pull out some thoughts from workqueue.rst:
> 
> """
> Generally, work items are not expected to hog a CPU and consume many cycles.
> That means maintaining just enough concurrency to prevent work processing
> from stalling should be optimal.
> """
> 
> For unbound queues:
> """
> The responsibility of regulating concurrency level is on the users.
> """
> 
> Given the unbound queues will be spawned on demand to service all queued
> work items (more interesting when mixing up with the system_unbound_wq), in
> the proposed design the number of instantiated worker threads does not
> correspond to the number of user threads (as you have elsewhere stated), but
> pessimistically to the number of active user contexts. That is the number
> which drives the maximum number of not-runnable jobs that can become
> runnable at once, and hence spawn that many work items, and in turn unbound
> worker threads.
> 
> Several problems there.
> 
> It is fundamentally pointless to have potentially that many more threads
> than the number of CPU cores - it simply creates a scheduling storm.
> 

We can use a different work queue if this is an issue, have a FIXME
which indicates we should allow the user to pass in the work queue.

> Unbound workers have no CPU / cache locality either and no connection with
> the CPU scheduler to optimize scheduling patterns. This may matter either on
> large systems or on small ones. Whereas the current design allows for
> scheduler to notice userspace CPU thread keeps waking up the same drm
> scheduler kernel thread, and so it can keep them on the same CPU, the
> unbound workers lose that ability and so 2nd CPU might be getting woken up
> from low sleep for every submission.
>

I guess I don't understand kthread vs. workqueue scheduling internals.
 
> Hence, apart from being a bit of a impedance mismatch, the proposal has the
> potential to change performance and power patterns and both large and small
> machines.
>

We are going to have to test this out I suppose and play around to see
if this design has any real world impacts. As Jason said, yea probably
will need a bit of help here from others. Will CC relavent parties on
next rev. 
 
> >  >>> Secondly, it probably demands separate workers (not optional),
> > otherwise
> >  >>> behaviour of shared workqueues has either the potential to
> > explode 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-10 Thread Matthew Brost
On Tue, Jan 10, 2023 at 12:19:35PM +, Tvrtko Ursulin wrote:
> 
> On 10/01/2023 11:28, Tvrtko Ursulin wrote:
> > 
> > 
> > On 09/01/2023 17:27, Jason Ekstrand wrote:
> > 
> > [snip]
> > 
> > >  >>> AFAICT it proposes to have 1:1 between *userspace* created
> > >     contexts (per
> > >  >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > >     invasive changes
> > >  >>> to the shared code is in the spirit of the overall idea and
> > > instead
> > >  >>> opportunity should be used to look at way to refactor/improve
> > >     drm_sched.
> > > 
> > > 
> > > Maybe?  I'm not convinced that what Xe is doing is an abuse at all
> > > or really needs to drive a re-factor.  (More on that later.) 
> > > There's only one real issue which is that it fires off potentially a
> > > lot of kthreads. Even that's not that bad given that kthreads are
> > > pretty light and you're not likely to have more kthreads than
> > > userspace threads which are much heavier.  Not ideal, but not the
> > > end of the world either.  Definitely something we can/should
> > > optimize but if we went through with Xe without this patch, it would
> > > probably be mostly ok.
> > > 
> > >  >> Yes, it is 1:1 *userspace* engines and drm_sched.
> > >  >>
> > >  >> I'm not really prepared to make large changes to DRM scheduler
> > >     at the
> > >  >> moment for Xe as they are not really required nor does Boris
> > >     seem they
> > >  >> will be required for his work either. I am interested to see
> > >     what Boris
> > >  >> comes up with.
> > >  >>
> > >  >>> Even on the low level, the idea to replace drm_sched threads
> > >     with workers
> > >  >>> has a few problems.
> > >  >>>
> > >  >>> To start with, the pattern of:
> > >  >>>
> > >  >>>    while (not_stopped) {
> > >  >>>     keep picking jobs
> > >  >>>    }
> > >  >>>
> > >  >>> Feels fundamentally in disagreement with workers (while
> > >     obviously fits
> > >  >>> perfectly with the current kthread design).
> > >  >>
> > >  >> The while loop breaks and worker exists if no jobs are ready.
> > > 
> > > 
> > > I'm not very familiar with workqueues. What are you saying would fit
> > > better? One scheduling job per work item rather than one big work
> > > item which handles all available jobs?
> > 
> > Yes and no, it indeed IMO does not fit to have a work item which is
> > potentially unbound in runtime. But it is a bit moot conceptual mismatch
> > because it is a worst case / theoretical, and I think due more
> > fundamental concerns.
> > 
> > If we have to go back to the low level side of things, I've picked this
> > random spot to consolidate what I have already mentioned and perhaps
> > expand.
> > 
> > To start with, let me pull out some thoughts from workqueue.rst:
> > 
> > """
> > Generally, work items are not expected to hog a CPU and consume many
> > cycles. That means maintaining just enough concurrency to prevent work
> > processing from stalling should be optimal.
> > """
> > 
> > For unbound queues:
> > """
> > The responsibility of regulating concurrency level is on the users.
> > """
> > 
> > Given the unbound queues will be spawned on demand to service all queued
> > work items (more interesting when mixing up with the system_unbound_wq),
> > in the proposed design the number of instantiated worker threads does
> > not correspond to the number of user threads (as you have elsewhere
> > stated), but pessimistically to the number of active user contexts. That
> > is the number which drives the maximum number of not-runnable jobs that
> > can become runnable at once, and hence spawn that many work items, and
> > in turn unbound worker threads.
> > 
> > Several problems there.
> > 
> > It is fundamentally pointless to have potentially that many more threads
> > than the number of CPU cores - it simply creates a scheduling storm.
> 
> To make matters worse, if I follow the code correctly, all these per user
> context worker thread / work items end up contending on the same lock or
> circular buffer, both are one instance per GPU:
> 
> guc_engine_run_job
>  -> submit_engine
> a) wq_item_append
> -> wq_wait_for_space
>   -> msleep

a) is dedicated per xe_engine

Also you missed the step of programming the ring which is dedicated per 
xe_engine

> b) xe_guc_ct_send
> -> guc_ct_send
>   -> mutex_lock(>lock);
>   -> later a potential msleep in h2g_has_room

Techincally there is 1 instance per GT not GPU, yes this is shared but
in practice there will always be space in the CT channel so contention
on the lock should be rare.

I haven't read your rather long reply yet, but also FWIW using a
workqueue has suggested by AMD (original authors of the DRM scheduler)
when we ran this design by them.

Matt 

> 
> Regards,
> 
> Tvrtko


Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-10 Thread Jason Ekstrand
On Tue, Jan 10, 2023 at 5:28 AM Tvrtko Ursulin <
tvrtko.ursu...@linux.intel.com> wrote:

>
>
> On 09/01/2023 17:27, Jason Ekstrand wrote:
>
> [snip]
>
> >  >>> AFAICT it proposes to have 1:1 between *userspace* created
> > contexts (per
> >  >>> context _and_ engine) and drm_sched. I am not sure avoiding
> > invasive changes
> >  >>> to the shared code is in the spirit of the overall idea and
> instead
> >  >>> opportunity should be used to look at way to refactor/improve
> > drm_sched.
> >
> >
> > Maybe?  I'm not convinced that what Xe is doing is an abuse at all or
> > really needs to drive a re-factor.  (More on that later.)  There's only
> > one real issue which is that it fires off potentially a lot of kthreads.
> > Even that's not that bad given that kthreads are pretty light and you're
> > not likely to have more kthreads than userspace threads which are much
> > heavier.  Not ideal, but not the end of the world either.  Definitely
> > something we can/should optimize but if we went through with Xe without
> > this patch, it would probably be mostly ok.
> >
> >  >> Yes, it is 1:1 *userspace* engines and drm_sched.
> >  >>
> >  >> I'm not really prepared to make large changes to DRM scheduler
> > at the
> >  >> moment for Xe as they are not really required nor does Boris
> > seem they
> >  >> will be required for his work either. I am interested to see
> > what Boris
> >  >> comes up with.
> >  >>
> >  >>> Even on the low level, the idea to replace drm_sched threads
> > with workers
> >  >>> has a few problems.
> >  >>>
> >  >>> To start with, the pattern of:
> >  >>>
> >  >>>while (not_stopped) {
> >  >>> keep picking jobs
> >  >>>}
> >  >>>
> >  >>> Feels fundamentally in disagreement with workers (while
> > obviously fits
> >  >>> perfectly with the current kthread design).
> >  >>
> >  >> The while loop breaks and worker exists if no jobs are ready.
> >
> >
> > I'm not very familiar with workqueues. What are you saying would fit
> > better? One scheduling job per work item rather than one big work item
> > which handles all available jobs?
>
> Yes and no, it indeed IMO does not fit to have a work item which is
> potentially unbound in runtime. But it is a bit moot conceptual mismatch
> because it is a worst case / theoretical, and I think due more
> fundamental concerns.
>
> If we have to go back to the low level side of things, I've picked this
> random spot to consolidate what I have already mentioned and perhaps
> expand.
>
> To start with, let me pull out some thoughts from workqueue.rst:
>
> """
> Generally, work items are not expected to hog a CPU and consume many
> cycles. That means maintaining just enough concurrency to prevent work
> processing from stalling should be optimal.
> """
>
> For unbound queues:
> """
> The responsibility of regulating concurrency level is on the users.
> """
>
> Given the unbound queues will be spawned on demand to service all queued
> work items (more interesting when mixing up with the system_unbound_wq),
> in the proposed design the number of instantiated worker threads does
> not correspond to the number of user threads (as you have elsewhere
> stated), but pessimistically to the number of active user contexts.


Those are pretty much the same in practice.  Rather, user threads is
typically an upper bound on the number of contexts.  Yes, a single user
thread could have a bunch of contexts but basically nothing does that
except IGT.  In real-world usage, it's at most one context per user thread.


> That
> is the number which drives the maximum number of not-runnable jobs that
> can become runnable at once, and hence spawn that many work items, and
> in turn unbound worker threads.
>
> Several problems there.
>
> It is fundamentally pointless to have potentially that many more threads
> than the number of CPU cores - it simply creates a scheduling storm.
>
> Unbound workers have no CPU / cache locality either and no connection
> with the CPU scheduler to optimize scheduling patterns. This may matter
> either on large systems or on small ones. Whereas the current design
> allows for scheduler to notice userspace CPU thread keeps waking up the
> same drm scheduler kernel thread, and so it can keep them on the same
> CPU, the unbound workers lose that ability and so 2nd CPU might be
> getting woken up from low sleep for every submission.
>
> Hence, apart from being a bit of a impedance mismatch, the proposal has
> the potential to change performance and power patterns and both large
> and small machines.
>

Ok, thanks for explaining the issue you're seeing in more detail.  Yes,
deferred kwork does appear to mismatch somewhat with what the scheduler
needs or at least how it's worked in the past.  How much impact will that
mismatch have?  Unclear.


> >  >>> Secondly, it probably demands separate workers (not 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-10 Thread Tvrtko Ursulin



On 10/01/2023 11:28, Tvrtko Ursulin wrote:



On 09/01/2023 17:27, Jason Ekstrand wrote:

[snip]


 >>> AFAICT it proposes to have 1:1 between *userspace* created
    contexts (per
 >>> context _and_ engine) and drm_sched. I am not sure avoiding
    invasive changes
 >>> to the shared code is in the spirit of the overall idea and 
instead

 >>> opportunity should be used to look at way to refactor/improve
    drm_sched.


Maybe?  I'm not convinced that what Xe is doing is an abuse at all or 
really needs to drive a re-factor.  (More on that later.)  There's 
only one real issue which is that it fires off potentially a lot of 
kthreads. Even that's not that bad given that kthreads are pretty 
light and you're not likely to have more kthreads than userspace 
threads which are much heavier.  Not ideal, but not the end of the 
world either.  Definitely something we can/should optimize but if we 
went through with Xe without this patch, it would probably be mostly ok.


 >> Yes, it is 1:1 *userspace* engines and drm_sched.
 >>
 >> I'm not really prepared to make large changes to DRM scheduler
    at the
 >> moment for Xe as they are not really required nor does Boris
    seem they
 >> will be required for his work either. I am interested to see
    what Boris
 >> comes up with.
 >>
 >>> Even on the low level, the idea to replace drm_sched threads
    with workers
 >>> has a few problems.
 >>>
 >>> To start with, the pattern of:
 >>>
 >>>    while (not_stopped) {
 >>>     keep picking jobs
 >>>    }
 >>>
 >>> Feels fundamentally in disagreement with workers (while
    obviously fits
 >>> perfectly with the current kthread design).
 >>
 >> The while loop breaks and worker exists if no jobs are ready.


I'm not very familiar with workqueues. What are you saying would fit 
better? One scheduling job per work item rather than one big work item 
which handles all available jobs?


Yes and no, it indeed IMO does not fit to have a work item which is 
potentially unbound in runtime. But it is a bit moot conceptual mismatch 
because it is a worst case / theoretical, and I think due more 
fundamental concerns.


If we have to go back to the low level side of things, I've picked this 
random spot to consolidate what I have already mentioned and perhaps 
expand.


To start with, let me pull out some thoughts from workqueue.rst:

"""
Generally, work items are not expected to hog a CPU and consume many 
cycles. That means maintaining just enough concurrency to prevent work 
processing from stalling should be optimal.

"""

For unbound queues:
"""
The responsibility of regulating concurrency level is on the users.
"""

Given the unbound queues will be spawned on demand to service all queued 
work items (more interesting when mixing up with the system_unbound_wq), 
in the proposed design the number of instantiated worker threads does 
not correspond to the number of user threads (as you have elsewhere 
stated), but pessimistically to the number of active user contexts. That 
is the number which drives the maximum number of not-runnable jobs that 
can become runnable at once, and hence spawn that many work items, and 
in turn unbound worker threads.


Several problems there.

It is fundamentally pointless to have potentially that many more threads 
than the number of CPU cores - it simply creates a scheduling storm.


To make matters worse, if I follow the code correctly, all these per 
user context worker thread / work items end up contending on the same 
lock or circular buffer, both are one instance per GPU:


guc_engine_run_job
 -> submit_engine
a) wq_item_append
-> wq_wait_for_space
  -> msleep
b) xe_guc_ct_send
-> guc_ct_send
  -> mutex_lock(>lock);
  -> later a potential msleep in h2g_has_room

Regards,

Tvrtko


Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-10 Thread Tvrtko Ursulin




On 09/01/2023 17:27, Jason Ekstrand wrote:

[snip]


 >>> AFAICT it proposes to have 1:1 between *userspace* created
contexts (per
 >>> context _and_ engine) and drm_sched. I am not sure avoiding
invasive changes
 >>> to the shared code is in the spirit of the overall idea and instead
 >>> opportunity should be used to look at way to refactor/improve
drm_sched.


Maybe?  I'm not convinced that what Xe is doing is an abuse at all or 
really needs to drive a re-factor.  (More on that later.)  There's only 
one real issue which is that it fires off potentially a lot of kthreads. 
Even that's not that bad given that kthreads are pretty light and you're 
not likely to have more kthreads than userspace threads which are much 
heavier.  Not ideal, but not the end of the world either.  Definitely 
something we can/should optimize but if we went through with Xe without 
this patch, it would probably be mostly ok.


 >> Yes, it is 1:1 *userspace* engines and drm_sched.
 >>
 >> I'm not really prepared to make large changes to DRM scheduler
at the
 >> moment for Xe as they are not really required nor does Boris
seem they
 >> will be required for his work either. I am interested to see
what Boris
 >> comes up with.
 >>
 >>> Even on the low level, the idea to replace drm_sched threads
with workers
 >>> has a few problems.
 >>>
 >>> To start with, the pattern of:
 >>>
 >>>    while (not_stopped) {
 >>>     keep picking jobs
 >>>    }
 >>>
 >>> Feels fundamentally in disagreement with workers (while
obviously fits
 >>> perfectly with the current kthread design).
 >>
 >> The while loop breaks and worker exists if no jobs are ready.


I'm not very familiar with workqueues. What are you saying would fit 
better? One scheduling job per work item rather than one big work item 
which handles all available jobs?


Yes and no, it indeed IMO does not fit to have a work item which is 
potentially unbound in runtime. But it is a bit moot conceptual mismatch 
because it is a worst case / theoretical, and I think due more 
fundamental concerns.


If we have to go back to the low level side of things, I've picked this 
random spot to consolidate what I have already mentioned and perhaps expand.


To start with, let me pull out some thoughts from workqueue.rst:

"""
Generally, work items are not expected to hog a CPU and consume many 
cycles. That means maintaining just enough concurrency to prevent work 
processing from stalling should be optimal.

"""

For unbound queues:
"""
The responsibility of regulating concurrency level is on the users.
"""

Given the unbound queues will be spawned on demand to service all queued 
work items (more interesting when mixing up with the system_unbound_wq), 
in the proposed design the number of instantiated worker threads does 
not correspond to the number of user threads (as you have elsewhere 
stated), but pessimistically to the number of active user contexts. That 
is the number which drives the maximum number of not-runnable jobs that 
can become runnable at once, and hence spawn that many work items, and 
in turn unbound worker threads.


Several problems there.

It is fundamentally pointless to have potentially that many more threads 
than the number of CPU cores - it simply creates a scheduling storm.


Unbound workers have no CPU / cache locality either and no connection 
with the CPU scheduler to optimize scheduling patterns. This may matter 
either on large systems or on small ones. Whereas the current design 
allows for scheduler to notice userspace CPU thread keeps waking up the 
same drm scheduler kernel thread, and so it can keep them on the same 
CPU, the unbound workers lose that ability and so 2nd CPU might be 
getting woken up from low sleep for every submission.


Hence, apart from being a bit of a impedance mismatch, the proposal has 
the potential to change performance and power patterns and both large 
and small machines.



 >>> Secondly, it probably demands separate workers (not optional),
otherwise
 >>> behaviour of shared workqueues has either the potential to
explode number
 >>> kernel threads anyway, or add latency.
 >>>
 >>
 >> Right now the system_unbound_wq is used which does have a limit
on the
 >> number of threads, right? I do have a FIXME to allow a worker to be
 >> passed in similar to TDR.
 >>
 >> WRT to latency, the 1:1 ratio could actually have lower latency
as 2 GPU
 >> schedulers can be pushing jobs into the backend / cleaning up
jobs in
 >> parallel.
 >>
 >
 > Thought of one more point here where why in Xe we absolutely want
a 1 to
 > 1 ratio between entity and scheduler - the way we implement
timeslicing
 > for preempt fences.
 >
 > Let me try to explain.
 >
 > Preempt fences are implemented via the generic messaging
 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-10 Thread Boris Brezillon
Hi Daniel,

On Mon, 9 Jan 2023 21:40:21 +0100
Daniel Vetter  wrote:

> On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:
> > Hi Jason,
> > 
> > On Mon, 9 Jan 2023 09:45:09 -0600
> > Jason Ekstrand  wrote:
> >   
> > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost 
> > > wrote:
> > >   
> > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:
> > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > Boris Brezillon  wrote:
> > > > >
> > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > Boris Brezillon  wrote:
> > > > > >
> > > > > > > Hello Matthew,
> > > > > > >
> > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > Matthew Brost  wrote:
> > > > > > >
> > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 
> > > > > > > > to 1
> > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At 
> > > > > > > > first
> > > > this
> > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > >
> > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is 
> > > > > > > > not
> > > > > > > > guaranteed to be the same completion even if targeting the same 
> > > > > > > >
> > > > hardware
> > > > > > > > engine. This is because in XE we have a firmware scheduler, the 
> > > > > > > >
> > > > GuC,
> > > > > > > > which allowed to reorder, timeslice, and preempt submissions. 
> > > > > > > > If a
> > > > using
> > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the 
> > > > > > > > TDR
> > > > falls
> > > > > > > > apart as the TDR expects submission order == completion order.  
> > > > > > > >   
> > > > Using a
> > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this
> > > > problem.
> > > > > > >
> > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > issues to support Arm's new Mali GPU which is relying on a
> > > > FW-assisted
> > > > > > > scheduling scheme (you give the FW N streams to execute, and it 
> > > > > > > does
> > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > does timeslice scheduling to update the command streams passed to 
> > > > > > > the
> > > > > > > FW). I must admit I gave up on using drm_sched at some point, 
> > > > > > > mostly
> > > > > > > because the integration with drm_sched was painful, but also 
> > > > > > > because
> > > > I
> > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > timeslice-oriented scheduling model wasn't really future proof.   
> > > > > > >  
> > > > Giving
> > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably  
> > > > > > >   
> > > > might
> > > > > > > help for a few things (didn't think it through yet), but I feel 
> > > > > > > it's
> > > > > > > coming short on other aspects we have to deal with on Arm GPUs.   
> > > > > > >  
> > > > > >
> > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I 
> > > > > > think I
> > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > while still controlling how scheduling is really done. Here
> > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue   
> > > > > >  
> > > >
> > > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > > aside from the most basic level of not overflowing the submission ring
> > > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > > scheduler fits quite well.
> > > >
> > > 
> > > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > > want here and what you need for Arm thanks to the number of FW queues
> > > available. I don't remember the exact number of GuC queues but it's at
> > > least 1k. This puts it in an entirely different class from what you have 
> > > on
> > > Mali. Roughly, there's about three categories here:
> > > 
> > >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > > categorization.)
> > > 
> > >  2. Hardware (or firmware) with a very limited number of queues where
> > > you're going to have to juggle in the kernel in order to run desktop 
> > > Linux.
> > > 
> > >  3. Firmware scheduling with a high queue count. In this case, you don't
> > > want the kernel scheduling anything. Just throw 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-09 Thread Daniel Vetter
On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:
> Hi Jason,
> 
> On Mon, 9 Jan 2023 09:45:09 -0600
> Jason Ekstrand  wrote:
> 
> > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost 
> > wrote:
> > 
> > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:  
> > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > Boris Brezillon  wrote:
> > > >  
> > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > Boris Brezillon  wrote:
> > > > >  
> > > > > > Hello Matthew,
> > > > > >
> > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > Matthew Brost  wrote:
> > > > > >  
> > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 
> > > > > > > 1
> > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At 
> > > > > > > first  
> > > this  
> > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > >
> > > > > > > 1. In XE the submission order from multiple drm_sched_entity is 
> > > > > > > not
> > > > > > > guaranteed to be the same completion even if targeting the same  
> > > hardware  
> > > > > > > engine. This is because in XE we have a firmware scheduler, the  
> > > GuC,  
> > > > > > > which allowed to reorder, timeslice, and preempt submissions. If 
> > > > > > > a  
> > > using  
> > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the 
> > > > > > > TDR  
> > > falls  
> > > > > > > apart as the TDR expects submission order == completion order.  
> > > Using a  
> > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this  
> > > problem.  
> > > > > >
> > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > issues to support Arm's new Mali GPU which is relying on a  
> > > FW-assisted  
> > > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > does timeslice scheduling to update the command streams passed to 
> > > > > > the
> > > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > > because the integration with drm_sched was painful, but also 
> > > > > > because  
> > > I  
> > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > timeslice-oriented scheduling model wasn't really future proof.  
> > > Giving  
> > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably  
> > > might  
> > > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > > coming short on other aspects we have to deal with on Arm GPUs.  
> > > > >
> > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > > have a better understanding of how you get away with using drm_sched
> > > > > while still controlling how scheduling is really done. Here
> > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue  
> > >
> > > You nailed it here, we use the DRM scheduler for queuing jobs,
> > > dependency tracking and releasing jobs to be scheduled when dependencies
> > > are met, and lastly a tracking mechanism of inflights jobs that need to
> > > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > > aside from the most basic level of not overflowing the submission ring
> > > buffer. In this sense, a 1 to 1 relationship between entity and
> > > scheduler fits quite well.
> > >  
> > 
> > Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> > want here and what you need for Arm thanks to the number of FW queues
> > available. I don't remember the exact number of GuC queues but it's at
> > least 1k. This puts it in an entirely different class from what you have on
> > Mali. Roughly, there's about three categories here:
> > 
> >  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> > old Mali, Intel Haswell and earlier, and probably a bunch of others.
> > (Intel BDW+ with execlists is a weird case that doesn't fit in this
> > categorization.)
> > 
> >  2. Hardware (or firmware) with a very limited number of queues where
> > you're going to have to juggle in the kernel in order to run desktop Linux.
> > 
> >  3. Firmware scheduling with a high queue count. In this case, you don't
> > want the kernel scheduling anything. Just throw it at the firmware and let
> > it go br.  If we ever run out of queues (unlikely), the kernel can
> > temporarily pause some low-priority contexts and do some juggling or,
> > frankly, just fail userspace queue creation and tell the user to close some
> > windows.
> > 
> > The existence of this 2nd class is a bit annoying but it's where we are. I
> > think it's worth recognizing that Xe and panfrost are in different places
> > here and will require different designs. For Xe, we really are just 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-09 Thread Jason Ekstrand
On Mon, Jan 9, 2023 at 7:46 AM Tvrtko Ursulin <
tvrtko.ursu...@linux.intel.com> wrote:

>
> On 06/01/2023 23:52, Matthew Brost wrote:
> > On Thu, Jan 05, 2023 at 09:43:41PM +, Matthew Brost wrote:
> >> On Tue, Jan 03, 2023 at 01:02:15PM +, Tvrtko Ursulin wrote:
> >>>
> >>> On 02/01/2023 07:30, Boris Brezillon wrote:
>  On Fri, 30 Dec 2022 12:55:08 +0100
>  Boris Brezillon  wrote:
> 
> > On Fri, 30 Dec 2022 11:20:42 +0100
> > Boris Brezillon  wrote:
> >
> >> Hello Matthew,
> >>
> >> On Thu, 22 Dec 2022 14:21:11 -0800
> >> Matthew Brost  wrote:
> >>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> >>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first
> this
> >>> seems a bit odd but let us explain the reasoning below.
> >>>
> >>> 1. In XE the submission order from multiple drm_sched_entity is not
> >>> guaranteed to be the same completion even if targeting the same
> hardware
> >>> engine. This is because in XE we have a firmware scheduler, the
> GuC,
> >>> which allowed to reorder, timeslice, and preempt submissions. If a
> using
> >>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR
> falls
> >>> apart as the TDR expects submission order == completion order.
> Using a
> >>> dedicated drm_gpu_scheduler per drm_sched_entity solve this
> problem.
> >>
> >> Oh, that's interesting. I've been trying to solve the same sort of
> >> issues to support Arm's new Mali GPU which is relying on a
> FW-assisted
> >> scheduling scheme (you give the FW N streams to execute, and it does
> >> the scheduling between those N command streams, the kernel driver
> >> does timeslice scheduling to update the command streams passed to
> the
> >> FW). I must admit I gave up on using drm_sched at some point, mostly
> >> because the integration with drm_sched was painful, but also
> because I
> >> felt trying to bend drm_sched to make it interact with a
> >> timeslice-oriented scheduling model wasn't really future proof.
> Giving
> >> drm_sched_entity exlusive access to a drm_gpu_scheduler probably
> might
> >> help for a few things (didn't think it through yet), but I feel it's
> >> coming short on other aspects we have to deal with on Arm GPUs.
> >
> > Ok, so I just had a quick look at the Xe driver and how it
> > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think
> I
> > have a better understanding of how you get away with using drm_sched
> > while still controlling how scheduling is really done. Here
> > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> > selection is dumb because there's only one entity ever bound to the
> > scheduler (the one that's part of the xe_guc_engine object which also
> > contains the drm_gpu_scheduler instance). I guess the main issue we'd
> > have on Arm is the fact that the stream doesn't necessarily get
> > scheduled when ->run_job() is called, it can be placed in the
> runnable
> > queue and be picked later by the kernel-side scheduler when a FW slot
> > gets released. That can probably be sorted out by manually disabling
> the
> > job timer and re-enabling it when the stream gets picked by the
> > scheduler. But my main concern remains, we're basically abusing
> > drm_sched here.
> >
> > For the Arm driver, that means turning the following sequence
> >
> > 1. wait for job deps
> > 2. queue job to ringbuf and push the stream to the runnable
> >  queue (if it wasn't queued already). Wakeup the timeslice
> scheduler
> >  to re-evaluate (if the stream is not on a FW slot already)
> > 3. stream gets picked by the timeslice scheduler and sent to the FW
> for
> >  execution
> >
> > into
> >
> > 1. queue job to entity which takes care of waiting for job deps for
> >  us
> > 2. schedule a drm_sched_main iteration
> > 3. the only available entity is picked, and the first job from this
> >  entity is dequeued. ->run_job() is called: the job is queued to
> the
> >  ringbuf and the stream is pushed to the runnable queue (if it
> wasn't
> >  queued already). Wakeup the timeslice scheduler to re-evaluate
> (if
> >  the stream is not on a FW slot already)
> > 4. stream gets picked by the timeslice scheduler and sent to the FW
> for
> >  execution
> >
> > That's one extra step we don't really need. To sum-up, yes, all the
> > job/entity tracking might be interesting to share/re-use, but I
> wonder
> > if we couldn't have that without pulling out the scheduling part of
> > drm_sched, or maybe I'm missing something, and there's something in
> > drm_gpu_scheduler you really need.
> 
>  On second 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-09 Thread Boris Brezillon
Hi Jason,

On Mon, 9 Jan 2023 09:45:09 -0600
Jason Ekstrand  wrote:

> On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost 
> wrote:
> 
> > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:  
> > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > Boris Brezillon  wrote:
> > >  
> > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > Boris Brezillon  wrote:
> > > >  
> > > > > Hello Matthew,
> > > > >
> > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > Matthew Brost  wrote:
> > > > >  
> > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first  
> > this  
> > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > >
> > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > guaranteed to be the same completion even if targeting the same  
> > hardware  
> > > > > > engine. This is because in XE we have a firmware scheduler, the  
> > GuC,  
> > > > > > which allowed to reorder, timeslice, and preempt submissions. If a  
> > using  
> > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR  
> > falls  
> > > > > > apart as the TDR expects submission order == completion order.  
> > Using a  
> > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this  
> > problem.  
> > > > >
> > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > issues to support Arm's new Mali GPU which is relying on a  
> > FW-assisted  
> > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > the scheduling between those N command streams, the kernel driver
> > > > > does timeslice scheduling to update the command streams passed to the
> > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > because the integration with drm_sched was painful, but also because  
> > I  
> > > > > felt trying to bend drm_sched to make it interact with a
> > > > > timeslice-oriented scheduling model wasn't really future proof.  
> > Giving  
> > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably  
> > might  
> > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > coming short on other aspects we have to deal with on Arm GPUs.  
> > > >
> > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > have a better understanding of how you get away with using drm_sched
> > > > while still controlling how scheduling is really done. Here
> > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue  
> >
> > You nailed it here, we use the DRM scheduler for queuing jobs,
> > dependency tracking and releasing jobs to be scheduled when dependencies
> > are met, and lastly a tracking mechanism of inflights jobs that need to
> > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > aside from the most basic level of not overflowing the submission ring
> > buffer. In this sense, a 1 to 1 relationship between entity and
> > scheduler fits quite well.
> >  
> 
> Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> want here and what you need for Arm thanks to the number of FW queues
> available. I don't remember the exact number of GuC queues but it's at
> least 1k. This puts it in an entirely different class from what you have on
> Mali. Roughly, there's about three categories here:
> 
>  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> old Mali, Intel Haswell and earlier, and probably a bunch of others.
> (Intel BDW+ with execlists is a weird case that doesn't fit in this
> categorization.)
> 
>  2. Hardware (or firmware) with a very limited number of queues where
> you're going to have to juggle in the kernel in order to run desktop Linux.
> 
>  3. Firmware scheduling with a high queue count. In this case, you don't
> want the kernel scheduling anything. Just throw it at the firmware and let
> it go br.  If we ever run out of queues (unlikely), the kernel can
> temporarily pause some low-priority contexts and do some juggling or,
> frankly, just fail userspace queue creation and tell the user to close some
> windows.
> 
> The existence of this 2nd class is a bit annoying but it's where we are. I
> think it's worth recognizing that Xe and panfrost are in different places
> here and will require different designs. For Xe, we really are just using
> drm/scheduler as a front-end and the firmware does all the real scheduling.
> 
> How do we deal with class 2? That's an interesting question.  We may
> eventually want to break that off into a separate discussion and not litter
> the Xe thread but let's keep going here for a bit.  I think there are some
> pretty reasonable solutions but they're 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-09 Thread Jason Ekstrand
On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost 
wrote:

> On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:
> > On Fri, 30 Dec 2022 12:55:08 +0100
> > Boris Brezillon  wrote:
> >
> > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > Boris Brezillon  wrote:
> > >
> > > > Hello Matthew,
> > > >
> > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > Matthew Brost  wrote:
> > > >
> > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first
> this
> > > > > seems a bit odd but let us explain the reasoning below.
> > > > >
> > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > guaranteed to be the same completion even if targeting the same
> hardware
> > > > > engine. This is because in XE we have a firmware scheduler, the
> GuC,
> > > > > which allowed to reorder, timeslice, and preempt submissions. If a
> using
> > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR
> falls
> > > > > apart as the TDR expects submission order == completion order.
> Using a
> > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this
> problem.
> > > >
> > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > issues to support Arm's new Mali GPU which is relying on a
> FW-assisted
> > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > the scheduling between those N command streams, the kernel driver
> > > > does timeslice scheduling to update the command streams passed to the
> > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > because the integration with drm_sched was painful, but also because
> I
> > > > felt trying to bend drm_sched to make it interact with a
> > > > timeslice-oriented scheduling model wasn't really future proof.
> Giving
> > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably
> might
> > > > help for a few things (didn't think it through yet), but I feel it's
> > > > coming short on other aspects we have to deal with on Arm GPUs.
> > >
> > > Ok, so I just had a quick look at the Xe driver and how it
> > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > have a better understanding of how you get away with using drm_sched
> > > while still controlling how scheduling is really done. Here
> > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue
>
> You nailed it here, we use the DRM scheduler for queuing jobs,
> dependency tracking and releasing jobs to be scheduled when dependencies
> are met, and lastly a tracking mechanism of inflights jobs that need to
> be cleaned up if an error occurs. It doesn't actually do any scheduling
> aside from the most basic level of not overflowing the submission ring
> buffer. In this sense, a 1 to 1 relationship between entity and
> scheduler fits quite well.
>

Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
want here and what you need for Arm thanks to the number of FW queues
available. I don't remember the exact number of GuC queues but it's at
least 1k. This puts it in an entirely different class from what you have on
Mali. Roughly, there's about three categories here:

 1. Hardware where the kernel is placing jobs on actual HW rings. This is
old Mali, Intel Haswell and earlier, and probably a bunch of others.
(Intel BDW+ with execlists is a weird case that doesn't fit in this
categorization.)

 2. Hardware (or firmware) with a very limited number of queues where
you're going to have to juggle in the kernel in order to run desktop Linux.

 3. Firmware scheduling with a high queue count. In this case, you don't
want the kernel scheduling anything. Just throw it at the firmware and let
it go br.  If we ever run out of queues (unlikely), the kernel can
temporarily pause some low-priority contexts and do some juggling or,
frankly, just fail userspace queue creation and tell the user to close some
windows.

The existence of this 2nd class is a bit annoying but it's where we are. I
think it's worth recognizing that Xe and panfrost are in different places
here and will require different designs. For Xe, we really are just using
drm/scheduler as a front-end and the firmware does all the real scheduling.

How do we deal with class 2? That's an interesting question.  We may
eventually want to break that off into a separate discussion and not litter
the Xe thread but let's keep going here for a bit.  I think there are some
pretty reasonable solutions but they're going to look a bit different.

The way I did this for Xe with execlists was to keep the 1:1:1 mapping
between drm_gpu_scheduler, drm_sched_entity, and userspace xe_engine.
Instead of feeding a GuC ring, though, it would feed a fixed-size execlist
ring and then there was a tiny kernel which operated entirely in IRQ

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-09 Thread Tvrtko Ursulin



On 06/01/2023 23:52, Matthew Brost wrote:

On Thu, Jan 05, 2023 at 09:43:41PM +, Matthew Brost wrote:

On Tue, Jan 03, 2023 at 01:02:15PM +, Tvrtko Ursulin wrote:


On 02/01/2023 07:30, Boris Brezillon wrote:

On Fri, 30 Dec 2022 12:55:08 +0100
Boris Brezillon  wrote:


On Fri, 30 Dec 2022 11:20:42 +0100
Boris Brezillon  wrote:


Hello Matthew,

On Thu, 22 Dec 2022 14:21:11 -0800
Matthew Brost  wrote:

In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
seems a bit odd but let us explain the reasoning below.

1. In XE the submission order from multiple drm_sched_entity is not
guaranteed to be the same completion even if targeting the same hardware
engine. This is because in XE we have a firmware scheduler, the GuC,
which allowed to reorder, timeslice, and preempt submissions. If a using
shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
apart as the TDR expects submission order == completion order. Using a
dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.


Oh, that's interesting. I've been trying to solve the same sort of
issues to support Arm's new Mali GPU which is relying on a FW-assisted
scheduling scheme (you give the FW N streams to execute, and it does
the scheduling between those N command streams, the kernel driver
does timeslice scheduling to update the command streams passed to the
FW). I must admit I gave up on using drm_sched at some point, mostly
because the integration with drm_sched was painful, but also because I
felt trying to bend drm_sched to make it interact with a
timeslice-oriented scheduling model wasn't really future proof. Giving
drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
help for a few things (didn't think it through yet), but I feel it's
coming short on other aspects we have to deal with on Arm GPUs.


Ok, so I just had a quick look at the Xe driver and how it
instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
have a better understanding of how you get away with using drm_sched
while still controlling how scheduling is really done. Here
drm_gpu_scheduler is just a dummy abstract that let's you use the
drm_sched job queuing/dep/tracking mechanism. The whole run-queue
selection is dumb because there's only one entity ever bound to the
scheduler (the one that's part of the xe_guc_engine object which also
contains the drm_gpu_scheduler instance). I guess the main issue we'd
have on Arm is the fact that the stream doesn't necessarily get
scheduled when ->run_job() is called, it can be placed in the runnable
queue and be picked later by the kernel-side scheduler when a FW slot
gets released. That can probably be sorted out by manually disabling the
job timer and re-enabling it when the stream gets picked by the
scheduler. But my main concern remains, we're basically abusing
drm_sched here.

For the Arm driver, that means turning the following sequence

1. wait for job deps
2. queue job to ringbuf and push the stream to the runnable
 queue (if it wasn't queued already). Wakeup the timeslice scheduler
 to re-evaluate (if the stream is not on a FW slot already)
3. stream gets picked by the timeslice scheduler and sent to the FW for
 execution

into

1. queue job to entity which takes care of waiting for job deps for
 us
2. schedule a drm_sched_main iteration
3. the only available entity is picked, and the first job from this
 entity is dequeued. ->run_job() is called: the job is queued to the
 ringbuf and the stream is pushed to the runnable queue (if it wasn't
 queued already). Wakeup the timeslice scheduler to re-evaluate (if
 the stream is not on a FW slot already)
4. stream gets picked by the timeslice scheduler and sent to the FW for
 execution

That's one extra step we don't really need. To sum-up, yes, all the
job/entity tracking might be interesting to share/re-use, but I wonder
if we couldn't have that without pulling out the scheduling part of
drm_sched, or maybe I'm missing something, and there's something in
drm_gpu_scheduler you really need.


On second thought, that's probably an acceptable overhead (not even
sure the extra step I was mentioning exists in practice, because dep
fence signaled state is checked as part of the drm_sched_main
iteration, so that's basically replacing the worker I schedule to
check job deps), and I like the idea of being able to re-use drm_sched
dep-tracking without resorting to invasive changes to the existing
logic, so I'll probably give it a try.


I agree with the concerns and think that how Xe proposes to integrate with
drm_sched is a problem, or at least significantly inelegant.



Inelegant is a matter of opinion, I actually rather like this solution.

BTW this isn't my design rather this was Jason's idea.
  

AFAICT it proposes to have 1:1 between *userspace* created contexts (per
context _and_ engine) and drm_sched. 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-06 Thread Matthew Brost
On Thu, Jan 05, 2023 at 09:43:41PM +, Matthew Brost wrote:
> On Tue, Jan 03, 2023 at 01:02:15PM +, Tvrtko Ursulin wrote:
> > 
> > On 02/01/2023 07:30, Boris Brezillon wrote:
> > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > Boris Brezillon  wrote:
> > > 
> > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > Boris Brezillon  wrote:
> > > > 
> > > > > Hello Matthew,
> > > > > 
> > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > Matthew Brost  wrote:
> > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first 
> > > > > > this
> > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > 
> > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > guaranteed to be the same completion even if targeting the same 
> > > > > > hardware
> > > > > > engine. This is because in XE we have a firmware scheduler, the GuC,
> > > > > > which allowed to reorder, timeslice, and preempt submissions. If a 
> > > > > > using
> > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR 
> > > > > > falls
> > > > > > apart as the TDR expects submission order == completion order. 
> > > > > > Using a
> > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
> > > > > 
> > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > issues to support Arm's new Mali GPU which is relying on a FW-assisted
> > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > the scheduling between those N command streams, the kernel driver
> > > > > does timeslice scheduling to update the command streams passed to the
> > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > because the integration with drm_sched was painful, but also because I
> > > > > felt trying to bend drm_sched to make it interact with a
> > > > > timeslice-oriented scheduling model wasn't really future proof. Giving
> > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
> > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > coming short on other aspects we have to deal with on Arm GPUs.
> > > > 
> > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > have a better understanding of how you get away with using drm_sched
> > > > while still controlling how scheduling is really done. Here
> > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> > > > selection is dumb because there's only one entity ever bound to the
> > > > scheduler (the one that's part of the xe_guc_engine object which also
> > > > contains the drm_gpu_scheduler instance). I guess the main issue we'd
> > > > have on Arm is the fact that the stream doesn't necessarily get
> > > > scheduled when ->run_job() is called, it can be placed in the runnable
> > > > queue and be picked later by the kernel-side scheduler when a FW slot
> > > > gets released. That can probably be sorted out by manually disabling the
> > > > job timer and re-enabling it when the stream gets picked by the
> > > > scheduler. But my main concern remains, we're basically abusing
> > > > drm_sched here.
> > > > 
> > > > For the Arm driver, that means turning the following sequence
> > > > 
> > > > 1. wait for job deps
> > > > 2. queue job to ringbuf and push the stream to the runnable
> > > > queue (if it wasn't queued already). Wakeup the timeslice scheduler
> > > > to re-evaluate (if the stream is not on a FW slot already)
> > > > 3. stream gets picked by the timeslice scheduler and sent to the FW for
> > > > execution
> > > > 
> > > > into
> > > > 
> > > > 1. queue job to entity which takes care of waiting for job deps for
> > > > us
> > > > 2. schedule a drm_sched_main iteration
> > > > 3. the only available entity is picked, and the first job from this
> > > > entity is dequeued. ->run_job() is called: the job is queued to the
> > > > ringbuf and the stream is pushed to the runnable queue (if it wasn't
> > > > queued already). Wakeup the timeslice scheduler to re-evaluate (if
> > > > the stream is not on a FW slot already)
> > > > 4. stream gets picked by the timeslice scheduler and sent to the FW for
> > > > execution
> > > > 
> > > > That's one extra step we don't really need. To sum-up, yes, all the
> > > > job/entity tracking might be interesting to share/re-use, but I wonder
> > > > if we couldn't have that without pulling out the scheduling part of
> > > > drm_sched, or maybe I'm missing something, and there's something in
> > > > drm_gpu_scheduler you really need.
> > > 
> > > On second thought, that's probably an acceptable overhead (not even
> > > sure the extra step I 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-05 Thread Matthew Brost
On Tue, Jan 03, 2023 at 01:02:15PM +, Tvrtko Ursulin wrote:
> 
> On 02/01/2023 07:30, Boris Brezillon wrote:
> > On Fri, 30 Dec 2022 12:55:08 +0100
> > Boris Brezillon  wrote:
> > 
> > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > Boris Brezillon  wrote:
> > > 
> > > > Hello Matthew,
> > > > 
> > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > Matthew Brost  wrote:
> > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first 
> > > > > this
> > > > > seems a bit odd but let us explain the reasoning below.
> > > > > 
> > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > guaranteed to be the same completion even if targeting the same 
> > > > > hardware
> > > > > engine. This is because in XE we have a firmware scheduler, the GuC,
> > > > > which allowed to reorder, timeslice, and preempt submissions. If a 
> > > > > using
> > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR 
> > > > > falls
> > > > > apart as the TDR expects submission order == completion order. Using a
> > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
> > > > 
> > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > issues to support Arm's new Mali GPU which is relying on a FW-assisted
> > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > the scheduling between those N command streams, the kernel driver
> > > > does timeslice scheduling to update the command streams passed to the
> > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > because the integration with drm_sched was painful, but also because I
> > > > felt trying to bend drm_sched to make it interact with a
> > > > timeslice-oriented scheduling model wasn't really future proof. Giving
> > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
> > > > help for a few things (didn't think it through yet), but I feel it's
> > > > coming short on other aspects we have to deal with on Arm GPUs.
> > > 
> > > Ok, so I just had a quick look at the Xe driver and how it
> > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > have a better understanding of how you get away with using drm_sched
> > > while still controlling how scheduling is really done. Here
> > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> > > selection is dumb because there's only one entity ever bound to the
> > > scheduler (the one that's part of the xe_guc_engine object which also
> > > contains the drm_gpu_scheduler instance). I guess the main issue we'd
> > > have on Arm is the fact that the stream doesn't necessarily get
> > > scheduled when ->run_job() is called, it can be placed in the runnable
> > > queue and be picked later by the kernel-side scheduler when a FW slot
> > > gets released. That can probably be sorted out by manually disabling the
> > > job timer and re-enabling it when the stream gets picked by the
> > > scheduler. But my main concern remains, we're basically abusing
> > > drm_sched here.
> > > 
> > > For the Arm driver, that means turning the following sequence
> > > 
> > > 1. wait for job deps
> > > 2. queue job to ringbuf and push the stream to the runnable
> > > queue (if it wasn't queued already). Wakeup the timeslice scheduler
> > > to re-evaluate (if the stream is not on a FW slot already)
> > > 3. stream gets picked by the timeslice scheduler and sent to the FW for
> > > execution
> > > 
> > > into
> > > 
> > > 1. queue job to entity which takes care of waiting for job deps for
> > > us
> > > 2. schedule a drm_sched_main iteration
> > > 3. the only available entity is picked, and the first job from this
> > > entity is dequeued. ->run_job() is called: the job is queued to the
> > > ringbuf and the stream is pushed to the runnable queue (if it wasn't
> > > queued already). Wakeup the timeslice scheduler to re-evaluate (if
> > > the stream is not on a FW slot already)
> > > 4. stream gets picked by the timeslice scheduler and sent to the FW for
> > > execution
> > > 
> > > That's one extra step we don't really need. To sum-up, yes, all the
> > > job/entity tracking might be interesting to share/re-use, but I wonder
> > > if we couldn't have that without pulling out the scheduling part of
> > > drm_sched, or maybe I'm missing something, and there's something in
> > > drm_gpu_scheduler you really need.
> > 
> > On second thought, that's probably an acceptable overhead (not even
> > sure the extra step I was mentioning exists in practice, because dep
> > fence signaled state is checked as part of the drm_sched_main
> > iteration, so that's basically replacing the worker I schedule to
> > check job deps), and I like the idea of being able to re-use 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-03 Thread Boris Brezillon
Hi,

On Tue, 3 Jan 2023 13:02:15 +
Tvrtko Ursulin  wrote:

> On 02/01/2023 07:30, Boris Brezillon wrote:
> > On Fri, 30 Dec 2022 12:55:08 +0100
> > Boris Brezillon  wrote:
> >   
> >> On Fri, 30 Dec 2022 11:20:42 +0100
> >> Boris Brezillon  wrote:
> >>  
> >>> Hello Matthew,
> >>>
> >>> On Thu, 22 Dec 2022 14:21:11 -0800
> >>> Matthew Brost  wrote:
> >>>  
>  In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
>  mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
>  seems a bit odd but let us explain the reasoning below.
> 
>  1. In XE the submission order from multiple drm_sched_entity is not
>  guaranteed to be the same completion even if targeting the same hardware
>  engine. This is because in XE we have a firmware scheduler, the GuC,
>  which allowed to reorder, timeslice, and preempt submissions. If a using
>  shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
>  apart as the TDR expects submission order == completion order. Using a
>  dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.  
> >>>
> >>> Oh, that's interesting. I've been trying to solve the same sort of
> >>> issues to support Arm's new Mali GPU which is relying on a FW-assisted
> >>> scheduling scheme (you give the FW N streams to execute, and it does
> >>> the scheduling between those N command streams, the kernel driver
> >>> does timeslice scheduling to update the command streams passed to the
> >>> FW). I must admit I gave up on using drm_sched at some point, mostly
> >>> because the integration with drm_sched was painful, but also because I
> >>> felt trying to bend drm_sched to make it interact with a
> >>> timeslice-oriented scheduling model wasn't really future proof. Giving
> >>> drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
> >>> help for a few things (didn't think it through yet), but I feel it's
> >>> coming short on other aspects we have to deal with on Arm GPUs.  
> >>
> >> Ok, so I just had a quick look at the Xe driver and how it
> >> instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> >> have a better understanding of how you get away with using drm_sched
> >> while still controlling how scheduling is really done. Here
> >> drm_gpu_scheduler is just a dummy abstract that let's you use the
> >> drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> >> selection is dumb because there's only one entity ever bound to the
> >> scheduler (the one that's part of the xe_guc_engine object which also
> >> contains the drm_gpu_scheduler instance). I guess the main issue we'd
> >> have on Arm is the fact that the stream doesn't necessarily get
> >> scheduled when ->run_job() is called, it can be placed in the runnable
> >> queue and be picked later by the kernel-side scheduler when a FW slot
> >> gets released. That can probably be sorted out by manually disabling the
> >> job timer and re-enabling it when the stream gets picked by the
> >> scheduler. But my main concern remains, we're basically abusing
> >> drm_sched here.
> >>
> >> For the Arm driver, that means turning the following sequence
> >>
> >> 1. wait for job deps
> >> 2. queue job to ringbuf and push the stream to the runnable
> >> queue (if it wasn't queued already). Wakeup the timeslice scheduler
> >> to re-evaluate (if the stream is not on a FW slot already)
> >> 3. stream gets picked by the timeslice scheduler and sent to the FW for
> >> execution
> >>
> >> into
> >>
> >> 1. queue job to entity which takes care of waiting for job deps for
> >> us
> >> 2. schedule a drm_sched_main iteration
> >> 3. the only available entity is picked, and the first job from this
> >> entity is dequeued. ->run_job() is called: the job is queued to the
> >> ringbuf and the stream is pushed to the runnable queue (if it wasn't
> >> queued already). Wakeup the timeslice scheduler to re-evaluate (if
> >> the stream is not on a FW slot already)
> >> 4. stream gets picked by the timeslice scheduler and sent to the FW for
> >> execution
> >>
> >> That's one extra step we don't really need. To sum-up, yes, all the
> >> job/entity tracking might be interesting to share/re-use, but I wonder
> >> if we couldn't have that without pulling out the scheduling part of
> >> drm_sched, or maybe I'm missing something, and there's something in
> >> drm_gpu_scheduler you really need.  
> > 
> > On second thought, that's probably an acceptable overhead (not even
> > sure the extra step I was mentioning exists in practice, because dep
> > fence signaled state is checked as part of the drm_sched_main
> > iteration, so that's basically replacing the worker I schedule to
> > check job deps), and I like the idea of being able to re-use drm_sched
> > dep-tracking without resorting to invasive changes to the existing
> > logic, so I'll probably give it a try.  
> 
> I agree with the concerns 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-03 Thread Tvrtko Ursulin



On 02/01/2023 07:30, Boris Brezillon wrote:

On Fri, 30 Dec 2022 12:55:08 +0100
Boris Brezillon  wrote:


On Fri, 30 Dec 2022 11:20:42 +0100
Boris Brezillon  wrote:


Hello Matthew,

On Thu, 22 Dec 2022 14:21:11 -0800
Matthew Brost  wrote:
   

In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
seems a bit odd but let us explain the reasoning below.

1. In XE the submission order from multiple drm_sched_entity is not
guaranteed to be the same completion even if targeting the same hardware
engine. This is because in XE we have a firmware scheduler, the GuC,
which allowed to reorder, timeslice, and preempt submissions. If a using
shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
apart as the TDR expects submission order == completion order. Using a
dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.


Oh, that's interesting. I've been trying to solve the same sort of
issues to support Arm's new Mali GPU which is relying on a FW-assisted
scheduling scheme (you give the FW N streams to execute, and it does
the scheduling between those N command streams, the kernel driver
does timeslice scheduling to update the command streams passed to the
FW). I must admit I gave up on using drm_sched at some point, mostly
because the integration with drm_sched was painful, but also because I
felt trying to bend drm_sched to make it interact with a
timeslice-oriented scheduling model wasn't really future proof. Giving
drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
help for a few things (didn't think it through yet), but I feel it's
coming short on other aspects we have to deal with on Arm GPUs.


Ok, so I just had a quick look at the Xe driver and how it
instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
have a better understanding of how you get away with using drm_sched
while still controlling how scheduling is really done. Here
drm_gpu_scheduler is just a dummy abstract that let's you use the
drm_sched job queuing/dep/tracking mechanism. The whole run-queue
selection is dumb because there's only one entity ever bound to the
scheduler (the one that's part of the xe_guc_engine object which also
contains the drm_gpu_scheduler instance). I guess the main issue we'd
have on Arm is the fact that the stream doesn't necessarily get
scheduled when ->run_job() is called, it can be placed in the runnable
queue and be picked later by the kernel-side scheduler when a FW slot
gets released. That can probably be sorted out by manually disabling the
job timer and re-enabling it when the stream gets picked by the
scheduler. But my main concern remains, we're basically abusing
drm_sched here.

For the Arm driver, that means turning the following sequence

1. wait for job deps
2. queue job to ringbuf and push the stream to the runnable
queue (if it wasn't queued already). Wakeup the timeslice scheduler
to re-evaluate (if the stream is not on a FW slot already)
3. stream gets picked by the timeslice scheduler and sent to the FW for
execution

into

1. queue job to entity which takes care of waiting for job deps for
us
2. schedule a drm_sched_main iteration
3. the only available entity is picked, and the first job from this
entity is dequeued. ->run_job() is called: the job is queued to the
ringbuf and the stream is pushed to the runnable queue (if it wasn't
queued already). Wakeup the timeslice scheduler to re-evaluate (if
the stream is not on a FW slot already)
4. stream gets picked by the timeslice scheduler and sent to the FW for
execution

That's one extra step we don't really need. To sum-up, yes, all the
job/entity tracking might be interesting to share/re-use, but I wonder
if we couldn't have that without pulling out the scheduling part of
drm_sched, or maybe I'm missing something, and there's something in
drm_gpu_scheduler you really need.


On second thought, that's probably an acceptable overhead (not even
sure the extra step I was mentioning exists in practice, because dep
fence signaled state is checked as part of the drm_sched_main
iteration, so that's basically replacing the worker I schedule to
check job deps), and I like the idea of being able to re-use drm_sched
dep-tracking without resorting to invasive changes to the existing
logic, so I'll probably give it a try.


I agree with the concerns and think that how Xe proposes to integrate 
with drm_sched is a problem, or at least significantly inelegant.


AFAICT it proposes to have 1:1 between *userspace* created contexts (per 
context _and_ engine) and drm_sched. I am not sure avoiding invasive 
changes to the shared code is in the spirit of the overall idea and 
instead opportunity should be used to look at way to refactor/improve 
drm_sched.


Even on the low level, the idea to replace drm_sched threads with 
workers has a few problems.


To start with, the pattern 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2022-12-28 Thread Matthew Brost
On Fri, Dec 23, 2022 at 09:42:58AM -0800, Rob Clark wrote:
> On Thu, Dec 22, 2022 at 2:29 PM Matthew Brost  wrote:
> >
> > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> > seems a bit odd but let us explain the reasoning below.
> >
> > 1. In XE the submission order from multiple drm_sched_entity is not
> > guaranteed to be the same completion even if targeting the same hardware
> > engine. This is because in XE we have a firmware scheduler, the GuC,
> > which allowed to reorder, timeslice, and preempt submissions. If a using
> > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> > apart as the TDR expects submission order == completion order. Using a
> > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
> >
> > 2. In XE submissions are done via programming a ring buffer (circular
> > buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
> > limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
> > control on the ring for free.
> >
> > A problem with this design is currently a drm_gpu_scheduler uses a
> > kthread for submission / job cleanup. This doesn't scale if a large
> > number of drm_gpu_scheduler are used. To work around the scaling issue,
> > use a worker rather than kthread for submission / job cleanup.
> 
> You might want to enable CONFIG_DRM_MSM in your kconfig, I think you
> missed a part
> 
> BR,
> -R
> 

Thanks for feedback Rob, yes indeed we missed updating the MSM driver. Fixed up
in our Xe repo and will be fixed in the next rev on the list.

Matt

> > Signed-off-by: Matthew Brost 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +--
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  12 +-
> >  drivers/gpu/drm/scheduler/sched_main.c  | 124 
> >  include/drm/gpu_scheduler.h |  13 +-
> >  4 files changed, 93 insertions(+), 70 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > index f60753f97ac5..9c2a10aeb0b3 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> > @@ -1489,9 +1489,9 @@ static int amdgpu_debugfs_test_ib_show(struct 
> > seq_file *m, void *unused)
> > for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
> > struct amdgpu_ring *ring = adev->rings[i];
> >
> > -   if (!ring || !ring->sched.thread)
> > +   if (!ring || !ring->sched.ready)
> > continue;
> > -   kthread_park(ring->sched.thread);
> > +   drm_sched_run_wq_stop(>sched);
> > }
> >
> > seq_printf(m, "run ib test:\n");
> > @@ -1505,9 +1505,9 @@ static int amdgpu_debugfs_test_ib_show(struct 
> > seq_file *m, void *unused)
> > for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
> > struct amdgpu_ring *ring = adev->rings[i];
> >
> > -   if (!ring || !ring->sched.thread)
> > +   if (!ring || !ring->sched.ready)
> > continue;
> > -   kthread_unpark(ring->sched.thread);
> > +   drm_sched_run_wq_start(>sched);
> > }
> >
> > up_write(>reset_domain->sem);
> > @@ -1727,7 +1727,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 
> > val)
> >
> > ring = adev->rings[val];
> >
> > -   if (!ring || !ring->funcs->preempt_ib || !ring->sched.thread)
> > +   if (!ring || !ring->funcs->preempt_ib || !ring->sched.ready)
> > return -EINVAL;
> >
> > /* the last preemption failed */
> > @@ -1745,7 +1745,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 
> > val)
> > goto pro_end;
> >
> > /* stop the scheduler */
> > -   kthread_park(ring->sched.thread);
> > +   drm_sched_run_wq_stop(>sched);
> >
> > /* preempt the IB */
> > r = amdgpu_ring_preempt_ib(ring);
> > @@ -1779,7 +1779,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 
> > val)
> >
> >  failure:
> > /* restart the scheduler */
> > -   kthread_unpark(ring->sched.thread);
> > +   drm_sched_run_wq_start(>sched);
> >
> > up_read(>reset_domain->sem);
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 076ae400d099..9552929ccf87 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -4577,7 +4577,7 @@ bool amdgpu_device_has_job_running(struct 
> > amdgpu_device *adev)
> > for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> > struct amdgpu_ring *ring = adev->rings[i];
> >
> > -   if (!ring || !ring->sched.thread)
> > +   if (!ring || !ring->sched.ready)
> > continue;
> >
> > 

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2022-12-23 Thread Rob Clark
On Thu, Dec 22, 2022 at 2:29 PM Matthew Brost  wrote:
>
> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> seems a bit odd but let us explain the reasoning below.
>
> 1. In XE the submission order from multiple drm_sched_entity is not
> guaranteed to be the same completion even if targeting the same hardware
> engine. This is because in XE we have a firmware scheduler, the GuC,
> which allowed to reorder, timeslice, and preempt submissions. If a using
> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> apart as the TDR expects submission order == completion order. Using a
> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
>
> 2. In XE submissions are done via programming a ring buffer (circular
> buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
> limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
> control on the ring for free.
>
> A problem with this design is currently a drm_gpu_scheduler uses a
> kthread for submission / job cleanup. This doesn't scale if a large
> number of drm_gpu_scheduler are used. To work around the scaling issue,
> use a worker rather than kthread for submission / job cleanup.

You might want to enable CONFIG_DRM_MSM in your kconfig, I think you
missed a part

BR,
-R

> Signed-off-by: Matthew Brost 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  12 +-
>  drivers/gpu/drm/scheduler/sched_main.c  | 124 
>  include/drm/gpu_scheduler.h |  13 +-
>  4 files changed, 93 insertions(+), 70 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index f60753f97ac5..9c2a10aeb0b3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -1489,9 +1489,9 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file 
> *m, void *unused)
> for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
> struct amdgpu_ring *ring = adev->rings[i];
>
> -   if (!ring || !ring->sched.thread)
> +   if (!ring || !ring->sched.ready)
> continue;
> -   kthread_park(ring->sched.thread);
> +   drm_sched_run_wq_stop(>sched);
> }
>
> seq_printf(m, "run ib test:\n");
> @@ -1505,9 +1505,9 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file 
> *m, void *unused)
> for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
> struct amdgpu_ring *ring = adev->rings[i];
>
> -   if (!ring || !ring->sched.thread)
> +   if (!ring || !ring->sched.ready)
> continue;
> -   kthread_unpark(ring->sched.thread);
> +   drm_sched_run_wq_start(>sched);
> }
>
> up_write(>reset_domain->sem);
> @@ -1727,7 +1727,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 
> val)
>
> ring = adev->rings[val];
>
> -   if (!ring || !ring->funcs->preempt_ib || !ring->sched.thread)
> +   if (!ring || !ring->funcs->preempt_ib || !ring->sched.ready)
> return -EINVAL;
>
> /* the last preemption failed */
> @@ -1745,7 +1745,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 
> val)
> goto pro_end;
>
> /* stop the scheduler */
> -   kthread_park(ring->sched.thread);
> +   drm_sched_run_wq_stop(>sched);
>
> /* preempt the IB */
> r = amdgpu_ring_preempt_ib(ring);
> @@ -1779,7 +1779,7 @@ static int amdgpu_debugfs_ib_preempt(void *data, u64 
> val)
>
>  failure:
> /* restart the scheduler */
> -   kthread_unpark(ring->sched.thread);
> +   drm_sched_run_wq_start(>sched);
>
> up_read(>reset_domain->sem);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 076ae400d099..9552929ccf87 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4577,7 +4577,7 @@ bool amdgpu_device_has_job_running(struct amdgpu_device 
> *adev)
> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> struct amdgpu_ring *ring = adev->rings[i];
>
> -   if (!ring || !ring->sched.thread)
> +   if (!ring || !ring->sched.ready)
> continue;
>
> spin_lock(>sched.job_list_lock);
> @@ -4708,7 +4708,7 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
> *adev,
> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> struct amdgpu_ring *ring = adev->rings[i];
>
> -   if (!ring || !ring->sched.thread)
> +   if (!ring || !ring->sched.ready)
> continue;
>
> /*clear job fence from fence drv to avoid