sched: Convert drm scheduler to use a work queue rather than kthread

Christian König Tue, 22 Aug 2023 02:51:31 -0700

Am 21.08.23 um 21:46 schrieb Faith Ekstrand:

On Mon, Aug 21, 2023 at 1:13 PM Christian König<christian.koe...@amd.com> wrote:
    [SNIP]
    So as long as nobody from userspace comes and says we absolutely
    need to
    optimize this use case I would rather not do it.
This is a place where nouveau's needs are legitimately different fromAMD or Intel, I think. NVIDIA's command streamer model is verydifferent from AMD and Intel. On AMD and Intel, each EXEC turns intoa single small packet (on the order of 16B) which kicks off a commandbuffer. There may be a bit of cache management or something around itbut that's it. From there, it's userspace's job to make one commandbuffer chain to another until it's finally done and then do a"return", whatever that looks like.
NVIDIA's model is much more static. Each packet in the HW/FW ring isan address and a size and that much data is processed and then itgrabs the next packet and processes. The result is that, if we usemultiple buffers of commands, there's no way to chain them together. We just have to pass the whole list of buffers to the kernel.


So far that is actually completely identical to what AMD has.

A single EXEC ioctl / job may have 500 such addr+size packetsdepending on how big the command buffer is.

And that is what I don't understand. Why would you need 100dreds of suchaddr+size packets?

This is basically identical to what AMD has (well on newer hw there isan extension in the CP packets to JUMP/CALL subsequent IBs, but thisisn't widely used as far as I know).

Previously the limit was something like 4 which we extended to becauseBas came up with similar requirements for the AMD side from RADV.

But essentially those approaches with 100dreds of IBs doesn't sound likea good idea to me.

It gets worse on pre-Turing hardware where we have to split the batchfor every single DrawIndirect or DispatchIndirect.
Lest you think NVIDIA is just crazy here, it's a perfectly reasonablemodel if you assume that userspace is feeding the firmware. Whenthat's happening, you just have a userspace thread that sits there andfeeds the ringbuffer with whatever is next and you can marshal as muchdata through as you want. Sure, it'd be nice to have a 2nd level batchthing that gets launched from the FW ring and has all the individuallaunch commands but it's not at all necessary.
What does that mean from a gpu_scheduler PoV? Basically, it means avariable packet size.
What does this mean for implementation? IDK. One option would be toteach the scheduler about actual job sizes. Another would be tovirtualize it and have another layer underneath the scheduler thatdoes the actual feeding of the ring. Another would be to decrease thejob size somewhat and then have the front-end submit as many jobs asit needs to service userspace and only put the out-fences on the lastjob. All the options kinda suck.


Yeah, agree. The job size Danilo suggested is still the least painful.

Christian.


~Faith

Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Reply via email to