Am 21.08.23 um 21:46 schrieb Faith Ekstrand:
On Mon, Aug 21, 2023 at 1:13 PM Christian König
<christian.koe...@amd.com> wrote:
[SNIP]
So as long as nobody from userspace comes and says we absolutely
need to
optimize this use case I would rather not do it.
This is a place where nouveau's needs are legitimately different from
AMD or Intel, I think. NVIDIA's command streamer model is very
different from AMD and Intel. On AMD and Intel, each EXEC turns into
a single small packet (on the order of 16B) which kicks off a command
buffer. There may be a bit of cache management or something around it
but that's it. From there, it's userspace's job to make one command
buffer chain to another until it's finally done and then do a
"return", whatever that looks like.
NVIDIA's model is much more static. Each packet in the HW/FW ring is
an address and a size and that much data is processed and then it
grabs the next packet and processes. The result is that, if we use
multiple buffers of commands, there's no way to chain them together.
We just have to pass the whole list of buffers to the kernel.
So far that is actually completely identical to what AMD has.
A single EXEC ioctl / job may have 500 such addr+size packets
depending on how big the command buffer is.
And that is what I don't understand. Why would you need 100dreds of such
addr+size packets?
This is basically identical to what AMD has (well on newer hw there is
an extension in the CP packets to JUMP/CALL subsequent IBs, but this
isn't widely used as far as I know).
Previously the limit was something like 4 which we extended to because
Bas came up with similar requirements for the AMD side from RADV.
But essentially those approaches with 100dreds of IBs doesn't sound like
a good idea to me.
It gets worse on pre-Turing hardware where we have to split the batch
for every single DrawIndirect or DispatchIndirect.
Lest you think NVIDIA is just crazy here, it's a perfectly reasonable
model if you assume that userspace is feeding the firmware. When
that's happening, you just have a userspace thread that sits there and
feeds the ringbuffer with whatever is next and you can marshal as much
data through as you want. Sure, it'd be nice to have a 2nd level batch
thing that gets launched from the FW ring and has all the individual
launch commands but it's not at all necessary.
What does that mean from a gpu_scheduler PoV? Basically, it means a
variable packet size.
What does this mean for implementation? IDK. One option would be to
teach the scheduler about actual job sizes. Another would be to
virtualize it and have another layer underneath the scheduler that
does the actual feeding of the ring. Another would be to decrease the
job size somewhat and then have the front-end submit as many jobs as
it needs to service userspace and only put the out-fences on the last
job. All the options kinda suck.
Yeah, agree. The job size Danilo suggested is still the least painful.
Christian.
~Faith