On Thu, Jan 12, 2023 at 04:39:32PM -0800, John Harrison wrote: > On 1/11/2023 14:56, Jason Ekstrand wrote: > > On Wed, Jan 11, 2023 at 4:32 PM Matthew Brost <matthew.br...@intel.com> > > wrote: > > > > On Wed, Jan 11, 2023 at 04:18:01PM -0600, Jason Ekstrand wrote: > > > On Wed, Jan 11, 2023 at 2:50 AM Tvrtko Ursulin < > > > tvrtko.ursu...@linux.intel.com> wrote: > > > > > > > > > [snip] > > > > > > > > Typically is the key here. But I am not sure it is good > > enough. Consider > > > > this example - Intel Flex 170: > > > > > > > > * Delivers up to 36 streams 1080p60 transcode throughput per > > card. > > > > * When scaled to 10 cards in a 4U server configuration, it > > can support > > > > up to 360 streams of HEVC/HEVC 1080p60 transcode throughput. > > > > > > > > > > I had a feeling it was going to be media.... 😅 > > > > > > > Yea wondering the media UMD can be rewritten to use less > > xe_engines, it > > is massive rewrite for VM bind + no implicit dependencies so let's > > just > > pile on some more work? > > > > > > It could probably use fewer than it does today. It currently creates > > and throws away contexts like crazy, or did last I looked at it. > > However, the nature of media encode is that it often spreads across two > > or three different types of engines. There's not much you can do to > > change that. > And as per Tvrtko's example, you get media servers that transcode huge > numbers of tiny streams in parallel. Almost no work per frame but 100s of > independent streams being run concurrently. That means many 100s of contexts > all trying to run at 30fps. I recall a specific bug about thundering herds - > hundreds (thousands?) of waiting threads all being woken up at once because > some request had completed. > > > > > > > > One transcode stream from my experience typically is 3-4 GPU > > contexts > > > > (buffer travels from vcs -> rcs -> vcs, maybe vecs) used from > > a single > > > > CPU thread. 4 contexts * 36 streams = 144 active contexts. > > Multiply by > > > > 60fps = 8640 jobs submitted and completed per second. > > > > > > > > 144 active contexts in the proposed scheme means possibly > > means 144 > > > > kernel worker threads spawned (driven by 36 transcode CPU > > threads). (I > > > > don't think the pools would scale down given all are > > constantly pinged > > > > at 60fps.) > > > > > > > > And then each of 144 threads goes to grab the single GuC CT > > mutex. First > > > > threads are being made schedulable, then put to sleep as mutex > > > > contention is hit, then woken again as mutexes are getting > > released, > > > > rinse, repeat. > > > > > > > > > > Why is every submission grabbing the GuC CT mutex? I've not read > > the GuC > > > back-end yet but I was under the impression that most run_job() > > would be > > > just shoving another packet into a ring buffer. If we have to > > send the GuC > > > a message on the control ring every single time we submit a job, > > that's > > > pretty horrible. > > > > > > > Run job writes the ring buffer and moves the tail as the first > > step (no > > lock required). Next it needs to tell the GuC the xe_engine LRC > > tail has > > moved, this is done from a single Host to GuC channel which is > > circular > > buffer, the writing of the channel protected by the mutex. There are > > little more nuances too but in practice there is always space in the > > channel so the time mutex needs to held is really, really small > > (check cached credits, write 3 dwords in payload, write 1 dword to > > move > > tail). I also believe mutexes in Linux are hybrid where they spin > > for a > > little bit before sleeping and certainly if there is space in the > > channel we shouldn't sleep mutex contention. > > > > > > Ok, that makes sense. It's maybe a bit clunky and it'd be nice if we > > had some way to batch things up a bit so we only have to poke the GuC > > channel once for every batch of things rather than once per job. That's > > maybe something we can look into as a future improvement; not > > fundamental. > > > > Generally, though, it sounds like contention could be a real problem if > > we end up ping-ponging that lock between cores. It's going to depend on > > how much work it takes to get the next ready thing vs. the cost of that > > atomic. But, also, anything we do is going to potentially run into > > contention problems. *shrug* If we were going to go for > > one-per-HW-engine, we may as well go one-per-device and then we wouldn't > > need the lock. Off the top of my head, that doesn't sound great either > > but IDK. > > > > As far as this being horrible, well didn't design the GuC and this how > > it is implemented for KMD based submission. We also have 256 doorbells > > so we wouldn't need a lock but I think are other issues with that > > design > > too which need to be worked out in the Xe2 / Xe3 timeframe. > > > > > > Yeah, not blaming you. Just surprised, that's all. How does it work > > for userspace submission? What would it look like if the kernel > > emulated userspace submission? Is that even possible? > > > > What are these doorbell things? How do they play into it? > Basically a bank of MMIO space reserved per 'entity' where a write to that > MMIO space becomes an named interrupt to GuC. You can assign each doorbell > to a specific GuC context. So writing to that doorbell address is > effectively the same as sending a SCHEDULE_CONTEXT H2G message from the KMD > for that context. But the advantage is you ring the doorbell from user land > with no call into the kernel at all. Or from within the kernel, you can do > it without needing any locks at all. Problem is, we have 64K contexts in GuC > but only 256 doorbells in the hardware. Less if using SRIOV. So the "per > 'entity'" part because somewhat questionable as to exactly what the 'entity' > is. And hence we just haven't bothered supporting them in Linux because a) > no direct submission from user land yet, and b) as Matthew says entire chain > of IOCTL from UMD to kernel to acquiring a lock and sending the H2G has > generally been fast enough. The latency only becomes an issue for ULLS > people but for them, even the doorbells from user space are too high a > latency because that still potentially involves the GuC having to do some > scheduling and context switch type action. > > John. >
I talked with Jason on IRC last week about doorbells and we came up with the idea after chatting to allocate the doorbells with a greedy algorithm which results in the first 256 xe_engine each getting their own doorbell thus avoid contention on the CT channel / lock (this is still KMD submission). Coded up a prototype for this and initial test results of xe_exec_threads /w 245 user xe_engines, 5 threads, and 40k total execs are an average of .824s vs. 923s for /w and w/o doorbells. Or in other words 49714 execs per seconds /w doorbells vs. 44353 without. This seems to indicate using doorbells can provide a performance improvement. Also Jason and I reasoned we should be able to use doorbells 99% of the time aside from maybe some wacky media use cases. I also plan on following up with the media UMD to see if we can get them to use less xe_engines. Matt > > > Also if you see my follow up response Xe is ~33k execs per second with > > the current implementation on a 8 core (or maybe 8 thread) TGL which > > seems to fine to me. > > > > > > 33k exec/sec is about 500/frame which should be fine. 500 is a lot for a > > single frame. I typically tell game devs to shoot for dozens per > > frame. The important thing is that it stays low even with hundreds of > > memory objects bound. (Xe should be just fine there.) > > > > --Jason > > > > Matt > > > > > --Jason > > > > > > > > > (And yes this backend contention is there regardless of 1:1:1, > > it would > > > > require a different re-design to solve that. But it is just a > > question > > > > whether there are 144 contending threads, or just 6 with the > > thread per > > > > engine class scheme.) > > > > > > > > Then multiply all by 10 for a 4U server use case and you get > > 1440 worker > > > > kthreads, yes 10 more CT locks, but contending on how many CPU > > cores? > > > > Just so they can grab a timeslice and maybe content on a mutex > > as the > > > > next step. > > > > > > > > This example is where it would hurt on large systems. Imagine > > only an > > > > even wider media transcode card... > > > > > > > > Second example is only a single engine class used (3d > > desktop?) but with > > > > a bunch of not-runnable jobs queued and waiting on a fence to > > signal. > > > > Implicit or explicit dependencies doesn't matter. Then the > > fence signals > > > > and call backs run. N work items get scheduled, but they all > > submit to > > > > the same HW engine. So we end up with: > > > > > > > > /-- wi1 --\ > > > > / .. .. \ > > > > cb --+--- wi.. ---+-- rq1 -- .. -- rqN > > > > \ .. .. / > > > > \-- wiN --/ > > > > > > > > > > > > All that we have achieved is waking up N CPUs to contend on > > the same > > > > lock and effectively insert the job into the same single HW > > queue. I > > > > don't see any positives there. > > > > > > > > This example I think can particularly hurt small / low power > > devices > > > > because of needless waking up of many cores for no benefit. > > Granted, I > > > > don't have a good feel on how common this pattern is in practice. > > > > > > > > > > > > > > That > > > > > is the number which drives the maximum number of > > not-runnable jobs > > > > that > > > > > can become runnable at once, and hence spawn that many > > work items, > > > > and > > > > > in turn unbound worker threads. > > > > > > > > > > Several problems there. > > > > > > > > > > It is fundamentally pointless to have potentially that > > many more > > > > > threads > > > > > than the number of CPU cores - it simply creates a > > scheduling storm. > > > > > > > > > > Unbound workers have no CPU / cache locality either and > > no connection > > > > > with the CPU scheduler to optimize scheduling patterns. > > This may > > > > matter > > > > > either on large systems or on small ones. Whereas the > > current design > > > > > allows for scheduler to notice userspace CPU thread > > keeps waking up > > > > the > > > > > same drm scheduler kernel thread, and so it can keep > > them on the same > > > > > CPU, the unbound workers lose that ability and so 2nd > > CPU might be > > > > > getting woken up from low sleep for every submission. > > > > > > > > > > Hence, apart from being a bit of a impedance mismatch, > > the proposal > > > > has > > > > > the potential to change performance and power patterns > > and both large > > > > > and small machines. > > > > > > > > > > > > > > > Ok, thanks for explaining the issue you're seeing in more > > detail. Yes, > > > > > deferred kwork does appear to mismatch somewhat with what > > the scheduler > > > > > needs or at least how it's worked in the past. How much > > impact will > > > > > that mismatch have? Unclear. > > > > > > > > > > > >>> Secondly, it probably demands separate > > workers (not > > > > > optional), > > > > > > otherwise > > > > > > >>> behaviour of shared workqueues has either > > the potential > > > > to > > > > > > explode number > > > > > > >>> kernel threads anyway, or add latency. > > > > > > >>> > > > > > > >> > > > > > > >> Right now the system_unbound_wq is used which > > does have a > > > > > limit > > > > > > on the > > > > > > >> number of threads, right? I do have a FIXME > > to allow a > > > > > worker to be > > > > > > >> passed in similar to TDR. > > > > > > >> > > > > > > >> WRT to latency, the 1:1 ratio could actually > > have lower > > > > > latency > > > > > > as 2 GPU > > > > > > >> schedulers can be pushing jobs into the backend / > > > > cleaning up > > > > > > jobs in > > > > > > >> parallel. > > > > > > >> > > > > > > > > > > > > > > Thought of one more point here where why in Xe we > > > > > absolutely want > > > > > > a 1 to > > > > > > > 1 ratio between entity and scheduler - the way > > we implement > > > > > > timeslicing > > > > > > > for preempt fences. > > > > > > > > > > > > > > Let me try to explain. > > > > > > > > > > > > > > Preempt fences are implemented via the generic > > messaging > > > > > > interface [1] > > > > > > > with suspend / resume messages. If a suspend > > messages is > > > > > received to > > > > > > > soon after calling resume (this is per entity) > > we simply > > > > > sleep in the > > > > > > > suspend call thus giving the entity a > > timeslice. This > > > > > completely > > > > > > falls > > > > > > > apart with a many to 1 relationship as now a > > entity > > > > > waiting for a > > > > > > > timeslice blocks the other entities. Could we > > work aroudn > > > > > this, > > > > > > sure but > > > > > > > just another bunch of code we'd have to add in > > Xe. Being to > > > > > > freely sleep > > > > > > > in backend without affecting other entities is > > really, > > > > really > > > > > > nice IMO > > > > > > > and I bet Xe isn't the only driver that is > > going to feel > > > > > this way. > > > > > > > > > > > > > > Last thing I'll say regardless of how anyone > > feels about > > > > > Xe using > > > > > > a 1 to > > > > > > > 1 relationship this patch IMO makes sense as I > > hope we can > > > > all > > > > > > agree a > > > > > > > workqueue scales better than kthreads. > > > > > > > > > > > > I don't know for sure what will scale better and > > for what use > > > > > case, > > > > > > combination of CPU cores vs number of GPU engines > > to keep > > > > > busy vs other > > > > > > system activity. But I wager someone is bound to > > ask for some > > > > > > numbers to > > > > > > make sure proposal is not negatively affecting > > any other > > > > drivers. > > > > > > > > > > > > > > > > > > Then let them ask. Waving your hands vaguely in the > > direction of > > > > > the > > > > > > rest of DRM and saying "Uh, someone (not me) might > > object" is > > > > > profoundly > > > > > > unhelpful. Sure, someone might. That's why it's on > > dri-devel. > > > > > If you > > > > > > think there's someone in particular who might have a > > useful > > > > > opinion on > > > > > > this, throw them in the CC so they don't miss the > > e-mail thread. > > > > > > > > > > > > Or are you asking for numbers? If so, what numbers > > are you > > > > > asking for? > > > > > > > > > > It was a heads up to the Xe team in case people weren't > > appreciating > > > > > how > > > > > the proposed change has the potential influence power > > and performance > > > > > across the board. And nothing in the follow up > > discussion made me > > > > think > > > > > it was considered so I don't think it was redundant to > > raise it. > > > > > > > > > > In my experience it is typical that such core changes > > come with some > > > > > numbers. Which is in case of drm scheduler is tricky and > > probably > > > > > requires explicitly asking everyone to test (rather than > > count on > > > > > "don't > > > > > miss the email thread"). Real products can fail to ship > > due ten mW > > > > here > > > > > or there. Like suddenly an extra core prevented from > > getting into > > > > deep > > > > > sleep. > > > > > > > > > > If that was "profoundly unhelpful" so be it. > > > > > > > > > > > > > > > With your above explanation, it makes more sense what you're > > asking. > > > > > It's still not something Matt is likely to be able to > > provide on his > > > > > own. We need to tag some other folks and ask them to test > > it out. We > > > > > could play around a bit with it on Xe but it's not exactly > > production > > > > > grade yet and is going to hit this differently from most. > > Likely > > > > > candidates are probably AMD and Freedreno. > > > > > > > > Whoever is setup to check out power and performance would be > > good to > > > > give it a spin, yes. > > > > > > > > PS. I don't think I was asking Matt to test with other > > devices. To start > > > > with I think Xe is a team effort. I was asking for more > > background on > > > > the design decision since patch 4/20 does not say anything on that > > > > angle, nor later in the thread it was IMO sufficiently addressed. > > > > > > > > > > Also, If we're talking about a design that might > > paint us into an > > > > > > Intel-HW-specific hole, that would be one thing. But > > we're not. > > > > > We're > > > > > > talking about switching which kernel threading/task > > mechanism to > > > > > use for > > > > > > what's really a very generic problem. The core Xe > > design works > > > > > without > > > > > > this patch (just with more kthreads). If we land > > this patch or > > > > > > something like it and get it wrong and it causes a > > performance > > > > > problem > > > > > > for someone down the line, we can revisit it. > > > > > > > > > > For some definition of "it works" - I really wouldn't > > suggest > > > > > shipping a > > > > > kthread per user context at any point. > > > > > > > > > > > > > > > You have yet to elaborate on why. What resources is it > > consuming that's > > > > > going to be a problem? Are you anticipating CPU affinity > > problems? Or > > > > > does it just seem wasteful? > > > > > > > > Well I don't know, commit message says the approach does not > > scale. :) > > > > > > > > > I think I largely agree that it's probably > > unnecessary/wasteful but > > > > > reducing the number of kthreads seems like a tractable > > problem to solve > > > > > regardless of where we put the gpu_scheduler object. Is > > this the right > > > > > solution? Maybe not. It was also proposed at one point > > that we could > > > > > split the scheduler into two pieces: A scheduler which owns > > the kthread, > > > > > and a back-end which targets some HW ring thing where you > > can have > > > > > multiple back-ends per scheduler. That's certainly more > > invasive from a > > > > > DRM scheduler internal API PoV but would solve the kthread > > problem in a > > > > > way that's more similar to what we have now. > > > > > > > > > > > In any case that's a low level question caused by > > the high > > > > > level design > > > > > > decision. So I'd think first focus on the high > > level - which > > > > > is the 1:1 > > > > > > mapping of entity to scheduler instance proposal. > > > > > > > > > > > > Fundamentally it will be up to the DRM > > maintainers and the > > > > > community to > > > > > > bless your approach. And it is important to > > stress 1:1 is > > > > about > > > > > > userspace contexts, so I believe unlike any other > > current > > > > > scheduler > > > > > > user. And also important to stress this > > effectively does not > > > > > make Xe > > > > > > _really_ use the scheduler that much. > > > > > > > > > > > > > > > > > > I don't think this makes Xe nearly as much of a > > one-off as you > > > > > think it > > > > > > does. I've already told the Asahi team working on > > Apple M1/2 > > > > > hardware > > > > > > to do it this way and it seems to be a pretty good > > mapping for > > > > > them. I > > > > > > believe this is roughly the plan for nouveau as > > well. It's not > > > > > the way > > > > > > it currently works for anyone because most other > > groups aren't > > > > > doing FW > > > > > > scheduling yet. In the world of FW scheduling and > > hardware > > > > > designed to > > > > > > support userspace direct-to-FW submit, I think the > > design makes > > > > > perfect > > > > > > sense (see below) and I expect we'll see more drivers > > move in this > > > > > > direction as those drivers evolve. (AMD is doing some > > customish > > > > > thing > > > > > > for how with gpu_scheduler on the front-end somehow. > > I've not dug > > > > > into > > > > > > those details.) > > > > > > > > > > > > I can only offer my opinion, which is that the > > two options > > > > > mentioned in > > > > > > this thread (either improve drm scheduler to cope > > with what is > > > > > > required, > > > > > > or split up the code so you can use just the parts of > > > > > drm_sched which > > > > > > you want - which is frontend dependency tracking) > > shouldn't > > > > be so > > > > > > readily dismissed, given how I think the idea was > > for the new > > > > > driver to > > > > > > work less in a silo and more in the community > > (not do kludges > > > > to > > > > > > workaround stuff because it is thought to be too > > hard to > > > > > improve common > > > > > > code), but fundamentally, "goto previous > > paragraph" for what > > > > I am > > > > > > concerned. > > > > > > > > > > > > > > > > > > Meta comment: It appears as if you're falling into > > the standard > > > > > i915 > > > > > > team trap of having an internal discussion about what the > > > > community > > > > > > discussion might look like instead of actually having the > > > > community > > > > > > discussion. If you are seriously concerned about > > interactions > > > > with > > > > > > other drivers or whether or setting common direction, > > the right > > > > > way to > > > > > > do that is to break a patch or two out into a > > separate RFC series > > > > > and > > > > > > tag a handful of driver maintainers. Trying to > > predict the > > > > > questions > > > > > > other people might ask is pointless. Cc them and > > asking for their > > > > > input > > > > > > instead. > > > > > > > > > > I don't follow you here. It's not an internal discussion > > - I am > > > > raising > > > > > my concerns on the design publicly. I am supposed to > > write a patch to > > > > > show something, but am allowed to comment on a RFC series? > > > > > > > > > > > > > > > I may have misread your tone a bit. It felt a bit like too many > > > > > discussions I've had in the past where people are trying to > > predict what > > > > > others will say instead of just asking them. Reading it > > again, I was > > > > > probably jumping to conclusions a bit. Sorry about that. > > > > > > > > Okay no problem, thanks. In any case we don't have to keep > > discussing > > > > it, since I wrote one or two emails ago it is fundamentally on the > > > > maintainers and community to ack the approach. I only felt > > like RFC did > > > > not explain the potential downsides sufficiently so I wanted > > to probe > > > > that area a bit. > > > > > > > > > It is "drm/sched: Convert drm scheduler to use a work > > queue rather > > > > than > > > > > kthread" which should have Cc-ed _everyone_ who use drm > > scheduler. > > > > > > > > > > > > > > > Yeah, it probably should have. I think that's mostly what > > I've been > > > > > trying to say. > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > Tvrtko > > > > > > > > > > > > P.S. And as a related side note, there are more > > areas where > > > > > drm_sched > > > > > > could be improved, like for instance priority > > handling. > > > > > > Take a look at msm_submitqueue_create / > > > > > msm_gpu_convert_priority / > > > > > > get_sched_entity to see how msm works around the > > drm_sched > > > > > hardcoded > > > > > > limit of available priority levels, in order to > > avoid having > > > > > to leave a > > > > > > hw capability unused. I suspect msm would be > > happier if they > > > > > could have > > > > > > all priority levels equal in terms of whether > > they apply only > > > > > at the > > > > > > frontend level or completely throughout the pipeline. > > > > > > > > > > > > > [1] > > > > > > > > > > > > > https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1 > > <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1> > > > > > > > <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1 > > <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1> > > > > > > > > > > > > > > > > < > > > > > > https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1 > > <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1> > > < > > > > > > https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1 > > <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>>> > > > > > > > > > > > > > >>> What would be interesting to learn is > > whether the option > > > > of > > > > > > refactoring > > > > > > >>> drm_sched to deal with out of order > > completion was > > > > > considered > > > > > > and what were > > > > > > >>> the conclusions. > > > > > > >>> > > > > > > >> > > > > > > >> I coded this up a while back when trying to > > convert the > > > > > i915 to > > > > > > the DRM > > > > > > >> scheduler it isn't all that hard either. The > > free flow > > > > > control > > > > > > on the > > > > > > >> ring (e.g. set job limit == SIZE OF RING / > > MAX JOB SIZE) > > > > is > > > > > > really what > > > > > > >> sold me on the this design. > > > > > > > > > > > > > > > > > > You're not the only one to suggest supporting > > out-of-order > > > > > completion. > > > > > > However, it's tricky and breaks a lot of internal > > assumptions of > > > > the > > > > > > scheduler. It also reduces functionality a bit > > because it can no > > > > > longer > > > > > > automatically rate-limit HW/FW queues which are often > > > > > fixed-size. (Ok, > > > > > > yes, it probably could but it becomes a substantially > > harder > > > > > problem.) > > > > > > > > > > > > It also seems like a worse mapping to me. The goal > > here is to > > > > turn > > > > > > submissions on a userspace-facing engine/queue into > > submissions > > > > > to a FW > > > > > > queue submissions, sorting out any dma_fence > > dependencies. Matt's > > > > > > description of saying this is a 1:1 mapping between > > sched/entity > > > > > doesn't > > > > > > tell the whole story. It's a 1:1:1 mapping between > > xe_engine, > > > > > > gpu_scheduler, and GuC FW engine. Why make it a > > 1:something:1 > > > > > mapping? > > > > > > Why is that better? > > > > > > > > > > As I have stated before, what I think what would fit > > well for Xe is > > > > one > > > > > drm_scheduler per engine class. In specific terms on our > > current > > > > > hardware, one drm scheduler instance for render, > > compute, blitter, > > > > > video > > > > > and video enhance. Userspace contexts remain scheduler > > entities. > > > > > > > > > > > > > > > And this is where we fairly strongly disagree. More in a bit. > > > > > > > > > > That way you avoid the whole kthread/kworker story and > > you have it > > > > > actually use the entity picking code in the scheduler, > > which may be > > > > > useful when the backend is congested. > > > > > > > > > > > > > > > What back-end congestion are you referring to here? Running > > out of FW > > > > > queue IDs? Something else? > > > > > > > > CT channel, number of context ids. > > > > > > > > > > > > > > Yes you have to solve the out of order problem so in my > > mind that is > > > > > something to discuss. What the problem actually is (just > > TDR?), how > > > > > tricky and why etc. > > > > > > > > > > And yes you lose the handy LRCA ring buffer size > > management so you'd > > > > > have to make those entities not runnable in some other way. > > > > > > > > > > Regarding the argument you raise below - would any of > > that make the > > > > > frontend / backend separation worse and why? Do you > > think it is less > > > > > natural? If neither is true then all remains is that it > > appears extra > > > > > work to support out of order completion of entities has been > > > > discounted > > > > > in favour of an easy but IMO inelegant option. > > > > > > > > > > > > > > > Broadly speaking, the kernel needs to stop thinking about > > GPU scheduling > > > > > in terms of scheduling jobs and start thinking in terms of > > scheduling > > > > > contexts/engines. There is still some need for scheduling > > individual > > > > > jobs but that is only for the purpose of delaying them as > > needed to > > > > > resolve dma_fence dependencies. Once dependencies are > > resolved, they > > > > > get shoved onto the context/engine queue and from there the > > kernel only > > > > > really manages whole contexts/engines. This is a major > > architectural > > > > > shift, entirely different from the way i915 scheduling > > works. It's also > > > > > different from the historical usage of DRM scheduler which I > > think is > > > > > why this all looks a bit funny. > > > > > > > > > > To justify this architectural shift, let's look at where > > we're headed. > > > > > In the glorious future... > > > > > > > > > > 1. Userspace submits directly to firmware queues. The > > kernel has no > > > > > visibility whatsoever into individual jobs. At most it can > > pause/resume > > > > > FW contexts as needed to handle eviction and memory management. > > > > > > > > > > 2. Because of 1, apart from handing out the FW queue IDs > > at the > > > > > beginning, the kernel can't really juggle them that much. > > Depending on > > > > > FW design, it may be able to pause a client, give its IDs to > > another, > > > > > and then resume it later when IDs free up. What it's not > > doing is > > > > > juggling IDs on a job-by-job basis like i915 currently is. > > > > > > > > > > 3. Long-running compute jobs may not complete for days. > > This means > > > > > that memory management needs to happen in terms of > > pause/resume of > > > > > entire contexts/engines using the memory rather than based > > on waiting > > > > > for individual jobs to complete or pausing individual jobs > > until the > > > > > memory is available. > > > > > > > > > > 4. Synchronization happens via userspace memory fences > > (UMF) and the > > > > > kernel is mostly unaware of most dependencies and when a > > context/engine > > > > > is or is not runnable. Instead, it keeps as many of them > > minimally > > > > > active (memory is available, even if it's in system RAM) as > > possible and > > > > > lets the FW sort out dependencies. (There may need to be > > some facility > > > > > for sleeping a context until a memory change similar to > > futex() or > > > > > poll() for userspace threads. There are some details TBD.) > > > > > > > > > > Are there potential problems that will need to be solved > > here? Yes. Is > > > > > it a good design? Well, Microsoft has been living in this > > future for > > > > > half a decade or better and it's working quite well for > > them. It's also > > > > > the way all modern game consoles work. It really is just > > Linux that's > > > > > stuck with the same old job model we've had since the > > monumental shift > > > > > to DRI2. > > > > > > > > > > To that end, one of the core goals of the Xe project was to > > make the > > > > > driver internally behave as close to the above model as > > possible while > > > > > keeping the old-school job model as a very thin layer on > > top. As the > > > > > broader ecosystem problems (window-system support for UMF, > > for instance) > > > > > are solved, that layer can be peeled back. The core driver > > will already > > > > > be ready for it. > > > > > > > > > > To that end, the point of the DRM scheduler in Xe isn't to > > schedule > > > > > jobs. It's to resolve syncobj and dma-buf implicit sync > > dependencies > > > > > and stuff jobs into their respective context/engine queue > > once they're > > > > > ready. All the actual scheduling happens in firmware and > > any scheduling > > > > > the kernel does to deal with contention, oversubscriptions, > > too many > > > > > contexts, etc. is between contexts/engines, not individual > > jobs. Sure, > > > > > the individual job visibility is nice, but if we design > > around it, we'll > > > > > never get to the glorious future. > > > > > > > > > > I really need to turn the above (with a bit more detail) > > into a blog > > > > > post.... Maybe I'll do that this week. > > > > > > > > > > In any case, I hope that provides more insight into why Xe > > is designed > > > > > the way it is and why I'm pushing back so hard on trying to > > make it more > > > > > of a "classic" driver as far as scheduling is concerned. > > Are there > > > > > potential problems here? Yes, that's why Xe has been labeled a > > > > > prototype. Are such radical changes necessary to get to > > said glorious > > > > > future? Yes, I think they are. Will it be worth it? I > > believe so. > > > > > > > > Right, that's all solid I think. My takeaway is that frontend > > priority > > > > sorting and that stuff isn't needed and that is okay. And that > > there are > > > > multiple options to maybe improve drm scheduler, like the fore > > mentioned > > > > making it deal with out of order, or split into functional > > components, > > > > or split frontend/backend what you suggested. For most of them > > cost vs > > > > benefit is more or less not completely clear, neither how much > > effort > > > > was invested to look into them. > > > > > > > > One thing I missed from this explanation is how drm_scheduler > > per engine > > > > class interferes with the high level concepts. And I did not > > manage to > > > > pick up on what exactly is the TDR problem in that case. Maybe > > the two > > > > are one and the same. > > > > > > > > Bottom line is I still have the concern that conversion to > > kworkers has > > > > an opportunity to regress. Possibly more opportunity for some > > Xe use > > > > cases than to affect other vendors, since they would still be > > using per > > > > physical engine / queue scheduler instances. > > > > > > > > And to put my money where my mouth is I will try to put testing Xe > > > > inside the full blown ChromeOS environment in my team plans. > > It would > > > > probably also be beneficial if Xe team could take a look at > > real world > > > > behaviour of the extreme transcode use cases too. If the stack > > is ready > > > > for that and all. It would be better to know earlier rather > > than later > > > > if there is a fundamental issue. > > > > > > > > For the patch at hand, and the cover letter, it certainly > > feels it would > > > > benefit to record the past design discussion had with AMD > > folks, to > > > > explicitly copy other drivers, and to record the theoretical > > pros and > > > > cons of threads vs unbound workers as I have tried to > > highlight them. > > > > > > > > Regards, > > > > > > > > Tvrtko > > > > > >