On Thu, Jan 12, 2023 at 04:39:32PM -0800, John Harrison wrote:
> On 1/11/2023 14:56, Jason Ekstrand wrote:
> > On Wed, Jan 11, 2023 at 4:32 PM Matthew Brost <matthew.br...@intel.com>
> > wrote:
> > 
> >     On Wed, Jan 11, 2023 at 04:18:01PM -0600, Jason Ekstrand wrote:
> >     > On Wed, Jan 11, 2023 at 2:50 AM Tvrtko Ursulin <
> >     > tvrtko.ursu...@linux.intel.com> wrote:
> >     >
> >     > >
> >     [snip]
> >     > >
> >     > > Typically is the key here. But I am not sure it is good
> >     enough. Consider
> >     > > this example - Intel Flex 170:
> >     > >
> >     > >   * Delivers up to 36 streams 1080p60 transcode throughput per
> >     card.
> >     > >   * When scaled to 10 cards in a 4U server configuration, it
> >     can support
> >     > > up to 360 streams of HEVC/HEVC 1080p60 transcode throughput.
> >     > >
> >     >
> >     > I had a feeling it was going to be media.... 😅
> >     >
> > 
> >     Yea wondering the media UMD can be rewritten to use less
> >     xe_engines, it
> >     is massive rewrite for VM bind + no implicit dependencies so let's
> >     just
> >     pile on some more work?
> > 
> > 
> > It could probably use fewer than it does today.  It currently creates
> > and throws away contexts like crazy, or did last I looked at it. 
> > However, the nature of media encode is that it often spreads across two
> > or three different types of engines.  There's not much you can do to
> > change that.
> And as per Tvrtko's example, you get media servers that transcode huge
> numbers of tiny streams in parallel. Almost no work per frame but 100s of
> independent streams being run concurrently. That means many 100s of contexts
> all trying to run at 30fps. I recall a specific bug about thundering herds -
> hundreds (thousands?) of waiting threads all being woken up at once because
> some request had completed.
> 
> >     >
> >     > > One transcode stream from my experience typically is 3-4 GPU
> >     contexts
> >     > > (buffer travels from vcs -> rcs -> vcs, maybe vecs) used from
> >     a single
> >     > > CPU thread. 4 contexts * 36 streams = 144 active contexts.
> >     Multiply by
> >     > > 60fps = 8640 jobs submitted and completed per second.
> >     > >
> >     > > 144 active contexts in the proposed scheme means possibly
> >     means 144
> >     > > kernel worker threads spawned (driven by 36 transcode CPU
> >     threads). (I
> >     > > don't think the pools would scale down given all are
> >     constantly pinged
> >     > > at 60fps.)
> >     > >
> >     > > And then each of 144 threads goes to grab the single GuC CT
> >     mutex. First
> >     > > threads are being made schedulable, then put to sleep as mutex
> >     > > contention is hit, then woken again as mutexes are getting
> >     released,
> >     > > rinse, repeat.
> >     > >
> >     >
> >     > Why is every submission grabbing the GuC CT mutex? I've not read
> >     the GuC
> >     > back-end yet but I was under the impression that most run_job()
> >     would be
> >     > just shoving another packet into a ring buffer.  If we have to
> >     send the GuC
> >     > a message on the control ring every single time we submit a job,
> >     that's
> >     > pretty horrible.
> >     >
> > 
> >     Run job writes the ring buffer and moves the tail as the first
> >     step (no
> >     lock required). Next it needs to tell the GuC the xe_engine LRC
> >     tail has
> >     moved, this is done from a single Host to GuC channel which is
> >     circular
> >     buffer, the writing of the channel protected by the mutex. There are
> >     little more nuances too but in practice there is always space in the
> >     channel so the time mutex needs to held is really, really small
> >     (check cached credits, write 3 dwords in payload, write 1 dword to
> >     move
> >     tail). I also believe mutexes in Linux are hybrid where they spin
> >     for a
> >     little bit before sleeping and certainly if there is space in the
> >     channel we shouldn't sleep mutex contention.
> > 
> > 
> > Ok, that makes sense.  It's maybe a bit clunky and it'd be nice if we
> > had some way to batch things up a bit so we only have to poke the GuC
> > channel once for every batch of things rather than once per job.  That's
> > maybe something we can look into as a future improvement; not
> > fundamental.
> > 
> > Generally, though, it sounds like contention could be a real problem if
> > we end up ping-ponging that lock between cores.  It's going to depend on
> > how much work it takes to get the next ready thing vs. the cost of that
> > atomic.  But, also, anything we do is going to potentially run into
> > contention problems.  *shrug*  If we were going to go for
> > one-per-HW-engine, we may as well go one-per-device and then we wouldn't
> > need the lock.  Off the top of my head, that doesn't sound great either
> > but IDK.
> > 
> >     As far as this being horrible, well didn't design the GuC and this how
> >     it is implemented for KMD based submission. We also have 256 doorbells
> >     so we wouldn't need a lock but I think are other issues with that
> >     design
> >     too which need to be worked out in the Xe2 / Xe3 timeframe.
> > 
> > 
> > Yeah, not blaming you.  Just surprised, that's all.  How does it work
> > for userspace submission?  What would it look like if the kernel
> > emulated userspace submission?  Is that even possible?
> > 
> > What are these doorbell things?  How do they play into it?
> Basically a bank of MMIO space reserved per 'entity' where a write to that
> MMIO space becomes an named interrupt to GuC. You can assign each doorbell
> to a specific GuC context. So writing to that doorbell address is
> effectively the same as sending a SCHEDULE_CONTEXT H2G message from the KMD
> for that context. But the advantage is you ring the doorbell from user land
> with no call into the kernel at all. Or from within the kernel, you can do
> it without needing any locks at all. Problem is, we have 64K contexts in GuC
> but only 256 doorbells in the hardware. Less if using SRIOV. So the "per
> 'entity'" part because somewhat questionable as to exactly what the 'entity'
> is. And hence we just haven't bothered supporting them in Linux because a)
> no direct submission from user land yet, and b) as Matthew says entire chain
> of IOCTL from UMD to kernel to acquiring a lock and sending the H2G has
> generally been fast enough. The latency only becomes an issue for ULLS
> people but for them, even the doorbells from user space are too high a
> latency because that still potentially involves the GuC having to do some
> scheduling and context switch type action.
> 
> John.
> 

I talked with Jason on IRC last week about doorbells and we came up
with the idea after chatting to allocate the doorbells with a greedy
algorithm which results in the first 256 xe_engine each getting their
own doorbell thus avoid contention on the CT channel / lock (this is
still KMD submission).

Coded up a prototype for this and initial test results of
xe_exec_threads /w 245 user xe_engines, 5 threads, and 40k total execs
are an average of .824s vs. 923s for /w and w/o doorbells. Or in other
words 49714 execs per seconds /w doorbells vs. 44353 without. This seems
to indicate using doorbells can provide a performance improvement. Also
Jason and I reasoned we should be able to use doorbells 99% of the time
aside from maybe some wacky media use cases. I also plan on following up
with the media UMD to see if we can get them to use less xe_engines.

Matt

> 
> >     Also if you see my follow up response Xe is ~33k execs per second with
> >     the current implementation on a 8 core (or maybe 8 thread) TGL which
> >     seems to fine to me.
> > 
> > 
> > 33k exec/sec is about 500/frame which should be fine. 500 is a lot for a
> > single frame.  I typically tell game devs to shoot for dozens per
> > frame.  The important thing is that it stays low even with hundreds of
> > memory objects bound. (Xe should be just fine there.)
> > 
> > --Jason
> > 
> >     Matt
> > 
> >     > --Jason
> >     >
> >     >
> >     > (And yes this backend contention is there regardless of 1:1:1,
> >     it would
> >     > > require a different re-design to solve that. But it is just a
> >     question
> >     > > whether there are 144 contending threads, or just 6 with the
> >     thread per
> >     > > engine class scheme.)
> >     > >
> >     > > Then multiply all by 10 for a 4U server use case and you get
> >     1440 worker
> >     > > kthreads, yes 10 more CT locks, but contending on how many CPU
> >     cores?
> >     > > Just so they can grab a timeslice and maybe content on a mutex
> >     as the
> >     > > next step.
> >     > >
> >     > > This example is where it would hurt on large systems. Imagine
> >     only an
> >     > > even wider media transcode card...
> >     > >
> >     > > Second example is only a single engine class used (3d
> >     desktop?) but with
> >     > > a bunch of not-runnable jobs queued and waiting on a fence to
> >     signal.
> >     > > Implicit or explicit dependencies doesn't matter. Then the
> >     fence signals
> >     > > and call backs run. N work items get scheduled, but they all
> >     submit to
> >     > > the same HW engine. So we end up with:
> >     > >
> >     > >          /-- wi1 --\
> >     > >         / ..     .. \
> >     > >   cb --+---  wi.. ---+-- rq1 -- .. -- rqN
> >     > >         \ ..    ..  /
> >     > >          \-- wiN --/
> >     > >
> >     > >
> >     > > All that we have achieved is waking up N CPUs to contend on
> >     the same
> >     > > lock and effectively insert the job into the same single HW
> >     queue. I
> >     > > don't see any positives there.
> >     > >
> >     > > This example I think can particularly hurt small / low power
> >     devices
> >     > > because of needless waking up of many cores for no benefit.
> >     Granted, I
> >     > > don't have a good feel on how common this pattern is in practice.
> >     > >
> >     > > >
> >     > > >     That
> >     > > >     is the number which drives the maximum number of
> >     not-runnable jobs
> >     > > that
> >     > > >     can become runnable at once, and hence spawn that many
> >     work items,
> >     > > and
> >     > > >     in turn unbound worker threads.
> >     > > >
> >     > > >     Several problems there.
> >     > > >
> >     > > >     It is fundamentally pointless to have potentially that
> >     many more
> >     > > >     threads
> >     > > >     than the number of CPU cores - it simply creates a
> >     scheduling storm.
> >     > > >
> >     > > >     Unbound workers have no CPU / cache locality either and
> >     no connection
> >     > > >     with the CPU scheduler to optimize scheduling patterns.
> >     This may
> >     > > matter
> >     > > >     either on large systems or on small ones. Whereas the
> >     current design
> >     > > >     allows for scheduler to notice userspace CPU thread
> >     keeps waking up
> >     > > the
> >     > > >     same drm scheduler kernel thread, and so it can keep
> >     them on the same
> >     > > >     CPU, the unbound workers lose that ability and so 2nd
> >     CPU might be
> >     > > >     getting woken up from low sleep for every submission.
> >     > > >
> >     > > >     Hence, apart from being a bit of a impedance mismatch,
> >     the proposal
> >     > > has
> >     > > >     the potential to change performance and power patterns
> >     and both large
> >     > > >     and small machines.
> >     > > >
> >     > > >
> >     > > > Ok, thanks for explaining the issue you're seeing in more
> >     detail.  Yes,
> >     > > > deferred kwork does appear to mismatch somewhat with what
> >     the scheduler
> >     > > > needs or at least how it's worked in the past.  How much
> >     impact will
> >     > > > that mismatch have?  Unclear.
> >     > > >
> >     > > >      >      >>> Secondly, it probably demands separate
> >     workers (not
> >     > > >     optional),
> >     > > >      >     otherwise
> >     > > >      >      >>> behaviour of shared workqueues has either
> >     the potential
> >     > > to
> >     > > >      >     explode number
> >     > > >      >      >>> kernel threads anyway, or add latency.
> >     > > >      >      >>>
> >     > > >      >      >>
> >     > > >      >      >> Right now the system_unbound_wq is used which
> >     does have a
> >     > > >     limit
> >     > > >      >     on the
> >     > > >      >      >> number of threads, right? I do have a FIXME
> >     to allow a
> >     > > >     worker to be
> >     > > >      >      >> passed in similar to TDR.
> >     > > >      >      >>
> >     > > >      >      >> WRT to latency, the 1:1 ratio could actually
> >     have lower
> >     > > >     latency
> >     > > >      >     as 2 GPU
> >     > > >      >      >> schedulers can be pushing jobs into the backend /
> >     > > cleaning up
> >     > > >      >     jobs in
> >     > > >      >      >> parallel.
> >     > > >      >      >>
> >     > > >      >      >
> >     > > >      >      > Thought of one more point here where why in Xe we
> >     > > >     absolutely want
> >     > > >      >     a 1 to
> >     > > >      >      > 1 ratio between entity and scheduler - the way
> >     we implement
> >     > > >      >     timeslicing
> >     > > >      >      > for preempt fences.
> >     > > >      >      >
> >     > > >      >      > Let me try to explain.
> >     > > >      >      >
> >     > > >      >      > Preempt fences are implemented via the generic
> >     messaging
> >     > > >      >     interface [1]
> >     > > >      >      > with suspend / resume messages. If a suspend
> >     messages is
> >     > > >     received to
> >     > > >      >      > soon after calling resume (this is per entity)
> >     we simply
> >     > > >     sleep in the
> >     > > >      >      > suspend call thus giving the entity a
> >     timeslice. This
> >     > > >     completely
> >     > > >      >     falls
> >     > > >      >      > apart with a many to 1 relationship as now a
> >     entity
> >     > > >     waiting for a
> >     > > >      >      > timeslice blocks the other entities. Could we
> >     work aroudn
> >     > > >     this,
> >     > > >      >     sure but
> >     > > >      >      > just another bunch of code we'd have to add in
> >     Xe. Being to
> >     > > >      >     freely sleep
> >     > > >      >      > in backend without affecting other entities is
> >     really,
> >     > > really
> >     > > >      >     nice IMO
> >     > > >      >      > and I bet Xe isn't the only driver that is
> >     going to feel
> >     > > >     this way.
> >     > > >      >      >
> >     > > >      >      > Last thing I'll say regardless of how anyone
> >     feels about
> >     > > >     Xe using
> >     > > >      >     a 1 to
> >     > > >      >      > 1 relationship this patch IMO makes sense as I
> >     hope we can
> >     > > all
> >     > > >      >     agree a
> >     > > >      >      > workqueue scales better than kthreads.
> >     > > >      >
> >     > > >      >     I don't know for sure what will scale better and
> >     for what use
> >     > > >     case,
> >     > > >      >     combination of CPU cores vs number of GPU engines
> >     to keep
> >     > > >     busy vs other
> >     > > >      >     system activity. But I wager someone is bound to
> >     ask for some
> >     > > >      >     numbers to
> >     > > >      >     make sure proposal is not negatively affecting
> >     any other
> >     > > drivers.
> >     > > >      >
> >     > > >      >
> >     > > >      > Then let them ask.  Waving your hands vaguely in the
> >     direction of
> >     > > >     the
> >     > > >      > rest of DRM and saying "Uh, someone (not me) might
> >     object" is
> >     > > >     profoundly
> >     > > >      > unhelpful.  Sure, someone might. That's why it's on
> >     dri-devel.
> >     > > >     If you
> >     > > >      > think there's someone in particular who might have a
> >     useful
> >     > > >     opinion on
> >     > > >      > this, throw them in the CC so they don't miss the
> >     e-mail thread.
> >     > > >      >
> >     > > >      > Or are you asking for numbers?  If so, what numbers
> >     are you
> >     > > >     asking for?
> >     > > >
> >     > > >     It was a heads up to the Xe team in case people weren't
> >     appreciating
> >     > > >     how
> >     > > >     the proposed change has the potential influence power
> >     and performance
> >     > > >     across the board. And nothing in the follow up
> >     discussion made me
> >     > > think
> >     > > >     it was considered so I don't think it was redundant to
> >     raise it.
> >     > > >
> >     > > >     In my experience it is typical that such core changes
> >     come with some
> >     > > >     numbers. Which is in case of drm scheduler is tricky and
> >     probably
> >     > > >     requires explicitly asking everyone to test (rather than
> >     count on
> >     > > >     "don't
> >     > > >     miss the email thread"). Real products can fail to ship
> >     due ten mW
> >     > > here
> >     > > >     or there. Like suddenly an extra core prevented from
> >     getting into
> >     > > deep
> >     > > >     sleep.
> >     > > >
> >     > > >     If that was "profoundly unhelpful" so be it.
> >     > > >
> >     > > >
> >     > > > With your above explanation, it makes more sense what you're
> >     asking.
> >     > > > It's still not something Matt is likely to be able to
> >     provide on his
> >     > > > own.  We need to tag some other folks and ask them to test
> >     it out.  We
> >     > > > could play around a bit with it on Xe but it's not exactly
> >     production
> >     > > > grade yet and is going to hit this differently from most. 
> >     Likely
> >     > > > candidates are probably AMD and Freedreno.
> >     > >
> >     > > Whoever is setup to check out power and performance would be
> >     good to
> >     > > give it a spin, yes.
> >     > >
> >     > > PS. I don't think I was asking Matt to test with other
> >     devices. To start
> >     > > with I think Xe is a team effort. I was asking for more
> >     background on
> >     > > the design decision since patch 4/20 does not say anything on that
> >     > > angle, nor later in the thread it was IMO sufficiently addressed.
> >     > >
> >     > > >      > Also, If we're talking about a design that might
> >     paint us into an
> >     > > >      > Intel-HW-specific hole, that would be one thing.  But
> >     we're not.
> >     > > >     We're
> >     > > >      > talking about switching which kernel threading/task
> >     mechanism to
> >     > > >     use for
> >     > > >      > what's really a very generic problem.  The core Xe
> >     design works
> >     > > >     without
> >     > > >      > this patch (just with more kthreads).  If we land
> >     this patch or
> >     > > >      > something like it and get it wrong and it causes a
> >     performance
> >     > > >     problem
> >     > > >      > for someone down the line, we can revisit it.
> >     > > >
> >     > > >     For some definition of "it works" - I really wouldn't
> >     suggest
> >     > > >     shipping a
> >     > > >     kthread per user context at any point.
> >     > > >
> >     > > >
> >     > > > You have yet to elaborate on why. What resources is it
> >     consuming that's
> >     > > > going to be a problem? Are you anticipating CPU affinity
> >     problems? Or
> >     > > > does it just seem wasteful?
> >     > >
> >     > > Well I don't know, commit message says the approach does not
> >     scale. :)
> >     > >
> >     > > > I think I largely agree that it's probably
> >     unnecessary/wasteful but
> >     > > > reducing the number of kthreads seems like a tractable
> >     problem to solve
> >     > > > regardless of where we put the gpu_scheduler object.  Is
> >     this the right
> >     > > > solution?  Maybe not.  It was also proposed at one point
> >     that we could
> >     > > > split the scheduler into two pieces: A scheduler which owns
> >     the kthread,
> >     > > > and a back-end which targets some HW ring thing where you
> >     can have
> >     > > > multiple back-ends per scheduler.  That's certainly more
> >     invasive from a
> >     > > > DRM scheduler internal API PoV but would solve the kthread
> >     problem in a
> >     > > > way that's more similar to what we have now.
> >     > > >
> >     > > >      >     In any case that's a low level question caused by
> >     the high
> >     > > >     level design
> >     > > >      >     decision. So I'd think first focus on the high
> >     level - which
> >     > > >     is the 1:1
> >     > > >      >     mapping of entity to scheduler instance proposal.
> >     > > >      >
> >     > > >      >     Fundamentally it will be up to the DRM
> >     maintainers and the
> >     > > >     community to
> >     > > >      >     bless your approach. And it is important to
> >     stress 1:1 is
> >     > > about
> >     > > >      >     userspace contexts, so I believe unlike any other
> >     current
> >     > > >     scheduler
> >     > > >      >     user. And also important to stress this
> >     effectively does not
> >     > > >     make Xe
> >     > > >      >     _really_ use the scheduler that much.
> >     > > >      >
> >     > > >      >
> >     > > >      > I don't think this makes Xe nearly as much of a
> >     one-off as you
> >     > > >     think it
> >     > > >      > does.  I've already told the Asahi team working on
> >     Apple M1/2
> >     > > >     hardware
> >     > > >      > to do it this way and it seems to be a pretty good
> >     mapping for
> >     > > >     them. I
> >     > > >      > believe this is roughly the plan for nouveau as
> >     well.  It's not
> >     > > >     the way
> >     > > >      > it currently works for anyone because most other
> >     groups aren't
> >     > > >     doing FW
> >     > > >      > scheduling yet.  In the world of FW scheduling and
> >     hardware
> >     > > >     designed to
> >     > > >      > support userspace direct-to-FW submit, I think the
> >     design makes
> >     > > >     perfect
> >     > > >      > sense (see below) and I expect we'll see more drivers
> >     move in this
> >     > > >      > direction as those drivers evolve. (AMD is doing some
> >     customish
> >     > > >     thing
> >     > > >      > for how with gpu_scheduler on the front-end somehow.
> >     I've not dug
> >     > > >     into
> >     > > >      > those details.)
> >     > > >      >
> >     > > >      >     I can only offer my opinion, which is that the
> >     two options
> >     > > >     mentioned in
> >     > > >      >     this thread (either improve drm scheduler to cope
> >     with what is
> >     > > >      >     required,
> >     > > >      >     or split up the code so you can use just the parts of
> >     > > >     drm_sched which
> >     > > >      >     you want - which is frontend dependency tracking)
> >     shouldn't
> >     > > be so
> >     > > >      >     readily dismissed, given how I think the idea was
> >     for the new
> >     > > >     driver to
> >     > > >      >     work less in a silo and more in the community
> >     (not do kludges
> >     > > to
> >     > > >      >     workaround stuff because it is thought to be too
> >     hard to
> >     > > >     improve common
> >     > > >      >     code), but fundamentally, "goto previous
> >     paragraph" for what
> >     > > I am
> >     > > >      >     concerned.
> >     > > >      >
> >     > > >      >
> >     > > >      > Meta comment:  It appears as if you're falling into
> >     the standard
> >     > > >     i915
> >     > > >      > team trap of having an internal discussion about what the
> >     > > community
> >     > > >      > discussion might look like instead of actually having the
> >     > > community
> >     > > >      > discussion.  If you are seriously concerned about
> >     interactions
> >     > > with
> >     > > >      > other drivers or whether or setting common direction,
> >     the right
> >     > > >     way to
> >     > > >      > do that is to break a patch or two out into a
> >     separate RFC series
> >     > > >     and
> >     > > >      > tag a handful of driver maintainers.  Trying to
> >     predict the
> >     > > >     questions
> >     > > >      > other people might ask is pointless. Cc them and
> >     asking for their
> >     > > >     input
> >     > > >      > instead.
> >     > > >
> >     > > >     I don't follow you here. It's not an internal discussion
> >     - I am
> >     > > raising
> >     > > >     my concerns on the design publicly. I am supposed to
> >     write a patch to
> >     > > >     show something, but am allowed to comment on a RFC series?
> >     > > >
> >     > > >
> >     > > > I may have misread your tone a bit.  It felt a bit like too many
> >     > > > discussions I've had in the past where people are trying to
> >     predict what
> >     > > > others will say instead of just asking them. Reading it
> >     again, I was
> >     > > > probably jumping to conclusions a bit.  Sorry about that.
> >     > >
> >     > > Okay no problem, thanks. In any case we don't have to keep
> >     discussing
> >     > > it, since I wrote one or two emails ago it is fundamentally on the
> >     > > maintainers and community to ack the approach. I only felt
> >     like RFC did
> >     > > not explain the potential downsides sufficiently so I wanted
> >     to probe
> >     > > that area a bit.
> >     > >
> >     > > >     It is "drm/sched: Convert drm scheduler to use a work
> >     queue rather
> >     > > than
> >     > > >     kthread" which should have Cc-ed _everyone_ who use drm
> >     scheduler.
> >     > > >
> >     > > >
> >     > > > Yeah, it probably should have.  I think that's mostly what
> >     I've been
> >     > > > trying to say.
> >     > > >
> >     > > >      >
> >     > > >      >     Regards,
> >     > > >      >
> >     > > >      >     Tvrtko
> >     > > >      >
> >     > > >      >     P.S. And as a related side note, there are more
> >     areas where
> >     > > >     drm_sched
> >     > > >      >     could be improved, like for instance priority
> >     handling.
> >     > > >      >     Take a look at msm_submitqueue_create /
> >     > > >     msm_gpu_convert_priority /
> >     > > >      >     get_sched_entity to see how msm works around the
> >     drm_sched
> >     > > >     hardcoded
> >     > > >      >     limit of available priority levels, in order to
> >     avoid having
> >     > > >     to leave a
> >     > > >      >     hw capability unused. I suspect msm would be
> >     happier if they
> >     > > >     could have
> >     > > >      >     all priority levels equal in terms of whether
> >     they apply only
> >     > > >     at the
> >     > > >      >     frontend level or completely throughout the pipeline.
> >     > > >      >
> >     > > >      >      > [1]
> >     > > >      >
> >     > > >
> >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
> >     > > >   
> >      <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
> >     > > >
> >     > > >      >
> >     > > >       <
> >     > >
> >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>
> >     <
> >     > >
> >     https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1
> >     <https://patchwork.freedesktop.org/patch/515857/?series=112189&rev=1>>>
> >     > > >      >      >
> >     > > >      >      >>> What would be interesting to learn is
> >     whether the option
> >     > > of
> >     > > >      >     refactoring
> >     > > >      >      >>> drm_sched to deal with out of order
> >     completion was
> >     > > >     considered
> >     > > >      >     and what were
> >     > > >      >      >>> the conclusions.
> >     > > >      >      >>>
> >     > > >      >      >>
> >     > > >      >      >> I coded this up a while back when trying to
> >     convert the
> >     > > >     i915 to
> >     > > >      >     the DRM
> >     > > >      >      >> scheduler it isn't all that hard either. The
> >     free flow
> >     > > >     control
> >     > > >      >     on the
> >     > > >      >      >> ring (e.g. set job limit == SIZE OF RING /
> >     MAX JOB SIZE)
> >     > > is
> >     > > >      >     really what
> >     > > >      >      >> sold me on the this design.
> >     > > >      >
> >     > > >      >
> >     > > >      > You're not the only one to suggest supporting
> >     out-of-order
> >     > > >     completion.
> >     > > >      > However, it's tricky and breaks a lot of internal
> >     assumptions of
> >     > > the
> >     > > >      > scheduler. It also reduces functionality a bit
> >     because it can no
> >     > > >     longer
> >     > > >      > automatically rate-limit HW/FW queues which are often
> >     > > >     fixed-size.  (Ok,
> >     > > >      > yes, it probably could but it becomes a substantially
> >     harder
> >     > > >     problem.)
> >     > > >      >
> >     > > >      > It also seems like a worse mapping to me.  The goal
> >     here is to
> >     > > turn
> >     > > >      > submissions on a userspace-facing engine/queue into
> >     submissions
> >     > > >     to a FW
> >     > > >      > queue submissions, sorting out any dma_fence
> >     dependencies.  Matt's
> >     > > >      > description of saying this is a 1:1 mapping between
> >     sched/entity
> >     > > >     doesn't
> >     > > >      > tell the whole story. It's a 1:1:1 mapping between
> >     xe_engine,
> >     > > >      > gpu_scheduler, and GuC FW engine. Why make it a
> >     1:something:1
> >     > > >     mapping?
> >     > > >      > Why is that better?
> >     > > >
> >     > > >     As I have stated before, what I think what would fit
> >     well for Xe is
> >     > > one
> >     > > >     drm_scheduler per engine class. In specific terms on our
> >     current
> >     > > >     hardware, one drm scheduler instance for render,
> >     compute, blitter,
> >     > > >     video
> >     > > >     and video enhance. Userspace contexts remain scheduler
> >     entities.
> >     > > >
> >     > > >
> >     > > > And this is where we fairly strongly disagree.  More in a bit.
> >     > > >
> >     > > >     That way you avoid the whole kthread/kworker story and
> >     you have it
> >     > > >     actually use the entity picking code in the scheduler,
> >     which may be
> >     > > >     useful when the backend is congested.
> >     > > >
> >     > > >
> >     > > > What back-end congestion are you referring to here?  Running
> >     out of FW
> >     > > > queue IDs?  Something else?
> >     > >
> >     > > CT channel, number of context ids.
> >     > >
> >     > > >
> >     > > >     Yes you have to solve the out of order problem so in my
> >     mind that is
> >     > > >     something to discuss. What the problem actually is (just
> >     TDR?), how
> >     > > >     tricky and why etc.
> >     > > >
> >     > > >     And yes you lose the handy LRCA ring buffer size
> >     management so you'd
> >     > > >     have to make those entities not runnable in some other way.
> >     > > >
> >     > > >     Regarding the argument you raise below - would any of
> >     that make the
> >     > > >     frontend / backend separation worse and why? Do you
> >     think it is less
> >     > > >     natural? If neither is true then all remains is that it
> >     appears extra
> >     > > >     work to support out of order completion of entities has been
> >     > > discounted
> >     > > >     in favour of an easy but IMO inelegant option.
> >     > > >
> >     > > >
> >     > > > Broadly speaking, the kernel needs to stop thinking about
> >     GPU scheduling
> >     > > > in terms of scheduling jobs and start thinking in terms of
> >     scheduling
> >     > > > contexts/engines.  There is still some need for scheduling
> >     individual
> >     > > > jobs but that is only for the purpose of delaying them as
> >     needed to
> >     > > > resolve dma_fence dependencies.  Once dependencies are
> >     resolved, they
> >     > > > get shoved onto the context/engine queue and from there the
> >     kernel only
> >     > > > really manages whole contexts/engines.  This is a major
> >     architectural
> >     > > > shift, entirely different from the way i915 scheduling
> >     works.  It's also
> >     > > > different from the historical usage of DRM scheduler which I
> >     think is
> >     > > > why this all looks a bit funny.
> >     > > >
> >     > > > To justify this architectural shift, let's look at where
> >     we're headed.
> >     > > > In the glorious future...
> >     > > >
> >     > > >   1. Userspace submits directly to firmware queues.  The
> >     kernel has no
> >     > > > visibility whatsoever into individual jobs. At most it can
> >     pause/resume
> >     > > > FW contexts as needed to handle eviction and memory management.
> >     > > >
> >     > > >   2. Because of 1, apart from handing out the FW queue IDs
> >     at the
> >     > > > beginning, the kernel can't really juggle them that much. 
> >     Depending on
> >     > > > FW design, it may be able to pause a client, give its IDs to
> >     another,
> >     > > > and then resume it later when IDs free up. What it's not
> >     doing is
> >     > > > juggling IDs on a job-by-job basis like i915 currently is.
> >     > > >
> >     > > >   3. Long-running compute jobs may not complete for days. 
> >     This means
> >     > > > that memory management needs to happen in terms of
> >     pause/resume of
> >     > > > entire contexts/engines using the memory rather than based
> >     on waiting
> >     > > > for individual jobs to complete or pausing individual jobs
> >     until the
> >     > > > memory is available.
> >     > > >
> >     > > >   4. Synchronization happens via userspace memory fences
> >     (UMF) and the
> >     > > > kernel is mostly unaware of most dependencies and when a
> >     context/engine
> >     > > > is or is not runnable.  Instead, it keeps as many of them
> >     minimally
> >     > > > active (memory is available, even if it's in system RAM) as
> >     possible and
> >     > > > lets the FW sort out dependencies.  (There may need to be
> >     some facility
> >     > > > for sleeping a context until a memory change similar to
> >     futex() or
> >     > > > poll() for userspace threads.  There are some details TBD.)
> >     > > >
> >     > > > Are there potential problems that will need to be solved
> >     here?  Yes.  Is
> >     > > > it a good design?  Well, Microsoft has been living in this
> >     future for
> >     > > > half a decade or better and it's working quite well for
> >     them.  It's also
> >     > > > the way all modern game consoles work.  It really is just
> >     Linux that's
> >     > > > stuck with the same old job model we've had since the
> >     monumental shift
> >     > > > to DRI2.
> >     > > >
> >     > > > To that end, one of the core goals of the Xe project was to
> >     make the
> >     > > > driver internally behave as close to the above model as
> >     possible while
> >     > > > keeping the old-school job model as a very thin layer on
> >     top.  As the
> >     > > > broader ecosystem problems (window-system support for UMF,
> >     for instance)
> >     > > > are solved, that layer can be peeled back. The core driver
> >     will already
> >     > > > be ready for it.
> >     > > >
> >     > > > To that end, the point of the DRM scheduler in Xe isn't to
> >     schedule
> >     > > > jobs.  It's to resolve syncobj and dma-buf implicit sync
> >     dependencies
> >     > > > and stuff jobs into their respective context/engine queue
> >     once they're
> >     > > > ready.  All the actual scheduling happens in firmware and
> >     any scheduling
> >     > > > the kernel does to deal with contention, oversubscriptions,
> >     too many
> >     > > > contexts, etc. is between contexts/engines, not individual
> >     jobs.  Sure,
> >     > > > the individual job visibility is nice, but if we design
> >     around it, we'll
> >     > > > never get to the glorious future.
> >     > > >
> >     > > > I really need to turn the above (with a bit more detail)
> >     into a blog
> >     > > > post.... Maybe I'll do that this week.
> >     > > >
> >     > > > In any case, I hope that provides more insight into why Xe
> >     is designed
> >     > > > the way it is and why I'm pushing back so hard on trying to
> >     make it more
> >     > > > of a "classic" driver as far as scheduling is concerned. 
> >     Are there
> >     > > > potential problems here?  Yes, that's why Xe has been labeled a
> >     > > > prototype.  Are such radical changes necessary to get to
> >     said glorious
> >     > > > future?  Yes, I think they are.  Will it be worth it?  I
> >     believe so.
> >     > >
> >     > > Right, that's all solid I think. My takeaway is that frontend
> >     priority
> >     > > sorting and that stuff isn't needed and that is okay. And that
> >     there are
> >     > > multiple options to maybe improve drm scheduler, like the fore
> >     mentioned
> >     > > making it deal with out of order, or split into functional
> >     components,
> >     > > or split frontend/backend what you suggested. For most of them
> >     cost vs
> >     > > benefit is more or less not completely clear, neither how much
> >     effort
> >     > > was invested to look into them.
> >     > >
> >     > > One thing I missed from this explanation is how drm_scheduler
> >     per engine
> >     > > class interferes with the high level concepts. And I did not
> >     manage to
> >     > > pick up on what exactly is the TDR problem in that case. Maybe
> >     the two
> >     > > are one and the same.
> >     > >
> >     > > Bottom line is I still have the concern that conversion to
> >     kworkers has
> >     > > an opportunity to regress. Possibly more opportunity for some
> >     Xe use
> >     > > cases than to affect other vendors, since they would still be
> >     using per
> >     > > physical engine / queue scheduler instances.
> >     > >
> >     > > And to put my money where my mouth is I will try to put testing Xe
> >     > > inside the full blown ChromeOS environment in my team plans.
> >     It would
> >     > > probably also be beneficial if Xe team could take a look at
> >     real world
> >     > > behaviour of the extreme transcode use cases too. If the stack
> >     is ready
> >     > > for that and all. It would be better to know earlier rather
> >     than later
> >     > > if there is a fundamental issue.
> >     > >
> >     > > For the patch at hand, and the cover letter, it certainly
> >     feels it would
> >     > > benefit to record the past design discussion had with AMD
> >     folks, to
> >     > > explicitly copy other drivers, and to record the theoretical
> >     pros and
> >     > > cons of threads vs unbound workers as I have tried to
> >     highlight them.
> >     > >
> >     > > Regards,
> >     > >
> >     > > Tvrtko
> >     > >
> > 

Reply via email to