Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
On Thu, 6 Apr 2023 at 18:58, Matthew Brost wrote: > > On Thu, Apr 06, 2023 at 08:32:59AM +0200, Daniel Vetter wrote: > > On Wed, Apr 05, 2023 at 11:58:44PM +, Matthew Brost wrote: > > > On Wed, Apr 05, 2023 at 03:09:08PM +0200, Daniel Vetter wrote: > > > > On Tue, Apr 04, 2023 at 07:48:27PM +, Matthew Brost wrote: > > > > > On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote: > > > > > > On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote: > > > > > > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström > > > > > > > (Intel) wrote: > > > > > > > > > > > > > > > > On 4/4/23 15:10, Christian König wrote: > > > > > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström: > > > > > > > > > > Hi, Christian, > > > > > > > > > > > > > > > > > > > > On 4/4/23 11:09, Christian König wrote: > > > > > > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost: > > > > > > > > > > > > From: Thomas Hellström > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > For long-running workloads, drivers either need to > > > > > > > > > > > > open-code > > > > > > > > > > > > completion > > > > > > > > > > > > waits, invent their own synchronization primitives or > > > > > > > > > > > > internally use > > > > > > > > > > > > dma-fences that do not obey the cross-driver dma-fence > > > > > > > > > > > > protocol, but > > > > > > > > > > > > without any lockdep annotation all these approaches are > > > > > > > > > > > > error prone. > > > > > > > > > > > > > > > > > > > > > > > > So since for example the drm scheduler uses dma-fences > > > > > > > > > > > > it is > > > > > > > > > > > > desirable for > > > > > > > > > > > > a driver to be able to use it for throttling and error > > > > > > > > > > > > handling also with > > > > > > > > > > > > internal dma-fences tha do not obey the cros-driver > > > > > > > > > > > > dma-fence protocol. > > > > > > > > > > > > > > > > > > > > > > > > Introduce long-running completion fences in form of > > > > > > > > > > > > dma-fences, and add > > > > > > > > > > > > lockdep annotation for them. In particular: > > > > > > > > > > > > > > > > > > > > > > > > * Do not allow waiting under any memory management > > > > > > > > > > > > locks. > > > > > > > > > > > > * Do not allow to attach them to a dma-resv object. > > > > > > > > > > > > * Introduce a new interface for adding callbacks making > > > > > > > > > > > > the > > > > > > > > > > > > helper adding > > > > > > > > > > > >a callback sign off on that it is aware that the > > > > > > > > > > > > dma-fence may not > > > > > > > > > > > >complete anytime soon. Typically this will be the > > > > > > > > > > > > scheduler chaining > > > > > > > > > > > >a new long-running fence on another one. > > > > > > > > > > > > > > > > > > > > > > Well that's pretty much what I tried before: > > > > > > > > > > > https://lwn.net/Articles/893704/ > > > > > > > > > > > > > > > > > > > > > > > > > I don't think this quite the same, this explictly enforces that > > > > > > > we don't > > > > > > > break the dma-fence rules (in path of memory allocations, > > > > > > > exported in > > > > > > > any way), essentially this just SW sync point reusing dma-fence > > > > > > > the > > > > > > > infrastructure for signaling / callbacks. I believe your series > > > > > > > tried to > > > > > > > export these fences to user space (admittedly I haven't fully > > > > > > > read your > > > > > > > series). > > > > > > > > > > > > > > In this use case we essentially just want to flow control the > > > > > > > ring via > > > > > > > the dma-scheduler + maintain a list of pending jobs so the TDR > > > > > > > can be > > > > > > > used for cleanup if LR entity encounters an error. To me this > > > > > > > seems > > > > > > > perfectly reasonable but I know dma-femce rules are akin to a > > > > > > > holy war. > > > > > > > > > > > > > > If we return NULL in run_job, now we have to be able to sink all > > > > > > > jobs > > > > > > > in the backend regardless on ring space, maintain a list of jobs > > > > > > > pending > > > > > > > for cleanup after errors, and write a different cleanup path as > > > > > > > now the > > > > > > > TDR doesn't work. Seems very, very silly to duplicate all of this > > > > > > > code > > > > > > > when the DRM scheduler provides all of this for us. Also if we go > > > > > > > this > > > > > > > route, now all drivers are going to invent ways to handle LR jobs > > > > > > > /w the > > > > > > > DRM scheduler. > > > > > > > > > > > > > > This solution is pretty clear, mark the scheduler as LR, and don't > > > > > > > export any fences from the scheduler. If you try to export these > > > > > > > fences > > > > > > > a blow up happens. > > > > > > > > > > > > The problem is if you mix things up. Like for resets you need all > > > > > > the > > > > > > schedulers on an engine/set-of-engines to quiescent or things get > > > > > > potentially hilarious. If you now
Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
On Thu, Apr 06, 2023 at 08:32:59AM +0200, Daniel Vetter wrote: > On Wed, Apr 05, 2023 at 11:58:44PM +, Matthew Brost wrote: > > On Wed, Apr 05, 2023 at 03:09:08PM +0200, Daniel Vetter wrote: > > > On Tue, Apr 04, 2023 at 07:48:27PM +, Matthew Brost wrote: > > > > On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote: > > > > > On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote: > > > > > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) > > > > > > wrote: > > > > > > > > > > > > > > On 4/4/23 15:10, Christian König wrote: > > > > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström: > > > > > > > > > Hi, Christian, > > > > > > > > > > > > > > > > > > On 4/4/23 11:09, Christian König wrote: > > > > > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost: > > > > > > > > > > > From: Thomas Hellström > > > > > > > > > > > > > > > > > > > > > > For long-running workloads, drivers either need to > > > > > > > > > > > open-code > > > > > > > > > > > completion > > > > > > > > > > > waits, invent their own synchronization primitives or > > > > > > > > > > > internally use > > > > > > > > > > > dma-fences that do not obey the cross-driver dma-fence > > > > > > > > > > > protocol, but > > > > > > > > > > > without any lockdep annotation all these approaches are > > > > > > > > > > > error prone. > > > > > > > > > > > > > > > > > > > > > > So since for example the drm scheduler uses dma-fences it > > > > > > > > > > > is > > > > > > > > > > > desirable for > > > > > > > > > > > a driver to be able to use it for throttling and error > > > > > > > > > > > handling also with > > > > > > > > > > > internal dma-fences tha do not obey the cros-driver > > > > > > > > > > > dma-fence protocol. > > > > > > > > > > > > > > > > > > > > > > Introduce long-running completion fences in form of > > > > > > > > > > > dma-fences, and add > > > > > > > > > > > lockdep annotation for them. In particular: > > > > > > > > > > > > > > > > > > > > > > * Do not allow waiting under any memory management locks. > > > > > > > > > > > * Do not allow to attach them to a dma-resv object. > > > > > > > > > > > * Introduce a new interface for adding callbacks making > > > > > > > > > > > the > > > > > > > > > > > helper adding > > > > > > > > > > > a callback sign off on that it is aware that the > > > > > > > > > > > dma-fence may not > > > > > > > > > > > complete anytime soon. Typically this will be the > > > > > > > > > > > scheduler chaining > > > > > > > > > > > a new long-running fence on another one. > > > > > > > > > > > > > > > > > > > > Well that's pretty much what I tried before: > > > > > > > > > > https://lwn.net/Articles/893704/ > > > > > > > > > > > > > > > > > > > > > > I don't think this quite the same, this explictly enforces that we > > > > > > don't > > > > > > break the dma-fence rules (in path of memory allocations, exported > > > > > > in > > > > > > any way), essentially this just SW sync point reusing dma-fence the > > > > > > infrastructure for signaling / callbacks. I believe your series > > > > > > tried to > > > > > > export these fences to user space (admittedly I haven't fully read > > > > > > your > > > > > > series). > > > > > > > > > > > > In this use case we essentially just want to flow control the ring > > > > > > via > > > > > > the dma-scheduler + maintain a list of pending jobs so the TDR can > > > > > > be > > > > > > used for cleanup if LR entity encounters an error. To me this seems > > > > > > perfectly reasonable but I know dma-femce rules are akin to a holy > > > > > > war. > > > > > > > > > > > > If we return NULL in run_job, now we have to be able to sink all > > > > > > jobs > > > > > > in the backend regardless on ring space, maintain a list of jobs > > > > > > pending > > > > > > for cleanup after errors, and write a different cleanup path as now > > > > > > the > > > > > > TDR doesn't work. Seems very, very silly to duplicate all of this > > > > > > code > > > > > > when the DRM scheduler provides all of this for us. Also if we go > > > > > > this > > > > > > route, now all drivers are going to invent ways to handle LR jobs > > > > > > /w the > > > > > > DRM scheduler. > > > > > > > > > > > > This solution is pretty clear, mark the scheduler as LR, and don't > > > > > > export any fences from the scheduler. If you try to export these > > > > > > fences > > > > > > a blow up happens. > > > > > > > > > > The problem is if you mix things up. Like for resets you need all the > > > > > schedulers on an engine/set-of-engines to quiescent or things get > > > > > potentially hilarious. If you now have a scheduler in forever limbo, > > > > > the > > > > > dma_fence guarantees are right out the window. > > > > > > > > > > > > > Right, a GT reset on Xe is: > > > > > > > > Stop all schedulers > > > > Do a reset > > > > Ban any schedulers which we think caused the GT reset > > > > Resubmit all schedulers which we
Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
On Wed, Apr 05, 2023 at 11:58:44PM +, Matthew Brost wrote: > On Wed, Apr 05, 2023 at 03:09:08PM +0200, Daniel Vetter wrote: > > On Tue, Apr 04, 2023 at 07:48:27PM +, Matthew Brost wrote: > > > On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote: > > > > On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote: > > > > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) > > > > > wrote: > > > > > > > > > > > > On 4/4/23 15:10, Christian König wrote: > > > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström: > > > > > > > > Hi, Christian, > > > > > > > > > > > > > > > > On 4/4/23 11:09, Christian König wrote: > > > > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost: > > > > > > > > > > From: Thomas Hellström > > > > > > > > > > > > > > > > > > > > For long-running workloads, drivers either need to open-code > > > > > > > > > > completion > > > > > > > > > > waits, invent their own synchronization primitives or > > > > > > > > > > internally use > > > > > > > > > > dma-fences that do not obey the cross-driver dma-fence > > > > > > > > > > protocol, but > > > > > > > > > > without any lockdep annotation all these approaches are > > > > > > > > > > error prone. > > > > > > > > > > > > > > > > > > > > So since for example the drm scheduler uses dma-fences it is > > > > > > > > > > desirable for > > > > > > > > > > a driver to be able to use it for throttling and error > > > > > > > > > > handling also with > > > > > > > > > > internal dma-fences tha do not obey the cros-driver > > > > > > > > > > dma-fence protocol. > > > > > > > > > > > > > > > > > > > > Introduce long-running completion fences in form of > > > > > > > > > > dma-fences, and add > > > > > > > > > > lockdep annotation for them. In particular: > > > > > > > > > > > > > > > > > > > > * Do not allow waiting under any memory management locks. > > > > > > > > > > * Do not allow to attach them to a dma-resv object. > > > > > > > > > > * Introduce a new interface for adding callbacks making the > > > > > > > > > > helper adding > > > > > > > > > > a callback sign off on that it is aware that the > > > > > > > > > > dma-fence may not > > > > > > > > > > complete anytime soon. Typically this will be the > > > > > > > > > > scheduler chaining > > > > > > > > > > a new long-running fence on another one. > > > > > > > > > > > > > > > > > > Well that's pretty much what I tried before: > > > > > > > > > https://lwn.net/Articles/893704/ > > > > > > > > > > > > > > > > > > > I don't think this quite the same, this explictly enforces that we > > > > > don't > > > > > break the dma-fence rules (in path of memory allocations, exported in > > > > > any way), essentially this just SW sync point reusing dma-fence the > > > > > infrastructure for signaling / callbacks. I believe your series tried > > > > > to > > > > > export these fences to user space (admittedly I haven't fully read > > > > > your > > > > > series). > > > > > > > > > > In this use case we essentially just want to flow control the ring via > > > > > the dma-scheduler + maintain a list of pending jobs so the TDR can be > > > > > used for cleanup if LR entity encounters an error. To me this seems > > > > > perfectly reasonable but I know dma-femce rules are akin to a holy > > > > > war. > > > > > > > > > > If we return NULL in run_job, now we have to be able to sink all jobs > > > > > in the backend regardless on ring space, maintain a list of jobs > > > > > pending > > > > > for cleanup after errors, and write a different cleanup path as now > > > > > the > > > > > TDR doesn't work. Seems very, very silly to duplicate all of this code > > > > > when the DRM scheduler provides all of this for us. Also if we go this > > > > > route, now all drivers are going to invent ways to handle LR jobs /w > > > > > the > > > > > DRM scheduler. > > > > > > > > > > This solution is pretty clear, mark the scheduler as LR, and don't > > > > > export any fences from the scheduler. If you try to export these > > > > > fences > > > > > a blow up happens. > > > > > > > > The problem is if you mix things up. Like for resets you need all the > > > > schedulers on an engine/set-of-engines to quiescent or things get > > > > potentially hilarious. If you now have a scheduler in forever limbo, the > > > > dma_fence guarantees are right out the window. > > > > > > > > > > Right, a GT reset on Xe is: > > > > > > Stop all schedulers > > > Do a reset > > > Ban any schedulers which we think caused the GT reset > > > Resubmit all schedulers which we think were good > > > Restart all schedulers > > > > > > None of this flow depends on LR dma-fences, all of this uses the DRM > > > sched infrastructure and work very well compared to the i915. Rewriting > > > all this with a driver specific implementation is what we are trying to > > > avoid. > > > > > > Similarly if LR entity hangs on its own (not a GT reset, rather the > > > firmware does the reset
Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
On Wed, Apr 05, 2023 at 03:09:08PM +0200, Daniel Vetter wrote: > On Tue, Apr 04, 2023 at 07:48:27PM +, Matthew Brost wrote: > > On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote: > > > On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote: > > > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) > > > > wrote: > > > > > > > > > > On 4/4/23 15:10, Christian König wrote: > > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström: > > > > > > > Hi, Christian, > > > > > > > > > > > > > > On 4/4/23 11:09, Christian König wrote: > > > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost: > > > > > > > > > From: Thomas Hellström > > > > > > > > > > > > > > > > > > For long-running workloads, drivers either need to open-code > > > > > > > > > completion > > > > > > > > > waits, invent their own synchronization primitives or > > > > > > > > > internally use > > > > > > > > > dma-fences that do not obey the cross-driver dma-fence > > > > > > > > > protocol, but > > > > > > > > > without any lockdep annotation all these approaches are error > > > > > > > > > prone. > > > > > > > > > > > > > > > > > > So since for example the drm scheduler uses dma-fences it is > > > > > > > > > desirable for > > > > > > > > > a driver to be able to use it for throttling and error > > > > > > > > > handling also with > > > > > > > > > internal dma-fences tha do not obey the cros-driver > > > > > > > > > dma-fence protocol. > > > > > > > > > > > > > > > > > > Introduce long-running completion fences in form of > > > > > > > > > dma-fences, and add > > > > > > > > > lockdep annotation for them. In particular: > > > > > > > > > > > > > > > > > > * Do not allow waiting under any memory management locks. > > > > > > > > > * Do not allow to attach them to a dma-resv object. > > > > > > > > > * Introduce a new interface for adding callbacks making the > > > > > > > > > helper adding > > > > > > > > > a callback sign off on that it is aware that the dma-fence > > > > > > > > > may not > > > > > > > > > complete anytime soon. Typically this will be the > > > > > > > > > scheduler chaining > > > > > > > > > a new long-running fence on another one. > > > > > > > > > > > > > > > > Well that's pretty much what I tried before: > > > > > > > > https://lwn.net/Articles/893704/ > > > > > > > > > > > > > > > > I don't think this quite the same, this explictly enforces that we don't > > > > break the dma-fence rules (in path of memory allocations, exported in > > > > any way), essentially this just SW sync point reusing dma-fence the > > > > infrastructure for signaling / callbacks. I believe your series tried to > > > > export these fences to user space (admittedly I haven't fully read your > > > > series). > > > > > > > > In this use case we essentially just want to flow control the ring via > > > > the dma-scheduler + maintain a list of pending jobs so the TDR can be > > > > used for cleanup if LR entity encounters an error. To me this seems > > > > perfectly reasonable but I know dma-femce rules are akin to a holy war. > > > > > > > > If we return NULL in run_job, now we have to be able to sink all jobs > > > > in the backend regardless on ring space, maintain a list of jobs pending > > > > for cleanup after errors, and write a different cleanup path as now the > > > > TDR doesn't work. Seems very, very silly to duplicate all of this code > > > > when the DRM scheduler provides all of this for us. Also if we go this > > > > route, now all drivers are going to invent ways to handle LR jobs /w the > > > > DRM scheduler. > > > > > > > > This solution is pretty clear, mark the scheduler as LR, and don't > > > > export any fences from the scheduler. If you try to export these fences > > > > a blow up happens. > > > > > > The problem is if you mix things up. Like for resets you need all the > > > schedulers on an engine/set-of-engines to quiescent or things get > > > potentially hilarious. If you now have a scheduler in forever limbo, the > > > dma_fence guarantees are right out the window. > > > > > > > Right, a GT reset on Xe is: > > > > Stop all schedulers > > Do a reset > > Ban any schedulers which we think caused the GT reset > > Resubmit all schedulers which we think were good > > Restart all schedulers > > > > None of this flow depends on LR dma-fences, all of this uses the DRM > > sched infrastructure and work very well compared to the i915. Rewriting > > all this with a driver specific implementation is what we are trying to > > avoid. > > > > Similarly if LR entity hangs on its own (not a GT reset, rather the > > firmware does the reset for us) we use all the DRM scheduler > > infrastructure to handle this. Again this works rather well... > > Yeah this is why I don't think duplicating everything that long-running > jobs need makes any sense. iow I agree with you. > Glad we agree. > > > But the issue you're having is fairly specific if it's just about > > > ringspace.
Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
Am 05.04.23 um 14:45 schrieb Daniel Vetter: On Wed, Apr 05, 2023 at 02:39:35PM +0200, Christian König wrote: Am 05.04.23 um 14:35 schrieb Thomas Hellström: Hi, On 4/4/23 21:25, Daniel Vetter wrote: On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote: On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote: On 4/4/23 15:10, Christian König wrote: Am 04.04.23 um 14:54 schrieb Thomas Hellström: Hi, Christian, On 4/4/23 11:09, Christian König wrote: Am 04.04.23 um 02:22 schrieb Matthew Brost: From: Thomas Hellström For long-running workloads, drivers either need to open-code completion waits, invent their own synchronization primitives or internally use dma-fences that do not obey the cross-driver dma-fence protocol, but without any lockdep annotation all these approaches are error prone. So since for example the drm scheduler uses dma-fences it is desirable for a driver to be able to use it for throttling and error handling also with internal dma-fences tha do not obey the cros-driver dma-fence protocol. Introduce long-running completion fences in form of dma-fences, and add lockdep annotation for them. In particular: * Do not allow waiting under any memory management locks. * Do not allow to attach them to a dma-resv object. * Introduce a new interface for adding callbacks making the helper adding a callback sign off on that it is aware that the dma-fence may not complete anytime soon. Typically this will be the scheduler chaining a new long-running fence on another one. Well that's pretty much what I tried before: https://lwn.net/Articles/893704/ I don't think this quite the same, this explictly enforces that we don't break the dma-fence rules (in path of memory allocations, exported in any way), essentially this just SW sync point reusing dma-fence the infrastructure for signaling / callbacks. I believe your series tried to export these fences to user space (admittedly I haven't fully read your series). In this use case we essentially just want to flow control the ring via the dma-scheduler + maintain a list of pending jobs so the TDR can be used for cleanup if LR entity encounters an error. To me this seems perfectly reasonable but I know dma-femce rules are akin to a holy war. If we return NULL in run_job, now we have to be able to sink all jobs in the backend regardless on ring space, maintain a list of jobs pending for cleanup after errors, and write a different cleanup path as now the TDR doesn't work. Seems very, very silly to duplicate all of this code when the DRM scheduler provides all of this for us. Also if we go this route, now all drivers are going to invent ways to handle LR jobs /w the DRM scheduler. This solution is pretty clear, mark the scheduler as LR, and don't export any fences from the scheduler. If you try to export these fences a blow up happens. The problem is if you mix things up. Like for resets you need all the schedulers on an engine/set-of-engines to quiescent or things get potentially hilarious. If you now have a scheduler in forever limbo, the dma_fence guarantees are right out the window. But the issue you're having is fairly specific if it's just about ringspace. I think the dumbest fix is to just block in submit if you run out of per-ctx ringspace, and call it a day. This notion that somehow the kernel is supposed to provide a bottomless queue of anything userspace submits simply doesn't hold up in reality (as much as userspace standards committees would like it to), and as long as it doesn't have a real-world perf impact it doesn't really matter why we end up blocking in the submit ioctl. It might also be a simple memory allocation that hits a snag in page reclaim. So it seems the discussion around the long-running synchronization diverged a bit between threads and this thread was hijacked for preempt-fences and userptr. Do I understand it correctly that the recommendation from both Daniel and Christian is to *not* use the drm scheduler for long-running compute jobs, but track any internal dma-fence dependencies (pipelined clearing or whatever) in a separate mechanism and handle unresolved dependencies on other long-running jobs using -EWOULDBLOCK? Yeah, I think that's a good summary. If needed we could extract some scheduler functionality into separate components, but the fundamental problem is that to the GPU scheduler provides a dma_fence interface to the outside to signal job completion and Daniel and I seem to agree that you really don't want that. I think I'm on something slightly different: - For anything which semantically is not a dma_fence I agree it probably should be handled with EWOULDBLOCK and passed to userspace. Either with a submit thread or userspace memory fences. Note that in practice you will have a bunch of blocking left in the ioctl, stuff like mutexes or memory allocations when things get really tight and you end up in synchronous reclaim. Not any different
Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
On Tue, Apr 04, 2023 at 07:48:27PM +, Matthew Brost wrote: > On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote: > > On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote: > > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote: > > > > > > > > On 4/4/23 15:10, Christian König wrote: > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström: > > > > > > Hi, Christian, > > > > > > > > > > > > On 4/4/23 11:09, Christian König wrote: > > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost: > > > > > > > > From: Thomas Hellström > > > > > > > > > > > > > > > > For long-running workloads, drivers either need to open-code > > > > > > > > completion > > > > > > > > waits, invent their own synchronization primitives or > > > > > > > > internally use > > > > > > > > dma-fences that do not obey the cross-driver dma-fence > > > > > > > > protocol, but > > > > > > > > without any lockdep annotation all these approaches are error > > > > > > > > prone. > > > > > > > > > > > > > > > > So since for example the drm scheduler uses dma-fences it is > > > > > > > > desirable for > > > > > > > > a driver to be able to use it for throttling and error > > > > > > > > handling also with > > > > > > > > internal dma-fences tha do not obey the cros-driver > > > > > > > > dma-fence protocol. > > > > > > > > > > > > > > > > Introduce long-running completion fences in form of > > > > > > > > dma-fences, and add > > > > > > > > lockdep annotation for them. In particular: > > > > > > > > > > > > > > > > * Do not allow waiting under any memory management locks. > > > > > > > > * Do not allow to attach them to a dma-resv object. > > > > > > > > * Introduce a new interface for adding callbacks making the > > > > > > > > helper adding > > > > > > > > a callback sign off on that it is aware that the dma-fence > > > > > > > > may not > > > > > > > > complete anytime soon. Typically this will be the > > > > > > > > scheduler chaining > > > > > > > > a new long-running fence on another one. > > > > > > > > > > > > > > Well that's pretty much what I tried before: > > > > > > > https://lwn.net/Articles/893704/ > > > > > > > > > > > > > I don't think this quite the same, this explictly enforces that we don't > > > break the dma-fence rules (in path of memory allocations, exported in > > > any way), essentially this just SW sync point reusing dma-fence the > > > infrastructure for signaling / callbacks. I believe your series tried to > > > export these fences to user space (admittedly I haven't fully read your > > > series). > > > > > > In this use case we essentially just want to flow control the ring via > > > the dma-scheduler + maintain a list of pending jobs so the TDR can be > > > used for cleanup if LR entity encounters an error. To me this seems > > > perfectly reasonable but I know dma-femce rules are akin to a holy war. > > > > > > If we return NULL in run_job, now we have to be able to sink all jobs > > > in the backend regardless on ring space, maintain a list of jobs pending > > > for cleanup after errors, and write a different cleanup path as now the > > > TDR doesn't work. Seems very, very silly to duplicate all of this code > > > when the DRM scheduler provides all of this for us. Also if we go this > > > route, now all drivers are going to invent ways to handle LR jobs /w the > > > DRM scheduler. > > > > > > This solution is pretty clear, mark the scheduler as LR, and don't > > > export any fences from the scheduler. If you try to export these fences > > > a blow up happens. > > > > The problem is if you mix things up. Like for resets you need all the > > schedulers on an engine/set-of-engines to quiescent or things get > > potentially hilarious. If you now have a scheduler in forever limbo, the > > dma_fence guarantees are right out the window. > > > > Right, a GT reset on Xe is: > > Stop all schedulers > Do a reset > Ban any schedulers which we think caused the GT reset > Resubmit all schedulers which we think were good > Restart all schedulers > > None of this flow depends on LR dma-fences, all of this uses the DRM > sched infrastructure and work very well compared to the i915. Rewriting > all this with a driver specific implementation is what we are trying to > avoid. > > Similarly if LR entity hangs on its own (not a GT reset, rather the > firmware does the reset for us) we use all the DRM scheduler > infrastructure to handle this. Again this works rather well... Yeah this is why I don't think duplicating everything that long-running jobs need makes any sense. iow I agree with you. > > But the issue you're having is fairly specific if it's just about > > ringspace. I think the dumbest fix is to just block in submit if you run > > out of per-ctx ringspace, and call it a day. This notion that somehow the > > How does that not break the dma-fence rules? A job can publish its > finished fence after ARM, if the finished fence fence waits on ring >
Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
On Wed, Apr 05, 2023 at 02:39:35PM +0200, Christian König wrote: > Am 05.04.23 um 14:35 schrieb Thomas Hellström: > > Hi, > > > > On 4/4/23 21:25, Daniel Vetter wrote: > > > On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote: > > > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström > > > > (Intel) wrote: > > > > > On 4/4/23 15:10, Christian König wrote: > > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström: > > > > > > > Hi, Christian, > > > > > > > > > > > > > > On 4/4/23 11:09, Christian König wrote: > > > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost: > > > > > > > > > From: Thomas Hellström > > > > > > > > > > > > > > > > > > For long-running workloads, drivers either need to open-code > > > > > > > > > completion > > > > > > > > > waits, invent their own synchronization > > > > > > > > > primitives or internally use > > > > > > > > > dma-fences that do not obey the cross-driver > > > > > > > > > dma-fence protocol, but > > > > > > > > > without any lockdep annotation all these > > > > > > > > > approaches are error prone. > > > > > > > > > > > > > > > > > > So since for example the drm scheduler uses dma-fences it is > > > > > > > > > desirable for > > > > > > > > > a driver to be able to use it for throttling and error > > > > > > > > > handling also with > > > > > > > > > internal dma-fences tha do not obey the cros-driver > > > > > > > > > dma-fence protocol. > > > > > > > > > > > > > > > > > > Introduce long-running completion fences in form of > > > > > > > > > dma-fences, and add > > > > > > > > > lockdep annotation for them. In particular: > > > > > > > > > > > > > > > > > > * Do not allow waiting under any memory management locks. > > > > > > > > > * Do not allow to attach them to a dma-resv object. > > > > > > > > > * Introduce a new interface for adding callbacks making the > > > > > > > > > helper adding > > > > > > > > > a callback sign off on that it is aware > > > > > > > > > that the dma-fence may not > > > > > > > > > complete anytime soon. Typically this will be the > > > > > > > > > scheduler chaining > > > > > > > > > a new long-running fence on another one. > > > > > > > > Well that's pretty much what I tried before: > > > > > > > > https://lwn.net/Articles/893704/ > > > > > > > > > > > > I don't think this quite the same, this explictly enforces that > > > > we don't > > > > break the dma-fence rules (in path of memory allocations, exported in > > > > any way), essentially this just SW sync point reusing dma-fence the > > > > infrastructure for signaling / callbacks. I believe your series > > > > tried to > > > > export these fences to user space (admittedly I haven't fully read your > > > > series). > > > > > > > > In this use case we essentially just want to flow control the ring via > > > > the dma-scheduler + maintain a list of pending jobs so the TDR can be > > > > used for cleanup if LR entity encounters an error. To me this seems > > > > perfectly reasonable but I know dma-femce rules are akin to a holy war. > > > > > > > > If we return NULL in run_job, now we have to be able to sink all jobs > > > > in the backend regardless on ring space, maintain a list of jobs > > > > pending > > > > for cleanup after errors, and write a different cleanup path as now the > > > > TDR doesn't work. Seems very, very silly to duplicate all of this code > > > > when the DRM scheduler provides all of this for us. Also if we go this > > > > route, now all drivers are going to invent ways to handle LR > > > > jobs /w the > > > > DRM scheduler. > > > > > > > > This solution is pretty clear, mark the scheduler as LR, and don't > > > > export any fences from the scheduler. If you try to export these fences > > > > a blow up happens. > > > The problem is if you mix things up. Like for resets you need all the > > > schedulers on an engine/set-of-engines to quiescent or things get > > > potentially hilarious. If you now have a scheduler in forever limbo, the > > > dma_fence guarantees are right out the window. > > > > > > But the issue you're having is fairly specific if it's just about > > > ringspace. I think the dumbest fix is to just block in submit if you run > > > out of per-ctx ringspace, and call it a day. This notion that > > > somehow the > > > kernel is supposed to provide a bottomless queue of anything userspace > > > submits simply doesn't hold up in reality (as much as userspace > > > standards > > > committees would like it to), and as long as it doesn't have a > > > real-world > > > perf impact it doesn't really matter why we end up blocking in the > > > submit > > > ioctl. It might also be a simple memory allocation that hits a snag in > > > page reclaim. > > > > So it seems the discussion around the long-running synchronization > > diverged a bit between threads and this thread was hijacked for > > preempt-fences and userptr. > > > > Do I understand it correctly that the recommendation from both Daniel > > and Christian is
Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
Am 05.04.23 um 14:35 schrieb Thomas Hellström: Hi, On 4/4/23 21:25, Daniel Vetter wrote: On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote: On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote: On 4/4/23 15:10, Christian König wrote: Am 04.04.23 um 14:54 schrieb Thomas Hellström: Hi, Christian, On 4/4/23 11:09, Christian König wrote: Am 04.04.23 um 02:22 schrieb Matthew Brost: From: Thomas Hellström For long-running workloads, drivers either need to open-code completion waits, invent their own synchronization primitives or internally use dma-fences that do not obey the cross-driver dma-fence protocol, but without any lockdep annotation all these approaches are error prone. So since for example the drm scheduler uses dma-fences it is desirable for a driver to be able to use it for throttling and error handling also with internal dma-fences tha do not obey the cros-driver dma-fence protocol. Introduce long-running completion fences in form of dma-fences, and add lockdep annotation for them. In particular: * Do not allow waiting under any memory management locks. * Do not allow to attach them to a dma-resv object. * Introduce a new interface for adding callbacks making the helper adding a callback sign off on that it is aware that the dma-fence may not complete anytime soon. Typically this will be the scheduler chaining a new long-running fence on another one. Well that's pretty much what I tried before: https://lwn.net/Articles/893704/ I don't think this quite the same, this explictly enforces that we don't break the dma-fence rules (in path of memory allocations, exported in any way), essentially this just SW sync point reusing dma-fence the infrastructure for signaling / callbacks. I believe your series tried to export these fences to user space (admittedly I haven't fully read your series). In this use case we essentially just want to flow control the ring via the dma-scheduler + maintain a list of pending jobs so the TDR can be used for cleanup if LR entity encounters an error. To me this seems perfectly reasonable but I know dma-femce rules are akin to a holy war. If we return NULL in run_job, now we have to be able to sink all jobs in the backend regardless on ring space, maintain a list of jobs pending for cleanup after errors, and write a different cleanup path as now the TDR doesn't work. Seems very, very silly to duplicate all of this code when the DRM scheduler provides all of this for us. Also if we go this route, now all drivers are going to invent ways to handle LR jobs /w the DRM scheduler. This solution is pretty clear, mark the scheduler as LR, and don't export any fences from the scheduler. If you try to export these fences a blow up happens. The problem is if you mix things up. Like for resets you need all the schedulers on an engine/set-of-engines to quiescent or things get potentially hilarious. If you now have a scheduler in forever limbo, the dma_fence guarantees are right out the window. But the issue you're having is fairly specific if it's just about ringspace. I think the dumbest fix is to just block in submit if you run out of per-ctx ringspace, and call it a day. This notion that somehow the kernel is supposed to provide a bottomless queue of anything userspace submits simply doesn't hold up in reality (as much as userspace standards committees would like it to), and as long as it doesn't have a real-world perf impact it doesn't really matter why we end up blocking in the submit ioctl. It might also be a simple memory allocation that hits a snag in page reclaim. So it seems the discussion around the long-running synchronization diverged a bit between threads and this thread was hijacked for preempt-fences and userptr. Do I understand it correctly that the recommendation from both Daniel and Christian is to *not* use the drm scheduler for long-running compute jobs, but track any internal dma-fence dependencies (pipelined clearing or whatever) in a separate mechanism and handle unresolved dependencies on other long-running jobs using -EWOULDBLOCK? Yeah, I think that's a good summary. If needed we could extract some scheduler functionality into separate components, but the fundamental problem is that to the GPU scheduler provides a dma_fence interface to the outside to signal job completion and Daniel and I seem to agree that you really don't want that. Regards, Christian. Thanks, Thomas And the reasons why it was rejected haven't changed. Regards, Christian. Yes, TBH this was mostly to get discussion going how we'd best tackle this problem while being able to reuse the scheduler for long-running workloads. I couldn't see any clear decision on your series, though, but one main difference I see is that this is intended for driver-internal use only. (I'm counting using the drm_scheduler as a helper for driver-private use). This is by no means a way to try tackle the
Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
Hi, On 4/4/23 21:25, Daniel Vetter wrote: On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote: On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote: On 4/4/23 15:10, Christian König wrote: Am 04.04.23 um 14:54 schrieb Thomas Hellström: Hi, Christian, On 4/4/23 11:09, Christian König wrote: Am 04.04.23 um 02:22 schrieb Matthew Brost: From: Thomas Hellström For long-running workloads, drivers either need to open-code completion waits, invent their own synchronization primitives or internally use dma-fences that do not obey the cross-driver dma-fence protocol, but without any lockdep annotation all these approaches are error prone. So since for example the drm scheduler uses dma-fences it is desirable for a driver to be able to use it for throttling and error handling also with internal dma-fences tha do not obey the cros-driver dma-fence protocol. Introduce long-running completion fences in form of dma-fences, and add lockdep annotation for them. In particular: * Do not allow waiting under any memory management locks. * Do not allow to attach them to a dma-resv object. * Introduce a new interface for adding callbacks making the helper adding a callback sign off on that it is aware that the dma-fence may not complete anytime soon. Typically this will be the scheduler chaining a new long-running fence on another one. Well that's pretty much what I tried before: https://lwn.net/Articles/893704/ I don't think this quite the same, this explictly enforces that we don't break the dma-fence rules (in path of memory allocations, exported in any way), essentially this just SW sync point reusing dma-fence the infrastructure for signaling / callbacks. I believe your series tried to export these fences to user space (admittedly I haven't fully read your series). In this use case we essentially just want to flow control the ring via the dma-scheduler + maintain a list of pending jobs so the TDR can be used for cleanup if LR entity encounters an error. To me this seems perfectly reasonable but I know dma-femce rules are akin to a holy war. If we return NULL in run_job, now we have to be able to sink all jobs in the backend regardless on ring space, maintain a list of jobs pending for cleanup after errors, and write a different cleanup path as now the TDR doesn't work. Seems very, very silly to duplicate all of this code when the DRM scheduler provides all of this for us. Also if we go this route, now all drivers are going to invent ways to handle LR jobs /w the DRM scheduler. This solution is pretty clear, mark the scheduler as LR, and don't export any fences from the scheduler. If you try to export these fences a blow up happens. The problem is if you mix things up. Like for resets you need all the schedulers on an engine/set-of-engines to quiescent or things get potentially hilarious. If you now have a scheduler in forever limbo, the dma_fence guarantees are right out the window. But the issue you're having is fairly specific if it's just about ringspace. I think the dumbest fix is to just block in submit if you run out of per-ctx ringspace, and call it a day. This notion that somehow the kernel is supposed to provide a bottomless queue of anything userspace submits simply doesn't hold up in reality (as much as userspace standards committees would like it to), and as long as it doesn't have a real-world perf impact it doesn't really matter why we end up blocking in the submit ioctl. It might also be a simple memory allocation that hits a snag in page reclaim. So it seems the discussion around the long-running synchronization diverged a bit between threads and this thread was hijacked for preempt-fences and userptr. Do I understand it correctly that the recommendation from both Daniel and Christian is to *not* use the drm scheduler for long-running compute jobs, but track any internal dma-fence dependencies (pipelined clearing or whatever) in a separate mechanism and handle unresolved dependencies on other long-running jobs using -EWOULDBLOCK? Thanks, Thomas And the reasons why it was rejected haven't changed. Regards, Christian. Yes, TBH this was mostly to get discussion going how we'd best tackle this problem while being able to reuse the scheduler for long-running workloads. I couldn't see any clear decision on your series, though, but one main difference I see is that this is intended for driver-internal use only. (I'm counting using the drm_scheduler as a helper for driver-private use). This is by no means a way to try tackle the indefinite fence problem. Well this was just my latest try to tackle this, but essentially the problems are the same as with your approach: When we express such operations as dma_fence there is always the change that we leak that somewhere. My approach of adding a flag noting that this operation is dangerous and can't be synced with something memory management depends on tried to contain this as much as
Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
On Tue, Apr 04, 2023 at 10:31:58PM +0200, Daniel Vetter wrote: > On Tue, Apr 04, 2023 at 08:19:37PM +, Matthew Brost wrote: > > On Tue, Apr 04, 2023 at 10:11:59PM +0200, Daniel Vetter wrote: > > > On Tue, 4 Apr 2023 at 22:04, Matthew Brost > > > wrote: > > > > > > > > On Tue, Apr 04, 2023 at 09:00:59PM +0200, Daniel Vetter wrote: > > > > > On Tue, 4 Apr 2023 at 15:10, Christian König > > > > > wrote: > > > > > > > > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström: > > > > > > > Hi, Christian, > > > > > > > > > > > > > > On 4/4/23 11:09, Christian König wrote: > > > > > > >> Am 04.04.23 um 02:22 schrieb Matthew Brost: > > > > > > >>> From: Thomas Hellström > > > > > > >>> > > > > > > >>> For long-running workloads, drivers either need to open-code > > > > > > >>> completion > > > > > > >>> waits, invent their own synchronization primitives or > > > > > > >>> internally use > > > > > > >>> dma-fences that do not obey the cross-driver dma-fence > > > > > > >>> protocol, but > > > > > > >>> without any lockdep annotation all these approaches are error > > > > > > >>> prone. > > > > > > >>> > > > > > > >>> So since for example the drm scheduler uses dma-fences it is > > > > > > >>> desirable for > > > > > > >>> a driver to be able to use it for throttling and error handling > > > > > > >>> also > > > > > > >>> with > > > > > > >>> internal dma-fences tha do not obey the cros-driver dma-fence > > > > > > >>> protocol. > > > > > > >>> > > > > > > >>> Introduce long-running completion fences in form of dma-fences, > > > > > > >>> and add > > > > > > >>> lockdep annotation for them. In particular: > > > > > > >>> > > > > > > >>> * Do not allow waiting under any memory management locks. > > > > > > >>> * Do not allow to attach them to a dma-resv object. > > > > > > >>> * Introduce a new interface for adding callbacks making the > > > > > > >>> helper > > > > > > >>> adding > > > > > > >>>a callback sign off on that it is aware that the dma-fence > > > > > > >>> may not > > > > > > >>>complete anytime soon. Typically this will be the scheduler > > > > > > >>> chaining > > > > > > >>>a new long-running fence on another one. > > > > > > >> > > > > > > >> Well that's pretty much what I tried before: > > > > > > >> https://lwn.net/Articles/893704/ > > > > > > >> > > > > > > >> And the reasons why it was rejected haven't changed. > > > > > > >> > > > > > > >> Regards, > > > > > > >> Christian. > > > > > > >> > > > > > > > Yes, TBH this was mostly to get discussion going how we'd best > > > > > > > tackle > > > > > > > this problem while being able to reuse the scheduler for > > > > > > > long-running > > > > > > > workloads. > > > > > > > > > > > > > > I couldn't see any clear decision on your series, though, but one > > > > > > > main > > > > > > > difference I see is that this is intended for driver-internal use > > > > > > > only. (I'm counting using the drm_scheduler as a helper for > > > > > > > driver-private use). This is by no means a way to try tackle the > > > > > > > indefinite fence problem. > > > > > > > > > > > > Well this was just my latest try to tackle this, but essentially the > > > > > > problems are the same as with your approach: When we express such > > > > > > operations as dma_fence there is always the change that we leak that > > > > > > somewhere. > > > > > > > > > > > > My approach of adding a flag noting that this operation is > > > > > > dangerous and > > > > > > can't be synced with something memory management depends on tried to > > > > > > contain this as much as possible, but Daniel still pretty clearly > > > > > > rejected it (for good reasons I think). > > > > > > > > > > Yeah I still don't like dma_fence that somehow have totally different > > > > > semantics in that critical piece of "will it complete or will it > > > > > deadlock?" :-) > > > > > > > > Not going to touch LR dma-fences in this reply, I think we can continue > > > > the LR fence discussion of the fork of this thread I just responded to. > > > > Have a response the preempt fence discussion below. > > > > > > > > > > > > > > > > > > > > > > > > We could ofc invent a completely different data-type that > > > > > > > abstracts > > > > > > > the synchronization the scheduler needs in the long-running case, > > > > > > > or > > > > > > > each driver could hack something up, like sleeping in the > > > > > > > prepare_job() or run_job() callback for throttling, but those > > > > > > > waits > > > > > > > should still be annotated in one way or annotated one way or > > > > > > > another > > > > > > > (and probably in a similar way across drivers) to make sure we > > > > > > > don't > > > > > > > do anything bad. > > > > > > > > > > > > > > So any suggestions as to what would be the better solution here > > > > > > > would > > > > > > > be appreciated. > > > > > > > > > > > > Mhm, do we really the the GPU scheduler for that? > > > > > > > > > > > > I mean in the 1 to 1 case you
Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
On Tue, Apr 04, 2023 at 08:19:37PM +, Matthew Brost wrote: > On Tue, Apr 04, 2023 at 10:11:59PM +0200, Daniel Vetter wrote: > > On Tue, 4 Apr 2023 at 22:04, Matthew Brost wrote: > > > > > > On Tue, Apr 04, 2023 at 09:00:59PM +0200, Daniel Vetter wrote: > > > > On Tue, 4 Apr 2023 at 15:10, Christian König > > > > wrote: > > > > > > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström: > > > > > > Hi, Christian, > > > > > > > > > > > > On 4/4/23 11:09, Christian König wrote: > > > > > >> Am 04.04.23 um 02:22 schrieb Matthew Brost: > > > > > >>> From: Thomas Hellström > > > > > >>> > > > > > >>> For long-running workloads, drivers either need to open-code > > > > > >>> completion > > > > > >>> waits, invent their own synchronization primitives or internally > > > > > >>> use > > > > > >>> dma-fences that do not obey the cross-driver dma-fence protocol, > > > > > >>> but > > > > > >>> without any lockdep annotation all these approaches are error > > > > > >>> prone. > > > > > >>> > > > > > >>> So since for example the drm scheduler uses dma-fences it is > > > > > >>> desirable for > > > > > >>> a driver to be able to use it for throttling and error handling > > > > > >>> also > > > > > >>> with > > > > > >>> internal dma-fences tha do not obey the cros-driver dma-fence > > > > > >>> protocol. > > > > > >>> > > > > > >>> Introduce long-running completion fences in form of dma-fences, > > > > > >>> and add > > > > > >>> lockdep annotation for them. In particular: > > > > > >>> > > > > > >>> * Do not allow waiting under any memory management locks. > > > > > >>> * Do not allow to attach them to a dma-resv object. > > > > > >>> * Introduce a new interface for adding callbacks making the helper > > > > > >>> adding > > > > > >>>a callback sign off on that it is aware that the dma-fence may > > > > > >>> not > > > > > >>>complete anytime soon. Typically this will be the scheduler > > > > > >>> chaining > > > > > >>>a new long-running fence on another one. > > > > > >> > > > > > >> Well that's pretty much what I tried before: > > > > > >> https://lwn.net/Articles/893704/ > > > > > >> > > > > > >> And the reasons why it was rejected haven't changed. > > > > > >> > > > > > >> Regards, > > > > > >> Christian. > > > > > >> > > > > > > Yes, TBH this was mostly to get discussion going how we'd best > > > > > > tackle > > > > > > this problem while being able to reuse the scheduler for > > > > > > long-running > > > > > > workloads. > > > > > > > > > > > > I couldn't see any clear decision on your series, though, but one > > > > > > main > > > > > > difference I see is that this is intended for driver-internal use > > > > > > only. (I'm counting using the drm_scheduler as a helper for > > > > > > driver-private use). This is by no means a way to try tackle the > > > > > > indefinite fence problem. > > > > > > > > > > Well this was just my latest try to tackle this, but essentially the > > > > > problems are the same as with your approach: When we express such > > > > > operations as dma_fence there is always the change that we leak that > > > > > somewhere. > > > > > > > > > > My approach of adding a flag noting that this operation is dangerous > > > > > and > > > > > can't be synced with something memory management depends on tried to > > > > > contain this as much as possible, but Daniel still pretty clearly > > > > > rejected it (for good reasons I think). > > > > > > > > Yeah I still don't like dma_fence that somehow have totally different > > > > semantics in that critical piece of "will it complete or will it > > > > deadlock?" :-) > > > > > > Not going to touch LR dma-fences in this reply, I think we can continue > > > the LR fence discussion of the fork of this thread I just responded to. > > > Have a response the preempt fence discussion below. > > > > > > > > > > > > > > > > > > > > We could ofc invent a completely different data-type that abstracts > > > > > > the synchronization the scheduler needs in the long-running case, or > > > > > > each driver could hack something up, like sleeping in the > > > > > > prepare_job() or run_job() callback for throttling, but those waits > > > > > > should still be annotated in one way or annotated one way or another > > > > > > (and probably in a similar way across drivers) to make sure we don't > > > > > > do anything bad. > > > > > > > > > > > > So any suggestions as to what would be the better solution here > > > > > > would > > > > > > be appreciated. > > > > > > > > > > Mhm, do we really the the GPU scheduler for that? > > > > > > > > > > I mean in the 1 to 1 case you basically just need a component which > > > > > collects the dependencies as dma_fence and if all of them are > > > > > fulfilled > > > > > schedules a work item. > > > > > > > > > > As long as the work item itself doesn't produce a dma_fence it can > > > > > then > > > > > still just wait for other none dma_fence dependencies. > > > > > > > > Yeah that's the
Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
On Tue, Apr 04, 2023 at 10:11:59PM +0200, Daniel Vetter wrote: > On Tue, 4 Apr 2023 at 22:04, Matthew Brost wrote: > > > > On Tue, Apr 04, 2023 at 09:00:59PM +0200, Daniel Vetter wrote: > > > On Tue, 4 Apr 2023 at 15:10, Christian König > > > wrote: > > > > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström: > > > > > Hi, Christian, > > > > > > > > > > On 4/4/23 11:09, Christian König wrote: > > > > >> Am 04.04.23 um 02:22 schrieb Matthew Brost: > > > > >>> From: Thomas Hellström > > > > >>> > > > > >>> For long-running workloads, drivers either need to open-code > > > > >>> completion > > > > >>> waits, invent their own synchronization primitives or internally use > > > > >>> dma-fences that do not obey the cross-driver dma-fence protocol, but > > > > >>> without any lockdep annotation all these approaches are error prone. > > > > >>> > > > > >>> So since for example the drm scheduler uses dma-fences it is > > > > >>> desirable for > > > > >>> a driver to be able to use it for throttling and error handling also > > > > >>> with > > > > >>> internal dma-fences tha do not obey the cros-driver dma-fence > > > > >>> protocol. > > > > >>> > > > > >>> Introduce long-running completion fences in form of dma-fences, and > > > > >>> add > > > > >>> lockdep annotation for them. In particular: > > > > >>> > > > > >>> * Do not allow waiting under any memory management locks. > > > > >>> * Do not allow to attach them to a dma-resv object. > > > > >>> * Introduce a new interface for adding callbacks making the helper > > > > >>> adding > > > > >>>a callback sign off on that it is aware that the dma-fence may > > > > >>> not > > > > >>>complete anytime soon. Typically this will be the scheduler > > > > >>> chaining > > > > >>>a new long-running fence on another one. > > > > >> > > > > >> Well that's pretty much what I tried before: > > > > >> https://lwn.net/Articles/893704/ > > > > >> > > > > >> And the reasons why it was rejected haven't changed. > > > > >> > > > > >> Regards, > > > > >> Christian. > > > > >> > > > > > Yes, TBH this was mostly to get discussion going how we'd best tackle > > > > > this problem while being able to reuse the scheduler for long-running > > > > > workloads. > > > > > > > > > > I couldn't see any clear decision on your series, though, but one main > > > > > difference I see is that this is intended for driver-internal use > > > > > only. (I'm counting using the drm_scheduler as a helper for > > > > > driver-private use). This is by no means a way to try tackle the > > > > > indefinite fence problem. > > > > > > > > Well this was just my latest try to tackle this, but essentially the > > > > problems are the same as with your approach: When we express such > > > > operations as dma_fence there is always the change that we leak that > > > > somewhere. > > > > > > > > My approach of adding a flag noting that this operation is dangerous and > > > > can't be synced with something memory management depends on tried to > > > > contain this as much as possible, but Daniel still pretty clearly > > > > rejected it (for good reasons I think). > > > > > > Yeah I still don't like dma_fence that somehow have totally different > > > semantics in that critical piece of "will it complete or will it > > > deadlock?" :-) > > > > Not going to touch LR dma-fences in this reply, I think we can continue > > the LR fence discussion of the fork of this thread I just responded to. > > Have a response the preempt fence discussion below. > > > > > > > > > > > > > > > > We could ofc invent a completely different data-type that abstracts > > > > > the synchronization the scheduler needs in the long-running case, or > > > > > each driver could hack something up, like sleeping in the > > > > > prepare_job() or run_job() callback for throttling, but those waits > > > > > should still be annotated in one way or annotated one way or another > > > > > (and probably in a similar way across drivers) to make sure we don't > > > > > do anything bad. > > > > > > > > > > So any suggestions as to what would be the better solution here would > > > > > be appreciated. > > > > > > > > Mhm, do we really the the GPU scheduler for that? > > > > > > > > I mean in the 1 to 1 case you basically just need a component which > > > > collects the dependencies as dma_fence and if all of them are fulfilled > > > > schedules a work item. > > > > > > > > As long as the work item itself doesn't produce a dma_fence it can then > > > > still just wait for other none dma_fence dependencies. > > > > > > Yeah that's the important thing, for long-running jobs dependencies as > > > dma_fence should be totally fine. You're just not allowed to have any > > > outgoing dma_fences at all (except the magic preemption fence). > > > > > > > Then the work function could submit the work and wait for the result. > > > > > > > > The work item would then pretty much represent what you want, you can > > > > wait for it to finish and pass it
Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
On Tue, 4 Apr 2023 at 22:04, Matthew Brost wrote: > > On Tue, Apr 04, 2023 at 09:00:59PM +0200, Daniel Vetter wrote: > > On Tue, 4 Apr 2023 at 15:10, Christian König > > wrote: > > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström: > > > > Hi, Christian, > > > > > > > > On 4/4/23 11:09, Christian König wrote: > > > >> Am 04.04.23 um 02:22 schrieb Matthew Brost: > > > >>> From: Thomas Hellström > > > >>> > > > >>> For long-running workloads, drivers either need to open-code > > > >>> completion > > > >>> waits, invent their own synchronization primitives or internally use > > > >>> dma-fences that do not obey the cross-driver dma-fence protocol, but > > > >>> without any lockdep annotation all these approaches are error prone. > > > >>> > > > >>> So since for example the drm scheduler uses dma-fences it is > > > >>> desirable for > > > >>> a driver to be able to use it for throttling and error handling also > > > >>> with > > > >>> internal dma-fences tha do not obey the cros-driver dma-fence > > > >>> protocol. > > > >>> > > > >>> Introduce long-running completion fences in form of dma-fences, and > > > >>> add > > > >>> lockdep annotation for them. In particular: > > > >>> > > > >>> * Do not allow waiting under any memory management locks. > > > >>> * Do not allow to attach them to a dma-resv object. > > > >>> * Introduce a new interface for adding callbacks making the helper > > > >>> adding > > > >>>a callback sign off on that it is aware that the dma-fence may not > > > >>>complete anytime soon. Typically this will be the scheduler > > > >>> chaining > > > >>>a new long-running fence on another one. > > > >> > > > >> Well that's pretty much what I tried before: > > > >> https://lwn.net/Articles/893704/ > > > >> > > > >> And the reasons why it was rejected haven't changed. > > > >> > > > >> Regards, > > > >> Christian. > > > >> > > > > Yes, TBH this was mostly to get discussion going how we'd best tackle > > > > this problem while being able to reuse the scheduler for long-running > > > > workloads. > > > > > > > > I couldn't see any clear decision on your series, though, but one main > > > > difference I see is that this is intended for driver-internal use > > > > only. (I'm counting using the drm_scheduler as a helper for > > > > driver-private use). This is by no means a way to try tackle the > > > > indefinite fence problem. > > > > > > Well this was just my latest try to tackle this, but essentially the > > > problems are the same as with your approach: When we express such > > > operations as dma_fence there is always the change that we leak that > > > somewhere. > > > > > > My approach of adding a flag noting that this operation is dangerous and > > > can't be synced with something memory management depends on tried to > > > contain this as much as possible, but Daniel still pretty clearly > > > rejected it (for good reasons I think). > > > > Yeah I still don't like dma_fence that somehow have totally different > > semantics in that critical piece of "will it complete or will it > > deadlock?" :-) > > Not going to touch LR dma-fences in this reply, I think we can continue > the LR fence discussion of the fork of this thread I just responded to. > Have a response the preempt fence discussion below. > > > > > > > > > > > > We could ofc invent a completely different data-type that abstracts > > > > the synchronization the scheduler needs in the long-running case, or > > > > each driver could hack something up, like sleeping in the > > > > prepare_job() or run_job() callback for throttling, but those waits > > > > should still be annotated in one way or annotated one way or another > > > > (and probably in a similar way across drivers) to make sure we don't > > > > do anything bad. > > > > > > > > So any suggestions as to what would be the better solution here would > > > > be appreciated. > > > > > > Mhm, do we really the the GPU scheduler for that? > > > > > > I mean in the 1 to 1 case you basically just need a component which > > > collects the dependencies as dma_fence and if all of them are fulfilled > > > schedules a work item. > > > > > > As long as the work item itself doesn't produce a dma_fence it can then > > > still just wait for other none dma_fence dependencies. > > > > Yeah that's the important thing, for long-running jobs dependencies as > > dma_fence should be totally fine. You're just not allowed to have any > > outgoing dma_fences at all (except the magic preemption fence). > > > > > Then the work function could submit the work and wait for the result. > > > > > > The work item would then pretty much represent what you want, you can > > > wait for it to finish and pass it along as long running dependency. > > > > > > Maybe give it a funky name and wrap it up in a structure, but that's > > > basically it. > > > > Like do we need this? If the kernel ever waits for a long-running > > compute job to finnish I'd call that a bug. Any functional > > dependencies
Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
On Tue, Apr 04, 2023 at 09:00:59PM +0200, Daniel Vetter wrote: > On Tue, 4 Apr 2023 at 15:10, Christian König wrote: > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström: > > > Hi, Christian, > > > > > > On 4/4/23 11:09, Christian König wrote: > > >> Am 04.04.23 um 02:22 schrieb Matthew Brost: > > >>> From: Thomas Hellström > > >>> > > >>> For long-running workloads, drivers either need to open-code completion > > >>> waits, invent their own synchronization primitives or internally use > > >>> dma-fences that do not obey the cross-driver dma-fence protocol, but > > >>> without any lockdep annotation all these approaches are error prone. > > >>> > > >>> So since for example the drm scheduler uses dma-fences it is > > >>> desirable for > > >>> a driver to be able to use it for throttling and error handling also > > >>> with > > >>> internal dma-fences tha do not obey the cros-driver dma-fence protocol. > > >>> > > >>> Introduce long-running completion fences in form of dma-fences, and add > > >>> lockdep annotation for them. In particular: > > >>> > > >>> * Do not allow waiting under any memory management locks. > > >>> * Do not allow to attach them to a dma-resv object. > > >>> * Introduce a new interface for adding callbacks making the helper > > >>> adding > > >>>a callback sign off on that it is aware that the dma-fence may not > > >>>complete anytime soon. Typically this will be the scheduler chaining > > >>>a new long-running fence on another one. > > >> > > >> Well that's pretty much what I tried before: > > >> https://lwn.net/Articles/893704/ > > >> > > >> And the reasons why it was rejected haven't changed. > > >> > > >> Regards, > > >> Christian. > > >> > > > Yes, TBH this was mostly to get discussion going how we'd best tackle > > > this problem while being able to reuse the scheduler for long-running > > > workloads. > > > > > > I couldn't see any clear decision on your series, though, but one main > > > difference I see is that this is intended for driver-internal use > > > only. (I'm counting using the drm_scheduler as a helper for > > > driver-private use). This is by no means a way to try tackle the > > > indefinite fence problem. > > > > Well this was just my latest try to tackle this, but essentially the > > problems are the same as with your approach: When we express such > > operations as dma_fence there is always the change that we leak that > > somewhere. > > > > My approach of adding a flag noting that this operation is dangerous and > > can't be synced with something memory management depends on tried to > > contain this as much as possible, but Daniel still pretty clearly > > rejected it (for good reasons I think). > > Yeah I still don't like dma_fence that somehow have totally different > semantics in that critical piece of "will it complete or will it > deadlock?" :-) Not going to touch LR dma-fences in this reply, I think we can continue the LR fence discussion of the fork of this thread I just responded to. Have a response the preempt fence discussion below. > > > > > > > > We could ofc invent a completely different data-type that abstracts > > > the synchronization the scheduler needs in the long-running case, or > > > each driver could hack something up, like sleeping in the > > > prepare_job() or run_job() callback for throttling, but those waits > > > should still be annotated in one way or annotated one way or another > > > (and probably in a similar way across drivers) to make sure we don't > > > do anything bad. > > > > > > So any suggestions as to what would be the better solution here would > > > be appreciated. > > > > Mhm, do we really the the GPU scheduler for that? > > > > I mean in the 1 to 1 case you basically just need a component which > > collects the dependencies as dma_fence and if all of them are fulfilled > > schedules a work item. > > > > As long as the work item itself doesn't produce a dma_fence it can then > > still just wait for other none dma_fence dependencies. > > Yeah that's the important thing, for long-running jobs dependencies as > dma_fence should be totally fine. You're just not allowed to have any > outgoing dma_fences at all (except the magic preemption fence). > > > Then the work function could submit the work and wait for the result. > > > > The work item would then pretty much represent what you want, you can > > wait for it to finish and pass it along as long running dependency. > > > > Maybe give it a funky name and wrap it up in a structure, but that's > > basically it. > > Like do we need this? If the kernel ever waits for a long-running > compute job to finnish I'd call that a bug. Any functional > dependencies between engines or whatever are userspace's problem only, > which it needs to sort out using userspace memory fences. > > The only things the kernel needs are some way to track dependencies as > dma_fence (because memory management move the memory away and we need > to move it back in, ideally
Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote: > On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote: > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote: > > > > > > On 4/4/23 15:10, Christian König wrote: > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström: > > > > > Hi, Christian, > > > > > > > > > > On 4/4/23 11:09, Christian König wrote: > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost: > > > > > > > From: Thomas Hellström > > > > > > > > > > > > > > For long-running workloads, drivers either need to open-code > > > > > > > completion > > > > > > > waits, invent their own synchronization primitives or internally > > > > > > > use > > > > > > > dma-fences that do not obey the cross-driver dma-fence protocol, > > > > > > > but > > > > > > > without any lockdep annotation all these approaches are error > > > > > > > prone. > > > > > > > > > > > > > > So since for example the drm scheduler uses dma-fences it is > > > > > > > desirable for > > > > > > > a driver to be able to use it for throttling and error > > > > > > > handling also with > > > > > > > internal dma-fences tha do not obey the cros-driver > > > > > > > dma-fence protocol. > > > > > > > > > > > > > > Introduce long-running completion fences in form of > > > > > > > dma-fences, and add > > > > > > > lockdep annotation for them. In particular: > > > > > > > > > > > > > > * Do not allow waiting under any memory management locks. > > > > > > > * Do not allow to attach them to a dma-resv object. > > > > > > > * Introduce a new interface for adding callbacks making the > > > > > > > helper adding > > > > > > > a callback sign off on that it is aware that the dma-fence may > > > > > > > not > > > > > > > complete anytime soon. Typically this will be the > > > > > > > scheduler chaining > > > > > > > a new long-running fence on another one. > > > > > > > > > > > > Well that's pretty much what I tried before: > > > > > > https://lwn.net/Articles/893704/ > > > > > > > > > > I don't think this quite the same, this explictly enforces that we don't > > break the dma-fence rules (in path of memory allocations, exported in > > any way), essentially this just SW sync point reusing dma-fence the > > infrastructure for signaling / callbacks. I believe your series tried to > > export these fences to user space (admittedly I haven't fully read your > > series). > > > > In this use case we essentially just want to flow control the ring via > > the dma-scheduler + maintain a list of pending jobs so the TDR can be > > used for cleanup if LR entity encounters an error. To me this seems > > perfectly reasonable but I know dma-femce rules are akin to a holy war. > > > > If we return NULL in run_job, now we have to be able to sink all jobs > > in the backend regardless on ring space, maintain a list of jobs pending > > for cleanup after errors, and write a different cleanup path as now the > > TDR doesn't work. Seems very, very silly to duplicate all of this code > > when the DRM scheduler provides all of this for us. Also if we go this > > route, now all drivers are going to invent ways to handle LR jobs /w the > > DRM scheduler. > > > > This solution is pretty clear, mark the scheduler as LR, and don't > > export any fences from the scheduler. If you try to export these fences > > a blow up happens. > > The problem is if you mix things up. Like for resets you need all the > schedulers on an engine/set-of-engines to quiescent or things get > potentially hilarious. If you now have a scheduler in forever limbo, the > dma_fence guarantees are right out the window. > Right, a GT reset on Xe is: Stop all schedulers Do a reset Ban any schedulers which we think caused the GT reset Resubmit all schedulers which we think were good Restart all schedulers None of this flow depends on LR dma-fences, all of this uses the DRM sched infrastructure and work very well compared to the i915. Rewriting all this with a driver specific implementation is what we are trying to avoid. Similarly if LR entity hangs on its own (not a GT reset, rather the firmware does the reset for us) we use all the DRM scheduler infrastructure to handle this. Again this works rather well... > But the issue you're having is fairly specific if it's just about > ringspace. I think the dumbest fix is to just block in submit if you run > out of per-ctx ringspace, and call it a day. This notion that somehow the How does that not break the dma-fence rules? A job can publish its finished fence after ARM, if the finished fence fence waits on ring space that may not free up in a reasonable amount of time we now have broken the dma-dence rules. My understanding is any dma-fence must only on other dma-fence, Christian seems to agree and NAK'd just blocking if no space available [1]. IMO this series ensures we don't break dma-fence rules by restricting how the finished fence can be used. > kernel is supposed to provide a
Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote: > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote: > > > > On 4/4/23 15:10, Christian König wrote: > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström: > > > > Hi, Christian, > > > > > > > > On 4/4/23 11:09, Christian König wrote: > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost: > > > > > > From: Thomas Hellström > > > > > > > > > > > > For long-running workloads, drivers either need to open-code > > > > > > completion > > > > > > waits, invent their own synchronization primitives or internally use > > > > > > dma-fences that do not obey the cross-driver dma-fence protocol, but > > > > > > without any lockdep annotation all these approaches are error prone. > > > > > > > > > > > > So since for example the drm scheduler uses dma-fences it is > > > > > > desirable for > > > > > > a driver to be able to use it for throttling and error > > > > > > handling also with > > > > > > internal dma-fences tha do not obey the cros-driver > > > > > > dma-fence protocol. > > > > > > > > > > > > Introduce long-running completion fences in form of > > > > > > dma-fences, and add > > > > > > lockdep annotation for them. In particular: > > > > > > > > > > > > * Do not allow waiting under any memory management locks. > > > > > > * Do not allow to attach them to a dma-resv object. > > > > > > * Introduce a new interface for adding callbacks making the > > > > > > helper adding > > > > > > a callback sign off on that it is aware that the dma-fence may > > > > > > not > > > > > > complete anytime soon. Typically this will be the > > > > > > scheduler chaining > > > > > > a new long-running fence on another one. > > > > > > > > > > Well that's pretty much what I tried before: > > > > > https://lwn.net/Articles/893704/ > > > > > > > I don't think this quite the same, this explictly enforces that we don't > break the dma-fence rules (in path of memory allocations, exported in > any way), essentially this just SW sync point reusing dma-fence the > infrastructure for signaling / callbacks. I believe your series tried to > export these fences to user space (admittedly I haven't fully read your > series). > > In this use case we essentially just want to flow control the ring via > the dma-scheduler + maintain a list of pending jobs so the TDR can be > used for cleanup if LR entity encounters an error. To me this seems > perfectly reasonable but I know dma-femce rules are akin to a holy war. > > If we return NULL in run_job, now we have to be able to sink all jobs > in the backend regardless on ring space, maintain a list of jobs pending > for cleanup after errors, and write a different cleanup path as now the > TDR doesn't work. Seems very, very silly to duplicate all of this code > when the DRM scheduler provides all of this for us. Also if we go this > route, now all drivers are going to invent ways to handle LR jobs /w the > DRM scheduler. > > This solution is pretty clear, mark the scheduler as LR, and don't > export any fences from the scheduler. If you try to export these fences > a blow up happens. The problem is if you mix things up. Like for resets you need all the schedulers on an engine/set-of-engines to quiescent or things get potentially hilarious. If you now have a scheduler in forever limbo, the dma_fence guarantees are right out the window. But the issue you're having is fairly specific if it's just about ringspace. I think the dumbest fix is to just block in submit if you run out of per-ctx ringspace, and call it a day. This notion that somehow the kernel is supposed to provide a bottomless queue of anything userspace submits simply doesn't hold up in reality (as much as userspace standards committees would like it to), and as long as it doesn't have a real-world perf impact it doesn't really matter why we end up blocking in the submit ioctl. It might also be a simple memory allocation that hits a snag in page reclaim. > > > > > And the reasons why it was rejected haven't changed. > > > > > > > > > > Regards, > > > > > Christian. > > > > > > > > > Yes, TBH this was mostly to get discussion going how we'd best > > > > tackle this problem while being able to reuse the scheduler for > > > > long-running workloads. > > > > > > > > I couldn't see any clear decision on your series, though, but one > > > > main difference I see is that this is intended for driver-internal > > > > use only. (I'm counting using the drm_scheduler as a helper for > > > > driver-private use). This is by no means a way to try tackle the > > > > indefinite fence problem. > > > > > > Well this was just my latest try to tackle this, but essentially the > > > problems are the same as with your approach: When we express such > > > operations as dma_fence there is always the change that we leak that > > > somewhere. > > > > > > My approach of adding a flag noting that this operation is dangerous and > > > can't be synced with
Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote: > > On 4/4/23 15:10, Christian König wrote: > > Am 04.04.23 um 14:54 schrieb Thomas Hellström: > > > Hi, Christian, > > > > > > On 4/4/23 11:09, Christian König wrote: > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost: > > > > > From: Thomas Hellström > > > > > > > > > > For long-running workloads, drivers either need to open-code > > > > > completion > > > > > waits, invent their own synchronization primitives or internally use > > > > > dma-fences that do not obey the cross-driver dma-fence protocol, but > > > > > without any lockdep annotation all these approaches are error prone. > > > > > > > > > > So since for example the drm scheduler uses dma-fences it is > > > > > desirable for > > > > > a driver to be able to use it for throttling and error > > > > > handling also with > > > > > internal dma-fences tha do not obey the cros-driver > > > > > dma-fence protocol. > > > > > > > > > > Introduce long-running completion fences in form of > > > > > dma-fences, and add > > > > > lockdep annotation for them. In particular: > > > > > > > > > > * Do not allow waiting under any memory management locks. > > > > > * Do not allow to attach them to a dma-resv object. > > > > > * Introduce a new interface for adding callbacks making the > > > > > helper adding > > > > > a callback sign off on that it is aware that the dma-fence may not > > > > > complete anytime soon. Typically this will be the > > > > > scheduler chaining > > > > > a new long-running fence on another one. > > > > > > > > Well that's pretty much what I tried before: > > > > https://lwn.net/Articles/893704/ > > > > I don't think this quite the same, this explictly enforces that we don't break the dma-fence rules (in path of memory allocations, exported in any way), essentially this just SW sync point reusing dma-fence the infrastructure for signaling / callbacks. I believe your series tried to export these fences to user space (admittedly I haven't fully read your series). In this use case we essentially just want to flow control the ring via the dma-scheduler + maintain a list of pending jobs so the TDR can be used for cleanup if LR entity encounters an error. To me this seems perfectly reasonable but I know dma-femce rules are akin to a holy war. If we return NULL in run_job, now we have to be able to sink all jobs in the backend regardless on ring space, maintain a list of jobs pending for cleanup after errors, and write a different cleanup path as now the TDR doesn't work. Seems very, very silly to duplicate all of this code when the DRM scheduler provides all of this for us. Also if we go this route, now all drivers are going to invent ways to handle LR jobs /w the DRM scheduler. This solution is pretty clear, mark the scheduler as LR, and don't export any fences from the scheduler. If you try to export these fences a blow up happens. > > > > And the reasons why it was rejected haven't changed. > > > > > > > > Regards, > > > > Christian. > > > > > > > Yes, TBH this was mostly to get discussion going how we'd best > > > tackle this problem while being able to reuse the scheduler for > > > long-running workloads. > > > > > > I couldn't see any clear decision on your series, though, but one > > > main difference I see is that this is intended for driver-internal > > > use only. (I'm counting using the drm_scheduler as a helper for > > > driver-private use). This is by no means a way to try tackle the > > > indefinite fence problem. > > > > Well this was just my latest try to tackle this, but essentially the > > problems are the same as with your approach: When we express such > > operations as dma_fence there is always the change that we leak that > > somewhere. > > > > My approach of adding a flag noting that this operation is dangerous and > > can't be synced with something memory management depends on tried to > > contain this as much as possible, but Daniel still pretty clearly > > rejected it (for good reasons I think). > > > > > > > > We could ofc invent a completely different data-type that abstracts > > > the synchronization the scheduler needs in the long-running case, or > > > each driver could hack something up, like sleeping in the > > > prepare_job() or run_job() callback for throttling, but those waits > > > should still be annotated in one way or annotated one way or another > > > (and probably in a similar way across drivers) to make sure we don't > > > do anything bad. > > > > > > So any suggestions as to what would be the better solution here > > > would be appreciated. > > > > Mhm, do we really the the GPU scheduler for that? > > I think we need to solve this within the DRM scheduler one way or another. > > I mean in the 1 to 1 case you basically just need a component which > > collects the dependencies as dma_fence and if all of them are fulfilled > > schedules a work item. > > > > As long as the work item itself
Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
On Tue, 4 Apr 2023 at 15:10, Christian König wrote: > > Am 04.04.23 um 14:54 schrieb Thomas Hellström: > > Hi, Christian, > > > > On 4/4/23 11:09, Christian König wrote: > >> Am 04.04.23 um 02:22 schrieb Matthew Brost: > >>> From: Thomas Hellström > >>> > >>> For long-running workloads, drivers either need to open-code completion > >>> waits, invent their own synchronization primitives or internally use > >>> dma-fences that do not obey the cross-driver dma-fence protocol, but > >>> without any lockdep annotation all these approaches are error prone. > >>> > >>> So since for example the drm scheduler uses dma-fences it is > >>> desirable for > >>> a driver to be able to use it for throttling and error handling also > >>> with > >>> internal dma-fences tha do not obey the cros-driver dma-fence protocol. > >>> > >>> Introduce long-running completion fences in form of dma-fences, and add > >>> lockdep annotation for them. In particular: > >>> > >>> * Do not allow waiting under any memory management locks. > >>> * Do not allow to attach them to a dma-resv object. > >>> * Introduce a new interface for adding callbacks making the helper > >>> adding > >>>a callback sign off on that it is aware that the dma-fence may not > >>>complete anytime soon. Typically this will be the scheduler chaining > >>>a new long-running fence on another one. > >> > >> Well that's pretty much what I tried before: > >> https://lwn.net/Articles/893704/ > >> > >> And the reasons why it was rejected haven't changed. > >> > >> Regards, > >> Christian. > >> > > Yes, TBH this was mostly to get discussion going how we'd best tackle > > this problem while being able to reuse the scheduler for long-running > > workloads. > > > > I couldn't see any clear decision on your series, though, but one main > > difference I see is that this is intended for driver-internal use > > only. (I'm counting using the drm_scheduler as a helper for > > driver-private use). This is by no means a way to try tackle the > > indefinite fence problem. > > Well this was just my latest try to tackle this, but essentially the > problems are the same as with your approach: When we express such > operations as dma_fence there is always the change that we leak that > somewhere. > > My approach of adding a flag noting that this operation is dangerous and > can't be synced with something memory management depends on tried to > contain this as much as possible, but Daniel still pretty clearly > rejected it (for good reasons I think). Yeah I still don't like dma_fence that somehow have totally different semantics in that critical piece of "will it complete or will it deadlock?" :-) > > > > > We could ofc invent a completely different data-type that abstracts > > the synchronization the scheduler needs in the long-running case, or > > each driver could hack something up, like sleeping in the > > prepare_job() or run_job() callback for throttling, but those waits > > should still be annotated in one way or annotated one way or another > > (and probably in a similar way across drivers) to make sure we don't > > do anything bad. > > > > So any suggestions as to what would be the better solution here would > > be appreciated. > > Mhm, do we really the the GPU scheduler for that? > > I mean in the 1 to 1 case you basically just need a component which > collects the dependencies as dma_fence and if all of them are fulfilled > schedules a work item. > > As long as the work item itself doesn't produce a dma_fence it can then > still just wait for other none dma_fence dependencies. Yeah that's the important thing, for long-running jobs dependencies as dma_fence should be totally fine. You're just not allowed to have any outgoing dma_fences at all (except the magic preemption fence). > Then the work function could submit the work and wait for the result. > > The work item would then pretty much represent what you want, you can > wait for it to finish and pass it along as long running dependency. > > Maybe give it a funky name and wrap it up in a structure, but that's > basically it. Like do we need this? If the kernel ever waits for a long-running compute job to finnish I'd call that a bug. Any functional dependencies between engines or whatever are userspace's problem only, which it needs to sort out using userspace memory fences. The only things the kernel needs are some way to track dependencies as dma_fence (because memory management move the memory away and we need to move it back in, ideally pipelined). And it needs the special preempt fence (if we don't have pagefaults) so that you have a fence to attach to all the dma_resv for memory management purposes. Now the scheduler already has almost all the pieces (at least if we assume there's some magic fw which time-slices these contexts on its own), and we just need a few minimal changes: - allowing the scheduler to ignore the completion fence and just immediately push the next "job" in if its dependencies are
Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
On 4/4/23 15:10, Christian König wrote: Am 04.04.23 um 14:54 schrieb Thomas Hellström: Hi, Christian, On 4/4/23 11:09, Christian König wrote: Am 04.04.23 um 02:22 schrieb Matthew Brost: From: Thomas Hellström For long-running workloads, drivers either need to open-code completion waits, invent their own synchronization primitives or internally use dma-fences that do not obey the cross-driver dma-fence protocol, but without any lockdep annotation all these approaches are error prone. So since for example the drm scheduler uses dma-fences it is desirable for a driver to be able to use it for throttling and error handling also with internal dma-fences tha do not obey the cros-driver dma-fence protocol. Introduce long-running completion fences in form of dma-fences, and add lockdep annotation for them. In particular: * Do not allow waiting under any memory management locks. * Do not allow to attach them to a dma-resv object. * Introduce a new interface for adding callbacks making the helper adding a callback sign off on that it is aware that the dma-fence may not complete anytime soon. Typically this will be the scheduler chaining a new long-running fence on another one. Well that's pretty much what I tried before: https://lwn.net/Articles/893704/ And the reasons why it was rejected haven't changed. Regards, Christian. Yes, TBH this was mostly to get discussion going how we'd best tackle this problem while being able to reuse the scheduler for long-running workloads. I couldn't see any clear decision on your series, though, but one main difference I see is that this is intended for driver-internal use only. (I'm counting using the drm_scheduler as a helper for driver-private use). This is by no means a way to try tackle the indefinite fence problem. Well this was just my latest try to tackle this, but essentially the problems are the same as with your approach: When we express such operations as dma_fence there is always the change that we leak that somewhere. My approach of adding a flag noting that this operation is dangerous and can't be synced with something memory management depends on tried to contain this as much as possible, but Daniel still pretty clearly rejected it (for good reasons I think). We could ofc invent a completely different data-type that abstracts the synchronization the scheduler needs in the long-running case, or each driver could hack something up, like sleeping in the prepare_job() or run_job() callback for throttling, but those waits should still be annotated in one way or annotated one way or another (and probably in a similar way across drivers) to make sure we don't do anything bad. So any suggestions as to what would be the better solution here would be appreciated. Mhm, do we really the the GPU scheduler for that? I mean in the 1 to 1 case you basically just need a component which collects the dependencies as dma_fence and if all of them are fulfilled schedules a work item. As long as the work item itself doesn't produce a dma_fence it can then still just wait for other none dma_fence dependencies. Then the work function could submit the work and wait for the result. The work item would then pretty much represent what you want, you can wait for it to finish and pass it along as long running dependency. Maybe give it a funky name and wrap it up in a structure, but that's basically it. This very much sounds like a i915_sw_fence for the dependency tracking and dma_fence_work for the actual work although it's completion fence is a dma_fence. Although that goes against the whole idea of a condition for merging the xe driver would be that we implement some sort of minimal scaffolding for long-running workloads in the drm scheduler, and the thinking behind that is to avoid implementing intel-specific solutions like those... Thanks, Thomas Regards, Christian. Thanks, Thomas
Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
Am 04.04.23 um 14:54 schrieb Thomas Hellström: Hi, Christian, On 4/4/23 11:09, Christian König wrote: Am 04.04.23 um 02:22 schrieb Matthew Brost: From: Thomas Hellström For long-running workloads, drivers either need to open-code completion waits, invent their own synchronization primitives or internally use dma-fences that do not obey the cross-driver dma-fence protocol, but without any lockdep annotation all these approaches are error prone. So since for example the drm scheduler uses dma-fences it is desirable for a driver to be able to use it for throttling and error handling also with internal dma-fences tha do not obey the cros-driver dma-fence protocol. Introduce long-running completion fences in form of dma-fences, and add lockdep annotation for them. In particular: * Do not allow waiting under any memory management locks. * Do not allow to attach them to a dma-resv object. * Introduce a new interface for adding callbacks making the helper adding a callback sign off on that it is aware that the dma-fence may not complete anytime soon. Typically this will be the scheduler chaining a new long-running fence on another one. Well that's pretty much what I tried before: https://lwn.net/Articles/893704/ And the reasons why it was rejected haven't changed. Regards, Christian. Yes, TBH this was mostly to get discussion going how we'd best tackle this problem while being able to reuse the scheduler for long-running workloads. I couldn't see any clear decision on your series, though, but one main difference I see is that this is intended for driver-internal use only. (I'm counting using the drm_scheduler as a helper for driver-private use). This is by no means a way to try tackle the indefinite fence problem. Well this was just my latest try to tackle this, but essentially the problems are the same as with your approach: When we express such operations as dma_fence there is always the change that we leak that somewhere. My approach of adding a flag noting that this operation is dangerous and can't be synced with something memory management depends on tried to contain this as much as possible, but Daniel still pretty clearly rejected it (for good reasons I think). We could ofc invent a completely different data-type that abstracts the synchronization the scheduler needs in the long-running case, or each driver could hack something up, like sleeping in the prepare_job() or run_job() callback for throttling, but those waits should still be annotated in one way or annotated one way or another (and probably in a similar way across drivers) to make sure we don't do anything bad. So any suggestions as to what would be the better solution here would be appreciated. Mhm, do we really the the GPU scheduler for that? I mean in the 1 to 1 case you basically just need a component which collects the dependencies as dma_fence and if all of them are fulfilled schedules a work item. As long as the work item itself doesn't produce a dma_fence it can then still just wait for other none dma_fence dependencies. Then the work function could submit the work and wait for the result. The work item would then pretty much represent what you want, you can wait for it to finish and pass it along as long running dependency. Maybe give it a funky name and wrap it up in a structure, but that's basically it. Regards, Christian. Thanks, Thomas
Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
Hi, Christian, On 4/4/23 11:09, Christian König wrote: Am 04.04.23 um 02:22 schrieb Matthew Brost: From: Thomas Hellström For long-running workloads, drivers either need to open-code completion waits, invent their own synchronization primitives or internally use dma-fences that do not obey the cross-driver dma-fence protocol, but without any lockdep annotation all these approaches are error prone. So since for example the drm scheduler uses dma-fences it is desirable for a driver to be able to use it for throttling and error handling also with internal dma-fences tha do not obey the cros-driver dma-fence protocol. Introduce long-running completion fences in form of dma-fences, and add lockdep annotation for them. In particular: * Do not allow waiting under any memory management locks. * Do not allow to attach them to a dma-resv object. * Introduce a new interface for adding callbacks making the helper adding a callback sign off on that it is aware that the dma-fence may not complete anytime soon. Typically this will be the scheduler chaining a new long-running fence on another one. Well that's pretty much what I tried before: https://lwn.net/Articles/893704/ And the reasons why it was rejected haven't changed. Regards, Christian. Yes, TBH this was mostly to get discussion going how we'd best tackle this problem while being able to reuse the scheduler for long-running workloads. I couldn't see any clear decision on your series, though, but one main difference I see is that this is intended for driver-internal use only. (I'm counting using the drm_scheduler as a helper for driver-private use). This is by no means a way to try tackle the indefinite fence problem. We could ofc invent a completely different data-type that abstracts the synchronization the scheduler needs in the long-running case, or each driver could hack something up, like sleeping in the prepare_job() or run_job() callback for throttling, but those waits should still be annotated in one way or annotated one way or another (and probably in a similar way across drivers) to make sure we don't do anything bad. So any suggestions as to what would be the better solution here would be appreciated. Thanks, Thomas
Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences
Am 04.04.23 um 02:22 schrieb Matthew Brost: From: Thomas Hellström For long-running workloads, drivers either need to open-code completion waits, invent their own synchronization primitives or internally use dma-fences that do not obey the cross-driver dma-fence protocol, but without any lockdep annotation all these approaches are error prone. So since for example the drm scheduler uses dma-fences it is desirable for a driver to be able to use it for throttling and error handling also with internal dma-fences tha do not obey the cros-driver dma-fence protocol. Introduce long-running completion fences in form of dma-fences, and add lockdep annotation for them. In particular: * Do not allow waiting under any memory management locks. * Do not allow to attach them to a dma-resv object. * Introduce a new interface for adding callbacks making the helper adding a callback sign off on that it is aware that the dma-fence may not complete anytime soon. Typically this will be the scheduler chaining a new long-running fence on another one. Well that's pretty much what I tried before: https://lwn.net/Articles/893704/ And the reasons why it was rejected haven't changed. Regards, Christian. Signed-off-by: Matthew Brost Signed-off-by: Thomas Hellström --- drivers/dma-buf/dma-fence.c | 142 ++-- drivers/dma-buf/dma-resv.c | 5 ++ include/linux/dma-fence.h | 55 +- 3 files changed, 160 insertions(+), 42 deletions(-) diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c index f177c56269bb..9726b2a3c67d 100644 --- a/drivers/dma-buf/dma-fence.c +++ b/drivers/dma-buf/dma-fence.c @@ -111,6 +111,20 @@ static atomic64_t dma_fence_context_counter = ATOMIC64_INIT(1); * drivers/gpu should ever call dma_fence_wait() in such contexts. */ +/** + * DOC: Long-Running (lr) dma-fences. + * + * * Long-running dma-fences are NOT required to complete in reasonable time. + * Typically they signal completion of user-space controlled workloads and + * as such, need to never be part of a cross-driver contract, never waited + * for inside a kernel lock, nor attached to a dma-resv. There are helpers + * and warnings in place to help facilitate that that never happens. + * + * * The motivation for their existense is that helpers that are intended to + * be used by drivers may use dma-fences that, given the workloads mentioned + * above, become long-running. + */ + static const char *dma_fence_stub_get_name(struct dma_fence *fence) { return "stub"; @@ -284,6 +298,34 @@ static struct lockdep_map dma_fence_lockdep_map = { .name = "dma_fence_map" }; +static struct lockdep_map dma_fence_lr_lockdep_map = { + .name = "dma_fence_lr_map" +}; + +static bool __dma_fence_begin_signalling(struct lockdep_map *map) +{ + /* explicitly nesting ... */ + if (lock_is_held_type(map, 1)) + return true; + + /* rely on might_sleep check for soft/hardirq locks */ + if (in_atomic()) + return true; + + /* ... and non-recursive readlock */ + lock_acquire(map, 0, 0, 1, 1, NULL, _RET_IP_); + + return false; +} + +static void __dma_fence_end_signalling(bool cookie, struct lockdep_map *map) +{ + if (cookie) + return; + + lock_release(map, _RET_IP_); +} + /** * dma_fence_begin_signalling - begin a critical DMA fence signalling section * @@ -300,18 +342,7 @@ static struct lockdep_map dma_fence_lockdep_map = { */ bool dma_fence_begin_signalling(void) { - /* explicitly nesting ... */ - if (lock_is_held_type(_fence_lockdep_map, 1)) - return true; - - /* rely on might_sleep check for soft/hardirq locks */ - if (in_atomic()) - return true; - - /* ... and non-recursive readlock */ - lock_acquire(_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_); - - return false; + return __dma_fence_begin_signalling(_fence_lockdep_map); } EXPORT_SYMBOL(dma_fence_begin_signalling); @@ -323,25 +354,61 @@ EXPORT_SYMBOL(dma_fence_begin_signalling); */ void dma_fence_end_signalling(bool cookie) { - if (cookie) - return; - - lock_release(_fence_lockdep_map, _RET_IP_); + __dma_fence_end_signalling(cookie, _fence_lockdep_map); } EXPORT_SYMBOL(dma_fence_end_signalling); -void __dma_fence_might_wait(void) +/** + * dma_fence_lr begin_signalling - begin a critical long-running DMA fence + * signalling section + * + * Drivers should use this to annotate the beginning of any code section + * required to eventually complete _fence by calling dma_fence_signal(). + * + * The end of these critical sections are annotated with + * dma_fence_lr_end_signalling(). Ideally the section should encompass all + * locks that are ever required to signal a long-running dma-fence. + * + * Return: An opaque cookie needed by the implementation,