Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-06 Thread Daniel Vetter
On Thu, 6 Apr 2023 at 18:58, Matthew Brost  wrote:
>
> On Thu, Apr 06, 2023 at 08:32:59AM +0200, Daniel Vetter wrote:
> > On Wed, Apr 05, 2023 at 11:58:44PM +, Matthew Brost wrote:
> > > On Wed, Apr 05, 2023 at 03:09:08PM +0200, Daniel Vetter wrote:
> > > > On Tue, Apr 04, 2023 at 07:48:27PM +, Matthew Brost wrote:
> > > > > On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote:
> > > > > > On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote:
> > > > > > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström 
> > > > > > > (Intel) wrote:
> > > > > > > >
> > > > > > > > On 4/4/23 15:10, Christian König wrote:
> > > > > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > > > > > > Hi, Christian,
> > > > > > > > > >
> > > > > > > > > > On 4/4/23 11:09, Christian König wrote:
> > > > > > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > > > > > > > From: Thomas Hellström 
> > > > > > > > > > > > 
> > > > > > > > > > > >
> > > > > > > > > > > > For long-running workloads, drivers either need to 
> > > > > > > > > > > > open-code
> > > > > > > > > > > > completion
> > > > > > > > > > > > waits, invent their own synchronization primitives or 
> > > > > > > > > > > > internally use
> > > > > > > > > > > > dma-fences that do not obey the cross-driver dma-fence 
> > > > > > > > > > > > protocol, but
> > > > > > > > > > > > without any lockdep annotation all these approaches are 
> > > > > > > > > > > > error prone.
> > > > > > > > > > > >
> > > > > > > > > > > > So since for example the drm scheduler uses dma-fences 
> > > > > > > > > > > > it is
> > > > > > > > > > > > desirable for
> > > > > > > > > > > > a driver to be able to use it for throttling and error
> > > > > > > > > > > > handling also with
> > > > > > > > > > > > internal dma-fences tha do not obey the cros-driver
> > > > > > > > > > > > dma-fence protocol.
> > > > > > > > > > > >
> > > > > > > > > > > > Introduce long-running completion fences in form of
> > > > > > > > > > > > dma-fences, and add
> > > > > > > > > > > > lockdep annotation for them. In particular:
> > > > > > > > > > > >
> > > > > > > > > > > > * Do not allow waiting under any memory management 
> > > > > > > > > > > > locks.
> > > > > > > > > > > > * Do not allow to attach them to a dma-resv object.
> > > > > > > > > > > > * Introduce a new interface for adding callbacks making 
> > > > > > > > > > > > the
> > > > > > > > > > > > helper adding
> > > > > > > > > > > >a callback sign off on that it is aware that the 
> > > > > > > > > > > > dma-fence may not
> > > > > > > > > > > >complete anytime soon. Typically this will be the
> > > > > > > > > > > > scheduler chaining
> > > > > > > > > > > >a new long-running fence on another one.
> > > > > > > > > > >
> > > > > > > > > > > Well that's pretty much what I tried before:
> > > > > > > > > > > https://lwn.net/Articles/893704/
> > > > > > > > > > >
> > > > > > >
> > > > > > > I don't think this quite the same, this explictly enforces that 
> > > > > > > we don't
> > > > > > > break the dma-fence rules (in path of memory allocations, 
> > > > > > > exported in
> > > > > > > any way), essentially this just SW sync point reusing dma-fence 
> > > > > > > the
> > > > > > > infrastructure for signaling / callbacks. I believe your series 
> > > > > > > tried to
> > > > > > > export these fences to user space (admittedly I haven't fully 
> > > > > > > read your
> > > > > > > series).
> > > > > > >
> > > > > > > In this use case we essentially just want to flow control the 
> > > > > > > ring via
> > > > > > > the dma-scheduler + maintain a list of pending jobs so the TDR 
> > > > > > > can be
> > > > > > > used for cleanup if LR entity encounters an error. To me this 
> > > > > > > seems
> > > > > > > perfectly reasonable but I know dma-femce rules are akin to a 
> > > > > > > holy war.
> > > > > > >
> > > > > > > If we return NULL in run_job, now we have to be able to sink all 
> > > > > > > jobs
> > > > > > > in the backend regardless on ring space, maintain a list of jobs 
> > > > > > > pending
> > > > > > > for cleanup after errors, and write a different cleanup path as 
> > > > > > > now the
> > > > > > > TDR doesn't work. Seems very, very silly to duplicate all of this 
> > > > > > > code
> > > > > > > when the DRM scheduler provides all of this for us. Also if we go 
> > > > > > > this
> > > > > > > route, now all drivers are going to invent ways to handle LR jobs 
> > > > > > > /w the
> > > > > > > DRM scheduler.
> > > > > > >
> > > > > > > This solution is pretty clear, mark the scheduler as LR, and don't
> > > > > > > export any fences from the scheduler. If you try to export these 
> > > > > > > fences
> > > > > > > a blow up happens.
> > > > > >
> > > > > > The problem is if you mix things up. Like for resets you need all 
> > > > > > the
> > > > > > schedulers on an engine/set-of-engines to quiescent or things get
> > > > > > potentially hilarious. If you now 

Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-06 Thread Matthew Brost
On Thu, Apr 06, 2023 at 08:32:59AM +0200, Daniel Vetter wrote:
> On Wed, Apr 05, 2023 at 11:58:44PM +, Matthew Brost wrote:
> > On Wed, Apr 05, 2023 at 03:09:08PM +0200, Daniel Vetter wrote:
> > > On Tue, Apr 04, 2023 at 07:48:27PM +, Matthew Brost wrote:
> > > > On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote:
> > > > > On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote:
> > > > > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) 
> > > > > > wrote:
> > > > > > > 
> > > > > > > On 4/4/23 15:10, Christian König wrote:
> > > > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > > > > > Hi, Christian,
> > > > > > > > > 
> > > > > > > > > On 4/4/23 11:09, Christian König wrote:
> > > > > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > > > > > > From: Thomas Hellström 
> > > > > > > > > > > 
> > > > > > > > > > > For long-running workloads, drivers either need to 
> > > > > > > > > > > open-code
> > > > > > > > > > > completion
> > > > > > > > > > > waits, invent their own synchronization primitives or 
> > > > > > > > > > > internally use
> > > > > > > > > > > dma-fences that do not obey the cross-driver dma-fence 
> > > > > > > > > > > protocol, but
> > > > > > > > > > > without any lockdep annotation all these approaches are 
> > > > > > > > > > > error prone.
> > > > > > > > > > > 
> > > > > > > > > > > So since for example the drm scheduler uses dma-fences it 
> > > > > > > > > > > is
> > > > > > > > > > > desirable for
> > > > > > > > > > > a driver to be able to use it for throttling and error
> > > > > > > > > > > handling also with
> > > > > > > > > > > internal dma-fences tha do not obey the cros-driver
> > > > > > > > > > > dma-fence protocol.
> > > > > > > > > > > 
> > > > > > > > > > > Introduce long-running completion fences in form of
> > > > > > > > > > > dma-fences, and add
> > > > > > > > > > > lockdep annotation for them. In particular:
> > > > > > > > > > > 
> > > > > > > > > > > * Do not allow waiting under any memory management locks.
> > > > > > > > > > > * Do not allow to attach them to a dma-resv object.
> > > > > > > > > > > * Introduce a new interface for adding callbacks making 
> > > > > > > > > > > the
> > > > > > > > > > > helper adding
> > > > > > > > > > >    a callback sign off on that it is aware that the 
> > > > > > > > > > > dma-fence may not
> > > > > > > > > > >    complete anytime soon. Typically this will be the
> > > > > > > > > > > scheduler chaining
> > > > > > > > > > >    a new long-running fence on another one.
> > > > > > > > > > 
> > > > > > > > > > Well that's pretty much what I tried before:
> > > > > > > > > > https://lwn.net/Articles/893704/
> > > > > > > > > > 
> > > > > > 
> > > > > > I don't think this quite the same, this explictly enforces that we 
> > > > > > don't
> > > > > > break the dma-fence rules (in path of memory allocations, exported 
> > > > > > in
> > > > > > any way), essentially this just SW sync point reusing dma-fence the
> > > > > > infrastructure for signaling / callbacks. I believe your series 
> > > > > > tried to
> > > > > > export these fences to user space (admittedly I haven't fully read 
> > > > > > your
> > > > > > series).
> > > > > > 
> > > > > > In this use case we essentially just want to flow control the ring 
> > > > > > via
> > > > > > the dma-scheduler + maintain a list of pending jobs so the TDR can 
> > > > > > be
> > > > > > used for cleanup if LR entity encounters an error. To me this seems
> > > > > > perfectly reasonable but I know dma-femce rules are akin to a holy 
> > > > > > war.
> > > > > > 
> > > > > > If we return NULL in run_job, now we have to be able to sink all 
> > > > > > jobs
> > > > > > in the backend regardless on ring space, maintain a list of jobs 
> > > > > > pending
> > > > > > for cleanup after errors, and write a different cleanup path as now 
> > > > > > the
> > > > > > TDR doesn't work. Seems very, very silly to duplicate all of this 
> > > > > > code
> > > > > > when the DRM scheduler provides all of this for us. Also if we go 
> > > > > > this
> > > > > > route, now all drivers are going to invent ways to handle LR jobs 
> > > > > > /w the
> > > > > > DRM scheduler.
> > > > > > 
> > > > > > This solution is pretty clear, mark the scheduler as LR, and don't
> > > > > > export any fences from the scheduler. If you try to export these 
> > > > > > fences
> > > > > > a blow up happens.
> > > > > 
> > > > > The problem is if you mix things up. Like for resets you need all the
> > > > > schedulers on an engine/set-of-engines to quiescent or things get
> > > > > potentially hilarious. If you now have a scheduler in forever limbo, 
> > > > > the
> > > > > dma_fence guarantees are right out the window.
> > > > > 
> > > > 
> > > > Right, a GT reset on Xe is:
> > > > 
> > > > Stop all schedulers
> > > > Do a reset
> > > > Ban any schedulers which we think caused the GT reset
> > > > Resubmit all schedulers which we 

Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-06 Thread Daniel Vetter
On Wed, Apr 05, 2023 at 11:58:44PM +, Matthew Brost wrote:
> On Wed, Apr 05, 2023 at 03:09:08PM +0200, Daniel Vetter wrote:
> > On Tue, Apr 04, 2023 at 07:48:27PM +, Matthew Brost wrote:
> > > On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote:
> > > > On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote:
> > > > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) 
> > > > > wrote:
> > > > > > 
> > > > > > On 4/4/23 15:10, Christian König wrote:
> > > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > > > > Hi, Christian,
> > > > > > > > 
> > > > > > > > On 4/4/23 11:09, Christian König wrote:
> > > > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > > > > > From: Thomas Hellström 
> > > > > > > > > > 
> > > > > > > > > > For long-running workloads, drivers either need to open-code
> > > > > > > > > > completion
> > > > > > > > > > waits, invent their own synchronization primitives or 
> > > > > > > > > > internally use
> > > > > > > > > > dma-fences that do not obey the cross-driver dma-fence 
> > > > > > > > > > protocol, but
> > > > > > > > > > without any lockdep annotation all these approaches are 
> > > > > > > > > > error prone.
> > > > > > > > > > 
> > > > > > > > > > So since for example the drm scheduler uses dma-fences it is
> > > > > > > > > > desirable for
> > > > > > > > > > a driver to be able to use it for throttling and error
> > > > > > > > > > handling also with
> > > > > > > > > > internal dma-fences tha do not obey the cros-driver
> > > > > > > > > > dma-fence protocol.
> > > > > > > > > > 
> > > > > > > > > > Introduce long-running completion fences in form of
> > > > > > > > > > dma-fences, and add
> > > > > > > > > > lockdep annotation for them. In particular:
> > > > > > > > > > 
> > > > > > > > > > * Do not allow waiting under any memory management locks.
> > > > > > > > > > * Do not allow to attach them to a dma-resv object.
> > > > > > > > > > * Introduce a new interface for adding callbacks making the
> > > > > > > > > > helper adding
> > > > > > > > > >    a callback sign off on that it is aware that the 
> > > > > > > > > > dma-fence may not
> > > > > > > > > >    complete anytime soon. Typically this will be the
> > > > > > > > > > scheduler chaining
> > > > > > > > > >    a new long-running fence on another one.
> > > > > > > > > 
> > > > > > > > > Well that's pretty much what I tried before:
> > > > > > > > > https://lwn.net/Articles/893704/
> > > > > > > > > 
> > > > > 
> > > > > I don't think this quite the same, this explictly enforces that we 
> > > > > don't
> > > > > break the dma-fence rules (in path of memory allocations, exported in
> > > > > any way), essentially this just SW sync point reusing dma-fence the
> > > > > infrastructure for signaling / callbacks. I believe your series tried 
> > > > > to
> > > > > export these fences to user space (admittedly I haven't fully read 
> > > > > your
> > > > > series).
> > > > > 
> > > > > In this use case we essentially just want to flow control the ring via
> > > > > the dma-scheduler + maintain a list of pending jobs so the TDR can be
> > > > > used for cleanup if LR entity encounters an error. To me this seems
> > > > > perfectly reasonable but I know dma-femce rules are akin to a holy 
> > > > > war.
> > > > > 
> > > > > If we return NULL in run_job, now we have to be able to sink all jobs
> > > > > in the backend regardless on ring space, maintain a list of jobs 
> > > > > pending
> > > > > for cleanup after errors, and write a different cleanup path as now 
> > > > > the
> > > > > TDR doesn't work. Seems very, very silly to duplicate all of this code
> > > > > when the DRM scheduler provides all of this for us. Also if we go this
> > > > > route, now all drivers are going to invent ways to handle LR jobs /w 
> > > > > the
> > > > > DRM scheduler.
> > > > > 
> > > > > This solution is pretty clear, mark the scheduler as LR, and don't
> > > > > export any fences from the scheduler. If you try to export these 
> > > > > fences
> > > > > a blow up happens.
> > > > 
> > > > The problem is if you mix things up. Like for resets you need all the
> > > > schedulers on an engine/set-of-engines to quiescent or things get
> > > > potentially hilarious. If you now have a scheduler in forever limbo, the
> > > > dma_fence guarantees are right out the window.
> > > > 
> > > 
> > > Right, a GT reset on Xe is:
> > > 
> > > Stop all schedulers
> > > Do a reset
> > > Ban any schedulers which we think caused the GT reset
> > > Resubmit all schedulers which we think were good
> > > Restart all schedulers
> > > 
> > > None of this flow depends on LR dma-fences, all of this uses the DRM
> > > sched infrastructure and work very well compared to the i915. Rewriting
> > > all this with a driver specific implementation is what we are trying to
> > > avoid.
> > > 
> > > Similarly if LR entity hangs on its own (not a GT reset, rather the
> > > firmware does the reset 

Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-05 Thread Matthew Brost
On Wed, Apr 05, 2023 at 03:09:08PM +0200, Daniel Vetter wrote:
> On Tue, Apr 04, 2023 at 07:48:27PM +, Matthew Brost wrote:
> > On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote:
> > > On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote:
> > > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) 
> > > > wrote:
> > > > > 
> > > > > On 4/4/23 15:10, Christian König wrote:
> > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > > > Hi, Christian,
> > > > > > > 
> > > > > > > On 4/4/23 11:09, Christian König wrote:
> > > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > > > > From: Thomas Hellström 
> > > > > > > > > 
> > > > > > > > > For long-running workloads, drivers either need to open-code
> > > > > > > > > completion
> > > > > > > > > waits, invent their own synchronization primitives or 
> > > > > > > > > internally use
> > > > > > > > > dma-fences that do not obey the cross-driver dma-fence 
> > > > > > > > > protocol, but
> > > > > > > > > without any lockdep annotation all these approaches are error 
> > > > > > > > > prone.
> > > > > > > > > 
> > > > > > > > > So since for example the drm scheduler uses dma-fences it is
> > > > > > > > > desirable for
> > > > > > > > > a driver to be able to use it for throttling and error
> > > > > > > > > handling also with
> > > > > > > > > internal dma-fences tha do not obey the cros-driver
> > > > > > > > > dma-fence protocol.
> > > > > > > > > 
> > > > > > > > > Introduce long-running completion fences in form of
> > > > > > > > > dma-fences, and add
> > > > > > > > > lockdep annotation for them. In particular:
> > > > > > > > > 
> > > > > > > > > * Do not allow waiting under any memory management locks.
> > > > > > > > > * Do not allow to attach them to a dma-resv object.
> > > > > > > > > * Introduce a new interface for adding callbacks making the
> > > > > > > > > helper adding
> > > > > > > > >    a callback sign off on that it is aware that the dma-fence 
> > > > > > > > > may not
> > > > > > > > >    complete anytime soon. Typically this will be the
> > > > > > > > > scheduler chaining
> > > > > > > > >    a new long-running fence on another one.
> > > > > > > > 
> > > > > > > > Well that's pretty much what I tried before:
> > > > > > > > https://lwn.net/Articles/893704/
> > > > > > > > 
> > > > 
> > > > I don't think this quite the same, this explictly enforces that we don't
> > > > break the dma-fence rules (in path of memory allocations, exported in
> > > > any way), essentially this just SW sync point reusing dma-fence the
> > > > infrastructure for signaling / callbacks. I believe your series tried to
> > > > export these fences to user space (admittedly I haven't fully read your
> > > > series).
> > > > 
> > > > In this use case we essentially just want to flow control the ring via
> > > > the dma-scheduler + maintain a list of pending jobs so the TDR can be
> > > > used for cleanup if LR entity encounters an error. To me this seems
> > > > perfectly reasonable but I know dma-femce rules are akin to a holy war.
> > > > 
> > > > If we return NULL in run_job, now we have to be able to sink all jobs
> > > > in the backend regardless on ring space, maintain a list of jobs pending
> > > > for cleanup after errors, and write a different cleanup path as now the
> > > > TDR doesn't work. Seems very, very silly to duplicate all of this code
> > > > when the DRM scheduler provides all of this for us. Also if we go this
> > > > route, now all drivers are going to invent ways to handle LR jobs /w the
> > > > DRM scheduler.
> > > > 
> > > > This solution is pretty clear, mark the scheduler as LR, and don't
> > > > export any fences from the scheduler. If you try to export these fences
> > > > a blow up happens.
> > > 
> > > The problem is if you mix things up. Like for resets you need all the
> > > schedulers on an engine/set-of-engines to quiescent or things get
> > > potentially hilarious. If you now have a scheduler in forever limbo, the
> > > dma_fence guarantees are right out the window.
> > > 
> > 
> > Right, a GT reset on Xe is:
> > 
> > Stop all schedulers
> > Do a reset
> > Ban any schedulers which we think caused the GT reset
> > Resubmit all schedulers which we think were good
> > Restart all schedulers
> > 
> > None of this flow depends on LR dma-fences, all of this uses the DRM
> > sched infrastructure and work very well compared to the i915. Rewriting
> > all this with a driver specific implementation is what we are trying to
> > avoid.
> > 
> > Similarly if LR entity hangs on its own (not a GT reset, rather the
> > firmware does the reset for us) we use all the DRM scheduler
> > infrastructure to handle this. Again this works rather well...
> 
> Yeah this is why I don't think duplicating everything that long-running
> jobs need makes any sense. iow I agree with you.
> 

Glad we agree.

> > > But the issue you're having is fairly specific if it's just about
> > > ringspace. 

Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-05 Thread Christian König

Am 05.04.23 um 14:45 schrieb Daniel Vetter:

On Wed, Apr 05, 2023 at 02:39:35PM +0200, Christian König wrote:

Am 05.04.23 um 14:35 schrieb Thomas Hellström:

Hi,

On 4/4/23 21:25, Daniel Vetter wrote:

On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote:

On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström
(Intel) wrote:

On 4/4/23 15:10, Christian König wrote:

Am 04.04.23 um 14:54 schrieb Thomas Hellström:

Hi, Christian,

On 4/4/23 11:09, Christian König wrote:

Am 04.04.23 um 02:22 schrieb Matthew Brost:

From: Thomas Hellström 

For long-running workloads, drivers either need to open-code
completion
waits, invent their own synchronization
primitives or internally use
dma-fences that do not obey the cross-driver
dma-fence protocol, but
without any lockdep annotation all these
approaches are error prone.

So since for example the drm scheduler uses dma-fences it is
desirable for
a driver to be able to use it for throttling and error
handling also with
internal dma-fences tha do not obey the cros-driver
dma-fence protocol.

Introduce long-running completion fences in form of
dma-fences, and add
lockdep annotation for them. In particular:

* Do not allow waiting under any memory management locks.
* Do not allow to attach them to a dma-resv object.
* Introduce a new interface for adding callbacks making the
helper adding
     a callback sign off on that it is aware
that the dma-fence may not
     complete anytime soon. Typically this will be the
scheduler chaining
     a new long-running fence on another one.

Well that's pretty much what I tried before:
https://lwn.net/Articles/893704/


I don't think this quite the same, this explictly enforces that
we don't
break the dma-fence rules (in path of memory allocations, exported in
any way), essentially this just SW sync point reusing dma-fence the
infrastructure for signaling / callbacks. I believe your series
tried to
export these fences to user space (admittedly I haven't fully read your
series).

In this use case we essentially just want to flow control the ring via
the dma-scheduler + maintain a list of pending jobs so the TDR can be
used for cleanup if LR entity encounters an error. To me this seems
perfectly reasonable but I know dma-femce rules are akin to a holy war.

If we return NULL in run_job, now we have to be able to sink all jobs
in the backend regardless on ring space, maintain a list of jobs
pending
for cleanup after errors, and write a different cleanup path as now the
TDR doesn't work. Seems very, very silly to duplicate all of this code
when the DRM scheduler provides all of this for us. Also if we go this
route, now all drivers are going to invent ways to handle LR
jobs /w the
DRM scheduler.

This solution is pretty clear, mark the scheduler as LR, and don't
export any fences from the scheduler. If you try to export these fences
a blow up happens.

The problem is if you mix things up. Like for resets you need all the
schedulers on an engine/set-of-engines to quiescent or things get
potentially hilarious. If you now have a scheduler in forever limbo, the
dma_fence guarantees are right out the window.

But the issue you're having is fairly specific if it's just about
ringspace. I think the dumbest fix is to just block in submit if you run
out of per-ctx ringspace, and call it a day. This notion that
somehow the
kernel is supposed to provide a bottomless queue of anything userspace
submits simply doesn't hold up in reality (as much as userspace
standards
committees would like it to), and as long as it doesn't have a
real-world
perf impact it doesn't really matter why we end up blocking in the
submit
ioctl. It might also be a simple memory allocation that hits a snag in
page reclaim.

So it seems the discussion around the long-running synchronization
diverged a bit between threads and this thread was hijacked for
preempt-fences and userptr.

Do I understand it correctly that the recommendation from both Daniel
and Christian is to *not* use the drm scheduler for long-running compute
jobs, but track any internal dma-fence dependencies (pipelined clearing
or whatever) in a separate mechanism and handle unresolved dependencies
on other long-running jobs using -EWOULDBLOCK?

Yeah, I think that's a good summary.

If needed we could extract some scheduler functionality into separate
components, but the fundamental problem is that to the GPU scheduler
provides a dma_fence interface to the outside to signal job completion and
Daniel and I seem to agree that you really don't want that.

I think I'm on something slightly different:

- For anything which semantically is not a dma_fence I agree it probably
   should be handled with EWOULDBLOCK and passed to userspace. Either with
   a submit thread or userspace memory fences. Note that in practice you
   will have a bunch of blocking left in the ioctl, stuff like mutexes or
   memory allocations when things get really tight and you end up in
   synchronous reclaim. Not any different 

Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-05 Thread Daniel Vetter
On Tue, Apr 04, 2023 at 07:48:27PM +, Matthew Brost wrote:
> On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote:
> > On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote:
> > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote:
> > > > 
> > > > On 4/4/23 15:10, Christian König wrote:
> > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > > Hi, Christian,
> > > > > > 
> > > > > > On 4/4/23 11:09, Christian König wrote:
> > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > > > From: Thomas Hellström 
> > > > > > > > 
> > > > > > > > For long-running workloads, drivers either need to open-code
> > > > > > > > completion
> > > > > > > > waits, invent their own synchronization primitives or 
> > > > > > > > internally use
> > > > > > > > dma-fences that do not obey the cross-driver dma-fence 
> > > > > > > > protocol, but
> > > > > > > > without any lockdep annotation all these approaches are error 
> > > > > > > > prone.
> > > > > > > > 
> > > > > > > > So since for example the drm scheduler uses dma-fences it is
> > > > > > > > desirable for
> > > > > > > > a driver to be able to use it for throttling and error
> > > > > > > > handling also with
> > > > > > > > internal dma-fences tha do not obey the cros-driver
> > > > > > > > dma-fence protocol.
> > > > > > > > 
> > > > > > > > Introduce long-running completion fences in form of
> > > > > > > > dma-fences, and add
> > > > > > > > lockdep annotation for them. In particular:
> > > > > > > > 
> > > > > > > > * Do not allow waiting under any memory management locks.
> > > > > > > > * Do not allow to attach them to a dma-resv object.
> > > > > > > > * Introduce a new interface for adding callbacks making the
> > > > > > > > helper adding
> > > > > > > >    a callback sign off on that it is aware that the dma-fence 
> > > > > > > > may not
> > > > > > > >    complete anytime soon. Typically this will be the
> > > > > > > > scheduler chaining
> > > > > > > >    a new long-running fence on another one.
> > > > > > > 
> > > > > > > Well that's pretty much what I tried before:
> > > > > > > https://lwn.net/Articles/893704/
> > > > > > > 
> > > 
> > > I don't think this quite the same, this explictly enforces that we don't
> > > break the dma-fence rules (in path of memory allocations, exported in
> > > any way), essentially this just SW sync point reusing dma-fence the
> > > infrastructure for signaling / callbacks. I believe your series tried to
> > > export these fences to user space (admittedly I haven't fully read your
> > > series).
> > > 
> > > In this use case we essentially just want to flow control the ring via
> > > the dma-scheduler + maintain a list of pending jobs so the TDR can be
> > > used for cleanup if LR entity encounters an error. To me this seems
> > > perfectly reasonable but I know dma-femce rules are akin to a holy war.
> > > 
> > > If we return NULL in run_job, now we have to be able to sink all jobs
> > > in the backend regardless on ring space, maintain a list of jobs pending
> > > for cleanup after errors, and write a different cleanup path as now the
> > > TDR doesn't work. Seems very, very silly to duplicate all of this code
> > > when the DRM scheduler provides all of this for us. Also if we go this
> > > route, now all drivers are going to invent ways to handle LR jobs /w the
> > > DRM scheduler.
> > > 
> > > This solution is pretty clear, mark the scheduler as LR, and don't
> > > export any fences from the scheduler. If you try to export these fences
> > > a blow up happens.
> > 
> > The problem is if you mix things up. Like for resets you need all the
> > schedulers on an engine/set-of-engines to quiescent or things get
> > potentially hilarious. If you now have a scheduler in forever limbo, the
> > dma_fence guarantees are right out the window.
> > 
> 
> Right, a GT reset on Xe is:
> 
> Stop all schedulers
> Do a reset
> Ban any schedulers which we think caused the GT reset
> Resubmit all schedulers which we think were good
> Restart all schedulers
> 
> None of this flow depends on LR dma-fences, all of this uses the DRM
> sched infrastructure and work very well compared to the i915. Rewriting
> all this with a driver specific implementation is what we are trying to
> avoid.
> 
> Similarly if LR entity hangs on its own (not a GT reset, rather the
> firmware does the reset for us) we use all the DRM scheduler
> infrastructure to handle this. Again this works rather well...

Yeah this is why I don't think duplicating everything that long-running
jobs need makes any sense. iow I agree with you.

> > But the issue you're having is fairly specific if it's just about
> > ringspace. I think the dumbest fix is to just block in submit if you run
> > out of per-ctx ringspace, and call it a day. This notion that somehow the
> 
> How does that not break the dma-fence rules? A job can publish its
> finished fence after ARM, if the finished fence fence waits on ring
> 

Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-05 Thread Daniel Vetter
On Wed, Apr 05, 2023 at 02:39:35PM +0200, Christian König wrote:
> Am 05.04.23 um 14:35 schrieb Thomas Hellström:
> > Hi,
> > 
> > On 4/4/23 21:25, Daniel Vetter wrote:
> > > On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote:
> > > > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström
> > > > (Intel) wrote:
> > > > > On 4/4/23 15:10, Christian König wrote:
> > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > > > Hi, Christian,
> > > > > > > 
> > > > > > > On 4/4/23 11:09, Christian König wrote:
> > > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > > > > From: Thomas Hellström 
> > > > > > > > > 
> > > > > > > > > For long-running workloads, drivers either need to open-code
> > > > > > > > > completion
> > > > > > > > > waits, invent their own synchronization
> > > > > > > > > primitives or internally use
> > > > > > > > > dma-fences that do not obey the cross-driver
> > > > > > > > > dma-fence protocol, but
> > > > > > > > > without any lockdep annotation all these
> > > > > > > > > approaches are error prone.
> > > > > > > > > 
> > > > > > > > > So since for example the drm scheduler uses dma-fences it is
> > > > > > > > > desirable for
> > > > > > > > > a driver to be able to use it for throttling and error
> > > > > > > > > handling also with
> > > > > > > > > internal dma-fences tha do not obey the cros-driver
> > > > > > > > > dma-fence protocol.
> > > > > > > > > 
> > > > > > > > > Introduce long-running completion fences in form of
> > > > > > > > > dma-fences, and add
> > > > > > > > > lockdep annotation for them. In particular:
> > > > > > > > > 
> > > > > > > > > * Do not allow waiting under any memory management locks.
> > > > > > > > > * Do not allow to attach them to a dma-resv object.
> > > > > > > > > * Introduce a new interface for adding callbacks making the
> > > > > > > > > helper adding
> > > > > > > > >     a callback sign off on that it is aware
> > > > > > > > > that the dma-fence may not
> > > > > > > > >     complete anytime soon. Typically this will be the
> > > > > > > > > scheduler chaining
> > > > > > > > >     a new long-running fence on another one.
> > > > > > > > Well that's pretty much what I tried before:
> > > > > > > > https://lwn.net/Articles/893704/
> > > > > > > > 
> > > > I don't think this quite the same, this explictly enforces that
> > > > we don't
> > > > break the dma-fence rules (in path of memory allocations, exported in
> > > > any way), essentially this just SW sync point reusing dma-fence the
> > > > infrastructure for signaling / callbacks. I believe your series
> > > > tried to
> > > > export these fences to user space (admittedly I haven't fully read your
> > > > series).
> > > > 
> > > > In this use case we essentially just want to flow control the ring via
> > > > the dma-scheduler + maintain a list of pending jobs so the TDR can be
> > > > used for cleanup if LR entity encounters an error. To me this seems
> > > > perfectly reasonable but I know dma-femce rules are akin to a holy war.
> > > > 
> > > > If we return NULL in run_job, now we have to be able to sink all jobs
> > > > in the backend regardless on ring space, maintain a list of jobs
> > > > pending
> > > > for cleanup after errors, and write a different cleanup path as now the
> > > > TDR doesn't work. Seems very, very silly to duplicate all of this code
> > > > when the DRM scheduler provides all of this for us. Also if we go this
> > > > route, now all drivers are going to invent ways to handle LR
> > > > jobs /w the
> > > > DRM scheduler.
> > > > 
> > > > This solution is pretty clear, mark the scheduler as LR, and don't
> > > > export any fences from the scheduler. If you try to export these fences
> > > > a blow up happens.
> > > The problem is if you mix things up. Like for resets you need all the
> > > schedulers on an engine/set-of-engines to quiescent or things get
> > > potentially hilarious. If you now have a scheduler in forever limbo, the
> > > dma_fence guarantees are right out the window.
> > > 
> > > But the issue you're having is fairly specific if it's just about
> > > ringspace. I think the dumbest fix is to just block in submit if you run
> > > out of per-ctx ringspace, and call it a day. This notion that
> > > somehow the
> > > kernel is supposed to provide a bottomless queue of anything userspace
> > > submits simply doesn't hold up in reality (as much as userspace
> > > standards
> > > committees would like it to), and as long as it doesn't have a
> > > real-world
> > > perf impact it doesn't really matter why we end up blocking in the
> > > submit
> > > ioctl. It might also be a simple memory allocation that hits a snag in
> > > page reclaim.
> > 
> > So it seems the discussion around the long-running synchronization
> > diverged a bit between threads and this thread was hijacked for
> > preempt-fences and userptr.
> > 
> > Do I understand it correctly that the recommendation from both Daniel
> > and Christian is 

Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-05 Thread Christian König

Am 05.04.23 um 14:35 schrieb Thomas Hellström:

Hi,

On 4/4/23 21:25, Daniel Vetter wrote:

On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote:
On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) 
wrote:

On 4/4/23 15:10, Christian König wrote:

Am 04.04.23 um 14:54 schrieb Thomas Hellström:

Hi, Christian,

On 4/4/23 11:09, Christian König wrote:

Am 04.04.23 um 02:22 schrieb Matthew Brost:

From: Thomas Hellström 

For long-running workloads, drivers either need to open-code
completion
waits, invent their own synchronization primitives or 
internally use
dma-fences that do not obey the cross-driver dma-fence 
protocol, but
without any lockdep annotation all these approaches are error 
prone.


So since for example the drm scheduler uses dma-fences it is
desirable for
a driver to be able to use it for throttling and error
handling also with
internal dma-fences tha do not obey the cros-driver
dma-fence protocol.

Introduce long-running completion fences in form of
dma-fences, and add
lockdep annotation for them. In particular:

* Do not allow waiting under any memory management locks.
* Do not allow to attach them to a dma-resv object.
* Introduce a new interface for adding callbacks making the
helper adding
    a callback sign off on that it is aware that the dma-fence 
may not

    complete anytime soon. Typically this will be the
scheduler chaining
    a new long-running fence on another one.

Well that's pretty much what I tried before:
https://lwn.net/Articles/893704/

I don't think this quite the same, this explictly enforces that we 
don't

break the dma-fence rules (in path of memory allocations, exported in
any way), essentially this just SW sync point reusing dma-fence the
infrastructure for signaling / callbacks. I believe your series 
tried to

export these fences to user space (admittedly I haven't fully read your
series).

In this use case we essentially just want to flow control the ring via
the dma-scheduler + maintain a list of pending jobs so the TDR can be
used for cleanup if LR entity encounters an error. To me this seems
perfectly reasonable but I know dma-femce rules are akin to a holy war.

If we return NULL in run_job, now we have to be able to sink all jobs
in the backend regardless on ring space, maintain a list of jobs 
pending

for cleanup after errors, and write a different cleanup path as now the
TDR doesn't work. Seems very, very silly to duplicate all of this code
when the DRM scheduler provides all of this for us. Also if we go this
route, now all drivers are going to invent ways to handle LR jobs /w 
the

DRM scheduler.

This solution is pretty clear, mark the scheduler as LR, and don't
export any fences from the scheduler. If you try to export these fences
a blow up happens.

The problem is if you mix things up. Like for resets you need all the
schedulers on an engine/set-of-engines to quiescent or things get
potentially hilarious. If you now have a scheduler in forever limbo, the
dma_fence guarantees are right out the window.

But the issue you're having is fairly specific if it's just about
ringspace. I think the dumbest fix is to just block in submit if you run
out of per-ctx ringspace, and call it a day. This notion that somehow 
the

kernel is supposed to provide a bottomless queue of anything userspace
submits simply doesn't hold up in reality (as much as userspace 
standards
committees would like it to), and as long as it doesn't have a 
real-world
perf impact it doesn't really matter why we end up blocking in the 
submit

ioctl. It might also be a simple memory allocation that hits a snag in
page reclaim.


So it seems the discussion around the long-running synchronization 
diverged a bit between threads and this thread was hijacked for 
preempt-fences and userptr.


Do I understand it correctly that the recommendation from both Daniel 
and Christian is to *not* use the drm scheduler for long-running 
compute jobs, but track any internal dma-fence dependencies (pipelined 
clearing or whatever) in a separate mechanism and handle unresolved 
dependencies on other long-running jobs using -EWOULDBLOCK?


Yeah, I think that's a good summary.

If needed we could extract some scheduler functionality into separate 
components, but the fundamental problem is that to the GPU scheduler 
provides a dma_fence interface to the outside to signal job completion 
and Daniel and I seem to agree that you really don't want that.


Regards,
Christian.



Thanks,
Thomas






And the reasons why it was rejected haven't changed.

Regards,
Christian.


Yes, TBH this was mostly to get discussion going how we'd best
tackle this problem while being able to reuse the scheduler for
long-running workloads.

I couldn't see any clear decision on your series, though, but one
main difference I see is that this is intended for driver-internal
use only. (I'm counting using the drm_scheduler as a helper for
driver-private use). This is by no means a way to try tackle the

Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-05 Thread Thomas Hellström

Hi,

On 4/4/23 21:25, Daniel Vetter wrote:

On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote:

On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote:

On 4/4/23 15:10, Christian König wrote:

Am 04.04.23 um 14:54 schrieb Thomas Hellström:

Hi, Christian,

On 4/4/23 11:09, Christian König wrote:

Am 04.04.23 um 02:22 schrieb Matthew Brost:

From: Thomas Hellström 

For long-running workloads, drivers either need to open-code
completion
waits, invent their own synchronization primitives or internally use
dma-fences that do not obey the cross-driver dma-fence protocol, but
without any lockdep annotation all these approaches are error prone.

So since for example the drm scheduler uses dma-fences it is
desirable for
a driver to be able to use it for throttling and error
handling also with
internal dma-fences tha do not obey the cros-driver
dma-fence protocol.

Introduce long-running completion fences in form of
dma-fences, and add
lockdep annotation for them. In particular:

* Do not allow waiting under any memory management locks.
* Do not allow to attach them to a dma-resv object.
* Introduce a new interface for adding callbacks making the
helper adding
    a callback sign off on that it is aware that the dma-fence may not
    complete anytime soon. Typically this will be the
scheduler chaining
    a new long-running fence on another one.

Well that's pretty much what I tried before:
https://lwn.net/Articles/893704/


I don't think this quite the same, this explictly enforces that we don't
break the dma-fence rules (in path of memory allocations, exported in
any way), essentially this just SW sync point reusing dma-fence the
infrastructure for signaling / callbacks. I believe your series tried to
export these fences to user space (admittedly I haven't fully read your
series).

In this use case we essentially just want to flow control the ring via
the dma-scheduler + maintain a list of pending jobs so the TDR can be
used for cleanup if LR entity encounters an error. To me this seems
perfectly reasonable but I know dma-femce rules are akin to a holy war.

If we return NULL in run_job, now we have to be able to sink all jobs
in the backend regardless on ring space, maintain a list of jobs pending
for cleanup after errors, and write a different cleanup path as now the
TDR doesn't work. Seems very, very silly to duplicate all of this code
when the DRM scheduler provides all of this for us. Also if we go this
route, now all drivers are going to invent ways to handle LR jobs /w the
DRM scheduler.

This solution is pretty clear, mark the scheduler as LR, and don't
export any fences from the scheduler. If you try to export these fences
a blow up happens.

The problem is if you mix things up. Like for resets you need all the
schedulers on an engine/set-of-engines to quiescent or things get
potentially hilarious. If you now have a scheduler in forever limbo, the
dma_fence guarantees are right out the window.

But the issue you're having is fairly specific if it's just about
ringspace. I think the dumbest fix is to just block in submit if you run
out of per-ctx ringspace, and call it a day. This notion that somehow the
kernel is supposed to provide a bottomless queue of anything userspace
submits simply doesn't hold up in reality (as much as userspace standards
committees would like it to), and as long as it doesn't have a real-world
perf impact it doesn't really matter why we end up blocking in the submit
ioctl. It might also be a simple memory allocation that hits a snag in
page reclaim.


So it seems the discussion around the long-running synchronization 
diverged a bit between threads and this thread was hijacked for 
preempt-fences and userptr.


Do I understand it correctly that the recommendation from both Daniel 
and Christian is to *not* use the drm scheduler for long-running compute 
jobs, but track any internal dma-fence dependencies (pipelined clearing 
or whatever) in a separate mechanism and handle unresolved dependencies 
on other long-running jobs using -EWOULDBLOCK?


Thanks,
Thomas






And the reasons why it was rejected haven't changed.

Regards,
Christian.


Yes, TBH this was mostly to get discussion going how we'd best
tackle this problem while being able to reuse the scheduler for
long-running workloads.

I couldn't see any clear decision on your series, though, but one
main difference I see is that this is intended for driver-internal
use only. (I'm counting using the drm_scheduler as a helper for
driver-private use). This is by no means a way to try tackle the
indefinite fence problem.

Well this was just my latest try to tackle this, but essentially the
problems are the same as with your approach: When we express such
operations as dma_fence there is always the change that we leak that
somewhere.

My approach of adding a flag noting that this operation is dangerous and
can't be synced with something memory management depends on tried to
contain this as much as 

Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-04 Thread Matthew Brost
On Tue, Apr 04, 2023 at 10:31:58PM +0200, Daniel Vetter wrote:
> On Tue, Apr 04, 2023 at 08:19:37PM +, Matthew Brost wrote:
> > On Tue, Apr 04, 2023 at 10:11:59PM +0200, Daniel Vetter wrote:
> > > On Tue, 4 Apr 2023 at 22:04, Matthew Brost  
> > > wrote:
> > > >
> > > > On Tue, Apr 04, 2023 at 09:00:59PM +0200, Daniel Vetter wrote:
> > > > > On Tue, 4 Apr 2023 at 15:10, Christian König 
> > > > >  wrote:
> > > > > >
> > > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > > > Hi, Christian,
> > > > > > >
> > > > > > > On 4/4/23 11:09, Christian König wrote:
> > > > > > >> Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > >>> From: Thomas Hellström 
> > > > > > >>>
> > > > > > >>> For long-running workloads, drivers either need to open-code 
> > > > > > >>> completion
> > > > > > >>> waits, invent their own synchronization primitives or 
> > > > > > >>> internally use
> > > > > > >>> dma-fences that do not obey the cross-driver dma-fence 
> > > > > > >>> protocol, but
> > > > > > >>> without any lockdep annotation all these approaches are error 
> > > > > > >>> prone.
> > > > > > >>>
> > > > > > >>> So since for example the drm scheduler uses dma-fences it is
> > > > > > >>> desirable for
> > > > > > >>> a driver to be able to use it for throttling and error handling 
> > > > > > >>> also
> > > > > > >>> with
> > > > > > >>> internal dma-fences tha do not obey the cros-driver dma-fence 
> > > > > > >>> protocol.
> > > > > > >>>
> > > > > > >>> Introduce long-running completion fences in form of dma-fences, 
> > > > > > >>> and add
> > > > > > >>> lockdep annotation for them. In particular:
> > > > > > >>>
> > > > > > >>> * Do not allow waiting under any memory management locks.
> > > > > > >>> * Do not allow to attach them to a dma-resv object.
> > > > > > >>> * Introduce a new interface for adding callbacks making the 
> > > > > > >>> helper
> > > > > > >>> adding
> > > > > > >>>a callback sign off on that it is aware that the dma-fence 
> > > > > > >>> may not
> > > > > > >>>complete anytime soon. Typically this will be the scheduler 
> > > > > > >>> chaining
> > > > > > >>>a new long-running fence on another one.
> > > > > > >>
> > > > > > >> Well that's pretty much what I tried before:
> > > > > > >> https://lwn.net/Articles/893704/
> > > > > > >>
> > > > > > >> And the reasons why it was rejected haven't changed.
> > > > > > >>
> > > > > > >> Regards,
> > > > > > >> Christian.
> > > > > > >>
> > > > > > > Yes, TBH this was mostly to get discussion going how we'd best 
> > > > > > > tackle
> > > > > > > this problem while being able to reuse the scheduler for 
> > > > > > > long-running
> > > > > > > workloads.
> > > > > > >
> > > > > > > I couldn't see any clear decision on your series, though, but one 
> > > > > > > main
> > > > > > > difference I see is that this is intended for driver-internal use
> > > > > > > only. (I'm counting using the drm_scheduler as a helper for
> > > > > > > driver-private use). This is by no means a way to try tackle the
> > > > > > > indefinite fence problem.
> > > > > >
> > > > > > Well this was just my latest try to tackle this, but essentially the
> > > > > > problems are the same as with your approach: When we express such
> > > > > > operations as dma_fence there is always the change that we leak that
> > > > > > somewhere.
> > > > > >
> > > > > > My approach of adding a flag noting that this operation is 
> > > > > > dangerous and
> > > > > > can't be synced with something memory management depends on tried to
> > > > > > contain this as much as possible, but Daniel still pretty clearly
> > > > > > rejected it (for good reasons I think).
> > > > >
> > > > > Yeah I still don't like dma_fence that somehow have totally different
> > > > > semantics in that critical piece of "will it complete or will it
> > > > > deadlock?" :-)
> > > >
> > > > Not going to touch LR dma-fences in this reply, I think we can continue
> > > > the LR fence discussion of the fork of this thread I just responded to.
> > > > Have a response the preempt fence discussion below.
> > > >
> > > > > >
> > > > > > >
> > > > > > > We could ofc invent a completely different data-type that 
> > > > > > > abstracts
> > > > > > > the synchronization the scheduler needs in the long-running case, 
> > > > > > > or
> > > > > > > each driver could hack something up, like sleeping in the
> > > > > > > prepare_job() or run_job() callback for throttling, but those 
> > > > > > > waits
> > > > > > > should still be annotated in one way or annotated one way or 
> > > > > > > another
> > > > > > > (and probably in a similar way across drivers) to make sure we 
> > > > > > > don't
> > > > > > > do anything bad.
> > > > > > >
> > > > > > >  So any suggestions as to what would be the better solution here 
> > > > > > > would
> > > > > > > be appreciated.
> > > > > >
> > > > > > Mhm, do we really the the GPU scheduler for that?
> > > > > >
> > > > > > I mean in the 1 to 1 case  you 

Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-04 Thread Daniel Vetter
On Tue, Apr 04, 2023 at 08:19:37PM +, Matthew Brost wrote:
> On Tue, Apr 04, 2023 at 10:11:59PM +0200, Daniel Vetter wrote:
> > On Tue, 4 Apr 2023 at 22:04, Matthew Brost  wrote:
> > >
> > > On Tue, Apr 04, 2023 at 09:00:59PM +0200, Daniel Vetter wrote:
> > > > On Tue, 4 Apr 2023 at 15:10, Christian König  
> > > > wrote:
> > > > >
> > > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > > Hi, Christian,
> > > > > >
> > > > > > On 4/4/23 11:09, Christian König wrote:
> > > > > >> Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > >>> From: Thomas Hellström 
> > > > > >>>
> > > > > >>> For long-running workloads, drivers either need to open-code 
> > > > > >>> completion
> > > > > >>> waits, invent their own synchronization primitives or internally 
> > > > > >>> use
> > > > > >>> dma-fences that do not obey the cross-driver dma-fence protocol, 
> > > > > >>> but
> > > > > >>> without any lockdep annotation all these approaches are error 
> > > > > >>> prone.
> > > > > >>>
> > > > > >>> So since for example the drm scheduler uses dma-fences it is
> > > > > >>> desirable for
> > > > > >>> a driver to be able to use it for throttling and error handling 
> > > > > >>> also
> > > > > >>> with
> > > > > >>> internal dma-fences tha do not obey the cros-driver dma-fence 
> > > > > >>> protocol.
> > > > > >>>
> > > > > >>> Introduce long-running completion fences in form of dma-fences, 
> > > > > >>> and add
> > > > > >>> lockdep annotation for them. In particular:
> > > > > >>>
> > > > > >>> * Do not allow waiting under any memory management locks.
> > > > > >>> * Do not allow to attach them to a dma-resv object.
> > > > > >>> * Introduce a new interface for adding callbacks making the helper
> > > > > >>> adding
> > > > > >>>a callback sign off on that it is aware that the dma-fence may 
> > > > > >>> not
> > > > > >>>complete anytime soon. Typically this will be the scheduler 
> > > > > >>> chaining
> > > > > >>>a new long-running fence on another one.
> > > > > >>
> > > > > >> Well that's pretty much what I tried before:
> > > > > >> https://lwn.net/Articles/893704/
> > > > > >>
> > > > > >> And the reasons why it was rejected haven't changed.
> > > > > >>
> > > > > >> Regards,
> > > > > >> Christian.
> > > > > >>
> > > > > > Yes, TBH this was mostly to get discussion going how we'd best 
> > > > > > tackle
> > > > > > this problem while being able to reuse the scheduler for 
> > > > > > long-running
> > > > > > workloads.
> > > > > >
> > > > > > I couldn't see any clear decision on your series, though, but one 
> > > > > > main
> > > > > > difference I see is that this is intended for driver-internal use
> > > > > > only. (I'm counting using the drm_scheduler as a helper for
> > > > > > driver-private use). This is by no means a way to try tackle the
> > > > > > indefinite fence problem.
> > > > >
> > > > > Well this was just my latest try to tackle this, but essentially the
> > > > > problems are the same as with your approach: When we express such
> > > > > operations as dma_fence there is always the change that we leak that
> > > > > somewhere.
> > > > >
> > > > > My approach of adding a flag noting that this operation is dangerous 
> > > > > and
> > > > > can't be synced with something memory management depends on tried to
> > > > > contain this as much as possible, but Daniel still pretty clearly
> > > > > rejected it (for good reasons I think).
> > > >
> > > > Yeah I still don't like dma_fence that somehow have totally different
> > > > semantics in that critical piece of "will it complete or will it
> > > > deadlock?" :-)
> > >
> > > Not going to touch LR dma-fences in this reply, I think we can continue
> > > the LR fence discussion of the fork of this thread I just responded to.
> > > Have a response the preempt fence discussion below.
> > >
> > > > >
> > > > > >
> > > > > > We could ofc invent a completely different data-type that abstracts
> > > > > > the synchronization the scheduler needs in the long-running case, or
> > > > > > each driver could hack something up, like sleeping in the
> > > > > > prepare_job() or run_job() callback for throttling, but those waits
> > > > > > should still be annotated in one way or annotated one way or another
> > > > > > (and probably in a similar way across drivers) to make sure we don't
> > > > > > do anything bad.
> > > > > >
> > > > > >  So any suggestions as to what would be the better solution here 
> > > > > > would
> > > > > > be appreciated.
> > > > >
> > > > > Mhm, do we really the the GPU scheduler for that?
> > > > >
> > > > > I mean in the 1 to 1 case  you basically just need a component which
> > > > > collects the dependencies as dma_fence and if all of them are 
> > > > > fulfilled
> > > > > schedules a work item.
> > > > >
> > > > > As long as the work item itself doesn't produce a dma_fence it can 
> > > > > then
> > > > > still just wait for other none dma_fence dependencies.
> > > >
> > > > Yeah that's the 

Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-04 Thread Matthew Brost
On Tue, Apr 04, 2023 at 10:11:59PM +0200, Daniel Vetter wrote:
> On Tue, 4 Apr 2023 at 22:04, Matthew Brost  wrote:
> >
> > On Tue, Apr 04, 2023 at 09:00:59PM +0200, Daniel Vetter wrote:
> > > On Tue, 4 Apr 2023 at 15:10, Christian König  
> > > wrote:
> > > >
> > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > Hi, Christian,
> > > > >
> > > > > On 4/4/23 11:09, Christian König wrote:
> > > > >> Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > >>> From: Thomas Hellström 
> > > > >>>
> > > > >>> For long-running workloads, drivers either need to open-code 
> > > > >>> completion
> > > > >>> waits, invent their own synchronization primitives or internally use
> > > > >>> dma-fences that do not obey the cross-driver dma-fence protocol, but
> > > > >>> without any lockdep annotation all these approaches are error prone.
> > > > >>>
> > > > >>> So since for example the drm scheduler uses dma-fences it is
> > > > >>> desirable for
> > > > >>> a driver to be able to use it for throttling and error handling also
> > > > >>> with
> > > > >>> internal dma-fences tha do not obey the cros-driver dma-fence 
> > > > >>> protocol.
> > > > >>>
> > > > >>> Introduce long-running completion fences in form of dma-fences, and 
> > > > >>> add
> > > > >>> lockdep annotation for them. In particular:
> > > > >>>
> > > > >>> * Do not allow waiting under any memory management locks.
> > > > >>> * Do not allow to attach them to a dma-resv object.
> > > > >>> * Introduce a new interface for adding callbacks making the helper
> > > > >>> adding
> > > > >>>a callback sign off on that it is aware that the dma-fence may 
> > > > >>> not
> > > > >>>complete anytime soon. Typically this will be the scheduler 
> > > > >>> chaining
> > > > >>>a new long-running fence on another one.
> > > > >>
> > > > >> Well that's pretty much what I tried before:
> > > > >> https://lwn.net/Articles/893704/
> > > > >>
> > > > >> And the reasons why it was rejected haven't changed.
> > > > >>
> > > > >> Regards,
> > > > >> Christian.
> > > > >>
> > > > > Yes, TBH this was mostly to get discussion going how we'd best tackle
> > > > > this problem while being able to reuse the scheduler for long-running
> > > > > workloads.
> > > > >
> > > > > I couldn't see any clear decision on your series, though, but one main
> > > > > difference I see is that this is intended for driver-internal use
> > > > > only. (I'm counting using the drm_scheduler as a helper for
> > > > > driver-private use). This is by no means a way to try tackle the
> > > > > indefinite fence problem.
> > > >
> > > > Well this was just my latest try to tackle this, but essentially the
> > > > problems are the same as with your approach: When we express such
> > > > operations as dma_fence there is always the change that we leak that
> > > > somewhere.
> > > >
> > > > My approach of adding a flag noting that this operation is dangerous and
> > > > can't be synced with something memory management depends on tried to
> > > > contain this as much as possible, but Daniel still pretty clearly
> > > > rejected it (for good reasons I think).
> > >
> > > Yeah I still don't like dma_fence that somehow have totally different
> > > semantics in that critical piece of "will it complete or will it
> > > deadlock?" :-)
> >
> > Not going to touch LR dma-fences in this reply, I think we can continue
> > the LR fence discussion of the fork of this thread I just responded to.
> > Have a response the preempt fence discussion below.
> >
> > > >
> > > > >
> > > > > We could ofc invent a completely different data-type that abstracts
> > > > > the synchronization the scheduler needs in the long-running case, or
> > > > > each driver could hack something up, like sleeping in the
> > > > > prepare_job() or run_job() callback for throttling, but those waits
> > > > > should still be annotated in one way or annotated one way or another
> > > > > (and probably in a similar way across drivers) to make sure we don't
> > > > > do anything bad.
> > > > >
> > > > >  So any suggestions as to what would be the better solution here would
> > > > > be appreciated.
> > > >
> > > > Mhm, do we really the the GPU scheduler for that?
> > > >
> > > > I mean in the 1 to 1 case  you basically just need a component which
> > > > collects the dependencies as dma_fence and if all of them are fulfilled
> > > > schedules a work item.
> > > >
> > > > As long as the work item itself doesn't produce a dma_fence it can then
> > > > still just wait for other none dma_fence dependencies.
> > >
> > > Yeah that's the important thing, for long-running jobs dependencies as
> > > dma_fence should be totally fine. You're just not allowed to have any
> > > outgoing dma_fences at all (except the magic preemption fence).
> > >
> > > > Then the work function could submit the work and wait for the result.
> > > >
> > > > The work item would then pretty much represent what you want, you can
> > > > wait for it to finish and pass it 

Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-04 Thread Daniel Vetter
On Tue, 4 Apr 2023 at 22:04, Matthew Brost  wrote:
>
> On Tue, Apr 04, 2023 at 09:00:59PM +0200, Daniel Vetter wrote:
> > On Tue, 4 Apr 2023 at 15:10, Christian König  
> > wrote:
> > >
> > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > Hi, Christian,
> > > >
> > > > On 4/4/23 11:09, Christian König wrote:
> > > >> Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > >>> From: Thomas Hellström 
> > > >>>
> > > >>> For long-running workloads, drivers either need to open-code 
> > > >>> completion
> > > >>> waits, invent their own synchronization primitives or internally use
> > > >>> dma-fences that do not obey the cross-driver dma-fence protocol, but
> > > >>> without any lockdep annotation all these approaches are error prone.
> > > >>>
> > > >>> So since for example the drm scheduler uses dma-fences it is
> > > >>> desirable for
> > > >>> a driver to be able to use it for throttling and error handling also
> > > >>> with
> > > >>> internal dma-fences tha do not obey the cros-driver dma-fence 
> > > >>> protocol.
> > > >>>
> > > >>> Introduce long-running completion fences in form of dma-fences, and 
> > > >>> add
> > > >>> lockdep annotation for them. In particular:
> > > >>>
> > > >>> * Do not allow waiting under any memory management locks.
> > > >>> * Do not allow to attach them to a dma-resv object.
> > > >>> * Introduce a new interface for adding callbacks making the helper
> > > >>> adding
> > > >>>a callback sign off on that it is aware that the dma-fence may not
> > > >>>complete anytime soon. Typically this will be the scheduler 
> > > >>> chaining
> > > >>>a new long-running fence on another one.
> > > >>
> > > >> Well that's pretty much what I tried before:
> > > >> https://lwn.net/Articles/893704/
> > > >>
> > > >> And the reasons why it was rejected haven't changed.
> > > >>
> > > >> Regards,
> > > >> Christian.
> > > >>
> > > > Yes, TBH this was mostly to get discussion going how we'd best tackle
> > > > this problem while being able to reuse the scheduler for long-running
> > > > workloads.
> > > >
> > > > I couldn't see any clear decision on your series, though, but one main
> > > > difference I see is that this is intended for driver-internal use
> > > > only. (I'm counting using the drm_scheduler as a helper for
> > > > driver-private use). This is by no means a way to try tackle the
> > > > indefinite fence problem.
> > >
> > > Well this was just my latest try to tackle this, but essentially the
> > > problems are the same as with your approach: When we express such
> > > operations as dma_fence there is always the change that we leak that
> > > somewhere.
> > >
> > > My approach of adding a flag noting that this operation is dangerous and
> > > can't be synced with something memory management depends on tried to
> > > contain this as much as possible, but Daniel still pretty clearly
> > > rejected it (for good reasons I think).
> >
> > Yeah I still don't like dma_fence that somehow have totally different
> > semantics in that critical piece of "will it complete or will it
> > deadlock?" :-)
>
> Not going to touch LR dma-fences in this reply, I think we can continue
> the LR fence discussion of the fork of this thread I just responded to.
> Have a response the preempt fence discussion below.
>
> > >
> > > >
> > > > We could ofc invent a completely different data-type that abstracts
> > > > the synchronization the scheduler needs in the long-running case, or
> > > > each driver could hack something up, like sleeping in the
> > > > prepare_job() or run_job() callback for throttling, but those waits
> > > > should still be annotated in one way or annotated one way or another
> > > > (and probably in a similar way across drivers) to make sure we don't
> > > > do anything bad.
> > > >
> > > >  So any suggestions as to what would be the better solution here would
> > > > be appreciated.
> > >
> > > Mhm, do we really the the GPU scheduler for that?
> > >
> > > I mean in the 1 to 1 case  you basically just need a component which
> > > collects the dependencies as dma_fence and if all of them are fulfilled
> > > schedules a work item.
> > >
> > > As long as the work item itself doesn't produce a dma_fence it can then
> > > still just wait for other none dma_fence dependencies.
> >
> > Yeah that's the important thing, for long-running jobs dependencies as
> > dma_fence should be totally fine. You're just not allowed to have any
> > outgoing dma_fences at all (except the magic preemption fence).
> >
> > > Then the work function could submit the work and wait for the result.
> > >
> > > The work item would then pretty much represent what you want, you can
> > > wait for it to finish and pass it along as long running dependency.
> > >
> > > Maybe give it a funky name and wrap it up in a structure, but that's
> > > basically it.
> >
> > Like do we need this? If the kernel ever waits for a long-running
> > compute job to finnish I'd call that a bug. Any functional
> > dependencies 

Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-04 Thread Matthew Brost
On Tue, Apr 04, 2023 at 09:00:59PM +0200, Daniel Vetter wrote:
> On Tue, 4 Apr 2023 at 15:10, Christian König  wrote:
> >
> > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > Hi, Christian,
> > >
> > > On 4/4/23 11:09, Christian König wrote:
> > >> Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > >>> From: Thomas Hellström 
> > >>>
> > >>> For long-running workloads, drivers either need to open-code completion
> > >>> waits, invent their own synchronization primitives or internally use
> > >>> dma-fences that do not obey the cross-driver dma-fence protocol, but
> > >>> without any lockdep annotation all these approaches are error prone.
> > >>>
> > >>> So since for example the drm scheduler uses dma-fences it is
> > >>> desirable for
> > >>> a driver to be able to use it for throttling and error handling also
> > >>> with
> > >>> internal dma-fences tha do not obey the cros-driver dma-fence protocol.
> > >>>
> > >>> Introduce long-running completion fences in form of dma-fences, and add
> > >>> lockdep annotation for them. In particular:
> > >>>
> > >>> * Do not allow waiting under any memory management locks.
> > >>> * Do not allow to attach them to a dma-resv object.
> > >>> * Introduce a new interface for adding callbacks making the helper
> > >>> adding
> > >>>a callback sign off on that it is aware that the dma-fence may not
> > >>>complete anytime soon. Typically this will be the scheduler chaining
> > >>>a new long-running fence on another one.
> > >>
> > >> Well that's pretty much what I tried before:
> > >> https://lwn.net/Articles/893704/
> > >>
> > >> And the reasons why it was rejected haven't changed.
> > >>
> > >> Regards,
> > >> Christian.
> > >>
> > > Yes, TBH this was mostly to get discussion going how we'd best tackle
> > > this problem while being able to reuse the scheduler for long-running
> > > workloads.
> > >
> > > I couldn't see any clear decision on your series, though, but one main
> > > difference I see is that this is intended for driver-internal use
> > > only. (I'm counting using the drm_scheduler as a helper for
> > > driver-private use). This is by no means a way to try tackle the
> > > indefinite fence problem.
> >
> > Well this was just my latest try to tackle this, but essentially the
> > problems are the same as with your approach: When we express such
> > operations as dma_fence there is always the change that we leak that
> > somewhere.
> >
> > My approach of adding a flag noting that this operation is dangerous and
> > can't be synced with something memory management depends on tried to
> > contain this as much as possible, but Daniel still pretty clearly
> > rejected it (for good reasons I think).
> 
> Yeah I still don't like dma_fence that somehow have totally different
> semantics in that critical piece of "will it complete or will it
> deadlock?" :-)

Not going to touch LR dma-fences in this reply, I think we can continue
the LR fence discussion of the fork of this thread I just responded to.
Have a response the preempt fence discussion below.

> >
> > >
> > > We could ofc invent a completely different data-type that abstracts
> > > the synchronization the scheduler needs in the long-running case, or
> > > each driver could hack something up, like sleeping in the
> > > prepare_job() or run_job() callback for throttling, but those waits
> > > should still be annotated in one way or annotated one way or another
> > > (and probably in a similar way across drivers) to make sure we don't
> > > do anything bad.
> > >
> > >  So any suggestions as to what would be the better solution here would
> > > be appreciated.
> >
> > Mhm, do we really the the GPU scheduler for that?
> >
> > I mean in the 1 to 1 case  you basically just need a component which
> > collects the dependencies as dma_fence and if all of them are fulfilled
> > schedules a work item.
> >
> > As long as the work item itself doesn't produce a dma_fence it can then
> > still just wait for other none dma_fence dependencies.
> 
> Yeah that's the important thing, for long-running jobs dependencies as
> dma_fence should be totally fine. You're just not allowed to have any
> outgoing dma_fences at all (except the magic preemption fence).
> 
> > Then the work function could submit the work and wait for the result.
> >
> > The work item would then pretty much represent what you want, you can
> > wait for it to finish and pass it along as long running dependency.
> >
> > Maybe give it a funky name and wrap it up in a structure, but that's
> > basically it.
> 
> Like do we need this? If the kernel ever waits for a long-running
> compute job to finnish I'd call that a bug. Any functional
> dependencies between engines or whatever are userspace's problem only,
> which it needs to sort out using userspace memory fences.
> 
> The only things the kernel needs are some way to track dependencies as
> dma_fence (because memory management move the memory away and we need
> to move it back in, ideally 

Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-04 Thread Matthew Brost
On Tue, Apr 04, 2023 at 09:25:52PM +0200, Daniel Vetter wrote:
> On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote:
> > On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote:
> > > 
> > > On 4/4/23 15:10, Christian König wrote:
> > > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > > Hi, Christian,
> > > > > 
> > > > > On 4/4/23 11:09, Christian König wrote:
> > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > > From: Thomas Hellström 
> > > > > > > 
> > > > > > > For long-running workloads, drivers either need to open-code
> > > > > > > completion
> > > > > > > waits, invent their own synchronization primitives or internally 
> > > > > > > use
> > > > > > > dma-fences that do not obey the cross-driver dma-fence protocol, 
> > > > > > > but
> > > > > > > without any lockdep annotation all these approaches are error 
> > > > > > > prone.
> > > > > > > 
> > > > > > > So since for example the drm scheduler uses dma-fences it is
> > > > > > > desirable for
> > > > > > > a driver to be able to use it for throttling and error
> > > > > > > handling also with
> > > > > > > internal dma-fences tha do not obey the cros-driver
> > > > > > > dma-fence protocol.
> > > > > > > 
> > > > > > > Introduce long-running completion fences in form of
> > > > > > > dma-fences, and add
> > > > > > > lockdep annotation for them. In particular:
> > > > > > > 
> > > > > > > * Do not allow waiting under any memory management locks.
> > > > > > > * Do not allow to attach them to a dma-resv object.
> > > > > > > * Introduce a new interface for adding callbacks making the
> > > > > > > helper adding
> > > > > > >    a callback sign off on that it is aware that the dma-fence may 
> > > > > > > not
> > > > > > >    complete anytime soon. Typically this will be the
> > > > > > > scheduler chaining
> > > > > > >    a new long-running fence on another one.
> > > > > > 
> > > > > > Well that's pretty much what I tried before:
> > > > > > https://lwn.net/Articles/893704/
> > > > > > 
> > 
> > I don't think this quite the same, this explictly enforces that we don't
> > break the dma-fence rules (in path of memory allocations, exported in
> > any way), essentially this just SW sync point reusing dma-fence the
> > infrastructure for signaling / callbacks. I believe your series tried to
> > export these fences to user space (admittedly I haven't fully read your
> > series).
> > 
> > In this use case we essentially just want to flow control the ring via
> > the dma-scheduler + maintain a list of pending jobs so the TDR can be
> > used for cleanup if LR entity encounters an error. To me this seems
> > perfectly reasonable but I know dma-femce rules are akin to a holy war.
> > 
> > If we return NULL in run_job, now we have to be able to sink all jobs
> > in the backend regardless on ring space, maintain a list of jobs pending
> > for cleanup after errors, and write a different cleanup path as now the
> > TDR doesn't work. Seems very, very silly to duplicate all of this code
> > when the DRM scheduler provides all of this for us. Also if we go this
> > route, now all drivers are going to invent ways to handle LR jobs /w the
> > DRM scheduler.
> > 
> > This solution is pretty clear, mark the scheduler as LR, and don't
> > export any fences from the scheduler. If you try to export these fences
> > a blow up happens.
> 
> The problem is if you mix things up. Like for resets you need all the
> schedulers on an engine/set-of-engines to quiescent or things get
> potentially hilarious. If you now have a scheduler in forever limbo, the
> dma_fence guarantees are right out the window.
> 

Right, a GT reset on Xe is:

Stop all schedulers
Do a reset
Ban any schedulers which we think caused the GT reset
Resubmit all schedulers which we think were good
Restart all schedulers

None of this flow depends on LR dma-fences, all of this uses the DRM
sched infrastructure and work very well compared to the i915. Rewriting
all this with a driver specific implementation is what we are trying to
avoid.

Similarly if LR entity hangs on its own (not a GT reset, rather the
firmware does the reset for us) we use all the DRM scheduler
infrastructure to handle this. Again this works rather well...

> But the issue you're having is fairly specific if it's just about
> ringspace. I think the dumbest fix is to just block in submit if you run
> out of per-ctx ringspace, and call it a day. This notion that somehow the

How does that not break the dma-fence rules? A job can publish its
finished fence after ARM, if the finished fence fence waits on ring
space that may not free up in a reasonable amount of time we now have
broken the dma-dence rules. My understanding is any dma-fence must only
on other dma-fence, Christian seems to agree and NAK'd just blocking if
no space available [1]. IMO this series ensures we don't break dma-fence
rules by restricting how the finished fence can be used.

> kernel is supposed to provide a 

Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-04 Thread Daniel Vetter
On Tue, Apr 04, 2023 at 07:02:23PM +, Matthew Brost wrote:
> On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote:
> > 
> > On 4/4/23 15:10, Christian König wrote:
> > > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > > Hi, Christian,
> > > > 
> > > > On 4/4/23 11:09, Christian König wrote:
> > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > From: Thomas Hellström 
> > > > > > 
> > > > > > For long-running workloads, drivers either need to open-code
> > > > > > completion
> > > > > > waits, invent their own synchronization primitives or internally use
> > > > > > dma-fences that do not obey the cross-driver dma-fence protocol, but
> > > > > > without any lockdep annotation all these approaches are error prone.
> > > > > > 
> > > > > > So since for example the drm scheduler uses dma-fences it is
> > > > > > desirable for
> > > > > > a driver to be able to use it for throttling and error
> > > > > > handling also with
> > > > > > internal dma-fences tha do not obey the cros-driver
> > > > > > dma-fence protocol.
> > > > > > 
> > > > > > Introduce long-running completion fences in form of
> > > > > > dma-fences, and add
> > > > > > lockdep annotation for them. In particular:
> > > > > > 
> > > > > > * Do not allow waiting under any memory management locks.
> > > > > > * Do not allow to attach them to a dma-resv object.
> > > > > > * Introduce a new interface for adding callbacks making the
> > > > > > helper adding
> > > > > >    a callback sign off on that it is aware that the dma-fence may 
> > > > > > not
> > > > > >    complete anytime soon. Typically this will be the
> > > > > > scheduler chaining
> > > > > >    a new long-running fence on another one.
> > > > > 
> > > > > Well that's pretty much what I tried before:
> > > > > https://lwn.net/Articles/893704/
> > > > > 
> 
> I don't think this quite the same, this explictly enforces that we don't
> break the dma-fence rules (in path of memory allocations, exported in
> any way), essentially this just SW sync point reusing dma-fence the
> infrastructure for signaling / callbacks. I believe your series tried to
> export these fences to user space (admittedly I haven't fully read your
> series).
> 
> In this use case we essentially just want to flow control the ring via
> the dma-scheduler + maintain a list of pending jobs so the TDR can be
> used for cleanup if LR entity encounters an error. To me this seems
> perfectly reasonable but I know dma-femce rules are akin to a holy war.
> 
> If we return NULL in run_job, now we have to be able to sink all jobs
> in the backend regardless on ring space, maintain a list of jobs pending
> for cleanup after errors, and write a different cleanup path as now the
> TDR doesn't work. Seems very, very silly to duplicate all of this code
> when the DRM scheduler provides all of this for us. Also if we go this
> route, now all drivers are going to invent ways to handle LR jobs /w the
> DRM scheduler.
> 
> This solution is pretty clear, mark the scheduler as LR, and don't
> export any fences from the scheduler. If you try to export these fences
> a blow up happens.

The problem is if you mix things up. Like for resets you need all the
schedulers on an engine/set-of-engines to quiescent or things get
potentially hilarious. If you now have a scheduler in forever limbo, the
dma_fence guarantees are right out the window.

But the issue you're having is fairly specific if it's just about
ringspace. I think the dumbest fix is to just block in submit if you run
out of per-ctx ringspace, and call it a day. This notion that somehow the
kernel is supposed to provide a bottomless queue of anything userspace
submits simply doesn't hold up in reality (as much as userspace standards
committees would like it to), and as long as it doesn't have a real-world
perf impact it doesn't really matter why we end up blocking in the submit
ioctl. It might also be a simple memory allocation that hits a snag in
page reclaim.

> > > > > And the reasons why it was rejected haven't changed.
> > > > > 
> > > > > Regards,
> > > > > Christian.
> > > > > 
> > > > Yes, TBH this was mostly to get discussion going how we'd best
> > > > tackle this problem while being able to reuse the scheduler for
> > > > long-running workloads.
> > > > 
> > > > I couldn't see any clear decision on your series, though, but one
> > > > main difference I see is that this is intended for driver-internal
> > > > use only. (I'm counting using the drm_scheduler as a helper for
> > > > driver-private use). This is by no means a way to try tackle the
> > > > indefinite fence problem.
> > > 
> > > Well this was just my latest try to tackle this, but essentially the
> > > problems are the same as with your approach: When we express such
> > > operations as dma_fence there is always the change that we leak that
> > > somewhere.
> > > 
> > > My approach of adding a flag noting that this operation is dangerous and
> > > can't be synced with 

Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-04 Thread Matthew Brost
On Tue, Apr 04, 2023 at 08:14:01PM +0200, Thomas Hellström (Intel) wrote:
> 
> On 4/4/23 15:10, Christian König wrote:
> > Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > > Hi, Christian,
> > > 
> > > On 4/4/23 11:09, Christian König wrote:
> > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > From: Thomas Hellström 
> > > > > 
> > > > > For long-running workloads, drivers either need to open-code
> > > > > completion
> > > > > waits, invent their own synchronization primitives or internally use
> > > > > dma-fences that do not obey the cross-driver dma-fence protocol, but
> > > > > without any lockdep annotation all these approaches are error prone.
> > > > > 
> > > > > So since for example the drm scheduler uses dma-fences it is
> > > > > desirable for
> > > > > a driver to be able to use it for throttling and error
> > > > > handling also with
> > > > > internal dma-fences tha do not obey the cros-driver
> > > > > dma-fence protocol.
> > > > > 
> > > > > Introduce long-running completion fences in form of
> > > > > dma-fences, and add
> > > > > lockdep annotation for them. In particular:
> > > > > 
> > > > > * Do not allow waiting under any memory management locks.
> > > > > * Do not allow to attach them to a dma-resv object.
> > > > > * Introduce a new interface for adding callbacks making the
> > > > > helper adding
> > > > >    a callback sign off on that it is aware that the dma-fence may not
> > > > >    complete anytime soon. Typically this will be the
> > > > > scheduler chaining
> > > > >    a new long-running fence on another one.
> > > > 
> > > > Well that's pretty much what I tried before:
> > > > https://lwn.net/Articles/893704/
> > > > 

I don't think this quite the same, this explictly enforces that we don't
break the dma-fence rules (in path of memory allocations, exported in
any way), essentially this just SW sync point reusing dma-fence the
infrastructure for signaling / callbacks. I believe your series tried to
export these fences to user space (admittedly I haven't fully read your
series).

In this use case we essentially just want to flow control the ring via
the dma-scheduler + maintain a list of pending jobs so the TDR can be
used for cleanup if LR entity encounters an error. To me this seems
perfectly reasonable but I know dma-femce rules are akin to a holy war.

If we return NULL in run_job, now we have to be able to sink all jobs
in the backend regardless on ring space, maintain a list of jobs pending
for cleanup after errors, and write a different cleanup path as now the
TDR doesn't work. Seems very, very silly to duplicate all of this code
when the DRM scheduler provides all of this for us. Also if we go this
route, now all drivers are going to invent ways to handle LR jobs /w the
DRM scheduler.

This solution is pretty clear, mark the scheduler as LR, and don't
export any fences from the scheduler. If you try to export these fences
a blow up happens.

> > > > And the reasons why it was rejected haven't changed.
> > > > 
> > > > Regards,
> > > > Christian.
> > > > 
> > > Yes, TBH this was mostly to get discussion going how we'd best
> > > tackle this problem while being able to reuse the scheduler for
> > > long-running workloads.
> > > 
> > > I couldn't see any clear decision on your series, though, but one
> > > main difference I see is that this is intended for driver-internal
> > > use only. (I'm counting using the drm_scheduler as a helper for
> > > driver-private use). This is by no means a way to try tackle the
> > > indefinite fence problem.
> > 
> > Well this was just my latest try to tackle this, but essentially the
> > problems are the same as with your approach: When we express such
> > operations as dma_fence there is always the change that we leak that
> > somewhere.
> > 
> > My approach of adding a flag noting that this operation is dangerous and
> > can't be synced with something memory management depends on tried to
> > contain this as much as possible, but Daniel still pretty clearly
> > rejected it (for good reasons I think).
> > 
> > > 
> > > We could ofc invent a completely different data-type that abstracts
> > > the synchronization the scheduler needs in the long-running case, or
> > > each driver could hack something up, like sleeping in the
> > > prepare_job() or run_job() callback for throttling, but those waits
> > > should still be annotated in one way or annotated one way or another
> > > (and probably in a similar way across drivers) to make sure we don't
> > > do anything bad.
> > > 
> > >  So any suggestions as to what would be the better solution here
> > > would be appreciated.
> > 
> > Mhm, do we really the the GPU scheduler for that?
> > 

I think we need to solve this within the DRM scheduler one way or
another.

> > I mean in the 1 to 1 case  you basically just need a component which
> > collects the dependencies as dma_fence and if all of them are fulfilled
> > schedules a work item.
> > 
> > As long as the work item itself 

Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-04 Thread Daniel Vetter
On Tue, 4 Apr 2023 at 15:10, Christian König  wrote:
>
> Am 04.04.23 um 14:54 schrieb Thomas Hellström:
> > Hi, Christian,
> >
> > On 4/4/23 11:09, Christian König wrote:
> >> Am 04.04.23 um 02:22 schrieb Matthew Brost:
> >>> From: Thomas Hellström 
> >>>
> >>> For long-running workloads, drivers either need to open-code completion
> >>> waits, invent their own synchronization primitives or internally use
> >>> dma-fences that do not obey the cross-driver dma-fence protocol, but
> >>> without any lockdep annotation all these approaches are error prone.
> >>>
> >>> So since for example the drm scheduler uses dma-fences it is
> >>> desirable for
> >>> a driver to be able to use it for throttling and error handling also
> >>> with
> >>> internal dma-fences tha do not obey the cros-driver dma-fence protocol.
> >>>
> >>> Introduce long-running completion fences in form of dma-fences, and add
> >>> lockdep annotation for them. In particular:
> >>>
> >>> * Do not allow waiting under any memory management locks.
> >>> * Do not allow to attach them to a dma-resv object.
> >>> * Introduce a new interface for adding callbacks making the helper
> >>> adding
> >>>a callback sign off on that it is aware that the dma-fence may not
> >>>complete anytime soon. Typically this will be the scheduler chaining
> >>>a new long-running fence on another one.
> >>
> >> Well that's pretty much what I tried before:
> >> https://lwn.net/Articles/893704/
> >>
> >> And the reasons why it was rejected haven't changed.
> >>
> >> Regards,
> >> Christian.
> >>
> > Yes, TBH this was mostly to get discussion going how we'd best tackle
> > this problem while being able to reuse the scheduler for long-running
> > workloads.
> >
> > I couldn't see any clear decision on your series, though, but one main
> > difference I see is that this is intended for driver-internal use
> > only. (I'm counting using the drm_scheduler as a helper for
> > driver-private use). This is by no means a way to try tackle the
> > indefinite fence problem.
>
> Well this was just my latest try to tackle this, but essentially the
> problems are the same as with your approach: When we express such
> operations as dma_fence there is always the change that we leak that
> somewhere.
>
> My approach of adding a flag noting that this operation is dangerous and
> can't be synced with something memory management depends on tried to
> contain this as much as possible, but Daniel still pretty clearly
> rejected it (for good reasons I think).

Yeah I still don't like dma_fence that somehow have totally different
semantics in that critical piece of "will it complete or will it
deadlock?" :-)
>
> >
> > We could ofc invent a completely different data-type that abstracts
> > the synchronization the scheduler needs in the long-running case, or
> > each driver could hack something up, like sleeping in the
> > prepare_job() or run_job() callback for throttling, but those waits
> > should still be annotated in one way or annotated one way or another
> > (and probably in a similar way across drivers) to make sure we don't
> > do anything bad.
> >
> >  So any suggestions as to what would be the better solution here would
> > be appreciated.
>
> Mhm, do we really the the GPU scheduler for that?
>
> I mean in the 1 to 1 case  you basically just need a component which
> collects the dependencies as dma_fence and if all of them are fulfilled
> schedules a work item.
>
> As long as the work item itself doesn't produce a dma_fence it can then
> still just wait for other none dma_fence dependencies.

Yeah that's the important thing, for long-running jobs dependencies as
dma_fence should be totally fine. You're just not allowed to have any
outgoing dma_fences at all (except the magic preemption fence).

> Then the work function could submit the work and wait for the result.
>
> The work item would then pretty much represent what you want, you can
> wait for it to finish and pass it along as long running dependency.
>
> Maybe give it a funky name and wrap it up in a structure, but that's
> basically it.

Like do we need this? If the kernel ever waits for a long-running
compute job to finnish I'd call that a bug. Any functional
dependencies between engines or whatever are userspace's problem only,
which it needs to sort out using userspace memory fences.

The only things the kernel needs are some way to track dependencies as
dma_fence (because memory management move the memory away and we need
to move it back in, ideally pipelined). And it needs the special
preempt fence (if we don't have pagefaults) so that you have a fence
to attach to all the dma_resv for memory management purposes. Now the
scheduler already has almost all the pieces (at least if we assume
there's some magic fw which time-slices these contexts on its own),
and we just need a few minimal changes:
- allowing the scheduler to ignore the completion fence and just
immediately push the next "job" in if its dependencies are 

Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-04 Thread Intel



On 4/4/23 15:10, Christian König wrote:

Am 04.04.23 um 14:54 schrieb Thomas Hellström:

Hi, Christian,

On 4/4/23 11:09, Christian König wrote:

Am 04.04.23 um 02:22 schrieb Matthew Brost:

From: Thomas Hellström 

For long-running workloads, drivers either need to open-code 
completion

waits, invent their own synchronization primitives or internally use
dma-fences that do not obey the cross-driver dma-fence protocol, but
without any lockdep annotation all these approaches are error prone.

So since for example the drm scheduler uses dma-fences it is 
desirable for
a driver to be able to use it for throttling and error handling 
also with
internal dma-fences tha do not obey the cros-driver dma-fence 
protocol.


Introduce long-running completion fences in form of dma-fences, and 
add

lockdep annotation for them. In particular:

* Do not allow waiting under any memory management locks.
* Do not allow to attach them to a dma-resv object.
* Introduce a new interface for adding callbacks making the helper 
adding

   a callback sign off on that it is aware that the dma-fence may not
   complete anytime soon. Typically this will be the scheduler 
chaining

   a new long-running fence on another one.


Well that's pretty much what I tried before: 
https://lwn.net/Articles/893704/


And the reasons why it was rejected haven't changed.

Regards,
Christian.

Yes, TBH this was mostly to get discussion going how we'd best tackle 
this problem while being able to reuse the scheduler for long-running 
workloads.


I couldn't see any clear decision on your series, though, but one 
main difference I see is that this is intended for driver-internal 
use only. (I'm counting using the drm_scheduler as a helper for 
driver-private use). This is by no means a way to try tackle the 
indefinite fence problem.


Well this was just my latest try to tackle this, but essentially the 
problems are the same as with your approach: When we express such 
operations as dma_fence there is always the change that we leak that 
somewhere.


My approach of adding a flag noting that this operation is dangerous 
and can't be synced with something memory management depends on tried 
to contain this as much as possible, but Daniel still pretty clearly 
rejected it (for good reasons I think).




We could ofc invent a completely different data-type that abstracts 
the synchronization the scheduler needs in the long-running case, or 
each driver could hack something up, like sleeping in the 
prepare_job() or run_job() callback for throttling, but those waits 
should still be annotated in one way or annotated one way or another 
(and probably in a similar way across drivers) to make sure we don't 
do anything bad.


 So any suggestions as to what would be the better solution here 
would be appreciated.


Mhm, do we really the the GPU scheduler for that?

I mean in the 1 to 1 case  you basically just need a component which 
collects the dependencies as dma_fence and if all of them are 
fulfilled schedules a work item.


As long as the work item itself doesn't produce a dma_fence it can 
then still just wait for other none dma_fence dependencies.


Then the work function could submit the work and wait for the result.

The work item would then pretty much represent what you want, you can 
wait for it to finish and pass it along as long running dependency.


Maybe give it a funky name and wrap it up in a structure, but that's 
basically it.


This very much sounds like a i915_sw_fence for the dependency tracking 
and dma_fence_work for the actual work although it's completion fence is 
a dma_fence.


Although that goes against the whole idea of a condition for merging the 
xe driver would be that we implement some sort of minimal scaffolding 
for long-running workloads in the drm scheduler, and the thinking behind 
that is to avoid implementing intel-specific solutions like those...


Thanks,

Thomas




Regards,
Christian.



Thanks,

Thomas







Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-04 Thread Christian König

Am 04.04.23 um 14:54 schrieb Thomas Hellström:

Hi, Christian,

On 4/4/23 11:09, Christian König wrote:

Am 04.04.23 um 02:22 schrieb Matthew Brost:

From: Thomas Hellström 

For long-running workloads, drivers either need to open-code completion
waits, invent their own synchronization primitives or internally use
dma-fences that do not obey the cross-driver dma-fence protocol, but
without any lockdep annotation all these approaches are error prone.

So since for example the drm scheduler uses dma-fences it is 
desirable for
a driver to be able to use it for throttling and error handling also 
with

internal dma-fences tha do not obey the cros-driver dma-fence protocol.

Introduce long-running completion fences in form of dma-fences, and add
lockdep annotation for them. In particular:

* Do not allow waiting under any memory management locks.
* Do not allow to attach them to a dma-resv object.
* Introduce a new interface for adding callbacks making the helper 
adding

   a callback sign off on that it is aware that the dma-fence may not
   complete anytime soon. Typically this will be the scheduler chaining
   a new long-running fence on another one.


Well that's pretty much what I tried before: 
https://lwn.net/Articles/893704/


And the reasons why it was rejected haven't changed.

Regards,
Christian.

Yes, TBH this was mostly to get discussion going how we'd best tackle 
this problem while being able to reuse the scheduler for long-running 
workloads.


I couldn't see any clear decision on your series, though, but one main 
difference I see is that this is intended for driver-internal use 
only. (I'm counting using the drm_scheduler as a helper for 
driver-private use). This is by no means a way to try tackle the 
indefinite fence problem.


Well this was just my latest try to tackle this, but essentially the 
problems are the same as with your approach: When we express such 
operations as dma_fence there is always the change that we leak that 
somewhere.


My approach of adding a flag noting that this operation is dangerous and 
can't be synced with something memory management depends on tried to 
contain this as much as possible, but Daniel still pretty clearly 
rejected it (for good reasons I think).




We could ofc invent a completely different data-type that abstracts 
the synchronization the scheduler needs in the long-running case, or 
each driver could hack something up, like sleeping in the 
prepare_job() or run_job() callback for throttling, but those waits 
should still be annotated in one way or annotated one way or another 
(and probably in a similar way across drivers) to make sure we don't 
do anything bad.


 So any suggestions as to what would be the better solution here would 
be appreciated.


Mhm, do we really the the GPU scheduler for that?

I mean in the 1 to 1 case  you basically just need a component which 
collects the dependencies as dma_fence and if all of them are fulfilled 
schedules a work item.


As long as the work item itself doesn't produce a dma_fence it can then 
still just wait for other none dma_fence dependencies.


Then the work function could submit the work and wait for the result.

The work item would then pretty much represent what you want, you can 
wait for it to finish and pass it along as long running dependency.


Maybe give it a funky name and wrap it up in a structure, but that's 
basically it.


Regards,
Christian.



Thanks,

Thomas









Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-04 Thread Thomas Hellström

Hi, Christian,

On 4/4/23 11:09, Christian König wrote:

Am 04.04.23 um 02:22 schrieb Matthew Brost:

From: Thomas Hellström 

For long-running workloads, drivers either need to open-code completion
waits, invent their own synchronization primitives or internally use
dma-fences that do not obey the cross-driver dma-fence protocol, but
without any lockdep annotation all these approaches are error prone.

So since for example the drm scheduler uses dma-fences it is 
desirable for
a driver to be able to use it for throttling and error handling also 
with

internal dma-fences tha do not obey the cros-driver dma-fence protocol.

Introduce long-running completion fences in form of dma-fences, and add
lockdep annotation for them. In particular:

* Do not allow waiting under any memory management locks.
* Do not allow to attach them to a dma-resv object.
* Introduce a new interface for adding callbacks making the helper 
adding

   a callback sign off on that it is aware that the dma-fence may not
   complete anytime soon. Typically this will be the scheduler chaining
   a new long-running fence on another one.


Well that's pretty much what I tried before: 
https://lwn.net/Articles/893704/


And the reasons why it was rejected haven't changed.

Regards,
Christian.

Yes, TBH this was mostly to get discussion going how we'd best tackle 
this problem while being able to reuse the scheduler for long-running 
workloads.


I couldn't see any clear decision on your series, though, but one main 
difference I see is that this is intended for driver-internal use only. 
(I'm counting using the drm_scheduler as a helper for driver-private 
use). This is by no means a way to try tackle the indefinite fence problem.


We could ofc invent a completely different data-type that abstracts the 
synchronization the scheduler needs in the long-running case, or each 
driver could hack something up, like sleeping in the prepare_job() or 
run_job() callback for throttling, but those waits should still be 
annotated in one way or annotated one way or another (and probably in a 
similar way across drivers) to make sure we don't do anything bad.


 So any suggestions as to what would be the better solution here would 
be appreciated.


Thanks,

Thomas







Re: [RFC PATCH 08/10] dma-buf/dma-fence: Introduce long-running completion fences

2023-04-04 Thread Christian König

Am 04.04.23 um 02:22 schrieb Matthew Brost:

From: Thomas Hellström 

For long-running workloads, drivers either need to open-code completion
waits, invent their own synchronization primitives or internally use
dma-fences that do not obey the cross-driver dma-fence protocol, but
without any lockdep annotation all these approaches are error prone.

So since for example the drm scheduler uses dma-fences it is desirable for
a driver to be able to use it for throttling and error handling also with
internal dma-fences tha do not obey the cros-driver dma-fence protocol.

Introduce long-running completion fences in form of dma-fences, and add
lockdep annotation for them. In particular:

* Do not allow waiting under any memory management locks.
* Do not allow to attach them to a dma-resv object.
* Introduce a new interface for adding callbacks making the helper adding
   a callback sign off on that it is aware that the dma-fence may not
   complete anytime soon. Typically this will be the scheduler chaining
   a new long-running fence on another one.


Well that's pretty much what I tried before: 
https://lwn.net/Articles/893704/


And the reasons why it was rejected haven't changed.

Regards,
Christian.



Signed-off-by: Matthew Brost 
Signed-off-by: Thomas Hellström 
---
  drivers/dma-buf/dma-fence.c | 142 ++--
  drivers/dma-buf/dma-resv.c  |   5 ++
  include/linux/dma-fence.h   |  55 +-
  3 files changed, 160 insertions(+), 42 deletions(-)

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index f177c56269bb..9726b2a3c67d 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -111,6 +111,20 @@ static atomic64_t dma_fence_context_counter = 
ATOMIC64_INIT(1);
   * drivers/gpu should ever call dma_fence_wait() in such contexts.
   */
  
+/**

+ * DOC: Long-Running (lr) dma-fences.
+ *
+ * * Long-running dma-fences are NOT required to complete in reasonable time.
+ *   Typically they signal completion of user-space controlled workloads and
+ *   as such, need to never be part of a cross-driver contract, never waited
+ *   for inside a kernel lock, nor attached to a dma-resv. There are helpers
+ *   and warnings in place to help facilitate that that never happens.
+ *
+ * * The motivation for their existense is that helpers that are intended to
+ *   be used by drivers may use dma-fences that, given the workloads mentioned
+ *   above, become long-running.
+ */
+
  static const char *dma_fence_stub_get_name(struct dma_fence *fence)
  {
  return "stub";
@@ -284,6 +298,34 @@ static struct lockdep_map dma_fence_lockdep_map = {
.name = "dma_fence_map"
  };
  
+static struct lockdep_map dma_fence_lr_lockdep_map = {

+   .name = "dma_fence_lr_map"
+};
+
+static bool __dma_fence_begin_signalling(struct lockdep_map *map)
+{
+   /* explicitly nesting ... */
+   if (lock_is_held_type(map, 1))
+   return true;
+
+   /* rely on might_sleep check for soft/hardirq locks */
+   if (in_atomic())
+   return true;
+
+   /* ... and non-recursive readlock */
+   lock_acquire(map, 0, 0, 1, 1, NULL, _RET_IP_);
+
+   return false;
+}
+
+static void __dma_fence_end_signalling(bool cookie, struct lockdep_map *map)
+{
+   if (cookie)
+   return;
+
+   lock_release(map, _RET_IP_);
+}
+
  /**
   * dma_fence_begin_signalling - begin a critical DMA fence signalling section
   *
@@ -300,18 +342,7 @@ static struct lockdep_map dma_fence_lockdep_map = {
   */
  bool dma_fence_begin_signalling(void)
  {
-   /* explicitly nesting ... */
-   if (lock_is_held_type(_fence_lockdep_map, 1))
-   return true;
-
-   /* rely on might_sleep check for soft/hardirq locks */
-   if (in_atomic())
-   return true;
-
-   /* ... and non-recursive readlock */
-   lock_acquire(_fence_lockdep_map, 0, 0, 1, 1, NULL, _RET_IP_);
-
-   return false;
+   return __dma_fence_begin_signalling(_fence_lockdep_map);
  }
  EXPORT_SYMBOL(dma_fence_begin_signalling);
  
@@ -323,25 +354,61 @@ EXPORT_SYMBOL(dma_fence_begin_signalling);

   */
  void dma_fence_end_signalling(bool cookie)
  {
-   if (cookie)
-   return;
-
-   lock_release(_fence_lockdep_map, _RET_IP_);
+   __dma_fence_end_signalling(cookie, _fence_lockdep_map);
  }
  EXPORT_SYMBOL(dma_fence_end_signalling);
  
-void __dma_fence_might_wait(void)

+/**
+ * dma_fence_lr begin_signalling - begin a critical long-running DMA fence
+ * signalling section
+ *
+ * Drivers should use this to annotate the beginning of any code section
+ * required to eventually complete _fence by calling dma_fence_signal().
+ *
+ * The end of these critical sections are annotated with
+ * dma_fence_lr_end_signalling(). Ideally the section should encompass all
+ * locks that are ever required to signal a long-running dma-fence.
+ *
+ * Return: An opaque cookie needed by the implementation,