Re: [PATCH 02/11] dma-buf: add new dma_fence_chain container v4

2019-02-15 Thread Jason Ekstrand
On Fri, Feb 15, 2019 at 12:33 PM Koenig, Christian 
wrote:

> Am 15.02.19 um 19:16 schrieb Jason Ekstrand:
>
> On Fri, Feb 15, 2019 at 11:51 AM Christian König <
> ckoenig.leichtzumer...@gmail.com> wrote:
>
>> Am 15.02.19 um 17:49 schrieb Jason Ekstrand:
>>
>> On Fri, Feb 15, 2019 at 9:52 AM Lionel Landwerlin via dri-devel <
>> dri-de...@lists.freedesktop.org> wrote:
>>
>>> On 15/02/2019 14:32, Koenig, Christian wrote:
>>> > Am 15.02.19 um 15:23 schrieb Lionel Landwerlin:
>>> >> Hi Christian, David,
>>> >>
>>> >> For timeline semaphore we need points to signaled in order.
>>> >> I'm struggling to understand how this fence-chain implementation
>>> >> preserves ordering of the seqnos.
>>> >>
>>> >> One of the scenario I can see an issue happening is when you have a
>>> >> timeline with points 1 & 2 and userspace submits for 2 different
>>> >> engines :
>>> >>  - first with let's say a blitter style engine on point 2
>>> >>  - then a 3d style engine on point 1
>>> > Yeah, and where exactly is the problem?
>>> >
>>> > Seqno 1 will signal when the 3d style engine finishes work.
>>> >
>>> > And seqno 2 will signal when both seqno 1 is signaled and the blitter
>>> > style engine has finished its work.
>>>
>>
>> That's an interesting interpretation of the spec.  I think it's legal and
>> I could see that behavior may be desirable in some ways.
>>
>>
>> Well we actually had this discussion multiple times now, both internally
>> as well as on the mailing list. Please also see the previous mails with
>> Daniel on this topic.
>>
>
> I dug through dri-devel and read everything I could find with a search for
> "timeline semaphore"  I didn't find all that much but this did come up once.
>
>
> Need to dig through my mails as well, that was back in November/December
> last year.
>
>
>
>> My initial suggestion was actually to exactly what Leonid suggested as
>> well.
>>
>> And following this I used a rather simple container for the
>> implementation, e.g. just a ring buffer indexed by the sequence number. In
>> this scenario userspace can specify on syncobj creation time how big the
>> window for sequence numbers should be, e.g. in this implementation how big
>> the ring buffer would be.
>>
>> This was rejected by our guys who actually wrote a good part of the
>> Vulkan specification. Daniel then has gone into the same direction during
>> the public discussion.
>>
>
> I agree with whoever said that specifying a ringbuffer size is
> unacceptable.  I'm not really sure how that's relevant though.  Is a
> ringbuffer required to implement the behavior that is being suggested
> here?  Genuine question; I'm trying to get back up to speed.
>
>
> Using a ring buffer was just an example how we could do it if we follow my
> and Lionel's suggestion.
>
> Key point is that we could simplify the implementation massively if
> sequence numbers don't need to depend on each other.
>
> In other words we just see the syncobj as container where fences are added
> and retrieved from instead of something actively involved in the signaling.
>

In principal, I think this is a reasonable argument.  Having it involved in
signalling doesn't seem terrible to me but it would mean that a driver
wouldn't be able to detect that the fence it's waiting on actually belongs
to itself and optimize things.


> Main reason we didn't do it this way is because the AMD Vulkan team has
> rejected this approach.
>

Clearly, there's not quite as much agreement as I'd thought there was.  Oh,
well, that's why we have these discussions.


> Additional to that chaining sequence numbers together is really the more
> defensive approach, e.g. it is less likely that applications can shoot
> themselves in the foot.
>

Yeah, I can see how the "everything prior to n must be signalled" could be
safer.  I think both wait-any and wait-all have their ups and downs.  It
just took me by surprise.


>
>  4. If you do get into a sticky situation, you can unblock an entire
>> timeline by using the CPU signal ioctl to set it to a high value.
>>
>>
>> Well I think that this could be problematic as well. Keep in mind that
>> main use case for this is sharing timelines between processes.
>>
>> In other words you don't want applications to be able to mess with it to
>> much.
>>
>
> Cross-process is exactly why you want it.  Suppose you're a compositor and
> you have a timeline shared with another application and you've submitted
> work which waits on it.  Then you get a notification somehow (SIGHUP?) that
> the client has died leaving you hanging.  What do you do?  You take the
> semaphore that's shared with you and the client and whack it to UINT64_MAX
> to unblock yourself.  Of course, this can be abused and that's always the
> risk you take with timelines.
>
>
> My last status is that basically everybody agrees now that wait before
> signal in the kernel is forbidden.
>

Agreed.  I'm not saying that wait before signal in the kernel should be a
thing.  I think we're all agreed that 

Re: [PATCH 02/11] dma-buf: add new dma_fence_chain container v4

2019-02-15 Thread Koenig, Christian
Am 15.02.19 um 19:16 schrieb Jason Ekstrand:
On Fri, Feb 15, 2019 at 11:51 AM Christian König 
mailto:ckoenig.leichtzumer...@gmail.com>> 
wrote:
Am 15.02.19 um 17:49 schrieb Jason Ekstrand:
On Fri, Feb 15, 2019 at 9:52 AM Lionel Landwerlin via dri-devel 
mailto:dri-de...@lists.freedesktop.org>> wrote:
On 15/02/2019 14:32, Koenig, Christian wrote:
> Am 15.02.19 um 15:23 schrieb Lionel Landwerlin:
>> Hi Christian, David,
>>
>> For timeline semaphore we need points to signaled in order.
>> I'm struggling to understand how this fence-chain implementation
>> preserves ordering of the seqnos.
>>
>> One of the scenario I can see an issue happening is when you have a
>> timeline with points 1 & 2 and userspace submits for 2 different
>> engines :
>>  - first with let's say a blitter style engine on point 2
>>  - then a 3d style engine on point 1
> Yeah, and where exactly is the problem?
>
> Seqno 1 will signal when the 3d style engine finishes work.
>
> And seqno 2 will signal when both seqno 1 is signaled and the blitter
> style engine has finished its work.

That's an interesting interpretation of the spec.  I think it's legal and I 
could see that behavior may be desirable in some ways.

Well we actually had this discussion multiple times now, both internally as 
well as on the mailing list. Please also see the previous mails with Daniel on 
this topic.

I dug through dri-devel and read everything I could find with a search for 
"timeline semaphore"  I didn't find all that much but this did come up once.

Need to dig through my mails as well, that was back in November/December last 
year.


My initial suggestion was actually to exactly what Leonid suggested as well.

And following this I used a rather simple container for the implementation, 
e.g. just a ring buffer indexed by the sequence number. In this scenario 
userspace can specify on syncobj creation time how big the window for sequence 
numbers should be, e.g. in this implementation how big the ring buffer would be.

This was rejected by our guys who actually wrote a good part of the Vulkan 
specification. Daniel then has gone into the same direction during the public 
discussion.

I agree with whoever said that specifying a ringbuffer size is unacceptable.  
I'm not really sure how that's relevant though.  Is a ringbuffer required to 
implement the behavior that is being suggested here?  Genuine question; I'm 
trying to get back up to speed.

Using a ring buffer was just an example how we could do it if we follow my and 
Lionel's suggestion.

Key point is that we could simplify the implementation massively if sequence 
numbers don't need to depend on each other.

In other words we just see the syncobj as container where fences are added and 
retrieved from instead of something actively involved in the signaling.

Main reason we didn't do it this way is because the AMD Vulkan team has 
rejected this approach.

Additional to that chaining sequence numbers together is really the more 
defensive approach, e.g. it is less likely that applications can shoot 
themselves in the foot.


 4. If you do get into a sticky situation, you can unblock an entire timeline 
by using the CPU signal ioctl to set it to a high value.

Well I think that this could be problematic as well. Keep in mind that main use 
case for this is sharing timelines between processes.

In other words you don't want applications to be able to mess with it to much.

Cross-process is exactly why you want it.  Suppose you're a compositor and you 
have a timeline shared with another application and you've submitted work which 
waits on it.  Then you get a notification somehow (SIGHUP?) that the client has 
died leaving you hanging.  What do you do?  You take the semaphore that's 
shared with you and the client and whack it to UINT64_MAX to unblock yourself.  
Of course, this can be abused and that's always the risk you take with 
timelines.

My last status is that basically everybody agrees now that wait before signal 
in the kernel is forbidden.

So when you get a SIGHUB because your client is dead you just kill your thread 
waiting on it.

Regards,
Christian.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 02/11] dma-buf: add new dma_fence_chain container v4

2019-02-15 Thread Jason Ekstrand
On Fri, Feb 15, 2019 at 11:51 AM Christian König <
ckoenig.leichtzumer...@gmail.com> wrote:

> Am 15.02.19 um 17:49 schrieb Jason Ekstrand:
>
> On Fri, Feb 15, 2019 at 9:52 AM Lionel Landwerlin via dri-devel <
> dri-de...@lists.freedesktop.org> wrote:
>
>> On 15/02/2019 14:32, Koenig, Christian wrote:
>> > Am 15.02.19 um 15:23 schrieb Lionel Landwerlin:
>> >> Hi Christian, David,
>> >>
>> >> For timeline semaphore we need points to signaled in order.
>> >> I'm struggling to understand how this fence-chain implementation
>> >> preserves ordering of the seqnos.
>> >>
>> >> One of the scenario I can see an issue happening is when you have a
>> >> timeline with points 1 & 2 and userspace submits for 2 different
>> >> engines :
>> >>  - first with let's say a blitter style engine on point 2
>> >>  - then a 3d style engine on point 1
>> > Yeah, and where exactly is the problem?
>> >
>> > Seqno 1 will signal when the 3d style engine finishes work.
>> >
>> > And seqno 2 will signal when both seqno 1 is signaled and the blitter
>> > style engine has finished its work.
>>
>
> That's an interesting interpretation of the spec.  I think it's legal and
> I could see that behavior may be desirable in some ways.
>
>
> Well we actually had this discussion multiple times now, both internally
> as well as on the mailing list. Please also see the previous mails with
> Daniel on this topic.
>

I dug through dri-devel and read everything I could find with a search for
"timeline semaphore"  I didn't find all that much but this did come up once.


> My initial suggestion was actually to exactly what Leonid suggested as
> well.
>
> And following this I used a rather simple container for the
> implementation, e.g. just a ring buffer indexed by the sequence number. In
> this scenario userspace can specify on syncobj creation time how big the
> window for sequence numbers should be, e.g. in this implementation how big
> the ring buffer would be.
>
> This was rejected by our guys who actually wrote a good part of the Vulkan
> specification. Daniel then has gone into the same direction during the
> public discussion.
>

I agree with whoever said that specifying a ringbuffer size is
unacceptable.  I'm not really sure how that's relevant though.  Is a
ringbuffer required to implement the behavior that is being suggested
here?  Genuine question; I'm trying to get back up to speed.


> [SNIP]
>
> I think what Christian is suggesting is a valid interpretation of the spec
> though it is rather unconventional.  The Vulkan spec, as it stands today,
> requires that the application ensure that at the time of signaling, the
> timeline semaphore value increases.  This means that all of the above
> possible cases are technically illegal in Vulkan and so it doesn't really
> matter what we do as long as we don't do anyting especially stupid.
>
>
> And exactly that's the point. When an application does something stupid
> with its own submissions then this is not much of a problem.
>
> But this interface is meant to be made for communication between
> processes, and here we want to be sure that nobody can do anything stupid.
>
> My understanding of how this works on Windows is that a wait operation on
> 3 is a wait until x >= 3 where x is a 64-bit value and a signal operation
> is simply a write to x.  This means that, in the above cases, waits on 1
> will be triggered immediately when 2 is written but waits on 2 may or may
> not happen at all depending on whether the GPU write which overwrites x to
> 1 or the CPU (or potentially GPU in a different context) read gets there
> first such that the reader observes 2.  If you mess this up and something
> isn't signaled, that's your fault.
>
>
> Yeah and I think that this is actually not a good idea at all.
> Implementing it like this ultimately means that you can only use polling on
> the number.
>

Yeah, there are problems with it.  I'm just putting it out there for
reference and because it's what developers expect regardless of whether
that's a good thing or not.

 4. If you do get into a sticky situation, you can unblock an entire
> timeline by using the CPU signal ioctl to set it to a high value.
>
>
> Well I think that this could be problematic as well. Keep in mind that
> main use case for this is sharing timelines between processes.
>
> In other words you don't want applications to be able to mess with it to
> much.
>

Cross-process is exactly why you want it.  Suppose you're a compositor and
you have a timeline shared with another application and you've submitted
work which waits on it.  Then you get a notification somehow (SIGHUP?) that
the client has died leaving you hanging.  What do you do?  You take the
semaphore that's shared with you and the client and whack it to UINT64_MAX
to unblock yourself.  Of course, this can be abused and that's always the
risk you take with timelines.


>
> Of all these reasons, I think 1 and 2 carry the most weight.  2, in
> particular, is interesting if we 

Re: [PATCH 02/11] dma-buf: add new dma_fence_chain container v4

2019-02-15 Thread Christian König via amd-gfx

Am 15.02.19 um 17:49 schrieb Jason Ekstrand:
On Fri, Feb 15, 2019 at 9:52 AM Lionel Landwerlin via dri-devel 
> wrote:


On 15/02/2019 14:32, Koenig, Christian wrote:
> Am 15.02.19 um 15:23 schrieb Lionel Landwerlin:
>> Hi Christian, David,
>>
>> For timeline semaphore we need points to signaled in order.
>> I'm struggling to understand how this fence-chain implementation
>> preserves ordering of the seqnos.
>>
>> One of the scenario I can see an issue happening is when you have a
>> timeline with points 1 & 2 and userspace submits for 2 different
>> engines :
>>      - first with let's say a blitter style engine on point 2
>>      - then a 3d style engine on point 1
> Yeah, and where exactly is the problem?
>
> Seqno 1 will signal when the 3d style engine finishes work.
>
> And seqno 2 will signal when both seqno 1 is signaled and the
blitter
> style engine has finished its work.


That's an interesting interpretation of the spec.  I think it's legal 
and I could see that behavior may be desirable in some ways.


Well we actually had this discussion multiple times now, both internally 
as well as on the mailing list. Please also see the previous mails with 
Daniel on this topic.


My initial suggestion was actually to exactly what Leonid suggested as well.

And following this I used a rather simple container for the 
implementation, e.g. just a ring buffer indexed by the sequence number. 
In this scenario userspace can specify on syncobj creation time how big 
the window for sequence numbers should be, e.g. in this implementation 
how big the ring buffer would be.


This was rejected by our guys who actually wrote a good part of the 
Vulkan specification. Daniel then has gone into the same direction 
during the public discussion.


[SNIP]
I think what Christian is suggesting is a valid interpretation of the 
spec though it is rather unconventional.  The Vulkan spec, as it 
stands today, requires that the application ensure that at the time of 
signaling, the timeline semaphore value increases.  This means that 
all of the above possible cases are technically illegal in Vulkan and 
so it doesn't really matter what we do as long as we don't do anyting 
especially stupid.


And exactly that's the point. When an application does something stupid 
with its own submissions then this is not much of a problem.


But this interface is meant to be made for communication between 
processes, and here we want to be sure that nobody can do anything stupid.


My understanding of how this works on Windows is that a wait operation 
on 3 is a wait until x >= 3 where x is a 64-bit value and a signal 
operation is simply a write to x. This means that, in the above cases, 
waits on 1 will be triggered immediately when 2 is written but waits 
on 2 may or may not happen at all depending on whether the GPU write 
which overwrites x to 1 or the CPU (or potentially GPU in a different 
context) read gets there first such that the reader observes 2.  If 
you mess this up and something isn't signaled, that's your fault.


Yeah and I think that this is actually not a good idea at all. 
Implementing it like this ultimately means that you can only use polling 
on the number.


 4. If you do get into a sticky situation, you can unblock an entire 
timeline by using the CPU signal ioctl to set it to a high value.


Well I think that this could be problematic as well. Keep in mind that 
main use case for this is sharing timelines between processes.


In other words you don't want applications to be able to mess with it to 
much.




Of all these reasons, I think 1 and 2 carry the most weight.  2, in 
particular, is interesting if we one day want to implement the same 
behavior with a simple 64-bit value like Windows does.  Immagine, for 
instance, a scenario where the GPU is doing it's own scheduling or 
command buffers are submitted ahead of the signal operation being 
available and told to just sit on the GPU until they see x >= 3.  
(Yes, there are issues here with residency, contention, etc.  I'm 
asking you to use your immagination.)  Assuming you can do 64-bit 
atomics (there are aparently issues here with PCIe that make things 
sticky), the behavior I'm suggesting is completely implementable in 
that way whereas the behavior Christian is suggesting is only 
implementable if you're maintaining a CPU-side list of fences.  I 
don't think we want to paint ourselves into that corner.


Actually we already had such an implementation with radeon. And I can 
only say that it was a totally PAIN IN THE A* to maintain.


This is one of the reason why we are not using hardware semaphores any 
more with amdgpu.


Regards,
Christian.



--Jason

>
> Regards,
> Christian.
>
>> -Lionel
>>
>> On 07/12/2018 09:55, Chunming Zhou wrote:
>>> From: Christian König 

Re: [PATCH 02/11] dma-buf: add new dma_fence_chain container v4

2019-02-15 Thread Jason Ekstrand
On Fri, Feb 15, 2019 at 9:52 AM Lionel Landwerlin via dri-devel <
dri-de...@lists.freedesktop.org> wrote:

> On 15/02/2019 14:32, Koenig, Christian wrote:
> > Am 15.02.19 um 15:23 schrieb Lionel Landwerlin:
> >> Hi Christian, David,
> >>
> >> For timeline semaphore we need points to signaled in order.
> >> I'm struggling to understand how this fence-chain implementation
> >> preserves ordering of the seqnos.
> >>
> >> One of the scenario I can see an issue happening is when you have a
> >> timeline with points 1 & 2 and userspace submits for 2 different
> >> engines :
> >>  - first with let's say a blitter style engine on point 2
> >>  - then a 3d style engine on point 1
> > Yeah, and where exactly is the problem?
> >
> > Seqno 1 will signal when the 3d style engine finishes work.
> >
> > And seqno 2 will signal when both seqno 1 is signaled and the blitter
> > style engine has finished its work.
>

That's an interesting interpretation of the spec.  I think it's legal and I
could see that behavior may be desirable in some ways.


> That's not really how I understood the spec, but I might be wrong.
>
> What makes me thing 1 should be signaled as soon as 2 is signaled
> (regardless of whether the fence attached on point 1 is been signaled),
> is that the spec defines wait & signal operations in term of the value
> of the timeline.
>
>
> -Lionel
>
> >
> >> Another scenario would be signaling a timeline with points 1 & 2 with
> >> those points in reverse order in the submission array.
> > That is actually illegal in the spec, but actually handled gracefully as
> > well.
> >
> > E.g. when you add seqno 1 to the syncobj container it will only signal
> > when 2 is signaled as well.
>

I think what Christian is suggesting is a valid interpretation of the spec
though it is rather unconventional.  The Vulkan spec, as it stands today,
requires that the application ensure that at the time of signaling, the
timeline semaphore value increases.  This means that all of the above
possible cases are technically illegal in Vulkan and so it doesn't really
matter what we do as long as we don't do anyting especially stupid.

My understanding of how this works on Windows is that a wait operation on 3
is a wait until x >= 3 where x is a 64-bit value and a signal operation is
simply a write to x.  This means that, in the above cases, waits on 1 will
be triggered immediately when 2 is written but waits on 2 may or may not
happen at all depending on whether the GPU write which overwrites x to 1 or
the CPU (or potentially GPU in a different context) read gets there first
such that the reader observes 2.  If you mess this up and something isn't
signaled, that's your fault.

Instead of specifying things to be exactly the Windows behavior, Vulkan
says that you must only ever increase the value and anything else is
illegal and therefore leads to undefined behavior.  The usual consequences
of undefined behavior apply: anything can happen up to and including
process termination.  In other words, how we handle those cases is
completely up to us as long as we do something sane that doesn't result in
kernel crashes or anything like that.  We do have to handle it in some way
because we can't outright prevent those cases from happening.  The question
then becomes what's the best way for the behavior to degrade.

In my opinion, the smoothest degredation is if you take the windows model
and replace the 64-bit write to x with a 64-bit atomic MAX operation.  In
other words, signaling 2 automatically unblocks 1 and any attempt to signal
a value lower than the current value is a no-op.  It has a few nice
advantages:

 1. Signaling N is guaranteed to unblock everything waiting on n <= N
regardless of what else may be pending.
 2. It matches what I think is the next natural evolution of the Windows
model where the write is replaced with an atomic.
 3. It gracefully handles the case where the operation to signal 1 is added
after the one to signal 2.  We can also make this case illegal but this
model extends to one in which it could be legal and well-defined.
 4. If you do get into a sticky situation, you can unblock an entire
timeline by using the CPU signal ioctl to set it to a high value.

Of all these reasons, I think 1 and 2 carry the most weight.  2, in
particular, is interesting if we one day want to implement the same
behavior with a simple 64-bit value like Windows does.  Immagine, for
instance, a scenario where the GPU is doing it's own scheduling or command
buffers are submitted ahead of the signal operation being available and
told to just sit on the GPU until they see x >= 3.  (Yes, there are issues
here with residency, contention, etc.  I'm asking you to use your
immagination.)  Assuming you can do 64-bit atomics (there are aparently
issues here with PCIe that make things sticky), the behavior I'm suggesting
is completely implementable in that way whereas the behavior Christian is
suggesting is only implementable if you're 

Re: [PATCH 02/11] dma-buf: add new dma_fence_chain container v4

2019-02-15 Thread Christian König via amd-gfx

Am 15.02.19 um 16:52 schrieb Lionel Landwerlin:

On 15/02/2019 14:32, Koenig, Christian wrote:

Am 15.02.19 um 15:23 schrieb Lionel Landwerlin:

Hi Christian, David,

For timeline semaphore we need points to signaled in order.
I'm struggling to understand how this fence-chain implementation
preserves ordering of the seqnos.

One of the scenario I can see an issue happening is when you have a
timeline with points 1 & 2 and userspace submits for 2 different
engines :
 - first with let's say a blitter style engine on point 2
 - then a 3d style engine on point 1

Yeah, and where exactly is the problem?

Seqno 1 will signal when the 3d style engine finishes work.

And seqno 2 will signal when both seqno 1 is signaled and the blitter
style engine has finished its work.


That's not really how I understood the spec, but I might be wrong.

What makes me thing 1 should be signaled as soon as 2 is signaled
(regardless of whether the fence attached on point 1 is been signaled),
is that the spec defines wait & signal operations in term of the value
of the timeline.


That's what we had initially as well and it was rejected.

When 2 signals before 1 is signaled you can't call this a timeline any more.

Christian.




-Lionel




Another scenario would be signaling a timeline with points 1 & 2 with
those points in reverse order in the submission array.

That is actually illegal in the spec, but actually handled gracefully as
well.

E.g. when you add seqno 1 to the syncobj container it will only signal
when 2 is signaled as well.







Regards,
Christian.


-Lionel

On 07/12/2018 09:55, Chunming Zhou wrote:

From: Christian König 

Lockless container implementation similar to a dma_fence_array, but 
with

only two elements per node and automatic garbage collection.

v2: properly document dma_fence_chain_for_each, add
dma_fence_chain_find_seqno,
  drop prev reference during garbage collection if it's not a
chain fence.
v3: use head and iterator for dma_fence_chain_for_each
v4: fix reference count in dma_fence_chain_enable_signaling

Signed-off-by: Christian König 
---
   drivers/dma-buf/Makefile  |   3 +-
   drivers/dma-buf/dma-fence-chain.c | 241 
++

   include/linux/dma-fence-chain.h   |  81 ++
   3 files changed, 324 insertions(+), 1 deletion(-)
   create mode 100644 drivers/dma-buf/dma-fence-chain.c
   create mode 100644 include/linux/dma-fence-chain.h

diff --git a/drivers/dma-buf/Makefile b/drivers/dma-buf/Makefile
index 0913a6ccab5a..1f006e083eb9 100644
--- a/drivers/dma-buf/Makefile
+++ b/drivers/dma-buf/Makefile
@@ -1,4 +1,5 @@
-obj-y := dma-buf.o dma-fence.o dma-fence-array.o reservation.o
seqno-fence.o
+obj-y := dma-buf.o dma-fence.o dma-fence-array.o dma-fence-chain.o \
+ reservation.o seqno-fence.o
   obj-$(CONFIG_SYNC_FILE)    += sync_file.o
   obj-$(CONFIG_SW_SYNC)    += sw_sync.o sync_debug.o
   obj-$(CONFIG_UDMABUF)    += udmabuf.o
diff --git a/drivers/dma-buf/dma-fence-chain.c
b/drivers/dma-buf/dma-fence-chain.c
new file mode 100644
index ..0c5e3c902fa0
--- /dev/null
+++ b/drivers/dma-buf/dma-fence-chain.c
@@ -0,0 +1,241 @@
+/*
+ * fence-chain: chain fences together in a timeline
+ *
+ * Copyright (C) 2018 Advanced Micro Devices, Inc.
+ * Authors:
+ *    Christian König 
+ *
+ * This program is free software; you can redistribute it and/or
modify it
+ * under the terms of the GNU General Public License version 2 as
published by
+ * the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
License for
+ * more details.
+ */
+
+#include 
+
+static bool dma_fence_chain_enable_signaling(struct dma_fence 
*fence);

+
+/**
+ * dma_fence_chain_get_prev - use RCU to get a reference to the
previous fence
+ * @chain: chain node to get the previous node from
+ *
+ * Use dma_fence_get_rcu_safe to get a reference to the previous
fence of the
+ * chain node.
+ */
+static struct dma_fence *dma_fence_chain_get_prev(struct
dma_fence_chain *chain)
+{
+    struct dma_fence *prev;
+
+    rcu_read_lock();
+    prev = dma_fence_get_rcu_safe(>prev);
+    rcu_read_unlock();
+    return prev;
+}
+
+/**
+ * dma_fence_chain_walk - chain walking function
+ * @fence: current chain node
+ *
+ * Walk the chain to the next node. Returns the next fence or NULL
if we are at
+ * the end of the chain. Garbage collects chain nodes which are 
already

+ * signaled.
+ */
+struct dma_fence *dma_fence_chain_walk(struct dma_fence *fence)
+{
+    struct dma_fence_chain *chain, *prev_chain;
+    struct dma_fence *prev, *replacement, *tmp;
+
+    chain = to_dma_fence_chain(fence);
+    if (!chain) {
+    dma_fence_put(fence);
+    return NULL;
+    }
+
+    while ((prev = dma_fence_chain_get_prev(chain))) {
+
+    prev_chain = to_dma_fence_chain(prev);
+ 

Re: [PATCH 02/11] dma-buf: add new dma_fence_chain container v4

2019-02-15 Thread Lionel Landwerlin via amd-gfx

On 15/02/2019 14:32, Koenig, Christian wrote:

Am 15.02.19 um 15:23 schrieb Lionel Landwerlin:

Hi Christian, David,

For timeline semaphore we need points to signaled in order.
I'm struggling to understand how this fence-chain implementation
preserves ordering of the seqnos.

One of the scenario I can see an issue happening is when you have a
timeline with points 1 & 2 and userspace submits for 2 different
engines :
     - first with let's say a blitter style engine on point 2
     - then a 3d style engine on point 1

Yeah, and where exactly is the problem?

Seqno 1 will signal when the 3d style engine finishes work.

And seqno 2 will signal when both seqno 1 is signaled and the blitter
style engine has finished its work.


That's not really how I understood the spec, but I might be wrong.

What makes me thing 1 should be signaled as soon as 2 is signaled
(regardless of whether the fence attached on point 1 is been signaled),
is that the spec defines wait & signal operations in term of the value
of the timeline.


-Lionel




Another scenario would be signaling a timeline with points 1 & 2 with
those points in reverse order in the submission array.

That is actually illegal in the spec, but actually handled gracefully as
well.

E.g. when you add seqno 1 to the syncobj container it will only signal
when 2 is signaled as well.







Regards,
Christian.


-Lionel

On 07/12/2018 09:55, Chunming Zhou wrote:

From: Christian König 

Lockless container implementation similar to a dma_fence_array, but with
only two elements per node and automatic garbage collection.

v2: properly document dma_fence_chain_for_each, add
dma_fence_chain_find_seqno,
  drop prev reference during garbage collection if it's not a
chain fence.
v3: use head and iterator for dma_fence_chain_for_each
v4: fix reference count in dma_fence_chain_enable_signaling

Signed-off-by: Christian König 
---
   drivers/dma-buf/Makefile  |   3 +-
   drivers/dma-buf/dma-fence-chain.c | 241 ++
   include/linux/dma-fence-chain.h   |  81 ++
   3 files changed, 324 insertions(+), 1 deletion(-)
   create mode 100644 drivers/dma-buf/dma-fence-chain.c
   create mode 100644 include/linux/dma-fence-chain.h

diff --git a/drivers/dma-buf/Makefile b/drivers/dma-buf/Makefile
index 0913a6ccab5a..1f006e083eb9 100644
--- a/drivers/dma-buf/Makefile
+++ b/drivers/dma-buf/Makefile
@@ -1,4 +1,5 @@
-obj-y := dma-buf.o dma-fence.o dma-fence-array.o reservation.o
seqno-fence.o
+obj-y := dma-buf.o dma-fence.o dma-fence-array.o dma-fence-chain.o \
+ reservation.o seqno-fence.o
   obj-$(CONFIG_SYNC_FILE)    += sync_file.o
   obj-$(CONFIG_SW_SYNC)    += sw_sync.o sync_debug.o
   obj-$(CONFIG_UDMABUF)    += udmabuf.o
diff --git a/drivers/dma-buf/dma-fence-chain.c
b/drivers/dma-buf/dma-fence-chain.c
new file mode 100644
index ..0c5e3c902fa0
--- /dev/null
+++ b/drivers/dma-buf/dma-fence-chain.c
@@ -0,0 +1,241 @@
+/*
+ * fence-chain: chain fences together in a timeline
+ *
+ * Copyright (C) 2018 Advanced Micro Devices, Inc.
+ * Authors:
+ *    Christian König 
+ *
+ * This program is free software; you can redistribute it and/or
modify it
+ * under the terms of the GNU General Public License version 2 as
published by
+ * the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
License for
+ * more details.
+ */
+
+#include 
+
+static bool dma_fence_chain_enable_signaling(struct dma_fence *fence);
+
+/**
+ * dma_fence_chain_get_prev - use RCU to get a reference to the
previous fence
+ * @chain: chain node to get the previous node from
+ *
+ * Use dma_fence_get_rcu_safe to get a reference to the previous
fence of the
+ * chain node.
+ */
+static struct dma_fence *dma_fence_chain_get_prev(struct
dma_fence_chain *chain)
+{
+    struct dma_fence *prev;
+
+    rcu_read_lock();
+    prev = dma_fence_get_rcu_safe(>prev);
+    rcu_read_unlock();
+    return prev;
+}
+
+/**
+ * dma_fence_chain_walk - chain walking function
+ * @fence: current chain node
+ *
+ * Walk the chain to the next node. Returns the next fence or NULL
if we are at
+ * the end of the chain. Garbage collects chain nodes which are already
+ * signaled.
+ */
+struct dma_fence *dma_fence_chain_walk(struct dma_fence *fence)
+{
+    struct dma_fence_chain *chain, *prev_chain;
+    struct dma_fence *prev, *replacement, *tmp;
+
+    chain = to_dma_fence_chain(fence);
+    if (!chain) {
+    dma_fence_put(fence);
+    return NULL;
+    }
+
+    while ((prev = dma_fence_chain_get_prev(chain))) {
+
+    prev_chain = to_dma_fence_chain(prev);
+    if (prev_chain) {
+    if (!dma_fence_is_signaled(prev_chain->fence))
+    break;
+
+    replacement = dma_fence_chain_get_prev(prev_chain);
+    } else {
+    if 

Re: [PATCH 02/11] dma-buf: add new dma_fence_chain container v4

2019-02-15 Thread Lionel Landwerlin via amd-gfx

Hi Christian, David,

For timeline semaphore we need points to signaled in order.
I'm struggling to understand how this fence-chain implementation 
preserves ordering of the seqnos.


One of the scenario I can see an issue happening is when you have a 
timeline with points 1 & 2 and userspace submits for 2 different engines :

    - first with let's say a blitter style engine on point 2
    - then a 3d style engine on point 1

Another scenario would be signaling a timeline with points 1 & 2 with 
those points in reverse order in the submission array.


-Lionel

On 07/12/2018 09:55, Chunming Zhou wrote:

From: Christian König 

Lockless container implementation similar to a dma_fence_array, but with
only two elements per node and automatic garbage collection.

v2: properly document dma_fence_chain_for_each, add dma_fence_chain_find_seqno,
 drop prev reference during garbage collection if it's not a chain fence.
v3: use head and iterator for dma_fence_chain_for_each
v4: fix reference count in dma_fence_chain_enable_signaling

Signed-off-by: Christian König 
---
  drivers/dma-buf/Makefile  |   3 +-
  drivers/dma-buf/dma-fence-chain.c | 241 ++
  include/linux/dma-fence-chain.h   |  81 ++
  3 files changed, 324 insertions(+), 1 deletion(-)
  create mode 100644 drivers/dma-buf/dma-fence-chain.c
  create mode 100644 include/linux/dma-fence-chain.h

diff --git a/drivers/dma-buf/Makefile b/drivers/dma-buf/Makefile
index 0913a6ccab5a..1f006e083eb9 100644
--- a/drivers/dma-buf/Makefile
+++ b/drivers/dma-buf/Makefile
@@ -1,4 +1,5 @@
-obj-y := dma-buf.o dma-fence.o dma-fence-array.o reservation.o seqno-fence.o
+obj-y := dma-buf.o dma-fence.o dma-fence-array.o dma-fence-chain.o \
+reservation.o seqno-fence.o
  obj-$(CONFIG_SYNC_FILE)   += sync_file.o
  obj-$(CONFIG_SW_SYNC) += sw_sync.o sync_debug.o
  obj-$(CONFIG_UDMABUF) += udmabuf.o
diff --git a/drivers/dma-buf/dma-fence-chain.c 
b/drivers/dma-buf/dma-fence-chain.c
new file mode 100644
index ..0c5e3c902fa0
--- /dev/null
+++ b/drivers/dma-buf/dma-fence-chain.c
@@ -0,0 +1,241 @@
+/*
+ * fence-chain: chain fences together in a timeline
+ *
+ * Copyright (C) 2018 Advanced Micro Devices, Inc.
+ * Authors:
+ * Christian König 
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include 
+
+static bool dma_fence_chain_enable_signaling(struct dma_fence *fence);
+
+/**
+ * dma_fence_chain_get_prev - use RCU to get a reference to the previous fence
+ * @chain: chain node to get the previous node from
+ *
+ * Use dma_fence_get_rcu_safe to get a reference to the previous fence of the
+ * chain node.
+ */
+static struct dma_fence *dma_fence_chain_get_prev(struct dma_fence_chain 
*chain)
+{
+   struct dma_fence *prev;
+
+   rcu_read_lock();
+   prev = dma_fence_get_rcu_safe(>prev);
+   rcu_read_unlock();
+   return prev;
+}
+
+/**
+ * dma_fence_chain_walk - chain walking function
+ * @fence: current chain node
+ *
+ * Walk the chain to the next node. Returns the next fence or NULL if we are at
+ * the end of the chain. Garbage collects chain nodes which are already
+ * signaled.
+ */
+struct dma_fence *dma_fence_chain_walk(struct dma_fence *fence)
+{
+   struct dma_fence_chain *chain, *prev_chain;
+   struct dma_fence *prev, *replacement, *tmp;
+
+   chain = to_dma_fence_chain(fence);
+   if (!chain) {
+   dma_fence_put(fence);
+   return NULL;
+   }
+
+   while ((prev = dma_fence_chain_get_prev(chain))) {
+
+   prev_chain = to_dma_fence_chain(prev);
+   if (prev_chain) {
+   if (!dma_fence_is_signaled(prev_chain->fence))
+   break;
+
+   replacement = dma_fence_chain_get_prev(prev_chain);
+   } else {
+   if (!dma_fence_is_signaled(prev))
+   break;
+
+   replacement = NULL;
+   }
+
+   tmp = cmpxchg(>prev, prev, replacement);
+   if (tmp == prev)
+   dma_fence_put(tmp);
+   else
+   dma_fence_put(replacement);
+   dma_fence_put(prev);
+   }
+
+   dma_fence_put(fence);
+   return prev;
+}
+EXPORT_SYMBOL(dma_fence_chain_walk);
+
+/**
+ * dma_fence_chain_find_seqno - find fence chain node by seqno
+ * @pfence: pointer to the chain node where to start
+ * @seqno: the sequence number to search for
+ *
+ * Advance the fence pointer to the 

Re: [PATCH 02/11] dma-buf: add new dma_fence_chain container v4

2019-02-15 Thread Koenig, Christian
Am 15.02.19 um 15:23 schrieb Lionel Landwerlin:
> Hi Christian, David,
>
> For timeline semaphore we need points to signaled in order.
> I'm struggling to understand how this fence-chain implementation 
> preserves ordering of the seqnos.
>
> One of the scenario I can see an issue happening is when you have a 
> timeline with points 1 & 2 and userspace submits for 2 different 
> engines :
>     - first with let's say a blitter style engine on point 2
>     - then a 3d style engine on point 1

Yeah, and where exactly is the problem?

Seqno 1 will signal when the 3d style engine finishes work.

And seqno 2 will signal when both seqno 1 is signaled and the blitter 
style engine has finished its work.

> Another scenario would be signaling a timeline with points 1 & 2 with 
> those points in reverse order in the submission array.

That is actually illegal in the spec, but actually handled gracefully as 
well.

E.g. when you add seqno 1 to the syncobj container it will only signal 
when 2 is signaled as well.

Regards,
Christian.

>
> -Lionel
>
> On 07/12/2018 09:55, Chunming Zhou wrote:
>> From: Christian König 
>>
>> Lockless container implementation similar to a dma_fence_array, but with
>> only two elements per node and automatic garbage collection.
>>
>> v2: properly document dma_fence_chain_for_each, add 
>> dma_fence_chain_find_seqno,
>>  drop prev reference during garbage collection if it's not a 
>> chain fence.
>> v3: use head and iterator for dma_fence_chain_for_each
>> v4: fix reference count in dma_fence_chain_enable_signaling
>>
>> Signed-off-by: Christian König 
>> ---
>>   drivers/dma-buf/Makefile  |   3 +-
>>   drivers/dma-buf/dma-fence-chain.c | 241 ++
>>   include/linux/dma-fence-chain.h   |  81 ++
>>   3 files changed, 324 insertions(+), 1 deletion(-)
>>   create mode 100644 drivers/dma-buf/dma-fence-chain.c
>>   create mode 100644 include/linux/dma-fence-chain.h
>>
>> diff --git a/drivers/dma-buf/Makefile b/drivers/dma-buf/Makefile
>> index 0913a6ccab5a..1f006e083eb9 100644
>> --- a/drivers/dma-buf/Makefile
>> +++ b/drivers/dma-buf/Makefile
>> @@ -1,4 +1,5 @@
>> -obj-y := dma-buf.o dma-fence.o dma-fence-array.o reservation.o 
>> seqno-fence.o
>> +obj-y := dma-buf.o dma-fence.o dma-fence-array.o dma-fence-chain.o \
>> + reservation.o seqno-fence.o
>>   obj-$(CONFIG_SYNC_FILE)    += sync_file.o
>>   obj-$(CONFIG_SW_SYNC)    += sw_sync.o sync_debug.o
>>   obj-$(CONFIG_UDMABUF)    += udmabuf.o
>> diff --git a/drivers/dma-buf/dma-fence-chain.c 
>> b/drivers/dma-buf/dma-fence-chain.c
>> new file mode 100644
>> index ..0c5e3c902fa0
>> --- /dev/null
>> +++ b/drivers/dma-buf/dma-fence-chain.c
>> @@ -0,0 +1,241 @@
>> +/*
>> + * fence-chain: chain fences together in a timeline
>> + *
>> + * Copyright (C) 2018 Advanced Micro Devices, Inc.
>> + * Authors:
>> + *    Christian König 
>> + *
>> + * This program is free software; you can redistribute it and/or 
>> modify it
>> + * under the terms of the GNU General Public License version 2 as 
>> published by
>> + * the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful, 
>> but WITHOUT
>> + * ANY WARRANTY; without even the implied warranty of 
>> MERCHANTABILITY or
>> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public 
>> License for
>> + * more details.
>> + */
>> +
>> +#include 
>> +
>> +static bool dma_fence_chain_enable_signaling(struct dma_fence *fence);
>> +
>> +/**
>> + * dma_fence_chain_get_prev - use RCU to get a reference to the 
>> previous fence
>> + * @chain: chain node to get the previous node from
>> + *
>> + * Use dma_fence_get_rcu_safe to get a reference to the previous 
>> fence of the
>> + * chain node.
>> + */
>> +static struct dma_fence *dma_fence_chain_get_prev(struct 
>> dma_fence_chain *chain)
>> +{
>> +    struct dma_fence *prev;
>> +
>> +    rcu_read_lock();
>> +    prev = dma_fence_get_rcu_safe(>prev);
>> +    rcu_read_unlock();
>> +    return prev;
>> +}
>> +
>> +/**
>> + * dma_fence_chain_walk - chain walking function
>> + * @fence: current chain node
>> + *
>> + * Walk the chain to the next node. Returns the next fence or NULL 
>> if we are at
>> + * the end of the chain. Garbage collects chain nodes which are already
>> + * signaled.
>> + */
>> +struct dma_fence *dma_fence_chain_walk(struct dma_fence *fence)
>> +{
>> +    struct dma_fence_chain *chain, *prev_chain;
>> +    struct dma_fence *prev, *replacement, *tmp;
>> +
>> +    chain = to_dma_fence_chain(fence);
>> +    if (!chain) {
>> +    dma_fence_put(fence);
>> +    return NULL;
>> +    }
>> +
>> +    while ((prev = dma_fence_chain_get_prev(chain))) {
>> +
>> +    prev_chain = to_dma_fence_chain(prev);
>> +    if (prev_chain) {
>> +    if (!dma_fence_is_signaled(prev_chain->fence))
>> +    break;
>> +
>> +    replacement = dma_fence_chain_get_prev(prev_chain);
>> +    } else {
>> +