Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-17 Thread Marek Olšák
Timeline semaphore waits (polling on memory) will be unmonitored and as
fast as the roundtrip to memory. Semaphore writes will be slower because
the copy of those write requests will also be forwarded to the kernel.
Arbitrary writes are not protected by the hw but the kernel will take
action against such behavior because it will receive them too.

I don't know if that would work with dma_fence.

Marek


On Thu, Jun 17, 2021 at 3:04 PM Daniel Vetter  wrote:

> On Thu, Jun 17, 2021 at 02:28:06PM -0400, Marek Olšák wrote:
> > The kernel will know who should touch the implicit-sync semaphore next,
> and
> > at the same time, the copy of all write requests to the implicit-sync
> > semaphore will be forwarded to the kernel for monitoring and bo_wait.
> >
> > Syncobjs could either use the same monitored access as implicit sync or
> be
> > completely unmonitored. We haven't decided yet.
> >
> > Syncfiles could either use one of the above or wait for a syncobj to go
> > idle before converting to a syncfile.
>
> Hm this sounds all like you're planning to completely rewrap everything
> ... I'm assuming the plan is still that this is going to be largely
> wrapped in dma_fence? Maybe with timeline objects being a bit more
> optimized, but I'm not sure how much you can optimize without breaking the
> interfaces.
> -Daniel
>
> >
> > Marek
> >
> >
> >
> > On Thu, Jun 17, 2021 at 12:48 PM Daniel Vetter  wrote:
> >
> > > On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
> > > > As long as we can figure out who touched to a certain sync object
> last
> > > that
> > > > would indeed work, yes.
> > >
> > > Don't you need to know who will touch it next, i.e. who is holding up
> your
> > > fence? Or maybe I'm just again totally confused.
> > > -Daniel
> > >
> > > >
> > > > Christian.
> > > >
> > > > Am 14.06.21 um 19:10 schrieb Marek Olšák:
> > > > > The call to the hw scheduler has a limitation on the size of all
> > > > > parameters combined. I think we can only pass a 32-bit sequence
> number
> > > > > and a ~16-bit global (per-GPU) syncobj handle in one call and not
> much
> > > > > else.
> > > > >
> > > > > The syncobj handle can be an element index in a global (per-GPU)
> > > syncobj
> > > > > table and it's read only for all processes with the exception of
> the
> > > > > signal command. Syncobjs can either have per VMID write access
> flags
> > > for
> > > > > the signal command (slow), or any process can write to any
> syncobjs and
> > > > > only rely on the kernel checking the write log (fast).
> > > > >
> > > > > In any case, we can execute the memory write in the queue engine
> and
> > > > > only use the hw scheduler for logging, which would be perfect.
> > > > >
> > > > > Marek
> > > > >
> > > > > On Thu, Jun 10, 2021 at 12:33 PM Christian König
> > > > >  > > > > > wrote:
> > > > >
> > > > > Hi guys,
> > > > >
> > > > > maybe soften that a bit. Reading from the shared memory of the
> > > > > user fence is ok for everybody. What we need to take more care
> of
> > > > > is the writing side.
> > > > >
> > > > > So my current thinking is that we allow read only access, but
> > > > > writing a new sequence value needs to go through the
> > > scheduler/kernel.
> > > > >
> > > > > So when the CPU wants to signal a timeline fence it needs to
> call
> > > > > an IOCTL. When the GPU wants to signal the timeline fence it
> needs
> > > > > to hand that of to the hardware scheduler.
> > > > >
> > > > > If we lockup the kernel can check with the hardware who did the
> > > > > last write and what value was written.
> > > > >
> > > > > That together with an IOCTL to give out sequence number for
> > > > > implicit sync to applications should be sufficient for the
> kernel
> > > > > to track who is responsible if something bad happens.
> > > > >
> > > > > In other words when the hardware says that the shader wrote
> stuff
> > > > > like 0xdeadbeef 0x0 or 0x into memory we kill the
> process
> > > > > who did that.
> > > > >
> > > > > If the hardware says that seq - 1 was written fine, but seq is
> > > > > missing then the kernel blames whoever was supposed to write
> seq.
> > > > >
> > > > > Just pieping the write through a privileged instance should be
> > > > > fine to make sure that we don't run into issues.
> > > > >
> > > > > Christian.
> > > > >
> > > > > Am 10.06.21 um 17:59 schrieb Marek Olšák:
> > > > > > Hi Daniel,
> > > > > >
> > > > > > We just talked about this whole topic internally and we came
> up
> > > > > > to the conclusion that the hardware needs to understand sync
> > > > > > object handles and have high-level wait and signal
> operations in
> > > > > > the command stream. Sync objects will be backed by memory,
> but
> > > > > > they won't be readable or writable by processes directly. The
> > > > > > hardware will log all 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-17 Thread Daniel Vetter
On Thu, Jun 17, 2021 at 02:28:06PM -0400, Marek Olšák wrote:
> The kernel will know who should touch the implicit-sync semaphore next, and
> at the same time, the copy of all write requests to the implicit-sync
> semaphore will be forwarded to the kernel for monitoring and bo_wait.
> 
> Syncobjs could either use the same monitored access as implicit sync or be
> completely unmonitored. We haven't decided yet.
> 
> Syncfiles could either use one of the above or wait for a syncobj to go
> idle before converting to a syncfile.

Hm this sounds all like you're planning to completely rewrap everything
... I'm assuming the plan is still that this is going to be largely
wrapped in dma_fence? Maybe with timeline objects being a bit more
optimized, but I'm not sure how much you can optimize without breaking the
interfaces.
-Daniel

> 
> Marek
> 
> 
> 
> On Thu, Jun 17, 2021 at 12:48 PM Daniel Vetter  wrote:
> 
> > On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
> > > As long as we can figure out who touched to a certain sync object last
> > that
> > > would indeed work, yes.
> >
> > Don't you need to know who will touch it next, i.e. who is holding up your
> > fence? Or maybe I'm just again totally confused.
> > -Daniel
> >
> > >
> > > Christian.
> > >
> > > Am 14.06.21 um 19:10 schrieb Marek Olšák:
> > > > The call to the hw scheduler has a limitation on the size of all
> > > > parameters combined. I think we can only pass a 32-bit sequence number
> > > > and a ~16-bit global (per-GPU) syncobj handle in one call and not much
> > > > else.
> > > >
> > > > The syncobj handle can be an element index in a global (per-GPU)
> > syncobj
> > > > table and it's read only for all processes with the exception of the
> > > > signal command. Syncobjs can either have per VMID write access flags
> > for
> > > > the signal command (slow), or any process can write to any syncobjs and
> > > > only rely on the kernel checking the write log (fast).
> > > >
> > > > In any case, we can execute the memory write in the queue engine and
> > > > only use the hw scheduler for logging, which would be perfect.
> > > >
> > > > Marek
> > > >
> > > > On Thu, Jun 10, 2021 at 12:33 PM Christian König
> > > >  > > > > wrote:
> > > >
> > > > Hi guys,
> > > >
> > > > maybe soften that a bit. Reading from the shared memory of the
> > > > user fence is ok for everybody. What we need to take more care of
> > > > is the writing side.
> > > >
> > > > So my current thinking is that we allow read only access, but
> > > > writing a new sequence value needs to go through the
> > scheduler/kernel.
> > > >
> > > > So when the CPU wants to signal a timeline fence it needs to call
> > > > an IOCTL. When the GPU wants to signal the timeline fence it needs
> > > > to hand that of to the hardware scheduler.
> > > >
> > > > If we lockup the kernel can check with the hardware who did the
> > > > last write and what value was written.
> > > >
> > > > That together with an IOCTL to give out sequence number for
> > > > implicit sync to applications should be sufficient for the kernel
> > > > to track who is responsible if something bad happens.
> > > >
> > > > In other words when the hardware says that the shader wrote stuff
> > > > like 0xdeadbeef 0x0 or 0x into memory we kill the process
> > > > who did that.
> > > >
> > > > If the hardware says that seq - 1 was written fine, but seq is
> > > > missing then the kernel blames whoever was supposed to write seq.
> > > >
> > > > Just pieping the write through a privileged instance should be
> > > > fine to make sure that we don't run into issues.
> > > >
> > > > Christian.
> > > >
> > > > Am 10.06.21 um 17:59 schrieb Marek Olšák:
> > > > > Hi Daniel,
> > > > >
> > > > > We just talked about this whole topic internally and we came up
> > > > > to the conclusion that the hardware needs to understand sync
> > > > > object handles and have high-level wait and signal operations in
> > > > > the command stream. Sync objects will be backed by memory, but
> > > > > they won't be readable or writable by processes directly. The
> > > > > hardware will log all accesses to sync objects and will send the
> > > > > log to the kernel periodically. The kernel will identify
> > > > > malicious behavior.
> > > > >
> > > > > Example of a hardware command stream:
> > > > > ...
> > > > > ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence
> > > > > number is assigned by the kernel
> > > > > Draw();
> > > > > ImplicitSyncSignalWhenDone(syncObjHandle);
> > > > > ...
> > > > >
> > > > > I'm afraid we have no other choice because of the TLB
> > > > > invalidation overhead.
> > > > >
> > > > > Marek
> > > > >
> > > > >
> > > > > On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter  > > > > 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-17 Thread Marek Olšák
The kernel will know who should touch the implicit-sync semaphore next, and
at the same time, the copy of all write requests to the implicit-sync
semaphore will be forwarded to the kernel for monitoring and bo_wait.

Syncobjs could either use the same monitored access as implicit sync or be
completely unmonitored. We haven't decided yet.

Syncfiles could either use one of the above or wait for a syncobj to go
idle before converting to a syncfile.

Marek



On Thu, Jun 17, 2021 at 12:48 PM Daniel Vetter  wrote:

> On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
> > As long as we can figure out who touched to a certain sync object last
> that
> > would indeed work, yes.
>
> Don't you need to know who will touch it next, i.e. who is holding up your
> fence? Or maybe I'm just again totally confused.
> -Daniel
>
> >
> > Christian.
> >
> > Am 14.06.21 um 19:10 schrieb Marek Olšák:
> > > The call to the hw scheduler has a limitation on the size of all
> > > parameters combined. I think we can only pass a 32-bit sequence number
> > > and a ~16-bit global (per-GPU) syncobj handle in one call and not much
> > > else.
> > >
> > > The syncobj handle can be an element index in a global (per-GPU)
> syncobj
> > > table and it's read only for all processes with the exception of the
> > > signal command. Syncobjs can either have per VMID write access flags
> for
> > > the signal command (slow), or any process can write to any syncobjs and
> > > only rely on the kernel checking the write log (fast).
> > >
> > > In any case, we can execute the memory write in the queue engine and
> > > only use the hw scheduler for logging, which would be perfect.
> > >
> > > Marek
> > >
> > > On Thu, Jun 10, 2021 at 12:33 PM Christian König
> > >  > > > wrote:
> > >
> > > Hi guys,
> > >
> > > maybe soften that a bit. Reading from the shared memory of the
> > > user fence is ok for everybody. What we need to take more care of
> > > is the writing side.
> > >
> > > So my current thinking is that we allow read only access, but
> > > writing a new sequence value needs to go through the
> scheduler/kernel.
> > >
> > > So when the CPU wants to signal a timeline fence it needs to call
> > > an IOCTL. When the GPU wants to signal the timeline fence it needs
> > > to hand that of to the hardware scheduler.
> > >
> > > If we lockup the kernel can check with the hardware who did the
> > > last write and what value was written.
> > >
> > > That together with an IOCTL to give out sequence number for
> > > implicit sync to applications should be sufficient for the kernel
> > > to track who is responsible if something bad happens.
> > >
> > > In other words when the hardware says that the shader wrote stuff
> > > like 0xdeadbeef 0x0 or 0x into memory we kill the process
> > > who did that.
> > >
> > > If the hardware says that seq - 1 was written fine, but seq is
> > > missing then the kernel blames whoever was supposed to write seq.
> > >
> > > Just pieping the write through a privileged instance should be
> > > fine to make sure that we don't run into issues.
> > >
> > > Christian.
> > >
> > > Am 10.06.21 um 17:59 schrieb Marek Olšák:
> > > > Hi Daniel,
> > > >
> > > > We just talked about this whole topic internally and we came up
> > > > to the conclusion that the hardware needs to understand sync
> > > > object handles and have high-level wait and signal operations in
> > > > the command stream. Sync objects will be backed by memory, but
> > > > they won't be readable or writable by processes directly. The
> > > > hardware will log all accesses to sync objects and will send the
> > > > log to the kernel periodically. The kernel will identify
> > > > malicious behavior.
> > > >
> > > > Example of a hardware command stream:
> > > > ...
> > > > ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence
> > > > number is assigned by the kernel
> > > > Draw();
> > > > ImplicitSyncSignalWhenDone(syncObjHandle);
> > > > ...
> > > >
> > > > I'm afraid we have no other choice because of the TLB
> > > > invalidation overhead.
> > > >
> > > > Marek
> > > >
> > > >
> > > > On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter  > > > > wrote:
> > > >
> > > > On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König
> wrote:
> > > > > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > > > > > [SNIP]
> > > > > > > Yeah, we call this the lightweight and the heavyweight
> > > > tlb flush.
> > > > > > >
> > > > > > > The lighweight can be used when you are sure that you
> > > > don't have any of the
> > > > > > > PTEs currently in flight in the 3D/DMA engine and you
> > > > just need to
> > > > > > > invalidate the TLB.
> > > 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-17 Thread Daniel Vetter
On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
> As long as we can figure out who touched to a certain sync object last that
> would indeed work, yes.

Don't you need to know who will touch it next, i.e. who is holding up your
fence? Or maybe I'm just again totally confused.
-Daniel

> 
> Christian.
> 
> Am 14.06.21 um 19:10 schrieb Marek Olšák:
> > The call to the hw scheduler has a limitation on the size of all
> > parameters combined. I think we can only pass a 32-bit sequence number
> > and a ~16-bit global (per-GPU) syncobj handle in one call and not much
> > else.
> > 
> > The syncobj handle can be an element index in a global (per-GPU) syncobj
> > table and it's read only for all processes with the exception of the
> > signal command. Syncobjs can either have per VMID write access flags for
> > the signal command (slow), or any process can write to any syncobjs and
> > only rely on the kernel checking the write log (fast).
> > 
> > In any case, we can execute the memory write in the queue engine and
> > only use the hw scheduler for logging, which would be perfect.
> > 
> > Marek
> > 
> > On Thu, Jun 10, 2021 at 12:33 PM Christian König
> >  > > wrote:
> > 
> > Hi guys,
> > 
> > maybe soften that a bit. Reading from the shared memory of the
> > user fence is ok for everybody. What we need to take more care of
> > is the writing side.
> > 
> > So my current thinking is that we allow read only access, but
> > writing a new sequence value needs to go through the scheduler/kernel.
> > 
> > So when the CPU wants to signal a timeline fence it needs to call
> > an IOCTL. When the GPU wants to signal the timeline fence it needs
> > to hand that of to the hardware scheduler.
> > 
> > If we lockup the kernel can check with the hardware who did the
> > last write and what value was written.
> > 
> > That together with an IOCTL to give out sequence number for
> > implicit sync to applications should be sufficient for the kernel
> > to track who is responsible if something bad happens.
> > 
> > In other words when the hardware says that the shader wrote stuff
> > like 0xdeadbeef 0x0 or 0x into memory we kill the process
> > who did that.
> > 
> > If the hardware says that seq - 1 was written fine, but seq is
> > missing then the kernel blames whoever was supposed to write seq.
> > 
> > Just pieping the write through a privileged instance should be
> > fine to make sure that we don't run into issues.
> > 
> > Christian.
> > 
> > Am 10.06.21 um 17:59 schrieb Marek Olšák:
> > > Hi Daniel,
> > > 
> > > We just talked about this whole topic internally and we came up
> > > to the conclusion that the hardware needs to understand sync
> > > object handles and have high-level wait and signal operations in
> > > the command stream. Sync objects will be backed by memory, but
> > > they won't be readable or writable by processes directly. The
> > > hardware will log all accesses to sync objects and will send the
> > > log to the kernel periodically. The kernel will identify
> > > malicious behavior.
> > > 
> > > Example of a hardware command stream:
> > > ...
> > > ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence
> > > number is assigned by the kernel
> > > Draw();
> > > ImplicitSyncSignalWhenDone(syncObjHandle);
> > > ...
> > > 
> > > I'm afraid we have no other choice because of the TLB
> > > invalidation overhead.
> > > 
> > > Marek
> > > 
> > > 
> > > On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter  > > > wrote:
> > > 
> > > On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
> > > > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > > > > [SNIP]
> > > > > > Yeah, we call this the lightweight and the heavyweight
> > > tlb flush.
> > > > > >
> > > > > > The lighweight can be used when you are sure that you
> > > don't have any of the
> > > > > > PTEs currently in flight in the 3D/DMA engine and you
> > > just need to
> > > > > > invalidate the TLB.
> > > > > >
> > > > > > The heavyweight must be used when you need to
> > > invalidate the TLB *AND* make
> > > > > > sure that no concurrently operation moves new stuff
> > > into the TLB.
> > > > > >
> > > > > > The problem is for this use case we have to use the
> > > heavyweight one.
> > > > > Just for my own curiosity: So the lightweight flush is
> > > only for in-between
> > > > > CS when you know access is idle? Or does that also not
> > > work if userspace
> > > > > has a CS on a dma engine going at the same time because
> > > the tlb aren't
> > > > > isolated enough between 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-14 Thread Christian König
As long as we can figure out who touched to a certain sync object last 
that would indeed work, yes.


Christian.

Am 14.06.21 um 19:10 schrieb Marek Olšák:
The call to the hw scheduler has a limitation on the size of all 
parameters combined. I think we can only pass a 32-bit sequence number 
and a ~16-bit global (per-GPU) syncobj handle in one call and not much 
else.


The syncobj handle can be an element index in a global (per-GPU) 
syncobj table and it's read only for all processes with the exception 
of the signal command. Syncobjs can either have per VMID write access 
flags for the signal command (slow), or any process can write to any 
syncobjs and only rely on the kernel checking the write log (fast).


In any case, we can execute the memory write in the queue engine and 
only use the hw scheduler for logging, which would be perfect.


Marek

On Thu, Jun 10, 2021 at 12:33 PM Christian König 
> wrote:


Hi guys,

maybe soften that a bit. Reading from the shared memory of the
user fence is ok for everybody. What we need to take more care of
is the writing side.

So my current thinking is that we allow read only access, but
writing a new sequence value needs to go through the scheduler/kernel.

So when the CPU wants to signal a timeline fence it needs to call
an IOCTL. When the GPU wants to signal the timeline fence it needs
to hand that of to the hardware scheduler.

If we lockup the kernel can check with the hardware who did the
last write and what value was written.

That together with an IOCTL to give out sequence number for
implicit sync to applications should be sufficient for the kernel
to track who is responsible if something bad happens.

In other words when the hardware says that the shader wrote stuff
like 0xdeadbeef 0x0 or 0x into memory we kill the process
who did that.

If the hardware says that seq - 1 was written fine, but seq is
missing then the kernel blames whoever was supposed to write seq.

Just pieping the write through a privileged instance should be
fine to make sure that we don't run into issues.

Christian.

Am 10.06.21 um 17:59 schrieb Marek Olšák:

Hi Daniel,

We just talked about this whole topic internally and we came up
to the conclusion that the hardware needs to understand sync
object handles and have high-level wait and signal operations in
the command stream. Sync objects will be backed by memory, but
they won't be readable or writable by processes directly. The
hardware will log all accesses to sync objects and will send the
log to the kernel periodically. The kernel will identify
malicious behavior.

Example of a hardware command stream:
...
ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence
number is assigned by the kernel
Draw();
ImplicitSyncSignalWhenDone(syncObjHandle);
...

I'm afraid we have no other choice because of the TLB
invalidation overhead.

Marek


On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter mailto:dan...@ffwll.ch>> wrote:

On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
> Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > [SNIP]
> > > Yeah, we call this the lightweight and the heavyweight
tlb flush.
> > >
> > > The lighweight can be used when you are sure that you
don't have any of the
> > > PTEs currently in flight in the 3D/DMA engine and you
just need to
> > > invalidate the TLB.
> > >
> > > The heavyweight must be used when you need to
invalidate the TLB *AND* make
> > > sure that no concurrently operation moves new stuff
into the TLB.
> > >
> > > The problem is for this use case we have to use the
heavyweight one.
> > Just for my own curiosity: So the lightweight flush is
only for in-between
> > CS when you know access is idle? Or does that also not
work if userspace
> > has a CS on a dma engine going at the same time because
the tlb aren't
> > isolated enough between engines?
>
> More or less correct, yes.
>
> The problem is a lightweight flush only invalidates the
TLB, but doesn't
> take care of entries which have been handed out to the
different engines.
>
> In other words what can happen is the following:
>
> 1. Shader asks TLB to resolve address X.
> 2. TLB looks into its cache and can't find address X so it
asks the walker
> to resolve.
> 3. Walker comes back with result for address X and TLB puts
that into its
> cache and gives it to Shader.
> 4. Shader starts doing some operation using result for
address X.
> 5. You send 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-14 Thread Marek Olšák
The call to the hw scheduler has a limitation on the size of all parameters
combined. I think we can only pass a 32-bit sequence number and a ~16-bit
global (per-GPU) syncobj handle in one call and not much else.

The syncobj handle can be an element index in a global (per-GPU) syncobj
table and it's read only for all processes with the exception of the signal
command. Syncobjs can either have per VMID write access flags for the
signal command (slow), or any process can write to any syncobjs and only
rely on the kernel checking the write log (fast).

In any case, we can execute the memory write in the queue engine and only
use the hw scheduler for logging, which would be perfect.

Marek

On Thu, Jun 10, 2021 at 12:33 PM Christian König <
ckoenig.leichtzumer...@gmail.com> wrote:

> Hi guys,
>
> maybe soften that a bit. Reading from the shared memory of the user fence
> is ok for everybody. What we need to take more care of is the writing side.
>
> So my current thinking is that we allow read only access, but writing a
> new sequence value needs to go through the scheduler/kernel.
>
> So when the CPU wants to signal a timeline fence it needs to call an
> IOCTL. When the GPU wants to signal the timeline fence it needs to hand
> that of to the hardware scheduler.
>
> If we lockup the kernel can check with the hardware who did the last write
> and what value was written.
>
> That together with an IOCTL to give out sequence number for implicit sync
> to applications should be sufficient for the kernel to track who is
> responsible if something bad happens.
>
> In other words when the hardware says that the shader wrote stuff like
> 0xdeadbeef 0x0 or 0x into memory we kill the process who did that.
>
> If the hardware says that seq - 1 was written fine, but seq is missing
> then the kernel blames whoever was supposed to write seq.
>
> Just pieping the write through a privileged instance should be fine to
> make sure that we don't run into issues.
>
> Christian.
>
> Am 10.06.21 um 17:59 schrieb Marek Olšák:
>
> Hi Daniel,
>
> We just talked about this whole topic internally and we came up to the
> conclusion that the hardware needs to understand sync object handles and
> have high-level wait and signal operations in the command stream. Sync
> objects will be backed by memory, but they won't be readable or writable by
> processes directly. The hardware will log all accesses to sync objects and
> will send the log to the kernel periodically. The kernel will identify
> malicious behavior.
>
> Example of a hardware command stream:
> ...
> ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence number is
> assigned by the kernel
> Draw();
> ImplicitSyncSignalWhenDone(syncObjHandle);
> ...
>
> I'm afraid we have no other choice because of the TLB invalidation
> overhead.
>
> Marek
>
>
> On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter  wrote:
>
>> On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
>> > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
>> > > [SNIP]
>> > > > Yeah, we call this the lightweight and the heavyweight tlb flush.
>> > > >
>> > > > The lighweight can be used when you are sure that you don't have
>> any of the
>> > > > PTEs currently in flight in the 3D/DMA engine and you just need to
>> > > > invalidate the TLB.
>> > > >
>> > > > The heavyweight must be used when you need to invalidate the TLB
>> *AND* make
>> > > > sure that no concurrently operation moves new stuff into the TLB.
>> > > >
>> > > > The problem is for this use case we have to use the heavyweight one.
>> > > Just for my own curiosity: So the lightweight flush is only for
>> in-between
>> > > CS when you know access is idle? Or does that also not work if
>> userspace
>> > > has a CS on a dma engine going at the same time because the tlb aren't
>> > > isolated enough between engines?
>> >
>> > More or less correct, yes.
>> >
>> > The problem is a lightweight flush only invalidates the TLB, but doesn't
>> > take care of entries which have been handed out to the different
>> engines.
>> >
>> > In other words what can happen is the following:
>> >
>> > 1. Shader asks TLB to resolve address X.
>> > 2. TLB looks into its cache and can't find address X so it asks the
>> walker
>> > to resolve.
>> > 3. Walker comes back with result for address X and TLB puts that into
>> its
>> > cache and gives it to Shader.
>> > 4. Shader starts doing some operation using result for address X.
>> > 5. You send lightweight TLB invalidate and TLB throws away cached
>> values for
>> > address X.
>> > 6. Shader happily still uses whatever the TLB gave to it in step 3 to
>> > accesses address X
>> >
>> > See it like the shader has their own 1 entry L0 TLB cache which is not
>> > affected by the lightweight flush.
>> >
>> > The heavyweight flush on the other hand sends out a broadcast signal to
>> > everybody and only comes back when we are sure that an address is not
>> in use
>> > any more.
>>
>> Ah makes sense. On intel the shaders 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-10 Thread Christian König

Hi guys,

maybe soften that a bit. Reading from the shared memory of the user 
fence is ok for everybody. What we need to take more care of is the 
writing side.


So my current thinking is that we allow read only access, but writing a 
new sequence value needs to go through the scheduler/kernel.


So when the CPU wants to signal a timeline fence it needs to call an 
IOCTL. When the GPU wants to signal the timeline fence it needs to hand 
that of to the hardware scheduler.


If we lockup the kernel can check with the hardware who did the last 
write and what value was written.


That together with an IOCTL to give out sequence number for implicit 
sync to applications should be sufficient for the kernel to track who is 
responsible if something bad happens.


In other words when the hardware says that the shader wrote stuff like 
0xdeadbeef 0x0 or 0x into memory we kill the process who did that.


If the hardware says that seq - 1 was written fine, but seq is missing 
then the kernel blames whoever was supposed to write seq.


Just pieping the write through a privileged instance should be fine to 
make sure that we don't run into issues.


Christian.

Am 10.06.21 um 17:59 schrieb Marek Olšák:

Hi Daniel,

We just talked about this whole topic internally and we came up to the 
conclusion that the hardware needs to understand sync object handles 
and have high-level wait and signal operations in the command stream. 
Sync objects will be backed by memory, but they won't be readable or 
writable by processes directly. The hardware will log all accesses to 
sync objects and will send the log to the kernel periodically. The 
kernel will identify malicious behavior.


Example of a hardware command stream:
...
ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence 
number is assigned by the kernel

Draw();
ImplicitSyncSignalWhenDone(syncObjHandle);
...

I'm afraid we have no other choice because of the TLB invalidation 
overhead.


Marek


On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter > wrote:


On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
> Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > [SNIP]
> > > Yeah, we call this the lightweight and the heavyweight tlb
flush.
> > >
> > > The lighweight can be used when you are sure that you don't
have any of the
> > > PTEs currently in flight in the 3D/DMA engine and you just
need to
> > > invalidate the TLB.
> > >
> > > The heavyweight must be used when you need to invalidate the
TLB *AND* make
> > > sure that no concurrently operation moves new stuff into the
TLB.
> > >
> > > The problem is for this use case we have to use the
heavyweight one.
> > Just for my own curiosity: So the lightweight flush is only
for in-between
> > CS when you know access is idle? Or does that also not work if
userspace
> > has a CS on a dma engine going at the same time because the
tlb aren't
> > isolated enough between engines?
>
> More or less correct, yes.
>
> The problem is a lightweight flush only invalidates the TLB, but
doesn't
> take care of entries which have been handed out to the different
engines.
>
> In other words what can happen is the following:
>
> 1. Shader asks TLB to resolve address X.
> 2. TLB looks into its cache and can't find address X so it asks
the walker
> to resolve.
> 3. Walker comes back with result for address X and TLB puts that
into its
> cache and gives it to Shader.
> 4. Shader starts doing some operation using result for address X.
> 5. You send lightweight TLB invalidate and TLB throws away
cached values for
> address X.
> 6. Shader happily still uses whatever the TLB gave to it in step
3 to
> accesses address X
>
> See it like the shader has their own 1 entry L0 TLB cache which
is not
> affected by the lightweight flush.
>
> The heavyweight flush on the other hand sends out a broadcast
signal to
> everybody and only comes back when we are sure that an address
is not in use
> any more.

Ah makes sense. On intel the shaders only operate in VA,
everything goes
around as explicit async messages to IO blocks. So we don't have
this, the
only difference in tlb flushes is between tlb flush in the IB and
an mmio
one which is independent for anything currently being executed on an
egine.
-Daniel
-- 
Daniel Vetter

Software Engineer, Intel Corporation
http://blog.ffwll.ch 



___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-10 Thread Marek Olšák
Hi Daniel,

We just talked about this whole topic internally and we came up to the
conclusion that the hardware needs to understand sync object handles and
have high-level wait and signal operations in the command stream. Sync
objects will be backed by memory, but they won't be readable or writable by
processes directly. The hardware will log all accesses to sync objects and
will send the log to the kernel periodically. The kernel will identify
malicious behavior.

Example of a hardware command stream:
...
ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence number is
assigned by the kernel
Draw();
ImplicitSyncSignalWhenDone(syncObjHandle);
...

I'm afraid we have no other choice because of the TLB invalidation overhead.

Marek


On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter  wrote:

> On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
> > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > > [SNIP]
> > > > Yeah, we call this the lightweight and the heavyweight tlb flush.
> > > >
> > > > The lighweight can be used when you are sure that you don't have any
> of the
> > > > PTEs currently in flight in the 3D/DMA engine and you just need to
> > > > invalidate the TLB.
> > > >
> > > > The heavyweight must be used when you need to invalidate the TLB
> *AND* make
> > > > sure that no concurrently operation moves new stuff into the TLB.
> > > >
> > > > The problem is for this use case we have to use the heavyweight one.
> > > Just for my own curiosity: So the lightweight flush is only for
> in-between
> > > CS when you know access is idle? Or does that also not work if
> userspace
> > > has a CS on a dma engine going at the same time because the tlb aren't
> > > isolated enough between engines?
> >
> > More or less correct, yes.
> >
> > The problem is a lightweight flush only invalidates the TLB, but doesn't
> > take care of entries which have been handed out to the different engines.
> >
> > In other words what can happen is the following:
> >
> > 1. Shader asks TLB to resolve address X.
> > 2. TLB looks into its cache and can't find address X so it asks the
> walker
> > to resolve.
> > 3. Walker comes back with result for address X and TLB puts that into its
> > cache and gives it to Shader.
> > 4. Shader starts doing some operation using result for address X.
> > 5. You send lightweight TLB invalidate and TLB throws away cached values
> for
> > address X.
> > 6. Shader happily still uses whatever the TLB gave to it in step 3 to
> > accesses address X
> >
> > See it like the shader has their own 1 entry L0 TLB cache which is not
> > affected by the lightweight flush.
> >
> > The heavyweight flush on the other hand sends out a broadcast signal to
> > everybody and only comes back when we are sure that an address is not in
> use
> > any more.
>
> Ah makes sense. On intel the shaders only operate in VA, everything goes
> around as explicit async messages to IO blocks. So we don't have this, the
> only difference in tlb flushes is between tlb flush in the IB and an mmio
> one which is independent for anything currently being executed on an
> egine.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-09 Thread Daniel Vetter
On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
> Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > [SNIP]
> > > Yeah, we call this the lightweight and the heavyweight tlb flush.
> > > 
> > > The lighweight can be used when you are sure that you don't have any of 
> > > the
> > > PTEs currently in flight in the 3D/DMA engine and you just need to
> > > invalidate the TLB.
> > > 
> > > The heavyweight must be used when you need to invalidate the TLB *AND* 
> > > make
> > > sure that no concurrently operation moves new stuff into the TLB.
> > > 
> > > The problem is for this use case we have to use the heavyweight one.
> > Just for my own curiosity: So the lightweight flush is only for in-between
> > CS when you know access is idle? Or does that also not work if userspace
> > has a CS on a dma engine going at the same time because the tlb aren't
> > isolated enough between engines?
> 
> More or less correct, yes.
> 
> The problem is a lightweight flush only invalidates the TLB, but doesn't
> take care of entries which have been handed out to the different engines.
> 
> In other words what can happen is the following:
> 
> 1. Shader asks TLB to resolve address X.
> 2. TLB looks into its cache and can't find address X so it asks the walker
> to resolve.
> 3. Walker comes back with result for address X and TLB puts that into its
> cache and gives it to Shader.
> 4. Shader starts doing some operation using result for address X.
> 5. You send lightweight TLB invalidate and TLB throws away cached values for
> address X.
> 6. Shader happily still uses whatever the TLB gave to it in step 3 to
> accesses address X
> 
> See it like the shader has their own 1 entry L0 TLB cache which is not
> affected by the lightweight flush.
> 
> The heavyweight flush on the other hand sends out a broadcast signal to
> everybody and only comes back when we are sure that an address is not in use
> any more.

Ah makes sense. On intel the shaders only operate in VA, everything goes
around as explicit async messages to IO blocks. So we don't have this, the
only difference in tlb flushes is between tlb flush in the IB and an mmio
one which is independent for anything currently being executed on an
egine.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-09 Thread Christian König

Am 09.06.21 um 15:19 schrieb Daniel Vetter:

[SNIP]

Yeah, we call this the lightweight and the heavyweight tlb flush.

The lighweight can be used when you are sure that you don't have any of the
PTEs currently in flight in the 3D/DMA engine and you just need to
invalidate the TLB.

The heavyweight must be used when you need to invalidate the TLB *AND* make
sure that no concurrently operation moves new stuff into the TLB.

The problem is for this use case we have to use the heavyweight one.

Just for my own curiosity: So the lightweight flush is only for in-between
CS when you know access is idle? Or does that also not work if userspace
has a CS on a dma engine going at the same time because the tlb aren't
isolated enough between engines?


More or less correct, yes.

The problem is a lightweight flush only invalidates the TLB, but doesn't 
take care of entries which have been handed out to the different engines.


In other words what can happen is the following:

1. Shader asks TLB to resolve address X.
2. TLB looks into its cache and can't find address X so it asks the 
walker to resolve.
3. Walker comes back with result for address X and TLB puts that into 
its cache and gives it to Shader.

4. Shader starts doing some operation using result for address X.
5. You send lightweight TLB invalidate and TLB throws away cached values 
for address X.
6. Shader happily still uses whatever the TLB gave to it in step 3 to 
accesses address X


See it like the shader has their own 1 entry L0 TLB cache which is not 
affected by the lightweight flush.


The heavyweight flush on the other hand sends out a broadcast signal to 
everybody and only comes back when we are sure that an address is not in 
use any more.


Christian.


-Daniel



___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-09 Thread Daniel Vetter
On Fri, Jun 04, 2021 at 01:27:15PM +0200, Christian König wrote:
> Am 04.06.21 um 10:57 schrieb Daniel Vetter:
> > On Fri, Jun 04, 2021 at 09:00:31AM +0200, Christian König wrote:
> > > Am 02.06.21 um 21:19 schrieb Daniel Vetter:
> > > > On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:
> > > > > Am 02.06.21 um 20:48 schrieb Daniel Vetter:
> > > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  
> > > > > > > wrote:
> > > > > > > 
> > > > > > > > Yes, we can't break anything because we don't want to 
> > > > > > > > complicate things
> > > > > > > > for us. It's pretty much all NAK'd already. We are trying to 
> > > > > > > > gather more
> > > > > > > > knowledge and then make better decisions.
> > > > > > > > 
> > > > > > > > The idea we are considering is that we'll expose memory-based 
> > > > > > > > sync objects
> > > > > > > > to userspace for read only, and the kernel or hw will strictly 
> > > > > > > > control the
> > > > > > > > memory writes to those sync objects. The hole in that idea is 
> > > > > > > > that
> > > > > > > > userspace can decide not to signal a job, so even if userspace 
> > > > > > > > can't
> > > > > > > > overwrite memory-based sync object states arbitrarily, it can 
> > > > > > > > still decide
> > > > > > > > not to signal them, and then a future fence is born.
> > > > > > > > 
> > > > > > > This would actually be treated as a GPU hang caused by that 
> > > > > > > context, so it
> > > > > > > should be fine.
> > > > > > This is practically what I proposed already, except your not doing 
> > > > > > it with
> > > > > > dma_fence. And on the memory fence side this also doesn't actually 
> > > > > > give
> > > > > > what you want for that compute model.
> > > > > > 
> > > > > > This seems like a bit a worst of both worlds approach to me? Tons 
> > > > > > of work
> > > > > > in the kernel to hide these not-dma_fence-but-almost, and still 
> > > > > > pain to
> > > > > > actually drive the hardware like it should be for compute or direct
> > > > > > display.
> > > > > > 
> > > > > > Also maybe I've missed it, but I didn't see any replies to my 
> > > > > > suggestion
> > > > > > how to fake the entire dma_fence stuff on top of new hw. Would be
> > > > > > interesting to know what doesn't work there instead of amd folks 
> > > > > > going of
> > > > > > into internal again and then coming back with another rfc from out 
> > > > > > of
> > > > > > nowhere :-)
> > > > > Well to be honest I would just push back on our hardware/firmware 
> > > > > guys that
> > > > > we need to keep kernel queues forever before going down that route.
> > > > I looked again, and you said the model wont work because preemption is 
> > > > way
> > > > too slow, even when the context is idle.
> > > > 
> > > > I guess at that point I got maybe too fed up and just figured "not my
> > > > problem", but if preempt is too slow as the unload fence, you can do it
> > > > with pte removal and tlb shootdown too (that is hopefully not too slow,
> > > > otherwise your hw is just garbage and wont even be fast for direct 
> > > > submit
> > > > compute workloads).
> > > Have you seen that one here:
> > > https://www.spinics.net/lists/amd-gfx/msg63101.html :)
> > > 
> > > I've rejected it because I think polling for 6 seconds on a TLB flush 
> > > which
> > > can block interrupts as well is just madness.
> > Hm but I thought you had like 2 tlb flush modes, the shitty one (with
> > retrying page faults) and the not so shitty one?
> 
> Yeah, we call this the lightweight and the heavyweight tlb flush.
> 
> The lighweight can be used when you are sure that you don't have any of the
> PTEs currently in flight in the 3D/DMA engine and you just need to
> invalidate the TLB.
> 
> The heavyweight must be used when you need to invalidate the TLB *AND* make
> sure that no concurrently operation moves new stuff into the TLB.
> 
> The problem is for this use case we have to use the heavyweight one.

Just for my own curiosity: So the lightweight flush is only for in-between
CS when you know access is idle? Or does that also not work if userspace
has a CS on a dma engine going at the same time because the tlb aren't
isolated enough between engines?
-Daniel

> > But yeah at that point I think you just have to bite one of the bullets.
> 
> Yeah, completely agree. We can choose which way we want to die, but it's
> certainly not going to be nice whatever we do.
> 
> > 
> > The thing is with hmm/userspace memory fence model this will be even
> > worse, because you will _have_ to do this tlb flush deep down in core mm
> > functions, so this is going to be userptr, but worse.
> > 
> > With the dma_resv/dma_fence bo memory management model you can at least
> > wrap that tlb flush into a dma_fence and push the waiting/pinging onto a
> > separate thread or something like that. If the hw really is that slow.
> > 
> > Somewhat aside: Personally I think 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-04 Thread Christian König

Am 04.06.21 um 10:57 schrieb Daniel Vetter:

On Fri, Jun 04, 2021 at 09:00:31AM +0200, Christian König wrote:

Am 02.06.21 um 21:19 schrieb Daniel Vetter:

On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:

Am 02.06.21 um 20:48 schrieb Daniel Vetter:

On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:

On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:


Yes, we can't break anything because we don't want to complicate things
for us. It's pretty much all NAK'd already. We are trying to gather more
knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync objects
to userspace for read only, and the kernel or hw will strictly control the
memory writes to those sync objects. The hole in that idea is that
userspace can decide not to signal a job, so even if userspace can't
overwrite memory-based sync object states arbitrarily, it can still decide
not to signal them, and then a future fence is born.


This would actually be treated as a GPU hang caused by that context, so it
should be fine.

This is practically what I proposed already, except your not doing it with
dma_fence. And on the memory fence side this also doesn't actually give
what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of work
in the kernel to hide these not-dma_fence-but-almost, and still pain to
actually drive the hardware like it should be for compute or direct
display.

Also maybe I've missed it, but I didn't see any replies to my suggestion
how to fake the entire dma_fence stuff on top of new hw. Would be
interesting to know what doesn't work there instead of amd folks going of
into internal again and then coming back with another rfc from out of
nowhere :-)

Well to be honest I would just push back on our hardware/firmware guys that
we need to keep kernel queues forever before going down that route.

I looked again, and you said the model wont work because preemption is way
too slow, even when the context is idle.

I guess at that point I got maybe too fed up and just figured "not my
problem", but if preempt is too slow as the unload fence, you can do it
with pte removal and tlb shootdown too (that is hopefully not too slow,
otherwise your hw is just garbage and wont even be fast for direct submit
compute workloads).

Have you seen that one here:
https://www.spinics.net/lists/amd-gfx/msg63101.html :)

I've rejected it because I think polling for 6 seconds on a TLB flush which
can block interrupts as well is just madness.

Hm but I thought you had like 2 tlb flush modes, the shitty one (with
retrying page faults) and the not so shitty one?


Yeah, we call this the lightweight and the heavyweight tlb flush.

The lighweight can be used when you are sure that you don't have any of 
the PTEs currently in flight in the 3D/DMA engine and you just need to 
invalidate the TLB.


The heavyweight must be used when you need to invalidate the TLB *AND* 
make sure that no concurrently operation moves new stuff into the TLB.


The problem is for this use case we have to use the heavyweight one.


But yeah at that point I think you just have to bite one of the bullets.


Yeah, completely agree. We can choose which way we want to die, but it's 
certainly not going to be nice whatever we do.




The thing is with hmm/userspace memory fence model this will be even
worse, because you will _have_ to do this tlb flush deep down in core mm
functions, so this is going to be userptr, but worse.

With the dma_resv/dma_fence bo memory management model you can at least
wrap that tlb flush into a dma_fence and push the waiting/pinging onto a
separate thread or something like that. If the hw really is that slow.

Somewhat aside: Personally I think that sriov needs to move over to the
compute model, i.e. indefinite timeouts, no tdr, because everything takes
too long. At least looking around sriov timeouts tend to be 10x bare
metal, across the board.

But for stuff like cloud gaming that's serious amounts of heavy lifting
since it brings us right back "the entire linux/android 3d stack is built
on top of dma_fence right now".


The only thing that you need to do when you use pte clearing + tlb
shootdown instad of preemption as the unload fence for buffers that get
moved is that if you get any gpu page fault, you don't serve that, but
instead treat it as a tdr and shot the context permanently.

So summarizing the model I proposed:

- you allow userspace to directly write into the ringbuffer, and also
write the fences directly

- actual submit is done by the kernel, using drm/scheduler. The kernel
blindly trusts userspace to set up everything else, and even just wraps
dma_fences around the userspace memory fences.

- the only check is tdr. If a fence doesn't complete an tdr fires, a) the
kernel shot the entire context and b) userspace recovers by setting up a
new ringbuffer

- memory management is done using ttm only, you 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-04 Thread Daniel Vetter
On Fri, Jun 04, 2021 at 09:00:31AM +0200, Christian König wrote:
> Am 02.06.21 um 21:19 schrieb Daniel Vetter:
> > On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:
> > > 
> > > Am 02.06.21 um 20:48 schrieb Daniel Vetter:
> > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:
> > > > > 
> > > > > > Yes, we can't break anything because we don't want to complicate 
> > > > > > things
> > > > > > for us. It's pretty much all NAK'd already. We are trying to gather 
> > > > > > more
> > > > > > knowledge and then make better decisions.
> > > > > > 
> > > > > > The idea we are considering is that we'll expose memory-based sync 
> > > > > > objects
> > > > > > to userspace for read only, and the kernel or hw will strictly 
> > > > > > control the
> > > > > > memory writes to those sync objects. The hole in that idea is that
> > > > > > userspace can decide not to signal a job, so even if userspace can't
> > > > > > overwrite memory-based sync object states arbitrarily, it can still 
> > > > > > decide
> > > > > > not to signal them, and then a future fence is born.
> > > > > > 
> > > > > This would actually be treated as a GPU hang caused by that context, 
> > > > > so it
> > > > > should be fine.
> > > > This is practically what I proposed already, except your not doing it 
> > > > with
> > > > dma_fence. And on the memory fence side this also doesn't actually give
> > > > what you want for that compute model.
> > > > 
> > > > This seems like a bit a worst of both worlds approach to me? Tons of 
> > > > work
> > > > in the kernel to hide these not-dma_fence-but-almost, and still pain to
> > > > actually drive the hardware like it should be for compute or direct
> > > > display.
> > > > 
> > > > Also maybe I've missed it, but I didn't see any replies to my suggestion
> > > > how to fake the entire dma_fence stuff on top of new hw. Would be
> > > > interesting to know what doesn't work there instead of amd folks going 
> > > > of
> > > > into internal again and then coming back with another rfc from out of
> > > > nowhere :-)
> > > Well to be honest I would just push back on our hardware/firmware guys 
> > > that
> > > we need to keep kernel queues forever before going down that route.
> > I looked again, and you said the model wont work because preemption is way
> > too slow, even when the context is idle.
> > 
> > I guess at that point I got maybe too fed up and just figured "not my
> > problem", but if preempt is too slow as the unload fence, you can do it
> > with pte removal and tlb shootdown too (that is hopefully not too slow,
> > otherwise your hw is just garbage and wont even be fast for direct submit
> > compute workloads).
> 
> Have you seen that one here:
> https://www.spinics.net/lists/amd-gfx/msg63101.html :)
> 
> I've rejected it because I think polling for 6 seconds on a TLB flush which
> can block interrupts as well is just madness.

Hm but I thought you had like 2 tlb flush modes, the shitty one (with
retrying page faults) and the not so shitty one?

But yeah at that point I think you just have to bite one of the bullets.

The thing is with hmm/userspace memory fence model this will be even
worse, because you will _have_ to do this tlb flush deep down in core mm
functions, so this is going to be userptr, but worse.

With the dma_resv/dma_fence bo memory management model you can at least
wrap that tlb flush into a dma_fence and push the waiting/pinging onto a
separate thread or something like that. If the hw really is that slow.

Somewhat aside: Personally I think that sriov needs to move over to the
compute model, i.e. indefinite timeouts, no tdr, because everything takes
too long. At least looking around sriov timeouts tend to be 10x bare
metal, across the board.

But for stuff like cloud gaming that's serious amounts of heavy lifting
since it brings us right back "the entire linux/android 3d stack is built
on top of dma_fence right now".

> > The only thing that you need to do when you use pte clearing + tlb
> > shootdown instad of preemption as the unload fence for buffers that get
> > moved is that if you get any gpu page fault, you don't serve that, but
> > instead treat it as a tdr and shot the context permanently.
> > 
> > So summarizing the model I proposed:
> > 
> > - you allow userspace to directly write into the ringbuffer, and also
> >write the fences directly
> > 
> > - actual submit is done by the kernel, using drm/scheduler. The kernel
> >blindly trusts userspace to set up everything else, and even just wraps
> >dma_fences around the userspace memory fences.
> > 
> > - the only check is tdr. If a fence doesn't complete an tdr fires, a) the
> >kernel shot the entire context and b) userspace recovers by setting up a
> >new ringbuffer
> > 
> > - memory management is done using ttm only, you still need to supply the
> >buffer list (ofc that list includes the always present 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-04 Thread Christian König

Am 02.06.21 um 21:19 schrieb Daniel Vetter:

On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:


Am 02.06.21 um 20:48 schrieb Daniel Vetter:

On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:

On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:


Yes, we can't break anything because we don't want to complicate things
for us. It's pretty much all NAK'd already. We are trying to gather more
knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync objects
to userspace for read only, and the kernel or hw will strictly control the
memory writes to those sync objects. The hole in that idea is that
userspace can decide not to signal a job, so even if userspace can't
overwrite memory-based sync object states arbitrarily, it can still decide
not to signal them, and then a future fence is born.


This would actually be treated as a GPU hang caused by that context, so it
should be fine.

This is practically what I proposed already, except your not doing it with
dma_fence. And on the memory fence side this also doesn't actually give
what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of work
in the kernel to hide these not-dma_fence-but-almost, and still pain to
actually drive the hardware like it should be for compute or direct
display.

Also maybe I've missed it, but I didn't see any replies to my suggestion
how to fake the entire dma_fence stuff on top of new hw. Would be
interesting to know what doesn't work there instead of amd folks going of
into internal again and then coming back with another rfc from out of
nowhere :-)

Well to be honest I would just push back on our hardware/firmware guys that
we need to keep kernel queues forever before going down that route.

I looked again, and you said the model wont work because preemption is way
too slow, even when the context is idle.

I guess at that point I got maybe too fed up and just figured "not my
problem", but if preempt is too slow as the unload fence, you can do it
with pte removal and tlb shootdown too (that is hopefully not too slow,
otherwise your hw is just garbage and wont even be fast for direct submit
compute workloads).


Have you seen that one here: 
https://www.spinics.net/lists/amd-gfx/msg63101.html :)


I've rejected it because I think polling for 6 seconds on a TLB flush 
which can block interrupts as well is just madness.






The only thing that you need to do when you use pte clearing + tlb
shootdown instad of preemption as the unload fence for buffers that get
moved is that if you get any gpu page fault, you don't serve that, but
instead treat it as a tdr and shot the context permanently.

So summarizing the model I proposed:

- you allow userspace to directly write into the ringbuffer, and also
   write the fences directly

- actual submit is done by the kernel, using drm/scheduler. The kernel
   blindly trusts userspace to set up everything else, and even just wraps
   dma_fences around the userspace memory fences.

- the only check is tdr. If a fence doesn't complete an tdr fires, a) the
   kernel shot the entire context and b) userspace recovers by setting up a
   new ringbuffer

- memory management is done using ttm only, you still need to supply the
   buffer list (ofc that list includes the always present ones, so CS will
   only get the list of special buffers like today). If you hw can't trun
   gpu page faults and you ever get one we pull up the same old solution:
   Kernel shots the entire context.

   The important thing is that from the gpu pov memory management works
   exactly like compute workload with direct submit, except that you just
   terminate the context on _any_ page fault, instead of only those that go
   somewhere where there's really no mapping and repair the others.

   Also I guess from reading the old thread this means you'd disable page
   fault retry because that is apparently also way too slow for anything.

- memory management uses an unload fence. That unload fences waits for all
   userspace memory fences (represented as dma_fence) to complete, with
   maybe some fudge to busy-spin until we've reached the actual end of the
   ringbuffer (maybe you have a IB tail there after the memory fence write,
   we have that on intel hw), and it waits for the memory to get
   "unloaded". This is either preemption, or pte clearing + tlb shootdown,
   or whatever else your hw provides which is a) used for dynamic memory
   management b) fast enough for actual memory management.

- any time a context dies we force-complete all it's pending fences,
   in-order ofc

So from hw pov this looks 99% like direct userspace submit, with the exact
same mappings, command sequences and everything else. The only difference
is that the rinbuffer head/tail updates happen from drm/scheduler, instead
of directly from userspace.

None of this stuff needs funny tricks where the kernel controls the
writes to 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-03 Thread Marek Olšák
On Thu., Jun. 3, 2021, 15:18 Daniel Vetter,  wrote:

> On Thu, Jun 3, 2021 at 7:53 PM Marek Olšák  wrote:
> >
> > Daniel, I think what you are suggesting is that we need to enable user
> queues with the drm scheduler and dma_fence first, and once that works, we
> can investigate how much of that kernel logic can be moved to the hw. Would
> that work? In theory it shouldn't matter whether the kernel does it or the
> hw does it. It's the same code, just in a different place.
>
> Yeah I guess that's another way to look at it. Maybe in practice
> you'll just move it from the kernel to userspace, which then programs
> the hw waits directly into its IB. That's at least how I'd do it on
> i915, assuming I'd have such hw. So these fences that userspace
> programs directly (to sync within itself) won't even show up as
> dependencies in the kernel.
>
> And then yes on the other side you can lift work from the
> drm/scheduler wrt dependencies you get in the kernel (whether explicit
> sync with sync_file, or implicit sync fished out of dma_resv) and
> program the hw directly that way. That would mean that userspace wont
> fill the ringbuffer directly, but the kernel would do that, so that
> you have space to stuff in the additional waits. Again assuming i915
> hw model, maybe works differently on amd. Iirc we have some of that
> already in the i915 scheduler, but I'd need to recheck how much it
> really uses the hw semaphores.
>

I was thinking we would pass per process syncobj handles and buffer handles
into commands in the user queue, or something equivalent. We do have a
large degree of programmability in the hw that we can do something like
that. The question is whether this high level user->hw interface would have
any advantage over trivial polling on memory, etc. My impression is no
because the kernel would be robust enough that it wouldn't matter what
userspace does, but I don't know. Anyway, all we need is user queues and
what your proposed seems totally sufficient.

Marek

-Daniel
>
> > Thanks,
> > Marek
> >
> > On Thu, Jun 3, 2021 at 7:22 AM Daniel Vetter  wrote:
> >>
> >> On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák  wrote:
> >> >
> >> > On Thu., Jun. 3, 2021, 06:03 Daniel Vetter,  wrote:
> >> >>
> >> >> On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
> >> >> > On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter 
> wrote:
> >> >> >
> >> >> > > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
> >> >> > > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter 
> wrote:
> >> >> > > >
> >> >> > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> >> >> > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <
> mar...@gmail.com> wrote:
> >> >> > > > > >
> >> >> > > > > > > Yes, we can't break anything because we don't want to
> complicate
> >> >> > > things
> >> >> > > > > > > for us. It's pretty much all NAK'd already. We are
> trying to gather
> >> >> > > > > more
> >> >> > > > > > > knowledge and then make better decisions.
> >> >> > > > > > >
> >> >> > > > > > > The idea we are considering is that we'll expose
> memory-based sync
> >> >> > > > > objects
> >> >> > > > > > > to userspace for read only, and the kernel or hw will
> strictly
> >> >> > > control
> >> >> > > > > the
> >> >> > > > > > > memory writes to those sync objects. The hole in that
> idea is that
> >> >> > > > > > > userspace can decide not to signal a job, so even if
> userspace
> >> >> > > can't
> >> >> > > > > > > overwrite memory-based sync object states arbitrarily,
> it can still
> >> >> > > > > decide
> >> >> > > > > > > not to signal them, and then a future fence is born.
> >> >> > > > > > >
> >> >> > > > > >
> >> >> > > > > > This would actually be treated as a GPU hang caused by
> that context,
> >> >> > > so
> >> >> > > > > it
> >> >> > > > > > should be fine.
> >> >> > > > >
> >> >> > > > > This is practically what I proposed already, except your not
> doing it
> >> >> > > with
> >> >> > > > > dma_fence. And on the memory fence side this also doesn't
> actually give
> >> >> > > > > what you want for that compute model.
> >> >> > > > >
> >> >> > > > > This seems like a bit a worst of both worlds approach to me?
> Tons of
> >> >> > > work
> >> >> > > > > in the kernel to hide these not-dma_fence-but-almost, and
> still pain to
> >> >> > > > > actually drive the hardware like it should be for compute or
> direct
> >> >> > > > > display.
> >> >> > > > >
> >> >> > > > > Also maybe I've missed it, but I didn't see any replies to my
> >> >> > > suggestion
> >> >> > > > > how to fake the entire dma_fence stuff on top of new hw.
> Would be
> >> >> > > > > interesting to know what doesn't work there instead of amd
> folks going
> >> >> > > of
> >> >> > > > > into internal again and then coming back with another rfc
> from out of
> >> >> > > > > nowhere :-)
> >> >> > > > >
> >> >> > > >
> >> >> > > > Going internal again is probably a good idea to spare you the
> long
> >> >> > > > discussions and not waste your 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-03 Thread Daniel Vetter
On Thu, Jun 3, 2021 at 7:53 PM Marek Olšák  wrote:
>
> Daniel, I think what you are suggesting is that we need to enable user queues 
> with the drm scheduler and dma_fence first, and once that works, we can 
> investigate how much of that kernel logic can be moved to the hw. Would that 
> work? In theory it shouldn't matter whether the kernel does it or the hw does 
> it. It's the same code, just in a different place.

Yeah I guess that's another way to look at it. Maybe in practice
you'll just move it from the kernel to userspace, which then programs
the hw waits directly into its IB. That's at least how I'd do it on
i915, assuming I'd have such hw. So these fences that userspace
programs directly (to sync within itself) won't even show up as
dependencies in the kernel.

And then yes on the other side you can lift work from the
drm/scheduler wrt dependencies you get in the kernel (whether explicit
sync with sync_file, or implicit sync fished out of dma_resv) and
program the hw directly that way. That would mean that userspace wont
fill the ringbuffer directly, but the kernel would do that, so that
you have space to stuff in the additional waits. Again assuming i915
hw model, maybe works differently on amd. Iirc we have some of that
already in the i915 scheduler, but I'd need to recheck how much it
really uses the hw semaphores.
-Daniel

> Thanks,
> Marek
>
> On Thu, Jun 3, 2021 at 7:22 AM Daniel Vetter  wrote:
>>
>> On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák  wrote:
>> >
>> > On Thu., Jun. 3, 2021, 06:03 Daniel Vetter,  wrote:
>> >>
>> >> On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
>> >> > On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter  wrote:
>> >> >
>> >> > > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
>> >> > > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter  
>> >> > > > wrote:
>> >> > > >
>> >> > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
>> >> > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  
>> >> > > > > > wrote:
>> >> > > > > >
>> >> > > > > > > Yes, we can't break anything because we don't want to 
>> >> > > > > > > complicate
>> >> > > things
>> >> > > > > > > for us. It's pretty much all NAK'd already. We are trying to 
>> >> > > > > > > gather
>> >> > > > > more
>> >> > > > > > > knowledge and then make better decisions.
>> >> > > > > > >
>> >> > > > > > > The idea we are considering is that we'll expose memory-based 
>> >> > > > > > > sync
>> >> > > > > objects
>> >> > > > > > > to userspace for read only, and the kernel or hw will strictly
>> >> > > control
>> >> > > > > the
>> >> > > > > > > memory writes to those sync objects. The hole in that idea is 
>> >> > > > > > > that
>> >> > > > > > > userspace can decide not to signal a job, so even if userspace
>> >> > > can't
>> >> > > > > > > overwrite memory-based sync object states arbitrarily, it can 
>> >> > > > > > > still
>> >> > > > > decide
>> >> > > > > > > not to signal them, and then a future fence is born.
>> >> > > > > > >
>> >> > > > > >
>> >> > > > > > This would actually be treated as a GPU hang caused by that 
>> >> > > > > > context,
>> >> > > so
>> >> > > > > it
>> >> > > > > > should be fine.
>> >> > > > >
>> >> > > > > This is practically what I proposed already, except your not 
>> >> > > > > doing it
>> >> > > with
>> >> > > > > dma_fence. And on the memory fence side this also doesn't 
>> >> > > > > actually give
>> >> > > > > what you want for that compute model.
>> >> > > > >
>> >> > > > > This seems like a bit a worst of both worlds approach to me? Tons 
>> >> > > > > of
>> >> > > work
>> >> > > > > in the kernel to hide these not-dma_fence-but-almost, and still 
>> >> > > > > pain to
>> >> > > > > actually drive the hardware like it should be for compute or 
>> >> > > > > direct
>> >> > > > > display.
>> >> > > > >
>> >> > > > > Also maybe I've missed it, but I didn't see any replies to my
>> >> > > suggestion
>> >> > > > > how to fake the entire dma_fence stuff on top of new hw. Would be
>> >> > > > > interesting to know what doesn't work there instead of amd folks 
>> >> > > > > going
>> >> > > of
>> >> > > > > into internal again and then coming back with another rfc from 
>> >> > > > > out of
>> >> > > > > nowhere :-)
>> >> > > > >
>> >> > > >
>> >> > > > Going internal again is probably a good idea to spare you the long
>> >> > > > discussions and not waste your time, but we haven't talked about the
>> >> > > > dma_fence stuff internally other than acknowledging that it can be
>> >> > > solved.
>> >> > > >
>> >> > > > The compute use case already uses the hw as-is with no inter-process
>> >> > > > sharing, which mostly keeps the kernel out of the picture. It uses
>> >> > > glFinish
>> >> > > > to sync with GL.
>> >> > > >
>> >> > > > The gfx use case needs new hardware logic to support implicit and
>> >> > > explicit
>> >> > > > sync. When we propose a solution, it's usually torn apart the next 
>> >> > > > day by
>> >> > > > ourselves.
>> >> > > >

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-03 Thread Marek Olšák
Daniel, I think what you are suggesting is that we need to enable user
queues with the drm scheduler and dma_fence first, and once that works, we
can investigate how much of that kernel logic can be moved to the hw. Would
that work? In theory it shouldn't matter whether the kernel does it or the
hw does it. It's the same code, just in a different place.

Thanks,
Marek

On Thu, Jun 3, 2021 at 7:22 AM Daniel Vetter  wrote:

> On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák  wrote:
> >
> > On Thu., Jun. 3, 2021, 06:03 Daniel Vetter,  wrote:
> >>
> >> On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
> >> > On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter  wrote:
> >> >
> >> > > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
> >> > > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter 
> wrote:
> >> > > >
> >> > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> >> > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák 
> wrote:
> >> > > > > >
> >> > > > > > > Yes, we can't break anything because we don't want to
> complicate
> >> > > things
> >> > > > > > > for us. It's pretty much all NAK'd already. We are trying
> to gather
> >> > > > > more
> >> > > > > > > knowledge and then make better decisions.
> >> > > > > > >
> >> > > > > > > The idea we are considering is that we'll expose
> memory-based sync
> >> > > > > objects
> >> > > > > > > to userspace for read only, and the kernel or hw will
> strictly
> >> > > control
> >> > > > > the
> >> > > > > > > memory writes to those sync objects. The hole in that idea
> is that
> >> > > > > > > userspace can decide not to signal a job, so even if
> userspace
> >> > > can't
> >> > > > > > > overwrite memory-based sync object states arbitrarily, it
> can still
> >> > > > > decide
> >> > > > > > > not to signal them, and then a future fence is born.
> >> > > > > > >
> >> > > > > >
> >> > > > > > This would actually be treated as a GPU hang caused by that
> context,
> >> > > so
> >> > > > > it
> >> > > > > > should be fine.
> >> > > > >
> >> > > > > This is practically what I proposed already, except your not
> doing it
> >> > > with
> >> > > > > dma_fence. And on the memory fence side this also doesn't
> actually give
> >> > > > > what you want for that compute model.
> >> > > > >
> >> > > > > This seems like a bit a worst of both worlds approach to me?
> Tons of
> >> > > work
> >> > > > > in the kernel to hide these not-dma_fence-but-almost, and still
> pain to
> >> > > > > actually drive the hardware like it should be for compute or
> direct
> >> > > > > display.
> >> > > > >
> >> > > > > Also maybe I've missed it, but I didn't see any replies to my
> >> > > suggestion
> >> > > > > how to fake the entire dma_fence stuff on top of new hw. Would
> be
> >> > > > > interesting to know what doesn't work there instead of amd
> folks going
> >> > > of
> >> > > > > into internal again and then coming back with another rfc from
> out of
> >> > > > > nowhere :-)
> >> > > > >
> >> > > >
> >> > > > Going internal again is probably a good idea to spare you the long
> >> > > > discussions and not waste your time, but we haven't talked about
> the
> >> > > > dma_fence stuff internally other than acknowledging that it can be
> >> > > solved.
> >> > > >
> >> > > > The compute use case already uses the hw as-is with no
> inter-process
> >> > > > sharing, which mostly keeps the kernel out of the picture. It uses
> >> > > glFinish
> >> > > > to sync with GL.
> >> > > >
> >> > > > The gfx use case needs new hardware logic to support implicit and
> >> > > explicit
> >> > > > sync. When we propose a solution, it's usually torn apart the
> next day by
> >> > > > ourselves.
> >> > > >
> >> > > > Since we are talking about next hw or next next hw, preemption
> should be
> >> > > > better.
> >> > > >
> >> > > > user queue = user-mapped ring buffer
> >> > > >
> >> > > > For implicit sync, we will only let userspace lock access to a
> buffer
> >> > > via a
> >> > > > user queue, which waits for the per-buffer sequence counter in
> memory to
> >> > > be
> >> > > > >= the number assigned by the kernel, and later unlock the access
> with
> >> > > > another command, which increments the per-buffer sequence counter
> in
> >> > > memory
> >> > > > with atomic_inc regardless of the number assigned by the kernel.
> The
> >> > > kernel
> >> > > > counter and the counter in memory can be out-of-sync, and I'll
> explain
> >> > > why
> >> > > > it's OK. If a process increments the kernel counter but not the
> memory
> >> > > > counter, that's its problem and it's the same as a GPU hang
> caused by
> >> > > that
> >> > > > process. If a process increments the memory counter but not the
> kernel
> >> > > > counter, the ">=" condition alongside atomic_inc guarantee that
> >> > > signaling n
> >> > > > will signal n+1, so it will never deadlock but also it will
> effectively
> >> > > > disable synchronization. This method of disabling synchronization
> is
> >> > > > similar to a 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-03 Thread Daniel Vetter
On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák  wrote:
>
> On Thu., Jun. 3, 2021, 06:03 Daniel Vetter,  wrote:
>>
>> On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
>> > On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter  wrote:
>> >
>> > > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
>> > > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter  wrote:
>> > > >
>> > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
>> > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Yes, we can't break anything because we don't want to complicate
>> > > things
>> > > > > > > for us. It's pretty much all NAK'd already. We are trying to 
>> > > > > > > gather
>> > > > > more
>> > > > > > > knowledge and then make better decisions.
>> > > > > > >
>> > > > > > > The idea we are considering is that we'll expose memory-based 
>> > > > > > > sync
>> > > > > objects
>> > > > > > > to userspace for read only, and the kernel or hw will strictly
>> > > control
>> > > > > the
>> > > > > > > memory writes to those sync objects. The hole in that idea is 
>> > > > > > > that
>> > > > > > > userspace can decide not to signal a job, so even if userspace
>> > > can't
>> > > > > > > overwrite memory-based sync object states arbitrarily, it can 
>> > > > > > > still
>> > > > > decide
>> > > > > > > not to signal them, and then a future fence is born.
>> > > > > > >
>> > > > > >
>> > > > > > This would actually be treated as a GPU hang caused by that 
>> > > > > > context,
>> > > so
>> > > > > it
>> > > > > > should be fine.
>> > > > >
>> > > > > This is practically what I proposed already, except your not doing it
>> > > with
>> > > > > dma_fence. And on the memory fence side this also doesn't actually 
>> > > > > give
>> > > > > what you want for that compute model.
>> > > > >
>> > > > > This seems like a bit a worst of both worlds approach to me? Tons of
>> > > work
>> > > > > in the kernel to hide these not-dma_fence-but-almost, and still pain 
>> > > > > to
>> > > > > actually drive the hardware like it should be for compute or direct
>> > > > > display.
>> > > > >
>> > > > > Also maybe I've missed it, but I didn't see any replies to my
>> > > suggestion
>> > > > > how to fake the entire dma_fence stuff on top of new hw. Would be
>> > > > > interesting to know what doesn't work there instead of amd folks 
>> > > > > going
>> > > of
>> > > > > into internal again and then coming back with another rfc from out of
>> > > > > nowhere :-)
>> > > > >
>> > > >
>> > > > Going internal again is probably a good idea to spare you the long
>> > > > discussions and not waste your time, but we haven't talked about the
>> > > > dma_fence stuff internally other than acknowledging that it can be
>> > > solved.
>> > > >
>> > > > The compute use case already uses the hw as-is with no inter-process
>> > > > sharing, which mostly keeps the kernel out of the picture. It uses
>> > > glFinish
>> > > > to sync with GL.
>> > > >
>> > > > The gfx use case needs new hardware logic to support implicit and
>> > > explicit
>> > > > sync. When we propose a solution, it's usually torn apart the next day 
>> > > > by
>> > > > ourselves.
>> > > >
>> > > > Since we are talking about next hw or next next hw, preemption should 
>> > > > be
>> > > > better.
>> > > >
>> > > > user queue = user-mapped ring buffer
>> > > >
>> > > > For implicit sync, we will only let userspace lock access to a buffer
>> > > via a
>> > > > user queue, which waits for the per-buffer sequence counter in memory 
>> > > > to
>> > > be
>> > > > >= the number assigned by the kernel, and later unlock the access with
>> > > > another command, which increments the per-buffer sequence counter in
>> > > memory
>> > > > with atomic_inc regardless of the number assigned by the kernel. The
>> > > kernel
>> > > > counter and the counter in memory can be out-of-sync, and I'll explain
>> > > why
>> > > > it's OK. If a process increments the kernel counter but not the memory
>> > > > counter, that's its problem and it's the same as a GPU hang caused by
>> > > that
>> > > > process. If a process increments the memory counter but not the kernel
>> > > > counter, the ">=" condition alongside atomic_inc guarantee that
>> > > signaling n
>> > > > will signal n+1, so it will never deadlock but also it will effectively
>> > > > disable synchronization. This method of disabling synchronization is
>> > > > similar to a process corrupting the buffer, which should be fine. Can 
>> > > > you
>> > > > find any flaw in it? I can't find any.
>> > >
>> > > Hm maybe I misunderstood what exactly you wanted to do earlier. That kind
>> > > of "we let userspace free-wheel whatever it wants, kernel ensures
>> > > correctness of the resulting chain of dma_fence with reset the entire
>> > > context" is what I proposed too.
>> > >
>> > > Like you say, userspace is allowed to render garbage already.
>> > >
>> > > > The explicit submit can be done by 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-03 Thread Marek Olšák
On Thu., Jun. 3, 2021, 06:03 Daniel Vetter,  wrote:

> On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
> > On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter  wrote:
> >
> > > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
> > > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter 
> wrote:
> > > >
> > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák 
> wrote:
> > > > > >
> > > > > > > Yes, we can't break anything because we don't want to
> complicate
> > > things
> > > > > > > for us. It's pretty much all NAK'd already. We are trying to
> gather
> > > > > more
> > > > > > > knowledge and then make better decisions.
> > > > > > >
> > > > > > > The idea we are considering is that we'll expose memory-based
> sync
> > > > > objects
> > > > > > > to userspace for read only, and the kernel or hw will strictly
> > > control
> > > > > the
> > > > > > > memory writes to those sync objects. The hole in that idea is
> that
> > > > > > > userspace can decide not to signal a job, so even if userspace
> > > can't
> > > > > > > overwrite memory-based sync object states arbitrarily, it can
> still
> > > > > decide
> > > > > > > not to signal them, and then a future fence is born.
> > > > > > >
> > > > > >
> > > > > > This would actually be treated as a GPU hang caused by that
> context,
> > > so
> > > > > it
> > > > > > should be fine.
> > > > >
> > > > > This is practically what I proposed already, except your not doing
> it
> > > with
> > > > > dma_fence. And on the memory fence side this also doesn't actually
> give
> > > > > what you want for that compute model.
> > > > >
> > > > > This seems like a bit a worst of both worlds approach to me? Tons
> of
> > > work
> > > > > in the kernel to hide these not-dma_fence-but-almost, and still
> pain to
> > > > > actually drive the hardware like it should be for compute or direct
> > > > > display.
> > > > >
> > > > > Also maybe I've missed it, but I didn't see any replies to my
> > > suggestion
> > > > > how to fake the entire dma_fence stuff on top of new hw. Would be
> > > > > interesting to know what doesn't work there instead of amd folks
> going
> > > of
> > > > > into internal again and then coming back with another rfc from out
> of
> > > > > nowhere :-)
> > > > >
> > > >
> > > > Going internal again is probably a good idea to spare you the long
> > > > discussions and not waste your time, but we haven't talked about the
> > > > dma_fence stuff internally other than acknowledging that it can be
> > > solved.
> > > >
> > > > The compute use case already uses the hw as-is with no inter-process
> > > > sharing, which mostly keeps the kernel out of the picture. It uses
> > > glFinish
> > > > to sync with GL.
> > > >
> > > > The gfx use case needs new hardware logic to support implicit and
> > > explicit
> > > > sync. When we propose a solution, it's usually torn apart the next
> day by
> > > > ourselves.
> > > >
> > > > Since we are talking about next hw or next next hw, preemption
> should be
> > > > better.
> > > >
> > > > user queue = user-mapped ring buffer
> > > >
> > > > For implicit sync, we will only let userspace lock access to a buffer
> > > via a
> > > > user queue, which waits for the per-buffer sequence counter in
> memory to
> > > be
> > > > >= the number assigned by the kernel, and later unlock the access
> with
> > > > another command, which increments the per-buffer sequence counter in
> > > memory
> > > > with atomic_inc regardless of the number assigned by the kernel. The
> > > kernel
> > > > counter and the counter in memory can be out-of-sync, and I'll
> explain
> > > why
> > > > it's OK. If a process increments the kernel counter but not the
> memory
> > > > counter, that's its problem and it's the same as a GPU hang caused by
> > > that
> > > > process. If a process increments the memory counter but not the
> kernel
> > > > counter, the ">=" condition alongside atomic_inc guarantee that
> > > signaling n
> > > > will signal n+1, so it will never deadlock but also it will
> effectively
> > > > disable synchronization. This method of disabling synchronization is
> > > > similar to a process corrupting the buffer, which should be fine.
> Can you
> > > > find any flaw in it? I can't find any.
> > >
> > > Hm maybe I misunderstood what exactly you wanted to do earlier. That
> kind
> > > of "we let userspace free-wheel whatever it wants, kernel ensures
> > > correctness of the resulting chain of dma_fence with reset the entire
> > > context" is what I proposed too.
> > >
> > > Like you say, userspace is allowed to render garbage already.
> > >
> > > > The explicit submit can be done by userspace (if there is no
> > > > synchronization), but we plan to use the kernel to do it for implicit
> > > sync.
> > > > Essentially, the kernel will receive a buffer list and addresses of
> wait
> > > > commands in the user queue. It will assign new sequence numbers to
> all
> > > > 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-03 Thread Daniel Vetter
On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
> On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter  wrote:
> 
> > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
> > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter  wrote:
> > >
> > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:
> > > > >
> > > > > > Yes, we can't break anything because we don't want to complicate
> > things
> > > > > > for us. It's pretty much all NAK'd already. We are trying to gather
> > > > more
> > > > > > knowledge and then make better decisions.
> > > > > >
> > > > > > The idea we are considering is that we'll expose memory-based sync
> > > > objects
> > > > > > to userspace for read only, and the kernel or hw will strictly
> > control
> > > > the
> > > > > > memory writes to those sync objects. The hole in that idea is that
> > > > > > userspace can decide not to signal a job, so even if userspace
> > can't
> > > > > > overwrite memory-based sync object states arbitrarily, it can still
> > > > decide
> > > > > > not to signal them, and then a future fence is born.
> > > > > >
> > > > >
> > > > > This would actually be treated as a GPU hang caused by that context,
> > so
> > > > it
> > > > > should be fine.
> > > >
> > > > This is practically what I proposed already, except your not doing it
> > with
> > > > dma_fence. And on the memory fence side this also doesn't actually give
> > > > what you want for that compute model.
> > > >
> > > > This seems like a bit a worst of both worlds approach to me? Tons of
> > work
> > > > in the kernel to hide these not-dma_fence-but-almost, and still pain to
> > > > actually drive the hardware like it should be for compute or direct
> > > > display.
> > > >
> > > > Also maybe I've missed it, but I didn't see any replies to my
> > suggestion
> > > > how to fake the entire dma_fence stuff on top of new hw. Would be
> > > > interesting to know what doesn't work there instead of amd folks going
> > of
> > > > into internal again and then coming back with another rfc from out of
> > > > nowhere :-)
> > > >
> > >
> > > Going internal again is probably a good idea to spare you the long
> > > discussions and not waste your time, but we haven't talked about the
> > > dma_fence stuff internally other than acknowledging that it can be
> > solved.
> > >
> > > The compute use case already uses the hw as-is with no inter-process
> > > sharing, which mostly keeps the kernel out of the picture. It uses
> > glFinish
> > > to sync with GL.
> > >
> > > The gfx use case needs new hardware logic to support implicit and
> > explicit
> > > sync. When we propose a solution, it's usually torn apart the next day by
> > > ourselves.
> > >
> > > Since we are talking about next hw or next next hw, preemption should be
> > > better.
> > >
> > > user queue = user-mapped ring buffer
> > >
> > > For implicit sync, we will only let userspace lock access to a buffer
> > via a
> > > user queue, which waits for the per-buffer sequence counter in memory to
> > be
> > > >= the number assigned by the kernel, and later unlock the access with
> > > another command, which increments the per-buffer sequence counter in
> > memory
> > > with atomic_inc regardless of the number assigned by the kernel. The
> > kernel
> > > counter and the counter in memory can be out-of-sync, and I'll explain
> > why
> > > it's OK. If a process increments the kernel counter but not the memory
> > > counter, that's its problem and it's the same as a GPU hang caused by
> > that
> > > process. If a process increments the memory counter but not the kernel
> > > counter, the ">=" condition alongside atomic_inc guarantee that
> > signaling n
> > > will signal n+1, so it will never deadlock but also it will effectively
> > > disable synchronization. This method of disabling synchronization is
> > > similar to a process corrupting the buffer, which should be fine. Can you
> > > find any flaw in it? I can't find any.
> >
> > Hm maybe I misunderstood what exactly you wanted to do earlier. That kind
> > of "we let userspace free-wheel whatever it wants, kernel ensures
> > correctness of the resulting chain of dma_fence with reset the entire
> > context" is what I proposed too.
> >
> > Like you say, userspace is allowed to render garbage already.
> >
> > > The explicit submit can be done by userspace (if there is no
> > > synchronization), but we plan to use the kernel to do it for implicit
> > sync.
> > > Essentially, the kernel will receive a buffer list and addresses of wait
> > > commands in the user queue. It will assign new sequence numbers to all
> > > buffers and write those numbers into the wait commands, and ring the hw
> > > doorbell to start execution of that queue.
> >
> > Yeah for implicit sync I think kernel and using drm/scheduler to sort out
> > the dma_fence dependencies is probably best. Since you can filter out
> > which dma_fence you hand to the 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-03 Thread Marek Olšák
On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter  wrote:

> On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
> > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter  wrote:
> >
> > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:
> > > >
> > > > > Yes, we can't break anything because we don't want to complicate
> things
> > > > > for us. It's pretty much all NAK'd already. We are trying to gather
> > > more
> > > > > knowledge and then make better decisions.
> > > > >
> > > > > The idea we are considering is that we'll expose memory-based sync
> > > objects
> > > > > to userspace for read only, and the kernel or hw will strictly
> control
> > > the
> > > > > memory writes to those sync objects. The hole in that idea is that
> > > > > userspace can decide not to signal a job, so even if userspace
> can't
> > > > > overwrite memory-based sync object states arbitrarily, it can still
> > > decide
> > > > > not to signal them, and then a future fence is born.
> > > > >
> > > >
> > > > This would actually be treated as a GPU hang caused by that context,
> so
> > > it
> > > > should be fine.
> > >
> > > This is practically what I proposed already, except your not doing it
> with
> > > dma_fence. And on the memory fence side this also doesn't actually give
> > > what you want for that compute model.
> > >
> > > This seems like a bit a worst of both worlds approach to me? Tons of
> work
> > > in the kernel to hide these not-dma_fence-but-almost, and still pain to
> > > actually drive the hardware like it should be for compute or direct
> > > display.
> > >
> > > Also maybe I've missed it, but I didn't see any replies to my
> suggestion
> > > how to fake the entire dma_fence stuff on top of new hw. Would be
> > > interesting to know what doesn't work there instead of amd folks going
> of
> > > into internal again and then coming back with another rfc from out of
> > > nowhere :-)
> > >
> >
> > Going internal again is probably a good idea to spare you the long
> > discussions and not waste your time, but we haven't talked about the
> > dma_fence stuff internally other than acknowledging that it can be
> solved.
> >
> > The compute use case already uses the hw as-is with no inter-process
> > sharing, which mostly keeps the kernel out of the picture. It uses
> glFinish
> > to sync with GL.
> >
> > The gfx use case needs new hardware logic to support implicit and
> explicit
> > sync. When we propose a solution, it's usually torn apart the next day by
> > ourselves.
> >
> > Since we are talking about next hw or next next hw, preemption should be
> > better.
> >
> > user queue = user-mapped ring buffer
> >
> > For implicit sync, we will only let userspace lock access to a buffer
> via a
> > user queue, which waits for the per-buffer sequence counter in memory to
> be
> > >= the number assigned by the kernel, and later unlock the access with
> > another command, which increments the per-buffer sequence counter in
> memory
> > with atomic_inc regardless of the number assigned by the kernel. The
> kernel
> > counter and the counter in memory can be out-of-sync, and I'll explain
> why
> > it's OK. If a process increments the kernel counter but not the memory
> > counter, that's its problem and it's the same as a GPU hang caused by
> that
> > process. If a process increments the memory counter but not the kernel
> > counter, the ">=" condition alongside atomic_inc guarantee that
> signaling n
> > will signal n+1, so it will never deadlock but also it will effectively
> > disable synchronization. This method of disabling synchronization is
> > similar to a process corrupting the buffer, which should be fine. Can you
> > find any flaw in it? I can't find any.
>
> Hm maybe I misunderstood what exactly you wanted to do earlier. That kind
> of "we let userspace free-wheel whatever it wants, kernel ensures
> correctness of the resulting chain of dma_fence with reset the entire
> context" is what I proposed too.
>
> Like you say, userspace is allowed to render garbage already.
>
> > The explicit submit can be done by userspace (if there is no
> > synchronization), but we plan to use the kernel to do it for implicit
> sync.
> > Essentially, the kernel will receive a buffer list and addresses of wait
> > commands in the user queue. It will assign new sequence numbers to all
> > buffers and write those numbers into the wait commands, and ring the hw
> > doorbell to start execution of that queue.
>
> Yeah for implicit sync I think kernel and using drm/scheduler to sort out
> the dma_fence dependencies is probably best. Since you can filter out
> which dma_fence you hand to the scheduler for dependency tracking you can
> filter out your own ones and let the hw handle those directly (depending
> how much your hw can do an all that). On i915 we might do that to be able
> to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw scheduler.
>
> For buffer tracking 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-03 Thread Daniel Vetter
On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
> On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter  wrote:
> 
> > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:
> > >
> > > > Yes, we can't break anything because we don't want to complicate things
> > > > for us. It's pretty much all NAK'd already. We are trying to gather
> > more
> > > > knowledge and then make better decisions.
> > > >
> > > > The idea we are considering is that we'll expose memory-based sync
> > objects
> > > > to userspace for read only, and the kernel or hw will strictly control
> > the
> > > > memory writes to those sync objects. The hole in that idea is that
> > > > userspace can decide not to signal a job, so even if userspace can't
> > > > overwrite memory-based sync object states arbitrarily, it can still
> > decide
> > > > not to signal them, and then a future fence is born.
> > > >
> > >
> > > This would actually be treated as a GPU hang caused by that context, so
> > it
> > > should be fine.
> >
> > This is practically what I proposed already, except your not doing it with
> > dma_fence. And on the memory fence side this also doesn't actually give
> > what you want for that compute model.
> >
> > This seems like a bit a worst of both worlds approach to me? Tons of work
> > in the kernel to hide these not-dma_fence-but-almost, and still pain to
> > actually drive the hardware like it should be for compute or direct
> > display.
> >
> > Also maybe I've missed it, but I didn't see any replies to my suggestion
> > how to fake the entire dma_fence stuff on top of new hw. Would be
> > interesting to know what doesn't work there instead of amd folks going of
> > into internal again and then coming back with another rfc from out of
> > nowhere :-)
> >
> 
> Going internal again is probably a good idea to spare you the long
> discussions and not waste your time, but we haven't talked about the
> dma_fence stuff internally other than acknowledging that it can be solved.
> 
> The compute use case already uses the hw as-is with no inter-process
> sharing, which mostly keeps the kernel out of the picture. It uses glFinish
> to sync with GL.
> 
> The gfx use case needs new hardware logic to support implicit and explicit
> sync. When we propose a solution, it's usually torn apart the next day by
> ourselves.
> 
> Since we are talking about next hw or next next hw, preemption should be
> better.
> 
> user queue = user-mapped ring buffer
> 
> For implicit sync, we will only let userspace lock access to a buffer via a
> user queue, which waits for the per-buffer sequence counter in memory to be
> >= the number assigned by the kernel, and later unlock the access with
> another command, which increments the per-buffer sequence counter in memory
> with atomic_inc regardless of the number assigned by the kernel. The kernel
> counter and the counter in memory can be out-of-sync, and I'll explain why
> it's OK. If a process increments the kernel counter but not the memory
> counter, that's its problem and it's the same as a GPU hang caused by that
> process. If a process increments the memory counter but not the kernel
> counter, the ">=" condition alongside atomic_inc guarantee that signaling n
> will signal n+1, so it will never deadlock but also it will effectively
> disable synchronization. This method of disabling synchronization is
> similar to a process corrupting the buffer, which should be fine. Can you
> find any flaw in it? I can't find any.

Hm maybe I misunderstood what exactly you wanted to do earlier. That kind
of "we let userspace free-wheel whatever it wants, kernel ensures
correctness of the resulting chain of dma_fence with reset the entire
context" is what I proposed too.

Like you say, userspace is allowed to render garbage already.

> The explicit submit can be done by userspace (if there is no
> synchronization), but we plan to use the kernel to do it for implicit sync.
> Essentially, the kernel will receive a buffer list and addresses of wait
> commands in the user queue. It will assign new sequence numbers to all
> buffers and write those numbers into the wait commands, and ring the hw
> doorbell to start execution of that queue.

Yeah for implicit sync I think kernel and using drm/scheduler to sort out
the dma_fence dependencies is probably best. Since you can filter out
which dma_fence you hand to the scheduler for dependency tracking you can
filter out your own ones and let the hw handle those directly (depending
how much your hw can do an all that). On i915 we might do that to be able
to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw scheduler.

For buffer tracking with implicit sync I think cleanest is probably to
still keep them wrapped as dma_fence and stuffed into dma_resv, but
conceptually it's the same. If we let every driver reinvent their own
buffer tracking just because the hw works a bit different it'll be a mess.

Wrt 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Marek Olšák
On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter  wrote:

> On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:
> >
> > > Yes, we can't break anything because we don't want to complicate things
> > > for us. It's pretty much all NAK'd already. We are trying to gather
> more
> > > knowledge and then make better decisions.
> > >
> > > The idea we are considering is that we'll expose memory-based sync
> objects
> > > to userspace for read only, and the kernel or hw will strictly control
> the
> > > memory writes to those sync objects. The hole in that idea is that
> > > userspace can decide not to signal a job, so even if userspace can't
> > > overwrite memory-based sync object states arbitrarily, it can still
> decide
> > > not to signal them, and then a future fence is born.
> > >
> >
> > This would actually be treated as a GPU hang caused by that context, so
> it
> > should be fine.
>
> This is practically what I proposed already, except your not doing it with
> dma_fence. And on the memory fence side this also doesn't actually give
> what you want for that compute model.
>
> This seems like a bit a worst of both worlds approach to me? Tons of work
> in the kernel to hide these not-dma_fence-but-almost, and still pain to
> actually drive the hardware like it should be for compute or direct
> display.
>
> Also maybe I've missed it, but I didn't see any replies to my suggestion
> how to fake the entire dma_fence stuff on top of new hw. Would be
> interesting to know what doesn't work there instead of amd folks going of
> into internal again and then coming back with another rfc from out of
> nowhere :-)
>

Going internal again is probably a good idea to spare you the long
discussions and not waste your time, but we haven't talked about the
dma_fence stuff internally other than acknowledging that it can be solved.

The compute use case already uses the hw as-is with no inter-process
sharing, which mostly keeps the kernel out of the picture. It uses glFinish
to sync with GL.

The gfx use case needs new hardware logic to support implicit and explicit
sync. When we propose a solution, it's usually torn apart the next day by
ourselves.

Since we are talking about next hw or next next hw, preemption should be
better.

user queue = user-mapped ring buffer

For implicit sync, we will only let userspace lock access to a buffer via a
user queue, which waits for the per-buffer sequence counter in memory to be
>= the number assigned by the kernel, and later unlock the access with
another command, which increments the per-buffer sequence counter in memory
with atomic_inc regardless of the number assigned by the kernel. The kernel
counter and the counter in memory can be out-of-sync, and I'll explain why
it's OK. If a process increments the kernel counter but not the memory
counter, that's its problem and it's the same as a GPU hang caused by that
process. If a process increments the memory counter but not the kernel
counter, the ">=" condition alongside atomic_inc guarantee that signaling n
will signal n+1, so it will never deadlock but also it will effectively
disable synchronization. This method of disabling synchronization is
similar to a process corrupting the buffer, which should be fine. Can you
find any flaw in it? I can't find any.

The explicit submit can be done by userspace (if there is no
synchronization), but we plan to use the kernel to do it for implicit sync.
Essentially, the kernel will receive a buffer list and addresses of wait
commands in the user queue. It will assign new sequence numbers to all
buffers and write those numbers into the wait commands, and ring the hw
doorbell to start execution of that queue.

Marek
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Daniel Vetter
On Wed, Jun 02, 2021 at 10:09:01AM +0200, Michel Dänzer wrote:
> On 2021-06-01 12:49 p.m., Michel Dänzer wrote:
> > On 2021-06-01 12:21 p.m., Christian König wrote:
> > 
> >> Another question is if that is sufficient as security for the display 
> >> server or if we need further handling down the road? I mean essentially we 
> >> are moving the reliability problem into the display server.
> > 
> > Good question. This should generally protect the display server from 
> > freezing due to client fences never signalling, but there might still be 
> > corner cases.
> 
> E.g. a client might be able to sneak in a fence between when the
> compositor checks fences and when it submits its drawing to the kernel.

This is why implicit sync should be handled with explicit IPC. You pick
the fence up once, and then you need to tell your GL stack to _not_ do
implicit sync. Would need a new extension. vk afaiui does this
automatically already.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Daniel Vetter
On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:
> 
> 
> Am 02.06.21 um 20:48 schrieb Daniel Vetter:
> > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:
> > > 
> > > > Yes, we can't break anything because we don't want to complicate things
> > > > for us. It's pretty much all NAK'd already. We are trying to gather more
> > > > knowledge and then make better decisions.
> > > > 
> > > > The idea we are considering is that we'll expose memory-based sync 
> > > > objects
> > > > to userspace for read only, and the kernel or hw will strictly control 
> > > > the
> > > > memory writes to those sync objects. The hole in that idea is that
> > > > userspace can decide not to signal a job, so even if userspace can't
> > > > overwrite memory-based sync object states arbitrarily, it can still 
> > > > decide
> > > > not to signal them, and then a future fence is born.
> > > > 
> > > This would actually be treated as a GPU hang caused by that context, so it
> > > should be fine.
> > This is practically what I proposed already, except your not doing it with
> > dma_fence. And on the memory fence side this also doesn't actually give
> > what you want for that compute model.
> > 
> > This seems like a bit a worst of both worlds approach to me? Tons of work
> > in the kernel to hide these not-dma_fence-but-almost, and still pain to
> > actually drive the hardware like it should be for compute or direct
> > display.
> > 
> > Also maybe I've missed it, but I didn't see any replies to my suggestion
> > how to fake the entire dma_fence stuff on top of new hw. Would be
> > interesting to know what doesn't work there instead of amd folks going of
> > into internal again and then coming back with another rfc from out of
> > nowhere :-)
> 
> Well to be honest I would just push back on our hardware/firmware guys that
> we need to keep kernel queues forever before going down that route.

I looked again, and you said the model wont work because preemption is way
too slow, even when the context is idle.

I guess at that point I got maybe too fed up and just figured "not my
problem", but if preempt is too slow as the unload fence, you can do it
with pte removal and tlb shootdown too (that is hopefully not too slow,
otherwise your hw is just garbage and wont even be fast for direct submit
compute workloads).

The only thing that you need to do when you use pte clearing + tlb
shootdown instad of preemption as the unload fence for buffers that get
moved is that if you get any gpu page fault, you don't serve that, but
instead treat it as a tdr and shot the context permanently.

So summarizing the model I proposed:

- you allow userspace to directly write into the ringbuffer, and also
  write the fences directly

- actual submit is done by the kernel, using drm/scheduler. The kernel
  blindly trusts userspace to set up everything else, and even just wraps
  dma_fences around the userspace memory fences.

- the only check is tdr. If a fence doesn't complete an tdr fires, a) the
  kernel shot the entire context and b) userspace recovers by setting up a
  new ringbuffer

- memory management is done using ttm only, you still need to supply the
  buffer list (ofc that list includes the always present ones, so CS will
  only get the list of special buffers like today). If you hw can't trun
  gpu page faults and you ever get one we pull up the same old solution:
  Kernel shots the entire context.

  The important thing is that from the gpu pov memory management works
  exactly like compute workload with direct submit, except that you just
  terminate the context on _any_ page fault, instead of only those that go
  somewhere where there's really no mapping and repair the others.

  Also I guess from reading the old thread this means you'd disable page
  fault retry because that is apparently also way too slow for anything.

- memory management uses an unload fence. That unload fences waits for all
  userspace memory fences (represented as dma_fence) to complete, with
  maybe some fudge to busy-spin until we've reached the actual end of the
  ringbuffer (maybe you have a IB tail there after the memory fence write,
  we have that on intel hw), and it waits for the memory to get
  "unloaded". This is either preemption, or pte clearing + tlb shootdown,
  or whatever else your hw provides which is a) used for dynamic memory
  management b) fast enough for actual memory management.

- any time a context dies we force-complete all it's pending fences,
  in-order ofc

So from hw pov this looks 99% like direct userspace submit, with the exact
same mappings, command sequences and everything else. The only difference
is that the rinbuffer head/tail updates happen from drm/scheduler, instead
of directly from userspace.

None of this stuff needs funny tricks where the kernel controls the
writes to memory fences, or where you need kernel ringbuffers, or anything
like thist. 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Christian König



Am 02.06.21 um 20:48 schrieb Daniel Vetter:

On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:

On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:


Yes, we can't break anything because we don't want to complicate things
for us. It's pretty much all NAK'd already. We are trying to gather more
knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync objects
to userspace for read only, and the kernel or hw will strictly control the
memory writes to those sync objects. The hole in that idea is that
userspace can decide not to signal a job, so even if userspace can't
overwrite memory-based sync object states arbitrarily, it can still decide
not to signal them, and then a future fence is born.


This would actually be treated as a GPU hang caused by that context, so it
should be fine.

This is practically what I proposed already, except your not doing it with
dma_fence. And on the memory fence side this also doesn't actually give
what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of work
in the kernel to hide these not-dma_fence-but-almost, and still pain to
actually drive the hardware like it should be for compute or direct
display.

Also maybe I've missed it, but I didn't see any replies to my suggestion
how to fake the entire dma_fence stuff on top of new hw. Would be
interesting to know what doesn't work there instead of amd folks going of
into internal again and then coming back with another rfc from out of
nowhere :-)


Well to be honest I would just push back on our hardware/firmware guys 
that we need to keep kernel queues forever before going down that route.


That syncfile and all that Android stuff isn't working out of the box 
with the new shiny user queue submission model (which in turn is mostly 
because of Windows) already raised some eyebrows here.


Christian.


-Daniel


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Daniel Vetter
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:
> 
> > Yes, we can't break anything because we don't want to complicate things
> > for us. It's pretty much all NAK'd already. We are trying to gather more
> > knowledge and then make better decisions.
> >
> > The idea we are considering is that we'll expose memory-based sync objects
> > to userspace for read only, and the kernel or hw will strictly control the
> > memory writes to those sync objects. The hole in that idea is that
> > userspace can decide not to signal a job, so even if userspace can't
> > overwrite memory-based sync object states arbitrarily, it can still decide
> > not to signal them, and then a future fence is born.
> >
> 
> This would actually be treated as a GPU hang caused by that context, so it
> should be fine.

This is practically what I proposed already, except your not doing it with
dma_fence. And on the memory fence side this also doesn't actually give
what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of work
in the kernel to hide these not-dma_fence-but-almost, and still pain to
actually drive the hardware like it should be for compute or direct
display.

Also maybe I've missed it, but I didn't see any replies to my suggestion
how to fake the entire dma_fence stuff on top of new hw. Would be
interesting to know what doesn't work there instead of amd folks going of
into internal again and then coming back with another rfc from out of
nowhere :-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Christian König

Am 02.06.21 um 11:58 schrieb Marek Olšák:
On Wed, Jun 2, 2021 at 5:44 AM Christian König 
> wrote:


Am 02.06.21 um 10:57 schrieb Daniel Stone:
> Hi Christian,
>
> On Tue, 1 Jun 2021 at 13:51, Christian König
> mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
>> Am 01.06.21 um 14:30 schrieb Daniel Vetter:
>>> If you want to enable this use-case with driver magic and
without the
>>> compositor being aware of what's going on, the solution is
EGLStreams.
>>> Not sure we want to go there, but it's definitely a lot more
feasible
>>> than trying to stuff eglstreams semantics into dma-buf implicit
>>> fencing support in a desperate attempt to not change compositors.
>> Well not changing compositors is certainly not something I
would try
>> with this use case.
>>
>> Not changing compositors is more like ok we have Ubuntu 20.04
and need
>> to support that we the newest hardware generation.
> Serious question, have you talked to Canonical?
>
> I mean there's a hell of a lot of effort being expended here, but it
> seems to all be predicated on the assumption that Ubuntu's LTS
> HWE/backport policy is totally immutable, and that we might need to
> make the kernel do backflips to work around that. But ... is it? Has
> anyone actually asked them how they feel about this?

This was merely an example. What I wanted to say is that we need to
support system already deployed.

In other words our customers won't accept that they need to
replace the
compositor just because they switch to a new hardware generation.

> I mean, my answer to the first email is 'no, absolutely not'
from the
> technical perspective (the initial proposal totally breaks
current and
> future userspace), from a design perspective (it breaks a lot of
> usecases which aren't single-vendor GPU+display+codec, or aren't
just
> a simple desktop), and from a sustainability perspective (cutting
> Android adrift again isn't acceptable collateral damage to make it
> easier to backport things to last year's Ubuntu release).
>
> But then again, I don't even know what I'm NAKing here ... ? The
> original email just lists a proposal to break a ton of things, with
> proposed replacements which aren't technically viable, and it's not
> clear why? Can we please have some more details and some reasoning
> behind them?
>
> I don't mind that userspace (compositor, protocols, clients like
Mesa
> as well as codec APIs) need to do a lot of work to support the new
> model. I do really care though that the hard-binary-switch model
works
> fine enough for AMD but totally breaks heterogeneous systems and
makes
> it impossible for userspace to do the right thing.

Well how the handling for new Android, distributions etc... is
going to
look like is a completely different story.

And I completely agree with both Daniel Vetter and you that we
need to
keep this in mind when designing the compatibility with older
software.

For Android I'm really not sure what to do. In general Android is
already trying to do the right thing by using explicit sync, the
problem
is that this is build around the idea that this explicit sync is
syncfile kernel based.

Either we need to change Android and come up with something that
works
with user fences as well or we somehow invent a compatibility
layer for
syncfile as well.


What's the issue with syncfiles that syncobjs don't suffer from?


Syncobjs where designed with future fences in mind. In other words we 
already have the ability to wait for a future submission to appear with 
all the nasty locking implications.


Syncfile on the other hand are just a container for up to N kernel 
fences and since we don't have kernel fences any more that is rather 
tricky to keep working.


Going to look into the uAPI around syncfiles once more and see if we can 
somehow use that for user fences as well.


Christian.



Marek


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Marek Olšák
On Wed, Jun 2, 2021 at 5:44 AM Christian König <
ckoenig.leichtzumer...@gmail.com> wrote:

> Am 02.06.21 um 10:57 schrieb Daniel Stone:
> > Hi Christian,
> >
> > On Tue, 1 Jun 2021 at 13:51, Christian König
> >  wrote:
> >> Am 01.06.21 um 14:30 schrieb Daniel Vetter:
> >>> If you want to enable this use-case with driver magic and without the
> >>> compositor being aware of what's going on, the solution is EGLStreams.
> >>> Not sure we want to go there, but it's definitely a lot more feasible
> >>> than trying to stuff eglstreams semantics into dma-buf implicit
> >>> fencing support in a desperate attempt to not change compositors.
> >> Well not changing compositors is certainly not something I would try
> >> with this use case.
> >>
> >> Not changing compositors is more like ok we have Ubuntu 20.04 and need
> >> to support that we the newest hardware generation.
> > Serious question, have you talked to Canonical?
> >
> > I mean there's a hell of a lot of effort being expended here, but it
> > seems to all be predicated on the assumption that Ubuntu's LTS
> > HWE/backport policy is totally immutable, and that we might need to
> > make the kernel do backflips to work around that. But ... is it? Has
> > anyone actually asked them how they feel about this?
>
> This was merely an example. What I wanted to say is that we need to
> support system already deployed.
>
> In other words our customers won't accept that they need to replace the
> compositor just because they switch to a new hardware generation.
>
> > I mean, my answer to the first email is 'no, absolutely not' from the
> > technical perspective (the initial proposal totally breaks current and
> > future userspace), from a design perspective (it breaks a lot of
> > usecases which aren't single-vendor GPU+display+codec, or aren't just
> > a simple desktop), and from a sustainability perspective (cutting
> > Android adrift again isn't acceptable collateral damage to make it
> > easier to backport things to last year's Ubuntu release).
> >
> > But then again, I don't even know what I'm NAKing here ... ? The
> > original email just lists a proposal to break a ton of things, with
> > proposed replacements which aren't technically viable, and it's not
> > clear why? Can we please have some more details and some reasoning
> > behind them?
> >
> > I don't mind that userspace (compositor, protocols, clients like Mesa
> > as well as codec APIs) need to do a lot of work to support the new
> > model. I do really care though that the hard-binary-switch model works
> > fine enough for AMD but totally breaks heterogeneous systems and makes
> > it impossible for userspace to do the right thing.
>
> Well how the handling for new Android, distributions etc... is going to
> look like is a completely different story.
>
> And I completely agree with both Daniel Vetter and you that we need to
> keep this in mind when designing the compatibility with older software.
>
> For Android I'm really not sure what to do. In general Android is
> already trying to do the right thing by using explicit sync, the problem
> is that this is build around the idea that this explicit sync is
> syncfile kernel based.
>
> Either we need to change Android and come up with something that works
> with user fences as well or we somehow invent a compatibility layer for
> syncfile as well.
>

What's the issue with syncfiles that syncobjs don't suffer from?

Marek
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Christian König

Am 02.06.21 um 10:57 schrieb Daniel Stone:

Hi Christian,

On Tue, 1 Jun 2021 at 13:51, Christian König
 wrote:

Am 01.06.21 um 14:30 schrieb Daniel Vetter:

If you want to enable this use-case with driver magic and without the
compositor being aware of what's going on, the solution is EGLStreams.
Not sure we want to go there, but it's definitely a lot more feasible
than trying to stuff eglstreams semantics into dma-buf implicit
fencing support in a desperate attempt to not change compositors.

Well not changing compositors is certainly not something I would try
with this use case.

Not changing compositors is more like ok we have Ubuntu 20.04 and need
to support that we the newest hardware generation.

Serious question, have you talked to Canonical?

I mean there's a hell of a lot of effort being expended here, but it
seems to all be predicated on the assumption that Ubuntu's LTS
HWE/backport policy is totally immutable, and that we might need to
make the kernel do backflips to work around that. But ... is it? Has
anyone actually asked them how they feel about this?


This was merely an example. What I wanted to say is that we need to 
support system already deployed.


In other words our customers won't accept that they need to replace the 
compositor just because they switch to a new hardware generation.



I mean, my answer to the first email is 'no, absolutely not' from the
technical perspective (the initial proposal totally breaks current and
future userspace), from a design perspective (it breaks a lot of
usecases which aren't single-vendor GPU+display+codec, or aren't just
a simple desktop), and from a sustainability perspective (cutting
Android adrift again isn't acceptable collateral damage to make it
easier to backport things to last year's Ubuntu release).

But then again, I don't even know what I'm NAKing here ... ? The
original email just lists a proposal to break a ton of things, with
proposed replacements which aren't technically viable, and it's not
clear why? Can we please have some more details and some reasoning
behind them?

I don't mind that userspace (compositor, protocols, clients like Mesa
as well as codec APIs) need to do a lot of work to support the new
model. I do really care though that the hard-binary-switch model works
fine enough for AMD but totally breaks heterogeneous systems and makes
it impossible for userspace to do the right thing.


Well how the handling for new Android, distributions etc... is going to 
look like is a completely different story.


And I completely agree with both Daniel Vetter and you that we need to 
keep this in mind when designing the compatibility with older software.


For Android I'm really not sure what to do. In general Android is 
already trying to do the right thing by using explicit sync, the problem 
is that this is build around the idea that this explicit sync is 
syncfile kernel based.


Either we need to change Android and come up with something that works 
with user fences as well or we somehow invent a compatibility layer for 
syncfile as well.


Regards,
Christian.



Cheers,
Daniel


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Marek Olšák
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák  wrote:

> Yes, we can't break anything because we don't want to complicate things
> for us. It's pretty much all NAK'd already. We are trying to gather more
> knowledge and then make better decisions.
>
> The idea we are considering is that we'll expose memory-based sync objects
> to userspace for read only, and the kernel or hw will strictly control the
> memory writes to those sync objects. The hole in that idea is that
> userspace can decide not to signal a job, so even if userspace can't
> overwrite memory-based sync object states arbitrarily, it can still decide
> not to signal them, and then a future fence is born.
>

This would actually be treated as a GPU hang caused by that context, so it
should be fine.

Marek
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Marek Olšák
Yes, we can't break anything because we don't want to complicate things for
us. It's pretty much all NAK'd already. We are trying to gather more
knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync objects
to userspace for read only, and the kernel or hw will strictly control the
memory writes to those sync objects. The hole in that idea is that
userspace can decide not to signal a job, so even if userspace can't
overwrite memory-based sync object states arbitrarily, it can still decide
not to signal them, and then a future fence is born.

Marek

On Wed, Jun 2, 2021 at 4:57 AM Daniel Stone  wrote:

> Hi Christian,
>
> On Tue, 1 Jun 2021 at 13:51, Christian König
>  wrote:
> > Am 01.06.21 um 14:30 schrieb Daniel Vetter:
> > > If you want to enable this use-case with driver magic and without the
> > > compositor being aware of what's going on, the solution is EGLStreams.
> > > Not sure we want to go there, but it's definitely a lot more feasible
> > > than trying to stuff eglstreams semantics into dma-buf implicit
> > > fencing support in a desperate attempt to not change compositors.
> >
> > Well not changing compositors is certainly not something I would try
> > with this use case.
> >
> > Not changing compositors is more like ok we have Ubuntu 20.04 and need
> > to support that we the newest hardware generation.
>
> Serious question, have you talked to Canonical?
>
> I mean there's a hell of a lot of effort being expended here, but it
> seems to all be predicated on the assumption that Ubuntu's LTS
> HWE/backport policy is totally immutable, and that we might need to
> make the kernel do backflips to work around that. But ... is it? Has
> anyone actually asked them how they feel about this?
>
> I mean, my answer to the first email is 'no, absolutely not' from the
> technical perspective (the initial proposal totally breaks current and
> future userspace), from a design perspective (it breaks a lot of
> usecases which aren't single-vendor GPU+display+codec, or aren't just
> a simple desktop), and from a sustainability perspective (cutting
> Android adrift again isn't acceptable collateral damage to make it
> easier to backport things to last year's Ubuntu release).
>
> But then again, I don't even know what I'm NAKing here ... ? The
> original email just lists a proposal to break a ton of things, with
> proposed replacements which aren't technically viable, and it's not
> clear why? Can we please have some more details and some reasoning
> behind them?
>
> I don't mind that userspace (compositor, protocols, clients like Mesa
> as well as codec APIs) need to do a lot of work to support the new
> model. I do really care though that the hard-binary-switch model works
> fine enough for AMD but totally breaks heterogeneous systems and makes
> it impossible for userspace to do the right thing.
>
> Cheers,
> Daniel
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Daniel Stone
Hi Christian,

On Tue, 1 Jun 2021 at 13:51, Christian König
 wrote:
> Am 01.06.21 um 14:30 schrieb Daniel Vetter:
> > If you want to enable this use-case with driver magic and without the
> > compositor being aware of what's going on, the solution is EGLStreams.
> > Not sure we want to go there, but it's definitely a lot more feasible
> > than trying to stuff eglstreams semantics into dma-buf implicit
> > fencing support in a desperate attempt to not change compositors.
>
> Well not changing compositors is certainly not something I would try
> with this use case.
>
> Not changing compositors is more like ok we have Ubuntu 20.04 and need
> to support that we the newest hardware generation.

Serious question, have you talked to Canonical?

I mean there's a hell of a lot of effort being expended here, but it
seems to all be predicated on the assumption that Ubuntu's LTS
HWE/backport policy is totally immutable, and that we might need to
make the kernel do backflips to work around that. But ... is it? Has
anyone actually asked them how they feel about this?

I mean, my answer to the first email is 'no, absolutely not' from the
technical perspective (the initial proposal totally breaks current and
future userspace), from a design perspective (it breaks a lot of
usecases which aren't single-vendor GPU+display+codec, or aren't just
a simple desktop), and from a sustainability perspective (cutting
Android adrift again isn't acceptable collateral damage to make it
easier to backport things to last year's Ubuntu release).

But then again, I don't even know what I'm NAKing here ... ? The
original email just lists a proposal to break a ton of things, with
proposed replacements which aren't technically viable, and it's not
clear why? Can we please have some more details and some reasoning
behind them?

I don't mind that userspace (compositor, protocols, clients like Mesa
as well as codec APIs) need to do a lot of work to support the new
model. I do really care though that the hard-binary-switch model works
fine enough for AMD but totally breaks heterogeneous systems and makes
it impossible for userspace to do the right thing.

Cheers,
Daniel
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-02 Thread Michel Dänzer
On 2021-06-01 12:49 p.m., Michel Dänzer wrote:
> On 2021-06-01 12:21 p.m., Christian König wrote:
> 
>> Another question is if that is sufficient as security for the display server 
>> or if we need further handling down the road? I mean essentially we are 
>> moving the reliability problem into the display server.
> 
> Good question. This should generally protect the display server from freezing 
> due to client fences never signalling, but there might still be corner cases.

E.g. a client might be able to sneak in a fence between when the compositor 
checks fences and when it submits its drawing to the kernel.


-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Daniel Stone
Hi,

On Tue, 1 Jun 2021 at 14:18, Michel Dänzer  wrote:
> On 2021-06-01 2:10 p.m., Christian König wrote:
> > Am 01.06.21 um 12:49 schrieb Michel Dänzer:
> >> There isn't a choice for Wayland compositors in general, since there can 
> >> be arbitrary other state which needs to be applied atomically together 
> >> with the new buffer. (Though in theory, a compositor might get fancy and 
> >> special-case surface commits which can be handled by waiting on the GPU)

Yeah, this is pretty crucial.

> >> Latency is largely a matter of scheduling in the compositor. The latency 
> >> incurred by the compositor shouldn't have to be more than single-digit 
> >> milliseconds. (I've seen total latency from when the client starts 
> >> processing a (static) frame to when it starts being scanned out as low as 
> >> ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, 
> >> lower than typical with Xorg)
> >
> > Well let me describe it like this:
> >
> > We have an use cases for 144 Hz guaranteed refresh rate. That essentially 
> > means that the client application needs to be able to spit out one 
> > frame/window content every ~6.9ms. That's tough, but doable.
> >
> > When you now add 6ms latency in the compositor that means the client 
> > application has only .9ms left for it's frame which is basically impossible 
> > to do.
>
> You misunderstood me. 6 ms is the lowest possible end-to-end latency from 
> client to scanout, but the client can start as early as it wants/needs to. 
> It's a trade-off between latency and the risk of missing a scanout cycle.

Not quite.

When weston-presentation-shm is reporting is a 6ms delta between when
it started its rendering loop and when the frame was presented to
screen. How w-p-s was run matters a lot, because you can insert an
arbitrary delay in there to simulate client rendering. It also matters
a lot that the client is SHM, because that will result in Mutter doing
glTexImage2D on whatever size the window is, then doing a full GL
compositing pass, so even if it's run with zero delay, 6ms isn't 'the
amount of time it takes Mutter to get a frame to screen', it's
measuring the overhead of a texture upload and full-screen composition
as well.

I'm assuming the 'guaranteed 144Hz' target is a fullscreen GL client,
for which you definitely avoid TexImage2D, and could hopefully
(assuming the client picks a modifier which can be displayed) also
avoid the composition pass in favour of direct scanout from the client
buffer; that would give you a much lower number.

Each compositor has its own heuristics around timing. They all make
their own tradeoff between low latency and fewest dropped frames.
Obviously, the higher your latency, the lower the chance of missing a
deadline. There's a lot of detail in the MR that Michel linked.

Cheers,
Daniel
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Michel Dänzer
On 2021-06-01 3:18 p.m., Michel Dänzer wrote:
> On 2021-06-01 2:10 p.m., Christian König wrote:
>> Am 01.06.21 um 12:49 schrieb Michel Dänzer:
>>> On 2021-06-01 12:21 p.m., Christian König wrote:
 Am 01.06.21 um 11:02 schrieb Michel Dänzer:
> On 2021-05-27 11:51 p.m., Marek Olšák wrote:
>> 3) Compositors (and other privileged processes, and display flipping) 
>> can't trust imported/exported fences. They need a timeout recovery 
>> mechanism from the beginning, and the following are some possible 
>> solutions to timeouts:
>>
>> a) use a CPU wait with a small absolute timeout, and display the 
>> previous content on timeout
>> b) use a GPU wait with a small absolute timeout, and conditional 
>> rendering will choose between the latest content (if signalled) and 
>> previous content (if timed out)
>>
>> The result would be that the desktop can run close to 60 fps even if an 
>> app runs at 1 fps.
> FWIW, this is working with
> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with 
> implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to 
> provide the same dma-buf poll semantics as other drivers and high 
> priority GFX contexts via EGL_IMG_context_priority which can preempt 
> lower priority ones).
 Yeah, that is really nice to have.

 One question is if you wait on the CPU or the GPU for the new surface to 
 become available?
>>> It's based on polling dma-buf fds, i.e. CPU.
>>>
 The former is a bit bad for latency and power management.
>>> There isn't a choice for Wayland compositors in general, since there can be 
>>> arbitrary other state which needs to be applied atomically together with 
>>> the new buffer. (Though in theory, a compositor might get fancy and 
>>> special-case surface commits which can be handled by waiting on the GPU)
>>>
>>> Latency is largely a matter of scheduling in the compositor. The latency 
>>> incurred by the compositor shouldn't have to be more than single-digit 
>>> milliseconds. (I've seen total latency from when the client starts 
>>> processing a (static) frame to when it starts being scanned out as low as 
>>> ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, 
>>> lower than typical with Xorg)
>>
>> Well let me describe it like this:
>>
>> We have an use cases for 144 Hz guaranteed refresh rate. That essentially 
>> means that the client application needs to be able to spit out one 
>> frame/window content every ~6.9ms. That's tough, but doable.
>>
>> When you now add 6ms latency in the compositor that means the client 
>> application has only .9ms left for it's frame which is basically impossible 
>> to do.
> 
> You misunderstood me. 6 ms is the lowest possible end-to-end latency from 
> client to scanout, but the client can start as early as it wants/needs to. 
> It's a trade-off between latency and the risk of missing a scanout cycle.

Note that what I wrote above is about the case where the compositor needs to 
draw its own frame sampling from the client buffer. If your concern is about a 
fullscreen application for which the compositor can directly use the 
application buffers for scanout, it should be possible in theory to get the 
latency incurred by the compositor down to ~1 ms.

If that's too much[0], it could be improved further by adding atomic KMS API to 
replace a pending page flip with another one. Then the compositor could just 
directly submit a flip as soon as a new buffer becomes ready (or even as soon 
as the client submits it to the compositor, depending on how exactly the new 
KMS API works). Then the minimum latency should be mostly up to the kernel 
driver / HW.

Another possibility would be for the application to use KMS directly, e.g. via 
a DRM lease. That might still require the same new API to get the flip 
submission latency significantly below 1 ms though.


[0] Though I'm not sure how to reconcile that with "spitting out one frame 
every ~6.9ms is tough", as that means the theoretical minimum total 
client→scanout latency is ~7 ms (and missing a scanout cycle ~doubles the 
latency).

-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Michel Dänzer
On 2021-06-01 3:01 p.m., Marek Olšák wrote:
> 
> 
> On Tue., Jun. 1, 2021, 08:51 Christian König, 
> mailto:ckoenig.leichtzumer...@gmail.com>> 
> wrote:
> 
> Am 01.06.21 um 14:30 schrieb Daniel Vetter:
> > On Tue, Jun 1, 2021 at 2:10 PM Christian König
> >  > wrote:
> >> Am 01.06.21 um 12:49 schrieb Michel Dänzer:
> >>> On 2021-06-01 12:21 p.m., Christian König wrote:
>  Am 01.06.21 um 11:02 schrieb Michel Dänzer:
> > On 2021-05-27 11:51 p.m., Marek Olšák wrote:
> >> 3) Compositors (and other privileged processes, and display 
> flipping) can't trust imported/exported fences. They need a timeout recovery 
> mechanism from the beginning, and the following are some possible solutions 
> to timeouts:
> >>
> >> a) use a CPU wait with a small absolute timeout, and display the 
> previous content on timeout
> >> b) use a GPU wait with a small absolute timeout, and conditional 
> rendering will choose between the latest content (if signalled) and previous 
> content (if timed out)
> >>
> >> The result would be that the desktop can run close to 60 fps even 
> if an app runs at 1 fps.
> > FWIW, this is working with
> > https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 
>  , even with 
> implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide 
> the same dma-buf poll semantics as other drivers and high priority GFX 
> contexts via EGL_IMG_context_priority which can preempt lower priority ones).
>  Yeah, that is really nice to have.
> 
>  One question is if you wait on the CPU or the GPU for the new 
> surface to become available?
> >>> It's based on polling dma-buf fds, i.e. CPU.
> >>>
>  The former is a bit bad for latency and power management.
> >>> There isn't a choice for Wayland compositors in general, since there 
> can be arbitrary other state which needs to be applied atomically together 
> with the new buffer. (Though in theory, a compositor might get fancy and 
> special-case surface commits which can be handled by waiting on the GPU)
> >>>
> >>> Latency is largely a matter of scheduling in the compositor. The 
> latency incurred by the compositor shouldn't have to be more than 
> single-digit milliseconds. (I've seen total latency from when the client 
> starts processing a (static) frame to when it starts being scanned out as low 
> as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620 
> , lower than 
> typical with Xorg)
> >> Well let me describe it like this:
> >>
> >> We have an use cases for 144 Hz guaranteed refresh rate. That
> >> essentially means that the client application needs to be able to spit
> >> out one frame/window content every ~6.9ms. That's tough, but doable.
> >>
> >> When you now add 6ms latency in the compositor that means the client
> >> application has only .9ms left for it's frame which is basically
> >> impossible to do.
> >>
> >> See for the user fences handling the display engine will learn to read
> >> sequence numbers from memory and decide on it's own if the old frame or
> >> the new one is scanned out. To get the latency there as low as 
> possible.
> > This won't work with implicit sync at all.
> >
> > If you want to enable this use-case with driver magic and without the
> > compositor being aware of what's going on, the solution is EGLStreams.
> > Not sure we want to go there, but it's definitely a lot more feasible
> > than trying to stuff eglstreams semantics into dma-buf implicit
> > fencing support in a desperate attempt to not change compositors.

EGLStreams are a red herring here. Wayland has atomic state transactions, 
similar to the atomic KMS API. These semantics could be achieved even with 
EGLStreams, at least via additional EGL extensions.

Any fancy technique we're discussing here would have to be completely between 
the Wayland compositor and the kernel, transparent to anything else.


> > I still think the most reasonable approach here is that we wrap a
> > dma_fence compat layer/mode over new hw for existing
> > userspace/compositors. And then enable userspace memory fences and the
> > fancy new features those allow with a new model that's built for them.
> 
> Yeah, that's basically the same direction I'm heading. Question is how
> to fix all those details.
> 
> > Also even with dma_fence we could implement your model of staying with
> > the previous buffer (or an older buffer at that's already rendered),
> > but it needs explicit involvement of the compositor. At least without
> > adding eglstreams fd to the kernel and wiring up all the relevant
> > extensions.
> 
> 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Michel Dänzer
On 2021-06-01 2:10 p.m., Christian König wrote:
> Am 01.06.21 um 12:49 schrieb Michel Dänzer:
>> On 2021-06-01 12:21 p.m., Christian König wrote:
>>> Am 01.06.21 um 11:02 schrieb Michel Dänzer:
 On 2021-05-27 11:51 p.m., Marek Olšák wrote:
> 3) Compositors (and other privileged processes, and display flipping) 
> can't trust imported/exported fences. They need a timeout recovery 
> mechanism from the beginning, and the following are some possible 
> solutions to timeouts:
>
> a) use a CPU wait with a small absolute timeout, and display the previous 
> content on timeout
> b) use a GPU wait with a small absolute timeout, and conditional 
> rendering will choose between the latest content (if signalled) and 
> previous content (if timed out)
>
> The result would be that the desktop can run close to 60 fps even if an 
> app runs at 1 fps.
 FWIW, this is working with
 https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with 
 implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to 
 provide the same dma-buf poll semantics as other drivers and high priority 
 GFX contexts via EGL_IMG_context_priority which can preempt lower priority 
 ones).
>>> Yeah, that is really nice to have.
>>>
>>> One question is if you wait on the CPU or the GPU for the new surface to 
>>> become available?
>> It's based on polling dma-buf fds, i.e. CPU.
>>
>>> The former is a bit bad for latency and power management.
>> There isn't a choice for Wayland compositors in general, since there can be 
>> arbitrary other state which needs to be applied atomically together with the 
>> new buffer. (Though in theory, a compositor might get fancy and special-case 
>> surface commits which can be handled by waiting on the GPU)
>>
>> Latency is largely a matter of scheduling in the compositor. The latency 
>> incurred by the compositor shouldn't have to be more than single-digit 
>> milliseconds. (I've seen total latency from when the client starts 
>> processing a (static) frame to when it starts being scanned out as low as ~6 
>> ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower 
>> than typical with Xorg)
> 
> Well let me describe it like this:
> 
> We have an use cases for 144 Hz guaranteed refresh rate. That essentially 
> means that the client application needs to be able to spit out one 
> frame/window content every ~6.9ms. That's tough, but doable.
> 
> When you now add 6ms latency in the compositor that means the client 
> application has only .9ms left for it's frame which is basically impossible 
> to do.

You misunderstood me. 6 ms is the lowest possible end-to-end latency from 
client to scanout, but the client can start as early as it wants/needs to. It's 
a trade-off between latency and the risk of missing a scanout cycle.


-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Marek Olšák
On Tue., Jun. 1, 2021, 08:51 Christian König, <
ckoenig.leichtzumer...@gmail.com> wrote:

> Am 01.06.21 um 14:30 schrieb Daniel Vetter:
> > On Tue, Jun 1, 2021 at 2:10 PM Christian König
> >  wrote:
> >> Am 01.06.21 um 12:49 schrieb Michel Dänzer:
> >>> On 2021-06-01 12:21 p.m., Christian König wrote:
>  Am 01.06.21 um 11:02 schrieb Michel Dänzer:
> > On 2021-05-27 11:51 p.m., Marek Olšák wrote:
> >> 3) Compositors (and other privileged processes, and display
> flipping) can't trust imported/exported fences. They need a timeout
> recovery mechanism from the beginning, and the following are some possible
> solutions to timeouts:
> >>
> >> a) use a CPU wait with a small absolute timeout, and display the
> previous content on timeout
> >> b) use a GPU wait with a small absolute timeout, and conditional
> rendering will choose between the latest content (if signalled) and
> previous content (if timed out)
> >>
> >> The result would be that the desktop can run close to 60 fps even
> if an app runs at 1 fps.
> > FWIW, this is working with
> > https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even
> with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to
> provide the same dma-buf poll semantics as other drivers and high priority
> GFX contexts via EGL_IMG_context_priority which can preempt lower priority
> ones).
>  Yeah, that is really nice to have.
> 
>  One question is if you wait on the CPU or the GPU for the new surface
> to become available?
> >>> It's based on polling dma-buf fds, i.e. CPU.
> >>>
>  The former is a bit bad for latency and power management.
> >>> There isn't a choice for Wayland compositors in general, since there
> can be arbitrary other state which needs to be applied atomically together
> with the new buffer. (Though in theory, a compositor might get fancy and
> special-case surface commits which can be handled by waiting on the GPU)
> >>>
> >>> Latency is largely a matter of scheduling in the compositor. The
> latency incurred by the compositor shouldn't have to be more than
> single-digit milliseconds. (I've seen total latency from when the client
> starts processing a (static) frame to when it starts being scanned out as
> low as ~6 ms with
> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than
> typical with Xorg)
> >> Well let me describe it like this:
> >>
> >> We have an use cases for 144 Hz guaranteed refresh rate. That
> >> essentially means that the client application needs to be able to spit
> >> out one frame/window content every ~6.9ms. That's tough, but doable.
> >>
> >> When you now add 6ms latency in the compositor that means the client
> >> application has only .9ms left for it's frame which is basically
> >> impossible to do.
> >>
> >> See for the user fences handling the display engine will learn to read
> >> sequence numbers from memory and decide on it's own if the old frame or
> >> the new one is scanned out. To get the latency there as low as possible.
> > This won't work with implicit sync at all.
> >
> > If you want to enable this use-case with driver magic and without the
> > compositor being aware of what's going on, the solution is EGLStreams.
> > Not sure we want to go there, but it's definitely a lot more feasible
> > than trying to stuff eglstreams semantics into dma-buf implicit
> > fencing support in a desperate attempt to not change compositors.
>
> Well not changing compositors is certainly not something I would try
> with this use case.
>
> Not changing compositors is more like ok we have Ubuntu 20.04 and need
> to support that we the newest hardware generation.
>
> > I still think the most reasonable approach here is that we wrap a
> > dma_fence compat layer/mode over new hw for existing
> > userspace/compositors. And then enable userspace memory fences and the
> > fancy new features those allow with a new model that's built for them.
>
> Yeah, that's basically the same direction I'm heading. Question is how
> to fix all those details.
>
> > Also even with dma_fence we could implement your model of staying with
> > the previous buffer (or an older buffer at that's already rendered),
> > but it needs explicit involvement of the compositor. At least without
> > adding eglstreams fd to the kernel and wiring up all the relevant
> > extensions.
>
> Question is do we already have some extension which allows different
> textures to be selected on the fly depending on some state?
>

There is no such extension for sync objects, but it can be done with
queries, like occlusion queries. There is also no timeout option and it can
only do "if" and "if not", but not "if .. else"

Marek



> E.g. something like use new frame if it's available and old frame
> otherwise.
>
> If you then apply this to the standard dma_fence based hardware or the
> new user fence based one is then pretty much irrelevant.
>
> Regards,
> Christian.
>
> > -Daniel
> >
>  Another 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Christian König

Am 01.06.21 um 14:30 schrieb Daniel Vetter:

On Tue, Jun 1, 2021 at 2:10 PM Christian König
 wrote:

Am 01.06.21 um 12:49 schrieb Michel Dänzer:

On 2021-06-01 12:21 p.m., Christian König wrote:

Am 01.06.21 um 11:02 schrieb Michel Dänzer:

On 2021-05-27 11:51 p.m., Marek Olšák wrote:

3) Compositors (and other privileged processes, and display flipping) can't 
trust imported/exported fences. They need a timeout recovery mechanism from the 
beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous 
content on timeout
b) use a GPU wait with a small absolute timeout, and conditional rendering will 
choose between the latest content (if signalled) and previous content (if timed 
out)

The result would be that the desktop can run close to 60 fps even if an app 
runs at 1 fps.

FWIW, this is working with
https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with 
implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the 
same dma-buf poll semantics as other drivers and high priority GFX contexts via 
EGL_IMG_context_priority which can preempt lower priority ones).

Yeah, that is really nice to have.

One question is if you wait on the CPU or the GPU for the new surface to become 
available?

It's based on polling dma-buf fds, i.e. CPU.


The former is a bit bad for latency and power management.

There isn't a choice for Wayland compositors in general, since there can be 
arbitrary other state which needs to be applied atomically together with the 
new buffer. (Though in theory, a compositor might get fancy and special-case 
surface commits which can be handled by waiting on the GPU)

Latency is largely a matter of scheduling in the compositor. The latency 
incurred by the compositor shouldn't have to be more than single-digit 
milliseconds. (I've seen total latency from when the client starts processing a 
(static) frame to when it starts being scanned out as low as ~6 ms with 
https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical 
with Xorg)

Well let me describe it like this:

We have an use cases for 144 Hz guaranteed refresh rate. That
essentially means that the client application needs to be able to spit
out one frame/window content every ~6.9ms. That's tough, but doable.

When you now add 6ms latency in the compositor that means the client
application has only .9ms left for it's frame which is basically
impossible to do.

See for the user fences handling the display engine will learn to read
sequence numbers from memory and decide on it's own if the old frame or
the new one is scanned out. To get the latency there as low as possible.

This won't work with implicit sync at all.

If you want to enable this use-case with driver magic and without the
compositor being aware of what's going on, the solution is EGLStreams.
Not sure we want to go there, but it's definitely a lot more feasible
than trying to stuff eglstreams semantics into dma-buf implicit
fencing support in a desperate attempt to not change compositors.


Well not changing compositors is certainly not something I would try 
with this use case.


Not changing compositors is more like ok we have Ubuntu 20.04 and need 
to support that we the newest hardware generation.



I still think the most reasonable approach here is that we wrap a
dma_fence compat layer/mode over new hw for existing
userspace/compositors. And then enable userspace memory fences and the
fancy new features those allow with a new model that's built for them.


Yeah, that's basically the same direction I'm heading. Question is how 
to fix all those details.



Also even with dma_fence we could implement your model of staying with
the previous buffer (or an older buffer at that's already rendered),
but it needs explicit involvement of the compositor. At least without
adding eglstreams fd to the kernel and wiring up all the relevant
extensions.


Question is do we already have some extension which allows different 
textures to be selected on the fly depending on some state?


E.g. something like use new frame if it's available and old frame otherwise.

If you then apply this to the standard dma_fence based hardware or the 
new user fence based one is then pretty much irrelevant.


Regards,
Christian.


-Daniel


Another question is if that is sufficient as security for the display server or 
if we need further handling down the road? I mean essentially we are moving the 
reliability problem into the display server.

Good question. This should generally protect the display server from freezing 
due to client fences never signalling, but there might still be corner cases.






___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Daniel Vetter
On Tue, Jun 1, 2021 at 2:10 PM Christian König
 wrote:
>
> Am 01.06.21 um 12:49 schrieb Michel Dänzer:
> > On 2021-06-01 12:21 p.m., Christian König wrote:
> >> Am 01.06.21 um 11:02 schrieb Michel Dänzer:
> >>> On 2021-05-27 11:51 p.m., Marek Olšák wrote:
>  3) Compositors (and other privileged processes, and display flipping) 
>  can't trust imported/exported fences. They need a timeout recovery 
>  mechanism from the beginning, and the following are some possible 
>  solutions to timeouts:
> 
>  a) use a CPU wait with a small absolute timeout, and display the 
>  previous content on timeout
>  b) use a GPU wait with a small absolute timeout, and conditional 
>  rendering will choose between the latest content (if signalled) and 
>  previous content (if timed out)
> 
>  The result would be that the desktop can run close to 60 fps even if an 
>  app runs at 1 fps.
> >>> FWIW, this is working with
> >>> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with 
> >>> implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to 
> >>> provide the same dma-buf poll semantics as other drivers and high 
> >>> priority GFX contexts via EGL_IMG_context_priority which can preempt 
> >>> lower priority ones).
> >> Yeah, that is really nice to have.
> >>
> >> One question is if you wait on the CPU or the GPU for the new surface to 
> >> become available?
> > It's based on polling dma-buf fds, i.e. CPU.
> >
> >> The former is a bit bad for latency and power management.
> > There isn't a choice for Wayland compositors in general, since there can be 
> > arbitrary other state which needs to be applied atomically together with 
> > the new buffer. (Though in theory, a compositor might get fancy and 
> > special-case surface commits which can be handled by waiting on the GPU)
> >
> > Latency is largely a matter of scheduling in the compositor. The latency 
> > incurred by the compositor shouldn't have to be more than single-digit 
> > milliseconds. (I've seen total latency from when the client starts 
> > processing a (static) frame to when it starts being scanned out as low as 
> > ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, 
> > lower than typical with Xorg)
>
> Well let me describe it like this:
>
> We have an use cases for 144 Hz guaranteed refresh rate. That
> essentially means that the client application needs to be able to spit
> out one frame/window content every ~6.9ms. That's tough, but doable.
>
> When you now add 6ms latency in the compositor that means the client
> application has only .9ms left for it's frame which is basically
> impossible to do.
>
> See for the user fences handling the display engine will learn to read
> sequence numbers from memory and decide on it's own if the old frame or
> the new one is scanned out. To get the latency there as low as possible.

This won't work with implicit sync at all.

If you want to enable this use-case with driver magic and without the
compositor being aware of what's going on, the solution is EGLStreams.
Not sure we want to go there, but it's definitely a lot more feasible
than trying to stuff eglstreams semantics into dma-buf implicit
fencing support in a desperate attempt to not change compositors.

I still think the most reasonable approach here is that we wrap a
dma_fence compat layer/mode over new hw for existing
userspace/compositors. And then enable userspace memory fences and the
fancy new features those allow with a new model that's built for them.
Also even with dma_fence we could implement your model of staying with
the previous buffer (or an older buffer at that's already rendered),
but it needs explicit involvement of the compositor. At least without
adding eglstreams fd to the kernel and wiring up all the relevant
extensions.
-Daniel

> >> Another question is if that is sufficient as security for the display 
> >> server or if we need further handling down the road? I mean essentially we 
> >> are moving the reliability problem into the display server.
> > Good question. This should generally protect the display server from 
> > freezing due to client fences never signalling, but there might still be 
> > corner cases.
> >
> >
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Christian König

Am 01.06.21 um 12:49 schrieb Michel Dänzer:

On 2021-06-01 12:21 p.m., Christian König wrote:

Am 01.06.21 um 11:02 schrieb Michel Dänzer:

On 2021-05-27 11:51 p.m., Marek Olšák wrote:

3) Compositors (and other privileged processes, and display flipping) can't 
trust imported/exported fences. They need a timeout recovery mechanism from the 
beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous 
content on timeout
b) use a GPU wait with a small absolute timeout, and conditional rendering will 
choose between the latest content (if signalled) and previous content (if timed 
out)

The result would be that the desktop can run close to 60 fps even if an app 
runs at 1 fps.

FWIW, this is working with
https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with 
implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the 
same dma-buf poll semantics as other drivers and high priority GFX contexts via 
EGL_IMG_context_priority which can preempt lower priority ones).

Yeah, that is really nice to have.

One question is if you wait on the CPU or the GPU for the new surface to become 
available?

It's based on polling dma-buf fds, i.e. CPU.


The former is a bit bad for latency and power management.

There isn't a choice for Wayland compositors in general, since there can be 
arbitrary other state which needs to be applied atomically together with the 
new buffer. (Though in theory, a compositor might get fancy and special-case 
surface commits which can be handled by waiting on the GPU)

Latency is largely a matter of scheduling in the compositor. The latency 
incurred by the compositor shouldn't have to be more than single-digit 
milliseconds. (I've seen total latency from when the client starts processing a 
(static) frame to when it starts being scanned out as low as ~6 ms with 
https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical 
with Xorg)


Well let me describe it like this:

We have an use cases for 144 Hz guaranteed refresh rate. That 
essentially means that the client application needs to be able to spit 
out one frame/window content every ~6.9ms. That's tough, but doable.


When you now add 6ms latency in the compositor that means the client 
application has only .9ms left for it's frame which is basically 
impossible to do.


See for the user fences handling the display engine will learn to read 
sequence numbers from memory and decide on it's own if the old frame or 
the new one is scanned out. To get the latency there as low as possible.



Another question is if that is sufficient as security for the display server or 
if we need further handling down the road? I mean essentially we are moving the 
reliability problem into the display server.

Good question. This should generally protect the display server from freezing 
due to client fences never signalling, but there might still be corner cases.




___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Michel Dänzer
On 2021-06-01 12:21 p.m., Christian König wrote:
> Am 01.06.21 um 11:02 schrieb Michel Dänzer:
>> On 2021-05-27 11:51 p.m., Marek Olšák wrote:
>>> 3) Compositors (and other privileged processes, and display flipping) can't 
>>> trust imported/exported fences. They need a timeout recovery mechanism from 
>>> the beginning, and the following are some possible solutions to timeouts:
>>>
>>> a) use a CPU wait with a small absolute timeout, and display the previous 
>>> content on timeout
>>> b) use a GPU wait with a small absolute timeout, and conditional rendering 
>>> will choose between the latest content (if signalled) and previous content 
>>> (if timed out)
>>>
>>> The result would be that the desktop can run close to 60 fps even if an app 
>>> runs at 1 fps.
>> FWIW, this is working with
>> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with 
>> implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide 
>> the same dma-buf poll semantics as other drivers and high priority GFX 
>> contexts via EGL_IMG_context_priority which can preempt lower priority ones).
> 
> Yeah, that is really nice to have.
> 
> One question is if you wait on the CPU or the GPU for the new surface to 
> become available?

It's based on polling dma-buf fds, i.e. CPU.

> The former is a bit bad for latency and power management.

There isn't a choice for Wayland compositors in general, since there can be 
arbitrary other state which needs to be applied atomically together with the 
new buffer. (Though in theory, a compositor might get fancy and special-case 
surface commits which can be handled by waiting on the GPU)

Latency is largely a matter of scheduling in the compositor. The latency 
incurred by the compositor shouldn't have to be more than single-digit 
milliseconds. (I've seen total latency from when the client starts processing a 
(static) frame to when it starts being scanned out as low as ~6 ms with 
https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical 
with Xorg)


> Another question is if that is sufficient as security for the display server 
> or if we need further handling down the road? I mean essentially we are 
> moving the reliability problem into the display server.

Good question. This should generally protect the display server from freezing 
due to client fences never signalling, but there might still be corner cases.


-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Christian König

Am 01.06.21 um 11:02 schrieb Michel Dänzer:

On 2021-05-27 11:51 p.m., Marek Olšák wrote:

3) Compositors (and other privileged processes, and display flipping) can't 
trust imported/exported fences. They need a timeout recovery mechanism from the 
beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous 
content on timeout
b) use a GPU wait with a small absolute timeout, and conditional rendering will 
choose between the latest content (if signalled) and previous content (if timed 
out)

The result would be that the desktop can run close to 60 fps even if an app 
runs at 1 fps.

FWIW, this is working with
https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with 
implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the 
same dma-buf poll semantics as other drivers and high priority GFX contexts via 
EGL_IMG_context_priority which can preempt lower priority ones).


Yeah, that is really nice to have.

One question is if you wait on the CPU or the GPU for the new surface to 
become available? The former is a bit bad for latency and power management.


Another question is if that is sufficient as security for the display 
server or if we need further handling down the road? I mean essentially 
we are moving the reliability problem into the display server.


Regards,
Christian.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-06-01 Thread Michel Dänzer
On 2021-05-27 11:51 p.m., Marek Olšák wrote:
> 
> 3) Compositors (and other privileged processes, and display flipping) can't 
> trust imported/exported fences. They need a timeout recovery mechanism from 
> the beginning, and the following are some possible solutions to timeouts:
> 
> a) use a CPU wait with a small absolute timeout, and display the previous 
> content on timeout
> b) use a GPU wait with a small absolute timeout, and conditional rendering 
> will choose between the latest content (if signalled) and previous content 
> (if timed out)
> 
> The result would be that the desktop can run close to 60 fps even if an app 
> runs at 1 fps.

FWIW, this is working with
https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with 
implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the 
same dma-buf poll semantics as other drivers and high priority GFX contexts via 
EGL_IMG_context_priority which can preempt lower priority ones).


-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-05-31 Thread Christian König
Yes, exactly that's my thinking and also the reason why I'm pondering so 
hard on the requirement that the memory for shared user fences should 
not be modifiable by userspace directly.


Christian.

Am 29.05.21 um 05:33 schrieb Marek Olšák:

My first email can be ignored except for the sync files. Oh well.

I think I see what you mean, Christian. If we assume that an imported 
fence is always read only (the buffer with the sequence number is read 
only), only the process that created and exported the fence can signal 
it. If the fence is not signaled, the exporting process is guilty. The 
only thing the importing process must do when it's about to use the 
fence as a dependency is to notify the kernel about it. Thus, the 
kernel will always know the dependency graph. Then if the importing 
process times out, the kernel will blame any of the processes that 
passed it a fence that is still unsignaled. The kernel will blame the 
process that timed out only if all imported fences have been signaled. 
It seems pretty robust.


It's the same with implicit sync except that the buffer with the 
sequence number is writable. Any process that has an implicitly-sync'd 
buffer can set the sequence number to 0 or UINT64_MAX. 0 will cause a 
timeout for the next job, while UINT64_MAX might cause a timeout a 
little later. The timeout can be mitigated by the kernel because the 
kernel knows the greatest number that should be there, but it's not 
possible to know which process is guilty (all processes holding the 
buffer handle would be suspects).


Marek

On Fri, May 28, 2021 at 6:25 PM Marek Olšák > wrote:


If both implicit and explicit synchronization are handled the
same, then the kernel won't be able to identify the process that
caused an implicit sync deadlock. The process that is stuck
waiting for a fence can be innocent, and the kernel can't punish
it. Likewise, the GPU reset guery that reports which process is
guilty and innocent will only be able to report unknown. Is that OK?

Marek

On Fri, May 28, 2021 at 10:41 AM Christian König
mailto:ckoenig.leichtzumer...@gmail.com>> wrote:

Hi Marek,

well I don't think that implicit and explicit synchronization
needs to be mutual exclusive.

What we should do is to have the ability to transport an
synchronization object with each BO.

Implicit and explicit synchronization then basically become
the same, they just transport the synchronization object
differently.

The biggest problem are the sync_files for Android, since they
are really not easy to support at all. If Android wants to
support user queues we would probably have to do some changes
there.

Regards,
Christian.

Am 27.05.21 um 23:51 schrieb Marek Olšák:

Hi,

Since Christian believes that we can't deadlock the kernel
with some changes there, we just need to make everything nice
for userspace too. Instead of explaining how it will work, I
will explain the cases where future hardware (and its kernel
driver) will break existing userspace in order to protect
everybody from deadlocks. Anything that uses implicit sync
will be spared, so X and Wayland will be fine, assuming they
don't import/export fences. Those use cases that do
import/export fences might or might not work, depending on
how the fences are used.

One of the necessities is that all fences will become future
fences. The semantics of imported/exported fences will change
completely and will have new restrictions on the usage. The
restrictions are:


1) Android sync files will be impossible to support, so won't
be supported. (they don't allow future fences)


2) Implicit sync and explicit sync will be mutually exclusive
between process. A process can either use one or the other,
but not both. This is meant to prevent a deadlock condition
with future fences where any process can malevolently
deadlock execution of any other process, even execution of a
higher-privileged process. The kernel will impose the
following restrictions to protect against the deadlock:

a) a process with an implicitly-sync'd imported/exported
buffer can't import/export a fence from/to another process
b) a process with an imported/exported fence can't
import/export an implicitly-sync'd buffer from/to another process

Alternative: A higher-privileged process could enforce both
restrictions instead of the kernel to protect itself from the
deadlock, but this would be a can of worms for existing
userspace. It would be better if the kernel just broke unsafe
userspace on future hw, just like sync files.

If both implicit and explicit sync are 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-05-28 Thread Marek Olšák
My first email can be ignored except for the sync files. Oh well.

I think I see what you mean, Christian. If we assume that an imported fence
is always read only (the buffer with the sequence number is read only),
only the process that created and exported the fence can signal it. If the
fence is not signaled, the exporting process is guilty. The only thing the
importing process must do when it's about to use the fence as a dependency
is to notify the kernel about it. Thus, the kernel will always know the
dependency graph. Then if the importing process times out, the kernel will
blame any of the processes that passed it a fence that is still unsignaled.
The kernel will blame the process that timed out only if all imported
fences have been signaled. It seems pretty robust.

It's the same with implicit sync except that the buffer with the sequence
number is writable. Any process that has an implicitly-sync'd buffer can
set the sequence number to 0 or UINT64_MAX. 0 will cause a timeout for the
next job, while UINT64_MAX might cause a timeout a little later. The
timeout can be mitigated by the kernel because the kernel knows the
greatest number that should be there, but it's not possible to know which
process is guilty (all processes holding the buffer handle would be
suspects).

Marek

On Fri, May 28, 2021 at 6:25 PM Marek Olšák  wrote:

> If both implicit and explicit synchronization are handled the same, then
> the kernel won't be able to identify the process that caused an implicit
> sync deadlock. The process that is stuck waiting for a fence can be
> innocent, and the kernel can't punish it. Likewise, the GPU reset guery
> that reports which process is guilty and innocent will only be able to
> report unknown. Is that OK?
>
> Marek
>
> On Fri, May 28, 2021 at 10:41 AM Christian König <
> ckoenig.leichtzumer...@gmail.com> wrote:
>
>> Hi Marek,
>>
>> well I don't think that implicit and explicit synchronization needs to be
>> mutual exclusive.
>>
>> What we should do is to have the ability to transport an synchronization
>> object with each BO.
>>
>> Implicit and explicit synchronization then basically become the same,
>> they just transport the synchronization object differently.
>>
>> The biggest problem are the sync_files for Android, since they are really
>> not easy to support at all. If Android wants to support user queues we
>> would probably have to do some changes there.
>>
>> Regards,
>> Christian.
>>
>> Am 27.05.21 um 23:51 schrieb Marek Olšák:
>>
>> Hi,
>>
>> Since Christian believes that we can't deadlock the kernel with some
>> changes there, we just need to make everything nice for userspace too.
>> Instead of explaining how it will work, I will explain the cases where
>> future hardware (and its kernel driver) will break existing userspace in
>> order to protect everybody from deadlocks. Anything that uses implicit sync
>> will be spared, so X and Wayland will be fine, assuming they don't
>> import/export fences. Those use cases that do import/export fences might or
>> might not work, depending on how the fences are used.
>>
>> One of the necessities is that all fences will become future fences. The
>> semantics of imported/exported fences will change completely and will have
>> new restrictions on the usage. The restrictions are:
>>
>>
>> 1) Android sync files will be impossible to support, so won't be
>> supported. (they don't allow future fences)
>>
>>
>> 2) Implicit sync and explicit sync will be mutually exclusive between
>> process. A process can either use one or the other, but not both. This is
>> meant to prevent a deadlock condition with future fences where any process
>> can malevolently deadlock execution of any other process, even execution of
>> a higher-privileged process. The kernel will impose the following
>> restrictions to protect against the deadlock:
>>
>> a) a process with an implicitly-sync'd imported/exported buffer can't
>> import/export a fence from/to another process
>> b) a process with an imported/exported fence can't import/export an
>> implicitly-sync'd buffer from/to another process
>>
>> Alternative: A higher-privileged process could enforce both restrictions
>> instead of the kernel to protect itself from the deadlock, but this would
>> be a can of worms for existing userspace. It would be better if the kernel
>> just broke unsafe userspace on future hw, just like sync files.
>>
>> If both implicit and explicit sync are allowed to occur simultaneously,
>> sending a future fence that will never signal to any process will deadlock
>> that process after it acquires the implicit sync lock, which is a sequence
>> number that the process is required to write to memory and send an
>> interrupt from the GPU in a finite time. This is how the deadlock can
>> happen:
>>
>> * The process gets sequence number N from the kernel for an
>> implicitly-sync'd buffer.
>> * The process inserts (into the GPU user-mapped queue) a wait for
>> sequence number N-1.
>> * The 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-05-28 Thread Marek Olšák
If both implicit and explicit synchronization are handled the same, then
the kernel won't be able to identify the process that caused an implicit
sync deadlock. The process that is stuck waiting for a fence can be
innocent, and the kernel can't punish it. Likewise, the GPU reset guery
that reports which process is guilty and innocent will only be able to
report unknown. Is that OK?

Marek

On Fri, May 28, 2021 at 10:41 AM Christian König <
ckoenig.leichtzumer...@gmail.com> wrote:

> Hi Marek,
>
> well I don't think that implicit and explicit synchronization needs to be
> mutual exclusive.
>
> What we should do is to have the ability to transport an synchronization
> object with each BO.
>
> Implicit and explicit synchronization then basically become the same, they
> just transport the synchronization object differently.
>
> The biggest problem are the sync_files for Android, since they are really
> not easy to support at all. If Android wants to support user queues we
> would probably have to do some changes there.
>
> Regards,
> Christian.
>
> Am 27.05.21 um 23:51 schrieb Marek Olšák:
>
> Hi,
>
> Since Christian believes that we can't deadlock the kernel with some
> changes there, we just need to make everything nice for userspace too.
> Instead of explaining how it will work, I will explain the cases where
> future hardware (and its kernel driver) will break existing userspace in
> order to protect everybody from deadlocks. Anything that uses implicit sync
> will be spared, so X and Wayland will be fine, assuming they don't
> import/export fences. Those use cases that do import/export fences might or
> might not work, depending on how the fences are used.
>
> One of the necessities is that all fences will become future fences. The
> semantics of imported/exported fences will change completely and will have
> new restrictions on the usage. The restrictions are:
>
>
> 1) Android sync files will be impossible to support, so won't be
> supported. (they don't allow future fences)
>
>
> 2) Implicit sync and explicit sync will be mutually exclusive between
> process. A process can either use one or the other, but not both. This is
> meant to prevent a deadlock condition with future fences where any process
> can malevolently deadlock execution of any other process, even execution of
> a higher-privileged process. The kernel will impose the following
> restrictions to protect against the deadlock:
>
> a) a process with an implicitly-sync'd imported/exported buffer can't
> import/export a fence from/to another process
> b) a process with an imported/exported fence can't import/export an
> implicitly-sync'd buffer from/to another process
>
> Alternative: A higher-privileged process could enforce both restrictions
> instead of the kernel to protect itself from the deadlock, but this would
> be a can of worms for existing userspace. It would be better if the kernel
> just broke unsafe userspace on future hw, just like sync files.
>
> If both implicit and explicit sync are allowed to occur simultaneously,
> sending a future fence that will never signal to any process will deadlock
> that process after it acquires the implicit sync lock, which is a sequence
> number that the process is required to write to memory and send an
> interrupt from the GPU in a finite time. This is how the deadlock can
> happen:
>
> * The process gets sequence number N from the kernel for an
> implicitly-sync'd buffer.
> * The process inserts (into the GPU user-mapped queue) a wait for sequence
> number N-1.
> * The process inserts a wait for a fence, but it doesn't know that it will
> never signal ==> deadlock.
> ...
> * The process inserts a command to write sequence number N to a
> predetermined memory location. (which will make the buffer idle and send an
> interrupt to the kernel)
> ...
> * The kernel will terminate the process because it has never received the
> interrupt. (i.e. a less-privileged process just killed a more-privileged
> process)
>
> It's the interrupt for implicit sync that never arrived that caused the
> termination, and the only way another process can cause it is by sending a
> fence that will never signal. Thus, importing/exporting fences from/to
> other processes can't be allowed simultaneously with implicit sync.
>
>
> 3) Compositors (and other privileged processes, and display flipping)
> can't trust imported/exported fences. They need a timeout recovery
> mechanism from the beginning, and the following are some possible solutions
> to timeouts:
>
> a) use a CPU wait with a small absolute timeout, and display the previous
> content on timeout
> b) use a GPU wait with a small absolute timeout, and conditional rendering
> will choose between the latest content (if signalled) and previous content
> (if timed out)
>
> The result would be that the desktop can run close to 60 fps even if an
> app runs at 1 fps.
>
> *Redefining imported/exported fences and breaking some users/OSs is the
> only way to have userspace GPU 

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

2021-05-28 Thread Christian König

Hi Marek,

well I don't think that implicit and explicit synchronization needs to 
be mutual exclusive.


What we should do is to have the ability to transport an synchronization 
object with each BO.


Implicit and explicit synchronization then basically become the same, 
they just transport the synchronization object differently.


The biggest problem are the sync_files for Android, since they are 
really not easy to support at all. If Android wants to support user 
queues we would probably have to do some changes there.


Regards,
Christian.

Am 27.05.21 um 23:51 schrieb Marek Olšák:

Hi,

Since Christian believes that we can't deadlock the kernel with some 
changes there, we just need to make everything nice for userspace too. 
Instead of explaining how it will work, I will explain the cases where 
future hardware (and its kernel driver) will break existing userspace 
in order to protect everybody from deadlocks. Anything that uses 
implicit sync will be spared, so X and Wayland will be fine, assuming 
they don't import/export fences. Those use cases that do import/export 
fences might or might not work, depending on how the fences are used.


One of the necessities is that all fences will become future fences. 
The semantics of imported/exported fences will change completely and 
will have new restrictions on the usage. The restrictions are:



1) Android sync files will be impossible to support, so won't be 
supported. (they don't allow future fences)



2) Implicit sync and explicit sync will be mutually exclusive between 
process. A process can either use one or the other, but not both. This 
is meant to prevent a deadlock condition with future fences where any 
process can malevolently deadlock execution of any other process, even 
execution of a higher-privileged process. The kernel will impose the 
following restrictions to protect against the deadlock:


a) a process with an implicitly-sync'd imported/exported buffer can't 
import/export a fence from/to another process
b) a process with an imported/exported fence can't import/export an 
implicitly-sync'd buffer from/to another process


Alternative: A higher-privileged process could enforce both 
restrictions instead of the kernel to protect itself from the 
deadlock, but this would be a can of worms for existing userspace. It 
would be better if the kernel just broke unsafe userspace on future 
hw, just like sync files.


If both implicit and explicit sync are allowed to occur 
simultaneously, sending a future fence that will never signal to any 
process will deadlock that process after it acquires the implicit sync 
lock, which is a sequence number that the process is required to write 
to memory and send an interrupt from the GPU in a finite time. This is 
how the deadlock can happen:


* The process gets sequence number N from the kernel for an 
implicitly-sync'd buffer.
* The process inserts (into the GPU user-mapped queue) a wait for 
sequence number N-1.
* The process inserts a wait for a fence, but it doesn't know that it 
will never signal ==> deadlock.

...
* The process inserts a command to write sequence number N to a 
predetermined memory location. (which will make the buffer idle and 
send an interrupt to the kernel)

...
* The kernel will terminate the process because it has never received 
the interrupt. (i.e. a less-privileged process just killed a 
more-privileged process)


It's the interrupt for implicit sync that never arrived that caused 
the termination, and the only way another process can cause it is by 
sending a fence that will never signal. Thus, importing/exporting 
fences from/to other processes can't be allowed simultaneously with 
implicit sync.



3) Compositors (and other privileged processes, and display flipping) 
can't trust imported/exported fences. They need a timeout recovery 
mechanism from the beginning, and the following are some possible 
solutions to timeouts:


a) use a CPU wait with a small absolute timeout, and display the 
previous content on timeout
b) use a GPU wait with a small absolute timeout, and conditional 
rendering will choose between the latest content (if signalled) and 
previous content (if timed out)


The result would be that the desktop can run close to 60 fps even if 
an app runs at 1 fps.


*Redefining imported/exported fences and breaking some users/OSs is 
the only way to have userspace GPU command submission, and the 
deadlock example here is the counterexample proving that there is no 
other way.*


So, what are the chances this is going to fly with the ecosystem?

Thanks,
Marek


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] Linux Graphics Next: Userspace submission update

2021-05-27 Thread Marek Olšák
Hi,

Since Christian believes that we can't deadlock the kernel with some
changes there, we just need to make everything nice for userspace too.
Instead of explaining how it will work, I will explain the cases where
future hardware (and its kernel driver) will break existing userspace in
order to protect everybody from deadlocks. Anything that uses implicit sync
will be spared, so X and Wayland will be fine, assuming they don't
import/export fences. Those use cases that do import/export fences might or
might not work, depending on how the fences are used.

One of the necessities is that all fences will become future fences. The
semantics of imported/exported fences will change completely and will have
new restrictions on the usage. The restrictions are:


1) Android sync files will be impossible to support, so won't be supported.
(they don't allow future fences)


2) Implicit sync and explicit sync will be mutually exclusive between
process. A process can either use one or the other, but not both. This is
meant to prevent a deadlock condition with future fences where any process
can malevolently deadlock execution of any other process, even execution of
a higher-privileged process. The kernel will impose the following
restrictions to protect against the deadlock:

a) a process with an implicitly-sync'd imported/exported buffer can't
import/export a fence from/to another process
b) a process with an imported/exported fence can't import/export an
implicitly-sync'd buffer from/to another process

Alternative: A higher-privileged process could enforce both restrictions
instead of the kernel to protect itself from the deadlock, but this would
be a can of worms for existing userspace. It would be better if the kernel
just broke unsafe userspace on future hw, just like sync files.

If both implicit and explicit sync are allowed to occur simultaneously,
sending a future fence that will never signal to any process will deadlock
that process after it acquires the implicit sync lock, which is a sequence
number that the process is required to write to memory and send an
interrupt from the GPU in a finite time. This is how the deadlock can
happen:

* The process gets sequence number N from the kernel for an
implicitly-sync'd buffer.
* The process inserts (into the GPU user-mapped queue) a wait for sequence
number N-1.
* The process inserts a wait for a fence, but it doesn't know that it will
never signal ==> deadlock.
...
* The process inserts a command to write sequence number N to a
predetermined memory location. (which will make the buffer idle and send an
interrupt to the kernel)
...
* The kernel will terminate the process because it has never received the
interrupt. (i.e. a less-privileged process just killed a more-privileged
process)

It's the interrupt for implicit sync that never arrived that caused the
termination, and the only way another process can cause it is by sending a
fence that will never signal. Thus, importing/exporting fences from/to
other processes can't be allowed simultaneously with implicit sync.


3) Compositors (and other privileged processes, and display flipping) can't
trust imported/exported fences. They need a timeout recovery mechanism from
the beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous
content on timeout
b) use a GPU wait with a small absolute timeout, and conditional rendering
will choose between the latest content (if signalled) and previous content
(if timed out)

The result would be that the desktop can run close to 60 fps even if an app
runs at 1 fps.

*Redefining imported/exported fences and breaking some users/OSs is the
only way to have userspace GPU command submission, and the deadlock example
here is the counterexample proving that there is no other way.*

So, what are the chances this is going to fly with the ecosystem?

Thanks,
Marek
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev