Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-19 Thread Marek Olšák
We already don't have accurate BO fences in some cases. Instead, BOs can
have fences which are equal to the last seen command buffer for each queue.
It's practically the same as if the kernel had no visibility into command
submissions and just added a fence into all queues when it needed to wait
for idle. That's already one alternative to BO fences that would work
today. The only BOs that need accurate BO fences are shared buffers, and
those use cases can be converted to explicit fences.

Removing memory management from all command buffer submission logic would
be one of the benefits that is quite appealing.

You don't need to depend on apps for budgeting and placement determination.
You can sort buffers according to driver usage, e.g. scratch/spill buffers,
shader IO rings, MSAA images, other images, and buffers. Alternatively, you
can have just internal buffers vs app buffers. Then you assign VRAM from
left to right until you reach the quota. This is optional, so this part can
be ignored.

>> - A GPU hang signals all fences. Other deadlocks will be handled like
GPU hangs.
>
>What do you mean by "all"?  All fences that were supposed to be
>signaled by the hung context?

Yes, that's one of the possibilities. Any GPU hang followed by a GPU reset
can clear VRAM, so all processes should recreate their contexts and
reinitialize resources. A deadlock caused by userspace could be handled
similarly.

I don't know how timeline fences would work across processes and how
resilient they would be to segfaults.

Marek

On Mon, Apr 19, 2021 at 11:48 AM Jason Ekstrand 
wrote:

> Not going to comment on everything on the first pass...
>
> On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák  wrote:
> >
> > Hi,
> >
> > This is our initial proposal for explicit fences everywhere and new
> memory management that doesn't use BO fences. It's a redesign of how Linux
> graphics drivers work, and it can coexist with what we have now.
> >
> >
> > 1. Introduction
> > (skip this if you are already sold on explicit fences)
> >
> > The current Linux graphics architecture was initially designed for GPUs
> with only one graphics queue where everything was executed in the
> submission order and per-BO fences were used for memory management and
> CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
> queues were added on top, which required the introduction of implicit
> GPU-GPU synchronization between queues of different processes using per-BO
> fences. Recently, even parallel execution within one queue was enabled
> where a command buffer starts draws and compute shaders, but doesn't wait
> for them, enabling parallelism between back-to-back command buffers.
> Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
> was created to enable all those use cases, and it's the only reason why the
> scheduler exists.
> >
> > The GPU scheduler, implicit synchronization, BO-fence-based memory
> management, and the tracking of per-BO fences increase CPU overhead and
> latency, and reduce parallelism. There is a desire to replace all of them
> with something much simpler. Below is how we could do it.
> >
> >
> > 2. Explicit synchronization for window systems and modesetting
> >
> > The producer is an application and the consumer is a compositor or a
> modesetting driver.
> >
> > 2.1. The Present request
> >
> > As part of the Present request, the producer will pass 2 fences (sync
> objects) to the consumer alongside the presented DMABUF BO:
> > - The submit fence: Initially unsignalled, it will be signalled when the
> producer has finished drawing into the presented buffer.
> > - The return fence: Initially unsignalled, it will be signalled when the
> consumer has finished using the presented buffer.
>
> I'm not sure syncobj is what we want.  In the Intel world we're trying
> to go even further to something we're calling "userspace fences" which
> are a timeline implemented as a single 64-bit value in some
> CPU-mappable BO.  The client writes a higher value into the BO to
> signal the timeline.  The kernel then provides some helpers for
> waiting on them reliably and without spinning.  I don't expect
> everyone to support these right away but, If we're going to re-plumb
> userspace for explicit synchronization, I'd like to make sure we take
> this into account so we only have to do it once.
>
>
> > Deadlock mitigation to recover from segfaults:
> > - The kernel knows which process is obliged to signal which fence. This
> information is part of the Present request and supplied by userspace.
>
> This isn't clear to me.  Yes, if we're using anything dma-fence based
> like syncobj, this is true.  But it doesn't seem totally true as a
> general statement.
>
>
> > - If the producer crashes, the kernel signals the submit fence, so that
> the consumer can make forward progress.
> > - If the consumer crashes, the kernel signals the return fence, so that
> the producer can reclaim the buffer.
> > - A GPU hang signals all 

[Mesa-dev] [PATCH v4] drm/doc/rfc: i915 DG1 uAPI

2021-04-19 Thread Matthew Auld
Add an entry for the new uAPI needed for DG1. Also add the overall
upstream plan, including some notes for the TTM conversion.

v2(Daniel):
  - include the overall upstreaming plan
  - add a note for mmap, there are differences here for TTM vs i915
  - bunch of other suggestions from Daniel
v3:
 (Daniel)
  - add a note for set/get caching stuff
  - add some more docs for existing query and extensions stuff
  - add an actual code example for regions query
  - bunch of other stuff
 (Jason)
  - uAPI change(!):
- try a simpler design with the placements extension
- rather than have a generic setparam which can cover multiple
  use cases, have each extension be responsible for one thing
  only
v4:
 (Daniel)
  - add some more notes for ttm conversion
  - bunch of other stuff
 (Jason)
  - uAPI change(!):
- drop all the extra rsvd members for the region_query and
  region_info, just keep the bare minimum needed for padding

Signed-off-by: Matthew Auld 
Cc: Joonas Lahtinen 
Cc: Thomas Hellström 
Cc: Daniele Ceraolo Spurio 
Cc: Lionel Landwerlin 
Cc: Jon Bloomfield 
Cc: Jordan Justen 
Cc: Daniel Vetter 
Cc: Kenneth Graunke 
Cc: Jason Ekstrand 
Cc: Dave Airlie 
Cc: dri-de...@lists.freedesktop.org
Cc: mesa-dev@lists.freedesktop.org
Acked-by: Daniel Vetter 
---
 Documentation/gpu/rfc/i915_gem_lmem.h   | 212 
 Documentation/gpu/rfc/i915_gem_lmem.rst | 130 +++
 Documentation/gpu/rfc/index.rst |   4 +
 3 files changed, 346 insertions(+)
 create mode 100644 Documentation/gpu/rfc/i915_gem_lmem.h
 create mode 100644 Documentation/gpu/rfc/i915_gem_lmem.rst

diff --git a/Documentation/gpu/rfc/i915_gem_lmem.h 
b/Documentation/gpu/rfc/i915_gem_lmem.h
new file mode 100644
index ..7ed59b6202d5
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_gem_lmem.h
@@ -0,0 +1,212 @@
+/**
+ * enum drm_i915_gem_memory_class - Supported memory classes
+ */
+enum drm_i915_gem_memory_class {
+   /** @I915_MEMORY_CLASS_SYSTEM: System memory */
+   I915_MEMORY_CLASS_SYSTEM = 0,
+   /** @I915_MEMORY_CLASS_DEVICE: Device local-memory */
+   I915_MEMORY_CLASS_DEVICE,
+};
+
+/**
+ * struct drm_i915_gem_memory_class_instance - Identify particular memory 
region
+ */
+struct drm_i915_gem_memory_class_instance {
+   /** @memory_class: See enum drm_i915_gem_memory_class */
+   __u16 memory_class;
+
+   /** @memory_instance: Which instance */
+   __u16 memory_instance;
+};
+
+/**
+ * struct drm_i915_memory_region_info - Describes one region as known to the
+ * driver.
+ *
+ * Note that we reserve some stuff here for potential future work. As an 
example
+ * we might want expose the capabilities(see @caps) for a given region, which
+ * could include things like if the region is CPU mappable/accessible, what are
+ * the supported mapping types etc.
+ *
+ * Note this is using both struct drm_i915_query_item and struct 
drm_i915_query.
+ * For this new query we are adding the new query id 
DRM_I915_QUERY_MEMORY_REGIONS
+ * at _i915_query_item.query_id.
+ */
+struct drm_i915_memory_region_info {
+   /** @region: The class:instance pair encoding */
+   struct drm_i915_gem_memory_class_instance region;
+
+   /** @pad: MBZ */
+   __u32 pad;
+
+   /** @caps: MBZ */
+   __u64 caps;
+
+   /** @probed_size: Memory probed by the driver (-1 = unknown) */
+   __u64 probed_size;
+
+   /** @unallocated_size: Estimate of memory remaining (-1 = unknown) */
+   __u64 unallocated_size;
+};
+
+/**
+ * struct drm_i915_query_memory_regions
+ *
+ * The region info query enumerates all regions known to the driver by filling
+ * in an array of struct drm_i915_memory_region_info structures.
+ *
+ * Example for getting the list of supported regions:
+ *
+ * .. code-block:: C
+ *
+ * struct drm_i915_query_memory_regions *info;
+ * struct drm_i915_query_item item = {
+ * .query_id = DRM_I915_QUERY_MEMORY_REGIONS;
+ * };
+ * struct drm_i915_query query = {
+ * .num_items = 1,
+ * .items_ptr = (uintptr_t),
+ * };
+ * int err, i;
+ *
+ * // First query the size of the blob we need, this needs to be large
+ * // enough to hold our array of regions. The kernel will fill out the
+ * // item.length for us, which is the number of bytes we need.
+ * err = ioctl(fd, DRM_IOCTL_I915_QUERY, );
+ * if (err) ...
+ *
+ * info = calloc(1, item.length);
+ * // Now that we allocated the required number of bytes, we call the ioctl
+ * // again, this time with the data_ptr pointing to our newly allocated
+ * // blob, which the kernel can then populate with the all the region 
info.
+ * item.data_ptr = (uintptr_t),
+ *
+ * err = ioctl(fd, DRM_IOCTL_I915_QUERY, );
+ * if (err) ...
+ *
+ * // We can now access each region in the array
+ * for (i = 0; i < info->num_regions; i++) {
+ * struct 

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-19 Thread Jason Ekstrand
Not going to comment on everything on the first pass...

On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák  wrote:
>
> Hi,
>
> This is our initial proposal for explicit fences everywhere and new memory 
> management that doesn't use BO fences. It's a redesign of how Linux graphics 
> drivers work, and it can coexist with what we have now.
>
>
> 1. Introduction
> (skip this if you are already sold on explicit fences)
>
> The current Linux graphics architecture was initially designed for GPUs with 
> only one graphics queue where everything was executed in the submission order 
> and per-BO fences were used for memory management and CPU-GPU 
> synchronization, not GPU-GPU synchronization. Later, multiple queues were 
> added on top, which required the introduction of implicit GPU-GPU 
> synchronization between queues of different processes using per-BO fences. 
> Recently, even parallel execution within one queue was enabled where a 
> command buffer starts draws and compute shaders, but doesn't wait for them, 
> enabling parallelism between back-to-back command buffers. Modesetting also 
> uses per-BO fences for scheduling flips. Our GPU scheduler was created to 
> enable all those use cases, and it's the only reason why the scheduler exists.
>
> The GPU scheduler, implicit synchronization, BO-fence-based memory 
> management, and the tracking of per-BO fences increase CPU overhead and 
> latency, and reduce parallelism. There is a desire to replace all of them 
> with something much simpler. Below is how we could do it.
>
>
> 2. Explicit synchronization for window systems and modesetting
>
> The producer is an application and the consumer is a compositor or a 
> modesetting driver.
>
> 2.1. The Present request
>
> As part of the Present request, the producer will pass 2 fences (sync 
> objects) to the consumer alongside the presented DMABUF BO:
> - The submit fence: Initially unsignalled, it will be signalled when the 
> producer has finished drawing into the presented buffer.
> - The return fence: Initially unsignalled, it will be signalled when the 
> consumer has finished using the presented buffer.

I'm not sure syncobj is what we want.  In the Intel world we're trying
to go even further to something we're calling "userspace fences" which
are a timeline implemented as a single 64-bit value in some
CPU-mappable BO.  The client writes a higher value into the BO to
signal the timeline.  The kernel then provides some helpers for
waiting on them reliably and without spinning.  I don't expect
everyone to support these right away but, If we're going to re-plumb
userspace for explicit synchronization, I'd like to make sure we take
this into account so we only have to do it once.


> Deadlock mitigation to recover from segfaults:
> - The kernel knows which process is obliged to signal which fence. This 
> information is part of the Present request and supplied by userspace.

This isn't clear to me.  Yes, if we're using anything dma-fence based
like syncobj, this is true.  But it doesn't seem totally true as a
general statement.


> - If the producer crashes, the kernel signals the submit fence, so that the 
> consumer can make forward progress.
> - If the consumer crashes, the kernel signals the return fence, so that the 
> producer can reclaim the buffer.
> - A GPU hang signals all fences. Other deadlocks will be handled like GPU 
> hangs.

What do you mean by "all"?  All fences that were supposed to be
signaled by the hung context?


>
> Other window system requests can follow the same idea.
>
> Merged fences where one fence object contains multiple fences will be 
> supported. A merged fence is signalled only when its fences are signalled. 
> The consumer will have the option to redefine the unsignalled return fence to 
> a merged fence.
>
> 2.2. Modesetting
>
> Since a modesetting driver can also be the consumer, the present ioctl will 
> contain a submit fence and a return fence too. One small problem with this is 
> that userspace can hang the modesetting driver, but in theory, any later 
> present ioctl can override the previous one, so the unsignalled presentation 
> is never used.
>
>
> 3. New memory management
>
> The per-BO fences will be removed and the kernel will not know which buffers 
> are busy. This will reduce CPU overhead and latency. The kernel will not need 
> per-BO fences with explicit synchronization, so we just need to remove their 
> last user: buffer evictions. It also resolves the current OOM deadlock.

Is this even really possible?  I'm no kernel MM expert (trying to
learn some) but my understanding is that the use of per-BO dma-fence
runs deep.  I would like to stop using it for implicit synchronization
to be sure, but I'm not sure I believe the claim that we can get rid
of it entirely.  Happy to see someone try, though.


> 3.1. Evictions
>
> If the kernel wants to move a buffer, it will have to wait for everything to 
> go idle, halt all userspace command submissions, move the 

Re: [Mesa-dev] [PATCH v3 4/4] drm/doc/rfc: i915 DG1 uAPI

2021-04-19 Thread Jason Ekstrand
On Mon, Apr 19, 2021 at 7:02 AM Matthew Auld  wrote:
>
> On 16/04/2021 17:38, Jason Ekstrand wrote:
> > On Thu, Apr 15, 2021 at 11:04 AM Matthew Auld  
> > wrote:
> >>
> >> Add an entry for the new uAPI needed for DG1.
> >>
> >> v2(Daniel):
> >>- include the overall upstreaming plan
> >>- add a note for mmap, there are differences here for TTM vs i915
> >>- bunch of other suggestions from Daniel
> >> v3:
> >>   (Daniel)
> >>- add a note for set/get caching stuff
> >>- add some more docs for existing query and extensions stuff
> >>- add an actual code example for regions query
> >>- bunch of other stuff
> >>   (Jason)
> >>- uAPI change(!):
> >>  - try a simpler design with the placements extension
> >>  - rather than have a generic setparam which can cover multiple
> >>use cases, have each extension be responsible for one thing
> >>only
> >>
> >> Signed-off-by: Matthew Auld 
> >> Cc: Joonas Lahtinen 
> >> Cc: Jordan Justen 
> >> Cc: Daniel Vetter 
> >> Cc: Kenneth Graunke 
> >> Cc: Jason Ekstrand 
> >> Cc: Dave Airlie 
> >> Cc: dri-de...@lists.freedesktop.org
> >> Cc: mesa-dev@lists.freedesktop.org
> >> ---
> >>   Documentation/gpu/rfc/i915_gem_lmem.h   | 255 
> >>   Documentation/gpu/rfc/i915_gem_lmem.rst | 139 +
> >>   Documentation/gpu/rfc/index.rst |   4 +
> >>   3 files changed, 398 insertions(+)
> >>   create mode 100644 Documentation/gpu/rfc/i915_gem_lmem.h
> >>   create mode 100644 Documentation/gpu/rfc/i915_gem_lmem.rst
> >>
> >> diff --git a/Documentation/gpu/rfc/i915_gem_lmem.h 
> >> b/Documentation/gpu/rfc/i915_gem_lmem.h
> >> new file mode 100644
> >> index ..2a82a452e9f2
> >> --- /dev/null
> >> +++ b/Documentation/gpu/rfc/i915_gem_lmem.h
> >> @@ -0,0 +1,255 @@
> >> +/*
> >> + * Note that drm_i915_query_item and drm_i915_query are existing bits of 
> >> uAPI.
> >> + * For the regions query we are just adding a new query id, so no actual 
> >> new
> >> + * ioctl or anything, but including it here for reference.
> >> + */
> >> +struct drm_i915_query_item {
> >> +#define DRM_I915_QUERY_MEMORY_REGIONS   0xdeadbeaf
> >> +   
> >> +__u64 query_id;
> >> +
> >> +/*
> >> + * When set to zero by userspace, this is filled with the size of 
> >> the
> >> + * data to be written at the data_ptr pointer. The kernel sets 
> >> this
> >> + * value to a negative value to signal an error on a particular 
> >> query
> >> + * item.
> >> + */
> >> +__s32 length;
> >> +
> >> +__u32 flags;
> >> +/*
> >> + * Data will be written at the location pointed by data_ptr when 
> >> the
> >> + * value of length matches the length of the data to be written 
> >> by the
> >> + * kernel.
> >> + */
> >> +__u64 data_ptr;
> >> +};
> >> +
> >> +struct drm_i915_query {
> >> +__u32 num_items;
> >> +/*
> >> + * Unused for now. Must be cleared to zero.
> >> + */
> >> +__u32 flags;
> >> +/*
> >> + * This points to an array of num_items drm_i915_query_item 
> >> structures.
> >> + */
> >> +__u64 items_ptr;
> >> +};
> >> +
> >> +#define DRM_IOCTL_I915_QUERY   DRM_IOWR(DRM_COMMAND_BASE + 
> >> DRM_I915_QUERY, struct drm_i915_query)
> >> +
> >> +/**
> >> + * enum drm_i915_gem_memory_class
> >> + */
> >> +enum drm_i915_gem_memory_class {
> >> +   /** @I915_MEMORY_CLASS_SYSTEM: system memory */
> >> +   I915_MEMORY_CLASS_SYSTEM = 0,
> >> +   /** @I915_MEMORY_CLASS_DEVICE: device local-memory */
> >> +   I915_MEMORY_CLASS_DEVICE,
> >> +};
> >> +
> >> +/**
> >> + * struct drm_i915_gem_memory_class_instance
> >> + */
> >> +struct drm_i915_gem_memory_class_instance {
> >> +   /** @memory_class: see enum drm_i915_gem_memory_class */
> >> +   __u16 memory_class;
> >> +
> >> +   /** @memory_instance: which instance */
> >> +   __u16 memory_instance;
> >> +};
> >> +
> >> +/**
> >> + * struct drm_i915_memory_region_info
> >> + *
> >> + * Describes one region as known to the driver.
> >> + *
> >> + * Note that we reserve quite a lot of stuff here for potential future 
> >> work. As
> >> + * an example we might want expose the capabilities(see caps) for a given
> >> + * region, which could include things like if the region is CPU
> >> + * mappable/accessible etc.
> >
> > I get caps but I'm seriously at a loss as to what the rest of this
> > would be used for.  Why are caps and flags both there and separate?
> > Flags are typically something you set, not query.  Also, what's with
> > rsvd1 at the end?  This smells of substantial over-building to me.
> >
> > I thought to myself, "maybe I'm missing a future use-case" so I looked
> > at the internal tree and none of this is being used there either.
> > This indicates to me that either I'm missing something and there's
> > code somewhere I don't know about 

Re: [Mesa-dev] [PATCH v3 4/4] drm/doc/rfc: i915 DG1 uAPI

2021-04-19 Thread Matthew Auld

On 16/04/2021 17:38, Jason Ekstrand wrote:

On Thu, Apr 15, 2021 at 11:04 AM Matthew Auld  wrote:


Add an entry for the new uAPI needed for DG1.

v2(Daniel):
   - include the overall upstreaming plan
   - add a note for mmap, there are differences here for TTM vs i915
   - bunch of other suggestions from Daniel
v3:
  (Daniel)
   - add a note for set/get caching stuff
   - add some more docs for existing query and extensions stuff
   - add an actual code example for regions query
   - bunch of other stuff
  (Jason)
   - uAPI change(!):
 - try a simpler design with the placements extension
 - rather than have a generic setparam which can cover multiple
   use cases, have each extension be responsible for one thing
   only

Signed-off-by: Matthew Auld 
Cc: Joonas Lahtinen 
Cc: Jordan Justen 
Cc: Daniel Vetter 
Cc: Kenneth Graunke 
Cc: Jason Ekstrand 
Cc: Dave Airlie 
Cc: dri-de...@lists.freedesktop.org
Cc: mesa-dev@lists.freedesktop.org
---
  Documentation/gpu/rfc/i915_gem_lmem.h   | 255 
  Documentation/gpu/rfc/i915_gem_lmem.rst | 139 +
  Documentation/gpu/rfc/index.rst |   4 +
  3 files changed, 398 insertions(+)
  create mode 100644 Documentation/gpu/rfc/i915_gem_lmem.h
  create mode 100644 Documentation/gpu/rfc/i915_gem_lmem.rst

diff --git a/Documentation/gpu/rfc/i915_gem_lmem.h 
b/Documentation/gpu/rfc/i915_gem_lmem.h
new file mode 100644
index ..2a82a452e9f2
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_gem_lmem.h
@@ -0,0 +1,255 @@
+/*
+ * Note that drm_i915_query_item and drm_i915_query are existing bits of uAPI.
+ * For the regions query we are just adding a new query id, so no actual new
+ * ioctl or anything, but including it here for reference.
+ */
+struct drm_i915_query_item {
+#define DRM_I915_QUERY_MEMORY_REGIONS   0xdeadbeaf
+   
+__u64 query_id;
+
+/*
+ * When set to zero by userspace, this is filled with the size of the
+ * data to be written at the data_ptr pointer. The kernel sets this
+ * value to a negative value to signal an error on a particular query
+ * item.
+ */
+__s32 length;
+
+__u32 flags;
+/*
+ * Data will be written at the location pointed by data_ptr when the
+ * value of length matches the length of the data to be written by the
+ * kernel.
+ */
+__u64 data_ptr;
+};
+
+struct drm_i915_query {
+__u32 num_items;
+/*
+ * Unused for now. Must be cleared to zero.
+ */
+__u32 flags;
+/*
+ * This points to an array of num_items drm_i915_query_item structures.
+ */
+__u64 items_ptr;
+};
+
+#define DRM_IOCTL_I915_QUERY   DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_QUERY, 
struct drm_i915_query)
+
+/**
+ * enum drm_i915_gem_memory_class
+ */
+enum drm_i915_gem_memory_class {
+   /** @I915_MEMORY_CLASS_SYSTEM: system memory */
+   I915_MEMORY_CLASS_SYSTEM = 0,
+   /** @I915_MEMORY_CLASS_DEVICE: device local-memory */
+   I915_MEMORY_CLASS_DEVICE,
+};
+
+/**
+ * struct drm_i915_gem_memory_class_instance
+ */
+struct drm_i915_gem_memory_class_instance {
+   /** @memory_class: see enum drm_i915_gem_memory_class */
+   __u16 memory_class;
+
+   /** @memory_instance: which instance */
+   __u16 memory_instance;
+};
+
+/**
+ * struct drm_i915_memory_region_info
+ *
+ * Describes one region as known to the driver.
+ *
+ * Note that we reserve quite a lot of stuff here for potential future work. As
+ * an example we might want expose the capabilities(see caps) for a given
+ * region, which could include things like if the region is CPU
+ * mappable/accessible etc.


I get caps but I'm seriously at a loss as to what the rest of this
would be used for.  Why are caps and flags both there and separate?
Flags are typically something you set, not query.  Also, what's with
rsvd1 at the end?  This smells of substantial over-building to me.

I thought to myself, "maybe I'm missing a future use-case" so I looked
at the internal tree and none of this is being used there either.
This indicates to me that either I'm missing something and there's
code somewhere I don't know about or, with three years of building on
internal branches, we still haven't proven that any of this is needed.
If it's the latter, which I strongly suspect, maybe we should drop the
unnecessary bits and only add them back in if and when we have proof
that they're useful.


Do you mean just drop caps/flags here, but keep/inflate rsvd0/rsvd1, 
which is less opinionated about future unknowns? If so, makes sense to me.




To be clear, I don't mind the query API as such and the class/instance
stuff seems fine and I really like being able to get the sizes
directly.  What concerns me is all this extra future-proofing that we
have zero proof is actually useful.  In my experience, when you build
out like this without so much as a 

[Mesa-dev] [PATCH v2 4/4] drm/i915/uapi: convert i915_query and friend to kernel doc

2021-04-19 Thread Matthew Auld
Add a note about the two-step process.

v2(Tvrtko):
  - Also document the other method of just passing in a buffer which is
large enough, which avoids two ioctl calls. Can make sense for
smaller query items.
v3: prefer kernel-doc references for structs and members

Suggested-by: Daniel Vetter 
Signed-off-by: Matthew Auld 
Cc: Joonas Lahtinen 
Cc: Tvrtko Ursulin 
Cc: Jordan Justen 
Cc: Daniel Vetter 
Cc: Kenneth Graunke 
Cc: Jason Ekstrand 
Cc: Dave Airlie 
Cc: dri-de...@lists.freedesktop.org
Cc: mesa-dev@lists.freedesktop.org
Reviewed-by: Daniel Vetter 
Reviewed-by: Jason Ekstrand 
---
 include/uapi/drm/i915_drm.h | 69 +
 1 file changed, 55 insertions(+), 14 deletions(-)

diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index e2867d8cd5e3..6a34243a7646 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -2218,53 +2218,94 @@ struct drm_i915_perf_oa_config {
__u64 flex_regs_ptr;
 };
 
+/**
+ * struct drm_i915_query_item - An individual query for the kernel to process.
+ *
+ * The behaviour is determined by the @query_id. Note that exactly what
+ * @data_ptr is also depends on the specific @query_id.
+ */
 struct drm_i915_query_item {
+   /** @query_id: The id for this query */
__u64 query_id;
 #define DRM_I915_QUERY_TOPOLOGY_INFO1
 #define DRM_I915_QUERY_ENGINE_INFO 2
 #define DRM_I915_QUERY_PERF_CONFIG  3
 /* Must be kept compact -- no holes and well documented */
 
-   /*
+   /**
+* @length:
+*
 * When set to zero by userspace, this is filled with the size of the
-* data to be written at the data_ptr pointer. The kernel sets this
+* data to be written at the @data_ptr pointer. The kernel sets this
 * value to a negative value to signal an error on a particular query
 * item.
 */
__s32 length;
 
-   /*
+   /**
+* @flags:
+*
 * When query_id == DRM_I915_QUERY_TOPOLOGY_INFO, must be 0.
 *
 * When query_id == DRM_I915_QUERY_PERF_CONFIG, must be one of the
-* following :
-* - DRM_I915_QUERY_PERF_CONFIG_LIST
-* - DRM_I915_QUERY_PERF_CONFIG_DATA_FOR_UUID
-* - DRM_I915_QUERY_PERF_CONFIG_FOR_UUID
+* following:
+*
+*  - DRM_I915_QUERY_PERF_CONFIG_LIST
+*  - DRM_I915_QUERY_PERF_CONFIG_DATA_FOR_UUID
+*  - DRM_I915_QUERY_PERF_CONFIG_FOR_UUID
 */
__u32 flags;
 #define DRM_I915_QUERY_PERF_CONFIG_LIST  1
 #define DRM_I915_QUERY_PERF_CONFIG_DATA_FOR_UUID 2
 #define DRM_I915_QUERY_PERF_CONFIG_DATA_FOR_ID   3
 
-   /*
-* Data will be written at the location pointed by data_ptr when the
-* value of length matches the length of the data to be written by the
+   /**
+* @data_ptr:
+*
+* Data will be written at the location pointed by @data_ptr when the
+* value of @length matches the length of the data to be written by the
 * kernel.
 */
__u64 data_ptr;
 };
 
+/**
+ * struct drm_i915_query - Supply an array of struct drm_i915_query_item for 
the
+ * kernel to fill out.
+ *
+ * Note that this is generally a two step process for each struct
+ * drm_i915_query_item in the array:
+ *
+ * 1. Call the DRM_IOCTL_I915_QUERY, giving it our array of struct
+ *drm_i915_query_item, with _i915_query_item.length set to zero. The
+ *kernel will then fill in the size, in bytes, which tells userspace how
+ *memory it needs to allocate for the blob(say for an array of properties).
+ *
+ * 2. Next we call DRM_IOCTL_I915_QUERY again, this time with the
+ *_i915_query_item.data_ptr equal to our newly allocated blob. Note 
that
+ *the _i915_query_item.length should still be the same as what the
+ *kernel previously set. At this point the kernel can fill in the blob.
+ *
+ * Note that for some query items it can make sense for userspace to just pass
+ * in a buffer/blob equal to or larger than the required size. In this case 
only
+ * a single ioctl call is needed. For some smaller query items this can work
+ * quite well.
+ *
+ */
 struct drm_i915_query {
+   /** @num_items: The number of elements in the @items_ptr array */
__u32 num_items;
 
-   /*
-* Unused for now. Must be cleared to zero.
+   /**
+* @flags: Unused for now. Must be cleared to zero.
 */
__u32 flags;
 
-   /*
-* This points to an array of num_items drm_i915_query_item structures.
+   /**
+* @items_ptr:
+*
+* Pointer to an array of struct drm_i915_query_item. The number of
+* array elements is @num_items.
 */
__u64 items_ptr;
 };
-- 
2.26.3

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org

[Mesa-dev] [PATCH v2 3/4] drm/i915/uapi: convert i915_user_extension to kernel doc

2021-04-19 Thread Matthew Auld
Add some example usage for the extension chaining also, which is quite
nifty.

v2: (Daniel)
  - clarify that the name is just some integer, also document that the
name space is not global
v3: prefer kernel-doc references for structs

Suggested-by: Daniel Vetter 
Signed-off-by: Matthew Auld 
Cc: Joonas Lahtinen 
Cc: Jordan Justen 
Cc: Daniel Vetter 
Cc: Kenneth Graunke 
Cc: Jason Ekstrand 
Cc: Dave Airlie 
Cc: dri-de...@lists.freedesktop.org
Cc: mesa-dev@lists.freedesktop.org
Reviewed-by: Daniel Vetter 
Reviewed-by: Jason Ekstrand 
---
 include/uapi/drm/i915_drm.h | 54 ++---
 1 file changed, 50 insertions(+), 4 deletions(-)

diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 92da48e935d1..e2867d8cd5e3 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -62,8 +62,8 @@ extern "C" {
 #define I915_ERROR_UEVENT  "ERROR"
 #define I915_RESET_UEVENT  "RESET"
 
-/*
- * i915_user_extension: Base class for defining a chain of extensions
+/**
+ * struct i915_user_extension - Base class for defining a chain of extensions
  *
  * Many interfaces need to grow over time. In most cases we can simply
  * extend the struct and have userspace pass in more data. Another option,
@@ -76,12 +76,58 @@ extern "C" {
  * increasing complexity, and for large parts of that interface to be
  * entirely optional. The downside is more pointer chasing; chasing across
  * the __user boundary with pointers encapsulated inside u64.
+ *
+ * Example chaining:
+ *
+ * .. code-block:: C
+ *
+ * struct i915_user_extension ext3 {
+ * .next_extension = 0, // end
+ * .name = ...,
+ * };
+ * struct i915_user_extension ext2 {
+ * .next_extension = (uintptr_t),
+ * .name = ...,
+ * };
+ * struct i915_user_extension ext1 {
+ * .next_extension = (uintptr_t),
+ * .name = ...,
+ * };
+ *
+ * Typically the struct i915_user_extension would be embedded in some uAPI
+ * struct, and in this case we would feed it the head of the chain(i.e ext1),
+ * which would then apply all of the above extensions.
+ *
  */
 struct i915_user_extension {
+   /**
+* @next_extension:
+*
+* Pointer to the next struct i915_user_extension, or zero if the end.
+*/
__u64 next_extension;
+   /**
+* @name: Name of the extension.
+*
+* Note that the name here is just some integer.
+*
+* Also note that the name space for this is not global for the whole
+* driver, but rather its scope/meaning is limited to the specific piece
+* of uAPI which has embedded the struct i915_user_extension.
+*/
__u32 name;
-   __u32 flags; /* All undefined bits must be zero. */
-   __u32 rsvd[4]; /* Reserved for future use; must be zero. */
+   /**
+* @flags: MBZ
+*
+* All undefined bits must be zero.
+*/
+   __u32 flags;
+   /**
+* @rsvd: MBZ
+*
+* Reserved for future use; must be zero.
+*/
+   __u32 rsvd[4];
 };
 
 /*
-- 
2.26.3

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH v2 2/4] drm/doc: add section for driver uAPI

2021-04-19 Thread Matthew Auld
Add section for drm/i915 uAPI and pull in i915_drm.h.

Suggested-by: Daniel Vetter 
Signed-off-by: Matthew Auld 
Cc: Joonas Lahtinen 
Cc: Jordan Justen 
Cc: Daniel Vetter 
Cc: Kenneth Graunke 
Cc: Jason Ekstrand 
Cc: Dave Airlie 
Cc: dri-de...@lists.freedesktop.org
Cc: mesa-dev@lists.freedesktop.org
Reviewed-by: Daniel Vetter 
---
 Documentation/gpu/driver-uapi.rst | 8 
 Documentation/gpu/index.rst   | 1 +
 2 files changed, 9 insertions(+)
 create mode 100644 Documentation/gpu/driver-uapi.rst

diff --git a/Documentation/gpu/driver-uapi.rst 
b/Documentation/gpu/driver-uapi.rst
new file mode 100644
index ..4411e6919a3d
--- /dev/null
+++ b/Documentation/gpu/driver-uapi.rst
@@ -0,0 +1,8 @@
+===
+DRM Driver uAPI
+===
+
+drm/i915 uAPI
+=
+
+.. kernel-doc:: include/uapi/drm/i915_drm.h
diff --git a/Documentation/gpu/index.rst b/Documentation/gpu/index.rst
index ec4bc72438e4..b9c1214d8f23 100644
--- a/Documentation/gpu/index.rst
+++ b/Documentation/gpu/index.rst
@@ -10,6 +10,7 @@ Linux GPU Driver Developer's Guide
drm-kms
drm-kms-helpers
drm-uapi
+   driver-uapi
drm-client
drivers
backlight
-- 
2.26.3

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH v2 1/4] drm/i915/uapi: fix kernel doc warnings

2021-04-19 Thread Matthew Auld
Fix the cases where it is almost already valid kernel doc, for the
others just nerf the warnings for now.

Signed-off-by: Matthew Auld 
Cc: Joonas Lahtinen 
Cc: Jordan Justen 
Cc: Daniel Vetter 
Cc: Kenneth Graunke 
Cc: Jason Ekstrand 
Cc: Dave Airlie 
Cc: dri-de...@lists.freedesktop.org
Cc: mesa-dev@lists.freedesktop.org
Reviewed-by: Daniel Vetter 
---
 include/uapi/drm/i915_drm.h | 32 
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index ddc47bbf48b6..92da48e935d1 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1054,12 +1054,12 @@ struct drm_i915_gem_exec_fence {
__u32 flags;
 };
 
-/**
+/*
  * See drm_i915_gem_execbuffer_ext_timeline_fences.
  */
 #define DRM_I915_GEM_EXECBUFFER_EXT_TIMELINE_FENCES 0
 
-/**
+/*
  * This structure describes an array of drm_syncobj and associated points for
  * timeline variants of drm_syncobj. It is invalid to append this structure to
  * the execbuf if I915_EXEC_FENCE_ARRAY is set.
@@ -1700,7 +1700,7 @@ struct drm_i915_gem_context_param {
__u64 value;
 };
 
-/**
+/*
  * Context SSEU programming
  *
  * It may be necessary for either functional or performance reason to configure
@@ -2067,7 +2067,7 @@ struct drm_i915_perf_open_param {
__u64 properties_ptr;
 };
 
-/**
+/*
  * Enable data capture for a stream that was either opened in a disabled state
  * via I915_PERF_FLAG_DISABLED or was later disabled via
  * I915_PERF_IOCTL_DISABLE.
@@ -2081,7 +2081,7 @@ struct drm_i915_perf_open_param {
  */
 #define I915_PERF_IOCTL_ENABLE _IO('i', 0x0)
 
-/**
+/*
  * Disable data capture for a stream.
  *
  * It is an error to try and read a stream that is disabled.
@@ -2090,7 +2090,7 @@ struct drm_i915_perf_open_param {
  */
 #define I915_PERF_IOCTL_DISABLE_IO('i', 0x1)
 
-/**
+/*
  * Change metrics_set captured by a stream.
  *
  * If the stream is bound to a specific context, the configuration change
@@ -2103,7 +2103,7 @@ struct drm_i915_perf_open_param {
  */
 #define I915_PERF_IOCTL_CONFIG _IO('i', 0x2)
 
-/**
+/*
  * Common to all i915 perf records
  */
 struct drm_i915_perf_record_header {
@@ -2151,7 +2151,7 @@ enum drm_i915_perf_record_type {
DRM_I915_PERF_RECORD_MAX /* non-ABI */
 };
 
-/**
+/*
  * Structure to upload perf dynamic configuration into the kernel.
  */
 struct drm_i915_perf_oa_config {
@@ -2292,21 +2292,21 @@ struct drm_i915_query_topology_info {
  * Describes one engine and it's capabilities as known to the driver.
  */
 struct drm_i915_engine_info {
-   /** Engine class and instance. */
+   /** @engine: Engine class and instance. */
struct i915_engine_class_instance engine;
 
-   /** Reserved field. */
+   /** @rsvd0: Reserved field. */
__u32 rsvd0;
 
-   /** Engine flags. */
+   /** @flags: Engine flags. */
__u64 flags;
 
-   /** Capabilities of this engine. */
+   /** @capabilities: Capabilities of this engine. */
__u64 capabilities;
 #define I915_VIDEO_CLASS_CAPABILITY_HEVC   (1 << 0)
 #define I915_VIDEO_AND_ENHANCE_CLASS_CAPABILITY_SFC(1 << 1)
 
-   /** Reserved fields. */
+   /** @rsvd1: Reserved fields. */
__u64 rsvd1[4];
 };
 
@@ -2317,13 +2317,13 @@ struct drm_i915_engine_info {
  * an array of struct drm_i915_engine_info structures.
  */
 struct drm_i915_query_engine_info {
-   /** Number of struct drm_i915_engine_info structs following. */
+   /** @num_engines: Number of struct drm_i915_engine_info structs 
following. */
__u32 num_engines;
 
-   /** MBZ */
+   /** @rsvd: MBZ */
__u32 rsvd[3];
 
-   /** Marker for drm_i915_engine_info structures. */
+   /** @engines: Marker for drm_i915_engine_info structures. */
struct drm_i915_engine_info engines[];
 };
 
-- 
2.26.3

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-19 Thread Marek Olšák
Hi,

This is our initial proposal for explicit fences everywhere and new memory
management that doesn't use BO fences. It's a redesign of how Linux
graphics drivers work, and it can coexist with what we have now.


*1. Introduction*
(skip this if you are already sold on explicit fences)

The current Linux graphics architecture was initially designed for GPUs
with only one graphics queue where everything was executed in the
submission order and per-BO fences were used for memory management and
CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
queues were added on top, which required the introduction of implicit
GPU-GPU synchronization between queues of different processes using per-BO
fences. Recently, even parallel execution within one queue was enabled
where a command buffer starts draws and compute shaders, but doesn't wait
for them, enabling parallelism between back-to-back command buffers.
Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
was created to enable all those use cases, and it's the only reason why the
scheduler exists.

The GPU scheduler, implicit synchronization, BO-fence-based memory
management, and the tracking of per-BO fences increase CPU overhead and
latency, and reduce parallelism. There is a desire to replace all of them
with something much simpler. Below is how we could do it.


*2. Explicit synchronization for window systems and modesetting*

The producer is an application and the consumer is a compositor or a
modesetting driver.

*2.1. The Present request*

As part of the Present request, the producer will pass 2 fences (sync
objects) to the consumer alongside the presented DMABUF BO:
- The submit fence: Initially unsignalled, it will be signalled when the
producer has finished drawing into the presented buffer.
- The return fence: Initially unsignalled, it will be signalled when the
consumer has finished using the presented buffer.

Deadlock mitigation to recover from segfaults:
- The kernel knows which process is obliged to signal which fence. This
information is part of the Present request and supplied by userspace.
- If the producer crashes, the kernel signals the submit fence, so that the
consumer can make forward progress.
- If the consumer crashes, the kernel signals the return fence, so that the
producer can reclaim the buffer.
- A GPU hang signals all fences. Other deadlocks will be handled like GPU
hangs.

Other window system requests can follow the same idea.

Merged fences where one fence object contains multiple fences will be
supported. A merged fence is signalled only when its fences are signalled.
The consumer will have the option to redefine the unsignalled return fence
to a merged fence.

*2.2. Modesetting*

Since a modesetting driver can also be the consumer, the present ioctl will
contain a submit fence and a return fence too. One small problem with this
is that userspace can hang the modesetting driver, but in theory, any later
present ioctl can override the previous one, so the unsignalled
presentation is never used.


*3. New memory management*

The per-BO fences will be removed and the kernel will not know which
buffers are busy. This will reduce CPU overhead and latency. The kernel
will not need per-BO fences with explicit synchronization, so we just need
to remove their last user: buffer evictions. It also resolves the current
OOM deadlock.

*3.1. Evictions*

If the kernel wants to move a buffer, it will have to wait for everything
to go idle, halt all userspace command submissions, move the buffer, and
resume everything. This is not expected to happen when memory is not
exhausted. Other more efficient ways of synchronization are also possible
(e.g. sync only one process), but are not discussed here.

*3.2. Per-process VRAM usage quota*

Each process can optionally and periodically query its VRAM usage quota and
change domains of its buffers to obey that quota. For example, a process
allocated 2 GB of buffers in VRAM, but the kernel decreased the quota to 1
GB. The process can change the domains of the least important buffers to
GTT to get the best outcome for itself. If the process doesn't do it, the
kernel will choose which buffers to evict at random. (thanks to Christian
Koenig for this idea)

*3.3. Buffer destruction without per-BO fences*

When the buffer destroy ioctl is called, an optional fence list can be
passed to the kernel to indicate when it's safe to deallocate the buffer.
If the fence list is empty, the buffer will be deallocated immediately.
Shared buffers will be handled by merging fence lists from all processes
that destroy them. Mitigation of malicious behavior:
- If userspace destroys a busy buffer, it will get a GPU page fault.
- If userspace sends fences that never signal, the kernel will have a
timeout period and then will proceed to deallocate the buffer anyway.

*3.4. Other notes on MM*

Overcommitment of GPU-accessible memory will cause an allocation failure or
invoke the OOM