Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Alex Deucher
On Wed, Apr 28, 2021 at 10:35 AM Daniel Vetter  wrote:
>
> On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:
> > Am 28.04.21 um 15:34 schrieb Daniel Vetter:
> > > On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
> > > > Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> > > > > On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> > > > > > On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > > > > > > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > > > > > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > > > > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser 
> > > > > > > > >  wrote:
> > > > > > > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> > > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > > > Ok. So that would only make the following use cases 
> > > > > > > > > > > > broken for now:
> > > > > > > > > > > >
> > > > > > > > > > > > - amd render -> external gpu
> > > > > > > > > > > > - amd video encode -> network device
> > > > > > > > > > > FWIW, "only" breaking amd render -> external gpu will 
> > > > > > > > > > > make us pretty
> > > > > > > > > > > unhappy
> > > > > > > > > > I concur. I have quite a few users with a multi-GPU setup 
> > > > > > > > > > involving
> > > > > > > > > > AMD hardware.
> > > > > > > > > >
> > > > > > > > > > Note, if this brokenness can't be avoided, I'd prefer a to 
> > > > > > > > > > get a clear
> > > > > > > > > > error, and not bad results on screen because nothing is 
> > > > > > > > > > synchronized
> > > > > > > > > > anymore.
> > > > > > > > > It's an upcoming requirement for windows[1], so you are 
> > > > > > > > > likely to
> > > > > > > > > start seeing this across all GPU vendors that support 
> > > > > > > > > windows.  I
> > > > > > > > > think the timing depends on how quickly the legacy hardware 
> > > > > > > > > support
> > > > > > > > > sticks around for each vendor.
> > > > > > > > Yeah but hw scheduling doesn't mean the hw has to be 
> > > > > > > > constructed to not
> > > > > > > > support isolating the ringbuffer at all.
> > > > > > > >
> > > > > > > > E.g. even if the hw loses the bit to put the ringbuffer outside 
> > > > > > > > of the
> > > > > > > > userspace gpu vm, if you have pagetables I'm seriously hoping 
> > > > > > > > you have r/o
> > > > > > > > pte flags. Otherwise the entire "share address space with cpu 
> > > > > > > > side,
> > > > > > > > seamlessly" thing is out of the window.
> > > > > > > >
> > > > > > > > And with that r/o bit on the ringbuffer you can once more force 
> > > > > > > > submit
> > > > > > > > through kernel space, and all the legacy dma_fence based stuff 
> > > > > > > > keeps
> > > > > > > > working. And we don't have to invent some horrendous userspace 
> > > > > > > > fence based
> > > > > > > > implicit sync mechanism in the kernel, but can instead do this 
> > > > > > > > transition
> > > > > > > > properly with drm_syncobj timeline explicit sync and protocol 
> > > > > > > > reving.
> > > > > > > >
> > > > > > > > At least I think you'd have to work extra hard to create a gpu 
> > > > > > > > which
> > > > > > > > cannot possibly be intercepted by the kernel, even when it's 
> > > > > > > > designed to
> > > > > > > > support userspace direct submit only.
> > > > > > > >
> > > > > > > > Or are your hw engineers more creative here and we're screwed?
> > > > > > > The upcomming hardware generation will have this hardware 
> > > > > > > scheduler as a
> > > > > > > must have, but there are certain ways we can still stick to the 
> > > > > > > old
> > > > > > > approach:
> > > > > > >
> > > > > > > 1. The new hardware scheduler currently still supports kernel 
> > > > > > > queues which
> > > > > > > essentially is the same as the old hardware ring buffer.
> > > > > > >
> > > > > > > 2. Mapping the top level ring buffer into the VM at least 
> > > > > > > partially solves
> > > > > > > the problem. This way you can't manipulate the ring buffer 
> > > > > > > content, but the
> > > > > > > location for the fence must still be writeable.
> > > > > > Yeah allowing userspace to lie about completion fences in this 
> > > > > > model is
> > > > > > ok. Though I haven't thought through full consequences of that, but 
> > > > > > I
> > > > > > think it's not any worse than userspace lying about which 
> > > > > > buffers/address
> > > > > > it uses in the current model - we rely on hw vm ptes to catch that 
> > > > > > stuff.
> > > > > >
> > > > > > Also it might be good to switch to a non-recoverable ctx model for 
> > > > > > these.
> > > > > > That's already what we do in i915 (opt-in, but all current umd use 
> > > > > > that
> > > > > > mode). So any hang/watchdog just kills the entire ctx and you don't 
> > > > > > have
> > > > > > to worry about userspace doing something funny with it's ringbuffer.
> > > > > > Simplifies everything.
> > > > > >
> > > > > > Also ofc userspace fencing still disallowed

Re: [Mesa-dev] Trying to build a opencl dev env

2021-04-28 Thread Luke A. Guest

How do you run the opencl-cts tests on this?

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 1/9] drm/doc/rfc: i915 DG1 uAPI

2021-04-28 Thread Kenneth Graunke
On Monday, April 26, 2021 2:38:53 AM PDT Matthew Auld wrote:
> Add an entry for the new uAPI needed for DG1. Also add the overall
> upstream plan, including some notes for the TTM conversion.
> 
> v2(Daniel):
>   - include the overall upstreaming plan
>   - add a note for mmap, there are differences here for TTM vs i915
>   - bunch of other suggestions from Daniel
> v3:
>  (Daniel)
>   - add a note for set/get caching stuff
>   - add some more docs for existing query and extensions stuff
>   - add an actual code example for regions query
>   - bunch of other stuff
>  (Jason)
>   - uAPI change(!):
>   - try a simpler design with the placements extension
>   - rather than have a generic setparam which can cover multiple
> use cases, have each extension be responsible for one thing
> only
> v4:
>  (Daniel)
>   - add some more notes for ttm conversion
>   - bunch of other stuff
>  (Jason)
>   - uAPI change(!):
>   - drop all the extra rsvd members for the region_query and
> region_info, just keep the bare minimum needed for padding
> 
> Signed-off-by: Matthew Auld 
> Cc: Joonas Lahtinen 
> Cc: Thomas Hellström 
> Cc: Daniele Ceraolo Spurio 
> Cc: Lionel Landwerlin 
> Cc: Jon Bloomfield 
> Cc: Jordan Justen 
> Cc: Daniel Vetter 
> Cc: Kenneth Graunke 
> Cc: Jason Ekstrand 
> Cc: Dave Airlie 
> Cc: dri-de...@lists.freedesktop.org
> Cc: mesa-dev@lists.freedesktop.org
> Acked-by: Daniel Vetter 
> Acked-by: Dave Airlie 
> ---
>  Documentation/gpu/rfc/i915_gem_lmem.h   | 212 
>  Documentation/gpu/rfc/i915_gem_lmem.rst | 130 +++
>  Documentation/gpu/rfc/index.rst |   4 +
>  3 files changed, 346 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_gem_lmem.h
>  create mode 100644 Documentation/gpu/rfc/i915_gem_lmem.rst

With or without any of my suggestions,

Patch 7 is:

Acked-by: Kenneth Graunke 

The rest of the series (1-6, 8-9) are:

Reviewed-by: Kenneth Graunke 


signature.asc
Description: This is a digitally signed message part.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 6/9] drm/i915/uapi: implement object placement extension

2021-04-28 Thread Kenneth Graunke
On Monday, April 26, 2021 2:38:58 AM PDT Matthew Auld wrote:
> Add new extension to support setting an immutable-priority-list of
> potential placements, at creation time.
> 
> If we use the normal gem_create or gem_create_ext without the
> extensions/placements then we still get the old behaviour with only
> placing the object in system memory.
> 
> v2(Daniel & Jason):
> - Add a bunch of kernel-doc
> - Simplify design for placements extension
> 
> Testcase: igt/gem_create/create-ext-placement-sanity-check
> Testcase: igt/gem_create/create-ext-placement-each
> Testcase: igt/gem_create/create-ext-placement-all
> Signed-off-by: Matthew Auld 
> Signed-off-by: CQ Tang 
> Cc: Joonas Lahtinen 
> Cc: Daniele Ceraolo Spurio 
> Cc: Lionel Landwerlin 
> Cc: Jordan Justen 
> Cc: Daniel Vetter 
> Cc: Kenneth Graunke 
> Cc: Jason Ekstrand 
> Cc: Dave Airlie 
> Cc: dri-de...@lists.freedesktop.org
> Cc: mesa-dev@lists.freedesktop.org
> ---
>  drivers/gpu/drm/i915/gem/i915_gem_create.c| 215 --
>  drivers/gpu/drm/i915/gem/i915_gem_object.c|   3 +
>  .../gpu/drm/i915/gem/i915_gem_object_types.h  |   6 +
>  .../drm/i915/gem/selftests/i915_gem_mman.c|  26 +++
>  drivers/gpu/drm/i915/intel_memory_region.c|  16 ++
>  drivers/gpu/drm/i915/intel_memory_region.h|   4 +
>  include/uapi/drm/i915_drm.h   |  62 +
>  7 files changed, 315 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_create.c 
> b/drivers/gpu/drm/i915/gem/i915_gem_create.c
> index 90e9eb6601b5..895f1666a8d3 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_create.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_create.c
> @@ -4,12 +4,47 @@
>   */
>  
>  #include "gem/i915_gem_ioctls.h"
> +#include "gem/i915_gem_lmem.h"
>  #include "gem/i915_gem_region.h"
>  
>  #include "i915_drv.h"
>  #include "i915_trace.h"
>  #include "i915_user_extensions.h"
>  
> +static u32 object_max_page_size(struct drm_i915_gem_object *obj)
> +{
> + u32 max_page_size = 0;
> + int i;
> +
> + for (i = 0; i < obj->mm.n_placements; i++) {
> + struct intel_memory_region *mr = obj->mm.placements[i];
> +
> + GEM_BUG_ON(!is_power_of_2(mr->min_page_size));
> + max_page_size = max_t(u32, max_page_size, mr->min_page_size);
> + }
> +
> + GEM_BUG_ON(!max_page_size);
> + return max_page_size;
> +}
> +
> +static void object_set_placements(struct drm_i915_gem_object *obj,
> +   struct intel_memory_region **placements,
> +   unsigned int n_placements)
> +{
> + GEM_BUG_ON(!n_placements);
> +
> + if (n_placements == 1) {
> + struct intel_memory_region *mr = placements[0];
> + struct drm_i915_private *i915 = mr->i915;
> +
> + obj->mm.placements = &i915->mm.regions[mr->id];
> + obj->mm.n_placements = 1;
> + } else {
> + obj->mm.placements = placements;
> + obj->mm.n_placements = n_placements;
> + }
> +}
> +

I found this helper function rather odd looking at first.  In the
general case, it simply sets fields based on the parameters...but in
the n == 1 case, it goes and uses something else as the array.

On further inspection, this makes sense: normally, we have an array
of multiple placements in priority order.  That array is (essentially)
malloc'd.  But if there's only 1 item, having a malloc'd array of 1
thing is pretty silly.  We can just point at it directly.  Which means
the callers can kfree the array, and the object destructor should not.

Maybe a comment saying

   /* 
* For the common case of one memory region, skip storing an
* allocated array and just point at the region directly.
*/

would be helpful?


signature.asc
Description: This is a digitally signed message part.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 1/9] drm/doc/rfc: i915 DG1 uAPI

2021-04-28 Thread Kenneth Graunke
On Wednesday, April 28, 2021 9:56:25 AM PDT Jason Ekstrand wrote:
> On Wed, Apr 28, 2021 at 11:41 AM Matthew Auld  wrote:
[snip]
> > Slightly orthogonal: what does Mesa do here for snooped vs LLC
> > platforms? Does it make such a distinction? Just curious.
> 
> In Vulkan on non-LLC platforms, we only enable snooping for things
> that are going to be mapped: staging buffers, state buffers, batches,
> etc.  For anything that's not mapped (tiled images, etc.) we leave
> snooping off on non-LLC platforms so we don't take a hit from it.  In
> GL, I think it works out to be effectively the same but it's a less
> obvious decision there.
> 
> --Jason

iris currently enables snooping on non-LLC platforms when Gallium marks
a resource as PIPE_USAGE_STAGING, which generally means it's going to be
mapped and "fast CPU access" is desired.  Most buffers are not snooped.

I don't believe i965 uses snooping at all, surprisingly.

--Ken


signature.asc
Description: This is a digitally signed message part.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 1/9] drm/doc/rfc: i915 DG1 uAPI

2021-04-28 Thread Jason Ekstrand
On Wed, Apr 28, 2021 at 11:41 AM Matthew Auld  wrote:
>
> On 28/04/2021 16:51, Jason Ekstrand wrote:
> > On Mon, Apr 26, 2021 at 4:42 AM Matthew Auld  wrote:
> >>
> >> Add an entry for the new uAPI needed for DG1. Also add the overall
> >> upstream plan, including some notes for the TTM conversion.
> >>
> >> v2(Daniel):
> >>- include the overall upstreaming plan
> >>- add a note for mmap, there are differences here for TTM vs i915
> >>- bunch of other suggestions from Daniel
> >> v3:
> >>   (Daniel)
> >>- add a note for set/get caching stuff
> >>- add some more docs for existing query and extensions stuff
> >>- add an actual code example for regions query
> >>- bunch of other stuff
> >>   (Jason)
> >>- uAPI change(!):
> >>  - try a simpler design with the placements extension
> >>  - rather than have a generic setparam which can cover multiple
> >>use cases, have each extension be responsible for one thing
> >>only
> >> v4:
> >>   (Daniel)
> >>- add some more notes for ttm conversion
> >>- bunch of other stuff
> >>   (Jason)
> >>- uAPI change(!):
> >>  - drop all the extra rsvd members for the region_query and
> >>region_info, just keep the bare minimum needed for padding
> >>
> >> Signed-off-by: Matthew Auld 
> >> Cc: Joonas Lahtinen 
> >> Cc: Thomas Hellström 
> >> Cc: Daniele Ceraolo Spurio 
> >> Cc: Lionel Landwerlin 
> >> Cc: Jon Bloomfield 
> >> Cc: Jordan Justen 
> >> Cc: Daniel Vetter 
> >> Cc: Kenneth Graunke 
> >> Cc: Jason Ekstrand 
> >> Cc: Dave Airlie 
> >> Cc: dri-de...@lists.freedesktop.org
> >> Cc: mesa-dev@lists.freedesktop.org
> >> Acked-by: Daniel Vetter 
> >> Acked-by: Dave Airlie 
> >> ---
> >>   Documentation/gpu/rfc/i915_gem_lmem.h   | 212 
> >>   Documentation/gpu/rfc/i915_gem_lmem.rst | 130 +++
> >>   Documentation/gpu/rfc/index.rst |   4 +
> >>   3 files changed, 346 insertions(+)
> >>   create mode 100644 Documentation/gpu/rfc/i915_gem_lmem.h
> >>   create mode 100644 Documentation/gpu/rfc/i915_gem_lmem.rst
> >>
> >> diff --git a/Documentation/gpu/rfc/i915_gem_lmem.h 
> >> b/Documentation/gpu/rfc/i915_gem_lmem.h
> >> new file mode 100644
> >> index ..7ed59b6202d5
> >> --- /dev/null
> >> +++ b/Documentation/gpu/rfc/i915_gem_lmem.h
> >> @@ -0,0 +1,212 @@
> >> +/**
> >> + * enum drm_i915_gem_memory_class - Supported memory classes
> >> + */
> >> +enum drm_i915_gem_memory_class {
> >> +   /** @I915_MEMORY_CLASS_SYSTEM: System memory */
> >> +   I915_MEMORY_CLASS_SYSTEM = 0,
> >> +   /** @I915_MEMORY_CLASS_DEVICE: Device local-memory */
> >> +   I915_MEMORY_CLASS_DEVICE,
> >> +};
> >> +
> >> +/**
> >> + * struct drm_i915_gem_memory_class_instance - Identify particular memory 
> >> region
> >> + */
> >> +struct drm_i915_gem_memory_class_instance {
> >> +   /** @memory_class: See enum drm_i915_gem_memory_class */
> >> +   __u16 memory_class;
> >> +
> >> +   /** @memory_instance: Which instance */
> >> +   __u16 memory_instance;
> >> +};
> >> +
> >> +/**
> >> + * struct drm_i915_memory_region_info - Describes one region as known to 
> >> the
> >> + * driver.
> >> + *
> >> + * Note that we reserve some stuff here for potential future work. As an 
> >> example
> >> + * we might want expose the capabilities(see @caps) for a given region, 
> >> which
> >> + * could include things like if the region is CPU mappable/accessible, 
> >> what are
> >> + * the supported mapping types etc.
> >> + *
> >> + * Note this is using both struct drm_i915_query_item and struct 
> >> drm_i915_query.
> >> + * For this new query we are adding the new query id 
> >> DRM_I915_QUERY_MEMORY_REGIONS
> >> + * at &drm_i915_query_item.query_id.
> >> + */
> >> +struct drm_i915_memory_region_info {
> >> +   /** @region: The class:instance pair encoding */
> >> +   struct drm_i915_gem_memory_class_instance region;
> >> +
> >> +   /** @pad: MBZ */
> >> +   __u32 pad;
> >> +
> >> +   /** @caps: MBZ */
> >> +   __u64 caps;
> >
> > As was commented on another thread somewhere, if we're going to have
> > caps, we should have another __u64 supported_caps which tells
> > userspace what caps the kernel is capable of advertising.  That way
> > userspace can tell the difference between a kernel which doesn't
> > advertise a cap and a kernel which can advertise the cap but where the
> > cap isn't supported.
>
> Yeah, my plan was to just go with rsvd[], so drop the flags/caps for
> now, and add a comment/example for how we plan to extend this in the
> future(using your union + array trick). Hopefully that's reasonable.

That's fine with me too.  Just as long as we have an established plan
that works.

> >> +
> >> +   /** @probed_size: Memory probed by the driver (-1 = unknown) */
> >> +   __u64 probed_size;
> >> +
> >> +   /** @unallocated_size: Estimate of memory remaining (-1 = unknown) 
> >> */
> >> +  

Re: [Mesa-dev] [PATCH 1/9] drm/doc/rfc: i915 DG1 uAPI

2021-04-28 Thread Matthew Auld

On 28/04/2021 16:51, Jason Ekstrand wrote:

On Mon, Apr 26, 2021 at 4:42 AM Matthew Auld  wrote:


Add an entry for the new uAPI needed for DG1. Also add the overall
upstream plan, including some notes for the TTM conversion.

v2(Daniel):
   - include the overall upstreaming plan
   - add a note for mmap, there are differences here for TTM vs i915
   - bunch of other suggestions from Daniel
v3:
  (Daniel)
   - add a note for set/get caching stuff
   - add some more docs for existing query and extensions stuff
   - add an actual code example for regions query
   - bunch of other stuff
  (Jason)
   - uAPI change(!):
 - try a simpler design with the placements extension
 - rather than have a generic setparam which can cover multiple
   use cases, have each extension be responsible for one thing
   only
v4:
  (Daniel)
   - add some more notes for ttm conversion
   - bunch of other stuff
  (Jason)
   - uAPI change(!):
 - drop all the extra rsvd members for the region_query and
   region_info, just keep the bare minimum needed for padding

Signed-off-by: Matthew Auld 
Cc: Joonas Lahtinen 
Cc: Thomas Hellström 
Cc: Daniele Ceraolo Spurio 
Cc: Lionel Landwerlin 
Cc: Jon Bloomfield 
Cc: Jordan Justen 
Cc: Daniel Vetter 
Cc: Kenneth Graunke 
Cc: Jason Ekstrand 
Cc: Dave Airlie 
Cc: dri-de...@lists.freedesktop.org
Cc: mesa-dev@lists.freedesktop.org
Acked-by: Daniel Vetter 
Acked-by: Dave Airlie 
---
  Documentation/gpu/rfc/i915_gem_lmem.h   | 212 
  Documentation/gpu/rfc/i915_gem_lmem.rst | 130 +++
  Documentation/gpu/rfc/index.rst |   4 +
  3 files changed, 346 insertions(+)
  create mode 100644 Documentation/gpu/rfc/i915_gem_lmem.h
  create mode 100644 Documentation/gpu/rfc/i915_gem_lmem.rst

diff --git a/Documentation/gpu/rfc/i915_gem_lmem.h 
b/Documentation/gpu/rfc/i915_gem_lmem.h
new file mode 100644
index ..7ed59b6202d5
--- /dev/null
+++ b/Documentation/gpu/rfc/i915_gem_lmem.h
@@ -0,0 +1,212 @@
+/**
+ * enum drm_i915_gem_memory_class - Supported memory classes
+ */
+enum drm_i915_gem_memory_class {
+   /** @I915_MEMORY_CLASS_SYSTEM: System memory */
+   I915_MEMORY_CLASS_SYSTEM = 0,
+   /** @I915_MEMORY_CLASS_DEVICE: Device local-memory */
+   I915_MEMORY_CLASS_DEVICE,
+};
+
+/**
+ * struct drm_i915_gem_memory_class_instance - Identify particular memory 
region
+ */
+struct drm_i915_gem_memory_class_instance {
+   /** @memory_class: See enum drm_i915_gem_memory_class */
+   __u16 memory_class;
+
+   /** @memory_instance: Which instance */
+   __u16 memory_instance;
+};
+
+/**
+ * struct drm_i915_memory_region_info - Describes one region as known to the
+ * driver.
+ *
+ * Note that we reserve some stuff here for potential future work. As an 
example
+ * we might want expose the capabilities(see @caps) for a given region, which
+ * could include things like if the region is CPU mappable/accessible, what are
+ * the supported mapping types etc.
+ *
+ * Note this is using both struct drm_i915_query_item and struct 
drm_i915_query.
+ * For this new query we are adding the new query id 
DRM_I915_QUERY_MEMORY_REGIONS
+ * at &drm_i915_query_item.query_id.
+ */
+struct drm_i915_memory_region_info {
+   /** @region: The class:instance pair encoding */
+   struct drm_i915_gem_memory_class_instance region;
+
+   /** @pad: MBZ */
+   __u32 pad;
+
+   /** @caps: MBZ */
+   __u64 caps;


As was commented on another thread somewhere, if we're going to have
caps, we should have another __u64 supported_caps which tells
userspace what caps the kernel is capable of advertising.  That way
userspace can tell the difference between a kernel which doesn't
advertise a cap and a kernel which can advertise the cap but where the
cap isn't supported.


Yeah, my plan was to just go with rsvd[], so drop the flags/caps for 
now, and add a comment/example for how we plan to extend this in the 
future(using your union + array trick). Hopefully that's reasonable.





+
+   /** @probed_size: Memory probed by the driver (-1 = unknown) */
+   __u64 probed_size;
+
+   /** @unallocated_size: Estimate of memory remaining (-1 = unknown) */
+   __u64 unallocated_size;
+};
+
+/**
+ * struct drm_i915_query_memory_regions
+ *
+ * The region info query enumerates all regions known to the driver by filling
+ * in an array of struct drm_i915_memory_region_info structures.
+ *
+ * Example for getting the list of supported regions:
+ *
+ * .. code-block:: C
+ *
+ * struct drm_i915_query_memory_regions *info;
+ * struct drm_i915_query_item item = {
+ * .query_id = DRM_I915_QUERY_MEMORY_REGIONS;
+ * };
+ * struct drm_i915_query query = {
+ * .num_items = 1,
+ * .items_ptr = (uintptr_t)&item,
+ * };
+ * int err, i;
+ *
+ * // First query the size of the blob we need, this needs to be large
+ * // enough to hold our array 

[Mesa-dev] [ANNOUNCE] mesa 21.1.0-rc3

2021-04-28 Thread Dylan Baker
Hi all,

I'm filling in for Eric this week, but he'll be back next week.  I'd
like to announce mesa 21.1.0-rc3 is now available for general
consumption. As always, this is a Release Candidate, and bug reports and
anything that has regressed are very much needed.

There's a bunch of work here, lots of zink and  softpipe, but bits and
pieces of other things: tgsi, freddreno, nir, panfrost, intel, spirv,
core gallium, radv, aco, r600, and core mesa.

Cheers,
Dylan


Shortlog


Alyssa Rosenzweig (1):
  panfrost: Fix formats converting uninit from AFBC

Connor Abbott (2):
  ir3: Prevent oob writes to inputs/outputs array
  nir/lower_clip_disable: Fix store writemask

Dave Airlie (1):
  lavapipe: fix mipmapped resolves.

Dylan Baker (2):
  .pick_status.json: Update to ee9b744cb5d1466960e78b1de44ad345590e348c
  VERSION: bump for 21.1.0-rc3

Eric Engestrom (4):
  .pick_status.json: Mark 8acf361db4190aa5f7c788019d1e42d1df031b81 as 
denominated
  .pick_status.json: Update to 35a28e038107410bb6a733c51cbd267aa79a4b20
  .pick_status.json: Update to 7e905bd00f32b4fa48689a8e6266b145662cfc48
  .pick_status.json: Update to 72eca47c660b6c6051be5a5a80660ae765ecbaa5

Erik Faye-Lund (4):
  zink: do not read outside of array
  zink: do not require vulkan memory model for shader-images
  zink: correct image cap checks
  zink: fix shader-image requirements

Gert Wollny (2):
  Revert "r600: Don't advertise support for scaled int16 vertex formats"
  r600: don't set an index_bias for indirect draw calls

Gustavo Padovan (1):
  traces-iris: fix expectation for Intel GLK

Ian Romanick (2):
  tgsi_exec: Fix NaN behavior of saturate
  tgsi_exec: Fix NaN behavior of min and max

Icecream95 (3):
  pan/bi: Skip nir_opt_move/sink for blend shaders
  panfrost: Fix shader texture count
  pan/decode: Allow frame shader DCDs to be in another BO than the FBD

Jason Ekstrand (2):
  intel/compiler: Don't insert barriers for NULL sources
  anv: Use the same re-order mode for streamout as for GS

Lionel Landwerlin (1):
  spirv: fixup pointer_to/from_ssa with acceleration structures

Marcin Ślusarz (2):
  gallium/u_threaded: implement INTEL_performance_query hooks
  gallium/u_threaded: offload begin/end_intel_perf_query

Marek Olšák (1):
  radeonsi: make the gfx9 DCC MSAA clear shader depend on the number of 
samples

Mauro Rossi (2):
  android: gallium/radeonsi: add nir include path
  android: amd/common: add nir include path

Mike Blumenkrantz (9):
  Revert "zink: force scanout sync when mapping scanout resource"
  softpipe: fix render condition checking
  softpipe: fix streamout queries
  softpipe: ci updates
  zink: track persistent resource objects, not resources
  zink: restore previous semaphore (prev_sem) handling
  zink: use cached memory for staging resources
  zink: only reset query on suspend if the query has previously been stopped
  zink: when performing an implicit reset, sync qbos

Rhys Perry (1):
  radv: disable VK_FORMAT_R64_SFLOAT

Samuel Pitoiset (5):
  radv: fix emitting default depth bounds state on GFX6
  radv/winsys: fix allocating the number of CS in the sysmem path
  radv/winsys: fix resetting the number of padded IB words
  radv: make sure CP DMA is idle before executing secondary command buffers
  radv: fix various CMASK regressions on GFX9

Timothy Arceri (1):
  mesa: fix incomplete GL_NV_half_float implementation

Timur Kristóf (1):
  aco: Mark VCC clobbered for iadd8 and iadd16 reductions on GFX6-7.


git tag: mesa-21.1.0-rc3

https://mesa.freedesktop.org/archive/mesa-21.1.0-rc3.tar.xz
SHA256: 0d12e4ac6067b4f9ec4689561a5700dd792c765f79a0ba2093579f5350882f18  
mesa-21.1.0-rc3.tar.xz
SHA512: 
1668fa8ef1ad61ccf2da243f0c773b1e6f1e54f1cd3637de0567fd1c91e7e7a37d53c6e4cde6c9e487a012317323e2eb81046aacbad1b623b9dbc68abe8b22a1
  mesa-21.1.0-rc3.tar.xz
PGP:  https://mesa.freedesktop.org/archive/mesa-21.1.0-rc3.tar.xz.sig

signature.asc
Description: signature
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 1/9] drm/doc/rfc: i915 DG1 uAPI

2021-04-28 Thread Matthew Auld

On 28/04/2021 16:16, Kenneth Graunke wrote:

On Monday, April 26, 2021 2:38:53 AM PDT Matthew Auld wrote:

+Existing uAPI issues
+
+Some potential issues we still need to resolve.
+
+I915 MMAP
+-
+In i915 there are multiple ways to MMAP GEM object, including mapping the same
+object using different mapping types(WC vs WB), i.e multiple active mmaps per
+object. TTM expects one MMAP at most for the lifetime of the object. If it
+turns out that we have to backpedal here, there might be some potential
+userspace fallout.
+
+I915 SET/GET CACHING
+
+In i915 we have set/get_caching ioctl. TTM doesn't let us to change this, but
+DG1 doesn't support non-snooped pcie transactions, so we can just always
+allocate as WB for smem-only buffers.  If/when our hw gains support for
+non-snooped pcie transactions then we must fix this mode at allocation time as
+a new GEM extension.
+
+This is related to the mmap problem, because in general (meaning, when we're
+not running on intel cpus) the cpu mmap must not, ever, be inconsistent with
+allocation mode.
+
+Possible idea is to let the kernel picks the mmap mode for userspace from the
+following table:
+
+smem-only: WB. Userspace does not need to call clflush.
+
+smem+lmem: We allocate uncached memory, and give userspace a WC mapping
+for when the buffer is in smem, and WC when it's in lmem. GPU does snooped
+access, which is a bit inefficient.


I think you meant to write something different here.  What I read was:

- If it's in SMEM, give them WC
- If it's in LMEM, give them WC

Presumably one of those should have been something else, since otherwise
you would have written "always WC" :)


It should have been "always WC", sorry for the confusion.

"smem+lmem: We only ever allow a single mode, so simply allocate this as 
uncached memory, and always give userspace a WC mapping. GPU still does 
snooped access here(assuming we can't turn it off like on DG1), which is 
a bit inefficient."





+
+lmem only: always WC
+
+This means on discrete you only get a single mmap mode, all others must be
+rejected. That's probably going to be a new default mode or something like
+that.
+
+Links
+=
+[1] https://patchwork.freedesktop.org/series/86798/
+
+[2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5599#note_553791

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 1/9] drm/doc/rfc: i915 DG1 uAPI

2021-04-28 Thread Jason Ekstrand
On Mon, Apr 26, 2021 at 4:42 AM Matthew Auld  wrote:
>
> Add an entry for the new uAPI needed for DG1. Also add the overall
> upstream plan, including some notes for the TTM conversion.
>
> v2(Daniel):
>   - include the overall upstreaming plan
>   - add a note for mmap, there are differences here for TTM vs i915
>   - bunch of other suggestions from Daniel
> v3:
>  (Daniel)
>   - add a note for set/get caching stuff
>   - add some more docs for existing query and extensions stuff
>   - add an actual code example for regions query
>   - bunch of other stuff
>  (Jason)
>   - uAPI change(!):
> - try a simpler design with the placements extension
> - rather than have a generic setparam which can cover multiple
>   use cases, have each extension be responsible for one thing
>   only
> v4:
>  (Daniel)
>   - add some more notes for ttm conversion
>   - bunch of other stuff
>  (Jason)
>   - uAPI change(!):
> - drop all the extra rsvd members for the region_query and
>   region_info, just keep the bare minimum needed for padding
>
> Signed-off-by: Matthew Auld 
> Cc: Joonas Lahtinen 
> Cc: Thomas Hellström 
> Cc: Daniele Ceraolo Spurio 
> Cc: Lionel Landwerlin 
> Cc: Jon Bloomfield 
> Cc: Jordan Justen 
> Cc: Daniel Vetter 
> Cc: Kenneth Graunke 
> Cc: Jason Ekstrand 
> Cc: Dave Airlie 
> Cc: dri-de...@lists.freedesktop.org
> Cc: mesa-dev@lists.freedesktop.org
> Acked-by: Daniel Vetter 
> Acked-by: Dave Airlie 
> ---
>  Documentation/gpu/rfc/i915_gem_lmem.h   | 212 
>  Documentation/gpu/rfc/i915_gem_lmem.rst | 130 +++
>  Documentation/gpu/rfc/index.rst |   4 +
>  3 files changed, 346 insertions(+)
>  create mode 100644 Documentation/gpu/rfc/i915_gem_lmem.h
>  create mode 100644 Documentation/gpu/rfc/i915_gem_lmem.rst
>
> diff --git a/Documentation/gpu/rfc/i915_gem_lmem.h 
> b/Documentation/gpu/rfc/i915_gem_lmem.h
> new file mode 100644
> index ..7ed59b6202d5
> --- /dev/null
> +++ b/Documentation/gpu/rfc/i915_gem_lmem.h
> @@ -0,0 +1,212 @@
> +/**
> + * enum drm_i915_gem_memory_class - Supported memory classes
> + */
> +enum drm_i915_gem_memory_class {
> +   /** @I915_MEMORY_CLASS_SYSTEM: System memory */
> +   I915_MEMORY_CLASS_SYSTEM = 0,
> +   /** @I915_MEMORY_CLASS_DEVICE: Device local-memory */
> +   I915_MEMORY_CLASS_DEVICE,
> +};
> +
> +/**
> + * struct drm_i915_gem_memory_class_instance - Identify particular memory 
> region
> + */
> +struct drm_i915_gem_memory_class_instance {
> +   /** @memory_class: See enum drm_i915_gem_memory_class */
> +   __u16 memory_class;
> +
> +   /** @memory_instance: Which instance */
> +   __u16 memory_instance;
> +};
> +
> +/**
> + * struct drm_i915_memory_region_info - Describes one region as known to the
> + * driver.
> + *
> + * Note that we reserve some stuff here for potential future work. As an 
> example
> + * we might want expose the capabilities(see @caps) for a given region, which
> + * could include things like if the region is CPU mappable/accessible, what 
> are
> + * the supported mapping types etc.
> + *
> + * Note this is using both struct drm_i915_query_item and struct 
> drm_i915_query.
> + * For this new query we are adding the new query id 
> DRM_I915_QUERY_MEMORY_REGIONS
> + * at &drm_i915_query_item.query_id.
> + */
> +struct drm_i915_memory_region_info {
> +   /** @region: The class:instance pair encoding */
> +   struct drm_i915_gem_memory_class_instance region;
> +
> +   /** @pad: MBZ */
> +   __u32 pad;
> +
> +   /** @caps: MBZ */
> +   __u64 caps;

As was commented on another thread somewhere, if we're going to have
caps, we should have another __u64 supported_caps which tells
userspace what caps the kernel is capable of advertising.  That way
userspace can tell the difference between a kernel which doesn't
advertise a cap and a kernel which can advertise the cap but where the
cap isn't supported.

> +
> +   /** @probed_size: Memory probed by the driver (-1 = unknown) */
> +   __u64 probed_size;
> +
> +   /** @unallocated_size: Estimate of memory remaining (-1 = unknown) */
> +   __u64 unallocated_size;
> +};
> +
> +/**
> + * struct drm_i915_query_memory_regions
> + *
> + * The region info query enumerates all regions known to the driver by 
> filling
> + * in an array of struct drm_i915_memory_region_info structures.
> + *
> + * Example for getting the list of supported regions:
> + *
> + * .. code-block:: C
> + *
> + * struct drm_i915_query_memory_regions *info;
> + * struct drm_i915_query_item item = {
> + * .query_id = DRM_I915_QUERY_MEMORY_REGIONS;
> + * };
> + * struct drm_i915_query query = {
> + * .num_items = 1,
> + * .items_ptr = (uintptr_t)&item,
> + * };
> + * int err, i;
> + *
> + * // First query the size of the blob we need, this needs to be large
> + * // enough to hold our array of regions. The

Re: [Mesa-dev] [PATCH 1/9] drm/doc/rfc: i915 DG1 uAPI

2021-04-28 Thread Kenneth Graunke
On Monday, April 26, 2021 2:38:53 AM PDT Matthew Auld wrote:
> +Existing uAPI issues
> +
> +Some potential issues we still need to resolve.
> +
> +I915 MMAP
> +-
> +In i915 there are multiple ways to MMAP GEM object, including mapping the 
> same
> +object using different mapping types(WC vs WB), i.e multiple active mmaps per
> +object. TTM expects one MMAP at most for the lifetime of the object. If it
> +turns out that we have to backpedal here, there might be some potential
> +userspace fallout.
> +
> +I915 SET/GET CACHING
> +
> +In i915 we have set/get_caching ioctl. TTM doesn't let us to change this, but
> +DG1 doesn't support non-snooped pcie transactions, so we can just always
> +allocate as WB for smem-only buffers.  If/when our hw gains support for
> +non-snooped pcie transactions then we must fix this mode at allocation time 
> as
> +a new GEM extension.
> +
> +This is related to the mmap problem, because in general (meaning, when we're
> +not running on intel cpus) the cpu mmap must not, ever, be inconsistent with
> +allocation mode.
> +
> +Possible idea is to let the kernel picks the mmap mode for userspace from the
> +following table:
> +
> +smem-only: WB. Userspace does not need to call clflush.
> +
> +smem+lmem: We allocate uncached memory, and give userspace a WC mapping
> +for when the buffer is in smem, and WC when it's in lmem. GPU does snooped
> +access, which is a bit inefficient.

I think you meant to write something different here.  What I read was:

- If it's in SMEM, give them WC
- If it's in LMEM, give them WC

Presumably one of those should have been something else, since otherwise
you would have written "always WC" :)

> +
> +lmem only: always WC
> +
> +This means on discrete you only get a single mmap mode, all others must be
> +rejected. That's probably going to be a new default mode or something like
> +that.
> +
> +Links
> +=
> +[1] https://patchwork.freedesktop.org/series/86798/
> +
> +[2] 
> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/5599#note_553791


signature.asc
Description: This is a digitally signed message part.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Christian König

Am 28.04.21 um 16:34 schrieb Daniel Vetter:

On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:

Am 28.04.21 um 15:34 schrieb Daniel Vetter:

On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:

Am 28.04.21 um 14:26 schrieb Daniel Vetter:

On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:

On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:

Am 28.04.21 um 12:05 schrieb Daniel Vetter:

On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:

On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:

On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach  
wrote:


Ok. So that would only make the following use cases broken for now:

- amd render -> external gpu
- amd video encode -> network device

FWIW, "only" breaking amd render -> external gpu will make us pretty
unhappy

I concur. I have quite a few users with a multi-GPU setup involving
AMD hardware.

Note, if this brokenness can't be avoided, I'd prefer a to get a clear
error, and not bad results on screen because nothing is synchronized
anymore.

It's an upcoming requirement for windows[1], so you are likely to
start seeing this across all GPU vendors that support windows.  I
think the timing depends on how quickly the legacy hardware support
sticks around for each vendor.

Yeah but hw scheduling doesn't mean the hw has to be constructed to not
support isolating the ringbuffer at all.

E.g. even if the hw loses the bit to put the ringbuffer outside of the
userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
pte flags. Otherwise the entire "share address space with cpu side,
seamlessly" thing is out of the window.

And with that r/o bit on the ringbuffer you can once more force submit
through kernel space, and all the legacy dma_fence based stuff keeps
working. And we don't have to invent some horrendous userspace fence based
implicit sync mechanism in the kernel, but can instead do this transition
properly with drm_syncobj timeline explicit sync and protocol reving.

At least I think you'd have to work extra hard to create a gpu which
cannot possibly be intercepted by the kernel, even when it's designed to
support userspace direct submit only.

Or are your hw engineers more creative here and we're screwed?

The upcomming hardware generation will have this hardware scheduler as a
must have, but there are certain ways we can still stick to the old
approach:

1. The new hardware scheduler currently still supports kernel queues which
essentially is the same as the old hardware ring buffer.

2. Mapping the top level ring buffer into the VM at least partially solves
the problem. This way you can't manipulate the ring buffer content, but the
location for the fence must still be writeable.

Yeah allowing userspace to lie about completion fences in this model is
ok. Though I haven't thought through full consequences of that, but I
think it's not any worse than userspace lying about which buffers/address
it uses in the current model - we rely on hw vm ptes to catch that stuff.

Also it might be good to switch to a non-recoverable ctx model for these.
That's already what we do in i915 (opt-in, but all current umd use that
mode). So any hang/watchdog just kills the entire ctx and you don't have
to worry about userspace doing something funny with it's ringbuffer.
Simplifies everything.

Also ofc userspace fencing still disallowed, but since userspace would
queu up all writes to its ringbuffer through the drm/scheduler, we'd
handle dependencies through that still. Not great, but workable.

Thinking about this, not even mapping the ringbuffer r/o is required, it's
just that we must queue things throug the kernel to resolve dependencies
and everything without breaking dma_fence. If userspace lies, tdr will
shoot it and the kernel stops running that context entirely.

Thinking more about that approach I don't think that it will work correctly.

See we not only need to write the fence as signal that an IB is submitted,
but also adjust a bunch of privileged hardware registers.

When userspace could do that from its IBs as well then there is nothing
blocking it from reprogramming the page table base address for example.

We could do those writes with the CPU as well, but that would be a huge
performance drop because of the additional latency.

That's not what I'm suggesting. I'm suggesting you have the queue and
everything in userspace, like in wondows. Fences are exactly handled like
on windows too. The difference is:

- All new additions to the ringbuffer are done through a kernel ioctl
call, using the drm/scheduler to resolve dependencies.

- Memory management is also done like today int that ioctl.

- TDR makes sure that if userspace abuses the contract (which it can, but
it can do that already today because there's also no command parser to
e.g. stop gpu semaphores) the entire context is shot and terminally
killed. Userspace has to then set up a new one. This isn't how amdgpu
  

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:
> Am 28.04.21 um 15:34 schrieb Daniel Vetter:
> > On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
> > > Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> > > > On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> > > > > On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > > > > > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > > > > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > > > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  
> > > > > > > > wrote:
> > > > > > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> > > > > > > > >  wrote:
> > > > > > > > > 
> > > > > > > > > > > Ok. So that would only make the following use cases 
> > > > > > > > > > > broken for now:
> > > > > > > > > > > 
> > > > > > > > > > > - amd render -> external gpu
> > > > > > > > > > > - amd video encode -> network device
> > > > > > > > > > FWIW, "only" breaking amd render -> external gpu will make 
> > > > > > > > > > us pretty
> > > > > > > > > > unhappy
> > > > > > > > > I concur. I have quite a few users with a multi-GPU setup 
> > > > > > > > > involving
> > > > > > > > > AMD hardware.
> > > > > > > > > 
> > > > > > > > > Note, if this brokenness can't be avoided, I'd prefer a to 
> > > > > > > > > get a clear
> > > > > > > > > error, and not bad results on screen because nothing is 
> > > > > > > > > synchronized
> > > > > > > > > anymore.
> > > > > > > > It's an upcoming requirement for windows[1], so you are likely 
> > > > > > > > to
> > > > > > > > start seeing this across all GPU vendors that support windows.  
> > > > > > > > I
> > > > > > > > think the timing depends on how quickly the legacy hardware 
> > > > > > > > support
> > > > > > > > sticks around for each vendor.
> > > > > > > Yeah but hw scheduling doesn't mean the hw has to be constructed 
> > > > > > > to not
> > > > > > > support isolating the ringbuffer at all.
> > > > > > > 
> > > > > > > E.g. even if the hw loses the bit to put the ringbuffer outside 
> > > > > > > of the
> > > > > > > userspace gpu vm, if you have pagetables I'm seriously hoping you 
> > > > > > > have r/o
> > > > > > > pte flags. Otherwise the entire "share address space with cpu 
> > > > > > > side,
> > > > > > > seamlessly" thing is out of the window.
> > > > > > > 
> > > > > > > And with that r/o bit on the ringbuffer you can once more force 
> > > > > > > submit
> > > > > > > through kernel space, and all the legacy dma_fence based stuff 
> > > > > > > keeps
> > > > > > > working. And we don't have to invent some horrendous userspace 
> > > > > > > fence based
> > > > > > > implicit sync mechanism in the kernel, but can instead do this 
> > > > > > > transition
> > > > > > > properly with drm_syncobj timeline explicit sync and protocol 
> > > > > > > reving.
> > > > > > > 
> > > > > > > At least I think you'd have to work extra hard to create a gpu 
> > > > > > > which
> > > > > > > cannot possibly be intercepted by the kernel, even when it's 
> > > > > > > designed to
> > > > > > > support userspace direct submit only.
> > > > > > > 
> > > > > > > Or are your hw engineers more creative here and we're screwed?
> > > > > > The upcomming hardware generation will have this hardware scheduler 
> > > > > > as a
> > > > > > must have, but there are certain ways we can still stick to the old
> > > > > > approach:
> > > > > > 
> > > > > > 1. The new hardware scheduler currently still supports kernel 
> > > > > > queues which
> > > > > > essentially is the same as the old hardware ring buffer.
> > > > > > 
> > > > > > 2. Mapping the top level ring buffer into the VM at least partially 
> > > > > > solves
> > > > > > the problem. This way you can't manipulate the ring buffer content, 
> > > > > > but the
> > > > > > location for the fence must still be writeable.
> > > > > Yeah allowing userspace to lie about completion fences in this model 
> > > > > is
> > > > > ok. Though I haven't thought through full consequences of that, but I
> > > > > think it's not any worse than userspace lying about which 
> > > > > buffers/address
> > > > > it uses in the current model - we rely on hw vm ptes to catch that 
> > > > > stuff.
> > > > > 
> > > > > Also it might be good to switch to a non-recoverable ctx model for 
> > > > > these.
> > > > > That's already what we do in i915 (opt-in, but all current umd use 
> > > > > that
> > > > > mode). So any hang/watchdog just kills the entire ctx and you don't 
> > > > > have
> > > > > to worry about userspace doing something funny with it's ringbuffer.
> > > > > Simplifies everything.
> > > > > 
> > > > > Also ofc userspace fencing still disallowed, but since userspace would
> > > > > queu up all writes to its ringbuffer through the drm/scheduler, we'd
> > > > > handle dependencies through that still. Not great, but workable.
> > > > > 
> > > > > Thinking about this, not even mapping the ringbuffer r/o is required,

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Christian König

Am 28.04.21 um 15:34 schrieb Daniel Vetter:

On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:

Am 28.04.21 um 14:26 schrieb Daniel Vetter:

On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:

On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:

Am 28.04.21 um 12:05 schrieb Daniel Vetter:

On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:

On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:

On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach  
wrote:


Ok. So that would only make the following use cases broken for now:

- amd render -> external gpu
- amd video encode -> network device

FWIW, "only" breaking amd render -> external gpu will make us pretty
unhappy

I concur. I have quite a few users with a multi-GPU setup involving
AMD hardware.

Note, if this brokenness can't be avoided, I'd prefer a to get a clear
error, and not bad results on screen because nothing is synchronized
anymore.

It's an upcoming requirement for windows[1], so you are likely to
start seeing this across all GPU vendors that support windows.  I
think the timing depends on how quickly the legacy hardware support
sticks around for each vendor.

Yeah but hw scheduling doesn't mean the hw has to be constructed to not
support isolating the ringbuffer at all.

E.g. even if the hw loses the bit to put the ringbuffer outside of the
userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
pte flags. Otherwise the entire "share address space with cpu side,
seamlessly" thing is out of the window.

And with that r/o bit on the ringbuffer you can once more force submit
through kernel space, and all the legacy dma_fence based stuff keeps
working. And we don't have to invent some horrendous userspace fence based
implicit sync mechanism in the kernel, but can instead do this transition
properly with drm_syncobj timeline explicit sync and protocol reving.

At least I think you'd have to work extra hard to create a gpu which
cannot possibly be intercepted by the kernel, even when it's designed to
support userspace direct submit only.

Or are your hw engineers more creative here and we're screwed?

The upcomming hardware generation will have this hardware scheduler as a
must have, but there are certain ways we can still stick to the old
approach:

1. The new hardware scheduler currently still supports kernel queues which
essentially is the same as the old hardware ring buffer.

2. Mapping the top level ring buffer into the VM at least partially solves
the problem. This way you can't manipulate the ring buffer content, but the
location for the fence must still be writeable.

Yeah allowing userspace to lie about completion fences in this model is
ok. Though I haven't thought through full consequences of that, but I
think it's not any worse than userspace lying about which buffers/address
it uses in the current model - we rely on hw vm ptes to catch that stuff.

Also it might be good to switch to a non-recoverable ctx model for these.
That's already what we do in i915 (opt-in, but all current umd use that
mode). So any hang/watchdog just kills the entire ctx and you don't have
to worry about userspace doing something funny with it's ringbuffer.
Simplifies everything.

Also ofc userspace fencing still disallowed, but since userspace would
queu up all writes to its ringbuffer through the drm/scheduler, we'd
handle dependencies through that still. Not great, but workable.

Thinking about this, not even mapping the ringbuffer r/o is required, it's
just that we must queue things throug the kernel to resolve dependencies
and everything without breaking dma_fence. If userspace lies, tdr will
shoot it and the kernel stops running that context entirely.

Thinking more about that approach I don't think that it will work correctly.

See we not only need to write the fence as signal that an IB is submitted,
but also adjust a bunch of privileged hardware registers.

When userspace could do that from its IBs as well then there is nothing
blocking it from reprogramming the page table base address for example.

We could do those writes with the CPU as well, but that would be a huge
performance drop because of the additional latency.

That's not what I'm suggesting. I'm suggesting you have the queue and
everything in userspace, like in wondows. Fences are exactly handled like
on windows too. The difference is:

- All new additions to the ringbuffer are done through a kernel ioctl
   call, using the drm/scheduler to resolve dependencies.

- Memory management is also done like today int that ioctl.

- TDR makes sure that if userspace abuses the contract (which it can, but
   it can do that already today because there's also no command parser to
   e.g. stop gpu semaphores) the entire context is shot and terminally
   killed. Userspace has to then set up a new one. This isn't how amdgpu
   recovery works right now, but i915 supports it and I think it's also the
   better model for userspace error recov

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
> Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> > On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> > > On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > > > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  
> > > > > > wrote:
> > > > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> > > > > > >  wrote:
> > > > > > > 
> > > > > > > > > Ok. So that would only make the following use cases broken 
> > > > > > > > > for now:
> > > > > > > > > 
> > > > > > > > > - amd render -> external gpu
> > > > > > > > > - amd video encode -> network device
> > > > > > > > FWIW, "only" breaking amd render -> external gpu will make us 
> > > > > > > > pretty
> > > > > > > > unhappy
> > > > > > > I concur. I have quite a few users with a multi-GPU setup 
> > > > > > > involving
> > > > > > > AMD hardware.
> > > > > > > 
> > > > > > > Note, if this brokenness can't be avoided, I'd prefer a to get a 
> > > > > > > clear
> > > > > > > error, and not bad results on screen because nothing is 
> > > > > > > synchronized
> > > > > > > anymore.
> > > > > > It's an upcoming requirement for windows[1], so you are likely to
> > > > > > start seeing this across all GPU vendors that support windows.  I
> > > > > > think the timing depends on how quickly the legacy hardware support
> > > > > > sticks around for each vendor.
> > > > > Yeah but hw scheduling doesn't mean the hw has to be constructed to 
> > > > > not
> > > > > support isolating the ringbuffer at all.
> > > > > 
> > > > > E.g. even if the hw loses the bit to put the ringbuffer outside of the
> > > > > userspace gpu vm, if you have pagetables I'm seriously hoping you 
> > > > > have r/o
> > > > > pte flags. Otherwise the entire "share address space with cpu side,
> > > > > seamlessly" thing is out of the window.
> > > > > 
> > > > > And with that r/o bit on the ringbuffer you can once more force submit
> > > > > through kernel space, and all the legacy dma_fence based stuff keeps
> > > > > working. And we don't have to invent some horrendous userspace fence 
> > > > > based
> > > > > implicit sync mechanism in the kernel, but can instead do this 
> > > > > transition
> > > > > properly with drm_syncobj timeline explicit sync and protocol reving.
> > > > > 
> > > > > At least I think you'd have to work extra hard to create a gpu which
> > > > > cannot possibly be intercepted by the kernel, even when it's designed 
> > > > > to
> > > > > support userspace direct submit only.
> > > > > 
> > > > > Or are your hw engineers more creative here and we're screwed?
> > > > The upcomming hardware generation will have this hardware scheduler as a
> > > > must have, but there are certain ways we can still stick to the old
> > > > approach:
> > > > 
> > > > 1. The new hardware scheduler currently still supports kernel queues 
> > > > which
> > > > essentially is the same as the old hardware ring buffer.
> > > > 
> > > > 2. Mapping the top level ring buffer into the VM at least partially 
> > > > solves
> > > > the problem. This way you can't manipulate the ring buffer content, but 
> > > > the
> > > > location for the fence must still be writeable.
> > > Yeah allowing userspace to lie about completion fences in this model is
> > > ok. Though I haven't thought through full consequences of that, but I
> > > think it's not any worse than userspace lying about which buffers/address
> > > it uses in the current model - we rely on hw vm ptes to catch that stuff.
> > > 
> > > Also it might be good to switch to a non-recoverable ctx model for these.
> > > That's already what we do in i915 (opt-in, but all current umd use that
> > > mode). So any hang/watchdog just kills the entire ctx and you don't have
> > > to worry about userspace doing something funny with it's ringbuffer.
> > > Simplifies everything.
> > > 
> > > Also ofc userspace fencing still disallowed, but since userspace would
> > > queu up all writes to its ringbuffer through the drm/scheduler, we'd
> > > handle dependencies through that still. Not great, but workable.
> > > 
> > > Thinking about this, not even mapping the ringbuffer r/o is required, it's
> > > just that we must queue things throug the kernel to resolve dependencies
> > > and everything without breaking dma_fence. If userspace lies, tdr will
> > > shoot it and the kernel stops running that context entirely.
> 
> Thinking more about that approach I don't think that it will work correctly.
> 
> See we not only need to write the fence as signal that an IB is submitted,
> but also adjust a bunch of privileged hardware registers.
> 
> When userspace could do that from its IBs as well then there is nothing
> blocking it from reprogramming the page table base address for example.
> 
> We could do those writes with the CPU as well, but that would be

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Christian König

Am 28.04.21 um 14:26 schrieb Daniel Vetter:

On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:

On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:

Am 28.04.21 um 12:05 schrieb Daniel Vetter:

On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:

On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:

On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach  
wrote:


Ok. So that would only make the following use cases broken for now:

- amd render -> external gpu
- amd video encode -> network device

FWIW, "only" breaking amd render -> external gpu will make us pretty
unhappy

I concur. I have quite a few users with a multi-GPU setup involving
AMD hardware.

Note, if this brokenness can't be avoided, I'd prefer a to get a clear
error, and not bad results on screen because nothing is synchronized
anymore.

It's an upcoming requirement for windows[1], so you are likely to
start seeing this across all GPU vendors that support windows.  I
think the timing depends on how quickly the legacy hardware support
sticks around for each vendor.

Yeah but hw scheduling doesn't mean the hw has to be constructed to not
support isolating the ringbuffer at all.

E.g. even if the hw loses the bit to put the ringbuffer outside of the
userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
pte flags. Otherwise the entire "share address space with cpu side,
seamlessly" thing is out of the window.

And with that r/o bit on the ringbuffer you can once more force submit
through kernel space, and all the legacy dma_fence based stuff keeps
working. And we don't have to invent some horrendous userspace fence based
implicit sync mechanism in the kernel, but can instead do this transition
properly with drm_syncobj timeline explicit sync and protocol reving.

At least I think you'd have to work extra hard to create a gpu which
cannot possibly be intercepted by the kernel, even when it's designed to
support userspace direct submit only.

Or are your hw engineers more creative here and we're screwed?

The upcomming hardware generation will have this hardware scheduler as a
must have, but there are certain ways we can still stick to the old
approach:

1. The new hardware scheduler currently still supports kernel queues which
essentially is the same as the old hardware ring buffer.

2. Mapping the top level ring buffer into the VM at least partially solves
the problem. This way you can't manipulate the ring buffer content, but the
location for the fence must still be writeable.

Yeah allowing userspace to lie about completion fences in this model is
ok. Though I haven't thought through full consequences of that, but I
think it's not any worse than userspace lying about which buffers/address
it uses in the current model - we rely on hw vm ptes to catch that stuff.

Also it might be good to switch to a non-recoverable ctx model for these.
That's already what we do in i915 (opt-in, but all current umd use that
mode). So any hang/watchdog just kills the entire ctx and you don't have
to worry about userspace doing something funny with it's ringbuffer.
Simplifies everything.

Also ofc userspace fencing still disallowed, but since userspace would
queu up all writes to its ringbuffer through the drm/scheduler, we'd
handle dependencies through that still. Not great, but workable.

Thinking about this, not even mapping the ringbuffer r/o is required, it's
just that we must queue things throug the kernel to resolve dependencies
and everything without breaking dma_fence. If userspace lies, tdr will
shoot it and the kernel stops running that context entirely.


Thinking more about that approach I don't think that it will work correctly.

See we not only need to write the fence as signal that an IB is 
submitted, but also adjust a bunch of privileged hardware registers.


When userspace could do that from its IBs as well then there is nothing 
blocking it from reprogramming the page table base address for example.


We could do those writes with the CPU as well, but that would be a huge 
performance drop because of the additional latency.


Christian.



So I think even if we have hw with 100% userspace submit model only we
should be still fine. It's ofc silly, because instead of using userspace
fences and gpu semaphores the hw scheduler understands we still take the
detour through drm/scheduler, but at least it's not a break-the-world
event.

Also no page fault support, userptr invalidates still stall until
end-of-batch instead of just preempting it, and all that too. But I mean
there needs to be some motivation to fix this and roll out explicit sync
:-)
-Daniel


Or do I miss something here?


For now and the next hardware we are save to support the old submission
model, but the functionality of kernel queues will sooner or later go away
if it is only for Linux.

So we need to work on something which works in the long term and get us away
from this implicit sync.

Yeah I think we have pretty clear consens

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Alex Deucher
On Wed, Apr 28, 2021 at 6:31 AM Christian König
 wrote:
>
> Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> >> On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:
> >>> On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> >>>  wrote:
> >>>
> > Ok. So that would only make the following use cases broken for now:
> >
> > - amd render -> external gpu
> > - amd video encode -> network device
>  FWIW, "only" breaking amd render -> external gpu will make us pretty
>  unhappy
> >>> I concur. I have quite a few users with a multi-GPU setup involving
> >>> AMD hardware.
> >>>
> >>> Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> >>> error, and not bad results on screen because nothing is synchronized
> >>> anymore.
> >> It's an upcoming requirement for windows[1], so you are likely to
> >> start seeing this across all GPU vendors that support windows.  I
> >> think the timing depends on how quickly the legacy hardware support
> >> sticks around for each vendor.
> > Yeah but hw scheduling doesn't mean the hw has to be constructed to not
> > support isolating the ringbuffer at all.
> >
> > E.g. even if the hw loses the bit to put the ringbuffer outside of the
> > userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
> > pte flags. Otherwise the entire "share address space with cpu side,
> > seamlessly" thing is out of the window.
> >
> > And with that r/o bit on the ringbuffer you can once more force submit
> > through kernel space, and all the legacy dma_fence based stuff keeps
> > working. And we don't have to invent some horrendous userspace fence based
> > implicit sync mechanism in the kernel, but can instead do this transition
> > properly with drm_syncobj timeline explicit sync and protocol reving.
> >
> > At least I think you'd have to work extra hard to create a gpu which
> > cannot possibly be intercepted by the kernel, even when it's designed to
> > support userspace direct submit only.
> >
> > Or are your hw engineers more creative here and we're screwed?
>
> The upcomming hardware generation will have this hardware scheduler as a
> must have, but there are certain ways we can still stick to the old
> approach:
>
> 1. The new hardware scheduler currently still supports kernel queues
> which essentially is the same as the old hardware ring buffer.
>
> 2. Mapping the top level ring buffer into the VM at least partially
> solves the problem. This way you can't manipulate the ring buffer
> content, but the location for the fence must still be writeable.
>
> For now and the next hardware we are save to support the old submission
> model, but the functionality of kernel queues will sooner or later go
> away if it is only for Linux.

Even if it didn't go away completely, no one else will be using it.
This leaves a lot of under-validated execution paths that lead to
subtle bugs.  When everyone else moved to KIQ for queue management, we
stuck with MMIO for a while in Linux and we ran into tons of subtle
bugs that disappeared when we moved to KIQ.  There were lots of
assumptions about how software would use different firmware interfaces
or not which impacted lots of interactions with clock and powergating
to name a few.  On top of that, you need to use the scheduler to
utilize stuff like preemption properly.  Also, if you want to do stuff
like gang scheduling (UMD scheduling multiple queues together), it's
really hard to do with kernel software schedulers.

Alex

>
> So we need to work on something which works in the long term and get us
> away from this implicit sync.
>
> Christian.
>
> > -Daniel
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Simon Ser
On Wednesday, April 28th, 2021 at 2:21 PM, Daniel Vetter  
wrote:

> Yeah I think we have pretty clear consensus on that goal, just no one yet
> volunteered to get going with the winsys/wayland work to plumb drm_syncobj
> through, and the kernel/mesa work to make that optionally a userspace
> fence underneath. And it's for a sure a lot of work.

I'm interested in helping with the winsys/wayland bits, assuming the
following:

- We are pretty confident that drm_syncobj won't be superseded by
  something else in the near future. It seems to me like a lot of
  effort has gone into plumbing sync_file stuff all over, and it
  already needs replacing (I mean, it'll keep working, but we have a
  better replacement now. So compositors which have decided to ignore
  explicit sync for all this time won't have to do the work twice.)
- Plumbing drm_syncobj solves the synchronization issues with upcoming
  AMD hardware, and all of this works fine in cross-vendor multi-GPU
  setups.
- Someone is willing to spend a bit of time bearing with me and
  explaining how this all works. (I only know about sync_file for now,
  I'll start reading the Vulkan bits.)

Are these points something we can agree on?

Thanks,

Simon
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:
> > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> > > > >  wrote:
> > > > > 
> > > > > > > Ok. So that would only make the following use cases broken for 
> > > > > > > now:
> > > > > > > 
> > > > > > > - amd render -> external gpu
> > > > > > > - amd video encode -> network device
> > > > > > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > > > > > unhappy
> > > > > I concur. I have quite a few users with a multi-GPU setup involving
> > > > > AMD hardware.
> > > > > 
> > > > > Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> > > > > error, and not bad results on screen because nothing is synchronized
> > > > > anymore.
> > > > It's an upcoming requirement for windows[1], so you are likely to
> > > > start seeing this across all GPU vendors that support windows.  I
> > > > think the timing depends on how quickly the legacy hardware support
> > > > sticks around for each vendor.
> > > Yeah but hw scheduling doesn't mean the hw has to be constructed to not
> > > support isolating the ringbuffer at all.
> > > 
> > > E.g. even if the hw loses the bit to put the ringbuffer outside of the
> > > userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
> > > pte flags. Otherwise the entire "share address space with cpu side,
> > > seamlessly" thing is out of the window.
> > > 
> > > And with that r/o bit on the ringbuffer you can once more force submit
> > > through kernel space, and all the legacy dma_fence based stuff keeps
> > > working. And we don't have to invent some horrendous userspace fence based
> > > implicit sync mechanism in the kernel, but can instead do this transition
> > > properly with drm_syncobj timeline explicit sync and protocol reving.
> > > 
> > > At least I think you'd have to work extra hard to create a gpu which
> > > cannot possibly be intercepted by the kernel, even when it's designed to
> > > support userspace direct submit only.
> > > 
> > > Or are your hw engineers more creative here and we're screwed?
> > 
> > The upcomming hardware generation will have this hardware scheduler as a
> > must have, but there are certain ways we can still stick to the old
> > approach:
> > 
> > 1. The new hardware scheduler currently still supports kernel queues which
> > essentially is the same as the old hardware ring buffer.
> > 
> > 2. Mapping the top level ring buffer into the VM at least partially solves
> > the problem. This way you can't manipulate the ring buffer content, but the
> > location for the fence must still be writeable.
> 
> Yeah allowing userspace to lie about completion fences in this model is
> ok. Though I haven't thought through full consequences of that, but I
> think it's not any worse than userspace lying about which buffers/address
> it uses in the current model - we rely on hw vm ptes to catch that stuff.
> 
> Also it might be good to switch to a non-recoverable ctx model for these.
> That's already what we do in i915 (opt-in, but all current umd use that
> mode). So any hang/watchdog just kills the entire ctx and you don't have
> to worry about userspace doing something funny with it's ringbuffer.
> Simplifies everything.
> 
> Also ofc userspace fencing still disallowed, but since userspace would
> queu up all writes to its ringbuffer through the drm/scheduler, we'd
> handle dependencies through that still. Not great, but workable.
> 
> Thinking about this, not even mapping the ringbuffer r/o is required, it's
> just that we must queue things throug the kernel to resolve dependencies
> and everything without breaking dma_fence. If userspace lies, tdr will
> shoot it and the kernel stops running that context entirely.
> 
> So I think even if we have hw with 100% userspace submit model only we
> should be still fine. It's ofc silly, because instead of using userspace
> fences and gpu semaphores the hw scheduler understands we still take the
> detour through drm/scheduler, but at least it's not a break-the-world
> event.

Also no page fault support, userptr invalidates still stall until
end-of-batch instead of just preempting it, and all that too. But I mean
there needs to be some motivation to fix this and roll out explicit sync
:-)
-Daniel

> 
> Or do I miss something here?
> 
> > For now and the next hardware we are save to support the old submission
> > model, but the functionality of kernel queues will sooner or later go away
> > if it is only for Linux.
> > 
> > So we need to work on something which works in the long term and get us away
> > from this implicit sync.
> 
> Yeah I think we have pretty clear consensus on that goal, just no one yet
> volunteered to get going with the winsys/way

Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:
> > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> > > >  wrote:
> > > > 
> > > > > > Ok. So that would only make the following use cases broken for now:
> > > > > > 
> > > > > > - amd render -> external gpu
> > > > > > - amd video encode -> network device
> > > > > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > > > > unhappy
> > > > I concur. I have quite a few users with a multi-GPU setup involving
> > > > AMD hardware.
> > > > 
> > > > Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> > > > error, and not bad results on screen because nothing is synchronized
> > > > anymore.
> > > It's an upcoming requirement for windows[1], so you are likely to
> > > start seeing this across all GPU vendors that support windows.  I
> > > think the timing depends on how quickly the legacy hardware support
> > > sticks around for each vendor.
> > Yeah but hw scheduling doesn't mean the hw has to be constructed to not
> > support isolating the ringbuffer at all.
> > 
> > E.g. even if the hw loses the bit to put the ringbuffer outside of the
> > userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
> > pte flags. Otherwise the entire "share address space with cpu side,
> > seamlessly" thing is out of the window.
> > 
> > And with that r/o bit on the ringbuffer you can once more force submit
> > through kernel space, and all the legacy dma_fence based stuff keeps
> > working. And we don't have to invent some horrendous userspace fence based
> > implicit sync mechanism in the kernel, but can instead do this transition
> > properly with drm_syncobj timeline explicit sync and protocol reving.
> > 
> > At least I think you'd have to work extra hard to create a gpu which
> > cannot possibly be intercepted by the kernel, even when it's designed to
> > support userspace direct submit only.
> > 
> > Or are your hw engineers more creative here and we're screwed?
> 
> The upcomming hardware generation will have this hardware scheduler as a
> must have, but there are certain ways we can still stick to the old
> approach:
> 
> 1. The new hardware scheduler currently still supports kernel queues which
> essentially is the same as the old hardware ring buffer.
> 
> 2. Mapping the top level ring buffer into the VM at least partially solves
> the problem. This way you can't manipulate the ring buffer content, but the
> location for the fence must still be writeable.

Yeah allowing userspace to lie about completion fences in this model is
ok. Though I haven't thought through full consequences of that, but I
think it's not any worse than userspace lying about which buffers/address
it uses in the current model - we rely on hw vm ptes to catch that stuff.

Also it might be good to switch to a non-recoverable ctx model for these.
That's already what we do in i915 (opt-in, but all current umd use that
mode). So any hang/watchdog just kills the entire ctx and you don't have
to worry about userspace doing something funny with it's ringbuffer.
Simplifies everything.

Also ofc userspace fencing still disallowed, but since userspace would
queu up all writes to its ringbuffer through the drm/scheduler, we'd
handle dependencies through that still. Not great, but workable.

Thinking about this, not even mapping the ringbuffer r/o is required, it's
just that we must queue things throug the kernel to resolve dependencies
and everything without breaking dma_fence. If userspace lies, tdr will
shoot it and the kernel stops running that context entirely.

So I think even if we have hw with 100% userspace submit model only we
should be still fine. It's ofc silly, because instead of using userspace
fences and gpu semaphores the hw scheduler understands we still take the
detour through drm/scheduler, but at least it's not a break-the-world
event.

Or do I miss something here?

> For now and the next hardware we are save to support the old submission
> model, but the functionality of kernel queues will sooner or later go away
> if it is only for Linux.
> 
> So we need to work on something which works in the long term and get us away
> from this implicit sync.

Yeah I think we have pretty clear consensus on that goal, just no one yet
volunteered to get going with the winsys/wayland work to plumb drm_syncobj
through, and the kernel/mesa work to make that optionally a userspace
fence underneath. And it's for a sure a lot of work.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Christian König

Am 28.04.21 um 12:05 schrieb Daniel Vetter:

On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:

On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:

On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach  
wrote:


Ok. So that would only make the following use cases broken for now:

- amd render -> external gpu
- amd video encode -> network device

FWIW, "only" breaking amd render -> external gpu will make us pretty
unhappy

I concur. I have quite a few users with a multi-GPU setup involving
AMD hardware.

Note, if this brokenness can't be avoided, I'd prefer a to get a clear
error, and not bad results on screen because nothing is synchronized
anymore.

It's an upcoming requirement for windows[1], so you are likely to
start seeing this across all GPU vendors that support windows.  I
think the timing depends on how quickly the legacy hardware support
sticks around for each vendor.

Yeah but hw scheduling doesn't mean the hw has to be constructed to not
support isolating the ringbuffer at all.

E.g. even if the hw loses the bit to put the ringbuffer outside of the
userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
pte flags. Otherwise the entire "share address space with cpu side,
seamlessly" thing is out of the window.

And with that r/o bit on the ringbuffer you can once more force submit
through kernel space, and all the legacy dma_fence based stuff keeps
working. And we don't have to invent some horrendous userspace fence based
implicit sync mechanism in the kernel, but can instead do this transition
properly with drm_syncobj timeline explicit sync and protocol reving.

At least I think you'd have to work extra hard to create a gpu which
cannot possibly be intercepted by the kernel, even when it's designed to
support userspace direct submit only.

Or are your hw engineers more creative here and we're screwed?


The upcomming hardware generation will have this hardware scheduler as a 
must have, but there are certain ways we can still stick to the old 
approach:


1. The new hardware scheduler currently still supports kernel queues 
which essentially is the same as the old hardware ring buffer.


2. Mapping the top level ring buffer into the VM at least partially 
solves the problem. This way you can't manipulate the ring buffer 
content, but the location for the fence must still be writeable.


For now and the next hardware we are save to support the old submission 
model, but the functionality of kernel queues will sooner or later go 
away if it is only for Linux.


So we need to work on something which works in the long term and get us 
away from this implicit sync.


Christian.


-Daniel


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> On Tue, Apr 27, 2021 at 1:35 PM Simon Ser  wrote:
> >
> > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach 
> >  wrote:
> >
> > > > Ok. So that would only make the following use cases broken for now:
> > > >
> > > > - amd render -> external gpu
> > > > - amd video encode -> network device
> > >
> > > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > > unhappy
> >
> > I concur. I have quite a few users with a multi-GPU setup involving
> > AMD hardware.
> >
> > Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> > error, and not bad results on screen because nothing is synchronized
> > anymore.
> 
> It's an upcoming requirement for windows[1], so you are likely to
> start seeing this across all GPU vendors that support windows.  I
> think the timing depends on how quickly the legacy hardware support
> sticks around for each vendor.

Yeah but hw scheduling doesn't mean the hw has to be constructed to not
support isolating the ringbuffer at all.

E.g. even if the hw loses the bit to put the ringbuffer outside of the
userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
pte flags. Otherwise the entire "share address space with cpu side,
seamlessly" thing is out of the window.

And with that r/o bit on the ringbuffer you can once more force submit
through kernel space, and all the legacy dma_fence based stuff keeps
working. And we don't have to invent some horrendous userspace fence based
implicit sync mechanism in the kernel, but can instead do this transition
properly with drm_syncobj timeline explicit sync and protocol reving.

At least I think you'd have to work extra hard to create a gpu which
cannot possibly be intercepted by the kernel, even when it's designed to
support userspace direct submit only.

Or are your hw engineers more creative here and we're screwed?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Tue, Apr 27, 2021 at 06:27:27PM +, Simon Ser wrote:
> On Tuesday, April 27th, 2021 at 8:01 PM, Alex Deucher  
> wrote:
> 
> > It's an upcoming requirement for windows[1], so you are likely to
> > start seeing this across all GPU vendors that support windows. I
> > think the timing depends on how quickly the legacy hardware support
> > sticks around for each vendor.
> 
> Hm, okay.
> 
> Will using the existing explicit synchronization APIs make it work
> properly? (e.g. IN_FENCE_FD + OUT_FENCE_PTR in KMS, EGL_KHR_fence_sync +
> EGL_ANDROID_native_fence_sync + EGL_KHR_wait_sync in EGL)

If you have hw which really _only_ supports userspace direct submission
(i.e. the ringbuffer has to be in the same gpu vm as everything else by
design, and can't be protected at all with e.g. read-only pte entries)
then all that stuff would be broken.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 11:07:09AM +0200, Michel Dänzer wrote:
> On 2021-04-28 8:59 a.m., Christian König wrote:
> > Hi Dave,
> > 
> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> >> Supporting interop with any device is always possible. It depends on which 
> >> drivers we need to interoperate with and update them. We've already found 
> >> the path forward for amdgpu. We just need to find out how many other 
> >> drivers need to be updated and evaluate the cost/benefit aspect.
> >>
> >> Marek
> >>
> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie  >> > wrote:
> >>
> >> On Tue, 27 Apr 2021 at 22:06, Christian König
> >>  >> > wrote:
> >> >
> >> > Correct, we wouldn't have synchronization between device with and 
> >> without user queues any more.
> >> >
> >> > That could only be a problem for A+I Laptops.
> >>
> >> Since I think you mentioned you'd only be enabling this on newer
> >> chipsets, won't it be a problem for A+A where one A is a generation
> >> behind the other?
> >>
> > 
> > Crap, that is a good point as well.
> > 
> >>
> >> I'm not really liking where this is going btw, seems like a ill
> >> thought out concept, if AMD is really going down the road of designing
> >> hw that is currently Linux incompatible, you are going to have to
> >> accept a big part of the burden in bringing this support in to more
> >> than just amd drivers for upcoming generations of gpu.
> >>
> > 
> > Well we don't really like that either, but we have no other option as far 
> > as I can see.
> 
> I don't really understand what "future hw may remove support for kernel
> queues" means exactly. While the per-context queues can be mapped to
> userspace directly, they don't *have* to be, do they? I.e. the kernel
> driver should be able to either intercept userspace access to the
> queues, or in the worst case do it all itself, and provide the existing
> synchronization semantics as needed?
> 
> Surely there are resource limits for the per-context queues, so the
> kernel driver needs to do some kind of virtualization / multi-plexing
> anyway, or we'll get sad user faces when there's no queue available for
> .
> 
> I'm probably missing something though, awaiting enlightenment. :)

Yeah in all this discussion what's unclear to me is, is this a hard amdgpu
requirement going forward, in which case you need a time machine and lots
of people to retroactively fix this because this aint fast to get fixed.

Or is this just musings for an ecosystem that better fits current&future
hw, for which I think we all agree where the rough direction is?

The former is quite a glorious situation, and I'm with Dave here that if
your hw engineers really removed the bit to not map the ringbuffers to
userspace, then amd gets to eat a big chunk of the cost here.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Daniel Vetter
On Wed, Apr 28, 2021 at 08:59:47AM +0200, Christian König wrote:
> Hi Dave,
> 
> Am 27.04.21 um 21:23 schrieb Marek Olšák:
> > Supporting interop with any device is always possible. It depends on
> > which drivers we need to interoperate with and update them. We've
> > already found the path forward for amdgpu. We just need to find out how
> > many other drivers need to be updated and evaluate the cost/benefit
> > aspect.
> > 
> > Marek
> > 
> > On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie  > > wrote:
> > 
> > On Tue, 27 Apr 2021 at 22:06, Christian König
> >  > > wrote:
> > >
> > > Correct, we wouldn't have synchronization between device with
> > and without user queues any more.
> > >
> > > That could only be a problem for A+I Laptops.
> > 
> > Since I think you mentioned you'd only be enabling this on newer
> > chipsets, won't it be a problem for A+A where one A is a generation
> > behind the other?
> > 
> 
> Crap, that is a good point as well.
> 
> > 
> > I'm not really liking where this is going btw, seems like a ill
> > thought out concept, if AMD is really going down the road of designing
> > hw that is currently Linux incompatible, you are going to have to
> > accept a big part of the burden in bringing this support in to more
> > than just amd drivers for upcoming generations of gpu.
> > 
> 
> Well we don't really like that either, but we have no other option as far as
> I can see.
> 
> I have a couple of ideas how to handle this in the kernel without
> dma_fences, but it always require more or less changes to all existing
> drivers.

Yeah one horrible idea is to essentially do the plan we hashed out for
adding userspace fences to drm_syncobj timelines. And then add drm_syncobj
as another implicit fencing thing to dma-buf.

But:
- This is horrible. We're all agreeing that implicit sync is not a great
  idea, building an entire new world on this flawed thing doesn't sound
  like a good path forward.

- It's kernel uapi, so it's going to be forever.

- It's only fixing the correctness issue, since you have to stall for
  future/indefinite fences at the beginning of the CS ioctl. Or at the
  beginning of the atomic modeset ioctl, which kinda defeats the point of
  nonblocking.

- You still have to touch all kmd drivers.

- For performance, you still have to glue a submit thread onto all gl
  drivers.

It is horrendous.
-Daniel

> 
> Christian.
> 
> > 
> > Dave.
> > 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal

2021-04-28 Thread Michel Dänzer
On 2021-04-28 8:59 a.m., Christian König wrote:
> Hi Dave,
> 
> Am 27.04.21 um 21:23 schrieb Marek Olšák:
>> Supporting interop with any device is always possible. It depends on which 
>> drivers we need to interoperate with and update them. We've already found 
>> the path forward for amdgpu. We just need to find out how many other drivers 
>> need to be updated and evaluate the cost/benefit aspect.
>>
>> Marek
>>
>> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie > > wrote:
>>
>> On Tue, 27 Apr 2021 at 22:06, Christian König
>> > > wrote:
>> >
>> > Correct, we wouldn't have synchronization between device with and 
>> without user queues any more.
>> >
>> > That could only be a problem for A+I Laptops.
>>
>> Since I think you mentioned you'd only be enabling this on newer
>> chipsets, won't it be a problem for A+A where one A is a generation
>> behind the other?
>>
> 
> Crap, that is a good point as well.
> 
>>
>> I'm not really liking where this is going btw, seems like a ill
>> thought out concept, if AMD is really going down the road of designing
>> hw that is currently Linux incompatible, you are going to have to
>> accept a big part of the burden in bringing this support in to more
>> than just amd drivers for upcoming generations of gpu.
>>
> 
> Well we don't really like that either, but we have no other option as far as 
> I can see.

I don't really understand what "future hw may remove support for kernel queues" 
means exactly. While the per-context queues can be mapped to userspace 
directly, they don't *have* to be, do they? I.e. the kernel driver should be 
able to either intercept userspace access to the queues, or in the worst case 
do it all itself, and provide the existing synchronization semantics as needed?

Surely there are resource limits for the per-context queues, so the kernel 
driver needs to do some kind of virtualization / multi-plexing anyway, or we'll 
get sad user faces when there's no queue available for .

I'm probably missing something though, awaiting enlightenment. :)


-- 
Earthling Michel Dänzer   |   https://redhat.com
Libre software enthusiast | Mesa and X developer
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev