Re: [Intel-gfx] [PATCH 01/11] drm/i915/gem: Make context persistence optional

2019-10-29 Thread Chris Wilson
Quoting Jason Ekstrand (2019-10-29 16:19:09)
> 
> 
> On Fri, Oct 25, 2019 at 4:29 PM Chris Wilson  wrote:
> 
> Quoting Jason Ekstrand (2019-10-25 19:22:04)
> > On Thu, Oct 24, 2019 at 6:40 AM Chris Wilson 
> wrote:
> >
> >     Our existing behaviour is to allow contexts and their GPU requests 
> to
> >     persist past the point of closure until the requests are complete.
> This
> >     allows clients to operate in a 'fire-and-forget' manner where they
> can
> >     setup a rendering pipeline and hand it over to the display server 
> and
> >     immediately exiting. As the rendering pipeline is kept alive until
> >     completion, the display server (or other consumer) can use the
> results
> >     in the future and present them to the user.
> >
> >     However, not all clients want this persistent behaviour and would
> prefer
> >     that the contexts are cleaned up immediately upon closure. This
> ensures
> >     that when clients are run without hangchecking, any GPU hang is
> >     terminated with the process and does not continue to hog resources.
> >
> >     By defining a context property to allow clients to control
> persistence
> >     explicitly, we can remove the blanket advice to disable hangchecking
> >     that seems to be far too prevalent.
> >
> >
> > Just to be clear, when you say "disable hangchecking" do you mean
> disabling it
> > for all processes via a kernel parameter at boot time or a sysfs entry 
> or
> > similar?  Or is there some mechanism whereby a context can request no
> hang
> > checking?
> 
> They are being told to use the module parameter i915.enable_hangcheck=0
> to globally disable hangchecking. This is what we are trying to wean
> them off, and yet still allow indefinitely long kernels. The softer
> hangcheck is focused on if you block scheduling or preemption of higher
> priority work, then you are forcibly removed from the GPU. However, even
> that is too much for some workloads, where they really do expect to
> permanently hog the GPU. (All I can say is that they better be dedicated
> systems because if you demand interactivity on top of disabling
> preemption...)
> 
> 
> Ok, thinking out loud here (no need for you to respond):  Why should we take
> this approach?  It seems like there are several other ways we could solve 
> this:
> 
>  1. Have a per-context flag (that's what we did here)
>  2. Have a per-execbuf flag for "don't allow this execbuf to outlive the
> process".
>  3. Have a DRM_IOCTL_I915_KILL_CONTEXT which lets the client manually kill the
> context
> 
> Option 2 seems like a lot more work in i915 and it doesn't seem that
> advantageous.  Most drivers are going to either want their batches to outlive
> them or not; they aren't going to be making that decision on a per-batch 
> basis.

And processing each batch on context close, deciding how to carry
forward the different options doesn't sound like fun.
 
> Option 3 would work for some cases but it doesn't let the kernel terminate 
> work
> if the client is killed unexpectedly by, for instance a segfault.  The client
> could try to insert a crash handler but calling a DRM ioctl from a crash
> handler sounds like a bad plan.  On the other hand, the client can just as
> easily implement 3 by setting the new context flag and then calling
> GEM_CONTEXT_DESTROY.

Exactly. Abnormal process termination is the name of the game.
 
> With that, I think I'm convinced that a context param is the best way to do
> this.  We may even consider using it in Vulkan when running headless to let us
> kill stuff quicker.  We aren't seeing any long-running Vulkan compute 
> workloads
> yet but they may be coming.

Yeah, might it find a use for GL robustness? If the app wants a
guarantee that its resources are garbage collected along with it?
 
> Acked-by: Jason Ekstrand 
> 
> 
> One more question: Does this bit fully support being turned on and off or is 
> it
> a set-once?  I ask because how I'd likely go about doing this in Vulkan would
> be to set it on context create and then unset it the moment we see a buffer
> shared with the outside world.

You can set it as many times as you like, it only takes effect on
context termination. That comes back to whether you want to evaluate it
as a batch/execbuf attribute or on the context. The starting point for
the patches was to close the process termination hole, of which this is
the natural extension for a context to opt into being cancelled on
closure. I think an all-or-nothing makes sense, or rather I don't see
enough advantage in the per-batch attribute to justify the processing.
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH 01/11] drm/i915/gem: Make context persistence optional

2019-10-29 Thread Jason Ekstrand
On Fri, Oct 25, 2019 at 4:29 PM Chris Wilson 
wrote:

> Quoting Jason Ekstrand (2019-10-25 19:22:04)
> > On Thu, Oct 24, 2019 at 6:40 AM Chris Wilson 
> wrote:
> >
> > Our existing behaviour is to allow contexts and their GPU requests to
> > persist past the point of closure until the requests are complete.
> This
> > allows clients to operate in a 'fire-and-forget' manner where they
> can
> > setup a rendering pipeline and hand it over to the display server and
> > immediately exiting. As the rendering pipeline is kept alive until
> > completion, the display server (or other consumer) can use the
> results
> > in the future and present them to the user.
> >
> > However, not all clients want this persistent behaviour and would
> prefer
> > that the contexts are cleaned up immediately upon closure. This
> ensures
> > that when clients are run without hangchecking, any GPU hang is
> > terminated with the process and does not continue to hog resources.
> >
> > By defining a context property to allow clients to control
> persistence
> > explicitly, we can remove the blanket advice to disable hangchecking
> > that seems to be far too prevalent.
> >
> >
> > Just to be clear, when you say "disable hangchecking" do you mean
> disabling it
> > for all processes via a kernel parameter at boot time or a sysfs entry or
> > similar?  Or is there some mechanism whereby a context can request no
> hang
> > checking?
>
> They are being told to use the module parameter i915.enable_hangcheck=0
> to globally disable hangchecking. This is what we are trying to wean
> them off, and yet still allow indefinitely long kernels. The softer
> hangcheck is focused on if you block scheduling or preemption of higher
> priority work, then you are forcibly removed from the GPU. However, even
> that is too much for some workloads, where they really do expect to
> permanently hog the GPU. (All I can say is that they better be dedicated
> systems because if you demand interactivity on top of disabling
> preemption...)
>

Ok, thinking out loud here (no need for you to respond):  Why should we
take this approach?  It seems like there are several other ways we could
solve this:

 1. Have a per-context flag (that's what we did here)
 2. Have a per-execbuf flag for "don't allow this execbuf to outlive the
process".
 3. Have a DRM_IOCTL_I915_KILL_CONTEXT which lets the client manually kill
the context

Option 2 seems like a lot more work in i915 and it doesn't seem that
advantageous.  Most drivers are going to either want their batches to
outlive them or not; they aren't going to be making that decision on a
per-batch basis.

Option 3 would work for some cases but it doesn't let the kernel terminate
work if the client is killed unexpectedly by, for instance a segfault.  The
client could try to insert a crash handler but calling a DRM ioctl from a
crash handler sounds like a bad plan.  On the other hand, the client can
just as easily implement 3 by setting the new context flag and then calling
GEM_CONTEXT_DESTROY.

With that, I think I'm convinced that a context param is the best way to do
this.  We may even consider using it in Vulkan when running headless to let
us kill stuff quicker.  We aren't seeing any long-running Vulkan compute
workloads yet but they may be coming.

Acked-by: Jason Ekstrand 


One more question: Does this bit fully support being turned on and off or
is it a set-once?  I ask because how I'd likely go about doing this in
Vulkan would be to set it on context create and then unset it the moment we
see a buffer shared with the outside world.

--Jason
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH 01/11] drm/i915/gem: Make context persistence optional

2019-10-25 Thread Chris Wilson
Quoting Jason Ekstrand (2019-10-25 19:22:04)
> On Thu, Oct 24, 2019 at 6:40 AM Chris Wilson  wrote:
> 
> Our existing behaviour is to allow contexts and their GPU requests to
> persist past the point of closure until the requests are complete. This
> allows clients to operate in a 'fire-and-forget' manner where they can
> setup a rendering pipeline and hand it over to the display server and
> immediately exiting. As the rendering pipeline is kept alive until
> completion, the display server (or other consumer) can use the results
> in the future and present them to the user.
> 
> However, not all clients want this persistent behaviour and would prefer
> that the contexts are cleaned up immediately upon closure. This ensures
> that when clients are run without hangchecking, any GPU hang is
> terminated with the process and does not continue to hog resources.
> 
> By defining a context property to allow clients to control persistence
> explicitly, we can remove the blanket advice to disable hangchecking
> that seems to be far too prevalent.
> 
> 
> Just to be clear, when you say "disable hangchecking" do you mean disabling it
> for all processes via a kernel parameter at boot time or a sysfs entry or
> similar?  Or is there some mechanism whereby a context can request no hang
> checking?

They are being told to use the module parameter i915.enable_hangcheck=0
to globally disable hangchecking. This is what we are trying to wean
them off, and yet still allow indefinitely long kernels. The softer
hangcheck is focused on if you block scheduling or preemption of higher
priority work, then you are forcibly removed from the GPU. However, even
that is too much for some workloads, where they really do expect to
permanently hog the GPU. (All I can say is that they better be dedicated
systems because if you demand interactivity on top of disabling
preemption...)

> The default behaviour for new controls is the legacy persistence mode.
> New clients will have to opt out for immediate cleanup on context
> closure. If the hangchecking modparam is disabled, so is persistent
> context support -- all contexts will be terminated on closure.
> 
> 
> What happens to fences when the context is cancelled?  Is it the same behavior
> as we have today for when a GPU hang is detected and a context is banned?

Yes. The incomplete fence statuses are set to -EIO -- it is the very same
mechanism used to remove this context's future work from the GPU as is
used for banning.
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH 01/11] drm/i915/gem: Make context persistence optional

2019-10-25 Thread Jason Ekstrand
On Thu, Oct 24, 2019 at 6:40 AM Chris Wilson 
wrote:

> Our existing behaviour is to allow contexts and their GPU requests to
> persist past the point of closure until the requests are complete. This
> allows clients to operate in a 'fire-and-forget' manner where they can
> setup a rendering pipeline and hand it over to the display server and
> immediately exiting. As the rendering pipeline is kept alive until
> completion, the display server (or other consumer) can use the results
> in the future and present them to the user.
>
> However, not all clients want this persistent behaviour and would prefer
> that the contexts are cleaned up immediately upon closure. This ensures
> that when clients are run without hangchecking, any GPU hang is
> terminated with the process and does not continue to hog resources.
>
> By defining a context property to allow clients to control persistence
> explicitly, we can remove the blanket advice to disable hangchecking
> that seems to be far too prevalent.
>

Just to be clear, when you say "disable hangchecking" do you mean disabling
it for all processes via a kernel parameter at boot time or a sysfs entry
or similar?  Or is there some mechanism whereby a context can request no
hang checking?


> The default behaviour for new controls is the legacy persistence mode.
> New clients will have to opt out for immediate cleanup on context
> closure. If the hangchecking modparam is disabled, so is persistent
> context support -- all contexts will be terminated on closure.
>

What happens to fences when the context is cancelled?  Is it the same
behavior as we have today for when a GPU hang is detected and a context is
banned?

--Jason



> Testcase: igt/gem_ctx_persistence
> Signed-off-by: Chris Wilson 
> Cc: Joonas Lahtinen 
> Cc: Michał Winiarski 
> Cc: Jon Bloomfield 
> Reviewed-by: Jon Bloomfield 
> Reviewed-by: Tvrtko Ursulin 
> ---
>  drivers/gpu/drm/i915/gem/i915_gem_context.c   | 50 ++-
>  drivers/gpu/drm/i915/gem/i915_gem_context.h   | 15 ++
>  .../gpu/drm/i915/gem/i915_gem_context_types.h |  1 +
>  .../gpu/drm/i915/gem/selftests/mock_context.c |  2 +
>  include/uapi/drm/i915_drm.h   | 15 ++
>  5 files changed, 82 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> index 55f1f93c0925..0f1bbf5d1a11 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
> @@ -437,12 +437,39 @@ static void context_close(struct i915_gem_context
> *ctx)
>  * case we opt to forcibly kill off all remaining requests on
>  * context close.
>  */
> -   if (!i915_modparams.enable_hangcheck)
> +   if (!i915_gem_context_is_persistent(ctx) ||
> +   !i915_modparams.enable_hangcheck)
> kill_context(ctx);
>
> i915_gem_context_put(ctx);
>  }
>
> +static int __context_set_persistence(struct i915_gem_context *ctx, bool
> state)
> +{
> +   if (i915_gem_context_is_persistent(ctx) == state)
> +   return 0;
> +
> +   if (state) {
> +   /*
> +* Only contexts that are short-lived [that will expire or
> be
> +* reset] are allowed to survive past termination. We
> require
> +* hangcheck to ensure that the persistent requests are
> healthy.
> +*/
> +   if (!i915_modparams.enable_hangcheck)
> +   return -EINVAL;
> +
> +   i915_gem_context_set_persistence(ctx);
> +   } else {
> +   /* To cancel a context we use "preempt-to-idle" */
> +   if (!(ctx->i915->caps.scheduler &
> I915_SCHEDULER_CAP_PREEMPTION))
> +   return -ENODEV;
> +
> +   i915_gem_context_clear_persistence(ctx);
> +   }
> +
> +   return 0;
> +}
> +
>  static struct i915_gem_context *
>  __create_context(struct drm_i915_private *i915)
>  {
> @@ -477,6 +504,7 @@ __create_context(struct drm_i915_private *i915)
>
> i915_gem_context_set_bannable(ctx);
> i915_gem_context_set_recoverable(ctx);
> +   __context_set_persistence(ctx, true /* cgroup hook? */);
>
> for (i = 0; i < ARRAY_SIZE(ctx->hang_timestamp); i++)
> ctx->hang_timestamp[i] = jiffies -
> CONTEXT_FAST_HANG_JIFFIES;
> @@ -633,6 +661,7 @@ i915_gem_context_create_kernel(struct drm_i915_private
> *i915, int prio)
> return ctx;
>
> i915_gem_context_clear_bannable(ctx);
> +   i915_gem_context_set_persistence(ctx);
> ctx->sched.priority = I915_USER_PRIORITY(prio);
>
> GEM_BUG_ON(!i915_gem_context_is_kernel(ctx));
> @@ -1743,6 +1772,16 @@ get_engines(struct i915_gem_context *ctx,
> return err;
>  }
>
> +static int
> +set_persistence(struct i915_gem_context *ctx,
> +   const struct drm_i915_gem_context_param *args)
> +{
> +   if (args->size)
> +

Re: [Intel-gfx] [PATCH 01/11] drm/i915/gem: Make context persistence optional

2019-10-25 Thread Joonas Lahtinen
Quoting Chris Wilson (2019-10-24 14:40:18)
> Our existing behaviour is to allow contexts and their GPU requests to
> persist past the point of closure until the requests are complete. This
> allows clients to operate in a 'fire-and-forget' manner where they can
> setup a rendering pipeline and hand it over to the display server and
> immediately exiting. As the rendering pipeline is kept alive until
> completion, the display server (or other consumer) can use the results
> in the future and present them to the user.
> 
> However, not all clients want this persistent behaviour and would prefer
> that the contexts are cleaned up immediately upon closure. This ensures
> that when clients are run without hangchecking, any GPU hang is
> terminated with the process and does not continue to hog resources.

It's worth mentioning GPU compute workloads of indeterminate length
as an example for increased clarity.

> By defining a context property to allow clients to control persistence
> explicitly, we can remove the blanket advice to disable hangchecking
> that seems to be far too prevalent.

I would drop this paragraph from this patch, as it doesn't (yet) make
sense.

> The default behaviour for new controls is the legacy persistence mode.
> New clients will have to opt out for immediate cleanup on context
> closure.

"... have to opt in to immediate ..." reads more clear.

> If the hangchecking modparam is disabled, so is persistent
> context support -- all contexts will be terminated on closure.

Let's add here that it has actually been a source of bug reports
in the past that we don't terminate the workoads. So we expect
this behaviour change to be welcomed by the compute users who
have been instructed to disable the hanghceck in the past.

A couple of comments below. But anyway, with the uAPI comment
and commit message clarified this is:

Reviewed-by: Joonas Lahtinen 

Still needs the Link:, though.

> Testcase: igt/gem_ctx_persistence
> Signed-off-by: Chris Wilson 
> Cc: Joonas Lahtinen 
> Cc: Michał Winiarski 
> Cc: Jon Bloomfield 
> Reviewed-by: Jon Bloomfield 
> Reviewed-by: Tvrtko Ursulin 



> +static int __context_set_persistence(struct i915_gem_context *ctx, bool 
> state)
> +{
> +   if (i915_gem_context_is_persistent(ctx) == state)
> +   return 0;
> +
> +   if (state) {
> +   /*
> +* Only contexts that are short-lived [that will expire or be
> +* reset] are allowed to survive past termination. We require
> +* hangcheck to ensure that the persistent requests are 
> healthy.
> +*/
> +   if (!i915_modparams.enable_hangcheck)
> +   return -EINVAL;

This is slightly confusing as the default is to enable persistence.

Disabling and re-enabling would result -EINVAL. But I guess it's no
problem to lift such restriction later.

> +++ b/include/uapi/drm/i915_drm.h
> @@ -1572,6 +1572,21 @@ struct drm_i915_gem_context_param {
>   *   i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND)
>   */
>  #define I915_CONTEXT_PARAM_ENGINES 0xa
> +
> +/*
> + * I915_CONTEXT_PARAM_PERSISTENCE:
> + *
> + * Allow the context and active rendering to survive the process until
> + * completion. Persistence allows fire-and-forget clients to queue up a
> + * bunch of work, hand the output over to a display server and the quit.
> + * If the context is not marked as persistent, upon closing (either via

".. is marked as not persistent," is more clear.

Regards, Joonas

> + * an explicit DRM_I915_GEM_CONTEXT_DESTROY or implicitly from file closure
> + * or process termination), the context and any outstanding requests will be
> + * cancelled (and exported fences for cancelled requests marked as -EIO).
> + *
> + * By default, new contexts allow persistence.
> + */
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

[Intel-gfx] [PATCH 01/11] drm/i915/gem: Make context persistence optional

2019-10-24 Thread Chris Wilson
Our existing behaviour is to allow contexts and their GPU requests to
persist past the point of closure until the requests are complete. This
allows clients to operate in a 'fire-and-forget' manner where they can
setup a rendering pipeline and hand it over to the display server and
immediately exiting. As the rendering pipeline is kept alive until
completion, the display server (or other consumer) can use the results
in the future and present them to the user.

However, not all clients want this persistent behaviour and would prefer
that the contexts are cleaned up immediately upon closure. This ensures
that when clients are run without hangchecking, any GPU hang is
terminated with the process and does not continue to hog resources.

By defining a context property to allow clients to control persistence
explicitly, we can remove the blanket advice to disable hangchecking
that seems to be far too prevalent.

The default behaviour for new controls is the legacy persistence mode.
New clients will have to opt out for immediate cleanup on context
closure. If the hangchecking modparam is disabled, so is persistent
context support -- all contexts will be terminated on closure.

Testcase: igt/gem_ctx_persistence
Signed-off-by: Chris Wilson 
Cc: Joonas Lahtinen 
Cc: Michał Winiarski 
Cc: Jon Bloomfield 
Reviewed-by: Jon Bloomfield 
Reviewed-by: Tvrtko Ursulin 
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c   | 50 ++-
 drivers/gpu/drm/i915/gem/i915_gem_context.h   | 15 ++
 .../gpu/drm/i915/gem/i915_gem_context_types.h |  1 +
 .../gpu/drm/i915/gem/selftests/mock_context.c |  2 +
 include/uapi/drm/i915_drm.h   | 15 ++
 5 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c 
b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 55f1f93c0925..0f1bbf5d1a11 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -437,12 +437,39 @@ static void context_close(struct i915_gem_context *ctx)
 * case we opt to forcibly kill off all remaining requests on
 * context close.
 */
-   if (!i915_modparams.enable_hangcheck)
+   if (!i915_gem_context_is_persistent(ctx) ||
+   !i915_modparams.enable_hangcheck)
kill_context(ctx);
 
i915_gem_context_put(ctx);
 }
 
+static int __context_set_persistence(struct i915_gem_context *ctx, bool state)
+{
+   if (i915_gem_context_is_persistent(ctx) == state)
+   return 0;
+
+   if (state) {
+   /*
+* Only contexts that are short-lived [that will expire or be
+* reset] are allowed to survive past termination. We require
+* hangcheck to ensure that the persistent requests are healthy.
+*/
+   if (!i915_modparams.enable_hangcheck)
+   return -EINVAL;
+
+   i915_gem_context_set_persistence(ctx);
+   } else {
+   /* To cancel a context we use "preempt-to-idle" */
+   if (!(ctx->i915->caps.scheduler & 
I915_SCHEDULER_CAP_PREEMPTION))
+   return -ENODEV;
+
+   i915_gem_context_clear_persistence(ctx);
+   }
+
+   return 0;
+}
+
 static struct i915_gem_context *
 __create_context(struct drm_i915_private *i915)
 {
@@ -477,6 +504,7 @@ __create_context(struct drm_i915_private *i915)
 
i915_gem_context_set_bannable(ctx);
i915_gem_context_set_recoverable(ctx);
+   __context_set_persistence(ctx, true /* cgroup hook? */);
 
for (i = 0; i < ARRAY_SIZE(ctx->hang_timestamp); i++)
ctx->hang_timestamp[i] = jiffies - CONTEXT_FAST_HANG_JIFFIES;
@@ -633,6 +661,7 @@ i915_gem_context_create_kernel(struct drm_i915_private 
*i915, int prio)
return ctx;
 
i915_gem_context_clear_bannable(ctx);
+   i915_gem_context_set_persistence(ctx);
ctx->sched.priority = I915_USER_PRIORITY(prio);
 
GEM_BUG_ON(!i915_gem_context_is_kernel(ctx));
@@ -1743,6 +1772,16 @@ get_engines(struct i915_gem_context *ctx,
return err;
 }
 
+static int
+set_persistence(struct i915_gem_context *ctx,
+   const struct drm_i915_gem_context_param *args)
+{
+   if (args->size)
+   return -EINVAL;
+
+   return __context_set_persistence(ctx, args->value);
+}
+
 static int ctx_setparam(struct drm_i915_file_private *fpriv,
struct i915_gem_context *ctx,
struct drm_i915_gem_context_param *args)
@@ -1820,6 +1859,10 @@ static int ctx_setparam(struct drm_i915_file_private 
*fpriv,
ret = set_engines(ctx, args);
break;
 
+   case I915_CONTEXT_PARAM_PERSISTENCE:
+   ret = set_persistence(ctx, args);
+   break;
+
case I915_CONTEXT_PARAM_BAN_PERIOD:
default:
ret = -EINVAL;
@@ -2272,6 +2315,11 @@ int i915_ge