Re: [Intel-gfx] [PATCH 01/11] drm/i915/gem: Make context persistence optional
Quoting Jason Ekstrand (2019-10-29 16:19:09) > > > On Fri, Oct 25, 2019 at 4:29 PM Chris Wilson wrote: > > Quoting Jason Ekstrand (2019-10-25 19:22:04) > > On Thu, Oct 24, 2019 at 6:40 AM Chris Wilson > wrote: > > > > Our existing behaviour is to allow contexts and their GPU requests > to > > persist past the point of closure until the requests are complete. > This > > allows clients to operate in a 'fire-and-forget' manner where they > can > > setup a rendering pipeline and hand it over to the display server > and > > immediately exiting. As the rendering pipeline is kept alive until > > completion, the display server (or other consumer) can use the > results > > in the future and present them to the user. > > > > However, not all clients want this persistent behaviour and would > prefer > > that the contexts are cleaned up immediately upon closure. This > ensures > > that when clients are run without hangchecking, any GPU hang is > > terminated with the process and does not continue to hog resources. > > > > By defining a context property to allow clients to control > persistence > > explicitly, we can remove the blanket advice to disable hangchecking > > that seems to be far too prevalent. > > > > > > Just to be clear, when you say "disable hangchecking" do you mean > disabling it > > for all processes via a kernel parameter at boot time or a sysfs entry > or > > similar? Or is there some mechanism whereby a context can request no > hang > > checking? > > They are being told to use the module parameter i915.enable_hangcheck=0 > to globally disable hangchecking. This is what we are trying to wean > them off, and yet still allow indefinitely long kernels. The softer > hangcheck is focused on if you block scheduling or preemption of higher > priority work, then you are forcibly removed from the GPU. However, even > that is too much for some workloads, where they really do expect to > permanently hog the GPU. (All I can say is that they better be dedicated > systems because if you demand interactivity on top of disabling > preemption...) > > > Ok, thinking out loud here (no need for you to respond): Why should we take > this approach? It seems like there are several other ways we could solve > this: > > 1. Have a per-context flag (that's what we did here) > 2. Have a per-execbuf flag for "don't allow this execbuf to outlive the > process". > 3. Have a DRM_IOCTL_I915_KILL_CONTEXT which lets the client manually kill the > context > > Option 2 seems like a lot more work in i915 and it doesn't seem that > advantageous. Most drivers are going to either want their batches to outlive > them or not; they aren't going to be making that decision on a per-batch > basis. And processing each batch on context close, deciding how to carry forward the different options doesn't sound like fun. > Option 3 would work for some cases but it doesn't let the kernel terminate > work > if the client is killed unexpectedly by, for instance a segfault. The client > could try to insert a crash handler but calling a DRM ioctl from a crash > handler sounds like a bad plan. On the other hand, the client can just as > easily implement 3 by setting the new context flag and then calling > GEM_CONTEXT_DESTROY. Exactly. Abnormal process termination is the name of the game. > With that, I think I'm convinced that a context param is the best way to do > this. We may even consider using it in Vulkan when running headless to let us > kill stuff quicker. We aren't seeing any long-running Vulkan compute > workloads > yet but they may be coming. Yeah, might it find a use for GL robustness? If the app wants a guarantee that its resources are garbage collected along with it? > Acked-by: Jason Ekstrand > > > One more question: Does this bit fully support being turned on and off or is > it > a set-once? I ask because how I'd likely go about doing this in Vulkan would > be to set it on context create and then unset it the moment we see a buffer > shared with the outside world. You can set it as many times as you like, it only takes effect on context termination. That comes back to whether you want to evaluate it as a batch/execbuf attribute or on the context. The starting point for the patches was to close the process termination hole, of which this is the natural extension for a context to opt into being cancelled on closure. I think an all-or-nothing makes sense, or rather I don't see enough advantage in the per-batch attribute to justify the processing. -Chris ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 01/11] drm/i915/gem: Make context persistence optional
On Fri, Oct 25, 2019 at 4:29 PM Chris Wilson wrote: > Quoting Jason Ekstrand (2019-10-25 19:22:04) > > On Thu, Oct 24, 2019 at 6:40 AM Chris Wilson > wrote: > > > > Our existing behaviour is to allow contexts and their GPU requests to > > persist past the point of closure until the requests are complete. > This > > allows clients to operate in a 'fire-and-forget' manner where they > can > > setup a rendering pipeline and hand it over to the display server and > > immediately exiting. As the rendering pipeline is kept alive until > > completion, the display server (or other consumer) can use the > results > > in the future and present them to the user. > > > > However, not all clients want this persistent behaviour and would > prefer > > that the contexts are cleaned up immediately upon closure. This > ensures > > that when clients are run without hangchecking, any GPU hang is > > terminated with the process and does not continue to hog resources. > > > > By defining a context property to allow clients to control > persistence > > explicitly, we can remove the blanket advice to disable hangchecking > > that seems to be far too prevalent. > > > > > > Just to be clear, when you say "disable hangchecking" do you mean > disabling it > > for all processes via a kernel parameter at boot time or a sysfs entry or > > similar? Or is there some mechanism whereby a context can request no > hang > > checking? > > They are being told to use the module parameter i915.enable_hangcheck=0 > to globally disable hangchecking. This is what we are trying to wean > them off, and yet still allow indefinitely long kernels. The softer > hangcheck is focused on if you block scheduling or preemption of higher > priority work, then you are forcibly removed from the GPU. However, even > that is too much for some workloads, where they really do expect to > permanently hog the GPU. (All I can say is that they better be dedicated > systems because if you demand interactivity on top of disabling > preemption...) > Ok, thinking out loud here (no need for you to respond): Why should we take this approach? It seems like there are several other ways we could solve this: 1. Have a per-context flag (that's what we did here) 2. Have a per-execbuf flag for "don't allow this execbuf to outlive the process". 3. Have a DRM_IOCTL_I915_KILL_CONTEXT which lets the client manually kill the context Option 2 seems like a lot more work in i915 and it doesn't seem that advantageous. Most drivers are going to either want their batches to outlive them or not; they aren't going to be making that decision on a per-batch basis. Option 3 would work for some cases but it doesn't let the kernel terminate work if the client is killed unexpectedly by, for instance a segfault. The client could try to insert a crash handler but calling a DRM ioctl from a crash handler sounds like a bad plan. On the other hand, the client can just as easily implement 3 by setting the new context flag and then calling GEM_CONTEXT_DESTROY. With that, I think I'm convinced that a context param is the best way to do this. We may even consider using it in Vulkan when running headless to let us kill stuff quicker. We aren't seeing any long-running Vulkan compute workloads yet but they may be coming. Acked-by: Jason Ekstrand One more question: Does this bit fully support being turned on and off or is it a set-once? I ask because how I'd likely go about doing this in Vulkan would be to set it on context create and then unset it the moment we see a buffer shared with the outside world. --Jason ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 01/11] drm/i915/gem: Make context persistence optional
Quoting Jason Ekstrand (2019-10-25 19:22:04) > On Thu, Oct 24, 2019 at 6:40 AM Chris Wilson wrote: > > Our existing behaviour is to allow contexts and their GPU requests to > persist past the point of closure until the requests are complete. This > allows clients to operate in a 'fire-and-forget' manner where they can > setup a rendering pipeline and hand it over to the display server and > immediately exiting. As the rendering pipeline is kept alive until > completion, the display server (or other consumer) can use the results > in the future and present them to the user. > > However, not all clients want this persistent behaviour and would prefer > that the contexts are cleaned up immediately upon closure. This ensures > that when clients are run without hangchecking, any GPU hang is > terminated with the process and does not continue to hog resources. > > By defining a context property to allow clients to control persistence > explicitly, we can remove the blanket advice to disable hangchecking > that seems to be far too prevalent. > > > Just to be clear, when you say "disable hangchecking" do you mean disabling it > for all processes via a kernel parameter at boot time or a sysfs entry or > similar? Or is there some mechanism whereby a context can request no hang > checking? They are being told to use the module parameter i915.enable_hangcheck=0 to globally disable hangchecking. This is what we are trying to wean them off, and yet still allow indefinitely long kernels. The softer hangcheck is focused on if you block scheduling or preemption of higher priority work, then you are forcibly removed from the GPU. However, even that is too much for some workloads, where they really do expect to permanently hog the GPU. (All I can say is that they better be dedicated systems because if you demand interactivity on top of disabling preemption...) > The default behaviour for new controls is the legacy persistence mode. > New clients will have to opt out for immediate cleanup on context > closure. If the hangchecking modparam is disabled, so is persistent > context support -- all contexts will be terminated on closure. > > > What happens to fences when the context is cancelled? Is it the same behavior > as we have today for when a GPU hang is detected and a context is banned? Yes. The incomplete fence statuses are set to -EIO -- it is the very same mechanism used to remove this context's future work from the GPU as is used for banning. -Chris ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 01/11] drm/i915/gem: Make context persistence optional
On Thu, Oct 24, 2019 at 6:40 AM Chris Wilson wrote: > Our existing behaviour is to allow contexts and their GPU requests to > persist past the point of closure until the requests are complete. This > allows clients to operate in a 'fire-and-forget' manner where they can > setup a rendering pipeline and hand it over to the display server and > immediately exiting. As the rendering pipeline is kept alive until > completion, the display server (or other consumer) can use the results > in the future and present them to the user. > > However, not all clients want this persistent behaviour and would prefer > that the contexts are cleaned up immediately upon closure. This ensures > that when clients are run without hangchecking, any GPU hang is > terminated with the process and does not continue to hog resources. > > By defining a context property to allow clients to control persistence > explicitly, we can remove the blanket advice to disable hangchecking > that seems to be far too prevalent. > Just to be clear, when you say "disable hangchecking" do you mean disabling it for all processes via a kernel parameter at boot time or a sysfs entry or similar? Or is there some mechanism whereby a context can request no hang checking? > The default behaviour for new controls is the legacy persistence mode. > New clients will have to opt out for immediate cleanup on context > closure. If the hangchecking modparam is disabled, so is persistent > context support -- all contexts will be terminated on closure. > What happens to fences when the context is cancelled? Is it the same behavior as we have today for when a GPU hang is detected and a context is banned? --Jason > Testcase: igt/gem_ctx_persistence > Signed-off-by: Chris Wilson > Cc: Joonas Lahtinen > Cc: Michał Winiarski > Cc: Jon Bloomfield > Reviewed-by: Jon Bloomfield > Reviewed-by: Tvrtko Ursulin > --- > drivers/gpu/drm/i915/gem/i915_gem_context.c | 50 ++- > drivers/gpu/drm/i915/gem/i915_gem_context.h | 15 ++ > .../gpu/drm/i915/gem/i915_gem_context_types.h | 1 + > .../gpu/drm/i915/gem/selftests/mock_context.c | 2 + > include/uapi/drm/i915_drm.h | 15 ++ > 5 files changed, 82 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c > b/drivers/gpu/drm/i915/gem/i915_gem_context.c > index 55f1f93c0925..0f1bbf5d1a11 100644 > --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c > +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c > @@ -437,12 +437,39 @@ static void context_close(struct i915_gem_context > *ctx) > * case we opt to forcibly kill off all remaining requests on > * context close. > */ > - if (!i915_modparams.enable_hangcheck) > + if (!i915_gem_context_is_persistent(ctx) || > + !i915_modparams.enable_hangcheck) > kill_context(ctx); > > i915_gem_context_put(ctx); > } > > +static int __context_set_persistence(struct i915_gem_context *ctx, bool > state) > +{ > + if (i915_gem_context_is_persistent(ctx) == state) > + return 0; > + > + if (state) { > + /* > +* Only contexts that are short-lived [that will expire or > be > +* reset] are allowed to survive past termination. We > require > +* hangcheck to ensure that the persistent requests are > healthy. > +*/ > + if (!i915_modparams.enable_hangcheck) > + return -EINVAL; > + > + i915_gem_context_set_persistence(ctx); > + } else { > + /* To cancel a context we use "preempt-to-idle" */ > + if (!(ctx->i915->caps.scheduler & > I915_SCHEDULER_CAP_PREEMPTION)) > + return -ENODEV; > + > + i915_gem_context_clear_persistence(ctx); > + } > + > + return 0; > +} > + > static struct i915_gem_context * > __create_context(struct drm_i915_private *i915) > { > @@ -477,6 +504,7 @@ __create_context(struct drm_i915_private *i915) > > i915_gem_context_set_bannable(ctx); > i915_gem_context_set_recoverable(ctx); > + __context_set_persistence(ctx, true /* cgroup hook? */); > > for (i = 0; i < ARRAY_SIZE(ctx->hang_timestamp); i++) > ctx->hang_timestamp[i] = jiffies - > CONTEXT_FAST_HANG_JIFFIES; > @@ -633,6 +661,7 @@ i915_gem_context_create_kernel(struct drm_i915_private > *i915, int prio) > return ctx; > > i915_gem_context_clear_bannable(ctx); > + i915_gem_context_set_persistence(ctx); > ctx->sched.priority = I915_USER_PRIORITY(prio); > > GEM_BUG_ON(!i915_gem_context_is_kernel(ctx)); > @@ -1743,6 +1772,16 @@ get_engines(struct i915_gem_context *ctx, > return err; > } > > +static int > +set_persistence(struct i915_gem_context *ctx, > + const struct drm_i915_gem_context_param *args) > +{ > + if (args->size) > +
Re: [Intel-gfx] [PATCH 01/11] drm/i915/gem: Make context persistence optional
Quoting Chris Wilson (2019-10-24 14:40:18) > Our existing behaviour is to allow contexts and their GPU requests to > persist past the point of closure until the requests are complete. This > allows clients to operate in a 'fire-and-forget' manner where they can > setup a rendering pipeline and hand it over to the display server and > immediately exiting. As the rendering pipeline is kept alive until > completion, the display server (or other consumer) can use the results > in the future and present them to the user. > > However, not all clients want this persistent behaviour and would prefer > that the contexts are cleaned up immediately upon closure. This ensures > that when clients are run without hangchecking, any GPU hang is > terminated with the process and does not continue to hog resources. It's worth mentioning GPU compute workloads of indeterminate length as an example for increased clarity. > By defining a context property to allow clients to control persistence > explicitly, we can remove the blanket advice to disable hangchecking > that seems to be far too prevalent. I would drop this paragraph from this patch, as it doesn't (yet) make sense. > The default behaviour for new controls is the legacy persistence mode. > New clients will have to opt out for immediate cleanup on context > closure. "... have to opt in to immediate ..." reads more clear. > If the hangchecking modparam is disabled, so is persistent > context support -- all contexts will be terminated on closure. Let's add here that it has actually been a source of bug reports in the past that we don't terminate the workoads. So we expect this behaviour change to be welcomed by the compute users who have been instructed to disable the hanghceck in the past. A couple of comments below. But anyway, with the uAPI comment and commit message clarified this is: Reviewed-by: Joonas Lahtinen Still needs the Link:, though. > Testcase: igt/gem_ctx_persistence > Signed-off-by: Chris Wilson > Cc: Joonas Lahtinen > Cc: Michał Winiarski > Cc: Jon Bloomfield > Reviewed-by: Jon Bloomfield > Reviewed-by: Tvrtko Ursulin > +static int __context_set_persistence(struct i915_gem_context *ctx, bool > state) > +{ > + if (i915_gem_context_is_persistent(ctx) == state) > + return 0; > + > + if (state) { > + /* > +* Only contexts that are short-lived [that will expire or be > +* reset] are allowed to survive past termination. We require > +* hangcheck to ensure that the persistent requests are > healthy. > +*/ > + if (!i915_modparams.enable_hangcheck) > + return -EINVAL; This is slightly confusing as the default is to enable persistence. Disabling and re-enabling would result -EINVAL. But I guess it's no problem to lift such restriction later. > +++ b/include/uapi/drm/i915_drm.h > @@ -1572,6 +1572,21 @@ struct drm_i915_gem_context_param { > * i915_context_engines_bond (I915_CONTEXT_ENGINES_EXT_BOND) > */ > #define I915_CONTEXT_PARAM_ENGINES 0xa > + > +/* > + * I915_CONTEXT_PARAM_PERSISTENCE: > + * > + * Allow the context and active rendering to survive the process until > + * completion. Persistence allows fire-and-forget clients to queue up a > + * bunch of work, hand the output over to a display server and the quit. > + * If the context is not marked as persistent, upon closing (either via ".. is marked as not persistent," is more clear. Regards, Joonas > + * an explicit DRM_I915_GEM_CONTEXT_DESTROY or implicitly from file closure > + * or process termination), the context and any outstanding requests will be > + * cancelled (and exported fences for cancelled requests marked as -EIO). > + * > + * By default, new contexts allow persistence. > + */ ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
[Intel-gfx] [PATCH 01/11] drm/i915/gem: Make context persistence optional
Our existing behaviour is to allow contexts and their GPU requests to persist past the point of closure until the requests are complete. This allows clients to operate in a 'fire-and-forget' manner where they can setup a rendering pipeline and hand it over to the display server and immediately exiting. As the rendering pipeline is kept alive until completion, the display server (or other consumer) can use the results in the future and present them to the user. However, not all clients want this persistent behaviour and would prefer that the contexts are cleaned up immediately upon closure. This ensures that when clients are run without hangchecking, any GPU hang is terminated with the process and does not continue to hog resources. By defining a context property to allow clients to control persistence explicitly, we can remove the blanket advice to disable hangchecking that seems to be far too prevalent. The default behaviour for new controls is the legacy persistence mode. New clients will have to opt out for immediate cleanup on context closure. If the hangchecking modparam is disabled, so is persistent context support -- all contexts will be terminated on closure. Testcase: igt/gem_ctx_persistence Signed-off-by: Chris Wilson Cc: Joonas Lahtinen Cc: Michał Winiarski Cc: Jon Bloomfield Reviewed-by: Jon Bloomfield Reviewed-by: Tvrtko Ursulin --- drivers/gpu/drm/i915/gem/i915_gem_context.c | 50 ++- drivers/gpu/drm/i915/gem/i915_gem_context.h | 15 ++ .../gpu/drm/i915/gem/i915_gem_context_types.h | 1 + .../gpu/drm/i915/gem/selftests/mock_context.c | 2 + include/uapi/drm/i915_drm.h | 15 ++ 5 files changed, 82 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c index 55f1f93c0925..0f1bbf5d1a11 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c @@ -437,12 +437,39 @@ static void context_close(struct i915_gem_context *ctx) * case we opt to forcibly kill off all remaining requests on * context close. */ - if (!i915_modparams.enable_hangcheck) + if (!i915_gem_context_is_persistent(ctx) || + !i915_modparams.enable_hangcheck) kill_context(ctx); i915_gem_context_put(ctx); } +static int __context_set_persistence(struct i915_gem_context *ctx, bool state) +{ + if (i915_gem_context_is_persistent(ctx) == state) + return 0; + + if (state) { + /* +* Only contexts that are short-lived [that will expire or be +* reset] are allowed to survive past termination. We require +* hangcheck to ensure that the persistent requests are healthy. +*/ + if (!i915_modparams.enable_hangcheck) + return -EINVAL; + + i915_gem_context_set_persistence(ctx); + } else { + /* To cancel a context we use "preempt-to-idle" */ + if (!(ctx->i915->caps.scheduler & I915_SCHEDULER_CAP_PREEMPTION)) + return -ENODEV; + + i915_gem_context_clear_persistence(ctx); + } + + return 0; +} + static struct i915_gem_context * __create_context(struct drm_i915_private *i915) { @@ -477,6 +504,7 @@ __create_context(struct drm_i915_private *i915) i915_gem_context_set_bannable(ctx); i915_gem_context_set_recoverable(ctx); + __context_set_persistence(ctx, true /* cgroup hook? */); for (i = 0; i < ARRAY_SIZE(ctx->hang_timestamp); i++) ctx->hang_timestamp[i] = jiffies - CONTEXT_FAST_HANG_JIFFIES; @@ -633,6 +661,7 @@ i915_gem_context_create_kernel(struct drm_i915_private *i915, int prio) return ctx; i915_gem_context_clear_bannable(ctx); + i915_gem_context_set_persistence(ctx); ctx->sched.priority = I915_USER_PRIORITY(prio); GEM_BUG_ON(!i915_gem_context_is_kernel(ctx)); @@ -1743,6 +1772,16 @@ get_engines(struct i915_gem_context *ctx, return err; } +static int +set_persistence(struct i915_gem_context *ctx, + const struct drm_i915_gem_context_param *args) +{ + if (args->size) + return -EINVAL; + + return __context_set_persistence(ctx, args->value); +} + static int ctx_setparam(struct drm_i915_file_private *fpriv, struct i915_gem_context *ctx, struct drm_i915_gem_context_param *args) @@ -1820,6 +1859,10 @@ static int ctx_setparam(struct drm_i915_file_private *fpriv, ret = set_engines(ctx, args); break; + case I915_CONTEXT_PARAM_PERSISTENCE: + ret = set_persistence(ctx, args); + break; + case I915_CONTEXT_PARAM_BAN_PERIOD: default: ret = -EINVAL; @@ -2272,6 +2315,11 @@ int i915_ge