Re: [Intel-gfx] [PATCH v2] drm/i915: Sleep and retry a GPU reset if at first we don't succeed

2017-12-01 Thread Chris Wilson
Quoting Mika Kuoppala (2017-12-01 13:20:29)
> Chris Wilson  writes:
> 
> > As we declare the GPU wedged if the reset fails, such a failure is quite
> > terminal. Before taking that drastic action, let's sleep first and try
> > active, in the hope that the hardware has quietened down and is then
> > able to reset. After a few such attempts, it is fair to say that the HW
> > is truly wedged.
> >
> > v2: Always print the failure message now, we precheck whether resets are
> > disabled.
> >
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=104007
> > Signed-off-by: Chris Wilson 
> > Cc: Mika Kuoppala 
> > Cc: Joonas Lahtinen 
> > ---
> >  drivers/gpu/drm/i915/i915_drv.c | 20 +++-
> >  1 file changed, 15 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_drv.c 
> > b/drivers/gpu/drm/i915/i915_drv.c
> > index e0f053f9c186..7faf20aff25a 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.c
> > +++ b/drivers/gpu/drm/i915/i915_drv.c
> > @@ -1877,7 +1877,9 @@ void i915_reset(struct drm_i915_private *i915, 
> > unsigned int flags)
> >  {
> >   struct i915_gpu_error *error = >gpu_error;
> >   int ret;
> > + int i;
> >  
> > + might_sleep();
> >   lockdep_assert_held(>drm.struct_mutex);
> >   GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, >flags));
> >  
> > @@ -1900,12 +1902,20 @@ void i915_reset(struct drm_i915_private *i915, 
> > unsigned int flags)
> >   goto error;
> >   }
> >  
> > - ret = intel_gpu_reset(i915, ALL_ENGINES);
> > + if (!intel_has_gpu_reset(i915)) {
> > + DRM_DEBUG_DRIVER("GPU reset disabled\n");
> > + goto error;
> > + }
> > +
> > + for (i = 0; i < 3; i++) {
> > + ret = intel_gpu_reset(i915, ALL_ENGINES);
> > + if (ret == 0)
> > + break;
> > +
> > + msleep(100);
> 
> Seems reasonable to try few times and pause between defibrillate
> attempts instead of throwing dirt on top of coffin right
> off the bat.
> 
> Also I have been pondering that should we add a minicheck
> to intel_gpu_reset to poke that the gpu is really there.
> Like doing few nops in (render)ringbuffer and see if head
> moves before declaring it as a reset success?

That's kind of what the recovery tests. It's tricky to do so because we
would have to unwind hw state after the trial, plus we may not see a
problem until it tries to do something like write to HWSP.

> Not that we would not see it in init right after but just
> to have more precise location of failure instead of
> initing a dead gpu.

It's dead, Jim. Either way, the location should be in the post-mortem
error state capture before the first attempt to reset. As always, if
that isn't enough to diagnose the problem, we need to work harder to get
the right information into that dump. Tricky.
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH v2] drm/i915: Sleep and retry a GPU reset if at first we don't succeed

2017-12-01 Thread Mika Kuoppala
Chris Wilson  writes:

> As we declare the GPU wedged if the reset fails, such a failure is quite
> terminal. Before taking that drastic action, let's sleep first and try
> active, in the hope that the hardware has quietened down and is then
> able to reset. After a few such attempts, it is fair to say that the HW
> is truly wedged.
>
> v2: Always print the failure message now, we precheck whether resets are
> disabled.
>
> References: https://bugs.freedesktop.org/show_bug.cgi?id=104007
> Signed-off-by: Chris Wilson 
> Cc: Mika Kuoppala 
> Cc: Joonas Lahtinen 
> ---
>  drivers/gpu/drm/i915/i915_drv.c | 20 +++-
>  1 file changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index e0f053f9c186..7faf20aff25a 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1877,7 +1877,9 @@ void i915_reset(struct drm_i915_private *i915, unsigned 
> int flags)
>  {
>   struct i915_gpu_error *error = >gpu_error;
>   int ret;
> + int i;
>  
> + might_sleep();
>   lockdep_assert_held(>drm.struct_mutex);
>   GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, >flags));
>  
> @@ -1900,12 +1902,20 @@ void i915_reset(struct drm_i915_private *i915, 
> unsigned int flags)
>   goto error;
>   }
>  
> - ret = intel_gpu_reset(i915, ALL_ENGINES);
> + if (!intel_has_gpu_reset(i915)) {
> + DRM_DEBUG_DRIVER("GPU reset disabled\n");
> + goto error;
> + }
> +
> + for (i = 0; i < 3; i++) {
> + ret = intel_gpu_reset(i915, ALL_ENGINES);
> + if (ret == 0)
> + break;
> +
> + msleep(100);

Seems reasonable to try few times and pause between defibrillate
attempts instead of throwing dirt on top of coffin right
off the bat.

Also I have been pondering that should we add a minicheck
to intel_gpu_reset to poke that the gpu is really there.
Like doing few nops in (render)ringbuffer and see if head
moves before declaring it as a reset success?

Not that we would not see it in init right after but just
to have more precise location of failure instead of
initing a dead gpu.

Reviewed-by: Mika Kuoppala 

-Mika


> + }
>   if (ret) {
> - if (ret != -ENODEV)
> - DRM_ERROR("Failed to reset chip: %i\n", ret);
> - else
> - DRM_DEBUG_DRIVER("GPU reset disabled\n");
> + dev_err(i915->drm.dev, "Failed to reset chip\n");
>   goto error;
>   }
>  
> -- 
> 2.15.1
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH v2] drm/i915: Sleep and retry a GPU reset if at first we don't succeed

2017-12-01 Thread Chris Wilson
As we declare the GPU wedged if the reset fails, such a failure is quite
terminal. Before taking that drastic action, let's sleep first and try
active, in the hope that the hardware has quietened down and is then
able to reset. After a few such attempts, it is fair to say that the HW
is truly wedged.

v2: Always print the failure message now, we precheck whether resets are
disabled.

References: https://bugs.freedesktop.org/show_bug.cgi?id=104007
Signed-off-by: Chris Wilson 
Cc: Mika Kuoppala 
Cc: Joonas Lahtinen 
---
 drivers/gpu/drm/i915/i915_drv.c | 20 +++-
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index e0f053f9c186..7faf20aff25a 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1877,7 +1877,9 @@ void i915_reset(struct drm_i915_private *i915, unsigned 
int flags)
 {
struct i915_gpu_error *error = >gpu_error;
int ret;
+   int i;
 
+   might_sleep();
lockdep_assert_held(>drm.struct_mutex);
GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, >flags));
 
@@ -1900,12 +1902,20 @@ void i915_reset(struct drm_i915_private *i915, unsigned 
int flags)
goto error;
}
 
-   ret = intel_gpu_reset(i915, ALL_ENGINES);
+   if (!intel_has_gpu_reset(i915)) {
+   DRM_DEBUG_DRIVER("GPU reset disabled\n");
+   goto error;
+   }
+
+   for (i = 0; i < 3; i++) {
+   ret = intel_gpu_reset(i915, ALL_ENGINES);
+   if (ret == 0)
+   break;
+
+   msleep(100);
+   }
if (ret) {
-   if (ret != -ENODEV)
-   DRM_ERROR("Failed to reset chip: %i\n", ret);
-   else
-   DRM_DEBUG_DRIVER("GPU reset disabled\n");
+   dev_err(i915->drm.dev, "Failed to reset chip\n");
goto error;
}
 
-- 
2.15.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx