Re: [Intel-gfx] [PATCH 3/3] drm/i915: Prevent leaking of -EIO from i915_wait_request()

2015-12-11 Thread Daniel Vetter
On Fri, Dec 11, 2015 at 09:02:18AM +, Chris Wilson wrote:
> On Thu, Dec 03, 2015 at 10:14:54AM +0100, Daniel Vetter wrote:
> > On Tue, Dec 01, 2015 at 11:05:35AM +, Chris Wilson wrote:
> > > diff --git a/drivers/gpu/drm/i915/intel_display.c 
> > > b/drivers/gpu/drm/i915/intel_display.c
> > > index 4447e73b54db..73c61b94f7fd 100644
> > > --- a/drivers/gpu/drm/i915/intel_display.c
> > > +++ b/drivers/gpu/drm/i915/intel_display.c
> > > @@ -13315,23 +13309,15 @@ static int intel_atomic_prepare_commit(struct 
> > > drm_device *dev,
> > >  
> > >   ret = __i915_wait_request(intel_plane_state->wait_req,
> > > true, NULL, NULL);
> > > -
> > > - /* Swallow -EIO errors to allow updates during hw 
> > > lockup. */
> > > - if (ret == -EIO)
> > > - ret = 0;
> > > -
> > > - if (ret)
> > > + if (ret) {
> > > + mutex_lock(>struct_mutex);
> > > + drm_atomic_helper_cleanup_planes(dev, state);
> > > + mutex_unlock(>struct_mutex);
> > >   break;
> > > + }
> > >   }
> > > -
> > > - if (!ret)
> > > - return 0;
> > > -
> > > - mutex_lock(>struct_mutex);
> > > - drm_atomic_helper_cleanup_planes(dev, state);
> > >   }
> > >  
> > > - mutex_unlock(>struct_mutex);
> > 
> > Sneaking in lockless waits! Separate patch please.
> 
> No, it is just badly written code. The wait is already lockless but the
> lock is dropped and retaken around the error paths in such a manner that
> you cannot see this from a glimpse.

Indeed lack of diff context made me all confused, I stand corrected. Looks
good.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 3/3] drm/i915: Prevent leaking of -EIO from i915_wait_request()

2015-12-11 Thread Chris Wilson
On Thu, Dec 03, 2015 at 10:14:54AM +0100, Daniel Vetter wrote:
> On Tue, Dec 01, 2015 at 11:05:35AM +, Chris Wilson wrote:
> > diff --git a/drivers/gpu/drm/i915/intel_display.c 
> > b/drivers/gpu/drm/i915/intel_display.c
> > index 4447e73b54db..73c61b94f7fd 100644
> > --- a/drivers/gpu/drm/i915/intel_display.c
> > +++ b/drivers/gpu/drm/i915/intel_display.c
> > @@ -13315,23 +13309,15 @@ static int intel_atomic_prepare_commit(struct 
> > drm_device *dev,
> >  
> > ret = __i915_wait_request(intel_plane_state->wait_req,
> >   true, NULL, NULL);
> > -
> > -   /* Swallow -EIO errors to allow updates during hw 
> > lockup. */
> > -   if (ret == -EIO)
> > -   ret = 0;
> > -
> > -   if (ret)
> > +   if (ret) {
> > +   mutex_lock(>struct_mutex);
> > +   drm_atomic_helper_cleanup_planes(dev, state);
> > +   mutex_unlock(>struct_mutex);
> > break;
> > +   }
> > }
> > -
> > -   if (!ret)
> > -   return 0;
> > -
> > -   mutex_lock(>struct_mutex);
> > -   drm_atomic_helper_cleanup_planes(dev, state);
> > }
> >  
> > -   mutex_unlock(>struct_mutex);
> 
> Sneaking in lockless waits! Separate patch please.

No, it is just badly written code. The wait is already lockless but the
lock is dropped and retaken around the error paths in such a manner that
you cannot see this from a glimpse.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 3/3] drm/i915: Prevent leaking of -EIO from i915_wait_request()

2015-12-03 Thread Chris Wilson
On Thu, Dec 03, 2015 at 10:14:54AM +0100, Daniel Vetter wrote:
> > @@ -4078,9 +4067,9 @@ i915_gem_ring_throttle(struct drm_device *dev, struct 
> > drm_file *file)
> > if (ret)
> > return ret;
> >  
> > -   ret = i915_gem_check_wedge(_priv->gpu_error, false);
> > -   if (ret)
> > -   return ret;
> > +   /* ABI: return -EIO if already wedged */
> > +   if (i915_terminally_wedged(_priv->gpu_error))
> 
> Commit message says there's only one case, but I count two:
> - throttle here
> - waiting for ring space in case the gpu died meanwhile

That wasn't intended for ABI preservation. (If you look at the execbuffer
reporting -EIO that is from request_alloc). I hit a bug after
wait_for_ring space if I didn't abort when the hang was detected - we
walked off the end of the ring. That shouldn't be possible, I think I
papered over a bug rather than diagnose it correctly. :|
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 3/3] drm/i915: Prevent leaking of -EIO from i915_wait_request()

2015-12-03 Thread Daniel Vetter
On Tue, Dec 01, 2015 at 11:05:35AM +, Chris Wilson wrote:
> Reporting -EIO from i915_wait_request() has proven very troublematic
> over the years, with numerous hard-to-reproduce bugs cropping up in the
> corner case of where a reset occurs and the code wasn't expecting such
> an error.
> 
> If the we reset the GPU or have detected a hang and wish to reset the
> GPU, the request is forcibly complete and the wait broken. Currently, we
> report either -EAGAIN or -EIO in order for the caller to retreat and
> restart the wait (if appropriate) after dropping and then reacquiring
> the struct_mutex (essential to allow the GPU reset to proceed). However,
> if we take the view that the request is complete (no further work will
> be done on it by the GPU because it is dead and soon to be reset), then
> we can proceed with the task at hand and then drop the struct_mutex
> allowing the reset to occur. This transfers the burden of checking
> whether it is safe to proceed to the caller, which in all but one
> instance it is safe - completely eliminating the source of all spurious
> -EIO.
> 
> Signed-off-by: Chris Wilson 
> ---
>  drivers/gpu/drm/i915/i915_debugfs.c |  2 +-
>  drivers/gpu/drm/i915/i915_drv.c | 32 +--
>  drivers/gpu/drm/i915/i915_drv.h |  5 +---
>  drivers/gpu/drm/i915/i915_gem.c | 45 
> +
>  drivers/gpu/drm/i915/i915_irq.c | 21 ++-
>  drivers/gpu/drm/i915/intel_display.c| 32 +++
>  drivers/gpu/drm/i915/intel_lrc.c|  8 ++
>  drivers/gpu/drm/i915/intel_ringbuffer.c |  8 ++
>  8 files changed, 65 insertions(+), 88 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
> b/drivers/gpu/drm/i915/i915_debugfs.c
> index 864b3ae9419c..0583eda642d7 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -4656,7 +4656,7 @@ i915_wedged_get(void *data, u64 *val)
>   struct drm_device *dev = data;
>   struct drm_i915_private *dev_priv = dev->dev_private;
>  
> - *val = i915_reset_counter(_priv->gpu_error);
> + *val = i915_terminally_wedged(_priv->gpu_error);
>  
>   return 0;
>  }
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index 90faa8e03fca..eb893a5e00b1 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -923,23 +923,31 @@ int i915_resume_switcheroo(struct drm_device *dev)
>  int i915_reset(struct drm_device *dev)
>  {
>   struct drm_i915_private *dev_priv = dev->dev_private;
> + struct i915_gpu_error *error = _priv->gpu_error;
> + unsigned reset_counter;
>   bool simulated;
>   int ret;
>  
>   intel_reset_gt_powersave(dev);
>  
>   mutex_lock(>struct_mutex);
> + atomic_andnot(I915_WEDGED, >reset_counter);
> + reset_counter = atomic_inc_return(>reset_counter);
> + if (WARN_ON(__i915_reset_in_progress(reset_counter))) {
> + ret = -EIO;
> + goto error;
> + }
>  
>   i915_gem_reset(dev);
>  
> - simulated = dev_priv->gpu_error.stop_rings != 0;
> + simulated = error->stop_rings != 0;
>  
>   ret = intel_gpu_reset(dev);
>  
>   /* Also reset the gpu hangman. */
>   if (simulated) {
>   DRM_INFO("Simulated gpu hang, resetting stop_rings\n");
> - dev_priv->gpu_error.stop_rings = 0;
> + error->stop_rings = 0;
>   if (ret == -ENODEV) {
>   DRM_INFO("Reset not implemented, but ignoring "
>"error for simulated gpu hangs\n");
> @@ -952,8 +960,7 @@ int i915_reset(struct drm_device *dev)
>  
>   if (ret) {
>   DRM_ERROR("Failed to reset chip: %i\n", ret);
> - mutex_unlock(>struct_mutex);
> - return ret;
> + goto error;
>   }
>  
>   intel_overlay_reset(dev_priv);
> @@ -972,20 +979,14 @@ int i915_reset(struct drm_device *dev)
>* was running at the time of the reset (i.e. we weren't VT
>* switched away).
>*/
> -
> - /* Used to prevent gem_check_wedged returning -EAGAIN during gpu reset 
> */
> - dev_priv->gpu_error.reload_in_reset = true;
> -
>   ret = i915_gem_init_hw(dev);
> -
> - dev_priv->gpu_error.reload_in_reset = false;
> -
> - mutex_unlock(>struct_mutex);
>   if (ret) {
>   DRM_ERROR("Failed hw init on reset %d\n", ret);
> - return ret;
> + goto error;
>   }
>  
> + mutex_unlock(>struct_mutex);
> +
>   /*
>* rps/rc6 re-init is necessary to restore state lost after the
>* reset and the re-install of gt irqs. Skip for ironlake per
> @@ -996,6 +997,11 @@ int i915_reset(struct drm_device *dev)
>   intel_enable_gt_powersave(dev);
>  
>   return 0;
> +
> +error:
> + atomic_or(I915_WEDGED, >reset_counter);
> + 

[Intel-gfx] [PATCH 3/3] drm/i915: Prevent leaking of -EIO from i915_wait_request()

2015-12-01 Thread Chris Wilson
Reporting -EIO from i915_wait_request() has proven very troublematic
over the years, with numerous hard-to-reproduce bugs cropping up in the
corner case of where a reset occurs and the code wasn't expecting such
an error.

If the we reset the GPU or have detected a hang and wish to reset the
GPU, the request is forcibly complete and the wait broken. Currently, we
report either -EAGAIN or -EIO in order for the caller to retreat and
restart the wait (if appropriate) after dropping and then reacquiring
the struct_mutex (essential to allow the GPU reset to proceed). However,
if we take the view that the request is complete (no further work will
be done on it by the GPU because it is dead and soon to be reset), then
we can proceed with the task at hand and then drop the struct_mutex
allowing the reset to occur. This transfers the burden of checking
whether it is safe to proceed to the caller, which in all but one
instance it is safe - completely eliminating the source of all spurious
-EIO.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/i915_debugfs.c |  2 +-
 drivers/gpu/drm/i915/i915_drv.c | 32 +--
 drivers/gpu/drm/i915/i915_drv.h |  5 +---
 drivers/gpu/drm/i915/i915_gem.c | 45 +
 drivers/gpu/drm/i915/i915_irq.c | 21 ++-
 drivers/gpu/drm/i915/intel_display.c| 32 +++
 drivers/gpu/drm/i915/intel_lrc.c|  8 ++
 drivers/gpu/drm/i915/intel_ringbuffer.c |  8 ++
 8 files changed, 65 insertions(+), 88 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
b/drivers/gpu/drm/i915/i915_debugfs.c
index 864b3ae9419c..0583eda642d7 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -4656,7 +4656,7 @@ i915_wedged_get(void *data, u64 *val)
struct drm_device *dev = data;
struct drm_i915_private *dev_priv = dev->dev_private;
 
-   *val = i915_reset_counter(_priv->gpu_error);
+   *val = i915_terminally_wedged(_priv->gpu_error);
 
return 0;
 }
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 90faa8e03fca..eb893a5e00b1 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -923,23 +923,31 @@ int i915_resume_switcheroo(struct drm_device *dev)
 int i915_reset(struct drm_device *dev)
 {
struct drm_i915_private *dev_priv = dev->dev_private;
+   struct i915_gpu_error *error = _priv->gpu_error;
+   unsigned reset_counter;
bool simulated;
int ret;
 
intel_reset_gt_powersave(dev);
 
mutex_lock(>struct_mutex);
+   atomic_andnot(I915_WEDGED, >reset_counter);
+   reset_counter = atomic_inc_return(>reset_counter);
+   if (WARN_ON(__i915_reset_in_progress(reset_counter))) {
+   ret = -EIO;
+   goto error;
+   }
 
i915_gem_reset(dev);
 
-   simulated = dev_priv->gpu_error.stop_rings != 0;
+   simulated = error->stop_rings != 0;
 
ret = intel_gpu_reset(dev);
 
/* Also reset the gpu hangman. */
if (simulated) {
DRM_INFO("Simulated gpu hang, resetting stop_rings\n");
-   dev_priv->gpu_error.stop_rings = 0;
+   error->stop_rings = 0;
if (ret == -ENODEV) {
DRM_INFO("Reset not implemented, but ignoring "
 "error for simulated gpu hangs\n");
@@ -952,8 +960,7 @@ int i915_reset(struct drm_device *dev)
 
if (ret) {
DRM_ERROR("Failed to reset chip: %i\n", ret);
-   mutex_unlock(>struct_mutex);
-   return ret;
+   goto error;
}
 
intel_overlay_reset(dev_priv);
@@ -972,20 +979,14 @@ int i915_reset(struct drm_device *dev)
 * was running at the time of the reset (i.e. we weren't VT
 * switched away).
 */
-
-   /* Used to prevent gem_check_wedged returning -EAGAIN during gpu reset 
*/
-   dev_priv->gpu_error.reload_in_reset = true;
-
ret = i915_gem_init_hw(dev);
-
-   dev_priv->gpu_error.reload_in_reset = false;
-
-   mutex_unlock(>struct_mutex);
if (ret) {
DRM_ERROR("Failed hw init on reset %d\n", ret);
-   return ret;
+   goto error;
}
 
+   mutex_unlock(>struct_mutex);
+
/*
 * rps/rc6 re-init is necessary to restore state lost after the
 * reset and the re-install of gt irqs. Skip for ironlake per
@@ -996,6 +997,11 @@ int i915_reset(struct drm_device *dev)
intel_enable_gt_powersave(dev);
 
return 0;
+
+error:
+   atomic_or(I915_WEDGED, >reset_counter);
+   mutex_unlock(>struct_mutex);
+   return ret;
 }
 
 static int i915_pci_probe(struct pci_dev *pdev, const struct pci_device_id 
*ent)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h