[Nouveau] [PATCH v2 4/4] drm/nouveau: gpu lockup recovery

2012-04-28 Thread Marcin Slusarz
On Thu, Apr 26, 2012 at 05:32:29PM +1000, Ben Skeggs wrote:
> On Wed, 2012-04-25 at 23:20 +0200, Marcin Slusarz wrote:
> > Overall idea:
> > Detect lockups by watching for timeouts (vm flush / fence), return -EIOs,
> > handle them at ioctl level, reset the GPU and repeat last ioctl.
> > 
> > GPU reset is done by doing suspend / resume cycle with few tweaks:
> > - CPU-only bo eviction
> > - ignoring vm flush / fence timeouts
> > - shortening waits
> Okay.  I've thought about this a bit for a couple of days and think I'll
> be able to coherently share my thoughts on this issue now :)
> 
> Firstly, while I agree that we need to become more resilient to errors,
> I don't think that following in the radeon/intel footsteps with
> something (imo, hackish) like this is the right choice for us
> necessarily.

This is not only radeon/intel way. Windows, since Vista SP1, does the
same - see http://msdn.microsoft.com/en-us/windows/hardware/gg487368.
It's funny how similar it is to this patch (I haven't seen this page earlier).

If you fear people will stop reporting bugs - don't. GPU reset is painfully
slow and can take up to 50 seconds (BO eviction is the most time consuming
part), so people will be annoyed enough to report them.
Currently, GPU lockups make users so angry, they frequently switch to blob
without even thinking about reporting anything.

> The *vast* majority of "lockups" we have are as a result of us badly
> mishandling exceptions reported to us by the GPU.  There are a couple of
> exceptions, however, they're very rare..

> A very common example is where people gain DMA_PUSHERs for whatever
> reason, and things go haywire eventually.

Nope, I had tens of lockups during testing, and only once I had DMA_PUSHER
before detecting GPU lockup.

> To handle a DMA_PUSHER
> sanely, generally you have to drop all pending commands for the channel
> (set GET=PUT, etc) and continue on.  However, this leaves us with fences
> and semaphores unsignalled etc, causing issues further up the stack with
> perfectly good channels hanging on attempting to sync with the crashed
> channel etc.
> 
> The next most common example I can think of is nv4x hardware, getting a
> LIMIT_COLOR/ZETA exception from PGRAPH, and then a hang.  The solution
> is simple, learn how to handle the exception, log it, and PGRAPH
> survives.
> 
> I strongly believe that if we focused our efforts on dealing with what
> the GPU reports to us a lot better, we'll find we really don't need such
> "lockup recovery".

While I agree we need to improve on error handling to make "lockup recovery"
not needed, the reality is we can't predict everything and driver needs to
cope with its own bugs.

> I am, however, considering pulling the vm flush timeout error
> propagation and break-out-of-waits-on-signals that builds on it.  As we
> really do need to become better at having killable processes if things
> go wrong :)

Good :)

Marcin


Re: [Nouveau] [PATCH v2 4/4] drm/nouveau: gpu lockup recovery

2012-04-28 Thread Marcin Slusarz
On Thu, Apr 26, 2012 at 05:32:29PM +1000, Ben Skeggs wrote:
> On Wed, 2012-04-25 at 23:20 +0200, Marcin Slusarz wrote:
> > Overall idea:
> > Detect lockups by watching for timeouts (vm flush / fence), return -EIOs,
> > handle them at ioctl level, reset the GPU and repeat last ioctl.
> > 
> > GPU reset is done by doing suspend / resume cycle with few tweaks:
> > - CPU-only bo eviction
> > - ignoring vm flush / fence timeouts
> > - shortening waits
> Okay.  I've thought about this a bit for a couple of days and think I'll
> be able to coherently share my thoughts on this issue now :)
> 
> Firstly, while I agree that we need to become more resilient to errors,
> I don't think that following in the radeon/intel footsteps with
> something (imo, hackish) like this is the right choice for us
> necessarily.

This is not only radeon/intel way. Windows, since Vista SP1, does the
same - see http://msdn.microsoft.com/en-us/windows/hardware/gg487368.
It's funny how similar it is to this patch (I haven't seen this page earlier).

If you fear people will stop reporting bugs - don't. GPU reset is painfully
slow and can take up to 50 seconds (BO eviction is the most time consuming
part), so people will be annoyed enough to report them.
Currently, GPU lockups make users so angry, they frequently switch to blob
without even thinking about reporting anything.

> The *vast* majority of "lockups" we have are as a result of us badly
> mishandling exceptions reported to us by the GPU.  There are a couple of
> exceptions, however, they're very rare..

> A very common example is where people gain DMA_PUSHERs for whatever
> reason, and things go haywire eventually.

Nope, I had tens of lockups during testing, and only once I had DMA_PUSHER
before detecting GPU lockup.

> To handle a DMA_PUSHER
> sanely, generally you have to drop all pending commands for the channel
> (set GET=PUT, etc) and continue on.  However, this leaves us with fences
> and semaphores unsignalled etc, causing issues further up the stack with
> perfectly good channels hanging on attempting to sync with the crashed
> channel etc.
> 
> The next most common example I can think of is nv4x hardware, getting a
> LIMIT_COLOR/ZETA exception from PGRAPH, and then a hang.  The solution
> is simple, learn how to handle the exception, log it, and PGRAPH
> survives.
> 
> I strongly believe that if we focused our efforts on dealing with what
> the GPU reports to us a lot better, we'll find we really don't need such
> "lockup recovery".

While I agree we need to improve on error handling to make "lockup recovery"
not needed, the reality is we can't predict everything and driver needs to
cope with its own bugs.

> I am, however, considering pulling the vm flush timeout error
> propagation and break-out-of-waits-on-signals that builds on it.  As we
> really do need to become better at having killable processes if things
> go wrong :)

Good :)

Marcin
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[Nouveau] [PATCH v2 4/4] drm/nouveau: gpu lockup recovery

2012-04-26 Thread Ben Skeggs
On Wed, 2012-04-25 at 23:20 +0200, Marcin Slusarz wrote:
> Overall idea:
> Detect lockups by watching for timeouts (vm flush / fence), return -EIOs,
> handle them at ioctl level, reset the GPU and repeat last ioctl.
> 
> GPU reset is done by doing suspend / resume cycle with few tweaks:
> - CPU-only bo eviction
> - ignoring vm flush / fence timeouts
> - shortening waits
Okay.  I've thought about this a bit for a couple of days and think I'll
be able to coherently share my thoughts on this issue now :)

Firstly, while I agree that we need to become more resilient to errors,
I don't think that following in the radeon/intel footsteps with
something (imo, hackish) like this is the right choice for us
necessarily.

The *vast* majority of "lockups" we have are as a result of us badly
mishandling exceptions reported to us by the GPU.  There are a couple of
exceptions, however, they're very rare..

A very common example is where people gain DMA_PUSHERs for whatever
reason, and things go haywire eventually.  To handle a DMA_PUSHER
sanely, generally you have to drop all pending commands for the channel
(set GET=PUT, etc) and continue on.  However, this leaves us with fences
and semaphores unsignalled etc, causing issues further up the stack with
perfectly good channels hanging on attempting to sync with the crashed
channel etc.

The next most common example I can think of is nv4x hardware, getting a
LIMIT_COLOR/ZETA exception from PGRAPH, and then a hang.  The solution
is simple, learn how to handle the exception, log it, and PGRAPH
survives.

I strongly believe that if we focused our efforts on dealing with what
the GPU reports to us a lot better, we'll find we really don't need such
"lockup recovery".

I am, however, considering pulling the vm flush timeout error
propagation and break-out-of-waits-on-signals that builds on it.  As we
really do need to become better at having killable processes if things
go wrong :)

Ben.

> 
> Signed-off-by: Marcin Slusarz 
> ---
>  drivers/gpu/drm/nouveau/Makefile   |2 +-
>  drivers/gpu/drm/nouveau/nouveau_bo.c   |2 +-
>  drivers/gpu/drm/nouveau/nouveau_channel.c  |5 +-
>  drivers/gpu/drm/nouveau/nouveau_drv.c  |   56 ++-
>  drivers/gpu/drm/nouveau/nouveau_drv.h  |   45 -
>  drivers/gpu/drm/nouveau/nouveau_fence.c|7 +-
>  drivers/gpu/drm/nouveau/nouveau_gem.c  |   14 +++-
>  drivers/gpu/drm/nouveau/nouveau_notifier.c |3 +
>  drivers/gpu/drm/nouveau/nouveau_object.c   |6 +
>  drivers/gpu/drm/nouveau/nouveau_reset.c|  148 
> 
>  drivers/gpu/drm/nouveau/nouveau_state.c|6 +
>  drivers/gpu/drm/nouveau/nv50_graph.c   |   11 +-
>  12 files changed, 290 insertions(+), 15 deletions(-)
>  create mode 100644 drivers/gpu/drm/nouveau/nouveau_reset.c
> 
> diff --git a/drivers/gpu/drm/nouveau/Makefile 
> b/drivers/gpu/drm/nouveau/Makefile
> index 03860f5..77d0c33 100644
> --- a/drivers/gpu/drm/nouveau/Makefile
> +++ b/drivers/gpu/drm/nouveau/Makefile
> @@ -9,7 +9,7 @@ nouveau-y := nouveau_drv.o nouveau_state.o nouveau_channel.o 
> nouveau_mem.o \
>   nouveau_bo.o nouveau_fence.o nouveau_gem.o nouveau_ttm.o \
>   nouveau_hw.o nouveau_calc.o nouveau_bios.o nouveau_i2c.o \
>   nouveau_display.o nouveau_connector.o nouveau_fbcon.o \
> - nouveau_hdmi.o nouveau_dp.o nouveau_ramht.o \
> + nouveau_hdmi.o nouveau_dp.o nouveau_ramht.o nouveau_reset.o \
>nouveau_pm.o nouveau_volt.o nouveau_perf.o nouveau_temp.o \
>nouveau_mm.o nouveau_vm.o nouveau_mxm.o nouveau_gpio.o \
>   nv04_timer.o \
> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c 
> b/drivers/gpu/drm/nouveau/nouveau_bo.c
> index 5b0dc50..7de6cad 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> @@ -936,7 +936,7 @@ nouveau_bo_move(struct ttm_buffer_object *bo, bool evict, 
> bool intr,
>   }
>  
>   /* Software copy if the card isn't up and running yet. */
> - if (!dev_priv->channel) {
> + if (!dev_priv->channel || nouveau_gpu_reset_in_progress(dev_priv->dev)) 
> {
>   ret = ttm_bo_move_memcpy(bo, evict, no_wait_reserve, 
> no_wait_gpu, new_mem);
>   goto out;
>   }
> diff --git a/drivers/gpu/drm/nouveau/nouveau_channel.c 
> b/drivers/gpu/drm/nouveau/nouveau_channel.c
> index 846afb0..c0fa5a7 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_channel.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_channel.c
> @@ -420,7 +420,7 @@ nouveau_ioctl_fifo_alloc(struct drm_device *dev, void 
> *data,
>   init->fb_ctxdma_handle,
>   init->tt_ctxdma_handle);
>   if (ret)
> - return ret;
> + goto out;
>   init->channel  = chan->id;
>  
>   if (nouveau_vram_pushbuf == 0) {
> @@ -450,6 +450,9 @@ nouveau_ioctl_fifo_alloc(struct drm_device *dev, void 
> *data,
>   if (ret ==

Re: [Nouveau] [PATCH v2 4/4] drm/nouveau: gpu lockup recovery

2012-04-26 Thread Ben Skeggs
On Wed, 2012-04-25 at 23:20 +0200, Marcin Slusarz wrote:
> Overall idea:
> Detect lockups by watching for timeouts (vm flush / fence), return -EIOs,
> handle them at ioctl level, reset the GPU and repeat last ioctl.
> 
> GPU reset is done by doing suspend / resume cycle with few tweaks:
> - CPU-only bo eviction
> - ignoring vm flush / fence timeouts
> - shortening waits
Okay.  I've thought about this a bit for a couple of days and think I'll
be able to coherently share my thoughts on this issue now :)

Firstly, while I agree that we need to become more resilient to errors,
I don't think that following in the radeon/intel footsteps with
something (imo, hackish) like this is the right choice for us
necessarily.

The *vast* majority of "lockups" we have are as a result of us badly
mishandling exceptions reported to us by the GPU.  There are a couple of
exceptions, however, they're very rare..

A very common example is where people gain DMA_PUSHERs for whatever
reason, and things go haywire eventually.  To handle a DMA_PUSHER
sanely, generally you have to drop all pending commands for the channel
(set GET=PUT, etc) and continue on.  However, this leaves us with fences
and semaphores unsignalled etc, causing issues further up the stack with
perfectly good channels hanging on attempting to sync with the crashed
channel etc.

The next most common example I can think of is nv4x hardware, getting a
LIMIT_COLOR/ZETA exception from PGRAPH, and then a hang.  The solution
is simple, learn how to handle the exception, log it, and PGRAPH
survives.

I strongly believe that if we focused our efforts on dealing with what
the GPU reports to us a lot better, we'll find we really don't need such
"lockup recovery".

I am, however, considering pulling the vm flush timeout error
propagation and break-out-of-waits-on-signals that builds on it.  As we
really do need to become better at having killable processes if things
go wrong :)

Ben.

> 
> Signed-off-by: Marcin Slusarz 
> ---
>  drivers/gpu/drm/nouveau/Makefile   |2 +-
>  drivers/gpu/drm/nouveau/nouveau_bo.c   |2 +-
>  drivers/gpu/drm/nouveau/nouveau_channel.c  |5 +-
>  drivers/gpu/drm/nouveau/nouveau_drv.c  |   56 ++-
>  drivers/gpu/drm/nouveau/nouveau_drv.h  |   45 -
>  drivers/gpu/drm/nouveau/nouveau_fence.c|7 +-
>  drivers/gpu/drm/nouveau/nouveau_gem.c  |   14 +++-
>  drivers/gpu/drm/nouveau/nouveau_notifier.c |3 +
>  drivers/gpu/drm/nouveau/nouveau_object.c   |6 +
>  drivers/gpu/drm/nouveau/nouveau_reset.c|  148 
> 
>  drivers/gpu/drm/nouveau/nouveau_state.c|6 +
>  drivers/gpu/drm/nouveau/nv50_graph.c   |   11 +-
>  12 files changed, 290 insertions(+), 15 deletions(-)
>  create mode 100644 drivers/gpu/drm/nouveau/nouveau_reset.c
> 
> diff --git a/drivers/gpu/drm/nouveau/Makefile 
> b/drivers/gpu/drm/nouveau/Makefile
> index 03860f5..77d0c33 100644
> --- a/drivers/gpu/drm/nouveau/Makefile
> +++ b/drivers/gpu/drm/nouveau/Makefile
> @@ -9,7 +9,7 @@ nouveau-y := nouveau_drv.o nouveau_state.o nouveau_channel.o 
> nouveau_mem.o \
>   nouveau_bo.o nouveau_fence.o nouveau_gem.o nouveau_ttm.o \
>   nouveau_hw.o nouveau_calc.o nouveau_bios.o nouveau_i2c.o \
>   nouveau_display.o nouveau_connector.o nouveau_fbcon.o \
> - nouveau_hdmi.o nouveau_dp.o nouveau_ramht.o \
> + nouveau_hdmi.o nouveau_dp.o nouveau_ramht.o nouveau_reset.o \
>nouveau_pm.o nouveau_volt.o nouveau_perf.o nouveau_temp.o \
>nouveau_mm.o nouveau_vm.o nouveau_mxm.o nouveau_gpio.o \
>   nv04_timer.o \
> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c 
> b/drivers/gpu/drm/nouveau/nouveau_bo.c
> index 5b0dc50..7de6cad 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> @@ -936,7 +936,7 @@ nouveau_bo_move(struct ttm_buffer_object *bo, bool evict, 
> bool intr,
>   }
>  
>   /* Software copy if the card isn't up and running yet. */
> - if (!dev_priv->channel) {
> + if (!dev_priv->channel || nouveau_gpu_reset_in_progress(dev_priv->dev)) 
> {
>   ret = ttm_bo_move_memcpy(bo, evict, no_wait_reserve, 
> no_wait_gpu, new_mem);
>   goto out;
>   }
> diff --git a/drivers/gpu/drm/nouveau/nouveau_channel.c 
> b/drivers/gpu/drm/nouveau/nouveau_channel.c
> index 846afb0..c0fa5a7 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_channel.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_channel.c
> @@ -420,7 +420,7 @@ nouveau_ioctl_fifo_alloc(struct drm_device *dev, void 
> *data,
>   init->fb_ctxdma_handle,
>   init->tt_ctxdma_handle);
>   if (ret)
> - return ret;
> + goto out;
>   init->channel  = chan->id;
>  
>   if (nouveau_vram_pushbuf == 0) {
> @@ -450,6 +450,9 @@ nouveau_ioctl_fifo_alloc(struct drm_device *dev, void 
> *data,
>   if (ret ==

[PATCH v2 4/4] drm/nouveau: gpu lockup recovery

2012-04-25 Thread Marcin Slusarz
On Wed, Apr 25, 2012 at 11:20:36PM +0200, Marcin Slusarz wrote:
> Overall idea:
> Detect lockups by watching for timeouts (vm flush / fence), return -EIOs,
> handle them at ioctl level, reset the GPU and repeat last ioctl.
> 
> GPU reset is done by doing suspend / resume cycle with few tweaks:
> - CPU-only bo eviction
> - ignoring vm flush / fence timeouts
> - shortening waits
> 
> Signed-off-by: Marcin Slusarz 
> ---

What changed from v1:
- moved ioctl locking from drm core to nouveau
- made down_reads interruptible
- fixed build bug on 32-bit systems


[PATCH v2 4/4] drm/nouveau: gpu lockup recovery

2012-04-25 Thread Marcin Slusarz
Overall idea:
Detect lockups by watching for timeouts (vm flush / fence), return -EIOs,
handle them at ioctl level, reset the GPU and repeat last ioctl.

GPU reset is done by doing suspend / resume cycle with few tweaks:
- CPU-only bo eviction
- ignoring vm flush / fence timeouts
- shortening waits

Signed-off-by: Marcin Slusarz 
---
 drivers/gpu/drm/nouveau/Makefile   |2 +-
 drivers/gpu/drm/nouveau/nouveau_bo.c   |2 +-
 drivers/gpu/drm/nouveau/nouveau_channel.c  |5 +-
 drivers/gpu/drm/nouveau/nouveau_drv.c  |   56 ++-
 drivers/gpu/drm/nouveau/nouveau_drv.h  |   45 -
 drivers/gpu/drm/nouveau/nouveau_fence.c|7 +-
 drivers/gpu/drm/nouveau/nouveau_gem.c  |   14 +++-
 drivers/gpu/drm/nouveau/nouveau_notifier.c |3 +
 drivers/gpu/drm/nouveau/nouveau_object.c   |6 +
 drivers/gpu/drm/nouveau/nouveau_reset.c|  148 
 drivers/gpu/drm/nouveau/nouveau_state.c|6 +
 drivers/gpu/drm/nouveau/nv50_graph.c   |   11 +-
 12 files changed, 290 insertions(+), 15 deletions(-)
 create mode 100644 drivers/gpu/drm/nouveau/nouveau_reset.c

diff --git a/drivers/gpu/drm/nouveau/Makefile b/drivers/gpu/drm/nouveau/Makefile
index 03860f5..77d0c33 100644
--- a/drivers/gpu/drm/nouveau/Makefile
+++ b/drivers/gpu/drm/nouveau/Makefile
@@ -9,7 +9,7 @@ nouveau-y := nouveau_drv.o nouveau_state.o nouveau_channel.o 
nouveau_mem.o \
  nouveau_bo.o nouveau_fence.o nouveau_gem.o nouveau_ttm.o \
  nouveau_hw.o nouveau_calc.o nouveau_bios.o nouveau_i2c.o \
  nouveau_display.o nouveau_connector.o nouveau_fbcon.o \
- nouveau_hdmi.o nouveau_dp.o nouveau_ramht.o \
+ nouveau_hdmi.o nouveau_dp.o nouveau_ramht.o nouveau_reset.o \
 nouveau_pm.o nouveau_volt.o nouveau_perf.o nouveau_temp.o \
 nouveau_mm.o nouveau_vm.o nouveau_mxm.o nouveau_gpio.o \
  nv04_timer.o \
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c 
b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 5b0dc50..7de6cad 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -936,7 +936,7 @@ nouveau_bo_move(struct ttm_buffer_object *bo, bool evict, 
bool intr,
}

/* Software copy if the card isn't up and running yet. */
-   if (!dev_priv->channel) {
+   if (!dev_priv->channel || nouveau_gpu_reset_in_progress(dev_priv->dev)) 
{
ret = ttm_bo_move_memcpy(bo, evict, no_wait_reserve, 
no_wait_gpu, new_mem);
goto out;
}
diff --git a/drivers/gpu/drm/nouveau/nouveau_channel.c 
b/drivers/gpu/drm/nouveau/nouveau_channel.c
index 846afb0..c0fa5a7 100644
--- a/drivers/gpu/drm/nouveau/nouveau_channel.c
+++ b/drivers/gpu/drm/nouveau/nouveau_channel.c
@@ -420,7 +420,7 @@ nouveau_ioctl_fifo_alloc(struct drm_device *dev, void *data,
init->fb_ctxdma_handle,
init->tt_ctxdma_handle);
if (ret)
-   return ret;
+   goto out;
init->channel  = chan->id;

if (nouveau_vram_pushbuf == 0) {
@@ -450,6 +450,9 @@ nouveau_ioctl_fifo_alloc(struct drm_device *dev, void *data,
if (ret == 0)
atomic_inc(&chan->users); /* userspace reference */
nouveau_channel_put(&chan);
+out:
+   if (ret == -EIO)
+   ret = nouveau_reset_device(dev);
return ret;
 }

diff --git a/drivers/gpu/drm/nouveau/nouveau_drv.c 
b/drivers/gpu/drm/nouveau/nouveau_drv.c
index 090fff6..261e1f5 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drv.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drv.c
@@ -237,7 +237,7 @@ nouveau_pci_suspend(struct pci_dev *pdev, pm_message_t 
pm_state)
if (!dev_priv->eng[e])
continue;

-   ret = dev_priv->eng[e]->fini(dev, e, true);
+   ret = dev_priv->eng[e]->fini(dev, e, 
!nouveau_gpu_reset_in_progress(dev));
if (ret) {
NV_ERROR(dev, "... engine %d failed: %d\n", e, ret);
goto out_abort;
@@ -443,11 +443,63 @@ nouveau_pci_resume(struct pci_dev *pdev)
return 0;
 }

+void intr_rwsem_init(struct intr_rwsem *r)
+{
+   init_rwsem(&r->rwsem);
+   mutex_init(&r->mutex);
+}
+
+int intr_rwsem_down_read_interruptible(struct intr_rwsem *r)
+{
+   while (down_read_trylock(&r->rwsem) == 0) {
+   int ret = mutex_lock_interruptible(&r->mutex);
+   if (ret)
+   return ret;
+   mutex_unlock(&r->mutex);
+   }
+   return 0;
+}
+
+void intr_rwsem_up_read(struct intr_rwsem *r)
+{
+   up_read(&r->rwsem);
+}
+
+void intr_rwsem_down_write(struct intr_rwsem *r)
+{
+   mutex_lock(&r->mutex);
+   down_write(&r->rwsem);
+}
+
+void intr_rwsem_up_write(struct intr_rwsem *r)
+{
+   up_write(&r->rwsem);
+   mutex_unlock(&r->mutex);
+}
+
+static long nouveau_ioctl(struct file *filp,

Re: [PATCH v2 4/4] drm/nouveau: gpu lockup recovery

2012-04-25 Thread Marcin Slusarz
On Wed, Apr 25, 2012 at 11:20:36PM +0200, Marcin Slusarz wrote:
> Overall idea:
> Detect lockups by watching for timeouts (vm flush / fence), return -EIOs,
> handle them at ioctl level, reset the GPU and repeat last ioctl.
> 
> GPU reset is done by doing suspend / resume cycle with few tweaks:
> - CPU-only bo eviction
> - ignoring vm flush / fence timeouts
> - shortening waits
> 
> Signed-off-by: Marcin Slusarz 
> ---

What changed from v1:
- moved ioctl locking from drm core to nouveau
- made down_reads interruptible
- fixed build bug on 32-bit systems
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[PATCH v2 4/4] drm/nouveau: gpu lockup recovery

2012-04-25 Thread Marcin Slusarz
Overall idea:
Detect lockups by watching for timeouts (vm flush / fence), return -EIOs,
handle them at ioctl level, reset the GPU and repeat last ioctl.

GPU reset is done by doing suspend / resume cycle with few tweaks:
- CPU-only bo eviction
- ignoring vm flush / fence timeouts
- shortening waits

Signed-off-by: Marcin Slusarz 
---
 drivers/gpu/drm/nouveau/Makefile   |2 +-
 drivers/gpu/drm/nouveau/nouveau_bo.c   |2 +-
 drivers/gpu/drm/nouveau/nouveau_channel.c  |5 +-
 drivers/gpu/drm/nouveau/nouveau_drv.c  |   56 ++-
 drivers/gpu/drm/nouveau/nouveau_drv.h  |   45 -
 drivers/gpu/drm/nouveau/nouveau_fence.c|7 +-
 drivers/gpu/drm/nouveau/nouveau_gem.c  |   14 +++-
 drivers/gpu/drm/nouveau/nouveau_notifier.c |3 +
 drivers/gpu/drm/nouveau/nouveau_object.c   |6 +
 drivers/gpu/drm/nouveau/nouveau_reset.c|  148 
 drivers/gpu/drm/nouveau/nouveau_state.c|6 +
 drivers/gpu/drm/nouveau/nv50_graph.c   |   11 +-
 12 files changed, 290 insertions(+), 15 deletions(-)
 create mode 100644 drivers/gpu/drm/nouveau/nouveau_reset.c

diff --git a/drivers/gpu/drm/nouveau/Makefile b/drivers/gpu/drm/nouveau/Makefile
index 03860f5..77d0c33 100644
--- a/drivers/gpu/drm/nouveau/Makefile
+++ b/drivers/gpu/drm/nouveau/Makefile
@@ -9,7 +9,7 @@ nouveau-y := nouveau_drv.o nouveau_state.o nouveau_channel.o 
nouveau_mem.o \
  nouveau_bo.o nouveau_fence.o nouveau_gem.o nouveau_ttm.o \
  nouveau_hw.o nouveau_calc.o nouveau_bios.o nouveau_i2c.o \
  nouveau_display.o nouveau_connector.o nouveau_fbcon.o \
- nouveau_hdmi.o nouveau_dp.o nouveau_ramht.o \
+ nouveau_hdmi.o nouveau_dp.o nouveau_ramht.o nouveau_reset.o \
 nouveau_pm.o nouveau_volt.o nouveau_perf.o nouveau_temp.o \
 nouveau_mm.o nouveau_vm.o nouveau_mxm.o nouveau_gpio.o \
  nv04_timer.o \
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c 
b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 5b0dc50..7de6cad 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -936,7 +936,7 @@ nouveau_bo_move(struct ttm_buffer_object *bo, bool evict, 
bool intr,
}
 
/* Software copy if the card isn't up and running yet. */
-   if (!dev_priv->channel) {
+   if (!dev_priv->channel || nouveau_gpu_reset_in_progress(dev_priv->dev)) 
{
ret = ttm_bo_move_memcpy(bo, evict, no_wait_reserve, 
no_wait_gpu, new_mem);
goto out;
}
diff --git a/drivers/gpu/drm/nouveau/nouveau_channel.c 
b/drivers/gpu/drm/nouveau/nouveau_channel.c
index 846afb0..c0fa5a7 100644
--- a/drivers/gpu/drm/nouveau/nouveau_channel.c
+++ b/drivers/gpu/drm/nouveau/nouveau_channel.c
@@ -420,7 +420,7 @@ nouveau_ioctl_fifo_alloc(struct drm_device *dev, void *data,
init->fb_ctxdma_handle,
init->tt_ctxdma_handle);
if (ret)
-   return ret;
+   goto out;
init->channel  = chan->id;
 
if (nouveau_vram_pushbuf == 0) {
@@ -450,6 +450,9 @@ nouveau_ioctl_fifo_alloc(struct drm_device *dev, void *data,
if (ret == 0)
atomic_inc(&chan->users); /* userspace reference */
nouveau_channel_put(&chan);
+out:
+   if (ret == -EIO)
+   ret = nouveau_reset_device(dev);
return ret;
 }
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_drv.c 
b/drivers/gpu/drm/nouveau/nouveau_drv.c
index 090fff6..261e1f5 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drv.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drv.c
@@ -237,7 +237,7 @@ nouveau_pci_suspend(struct pci_dev *pdev, pm_message_t 
pm_state)
if (!dev_priv->eng[e])
continue;
 
-   ret = dev_priv->eng[e]->fini(dev, e, true);
+   ret = dev_priv->eng[e]->fini(dev, e, 
!nouveau_gpu_reset_in_progress(dev));
if (ret) {
NV_ERROR(dev, "... engine %d failed: %d\n", e, ret);
goto out_abort;
@@ -443,11 +443,63 @@ nouveau_pci_resume(struct pci_dev *pdev)
return 0;
 }
 
+void intr_rwsem_init(struct intr_rwsem *r)
+{
+   init_rwsem(&r->rwsem);
+   mutex_init(&r->mutex);
+}
+
+int intr_rwsem_down_read_interruptible(struct intr_rwsem *r)
+{
+   while (down_read_trylock(&r->rwsem) == 0) {
+   int ret = mutex_lock_interruptible(&r->mutex);
+   if (ret)
+   return ret;
+   mutex_unlock(&r->mutex);
+   }
+   return 0;
+}
+
+void intr_rwsem_up_read(struct intr_rwsem *r)
+{
+   up_read(&r->rwsem);
+}
+
+void intr_rwsem_down_write(struct intr_rwsem *r)
+{
+   mutex_lock(&r->mutex);
+   down_write(&r->rwsem);
+}
+
+void intr_rwsem_up_write(struct intr_rwsem *r)
+{
+   up_write(&r->rwsem);
+   mutex_unlock(&r->mutex);
+}
+
+static long nouveau_ioctl(struct file *