Re: [Intel-gfx] [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain

2013-05-24 Thread Chris Wilson
On Fri, May 24, 2013 at 11:03:06AM +0200, Daniel Vetter wrote:
> On Wed, May 1, 2013 at 6:23 PM, Daniel Vetter  wrote:
> > The atomic_set(WEDGED) is imo very dangerous, since it'll wreak havoc
> > with our accounting. But in general I really prefer an angry user with
> > a stuck-process backtrace than trying to paper over our own
> > shortcomings with hacks. And at least for the gpu hang vs.
> > pageflip/set_base we should now be fairly well-covered with tests. And
> > I haven't yet seen a report indicating that there's another issue
> > looming ...
> >
> > I guess an obnoxious WARN with a return 0; would also work for the
> > time out case. But I prefer to just ditch the timeout.
> 
> Poke ... I still think setting wedged here is a bit risky.

The alternative is that every ioctl then takes 10s. But if we do set
wedged and it recovers, it is reset to 0. If the reset fails, it is also
set to wedged. It's a fugly thing to do, but at that level of paranoia,
I think it is the right thing to do.

> And maybe
> add a comment why wait_for_error eats the -EIO (and why
> intel_ring_begin does not eat the -EIO)?

Sure. The patch today is slightly different anyway...
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain

2013-05-24 Thread Daniel Vetter
On Wed, May 1, 2013 at 6:23 PM, Daniel Vetter  wrote:
> On Wed, May 1, 2013 at 3:01 PM, Chris Wilson  wrote:
>> On Wed, May 01, 2013 at 02:40:43PM +0200, Daniel Vetter wrote:
>>> On Wed, May 1, 2013 at 1:06 PM, Chris Wilson  
>>> wrote:
>>> > On Wed, May 01, 2013 at 12:38:27PM +0200, Daniel Vetter wrote:
>>> >> On Wed, May 1, 2013 at 12:25 PM, Chris Wilson  
>>> >> wrote:
>>> >> > If reset fails, the GPU is declared wedged. This ideally should never
>>> >> > happen, but very rarely it does. After the GPU is declared wedged, we
>>> >> > must allow userspace to continue to use its mapping of bo in order to
>>> >> > recover its data (and in some cases in order for memory management to
>>> >> > continue unabated). Obviously after the GPU is wedged, no bo are
>>> >> > currently accessed by the GPU and so we can complete any waits or 
>>> >> > domain
>>> >> > transitions away from the GPU. Currently, we fail this essential task
>>> >> > and instead report EIO and send a SIGBUS to the affected process -
>>> >> > causing major loss of data (by killing X or compiz).
>>> >> >
>>> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
>>> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
>>> >> > Signed-off-by: Chris Wilson 
>>> >>
>>> >> So I've read again through the reset code and I still don't see how
>>> >> wait_rendering can ever gives us -EIO once the gpu is dead. So all the
>>> >> -EIO eating after wait_rendering looks really suspicious to me.
>>> >
>>> > That's always been a layer of defense against the driver. Alternatively,
>>> > throw we can throw a warn into the wait as we shouldn't enter there with
>>> > a seqno and a wedged GPU.
>>>
>>> Yeah, that sounds like a good idea. We really shouldn't ever have an
>>> oustanding request while the gpu is wedged (ignoring dangerous changes
>>> to the wedged state through debugfs).
>>
>> Except it is only true for a locked wait. :(
>
> Hm, indeed ... we need to convert any -EIO into a -ERESTARTSYS in (at
> least nonblocking) waits. We'd get that by simply dropping all the
> check_wedge calls from the wait functions. That would leave us with a
> check_wedge in throttle and intel_ring_begin, which I think are the
> two places we really want that.
>
>>> >> Now the other thing is i915_gem_object_wait_
>>> >> rendering, that thing loves to throw an -EIO at us. And on a quick
>>> >> check your patch misses the one in set_domain_ioctl. We probably need
>>> >> to do the same with sw_finish_ioctl. So what about a
>>> >> i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate
>>> >> the few places we don't want to hear about a dead gpu?
>>> >
>>> > In fact, it turns out that we hardly ever want
>>> > i915_mutex_lock_interruptible to return -EIO. Long ago, the idea was
>>> > that EIO was only ever returned when we tried to issue a command on the
>>> > GPU whilst it was terminally wedged - in order to preserve data
>>> > integrity.
>>>
>>> Don't we need to re-add a check in execbuf then? Otherwise I guess
>>> userspace has no idea ever that something is amiss ...
>>
>> It will catch the EIO as soon as it tries to emit the command, but
>> adding the explicit check will save a ton of work in reserving buffers.
>
> Right, I've forgotten about the check in intel_ring_begin.
>
>>> Otherwise I think this approach would also make sense, and feels more
>>> future-proof. Applications that want real robustness simply need to
>>> query the kernel through the new interface whether anything ugly has
>>> happened to them. And if we want more synchronous feedback we can
>>> always improve wait/set_domain_ioctl to check whether the given bo
>>> suffered through a little calamity.
>>
>> The only explicit point we have to remember is to preserve the
>> throttle() semantics of reporting EIO when wedged.
>
> I think plugging the leak in execbuf is still good to avoid queueing
> up tons of crap (since we really don't want to do that). But I think
> with the -EIO check in intel_ring_begin we are covered.
>
>>> > diff --git a/drivers/gpu/drm/i915/i915_gem.c 
>>> > b/drivers/gpu/drm/i915/i915_gem.c
>>> > index 2bd8d7a..f243e32 100644
>>> > --- a/drivers/gpu/drm/i915/i915_gem.c
>>> > +++ b/drivers/gpu/drm/i915/i915_gem.c
>>> > @@ -97,7 +97,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
>>> >
>>> > /* GPU is already declared terminally dead, give up. */
>>> > if (i915_terminally_wedged(error))
>>> > -   return -EIO;
>>> > +   return 0;
>>> >
>>> > /*
>>> >  * Only wait 10 seconds for the gpu reset to complete to avoid 
>>> > hanging
>>> > @@ -109,13 +109,12 @@ i915_gem_wait_for_error(struct i915_gpu_error 
>>> > *error)
>>> >10*HZ);
>>> > if (ret == 0) {
>>> > DRM_ERROR("Timed out waiting for the gpu reset to 
>>> > complete\n");
>>> > -   return -EIO;
>>> > -   } else if (ret < 0) {
>>> > - 

Re: [Intel-gfx] [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain

2013-05-01 Thread Daniel Vetter
On Wed, May 1, 2013 at 3:01 PM, Chris Wilson  wrote:
> On Wed, May 01, 2013 at 02:40:43PM +0200, Daniel Vetter wrote:
>> On Wed, May 1, 2013 at 1:06 PM, Chris Wilson  
>> wrote:
>> > On Wed, May 01, 2013 at 12:38:27PM +0200, Daniel Vetter wrote:
>> >> On Wed, May 1, 2013 at 12:25 PM, Chris Wilson  
>> >> wrote:
>> >> > If reset fails, the GPU is declared wedged. This ideally should never
>> >> > happen, but very rarely it does. After the GPU is declared wedged, we
>> >> > must allow userspace to continue to use its mapping of bo in order to
>> >> > recover its data (and in some cases in order for memory management to
>> >> > continue unabated). Obviously after the GPU is wedged, no bo are
>> >> > currently accessed by the GPU and so we can complete any waits or domain
>> >> > transitions away from the GPU. Currently, we fail this essential task
>> >> > and instead report EIO and send a SIGBUS to the affected process -
>> >> > causing major loss of data (by killing X or compiz).
>> >> >
>> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
>> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
>> >> > Signed-off-by: Chris Wilson 
>> >>
>> >> So I've read again through the reset code and I still don't see how
>> >> wait_rendering can ever gives us -EIO once the gpu is dead. So all the
>> >> -EIO eating after wait_rendering looks really suspicious to me.
>> >
>> > That's always been a layer of defense against the driver. Alternatively,
>> > throw we can throw a warn into the wait as we shouldn't enter there with
>> > a seqno and a wedged GPU.
>>
>> Yeah, that sounds like a good idea. We really shouldn't ever have an
>> oustanding request while the gpu is wedged (ignoring dangerous changes
>> to the wedged state through debugfs).
>
> Except it is only true for a locked wait. :(

Hm, indeed ... we need to convert any -EIO into a -ERESTARTSYS in (at
least nonblocking) waits. We'd get that by simply dropping all the
check_wedge calls from the wait functions. That would leave us with a
check_wedge in throttle and intel_ring_begin, which I think are the
two places we really want that.

>> >> Now the other thing is i915_gem_object_wait_
>> >> rendering, that thing loves to throw an -EIO at us. And on a quick
>> >> check your patch misses the one in set_domain_ioctl. We probably need
>> >> to do the same with sw_finish_ioctl. So what about a
>> >> i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate
>> >> the few places we don't want to hear about a dead gpu?
>> >
>> > In fact, it turns out that we hardly ever want
>> > i915_mutex_lock_interruptible to return -EIO. Long ago, the idea was
>> > that EIO was only ever returned when we tried to issue a command on the
>> > GPU whilst it was terminally wedged - in order to preserve data
>> > integrity.
>>
>> Don't we need to re-add a check in execbuf then? Otherwise I guess
>> userspace has no idea ever that something is amiss ...
>
> It will catch the EIO as soon as it tries to emit the command, but
> adding the explicit check will save a ton of work in reserving buffers.

Right, I've forgotten about the check in intel_ring_begin.

>> Otherwise I think this approach would also make sense, and feels more
>> future-proof. Applications that want real robustness simply need to
>> query the kernel through the new interface whether anything ugly has
>> happened to them. And if we want more synchronous feedback we can
>> always improve wait/set_domain_ioctl to check whether the given bo
>> suffered through a little calamity.
>
> The only explicit point we have to remember is to preserve the
> throttle() semantics of reporting EIO when wedged.

I think plugging the leak in execbuf is still good to avoid queueing
up tons of crap (since we really don't want to do that). But I think
with the -EIO check in intel_ring_begin we are covered.

>> > diff --git a/drivers/gpu/drm/i915/i915_gem.c 
>> > b/drivers/gpu/drm/i915/i915_gem.c
>> > index 2bd8d7a..f243e32 100644
>> > --- a/drivers/gpu/drm/i915/i915_gem.c
>> > +++ b/drivers/gpu/drm/i915/i915_gem.c
>> > @@ -97,7 +97,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
>> >
>> > /* GPU is already declared terminally dead, give up. */
>> > if (i915_terminally_wedged(error))
>> > -   return -EIO;
>> > +   return 0;
>> >
>> > /*
>> >  * Only wait 10 seconds for the gpu reset to complete to avoid 
>> > hanging
>> > @@ -109,13 +109,12 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
>> >10*HZ);
>> > if (ret == 0) {
>> > DRM_ERROR("Timed out waiting for the gpu reset to 
>> > complete\n");
>> > -   return -EIO;
>> > -   } else if (ret < 0) {
>> > -   return ret;
>> > -   }
>> > +   atomic_set(&error->reset_counter, I915_WEDGED);
>> > +   } else if (ret > 0)
>> > +   ret = 0;
>>
>> Less convince

Re: [Intel-gfx] [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain

2013-05-01 Thread Chris Wilson
On Wed, May 01, 2013 at 02:40:43PM +0200, Daniel Vetter wrote:
> On Wed, May 1, 2013 at 1:06 PM, Chris Wilson  wrote:
> > On Wed, May 01, 2013 at 12:38:27PM +0200, Daniel Vetter wrote:
> >> On Wed, May 1, 2013 at 12:25 PM, Chris Wilson  
> >> wrote:
> >> > If reset fails, the GPU is declared wedged. This ideally should never
> >> > happen, but very rarely it does. After the GPU is declared wedged, we
> >> > must allow userspace to continue to use its mapping of bo in order to
> >> > recover its data (and in some cases in order for memory management to
> >> > continue unabated). Obviously after the GPU is wedged, no bo are
> >> > currently accessed by the GPU and so we can complete any waits or domain
> >> > transitions away from the GPU. Currently, we fail this essential task
> >> > and instead report EIO and send a SIGBUS to the affected process -
> >> > causing major loss of data (by killing X or compiz).
> >> >
> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
> >> > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
> >> > Signed-off-by: Chris Wilson 
> >>
> >> So I've read again through the reset code and I still don't see how
> >> wait_rendering can ever gives us -EIO once the gpu is dead. So all the
> >> -EIO eating after wait_rendering looks really suspicious to me.
> >
> > That's always been a layer of defense against the driver. Alternatively,
> > throw we can throw a warn into the wait as we shouldn't enter there with
> > a seqno and a wedged GPU.
> 
> Yeah, that sounds like a good idea. We really shouldn't ever have an
> oustanding request while the gpu is wedged (ignoring dangerous changes
> to the wedged state through debugfs).

Except it is only true for a locked wait. :(

> >> Now the other thing is i915_gem_object_wait_
> >> rendering, that thing loves to throw an -EIO at us. And on a quick
> >> check your patch misses the one in set_domain_ioctl. We probably need
> >> to do the same with sw_finish_ioctl. So what about a
> >> i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate
> >> the few places we don't want to hear about a dead gpu?
> >
> > In fact, it turns out that we hardly ever want
> > i915_mutex_lock_interruptible to return -EIO. Long ago, the idea was
> > that EIO was only ever returned when we tried to issue a command on the
> > GPU whilst it was terminally wedged - in order to preserve data
> > integrity.
> 
> Don't we need to re-add a check in execbuf then? Otherwise I guess
> userspace has no idea ever that something is amiss ...

It will catch the EIO as soon as it tries to emit the command, but
adding the explicit check will save a ton of work in reserving buffers.
 
> Otherwise I think this approach would also make sense, and feels more
> future-proof. Applications that want real robustness simply need to
> query the kernel through the new interface whether anything ugly has
> happened to them. And if we want more synchronous feedback we can
> always improve wait/set_domain_ioctl to check whether the given bo
> suffered through a little calamity.

The only explicit point we have to remember is to preserve the
throttle() semantics of reporting EIO when wedged.
 
> > diff --git a/drivers/gpu/drm/i915/i915_gem.c 
> > b/drivers/gpu/drm/i915/i915_gem.c
> > index 2bd8d7a..f243e32 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.c
> > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > @@ -97,7 +97,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
> >
> > /* GPU is already declared terminally dead, give up. */
> > if (i915_terminally_wedged(error))
> > -   return -EIO;
> > +   return 0;
> >
> > /*
> >  * Only wait 10 seconds for the gpu reset to complete to avoid 
> > hanging
> > @@ -109,13 +109,12 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
> >10*HZ);
> > if (ret == 0) {
> > DRM_ERROR("Timed out waiting for the gpu reset to 
> > complete\n");
> > -   return -EIO;
> > -   } else if (ret < 0) {
> > -   return ret;
> > -   }
> > +   atomic_set(&error->reset_counter, I915_WEDGED);
> > +   } else if (ret > 0)
> > +   ret = 0;
> 
> Less convinced about this hunk here. I think the right approach would
> be to simply kill the timeout. Unlucky users can always get out of
> here with a signal, and it's imo more robust if we have all relevant
> timeouts for the reset process in the reset work. Separate patch
> though. Also we now have quite good coverage of all the reset handling
> in igt, so I don't think we need to go overboard in defending against
> our own logic screw-ups ... Reset really should work nowadays.

Apart from #63291 said otherwise (iirc)... I think several layers of
paranoia over handling GPU and driver hangs is justified.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_

Re: [Intel-gfx] [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain

2013-05-01 Thread Daniel Vetter
On Wed, May 1, 2013 at 1:06 PM, Chris Wilson  wrote:
> On Wed, May 01, 2013 at 12:38:27PM +0200, Daniel Vetter wrote:
>> On Wed, May 1, 2013 at 12:25 PM, Chris Wilson  
>> wrote:
>> > If reset fails, the GPU is declared wedged. This ideally should never
>> > happen, but very rarely it does. After the GPU is declared wedged, we
>> > must allow userspace to continue to use its mapping of bo in order to
>> > recover its data (and in some cases in order for memory management to
>> > continue unabated). Obviously after the GPU is wedged, no bo are
>> > currently accessed by the GPU and so we can complete any waits or domain
>> > transitions away from the GPU. Currently, we fail this essential task
>> > and instead report EIO and send a SIGBUS to the affected process -
>> > causing major loss of data (by killing X or compiz).
>> >
>> > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
>> > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
>> > Signed-off-by: Chris Wilson 
>>
>> So I've read again through the reset code and I still don't see how
>> wait_rendering can ever gives us -EIO once the gpu is dead. So all the
>> -EIO eating after wait_rendering looks really suspicious to me.
>
> That's always been a layer of defense against the driver. Alternatively,
> throw we can throw a warn into the wait as we shouldn't enter there with
> a seqno and a wedged GPU.

Yeah, that sounds like a good idea. We really shouldn't ever have an
oustanding request while the gpu is wedged (ignoring dangerous changes
to the wedged state through debugfs).

>> Now the other thing is i915_gem_object_wait_
>> rendering, that thing loves to throw an -EIO at us. And on a quick
>> check your patch misses the one in set_domain_ioctl. We probably need
>> to do the same with sw_finish_ioctl. So what about a
>> i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate
>> the few places we don't want to hear about a dead gpu?
>
> In fact, it turns out that we hardly ever want
> i915_mutex_lock_interruptible to return -EIO. Long ago, the idea was
> that EIO was only ever returned when we tried to issue a command on the
> GPU whilst it was terminally wedged - in order to preserve data
> integrity.

Don't we need to re-add a check in execbuf then? Otherwise I guess
userspace has no idea ever that something is amiss ...

Otherwise I think this approach would also make sense, and feels more
future-proof. Applications that want real robustness simply need to
query the kernel through the new interface whether anything ugly has
happened to them. And if we want more synchronous feedback we can
always improve wait/set_domain_ioctl to check whether the given bo
suffered through a little calamity.

> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 2bd8d7a..f243e32 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -97,7 +97,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
>
> /* GPU is already declared terminally dead, give up. */
> if (i915_terminally_wedged(error))
> -   return -EIO;
> +   return 0;
>
> /*
>  * Only wait 10 seconds for the gpu reset to complete to avoid hanging
> @@ -109,13 +109,12 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
>10*HZ);
> if (ret == 0) {
> DRM_ERROR("Timed out waiting for the gpu reset to 
> complete\n");
> -   return -EIO;
> -   } else if (ret < 0) {
> -   return ret;
> -   }
> +   atomic_set(&error->reset_counter, I915_WEDGED);
> +   } else if (ret > 0)
> +   ret = 0;

Less convinced about this hunk here. I think the right approach would
be to simply kill the timeout. Unlucky users can always get out of
here with a signal, and it's imo more robust if we have all relevant
timeouts for the reset process in the reset work. Separate patch
though. Also we now have quite good coverage of all the reset handling
in igt, so I don't think we need to go overboard in defending against
our own logic screw-ups ... Reset really should work nowadays.

Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain

2013-05-01 Thread Chris Wilson
On Wed, May 01, 2013 at 12:38:27PM +0200, Daniel Vetter wrote:
> On Wed, May 1, 2013 at 12:25 PM, Chris Wilson  
> wrote:
> > If reset fails, the GPU is declared wedged. This ideally should never
> > happen, but very rarely it does. After the GPU is declared wedged, we
> > must allow userspace to continue to use its mapping of bo in order to
> > recover its data (and in some cases in order for memory management to
> > continue unabated). Obviously after the GPU is wedged, no bo are
> > currently accessed by the GPU and so we can complete any waits or domain
> > transitions away from the GPU. Currently, we fail this essential task
> > and instead report EIO and send a SIGBUS to the affected process -
> > causing major loss of data (by killing X or compiz).
> >
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
> > Signed-off-by: Chris Wilson 
> 
> So I've read again through the reset code and I still don't see how
> wait_rendering can ever gives us -EIO once the gpu is dead. So all the
> -EIO eating after wait_rendering looks really suspicious to me.

That's always been a layer of defense against the driver. Alternatively,
throw we can throw a warn into the wait as we shouldn't enter there with
a seqno and a wedged GPU.
 
> Now the other thing is i915_gem_object_wait_
> rendering, that thing loves to throw an -EIO at us. And on a quick
> check your patch misses the one in set_domain_ioctl. We probably need
> to do the same with sw_finish_ioctl. So what about a
> i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate
> the few places we don't want to hear about a dead gpu?

In fact, it turns out that we hardly ever want
i915_mutex_lock_interruptible to return -EIO. Long ago, the idea was
that EIO was only ever returned when we tried to issue a command on the
GPU whilst it was terminally wedged - in order to preserve data
integrity.

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 2bd8d7a..f243e32 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -97,7 +97,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
 
/* GPU is already declared terminally dead, give up. */
if (i915_terminally_wedged(error))
-   return -EIO;
+   return 0;
 
/*
 * Only wait 10 seconds for the gpu reset to complete to avoid hanging
@@ -109,13 +109,12 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
   10*HZ);
if (ret == 0) {
DRM_ERROR("Timed out waiting for the gpu reset to complete\n");
-   return -EIO;
-   } else if (ret < 0) {
-   return ret;
-   }
+   atomic_set(&error->reset_counter, I915_WEDGED);
+   } else if (ret > 0)
+   ret = 0;
 #undef EXIT_COND
 
-   return 0;
+   return ret;
 }
 
 int i915_mutex_lock_interruptible(struct drm_device *dev)

 
> And if the chances of us breaking bo waiting are too high we can
> always add a few crazy igts which manually wedge the gpu to test them
> and ensure they all work.

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Suppress spurious EIO when moving away from the GPU domain

2013-05-01 Thread Daniel Vetter
On Wed, May 1, 2013 at 12:25 PM, Chris Wilson  wrote:
> If reset fails, the GPU is declared wedged. This ideally should never
> happen, but very rarely it does. After the GPU is declared wedged, we
> must allow userspace to continue to use its mapping of bo in order to
> recover its data (and in some cases in order for memory management to
> continue unabated). Obviously after the GPU is wedged, no bo are
> currently accessed by the GPU and so we can complete any waits or domain
> transitions away from the GPU. Currently, we fail this essential task
> and instead report EIO and send a SIGBUS to the affected process -
> causing major loss of data (by killing X or compiz).
>
> References: https://bugs.freedesktop.org/show_bug.cgi?id=63921
> References: https://bugs.freedesktop.org/show_bug.cgi?id=64073
> Signed-off-by: Chris Wilson 

So I've read again through the reset code and I still don't see how
wait_rendering can ever gives us -EIO once the gpu is dead. So all the
-EIO eating after wait_rendering looks really suspicious to me.

Now the other thing is i915_gem_object_wait_
rendering, that thing loves to throw an -EIO at us. And on a quick
check your patch misses the one in set_domain_ioctl. We probably need
to do the same with sw_finish_ioctl. So what about a
i915_mutex_lock_interruptible_no_EIO or similar to explictily annotate
the few places we don't want to hear about a dead gpu?

And if the chances of us breaking bo waiting are too high we can
always add a few crazy igts which manually wedge the gpu to test them
and ensure they all work.

Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx