Re: [Intel-gfx] [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin

2012-07-03 Thread Chris Wilson
My experience with these patches is that they make it less likely that
the hang is reported to the userspace in a timely fashion (filling the
ring full leads to lots of lost rendering) and worse make it much more
likely that i915_gem_fault() hits an EIO and goes bang. That is
unacceptable and trivial to hit with these patches. I have not yet
reproduced that issue using the same broken renderer without these
patches.

I do think the patches are a step in the right direction, but with the
change in userspace behaviour it has to be a NAK for the time being.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin

2012-07-03 Thread Daniel Vetter
On Tue, Jul 3, 2012 at 5:59 PM, Chris Wilson ch...@chris-wilson.co.uk wrote:
 My experience with these patches is that they make it less likely that
 the hang is reported to the userspace in a timely fashion (filling the
 ring full leads to lots of lost rendering) and worse make it much more
 likely that i915_gem_fault() hits an EIO and goes bang. That is
 unacceptable and trivial to hit with these patches. I have not yet
 reproduced that issue using the same broken renderer without these
 patches.

Hm, I don't see how these patches allow much more rendering to be
queued up until we stop everything - for single-threaded userspace
that should be at most one additional batch (which might have been the
one that could catch the spurious -EIO). All subsequent execbuf ioctl
calls should stall when trying to grab the mutex.

Same for the case that the gpu reset failed, userspace should be able
to submit one more batch afaict until it gets an -EIO. So can you
please dig into what exactly your seeing a bit more and unconfuse me?

 I do think the patches are a step in the right direction, but with the
 change in userspace behaviour it has to be a NAK for the time being.

Ok, I guess I'll have to throw the -EIO sigbus eater into the mix,
too. After all userspace is supposed to call set_domain(GTT) before
accessing the gtt mmap, so it should still get notice when the gpu has
died and rendering might be incomplete.

Imo we should still return -EIO for all ioctls, userspace should be
able to cope with these (In the future we might even add optional
behaviour to signal potentially dropped rendering due to a gpu reset
at wait_rendering for userspace that really cares). So would the
sigbus eater be good enough or do we need more?

Thanks, Daniel
-- 
Daniel Vetter
daniel.vet...@ffwll.ch - +41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin

2012-07-02 Thread Ben Widawsky
On Sun, 1 Jul 2012 12:41:19 +0200
Daniel Vetter dan...@ffwll.ch wrote:

 On Sun, Jul 1, 2012 at 5:09 AM, Ben Widawsky b...@bwidawsk.net wrote:
  On Tue, 26 Jun 2012 23:08:50 +0200
  Daniel Vetter daniel.vet...@ffwll.ch wrote:
 
  Having this early check in intel_ring_begin doesn't buy us anything,
  since we'll be calling into wait_request in the usual case already
  anyway. In the corner case of not waiting for free space using the
  last_retired_head we simply need to do the same check, too.
 
  With these changes we'll only ever get an -EIO from intel_ring_begin
  if the gpu has truely been declared dead.
 
  v2: Don't conflate the change to prevent intel_ring_begin from returning
  a spurious -EIO with the refactor to use i915_gem_check_wedge, which is
  just prep work to avoid returning -EAGAIN in callsites that can't handle
  syscall restarting.
 
  I'm really scared by this change. It's diffuclt to review because there
  are SO many callers of intel_ring_begin, and figuring out if they all
  work in the wedged case is simply too difficult. I've yet to review the
  rest of the series, but it may make more sense to put this change last
  perhaps?
 
 Well, this patch doesn't really affect much what error codes the
 callers get - we'll still throw both -EGAIN and -EIO at them (later on
 patches will fix this).
 
 What this patch does though is prevent us from returning too many
 -EIO. Imagine the gpu died and we've noticed already (hence
 dev_priv-mm.wedged is set), but some process is stuck churning
 through the execbuf ioctl, holding dev-struct_mutex. While processing
 the execbuf we call intel_ring_begin to emit a few commands. Now
 usually, even when the gpu is dead, there is enough space in the ring
 to do so, which allows us to complete the execbuf ioctl and then later
 on we can block properly when trying to grab the mutex in the next
 ioctl for the gpu reset work handler to complete.

That in itself is a pretty big change, don't you think? It seems rather
strange and dangerous to modify HW (which we will if we allow execbuf to
continue when we write the tail pointer). I think we want some way to
not write the tail ptr in such a case. Now you might respond, well, who
cares about writes? But this imposes a pretty large restriction on any
code that can't work well after the GPU is hung.

I see the irony. I'm complaining that you can make GPU hangs unstable,
and the patch series is fixing GPU reset. Call it paranoia, it still
seems unsafe to me, and makes us have to think a bit more whenever
adding any code. Am I over-thinking this?

 
 But thanks to that wedged check in intel_ring_begin we'll instead
 return -EIO, despite the fact that later on we could successfully
 reset the gpu. Returning -EIO forces the X server to fall back to s/w
 rendering and disabling dri2, and in the case of a 3d compositor
 usually results in a abort. After this patch we can still return -EIO
 if the gpu is dead but the reset work hasn't completed yet, but only
 so if the ring is full (which in many cases is unlikely).
 
 Cheers, Daniel
 
 
 
  Signed-Off-by: Daniel Vetter daniel.vet...@ffwll.ch
  ---
   drivers/gpu/drm/i915/intel_ringbuffer.c |4 
   1 file changed, 4 deletions(-)
 
  diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c 
  b/drivers/gpu/drm/i915/intel_ringbuffer.c
  index f30a53a..501546e 100644
  --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
  +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
  @@ -1230,13 +1230,9 @@ int intel_wait_ring_buffer(struct intel_ring_buffer 
  *ring, int n)
   int intel_ring_begin(struct intel_ring_buffer *ring,
 int num_dwords)
   {
  - struct drm_i915_private *dev_priv = ring-dev-dev_private;
int n = 4*num_dwords;
int ret;
 
  - if (unlikely(atomic_read(dev_priv-mm.wedged)))
  - return -EIO;
  -
if (unlikely(ring-tail + n  ring-effective_size)) {
ret = intel_wrap_ring_buffer(ring);
if (unlikely(ret))
 
 
 
  --
  Ben Widawsky, Intel Open Source Technology Center
  ___
  Intel-gfx mailing list
  Intel-gfx@lists.freedesktop.org
  http://lists.freedesktop.org/mailman/listinfo/intel-gfx
 
 
 



-- 
Ben Widawsky, Intel Open Source Technology Center
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin

2012-07-01 Thread Daniel Vetter
On Sun, Jul 1, 2012 at 5:09 AM, Ben Widawsky b...@bwidawsk.net wrote:
 On Tue, 26 Jun 2012 23:08:50 +0200
 Daniel Vetter daniel.vet...@ffwll.ch wrote:

 Having this early check in intel_ring_begin doesn't buy us anything,
 since we'll be calling into wait_request in the usual case already
 anyway. In the corner case of not waiting for free space using the
 last_retired_head we simply need to do the same check, too.

 With these changes we'll only ever get an -EIO from intel_ring_begin
 if the gpu has truely been declared dead.

 v2: Don't conflate the change to prevent intel_ring_begin from returning
 a spurious -EIO with the refactor to use i915_gem_check_wedge, which is
 just prep work to avoid returning -EAGAIN in callsites that can't handle
 syscall restarting.

 I'm really scared by this change. It's diffuclt to review because there
 are SO many callers of intel_ring_begin, and figuring out if they all
 work in the wedged case is simply too difficult. I've yet to review the
 rest of the series, but it may make more sense to put this change last
 perhaps?

Well, this patch doesn't really affect much what error codes the
callers get - we'll still throw both -EGAIN and -EIO at them (later on
patches will fix this).

What this patch does though is prevent us from returning too many
-EIO. Imagine the gpu died and we've noticed already (hence
dev_priv-mm.wedged is set), but some process is stuck churning
through the execbuf ioctl, holding dev-struct_mutex. While processing
the execbuf we call intel_ring_begin to emit a few commands. Now
usually, even when the gpu is dead, there is enough space in the ring
to do so, which allows us to complete the execbuf ioctl and then later
on we can block properly when trying to grab the mutex in the next
ioctl for the gpu reset work handler to complete.

But thanks to that wedged check in intel_ring_begin we'll instead
return -EIO, despite the fact that later on we could successfully
reset the gpu. Returning -EIO forces the X server to fall back to s/w
rendering and disabling dri2, and in the case of a 3d compositor
usually results in a abort. After this patch we can still return -EIO
if the gpu is dead but the reset work hasn't completed yet, but only
so if the ring is full (which in many cases is unlikely).

Cheers, Daniel



 Signed-Off-by: Daniel Vetter daniel.vet...@ffwll.ch
 ---
  drivers/gpu/drm/i915/intel_ringbuffer.c |4 
  1 file changed, 4 deletions(-)

 diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c 
 b/drivers/gpu/drm/i915/intel_ringbuffer.c
 index f30a53a..501546e 100644
 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
 +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
 @@ -1230,13 +1230,9 @@ int intel_wait_ring_buffer(struct intel_ring_buffer 
 *ring, int n)
  int intel_ring_begin(struct intel_ring_buffer *ring,
int num_dwords)
  {
 - struct drm_i915_private *dev_priv = ring-dev-dev_private;
   int n = 4*num_dwords;
   int ret;

 - if (unlikely(atomic_read(dev_priv-mm.wedged)))
 - return -EIO;
 -
   if (unlikely(ring-tail + n  ring-effective_size)) {
   ret = intel_wrap_ring_buffer(ring);
   if (unlikely(ret))



 --
 Ben Widawsky, Intel Open Source Technology Center
 ___
 Intel-gfx mailing list
 Intel-gfx@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/intel-gfx



-- 
Daniel Vetter
daniel.vet...@ffwll.ch - +41 (0) 79 364 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin

2012-06-30 Thread Ben Widawsky
On Tue, 26 Jun 2012 23:08:50 +0200
Daniel Vetter daniel.vet...@ffwll.ch wrote:

 Having this early check in intel_ring_begin doesn't buy us anything,
 since we'll be calling into wait_request in the usual case already
 anyway. In the corner case of not waiting for free space using the
 last_retired_head we simply need to do the same check, too.
 
 With these changes we'll only ever get an -EIO from intel_ring_begin
 if the gpu has truely been declared dead.
 
 v2: Don't conflate the change to prevent intel_ring_begin from returning
 a spurious -EIO with the refactor to use i915_gem_check_wedge, which is
 just prep work to avoid returning -EAGAIN in callsites that can't handle
 syscall restarting.

I'm really scared by this change. It's diffuclt to review because there
are SO many callers of intel_ring_begin, and figuring out if they all
work in the wedged case is simply too difficult. I've yet to review the
rest of the series, but it may make more sense to put this change last
perhaps?

 
 Signed-Off-by: Daniel Vetter daniel.vet...@ffwll.ch
 ---
  drivers/gpu/drm/i915/intel_ringbuffer.c |4 
  1 file changed, 4 deletions(-)
 
 diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c 
 b/drivers/gpu/drm/i915/intel_ringbuffer.c
 index f30a53a..501546e 100644
 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
 +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
 @@ -1230,13 +1230,9 @@ int intel_wait_ring_buffer(struct intel_ring_buffer 
 *ring, int n)
  int intel_ring_begin(struct intel_ring_buffer *ring,
int num_dwords)
  {
 - struct drm_i915_private *dev_priv = ring-dev-dev_private;
   int n = 4*num_dwords;
   int ret;
  
 - if (unlikely(atomic_read(dev_priv-mm.wedged)))
 - return -EIO;
 -
   if (unlikely(ring-tail + n  ring-effective_size)) {
   ret = intel_wrap_ring_buffer(ring);
   if (unlikely(ret))



-- 
Ben Widawsky, Intel Open Source Technology Center
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx