Re: [Intel-gfx] [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin
My experience with these patches is that they make it less likely that the hang is reported to the userspace in a timely fashion (filling the ring full leads to lots of lost rendering) and worse make it much more likely that i915_gem_fault() hits an EIO and goes bang. That is unacceptable and trivial to hit with these patches. I have not yet reproduced that issue using the same broken renderer without these patches. I do think the patches are a step in the right direction, but with the change in userspace behaviour it has to be a NAK for the time being. -Chris -- Chris Wilson, Intel Open Source Technology Centre ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin
On Tue, Jul 3, 2012 at 5:59 PM, Chris Wilson ch...@chris-wilson.co.uk wrote: My experience with these patches is that they make it less likely that the hang is reported to the userspace in a timely fashion (filling the ring full leads to lots of lost rendering) and worse make it much more likely that i915_gem_fault() hits an EIO and goes bang. That is unacceptable and trivial to hit with these patches. I have not yet reproduced that issue using the same broken renderer without these patches. Hm, I don't see how these patches allow much more rendering to be queued up until we stop everything - for single-threaded userspace that should be at most one additional batch (which might have been the one that could catch the spurious -EIO). All subsequent execbuf ioctl calls should stall when trying to grab the mutex. Same for the case that the gpu reset failed, userspace should be able to submit one more batch afaict until it gets an -EIO. So can you please dig into what exactly your seeing a bit more and unconfuse me? I do think the patches are a step in the right direction, but with the change in userspace behaviour it has to be a NAK for the time being. Ok, I guess I'll have to throw the -EIO sigbus eater into the mix, too. After all userspace is supposed to call set_domain(GTT) before accessing the gtt mmap, so it should still get notice when the gpu has died and rendering might be incomplete. Imo we should still return -EIO for all ioctls, userspace should be able to cope with these (In the future we might even add optional behaviour to signal potentially dropped rendering due to a gpu reset at wait_rendering for userspace that really cares). So would the sigbus eater be good enough or do we need more? Thanks, Daniel -- Daniel Vetter daniel.vet...@ffwll.ch - +41 (0) 79 365 57 48 - http://blog.ffwll.ch ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin
On Sun, 1 Jul 2012 12:41:19 +0200 Daniel Vetter dan...@ffwll.ch wrote: On Sun, Jul 1, 2012 at 5:09 AM, Ben Widawsky b...@bwidawsk.net wrote: On Tue, 26 Jun 2012 23:08:50 +0200 Daniel Vetter daniel.vet...@ffwll.ch wrote: Having this early check in intel_ring_begin doesn't buy us anything, since we'll be calling into wait_request in the usual case already anyway. In the corner case of not waiting for free space using the last_retired_head we simply need to do the same check, too. With these changes we'll only ever get an -EIO from intel_ring_begin if the gpu has truely been declared dead. v2: Don't conflate the change to prevent intel_ring_begin from returning a spurious -EIO with the refactor to use i915_gem_check_wedge, which is just prep work to avoid returning -EAGAIN in callsites that can't handle syscall restarting. I'm really scared by this change. It's diffuclt to review because there are SO many callers of intel_ring_begin, and figuring out if they all work in the wedged case is simply too difficult. I've yet to review the rest of the series, but it may make more sense to put this change last perhaps? Well, this patch doesn't really affect much what error codes the callers get - we'll still throw both -EGAIN and -EIO at them (later on patches will fix this). What this patch does though is prevent us from returning too many -EIO. Imagine the gpu died and we've noticed already (hence dev_priv-mm.wedged is set), but some process is stuck churning through the execbuf ioctl, holding dev-struct_mutex. While processing the execbuf we call intel_ring_begin to emit a few commands. Now usually, even when the gpu is dead, there is enough space in the ring to do so, which allows us to complete the execbuf ioctl and then later on we can block properly when trying to grab the mutex in the next ioctl for the gpu reset work handler to complete. That in itself is a pretty big change, don't you think? It seems rather strange and dangerous to modify HW (which we will if we allow execbuf to continue when we write the tail pointer). I think we want some way to not write the tail ptr in such a case. Now you might respond, well, who cares about writes? But this imposes a pretty large restriction on any code that can't work well after the GPU is hung. I see the irony. I'm complaining that you can make GPU hangs unstable, and the patch series is fixing GPU reset. Call it paranoia, it still seems unsafe to me, and makes us have to think a bit more whenever adding any code. Am I over-thinking this? But thanks to that wedged check in intel_ring_begin we'll instead return -EIO, despite the fact that later on we could successfully reset the gpu. Returning -EIO forces the X server to fall back to s/w rendering and disabling dri2, and in the case of a 3d compositor usually results in a abort. After this patch we can still return -EIO if the gpu is dead but the reset work hasn't completed yet, but only so if the ring is full (which in many cases is unlikely). Cheers, Daniel Signed-Off-by: Daniel Vetter daniel.vet...@ffwll.ch --- drivers/gpu/drm/i915/intel_ringbuffer.c |4 1 file changed, 4 deletions(-) diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c index f30a53a..501546e 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -1230,13 +1230,9 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n) int intel_ring_begin(struct intel_ring_buffer *ring, int num_dwords) { - struct drm_i915_private *dev_priv = ring-dev-dev_private; int n = 4*num_dwords; int ret; - if (unlikely(atomic_read(dev_priv-mm.wedged))) - return -EIO; - if (unlikely(ring-tail + n ring-effective_size)) { ret = intel_wrap_ring_buffer(ring); if (unlikely(ret)) -- Ben Widawsky, Intel Open Source Technology Center ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx -- Ben Widawsky, Intel Open Source Technology Center ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin
On Sun, Jul 1, 2012 at 5:09 AM, Ben Widawsky b...@bwidawsk.net wrote: On Tue, 26 Jun 2012 23:08:50 +0200 Daniel Vetter daniel.vet...@ffwll.ch wrote: Having this early check in intel_ring_begin doesn't buy us anything, since we'll be calling into wait_request in the usual case already anyway. In the corner case of not waiting for free space using the last_retired_head we simply need to do the same check, too. With these changes we'll only ever get an -EIO from intel_ring_begin if the gpu has truely been declared dead. v2: Don't conflate the change to prevent intel_ring_begin from returning a spurious -EIO with the refactor to use i915_gem_check_wedge, which is just prep work to avoid returning -EAGAIN in callsites that can't handle syscall restarting. I'm really scared by this change. It's diffuclt to review because there are SO many callers of intel_ring_begin, and figuring out if they all work in the wedged case is simply too difficult. I've yet to review the rest of the series, but it may make more sense to put this change last perhaps? Well, this patch doesn't really affect much what error codes the callers get - we'll still throw both -EGAIN and -EIO at them (later on patches will fix this). What this patch does though is prevent us from returning too many -EIO. Imagine the gpu died and we've noticed already (hence dev_priv-mm.wedged is set), but some process is stuck churning through the execbuf ioctl, holding dev-struct_mutex. While processing the execbuf we call intel_ring_begin to emit a few commands. Now usually, even when the gpu is dead, there is enough space in the ring to do so, which allows us to complete the execbuf ioctl and then later on we can block properly when trying to grab the mutex in the next ioctl for the gpu reset work handler to complete. But thanks to that wedged check in intel_ring_begin we'll instead return -EIO, despite the fact that later on we could successfully reset the gpu. Returning -EIO forces the X server to fall back to s/w rendering and disabling dri2, and in the case of a 3d compositor usually results in a abort. After this patch we can still return -EIO if the gpu is dead but the reset work hasn't completed yet, but only so if the ring is full (which in many cases is unlikely). Cheers, Daniel Signed-Off-by: Daniel Vetter daniel.vet...@ffwll.ch --- drivers/gpu/drm/i915/intel_ringbuffer.c |4 1 file changed, 4 deletions(-) diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c index f30a53a..501546e 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -1230,13 +1230,9 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n) int intel_ring_begin(struct intel_ring_buffer *ring, int num_dwords) { - struct drm_i915_private *dev_priv = ring-dev-dev_private; int n = 4*num_dwords; int ret; - if (unlikely(atomic_read(dev_priv-mm.wedged))) - return -EIO; - if (unlikely(ring-tail + n ring-effective_size)) { ret = intel_wrap_ring_buffer(ring); if (unlikely(ret)) -- Ben Widawsky, Intel Open Source Technology Center ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx -- Daniel Vetter daniel.vet...@ffwll.ch - +41 (0) 79 364 57 48 - http://blog.ffwll.ch ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 1/4] drm/i915: don't return a spurious -EIO from intel_ring_begin
On Tue, 26 Jun 2012 23:08:50 +0200 Daniel Vetter daniel.vet...@ffwll.ch wrote: Having this early check in intel_ring_begin doesn't buy us anything, since we'll be calling into wait_request in the usual case already anyway. In the corner case of not waiting for free space using the last_retired_head we simply need to do the same check, too. With these changes we'll only ever get an -EIO from intel_ring_begin if the gpu has truely been declared dead. v2: Don't conflate the change to prevent intel_ring_begin from returning a spurious -EIO with the refactor to use i915_gem_check_wedge, which is just prep work to avoid returning -EAGAIN in callsites that can't handle syscall restarting. I'm really scared by this change. It's diffuclt to review because there are SO many callers of intel_ring_begin, and figuring out if they all work in the wedged case is simply too difficult. I've yet to review the rest of the series, but it may make more sense to put this change last perhaps? Signed-Off-by: Daniel Vetter daniel.vet...@ffwll.ch --- drivers/gpu/drm/i915/intel_ringbuffer.c |4 1 file changed, 4 deletions(-) diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c index f30a53a..501546e 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -1230,13 +1230,9 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n) int intel_ring_begin(struct intel_ring_buffer *ring, int num_dwords) { - struct drm_i915_private *dev_priv = ring-dev-dev_private; int n = 4*num_dwords; int ret; - if (unlikely(atomic_read(dev_priv-mm.wedged))) - return -EIO; - if (unlikely(ring-tail + n ring-effective_size)) { ret = intel_wrap_ring_buffer(ring); if (unlikely(ret)) -- Ben Widawsky, Intel Open Source Technology Center ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx