Re: [Intel-gfx] [PATCH 5/5] drm/i915: Fix error capture on BYT/BDW

2014-01-27 Thread Chris Wilson
On Sun, Jan 26, 2014 at 01:47:29PM -0800, Ben Widawsky wrote:
 On Sun, Jan 26, 2014 at 07:55:59PM +, Chris Wilson wrote:
  On Sun, Jan 26, 2014 at 11:05:40AM -0800, Ben Widawsky wrote:
   On Sun, Jan 26, 2014 at 11:47:40AM +, Chris Wilson wrote:
On Fri, Jan 24, 2014 at 06:17:45PM -0800, Ben Widawsky wrote:
 The previous check during error capture of whether or not the current 
 VM
 should be scanned used, gen  7. That was more or less trying to
 determine if there was a full PPGTT. At the time, this was sort of 
 what
 I meant to do because I was more interested in working backwards from
 hardware state. However, this is incorrect because it will not include
 platforms that are greater than gen7, and not having PPGTT.  Example
 would be BYT which is gen7 but doesn't have PPGTT, BDW, or any 
 platform
 greater than gen7 with the PPGTT module parameter invoked.
 
 I am /assuming/ BYT was broken, I have not actually checked.
 
 While here, clean up the file a bit to avoid duplicate reads (now that
 the PPGTT info is in the error state).
 
 I think Mika/Chris may have been looking at this too.

Sure, we are looking (for identifying the guilty request/batch) by using
the older, simpler mechanism of finding the first incomplete request. I
think that search is now definite since we preallocate the request and 
no
longer do request collascing if ENOMEM (i.e. there is a 1:1 relationship
between seqno/batch/request).

That should also apply here and be much simpler.
   
   How does that solve hangs which aren't caused by requests?
  
  Was that an intentional rhetorical question?
  
  The code you touch here only deals with requests - finding the current
  batchbuffer if any.
  -Chris
  
 
 It wasn't rhetorical. I temporarily ignored that all batches are tied to
 a request.
 
 So what's the plan now? Just looking at the callers, we seem to have a
 couple of callers that can't easily identify the bad request.

I was thinking along the lines of:

@@ -737,31 +709,16 @@ i915_error_first_batchbuffer(struct drm_i915_private 
*dev_priv,
}
 
seqno = ring-get_seqno(ring, false);
-   list_for_each_entry(vm, dev_priv-vm_list, global_link) {
-   if (!is_active_vm(vm, ring))
+   list_for_each_entry(request, ring-request_list, list) {
+   if (i915_seqno_passed(seqno, request-seqno))
continue;
 
-   found_active = true;
-
-   list_for_each_entry(vma, vm-active_list, mm_list) {
-   obj = vma-obj;
-   if (obj-ring != ring)
-   continue;
-
-   if (i915_seqno_passed(seqno, obj-last_read_seqno))
-   continue;
-
-   if ((obj-base.read_domains  I915_GEM_DOMAIN_COMMAND) 
== 0)
-   continue;
-
-   /* We need to copy these to an anonymous buffer as the 
simplest
-* method to avoid being overwritten by userspace.
-*/
-   return i915_error_object_create(dev_priv, obj, vm);
-   }
+   /* We need to copy these to an anonymous buffer as the simplest
+* method to avoid being overwritten by userspace.
+*/
+   return i915_error_object_create(dev_priv, request-batch_obj, 
request-ctx-vm);
}
 
-   WARN_ON(!found_active);

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 5/5] drm/i915: Fix error capture on BYT/BDW

2014-01-27 Thread Ben Widawsky
On Mon, Jan 27, 2014 at 01:45:22PM +, Chris Wilson wrote:
 On Sun, Jan 26, 2014 at 01:47:29PM -0800, Ben Widawsky wrote:
  On Sun, Jan 26, 2014 at 07:55:59PM +, Chris Wilson wrote:
   On Sun, Jan 26, 2014 at 11:05:40AM -0800, Ben Widawsky wrote:
On Sun, Jan 26, 2014 at 11:47:40AM +, Chris Wilson wrote:
 On Fri, Jan 24, 2014 at 06:17:45PM -0800, Ben Widawsky wrote:
  The previous check during error capture of whether or not the 
  current VM
  should be scanned used, gen  7. That was more or less trying to
  determine if there was a full PPGTT. At the time, this was sort of 
  what
  I meant to do because I was more interested in working backwards 
  from
  hardware state. However, this is incorrect because it will not 
  include
  platforms that are greater than gen7, and not having PPGTT.  Example
  would be BYT which is gen7 but doesn't have PPGTT, BDW, or any 
  platform
  greater than gen7 with the PPGTT module parameter invoked.
  
  I am /assuming/ BYT was broken, I have not actually checked.
  
  While here, clean up the file a bit to avoid duplicate reads (now 
  that
  the PPGTT info is in the error state).
  
  I think Mika/Chris may have been looking at this too.
 
 Sure, we are looking (for identifying the guilty request/batch) by 
 using
 the older, simpler mechanism of finding the first incomplete request. 
 I
 think that search is now definite since we preallocate the request 
 and no
 longer do request collascing if ENOMEM (i.e. there is a 1:1 
 relationship
 between seqno/batch/request).
 
 That should also apply here and be much simpler.

How does that solve hangs which aren't caused by requests?
   
   Was that an intentional rhetorical question?
   
   The code you touch here only deals with requests - finding the current
   batchbuffer if any.
   -Chris
   
  
  It wasn't rhetorical. I temporarily ignored that all batches are tied to
  a request.
  
  So what's the plan now? Just looking at the callers, we seem to have a
  couple of callers that can't easily identify the bad request.
 
 I was thinking along the lines of:
 
 @@ -737,31 +709,16 @@ i915_error_first_batchbuffer(struct drm_i915_private 
 *dev_priv,
 }
  
 seqno = ring-get_seqno(ring, false);
 -   list_for_each_entry(vm, dev_priv-vm_list, global_link) {
 -   if (!is_active_vm(vm, ring))
 +   list_for_each_entry(request, ring-request_list, list) {
 +   if (i915_seqno_passed(seqno, request-seqno))
 continue;
  
 -   found_active = true;
 -
 -   list_for_each_entry(vma, vm-active_list, mm_list) {
 -   obj = vma-obj;
 -   if (obj-ring != ring)
 -   continue;
 -
 -   if (i915_seqno_passed(seqno, obj-last_read_seqno))
 -   continue;
 -
 -   if ((obj-base.read_domains  
 I915_GEM_DOMAIN_COMMAND) == 0)
 -   continue;
 -
 -   /* We need to copy these to an anonymous buffer as 
 the simplest
 -* method to avoid being overwritten by userspace.
 -*/
 -   return i915_error_object_create(dev_priv, obj, vm);
 -   }
 +   /* We need to copy these to an anonymous buffer as the 
 simplest
 +* method to avoid being overwritten by userspace.
 +*/
 +   return i915_error_object_create(dev_priv, request-batch_obj, 
 request-ctx-vm);
 }
  
 -   WARN_ON(!found_active);
 

So per ring batchbuffers is okay with you (it's fine by me)?

-- 
Ben Widawsky, Intel Open Source Technology Center
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 5/5] drm/i915: Fix error capture on BYT/BDW

2014-01-27 Thread Ben Widawsky
On Mon, Jan 27, 2014 at 01:45:22PM +, Chris Wilson wrote:
 On Sun, Jan 26, 2014 at 01:47:29PM -0800, Ben Widawsky wrote:
  On Sun, Jan 26, 2014 at 07:55:59PM +, Chris Wilson wrote:
   On Sun, Jan 26, 2014 at 11:05:40AM -0800, Ben Widawsky wrote:
On Sun, Jan 26, 2014 at 11:47:40AM +, Chris Wilson wrote:
 On Fri, Jan 24, 2014 at 06:17:45PM -0800, Ben Widawsky wrote:
  The previous check during error capture of whether or not the 
  current VM
  should be scanned used, gen  7. That was more or less trying to
  determine if there was a full PPGTT. At the time, this was sort of 
  what
  I meant to do because I was more interested in working backwards 
  from
  hardware state. However, this is incorrect because it will not 
  include
  platforms that are greater than gen7, and not having PPGTT.  Example
  would be BYT which is gen7 but doesn't have PPGTT, BDW, or any 
  platform
  greater than gen7 with the PPGTT module parameter invoked.
  
  I am /assuming/ BYT was broken, I have not actually checked.
  
  While here, clean up the file a bit to avoid duplicate reads (now 
  that
  the PPGTT info is in the error state).
  
  I think Mika/Chris may have been looking at this too.
 
 Sure, we are looking (for identifying the guilty request/batch) by 
 using
 the older, simpler mechanism of finding the first incomplete request. 
 I
 think that search is now definite since we preallocate the request 
 and no
 longer do request collascing if ENOMEM (i.e. there is a 1:1 
 relationship
 between seqno/batch/request).
 
 That should also apply here and be much simpler.

How does that solve hangs which aren't caused by requests?
   
   Was that an intentional rhetorical question?
   
   The code you touch here only deals with requests - finding the current
   batchbuffer if any.
   -Chris
   
  
  It wasn't rhetorical. I temporarily ignored that all batches are tied to
  a request.
  
  So what's the plan now? Just looking at the callers, we seem to have a
  couple of callers that can't easily identify the bad request.
 
 I was thinking along the lines of:
 
 @@ -737,31 +709,16 @@ i915_error_first_batchbuffer(struct drm_i915_private 
 *dev_priv,
 }
  
 seqno = ring-get_seqno(ring, false);
 -   list_for_each_entry(vm, dev_priv-vm_list, global_link) {
 -   if (!is_active_vm(vm, ring))
 +   list_for_each_entry(request, ring-request_list, list) {
 +   if (i915_seqno_passed(seqno, request-seqno))
 continue;
  
 -   found_active = true;
 -
 -   list_for_each_entry(vma, vm-active_list, mm_list) {
 -   obj = vma-obj;
 -   if (obj-ring != ring)
 -   continue;
 -
 -   if (i915_seqno_passed(seqno, obj-last_read_seqno))
 -   continue;
 -
 -   if ((obj-base.read_domains  
 I915_GEM_DOMAIN_COMMAND) == 0)
 -   continue;
 -
 -   /* We need to copy these to an anonymous buffer as 
 the simplest
 -* method to avoid being overwritten by userspace.
 -*/
 -   return i915_error_object_create(dev_priv, obj, vm);
 -   }
 +   /* We need to copy these to an anonymous buffer as the 
 simplest
 +* method to avoid being overwritten by userspace.
 +*/
 +   return i915_error_object_create(dev_priv, request-batch_obj, 
 request-ctx-vm);
 }
  
 -   WARN_ON(!found_active);
 

The other issue is the existing method doesn't rely as much on proper
request handling, ie. this could be more resilient to driver bugs. I
kind of want to keep both...

-- 
Ben Widawsky, Intel Open Source Technology Center
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 5/5] drm/i915: Fix error capture on BYT/BDW

2014-01-27 Thread Chris Wilson
On Mon, Jan 27, 2014 at 12:31:08PM -0800, Ben Widawsky wrote:
 The other issue is the existing method doesn't rely as much on proper
 request handling, ie. this could be more resilient to driver bugs. I
 kind of want to keep both...

Actually I think it is. Part of the process of reading an error dump is
tying together the registers with what is captured. If they are
inconsistent, we know that the driver/capture is buggy. What happens in
the real world is that the GPU executes something completely different
than the batch buffer anyway...
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 5/5] drm/i915: Fix error capture on BYT/BDW

2014-01-27 Thread Ben Widawsky
On Mon, Jan 27, 2014 at 09:31:04PM +, Chris Wilson wrote:
 On Mon, Jan 27, 2014 at 12:31:08PM -0800, Ben Widawsky wrote:
  The other issue is the existing method doesn't rely as much on proper
  request handling, ie. this could be more resilient to driver bugs. I
  kind of want to keep both...
 
 Actually I think it is. Part of the process of reading an error dump is
 tying together the registers with what is captured. If they are
 inconsistent, we know that the driver/capture is buggy. What happens in
 the real world is that the GPU executes something completely different
 than the batch buffer anyway...
 -Chris
 
 -- 
 Chris Wilson, Intel Open Source Technology Centre

Recapping IRC conversation - Chris is sending a patch to fix this
problem with his solution.

-- 
Ben Widawsky, Intel Open Source Technology Center
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 5/5] drm/i915: Fix error capture on BYT/BDW

2014-01-26 Thread Chris Wilson
On Fri, Jan 24, 2014 at 06:17:45PM -0800, Ben Widawsky wrote:
 The previous check during error capture of whether or not the current VM
 should be scanned used, gen  7. That was more or less trying to
 determine if there was a full PPGTT. At the time, this was sort of what
 I meant to do because I was more interested in working backwards from
 hardware state. However, this is incorrect because it will not include
 platforms that are greater than gen7, and not having PPGTT.  Example
 would be BYT which is gen7 but doesn't have PPGTT, BDW, or any platform
 greater than gen7 with the PPGTT module parameter invoked.
 
 I am /assuming/ BYT was broken, I have not actually checked.
 
 While here, clean up the file a bit to avoid duplicate reads (now that
 the PPGTT info is in the error state).
 
 I think Mika/Chris may have been looking at this too.

Sure, we are looking (for identifying the guilty request/batch) by using
the older, simpler mechanism of finding the first incomplete request. I
think that search is now definite since we preallocate the request and no
longer do request collascing if ENOMEM (i.e. there is a 1:1 relationship
between seqno/batch/request).

That should also apply here and be much simpler.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 5/5] drm/i915: Fix error capture on BYT/BDW

2014-01-26 Thread Ben Widawsky
On Sun, Jan 26, 2014 at 11:47:40AM +, Chris Wilson wrote:
 On Fri, Jan 24, 2014 at 06:17:45PM -0800, Ben Widawsky wrote:
  The previous check during error capture of whether or not the current VM
  should be scanned used, gen  7. That was more or less trying to
  determine if there was a full PPGTT. At the time, this was sort of what
  I meant to do because I was more interested in working backwards from
  hardware state. However, this is incorrect because it will not include
  platforms that are greater than gen7, and not having PPGTT.  Example
  would be BYT which is gen7 but doesn't have PPGTT, BDW, or any platform
  greater than gen7 with the PPGTT module parameter invoked.
  
  I am /assuming/ BYT was broken, I have not actually checked.
  
  While here, clean up the file a bit to avoid duplicate reads (now that
  the PPGTT info is in the error state).
  
  I think Mika/Chris may have been looking at this too.
 
 Sure, we are looking (for identifying the guilty request/batch) by using
 the older, simpler mechanism of finding the first incomplete request. I
 think that search is now definite since we preallocate the request and no
 longer do request collascing if ENOMEM (i.e. there is a 1:1 relationship
 between seqno/batch/request).
 
 That should also apply here and be much simpler.
 -Chris
 
 -- 
 Chris Wilson, Intel Open Source Technology Centre

How does that solve hangs which aren't caused by requests?

-- 
Ben Widawsky, Intel Open Source Technology Center
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 5/5] drm/i915: Fix error capture on BYT/BDW

2014-01-26 Thread Chris Wilson
On Sun, Jan 26, 2014 at 11:05:40AM -0800, Ben Widawsky wrote:
 On Sun, Jan 26, 2014 at 11:47:40AM +, Chris Wilson wrote:
  On Fri, Jan 24, 2014 at 06:17:45PM -0800, Ben Widawsky wrote:
   The previous check during error capture of whether or not the current VM
   should be scanned used, gen  7. That was more or less trying to
   determine if there was a full PPGTT. At the time, this was sort of what
   I meant to do because I was more interested in working backwards from
   hardware state. However, this is incorrect because it will not include
   platforms that are greater than gen7, and not having PPGTT.  Example
   would be BYT which is gen7 but doesn't have PPGTT, BDW, or any platform
   greater than gen7 with the PPGTT module parameter invoked.
   
   I am /assuming/ BYT was broken, I have not actually checked.
   
   While here, clean up the file a bit to avoid duplicate reads (now that
   the PPGTT info is in the error state).
   
   I think Mika/Chris may have been looking at this too.
  
  Sure, we are looking (for identifying the guilty request/batch) by using
  the older, simpler mechanism of finding the first incomplete request. I
  think that search is now definite since we preallocate the request and no
  longer do request collascing if ENOMEM (i.e. there is a 1:1 relationship
  between seqno/batch/request).
  
  That should also apply here and be much simpler.
 
 How does that solve hangs which aren't caused by requests?

Was that an intentional rhetorical question?

The code you touch here only deals with requests - finding the current
batchbuffer if any.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 5/5] drm/i915: Fix error capture on BYT/BDW

2014-01-26 Thread Ben Widawsky
On Sun, Jan 26, 2014 at 07:55:59PM +, Chris Wilson wrote:
 On Sun, Jan 26, 2014 at 11:05:40AM -0800, Ben Widawsky wrote:
  On Sun, Jan 26, 2014 at 11:47:40AM +, Chris Wilson wrote:
   On Fri, Jan 24, 2014 at 06:17:45PM -0800, Ben Widawsky wrote:
The previous check during error capture of whether or not the current VM
should be scanned used, gen  7. That was more or less trying to
determine if there was a full PPGTT. At the time, this was sort of what
I meant to do because I was more interested in working backwards from
hardware state. However, this is incorrect because it will not include
platforms that are greater than gen7, and not having PPGTT.  Example
would be BYT which is gen7 but doesn't have PPGTT, BDW, or any platform
greater than gen7 with the PPGTT module parameter invoked.

I am /assuming/ BYT was broken, I have not actually checked.

While here, clean up the file a bit to avoid duplicate reads (now that
the PPGTT info is in the error state).

I think Mika/Chris may have been looking at this too.
   
   Sure, we are looking (for identifying the guilty request/batch) by using
   the older, simpler mechanism of finding the first incomplete request. I
   think that search is now definite since we preallocate the request and no
   longer do request collascing if ENOMEM (i.e. there is a 1:1 relationship
   between seqno/batch/request).
   
   That should also apply here and be much simpler.
  
  How does that solve hangs which aren't caused by requests?
 
 Was that an intentional rhetorical question?
 
 The code you touch here only deals with requests - finding the current
 batchbuffer if any.
 -Chris
 

It wasn't rhetorical. I temporarily ignored that all batches are tied to
a request.

So what's the plan now? Just looking at the callers, we seem to have a
couple of callers that can't easily identify the bad request.

-- 
Ben Widawsky, Intel Open Source Technology Center
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx