Re: [PATCH v4 1/7] drm/i915: Fix request locking during error capture & debugfs dump

2023-01-25 Thread John Harrison

On 1/23/2023 09:51, Tvrtko Ursulin wrote:

On 20/01/2023 23:28, john.c.harri...@intel.com wrote:

From: John Harrison 



  -struct i915_request *intel_context_find_active_request(struct 
intel_context *ce)
+struct i915_request *intel_context_find_active_request_get(struct 
intel_context *ce)


TBH I don't "dig" this name, it's a bit on the long side and feels out 
of character. I won't insist it be changed, but if get really has to 
be included in the name I would be happy with 
intel_context_get_active_request().

Daniele sided with you on this one. Will use your naming.

John.



Re: [PATCH v4 1/7] drm/i915: Fix request locking during error capture & debugfs dump

2023-01-23 Thread John Harrison

On 1/23/2023 09:51, Tvrtko Ursulin wrote:

On 20/01/2023 23:28, john.c.harri...@intel.com wrote:

From: John Harrison 

When GuC support was added to error capture, the locking around the
request object was broken. Fix it up.

The context based search manages the spinlocking around the search
internally. So it needs to grab the reference count internally as
well. The execlist only request based search relies on external
locking, so it needs an external reference count but within the
spinlock not outside it.

The only other caller of the context based search is the code for
dumping engine state to debugfs. That code wasn't previously getting
an explicit reference at all as it does everything while holding the
execlist specific spinlock. So, that needs updaing as well as that
spinlock doesn't help when using GuC submission. Rather than trying to
conditionally get/put depending on submission model, just change it to
always do the get/put.

In addition, intel_guc_find_hung_context() was not acquiring the
correct spinlock before searching the request list. So fix that up
too. While at it, add some extra whitespace padding for readability.


Is this part splittable into a separate patch?
I guess it could but it seems closely related to all the other locking 
fix ups in this patch.






v2: Explicitly document adding an extra blank line in some dense code
(Andy Shevchenko). Fix multiple potential null pointer derefs in case
of no request found (some spotted by Tvrtko, but there was more!).
Also fix a leaked request in case of !started and another in
__guc_reset_context now that intel_context_find_active_request is
actually reference counting the returned request.
v3: Add a _get suffix to intel_context_find_active_request now that it
grabs a reference (Daniele).

Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full 
GPU reset with GuC")
Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context 
reset")

Cc: Matthew Brost 
Cc: John Harrison 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: Tvrtko Ursulin 
Cc: Daniele Ceraolo Spurio 
Cc: Andrzej Hajda 
Cc: Matthew Auld 
Cc: Matt Roper 
Cc: Umesh Nerlige Ramappa 
Cc: Michael Cheng 
Cc: Lucas De Marchi 
Cc: Tejas Upadhyay 
Cc: Andy Shevchenko 
Cc: Aravind Iddamsetty 
Cc: Alan Previn 
Cc: Bruce Chang 
Cc: intel-...@lists.freedesktop.org
Signed-off-by: John Harrison 
Reviewed-by: Daniele Ceraolo Spurio 
---
  drivers/gpu/drm/i915/gt/intel_context.c   |  4 +++-
  drivers/gpu/drm/i915/gt/intel_context.h   |  3 +--
  drivers/gpu/drm/i915/gt/intel_engine_cs.c |  6 +-
  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 14 +-
  drivers/gpu/drm/i915/i915_gpu_error.c | 13 ++---
  5 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c 
b/drivers/gpu/drm/i915/gt/intel_context.c

index e94365b08f1ef..4285c1c71fa12 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -528,7 +528,7 @@ struct i915_request 
*intel_context_create_request(struct intel_context *ce)

  return rq;
  }
  -struct i915_request *intel_context_find_active_request(struct 
intel_context *ce)
+struct i915_request *intel_context_find_active_request_get(struct 
intel_context *ce)


TBH I don't "dig" this name, it's a bit on the long side and feels out 
of character. I won't insist it be changed, but if get really has to 
be included in the name I would be happy with 
intel_context_get_active_request().


Personally, I see the 'find' component as meaning it is a search not 
just a dereference of an existing pointer and therefore being a useful 
part of the name. I don't think there is a simple name that encapsulates 
everything that is going on here. But I don't feel too strongly about it 
if you really think the shorter version is better.


One could add some kerneldoc... but it would be almost the only function 
in the whole of intel_context.h with such. Not sure if that is 
intentional because "obviously it should be obvious what a function is 
doing by reading the code and documentation is a waste of space that 
gets out of date and inaccurate" and we aren't meant to kerneldoc 
internal behaviour or if it's just the general lack of documentation for 
any driver code.






  {
  struct intel_context *parent = intel_context_to_parent(ce);
  struct i915_request *rq, *active = NULL;
@@ -552,6 +552,8 @@ struct i915_request 
*intel_context_find_active_request(struct intel_context *ce)

    active = rq;
  }
+    if (active)
+    active = i915_request_get_rcu(active);
  spin_unlock_irqrestore(&parent->guc_state.lock, flags);
    return active;
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h 
b/drivers/gpu/drm/i915/gt/intel_context.h

index fb62b7b8cbcda..ccc80c6607ca8 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -268,8 +268,7 @@

Re: [PATCH v4 1/7] drm/i915: Fix request locking during error capture & debugfs dump

2023-01-23 Thread Tvrtko Ursulin



On 20/01/2023 23:28, john.c.harri...@intel.com wrote:

From: John Harrison 

When GuC support was added to error capture, the locking around the
request object was broken. Fix it up.

The context based search manages the spinlocking around the search
internally. So it needs to grab the reference count internally as
well. The execlist only request based search relies on external
locking, so it needs an external reference count but within the
spinlock not outside it.

The only other caller of the context based search is the code for
dumping engine state to debugfs. That code wasn't previously getting
an explicit reference at all as it does everything while holding the
execlist specific spinlock. So, that needs updaing as well as that
spinlock doesn't help when using GuC submission. Rather than trying to
conditionally get/put depending on submission model, just change it to
always do the get/put.

In addition, intel_guc_find_hung_context() was not acquiring the
correct spinlock before searching the request list. So fix that up
too. While at it, add some extra whitespace padding for readability.


Is this part splittable into a separate patch?



v2: Explicitly document adding an extra blank line in some dense code
(Andy Shevchenko). Fix multiple potential null pointer derefs in case
of no request found (some spotted by Tvrtko, but there was more!).
Also fix a leaked request in case of !started and another in
__guc_reset_context now that intel_context_find_active_request is
actually reference counting the returned request.
v3: Add a _get suffix to intel_context_find_active_request now that it
grabs a reference (Daniele).

Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset with 
GuC")
Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset")
Cc: Matthew Brost 
Cc: John Harrison 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: Tvrtko Ursulin 
Cc: Daniele Ceraolo Spurio 
Cc: Andrzej Hajda 
Cc: Matthew Auld 
Cc: Matt Roper 
Cc: Umesh Nerlige Ramappa 
Cc: Michael Cheng 
Cc: Lucas De Marchi 
Cc: Tejas Upadhyay 
Cc: Andy Shevchenko 
Cc: Aravind Iddamsetty 
Cc: Alan Previn 
Cc: Bruce Chang 
Cc: intel-...@lists.freedesktop.org
Signed-off-by: John Harrison 
Reviewed-by: Daniele Ceraolo Spurio 
---
  drivers/gpu/drm/i915/gt/intel_context.c   |  4 +++-
  drivers/gpu/drm/i915/gt/intel_context.h   |  3 +--
  drivers/gpu/drm/i915/gt/intel_engine_cs.c |  6 +-
  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 14 +-
  drivers/gpu/drm/i915/i915_gpu_error.c | 13 ++---
  5 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c 
b/drivers/gpu/drm/i915/gt/intel_context.c
index e94365b08f1ef..4285c1c71fa12 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -528,7 +528,7 @@ struct i915_request *intel_context_create_request(struct 
intel_context *ce)
return rq;
  }
  
-struct i915_request *intel_context_find_active_request(struct intel_context *ce)

+struct i915_request *intel_context_find_active_request_get(struct 
intel_context *ce)


TBH I don't "dig" this name, it's a bit on the long side and feels out of 
character. I won't insist it be changed, but if get really has to be included in the name 
I would be happy with intel_context_get_active_request().


  {
struct intel_context *parent = intel_context_to_parent(ce);
struct i915_request *rq, *active = NULL;
@@ -552,6 +552,8 @@ struct i915_request 
*intel_context_find_active_request(struct intel_context *ce)
  
  		active = rq;

}
+   if (active)
+   active = i915_request_get_rcu(active);
spin_unlock_irqrestore(&parent->guc_state.lock, flags);
  
  	return active;

diff --git a/drivers/gpu/drm/i915/gt/intel_context.h 
b/drivers/gpu/drm/i915/gt/intel_context.h
index fb62b7b8cbcda..ccc80c6607ca8 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -268,8 +268,7 @@ int intel_context_prepare_remote_request(struct 
intel_context *ce,
  
  struct i915_request *intel_context_create_request(struct intel_context *ce);
  
-struct i915_request *

-intel_context_find_active_request(struct intel_context *ce);
+struct i915_request *intel_context_find_active_request_get(struct 
intel_context *ce);
  
  static inline bool intel_context_is_barrier(const struct intel_context *ce)

  {
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 922f1bb22dc68..fbc0a81617e89 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -2237,9 +2237,11 @@ static void engine_dump_active_requests(struct 
intel_engine_cs *engine, struct d
if (guc) {
ce = intel_engine_get_hung_context(engine);
if (ce)
-   hung_rq = intel_context_find_acti

[PATCH v4 1/7] drm/i915: Fix request locking during error capture & debugfs dump

2023-01-20 Thread John . C . Harrison
From: John Harrison 

When GuC support was added to error capture, the locking around the
request object was broken. Fix it up.

The context based search manages the spinlocking around the search
internally. So it needs to grab the reference count internally as
well. The execlist only request based search relies on external
locking, so it needs an external reference count but within the
spinlock not outside it.

The only other caller of the context based search is the code for
dumping engine state to debugfs. That code wasn't previously getting
an explicit reference at all as it does everything while holding the
execlist specific spinlock. So, that needs updaing as well as that
spinlock doesn't help when using GuC submission. Rather than trying to
conditionally get/put depending on submission model, just change it to
always do the get/put.

In addition, intel_guc_find_hung_context() was not acquiring the
correct spinlock before searching the request list. So fix that up
too. While at it, add some extra whitespace padding for readability.

v2: Explicitly document adding an extra blank line in some dense code
(Andy Shevchenko). Fix multiple potential null pointer derefs in case
of no request found (some spotted by Tvrtko, but there was more!).
Also fix a leaked request in case of !started and another in
__guc_reset_context now that intel_context_find_active_request is
actually reference counting the returned request.
v3: Add a _get suffix to intel_context_find_active_request now that it
grabs a reference (Daniele).

Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset 
with GuC")
Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset")
Cc: Matthew Brost 
Cc: John Harrison 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: Tvrtko Ursulin 
Cc: Daniele Ceraolo Spurio 
Cc: Andrzej Hajda 
Cc: Matthew Auld 
Cc: Matt Roper 
Cc: Umesh Nerlige Ramappa 
Cc: Michael Cheng 
Cc: Lucas De Marchi 
Cc: Tejas Upadhyay 
Cc: Andy Shevchenko 
Cc: Aravind Iddamsetty 
Cc: Alan Previn 
Cc: Bruce Chang 
Cc: intel-...@lists.freedesktop.org
Signed-off-by: John Harrison 
Reviewed-by: Daniele Ceraolo Spurio 
---
 drivers/gpu/drm/i915/gt/intel_context.c   |  4 +++-
 drivers/gpu/drm/i915/gt/intel_context.h   |  3 +--
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  6 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 14 +-
 drivers/gpu/drm/i915/i915_gpu_error.c | 13 ++---
 5 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c 
b/drivers/gpu/drm/i915/gt/intel_context.c
index e94365b08f1ef..4285c1c71fa12 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -528,7 +528,7 @@ struct i915_request *intel_context_create_request(struct 
intel_context *ce)
return rq;
 }
 
-struct i915_request *intel_context_find_active_request(struct intel_context 
*ce)
+struct i915_request *intel_context_find_active_request_get(struct 
intel_context *ce)
 {
struct intel_context *parent = intel_context_to_parent(ce);
struct i915_request *rq, *active = NULL;
@@ -552,6 +552,8 @@ struct i915_request 
*intel_context_find_active_request(struct intel_context *ce)
 
active = rq;
}
+   if (active)
+   active = i915_request_get_rcu(active);
spin_unlock_irqrestore(&parent->guc_state.lock, flags);
 
return active;
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h 
b/drivers/gpu/drm/i915/gt/intel_context.h
index fb62b7b8cbcda..ccc80c6607ca8 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -268,8 +268,7 @@ int intel_context_prepare_remote_request(struct 
intel_context *ce,
 
 struct i915_request *intel_context_create_request(struct intel_context *ce);
 
-struct i915_request *
-intel_context_find_active_request(struct intel_context *ce);
+struct i915_request *intel_context_find_active_request_get(struct 
intel_context *ce);
 
 static inline bool intel_context_is_barrier(const struct intel_context *ce)
 {
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 922f1bb22dc68..fbc0a81617e89 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -2237,9 +2237,11 @@ static void engine_dump_active_requests(struct 
intel_engine_cs *engine, struct d
if (guc) {
ce = intel_engine_get_hung_context(engine);
if (ce)
-   hung_rq = intel_context_find_active_request(ce);
+   hung_rq = intel_context_find_active_request_get(ce);
} else {
hung_rq = intel_engine_execlist_find_hung_request(engine);
+   if (hung_rq)
+   hung_rq = i915_request_get_rcu(hung_rq);
}
 
if (hung_rq)
@@ -2250,6 +2252,8 @@ static