Re: [Intel-gfx] [PATCH] i915/guc: Run busyness worker only if gt is awake

2023-09-14 Thread Daniele Ceraolo Spurio



On 9/11/2023 5:52 PM, Umesh Nerlige Ramappa wrote:

The worker is canceled in the __gt_park path, but we still see it
running sometimes during suspend.

Only update stats if gt is awake. If not, intel_guc_busyness_park would
have already updated the stats. Note that we do not requeue the worker
if gt is not awake since intel_guc_busyness_unpark would do that at some
point.

If the gt was parked longer than time taken for GT timestamp to roll
over, we ignore those rollovers since we don't care about tracking the
exact GT time. We only care about roll overs when the gt is active and
running workloads.

v2 (Daniele)
- Edit commit message and code comment
- Use runtime pm in the worker
- Put runtime pm after enabling the worker
- Use Link tag and add Fixes tag

Link: https://gitlab.freedesktop.org/drm/intel/-/issues/7077
Fixes: 77cdd054dd2c ("drm/i915/pmu: Connect engine busyness stats from GuC to 
pmu")
Signed-off-by: Umesh Nerlige Ramappa 
---
  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 26 ---
  1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index e250bedf90fb..d37b29a0 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1461,6 +1461,24 @@ static void guc_timestamp_ping(struct work_struct *wrk)
unsigned long index;
int srcu, ret;
  
+	/*

+* The worker is canceled in the __gt_park path, but we still see it
+* running sometimes during suspend.


This sounds very vague and more like there is a bug in __gt_park. We 
actually do know that the issue on the park side is that the worker 
re-queues itself if it's running while we cancel it; that's not really a 
problem as long as we don't wake the device after it's gone to sleep 
(which is what your change below does), so this sentence should be 
reworded or dropped.



+*
+* Only update stats if gt is awake. If not, intel_guc_busyness_park
+* would have already updated the stats. Note that we do not requeue the
+* worker in this case since intel_guc_busyness_unpark would do that at
+* some point.
+*
+* If the gt was parked longer than time taken for GT timestamp to roll
+* over, we ignore those rollovers since we don't care about tracking
+* the exact GT time. We only care about roll overs when the gt is
+* active and running workloads.
+*/
+   wakeref = intel_runtime_pm_get_if_active(>i915->runtime_pm);


The patch title and the comment refer to GT being awake, but here you're 
taking a global rpm ref and not a gt_pm ref. I understand that taking a 
gt_pm ref in here can be complicated because of the canceling of the 
worker in the gt_park flow and that taking an rpm ref does work for what 
we need, but the title/comments need to reflect and explain that.


Daniele


+   if (!wakeref)
+   return;
+
/*
 * Synchronize with gt reset to make sure the worker does not
 * corrupt the engine/guc stats. NB: can't actually block waiting
@@ -1469,10 +1487,9 @@ static void guc_timestamp_ping(struct work_struct *wrk)
 */
ret = intel_gt_reset_trylock(gt, );
if (ret)
-   return;
+   goto err_trylock;
  
-	with_intel_runtime_pm(>i915->runtime_pm, wakeref)

-   __update_guc_busyness_stats(guc);
+   __update_guc_busyness_stats(guc);
  
  	/* adjust context stats for overflow */

xa_for_each(>context_lookup, index, ce)
@@ -1481,6 +1498,9 @@ static void guc_timestamp_ping(struct work_struct *wrk)
intel_gt_reset_unlock(gt, srcu);
  
  	guc_enable_busyness_worker(guc);

+
+err_trylock:
+   intel_runtime_pm_put(>i915->runtime_pm, wakeref);
  }
  
  static int guc_action_enable_usage_stats(struct intel_guc *guc)




Re: [Intel-gfx] [PATCH] i915/guc: Run busyness worker only if gt is awake

2023-09-11 Thread Umesh Nerlige Ramappa

On Mon, Sep 11, 2023 at 08:44:39AM -0700, Daniele Ceraolo Spurio wrote:

  On 9/8/2023 10:16 PM, Umesh Nerlige Ramappa wrote:

The worker is canceled in the __gt_park path, but we still see it
running sometimes during suspend. This is likely because some code is
getting a gt wakeref in the __gt_park path.

  This possible root-cause doesn't seem plausible to me, because a gt
  wakeref would cause an unpark, so taking it within the park would probably
  cause a deadlock. Is it not more likely that the worker re-queued itself?


Will drop the likely part. The worker running during suspend is the 
issue, so keeping that part.




Only update stats if gt is awake. If not, intel_guc_busyness_park would
have already updated the stats. Note that we do not requeue the worker
if gt is not awake since intel_guc_busyness_unpark would do that at some
point.

If the gt was parked longer than time taken for GT timestamp to roll
over, we ignore those rollovers since we don't care about tracking the
exact GT time. We only care about roll overs when the gt is active and
running workloads.

Closes: [1]https://gitlab.freedesktop.org/drm/intel/-/issues/7077

  This needs a fixes tag. Also, I'm not 100% sure but I believe we prefer
  "Link" to "Closes".


I thought Link was mostly for the patchworks link. I can change this to 
Link.


Any idea if there is a document/link that explains which tag to use for 
what? I have been confused by this before.




Signed-off-by: Umesh Nerlige Ramappa [2]
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 27 ---
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index e250bedf90fb..df31d6047ce9 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1457,10 +1457,27 @@ static void guc_timestamp_ping(struct work_struct *wrk)
struct intel_uc *uc = container_of(guc, typeof(*uc), guc);
struct intel_gt *gt = guc_to_gt(guc);
struct intel_context *ce;
-   intel_wakeref_t wakeref;
unsigned long index;
int srcu, ret;

+   /*
+* The worker is canceled in the __gt_park path, but we still see it
+* running sometimes during suspend. This is likely because some code
+* is getting a gt wakeref in the __gt_park path.

  Same comment from before about this explanation. I would just remove this
  part from the comment.

+*
+* Only update stats if gt is awake. If not, intel_guc_busyness_park
+* would have already updated the stats. Note that we do not requeue the
+* worker in this case since intel_guc_busyness_unpark would do that at
+* some point.
+*
+* If the gt was parked longer than time taken for GT timestamp to roll
+* over, we ignore those rollovers since we don't care about tracking
+* the exact GT time. We only care about roll overs when the gt is
+* active and running workloads.
+*/
+   if (!intel_gt_pm_get_if_awake(gt))
+   return;
+

  Do we need to drop the _sync from the busyness stats worker parking if we
  take the gt_pm wakeref in here, instead of an rpm one? because if the
  gt_pm_put below causes a park and the park waits on this worker to
  complete then we'll deadlock.


Hmm, My bad, That's not what I intended. It should be 
intel_runtime_pm_get_if_active(). I will change that




/*
 * Synchronize with gt reset to make sure the worker does not
 * corrupt the engine/guc stats. NB: can't actually block waiting
@@ -1468,17 +1485,19 @@ static void guc_timestamp_ping(struct work_struct *wrk)
 * this worker thread if started. So waiting would deadlock.
 */
ret = intel_gt_reset_trylock(gt, );
-   if (ret)
+   if (ret) {
+   intel_gt_pm_put(gt);
return;
+   }

-   with_intel_runtime_pm(>i915->runtime_pm, wakeref)
-   __update_guc_busyness_stats(guc);
+   __update_guc_busyness_stats(guc);

/* adjust context stats for overflow */
xa_for_each(>context_lookup, index, ce)
guc_context_update_stats(ce);

intel_gt_reset_unlock(gt, srcu);
+   intel_gt_pm_put(gt);

  I think this needs to go after the queuing, because it could cause a park
  and if it does we don't want to re-queue the worker immediately after,
  while if we queue it before then the park will cancel it.
  Non-blocking style comment: with gt_pm_put the last thing in function, you
  can also transform that early return in a "goto put;" and have a single
  place for the gt_put.


Will change, although I am not sure if the runtime pm put may also cause 
a gt park. Assuming it can, I will make these changes.


Thanks
Umesh



  Daniele


guc_enable_busyness_worker(guc);
 }

References

  Visible links
  1. 

Re: [Intel-gfx] [PATCH] i915/guc: Run busyness worker only if gt is awake

2023-09-11 Thread Daniele Ceraolo Spurio



On 9/8/2023 10:16 PM, Umesh Nerlige Ramappa wrote:

The worker is canceled in the __gt_park path, but we still see it
running sometimes during suspend. This is likely because some code is
getting a gt wakeref in the __gt_park path.
This possible root-cause doesn't seem plausible to me, because a gt 
wakeref would cause an unpark, so taking it within the park would 
probably cause a deadlock. Is it not more likely that the worker 
re-queued itself?



Only update stats if gt is awake. If not, intel_guc_busyness_park would
have already updated the stats. Note that we do not requeue the worker
if gt is not awake since intel_guc_busyness_unpark would do that at some
point.

If the gt was parked longer than time taken for GT timestamp to roll
over, we ignore those rollovers since we don't care about tracking the
exact GT time. We only care about roll overs when the gt is active and
running workloads.

Closes:https://gitlab.freedesktop.org/drm/intel/-/issues/7077
This needs a fixes tag. Also, I'm not 100% sure but I believe we prefer 
"Link" to "Closes".



Signed-off-by: Umesh Nerlige Ramappa
---
  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 27 ---
  1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index e250bedf90fb..df31d6047ce9 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1457,10 +1457,27 @@ static void guc_timestamp_ping(struct work_struct *wrk)
struct intel_uc *uc = container_of(guc, typeof(*uc), guc);
struct intel_gt *gt = guc_to_gt(guc);
struct intel_context *ce;
-   intel_wakeref_t wakeref;
unsigned long index;
int srcu, ret;
  
+	/*

+* The worker is canceled in the __gt_park path, but we still see it
+* running sometimes during suspend. This is likely because some code
+* is getting a gt wakeref in the __gt_park path.


Same comment from before about this explanation. I would just remove 
this part from the comment.



+*
+* Only update stats if gt is awake. If not, intel_guc_busyness_park
+* would have already updated the stats. Note that we do not requeue the
+* worker in this case since intel_guc_busyness_unpark would do that at
+* some point.
+*
+* If the gt was parked longer than time taken for GT timestamp to roll
+* over, we ignore those rollovers since we don't care about tracking
+* the exact GT time. We only care about roll overs when the gt is
+* active and running workloads.
+*/
+   if (!intel_gt_pm_get_if_awake(gt))
+   return;
+


Do we need to drop the _sync from the busyness stats worker parking if 
we take the gt_pm wakeref in here, instead of an rpm one? because if the 
gt_pm_put below causes a park and the park waits on this worker to 
complete then we'll deadlock.



/*
 * Synchronize with gt reset to make sure the worker does not
 * corrupt the engine/guc stats. NB: can't actually block waiting
@@ -1468,17 +1485,19 @@ static void guc_timestamp_ping(struct work_struct *wrk)
 * this worker thread if started. So waiting would deadlock.
 */
ret = intel_gt_reset_trylock(gt, );
-   if (ret)
+   if (ret) {
+   intel_gt_pm_put(gt);
return;
+   }
  
-	with_intel_runtime_pm(>i915->runtime_pm, wakeref)

-   __update_guc_busyness_stats(guc);
+   __update_guc_busyness_stats(guc);
  
  	/* adjust context stats for overflow */

xa_for_each(>context_lookup, index, ce)
guc_context_update_stats(ce);
  
  	intel_gt_reset_unlock(gt, srcu);

+   intel_gt_pm_put(gt);


I think this needs to go after the queuing, because it could cause a 
park and if it does we don't want to re-queue the worker immediately 
after, while if we queue it before then the park will cancel it.
Non-blocking style comment: with gt_pm_put the last thing in function, 
you can also transform that early return in a "goto put;" and have a 
single place for the gt_put.


Daniele

  
  	guc_enable_busyness_worker(guc);

  }