guc: Don't hog IRQs when destroying contexts

Tvrtko Ursulin Fri, 17 Dec 2021 03:14:17 -0800


On 17/12/2021 11:06, Tvrtko Ursulin wrote:

On 14/12/2021 17:04, Matthew Brost wrote:

From: John Harrison <john.c.harri...@intel.com>

While attempting to debug a CT deadlock issue in various CI failures
(most easily reproduced with gem_ctx_create/basic-files), I was seeing
CPU deadlock errors being reported. This were because the context
destroy loop was blocking waiting on H2G space from inside an IRQ
spinlock. There no was deadlock as such, it's just that the H2G queue
was full of context destroy commands and GuC was taking a long time to
process them. However, the kernel was seeing the large amount of time
spent inside the IRQ lock as a dead CPU. Various Bad Things(tm) would
then happen (heartbeat failures, CT deadlock errors, outstanding H2G
WARNs, etc.).

Re-working the loop to only acquire the spinlock around the list
management (which is all it is meant to protect) rather than the
entire destroy operation seems to fix all the above issues.

v2:
  (John Harrison)
   - Fix typo in comment message

Signed-off-by: John Harrison <john.c.harri...@intel.com>
Signed-off-by: Matthew Brost <matthew.br...@intel.com>
Reviewed-by: Matthew Brost <matthew.br...@intel.com>
---
  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 45 ++++++++++++-------
  1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.cb/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c

index 36c2965db49b..96fcf869e3ff 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c

@@ -2644,7 +2644,6 @@ static inline void guc_lrc_desc_unpin(structintel_context *ce)

      unsigned long flags;
      bool disabled;
-    lockdep_assert_held(&guc->submission_state.lock);
      GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
      GEM_BUG_ON(!lrc_desc_registered(guc, ce->guc_id.id));
      GEM_BUG_ON(ce != __get_context(guc, ce->guc_id.id));

@@ -2660,7 +2659,7 @@ static inline void guc_lrc_desc_unpin(structintel_context *ce)

      }
      spin_unlock_irqrestore(&ce->guc_state.lock, flags);
      if (unlikely(disabled)) {
-        __release_guc_id(guc, ce);
+        release_guc_id(guc, ce);
          __guc_context_destroy(ce);
          return;
      }

@@ -2694,36 +2693,48 @@ static void __guc_context_destroy(structintel_context *ce)

  static void guc_flush_destroyed_contexts(struct intel_guc *guc)
  {
-    struct intel_context *ce, *cn;
+    struct intel_context *ce;
      unsigned long flags;
      GEM_BUG_ON(!submission_disabled(guc) &&
             guc_submission_initialized(guc));
-    spin_lock_irqsave(&guc->submission_state.lock, flags);
-    list_for_each_entry_safe(ce, cn,
-                 &guc->submission_state.destroyed_contexts,
-                 destroyed_link) {
-        list_del_init(&ce->destroyed_link);
-        __release_guc_id(guc, ce);
+    while (!list_empty(&guc->submission_state.destroyed_contexts)) {

Are lockless false negatives a concern here - I mean this thread notseeing something just got added to the list?

+        spin_lock_irqsave(&guc->submission_state.lock, flags);

+ ce =list_first_entry_or_null(&guc->submission_state.destroyed_contexts,

+                          struct intel_context,
+                          destroyed_link);
+        if (ce)
+            list_del_init(&ce->destroyed_link);
+        spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+
+        if (!ce)
+            break;
+
+        release_guc_id(guc, ce);


This looks suboptimal and in conflict with this part of the commit message:

"""
  Re-working the loop to only acquire the spinlock around the list
  management (which is all it is meant to protect) rather than the
  entire destroy operation seems to fix all the above issues.
"""

Because you end up doing:

... loop ...
   spin_lock_irqsave(&guc->submission_state.lock, flags);
   list_del_init(&ce->destroyed_link);
   spin_unlock_irqrestore(&guc->submission_state.lock, flags);

   release_guc_id, which calls:
     spin_lock_irqsave(&guc->submission_state.lock, flags);
     __release_guc_id(guc, ce);
     spin_unlock_irqrestore(&guc->submission_state.lock, flags);

So a) the lock seems to be protecting more than just list management, orrelease_guc_if is wrong, and b) the loop ends up with highlyquestionable hammering on the lock.

Is there any point to this part of the patch? Or the only business endof the patch is below:

          __guc_context_destroy(ce);
      }
-    spin_unlock_irqrestore(&guc->submission_state.lock, flags);
  }
  static void deregister_destroyed_contexts(struct intel_guc *guc)
  {
-    struct intel_context *ce, *cn;
+    struct intel_context *ce;
      unsigned long flags;
-    spin_lock_irqsave(&guc->submission_state.lock, flags);
-    list_for_each_entry_safe(ce, cn,
-                 &guc->submission_state.destroyed_contexts,
-                 destroyed_link) {
-        list_del_init(&ce->destroyed_link);
+    while (!list_empty(&guc->submission_state.destroyed_contexts)) {
+        spin_lock_irqsave(&guc->submission_state.lock, flags);

+ ce =list_first_entry_or_null(&guc->submission_state.destroyed_contexts,

+                          struct intel_context,
+                          destroyed_link);
+        if (ce)
+            list_del_init(&ce->destroyed_link);
+        spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+
+        if (!ce)
+            break;
+
          guc_lrc_desc_unpin(ce);


Here?

Not wanting/needing to nest ce->guc_state.lock underguc->submission_state.lock, and call the CPU cycle expensivederegister_context?

1)

Could you unlink en masse, under the assumption destroyed contexts arenot reachable from anywhere else at this point, so under a single lockhold?

2)

But then you also end up with guc_lrc_desc_unpin calling__release_guc_id, which when called by release_guc_id does takeguc->submission_state.lock and here it does not. Is it then clear whichoperations inside __release_guc_id need the lock? Bitmap or IDA?

Ah no, with 2nd point I missed you changed guc_lrc_desc_unpin to callrelease_guc_id.

Question on the merit of change in guc_flush_destroyed_contexts remains,and also whether at both places you could do group unlink (one lockhold), put on a private list, and then unpin/deregister.


Regards,

Tvrtko

Re: [Intel-gfx] [PATCH 4/7] drm/i915/guc: Don't hog IRQs when destroying contexts

Reply via email to