On 18/11/2019 23:02, Chris Wilson wrote:
In commit a79ca656b648 ("drm/i915: Push the wakeref->count deferral to
the backend"), I erroneously concluded that we last modify the engine
inside __i915_request_commit() meaning that we could enable concurrent
submission for userspace as we enqueued this request. However, this
falls into a trap with other users of the engine->kernel_context waking
up and submitting their request before the idle-switch is queued, with
the result that the kernel_context is executed out-of-sequence most
likely upsetting the GPU and certainly ourselves when we try to retire
the out-of-sequence requests.

As such we need to hold onto the effective engine->kernel_context mutex
lock (via the engine pm mutex proxy) until we have finish queuing the
request to the engine.

v2: Serialise against concurrent intel_gt_retire_requests()

Fixes: a79ca656b648 ("drm/i915: Push the wakeref->count deferral to the 
backend")
Signed-off-by: Chris Wilson <ch...@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuopp...@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursu...@intel.com>
---
  drivers/gpu/drm/i915/gt/intel_engine_pm.c | 15 +++++++++------
  1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c 
b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
index 3c0f490ff2c7..722d3eec226e 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
@@ -75,6 +75,7 @@ static inline void __timeline_mark_unlock(struct 
intel_context *ce,
static bool switch_to_kernel_context(struct intel_engine_cs *engine)
  {
+       struct intel_context *ce = engine->kernel_context;
        struct i915_request *rq;
        unsigned long flags;
        bool result = true;
@@ -99,15 +100,13 @@ static bool switch_to_kernel_context(struct 
intel_engine_cs *engine)
         * retiring the last request, thus all rings should be empty and
         * all timelines idle.
         */
-       flags = __timeline_mark_lock(engine->kernel_context);
+       flags = __timeline_mark_lock(ce);
- rq = __i915_request_create(engine->kernel_context, GFP_NOWAIT);
+       rq = __i915_request_create(ce, GFP_NOWAIT);
        if (IS_ERR(rq))
                /* Context switch failed, hope for the best! Maybe reset? */
                goto out_unlock;
- intel_timeline_enter(i915_request_timeline(rq));
-
        /* Check again on the next retirement. */
        engine->wakeref_serial = engine->serial + 1;
        i915_request_add_active_barriers(rq);
@@ -116,13 +115,17 @@ static bool switch_to_kernel_context(struct 
intel_engine_cs *engine)
        rq->sched.attr.priority = I915_PRIORITY_BARRIER;
        __i915_request_commit(rq);
+ __i915_request_queue(rq, NULL);
+
        /* Release our exclusive hold on the engine */
        __intel_wakeref_defer_park(&engine->wakeref);
-       __i915_request_queue(rq, NULL);
+
+ /* And finally expose our selfselves to intel_gt_retire_requests()

ourselves

*/
+       intel_timeline_enter(ce->timeline);

I haven't really managed to follow this.

What are these other clients which can queue requests and what in this block prevents them doing so?

The change seems to be moving the queuing to before __intel_wakeref_defer_park and intel_timeline_enter to last. Wakeref defer extends the engine lifetime until the submitted rq is retired. But how is that consider "unlocking"?

Regards,

Tvrtko

result = false;
  out_unlock:
-       __timeline_mark_unlock(engine->kernel_context, flags);
+       __timeline_mark_unlock(ce, flags);
        return result;
  }
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Reply via email to