Chris Wilson <ch...@chris-wilson.co.uk> writes:

> On Mon, Nov 28, 2016 at 02:02:33PM +0000, Tvrtko Ursulin wrote:
>> 
>> On 25/11/2016 09:30, Chris Wilson wrote:
>> >Something I missed before sending off the partial series was that the
>> >non-scheduler guc reset path was broken (in the full series, this is
>> >pushed to the execlists reset handler). The issue is that after a reset,
>> >we have to refill the GuC workqueues, which we do by resubmitting the
>> >requests. However, if we already have submitted them, the fences within
>> >them have already been used and triggering them again is an error.
>> >Instead, just repopulate the guc workqueue.
>> >
>> >[  115.858560] [IGT] gem_busy: starting subtest hang-render
>> >[  135.839867] [drm] GPU HANG: ecode 9:0:0xe757fefe, in gem_busy [1716], 
>> >reason: Hang on render ring, action: reset
>> >[  135.839902] drm/i915: Resetting chip after gpu hang
>> >[  135.839957] [drm] RC6 on
>> >[  135.858351] ------------[ cut here ]------------
>> >[  135.858357] WARNING: CPU: 2 PID: 45 at 
>> >drivers/gpu/drm/i915/i915_sw_fence.c:108 i915_sw_fence_complete+0x25/0x30
>> >[  135.858357] Modules linked in: rfcomm bnep binfmt_misc nls_iso8859_1 
>> >input_leds snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic 
>> >snd_hda_intel snd_hda_codec snd_hda_core btusb btrtl snd_hwdep snd_pcm 
>> >8250_dw snd_seq_midi hid_lenovo snd_seq_midi_event snd_rawmidi iwlwifi 
>> >x86_pkg_temp_thermal coretemp snd_seq crct10dif_pclmul snd_seq_device 
>> >hci_uart snd_timer crc32_pclmul ghash_clmulni_intel idma64 aesni_intel 
>> >virt_dma btbcm snd btqca aes_x86_64 btintel lrw cfg80211 bluetooth gf128mul 
>> >glue_helper ablk_helper cryptd soundcore intel_lpss_pci intel_pch_thermal 
>> >intel_lpss_acpi intel_lpss acpi_als mfd_core kfifo_buf acpi_pad 
>> >industrialio autofs4 hid_plantronics usbhid dm_mirror dm_region_hash dm_log 
>> >sdhci_pci ahci sdhci libahci i2c_hid hid
>> >[  135.858389] CPU: 2 PID: 45 Comm: kworker/2:1 Tainted: G        W       
>> >4.9.0-rc4+ #238
>> >[  135.858389] Hardware name:                  /NUC6i3SYB, BIOS 
>> >SYSKLi35.86A.0024.2015.1027.2142 10/27/2015
>> >[  135.858392] Workqueue: events_long i915_hangcheck_elapsed
>> >[  135.858394]  ffffc900001bf9b8 ffffffff812bb238 0000000000000000 
>> >0000000000000000
>> >[  135.858396]  ffffc900001bf9f8 ffffffff8104f621 0000006c00000000 
>> >ffff8808296137f8
>> >[  135.858398]  0000000000000a00 ffff8808457a0000 ffff880845764e60 
>> >ffff880845760000
>> >[  135.858399] Call Trace:
>> >[  135.858403]  [<ffffffff812bb238>] dump_stack+0x4d/0x65
>> >[  135.858405]  [<ffffffff8104f621>] __warn+0xc1/0xe0
>> >[  135.858406]  [<ffffffff8104f748>] warn_slowpath_null+0x18/0x20
>> >[  135.858408]  [<ffffffff813f8c15>] i915_sw_fence_complete+0x25/0x30
>> >[  135.858410]  [<ffffffff813f8fad>] i915_sw_fence_commit+0xd/0x30
>> >[  135.858412]  [<ffffffff8142e591>] __i915_gem_request_submit+0xe1/0xf0
>> >[  135.858413]  [<ffffffff8142e5c8>] i915_gem_request_submit+0x28/0x40
>> >[  135.858415]  [<ffffffff814433e7>] i915_guc_submit+0x47/0x210
>> >[  135.858417]  [<ffffffff81443e98>] i915_guc_submission_enable+0x468/0x540
>> >[  135.858419]  [<ffffffff81442495>] intel_guc_setup+0x715/0x810
>> >[  135.858421]  [<ffffffff8142b6b4>] i915_gem_init_hw+0x114/0x2a0
>> >[  135.858423]  [<ffffffff813eeaa8>] i915_reset+0xe8/0x120
>> >[  135.858424]  [<ffffffff813f3937>] i915_reset_and_wakeup+0x157/0x180
>> >[  135.858426]  [<ffffffff813f79db>] i915_handle_error+0x1ab/0x230
>> >[  135.858428]  [<ffffffff812c760d>] ? scnprintf+0x4d/0x90
>> >[  135.858430]  [<ffffffff81435985>] i915_hangcheck_elapsed+0x275/0x3d0
>> >[  135.858432]  [<ffffffff810668cf>] process_one_work+0x12f/0x410
>> >[  135.858433]  [<ffffffff81066bf3>] worker_thread+0x43/0x4d0
>> >[  135.858435]  [<ffffffff81066bb0>] ? process_one_work+0x410/0x410
>> >[  135.858436]  [<ffffffff81066bb0>] ? process_one_work+0x410/0x410
>> >[  135.858438]  [<ffffffff8106bbb4>] kthread+0xd4/0xf0
>> >[  135.858440]  [<ffffffff8106bae0>] ? kthread_park+0x60/0x60
>> >
>> >v2: Only resubmit submitted requests
>> >v3: Don't forget the pending requests have reserved space.
>> >
>> >Fixes: d55ac5bf97c6 ("drm/i915: Defer transfer onto execution timeline to 
>> >actual hw submission")
>> >Signed-off-by: Chris Wilson <ch...@chris-wilson.co.uk>
>> >Cc: Tvrtko Ursulin <tvrtko.ursu...@intel.com>
>> >---
>> > drivers/gpu/drm/i915/i915_guc_submission.c | 37 
>> > ++++++++++++++++--------------
>> > 1 file changed, 20 insertions(+), 17 deletions(-)
>> >
>> >diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c 
>> >b/drivers/gpu/drm/i915/i915_guc_submission.c
>> >index 00b5fa871644..e12ff881d38d 100644
>> >--- a/drivers/gpu/drm/i915/i915_guc_submission.c
>> >+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
>> >@@ -602,12 +602,9 @@ static int guc_ring_doorbell(struct i915_guc_client 
>> >*gc)
>> > }
>> >
>> > /**
>> >- * i915_guc_submit() - Submit commands through GuC
>> >+ * __i915_guc_submit() - Submit commands through GuC
>> >  * @rq:            request associated with the commands
>> >  *
>> >- * Return: 0 on success, otherwise an errno.
>> >- *                 (Note: nonzero really shouldn't happen!)
>> >- *
>> >  * The caller must have already called i915_guc_wq_reserve() above with
>> >  * a result of 0 (success), guaranteeing that there is space in the work
>> >  * queue for the new request, so enqueuing the item cannot fail.
>> >@@ -619,7 +616,7 @@ static int guc_ring_doorbell(struct i915_guc_client *gc)
>> >  * The only error here arises if the doorbell hardware isn't functioning
>> >  * as expected, which really shouln't happen.
>> >  */
>> >-static void i915_guc_submit(struct drm_i915_gem_request *rq)
>> >+static void __i915_guc_submit(struct drm_i915_gem_request *rq)
>> > {
>> >    struct drm_i915_private *dev_priv = rq->i915;
>> >    struct intel_engine_cs *engine = rq->engine;
>> >@@ -628,17 +625,6 @@ static void i915_guc_submit(struct 
>> >drm_i915_gem_request *rq)
>> >    struct i915_guc_client *client = guc->execbuf_client;
>> >    int b_ret;
>> >
>> >-   /* We keep the previous context alive until we retire the following
>> >-    * request. This ensures that any the context object is still pinned
>> >-    * for any residual writes the HW makes into it on the context switch
>> >-    * into the next object following the breadcrumb. Otherwise, we may
>> >-    * retire the context too early.
>> >-    */
>> >-   rq->previous_context = engine->last_context;
>> >-   engine->last_context = rq->ctx;
>> >-
>> >-   i915_gem_request_submit(rq);
>> >-
>> >    spin_lock(&client->wq_lock);
>> >    guc_wq_item_append(client, rq);
>> >
>> >@@ -658,6 +644,23 @@ static void i915_guc_submit(struct 
>> >drm_i915_gem_request *rq)
>> >    spin_unlock(&client->wq_lock);
>> > }
>> >
>> >+static void i915_guc_submit(struct drm_i915_gem_request *rq)
>> >+{
>> >+   struct intel_engine_cs *engine = rq->engine;
>> >+
>> >+   /* We keep the previous context alive until we retire the following
>> >+    * request. This ensures that any the context object is still pinned
>> >+    * for any residual writes the HW makes into it on the context switch
>> >+    * into the next object following the breadcrumb. Otherwise, we may
>> >+    * retire the context too early.
>> >+    */
>> >+   rq->previous_context = engine->last_context;
>> >+   engine->last_context = rq->ctx;
>> >+
>> >+   i915_gem_request_submit(rq);
>> >+   __i915_guc_submit(rq);
>> >+}
>> >+
>> > /*
>> >  * Everything below here is concerned with setup & teardown, and is
>> >  * therefore not part of the somewhat time-critical batch-submission
>> >@@ -1556,7 +1559,7 @@ int i915_guc_submission_enable(struct 
>> >drm_i915_private *dev_priv)
>> >            /* Replay the current set of previously submitted requests */
>> >            list_for_each_entry(rq, &engine->timeline->requests, link) {
>> >                    client->wq_rsvd += sizeof(struct guc_wq_item);
>> >-                   i915_guc_submit(rq);
>> >+                   __i915_guc_submit(rq);
>> >            }
>> >    }
>> >
>> >
>> 
>> The only thing bothering me here is how this interacts with the
>> common, or "execlist" part of the reset.
>
> Also applies to ringbuffer (some parts at least).
>  
>> Copying Mika in hope he can help here. :)
>> 
>> I see that i915_gem_reset_engine runs before this GuC wq refill and
>> does some magic with the requests on the engine->timeline->requests
>> lists.
>> 
>> Would it be possible to somehow better integrate the two things?
>
> I thought they were nicely separated. The common work to disable the
> guilty context is handled by the core reset function. The specific work
> required to restart and replay the requests handled by the different
> backends.
>
> i915_gem_reset_engine: is for disabling the guilty context
> engine->reset: updates the context / ring for restarting past a hung
> batch (if required)
> engine->init: starts the engine, processing the pending requests if any
>
> engine->reset is a bad name, since we have per-engine resets as well,
> but I haven't a better verb yet for handling the request bumping after
> a

How about
engine->setup(request *) to replace ->reset 

and engine->start() to replace ->init()?

-Mika

> reset.
> -Chris
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Reply via email to