Re: [PATCH] drm/i915/gt: Reset queue_priority_hint on parking

2024-03-26 Thread Andi Shyti
Hi Janusz and Chris,

> Fixes: 22b7a426bbe1 ("drm/i915/execlists: Preempt-to-busy")
> Closes: https://gitlab.freedesktop.org/drm/intel/issues/10154
> Signed-off-by: Chris Wilson 
> Cc: Mika Kuoppala 
> Signed-off-by: Janusz Krzysztofik 
> Cc: Chris Wilson 
> Cc:  # v5.4+

with the tags rearranged a bit, pushed to drm-intel-gt-next.

Andi


Re: [PATCH] drm/i915/gt: Reset queue_priority_hint on parking

2024-03-20 Thread Janusz Krzysztofik
Hi Andi,

On Wednesday, 20 March 2024 15:29:58 CET Andi Shyti wrote:
> Hi Janusz,
> 
> ...
> 
> > Fixes: 22b7a426bbe1 ("drm/i915/execlists: Preempt-to-busy")
> > Closes: https://gitlab.freedesktop.org/drm/intel/issues/10154
> > Signed-off-by: Chris Wilson 
> > Cc: Mika Kuoppala 
> > Signed-off-by: Janusz Krzysztofik 
> > Cc: Chris Wilson 
> > Cc:  # v5.4+
> 
> this tag list is a bit confusing. Let's keep all Cc's together
> and, besides, Cc'eing the author looks a bit redundant.

You're right, please feel free to fix that while applying.

Thanks,
Janusz

> 
> No need to resend also because I retriggered another round of
> test.
> 
> Reviewed-by: Andi Shyti 
> 
> Thanks,
> Andi
> 






Re: [PATCH] drm/i915/gt: Reset queue_priority_hint on parking

2024-03-20 Thread Andi Shyti
Hi Janusz,

...

> Fixes: 22b7a426bbe1 ("drm/i915/execlists: Preempt-to-busy")
> Closes: https://gitlab.freedesktop.org/drm/intel/issues/10154
> Signed-off-by: Chris Wilson 
> Cc: Mika Kuoppala 
> Signed-off-by: Janusz Krzysztofik 
> Cc: Chris Wilson 
> Cc:  # v5.4+

this tag list is a bit confusing. Let's keep all Cc's together
and, besides, Cc'eing the author looks a bit redundant.

No need to resend also because I retriggered another round of
test.

Reviewed-by: Andi Shyti 

Thanks,
Andi


Re: [PATCH] drm/i915/gt: Reset queue_priority_hint on parking

2024-03-18 Thread Rodrigo Vivi
On Mon, Mar 18, 2024 at 02:58:47PM +0100, Janusz Krzysztofik wrote:
> From: Chris Wilson 
> 
> Originally, with strict in order execution, we could complete execution
> only when the queue was empty. Preempt-to-busy allows replacement of an
> active request that may complete before the preemption is processed by
> HW. If that happens, the request is retired from the queue, but the
> queue_priority_hint remains set, preventing direct submission until
> after the next CS interrupt is processed.

perhaps we are missing some intel_engine_flush_submission at preepmtion?

I wonder if there could be anything else we might be missing
with the lack of the flush.

> 
> This preempt-to-busy race can be triggered by the heartbeat, which will
> also act as the power-management barrier and upon completion allow us to
> idle the HW. We may process the completion of the heartbeat, and begin
> parking the engine before the CS event that restores the
> queue_priority_hint, causing us to fail the assertion that it is MIN.
> 
> <3>[  166.210729] __engine_park:283 
> GEM_BUG_ON(engine->sched_engine->queue_priority_hint != (-((int)(~0U >> 1)) - 
> 1))
> <0>[  166.210781] Dumping ftrace buffer:
> <0>[  166.210795] -
> ...
> <0>[  167.302811] drm_fdin-1097  2..s1. 165741070us : trace_ports: 
> :00:02.0 rcs0: promote { ccid:20 1217:2 prio 0 }
> <0>[  167.302861] drm_fdin-1097  2d.s2. 165741072us : 
> execlists_submission_tasklet: :00:02.0 rcs0: preempting last=1217:2, 
> prio=0, hint=2147483646
> <0>[  167.302928] drm_fdin-1097  2d.s2. 165741072us : 
> __i915_request_unsubmit: :00:02.0 rcs0: fence 1217:2, current 0
> <0>[  167.302992] drm_fdin-1097  2d.s2. 165741073us : 
> __i915_request_submit: :00:02.0 rcs0: fence 3:4660, current 4659
> <0>[  167.303044] drm_fdin-1097  2d.s1. 165741076us : 
> execlists_submission_tasklet: :00:02.0 rcs0: context:3 schedule-in, 
> ccid:40
> <0>[  167.303095] drm_fdin-1097  2d.s1. 165741077us : trace_ports: 
> :00:02.0 rcs0: submit { ccid:40 3:4660* prio 2147483646 }
> <0>[  167.303159] kworker/-89   11. 165741139us : 
> i915_request_retire.part.0: :00:02.0 rcs0: fence c90:2, current 2
> <0>[  167.303208] kworker/-89   11. 165741148us : 
> __intel_context_do_unpin: :00:02.0 rcs0: context:c90 unpin
> <0>[  167.303272] kworker/-89   11. 165741159us : 
> i915_request_retire.part.0: :00:02.0 rcs0: fence 1217:2, current 2
> <0>[  167.303321] kworker/-89   11. 165741166us : 
> __intel_context_do_unpin: :00:02.0 rcs0: context:1217 unpin
> <0>[  167.303384] kworker/-89   11. 165741170us : 
> i915_request_retire.part.0: :00:02.0 rcs0: fence 3:4660, current 4660
> <0>[  167.303434] kworker/-89   11d..1. 165741172us : 
> __intel_context_retire: :00:02.0 rcs0: context:1216 retire runtime: { 
> total:56028ns, avg:56028ns }
> <0>[  167.303484] kworker/-89   11. 165741198us : __engine_park: 
> :00:02.0 rcs0: parked
> <0>[  167.303534]   -0 5d.H3. 165741207us : 
> execlists_irq_handler: :00:02.0 rcs0: semaphore yield: 0040
> <0>[  167.303583] kworker/-89   11. 165741397us : 
> __intel_context_retire: :00:02.0 rcs0: context:1217 retire runtime: { 
> total:325575ns, avg:0ns }
> <0>[  167.303756] kworker/-89   11. 165741777us : 
> __intel_context_retire: :00:02.0 rcs0: context:c90 retire runtime: { 
> total:0ns, avg:0ns }
> <0>[  167.303806] kworker/-89   11. 165742017us : __engine_park: 
> __engine_park:283 GEM_BUG_ON(engine->sched_engine->queue_priority_hint != 
> (-((int)(~0U >> 1)) - 1))
> <0>[  167.303811] -
> <4>[  167.304722] [ cut here ]
> <2>[  167.304725] kernel BUG at drivers/gpu/drm/i915/gt/intel_engine_pm.c:283!
> <4>[  167.304731] invalid opcode:  [#1] PREEMPT SMP NOPTI
> <4>[  167.304734] CPU: 11 PID: 89 Comm: kworker/11:1 Tainted: GW  
> 6.8.0-rc2-CI_DRM_14193-gc655e0fd2804+ #1
> <4>[  167.304736] Hardware name: Intel Corporation Rocket Lake Client 
> Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 
> 04/21/2022
> <4>[  167.304738] Workqueue: i915-unordered retire_work_handler [i915]
> <4>[  167.304839] RIP: 0010:__engine_park+0x3fd/0x680 [i915]
> <4>[  167.304937] Code: 00 48 c7 c2 b0 e5 86 a0 48 8d 3d 00 00 00 00 e8 79 48 
> d4 e0 bf 01 00 00 00 e8 ef 0a d4 e0 31 f6 bf 09 00 00 00 e8 03 49 c0 e0 <0f> 
> 0b 0f 0b be 01 00 00 00 e8 f5 61 fd ff 31 c0 e9 34 fd ff ff 48
> <4>[  167.304940] RSP: 0018:c959fce0 EFLAGS: 00010246
> <4>[  167.304942] RAX: 0200 RBX:  RCX: 
> 0006
> <4>[  167.304944] RDX:  RSI:  RDI: 
> 0009
> <4>[  167.304946] RBP: 8881330ca1b0 R08: 0001 R09: 
> 0001
> <4>[  167.304947] R10: 0001 R11: 0001 R12: 
> 8881330ca000
> <4>[  

[PATCH] drm/i915/gt: Reset queue_priority_hint on parking

2024-03-18 Thread Janusz Krzysztofik
From: Chris Wilson 

Originally, with strict in order execution, we could complete execution
only when the queue was empty. Preempt-to-busy allows replacement of an
active request that may complete before the preemption is processed by
HW. If that happens, the request is retired from the queue, but the
queue_priority_hint remains set, preventing direct submission until
after the next CS interrupt is processed.

This preempt-to-busy race can be triggered by the heartbeat, which will
also act as the power-management barrier and upon completion allow us to
idle the HW. We may process the completion of the heartbeat, and begin
parking the engine before the CS event that restores the
queue_priority_hint, causing us to fail the assertion that it is MIN.

<3>[  166.210729] __engine_park:283 
GEM_BUG_ON(engine->sched_engine->queue_priority_hint != (-((int)(~0U >> 1)) - 
1))
<0>[  166.210781] Dumping ftrace buffer:
<0>[  166.210795] -
...
<0>[  167.302811] drm_fdin-1097  2..s1. 165741070us : trace_ports: 
:00:02.0 rcs0: promote { ccid:20 1217:2 prio 0 }
<0>[  167.302861] drm_fdin-1097  2d.s2. 165741072us : 
execlists_submission_tasklet: :00:02.0 rcs0: preempting last=1217:2, 
prio=0, hint=2147483646
<0>[  167.302928] drm_fdin-1097  2d.s2. 165741072us : 
__i915_request_unsubmit: :00:02.0 rcs0: fence 1217:2, current 0
<0>[  167.302992] drm_fdin-1097  2d.s2. 165741073us : 
__i915_request_submit: :00:02.0 rcs0: fence 3:4660, current 4659
<0>[  167.303044] drm_fdin-1097  2d.s1. 165741076us : 
execlists_submission_tasklet: :00:02.0 rcs0: context:3 schedule-in, ccid:40
<0>[  167.303095] drm_fdin-1097  2d.s1. 165741077us : trace_ports: 
:00:02.0 rcs0: submit { ccid:40 3:4660* prio 2147483646 }
<0>[  167.303159] kworker/-89   11. 165741139us : 
i915_request_retire.part.0: :00:02.0 rcs0: fence c90:2, current 2
<0>[  167.303208] kworker/-89   11. 165741148us : 
__intel_context_do_unpin: :00:02.0 rcs0: context:c90 unpin
<0>[  167.303272] kworker/-89   11. 165741159us : 
i915_request_retire.part.0: :00:02.0 rcs0: fence 1217:2, current 2
<0>[  167.303321] kworker/-89   11. 165741166us : 
__intel_context_do_unpin: :00:02.0 rcs0: context:1217 unpin
<0>[  167.303384] kworker/-89   11. 165741170us : 
i915_request_retire.part.0: :00:02.0 rcs0: fence 3:4660, current 4660
<0>[  167.303434] kworker/-89   11d..1. 165741172us : 
__intel_context_retire: :00:02.0 rcs0: context:1216 retire runtime: { 
total:56028ns, avg:56028ns }
<0>[  167.303484] kworker/-89   11. 165741198us : __engine_park: 
:00:02.0 rcs0: parked
<0>[  167.303534]   -0 5d.H3. 165741207us : 
execlists_irq_handler: :00:02.0 rcs0: semaphore yield: 0040
<0>[  167.303583] kworker/-89   11. 165741397us : 
__intel_context_retire: :00:02.0 rcs0: context:1217 retire runtime: { 
total:325575ns, avg:0ns }
<0>[  167.303756] kworker/-89   11. 165741777us : 
__intel_context_retire: :00:02.0 rcs0: context:c90 retire runtime: { 
total:0ns, avg:0ns }
<0>[  167.303806] kworker/-89   11. 165742017us : __engine_park: 
__engine_park:283 GEM_BUG_ON(engine->sched_engine->queue_priority_hint != 
(-((int)(~0U >> 1)) - 1))
<0>[  167.303811] -
<4>[  167.304722] [ cut here ]
<2>[  167.304725] kernel BUG at drivers/gpu/drm/i915/gt/intel_engine_pm.c:283!
<4>[  167.304731] invalid opcode:  [#1] PREEMPT SMP NOPTI
<4>[  167.304734] CPU: 11 PID: 89 Comm: kworker/11:1 Tainted: GW
  6.8.0-rc2-CI_DRM_14193-gc655e0fd2804+ #1
<4>[  167.304736] Hardware name: Intel Corporation Rocket Lake Client 
Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 
04/21/2022
<4>[  167.304738] Workqueue: i915-unordered retire_work_handler [i915]
<4>[  167.304839] RIP: 0010:__engine_park+0x3fd/0x680 [i915]
<4>[  167.304937] Code: 00 48 c7 c2 b0 e5 86 a0 48 8d 3d 00 00 00 00 e8 79 48 
d4 e0 bf 01 00 00 00 e8 ef 0a d4 e0 31 f6 bf 09 00 00 00 e8 03 49 c0 e0 <0f> 0b 
0f 0b be 01 00 00 00 e8 f5 61 fd ff 31 c0 e9 34 fd ff ff 48
<4>[  167.304940] RSP: 0018:c959fce0 EFLAGS: 00010246
<4>[  167.304942] RAX: 0200 RBX:  RCX: 
0006
<4>[  167.304944] RDX:  RSI:  RDI: 
0009
<4>[  167.304946] RBP: 8881330ca1b0 R08: 0001 R09: 
0001
<4>[  167.304947] R10: 0001 R11: 0001 R12: 
8881330ca000
<4>[  167.304948] R13: 888110f02aa0 R14: 88812d1d0205 R15: 
88811277d4f0
<4>[  167.304950] FS:  () GS:88844f78() 
knlGS:
<4>[  167.304952] CS:  0010 DS:  ES:  CR0: 80050033
<4>[  167.304953] CR2: 7fc362200c40 CR3: 00013306e003 CR4: 
00770ef0
<4>[  167.304955] PKRU: 5554
<4>[  167.304957] Call Trace:
<4>[  167.304958]