Re: [Intel-gfx] [igt-dev] [PATCH i-g-t v6] tests/perf_pmu: Verify engine busyness accuracy

2018-02-19 Thread Chris Wilson
Quoting Tvrtko Ursulin (2018-02-19 10:58:25)
> 
> On 19/02/2018 10:26, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-02-19 09:57:20)
> >>
> >> On 19/02/2018 09:27, Chris Wilson wrote:
> >>> Quoting Tvrtko Ursulin (2018-02-19 09:19:47)
> 
>  Do you have a link to BSW hang? Is that obviously related to PMU?
> >>>
> >>> It's only occurring in this test, just looks like an issue with the
> >>> spinner:
> >>>
> >>> [bsw] 
> >>> https://intel-gfx-ci.01.org/tree/drm-tip/kasan_2/fi-bsw-n3050/igt@perf_...@busy-accuracy-2-bcs0.html
> >>
> >> ...
> >> <0>[  681.022677] perf_pmu-15161..s1 282520414us : 
> >> execlists_submission_tasklet: bcs0 in[0]:  ctx=3.1, seqno=a
> >> <0>[  681.022838] perf_pmu-15161..s1 282520580us : 
> >> execlists_submission_tasklet: bcs0 cs-irq head=5 [5?], tail=0 [0?]
> >> <0>[  681.023001] perf_pmu-15161..s1 282520594us : 
> >> execlists_submission_tasklet: bcs0 csb[0]: status=0x0001:0x, 
> >> active=0x1
> >> <0>[  681.023168] kworker/-338 1 298087910us : reset_common_ring: 
> >> bcs0 seqno=a
> >> <0>[  681.023321] ksoftirq-17  1..s. 298088483us : 
> >> execlists_submission_tasklet: bcs0 in[0]:  ctx=3.1, seqno=a
> >> <0>[  681.023482] ksoftirq-17  1..s. 298088575us : 
> >> execlists_submission_tasklet: bcs0 cs-irq head=0 [0], tail=1 [1]
> >> <0>[  681.023644] ksoftirq-17  1..s. 298088579us : 
> >> execlists_submission_tasklet: bcs0 csb[1]: status=0x0018:0x0003, 
> >> active=0x1
> >> <0>[  681.023811] ksoftirq-17  1..s. 298088581us : 
> >> execlists_submission_tasklet: bcs0 out[0]: ctx=3.1, seqno=a
> >>
> >> Everything stops.
> >>
> >>> [kbl] 
> >>> https://intel-gfx-ci.01.org/tree/drm-tip/kasan_2/fi-kbl-7560u/igt@perf_...@busy-accuracy-2-bcs0.html
> >>
> >> ...
> >> <0>[  506.745332] perf_pmu-15443..s1 107905835us : 
> >> execlists_submission_tasklet: bcs0 in[0]:  ctx=3.1, seqno=a
> >> <0>[  506.745397]   -0   2..s1 107905980us : 
> >> execlists_submission_tasklet: bcs0 cs-irq head=2 [1?], tail=3 [3?]
> >> <0>[  506.745440]   -0   2..s1 107905983us : 
> >> execlists_submission_tasklet: bcs0 csb[3]: status=0x0001:0x, 
> >> active=0x1
> >> <0>[  506.745498] kworker/-30  3 120840583us : reset_common_ring: 
> >> bcs0 seqno=a
> >> <0>[  506.745547] ksoftirq-29  3..s. 120840688us : 
> >> execlists_submission_tasklet: bcs0 in[0]:  ctx=3.1, seqno=a
> >> <0>[  506.745598] in:imklo-499 2..s1 120840710us : 
> >> execlists_submission_tasklet: bcs0 cs-irq head=0 [0], tail=1 [1]
> >> <0>[  506.745637] in:imklo-499 2..s1 120840712us : 
> >> execlists_submission_tasklet: bcs0 csb[1]: status=0x0018:0x0003, 
> >> active=0x1
> >> <0>[  506.745676] in:imklo-499 2..s1 120840713us : 
> >> execlists_submission_tasklet: bcs0 out[0]: ctx=3.1, seqno=a
> >>
> >> Everything stops here.
> >>
> >> I have not idea what's happening here. In both cases I would expect the 
> >> test
> >> to have exited after the GPU hang (or at least attempt to exit!), since it
> >> would detect it overran the timeout.
> >>
> >> Could it be stuck in gem_sync after the reset? Or somewhere else?
> > 
> > I think it's that we will be throwing the calibration off if it hangs.
> > If busy_ns = 10s, won't that generate a target idle time of 500s?
> 
> Indeed, well spotted. I'll need to add a hang detector of some sort.

Oh, I think I know why it's hanging. As the buffer will be idle, the
kernel is allowed to move it, and __submit_spin_batch() doesn't tell the
kernel to preserve the original address (so the kernel assumes that the
relocations are relative to the passed in address and so move the buffer
to match). I should have noticed that before given the discussion around
EXEC_OBJECT_PINNED for the spinner.

I think there's an easy enough patch...
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [igt-dev] [PATCH i-g-t v6] tests/perf_pmu: Verify engine busyness accuracy

2018-02-19 Thread Tvrtko Ursulin


On 19/02/2018 10:26, Chris Wilson wrote:

Quoting Tvrtko Ursulin (2018-02-19 09:57:20)


On 19/02/2018 09:27, Chris Wilson wrote:

Quoting Tvrtko Ursulin (2018-02-19 09:19:47)


Do you have a link to BSW hang? Is that obviously related to PMU?


It's only occurring in this test, just looks like an issue with the
spinner:

[bsw] 
https://intel-gfx-ci.01.org/tree/drm-tip/kasan_2/fi-bsw-n3050/igt@perf_...@busy-accuracy-2-bcs0.html


...
<0>[  681.022677] perf_pmu-15161..s1 282520414us : 
execlists_submission_tasklet: bcs0 in[0]:  ctx=3.1, seqno=a
<0>[  681.022838] perf_pmu-15161..s1 282520580us : 
execlists_submission_tasklet: bcs0 cs-irq head=5 [5?], tail=0 [0?]
<0>[  681.023001] perf_pmu-15161..s1 282520594us : 
execlists_submission_tasklet: bcs0 csb[0]: status=0x0001:0x, active=0x1
<0>[  681.023168] kworker/-338 1 298087910us : reset_common_ring: bcs0 
seqno=a
<0>[  681.023321] ksoftirq-17  1..s. 298088483us : 
execlists_submission_tasklet: bcs0 in[0]:  ctx=3.1, seqno=a
<0>[  681.023482] ksoftirq-17  1..s. 298088575us : 
execlists_submission_tasklet: bcs0 cs-irq head=0 [0], tail=1 [1]
<0>[  681.023644] ksoftirq-17  1..s. 298088579us : 
execlists_submission_tasklet: bcs0 csb[1]: status=0x0018:0x0003, active=0x1
<0>[  681.023811] ksoftirq-17  1..s. 298088581us : 
execlists_submission_tasklet: bcs0 out[0]: ctx=3.1, seqno=a

Everything stops.


[kbl] 
https://intel-gfx-ci.01.org/tree/drm-tip/kasan_2/fi-kbl-7560u/igt@perf_...@busy-accuracy-2-bcs0.html


...
<0>[  506.745332] perf_pmu-15443..s1 107905835us : 
execlists_submission_tasklet: bcs0 in[0]:  ctx=3.1, seqno=a
<0>[  506.745397]   -0   2..s1 107905980us : 
execlists_submission_tasklet: bcs0 cs-irq head=2 [1?], tail=3 [3?]
<0>[  506.745440]   -0   2..s1 107905983us : 
execlists_submission_tasklet: bcs0 csb[3]: status=0x0001:0x, active=0x1
<0>[  506.745498] kworker/-30  3 120840583us : reset_common_ring: bcs0 
seqno=a
<0>[  506.745547] ksoftirq-29  3..s. 120840688us : 
execlists_submission_tasklet: bcs0 in[0]:  ctx=3.1, seqno=a
<0>[  506.745598] in:imklo-499 2..s1 120840710us : 
execlists_submission_tasklet: bcs0 cs-irq head=0 [0], tail=1 [1]
<0>[  506.745637] in:imklo-499 2..s1 120840712us : 
execlists_submission_tasklet: bcs0 csb[1]: status=0x0018:0x0003, active=0x1
<0>[  506.745676] in:imklo-499 2..s1 120840713us : 
execlists_submission_tasklet: bcs0 out[0]: ctx=3.1, seqno=a

Everything stops here.

I have not idea what's happening here. In both cases I would expect the test
to have exited after the GPU hang (or at least attempt to exit!), since it
would detect it overran the timeout.

Could it be stuck in gem_sync after the reset? Or somewhere else?


I think it's that we will be throwing the calibration off if it hangs.
If busy_ns = 10s, won't that generate a target idle time of 500s?


Indeed, well spotted. I'll need to add a hang detector of some sort.

In the meantime trying to figure out how to wire up GuC to engine stats. 
The fix to get correct state on stats enable by looking at ports is a 
problem given different tracking in GuC mode I had.


Regards,

Tvrtko


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [igt-dev] [PATCH i-g-t v6] tests/perf_pmu: Verify engine busyness accuracy

2018-02-19 Thread Chris Wilson
Quoting Tvrtko Ursulin (2018-02-19 09:57:20)
> 
> On 19/02/2018 09:27, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-02-19 09:19:47)
> >>
> >> Do you have a link to BSW hang? Is that obviously related to PMU?
> > 
> > It's only occurring in this test, just looks like an issue with the
> > spinner:
> > 
> > [bsw] 
> > https://intel-gfx-ci.01.org/tree/drm-tip/kasan_2/fi-bsw-n3050/igt@perf_...@busy-accuracy-2-bcs0.html
> 
> ...
> <0>[  681.022677] perf_pmu-15161..s1 282520414us : 
> execlists_submission_tasklet: bcs0 in[0]:  ctx=3.1, seqno=a
> <0>[  681.022838] perf_pmu-15161..s1 282520580us : 
> execlists_submission_tasklet: bcs0 cs-irq head=5 [5?], tail=0 [0?]
> <0>[  681.023001] perf_pmu-15161..s1 282520594us : 
> execlists_submission_tasklet: bcs0 csb[0]: status=0x0001:0x, 
> active=0x1
> <0>[  681.023168] kworker/-338 1 298087910us : reset_common_ring: 
> bcs0 seqno=a
> <0>[  681.023321] ksoftirq-17  1..s. 298088483us : 
> execlists_submission_tasklet: bcs0 in[0]:  ctx=3.1, seqno=a
> <0>[  681.023482] ksoftirq-17  1..s. 298088575us : 
> execlists_submission_tasklet: bcs0 cs-irq head=0 [0], tail=1 [1]
> <0>[  681.023644] ksoftirq-17  1..s. 298088579us : 
> execlists_submission_tasklet: bcs0 csb[1]: status=0x0018:0x0003, 
> active=0x1
> <0>[  681.023811] ksoftirq-17  1..s. 298088581us : 
> execlists_submission_tasklet: bcs0 out[0]: ctx=3.1, seqno=a
> 
> Everything stops.
> 
> > [kbl] 
> > https://intel-gfx-ci.01.org/tree/drm-tip/kasan_2/fi-kbl-7560u/igt@perf_...@busy-accuracy-2-bcs0.html
> 
> ...
> <0>[  506.745332] perf_pmu-15443..s1 107905835us : 
> execlists_submission_tasklet: bcs0 in[0]:  ctx=3.1, seqno=a
> <0>[  506.745397]   -0   2..s1 107905980us : 
> execlists_submission_tasklet: bcs0 cs-irq head=2 [1?], tail=3 [3?]
> <0>[  506.745440]   -0   2..s1 107905983us : 
> execlists_submission_tasklet: bcs0 csb[3]: status=0x0001:0x, 
> active=0x1
> <0>[  506.745498] kworker/-30  3 120840583us : reset_common_ring: 
> bcs0 seqno=a
> <0>[  506.745547] ksoftirq-29  3..s. 120840688us : 
> execlists_submission_tasklet: bcs0 in[0]:  ctx=3.1, seqno=a
> <0>[  506.745598] in:imklo-499 2..s1 120840710us : 
> execlists_submission_tasklet: bcs0 cs-irq head=0 [0], tail=1 [1]
> <0>[  506.745637] in:imklo-499 2..s1 120840712us : 
> execlists_submission_tasklet: bcs0 csb[1]: status=0x0018:0x0003, 
> active=0x1
> <0>[  506.745676] in:imklo-499 2..s1 120840713us : 
> execlists_submission_tasklet: bcs0 out[0]: ctx=3.1, seqno=a
> 
> Everything stops here.
> 
> I have not idea what's happening here. In both cases I would expect the test
> to have exited after the GPU hang (or at least attempt to exit!), since it
> would detect it overran the timeout.
> 
> Could it be stuck in gem_sync after the reset? Or somewhere else?

I think it's that we will be throwing the calibration off if it hangs.
If busy_ns = 10s, won't that generate a target idle time of 500s?
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [igt-dev] [PATCH i-g-t v6] tests/perf_pmu: Verify engine busyness accuracy

2018-02-19 Thread Tvrtko Ursulin

On 19/02/2018 09:27, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-02-19 09:19:47)
>>
>> Do you have a link to BSW hang? Is that obviously related to PMU?
> 
> It's only occurring in this test, just looks like an issue with the
> spinner:
> 
> [bsw] 
> https://intel-gfx-ci.01.org/tree/drm-tip/kasan_2/fi-bsw-n3050/igt@perf_...@busy-accuracy-2-bcs0.html

...
<0>[  681.022677] perf_pmu-15161..s1 282520414us : 
execlists_submission_tasklet: bcs0 in[0]:  ctx=3.1, seqno=a
<0>[  681.022838] perf_pmu-15161..s1 282520580us : 
execlists_submission_tasklet: bcs0 cs-irq head=5 [5?], tail=0 [0?]
<0>[  681.023001] perf_pmu-15161..s1 282520594us : 
execlists_submission_tasklet: bcs0 csb[0]: status=0x0001:0x, 
active=0x1
<0>[  681.023168] kworker/-338 1 298087910us : reset_common_ring: bcs0 
seqno=a
<0>[  681.023321] ksoftirq-17  1..s. 298088483us : 
execlists_submission_tasklet: bcs0 in[0]:  ctx=3.1, seqno=a
<0>[  681.023482] ksoftirq-17  1..s. 298088575us : 
execlists_submission_tasklet: bcs0 cs-irq head=0 [0], tail=1 [1]
<0>[  681.023644] ksoftirq-17  1..s. 298088579us : 
execlists_submission_tasklet: bcs0 csb[1]: status=0x0018:0x0003, 
active=0x1
<0>[  681.023811] ksoftirq-17  1..s. 298088581us : 
execlists_submission_tasklet: bcs0 out[0]: ctx=3.1, seqno=a

Everything stops.

> [kbl] 
> https://intel-gfx-ci.01.org/tree/drm-tip/kasan_2/fi-kbl-7560u/igt@perf_...@busy-accuracy-2-bcs0.html

...
<0>[  506.745332] perf_pmu-15443..s1 107905835us : 
execlists_submission_tasklet: bcs0 in[0]:  ctx=3.1, seqno=a
<0>[  506.745397]   -0   2..s1 107905980us : 
execlists_submission_tasklet: bcs0 cs-irq head=2 [1?], tail=3 [3?]
<0>[  506.745440]   -0   2..s1 107905983us : 
execlists_submission_tasklet: bcs0 csb[3]: status=0x0001:0x, 
active=0x1
<0>[  506.745498] kworker/-30  3 120840583us : reset_common_ring: bcs0 
seqno=a
<0>[  506.745547] ksoftirq-29  3..s. 120840688us : 
execlists_submission_tasklet: bcs0 in[0]:  ctx=3.1, seqno=a
<0>[  506.745598] in:imklo-499 2..s1 120840710us : 
execlists_submission_tasklet: bcs0 cs-irq head=0 [0], tail=1 [1]
<0>[  506.745637] in:imklo-499 2..s1 120840712us : 
execlists_submission_tasklet: bcs0 csb[1]: status=0x0018:0x0003, 
active=0x1
<0>[  506.745676] in:imklo-499 2..s1 120840713us : 
execlists_submission_tasklet: bcs0 out[0]: ctx=3.1, seqno=a

Everything stops here.

I have not idea what's happening here. In both cases I would expect the test
to have exited after the GPU hang (or at least attempt to exit!), since it
would detect it overran the timeout.

Could it be stuck in gem_sync after the reset? Or somewhere else?

Could we add "echo t > /proc/sysrq-trigger" equivalent when owatch triggers?

Or it would overflow some buffer? Should work in cases like this one, when
it is not a machine hang.

Regards,

Tvrtko
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [igt-dev] [PATCH i-g-t v6] tests/perf_pmu: Verify engine busyness accuracy

2018-02-19 Thread Chris Wilson
Quoting Tvrtko Ursulin (2018-02-19 09:19:47)
> 
> Do you have a link to BSW hang? Is that obviously related to PMU?

It's only occurring in this test, just looks like an issue with the
spinner:

[bsw] 
https://intel-gfx-ci.01.org/tree/drm-tip/kasan_2/fi-bsw-n3050/igt@perf_...@busy-accuracy-2-bcs0.html
[kbl] 
https://intel-gfx-ci.01.org/tree/drm-tip/kasan_2/fi-kbl-7560u/igt@perf_...@busy-accuracy-2-bcs0.html

-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [igt-dev] [PATCH i-g-t v6] tests/perf_pmu: Verify engine busyness accuracy

2018-02-19 Thread Tvrtko Ursulin


On 17/02/2018 11:36, Chris Wilson wrote:

Quoting Tvrtko Ursulin (2018-02-15 15:34:53)

From: Tvrtko Ursulin 

A subtest to verify that the engine busyness is reported with expected
accuracy on platforms where the feature is available.

We test three patterns: 2%, 50% and 98% load per engine.

v2:
  * Use spin batch instead of nop calibration.
  * Various tweaks.

v3:
  * Change loops to be time based.
  * Use __igt_spin_batch_new inside timing sensitive loops.
  * Fixed PWM sleep handling.

v4:
  * Use restarting spin batch.
  * Calibrate more carefully by looking at the real PWM loop.

v5:
  * Made standalone.
  * Better info messages.
  * Tweak sleep compensation.

v6:
  * Some final tweaks. (Chris Wilson)

Signed-off-by: Tvrtko Ursulin 
Reviewed-by: Chris Wilson 
---
+
+   /* Sampling platforms cannot reach the high accuracy criteria. */
+   igt_require(gem_has_execlists(gem_fd));


But we don't handle guc, right?


Correct.


igt_skip_on(gem_has_guc_submission(gem_fd)) ?


I'll dig up and rebase my old patch which implements busy stats in GuC 
mode.



https://intel-gfx-ci.01.org/tree/drm-tip/kasan_2/fi-skl-guc/igt@perf_...@busy-accuracy-2-vecs0.html

Or at least it doesn't work to sufficient accuracy. And bsw hung.


There are some occasional excursions over 15% tolerance even with 
execlists on small core. Bummer. Don't want to be playing up the 
tolerance game. I'll analyse in more detail and think what to do.


Do you have a link to BSW hang? Is that obviously related to PMU?

Regards,

Tvrtko


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [igt-dev] [PATCH i-g-t v6] tests/perf_pmu: Verify engine busyness accuracy

2018-02-17 Thread Chris Wilson
Quoting Tvrtko Ursulin (2018-02-15 15:34:53)
> From: Tvrtko Ursulin 
> 
> A subtest to verify that the engine busyness is reported with expected
> accuracy on platforms where the feature is available.
> 
> We test three patterns: 2%, 50% and 98% load per engine.
> 
> v2:
>  * Use spin batch instead of nop calibration.
>  * Various tweaks.
> 
> v3:
>  * Change loops to be time based.
>  * Use __igt_spin_batch_new inside timing sensitive loops.
>  * Fixed PWM sleep handling.
> 
> v4:
>  * Use restarting spin batch.
>  * Calibrate more carefully by looking at the real PWM loop.
> 
> v5:
>  * Made standalone.
>  * Better info messages.
>  * Tweak sleep compensation.
> 
> v6:
>  * Some final tweaks. (Chris Wilson)
> 
> Signed-off-by: Tvrtko Ursulin 
> Reviewed-by: Chris Wilson 
> ---
> +
> +   /* Sampling platforms cannot reach the high accuracy criteria. */
> +   igt_require(gem_has_execlists(gem_fd));

But we don't handle guc, right?
igt_skip_on(gem_has_guc_submission(gem_fd)) ?

https://intel-gfx-ci.01.org/tree/drm-tip/kasan_2/fi-skl-guc/igt@perf_...@busy-accuracy-2-vecs0.html

Or at least it doesn't work to sufficient accuracy. And bsw hung.
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx