On Friday, 10 October 2025 11:48:46 CEST Janusz Krzysztofik wrote:
> Hi Kamil,
>
> On Thursday, 9 October 2025 19:05:21 CEST Kamil Konieczny wrote:
> > Hi Janusz,
> > On 2025-10-09 at 12:22:12 +0200, Janusz Krzysztofik wrote:
> > > Hi Kamil,
> > >
> > > On Wednesday, 8 October 2025 18:42:42 CEST Kamil Konieczny wrote:
> > > > Hi Janusz,
> > > > On 2025-10-08 at 14:52:44 +0200, Janusz Krzysztofik wrote:
> > > > > Hi Kamil,
> > > > >
> > > > > On Wednesday, 8 October 2025 14:14:47 CEST Kamil Konieczny wrote:
> > > > > > Hi Janusz,
> > > > > > On 2025-10-07 at 13:38:25 +0200, Janusz Krzysztofik wrote:
> > > > > > > Subtests that measure time of resume after engine reset require
> > > > > > > results
> > > > > > > from at least 9 reset-resume cycles for reasonable calculation of
> > > > > > > a median
> > > > > > > value to be compared against a presumed limit. On most Gen12+
> > > > > > > platforms,
> > > > > > > as well as on some older ones like JSL, CHV, ILK or ELK, the
> > > > > > > current limit
> > > > > > > of 5 seconds for collecting those results occurs too short.
> > > > > > >
> > > > > > > Raise the limit to an empirically determined value of 20 seconds
> > > > > > > and break
> > > > > > > the loop as soon as 9 results are collected.
> > > > > > >
> > > > > > > v2: Split out a change in handling of not enough measurements to a
> > > > > > > separate patch (Kamil),
> > > > > > > - reword commit message to be more distinct from other patches
> > > > > > > in
> > > > > > > series (Kamil),
> > > > > > > - reword commit message and description so they no longer state
> > > > > > > the
> > > > > > > scope of the issue is limited to Gen12+, and list other
> > > > > > > (non-Gen12+)
> > > > > > > platforms found also affected.
> > > > > > >
> > > > > > > Signed-off-by: Janusz Krzysztofik
> > > > > > > <[email protected]>
> > > > > > > ---
> > > > > > > tests/intel/gem_eio.c | 5 ++++-
> > > > > > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/tests/intel/gem_eio.c b/tests/intel/gem_eio.c
> > > > > > > index 0a00ef026e..79dcef8fa6 100644
> > > > > > > --- a/tests/intel/gem_eio.c
> > > > > > > +++ b/tests/intel/gem_eio.c
> > > > > > > @@ -929,7 +929,7 @@ static void reset_stress(int fd, uint64_t
> > > > > > > ahnd, const
> > > > > intel_ctx_t *ctx0,
> > > > > > > gem_write(fd, obj.handle, 0, &bbe, sizeof(bbe));
> > > > > > >
> > > > > > > igt_stats_init(&stats);
> > > > > > > - igt_until_timeout(5) {
> > > > > > > + igt_until_timeout(20) {
> > > > > >
> > > > > > What I wanted here was actually (in pseudocode):
> > > > > >
> > > > > > mtime = gen < 5 || gen >= 12 ? 20 : 5;
> > > > >
> > > > > That's incorrect. JSL, now mentioned in commit description (see also
> > > > > changelog), is gen11, and it's the only one of that gen that exhibits
> > > > > the
> > > > > problem. Moreover, some affected CI machines need more time in
> > > > > unwedge-stress
> > > > > and not necessarily in reset-stress, some vice versa, and still some
> > > > > need more
> > > > > time in both. That may sound strange, but that's how results from my
> > > > > many
> > > > > trybot attempts look like. Also, not all pre-gen5 machines require a
> > > > > higher
> > > > > limit on resume time, as it is handled now and extended over gen12+
> > > > > in next
> > > > > patch. So before I try to fulfil your expectation and use a formula
> > > > > here, not
> > > > > a constant, we have to agree on how much precise that formula should
> > > > > be. If
> > > > > you don't accept a simplified approach then I have to spend more time
> > > > > on
> > > > > finding out what exactly takes time on kernel side in each of those
> > > > > distinct
> > > > > cases and maybe then I will be able to formulate exact conditions
> > > > > when we
> > > > > should wait longer and when not.
> > > > >
> > > >
> > > > One more note - maybe it is related with two GTs: GT0 and GT1?
> > >
> > > Maybe, but that's only one of potential factors, not covering cases like
> > > DG2
> > > or ILK as an example of two cases completely different, I believe.
> > >
> > > >
> > > > It could go with simplified formula here and just use some value,
> > > > 20 or 10?
> > >
> > > I still don't understand what your goal here is. What issue do you
> > > expect to
> > > be avoided or resolved by replacing the new, higher constant value with a
> > > formula? If I understood your point than I should be able to propose a
> > > solution.
> > >
> > > >
> > > > Btw did you see results for v1? The gem_eio@kms subtests
> > > > is failing due to disk limit in CI, and in logs there are
> > > > 21 'Forcing GPU reset' messages.
> > > >
> > > > https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_13866/shard-dg2-5/igt@[email protected]
> > > > https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_13866/shard-tglu-8/igt@[email protected]
> > > >
> > > > Is it not related to your series?
> > >
> > > Yes, that's related, in the sense that before, there was a shorter, 5s
> > > limit
> > > for performing resets and measuring resume time, so less noise was
> > > accumulated
> > > in dmesg than now when we wait up to 20s for 9 measurements collected in
> > > order
> > > to avoid falsely reporting success when we don't evaluate results because
> > > less
> > > than 9 have been collected.
> > >
> > > > Maybe number of resets should also be lowered?
> > >
> > > The kms subtest consist of 3 exercises. The first one -- inflight --
> > > triggers
> > > 7 resets (1 reset per ring), the remaining two ones are equivalents of
> > > reset-
> > > stress and unwedge-stress, with 9 resets per each of the 2 scenarios
> > > required
> > > as a minimum for stable median calculation, so 25 resets in total.
> > >
> > > Are you sure we are free to lower that limit of 9 measurements required
> > > for
> > > stable median?
> >
> > Not sure.
> >
> > >
> > > I think we should rather convince CI and display developers to limit the
> > > amount of noise in dmesg generated in CI runs by display debugging.
> > >
> > > > Also test took over 20 seconds after it was killed.
> > >
> > > I can see "Disk usage limit exceeded" event reported at timestamp
> > > 340.998570,
> > > and next subtest scheduled at 342.958470. Where do you see those 20
> > > seconds?
> > >
> > > Thanks,
> > > Janusz
> > >
> >
> > I did see it in dmesg4.txt, excerpt:
> >
> > 33669:<7>[ 494.568318] [IGT] gem_eio: executing
> > 33835:<7>[ 496.717429] [IGT] gem_eio: starting subtest kms
> > 39187:<7>[ 498.714559] [IGT] Forcing GPU reset
> > 39767:<7>[ 498.903888] [IGT] Checking that the GPU recovered
> >
> > and later:
> >
> > 87173:<7>[ 516.070903] [IGT] Checking that the GPU recovered
> > 88860:<7>[ 516.628325] [IGT] Forcing GPU reset
> > 89701:<7>[ 518.764372] [IGT] kms_frontbuffer_tracking: executing
> >
> > so 516 - 496 is 20.
> >
> > Re-checked it:
> > grep IGT 25-10-08-gem-eio-dmesg4.txt |grep ': executing'|grep -A5 gem_eio
> > <7>[ 494.568318] [IGT] gem_eio: executing
> > <7>[ 518.764372] [IGT] kms_frontbuffer_tracking: executing
>
> OK, but that doesn't show when gem_eio was killed, then doesn't mean gem_eio
> "took over 20 seconds after it was killed". Timestamps found in igt_runner
> log show it took less than 2 seconds.
>
> [494.326197] [085/130] (533s left) gem_eio (kms)
> [496.519345] Starting subtest: kms
> [516.673500] Disk usage limit exceeded.
> [518.473021] Closing watchdogs
> [518.475503] Initializing watchdogs
> [518.475569] /dev/watchdog0
> [518.503334] [FACT before any test] new:
> hardware.pci.gpu_at_addr.0000:03:00.0: 8086:56a0 Intel Dg2 (Gen12) DG2 [Arc
> A770]
> [518.516396] [FACT before any test] new:
> hardware.pci.drm_card_at_addr.0000:03:00.0: card0
> [518.518650] [FACT before any test] new: kernel.kmod_is_loaded.i915: true
> [518.519215] [FACT before any test] new: kernel.kmod_is_loaded.vgem: true
> [518.521639] [086/130] (508s left) kms_frontbuffer_tracking
> (fbcpsr-2p-primscrn-cur-indfb-draw-mmap-wc)
>
> >
> > Btw there were some KMS changes which lowered DRM logging
> > to not trigger disklimit, maybe it should be also used in
> > this subtest for example after first reset.
>
> Did you mean there were changes in the kernel, or in IGT? It would be very
> helpful if we could reuse an already accepted method for decreasing verbosity
> of display code from IGT, if there is one.
OK, I found something potentially useful in IGT.
Thanks,
Janusz
>
> Thanks,
> Janusz
>
>
> >
> > Regards,
> > Kamil
> >
> > >
> > > >
> > > > Regards,
> > > > Kamil
> > > >
> > > > > >
> > > > > > igt_until_timeout(mtime) {
> > > > > >
> > > > > > > const intel_ctx_t *ctx = context_create_safe(fd);
> > > > > > > igt_spin_t *hang;
> > > > > > > unsigned int i;
> > > > > > > @@ -978,6 +978,9 @@ static void reset_stress(int fd, uint64_t
> > > > > > > ahnd, const
> > > > > intel_ctx_t *ctx0,
> > > > > > > gem_sync(fd, obj.handle);
> > > > > > > igt_spin_free(fd, hang);
> > > > > > > intel_ctx_destroy(fd, ctx);
> > > > > > > +
> > > > > > > + if (stats.n_values > 8)
> > > > > >
> > > > > > Can it be a define as it is used in other places, for example:
> > > > > > #define NUMER_OF_MEASURED_CYCLES_NEEDED 9
> > > > > >
> > > > > > so you will use it elsewhere, and here it will be:
> > > > > >
> > > > > > if (stats.n_values >= NUMER_OF_MEASURED_CYCLES_NEEDED)
> > > > > > break;
> > > > >
> > > > > OK.
> > > > >
> > > > > >
> > > > > > > }
> > > > > > > check_wait_elapsed(name, fd, &stats);
> > > > > >
> > > > > > I did give you r-b for patch 1/5 but I am not sure how
> > > > > > reliable are measurements, should it be an assert instead of skip?
> > > > > > Maybe function check_wait_elapsed() should return bool to tell if
> > > > > > median is ready, and in each place subtests itself decide if it
> > > > > > should skip or assert? Up to you.
> > > > >
> > > > > check_wait_elapsed() is called only from reset_stress(), which in
> > > > > turn is
> > > > > called only by 3 subtests, all in scope of this series. Can you
> > > > > suggest some
> > > > > criteria when you think a subtest should skip and when it should fail
> > > > > if not
> > > > > enough results have been collected? I've chosen skip because we
> > > > > couldn't do
> > > > > much with fail other than blocklisting the failing subtest, while CI
> > > > > can
> > > > > handle skips as expected skips on selected platforms if we really
> > > > > can't find
> > > > > a balance among the loop long enough for collecting enough
> > > > > measurements and
> > > > > short enough for not exceeding per test timeout on platforms with
> > > > > many
> > > > > engines.
> > > > >
> > > > > Thanks,
> > > > > Janusz
> > > > >
> > > > >
> > > > > >
> > > > > > Regards,
> > > > > > Kamil
> > > > > >
> > > > > > > igt_stats_fini(&stats);
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > >
> >
>
>
>
>
>