gem_eio: Try harder to measure median resume time

Kamil Konieczny Sat, 18 Oct 2025 11:04:49 -0700

Hi Janusz,
On 2025-10-08 at 14:52:44 +0200, Janusz Krzysztofik wrote:
> Hi Kamil,
> 
> On Wednesday, 8 October 2025 14:14:47 CEST Kamil Konieczny wrote:
> > Hi Janusz,
> > On 2025-10-07 at 13:38:25 +0200, Janusz Krzysztofik wrote:
> > > Subtests that measure time of resume after engine reset require results
> > > from at least 9 reset-resume cycles for reasonable calculation of a median
> > > value to be compared against a presumed limit.  On most Gen12+ platforms,
> > > as well as on some older ones like JSL, CHV, ILK or ELK, the current limit
> > > of 5 seconds for collecting those results occurs too short.
> > > 
> > > Raise the limit to an empirically determined value of 20 seconds and break
> > > the loop as soon as 9 results are collected.
> > > 
> > > v2: Split out a change in handling of not enough measurements to a
> > >     separate patch (Kamil),
> > >   - reword commit message to be more distinct from other patches in
> > >     series (Kamil),
> > >   - reword commit message and description so they no longer state the
> > >     scope of the issue is limited to Gen12+, and list other (non-Gen12+)
> > >     platforms found also affected.
> > > 
> > > Signed-off-by: Janusz Krzysztofik <[email protected]>
> > > ---
> > >  tests/intel/gem_eio.c | 5 ++++-
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/tests/intel/gem_eio.c b/tests/intel/gem_eio.c
> > > index 0a00ef026e..79dcef8fa6 100644
> > > --- a/tests/intel/gem_eio.c
> > > +++ b/tests/intel/gem_eio.c
> > > @@ -929,7 +929,7 @@ static void reset_stress(int fd, uint64_t ahnd, const 
> intel_ctx_t *ctx0,
> > >   gem_write(fd, obj.handle, 0, &bbe, sizeof(bbe));
> > >  
> > >   igt_stats_init(&stats);
> > > - igt_until_timeout(5) {
> > > + igt_until_timeout(20) {
> > 
> > What I wanted here was actually (in pseudocode):
> > 
> > mtime = gen < 5 || gen >= 12 ? 20 : 5;
> 
> That's incorrect.  JSL, now mentioned in commit description (see also 
> changelog), is gen11, and it's the only one of that gen that exhibits the 
> problem.  Moreover, some affected CI machines need more time in 
> unwedge-stress 
> and not necessarily in reset-stress, some vice versa, and still some need 
> more 
> time in both.  That may sound strange, but that's how results from my many 
> trybot attempts look like.  Also, not all pre-gen5 machines require a higher 
> limit on resume time, as it is handled now and extended over gen12+ in next 
> patch.  So before I try to fulfil your expectation and use a formula here, 
> not 
> a constant, we have to agree on how much precise that formula should be.  If 
> you don't accept a simplified approach then I have to spend more time on 
> finding out what exactly takes time on kernel side in each of those distinct 
> cases and maybe then I will be able to formulate exact conditions when we 
> should wait longer and when not.
>


One more note - maybe it is related with two GTs: GT0 and GT1?

It could go with simplified formula here and just use some value,
20 or 10?

Btw did you see results for v1? The gem_eio@kms subtests
is failing due to disk limit in CI, and in logs there are
21 'Forcing GPU reset' messages.

https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_13866/shard-dg2-5/igt@[email protected]
https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_13866/shard-tglu-8/igt@[email protected]

Is it not related to your series?
Maybe number of resets should also be lowered?
Also test took over 20 seconds after it was killed.

Regards,
Kamil

> > 
> >     igt_until_timeout(mtime) {
> > 
> > >           const intel_ctx_t *ctx = context_create_safe(fd);
> > >           igt_spin_t *hang;
> > >           unsigned int i;
> > > @@ -978,6 +978,9 @@ static void reset_stress(int fd, uint64_t ahnd, const 
> intel_ctx_t *ctx0,
> > >           gem_sync(fd, obj.handle);
> > >           igt_spin_free(fd, hang);
> > >           intel_ctx_destroy(fd, ctx);
> > > +
> > > +         if (stats.n_values > 8)
> > 
> > Can it be a define as it is used in other places, for example:
> > #define NUMER_OF_MEASURED_CYCLES_NEEDED 9
> > 
> > so you will use it elsewhere, and here it will be:
> > 
> >             if (stats.n_values >= NUMER_OF_MEASURED_CYCLES_NEEDED)
> >                     break;
> 
> OK.
> 
> > 
> > >   }
> > >   check_wait_elapsed(name, fd, &stats);
> > 
> > I did give you r-b for patch 1/5 but I am not sure how
> > reliable are measurements, should it be an assert instead of skip?
> > Maybe function check_wait_elapsed() should return bool to tell if
> > median is ready, and in each place subtests itself decide if it
> > should skip or assert? Up to you.
> 
> check_wait_elapsed() is called only from reset_stress(), which in turn is 
> called only by 3 subtests, all in scope of this series.  Can you suggest some 
> criteria when you think a subtest should skip and when it should fail if not 
> enough results have been collected?  I've chosen skip because we couldn't do 
> much with fail other than blocklisting the failing subtest, while CI can 
> handle skips as expected skips on selected platforms if we really can't find 
> a balance among the loop long enough for collecting enough measurements and 
> short enough for not exceeding per test timeout on platforms with many 
> engines.
> 
> Thanks,
> Janusz
> 
> 
> > 
> > Regards,
> > Kamil
> > 
> > >   igt_stats_fini(&stats);
> > 
> 
> 
> 
>

Re: [PATCH i-g-t v2 2/5] tests/gem_eio: Try harder to measure median resume time

Reply via email to