gem_eio: Try harder to measure median resume time

Janusz Krzysztofik Fri, 17 Oct 2025 18:30:36 -0700

Hi Kamil,

On Wednesday, 8 October 2025 14:14:47 CEST Kamil Konieczny wrote:
> Hi Janusz,
> On 2025-10-07 at 13:38:25 +0200, Janusz Krzysztofik wrote:
> > Subtests that measure time of resume after engine reset require results
> > from at least 9 reset-resume cycles for reasonable calculation of a median
> > value to be compared against a presumed limit.  On most Gen12+ platforms,
> > as well as on some older ones like JSL, CHV, ILK or ELK, the current limit
> > of 5 seconds for collecting those results occurs too short.
> > 
> > Raise the limit to an empirically determined value of 20 seconds and break
> > the loop as soon as 9 results are collected.
> > 
> > v2: Split out a change in handling of not enough measurements to a
> >     separate patch (Kamil),
> >   - reword commit message to be more distinct from other patches in
> >     series (Kamil),
> >   - reword commit message and description so they no longer state the
> >     scope of the issue is limited to Gen12+, and list other (non-Gen12+)
> >     platforms found also affected.
> > 
> > Signed-off-by: Janusz Krzysztofik <[email protected]>
> > ---
> >  tests/intel/gem_eio.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/tests/intel/gem_eio.c b/tests/intel/gem_eio.c
> > index 0a00ef026e..79dcef8fa6 100644
> > --- a/tests/intel/gem_eio.c
> > +++ b/tests/intel/gem_eio.c
> > @@ -929,7 +929,7 @@ static void reset_stress(int fd, uint64_t ahnd, const 
intel_ctx_t *ctx0,
> >     gem_write(fd, obj.handle, 0, &bbe, sizeof(bbe));
> >  
> >     igt_stats_init(&stats);
> > -   igt_until_timeout(5) {
> > +   igt_until_timeout(20) {
> 
> What I wanted here was actually (in pseudocode):
> 
> mtime = gen < 5 || gen >= 12 ? 20 : 5;


That's incorrect.  JSL, now mentioned in commit description (see also 
changelog), is gen11, and it's the only one of that gen that exhibits the 
problem.  Moreover, some affected CI machines need more time in unwedge-stress 
and not necessarily in reset-stress, some vice versa, and still some need more 
time in both.  That may sound strange, but that's how results from my many 
trybot attempts look like.  Also, not all pre-gen5 machines require a higher 
limit on resume time, as it is handled now and extended over gen12+ in next 
patch.  So before I try to fulfil your expectation and use a formula here, not 
a constant, we have to agree on how much precise that formula should be.  If 
you don't accept a simplified approach then I have to spend more time on 
finding out what exactly takes time on kernel side in each of those distinct 
cases and maybe then I will be able to formulate exact conditions when we 
should wait longer and when not.

> 
>       igt_until_timeout(mtime) {
> 
> >             const intel_ctx_t *ctx = context_create_safe(fd);
> >             igt_spin_t *hang;
> >             unsigned int i;
> > @@ -978,6 +978,9 @@ static void reset_stress(int fd, uint64_t ahnd, const 
intel_ctx_t *ctx0,
> >             gem_sync(fd, obj.handle);
> >             igt_spin_free(fd, hang);
> >             intel_ctx_destroy(fd, ctx);
> > +
> > +           if (stats.n_values > 8)
> 
> Can it be a define as it is used in other places, for example:
> #define NUMER_OF_MEASURED_CYCLES_NEEDED 9
> 
> so you will use it elsewhere, and here it will be:
> 
>               if (stats.n_values >= NUMER_OF_MEASURED_CYCLES_NEEDED)
>                       break;

OK.

> 
> >     }
> >     check_wait_elapsed(name, fd, &stats);
> 
> I did give you r-b for patch 1/5 but I am not sure how
> reliable are measurements, should it be an assert instead of skip?
> Maybe function check_wait_elapsed() should return bool to tell if
> median is ready, and in each place subtests itself decide if it
> should skip or assert? Up to you.

check_wait_elapsed() is called only from reset_stress(), which in turn is 
called only by 3 subtests, all in scope of this series.  Can you suggest some 
criteria when you think a subtest should skip and when it should fail if not 
enough results have been collected?  I've chosen skip because we couldn't do 
much with fail other than blocklisting the failing subtest, while CI can 
handle skips as expected skips on selected platforms if we really can't find 
a balance among the loop long enough for collecting enough measurements and 
short enough for not exceeding per test timeout on platforms with many 
engines.

Thanks,
Janusz


> 
> Regards,
> Kamil
> 
> >     igt_stats_fini(&stats);
>

Re: [PATCH i-g-t v2 2/5] tests/gem_eio: Try harder to measure median resume time

Reply via email to