RE: [PATCH v3] test/service: fix spurious failures by extending timeout

Van Haaren, Harry Fri, 03 Feb 2023 08:10:10 -0800

> -----Original Message-----
> From: Thomas Monjalon <[email protected]>
> Sent: Friday, February 3, 2023 3:16 PM
> To: David Marchand <[email protected]>; Van Haaren, Harry
> <[email protected]>
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; mattias.ronnblom
> <[email protected]>; Morten Brørup
> <[email protected]>; Tyler Retzlaff <[email protected]>;
> Aaron Conole <[email protected]>
> Subject: Re: [PATCH v3] test/service: fix spurious failures by extending 
> timeout
> 
> 03/02/2023 16:03, Van Haaren, Harry:
> > From: Van Haaren, Harry
> > > > The timeout approach just does not have its place in a functional test.
> > > > Either this test is rewritten, or it must go to the performance tests
> > > > list so that we stop getting false positives.
> > > > Can you work on this?
> > >
> > > I'll investigate various approaches on Thursday and reply here with 
> > > suggested
> > > next steps.
> >
> > I've identified 3 checks that fail in CI (from the above log outputs), all 
> > 3 cases
> > Have different dlays: 100 ms delay, 200 ms delay and 1000ms.
> > In the CI, the service-core just hasn't been scheduled (yet) and causes the
> "failure".
> >
> > Option 1)
> > One option is to while(1) loop, waiting for the service-thread to be 
> > scheduled.
> This can be
> > seen as "increasing the timeout", however in this case the test-case would 
> > be
> errored
> > not in the test-code, but in the meson-test runner as a timeout (with a 
> > 10sec
> default?)
> > The benefit here is that massively increasing (~1sec or less to 10 sec) 
> > will cover
> all/many
> > of the CI timeouts.
> >
> > Option 2)
> > Move to perf-tests, and not run these in a noisy-CI environment where the
> results are not
> > consistent enough to have value. This would mean that the tests are not run 
> > in
> CI for the
> > 3 checks in question are below, they all *require* the service core to be
> scheduled:
> > service_attr_get() -> requires service core to run for service stats to 
> > increment
> > service_lcore_attr_get() -> requires service core to run for lcore stats to
> increment
> > service_lcore_start_stop() -> requires service to run to to ensure 
> > service-func
> itself executes.
> >
> > I don't see how we can "improve" option 2 to not require the service-thread 
> > to
> be scheduled by the OS..
> > And the only way to make the OS schedule it in the CI more consistently is 
> > to
> give it more time?
> 
> We are talking about seconds.
> There are setups where scheduling a thread is taking seconds?


Apparently so - otherwise these tests would always pass.

They *only* fail at random runs in CI, and reliably pass everywhere else.. I've 
not had
them fail locally, and that includes running in a loop for hours with a busy 
system..
but not a low-priority CI VM in a busy datacenter.


[Bruce wrote in separate mail]
>>> For me, the question is - why hasn't the service-core been scheduled? Can
>>> we use sched-yield or some other mechanism to force a wakeup of it?

I'm not aware of a way to make *a specific other pthread* wakeup.  We could 
sacrifice
the current lcore that's waiting for the service-lcore, with a sched_yield() as 
you suggest.
It would potentially "churn" the scheduler enough to give the service core some 
CPU?
It's a guess/gamble in the end, kind of like the timeouts we have today..

> > Thoughts and input welcomed, I'm happy to make the code changes
> themselves, its small effort
> > For both option 1 & 2.
> 
> For time-sensitive tests, yes they should be in perf tests category.
> As David said earlier, no timeout approach in functional tests.

Ok, as before, option 1) is to while(1) and wait for "success". Then there's
no timeout in the test code, but our meson test runner will time-out/fail after 
~10sec IIRC.

Or we move the tests perf-tests, as per Option 2), and these simply won't run 
in CI.

I'm OK with all 3 (including testing with sched_yield() for a month or two and 
if that helps?)

RE: [PATCH v3] test/service: fix spurious failures by extending timeout

Reply via email to