Re: Intermittent t_timerfd failures

2023-07-09 Thread Andreas Gustafsson
Taylor R Campbell wrote:
> sys/lib/libc/sys/t_timerfd.c timerfd_block is intermittently failing
> on the i386 test bed.

Here are some random observations from grepping the logs on various
testbeds.

1. The failures have been happening for as long as the timerfd test
case has existed.  The test case was first run with source date
2021.09.19.15.52.55, and the first failure was soon thereafter,
using source date 2021.09.21.09.24.15.

2. The execution time recorded by ATF on i386 indicates that the
failed test cases completed in less than a second, while the
successful ones took longer.  For example:

2023/2023.07.08.15.32.58/test.log.gz:timerfd_block: [1.211653s] Passed.
2023/2023.07.08.16.13.00/test.log.gz:timerfd_block: [1.540414s] Passed.
2023/2023.07.08.17.43.13/test.log.gz:timerfd_block: [1.649699s] Passed.
2023/2023.07.08.19.10.00/test.log.gz:timerfd_block: [0.864084s] Failed: 
/tmp/build/2023.07.08.19.10.00-i386/src/tests/lib/libc/sys/t_timerfd.c:198: 
then=1368.605876566 now=1369.462708087 delta=0.856831521
2023/2023.07.08.20.02.10/test.log.gz:timerfd_block: [1.203076s] Passed.
2023/2023.07.08.23.42.48/test.log.gz:timerfd_block: [1.699977s] Passed.
2023/2023.07.09.00.01.55/test.log.gz:timerfd_block: [1.231773s] Passed.

3. The TNF amd64 testbed shows failures similar to the i386 one.
Both are using "qemu -accel kvm" hosted on NetBSD 9.

4. The TNF sparc testbed shows no failures in 3456 runs (using
plain qemu, no acceleration).

5. The TNF macppc testbed (also plain qemu) shows frequent failures,
but unlike i386 and amd64, the execution time indicated by ATF is more
than a second:

./2023/2023.07.06.07.59.00/test.log.gz:timerfd_block: [1.132080s] Passed.
./2023/2023.07.07.00.20.39/test.log.gz:timerfd_block: [1.062924s] Passed.
./2023/2023.07.07.04.43.15/test.log.gz:timerfd_block: [1.029156s] Failed: 
/tmp/build/2023.07.07.04.43.15-macppc/src/tests/lib/libc/sys/t_timerfd.c:198: 
check_value_against_bounds(delta.tv_sec, 1, 1) not met
./2023/2023.07.07.08.39.23/test.log.gz:timerfd_block: [1.056571s] Passed.
./2023/2023.07.07.20.19.08/test.log.gz:timerfd_block: [1.081552s] Passed.

6. My own amd64 testbed running on real hardware shows no failures in
205 runs.

7. My own i386 testbed using "qemu -accel kvm" on Linux shows no
failures in 359 runs.
-- 
Andreas Gustafsson, g...@netbsd.org


Intermittent t_timerfd failures

2023-07-09 Thread Taylor R Campbell
sys/lib/libc/sys/t_timerfd.c timerfd_block is intermittently failing
on the i386 test bed.  I added some diagnostic prints to see exactly
what the timing failure is, which yielded:

t_timerfd.c:198: then=1368.605876566 now=1369.462708087 delta=0.856831521

https://releng.netbsd.org/b5reports/i386/2023/2023.07.08.19.10.00/test.html#lib_libc_sys_t_timerfd_timerfd_block

delta is supposed to be at least 1sec, because the test does:

const struct itimerspec its = {
.it_value = { .tv_sec = 1, .tv_nsec = 0 },
.it_interval = { .tv_sec = 0, .tv_nsec = 0 },
};

ATF_REQUIRE(clock_gettime(CLOCK_MONOTONIC, ) == 0);
ATF_REQUIRE(timerfd_settime(fd, 0, , NULL) == 0);
ATF_REQUIRE(timerfd_read(fd, ) == 0);
ATF_REQUIRE(clock_gettime(CLOCK_MONOTONIC, ) == 0);

That is, it checks its watch, asks to be woken in >=1sec, and then
checks its watch again; the test then verifies that the watch has
indeed advanced by >=1sec.

This is very puzzling to me -- I would expect delta to be >2x the
sleep time because of the usual hz=100-guest-on-hz=100-host problem
(https://gnats.netbsd.org/43997), and certainly not less than 1sec!

One possible discrepancy is that clock_gettime(CLOCK_MONOTONIC) uses
nanouptime, which returns results with the resolution of the
timecounter; whereas timerfd_settime uses getnanouptime, which returns
reuslts with the (most likely much coarser) resolution of the
hardclock interrupt, hz.

But I don't see how that could explain a <1sec delta on an emulated
system where I would expect to see ~2sec deltas!

And this can't even be explained by sampling error within a tick,
because the test is failing with a delta of 0.85 which is _five_ ticks
(at 100 Hz) away.