Re: Intermittent t_timerfd failures
Taylor R Campbell wrote: > sys/lib/libc/sys/t_timerfd.c timerfd_block is intermittently failing > on the i386 test bed. Here are some random observations from grepping the logs on various testbeds. 1. The failures have been happening for as long as the timerfd test case has existed. The test case was first run with source date 2021.09.19.15.52.55, and the first failure was soon thereafter, using source date 2021.09.21.09.24.15. 2. The execution time recorded by ATF on i386 indicates that the failed test cases completed in less than a second, while the successful ones took longer. For example: 2023/2023.07.08.15.32.58/test.log.gz:timerfd_block: [1.211653s] Passed. 2023/2023.07.08.16.13.00/test.log.gz:timerfd_block: [1.540414s] Passed. 2023/2023.07.08.17.43.13/test.log.gz:timerfd_block: [1.649699s] Passed. 2023/2023.07.08.19.10.00/test.log.gz:timerfd_block: [0.864084s] Failed: /tmp/build/2023.07.08.19.10.00-i386/src/tests/lib/libc/sys/t_timerfd.c:198: then=1368.605876566 now=1369.462708087 delta=0.856831521 2023/2023.07.08.20.02.10/test.log.gz:timerfd_block: [1.203076s] Passed. 2023/2023.07.08.23.42.48/test.log.gz:timerfd_block: [1.699977s] Passed. 2023/2023.07.09.00.01.55/test.log.gz:timerfd_block: [1.231773s] Passed. 3. The TNF amd64 testbed shows failures similar to the i386 one. Both are using "qemu -accel kvm" hosted on NetBSD 9. 4. The TNF sparc testbed shows no failures in 3456 runs (using plain qemu, no acceleration). 5. The TNF macppc testbed (also plain qemu) shows frequent failures, but unlike i386 and amd64, the execution time indicated by ATF is more than a second: ./2023/2023.07.06.07.59.00/test.log.gz:timerfd_block: [1.132080s] Passed. ./2023/2023.07.07.00.20.39/test.log.gz:timerfd_block: [1.062924s] Passed. ./2023/2023.07.07.04.43.15/test.log.gz:timerfd_block: [1.029156s] Failed: /tmp/build/2023.07.07.04.43.15-macppc/src/tests/lib/libc/sys/t_timerfd.c:198: check_value_against_bounds(delta.tv_sec, 1, 1) not met ./2023/2023.07.07.08.39.23/test.log.gz:timerfd_block: [1.056571s] Passed. ./2023/2023.07.07.20.19.08/test.log.gz:timerfd_block: [1.081552s] Passed. 6. My own amd64 testbed running on real hardware shows no failures in 205 runs. 7. My own i386 testbed using "qemu -accel kvm" on Linux shows no failures in 359 runs. -- Andreas Gustafsson, g...@netbsd.org
Intermittent t_timerfd failures
sys/lib/libc/sys/t_timerfd.c timerfd_block is intermittently failing on the i386 test bed. I added some diagnostic prints to see exactly what the timing failure is, which yielded: t_timerfd.c:198: then=1368.605876566 now=1369.462708087 delta=0.856831521 https://releng.netbsd.org/b5reports/i386/2023/2023.07.08.19.10.00/test.html#lib_libc_sys_t_timerfd_timerfd_block delta is supposed to be at least 1sec, because the test does: const struct itimerspec its = { .it_value = { .tv_sec = 1, .tv_nsec = 0 }, .it_interval = { .tv_sec = 0, .tv_nsec = 0 }, }; ATF_REQUIRE(clock_gettime(CLOCK_MONOTONIC, ) == 0); ATF_REQUIRE(timerfd_settime(fd, 0, , NULL) == 0); ATF_REQUIRE(timerfd_read(fd, ) == 0); ATF_REQUIRE(clock_gettime(CLOCK_MONOTONIC, ) == 0); That is, it checks its watch, asks to be woken in >=1sec, and then checks its watch again; the test then verifies that the watch has indeed advanced by >=1sec. This is very puzzling to me -- I would expect delta to be >2x the sleep time because of the usual hz=100-guest-on-hz=100-host problem (https://gnats.netbsd.org/43997), and certainly not less than 1sec! One possible discrepancy is that clock_gettime(CLOCK_MONOTONIC) uses nanouptime, which returns results with the resolution of the timecounter; whereas timerfd_settime uses getnanouptime, which returns reuslts with the (most likely much coarser) resolution of the hardclock interrupt, hz. But I don't see how that could explain a <1sec delta on an emulated system where I would expect to see ~2sec deltas! And this can't even be explained by sampling error within a tick, because the test is failing with a delta of 0.85 which is _five_ ticks (at 100 Hz) away.