Hi, When we compute high resolution time, both in the kernel and in libc, we get a 32-bit (or smaller) value from the active timecounter and scale it up into a 128-bit bintime.
The scaling math currently looks like this in the kernel: *bt = th->th_offset; bintimeaddfrac(bt, th->th_scale * tc_delta(th), bt); It looks similar in libc. The problem with this code is that if the product th->tc_scale * tc_delta(th) exceeds UINT64_MAX, the result overflows and we lose time. This is unlikely to happen under normal circumstances. Normally, the clock interrupt fires about HZ times per second, so the product doesn't get very close to overflow. You'd need to run with interrupts disabled on CPU0 for about 1 second to see an overflow. However, I have seen this happen before in the wild. If you try to build LLVM on an EdgeRouter Lite (512MB RAM), it will swap *very slowly* to the SD card and eventually lose time. I think it is inevitable that we will encounter bugs of this nature at some point. In general, I would like the timecounting layer to be more robust to nasty situations like these. Losing time is a catastrophic failure. It screws everything up. The solution to this problem is to use the full 96-bit product when we scale the count up into a bintime. We're multiplying a u_int (32-bit), the count, by a uint64_t, the scale, but we're not capturing the upper 32 bits of that product. If we did, we would have a longer grace period between clock interrupts before we lost time. The attached patch adds a TIMECOUNT_TO_BINTIME() function to sys/time.h and puts it to use in sys/kern/kern_tc.c and lib/libc/sys/microtime.c. The math is a bit boring, see the patch if you are curious. As for the cost, there is a small but significant increase in overhead when reading the clock with the TSC. Slower timecounters (HPET, ACPI timer) are so slow the extra overhead is noise. I ran my usual synthetic benchmark against a patched and unpatched libc for a number of different sample sizes. I ran the benchmark in single-user mode to avoid ntpd(8) interference (adjtime(2), adjfreq(2)) and set hw.setperf=100 to ensure uniform results. It looks to me that on amd64, userspace clock_gettime(2) is up to ~10% slower with the patch. But there is a lot of variation between the comparisons, so I don't think it's a consistent 10%. I'd say 10% is an upper bound. To ensure the discussion is well-framed: we're talking about a 10% overhead on a 40-80 nanosecond function call. Obviously the overhead will vary by platform. I imagine 32-bit platforms will exceed 10% because we're now doing even more 64-bit math. I'd be interested in results from e.g. macppc or armv7. ministat summaries are below. My benchmark program and script is below that. The patch is attached below that. Thoughts? -- All values (min, max, med, avg, stddev) are in seconds. Call clock_gettime(2) 10,000 times. Do that 1000 times. ~5.88% increase. x clock-gettime-tsc-user-head-1K-10K.dat1 + clock-gettime-tsc-user-patch-1K-10K.dat1 N Min Max Median Avg Stddev x 1000 0.000329069 0.000362506 0.000335691 0.00033582683 3.6951597e-06 + 1000 0.000346276 0.000381319 0.000353605 0.00035557716 3.962225e-06 Difference at 99.5% confidence 1.97503e-05 +/- 5.29405e-07 5.88111% +/- 0.157642% (Student's t, pooled s = 3.83102e-06) Call clock_gettime(2) 100,000 times. Do that 1000 times. ~1.31% increase. x clock-gettime-tsc-user-head-1K-100K.dat1 + clock-gettime-tsc-user-patch-1K-100K.dat1 N Min Max Median Avg Stddev x 1000 0.003432839 0.003679475 0.003472749 0.0034827782 3.0275757e-05 + 1000 0.003426934 0.003776351 0.003524453 0.0035284519 3.8631925e-05 Difference at 99.5% confidence 4.56738e-05 +/- 4.79602e-06 1.31142% +/- 0.137707% (Student's t, pooled s = 3.47062e-05) Call clock_gettime(2) 5,000 times. Do that 10,000 times. ~8.53% increase. x clock-gettime-tsc-user-head-10K-5K.dat1 + clock-gettime-tsc-user-patch-10K-5K.dat1 N Min Max Median Avg Stddev x 10000 0.000159681 0.000220758 0.000186826 0.00017805393 1.177249e-05 + 10000 0.000171105 0.000217009 0.000200789 0.00019325052 1.1280074e-05 Difference at 99.5% confidence 1.51966e-05 +/- 5.03804e-07 8.53482% +/- 0.28295% (Student's t, pooled s = 1.15289e-05) Call clock_gettime(2) 10,000 times. Do that 10,000 times. ~8.74% increase. x clock-gettime-tsc-user-head-10K-10K.dat1 + clock-gettime-tsc-user-patch-10K-10K.dat1 N Min Max Median Avg Stddev x 10000 0.000320472 0.000410915 0.000371971 0.00035789663 2.2141645e-05 + 10000 0.000340087 0.000436915 0.000393461 0.00038918836 1.8884201e-05 Difference at 99.5% confidence 3.12917e-05 +/- 8.99219e-07 8.74323% +/- 0.251251% (Student's t, pooled s = 2.05775e-05) Call clock_gettime(2) 15,000 times. Do that 10,000 times. ~2.47% increase. x clock-gettime-tsc-user-head-10K-15K.dat1 + clock-gettime-tsc-user-patch-10K-15K.dat1 N Min Max Median Avg Stddev x 10000 0.00049023 0.000660955 0.000573344 0.00055863893 2.9447818e-05 + 10000 0.000510269 0.000651218 0.000584569 0.00057240886 3.4920043e-05 Difference at 99.5% confidence 1.37699e-05 +/- 1.41149e-06 2.46491% +/- 0.252665% (Student's t, pooled s = 3.23e-05) -- /* * clock-gettime-bench.c * * Public domain. */ #include <sys/time.h> #include <err.h> #include <limits.h> #include <time.h> #include <stdio.h> #include <stdlib.h> int main(int argc, char *argv[]) { struct timespec begin, elapsed, end, now; long long count, i; const char *errstr; if (argc != 2) { fprintf(stderr, "usage: %s count\n", getprogname()); return 1; } count = strtonum(argv[1], 1, LLONG_MAX, &errstr); if (errstr != NULL) errx(1, "count is %s: %s", errstr, argv[1]); clock_gettime(CLOCK_MONOTONIC, &begin); for (i = 0; i < count; i++) clock_gettime(CLOCK_MONOTONIC, &now); clock_gettime(CLOCK_MONOTONIC, &end); timespecsub(&end, &begin, &elapsed); printf("%lld\t%lld.%09ld\n", count, (long long)elapsed.tv_sec, elapsed.tv_nsec); return 0; } -- #! /bin/sh # # clock-gettime-bench.sh # # Public domain. # # Driver program for clock-gettime-bench.c that discards runs that are # context switched. We want to measure *our* computation time, not # the time some other program spends running. # # Run as root to nice the program down to -20. # # In general, run from single-user mode at hw.setperf=100 to limit # interference from other variables and ensure a fair comparison. warn() { printf "$0: $@\n" >&2 } set -e if [ $# -ne 2 ]; then printf "usage: $0 nrounds nsamples\n" >&2 exit 1 fi nrounds=$1 nsamples=$2 i=0 while [ $i -lt $nrounds ]; do result=$(/usr/bin/time -l nice -n -20 ./clock-gettime-bench $nsamples 2>&1) if echo "$result" | tail -n 1 | egrep -q '^( ){9}0 involuntary context switches$'; then echo "$result" | head -n 1 i=$((i + 1)) else warn "context switched: retrying round $i" fi done exit 0 -- Index: sys/sys/time.h =================================================================== RCS file: /cvs/src/sys/sys/time.h,v retrieving revision 1.61 diff -u -p -r1.61 time.h --- sys/sys/time.h 19 Jun 2021 13:49:39 -0000 1.61 +++ sys/sys/time.h 14 Oct 2021 20:58:38 -0000 @@ -222,6 +222,17 @@ bintimesub(const struct bintime *bt, con * time_second ticks after N.999999999 not after N.4999999999 */ +static inline void +TIMECOUNT_TO_BINTIME(u_int count, uint64_t scale, struct bintime *bt) +{ + uint64_t hi64; + + hi64 = count * (scale >> 32); + bt->sec = hi64 >> 32; + bt->frac = hi64 << 32; + bintimeaddfrac(bt, count * (scale & 0xffffffff), bt); +} + static inline uint32_t FRAC_TO_NSEC(uint64_t frac) { Index: sys/kern/kern_tc.c =================================================================== RCS file: /cvs/src/sys/kern/kern_tc.c,v retrieving revision 1.74 diff -u -p -r1.74 kern_tc.c --- sys/kern/kern_tc.c 19 Jun 2021 13:49:39 -0000 1.74 +++ sys/kern/kern_tc.c 14 Oct 2021 20:58:38 -0000 @@ -189,8 +189,8 @@ binuptime(struct bintime *bt) th = timehands; gen = th->th_generation; membar_consumer(); - *bt = th->th_offset; - bintimeaddfrac(bt, th->th_scale * tc_delta(th), bt); + TIMECOUNT_TO_BINTIME(tc_delta(th), th->th_scale, bt); + bintimeadd(bt, &th->th_offset, bt); membar_consumer(); } while (gen == 0 || gen != th->th_generation); } @@ -278,7 +278,8 @@ binruntime(struct bintime *bt) th = timehands; gen = th->th_generation; membar_consumer(); - bintimeaddfrac(&th->th_offset, th->th_scale * tc_delta(th), bt); + TIMECOUNT_TO_BINTIME(tc_delta(th), th->th_scale, bt); + bintimeadd(bt, &th->th_offset, bt); bintimesub(bt, &th->th_naptime, bt); membar_consumer(); } while (gen == 0 || gen != th->th_generation); @@ -303,8 +304,8 @@ bintime(struct bintime *bt) th = timehands; gen = th->th_generation; membar_consumer(); - *bt = th->th_offset; - bintimeaddfrac(bt, th->th_scale * tc_delta(th), bt); + TIMECOUNT_TO_BINTIME(tc_delta(th), th->th_scale, bt); + bintimeadd(bt, &th->th_offset, bt); bintimeadd(bt, &th->th_boottime, bt); membar_consumer(); } while (gen == 0 || gen != th->th_generation); @@ -641,7 +642,8 @@ tc_windup(struct bintime *new_boottime, ncount = 0; th->th_offset_count += delta; th->th_offset_count &= th->th_counter->tc_counter_mask; - bintimeaddfrac(&th->th_offset, th->th_scale * delta, &th->th_offset); + TIMECOUNT_TO_BINTIME(delta, th->th_scale, &bt); + bintimeadd(&th->th_offset, &bt, &th->th_offset); /* * Ignore new offsets that predate the current offset. Index: lib/libc/sys/microtime.c =================================================================== RCS file: /cvs/src/lib/libc/sys/microtime.c,v retrieving revision 1.1 diff -u -p -r1.1 microtime.c --- lib/libc/sys/microtime.c 6 Jul 2020 13:33:06 -0000 1.1 +++ lib/libc/sys/microtime.c 14 Oct 2021 20:58:38 -0000 @@ -45,10 +45,10 @@ binuptime(struct bintime *bt, struct tim do { gen = tk->tk_generation; membar_consumer(); - *bt = tk->tk_offset; if (tc_delta(tk, &delta)) return -1; - bintimeaddfrac(bt, tk->tk_scale * delta, bt); + TIMECOUNT_TO_BINTIME(delta, tk->tk_scale, bt); + bintimeadd(bt, &tk->tk_offset, bt); membar_consumer(); } while (gen == 0 || gen != tk->tk_generation); @@ -65,7 +65,8 @@ binruntime(struct bintime *bt, struct ti membar_consumer(); if (tc_delta(tk, &delta)) return -1; - bintimeaddfrac(&tk->tk_offset, tk->tk_scale * delta, bt); + TIMECOUNT_TO_BINTIME(delta, tk->tk_scale, bt); + bintimeadd(bt, &tk->tk_offset, bt); bintimesub(bt, &tk->tk_naptime, bt); membar_consumer(); } while (gen == 0 || gen != tk->tk_generation); @@ -81,10 +82,10 @@ bintime(struct bintime *bt, struct timek do { gen = tk->tk_generation; membar_consumer(); - *bt = tk->tk_offset; if (tc_delta(tk, &delta)) return -1; - bintimeaddfrac(bt, tk->tk_scale * delta, bt); + TIMECOUNT_TO_BINTIME(delta, tk->tk_scale, bt); + bintimeadd(bt, &tk->tk_offset, bt); bintimeadd(bt, &tk->tk_boottime, bt); membar_consumer(); } while (gen == 0 || gen != tk->tk_generation);