Hi,

When we compute high resolution time, both in the kernel and in libc,
we get a 32-bit (or smaller) value from the active timecounter and
scale it up into a 128-bit bintime.

The scaling math currently looks like this in the kernel:

        *bt = th->th_offset;
        bintimeaddfrac(bt, th->th_scale * tc_delta(th), bt);

It looks similar in libc.

The problem with this code is that if the product

        th->tc_scale * tc_delta(th)

exceeds UINT64_MAX, the result overflows and we lose time.

This is unlikely to happen under normal circumstances.  Normally, the
clock interrupt fires about HZ times per second, so the product
doesn't get very close to overflow.  You'd need to run with interrupts
disabled on CPU0 for about 1 second to see an overflow.

However, I have seen this happen before in the wild.  If you try to
build LLVM on an EdgeRouter Lite (512MB RAM), it will swap *very
slowly* to the SD card and eventually lose time.

I think it is inevitable that we will encounter bugs of this nature at
some point.  In general, I would like the timecounting layer to be
more robust to nasty situations like these.  Losing time is a
catastrophic failure.  It screws everything up.

The solution to this problem is to use the full 96-bit product when we
scale the count up into a bintime.  We're multiplying a u_int
(32-bit), the count, by a uint64_t, the scale, but we're not capturing
the upper 32 bits of that product.  If we did, we would have a longer
grace period between clock interrupts before we lost time.

The attached patch adds a TIMECOUNT_TO_BINTIME() function to sys/time.h
and puts it to use in sys/kern/kern_tc.c and lib/libc/sys/microtime.c.
The math is a bit boring, see the patch if you are curious.

As for the cost, there is a small but significant increase in overhead
when reading the clock with the TSC.  Slower timecounters (HPET, ACPI
timer) are so slow the extra overhead is noise.

I ran my usual synthetic benchmark against a patched and unpatched
libc for a number of different sample sizes.  I ran the benchmark in
single-user mode to avoid ntpd(8) interference (adjtime(2), adjfreq(2))
and set hw.setperf=100 to ensure uniform results.

It looks to me that on amd64, userspace clock_gettime(2) is up to ~10%
slower with the patch.  But there is a lot of variation between the
comparisons, so I don't think it's a consistent 10%.  I'd say 10% is
an upper bound.

To ensure the discussion is well-framed: we're talking about a 10%
overhead on a 40-80 nanosecond function call.

Obviously the overhead will vary by platform.  I imagine 32-bit
platforms will exceed 10% because we're now doing even more 64-bit
math.  I'd be interested in results from e.g. macppc or armv7.

ministat summaries are below.  My benchmark program and script is
below that.  The patch is attached below that.

Thoughts?

--

All values (min, max, med, avg, stddev) are in seconds.

Call clock_gettime(2) 10,000 times.  Do that 1000 times.  ~5.88% increase.

x clock-gettime-tsc-user-head-1K-10K.dat1
+ clock-gettime-tsc-user-patch-1K-10K.dat1
    N           Min           Max        Median           Avg        Stddev
x 1000   0.000329069   0.000362506   0.000335691 0.00033582683 3.6951597e-06
+ 1000   0.000346276   0.000381319   0.000353605 0.00035557716  3.962225e-06
Difference at 99.5% confidence
        1.97503e-05 +/- 5.29405e-07
        5.88111% +/- 0.157642%
        (Student's t, pooled s = 3.83102e-06)

Call clock_gettime(2) 100,000 times.  Do that 1000 times.  ~1.31% increase.

x clock-gettime-tsc-user-head-1K-100K.dat1
+ clock-gettime-tsc-user-patch-1K-100K.dat1
    N           Min           Max        Median           Avg        Stddev
x 1000   0.003432839   0.003679475   0.003472749  0.0034827782 3.0275757e-05
+ 1000   0.003426934   0.003776351   0.003524453  0.0035284519 3.8631925e-05
Difference at 99.5% confidence
        4.56738e-05 +/- 4.79602e-06
        1.31142% +/- 0.137707%
        (Student's t, pooled s = 3.47062e-05)

Call clock_gettime(2) 5,000 times.  Do that 10,000 times.  ~8.53% increase.

x clock-gettime-tsc-user-head-10K-5K.dat1
+ clock-gettime-tsc-user-patch-10K-5K.dat1
    N           Min           Max        Median           Avg        Stddev
x 10000   0.000159681   0.000220758   0.000186826 0.00017805393  1.177249e-05
+ 10000   0.000171105   0.000217009   0.000200789 0.00019325052 1.1280074e-05
Difference at 99.5% confidence
        1.51966e-05 +/- 5.03804e-07
        8.53482% +/- 0.28295%
        (Student's t, pooled s = 1.15289e-05)

Call clock_gettime(2) 10,000 times.  Do that 10,000 times.  ~8.74% increase.

x clock-gettime-tsc-user-head-10K-10K.dat1
+ clock-gettime-tsc-user-patch-10K-10K.dat1
    N           Min           Max        Median           Avg        Stddev
x 10000   0.000320472   0.000410915   0.000371971 0.00035789663 2.2141645e-05
+ 10000   0.000340087   0.000436915   0.000393461 0.00038918836 1.8884201e-05
Difference at 99.5% confidence
        3.12917e-05 +/- 8.99219e-07
        8.74323% +/- 0.251251%
        (Student's t, pooled s = 2.05775e-05)

Call clock_gettime(2) 15,000 times.  Do that 10,000 times.  ~2.47% increase.

x clock-gettime-tsc-user-head-10K-15K.dat1
+ clock-gettime-tsc-user-patch-10K-15K.dat1
    N           Min           Max        Median           Avg        Stddev
x 10000    0.00049023   0.000660955   0.000573344 0.00055863893 2.9447818e-05
+ 10000   0.000510269   0.000651218   0.000584569 0.00057240886 3.4920043e-05
Difference at 99.5% confidence
        1.37699e-05 +/- 1.41149e-06
        2.46491% +/- 0.252665%
        (Student's t, pooled s = 3.23e-05)

--

/*
 * clock-gettime-bench.c
 *
 * Public domain.
 */

#include <sys/time.h>

#include <err.h>
#include <limits.h>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>

int
main(int argc, char *argv[])
{
        struct timespec begin, elapsed, end, now;
        long long count, i;
        const char *errstr;

        if (argc != 2) {
                fprintf(stderr, "usage: %s count\n", getprogname());
                return 1;
        }
        count = strtonum(argv[1], 1, LLONG_MAX, &errstr);
        if (errstr != NULL)
                errx(1, "count is %s: %s", errstr, argv[1]);

        clock_gettime(CLOCK_MONOTONIC, &begin);
        for (i = 0; i < count; i++)
                clock_gettime(CLOCK_MONOTONIC, &now);
        clock_gettime(CLOCK_MONOTONIC, &end);

        timespecsub(&end, &begin, &elapsed);
        printf("%lld\t%lld.%09ld\n",
            count, (long long)elapsed.tv_sec, elapsed.tv_nsec);

        return 0;
}

--

#! /bin/sh
#
# clock-gettime-bench.sh
#
# Public domain.
#
# Driver program for clock-gettime-bench.c that discards runs that are
# context switched.  We want to measure *our* computation time, not
# the time some other program spends running.
#
# Run as root to nice the program down to -20.
#
# In general, run from single-user mode at hw.setperf=100 to limit
# interference from other variables and ensure a fair comparison.

warn()
{
        printf "$0: $@\n" >&2
}

set -e

if [ $# -ne 2 ]; then
        printf "usage: $0 nrounds nsamples\n" >&2
        exit 1
fi

nrounds=$1
nsamples=$2

i=0
while [ $i -lt $nrounds ]; do
        result=$(/usr/bin/time -l nice -n -20 ./clock-gettime-bench $nsamples 
2>&1)
        if echo "$result" | tail -n 1 | egrep -q '^( ){9}0  involuntary context 
switches$'; then
                echo "$result" | head -n 1
                i=$((i + 1))
        else
                warn "context switched: retrying round $i"
        fi
done

exit 0

--

Index: sys/sys/time.h
===================================================================
RCS file: /cvs/src/sys/sys/time.h,v
retrieving revision 1.61
diff -u -p -r1.61 time.h
--- sys/sys/time.h      19 Jun 2021 13:49:39 -0000      1.61
+++ sys/sys/time.h      14 Oct 2021 20:58:38 -0000
@@ -222,6 +222,17 @@ bintimesub(const struct bintime *bt, con
  *   time_second ticks after N.999999999 not after N.4999999999
  */
 
+static inline void
+TIMECOUNT_TO_BINTIME(u_int count, uint64_t scale, struct bintime *bt)
+{
+       uint64_t hi64;
+
+       hi64 = count * (scale >> 32);
+       bt->sec = hi64 >> 32;
+       bt->frac = hi64 << 32;
+       bintimeaddfrac(bt, count * (scale & 0xffffffff), bt);
+}
+
 static inline uint32_t
 FRAC_TO_NSEC(uint64_t frac)
 {
Index: sys/kern/kern_tc.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_tc.c,v
retrieving revision 1.74
diff -u -p -r1.74 kern_tc.c
--- sys/kern/kern_tc.c  19 Jun 2021 13:49:39 -0000      1.74
+++ sys/kern/kern_tc.c  14 Oct 2021 20:58:38 -0000
@@ -189,8 +189,8 @@ binuptime(struct bintime *bt)
                th = timehands;
                gen = th->th_generation;
                membar_consumer();
-               *bt = th->th_offset;
-               bintimeaddfrac(bt, th->th_scale * tc_delta(th), bt);
+               TIMECOUNT_TO_BINTIME(tc_delta(th), th->th_scale, bt);
+               bintimeadd(bt, &th->th_offset, bt);
                membar_consumer();
        } while (gen == 0 || gen != th->th_generation);
 }
@@ -278,7 +278,8 @@ binruntime(struct bintime *bt)
                th = timehands;
                gen = th->th_generation;
                membar_consumer();
-               bintimeaddfrac(&th->th_offset, th->th_scale * tc_delta(th), bt);
+               TIMECOUNT_TO_BINTIME(tc_delta(th), th->th_scale, bt);
+               bintimeadd(bt, &th->th_offset, bt);
                bintimesub(bt, &th->th_naptime, bt);
                membar_consumer();
        } while (gen == 0 || gen != th->th_generation);
@@ -303,8 +304,8 @@ bintime(struct bintime *bt)
                th = timehands;
                gen = th->th_generation;
                membar_consumer();
-               *bt = th->th_offset;
-               bintimeaddfrac(bt, th->th_scale * tc_delta(th), bt);
+               TIMECOUNT_TO_BINTIME(tc_delta(th), th->th_scale, bt);
+               bintimeadd(bt, &th->th_offset, bt);
                bintimeadd(bt, &th->th_boottime, bt);
                membar_consumer();
        } while (gen == 0 || gen != th->th_generation);
@@ -641,7 +642,8 @@ tc_windup(struct bintime *new_boottime, 
                ncount = 0;
        th->th_offset_count += delta;
        th->th_offset_count &= th->th_counter->tc_counter_mask;
-       bintimeaddfrac(&th->th_offset, th->th_scale * delta, &th->th_offset);
+       TIMECOUNT_TO_BINTIME(delta, th->th_scale, &bt);
+       bintimeadd(&th->th_offset, &bt, &th->th_offset);
 
        /*
         * Ignore new offsets that predate the current offset.
Index: lib/libc/sys/microtime.c
===================================================================
RCS file: /cvs/src/lib/libc/sys/microtime.c,v
retrieving revision 1.1
diff -u -p -r1.1 microtime.c
--- lib/libc/sys/microtime.c    6 Jul 2020 13:33:06 -0000       1.1
+++ lib/libc/sys/microtime.c    14 Oct 2021 20:58:38 -0000
@@ -45,10 +45,10 @@ binuptime(struct bintime *bt, struct tim
        do {
                gen = tk->tk_generation;
                membar_consumer();
-               *bt = tk->tk_offset;
                if (tc_delta(tk, &delta))
                        return -1;
-               bintimeaddfrac(bt, tk->tk_scale * delta, bt);
+               TIMECOUNT_TO_BINTIME(delta, tk->tk_scale, bt);
+               bintimeadd(bt, &tk->tk_offset, bt);
                membar_consumer();
        } while (gen == 0 || gen != tk->tk_generation);
 
@@ -65,7 +65,8 @@ binruntime(struct bintime *bt, struct ti
                membar_consumer();
                if (tc_delta(tk, &delta))
                        return -1;
-               bintimeaddfrac(&tk->tk_offset, tk->tk_scale * delta, bt);
+               TIMECOUNT_TO_BINTIME(delta, tk->tk_scale, bt);
+               bintimeadd(bt, &tk->tk_offset, bt);
                bintimesub(bt, &tk->tk_naptime, bt);
                membar_consumer();
        } while (gen == 0 || gen != tk->tk_generation);
@@ -81,10 +82,10 @@ bintime(struct bintime *bt, struct timek
        do {
                gen = tk->tk_generation;
                membar_consumer();
-               *bt = tk->tk_offset;
                if (tc_delta(tk, &delta))
                        return -1;
-               bintimeaddfrac(bt, tk->tk_scale * delta, bt);
+               TIMECOUNT_TO_BINTIME(delta, tk->tk_scale, bt);
+               bintimeadd(bt, &tk->tk_offset, bt);
                bintimeadd(bt, &tk->tk_boottime, bt);
                membar_consumer();
        } while (gen == 0 || gen != tk->tk_generation);

Reply via email to