[please test] amd64: schedule clock interrupts against system clock

Scott Cheloha Thu, 24 Jun 2021 19:51:07 -0700

Hi,

I'm looking for testers for the attached patch.  You need an amd64
machine with a lapic.


This includes:

- All "real" amd64 machines ever made
- amd64 VMs running on hypervisors that provide a virtual lapic

Note that this does *not* include:

- amd64 VMs running on OpenBSD's vmm(4).

(I will ask for a separate round of testing for vmm(4) VMs, don't
worry.)

The patch adds a new machine-independent clock interrupt scheduling
layer (hereafter, "clockintr") to the kernel in kern/kern_clockintr.c,
configures GENERIC amd64 kernels to use clockintr, and changes
amd64/lapic.c to use clockintr instead of calling hardclock(9)
directly.

Please apply the patch and make sure to reconfigure your kernel before
recompiling/installing it to test.  I am especially interested in
whether this breaks suspend/resume or hibernate/unhibernate.
Suspend/resume is unaffected on my Lenovo X1C7, is the same true for
your machine?  Please include a dmesg with your results.

Stats for the clockintr subsystem are exposed via sysctl(2).  If
you're interested in providing them you can compile and run the
program attached inline in my next mail.  A snippet of the output from
across a suspend/resume is especially useful.

This is the end of the mail if you just want to test this.  If you are
interested in the possible behavior changes or a description of how
clockintr works, keep reading.

Thanks,

-Scott

--

There are some behavior changes, but I have found them to be small,
harmless, and/or useful.  The first one is the most significant:

- Clockintr schedules events against the system clock, so hardclock(9)
  ticks are pegged to the system clock, so the length of a tick is now
  subject to NTP adjustment via adjtime(2) and adjfreq(2).

  In practice, NTP adjustment is very conservative.  In my testing the
  delta between the raw frequency and the NTP frequency is small
  when ntpd(8) is doing coarse correction with adjtime(2) and invisible
  when ntpd(8) is doing fine correction with adjfreq(2).

  The upshot of the frequency difference is sometimes you will get
  some spurious ("early") interrupts while ntpd(8) is correcting the
  clock.  They go away when the ntpd(8) finishes synchronizing.

  FWIW: Linux, FreeBSD, and DragonflyBSD have all made this jump.

- hardclock(9) will run simultaneously on every CPU in the system.
  This seems to be fine, but there might be some subtle contention
  that gets worse as you add more CPUs.  Worth investigating.

- amd64 gets a pseudorandom statclock().  This is desirable, right?

- "Lost" or delayed ticks are handled by the clockintr layer
  transparently.  This means that if the clock interrupt is delayed
  due to e.g. hypervisor delays, we don't "lose" ticks and the
  timeout schedule does not decay.

  This is super relevant for vmm(4), but it may also be relevant for
  other hypervisors.

--

Last, here are notes for people interested in the design or the actual
code.  Ask questions if something about my approach seems off, I have
never added a subsystem to the kernel before.  The code has not
changed much in the last six months so I think I am nearing a stable
design.  I will document these interfaces in a real manpage soon.

- Clockintr prototypes are declared in <dev/clockintr.h>.

- Clockintr depends on the timecounter and the system clock to do its
  scheduling.  If there is no working timecounter the machine will
  hang, as multitasking preemption will cease.

- Global clockintr initialization is done via clockintr_init().  You
  call this from cpu_initclocks() on the primary CPU *after* you install
  a timecounter.  The function sets a global frequency for the
  hardclock(9) (required), a global frequency for the statclock()
  (or zero if you don't want a statclock), and sets global behavior
  flags.

  There is only one flag right now, CI_RNDSTAT, which toggles whether
  the statclock() has a pseudorandom period.  If the platform has a
  one-shot clock (e.g. amd64, arm64, etc.) it makes sense to set
  CI_RNDSTAT.  If the platform does not have a one-shot clock (e.g.
  alpha) there is no point in setting CI_RNDSTAT as the hardware
  cannot provide the feature.

- Per-CPU clockintr initialization is done via clockintr_cpu_init().
  On the primary CPU, call this immediately *after* you call
  clockintr_init().  Secondary CPUs should call this late in
  cpu_hatch(), probably right before cpu_switchto(9).  The function
  allocates memory for the local CPU's schedule and "installs" the local
  interrupt clock, if any.

  If the platform cannot provide a local interrupt clock with a
  one-shot mode you just pass a NULL pointer to clockintr_cpu_init().
  The clockintr layer on that CPU will then run in "dummy" mode.  The
  platform is left responsible for scheduling clock interrupt delivery
  without input from the clockintr code.  The platform should deliver a
  clock interrupt hz(9) times per second to keep things running
  smoothly.  This mode works but it is not very accurate.

  If the platform can provide a local one-shot clock you pass a
  pointer to a "struct intrclock" that describes said clock.  Currently
  intrclocks are immutable and stateless, so every CPU on a system can
  use the same struct (on many platforms each CPU has an identical copy
  of a particular clock).  If we have to add locking or state to the
  intrclock struct we will need to allocate per-CPU intrclock structs,
  but as of yet I haven't needed it.

  The intrclock struct currently has one member, "ic_rearm", a function
  pointer taking a count of nanoseconds.  ic_rearm() should rearm the
  calling CPU's clock to deliver a local clock interrupt after the given
  number of nanoseconds have elapsed.  All platorm-specific details are
  hidden from the clockintr layer.  The clockintr layer just passes a count
  of nanoseconds to the MD code, that's it.

- Clock interrupt events are run from clockintr_dispatch().  The
  platform needs to call this function at IPL_CLOCK from the ISR
  whenever it fires.  The dispatch function runs all expired events
  (hardclock + statclock) and, if the local CPU has an intrclock,
  schedules the next clock interrupt before returning.  Events are
  run without any locks or mutexes.

--

Here's the patch.

Index: sys/arch/amd64/amd64/lapic.c
===================================================================
RCS file: /cvs/src/sys/arch/amd64/amd64/lapic.c,v
retrieving revision 1.58
diff -u -p -r1.58 lapic.c
--- sys/arch/amd64/amd64/lapic.c        11 Jun 2021 05:33:16 -0000      1.58
+++ sys/arch/amd64/amd64/lapic.c        25 Jun 2021 02:48:45 -0000
@@ -49,6 +49,7 @@
 #include <machine/i82489reg.h>
 #include <machine/i82489var.h>
 
+#include <dev/clockintr.h>
 #include <dev/ic/i8253reg.h>
 
 #include "ioapic.h"
@@ -72,7 +73,6 @@ struct evcount clk_count;
 struct evcount ipi_count;
 #endif
 
-void   lapic_delay(int);
 static u_int32_t lapic_gettick(void);
 void   lapic_clockintr(void *, struct intrframe);
 void   lapic_initclocks(void);
@@ -402,18 +402,23 @@ lapic_gettick(void)
 
 #include <sys/kernel.h>                /* for hz */
 
-u_int32_t lapic_tval;
+void lapic_timer_oneshot(uint32_t, uint32_t);
+void lapic_timer_periodic(uint32_t, uint32_t);
+
+void lapic_timer_rearm(uint64_t);
+
+const struct intrclock lapic_intrclock = {
+       .ic_rearm = lapic_timer_rearm,
+};
 
 /*
  * this gets us up to a 4GHz busclock....
  */
 u_int32_t lapic_per_second = 0;
-u_int32_t lapic_frac_usec_per_cycle;
-u_int64_t lapic_frac_cycle_per_usec;
-u_int32_t lapic_delaytab[26];
 
-void lapic_timer_oneshot(uint32_t, uint32_t);
-void lapic_timer_periodic(uint32_t, uint32_t);
+uint64_t lapic_nsec_cycle_ratio;
+uint64_t lapic_nsec_max;
+uint32_t lapic_cycle_min = 1;
 
 /*
  * Start the local apic countdown timer.
@@ -443,6 +448,17 @@ lapic_timer_periodic(uint32_t mask, uint
 }
 
 void
+lapic_timer_rearm(uint64_t nsecs)
+{
+       uint32_t cycles;
+
+       nsecs = MIN(lapic_nsec_max, nsecs);
+       cycles = (nsecs * lapic_nsec_cycle_ratio) >> 32;
+       cycles = MAX(lapic_cycle_min, cycles);
+       lapic_timer_oneshot(0, cycles);
+}
+
+void
 lapic_clockintr(void *arg, struct intrframe frame)
 {
        struct cpu_info *ci = curcpu();
@@ -450,7 +466,7 @@ lapic_clockintr(void *arg, struct intrfr
 
        floor = ci->ci_handled_intr_level;
        ci->ci_handled_intr_level = ci->ci_ilevel;
-       hardclock((struct clockframe *)&frame);
+       clockintr_dispatch((struct clockframe *)&frame);
        ci->ci_handled_intr_level = floor;
 
        clk_count.ec_count++;
@@ -459,17 +475,23 @@ lapic_clockintr(void *arg, struct intrfr
 void
 lapic_startclock(void)
 {
-       lapic_timer_periodic(0, lapic_tval);
+       clockintr_cpu_init(&lapic_intrclock, 0);
+       lapic_timer_rearm(0);
 }
 
 void
 lapic_initclocks(void)
 {
-       lapic_startclock();
+       KASSERT(lapic_per_second > 0);
 
        i8254_inittimecounter_simple();
-}
 
+       stathz = hz;
+       profhz = stathz;
+       clockintr_init(hz, stathz, CI_RNDSTAT);
+
+       lapic_startclock();
+}
 
 extern int gettick(void);      /* XXX put in header file */
 extern u_long rtclock_tval; /* XXX put in header file */
@@ -488,8 +510,6 @@ wait_next_cycle(void)
        }
 }
 
-extern void tsc_delay(int);
-
 /*
  * Calibrate the local apic count-down timer (which is running at
  * bus-clock speed) vs. the i8254 counter/timer (which is running at
@@ -551,75 +571,16 @@ skip_calibration:
        printf("%s: apic clock running at %dMHz\n",
            ci->ci_dev->dv_xname, lapic_per_second / (1000 * 1000));
 
-       if (lapic_per_second != 0) {
-               /*
-                * reprogram the apic timer to run in periodic mode.
-                * XXX need to program timer on other cpu's, too.
-                */
-               lapic_tval = (lapic_per_second * 2) / hz;
-               lapic_tval = (lapic_tval / 2) + (lapic_tval & 0x1);
-
-               lapic_timer_periodic(LAPIC_LVTT_M, lapic_tval);
-
-               /*
-                * Compute fixed-point ratios between cycles and
-                * microseconds to avoid having to do any division
-                * in lapic_delay.
-                */
-
-               tmp = (1000000 * (u_int64_t)1 << 32) / lapic_per_second;
-               lapic_frac_usec_per_cycle = tmp;
-
-               tmp = (lapic_per_second * (u_int64_t)1 << 32) / 1000000;
-
-               lapic_frac_cycle_per_usec = tmp;
-
-               /*
-                * Compute delay in cycles for likely short delays in usec.
-                */
-               for (i = 0; i < 26; i++)
-                       lapic_delaytab[i] = (lapic_frac_cycle_per_usec * i) >>
-                           32;
-
-               /*
-                * Now that the timer's calibrated, use the apic timer routines
-                * for all our timing needs..
-                */
-               if (delay_func != tsc_delay)
-                       delay_func = lapic_delay;
-               initclock_func = lapic_initclocks;
-       }
-}
-
-/*
- * delay for N usec.
- */
-
-void
-lapic_delay(int usec)
-{
-       int32_t tick, otick;
-       int64_t deltat;         /* XXX may want to be 64bit */
-
-       otick = lapic_gettick();
-
-       if (usec <= 0)
+       /*
+        * XXX What happens if the lapic timer frequency is zero at this
+        * point?  Should we panic?
+        */
+       if (lapic_per_second == 0)
                return;
-       if (usec <= 25)
-               deltat = lapic_delaytab[usec];
-       else
-               deltat = (lapic_frac_cycle_per_usec * usec) >> 32;
-
-       while (deltat > 0) {
-               tick = lapic_gettick();
-               if (tick > otick)
-                       deltat -= lapic_tval - (tick - otick);
-               else
-                       deltat -= otick - tick;
-               otick = tick;
 
-               CPU_BUSY_CYCLE();
-       }
+       lapic_nsec_cycle_ratio = lapic_per_second * (1ULL << 32) / 1000000000;
+       lapic_nsec_max = UINT64_MAX / lapic_nsec_cycle_ratio;
+       initclock_func = lapic_initclocks;
 }
 
 /*
Index: sys/arch/amd64/amd64/cpu.c
===================================================================
RCS file: /cvs/src/sys/arch/amd64/amd64/cpu.c,v
retrieving revision 1.153
diff -u -p -r1.153 cpu.c
--- sys/arch/amd64/amd64/cpu.c  11 Mar 2021 11:16:55 -0000      1.153
+++ sys/arch/amd64/amd64/cpu.c  25 Jun 2021 02:48:46 -0000
@@ -939,7 +939,7 @@ cpu_hatch(void *v)
        tsc_sync_ap(ci);
 
        lapic_enable();
-       lapic_startclock();
+
        cpu_ucode_apply(ci);
        cpu_tsx_disable(ci);
 
@@ -995,6 +995,8 @@ cpu_hatch(void *v)
 
        nanouptime(&ci->ci_schedstate.spc_runtime);
        splx(s);
+
+       lapic_startclock();
 
        SCHED_LOCK(s);
        cpu_switchto(NULL, sched_chooseproc());
Index: sys/kern/kern_clockintr.c
===================================================================
RCS file: sys/kern/kern_clockintr.c
diff -N sys/kern/kern_clockintr.c
--- /dev/null   1 Jan 1970 00:00:00 -0000
+++ sys/kern/kern_clockintr.c   25 Jun 2021 02:48:46 -0000
@@ -0,0 +1,434 @@
+/* $OpenBSD$ */
+
+/*
+ * Copyright (c) 2003 Dale Rahn <dr...@openbsd.org>
+ * Copyright (c) 2020 Mark Kettenis <kette...@openbsd.org>
+ * Copyright (c) 2020-2021 Scott Cheloha <chel...@openbsd.org>
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+
+#include <sys/param.h>
+#include <sys/atomic.h>
+#include <sys/kernel.h>
+#include <sys/malloc.h>
+#include <sys/mutex.h>
+#include <sys/stdint.h>
+#include <sys/sysctl.h>
+#include <sys/systm.h>
+#include <sys/time.h>
+
+#include <dev/clockintr.h>
+
+#include <machine/intr.h>
+
+/*
+ * Locks used in this file:
+ *
+ *     C       global clockintr configuration mutex (clockintr_mtx)
+ *     I       Immutable after initialization
+ *     p       Only modified by local CPU
+ */
+struct mutex clockintr_mtx = MUTEX_INITIALIZER(IPL_CLOCK);
+
+/*
+ * Per-CPU clockintr state.
+ */
+struct clockintr_queue {
+       uint64_t cq_next;               /* [p] next event expiration */
+       uint64_t cq_next_hardclock;     /* [p] next hardclock expiration */
+       uint64_t cq_next_statclock;     /* [p] next statclock expiration */     
+       struct intrclock cq_intrclock;  /* [I] local interrupt clock */
+       struct clockintr_stat cq_stat;  /* [p] dispatch statistics */
+       volatile u_int cq_stat_gen;     /* [p] cq_stat update generation */ 
+       u_int cq_flags;                 /* [I] local state + behavior flags */
+} *clockintr_cpu_queue[MAXCPUS];
+
+u_int clockintr_flags;                 /* [I] global state + behavior flags */
+uint32_t hardclock_period;             /* [I] hardclock period (ns) */
+volatile u_int statgen = 1;            /* [C] stat update generation */
+uint32_t statavg;                      /* [C] average statclock period (ns) */
+uint32_t statmin;                      /* [C] minimum statclock period (ns) */
+uint32_t statvar;                      /* [C] max statmin offset (ns) */
+
+uint64_t nsec_advance(uint64_t *, uint64_t, uint64_t);
+uint64_t nsecruntime(void);
+
+/*
+ * Initialize global clockintr state.  Must be called only once.
+ */
+void
+clockintr_init(int hardfreq, int statfreq, u_int flags)
+{
+       KASSERT(clockintr_flags == 0);
+       KASSERT(hardfreq > 0 && hardfreq <= 1000000000);
+       KASSERT(statfreq >= 0 && statfreq <= 1000000000);
+       KASSERT((flags & ~CI_FLAG_MASK) == 0);
+
+       hardclock_period = 1000000000 / hardfreq;
+       if (statfreq != 0) {
+               SET(clockintr_flags, CI_WANTSTAT);
+               clockintr_reset_statclock_frequency(statfreq);
+       } else
+               KASSERT(!ISSET(flags, CI_RNDSTAT));
+       SET(clockintr_flags, flags | CI_INIT);
+}
+
+/*
+ * Allocate and initialize the local CPU's state for use in
+ * clockintr_dispatch().
+ */
+void
+clockintr_cpu_init(const struct intrclock *ic, u_int flags)
+{
+       struct clockintr_queue *cq;
+       int cpu;
+
+       cpu = cpu_number();
+
+       KASSERT((flags & ~CICPU_FLAG_MASK) == 0);
+
+       if (!ISSET(clockintr_flags, CI_INIT)) {
+               panic("%s: cpu%d: called before clockintr_init()",
+                   __func__, cpu);
+       }
+
+       /*
+        * It is not an error if we're called multiple times for a
+        * given CPU.  Just make sure the intrclock didn't change.
+        *
+        * XXX Is M_DEVBUF appropriate?  This isn't really a "driver".
+        */
+       cq = clockintr_cpu_queue[cpu];
+       if (cq == NULL) {
+               cq = malloc(sizeof(*cq), M_DEVBUF, M_NOWAIT | M_ZERO);
+               if (ic != NULL) {
+                       cq->cq_intrclock = *ic;
+                       SET(cq->cq_flags, CICPU_HAVE_INTRCLOCK);
+               }
+               cq->cq_stat_gen = 1;
+               SET(cq->cq_flags, flags | CICPU_INIT);
+               clockintr_cpu_queue[cpu] = cq;
+       } else {
+               KASSERT(ISSET(cq->cq_flags, CICPU_INIT));
+               if (ISSET(cq->cq_flags, CICPU_HAVE_INTRCLOCK))
+                       KASSERT(cq->cq_intrclock.ic_rearm == ic->ic_rearm);
+               else
+                       KASSERT(ic == NULL);
+       }
+}
+
+/*
+ * Run all expired events scheduled on the local CPU.
+ *
+ * At the moment there two kinds of events: hardclock and statclock.
+ *
+ * The hardclock has a fixed period of hardclock_period nanoseconds.
+ *
+ * If CI_WANTSTAT is unset then the statclock is not run.  Otherwise, the
+ * statclock period is determined by the CI_RNDSTAT flag:
+ *
+ * - If CI_RNDSTAT is unset then the statclock has a fixed period
+ *   of statavg nanoseconds.
+ *
+ * - If CI_RNDSTAT is set then the statclock has a pseudorandom period
+ *   of [statavg - (statvar / 2), statavg + (statvar / 2)] nanoseconds.
+ *   We use random(9) to determine the period instead of arc4random(9)
+ *   because it is faster.
+ *
+ * Returns 1 if any events are run, otherwise 0.
+ *
+ * TODO It would be great if hardclock() and statclock() took a count
+ *      of ticks so we don't need to call them in a loop if the clock
+ *      interrupt is delayed.  This would also allow us to organically
+ *      advance the value of the global variable "ticks" when we resume
+ *      from suspend.
+ *
+ * TODO All platforms should run a separate statclock.  We should not
+ *      call statclock() from hardclock().
+ */
+int
+clockintr_dispatch(struct clockframe *frame)
+{
+       uint64_t count, i, lateness, now, run;
+       struct clockintr_queue *cq;
+       uint32_t avg, min, off, var;
+       u_int gen, ogen;
+
+       splassert(IPL_CLOCK);
+       cq = clockintr_cpu_queue[cpu_number()];
+
+       /*
+        * If we arrived too early we have nothing to do.
+        */
+       now = nsecruntime();
+       if (now < cq->cq_next)
+               goto done;
+
+       lateness = now - cq->cq_next;
+       run = 0;
+
+       /*
+        * Run the dispatch.
+        */
+again:
+       /* Run all expired hardclock events. */
+       count = nsec_advance(&cq->cq_next_hardclock, hardclock_period, now);
+       for (i = 0; i < count; i++)
+               hardclock(frame);
+       run += count;
+
+       /* Run all expired statclock events. */
+       if (ISSET(clockintr_flags, CI_WANTSTAT)) {
+               do {
+                       gen = statgen;
+                       membar_consumer();
+                       avg = statavg;
+                       min = statmin;
+                       var = statvar;
+                       membar_consumer();
+               } while (gen == 0 || gen != statgen);
+               if (ISSET(clockintr_flags, CI_RNDSTAT)) {
+                       count = 0;
+                       while (cq->cq_next_statclock <= now) {
+                               count++;
+                               while ((off = (random() & (var - 1))) == 0)
+                                       continue;
+                               cq->cq_next_statclock += min + off;
+                       }
+               } else 
+                       count = nsec_advance(&cq->cq_next_statclock, avg, now);
+               for (i = 0; i < count; i++)
+                       statclock(frame);
+               run += count;
+       }
+
+       /*
+        * Rerun the dispatch if the next event has already expired.
+        */
+       if (ISSET(clockintr_flags, CI_WANTSTAT))
+               cq->cq_next = MIN(cq->cq_next_hardclock, cq->cq_next_statclock);
+       else
+               cq->cq_next = cq->cq_next_hardclock;
+       now = nsecruntime();
+       if (cq->cq_next <= now)
+               goto again;
+
+       /*
+        * Dispatch complete.
+        */
+done:
+       if (ISSET(cq->cq_flags, CICPU_HAVE_INTRCLOCK))
+               intrclock_rearm(&cq->cq_intrclock, cq->cq_next - now);
+
+       ogen = cq->cq_stat_gen;
+       cq->cq_stat_gen = 0;
+       membar_producer();
+       if (run > 0) {
+               cq->cq_stat.cs_dispatch_prompt++;
+               cq->cq_stat.cs_dispatch_lateness += lateness;
+               cq->cq_stat.cs_events_run += run;
+       } else
+               cq->cq_stat.cs_dispatch_early++;
+       membar_producer();
+       cq->cq_stat_gen = MAX(1, ogen + 1);
+
+       return run > 0;
+}
+
+/*
+ * Initialize and/or update the statclock variables.  Computes
+ * statavg, statmin, and statvar according to the given frequency.
+ *
+ * This is first called during clockintr_init() to enable a statclock
+ * separate from the hardclock.
+ * 
+ * Subsequent calls are made from setstatclockrate() to update the
+ * frequency when enabling or disabling profiling.
+ *
+ * TODO Isolate the profiling code from statclock() into a separate
+ *      profclock() routine so we don't need to change the effective
+ *      rate at runtime anymore.  Ideally we would set the statclock
+ *      variables once and never reset them.  Then we can remove the
+ *      atomic synchronization code from clockintr_dispatch().
+ */
+void
+clockintr_reset_statclock_frequency(int freq)
+{
+       uint32_t avg, half_avg, min, var;
+       unsigned int ogen;
+
+       KASSERT(ISSET(clockintr_flags, CI_WANTSTAT));
+       KASSERT(freq > 0 && freq <= 1000000000);
+
+       avg = 1000000000 / freq;
+
+       /* Find the largest power of two such that 2^n <= avg / 2. */
+       half_avg = avg / 2;
+       for (var = 1 << 31; var > half_avg; var /= 2)
+               continue;
+
+       /* Use the value we found to set a lower bound for our range. */
+       min = avg - (var / 2);
+
+       mtx_enter(&clockintr_mtx);
+
+       ogen = statgen;
+       statgen = 0;
+       membar_producer();
+
+       statavg = avg;
+       statmin = min;
+       statvar = var;
+
+       membar_producer();
+       statgen = MAX(1, ogen + 1);
+
+       mtx_leave(&clockintr_mtx);
+}
+
+int
+clockintr_sysctl(void *oldp, size_t *oldlenp, void *newp, size_t newlen)
+{
+       struct clockintr_stat stat, total = { 0 };
+       struct clockintr_queue *cq;
+       struct cpu_info *ci;
+       CPU_INFO_ITERATOR cii;
+       unsigned int gen;
+
+       CPU_INFO_FOREACH(cii, ci) {
+               cq = clockintr_cpu_queue[CPU_INFO_UNIT(ci)];
+               if (cq == NULL || !ISSET(cq->cq_flags, CICPU_INIT))
+                       continue;
+               do {
+                       gen = cq->cq_stat_gen;
+                       membar_consumer();
+                       stat = cq->cq_stat;
+                       membar_consumer();
+               } while (gen == 0 || gen != cq->cq_stat_gen);
+               total.cs_dispatch_early += stat.cs_dispatch_early;
+               total.cs_dispatch_prompt += stat.cs_dispatch_prompt;
+               total.cs_dispatch_lateness += stat.cs_dispatch_lateness;
+               total.cs_events_run += stat.cs_events_run;
+       }
+
+       return sysctl_rdstruct(oldp, oldlenp, newp, &total, sizeof(total));
+}
+
+/*
+ * Given an interval timer with a period of period nanoseconds whose
+ * next expiration point is the absolute time *next, find the timer's
+ * most imminent expiration point *after* the absolute time now and
+ * write it to *next.
+ *
+ * Returns the number of elapsed periods.
+ *
+ * There are three cases here.  Each is more computationally expensive
+ * than the last.
+ *
+ * 1. No periods have elapsed because *next has not yet elapsed.  We
+ *    don't need to update *next.  Just return 0.
+ *
+ * 2. One period has elapsed.  *next has elapsed but (*next + period)
+ *    has not elapsed.  Update *next and return 1.
+ *
+ * 3. More than one period has elapsed.  Compute the number of elapsed
+ *    periods using integer division and update *next.
+ *
+ * This routine performs no overflow checks.  We assume period is less than
+ * or equal to one billion, so overflow should never happen if the system
+ * clock is even remotely sane.
+ */
+uint64_t
+nsec_advance(uint64_t *next, uint64_t period, uint64_t now)
+{
+       uint64_t elapsed;
+
+       if (now < *next)
+               return 0;
+
+       if (now < *next + period) {
+               *next += period;
+               return 1;
+       }
+
+       elapsed = (now - *next) / period + 1;
+       *next += period * elapsed;
+       return elapsed;
+}
+
+/*
+ * TODO Move to kern_tc.c when other callers exist.
+ */
+uint64_t
+nsecruntime(void)
+{
+       struct timespec now;
+
+       nanoruntime(&now);
+       return TIMESPEC_TO_NSEC(&now);
+}
+
+#ifdef DDB
+#include <machine/db_machdep.h>
+
+#include <ddb/db_interface.h>
+#include <ddb/db_output.h>
+#include <ddb/db_sym.h>
+
+void db_show_clockintr_cpu(struct cpu_info *);
+
+/*
+ * ddb> show clockintr
+ */
+void
+db_show_clockintr(db_expr_t addr, int haddr, db_expr_t count, char *modif)
+{
+       struct timespec now;
+       struct cpu_info *info;
+       CPU_INFO_ITERATOR iterator;
+
+       nanoruntime(&now);
+
+       db_printf("%20s\n", "RUNTIME");
+       db_printf("%10lld.%09ld\n", now.tv_sec, now.tv_nsec);
+       db_printf("\n");
+       db_printf("%20s  %3s  %s\n", "EXPIRATION", "CPU", "FUNC");
+       CPU_INFO_FOREACH(iterator, info)
+               db_show_clockintr_cpu(info);
+}
+
+void
+db_show_clockintr_cpu(struct cpu_info *ci)
+{
+       struct timespec next;
+       struct clockintr_queue *cq;
+       unsigned int cpu;
+
+       cpu = CPU_INFO_UNIT(ci);
+       cq = clockintr_cpu_queue[cpu];
+
+       if (cq == NULL || !ISSET(cq->cq_flags, CICPU_INIT))
+               return;
+
+       NSEC_TO_TIMESPEC(cq->cq_next_hardclock, &next);
+       db_printf("%10lld.%09ld  %3u  %s\n",
+           next.tv_sec, next.tv_nsec, cpu, "hardclock");
+
+       if (ISSET(clockintr_flags, CI_WANTSTAT)) {
+               NSEC_TO_TIMESPEC(cq->cq_next_statclock, &next);
+               db_printf("%10lld.%09ld  %3u  %s\n",
+                   next.tv_sec, next.tv_nsec, cpu, "statclock");
+       }
+}
+#endif
Index: sys/kern/kern_sysctl.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_sysctl.c,v
retrieving revision 1.394
diff -u -p -r1.394 kern_sysctl.c
--- sys/kern/kern_sysctl.c      4 May 2021 21:57:15 -0000       1.394
+++ sys/kern/kern_sysctl.c      25 Jun 2021 02:48:48 -0000
@@ -84,6 +84,9 @@
 
 #include <uvm/uvm_extern.h>
 
+#ifdef CLOCKINTR
+#include <dev/clockintr.h>
+#endif
 #include <dev/cons.h>
 
 #include <net/route.h>
@@ -642,6 +645,10 @@ kern_sysctl(int *name, u_int namelen, vo
                return (timeout_sysctl(oldp, oldlenp, newp, newlen));
        case KERN_UTC_OFFSET:
                return (sysctl_utc_offset(oldp, oldlenp, newp, newlen));
+#ifdef CLOCKINTR
+       case KERN_CLOCKINTR_STATS:
+               return (clockintr_sysctl(oldp, oldlenp, newp, newlen));
+#endif
        default:
                return (sysctl_bounded_arr(kern_vars, nitems(kern_vars), name,
                    namelen, oldp, oldlenp, newp, newlen));
Index: sys/dev/clockintr.h
===================================================================
RCS file: sys/dev/clockintr.h
diff -N sys/dev/clockintr.h
--- /dev/null   1 Jan 1970 00:00:00 -0000
+++ sys/dev/clockintr.h 25 Jun 2021 02:48:48 -0000
@@ -0,0 +1,84 @@
+/* $OpenBSD$ */
+
+/*
+ * Copyright (c) 2020-2021 Scott Cheloha <chel...@openbsd.org>
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+
+#ifndef _DEV_CLOCKINTR_H_
+#define _DEV_CLOCKINTR_H_
+
+#ifdef _KERNEL
+
+#include <machine/intr.h>
+
+/*
+ * Platform API
+ */
+
+struct intrclock;
+
+typedef void (*intrclock_rearm_t)(uint64_t);
+
+struct intrclock {
+       intrclock_rearm_t ic_rearm;
+       void *ic_cookie;
+};
+
+static inline void
+intrclock_rearm(struct intrclock *ic, uint64_t nsecs)
+{
+       ic->ic_rearm(nsecs);
+}
+
+struct clockframe;
+
+void clockintr_cpu_init(const struct intrclock *, u_int);
+int clockintr_dispatch(struct clockframe *);
+void clockintr_init(int, int, u_int);
+void clockintr_reset_statclock_frequency(int);
+
+/* Global state flags. */
+#define CI_INIT                        0x00000001      /* clockintr_init() 
called */
+#define CI_WANTSTAT            0x00000002      /* run a separate statclock */
+#define CI_STATE_MASK          0x00000003
+
+/* Global behavior flags. */
+#define CI_RNDSTAT             0x80000000      /* randomized statclock */
+#define CI_FLAG_MASK           0x80000000
+
+/* Per-CPU state flags. */
+#define CICPU_INIT             0x00000001      /* ready for dispatch */
+#define CICPU_HAVE_INTRCLOCK   0x00000002      /* have local intr. clock */
+#define CICPU_STATE_MASK       0x00000003
+
+/* Per-CPU behavior flags. */
+#define CICPU_FLAG_MASK                0x00000000
+
+/*
+ * Kernel API
+ */
+
+int clockintr_sysctl(void *, size_t *, void *, size_t);
+
+#endif /* _KERNEL */
+
+struct clockintr_stat {
+       uint64_t        cs_dispatch_early;
+       uint64_t        cs_dispatch_prompt;
+       uint64_t        cs_dispatch_lateness;
+       uint64_t        cs_events_run;
+};
+
+#endif /* !_DEV_CLOCKINTR_H_ */
Index: sys/sys/sysctl.h
===================================================================
RCS file: /cvs/src/sys/sys/sysctl.h,v
retrieving revision 1.218
diff -u -p -r1.218 sysctl.h
--- sys/sys/sysctl.h    17 May 2021 17:54:31 -0000      1.218
+++ sys/sys/sysctl.h    25 Jun 2021 02:48:49 -0000
@@ -190,7 +190,8 @@ struct ctlname {
 #define        KERN_TIMEOUT_STATS      87      /* struct: timeout status and 
stats */
 #define        KERN_UTC_OFFSET         88      /* int: adjust RTC time to UTC 
*/
 #define        KERN_VIDEO              89      /* struct: video properties */
-#define        KERN_MAXID              90      /* number of valid kern ids */
+#define        KERN_CLOCKINTR_STATS    90      /* struct: clockintr stats */
+#define        KERN_MAXID              91      /* number of valid kern ids */
 
 #define        CTL_KERN_NAMES { \
        { 0, 0 }, \
Index: sys/conf/files
===================================================================
RCS file: /cvs/src/sys/conf/files,v
retrieving revision 1.702
diff -u -p -r1.702 files
--- sys/conf/files      16 Apr 2021 08:17:35 -0000      1.702
+++ sys/conf/files      25 Jun 2021 02:48:49 -0000
@@ -687,6 +687,7 @@ file kern/init_sysent.c
 file kern/kern_acct.c                  accounting
 file kern/kern_bufq.c
 file kern/kern_clock.c
+file kern/kern_clockintr.c             clockintr
 file kern/kern_descrip.c
 file kern/kern_event.c
 file kern/kern_exec.c
Index: sys/arch/amd64/conf/GENERIC
===================================================================
RCS file: /cvs/src/sys/arch/amd64/conf/GENERIC,v
retrieving revision 1.498
diff -u -p -r1.498 GENERIC
--- sys/arch/amd64/conf/GENERIC 28 Apr 2021 11:32:59 -0000      1.498
+++ sys/arch/amd64/conf/GENERIC 25 Jun 2021 02:48:49 -0000
@@ -20,6 +20,7 @@ option                MTRR            # CPU memory range 
attribu
 
 option         NTFS            # NTFS support
 option         HIBERNATE       # Hibernate support
+option         CLOCKINTR
 
 config         bsd     swap generic

[please test] amd64: schedule clock interrupts against system clock

Reply via email to