Hi, I'm looking for testers for the attached patch. You need an amd64 machine with a lapic.
This includes: - All "real" amd64 machines ever made - amd64 VMs running on hypervisors that provide a virtual lapic Note that this does *not* include: - amd64 VMs running on OpenBSD's vmm(4). (I will ask for a separate round of testing for vmm(4) VMs, don't worry.) The patch adds a new machine-independent clock interrupt scheduling layer (hereafter, "clockintr") to the kernel in kern/kern_clockintr.c, configures GENERIC amd64 kernels to use clockintr, and changes amd64/lapic.c to use clockintr instead of calling hardclock(9) directly. Please apply the patch and make sure to reconfigure your kernel before recompiling/installing it to test. I am especially interested in whether this breaks suspend/resume or hibernate/unhibernate. Suspend/resume is unaffected on my Lenovo X1C7, is the same true for your machine? Please include a dmesg with your results. Stats for the clockintr subsystem are exposed via sysctl(2). If you're interested in providing them you can compile and run the program attached inline in my next mail. A snippet of the output from across a suspend/resume is especially useful. This is the end of the mail if you just want to test this. If you are interested in the possible behavior changes or a description of how clockintr works, keep reading. Thanks, -Scott -- There are some behavior changes, but I have found them to be small, harmless, and/or useful. The first one is the most significant: - Clockintr schedules events against the system clock, so hardclock(9) ticks are pegged to the system clock, so the length of a tick is now subject to NTP adjustment via adjtime(2) and adjfreq(2). In practice, NTP adjustment is very conservative. In my testing the delta between the raw frequency and the NTP frequency is small when ntpd(8) is doing coarse correction with adjtime(2) and invisible when ntpd(8) is doing fine correction with adjfreq(2). The upshot of the frequency difference is sometimes you will get some spurious ("early") interrupts while ntpd(8) is correcting the clock. They go away when the ntpd(8) finishes synchronizing. FWIW: Linux, FreeBSD, and DragonflyBSD have all made this jump. - hardclock(9) will run simultaneously on every CPU in the system. This seems to be fine, but there might be some subtle contention that gets worse as you add more CPUs. Worth investigating. - amd64 gets a pseudorandom statclock(). This is desirable, right? - "Lost" or delayed ticks are handled by the clockintr layer transparently. This means that if the clock interrupt is delayed due to e.g. hypervisor delays, we don't "lose" ticks and the timeout schedule does not decay. This is super relevant for vmm(4), but it may also be relevant for other hypervisors. -- Last, here are notes for people interested in the design or the actual code. Ask questions if something about my approach seems off, I have never added a subsystem to the kernel before. The code has not changed much in the last six months so I think I am nearing a stable design. I will document these interfaces in a real manpage soon. - Clockintr prototypes are declared in <dev/clockintr.h>. - Clockintr depends on the timecounter and the system clock to do its scheduling. If there is no working timecounter the machine will hang, as multitasking preemption will cease. - Global clockintr initialization is done via clockintr_init(). You call this from cpu_initclocks() on the primary CPU *after* you install a timecounter. The function sets a global frequency for the hardclock(9) (required), a global frequency for the statclock() (or zero if you don't want a statclock), and sets global behavior flags. There is only one flag right now, CI_RNDSTAT, which toggles whether the statclock() has a pseudorandom period. If the platform has a one-shot clock (e.g. amd64, arm64, etc.) it makes sense to set CI_RNDSTAT. If the platform does not have a one-shot clock (e.g. alpha) there is no point in setting CI_RNDSTAT as the hardware cannot provide the feature. - Per-CPU clockintr initialization is done via clockintr_cpu_init(). On the primary CPU, call this immediately *after* you call clockintr_init(). Secondary CPUs should call this late in cpu_hatch(), probably right before cpu_switchto(9). The function allocates memory for the local CPU's schedule and "installs" the local interrupt clock, if any. If the platform cannot provide a local interrupt clock with a one-shot mode you just pass a NULL pointer to clockintr_cpu_init(). The clockintr layer on that CPU will then run in "dummy" mode. The platform is left responsible for scheduling clock interrupt delivery without input from the clockintr code. The platform should deliver a clock interrupt hz(9) times per second to keep things running smoothly. This mode works but it is not very accurate. If the platform can provide a local one-shot clock you pass a pointer to a "struct intrclock" that describes said clock. Currently intrclocks are immutable and stateless, so every CPU on a system can use the same struct (on many platforms each CPU has an identical copy of a particular clock). If we have to add locking or state to the intrclock struct we will need to allocate per-CPU intrclock structs, but as of yet I haven't needed it. The intrclock struct currently has one member, "ic_rearm", a function pointer taking a count of nanoseconds. ic_rearm() should rearm the calling CPU's clock to deliver a local clock interrupt after the given number of nanoseconds have elapsed. All platorm-specific details are hidden from the clockintr layer. The clockintr layer just passes a count of nanoseconds to the MD code, that's it. - Clock interrupt events are run from clockintr_dispatch(). The platform needs to call this function at IPL_CLOCK from the ISR whenever it fires. The dispatch function runs all expired events (hardclock + statclock) and, if the local CPU has an intrclock, schedules the next clock interrupt before returning. Events are run without any locks or mutexes. -- Here's the patch. Index: sys/arch/amd64/amd64/lapic.c =================================================================== RCS file: /cvs/src/sys/arch/amd64/amd64/lapic.c,v retrieving revision 1.58 diff -u -p -r1.58 lapic.c --- sys/arch/amd64/amd64/lapic.c 11 Jun 2021 05:33:16 -0000 1.58 +++ sys/arch/amd64/amd64/lapic.c 25 Jun 2021 02:48:45 -0000 @@ -49,6 +49,7 @@ #include <machine/i82489reg.h> #include <machine/i82489var.h> +#include <dev/clockintr.h> #include <dev/ic/i8253reg.h> #include "ioapic.h" @@ -72,7 +73,6 @@ struct evcount clk_count; struct evcount ipi_count; #endif -void lapic_delay(int); static u_int32_t lapic_gettick(void); void lapic_clockintr(void *, struct intrframe); void lapic_initclocks(void); @@ -402,18 +402,23 @@ lapic_gettick(void) #include <sys/kernel.h> /* for hz */ -u_int32_t lapic_tval; +void lapic_timer_oneshot(uint32_t, uint32_t); +void lapic_timer_periodic(uint32_t, uint32_t); + +void lapic_timer_rearm(uint64_t); + +const struct intrclock lapic_intrclock = { + .ic_rearm = lapic_timer_rearm, +}; /* * this gets us up to a 4GHz busclock.... */ u_int32_t lapic_per_second = 0; -u_int32_t lapic_frac_usec_per_cycle; -u_int64_t lapic_frac_cycle_per_usec; -u_int32_t lapic_delaytab[26]; -void lapic_timer_oneshot(uint32_t, uint32_t); -void lapic_timer_periodic(uint32_t, uint32_t); +uint64_t lapic_nsec_cycle_ratio; +uint64_t lapic_nsec_max; +uint32_t lapic_cycle_min = 1; /* * Start the local apic countdown timer. @@ -443,6 +448,17 @@ lapic_timer_periodic(uint32_t mask, uint } void +lapic_timer_rearm(uint64_t nsecs) +{ + uint32_t cycles; + + nsecs = MIN(lapic_nsec_max, nsecs); + cycles = (nsecs * lapic_nsec_cycle_ratio) >> 32; + cycles = MAX(lapic_cycle_min, cycles); + lapic_timer_oneshot(0, cycles); +} + +void lapic_clockintr(void *arg, struct intrframe frame) { struct cpu_info *ci = curcpu(); @@ -450,7 +466,7 @@ lapic_clockintr(void *arg, struct intrfr floor = ci->ci_handled_intr_level; ci->ci_handled_intr_level = ci->ci_ilevel; - hardclock((struct clockframe *)&frame); + clockintr_dispatch((struct clockframe *)&frame); ci->ci_handled_intr_level = floor; clk_count.ec_count++; @@ -459,17 +475,23 @@ lapic_clockintr(void *arg, struct intrfr void lapic_startclock(void) { - lapic_timer_periodic(0, lapic_tval); + clockintr_cpu_init(&lapic_intrclock, 0); + lapic_timer_rearm(0); } void lapic_initclocks(void) { - lapic_startclock(); + KASSERT(lapic_per_second > 0); i8254_inittimecounter_simple(); -} + stathz = hz; + profhz = stathz; + clockintr_init(hz, stathz, CI_RNDSTAT); + + lapic_startclock(); +} extern int gettick(void); /* XXX put in header file */ extern u_long rtclock_tval; /* XXX put in header file */ @@ -488,8 +510,6 @@ wait_next_cycle(void) } } -extern void tsc_delay(int); - /* * Calibrate the local apic count-down timer (which is running at * bus-clock speed) vs. the i8254 counter/timer (which is running at @@ -551,75 +571,16 @@ skip_calibration: printf("%s: apic clock running at %dMHz\n", ci->ci_dev->dv_xname, lapic_per_second / (1000 * 1000)); - if (lapic_per_second != 0) { - /* - * reprogram the apic timer to run in periodic mode. - * XXX need to program timer on other cpu's, too. - */ - lapic_tval = (lapic_per_second * 2) / hz; - lapic_tval = (lapic_tval / 2) + (lapic_tval & 0x1); - - lapic_timer_periodic(LAPIC_LVTT_M, lapic_tval); - - /* - * Compute fixed-point ratios between cycles and - * microseconds to avoid having to do any division - * in lapic_delay. - */ - - tmp = (1000000 * (u_int64_t)1 << 32) / lapic_per_second; - lapic_frac_usec_per_cycle = tmp; - - tmp = (lapic_per_second * (u_int64_t)1 << 32) / 1000000; - - lapic_frac_cycle_per_usec = tmp; - - /* - * Compute delay in cycles for likely short delays in usec. - */ - for (i = 0; i < 26; i++) - lapic_delaytab[i] = (lapic_frac_cycle_per_usec * i) >> - 32; - - /* - * Now that the timer's calibrated, use the apic timer routines - * for all our timing needs.. - */ - if (delay_func != tsc_delay) - delay_func = lapic_delay; - initclock_func = lapic_initclocks; - } -} - -/* - * delay for N usec. - */ - -void -lapic_delay(int usec) -{ - int32_t tick, otick; - int64_t deltat; /* XXX may want to be 64bit */ - - otick = lapic_gettick(); - - if (usec <= 0) + /* + * XXX What happens if the lapic timer frequency is zero at this + * point? Should we panic? + */ + if (lapic_per_second == 0) return; - if (usec <= 25) - deltat = lapic_delaytab[usec]; - else - deltat = (lapic_frac_cycle_per_usec * usec) >> 32; - - while (deltat > 0) { - tick = lapic_gettick(); - if (tick > otick) - deltat -= lapic_tval - (tick - otick); - else - deltat -= otick - tick; - otick = tick; - CPU_BUSY_CYCLE(); - } + lapic_nsec_cycle_ratio = lapic_per_second * (1ULL << 32) / 1000000000; + lapic_nsec_max = UINT64_MAX / lapic_nsec_cycle_ratio; + initclock_func = lapic_initclocks; } /* Index: sys/arch/amd64/amd64/cpu.c =================================================================== RCS file: /cvs/src/sys/arch/amd64/amd64/cpu.c,v retrieving revision 1.153 diff -u -p -r1.153 cpu.c --- sys/arch/amd64/amd64/cpu.c 11 Mar 2021 11:16:55 -0000 1.153 +++ sys/arch/amd64/amd64/cpu.c 25 Jun 2021 02:48:46 -0000 @@ -939,7 +939,7 @@ cpu_hatch(void *v) tsc_sync_ap(ci); lapic_enable(); - lapic_startclock(); + cpu_ucode_apply(ci); cpu_tsx_disable(ci); @@ -995,6 +995,8 @@ cpu_hatch(void *v) nanouptime(&ci->ci_schedstate.spc_runtime); splx(s); + + lapic_startclock(); SCHED_LOCK(s); cpu_switchto(NULL, sched_chooseproc()); Index: sys/kern/kern_clockintr.c =================================================================== RCS file: sys/kern/kern_clockintr.c diff -N sys/kern/kern_clockintr.c --- /dev/null 1 Jan 1970 00:00:00 -0000 +++ sys/kern/kern_clockintr.c 25 Jun 2021 02:48:46 -0000 @@ -0,0 +1,434 @@ +/* $OpenBSD$ */ + +/* + * Copyright (c) 2003 Dale Rahn <dr...@openbsd.org> + * Copyright (c) 2020 Mark Kettenis <kette...@openbsd.org> + * Copyright (c) 2020-2021 Scott Cheloha <chel...@openbsd.org> + * + * Permission to use, copy, modify, and distribute this software for any + * purpose with or without fee is hereby granted, provided that the above + * copyright notice and this permission notice appear in all copies. + * + * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF + * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. + */ + +#include <sys/param.h> +#include <sys/atomic.h> +#include <sys/kernel.h> +#include <sys/malloc.h> +#include <sys/mutex.h> +#include <sys/stdint.h> +#include <sys/sysctl.h> +#include <sys/systm.h> +#include <sys/time.h> + +#include <dev/clockintr.h> + +#include <machine/intr.h> + +/* + * Locks used in this file: + * + * C global clockintr configuration mutex (clockintr_mtx) + * I Immutable after initialization + * p Only modified by local CPU + */ +struct mutex clockintr_mtx = MUTEX_INITIALIZER(IPL_CLOCK); + +/* + * Per-CPU clockintr state. + */ +struct clockintr_queue { + uint64_t cq_next; /* [p] next event expiration */ + uint64_t cq_next_hardclock; /* [p] next hardclock expiration */ + uint64_t cq_next_statclock; /* [p] next statclock expiration */ + struct intrclock cq_intrclock; /* [I] local interrupt clock */ + struct clockintr_stat cq_stat; /* [p] dispatch statistics */ + volatile u_int cq_stat_gen; /* [p] cq_stat update generation */ + u_int cq_flags; /* [I] local state + behavior flags */ +} *clockintr_cpu_queue[MAXCPUS]; + +u_int clockintr_flags; /* [I] global state + behavior flags */ +uint32_t hardclock_period; /* [I] hardclock period (ns) */ +volatile u_int statgen = 1; /* [C] stat update generation */ +uint32_t statavg; /* [C] average statclock period (ns) */ +uint32_t statmin; /* [C] minimum statclock period (ns) */ +uint32_t statvar; /* [C] max statmin offset (ns) */ + +uint64_t nsec_advance(uint64_t *, uint64_t, uint64_t); +uint64_t nsecruntime(void); + +/* + * Initialize global clockintr state. Must be called only once. + */ +void +clockintr_init(int hardfreq, int statfreq, u_int flags) +{ + KASSERT(clockintr_flags == 0); + KASSERT(hardfreq > 0 && hardfreq <= 1000000000); + KASSERT(statfreq >= 0 && statfreq <= 1000000000); + KASSERT((flags & ~CI_FLAG_MASK) == 0); + + hardclock_period = 1000000000 / hardfreq; + if (statfreq != 0) { + SET(clockintr_flags, CI_WANTSTAT); + clockintr_reset_statclock_frequency(statfreq); + } else + KASSERT(!ISSET(flags, CI_RNDSTAT)); + SET(clockintr_flags, flags | CI_INIT); +} + +/* + * Allocate and initialize the local CPU's state for use in + * clockintr_dispatch(). + */ +void +clockintr_cpu_init(const struct intrclock *ic, u_int flags) +{ + struct clockintr_queue *cq; + int cpu; + + cpu = cpu_number(); + + KASSERT((flags & ~CICPU_FLAG_MASK) == 0); + + if (!ISSET(clockintr_flags, CI_INIT)) { + panic("%s: cpu%d: called before clockintr_init()", + __func__, cpu); + } + + /* + * It is not an error if we're called multiple times for a + * given CPU. Just make sure the intrclock didn't change. + * + * XXX Is M_DEVBUF appropriate? This isn't really a "driver". + */ + cq = clockintr_cpu_queue[cpu]; + if (cq == NULL) { + cq = malloc(sizeof(*cq), M_DEVBUF, M_NOWAIT | M_ZERO); + if (ic != NULL) { + cq->cq_intrclock = *ic; + SET(cq->cq_flags, CICPU_HAVE_INTRCLOCK); + } + cq->cq_stat_gen = 1; + SET(cq->cq_flags, flags | CICPU_INIT); + clockintr_cpu_queue[cpu] = cq; + } else { + KASSERT(ISSET(cq->cq_flags, CICPU_INIT)); + if (ISSET(cq->cq_flags, CICPU_HAVE_INTRCLOCK)) + KASSERT(cq->cq_intrclock.ic_rearm == ic->ic_rearm); + else + KASSERT(ic == NULL); + } +} + +/* + * Run all expired events scheduled on the local CPU. + * + * At the moment there two kinds of events: hardclock and statclock. + * + * The hardclock has a fixed period of hardclock_period nanoseconds. + * + * If CI_WANTSTAT is unset then the statclock is not run. Otherwise, the + * statclock period is determined by the CI_RNDSTAT flag: + * + * - If CI_RNDSTAT is unset then the statclock has a fixed period + * of statavg nanoseconds. + * + * - If CI_RNDSTAT is set then the statclock has a pseudorandom period + * of [statavg - (statvar / 2), statavg + (statvar / 2)] nanoseconds. + * We use random(9) to determine the period instead of arc4random(9) + * because it is faster. + * + * Returns 1 if any events are run, otherwise 0. + * + * TODO It would be great if hardclock() and statclock() took a count + * of ticks so we don't need to call them in a loop if the clock + * interrupt is delayed. This would also allow us to organically + * advance the value of the global variable "ticks" when we resume + * from suspend. + * + * TODO All platforms should run a separate statclock. We should not + * call statclock() from hardclock(). + */ +int +clockintr_dispatch(struct clockframe *frame) +{ + uint64_t count, i, lateness, now, run; + struct clockintr_queue *cq; + uint32_t avg, min, off, var; + u_int gen, ogen; + + splassert(IPL_CLOCK); + cq = clockintr_cpu_queue[cpu_number()]; + + /* + * If we arrived too early we have nothing to do. + */ + now = nsecruntime(); + if (now < cq->cq_next) + goto done; + + lateness = now - cq->cq_next; + run = 0; + + /* + * Run the dispatch. + */ +again: + /* Run all expired hardclock events. */ + count = nsec_advance(&cq->cq_next_hardclock, hardclock_period, now); + for (i = 0; i < count; i++) + hardclock(frame); + run += count; + + /* Run all expired statclock events. */ + if (ISSET(clockintr_flags, CI_WANTSTAT)) { + do { + gen = statgen; + membar_consumer(); + avg = statavg; + min = statmin; + var = statvar; + membar_consumer(); + } while (gen == 0 || gen != statgen); + if (ISSET(clockintr_flags, CI_RNDSTAT)) { + count = 0; + while (cq->cq_next_statclock <= now) { + count++; + while ((off = (random() & (var - 1))) == 0) + continue; + cq->cq_next_statclock += min + off; + } + } else + count = nsec_advance(&cq->cq_next_statclock, avg, now); + for (i = 0; i < count; i++) + statclock(frame); + run += count; + } + + /* + * Rerun the dispatch if the next event has already expired. + */ + if (ISSET(clockintr_flags, CI_WANTSTAT)) + cq->cq_next = MIN(cq->cq_next_hardclock, cq->cq_next_statclock); + else + cq->cq_next = cq->cq_next_hardclock; + now = nsecruntime(); + if (cq->cq_next <= now) + goto again; + + /* + * Dispatch complete. + */ +done: + if (ISSET(cq->cq_flags, CICPU_HAVE_INTRCLOCK)) + intrclock_rearm(&cq->cq_intrclock, cq->cq_next - now); + + ogen = cq->cq_stat_gen; + cq->cq_stat_gen = 0; + membar_producer(); + if (run > 0) { + cq->cq_stat.cs_dispatch_prompt++; + cq->cq_stat.cs_dispatch_lateness += lateness; + cq->cq_stat.cs_events_run += run; + } else + cq->cq_stat.cs_dispatch_early++; + membar_producer(); + cq->cq_stat_gen = MAX(1, ogen + 1); + + return run > 0; +} + +/* + * Initialize and/or update the statclock variables. Computes + * statavg, statmin, and statvar according to the given frequency. + * + * This is first called during clockintr_init() to enable a statclock + * separate from the hardclock. + * + * Subsequent calls are made from setstatclockrate() to update the + * frequency when enabling or disabling profiling. + * + * TODO Isolate the profiling code from statclock() into a separate + * profclock() routine so we don't need to change the effective + * rate at runtime anymore. Ideally we would set the statclock + * variables once and never reset them. Then we can remove the + * atomic synchronization code from clockintr_dispatch(). + */ +void +clockintr_reset_statclock_frequency(int freq) +{ + uint32_t avg, half_avg, min, var; + unsigned int ogen; + + KASSERT(ISSET(clockintr_flags, CI_WANTSTAT)); + KASSERT(freq > 0 && freq <= 1000000000); + + avg = 1000000000 / freq; + + /* Find the largest power of two such that 2^n <= avg / 2. */ + half_avg = avg / 2; + for (var = 1 << 31; var > half_avg; var /= 2) + continue; + + /* Use the value we found to set a lower bound for our range. */ + min = avg - (var / 2); + + mtx_enter(&clockintr_mtx); + + ogen = statgen; + statgen = 0; + membar_producer(); + + statavg = avg; + statmin = min; + statvar = var; + + membar_producer(); + statgen = MAX(1, ogen + 1); + + mtx_leave(&clockintr_mtx); +} + +int +clockintr_sysctl(void *oldp, size_t *oldlenp, void *newp, size_t newlen) +{ + struct clockintr_stat stat, total = { 0 }; + struct clockintr_queue *cq; + struct cpu_info *ci; + CPU_INFO_ITERATOR cii; + unsigned int gen; + + CPU_INFO_FOREACH(cii, ci) { + cq = clockintr_cpu_queue[CPU_INFO_UNIT(ci)]; + if (cq == NULL || !ISSET(cq->cq_flags, CICPU_INIT)) + continue; + do { + gen = cq->cq_stat_gen; + membar_consumer(); + stat = cq->cq_stat; + membar_consumer(); + } while (gen == 0 || gen != cq->cq_stat_gen); + total.cs_dispatch_early += stat.cs_dispatch_early; + total.cs_dispatch_prompt += stat.cs_dispatch_prompt; + total.cs_dispatch_lateness += stat.cs_dispatch_lateness; + total.cs_events_run += stat.cs_events_run; + } + + return sysctl_rdstruct(oldp, oldlenp, newp, &total, sizeof(total)); +} + +/* + * Given an interval timer with a period of period nanoseconds whose + * next expiration point is the absolute time *next, find the timer's + * most imminent expiration point *after* the absolute time now and + * write it to *next. + * + * Returns the number of elapsed periods. + * + * There are three cases here. Each is more computationally expensive + * than the last. + * + * 1. No periods have elapsed because *next has not yet elapsed. We + * don't need to update *next. Just return 0. + * + * 2. One period has elapsed. *next has elapsed but (*next + period) + * has not elapsed. Update *next and return 1. + * + * 3. More than one period has elapsed. Compute the number of elapsed + * periods using integer division and update *next. + * + * This routine performs no overflow checks. We assume period is less than + * or equal to one billion, so overflow should never happen if the system + * clock is even remotely sane. + */ +uint64_t +nsec_advance(uint64_t *next, uint64_t period, uint64_t now) +{ + uint64_t elapsed; + + if (now < *next) + return 0; + + if (now < *next + period) { + *next += period; + return 1; + } + + elapsed = (now - *next) / period + 1; + *next += period * elapsed; + return elapsed; +} + +/* + * TODO Move to kern_tc.c when other callers exist. + */ +uint64_t +nsecruntime(void) +{ + struct timespec now; + + nanoruntime(&now); + return TIMESPEC_TO_NSEC(&now); +} + +#ifdef DDB +#include <machine/db_machdep.h> + +#include <ddb/db_interface.h> +#include <ddb/db_output.h> +#include <ddb/db_sym.h> + +void db_show_clockintr_cpu(struct cpu_info *); + +/* + * ddb> show clockintr + */ +void +db_show_clockintr(db_expr_t addr, int haddr, db_expr_t count, char *modif) +{ + struct timespec now; + struct cpu_info *info; + CPU_INFO_ITERATOR iterator; + + nanoruntime(&now); + + db_printf("%20s\n", "RUNTIME"); + db_printf("%10lld.%09ld\n", now.tv_sec, now.tv_nsec); + db_printf("\n"); + db_printf("%20s %3s %s\n", "EXPIRATION", "CPU", "FUNC"); + CPU_INFO_FOREACH(iterator, info) + db_show_clockintr_cpu(info); +} + +void +db_show_clockintr_cpu(struct cpu_info *ci) +{ + struct timespec next; + struct clockintr_queue *cq; + unsigned int cpu; + + cpu = CPU_INFO_UNIT(ci); + cq = clockintr_cpu_queue[cpu]; + + if (cq == NULL || !ISSET(cq->cq_flags, CICPU_INIT)) + return; + + NSEC_TO_TIMESPEC(cq->cq_next_hardclock, &next); + db_printf("%10lld.%09ld %3u %s\n", + next.tv_sec, next.tv_nsec, cpu, "hardclock"); + + if (ISSET(clockintr_flags, CI_WANTSTAT)) { + NSEC_TO_TIMESPEC(cq->cq_next_statclock, &next); + db_printf("%10lld.%09ld %3u %s\n", + next.tv_sec, next.tv_nsec, cpu, "statclock"); + } +} +#endif Index: sys/kern/kern_sysctl.c =================================================================== RCS file: /cvs/src/sys/kern/kern_sysctl.c,v retrieving revision 1.394 diff -u -p -r1.394 kern_sysctl.c --- sys/kern/kern_sysctl.c 4 May 2021 21:57:15 -0000 1.394 +++ sys/kern/kern_sysctl.c 25 Jun 2021 02:48:48 -0000 @@ -84,6 +84,9 @@ #include <uvm/uvm_extern.h> +#ifdef CLOCKINTR +#include <dev/clockintr.h> +#endif #include <dev/cons.h> #include <net/route.h> @@ -642,6 +645,10 @@ kern_sysctl(int *name, u_int namelen, vo return (timeout_sysctl(oldp, oldlenp, newp, newlen)); case KERN_UTC_OFFSET: return (sysctl_utc_offset(oldp, oldlenp, newp, newlen)); +#ifdef CLOCKINTR + case KERN_CLOCKINTR_STATS: + return (clockintr_sysctl(oldp, oldlenp, newp, newlen)); +#endif default: return (sysctl_bounded_arr(kern_vars, nitems(kern_vars), name, namelen, oldp, oldlenp, newp, newlen)); Index: sys/dev/clockintr.h =================================================================== RCS file: sys/dev/clockintr.h diff -N sys/dev/clockintr.h --- /dev/null 1 Jan 1970 00:00:00 -0000 +++ sys/dev/clockintr.h 25 Jun 2021 02:48:48 -0000 @@ -0,0 +1,84 @@ +/* $OpenBSD$ */ + +/* + * Copyright (c) 2020-2021 Scott Cheloha <chel...@openbsd.org> + * + * Permission to use, copy, modify, and distribute this software for any + * purpose with or without fee is hereby granted, provided that the above + * copyright notice and this permission notice appear in all copies. + * + * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF + * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. + */ + +#ifndef _DEV_CLOCKINTR_H_ +#define _DEV_CLOCKINTR_H_ + +#ifdef _KERNEL + +#include <machine/intr.h> + +/* + * Platform API + */ + +struct intrclock; + +typedef void (*intrclock_rearm_t)(uint64_t); + +struct intrclock { + intrclock_rearm_t ic_rearm; + void *ic_cookie; +}; + +static inline void +intrclock_rearm(struct intrclock *ic, uint64_t nsecs) +{ + ic->ic_rearm(nsecs); +} + +struct clockframe; + +void clockintr_cpu_init(const struct intrclock *, u_int); +int clockintr_dispatch(struct clockframe *); +void clockintr_init(int, int, u_int); +void clockintr_reset_statclock_frequency(int); + +/* Global state flags. */ +#define CI_INIT 0x00000001 /* clockintr_init() called */ +#define CI_WANTSTAT 0x00000002 /* run a separate statclock */ +#define CI_STATE_MASK 0x00000003 + +/* Global behavior flags. */ +#define CI_RNDSTAT 0x80000000 /* randomized statclock */ +#define CI_FLAG_MASK 0x80000000 + +/* Per-CPU state flags. */ +#define CICPU_INIT 0x00000001 /* ready for dispatch */ +#define CICPU_HAVE_INTRCLOCK 0x00000002 /* have local intr. clock */ +#define CICPU_STATE_MASK 0x00000003 + +/* Per-CPU behavior flags. */ +#define CICPU_FLAG_MASK 0x00000000 + +/* + * Kernel API + */ + +int clockintr_sysctl(void *, size_t *, void *, size_t); + +#endif /* _KERNEL */ + +struct clockintr_stat { + uint64_t cs_dispatch_early; + uint64_t cs_dispatch_prompt; + uint64_t cs_dispatch_lateness; + uint64_t cs_events_run; +}; + +#endif /* !_DEV_CLOCKINTR_H_ */ Index: sys/sys/sysctl.h =================================================================== RCS file: /cvs/src/sys/sys/sysctl.h,v retrieving revision 1.218 diff -u -p -r1.218 sysctl.h --- sys/sys/sysctl.h 17 May 2021 17:54:31 -0000 1.218 +++ sys/sys/sysctl.h 25 Jun 2021 02:48:49 -0000 @@ -190,7 +190,8 @@ struct ctlname { #define KERN_TIMEOUT_STATS 87 /* struct: timeout status and stats */ #define KERN_UTC_OFFSET 88 /* int: adjust RTC time to UTC */ #define KERN_VIDEO 89 /* struct: video properties */ -#define KERN_MAXID 90 /* number of valid kern ids */ +#define KERN_CLOCKINTR_STATS 90 /* struct: clockintr stats */ +#define KERN_MAXID 91 /* number of valid kern ids */ #define CTL_KERN_NAMES { \ { 0, 0 }, \ Index: sys/conf/files =================================================================== RCS file: /cvs/src/sys/conf/files,v retrieving revision 1.702 diff -u -p -r1.702 files --- sys/conf/files 16 Apr 2021 08:17:35 -0000 1.702 +++ sys/conf/files 25 Jun 2021 02:48:49 -0000 @@ -687,6 +687,7 @@ file kern/init_sysent.c file kern/kern_acct.c accounting file kern/kern_bufq.c file kern/kern_clock.c +file kern/kern_clockintr.c clockintr file kern/kern_descrip.c file kern/kern_event.c file kern/kern_exec.c Index: sys/arch/amd64/conf/GENERIC =================================================================== RCS file: /cvs/src/sys/arch/amd64/conf/GENERIC,v retrieving revision 1.498 diff -u -p -r1.498 GENERIC --- sys/arch/amd64/conf/GENERIC 28 Apr 2021 11:32:59 -0000 1.498 +++ sys/arch/amd64/conf/GENERIC 25 Jun 2021 02:48:49 -0000 @@ -20,6 +20,7 @@ option MTRR # CPU memory range attribu option NTFS # NTFS support option HIBERNATE # Hibernate support +option CLOCKINTR config bsd swap generic