On Wed, Apr 07, 2021 at 06:18:40PM +0200, Mark Kettenis wrote:
> > From: Scott Cheloha <scottchel...@gmail.com>
> > Date: Wed, 7 Apr 2021 10:25:04 -0500
> > 
> > > On Apr 6, 2021, at 07:49, Paul Irofti <p...@irofti.net> wrote:
> > > 
> > >>>> The diff is obviously fine. But it is still a heuristic with no real
> > >>>> motivation except for this particular ESXi VM case. So my question
> > >>>> about why we choose the minimum instead of the median or the mean has
> > >>>> not been answered.
> > >>> 
> > >>> Because median or mean is affected by outliers.  We actually see
> > >>> some outliers in samples from the VMware.
> > >>> 
> > >>> I suppose there is a better mesure, but I am currently no idia and had
> > >>> not used that kind of measure in kernel.  On the other hand, finding
> > >>> the minimum is very simple.
> > >> Using the median should take care of the outliers though.
> > >> I'm not at all convinced that taking the absolute value of the
> > >> difference makes sense.  It probably works in this case since the
> > >> actual skew on your VM is zero.  So measurements close to zero are
> > >> "good".  But what if the skew isn't zero?  Take for example an AP that
> > >> is running ahead of the BP by 5000 ticks.  In that case, the right
> > >> value for the skew is -5000.  But now imagine that the BP gets
> > >> "interrupted" while doing a measurement, resulting in a delay of 10000
> > >> ticks between the two rdtsc_lfence() calls.  That would result in a
> > >> measured skew of around zero.  And by taking the minimum of the
> > >> absolute value, you end up using that value.
> > > 
> > > Exactly!
> > 
> > I agree that the median is a better choice
> > of skew than the absolute minimum or
> > average.
> > 
> > I think this means adding qsort to the kernel,
> > though.  Unless we want to do median of
> > medians...
> 
> Or maybe the code that does the actual measurements isn't fit for
> purpose.  The current code does two reads of the TSC register on both
> the BP and the AP.  This causes deviations in both directions,
> depending on whether the BP or the AP gets to experience an SMM event
> or VM exit.
> 
> The idea between doing the two reads is that by taking the average you
> compensate for the time spent signalling the AP and getting a report
> back.  But this may actually be making the measurements less accurate
> on some systems.
> 
> I believe the current code was inspired by what NetBSD does.  But
> maybe someone should take a close look at the Linux code.  The Linux
> code will have seen waaay more testing than the NetBSD code...

Before we get into Linux let's first consider FreeBSD.

Here's a port of their sync code.

The latest FreeBSD tsc.c, for reference:

https://github.com/freebsd/freebsd-src/blob/main/sys/x86/x86/tsc.c

(Maybe this should start a new thread?  We're way off topic from what
Yasuoka was originally proposing.)

In a nutshell:

- Sync test happens at the end of cpu_boot_secondary_processors()
  when tsc_check_sync() is called by BSP.

  So, it happens late in boot and during resume instead of during
  AP startup.

  Unsure what we would do if the TSCs became desync'd during
  suspend.  Maybe we could add a function, tc_detach(), to remove
  the TSC as a valid timecounter.

- BSP broadcasts IPIs to do the sync test.

  We start In x86_64_ipi_tsc_compare().  All the CPUs write
  TSC values into a big array repeatedly.  Each iteration is
  demarcated by an atomic fence.

  Then each CPU checks whether its values are sync'd with
  everyone else's.  The constraint is:

  Value N for a given CPU must be strictly greater than
  value N-1 for all CPUs.

  This is a very NUMA- and VM-friendly sync constraint.  It ignores
  intra-iteration latency from e.g. hypervisor preemption, cache misses,
  etc.  The use of the fence also introduces a hard cutoff instead of
  relying on a fixed heuristic value which might not be appropriate for
  every machine.

- If any CPU is not sync'd, the BSP issues another IPI.

  In x86_64_ipi_tsc_adjust(), each AP computes its median
  skew relative to the BSP from the values we wrote during
  the earlier IPI.

  This median value is then used to adjust the TSC.  We
  prefer using TSC_ADJUST to do this if it's available,
  but if it isn't we fall back to writing the TSC directly.
  We use wrmsr_safe() to do this to prevent #GPs from
  panicking the machine if writing the TSC isn't allowed.

- Then we do the sync test again.

We give up after five rounds of compare/adjust.

Misc notes:

- We can probably do this without IPIs.  It would require less
  space, fore sure.  I just figured I'd demonstrate what FreeBSD
  was doing.

- Selection sort is horrible, obviously we'd have to use a real
  sort if we settled on using the median skew as the adjustment
  value.

- I have added a boatload of debug prints.  Sorry for the noise, I
  just want to see what it's actually doing.

- FreeBSD has no concept of "drift".  Maybe this could be added?

- FreeBSD does not use any per-cpu skew values, so ci_tsc_skew
  is unset.  Note that this means there is no non-userspace
  fallback.  If they aren't sync'd you don't get to use the
  TSC as a timecounter at all.

  ... we can probably find a way to fall back to per-cpu skews
  if this is an unpopular change.  :)

- I think this sync constraint is better than the current "skew
  limit" constraint.  At least, my guess is that it will ignore
  things like NUMA latency and other measurement noise.  With
  the current test we cannot filter these artifacts out.

- FreeBSD doesn't bother using the TSC if APM is enabled.  I don't
  know what this means.

- FreeBSD has special logic when the machine is a VM.  Unsure
  what the upshot of this is exactly, but at minimum the TSC
  has different quality values depending on the hypervisor
  detected.

- FreeBSD doesn't use TSC_ADJUST at all.  So, that's a slight
  improvement on their approach.

- FreeBSD doesn't use the median skew as an adjustment value.  They
  use the absolute minimum skew, but only if all skews have the same
  sign, otherwise the given CPU doesn't bother adjusting its TSC.

  I'm unsure about this adjustment choice.  We've been talking
  about the limitations of the absolute minimum so I figured we
  could try the median instead.

- Are interrupts implicitly disabled during IPI handling?  Or do
  I need to explicitly intr_disable/intr_restore?

Obviously this needs cleanup.  It seems to work, though.  If I
forcibly desync my normally sync'd TSCs this patch resyncs them after
one round of compare/adjust, with or without using the TSC_ADJUST MSR.

Very curious if this improves the situation for VMs, NUMA machines, or
those laptops with the lagging CPU0 TSC.

-Scott

Index: sys/arch/amd64/amd64/cpu.c
===================================================================
RCS file: /cvs/src/sys/arch/amd64/amd64/cpu.c,v
retrieving revision 1.153
diff -u -p -r1.153 cpu.c
--- sys/arch/amd64/amd64/cpu.c  11 Mar 2021 11:16:55 -0000      1.153
+++ sys/arch/amd64/amd64/cpu.c  8 Apr 2021 04:37:54 -0000
@@ -761,10 +761,6 @@ cpu_init(struct cpu_info *ci)
        cr4 = rcr4();
        lcr4(cr4 & ~CR4_PGE);
        lcr4(cr4);
-
-       /* Synchronize TSC */
-       if (cold && !CPU_IS_PRIMARY(ci))
-             tsc_sync_ap(ci);
 #endif
 }
 
@@ -813,13 +809,15 @@ cpu_boot_secondary_processors(void)
                ci->ci_randseed = (arc4random() & 0x7fffffff) + 1;
                cpu_boot_secondary(ci);
        }
+
+       extern void tsc_check_sync(void);
+       tsc_check_sync();
 }
 
 void
 cpu_start_secondary(struct cpu_info *ci)
 {
        int i;
-       u_long s;
 
        ci->ci_flags |= CPUF_AP;
 
@@ -840,20 +838,6 @@ cpu_start_secondary(struct cpu_info *ci)
                printf("dropping into debugger; continue from here to resume 
boot\n");
                db_enter();
 #endif
-       } else {
-               /*
-                * Synchronize time stamp counters. Invalidate cache and
-                * synchronize twice (in tsc_sync_bp) to minimize possible
-                * cache effects. Disable interrupts to try and rule out any
-                * external interference.
-                */
-               s = intr_disable();
-               wbinvd();
-               tsc_sync_bp(ci);
-               intr_restore(s);
-#ifdef TSC_DEBUG
-               printf("TSC skew=%lld\n", (long long)ci->ci_tsc_skew);
-#endif
        }
 
        if ((ci->ci_flags & CPUF_IDENTIFIED) == 0) {
@@ -878,8 +862,6 @@ void
 cpu_boot_secondary(struct cpu_info *ci)
 {
        int i;
-       int64_t drift;
-       u_long s;
 
        atomic_setbits_int(&ci->ci_flags, CPUF_GO);
 
@@ -892,19 +874,6 @@ cpu_boot_secondary(struct cpu_info *ci)
                printf("dropping into debugger; continue from here to resume 
boot\n");
                db_enter();
 #endif
-       } else if (cold) {
-               /* Synchronize TSC again, check for drift. */
-               drift = ci->ci_tsc_skew;
-               s = intr_disable();
-               wbinvd();
-               tsc_sync_bp(ci);
-               intr_restore(s);
-               drift -= ci->ci_tsc_skew;
-#ifdef TSC_DEBUG
-               printf("TSC skew=%lld drift=%lld\n",
-                   (long long)ci->ci_tsc_skew, (long long)drift);
-#endif
-               tsc_sync_drift(drift);
        }
 }
 
@@ -928,15 +897,7 @@ cpu_hatch(void *v)
        if (ci->ci_flags & CPUF_PRESENT)
                panic("%s: already running!?", ci->ci_dev->dv_xname);
 #endif
-
-       /*
-        * Synchronize the TSC for the first time. Note that interrupts are
-        * off at this point.
-        */
-       wbinvd();
        ci->ci_flags |= CPUF_PRESENT;
-       ci->ci_tsc_skew = 0;    /* reset on resume */
-       tsc_sync_ap(ci);
 
        lapic_enable();
        lapic_startclock();
Index: sys/arch/amd64/amd64/identcpu.c
===================================================================
RCS file: /cvs/src/sys/arch/amd64/amd64/identcpu.c,v
retrieving revision 1.118
diff -u -p -r1.118 identcpu.c
--- sys/arch/amd64/amd64/identcpu.c     31 Dec 2020 06:22:33 -0000      1.118
+++ sys/arch/amd64/amd64/identcpu.c     8 Apr 2021 04:37:54 -0000
@@ -757,7 +757,8 @@ identifycpu(struct cpu_info *ci)
 #endif
        }
 
-       tsc_timecounter_init(ci, freq);
+       if (CPU_IS_PRIMARY(ci))
+               tsc_timecounter_init(ci, freq);
 
        cpu_topology(ci);
 #if NVMM > 0
Index: sys/arch/amd64/amd64/ipifuncs.c
===================================================================
RCS file: /cvs/src/sys/arch/amd64/amd64/ipifuncs.c,v
retrieving revision 1.35
diff -u -p -r1.35 ipifuncs.c
--- sys/arch/amd64/amd64/ipifuncs.c     13 Sep 2020 11:53:16 -0000      1.35
+++ sys/arch/amd64/amd64/ipifuncs.c     8 Apr 2021 04:37:54 -0000
@@ -61,6 +61,8 @@
 void x86_64_ipi_nop(struct cpu_info *);
 void x86_64_ipi_halt(struct cpu_info *);
 void x86_64_ipi_wbinvd(struct cpu_info *);
+void x86_64_ipi_tsc_compare(struct cpu_info *);
+void x86_64_ipi_tsc_adjust(struct cpu_info *);
 
 #if NVMM > 0
 void x86_64_ipi_start_vmm(struct cpu_info *);
@@ -103,6 +105,8 @@ void (*ipifunc[X86_NIPI])(struct cpu_inf
        NULL,
 #endif
        x86_64_ipi_wbinvd,
+       x86_64_ipi_tsc_compare,
+       x86_64_ipi_tsc_adjust,
 };
 
 void
Index: sys/arch/amd64/amd64/tsc.c
===================================================================
RCS file: /cvs/src/sys/arch/amd64/amd64/tsc.c,v
retrieving revision 1.23
diff -u -p -r1.23 tsc.c
--- sys/arch/amd64/amd64/tsc.c  23 Feb 2021 04:44:30 -0000      1.23
+++ sys/arch/amd64/amd64/tsc.c  8 Apr 2021 04:37:54 -0000
@@ -27,6 +27,14 @@
 #include <machine/cpu.h>
 #include <machine/cpufunc.h>
 
+#define TSC_DEBUG      1
+
+#ifdef TSC_DEBUG
+#define DPRINTF(_x...) do { printf("tsc: " _x); } while (0)
+#else
+#define DPRINTF(_x...)
+#endif
+
 #define RECALIBRATE_MAX_RETRIES                5
 #define RECALIBRATE_SMI_THRESHOLD      50000
 #define RECALIBRATE_DELAY_THRESHOLD    50
@@ -36,17 +44,9 @@ int          tsc_recalibrate;
 uint64_t       tsc_frequency;
 int            tsc_is_invariant;
 
-#define        TSC_DRIFT_MAX                   250
-#define TSC_SKEW_MAX                   100
-int64_t        tsc_drift_observed;
-
-volatile int64_t       tsc_sync_val;
-volatile struct cpu_info       *tsc_sync_cpu;
-
 u_int          tsc_get_timecount(struct timecounter *tc);
 void           tsc_delay(int usecs);
 
-#include "lapic.h"
 #if NLAPIC > 0
 extern u_int32_t lapic_per_second;
 #endif
@@ -221,29 +221,25 @@ cpu_recalibrate_tsc(struct timecounter *
 u_int
 tsc_get_timecount(struct timecounter *tc)
 {
-       return rdtsc_lfence() + curcpu()->ci_tsc_skew;
+       return rdtsc_lfence();
 }
 
 void
 tsc_timecounter_init(struct cpu_info *ci, uint64_t cpufreq)
 {
-#ifdef TSC_DEBUG
-       printf("%s: TSC skew=%lld observed drift=%lld\n", ci->ci_dev->dv_xname,
-           (long long)ci->ci_tsc_skew, (long long)tsc_drift_observed);
-#endif
-       if (ci->ci_tsc_skew < -TSC_SKEW_MAX || ci->ci_tsc_skew > TSC_SKEW_MAX) {
-               printf("%s: disabling user TSC (skew=%lld)\n",
-                   ci->ci_dev->dv_xname, (long long)ci->ci_tsc_skew);
-               tsc_timecounter.tc_user = 0;
-       }
+       KASSERT(CPU_IS_PRIMARY(ci));
 
-       if (!(ci->ci_flags & CPUF_PRIMARY) ||
-           !(ci->ci_flags & CPUF_CONST_TSC) ||
-           !(ci->ci_flags & CPUF_INVAR_TSC))
+       if (!ISSET(ci->ci_flags, CPUF_CONST_TSC)) {
+               DPRINTF("cannot use timecounter: not constant\n");
+               return;
+       }
+       if (!ISSET(ci->ci_flags, CPUF_INVAR_TSC)) {
+               DPRINTF("cannot use timecounter: not invariant\n");
                return;
+       }
 
-       tsc_frequency = tsc_freq_cpuid(ci);
        tsc_is_invariant = 1;
+       tsc_frequency = tsc_freq_cpuid(ci);
 
        /* Newer CPUs don't require recalibration */
        if (tsc_frequency > 0) {
@@ -256,112 +252,299 @@ tsc_timecounter_init(struct cpu_info *ci
                calibrate_tsc_freq();
        }
 
-       if (tsc_drift_observed > TSC_DRIFT_MAX) {
-               printf("ERROR: %lld cycle TSC drift observed\n",
-                   (long long)tsc_drift_observed);
-               tsc_timecounter.tc_quality = -1000;
-               tsc_timecounter.tc_user = 0;
-               tsc_is_invariant = 0;
-       } else
-               delay_func = tsc_delay;
-
+       /*
+        * We can't use the TSC as a timecounter on an MP kernel until
+        * we've checked the synchronization in tsc_check_sync() after
+        * booting the secondary CPUs.
+        *
+        * We can't use the TSC for delay(9), either, because we might
+        * jump the TSC during tsc_check_sync(), which would violate the
+        * assumptions in tsc_delay().
+        */
+#ifndef MULTIPROCESSOR
+       delay_func = tsc_delay;
        tc_init(&tsc_timecounter);
+#endif
 }
 
-/*
- * Record drift (in clock cycles).  Called during AP startup.
- */
 void
-tsc_sync_drift(int64_t drift)
+tsc_delay(int usecs)
 {
-       if (drift < 0)
-               drift = -drift;
-       if (drift > tsc_drift_observed)
-               tsc_drift_observed = drift;
+       uint64_t interval, start;
+
+       interval = (uint64_t)usecs * tsc_frequency / 1000000;
+       start = rdtsc_lfence();
+       while (rdtsc_lfence() - start < interval)
+               CPU_BUSY_CYCLE();
 }
 
-/*
- * Called during startup of APs, by the boot processor.  Interrupts
- * are disabled on entry.
- */
-void
-tsc_read_bp(struct cpu_info *ci, uint64_t *bptscp, uint64_t *aptscp)
-{
-       uint64_t bptsc;
+#ifdef MULTIPROCESSOR
 
-       if (atomic_swap_ptr(&tsc_sync_cpu, ci) != NULL)
-               panic("tsc_sync_bp: 1");
+/* TODO how many iterations is "good enough"? */
+#define TSC_NREADS     512
 
-       /* Flag it and read our TSC. */
-       atomic_setbits_int(&ci->ci_flags, CPUF_SYNCTSC);
-       bptsc = (rdtsc_lfence() >> 1);
+/* For sake of testing you can forcibly desync your TSCs. */
+/* #define TSC_DESYNC  1 */
 
-       /* Wait for remote to complete, and read ours again. */
-       while ((ci->ci_flags & CPUF_SYNCTSC) != 0)
-               membar_consumer();
-       bptsc += (rdtsc_lfence() >> 1);
+#define TSC_ADJUST_ROUNDS      5
+
+/* TODO this should be malloc'd */
+uint64_t tsc_data[MAXCPUS][TSC_NREADS];
+volatile uint32_t tsc_read_barrier[TSC_NREADS];
+volatile uint32_t tsc_compare_barrier;
+volatile uint32_t tsc_leave_barrier;
+uint32_t tsc_barrier_threshold;
+uint32_t tsc_rounds;
+int tsc_have_adjust_msr;
+volatile int tsc_sync;
 
-       /* Wait for the results to come in. */
-       while (tsc_sync_cpu == ci)
-               CPU_BUSY_CYCLE();
-       if (tsc_sync_cpu != NULL)
-               panic("tsc_sync_bp: 2");
+/*
+ * XXX Sloooow.
+ */
+static void
+selection_sort(int64_t *array, size_t n)
+{
+       uint64_t tmp;
+       size_t i, j, mindex;
 
-       *bptscp = bptsc;
-       *aptscp = tsc_sync_val;
+       for (i = 0; i < n - 1; i++) {
+               mindex = i;
+               for (j = i + 1; j < n; j++) {
+                       if (array[j] < array[mindex])
+                               mindex = j;
+               }
+               if (mindex != i) {
+                       tmp = array[i];
+                       array[i] = array[mindex];
+                       array[mindex] = tmp;
+               }
+       }
 }
 
-void
-tsc_sync_bp(struct cpu_info *ci)
+/*
+ * Compute skew between each element of ref and data, then write
+ * the skews to data.
+ *
+ * Then sort data, compute the median, and return it to the caller.
+ *
+ * The values in ref are not modified by this routine.
+ */
+static int64_t
+find_median_skew(const uint64_t *ref, uint64_t *data, size_t size)
 {
-       uint64_t bptsc, aptsc;
+       int64_t a, b;
+       size_t i, mid;
 
-       tsc_read_bp(ci, &bptsc, &aptsc); /* discarded - cache effects */
-       tsc_read_bp(ci, &bptsc, &aptsc);
+       /* The unsigned arithmetic isn't an issue here. */
+       for (i = 0; i < size; i++)
+               data[i] = ref[i] - data[i];
+
+       /*
+        * Find the median skew.  Note that selection_sort() treats
+        * these as signed elements for purposes of comparison.
+        */
+       selection_sort((int64_t *)data, size);
+       mid = size / 2;
+       a = (int64_t)data[mid - 1];
+       b = (int64_t)data[mid];
 
-       /* Compute final value to adjust for skew. */
-       ci->ci_tsc_skew = bptsc - aptsc;
+       return a / 2 + b / 2;
 }
 
-/*
- * Called during startup of AP, by the AP itself.  Interrupts are
- * disabled on entry.
- */
 void
-tsc_post_ap(struct cpu_info *ci)
+x86_64_ipi_tsc_compare(struct cpu_info *ci)
 {
-       uint64_t tsc;
+       uint32_t barrier_threshold;
+       unsigned int cpu, i, j;
+       int sync;
+
+       barrier_threshold = tsc_barrier_threshold;
+       cpu = CPU_INFO_UNIT(ci);
+
+       if (tsc_rounds == 0 && tsc_have_adjust_msr) {
+               if (rdmsr(MSR_TSC_ADJUST) != 0) {
+                       DPRINTF("cpu%u: zeroing TSC_ADJUST; was %lld\n",
+                           cpu, rdmsr(MSR_TSC_ADJUST));
+               }
+               wrmsr(MSR_TSC_ADJUST, 0);
+       }
 
-       /* Wait for go-ahead from primary. */
-       while ((ci->ci_flags & CPUF_SYNCTSC) == 0)
-               membar_consumer();
-       tsc = (rdtsc_lfence() >> 1);
+#ifdef TSC_DESYNC
+       if (tsc_rounds == 0) {
+               if (tsc_have_adjust_msr)
+                       wrmsr(MSR_TSC_ADJUST, arc4random());
+               else
+                       wrmsr_safe(MSR_TSC, rdtsc() + arc4random());
+       }
+#endif
+
+       /*
+        * Mitigate cache effects on measurement?
+        *
+        * XXX I have no idea what this does.
+        */
+       wbinvd();
+
+       for (i = 0; i < nitems(tsc_data[cpu]); i++) {
+               atomic_inc_int(&tsc_read_barrier[i]);
+               while (tsc_read_barrier[i] != barrier_threshold)
+                       membar_consumer();
+               tsc_data[cpu][i] = rdtsc_lfence();
+       }
 
-       /* Instruct primary to read its counter. */
-       atomic_clearbits_int(&ci->ci_flags, CPUF_SYNCTSC);
-       tsc += (rdtsc_lfence() >> 1);
+       atomic_inc_int(&tsc_compare_barrier);
+       while (tsc_compare_barrier != barrier_threshold)
+               membar_consumer();
 
-       /* Post result.  Ensure the whole value goes out atomically. */
-       (void)atomic_swap_64(&tsc_sync_val, tsc);
+       sync = 1;
+       for (i = 1; i < nitems(tsc_data[cpu]); i++) {
+               for (j = 0; j < nitems(tsc_data); j++) {
+                       if (tsc_data[cpu][i] <= tsc_data[j][i - 1]) {
+                               sync = 0;
+                               break;
+                       }
+               }
+               if (!sync)
+                       break;
+       }
+       if (!sync) {
+               tsc_sync = 0;   /* global, everyone can see this. */
+               DPRINTF("cpu%u[%u] lags cpu%u[%u]: %llu <= %llu\n",
+                   cpu, i, j, i - 1, tsc_data[cpu][i], tsc_data[j][i - 1]);
+       }
 
-       if (atomic_swap_ptr(&tsc_sync_cpu, NULL) != ci)
-               panic("tsc_sync_ap");
+       atomic_inc_int(&tsc_leave_barrier);
+       while (tsc_leave_barrier != barrier_threshold)
+               membar_consumer();
 }
 
 void
-tsc_sync_ap(struct cpu_info *ci)
+x86_64_ipi_tsc_adjust(struct cpu_info *ci)
 {
-       tsc_post_ap(ci);
-       tsc_post_ap(ci);
+       int64_t median;
+       uint32_t barrier_threshold;
+       unsigned int cpu, ref;
+       int failed;
+
+       KASSERT(!CPU_IS_PRIMARY(ci));
+
+       barrier_threshold = tsc_barrier_threshold;
+       cpu = CPU_INFO_UNIT(ci);
+
+       /*
+        * Compute median skew from BSP.  Note that this clobbers the
+        * counts we recorded in tsc_data[cpu] earlier.
+        *
+        * XXX Assume BSP is id 0.
+        *
+        * XXX BSP is not necessarily the best reference.  On NUMA
+        * systems we should sync CPUs within a package to another
+        * CPU in the same package to avoid latency error.
+        */
+       ref = 0;
+       median = find_median_skew(tsc_data[ref], tsc_data[cpu],
+           nitems(tsc_data[ref]));
+       DPRINTF("cpu%u: min=%lld max=%lld med=%lld\n",
+           cpu, tsc_data[cpu][0], tsc_data[cpu][nitems(tsc_data[cpu]) - 1],
+           median);
+
+       /*
+        * Try to adjust the local TSC accordingly.
+        * 
+        * Writing the TSC MSR is not necessarily allowed, hence
+        * wrmsr_safe().
+        *
+        * Prefer MSR_TSC_ADJUST if we have it.  The write is atomic,
+        * so we avoid the race between the RDTSC and the WRMSR wherein
+        * we could be preempted by a hypervisor, SMM code, etc.
+        *
+        * XXX Maybe we shouldn't try to adjust if the median skew is
+        * within a certain margin?
+        */
+       if (tsc_have_adjust_msr) {
+               failed = 0;
+               wrmsr(MSR_TSC_ADJUST, rdmsr(MSR_TSC_ADJUST) + median);
+       } else {
+               failed = wrmsr_safe(MSR_TSC, rdtsc_lfence() + median);
+               DPRINTF("cpu%u: wrmsr %s\n", cpu, failed ? "failed" : "ok");
+       }
+
+       atomic_inc_int(&tsc_leave_barrier);
+       while (tsc_leave_barrier != barrier_threshold)
+               membar_consumer();
 }
 
 void
-tsc_delay(int usecs)
+tsc_check_sync(void)
 {
-       uint64_t interval, start;
+       struct cpu_info *ci = curcpu();
+       unsigned int i, j;
 
-       interval = (uint64_t)usecs * tsc_frequency / 1000000;
-       start = rdtsc_lfence();
-       while (rdtsc_lfence() - start < interval)
-               CPU_BUSY_CYCLE();
+       KASSERT(CPU_IS_PRIMARY(ci));
+
+       if (!tsc_is_invariant)
+               return;
+
+       tsc_barrier_threshold = ncpus;  /* XXX can we assume this? */
+       if (ISSET(ci->ci_feature_sefflags_ebx, SEFF0EBX_TSC_ADJUST))
+               tsc_have_adjust_msr = 1;
+
+       /*
+        * Take measurements.  If we aren't synchronized, try to
+        * adjust everyone.  Then check again.
+        */
+       for (i = 0; i < TSC_ADJUST_ROUNDS + 1; i++) {
+               /* Reset all barriers. */
+               for (j = 0; j < nitems(tsc_read_barrier); j++)
+                       tsc_read_barrier[j] = 0;
+               tsc_compare_barrier = tsc_leave_barrier = 0;
+
+               /* Measure and compare. */
+               DPRINTF("testing synchronization...\n");
+               tsc_rounds = i;
+               tsc_sync = 1;
+               x86_broadcast_ipi(X86_IPI_TSC_CMP);
+               x86_send_ipi(ci, X86_IPI_TSC_CMP);
+               if (tsc_sync)
+                       break;
+
+               /* Tell everyone to sync up. */
+               if (i < TSC_ADJUST_ROUNDS) {
+                       DPRINTF("not synchronized; adjusting...\n");
+                       tsc_leave_barrier = 0;
+                       x86_broadcast_ipi(X86_IPI_TSC_ADJ);
+                       atomic_inc_int(&tsc_leave_barrier);
+                       while (tsc_leave_barrier != tsc_barrier_threshold)
+                               membar_consumer();
+               }
+       }
+
+#ifdef TSC_DEBUG
+       /* Manually print the stats from the final comparison. */
+       for (i = 1; i < ncpus; i++) {
+               int64_t median = find_median_skew(tsc_data[0], tsc_data[i],
+                   nitems(tsc_data[0]));
+               DPRINTF("cpu%u: min=%lld max=%lld med=%lld\n",
+                   i, tsc_data[i][0], tsc_data[i][nitems(tsc_data[i]) - 1],
+                   median);
+       }
+#endif
+
+       if (tsc_sync) {
+               if (tsc_rounds > 0) {
+                       DPRINTF("synchronized after %u adjustment rounds\n",
+                           tsc_rounds);
+               } else
+                       DPRINTF("already synchronized\n");
+               tc_init(&tsc_timecounter);
+       } else
+               DPRINTF("cannot use timecounter: not synchronized\n");
+
+       /*
+        * We can use the TSC for delay(9) even if they aren't sync'd.
+        * The only thing that matters is a constant frequency.
+        */
+       delay_func = tsc_delay;
 }
+
+#endif /* MULTIPROCESSOR */
Index: sys/arch/amd64/amd64/locore.S
===================================================================
RCS file: /cvs/src/sys/arch/amd64/amd64/locore.S,v
retrieving revision 1.122
diff -u -p -r1.122 locore.S
--- sys/arch/amd64/amd64/locore.S       3 Nov 2020 18:19:31 -0000       1.122
+++ sys/arch/amd64/amd64/locore.S       8 Apr 2021 04:37:54 -0000
@@ -1154,6 +1154,30 @@ NENTRY(rdmsr_resume)
        ret
 END(rdmsr_safe)
 
+/* int wrmsr_safe(u_int32_t msr, uint64_t val) */
+ENTRY(wrmsr_safe)
+       RETGUARD_SETUP(wrmsr_safe, r10)
+
+       movl    %edi,   %ecx    /* uint32_t msr */
+
+       movl    %esi,   %eax    /* uint64_t val */
+       sarq    $32,    %rsi
+       movl    %esi,   %edx
+
+       .globl  wrmsr_safe_fault
+wrmsr_safe_fault:
+       wrmsr
+
+       xorq    %rax,   %rax
+       RETGUARD_CHECK(wrmsr_safe, r10)
+       ret
+
+NENTRY(wrmsr_resume)
+       movq    $1,     %rax
+       RETGUARD_CHECK(wrmsr_safe, r10)
+       ret
+END(wrmsr_safe)
+
 #if NXEN > 0
        /* Hypercall page needs to be page aligned */
        .text
Index: sys/arch/amd64/amd64/vector.S
===================================================================
RCS file: /cvs/src/sys/arch/amd64/amd64/vector.S,v
retrieving revision 1.84
diff -u -p -r1.84 vector.S
--- sys/arch/amd64/amd64/vector.S       13 Nov 2020 05:32:08 -0000      1.84
+++ sys/arch/amd64/amd64/vector.S       8 Apr 2021 04:37:54 -0000
@@ -267,7 +267,7 @@ IDTVEC(trap0c)
  * to handle:
  *  - trapping in iretq to userspace and
  *  - trapping in xrstor in the kernel.
- *  - trapping when invalid MSRs are read in rdmsr_safe
+ *  - trapping when invalid MSRs are read in rdmsr_safe, etc.
  * We detect these by examining the %rip in the iretq_frame.
  * Handling them is done by updating %rip in the iretq_frame to point
  * to a stub handler of some sort and then iretq'ing to it.  For the
@@ -291,6 +291,9 @@ IDTVEC(trap0d)
        leaq    rdmsr_safe_fault(%rip),%rcx
        cmpq    %rcx,%rdx
        je      .Lhandle_rdmsr_safe
+       leaq    wrmsr_safe_fault(%rip),%rcx
+       cmpq    %rcx,%rdx
+       je      .Lhandle_wrmsr_safe
        popq    %rcx
        popq    %rdx
        TRAP(T_PROTFLT)
@@ -298,6 +301,11 @@ IDTVEC(trap0d)
 .Lhandle_rdmsr_safe:
        /* rdmsr faulted; just resume in rdmsr_resume */
        leaq    rdmsr_resume(%rip),%rcx
+       jmp     1f
+
+.Lhandle_wrmsr_safe:
+       /* wrmsr faulted; just resume in wrmsr_resume */
+       leaq    wrmsr_resume(%rip),%rcx
        jmp     1f
 
 .Lhandle_xrstor:
Index: sys/arch/amd64/include/cpufunc.h
===================================================================
RCS file: /cvs/src/sys/arch/amd64/include/cpufunc.h,v
retrieving revision 1.36
diff -u -p -r1.36 cpufunc.h
--- sys/arch/amd64/include/cpufunc.h    13 Sep 2020 11:53:16 -0000      1.36
+++ sys/arch/amd64/include/cpufunc.h    8 Apr 2021 04:37:54 -0000
@@ -398,6 +398,7 @@ struct cpu_info_full;
 void cpu_enter_pages(struct cpu_info_full *);
 
 int rdmsr_safe(u_int msr, uint64_t *);
+int wrmsr_safe(uint32_t, uint64_t);
 
 #endif /* _KERNEL */
 
Index: sys/arch/amd64/include/intrdefs.h
===================================================================
RCS file: /cvs/src/sys/arch/amd64/include/intrdefs.h,v
retrieving revision 1.21
diff -u -p -r1.21 intrdefs.h
--- sys/arch/amd64/include/intrdefs.h   13 Sep 2020 11:53:16 -0000      1.21
+++ sys/arch/amd64/include/intrdefs.h   8 Apr 2021 04:37:54 -0000
@@ -83,8 +83,10 @@
 #define X86_IPI_START_VMM              0x00000100
 #define X86_IPI_STOP_VMM               0x00000200
 #define X86_IPI_WBINVD                 0x00000400
+#define X86_IPI_TSC_CMP                        0x00000800
+#define X86_IPI_TSC_ADJ                        0x00001000
 
-#define X86_NIPI                       11
+#define X86_NIPI                       13
 
 #define IREENT_MAGIC   0x18041969
 
Index: sys/arch/amd64/include/specialreg.h
===================================================================
RCS file: /cvs/src/sys/arch/amd64/include/specialreg.h,v
retrieving revision 1.89
diff -u -p -r1.89 specialreg.h
--- sys/arch/amd64/include/specialreg.h 29 Mar 2021 12:39:02 -0000      1.89
+++ sys/arch/amd64/include/specialreg.h 8 Apr 2021 04:37:54 -0000
@@ -352,6 +352,7 @@
 #define MSR_EBC_FREQUENCY_ID    0x02c   /* Pentium 4 only */
 #define        MSR_TEST_CTL            0x033
 #define MSR_IA32_FEATURE_CONTROL 0x03a
+#define MSR_TSC_ADJUST         0x03b
 #define MSR_SPEC_CTRL          0x048   /* Speculation Control IBRS / STIBP */
 #define SPEC_CTRL_IBRS         (1ULL << 0)
 #define SPEC_CTRL_STIBP                (1ULL << 1)

Reply via email to