amd64: simplify TSC sync testing

Scott Cheloha Thu, 27 Jan 2022 08:29:42 -0800

Hi,

sthen@ complained recently about a multisocket system not being able
to use the TSC in userspace because the sync test measured too much
skew and disabled it.


I don't think there is any real skew on that system.  I think the sync
test is confusing NUMA overhead for skew and issuing a false positive
result.

Now, we _could_ change the test to account for NUMA overhead.  I don't
know exactly how you would do it, but I imagine you could devise a
handshake to compute an estimate and then factor that into your skew
measurement.

Another approach is to drop the current skew measurement handshake in
favor of a dumber approach without that particular false positive case.

This patch changes our sync test to a dumber approach.  Instead of
trying to measure and correct for skew we only test for lag and do
not attempt to correct for it if we detect it.

Two CPUs enter a loop and continuously check whether the other CPU's
TSC lags their own.  With this approach the false positive we are
seeing on sthen@'s machine is impossible because we are only checking
whether one value lags the other, not whether their difference exceeds
an arbitrary value.  Synchronization is tested to within a margin of
error because both CPUs are checking for lag at the same time.

To keep the margin of error is as small as possible for a given clock
rate we spin for a relatively long time.  Right now we spin for 1
millisecond per test round.  This is arbitrary but it seems to work.
There is a point of diminishing returns for round duration.  This
sync test approach takes much more time than the current handshake
approach and I'd like to minimize our impact on boot time.

To actually shrink the margin of error you need to run the CPUs at the
highest possible clock rate.  If they are underclocked due to e.g.
SpeedStep your margin of error grows and the test may fail to detect
lag.

We do two rounds of testing for each CPU.  This is arbitrary.  You
could do more.  I think at least one additional test round is a good
idea, just in case we "get lucky" in the first round.  I think this
could help mitigate measurement problems introduced by factors beyond
our control.  For example, if one of the CPUs blacks out for the
duration of the test because it is preempted by the hypervisor the
test will pass but the result is bogus.  If we do more test rounds we
have a better chance of getting a meaningful result even if we get
unlucky with hypervisor preemption or something similar.

Misc. notes:

- We no longer have a per-CPU skew value.  If any TSC lags another we
  just set the timecounter quality to a negative number.  We don't
  disable userspace or attempt to correct the skew in the kernel.

  I think that bears repeating: we are no longer correcting the skew.
  If we detect any lag the TSC is effectively broken.

- There is no longer any concept of TSC drift.  Drift is computed
  from skew change and we have no skew value anymore.

- Am I allowed to printf(9) from a secondary CPU during boot?  It
  seems to hang the kernel but I don't know why.

- I have no idea how to use the volatile keyword correctly.  I am
  trying to keep the compiler from optimizing away my stores.  I don't
  think it implicitly understands that two threads are looking at these
  variables at the same time

  If I am abusing it please tell me.  I'm trying to avoid a situation
  where some later compiler change subtly breaks the test.  If I
  sprinkle volatile everywhere my impression is that it forces the
  compiler to actually do the store.

- I have aligned the global TSC values to 64 bytes to try to minimize
  "cache bounce".  Each value has one reader and one writer so if the
  two values are on different cache lines a write to one value shouldn't
  cause a cache miss for the writer of the other value.

  ... right?

  I'm not an expert on cache stuff.  Can someone shed light on whether
  I am doing the right thing here?

- I rolled my own thread barriers for the test.  Would a generic
  thread barrier be useful elsewhere in the kernel?  Feels wrong to roll
  my own synchronization primitive, but maybe we actually don't need
  them anywhere else?

- I would like to forcibly reset IA32_TSC_ADJUST before running the
  test but I don't think the CPU feature flags are set early enough
  for tsc_reset_adjust() to see the relevant flag even if the CPU
  has that register.

  Could we initialize the flags earlier in boot, before the sync test?

Testing notes:

- Tests on multisocket systems, multiprocessor VMs on various hypervisors,
  and on systems where the TSC is currently disabled by the kernel are
  appreciated!  Just send your dmesg.

- TSC_DEBUG is set for now to force the test to run on all CPUs even if
  a prior test has already failed.  This is useful for testing but in
  practice if we fail one test there is no point in running further
  tests.

  I'm actually unsure about this.  If we take the time to run the test
  on every processor it might give us a better "picture" of what is
  wrong in a bug report.

- I left the TSC_FORCE_DESYNC preprocessor clause in there if you
  would like to try forcibly desynchronizing the TSC on your AP to see
  what the results look like in the console buffer.  It relies on the
  IA32_TSC_ADJUST register, so if you don't have one you'll get a #GP
  fault.  Proceed at your own risk.

Thoughts?  Test results?

Index: amd64/tsc.c
===================================================================
RCS file: /cvs/src/sys/arch/amd64/amd64/tsc.c,v
retrieving revision 1.24
diff -u -p -r1.24 tsc.c
--- amd64/tsc.c 31 Aug 2021 15:11:54 -0000      1.24
+++ amd64/tsc.c 27 Jan 2022 16:27:49 -0000
@@ -36,13 +36,6 @@ int          tsc_recalibrate;
 uint64_t       tsc_frequency;
 int            tsc_is_invariant;
 
-#define        TSC_DRIFT_MAX                   250
-#define TSC_SKEW_MAX                   100
-int64_t        tsc_drift_observed;
-
-volatile int64_t       tsc_sync_val;
-volatile struct cpu_info       *tsc_sync_cpu;
-
 u_int          tsc_get_timecount(struct timecounter *tc);
 void           tsc_delay(int usecs);
 
@@ -236,22 +229,12 @@ cpu_recalibrate_tsc(struct timecounter *
 u_int
 tsc_get_timecount(struct timecounter *tc)
 {
-       return rdtsc_lfence() + curcpu()->ci_tsc_skew;
+       return rdtsc_lfence();
 }
 
 void
 tsc_timecounter_init(struct cpu_info *ci, uint64_t cpufreq)
 {
-#ifdef TSC_DEBUG
-       printf("%s: TSC skew=%lld observed drift=%lld\n", ci->ci_dev->dv_xname,
-           (long long)ci->ci_tsc_skew, (long long)tsc_drift_observed);
-#endif
-       if (ci->ci_tsc_skew < -TSC_SKEW_MAX || ci->ci_tsc_skew > TSC_SKEW_MAX) {
-               printf("%s: disabling user TSC (skew=%lld)\n",
-                   ci->ci_dev->dv_xname, (long long)ci->ci_tsc_skew);
-               tsc_timecounter.tc_user = 0;
-       }
-
        if (!(ci->ci_flags & CPUF_PRIMARY) ||
            !(ci->ci_flags & CPUF_CONST_TSC) ||
            !(ci->ci_flags & CPUF_INVAR_TSC))
@@ -268,111 +251,237 @@ tsc_timecounter_init(struct cpu_info *ci
                calibrate_tsc_freq();
        }
 
-       if (tsc_drift_observed > TSC_DRIFT_MAX) {
-               printf("ERROR: %lld cycle TSC drift observed\n",
-                   (long long)tsc_drift_observed);
-               tsc_timecounter.tc_quality = -1000;
-               tsc_timecounter.tc_user = 0;
-               tsc_is_invariant = 0;
-       }
-
        tc_init(&tsc_timecounter);
 }
 
-/*
- * Record drift (in clock cycles).  Called during AP startup.
- */
 void
-tsc_sync_drift(int64_t drift)
+tsc_delay(int usecs)
 {
-       if (drift < 0)
-               drift = -drift;
-       if (drift > tsc_drift_observed)
-               tsc_drift_observed = drift;
+       uint64_t interval, start;
+
+       interval = (uint64_t)usecs * tsc_frequency / 1000000;
+       start = rdtsc_lfence();
+       while (rdtsc_lfence() - start < interval)
+               CPU_BUSY_CYCLE();
 }
 
-/*
- * Called during startup of APs, by the boot processor.  Interrupts
- * are disabled on entry.
- */
-void
-tsc_read_bp(struct cpu_info *ci, uint64_t *bptscp, uint64_t *aptscp)
-{
-       uint64_t bptsc;
+#ifdef MULTIPROCESSOR
 
-       if (atomic_swap_ptr(&tsc_sync_cpu, ci) != NULL)
-               panic("tsc_sync_bp: 1");
+#define TSC_DEBUG 1
 
-       /* Flag it and read our TSC. */
-       atomic_setbits_int(&ci->ci_flags, CPUF_SYNCTSC);
-       bptsc = (rdtsc_lfence() >> 1);
+/*
+ * Protections for global variables in this code:
+ *
+ *     1       Only set from zero to one
+ *     a       Modified atomically
+ *     b       Protected by a barrier
+ *     p       Only modified by primary CPU
+ *     s       Only modified by secondary CPU
+ */
 
-       /* Wait for remote to complete, and read ours again. */
-       while ((ci->ci_flags & CPUF_SYNCTSC) != 0)
-               membar_consumer();
-       bptsc += (rdtsc_lfence() >> 1);
+#define TSC_SYNC_MS            1       /* Test round duration */
+#define TSC_SYNC_ROUNDS                2       /* Number of test rounds */
 
-       /* Wait for the results to come in. */
-       while (tsc_sync_cpu == ci)
-               CPU_BUSY_CYCLE();
-       if (tsc_sync_cpu != NULL)
-               panic("tsc_sync_bp: 2");
+uint64_t tsc_ap_val __aligned(64);     /* [s] Latest TSC value from AP */
+uint64_t tsc_bp_val __aligned(64);     /* [p] Latest TSC value from BP */
+uint64_t tsc_sync_cycles;              /* [p] TSC cycles per test round */
+const char *tsc_ap_name;               /* [b] Name of AP */
+volatile uint64_t tsc_ap_lag;          /* [s] Biggest regression seen by AP */
+volatile u_int tsc_egress_barrier;     /* [a] Sync test end barrier */
+volatile u_int tsc_ingress_barrier;    /* [a] Sync test start barrier */
+volatile u_int tsc_sync_rounds;                /* [b] Remaining test rounds */
+volatile u_int tsc_lag_count;          /* [a] No. of regressions this round */
+int tsc_simulag;                       /* [1] Regressions seen on both CPUs? */
+int tsc_is_synchronized = 1;           /* [p] TSC sync'd across all CPUs? */
 
-       *bptscp = bptsc;
-       *aptscp = tsc_sync_val;
-}
+void tsc_reset_adjust(const struct cpu_info *);
+uint64_t tsc_test_sync(volatile const uint64_t *, uint64_t *);
 
 void
-tsc_sync_bp(struct cpu_info *ci)
+tsc_test_sync_bp(struct cpu_info *ci)
 {
-       uint64_t bptsc, aptsc;
+       uint64_t bp_lag;
+       unsigned int round;
+
+       /* TSC must be constant and invariant to use it as a timecounter. */
+       if (!tsc_is_invariant)
+               return;
+#ifndef TSC_DEBUG
+       /* No point in testing again if a prior test failed. */
+       if (!tsc_is_synchronized)
+               return;
+#endif
 
-       tsc_read_bp(ci, &bptsc, &aptsc); /* discarded - cache effects */
-       tsc_read_bp(ci, &bptsc, &aptsc);
+       /* Reset IA32_TSC_ADJUST if it exists. */
+       tsc_reset_adjust(ci);
 
-       /* Compute final value to adjust for skew. */
-       ci->ci_tsc_skew = bptsc - aptsc;
+       /* Reset the test cycle limit and round count. */
+       tsc_sync_cycles = TSC_SYNC_MS * tsc_frequency / 1000;
+       tsc_sync_rounds = TSC_SYNC_ROUNDS;
+
+       do {
+               /*
+                * Pass through the ingress barrier, run the test,
+                * then wait for the AP to reach the egress barrier.
+                */
+               atomic_inc_int(&tsc_ingress_barrier);
+               while (tsc_ingress_barrier != 2)
+                       CPU_BUSY_CYCLE();
+               bp_lag = tsc_test_sync(&tsc_ap_val, &tsc_bp_val);
+               while (tsc_egress_barrier != 1)
+                       CPU_BUSY_CYCLE();
+       
+               /*
+                * Report what went wrong, if anything.
+                */
+               if (tsc_lag_count != 0) {
+                       round = TSC_SYNC_ROUNDS - tsc_sync_rounds + 1;
+                       printf("tsc: cpu0/%s sync round %u: %u regressions\n",
+                           tsc_ap_name, round, tsc_lag_count);
+                       if (tsc_simulag) {
+                               printf("tsc: cpu0/%s sync round %u: "
+                                   "regressions seen by both CPUs\n",
+                                   tsc_ap_name, round);
+                       }
+                       printf("tsc: cpu0/%s sync round %u: "
+                           "cpu0 lags %s by %llu cycles\n",
+                           tsc_ap_name, round, tsc_ap_name, bp_lag);
+                       printf("tsc: cpu0/%s sync round %u: "
+                           "%s lags cpu0 by %llu cycles\n",
+                           tsc_ap_name, round, tsc_ap_name, tsc_ap_lag);
+                       /*
+                        * XXX We need a tc_detach() function to actually
+                        * disable a given timecounter.  Lowering the quality
+                        * like this is a nasty hack.
+                        */
+                       tsc_timecounter.tc_quality = -1000;
+                       tsc_is_synchronized = 0;
+                       tsc_sync_rounds = 0;
+               } else
+                       tsc_sync_rounds--;
+
+               /*
+                * Clean up for the next round.
+                *
+                * It is safe to reset the ingress barrier because
+                * at this point we know the AP has reached the egress
+                * barrier.
+                */
+               tsc_bp_val = 0;
+               tsc_ap_val = 0;
+               tsc_ap_lag = 0;
+               tsc_ingress_barrier = 0;
+               tsc_simulag = 0;
+               tsc_lag_count = 0;
+               if (tsc_sync_rounds == 0)
+                       tsc_ap_name = NULL;
+
+               /*
+                * Pass through the egress barrier and release the AP.
+                * The AP is responsible for resetting the barrier.
+                */
+               if (atomic_inc_int_nv(&tsc_egress_barrier) != 2)
+                       panic("%s: unexpected egress count", __func__);
+       } while (tsc_sync_rounds > 0);
 }
 
-/*
- * Called during startup of AP, by the AP itself.  Interrupts are
- * disabled on entry.
- */
 void
-tsc_post_ap(struct cpu_info *ci)
+tsc_test_sync_ap(struct cpu_info *ci)
 {
-       uint64_t tsc;
+       if (!tsc_is_invariant)
+               return;
+#ifndef TSC_DEBUG
+       if (!tsc_is_synchronized)
+               return;
+#endif
 
-       /* Wait for go-ahead from primary. */
-       while ((ci->ci_flags & CPUF_SYNCTSC) == 0)
-               membar_consumer();
-       tsc = (rdtsc_lfence() >> 1);
+       tsc_reset_adjust(ci);
 
-       /* Instruct primary to read its counter. */
-       atomic_clearbits_int(&ci->ci_flags, CPUF_SYNCTSC);
-       tsc += (rdtsc_lfence() >> 1);
+/* #define TSC_FORCE_DESYNC (150) */
+#ifdef TSC_FORCE_DESYNC
+       wrmsr(MSR_TSC_ADJUST, TSC_FORCE_DESYNC);
+#endif
 
-       /* Post result.  Ensure the whole value goes out atomically. */
-       (void)atomic_swap_64(&tsc_sync_val, tsc);
+       /* The BP needs our name in order to log any problems. */
+       tsc_ap_name = ci->ci_dev->dv_xname;
 
-       if (atomic_swap_ptr(&tsc_sync_cpu, NULL) != ci)
-               panic("tsc_sync_ap");
+       /*
+        * As the AP we are only responsible for running the test,
+        * reporting our lag, and resetting the egress barrier.
+        * The BP sets up and tears down everything else.
+        */
+       do {
+               atomic_inc_int(&tsc_ingress_barrier);
+               while (tsc_ingress_barrier != 2)
+                       CPU_BUSY_CYCLE();
+               tsc_ap_lag = tsc_test_sync(&tsc_bp_val, &tsc_ap_val);
+               atomic_inc_int(&tsc_egress_barrier);
+               while (tsc_egress_barrier != 2)
+                       CPU_BUSY_CYCLE();
+               tsc_egress_barrier = 0;
+       } while (tsc_sync_rounds > 0);
 }
 
 void
-tsc_sync_ap(struct cpu_info *ci)
+tsc_reset_adjust(const struct cpu_info *ci)
 {
-       tsc_post_ap(ci);
-       tsc_post_ap(ci);
+       int64_t adj;
+
+       if (ISSET(ci->ci_feature_sefflags_ebx, SEFF0EBX_TSC_ADJUST)) {
+               adj = rdmsr(MSR_TSC_ADJUST);
+               if (adj != 0) {
+                       wrmsr(MSR_TSC_ADJUST, 0);
+#if 0
+                       /* XXX APs can't printf during boot or resume? */
+                       printf("tsc: %s: IA32_TSC_ADJUST: %+lld -> 0",
+                           ci->ci_dev->dv_xname, (long long)adj);
+#endif
+               }
+       }
 }
 
-void
-tsc_delay(int usecs)
+uint64_t
+tsc_test_sync(volatile const uint64_t *reference, uint64_t *local)
 {
-       uint64_t interval, start;
+       uint64_t end, lag, max_lag, ref_latest;
+       u_int i, lag_count;
 
-       interval = (uint64_t)usecs * tsc_frequency / 1000000;
-       start = rdtsc_lfence();
-       while (rdtsc_lfence() - start < interval)
-               CPU_BUSY_CYCLE();
+       max_lag = 0;
+       i = lag_count = 0;
+
+       end = rdtsc_lfence() + tsc_sync_cycles;
+       while (*local < end) {
+               ref_latest = *reference;                /* atomic */
+               *local = rdtsc_lfence();
+
+               /*
+                * Ensure the other CPU has a chance to run.  This is
+                * crucial if the other CPU is our SMT sibling.  SMT
+                * starvation can prevent this test from detecting
+                * relatively large lags.  Eight is an arbitrary value
+                * but it works in practice without impacting the
+                * accuracy of the test for non-sibling threads.
+                */
+               if (++i % 8 == 0)
+                       CPU_BUSY_CYCLE();
+
+               /* Everything is fine if the count is strictly monotonic. */
+               if (ref_latest < *local)
+                       continue;
+
+               /*
+                * Otherwise, our TSC lags the other TSC.  Record the
+                * magnitude of the problem.
+                */
+               lag = ref_latest - *local;
+               if (max_lag < lag)
+                       max_lag = lag;
+               if (atomic_inc_int_nv(&tsc_lag_count) != lag_count + 1)
+                       tsc_simulag = 1;
+               lag_count = tsc_lag_count;              /* atomic */
+       }
+
+       return max_lag;
 }
+
+#endif /* MULTIPROCESSOR */
Index: amd64/cpu.c
===================================================================
RCS file: /cvs/src/sys/arch/amd64/amd64/cpu.c,v
retrieving revision 1.154
diff -u -p -r1.154 cpu.c
--- amd64/cpu.c 31 Aug 2021 17:40:59 -0000      1.154
+++ amd64/cpu.c 27 Jan 2022 16:27:49 -0000
@@ -762,9 +762,9 @@ cpu_init(struct cpu_info *ci)
        lcr4(cr4 & ~CR4_PGE);
        lcr4(cr4);
 
-       /* Synchronize TSC */
+       /* Check if our TSC is synchronized with the BP's TSC. */
        if (cold && !CPU_IS_PRIMARY(ci))
-             tsc_sync_ap(ci);
+             tsc_test_sync_ap(ci);
 #endif
 }
 
@@ -844,18 +844,14 @@ cpu_start_secondary(struct cpu_info *ci)
 #endif
        } else {
                /*
-                * Synchronize time stamp counters. Invalidate cache and
-                * synchronize twice (in tsc_sync_bp) to minimize possible
-                * cache effects. Disable interrupts to try and rule out any
-                * external interference.
+                * Check if our TSC is synchronized with the AP's TSC.
+                * Invalidate cache to minimize possible cache effects
+                * and disable interrupts to minimize test interference.
                 */
                s = intr_disable();
                wbinvd();
-               tsc_sync_bp(ci);
+               tsc_test_sync_bp(ci);
                intr_restore(s);
-#ifdef TSC_DEBUG
-               printf("TSC skew=%lld\n", (long long)ci->ci_tsc_skew);
-#endif
        }
 
        if ((ci->ci_flags & CPUF_IDENTIFIED) == 0) {
@@ -880,7 +876,6 @@ void
 cpu_boot_secondary(struct cpu_info *ci)
 {
        int i;
-       int64_t drift;
        u_long s;
 
        atomic_setbits_int(&ci->ci_flags, CPUF_GO);
@@ -895,18 +890,14 @@ cpu_boot_secondary(struct cpu_info *ci)
                db_enter();
 #endif
        } else if (cold) {
-               /* Synchronize TSC again, check for drift. */
-               drift = ci->ci_tsc_skew;
+               /*
+                * Test TSC synchronization again.  Something may
+                * have screwed it up while we were down.
+                */
                s = intr_disable();
                wbinvd();
-               tsc_sync_bp(ci);
+               tsc_test_sync_bp(ci);
                intr_restore(s);
-               drift -= ci->ci_tsc_skew;
-#ifdef TSC_DEBUG
-               printf("TSC skew=%lld drift=%lld\n",
-                   (long long)ci->ci_tsc_skew, (long long)drift);
-#endif
-               tsc_sync_drift(drift);
        }
 }
 
@@ -932,16 +923,17 @@ cpu_hatch(void *v)
 #endif
 
        /*
-        * Synchronize the TSC for the first time. Note that interrupts are
-        * off at this point.
+        * Test if our TSC is synchronized with the BP's TSC.
+        * Note that interrupts are off at this point.  Invalidate
+        * cache to minimize cache effects on the test.
         */
        wbinvd();
        ci->ci_flags |= CPUF_PRESENT;
-       ci->ci_tsc_skew = 0;    /* reset on resume */
-       tsc_sync_ap(ci);
+       tsc_test_sync_ap(ci);
 
        lapic_enable();
        lapic_startclock();
+
        cpu_ucode_apply(ci);
        cpu_tsx_disable(ci);
 
Index: include/cpu.h
===================================================================
RCS file: /cvs/src/sys/arch/amd64/include/cpu.h,v
retrieving revision 1.141
diff -u -p -r1.141 cpu.h
--- include/cpu.h       31 Aug 2021 17:40:59 -0000      1.141
+++ include/cpu.h       27 Jan 2022 16:27:49 -0000
@@ -209,8 +209,6 @@ struct cpu_info {
        paddr_t         ci_vmxon_region_pa;
        struct vmxon_region *ci_vmxon_region;
 
-       int64_t         ci_tsc_skew;            /* counter skew vs cpu0 */
-
        char            ci_panicbuf[512];
 
        paddr_t         ci_vmcs_pa;
@@ -230,7 +228,6 @@ struct cpu_info {
 #define CPUF_INVAR_TSC 0x0100          /* CPU has invariant TSC */
 #define CPUF_USERXSTATE        0x0200          /* CPU has curproc's xsave 
state */
 
-#define CPUF_SYNCTSC   0x0800          /* Synchronize TSC */
 #define CPUF_PRESENT   0x1000          /* CPU is present */
 #define CPUF_RUNNING   0x2000          /* CPU is running */
 #define CPUF_PAUSE     0x4000          /* CPU is paused in DDB */
Index: include/cpuvar.h
===================================================================
RCS file: /cvs/src/sys/arch/amd64/include/cpuvar.h,v
retrieving revision 1.11
diff -u -p -r1.11 cpuvar.h
--- include/cpuvar.h    16 May 2021 04:33:05 -0000      1.11
+++ include/cpuvar.h    27 Jan 2022 16:27:49 -0000
@@ -97,8 +97,7 @@ void identifycpu(struct cpu_info *);
 void cpu_init(struct cpu_info *);
 void cpu_init_first(void);
 
-void tsc_sync_drift(int64_t);
-void tsc_sync_bp(struct cpu_info *);
-void tsc_sync_ap(struct cpu_info *);
+void tsc_test_sync_bp(struct cpu_info *);
+void tsc_test_sync_ap(struct cpu_info *);
 
 #endif

amd64: simplify TSC sync testing

Reply via email to