This definitely wants testing on Ryzen ThinkPads (e.g. E485/E585/X395/T495s)
or Inspiron 5505, I see user TSC disabled on a lot of those in dmesglog.
On 2022/01/27 10:28, Scott Cheloha wrote:
> Hi,
>
> sthen@ complained recently about a multisocket system not being able
> to use the TSC in userspace because the sync test measured too much
> skew and disabled it.
>
> I don't think there is any real skew on that system. I think the sync
> test is confusing NUMA overhead for skew and issuing a false positive
> result.
>
> Now, we _could_ change the test to account for NUMA overhead. I don't
> know exactly how you would do it, but I imagine you could devise a
> handshake to compute an estimate and then factor that into your skew
> measurement.
>
> Another approach is to drop the current skew measurement handshake in
> favor of a dumber approach without that particular false positive case.
>
> This patch changes our sync test to a dumber approach. Instead of
> trying to measure and correct for skew we only test for lag and do
> not attempt to correct for it if we detect it.
>
> Two CPUs enter a loop and continuously check whether the other CPU's
> TSC lags their own. With this approach the false positive we are
> seeing on sthen@'s machine is impossible because we are only checking
> whether one value lags the other, not whether their difference exceeds
> an arbitrary value. Synchronization is tested to within a margin of
> error because both CPUs are checking for lag at the same time.
>
> To keep the margin of error is as small as possible for a given clock
> rate we spin for a relatively long time. Right now we spin for 1
> millisecond per test round. This is arbitrary but it seems to work.
> There is a point of diminishing returns for round duration. This
> sync test approach takes much more time than the current handshake
> approach and I'd like to minimize our impact on boot time.
>
> To actually shrink the margin of error you need to run the CPUs at the
> highest possible clock rate. If they are underclocked due to e.g.
> SpeedStep your margin of error grows and the test may fail to detect
> lag.
>
> We do two rounds of testing for each CPU. This is arbitrary. You
> could do more. I think at least one additional test round is a good
> idea, just in case we "get lucky" in the first round. I think this
> could help mitigate measurement problems introduced by factors beyond
> our control. For example, if one of the CPUs blacks out for the
> duration of the test because it is preempted by the hypervisor the
> test will pass but the result is bogus. If we do more test rounds we
> have a better chance of getting a meaningful result even if we get
> unlucky with hypervisor preemption or something similar.
>
> Misc. notes:
>
> - We no longer have a per-CPU skew value. If any TSC lags another we
> just set the timecounter quality to a negative number. We don't
> disable userspace or attempt to correct the skew in the kernel.
>
> I think that bears repeating: we are no longer correcting the skew.
> If we detect any lag the TSC is effectively broken.
>
> - There is no longer any concept of TSC drift. Drift is computed
> from skew change and we have no skew value anymore.
>
> - Am I allowed to printf(9) from a secondary CPU during boot? It
> seems to hang the kernel but I don't know why.
>
> - I have no idea how to use the volatile keyword correctly. I am
> trying to keep the compiler from optimizing away my stores. I don't
> think it implicitly understands that two threads are looking at these
> variables at the same time
>
> If I am abusing it please tell me. I'm trying to avoid a situation
> where some later compiler change subtly breaks the test. If I
> sprinkle volatile everywhere my impression is that it forces the
> compiler to actually do the store.
>
> - I have aligned the global TSC values to 64 bytes to try to minimize
> "cache bounce". Each value has one reader and one writer so if the
> two values are on different cache lines a write to one value shouldn't
> cause a cache miss for the writer of the other value.
>
> ... right?
>
> I'm not an expert on cache stuff. Can someone shed light on whether
> I am doing the right thing here?
>
> - I rolled my own thread barriers for the test. Would a generic
> thread barrier be useful elsewhere in the kernel? Feels wrong to roll
> my own synchronization primitive, but maybe we actually don't need
> them anywhere else?
>
> - I would like to forcibly reset IA32_TSC_ADJUST before running the
> test but I don't think the CPU feature flags are set early enough
> for tsc_reset_adjust() to see the relevant flag even if the CPU
> has that register.
>
> Could we initialize the flags earlier in boot, before the sync test?
>
> Testing notes:
>
> - Tests on multisocket systems, multiprocessor VMs on various hypervisors,
> and on systems where the TSC is currently disabled by the kernel are
> appreciated! Just send your dmesg.
>
> - TSC_DEBUG is set for now to force the test to run on all CPUs even if
> a prior test has already failed. This is useful for testing but in
> practice if we fail one test there is no point in running further
> tests.
>
> I'm actually unsure about this. If we take the time to run the test
> on every processor it might give us a better "picture" of what is
> wrong in a bug report.
>
> - I left the TSC_FORCE_DESYNC preprocessor clause in there if you
> would like to try forcibly desynchronizing the TSC on your AP to see
> what the results look like in the console buffer. It relies on the
> IA32_TSC_ADJUST register, so if you don't have one you'll get a #GP
> fault. Proceed at your own risk.
>
> Thoughts? Test results?
Index: amd64/tsc.c
===================================================================
RCS file: /cvs/src/sys/arch/amd64/amd64/tsc.c,v
retrieving revision 1.24
diff -u -p -r1.24 tsc.c
--- amd64/tsc.c 31 Aug 2021 15:11:54 -0000 1.24
+++ amd64/tsc.c 27 Jan 2022 16:27:49 -0000
@@ -36,13 +36,6 @@ int tsc_recalibrate;
uint64_t tsc_frequency;
int tsc_is_invariant;
-#define TSC_DRIFT_MAX 250
-#define TSC_SKEW_MAX 100
-int64_t tsc_drift_observed;
-
-volatile int64_t tsc_sync_val;
-volatile struct cpu_info *tsc_sync_cpu;
-
u_int tsc_get_timecount(struct timecounter *tc);
void tsc_delay(int usecs);
@@ -236,22 +229,12 @@ cpu_recalibrate_tsc(struct timecounter *
u_int
tsc_get_timecount(struct timecounter *tc)
{
- return rdtsc_lfence() + curcpu()->ci_tsc_skew;
+ return rdtsc_lfence();
}
void
tsc_timecounter_init(struct cpu_info *ci, uint64_t cpufreq)
{
-#ifdef TSC_DEBUG
- printf("%s: TSC skew=%lld observed drift=%lld\n", ci->ci_dev->dv_xname,
- (long long)ci->ci_tsc_skew, (long long)tsc_drift_observed);
-#endif
- if (ci->ci_tsc_skew < -TSC_SKEW_MAX || ci->ci_tsc_skew > TSC_SKEW_MAX) {
- printf("%s: disabling user TSC (skew=%lld)\n",
- ci->ci_dev->dv_xname, (long long)ci->ci_tsc_skew);
- tsc_timecounter.tc_user = 0;
- }
-
if (!(ci->ci_flags & CPUF_PRIMARY) ||
!(ci->ci_flags & CPUF_CONST_TSC) ||
!(ci->ci_flags & CPUF_INVAR_TSC))
@@ -268,111 +251,237 @@ tsc_timecounter_init(struct cpu_info *ci
calibrate_tsc_freq();
}
- if (tsc_drift_observed > TSC_DRIFT_MAX) {
- printf("ERROR: %lld cycle TSC drift observed\n",
- (long long)tsc_drift_observed);
- tsc_timecounter.tc_quality = -1000;
- tsc_timecounter.tc_user = 0;
- tsc_is_invariant = 0;
- }
-
tc_init(&tsc_timecounter);
}
-/*
- * Record drift (in clock cycles). Called during AP startup.
- */
void
-tsc_sync_drift(int64_t drift)
+tsc_delay(int usecs)
{
- if (drift < 0)
- drift = -drift;
- if (drift > tsc_drift_observed)
- tsc_drift_observed = drift;
+ uint64_t interval, start;
+
+ interval = (uint64_t)usecs * tsc_frequency / 1000000;
+ start = rdtsc_lfence();
+ while (rdtsc_lfence() - start < interval)
+ CPU_BUSY_CYCLE();
}
-/*
- * Called during startup of APs, by the boot processor. Interrupts
- * are disabled on entry.
- */
-void
-tsc_read_bp(struct cpu_info *ci, uint64_t *bptscp, uint64_t *aptscp)
-{
- uint64_t bptsc;
+#ifdef MULTIPROCESSOR
- if (atomic_swap_ptr(&tsc_sync_cpu, ci) != NULL)
- panic("tsc_sync_bp: 1");
+#define TSC_DEBUG 1
- /* Flag it and read our TSC. */
- atomic_setbits_int(&ci->ci_flags, CPUF_SYNCTSC);
- bptsc = (rdtsc_lfence() >> 1);
+/*
+ * Protections for global variables in this code:
+ *
+ * 1 Only set from zero to one
+ * a Modified atomically
+ * b Protected by a barrier
+ * p Only modified by primary CPU
+ * s Only modified by secondary CPU
+ */
- /* Wait for remote to complete, and read ours again. */
- while ((ci->ci_flags & CPUF_SYNCTSC) != 0)
- membar_consumer();
- bptsc += (rdtsc_lfence() >> 1);
+#define TSC_SYNC_MS 1 /* Test round duration */
+#define TSC_SYNC_ROUNDS 2 /* Number of test rounds */
- /* Wait for the results to come in. */
- while (tsc_sync_cpu == ci)
- CPU_BUSY_CYCLE();
- if (tsc_sync_cpu != NULL)
- panic("tsc_sync_bp: 2");
+uint64_t tsc_ap_val __aligned(64); /* [s] Latest TSC value from AP */
+uint64_t tsc_bp_val __aligned(64); /* [p] Latest TSC value from BP */
+uint64_t tsc_sync_cycles; /* [p] TSC cycles per test round */
+const char *tsc_ap_name; /* [b] Name of AP */
+volatile uint64_t tsc_ap_lag; /* [s] Biggest regression seen by AP */
+volatile u_int tsc_egress_barrier; /* [a] Sync test end barrier */
+volatile u_int tsc_ingress_barrier; /* [a] Sync test start barrier */
+volatile u_int tsc_sync_rounds; /* [b] Remaining test rounds */
+volatile u_int tsc_lag_count; /* [a] No. of regressions this round */
+int tsc_simulag; /* [1] Regressions seen on both CPUs? */
+int tsc_is_synchronized = 1; /* [p] TSC sync'd across all CPUs? */
- *bptscp = bptsc;
- *aptscp = tsc_sync_val;
-}
+void tsc_reset_adjust(const struct cpu_info *);
+uint64_t tsc_test_sync(volatile const uint64_t *, uint64_t *);
void
-tsc_sync_bp(struct cpu_info *ci)
+tsc_test_sync_bp(struct cpu_info *ci)
{
- uint64_t bptsc, aptsc;
+ uint64_t bp_lag;
+ unsigned int round;
+
+ /* TSC must be constant and invariant to use it as a timecounter. */
+ if (!tsc_is_invariant)
+ return;
+#ifndef TSC_DEBUG
+ /* No point in testing again if a prior test failed. */
+ if (!tsc_is_synchronized)
+ return;
+#endif
- tsc_read_bp(ci, &bptsc, &aptsc); /* discarded - cache effects */
- tsc_read_bp(ci, &bptsc, &aptsc);
+ /* Reset IA32_TSC_ADJUST if it exists. */
+ tsc_reset_adjust(ci);
- /* Compute final value to adjust for skew. */
- ci->ci_tsc_skew = bptsc - aptsc;
+ /* Reset the test cycle limit and round count. */
+ tsc_sync_cycles = TSC_SYNC_MS * tsc_frequency / 1000;
+ tsc_sync_rounds = TSC_SYNC_ROUNDS;
+
+ do {
+ /*
+ * Pass through the ingress barrier, run the test,
+ * then wait for the AP to reach the egress barrier.
+ */
+ atomic_inc_int(&tsc_ingress_barrier);
+ while (tsc_ingress_barrier != 2)
+ CPU_BUSY_CYCLE();
+ bp_lag = tsc_test_sync(&tsc_ap_val, &tsc_bp_val);
+ while (tsc_egress_barrier != 1)
+ CPU_BUSY_CYCLE();
+
+ /*
+ * Report what went wrong, if anything.
+ */
+ if (tsc_lag_count != 0) {
+ round = TSC_SYNC_ROUNDS - tsc_sync_rounds + 1;
+ printf("tsc: cpu0/%s sync round %u: %u regressions\n",
+ tsc_ap_name, round, tsc_lag_count);
+ if (tsc_simulag) {
+ printf("tsc: cpu0/%s sync round %u: "
+ "regressions seen by both CPUs\n",
+ tsc_ap_name, round);
+ }
+ printf("tsc: cpu0/%s sync round %u: "
+ "cpu0 lags %s by %llu cycles\n",
+ tsc_ap_name, round, tsc_ap_name, bp_lag);
+ printf("tsc: cpu0/%s sync round %u: "
+ "%s lags cpu0 by %llu cycles\n",
+ tsc_ap_name, round, tsc_ap_name, tsc_ap_lag);
+ /*
+ * XXX We need a tc_detach() function to actually
+ * disable a given timecounter. Lowering the quality
+ * like this is a nasty hack.
+ */
+ tsc_timecounter.tc_quality = -1000;
+ tsc_is_synchronized = 0;
+ tsc_sync_rounds = 0;
+ } else
+ tsc_sync_rounds--;
+
+ /*
+ * Clean up for the next round.
+ *
+ * It is safe to reset the ingress barrier because
+ * at this point we know the AP has reached the egress
+ * barrier.
+ */
+ tsc_bp_val = 0;
+ tsc_ap_val = 0;
+ tsc_ap_lag = 0;
+ tsc_ingress_barrier = 0;
+ tsc_simulag = 0;
+ tsc_lag_count = 0;
+ if (tsc_sync_rounds == 0)
+ tsc_ap_name = NULL;
+
+ /*
+ * Pass through the egress barrier and release the AP.
+ * The AP is responsible for resetting the barrier.
+ */
+ if (atomic_inc_int_nv(&tsc_egress_barrier) != 2)
+ panic("%s: unexpected egress count", __func__);
+ } while (tsc_sync_rounds > 0);
}
-/*
- * Called during startup of AP, by the AP itself. Interrupts are
- * disabled on entry.
- */
void
-tsc_post_ap(struct cpu_info *ci)
+tsc_test_sync_ap(struct cpu_info *ci)
{
- uint64_t tsc;
+ if (!tsc_is_invariant)
+ return;
+#ifndef TSC_DEBUG
+ if (!tsc_is_synchronized)
+ return;
+#endif
- /* Wait for go-ahead from primary. */
- while ((ci->ci_flags & CPUF_SYNCTSC) == 0)
- membar_consumer();
- tsc = (rdtsc_lfence() >> 1);
+ tsc_reset_adjust(ci);
- /* Instruct primary to read its counter. */
- atomic_clearbits_int(&ci->ci_flags, CPUF_SYNCTSC);
- tsc += (rdtsc_lfence() >> 1);
+/* #define TSC_FORCE_DESYNC (150) */
+#ifdef TSC_FORCE_DESYNC
+ wrmsr(MSR_TSC_ADJUST, TSC_FORCE_DESYNC);
+#endif
- /* Post result. Ensure the whole value goes out atomically. */
- (void)atomic_swap_64(&tsc_sync_val, tsc);
+ /* The BP needs our name in order to log any problems. */
+ tsc_ap_name = ci->ci_dev->dv_xname;
- if (atomic_swap_ptr(&tsc_sync_cpu, NULL) != ci)
- panic("tsc_sync_ap");
+ /*
+ * As the AP we are only responsible for running the test,
+ * reporting our lag, and resetting the egress barrier.
+ * The BP sets up and tears down everything else.
+ */
+ do {
+ atomic_inc_int(&tsc_ingress_barrier);
+ while (tsc_ingress_barrier != 2)
+ CPU_BUSY_CYCLE();
+ tsc_ap_lag = tsc_test_sync(&tsc_bp_val, &tsc_ap_val);
+ atomic_inc_int(&tsc_egress_barrier);
+ while (tsc_egress_barrier != 2)
+ CPU_BUSY_CYCLE();
+ tsc_egress_barrier = 0;
+ } while (tsc_sync_rounds > 0);
}
void
-tsc_sync_ap(struct cpu_info *ci)
+tsc_reset_adjust(const struct cpu_info *ci)
{
- tsc_post_ap(ci);
- tsc_post_ap(ci);
+ int64_t adj;
+
+ if (ISSET(ci->ci_feature_sefflags_ebx, SEFF0EBX_TSC_ADJUST)) {
+ adj = rdmsr(MSR_TSC_ADJUST);
+ if (adj != 0) {
+ wrmsr(MSR_TSC_ADJUST, 0);
+#if 0
+ /* XXX APs can't printf during boot or resume? */
+ printf("tsc: %s: IA32_TSC_ADJUST: %+lld -> 0",
+ ci->ci_dev->dv_xname, (long long)adj);
+#endif
+ }
+ }
}
-void
-tsc_delay(int usecs)
+uint64_t
+tsc_test_sync(volatile const uint64_t *reference, uint64_t *local)
{
- uint64_t interval, start;
+ uint64_t end, lag, max_lag, ref_latest;
+ u_int i, lag_count;
- interval = (uint64_t)usecs * tsc_frequency / 1000000;
- start = rdtsc_lfence();
- while (rdtsc_lfence() - start < interval)
- CPU_BUSY_CYCLE();
+ max_lag = 0;
+ i = lag_count = 0;
+
+ end = rdtsc_lfence() + tsc_sync_cycles;
+ while (*local < end) {
+ ref_latest = *reference; /* atomic */
+ *local = rdtsc_lfence();
+
+ /*
+ * Ensure the other CPU has a chance to run. This is
+ * crucial if the other CPU is our SMT sibling. SMT
+ * starvation can prevent this test from detecting
+ * relatively large lags. Eight is an arbitrary value
+ * but it works in practice without impacting the
+ * accuracy of the test for non-sibling threads.
+ */
+ if (++i % 8 == 0)
+ CPU_BUSY_CYCLE();
+
+ /* Everything is fine if the count is strictly monotonic. */
+ if (ref_latest < *local)
+ continue;
+
+ /*
+ * Otherwise, our TSC lags the other TSC. Record the
+ * magnitude of the problem.
+ */
+ lag = ref_latest - *local;
+ if (max_lag < lag)
+ max_lag = lag;
+ if (atomic_inc_int_nv(&tsc_lag_count) != lag_count + 1)
+ tsc_simulag = 1;
+ lag_count = tsc_lag_count; /* atomic */
+ }
+
+ return max_lag;
}
+
+#endif /* MULTIPROCESSOR */
Index: amd64/cpu.c
===================================================================
RCS file: /cvs/src/sys/arch/amd64/amd64/cpu.c,v
retrieving revision 1.154
diff -u -p -r1.154 cpu.c
--- amd64/cpu.c 31 Aug 2021 17:40:59 -0000 1.154
+++ amd64/cpu.c 27 Jan 2022 16:27:49 -0000
@@ -762,9 +762,9 @@ cpu_init(struct cpu_info *ci)
lcr4(cr4 & ~CR4_PGE);
lcr4(cr4);
- /* Synchronize TSC */
+ /* Check if our TSC is synchronized with the BP's TSC. */
if (cold && !CPU_IS_PRIMARY(ci))
- tsc_sync_ap(ci);
+ tsc_test_sync_ap(ci);
#endif
}
@@ -844,18 +844,14 @@ cpu_start_secondary(struct cpu_info *ci)
#endif
} else {
/*
- * Synchronize time stamp counters. Invalidate cache and
- * synchronize twice (in tsc_sync_bp) to minimize possible
- * cache effects. Disable interrupts to try and rule out any
- * external interference.
+ * Check if our TSC is synchronized with the AP's TSC.
+ * Invalidate cache to minimize possible cache effects
+ * and disable interrupts to minimize test interference.
*/
s = intr_disable();
wbinvd();
- tsc_sync_bp(ci);
+ tsc_test_sync_bp(ci);
intr_restore(s);
-#ifdef TSC_DEBUG
- printf("TSC skew=%lld\n", (long long)ci->ci_tsc_skew);
-#endif
}
if ((ci->ci_flags & CPUF_IDENTIFIED) == 0) {
@@ -880,7 +876,6 @@ void
cpu_boot_secondary(struct cpu_info *ci)
{
int i;
- int64_t drift;
u_long s;
atomic_setbits_int(&ci->ci_flags, CPUF_GO);
@@ -895,18 +890,14 @@ cpu_boot_secondary(struct cpu_info *ci)
db_enter();
#endif
} else if (cold) {
- /* Synchronize TSC again, check for drift. */
- drift = ci->ci_tsc_skew;
+ /*
+ * Test TSC synchronization again. Something may
+ * have screwed it up while we were down.
+ */
s = intr_disable();
wbinvd();
- tsc_sync_bp(ci);
+ tsc_test_sync_bp(ci);
intr_restore(s);
- drift -= ci->ci_tsc_skew;
-#ifdef TSC_DEBUG
- printf("TSC skew=%lld drift=%lld\n",
- (long long)ci->ci_tsc_skew, (long long)drift);
-#endif
- tsc_sync_drift(drift);
}
}
@@ -932,16 +923,17 @@ cpu_hatch(void *v)
#endif
/*
- * Synchronize the TSC for the first time. Note that interrupts are
- * off at this point.
+ * Test if our TSC is synchronized with the BP's TSC.
+ * Note that interrupts are off at this point. Invalidate
+ * cache to minimize cache effects on the test.
*/
wbinvd();
ci->ci_flags |= CPUF_PRESENT;
- ci->ci_tsc_skew = 0; /* reset on resume */
- tsc_sync_ap(ci);
+ tsc_test_sync_ap(ci);
lapic_enable();
lapic_startclock();
+
cpu_ucode_apply(ci);
cpu_tsx_disable(ci);
Index: include/cpu.h
===================================================================
RCS file: /cvs/src/sys/arch/amd64/include/cpu.h,v
retrieving revision 1.141
diff -u -p -r1.141 cpu.h
--- include/cpu.h 31 Aug 2021 17:40:59 -0000 1.141
+++ include/cpu.h 27 Jan 2022 16:27:49 -0000
@@ -209,8 +209,6 @@ struct cpu_info {
paddr_t ci_vmxon_region_pa;
struct vmxon_region *ci_vmxon_region;
- int64_t ci_tsc_skew; /* counter skew vs cpu0 */
-
char ci_panicbuf[512];
paddr_t ci_vmcs_pa;
@@ -230,7 +228,6 @@ struct cpu_info {
#define CPUF_INVAR_TSC 0x0100 /* CPU has invariant TSC */
#define CPUF_USERXSTATE 0x0200 /* CPU has curproc's xsave
state */
-#define CPUF_SYNCTSC 0x0800 /* Synchronize TSC */
#define CPUF_PRESENT 0x1000 /* CPU is present */
#define CPUF_RUNNING 0x2000 /* CPU is running */
#define CPUF_PAUSE 0x4000 /* CPU is paused in DDB */
Index: include/cpuvar.h
===================================================================
RCS file: /cvs/src/sys/arch/amd64/include/cpuvar.h,v
retrieving revision 1.11
diff -u -p -r1.11 cpuvar.h
--- include/cpuvar.h 16 May 2021 04:33:05 -0000 1.11
+++ include/cpuvar.h 27 Jan 2022 16:27:49 -0000
@@ -97,8 +97,7 @@ void identifycpu(struct cpu_info *);
void cpu_init(struct cpu_info *);
void cpu_init_first(void);
-void tsc_sync_drift(int64_t);
-void tsc_sync_bp(struct cpu_info *);
-void tsc_sync_ap(struct cpu_info *);
+void tsc_test_sync_bp(struct cpu_info *);
+void tsc_test_sync_ap(struct cpu_info *);
#endif