On Thu, Sep 08, 2022 at 08:32:27AM -0500, Scott Cheloha wrote: > On Tue, Sep 06, 2022 at 03:30:44AM -0700, Mike Larkin wrote: > > On Sun, Sep 04, 2022 at 02:50:10PM +1000, Jonathan Gray wrote: > > > On Sat, Sep 03, 2022 at 05:33:01PM -0500, Scott Cheloha wrote: > > > > On Sat, Sep 03, 2022 at 10:37:31PM +1000, Jonathan Gray wrote: > > > > > On Sat, Sep 03, 2022 at 06:52:20AM -0500, Scott Cheloha wrote: > > > > > > > On Sep 3, 2022, at 02:22, Jonathan Gray <j...@jsg.id.au> wrote: > > > > > > > > > > > > > > ???On Fri, Sep 02, 2022 at 06:00:25PM -0500, Scott Cheloha wrote: > > > > > > >> dv@ suggested coming to the list to request testing for the > > > > > > >> pvclock(4) > > > > > > >> driver. Attached is a patch that corrects several bugs. Most of > > > > > > >> these changes will only matter in the non-TSC_STABLE case on a > > > > > > >> multiprocessor VM. > > > > > > >> > > > > > > >> Ideally, nothing should break. > > > > > > >> > > > > > > >> - pvclock yields a 64-bit value. The BSD timecounter layer can > > > > > > >> only > > > > > > >> use the lower 32 bits, but internally we need to track the full > > > > > > >> 64-bit value to allow comparisons with the full value in the > > > > > > >> non-TSC_STABLE case. So make pvclock_lastcount a 64-bit > > > > > > >> quantity. > > > > > > >> > > > > > > >> - In pvclock_get_timecount(), move rdtsc() up into the lockless > > > > > > >> read > > > > > > >> loop to get a more accurate timestamp. > > > > > > >> > > > > > > >> - In pvclock_get_timecount(), use rdtsc_lfence(), not rdtsc(). > > > > > > >> > > > > > > >> - In pvclock_get_timecount(), check that our TSC value doesn't > > > > > > >> predate > > > > > > >> ti->ti_tsc_timestamp, otherwise we will produce an enormous > > > > > > >> value. > > > > > > >> > > > > > > >> - In pvclock_get_timecount(), update pvclock_lastcount in the > > > > > > >> non-TSC_STABLE case with more care. On amd64 we can do this > > > > > > >> with an > > > > > > >> atomic_cas_ulong(9) loop because u_long is 64 bits. On i386 we > > > > > > >> need > > > > > > >> to introduce a mutex to protect our comparison and read/write. > > > > > > > > > > > > > > i386 has cmpxchg8b, no need to disable interrupts > > > > > > > the ifdefs seem excessive > > > > > > > > > > > > How do I make use of CMPXCHG8B on i386 > > > > > > in this context? > > > > > > > > > > > > atomic_cas_ulong(9) is a 32-bit CAS on > > > > > > i386. > > > > > > > > > > static inline uint64_t > > > > > atomic_cas_64(volatile uint64_t *p, uint64_t o, uint64_t n) > > > > > { > > > > > return __sync_val_compare_and_swap(p, o, n); > > > > > } > > > > > > > > > > Or md atomic.h files could have an equivalent. > > > > > Not possible on all 32-bit archs. > > > > > > > > > > > > > > > > > We can't use FP registers in the kernel, no? > > > > > > > > > > What do FP registers have to do with it? > > > > > > > > > > > > > > > > > Am I missing some other avenue? > > > > > > > > > > There is no rdtsc_lfence() on i386. Initial diff doesn't build. > > > > > > > > LFENCE is an SSE2 extension. As is MFENCE. I don't think I can just > > > > drop rdtsc_lfence() into cpufunc.h and proceed without causing some > > > > kind of fault on an older CPU. > > > > > > > > What are my options on a 586-class CPU for forcing RDTSC to complete > > > > before later instructions? > > > > > > "3.3.2. Serializing Operations > > > After executing certain instructions the Pentium processor serializes > > > instruction execution. This means that any modifications to flags, > > > registers, and memory for previous instructions are completed before > > > the next instruction is fetched and executed. The prefetch queue > > > is flushed as a result of serializing operations. > > > > > > The Pentium processor serializes instruction execution after executing > > > one of the following instructions: Move to Special Register (except > > > CRO), INVD, INVLPG, IRET, IRETD, LGDT, LLDT, LIDT, LTR, WBINVD, > > > CPUID, RSM and WRMSR." > > > > > > from: > > > Pentium Processor User's Manual > > > Volume 1: Pentium Processor Data Book > > > Order Number 241428 > > > > > > http://bitsavers.org/components/intel/pentium/1993_Intel_Pentium_Processor_Users_Manual_Volume_1.pdf > > > > > > So it could be rdtsc ; cpuid. > > > lfence; rdtsc should still be preferred. > > > > > > It could be tested during boot and set a function pointer. > > > Or the codepatch bits could be used. > > > > > > In the specific case of pvclock, can it be assumed that the host > > > has hardware virt and would then have lfence? > > > > > > > I think this is a fair assumption. > > Okay, so how about we use rdtsc_lfence() in pvclock_get_timecount() > and, just to be safe, check for SSE2 during pvclock_match(). > > If in the future someone wants pvclock(4) support on guests without > SSE2 support (if any exist) we can deal with it at that time. > > Updated patch: > > - Add atomic_cas_64() to amd64/include/atomic.h. > > (NetBSD calls it atomic_cas_64(), so this is consistent with their > atomic function naming scheme.) > > I am unsure whether it would be better to just do this: > > #define atomic_cas_64(_p, _e, _n) _atomic_cas_ulong((unsigned long *)(_p), > (_e), (_n)) > > If I don't cast the pointer, the compiler complains, so I added > _atomic_cas_ullong(). > > What is preferred here?
isn't the point to use uint64_t for the types not longs? > > - Add atomic_cas_64() to i386/include/atomic.h. Implemented with the > __sync_val_compare_and_swap() builtin. I think you should split the atomic changes out as a different diff. the __sync_val_compare_and_swap part for sys/sys/atomic.h i386/amd64 could then have the _fn() asm and #define in the style of the other atomic methods Or just have a pvclock local atomic_cas_64() for now. I'm not sure on the mi concerns of adding it when it can't be used everywhere. And it would introduce a stdint.h requirement. > > - Add rdtsc_lfence() to i386/include/cpufunc.h. > > - In pvclock.c, make pvclock_lastcount a volatile 64-bit value to > fix the non-PVCLOCK_FLAG_TSC_STABLE case. > > - In pvclock_match(), check for SSE2 support in ci_feature_flags. > If we don't have it, we don't have LFENCE and we can't match. > > - In pvclock_get_timecount(), do RDTSC as early as possible in the > lockless read loop to get a better timestamp. > > - In pvclock_get_timecount(), use rdtsc_lfence() instead of rdtsc(). > > - In pvclock_get_timecount(), check whether our TSC lags > ti_tsc_timestamp to make sure so we don't produce an enormous, > invalid delta. If we're lagging, set delta to zero. > > - In pvclock_get_timecount(), fix the non-PVCLOCK_FLAG_TSC_STABLE > case. > > Because we could be in pvclock_get_timecount() with other vCPUs, we > need to read, compare, and (maybe) replace pvclock_lastcount in an > atomic CAS loop. > > The only thing that I'm still unsure about is whether this: > > volatile uint64_t pvclock_lastcount; > > void > foo(void) > { > uint64_t last; > > last = pvclock_lastcount; /* atomic on i386? */ > } > > is a safe, atomic 8-byte load on i386 on all the CPUs we support, > i.e. 586 and up. > > Can someone confirm this? I didn't know you could do this with > 32-bit registers. > > -- > > Should I put this out in a second request-for-test email? The patch > has drifted a bit. > > Index: arch/amd64/include/atomic.h > =================================================================== > RCS file: /cvs/src/sys/arch/amd64/include/atomic.h,v > retrieving revision 1.22 > diff -u -p -r1.22 atomic.h > --- arch/amd64/include/atomic.h 29 Aug 2022 02:01:18 -0000 1.22 > +++ arch/amd64/include/atomic.h 8 Sep 2022 13:28:48 -0000 > @@ -77,6 +77,18 @@ _atomic_cas_ulong(volatile unsigned long > } > #define atomic_cas_ulong(_p, _e, _n) _atomic_cas_ulong((_p), (_e), (_n)) > > +static inline unsigned long long > +_atomic_cas_ullong(volatile unsigned long long *p, unsigned long long e, > + unsigned long long n) > +{ > + __asm volatile(_LOCK " cmpxchgq %2, %1" > + : "=a" (n), "=m" (*p) > + : "r" (n), "a" (e), "m" (*p)); > + > + return (n); > +} > +#define atomic_cas_64(_p, _e, _n) _atomic_cas_ullong((_p), (_e), (_n)) > + > static inline void * > _atomic_cas_ptr(volatile void *p, void *e, void *n) > { > Index: arch/i386/include/atomic.h > =================================================================== > RCS file: /cvs/src/sys/arch/i386/include/atomic.h,v > retrieving revision 1.20 > diff -u -p -r1.20 atomic.h > --- arch/i386/include/atomic.h 29 Aug 2022 02:01:18 -0000 1.20 > +++ arch/i386/include/atomic.h 8 Sep 2022 13:28:49 -0000 > @@ -83,6 +83,12 @@ _atomic_cas_ptr(volatile void *p, void * > } > #define atomic_cas_ptr(_p, _e, _n) _atomic_cas_ptr((_p), (_e), (_n)) > > +static inline uint64_t > +atomic_cas_64(volatile uint64_t *p, uint64_t e, uint64_t n) > +{ > + return __sync_val_compare_and_swap(p, e, n); > +} > + > static inline unsigned int > _atomic_swap_uint(volatile unsigned int *p, unsigned int n) > { > Index: arch/i386/include/cpufunc.h > =================================================================== > RCS file: /cvs/src/sys/arch/i386/include/cpufunc.h,v > retrieving revision 1.33 > diff -u -p -r1.33 cpufunc.h > --- arch/i386/include/cpufunc.h 13 Sep 2020 11:53:16 -0000 1.33 > +++ arch/i386/include/cpufunc.h 8 Sep 2022 13:28:49 -0000 > @@ -229,6 +229,15 @@ rdtsc(void) > return (tsc); > } > > +static inline uint64_t > +rdtsc_lfence(void) > +{ > + uint64_t tsc; > + > + __asm volatile("lfence; rdtsc" : "=A" (tsc)); > + return tsc; > +} > + > static __inline void > wrmsr(u_int msr, u_int64_t newval) > { > Index: dev/pv/pvclock.c > =================================================================== > RCS file: /cvs/src/sys/dev/pv/pvclock.c,v > retrieving revision 1.8 > diff -u -p -r1.8 pvclock.c > --- dev/pv/pvclock.c 5 Nov 2021 11:38:29 -0000 1.8 > +++ dev/pv/pvclock.c 8 Sep 2022 13:28:49 -0000 > @@ -21,21 +21,23 @@ > #endif > > #include <sys/param.h> > +#include <sys/stdint.h> > #include <sys/systm.h> > #include <sys/kernel.h> > #include <sys/timetc.h> > #include <sys/timeout.h> > #include <sys/malloc.h> > -#include <sys/atomic.h> > > #include <machine/cpu.h> > #include <machine/atomic.h> > +#include <machine/specialreg.h> > + > #include <uvm/uvm_extern.h> > > #include <dev/pv/pvvar.h> > #include <dev/pv/pvreg.h> > > -uint pvclock_lastcount; > +volatile uint64_t pvclock_lastcount; > > struct pvclock_softc { > struct device sc_dev; > @@ -116,6 +118,12 @@ pvclock_match(struct device *parent, voi > (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) == 0) > return (0); > > + /* > + * LFENCE is also required, so check for SSE2 support. > + */ > + if (!ISSET(curcpu()->ci_feature_flags, CPUID_SSE2)) > + return (0); > + > return (1); > } > > @@ -213,6 +221,7 @@ pvclock_get_timecount(struct timecounter > struct pvclock_softc *sc = tc->tc_priv; > struct pvclock_time_info *ti; > uint64_t tsc_timestamp, system_time, delta, ctr; > + uint64_t last, tsc; > uint32_t version, mul_frac; > int8_t shift; > uint8_t flags; > @@ -220,6 +229,7 @@ pvclock_get_timecount(struct timecounter > ti = sc->sc_time; > do { > version = pvclock_read_begin(ti); > + tsc = rdtsc_lfence(); > system_time = ti->ti_system_time; > tsc_timestamp = ti->ti_tsc_timestamp; > mul_frac = ti->ti_tsc_to_system_mul; > @@ -231,7 +241,10 @@ pvclock_get_timecount(struct timecounter > * The algorithm is described in > * linux/Documentation/virtual/kvm/msr.txt > */ > - delta = rdtsc() - tsc_timestamp; > + if (tsc > tsc_timestamp) > + delta = tsc - tsc_timestamp; > + else > + delta = 0; > if (shift < 0) > delta >>= -shift; > else > @@ -241,10 +254,11 @@ pvclock_get_timecount(struct timecounter > if ((flags & PVCLOCK_FLAG_TSC_STABLE) != 0) > return (ctr); > > - if (ctr < pvclock_lastcount) > - return (pvclock_lastcount); > - > - atomic_swap_uint(&pvclock_lastcount, ctr); > + do { > + last = pvclock_lastcount; > + if (ctr < last) > + return (last); > + } while (atomic_cas_64(&pvclock_lastcount, last, ctr) != last); > > return (ctr); > } > >