> On Sep 8, 2022, at 9:05 AM, Mike Larkin <mlar...@nested.page> wrote: > > On Thu, Sep 08, 2022 at 08:32:27AM -0500, Scott Cheloha wrote: >> On Tue, Sep 06, 2022 at 03:30:44AM -0700, Mike Larkin wrote: >>> On Sun, Sep 04, 2022 at 02:50:10PM +1000, Jonathan Gray wrote: >>>> On Sat, Sep 03, 2022 at 05:33:01PM -0500, Scott Cheloha wrote: >>>>> On Sat, Sep 03, 2022 at 10:37:31PM +1000, Jonathan Gray wrote: >>>>>> On Sat, Sep 03, 2022 at 06:52:20AM -0500, Scott Cheloha wrote: >>>>>>>> On Sep 3, 2022, at 02:22, Jonathan Gray <j...@jsg.id.au> wrote: >>>>>>>> >>>>>>>> ???On Fri, Sep 02, 2022 at 06:00:25PM -0500, Scott Cheloha wrote: >>>>>>>>> dv@ suggested coming to the list to request testing for the pvclock(4) >>>>>>>>> driver. Attached is a patch that corrects several bugs. Most of >>>>>>>>> these changes will only matter in the non-TSC_STABLE case on a >>>>>>>>> multiprocessor VM. >>>>>>>>> >>>>>>>>> Ideally, nothing should break. >>>>>>>>> >>>>>>>>> - pvclock yields a 64-bit value. The BSD timecounter layer can only >>>>>>>>> use the lower 32 bits, but internally we need to track the full >>>>>>>>> 64-bit value to allow comparisons with the full value in the >>>>>>>>> non-TSC_STABLE case. So make pvclock_lastcount a 64-bit quantity. >>>>>>>>> >>>>>>>>> - In pvclock_get_timecount(), move rdtsc() up into the lockless read >>>>>>>>> loop to get a more accurate timestamp. >>>>>>>>> >>>>>>>>> - In pvclock_get_timecount(), use rdtsc_lfence(), not rdtsc(). >>>>>>>>> >>>>>>>>> - In pvclock_get_timecount(), check that our TSC value doesn't predate >>>>>>>>> ti->ti_tsc_timestamp, otherwise we will produce an enormous value. >>>>>>>>> >>>>>>>>> - In pvclock_get_timecount(), update pvclock_lastcount in the >>>>>>>>> non-TSC_STABLE case with more care. On amd64 we can do this with an >>>>>>>>> atomic_cas_ulong(9) loop because u_long is 64 bits. On i386 we need >>>>>>>>> to introduce a mutex to protect our comparison and read/write. >>>>>>>> >>>>>>>> i386 has cmpxchg8b, no need to disable interrupts >>>>>>>> the ifdefs seem excessive >>>>>>> >>>>>>> How do I make use of CMPXCHG8B on i386 >>>>>>> in this context? >>>>>>> >>>>>>> atomic_cas_ulong(9) is a 32-bit CAS on >>>>>>> i386. >>>>>> >>>>>> static inline uint64_t >>>>>> atomic_cas_64(volatile uint64_t *p, uint64_t o, uint64_t n) >>>>>> { >>>>>> return __sync_val_compare_and_swap(p, o, n); >>>>>> } >>>>>> >>>>>> Or md atomic.h files could have an equivalent. >>>>>> Not possible on all 32-bit archs. >>>>>> >>>>>>> >>>>>>> We can't use FP registers in the kernel, no? >>>>>> >>>>>> What do FP registers have to do with it? >>>>>> >>>>>>> >>>>>>> Am I missing some other avenue? >>>>>> >>>>>> There is no rdtsc_lfence() on i386. Initial diff doesn't build. >>>>> >>>>> LFENCE is an SSE2 extension. As is MFENCE. I don't think I can just >>>>> drop rdtsc_lfence() into cpufunc.h and proceed without causing some >>>>> kind of fault on an older CPU. >>>>> >>>>> What are my options on a 586-class CPU for forcing RDTSC to complete >>>>> before later instructions? >>>> >>>> "3.3.2. Serializing Operations >>>> After executing certain instructions the Pentium processor serializes >>>> instruction execution. This means that any modifications to flags, >>>> registers, and memory for previous instructions are completed before >>>> the next instruction is fetched and executed. The prefetch queue >>>> is flushed as a result of serializing operations. >>>> >>>> The Pentium processor serializes instruction execution after executing >>>> one of the following instructions: Move to Special Register (except >>>> CRO), INVD, INVLPG, IRET, IRETD, LGDT, LLDT, LIDT, LTR, WBINVD, >>>> CPUID, RSM and WRMSR." >>>> >>>> from: >>>> Pentium Processor User's Manual >>>> Volume 1: Pentium Processor Data Book >>>> Order Number 241428 >>>> >>>> http://bitsavers.org/components/intel/pentium/1993_Intel_Pentium_Processor_Users_Manual_Volume_1.pdf >>>> >>>> So it could be rdtsc ; cpuid. >>>> lfence; rdtsc should still be preferred. >>>> >>>> It could be tested during boot and set a function pointer. >>>> Or the codepatch bits could be used. >>>> >>>> In the specific case of pvclock, can it be assumed that the host >>>> has hardware virt and would then have lfence? >>>> >>> >>> I think this is a fair assumption. >> >> Okay, so how about we use rdtsc_lfence() in pvclock_get_timecount() >> and, just to be safe, check for SSE2 during pvclock_match(). >> >> If in the future someone wants pvclock(4) support on guests without >> SSE2 support (if any exist) we can deal with it at that time. >> >> Updated patch: >> >> - Add atomic_cas_64() to amd64/include/atomic.h. >> >> (NetBSD calls it atomic_cas_64(), so this is consistent with their >> atomic function naming scheme.) >> >> I am unsure whether it would be better to just do this: >> >> #define atomic_cas_64(_p, _e, _n) _atomic_cas_ulong((unsigned long *)(_p), >> (_e), (_n)) >> >> If I don't cast the pointer, the compiler complains, so I added >> _atomic_cas_ullong(). >> >> What is preferred here? >> >> - Add atomic_cas_64() to i386/include/atomic.h. Implemented with the >> __sync_val_compare_and_swap() builtin. >> >> - Add rdtsc_lfence() to i386/include/cpufunc.h. >> >> - In pvclock.c, make pvclock_lastcount a volatile 64-bit value to >> fix the non-PVCLOCK_FLAG_TSC_STABLE case. >> >> - In pvclock_match(), check for SSE2 support in ci_feature_flags. >> If we don't have it, we don't have LFENCE and we can't match. >> >> - In pvclock_get_timecount(), do RDTSC as early as possible in the >> lockless read loop to get a better timestamp. >> >> - In pvclock_get_timecount(), use rdtsc_lfence() instead of rdtsc(). >> >> - In pvclock_get_timecount(), check whether our TSC lags >> ti_tsc_timestamp to make sure so we don't produce an enormous, >> invalid delta. If we're lagging, set delta to zero. >> >> - In pvclock_get_timecount(), fix the non-PVCLOCK_FLAG_TSC_STABLE >> case. >> >> Because we could be in pvclock_get_timecount() with other vCPUs, we >> need to read, compare, and (maybe) replace pvclock_lastcount in an >> atomic CAS loop. >> >> The only thing that I'm still unsure about is whether this: >> >> volatile uint64_t pvclock_lastcount; >> >> void >> foo(void) >> { >> uint64_t last; >> >> last = pvclock_lastcount; /* atomic on i386? */ >> } >> >> is a safe, atomic 8-byte load on i386 on all the CPUs we support, >> i.e. 586 and up. >> >> Can someone confirm this? I didn't know you could do this with >> 32-bit registers. >> > > You could compile this and then objdump -D it and see for yourself...
I can't make heads or tails of it. Please explain what I am looking at and why it is, or is not, atomic.