> On Sep 8, 2022, at 9:05 AM, Mike Larkin <mlar...@nested.page> wrote:
> 
> On Thu, Sep 08, 2022 at 08:32:27AM -0500, Scott Cheloha wrote:
>> On Tue, Sep 06, 2022 at 03:30:44AM -0700, Mike Larkin wrote:
>>> On Sun, Sep 04, 2022 at 02:50:10PM +1000, Jonathan Gray wrote:
>>>> On Sat, Sep 03, 2022 at 05:33:01PM -0500, Scott Cheloha wrote:
>>>>> On Sat, Sep 03, 2022 at 10:37:31PM +1000, Jonathan Gray wrote:
>>>>>> On Sat, Sep 03, 2022 at 06:52:20AM -0500, Scott Cheloha wrote:
>>>>>>>> On Sep 3, 2022, at 02:22, Jonathan Gray <j...@jsg.id.au> wrote:
>>>>>>>> 
>>>>>>>> ???On Fri, Sep 02, 2022 at 06:00:25PM -0500, Scott Cheloha wrote:
>>>>>>>>> dv@ suggested coming to the list to request testing for the pvclock(4)
>>>>>>>>> driver.  Attached is a patch that corrects several bugs.  Most of
>>>>>>>>> these changes will only matter in the non-TSC_STABLE case on a
>>>>>>>>> multiprocessor VM.
>>>>>>>>> 
>>>>>>>>> Ideally, nothing should break.
>>>>>>>>> 
>>>>>>>>> - pvclock yields a 64-bit value.  The BSD timecounter layer can only
>>>>>>>>> use the lower 32 bits, but internally we need to track the full
>>>>>>>>> 64-bit value to allow comparisons with the full value in the
>>>>>>>>> non-TSC_STABLE case.  So make pvclock_lastcount a 64-bit quantity.
>>>>>>>>> 
>>>>>>>>> - In pvclock_get_timecount(), move rdtsc() up into the lockless read
>>>>>>>>> loop to get a more accurate timestamp.
>>>>>>>>> 
>>>>>>>>> - In pvclock_get_timecount(), use rdtsc_lfence(), not rdtsc().
>>>>>>>>> 
>>>>>>>>> - In pvclock_get_timecount(), check that our TSC value doesn't predate
>>>>>>>>> ti->ti_tsc_timestamp, otherwise we will produce an enormous value.
>>>>>>>>> 
>>>>>>>>> - In pvclock_get_timecount(), update pvclock_lastcount in the
>>>>>>>>> non-TSC_STABLE case with more care.  On amd64 we can do this with an
>>>>>>>>> atomic_cas_ulong(9) loop because u_long is 64 bits.  On i386 we need
>>>>>>>>> to introduce a mutex to protect our comparison and read/write.
>>>>>>>> 
>>>>>>>> i386 has cmpxchg8b, no need to disable interrupts
>>>>>>>> the ifdefs seem excessive
>>>>>>> 
>>>>>>> How do I make use of CMPXCHG8B on i386
>>>>>>> in this context?
>>>>>>> 
>>>>>>> atomic_cas_ulong(9) is a 32-bit CAS on
>>>>>>> i386.
>>>>>> 
>>>>>> static inline uint64_t
>>>>>> atomic_cas_64(volatile uint64_t *p, uint64_t o, uint64_t n)
>>>>>> {
>>>>>>  return __sync_val_compare_and_swap(p, o, n);
>>>>>> }
>>>>>> 
>>>>>> Or md atomic.h files could have an equivalent.
>>>>>> Not possible on all 32-bit archs.
>>>>>> 
>>>>>>> 
>>>>>>> We can't use FP registers in the kernel, no?
>>>>>> 
>>>>>> What do FP registers have to do with it?
>>>>>> 
>>>>>>> 
>>>>>>> Am I missing some other avenue?
>>>>>> 
>>>>>> There is no rdtsc_lfence() on i386.  Initial diff doesn't build.
>>>>> 
>>>>> LFENCE is an SSE2 extension.  As is MFENCE.  I don't think I can just
>>>>> drop rdtsc_lfence() into cpufunc.h and proceed without causing some
>>>>> kind of fault on an older CPU.
>>>>> 
>>>>> What are my options on a 586-class CPU for forcing RDTSC to complete
>>>>> before later instructions?
>>>> 
>>>> "3.3.2. Serializing Operations
>>>> After executing certain instructions the Pentium processor serializes
>>>> instruction execution. This means that any modifications to flags,
>>>> registers, and memory for previous instructions are completed before
>>>> the next instruction is fetched and executed. The prefetch queue
>>>> is flushed as a result of serializing operations.
>>>> 
>>>> The Pentium processor serializes instruction execution after executing
>>>> one of the following instructions: Move to Special Register (except
>>>> CRO), INVD, INVLPG, IRET, IRETD, LGDT, LLDT, LIDT, LTR, WBINVD,
>>>> CPUID, RSM and WRMSR."
>>>> 
>>>> from:
>>>> Pentium Processor User's Manual
>>>> Volume 1: Pentium Processor Data Book
>>>> Order Number 241428
>>>> 
>>>> http://bitsavers.org/components/intel/pentium/1993_Intel_Pentium_Processor_Users_Manual_Volume_1.pdf
>>>> 
>>>> So it could be rdtsc ; cpuid.
>>>> lfence; rdtsc should still be preferred.
>>>> 
>>>> It could be tested during boot and set a function pointer.
>>>> Or the codepatch bits could be used.
>>>> 
>>>> In the specific case of pvclock, can it be assumed that the host
>>>> has hardware virt and would then have lfence?
>>>> 
>>> 
>>> I think this is a fair assumption.
>> 
>> Okay, so how about we use rdtsc_lfence() in pvclock_get_timecount()
>> and, just to be safe, check for SSE2 during pvclock_match().
>> 
>> If in the future someone wants pvclock(4) support on guests without
>> SSE2 support (if any exist) we can deal with it at that time.
>> 
>> Updated patch:
>> 
>> - Add atomic_cas_64() to amd64/include/atomic.h.
>> 
>>  (NetBSD calls it atomic_cas_64(), so this is consistent with their
>>  atomic function naming scheme.)
>> 
>>  I am unsure whether it would be better to just do this:
>> 
>> #define atomic_cas_64(_p, _e, _n) _atomic_cas_ulong((unsigned long *)(_p), 
>> (_e), (_n))
>> 
>>  If I don't cast the pointer, the compiler complains, so I added
>>  _atomic_cas_ullong().
>> 
>>  What is preferred here?
>> 
>> - Add atomic_cas_64() to i386/include/atomic.h.  Implemented with the
>>  __sync_val_compare_and_swap() builtin.
>> 
>> - Add rdtsc_lfence() to i386/include/cpufunc.h.
>> 
>> - In pvclock.c, make pvclock_lastcount a volatile 64-bit value to
>>  fix the non-PVCLOCK_FLAG_TSC_STABLE case.
>> 
>> - In pvclock_match(), check for SSE2 support in ci_feature_flags.
>>  If we don't have it, we don't have LFENCE and we can't match.
>> 
>> - In pvclock_get_timecount(), do RDTSC as early as possible in the
>>  lockless read loop to get a better timestamp.
>> 
>> - In pvclock_get_timecount(), use rdtsc_lfence() instead of rdtsc().
>> 
>> - In pvclock_get_timecount(), check whether our TSC lags
>>  ti_tsc_timestamp to make sure so we don't produce an enormous,
>>  invalid delta.  If we're lagging, set delta to zero.
>> 
>> - In pvclock_get_timecount(), fix the non-PVCLOCK_FLAG_TSC_STABLE
>>  case.
>> 
>>  Because we could be in pvclock_get_timecount() with other vCPUs, we
>>  need to read, compare, and (maybe) replace pvclock_lastcount in an
>>  atomic CAS loop.
>> 
>>  The only thing that I'm still unsure about is whether this:
>> 
>> volatile uint64_t pvclock_lastcount;
>> 
>> void
>> foo(void)
>> {
>>      uint64_t last;
>> 
>>      last = pvclock_lastcount;       /* atomic on i386? */
>> }
>> 
>>  is a safe, atomic 8-byte load on i386 on all the CPUs we support,
>>  i.e. 586 and up.
>> 
>>  Can someone confirm this?  I didn't know you could do this with
>>  32-bit registers.
>> 
> 
> You could compile this and then objdump -D it and see for yourself...

I can't make heads or tails of it.  Please explain what I am looking
at and why it is, or is not, atomic.

Reply via email to