On Tue, Jul 2, 2019 at 6:35 AM Waldek Kozaczuk <jwkozac...@gmail.com> wrote:

> No, I have never been able to reproduce this issue for a year or so. I
> would typically happen in a nested virtualization scenario.
>
> My understanding was that for some reason the hypervisor (in this case
> qemu without KVM) would violate the promise of monotonic reads of the
> counter and based on how I understood this part of the spec:
> "Note: On a 64-bit platform, if software attempts a 64 bit read of the
> 64-bit counter, software must be aware that some platforms may break the 64
> bit read into two 32 bit reads, and therefore the read may be inaccurate if
> the low 32 bits roll over between the high and low part reads."
>  it would be enough to re-read counter until the value is not less (and
> typically this condition would be satisfied on subsequent read).
>

Whether the wrong counter is less, or more, than the right one, depends on
the order of the read:

1. If you read the low 32 bits and then high 32, you can see for low 2^32-1
and then, when you read high, it has already grown by 1. So overall the
time you read is 2^32 too much.
2. If you read the 32 high bits first, and then the low, you can can see an
unincremented high part, and then when you read the low, it will have
rolled back to 0. So overall the time your read is 2^32 too low.

But none of these theories explain the specific number 9*2^32 nanoseconds
we actually saw, and I never could explain this :-(


> So the whole idea of my solution was to guarantee monotonic reads no
> matter what. But it looks like my understanding was wrong. Why are you
> thinking that the main counter would return wrong value ( 9*2^32
> nanoseconds off) for the long period of time?
>

I wrote this before I understood your actual implementation. Most solutions
to guarantee monotonic reads return the same counter for a long time, until
the real counter reaches the number we got stuck on. But your solution
actually seems to hang the CPU until the clock goes back to normal. In our
case it can hang for 38 seconds. Maybe it's better than crashing, though?
So maybe we can do such a fix as some sort of emergency fix with a big
comment saying it's just an emergency fix. I would have been better,
though, to understand where this 38 second jump comes from and avoid it in
the first place, if we can.


>
> Anyway based on my other observations, I think we have even deeper
> problems with hpet possibly only in the nested virtualization scenario.
> Every time my host machine with VirtualBox gets asleep, after wakeup the
> OSv clock gets off by exact time the host was hibernated. Its as if QEMU
> running in VirtualBox never get updated clock and never report it properly
> to OSv guest. I wonder if we need to react to some ACPI wakeup event and
> reset hpet clock in this case.
>

I think this is a separate issue - as I said, the other issue was about
consistently 38 second jump - I don't think it's a coincidence.
I think when the host hibernates, it is fine for the uptime() to not move,
but the wall clock needs to move. Currently we only calculate _wall using
RTC when the hpet clock starts, so I think we need to recalculate _wall on
such events such as host hibernating or changing its clock. I don't how to
detect that - I think ACPI has something called "SCI" (system control
interrupt) which may or may not help after hibernation, but what about ntp
in the host (if you remember we had the same issue in KVM clock)? Perhaps
we should just periodically re-sync the _wall, e.g., once every second of
uptime? Again maybe the Linux driver can give you ideas of what can or
should be done.


> On Fri, Jun 28, 2019 at 7:04 AM Waldemar Kozaczuk <jwkoz...@gmail.com>
>> wrote:
>>
>>> This patch hardens the implementation of hpet clock with 64-bit counter
>>> by enforcing that every read of the main counter is NOT less than
>>> the previous value.
>>>
>>> Fixes #382
>>>
>>
>> Hi,
>>
>> While one of the ideas raised in #382 was that indeed such a solution may
>> help, but I'm not convinced at all it does.
>>
>> First, we still have no good explanation (at least not one that I heard)
>> why we should be off by exactly 9*2^32 nanoseconds (38.65) and not
>> something else.
>>
>> Second, my suspicion (see
>> https://github.com/cloudius-systems/osv/issues/382#issuecomment-364036208)
>> was that the clock jumps *forward* 38.65 seconds in one call, and then
>> goes back 38.65. Doing this monotonization thing you're doing here will
>> allow the clock to jump forward, but then remain stuck for 38.65 seconds.
>> I'm not sure what bad side-effects this could have (would context switching
>> get stuck for almost 40 seconds? What else will break?) but it sounds bad.
>> (edit: on second review I think your code will actually hang in an endless
>> loop for 38.65 seconds, which is even worse? ;-) )
>>
>> Another issue is that the problem of #382 appears to be a *local* clock
>> problem (happens in a single CPU) while what this patch is supposedly
>> fixing is *global* clock monotonization (if one CPU returned some value,
>> we don't want other CPUs to return earlier values) which I'm not sure if we
>> need in the context of #382?
>>
>> Were you ever able to reproduce #382 and see what your patch does to it?
>>
>>
>>> Signed-off-by: Waldemar Kozaczuk <jwkoz...@gmail.com>
>>> ---
>>>  drivers/hpet.cc | 18 ++++++++++++++++--
>>>  1 file changed, 16 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/hpet.cc b/drivers/hpet.cc
>>> index 22afae7d..ba9ed3bc 100644
>>> --- a/drivers/hpet.cc
>>> +++ b/drivers/hpet.cc
>>> @@ -29,6 +29,7 @@ protected:
>>>      mmioaddr_t _addr;
>>>      uint64_t _wall;
>>>      uint64_t _period;
>>> +    std::atomic<s64> _last_read_counter;
>>>
>>  };
>>>
>>>  #define HPET_COUNTER    0x0f0
>>> @@ -55,11 +56,23 @@ public:
>>>      hpet_64bit_clock(mmioaddr_t hpet_mmio_address) :
>>> hpetclock(hpet_mmio_address) {}
>>>  protected:
>>>      virtual s64 time() override __attribute__((no_instrument_function))
>>> {
>>> -        return _wall + mmio_getq(_addr + HPET_COUNTER) * _period;;
>>> +        return _wall + fetch_counter() * _period;
>>>      }
>>>
>>>      virtual s64 uptime() override
>>> __attribute__((no_instrument_function)) {
>>> -        return mmio_getq(_addr + HPET_COUNTER) * _period;;
>>> +        return fetch_counter() * _period;
>>> +    }
>>> +private:
>>> +    s64 fetch_counter() {
>>> +        auto last_valid_counter = _last_read_counter.load();
>>>
>>
>> The default CST memory ordering (against other memory variables) isn't
>> needed here, you just need the atomicity of this counter itself.
>> So you can do load(std::memory_order_relaxed) here, and below store()
>> with relaxed as well.
>>
>> +
>>> +        s64 counter;
>>> +        do {
>>> +             counter = mmio_getq(_addr + HPET_COUNTER);
>>> +        } while (counter < last_valid_counter);
>>>
>>
>> Just now on second review I realized you aren't doing monotonization here
>> (returning last_valid_counter if we're suddenly below here) but hanging the
>> CPU until the time goes back to what's expected. Even if it takes 38
>> seconds. Is this a good idea?
>>
>> +
>>> +        _last_read_counter.store(counter);
>>>
>>
>> I think a store() may not be the right thing to do here. The problem is
>> that several CPUs may be doing this code in parallel, and you can have the
>> following time sequence:
>>
>>                CPU 1         CPU2
>>
>>           counter=100
>>                             counter=101
>>                             _last_read_counter = 101
>>                             return 101
>>                        _last_read_counter = 100
>>            return 100
>>
>> So the counter will not be globally monotonic, and if there's a local bug
>> (CPU2 suddenly goes back in time) it can also not be locally monotonic
>> (CPU2 returns to time 100, not 101).
>>
>> You can see in arch/x64/pvclock-abi.cc how the KVM clock does something
>> similar with a compare_exchange loop instead of simply store.
>>
>>
>>
>>> +        return counter;
>>>      }
>>>  };
>>>
>>> @@ -99,6 +112,7 @@ hpetclock::hpetclock(mmioaddr_t hpet_mmio_address)
>>>          mmio_setl(_addr + HPET_COUNTER, 0);
>>>          mmio_setl(_addr + HPET_COUNTER + 4, 0);
>>>
>>> +        _last_read_counter.store(0);
>>>          _wall = r->wallclock_ns();
>>>
>>>          // We got them all, now we restart the HPET.
>>> --
>>> 2.20.1
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "OSv Development" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to osv...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/osv-dev/20190628040415.7411-1-jwkozaczuk%40gmail.com
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "OSv Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to osv-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/osv-dev/6ffe200f-547d-4bfb-b0cc-779a0997e16b%40googlegroups.com
> <https://groups.google.com/d/msgid/osv-dev/6ffe200f-547d-4bfb-b0cc-779a0997e16b%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/CANEVyju56pFiWoq-9Sj8aTs_pvvvwk_ATJYXq4gvpayS7-rOpQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to