Re: Faster System.nanotime() ?

dor laor Thu, 02 May 2019 14:53:59 -0700

On Thu, May 2, 2019 at 7:14 AM Gil Tene <g...@azul.com> wrote:

>
>
> Sent from my iPad
>
> On May 1, 2019, at 1:38 PM, dor laor <dor.l...@gmail.com> wrote:
>
> On Wed, May 1, 2019 at 9:58 AM Gil Tene <g...@azul.com> wrote:
>
>> There are many ways for RDTSC to be made "wrong" (as in non-monotonic
>> within a software thread, process, system, etc.) on systems, but AFAIK
>> "most" modern x86-64 bare metal systems can be set up for good clean,
>> monotonic system-wide TSC-ness. The hardware certainly has the ability to
>> keep those TSCs in sync (enough to not have detectable non-sync effects)
>> both within a socket and across multi-socket systems (when the hardware is
>> "built right"). The TSCs all get reset together and move together unless
>> interfered with...
>>
>> Two ways I've seen this go wrong even on modern hardware include:
>>
>> A) Some BIOSes resetting TSC on a single core or hyperthread on each
>> socket (usually thread 0 of core 0) for some strange reason during the boot
>> sequence. [I've conclusively shown this on some 4 socket Sandy Bridge
>> systems.] This leads different vcores to have vastly differing TSC values,
>> which gets bigger with every non-power-cycling reboot, with obvious
>> negative effects and screams from anyone relying on TSC consistency for
>> virtually any purpose.
>>
>> B) Hypervisors virtualizing TSC. Some hypervisors (notably at least some
>> versions of VMWare) will virtualize the TSC and "slew" the virtualized
>> value to avoid presenting guest OSs with huge jumps in TSC values when a
>> core was taken away for a "long" (i.e. many-msec) period of time. Instead,
>> the virtualized TSC will incrementally move forward in small jumps until it
>> catches up. The purpose of this appears to be to avoid triggering guest OS
>> panics in code that watches TSC for panic-timeouts and other sanity checks
>> (e.g. code in OS spinlocks). The effect of this "slewing" is obvious: TSWC
>> values can easily jump backward, even within a single software thread.
>>
>
> A hypervisor wouldn't take the TSC backwards, it can slow the TSC but not
> take it backward, unless they virtualize the cpu bits for stable tsc
> differently which
> happens but I doubt VMware (and better hypervisors) take the TSC back
>
>
> A hypervisor wouldn't take the TSC backwards within one vcore.
>
> But vcores are scheduled individually, which means that any slewing done
> to hide a long jump forward in the physical TSC in situations where a vcore
> was not actually running on a physical core for a “long enough” period of
> time is done individually within each vcore and its virtualized TSC.
> (synchronizing the virtualized TSC slewing across vcores would require
> either synchronizing their scheduling such that the entire VM would be
> either “on” or “off” cores at the same time, or making the virtualuzed a
> TSC only tick forward in large quantum’s, or only when all vcores are
> actively running on physical cores,  all of which would cause some other
> dramatic strangeness).
>
> Multiple vcores belonging to the same guest OS can (and usually will) end
> up running simultaneously on multiple real cores, which obviously means
> that during slewing periods they will be showing vastly differing
> virtualized TSC values (with gaps of 10s of msec) until the “slewing” is
> done. All it takes is a “lucky timing” context switch within the Guest OS,
> moving a thread from one vcore to another (for whichever of the many
> reasons the guest OS might decide to do that) for *your* program to observe
> the TSC “jumping backwards” by 10s of msec between one RDTSC execution and
> another.
>


It's the same issue as a physical machine with multiple sockets, the tsc
isn't synced across those different sockets.
The hypervisor keeps an offset per unscheduled vcore and makes sure it is
monotonic. Although we at KVM considered to
slew/speed the TSC on vcores, primarily for live migration, we didn't do it
in practice. One of my old team members wrote this
pretty good write up (in 2011 but still relevant):
https://www.kernel.org/doc/Documentation/virtual/kvm/timekeeping.txt


>
>
>
>> The bottom line is that TSC can be relied on bare metal (where there is
>> no hypervisor scheduling of guest OS cores) if the system is set up right,
>> but can do very wrong things otherwise. People who really care about low
>> cost time measurement (like System.nanotime()) can control their systems to
>> make this work and elect to rely on it (that's exactly what Zing's
>> -XX:+UseRdtsc flag is for), but it can be dangerous to rely on it by
>> default.
>>
>> On Tuesday, April 30, 2019 at 3:07:11 AM UTC-7, Ben Evans wrote:
>>>
>>> I'd assumed that the monotonicity of System.nanoTime() on modern
>>> systems was due to the OS compensating, rather than any changes at the
>>> hardware level. Is that not the case?
>>>
>>> In particular, Rust definitely still seems to think that their
>>> SystemTime (which looks to back directly on to a RDTSC) can be
>>> non-monotonic: https://doc.rust-lang.org/std/time/struct.SystemTime.html
>>>
>>> On Tue, 30 Apr 2019 at 07:50, dor laor <dor...@gmail.com> wrote:
>>> >
>>> > It might be since in the past many systems did not have a stable rdtsc
>>> and thus if the instruction is executed
>>> > on different sockets it can result in wrong answers and negative time.
>>> Today most systems do have a stable tsc
>>> > and you can verify it from userspace/java too.
>>> > I bet it's easy to google the reason
>>> >
>>> > On Mon, Apr 29, 2019 at 2:36 PM 'Carl Mastrangelo' via
>>> mechanical-sympathy <mechanica...@googlegroups.com> wrote:
>>> >>
>>> >> This may be a dumb question, but why (on Linux) is System.nanotime()
>>> a call out to clock_gettime?    It seems like it could be inlined by the
>>> JVM, and stripped down to the rdtsc instruction.   From my reading of the
>>> vDSO source for x86, the implementation is not that complex, and could be
>>> copied into Java.
>>> >>
>>> >> --
>>> >> You received this message because you are subscribed to the Google
>>> Groups "mechanical-sympathy" group.
>>> >> To unsubscribe from this group and stop receiving emails from it,
>>> send an email to mechanical-sympathy+unsubscr...@googlegroups.com.
>>> >> For more options, visit https://groups.google.com/d/optout.
>>> >
>>> > --
>>> > You received this message because you are subscribed to the Google
>>> Groups "mechanical-sympathy" group.
>>> > To unsubscribe from this group and stop receiving emails from it, send
>>> an email to mechanical-sympathy+unsubscr...@googlegroups.com.
>>> > For more options, visit https://groups.google.com/d/optout.
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to mechanical-sympathy+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "mechanical-sympathy" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/mechanical-sympathy/7WnH37dA6Yc/unsubscribe
> .
> To unsubscribe from this group and all its topics, send an email to
> mechanical-sympathy+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Faster System.nanotime() ?

Reply via email to