On 08/29/2016 11:06 AM, Jan Beulich wrote:
>>>> On 26.08.16 at 17:44, <joao.m.mart...@oracle.com> wrote:
>> On 08/25/2016 11:37 AM, Jan Beulich wrote:
>>>>>> On 24.08.16 at 14:43, <joao.m.mart...@oracle.com> wrote:
>>>> This patch proposes relying on host TSC synchronization and
>>>> passthrough to the guest, when running on a TSC-safe platform. On
>>>> time_calibration we retrieve the platform time in ns and the counter
>>>> read by the clocksource that was used to compute system time. We
>>>> introduce a new rendezous function which doesn't require
>>>> synchronization between master and slave CPUS and just reads
>>>> calibration_rendezvous struct and writes it down the stime and stamp
>>>> to the cpu_calibration struct to be used later on. We can guarantee that
>>>> on a platform with a constant and reliable TSC, that the time read on
>>>> vcpu B right after A is bigger independently of the VCPU calibration
>>>> error. Since pvclock time infos are monotonic as seen by any vCPU set
>>>> PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on Linux.
>>>> IIUC, this is similar to how it's implemented on KVM.
>>>
>>> Without any tools side change, how is it guaranteed that a guest
>>> which observed the stable bit won't get migrated to a host not
>>> providing that guarantee?
>> Do you want to prevent migration in such cases? The worst that can happen is 
>> that the
>> guest might need to fallback to a system call if this bit is 0 and would 
>> keep doing
>> so if the bit is 0.
> 
> Whether migration needs preventing I'm not sure; all I was trying
> to indicate is that there seem to be pieces missing wrt migration.
> As to the guest falling back to a system call - are guest kernels and
> (as far as as affected) applications required to cope with the flag
> changing from 1 to 0 behind their back?
It's expected they cope with this bit changing AFAIK. The vdso code (i.e.
applications) always check this bit on every read to decide whether to fallback 
to a
system call. And same for pvclock code in the guest kernel on every read in both
Linux/FreeBSD to see whether to skip or not the monotonicity checks.

>>>>  {
>>>>      struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
>>>>  
>>>> -    c->local_tsc    = rdtsc_ordered();
>>>> -    c->local_stime  = get_s_time_fixed(c->local_tsc);
>>>> +    if ( master_tsc )
>>>> +    {
>>>> +        c->local_tsc    = r->master_tsc_stamp;
>>>
>>> Doesn't this require the TSCs to run in perfect sync (not even off
>>> wrt each other by a single cycle)? Is such even possible on multi
>>> socket systems? I.e. how would multiple chips get into such a
>>> mode in the first place, considering their TSCs can't possibly start
>>> ticking at exactly the same (perhaps even sub-)nanosecond?
>> They do require to be in sync with multi-sockets, otherwise this wouldn't 
>> work.
> 
> "In sync" may mean two things: Ticking at exactly the same rate, or
> (more strict) holding the exact same values at all times.
I meant the more strict one.

> 
>> Invariant TSC only refers to cores in a package, but multi-socket is up to 
>> board
>> vendors (no manuals mention this guarantee across sockets). That one of the 
>> reasons
>> TSC is such a burden :(
>>
>> Looking at datasheets (on the oldest processor I was testing this) it 
>> mentions this note:
>>
>> "In order In order to ensure Timestamp Counter (TSC) synchronization across 
>> sockets
>> in multi-socket systems, the RESET# deassertion edge should arrive at the 
>> same BCLK
>> rising edge at both sockets and should meet the Tsu and Th requirement of 
>> 600ps
>> relative to BCLK, as outlined in Table 2-26.".
> 
> Hmm, a dual socket system is certainly still one of the easier ones to
> deal with. 600ps means 18cm difference in signaling paths, which on
> larger systems (and namely ones composed of mostly independent
> nodes) I could easily seem getting exceeded. That can certainly be
> compensated (e.g. by deasserting RESET# at different times for
> different sockets), but I'd then still question the accuracy.
Interesting, good point. FWIW the linux code doesn't deem multi-node systems as 
TSC
invariant/reliable.

> 
>> [0] Intel Xeon Processor 5600 Series Datasheet Vol 1, Page 63,
>> http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-5
>>  
>> 600-vol-1-datasheet.pdf
>>
>> The BCLK looks to be the global reference clock shared across sockets IIUC 
>> used in
>> the PLLs in the individual packages (to generate the signal where the TSC is
>> extrapolated from). ( Please read it with a grain of salt, as I may be doing 
>> wrong
>> readings of these datasheets ). But If it was a box with TSC skewed among 
>> sockets,
>> wouldn't we see that at boot time in the tsc warp test? Or maybe TSC sync 
>> check isn't
>> potentially fast enough to catch any oddities?
> 
> That's my main fear: The check can't possibly determine whether TSCs
> are in perfect sync, it can only check an approximation. 
Indeed, and as we add more CPUs, the tsc reliability check will significantly 
slow
down, therefore minimizing this approximation, unless there's a significant 
skew.

> Perhaps even
> worse than the multi-node consideration here is hyper-threading, as
> that makes it fundamentally impossible that all threads within one core
> execute the same operation at exactly the same time. Not to speak of
> the various odd cache effects which I did observe while doing the
> measurements for my series (e.g. the second thread speculating the
> TSC reads much farther than the primary ones, presumably because
> the primary ones first needed to get the I-cache populated).
Hmmm, not sure how we could cope with TSC HT issues. In this patch, we 
propagate TSC
reads from platform timer on CPU 0 into the other CPUs, it would probably is
non-visible as there aren't TSC reads being done on multiple threads 
approximately at
the same time?

>> Our docs
>> (https://xenbits.xen.org/docs/unstable/misc/tscmode.txt) also seem to mention
>> something along these lines on multi-socket systems. And Linux tsc code 
>> seems to
>> assume that Intel boxes have synchronized TSCs across sockets [1] and that 
>> the
>> exceptions cases should mark tsc=skewed (we also have such parameter).
>>
>> [1] arch/x86/kernel/tsc.c#L1094
> 
> Referring back to what I've said above: Does this mean exact same
> tick rate, or exact same values?
Here I also meant the invariant condition i.e exact same values.

> 
>> As reassurance I've been running tests for days long (currently in almost a 
>> week on
>> 2-socket system) and I'll keep running to see if it catches any issues or 
>> time going
>> backwards. Could also run in the biggest boxes we have with 8 sockets. But 
>> still it
>> would represent only a tiny fraction of what x86 has available these days.
> 
> A truly interesting case would be, as mentioned, a box composed of
> individual nodes. Not sure whether that 8-socket one you mention
> would meet that.
It's not a multi-node machine - but within single-node machines it's 
potentially the
worst case scenario.

>> Other than the things above I am not sure how to go about this :( Should we 
>> start
>> adjusting the TSCs if we find disparities or skew is observed on the long 
>> run? Or
>> allow only TSCs on vCPUS of the same package to expose this flag? Hmm, 
>> what's your
>> take on this? Appreciate your feedback.
> 
> At least as an initial approach requiring affinities to be limited to a
> single socket would seem like a good compromise, provided HT
> aspects don't have a bad effect (in which case also excluding HT
> may be required). I'd also be fine with command line options
> allowing to further relax that, but a simple "clocksource=tsc"
> should imo result in a setup which from all we can tell will work as
> intended.
Sounds reasonable, so unless command line options are specified we disallow TSC 
to be
clocksource on multi-socket systems. WRT to command line options, how about 
extending
"tsc" parameter to accept another possible value such as "global" or 
"socketsafe"?
Current values are "unstable" and "skewed".

Thanks so far for all the comments so far!
Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Reply via email to