On 08/25/2016 11:37 AM, Jan Beulich wrote: >>>> On 24.08.16 at 14:43, <joao.m.mart...@oracle.com> wrote: >> This patch proposes relying on host TSC synchronization and >> passthrough to the guest, when running on a TSC-safe platform. On >> time_calibration we retrieve the platform time in ns and the counter >> read by the clocksource that was used to compute system time. We >> introduce a new rendezous function which doesn't require >> synchronization between master and slave CPUS and just reads >> calibration_rendezvous struct and writes it down the stime and stamp >> to the cpu_calibration struct to be used later on. We can guarantee that >> on a platform with a constant and reliable TSC, that the time read on >> vcpu B right after A is bigger independently of the VCPU calibration >> error. Since pvclock time infos are monotonic as seen by any vCPU set >> PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on Linux. >> IIUC, this is similar to how it's implemented on KVM. > > Without any tools side change, how is it guaranteed that a guest > which observed the stable bit won't get migrated to a host not > providing that guarantee? Do you want to prevent migration in such cases? The worst that can happen is that the guest might need to fallback to a system call if this bit is 0 and would keep doing so if the bit is 0.
>> Changes since v2: >> - Add XEN_ prefix to pvclock flags. >> - Adapter time_calibration_rendezvous_tail to have the case of setting >> master >> tsc/stime and use it for the nop_rendezvous. >> - Removed hotplug CPU option that was added in v1 >> - Prevent online of CPUs when clocksource is tsc. > > Some of the above listed changes don't seem to belong here, but > rather in one of the earlier patches. OK - as mentioned in patch two this was intended. Will merge these changes into patch 2. > >> --- a/xen/arch/x86/platform_hypercall.c >> +++ b/xen/arch/x86/platform_hypercall.c >> @@ -631,7 +631,8 @@ ret_t >> do_platform_op(XEN_GUEST_HANDLE_PARAM(xen_platform_op_t) u_xenpf_op) >> if ( ret ) >> break; >> >> - if ( cpu >= nr_cpu_ids || !cpu_present(cpu) ) >> + if ( cpu >= nr_cpu_ids || !cpu_present(cpu) || >> + host_tsc_is_clocksource() ) > > I have to admit that I consider the "host" here confusing: What other > TSCs could one think of on x86? Maybe clocksource_is_tsc()? Hmm, didn't thought of any other TSC, I was just merely follow the existent naming style in the header ("host_tsc_is_safe()"). I am also fine with clocksource_is_tsc(). > >> --- a/xen/arch/x86/time.c >> +++ b/xen/arch/x86/time.c >> @@ -480,6 +480,13 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns) >> >> static s64 __init init_tsctimer(struct platform_timesource *pts) >> { >> + if ( nr_cpu_ids != num_present_cpus() ) >> + { >> + printk(XENLOG_INFO "TSC: CPU Hotplug intended," >> + "not using TSC as clocksource\n"); > > Please don't split log messages across lines. Keep the XENLOG_INFO > on one line, and then the whole literal on the next. > Fixed. > Also you/we will need to take some measures to avoid this triggering > on systems where extra MADT entries exist just as dummies (perhaps > to ease firmware implementation, as was the case prior to commit > 0875433389 ["hvmloader: limit CPUs exposed to guests"] with our > own ACPI tables we present to HVM guests). OK. I think my laptop might be one of those but I need to check (at least do need to adjust maxcpus= to use clocksource=tsc, but that's the only case. Other boxes I don't need to) > >> @@ -1328,12 +1337,22 @@ struct calibration_rendezvous { >> }; >> >> static void >> -time_calibration_rendezvous_tail(const struct calibration_rendezvous *r) >> +time_calibration_rendezvous_tail(const struct calibration_rendezvous *r, >> + bool_t master_tsc) > > Please use plain"bool" in new code. OK. > And then I'm not convinced this > refactoring is better than simply having the new rendezvous function > not call time_calibration_rendezvous_tail(). I will move it to the nop_rendezvous. > >> { >> struct cpu_time_stamp *c = &this_cpu(cpu_calibration); >> >> - c->local_tsc = rdtsc_ordered(); >> - c->local_stime = get_s_time_fixed(c->local_tsc); >> + if ( master_tsc ) >> + { >> + c->local_tsc = r->master_tsc_stamp; > > Doesn't this require the TSCs to run in perfect sync (not even off > wrt each other by a single cycle)? Is such even possible on multi > socket systems? I.e. how would multiple chips get into such a > mode in the first place, considering their TSCs can't possibly start > ticking at exactly the same (perhaps even sub-)nanosecond? They do require to be in sync with multi-sockets, otherwise this wouldn't work. Invariant TSC only refers to cores in a package, but multi-socket is up to board vendors (no manuals mention this guarantee across sockets). That one of the reasons TSC is such a burden :( Looking at datasheets (on the oldest processor I was testing this) it mentions this note: "In order In order to ensure Timestamp Counter (TSC) synchronization across sockets in multi-socket systems, the RESET# deassertion edge should arrive at the same BCLK rising edge at both sockets and should meet the Tsu and Th requirement of 600ps relative to BCLK, as outlined in Table 2-26.". [0] Intel Xeon Processor 5600 Series Datasheet Vol 1, Page 63, http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-5600-vol-1-datasheet.pdf The BCLK looks to be the global reference clock shared across sockets IIUC used in the PLLs in the individual packages (to generate the signal where the TSC is extrapolated from). ( Please read it with a grain of salt, as I may be doing wrong readings of these datasheets ). But If it was a box with TSC skewed among sockets, wouldn't we see that at boot time in the tsc warp test? Or maybe TSC sync check isn't potentially fast enough to catch any oddities? Our docs (https://xenbits.xen.org/docs/unstable/misc/tscmode.txt) also seem to mention something along these lines on multi-socket systems. And Linux tsc code seems to assume that Intel boxes have synchronized TSCs across sockets [1] and that the exceptions cases should mark tsc=skewed (we also have such parameter). [1] arch/x86/kernel/tsc.c#L1094 As reassurance I've been running tests for days long (currently in almost a week on 2-socket system) and I'll keep running to see if it catches any issues or time going backwards. Could also run in the biggest boxes we have with 8 sockets. But still it would represent only a tiny fraction of what x86 has available these days. Other than the things above I am not sure how to go about this :( Should we start adjusting the TSCs if we find disparities or skew is observed on the long run? Or allow only TSCs on vCPUS of the same package to expose this flag? Hmm, what's your take on this? Appreciate your feedback. >> +static void time_calibration_nop_rendezvous(void *rv) >> +{ >> + const struct calibration_rendezvous *r = rv; >> + >> + time_calibration_rendezvous_tail(r, true); >> } > > I don't see what you need the local variable for, unless you > stopped calling time_calibration_rendezvous_tail() as suggested > above. I like the one above as you suggested. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel