On 08/25/2016 11:37 AM, Jan Beulich wrote:
>>>> On 24.08.16 at 14:43, <joao.m.mart...@oracle.com> wrote:
>> This patch proposes relying on host TSC synchronization and
>> passthrough to the guest, when running on a TSC-safe platform. On
>> time_calibration we retrieve the platform time in ns and the counter
>> read by the clocksource that was used to compute system time. We
>> introduce a new rendezous function which doesn't require
>> synchronization between master and slave CPUS and just reads
>> calibration_rendezvous struct and writes it down the stime and stamp
>> to the cpu_calibration struct to be used later on. We can guarantee that
>> on a platform with a constant and reliable TSC, that the time read on
>> vcpu B right after A is bigger independently of the VCPU calibration
>> error. Since pvclock time infos are monotonic as seen by any vCPU set
>> PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on Linux.
>> IIUC, this is similar to how it's implemented on KVM.
> 
> Without any tools side change, how is it guaranteed that a guest
> which observed the stable bit won't get migrated to a host not
> providing that guarantee?
Do you want to prevent migration in such cases? The worst that can happen is 
that the
guest might need to fallback to a system call if this bit is 0 and would keep 
doing
so if the bit is 0.

>> Changes since v2:
>>  - Add XEN_ prefix to pvclock flags.
>>  - Adapter time_calibration_rendezvous_tail to have the case of setting 
>> master
>>  tsc/stime and use it for the nop_rendezvous.
>>  - Removed hotplug CPU option that was added in v1
>>  - Prevent online of CPUs when clocksource is tsc.
> 
> Some of the above listed changes don't seem to belong here, but
> rather in one of the earlier patches.
OK - as mentioned in patch two this was intended. Will merge these changes into 
patch 2.

> 
>> --- a/xen/arch/x86/platform_hypercall.c
>> +++ b/xen/arch/x86/platform_hypercall.c
>> @@ -631,7 +631,8 @@ ret_t 
>> do_platform_op(XEN_GUEST_HANDLE_PARAM(xen_platform_op_t) u_xenpf_op)
>>          if ( ret )
>>              break;
>>  
>> -        if ( cpu >= nr_cpu_ids || !cpu_present(cpu) )
>> +        if ( cpu >= nr_cpu_ids || !cpu_present(cpu) ||
>> +             host_tsc_is_clocksource() )
> 
> I have to admit that I consider the "host" here confusing: What other
> TSCs could one think of on x86? Maybe clocksource_is_tsc()?
Hmm, didn't thought of any other TSC, I was just merely follow the existent 
naming
style in the header ("host_tsc_is_safe()"). I am also fine with 
clocksource_is_tsc().

> 
>> --- a/xen/arch/x86/time.c
>> +++ b/xen/arch/x86/time.c
>> @@ -480,6 +480,13 @@ uint64_t ns_to_acpi_pm_tick(uint64_t ns)
>>  
>>  static s64 __init init_tsctimer(struct platform_timesource *pts)
>>  {
>> +    if ( nr_cpu_ids != num_present_cpus() )
>> +    {
>> +        printk(XENLOG_INFO "TSC: CPU Hotplug intended,"
>> +                           "not using TSC as clocksource\n");
> 
> Please don't split log messages across lines. Keep the XENLOG_INFO
> on one line, and then the whole literal on the next.
> 
Fixed.

> Also you/we will need to take some measures to avoid this triggering
> on systems where extra MADT entries exist just as dummies (perhaps
> to ease firmware implementation, as was the case prior to commit
> 0875433389 ["hvmloader: limit CPUs exposed to guests"] with our
> own ACPI tables we present to HVM guests).
OK. I think my laptop might be one of those but I need to
check (at least do need to adjust maxcpus= to use clocksource=tsc, but that's 
the
only case. Other boxes I don't need to)

> 
>> @@ -1328,12 +1337,22 @@ struct calibration_rendezvous {
>>  };
>>  
>>  static void
>> -time_calibration_rendezvous_tail(const struct calibration_rendezvous *r)
>> +time_calibration_rendezvous_tail(const struct calibration_rendezvous *r,
>> +                                 bool_t master_tsc)
> 
> Please use plain"bool" in new code.
OK.

> And then I'm not convinced this
> refactoring is better than simply having the new rendezvous function
> not call time_calibration_rendezvous_tail().
I will move it to the nop_rendezvous.

> 
>>  {
>>      struct cpu_time_stamp *c = &this_cpu(cpu_calibration);
>>  
>> -    c->local_tsc    = rdtsc_ordered();
>> -    c->local_stime  = get_s_time_fixed(c->local_tsc);
>> +    if ( master_tsc )
>> +    {
>> +        c->local_tsc    = r->master_tsc_stamp;
> 
> Doesn't this require the TSCs to run in perfect sync (not even off
> wrt each other by a single cycle)? Is such even possible on multi
> socket systems? I.e. how would multiple chips get into such a
> mode in the first place, considering their TSCs can't possibly start
> ticking at exactly the same (perhaps even sub-)nanosecond?
They do require to be in sync with multi-sockets, otherwise this wouldn't work.
Invariant TSC only refers to cores in a package, but multi-socket is up to board
vendors (no manuals mention this guarantee across sockets). That one of the 
reasons
TSC is such a burden :(

Looking at datasheets (on the oldest processor I was testing this) it mentions 
this note:

"In order In order to ensure Timestamp Counter (TSC) synchronization across 
sockets
in multi-socket systems, the RESET# deassertion edge should arrive at the same 
BCLK
rising edge at both sockets and should meet the Tsu and Th requirement of 600ps
relative to BCLK, as outlined in Table 2-26.".

[0] Intel Xeon Processor 5600 Series Datasheet Vol 1, Page 63,
http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-5600-vol-1-datasheet.pdf

The BCLK looks to be the global reference clock shared across sockets IIUC used 
in
the PLLs in the individual packages (to generate the signal where the TSC is
extrapolated from). ( Please read it with a grain of salt, as I may be doing 
wrong
readings of these datasheets ). But If it was a box with TSC skewed among 
sockets,
wouldn't we see that at boot time in the tsc warp test? Or maybe TSC sync check 
isn't
potentially fast enough to catch any oddities? Our docs
(https://xenbits.xen.org/docs/unstable/misc/tscmode.txt) also seem to mention
something along these lines on multi-socket systems. And Linux tsc code seems to
assume that Intel boxes have synchronized TSCs across sockets [1] and that the
exceptions cases should mark tsc=skewed (we also have such parameter).

[1] arch/x86/kernel/tsc.c#L1094

As reassurance I've been running tests for days long (currently in almost a 
week on
2-socket system) and I'll keep running to see if it catches any issues or time 
going
backwards. Could also run in the biggest boxes we have with 8 sockets. But 
still it
would represent only a tiny fraction of what x86 has available these days.

Other than the things above I am not sure how to go about this :( Should we 
start
adjusting the TSCs if we find disparities or skew is observed on the long run? 
Or
allow only TSCs on vCPUS of the same package to expose this flag? Hmm, what's 
your
take on this? Appreciate your feedback.

>> +static void time_calibration_nop_rendezvous(void *rv)
>> +{
>> +    const struct calibration_rendezvous *r = rv;
>> +
>> +    time_calibration_rendezvous_tail(r, true);
>>  }
> 
> I don't see what you need the local variable for, unless you
> stopped calling time_calibration_rendezvous_tail() as suggested
> above.
I like the one above as you suggested.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Reply via email to