2017-02-28 0:11 GMT+08:00 Paolo Bonzini <pbonz...@redhat.com>: > > > On 27/02/2017 16:59, Peter Zijlstra wrote: >> OK, so if !KVM_FEATURE_CLOCKSOURCE_STABLE_BIT nothing is stable, but if >> it is set, TSC might still not be stable, but kvm_clock_read() is. >> >>> However, if kvmclock is stable, we know that the sched clock is stable. >> Right, so the problem is that we only ever want to allow marking >> unstable -- once its found unstable, for whatever reason, we should >> never allow going stable. The corollary of this proposition is that we >> must start out assuming it will become stable. And to avoid actually >> using unstable TSC we do a 3 state bringup: >> >> 1) sched_clock_running = 0, __stable_early = 1, __stable = 0 >> 2) sched_clock_running = 1 (__stable is effective, iow, we run unstable) >> 3) sched_clock_running = 2 (__stable <- __stable_early) >> >> 2) happens 'early' but is 'safe'. >> 3) happens 'late', after we've brought up SMP and probed TSC >> >> Between there, we should have detected the most common TSC wreckage and >> made sure to not then switch to 'stable' at 3. >> >> Now the problem appears to be that we assume sched_clock will use RDTSC >> (native_sched_clock) while sched_clock is a paravirt op. >> >> Now, I've not yet figured out the ordering between when we set >> pv_time_ops.sched_clock and when we do the 'normal' TSC init stuff. > > I think the ordering is fine: > > - pv_time_ops.sched_clock is set here: > > start_kernel (init/main.c line 509) > setup_arch > kvmclock_init > kvm_sched_clock_init > > - TSC can be declared unstable only after this: > > start_kernel (init/main.c line 628) > late_time_init > tsc_init > > So by the time the tsc_cs_mark_unstable or mark_tsc_unstable can call > clear_sched_clock_stable, pv_time_ops.sched_clock has been set. > >> But it appears to me, we should not be calling >> clear_sched_clock_stable() on TSC bits when we don't end up using >> native_sched_clock(). > > Yes, this makes sense.
How about something like below, we delay sched clock stable check to tsc_init() if we run in VM, it calls clear_sched_clock_stable() if there is no invariant tsc bit when we end up using native_sched_clock(), otherwise, kvmclock will determine if it is stable depends on KVM_FEATURE_CLOCKSOURCE_STABLE_BIT / PVCLOCK_TSC_STABLE_BIT : diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index 4e95b2e..ed8eda4 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -557,7 +557,7 @@ static void early_init_amd(struct cpuinfo_x86 *c) set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC); if (check_tsc_unstable()) clear_sched_clock_stable(); - } else { + } else if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) { clear_sched_clock_stable(); } diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 017ecd3..1927f5f 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -163,7 +163,7 @@ static void early_init_intel(struct cpuinfo_x86 *c) set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC); if (check_tsc_unstable()) clear_sched_clock_stable(); - } else { + } else if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) { clear_sched_clock_stable(); } diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 2724dc8..68149f5 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -326,9 +326,16 @@ unsigned long long sched_clock(void) { return paravirt_sched_clock(); } + +static inline bool using_native_sched_clock(void) +{ + return pv_time_ops.sched_clock == native_sched_clock; +} #else unsigned long long sched_clock(void) __attribute__((alias("native_sched_clock"))); + +static inline bool using_native_sched_clock(void) { return true; } #endif int check_tsc_unstable(void) @@ -1398,6 +1405,11 @@ void __init tsc_init(void) use_tsc_delay(); + if (using_native_sched_clock() && + boot_cpu_has(X86_FEATURE_HYPERVISOR) && + !boot_cpu_has(X86_FEATURE_NONSTOP_TSC)) + clear_sched_clock_stable(); + if (unsynchronized_tsc()) mark_tsc_unstable("TSCs unsynchronized"); Regards, Wanpeng Li