Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Thu, Oct 11, 2018 at 04:00:29PM -0700, Andy Lutomirski wrote: > On Thu, Oct 11, 2018 at 3:28 PM Marcelo Tosatti wrote: > > > > On Tue, Oct 09, 2018 at 01:09:42PM -0700, Andy Lutomirski wrote: > > > On Tue, Oct 9, 2018 at 8:28 AM Marcelo Tosatti > > > wrote: > > > > > > > > On Mon, Oct 08, 2018 at 10:38:22AM -0700, Andy Lutomirski wrote: > > > > > On Mon, Oct 8, 2018 at 8:27 AM Marcelo Tosatti > > > > > wrote: > > > > > > > > I read the comment three more times and even dug through the git > > > > > history. It seems like what you're saying is that, under certain > > > > > conditions (which arguably would be bugs in the core Linux timing > > > > > code), > > > > > > > > I don't see that as a bug. Its just a side effect of reading two > > > > different clocks (one is CLOCK_MONOTONIC and the other is TSC), > > > > and using those two clocks to as a "base + offset". > > > > > > > > As the comment explains, if you do that, can't guarantee monotonicity. > > > > > > > > > actually calling ktime_get_boot_ns() could be non-monotonic > > > > > with respect to the kvmclock timing. But get_kvmclock_ns() isn't used > > > > > for VM timing as such -- it's used for the IOCTL interfaces for > > > > > updating the time offset. So can you explain how my patch is > > > > > incorrect? > > > > > > > > ktime_get_boot_ns() has frequency correction applied, while > > > > reading masterclock + TSC offset does not. > > > > > > > > So the clock reads differ. > > > > > > > > > > Ah, okay, I finally think I see what's going on. In the kvmclock data > > > exposed to the guest, tsc_shift and tsc_to_system_mul come from > > > tgt_tsc_khz, whereas master_kernel_ns and master_cycle_now come from > > > CLOCK_BOOTTIME. So the kvmclock and kernel clock drift apart at a > > > rate given by the frequency shift and then suddenly agree again every > > > time the pvclock data is updated. > > > > Yes. > > > > > Is there a reason to do it this way? > > > > Since pvclock updates which update system_timestamp are expensive (must > > stop all vcpus), > > they should be avoided. > > > > Fair enough. > > > So only HW TSC counts > > makes sense. > > >, and used as offset against vcpu's tsc_timestamp. > > > > Why don't you just expose CLOCK_MONTONIC_RAW or CLOCK_MONOTONIC_RAW > plus suspend time, though? Then you would actually be tracking a real > kernel timekeeping mode, and you wouldn't need all this complicated > offsetting work to avoid accidentally going backwards. Can you outline how that would work ? ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Thu, Oct 11, 2018 at 3:28 PM Marcelo Tosatti wrote: > > On Tue, Oct 09, 2018 at 01:09:42PM -0700, Andy Lutomirski wrote: > > On Tue, Oct 9, 2018 at 8:28 AM Marcelo Tosatti wrote: > > > > > > On Mon, Oct 08, 2018 at 10:38:22AM -0700, Andy Lutomirski wrote: > > > > On Mon, Oct 8, 2018 at 8:27 AM Marcelo Tosatti > > > > wrote: > > > > > > I read the comment three more times and even dug through the git > > > > history. It seems like what you're saying is that, under certain > > > > conditions (which arguably would be bugs in the core Linux timing > > > > code), > > > > > > I don't see that as a bug. Its just a side effect of reading two > > > different clocks (one is CLOCK_MONOTONIC and the other is TSC), > > > and using those two clocks to as a "base + offset". > > > > > > As the comment explains, if you do that, can't guarantee monotonicity. > > > > > > > actually calling ktime_get_boot_ns() could be non-monotonic > > > > with respect to the kvmclock timing. But get_kvmclock_ns() isn't used > > > > for VM timing as such -- it's used for the IOCTL interfaces for > > > > updating the time offset. So can you explain how my patch is > > > > incorrect? > > > > > > ktime_get_boot_ns() has frequency correction applied, while > > > reading masterclock + TSC offset does not. > > > > > > So the clock reads differ. > > > > > > > Ah, okay, I finally think I see what's going on. In the kvmclock data > > exposed to the guest, tsc_shift and tsc_to_system_mul come from > > tgt_tsc_khz, whereas master_kernel_ns and master_cycle_now come from > > CLOCK_BOOTTIME. So the kvmclock and kernel clock drift apart at a > > rate given by the frequency shift and then suddenly agree again every > > time the pvclock data is updated. > > Yes. > > > Is there a reason to do it this way? > > Since pvclock updates which update system_timestamp are expensive (must stop > all vcpus), > they should be avoided. > Fair enough. > So only HW TSC counts makes sense. >, and used as offset against vcpu's tsc_timestamp. > Why don't you just expose CLOCK_MONTONIC_RAW or CLOCK_MONOTONIC_RAW plus suspend time, though? Then you would actually be tracking a real kernel timekeeping mode, and you wouldn't need all this complicated offsetting work to avoid accidentally going backwards. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Tue, Oct 09, 2018 at 01:09:42PM -0700, Andy Lutomirski wrote: > On Tue, Oct 9, 2018 at 8:28 AM Marcelo Tosatti wrote: > > > > On Mon, Oct 08, 2018 at 10:38:22AM -0700, Andy Lutomirski wrote: > > > On Mon, Oct 8, 2018 at 8:27 AM Marcelo Tosatti > > > wrote: > > > > I read the comment three more times and even dug through the git > > > history. It seems like what you're saying is that, under certain > > > conditions (which arguably would be bugs in the core Linux timing > > > code), > > > > I don't see that as a bug. Its just a side effect of reading two > > different clocks (one is CLOCK_MONOTONIC and the other is TSC), > > and using those two clocks to as a "base + offset". > > > > As the comment explains, if you do that, can't guarantee monotonicity. > > > > > actually calling ktime_get_boot_ns() could be non-monotonic > > > with respect to the kvmclock timing. But get_kvmclock_ns() isn't used > > > for VM timing as such -- it's used for the IOCTL interfaces for > > > updating the time offset. So can you explain how my patch is > > > incorrect? > > > > ktime_get_boot_ns() has frequency correction applied, while > > reading masterclock + TSC offset does not. > > > > So the clock reads differ. > > > > Ah, okay, I finally think I see what's going on. In the kvmclock data > exposed to the guest, tsc_shift and tsc_to_system_mul come from > tgt_tsc_khz, whereas master_kernel_ns and master_cycle_now come from > CLOCK_BOOTTIME. So the kvmclock and kernel clock drift apart at a > rate given by the frequency shift and then suddenly agree again every > time the pvclock data is updated. Yes. > Is there a reason to do it this way? Since pvclock updates which update system_timestamp are expensive (must stop all vcpus), they should be avoided. So only HW TSC counts, and used as offset against vcpu's tsc_timestamp. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Tue, Oct 9, 2018 at 8:28 AM Marcelo Tosatti wrote: > > On Mon, Oct 08, 2018 at 10:38:22AM -0700, Andy Lutomirski wrote: > > On Mon, Oct 8, 2018 at 8:27 AM Marcelo Tosatti wrote: > > I read the comment three more times and even dug through the git > > history. It seems like what you're saying is that, under certain > > conditions (which arguably would be bugs in the core Linux timing > > code), > > I don't see that as a bug. Its just a side effect of reading two > different clocks (one is CLOCK_MONOTONIC and the other is TSC), > and using those two clocks to as a "base + offset". > > As the comment explains, if you do that, can't guarantee monotonicity. > > > actually calling ktime_get_boot_ns() could be non-monotonic > > with respect to the kvmclock timing. But get_kvmclock_ns() isn't used > > for VM timing as such -- it's used for the IOCTL interfaces for > > updating the time offset. So can you explain how my patch is > > incorrect? > > ktime_get_boot_ns() has frequency correction applied, while > reading masterclock + TSC offset does not. > > So the clock reads differ. > Ah, okay, I finally think I see what's going on. In the kvmclock data exposed to the guest, tsc_shift and tsc_to_system_mul come from tgt_tsc_khz, whereas master_kernel_ns and master_cycle_now come from CLOCK_BOOTTIME. So the kvmclock and kernel clock drift apart at a rate given by the frequency shift and then suddenly agree again every time the pvclock data is updated. Is there a reason to do it this way? ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Mon, Oct 08, 2018 at 10:38:22AM -0700, Andy Lutomirski wrote: > On Mon, Oct 8, 2018 at 8:27 AM Marcelo Tosatti wrote: > > > > On Sat, Oct 06, 2018 at 03:28:05PM -0700, Andy Lutomirski wrote: > > > On Sat, Oct 6, 2018 at 1:29 PM Marcelo Tosatti > > > wrote: > > > > > > > > On Thu, Oct 04, 2018 at 03:15:32PM -0700, Andy Lutomirski wrote: > > > > > For better or for worse, I'm trying to understand this code. So far, > > > > > I've come up with this patch: > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=14fd71e12b1c4492a06f368f75041f263e6862bf > > > > > > > > > > Is it correct, or am I missing some subtlety? > > > > > > > > The master clock, when initialized, has a pair > > > > > > > > masterclockvalues=(TSC value, time-of-day data). > > > > > > > > When updating the guest clock, we only update relative to (TSC value) > > > > that was read on masterclock initialization. > > > > > > I don't see the problem. The masterclock data is updated here: > > > > > > host_tsc_clocksource = kvm_get_time_and_clockread( > > > &ka->master_kernel_ns, > > > &ka->master_cycle_now); > > > > > > kvm_get_time_and_clockread() gets those values from > > > do_monotonic_boot(), which, barring bugs, should cause > > > get_kvmclock_ns() to return exactly the same thing as > > > ktime_get_boot_ns() + ka->kvmclock_offset, albeit in a rather > > > roundabout manner. > > > > > > So what am I missing? Is there actually something wrong with my patch? > > > > For the bug mentioned in the comment not to happen, you must only read > > TSC and add it as offset to (TSC value, time-of-day data). > > > > Its more than "a roundabout manner". > > > > Read the comment again. > > > > I read the comment three more times and even dug through the git > history. It seems like what you're saying is that, under certain > conditions (which arguably would be bugs in the core Linux timing > code), I don't see that as a bug. Its just a side effect of reading two different clocks (one is CLOCK_MONOTONIC and the other is TSC), and using those two clocks to as a "base + offset". As the comment explains, if you do that, can't guarantee monotonicity. > actually calling ktime_get_boot_ns() could be non-monotonic > with respect to the kvmclock timing. But get_kvmclock_ns() isn't used > for VM timing as such -- it's used for the IOCTL interfaces for > updating the time offset. So can you explain how my patch is > incorrect? ktime_get_boot_ns() has frequency correction applied, while reading masterclock + TSC offset does not. So the clock reads differ. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Mon, Oct 8, 2018 at 8:27 AM Marcelo Tosatti wrote: > > On Sat, Oct 06, 2018 at 03:28:05PM -0700, Andy Lutomirski wrote: > > On Sat, Oct 6, 2018 at 1:29 PM Marcelo Tosatti wrote: > > > > > > On Thu, Oct 04, 2018 at 03:15:32PM -0700, Andy Lutomirski wrote: > > > > For better or for worse, I'm trying to understand this code. So far, > > > > I've come up with this patch: > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=14fd71e12b1c4492a06f368f75041f263e6862bf > > > > > > > > Is it correct, or am I missing some subtlety? > > > > > > The master clock, when initialized, has a pair > > > > > > masterclockvalues=(TSC value, time-of-day data). > > > > > > When updating the guest clock, we only update relative to (TSC value) > > > that was read on masterclock initialization. > > > > I don't see the problem. The masterclock data is updated here: > > > > host_tsc_clocksource = kvm_get_time_and_clockread( > > &ka->master_kernel_ns, > > &ka->master_cycle_now); > > > > kvm_get_time_and_clockread() gets those values from > > do_monotonic_boot(), which, barring bugs, should cause > > get_kvmclock_ns() to return exactly the same thing as > > ktime_get_boot_ns() + ka->kvmclock_offset, albeit in a rather > > roundabout manner. > > > > So what am I missing? Is there actually something wrong with my patch? > > For the bug mentioned in the comment not to happen, you must only read > TSC and add it as offset to (TSC value, time-of-day data). > > Its more than "a roundabout manner". > > Read the comment again. > I read the comment three more times and even dug through the git history. It seems like what you're saying is that, under certain conditions (which arguably would be bugs in the core Linux timing code), actually calling ktime_get_boot_ns() could be non-monotonic with respect to the kvmclock timing. But get_kvmclock_ns() isn't used for VM timing as such -- it's used for the IOCTL interfaces for updating the time offset. So can you explain how my patch is incorrect? --Andy ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Sat, Oct 06, 2018 at 03:28:05PM -0700, Andy Lutomirski wrote: > On Sat, Oct 6, 2018 at 1:29 PM Marcelo Tosatti wrote: > > > > On Thu, Oct 04, 2018 at 03:15:32PM -0700, Andy Lutomirski wrote: > > > For better or for worse, I'm trying to understand this code. So far, > > > I've come up with this patch: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=14fd71e12b1c4492a06f368f75041f263e6862bf > > > > > > Is it correct, or am I missing some subtlety? > > > > The master clock, when initialized, has a pair > > > > masterclockvalues=(TSC value, time-of-day data). > > > > When updating the guest clock, we only update relative to (TSC value) > > that was read on masterclock initialization. > > I don't see the problem. The masterclock data is updated here: > > host_tsc_clocksource = kvm_get_time_and_clockread( > &ka->master_kernel_ns, > &ka->master_cycle_now); > > kvm_get_time_and_clockread() gets those values from > do_monotonic_boot(), which, barring bugs, should cause > get_kvmclock_ns() to return exactly the same thing as > ktime_get_boot_ns() + ka->kvmclock_offset, albeit in a rather > roundabout manner. > > So what am I missing? Is there actually something wrong with my patch? For the bug mentioned in the comment not to happen, you must only read TSC and add it as offset to (TSC value, time-of-day data). Its more than "a roundabout manner". Read the comment again. > > > > > > See the following comment on x86.c: > > I read that comment, and it's not obvious to me how it's related. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Thu, Oct 04, 2018 at 03:15:32PM -0700, Andy Lutomirski wrote: > For better or for worse, I'm trying to understand this code. So far, > I've come up with this patch: > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=14fd71e12b1c4492a06f368f75041f263e6862bf > > Is it correct, or am I missing some subtlety? "In the non-fallback case, a bunch of gnarly arithmetic is done: a hopefully matched pair of (TSC, boot time) is read, then all locks are dropped, then the TSC frequency is read, a branch new multiplier and shift is read, and the result is turned into a time. This seems quite racy to me. Because locks are not held, I don't see what keeps TSC frequency used in sync with the master clock data." If tsc_khz changes, the host TSC clocksource will not be used, which disables masterclock: if ((val == CPUFREQ_PRECHANGE && freq->old < freq->new) || (val == CPUFREQ_POSTCHANGE && freq->old > freq->new)) { *lpj = cpufreq_scale(loops_per_jiffy_ref, ref_freq, freq->new); tsc_khz = cpufreq_scale(tsc_khz_ref, ref_freq, freq->new); if (!(freq->flags & CPUFREQ_CONST_LOOPS)) mark_tsc_unstable("cpufreq changes"); In which case it ends up in: - spin_lock(&ka->pvclock_gtod_sync_lock); - if (!ka->use_master_clock) { - spin_unlock(&ka->pvclock_gtod_sync_lock); - return ktime_get_boot_ns() + ka->kvmclock_offset; - } masterclock -> non masterclock transition sets a REQUEST bit on each vCPU, so as to invalidate any previous clock reads. static void kvm_gen_update_masterclock(struct kvm *kvm) { #ifdef CONFIG_X86_64 int i; struct kvm_vcpu *vcpu; struct kvm_arch *ka = &kvm->arch; spin_lock(&ka->pvclock_gtod_sync_lock); kvm_make_mclock_inprogress_request(kvm); /* no guest entries from this point */ pvclock_update_vm_gtod_copy(kvm); kvm_for_each_vcpu(i, vcpu, kvm) kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); /* guest entries allowed */ kvm_for_each_vcpu(i, vcpu, kvm) kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu); spin_unlock(&ka->pvclock_gtod_sync_lock); #endif /* * If the host uses TSC clock, then passthrough TSC as stable * to the guest. */ host_tsc_clocksource = kvm_get_time_and_clockread( &ka->master_kernel_ns, &ka->master_cycle_now); ka->use_master_clock = host_tsc_clocksource && vcpus_matched && !ka->backwards_tsc_observed && !ka->boot_vcpu_runs_old_kvmclock; ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Sat, Oct 6, 2018 at 1:29 PM Marcelo Tosatti wrote: > > On Thu, Oct 04, 2018 at 03:15:32PM -0700, Andy Lutomirski wrote: > > For better or for worse, I'm trying to understand this code. So far, > > I've come up with this patch: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=14fd71e12b1c4492a06f368f75041f263e6862bf > > > > Is it correct, or am I missing some subtlety? > > The master clock, when initialized, has a pair > > masterclockvalues=(TSC value, time-of-day data). > > When updating the guest clock, we only update relative to (TSC value) > that was read on masterclock initialization. I don't see the problem. The masterclock data is updated here: host_tsc_clocksource = kvm_get_time_and_clockread( &ka->master_kernel_ns, &ka->master_cycle_now); kvm_get_time_and_clockread() gets those values from do_monotonic_boot(), which, barring bugs, should cause get_kvmclock_ns() to return exactly the same thing as ktime_get_boot_ns() + ka->kvmclock_offset, albeit in a rather roundabout manner. So what am I missing? Is there actually something wrong with my patch? > > See the following comment on x86.c: I read that comment, and it's not obvious to me how it's related. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Thu, Oct 04, 2018 at 03:15:32PM -0700, Andy Lutomirski wrote: > For better or for worse, I'm trying to understand this code. So far, > I've come up with this patch: > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=14fd71e12b1c4492a06f368f75041f263e6862bf > > Is it correct, or am I missing some subtlety? The master clock, when initialized, has a pair masterclockvalues=(TSC value, time-of-day data). When updating the guest clock, we only update relative to (TSC value) that was read on masterclock initialization. See the following comment on x86.c: /* * * Assuming a stable TSC across physical CPUS, and a stable TSC * across virtual CPUs, the following condition is possible. * Each numbered line represents an event visible to both * CPUs at the next numbered event. ... When updating the "masterclockvalues" pair, all vcpus are stopped. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Thu, Oct 04, 2018 at 09:54:45AM +0200, Vitaly Kuznetsov wrote: > Marcelo Tosatti writes: > > > On Wed, Oct 03, 2018 at 11:22:58AM +0200, Vitaly Kuznetsov wrote: > >> > >> There is a very long history of different (hardware) issues Marcelo was > >> fighting with and the current code is the survived Frankenstein. > > > > Right, the code has to handle different TSC modes. > > > >> E.g. it > >> is very, very unclear what "catchup", "always catchup" and > >> masterclock-less mode in general are and if we still need them. > > > > Catchup means handling exposed (to guest) TSC frequency smaller than > > HW TSC frequency. > > > > That is "frankenstein" code, could be removed. > > > >> That said I'm all for simplification. I'm not sure if we still need to > >> care about buggy hardware though. > > > > What simplification is that again? > > > > I was hoping to hear this from you :-) If I am to suggest how we can > move forward I'd propose: > - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling > is supported). In that case just use TSC clocksource on the guest directly: i am writing code for that now (its faster than pvclock syscall). > - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page > clocksource is a single page for the whole VM, not a per-cpu thing. Can > we think that all the buggy hardware is already gone? As Peter mentioned, non sync TSC hardware might exist in the future. And older hardware must be supported. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
For better or for worse, I'm trying to understand this code. So far, I've come up with this patch: https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=14fd71e12b1c4492a06f368f75041f263e6862bf Is it correct, or am I missing some subtlety? ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner wrote: > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() > implementation, which extended the clockid switch case and added yet > another slightly different copy of the same code. > > Especially the extended switch case is problematic as the compiler tends to > generate a jump table which then requires to use retpolines. If jump tables > are disabled it adds yet another conditional to the existing maze. > > This series takes a different approach by consolidating the almost > identical functions into one implementation for high resolution clocks and > one for the coarse grained clock ids by storing the base data for each > clock id in an array which is indexed by the clock id. > > This completely eliminates the switch case and allows further > simplifications of the code base, which at the end all together gain a few > cycles performance or at least stay on par with todays code. The resulting > performance depends heavily on the micro architecture and the compiler. tglx, please consider this whole series Acked-by: Andy Lutomirski Please feel free to push it top tip:x86/vdso, as long as you pull in tip/x86/urgent first. Once it lands, I'll email out a couple of follow-up patches. --Andy ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
> On Oct 4, 2018, at 12:31 PM, Peter Zijlstra wrote: > > On Thu, Oct 04, 2018 at 07:00:45AM -0700, Andy Lutomirski wrote: >>> On Oct 4, 2018, at 1:11 AM, Peter Zijlstra wrote: >>> On Thu, Oct 04, 2018 at 09:54:45AM +0200, Vitaly Kuznetsov wrote: I was hoping to hear this from you :-) If I am to suggest how we can move forward I'd propose: - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling is supported). - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page clocksource is a single page for the whole VM, not a per-cpu thing. Can we think that all the buggy hardware is already gone? >>> >>> No, and it is not the hardware you have to worry about (mostly), it is >>> the frigging PoS firmware people put on it. >>> >>> Ever since Nehalem TSC is stable (unless you get to >4 socket systems, >>> after which it still can be, but bets are off). But even relatively >>> recent systems fail the TSC sync test because firmware messes it up by >>> writing to either MSR_TSC or MSR_TSC_ADJUST. >>> >>> But the thing is, if the TSC is not synced, you cannot use it for >>> timekeeping, full stop. So having a single page is fine, it either >>> contains a mult/shift that is valid, or it indicates TSC is messed up >>> and you fall back to something else. >>> >>> There is no inbetween there. >>> >>> For sched_clock we can still use the global page, because the rate will >>> still be the same for each cpu, it's just offset between CPUs and the >>> code compensates for that. >> >> But if we’re in a KVM guest, then the clock will jump around on the >> same *vCPU* when the vCPU migrates. > > Urgh yes.. > >> But I don’t see how kvmclock helps here, since I don’t think it’s used >> for sched_clock. > > I get hits on kvm_sched_clock, but haven't looked further. Ok, so KVM is using the per-vCPU pvclock data for sched_clock. Which hopefully does something intelligent. Regardless of any TSC syncing issues, a paravirt clock should presumably be used for sched_clock to account for time that the vCPU was stopped. It would be fantastic if this stuff were documented much better, both in terms of the data structures and the Linux code. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Thu, Oct 04, 2018 at 07:00:45AM -0700, Andy Lutomirski wrote: > > On Oct 4, 2018, at 1:11 AM, Peter Zijlstra wrote: > > > >> On Thu, Oct 04, 2018 at 09:54:45AM +0200, Vitaly Kuznetsov wrote: > >> I was hoping to hear this from you :-) If I am to suggest how we can > >> move forward I'd propose: > >> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling > >> is supported). > >> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page > >> clocksource is a single page for the whole VM, not a per-cpu thing. Can > >> we think that all the buggy hardware is already gone? > > > > No, and it is not the hardware you have to worry about (mostly), it is > > the frigging PoS firmware people put on it. > > > > Ever since Nehalem TSC is stable (unless you get to >4 socket systems, > > after which it still can be, but bets are off). But even relatively > > recent systems fail the TSC sync test because firmware messes it up by > > writing to either MSR_TSC or MSR_TSC_ADJUST. > > > > But the thing is, if the TSC is not synced, you cannot use it for > > timekeeping, full stop. So having a single page is fine, it either > > contains a mult/shift that is valid, or it indicates TSC is messed up > > and you fall back to something else. > > > > There is no inbetween there. > > > > For sched_clock we can still use the global page, because the rate will > > still be the same for each cpu, it's just offset between CPUs and the > > code compensates for that. > > But if we’re in a KVM guest, then the clock will jump around on the > same *vCPU* when the vCPU migrates. Urgh yes.. > But I don’t see how kvmclock helps here, since I don’t think it’s used > for sched_clock. I get hits on kvm_sched_clock, but haven't looked further. Anyway, Most of the argument still holds, either TSC is synced or it is not and it _really_ should not be used. Not much middle ground there. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
Andy Lutomirski writes: > On Thu, Oct 4, 2018 at 9:43 AM Marcelo Tosatti wrote: >> >> On Wed, Oct 03, 2018 at 03:32:08PM -0700, Andy Lutomirski wrote: >> > On Wed, Oct 3, 2018 at 12:01 PM Marcelo Tosatti >> > wrote: >> > > >> > > On Tue, Oct 02, 2018 at 10:15:49PM -0700, Andy Lutomirski wrote: >> > > > Hi Vitaly, Paolo, Radim, etc., >> > > > >> > > > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner >> > > > wrote: >> > > > > >> > > > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() >> > > > > implementation, which extended the clockid switch case and added yet >> > > > > another slightly different copy of the same code. >> > > > > >> > > > > Especially the extended switch case is problematic as the compiler >> > > > > tends to >> > > > > generate a jump table which then requires to use retpolines. If jump >> > > > > tables >> > > > > are disabled it adds yet another conditional to the existing maze. >> > > > > >> > > > > This series takes a different approach by consolidating the almost >> > > > > identical functions into one implementation for high resolution >> > > > > clocks and >> > > > > one for the coarse grained clock ids by storing the base data for >> > > > > each >> > > > > clock id in an array which is indexed by the clock id. >> > > > > >> > > > >> > > > I was trying to understand more of the implications of this patch >> > > > series, and I was again reminded that there is an entire extra copy of >> > > > the vclock reading code in arch/x86/kvm/x86.c. And the purpose of >> > > > that code is very, very opaque. >> > > > >> > > > Can one of you explain what the code is even doing? From a couple of >> > > > attempts to read through it, it's a whole bunch of >> > > > probably-extremely-buggy code that, >> > > >> > > Yes, probably. >> > > >> > > > drumroll please, tries to atomically read the TSC value and the time. >> > > > And decide whether the >> > > > result is "based on the TSC". >> > > >> > > I think "based on the TSC" refers to whether TSC clocksource is being >> > > used. >> > > >> > > > And then synthesizes a TSC-to-ns >> > > > multiplier and shift, based on *something other than the actual >> > > > multiply and shift used*. >> > > > >> > > > IOW, unless I'm totally misunderstanding it, the code digs into the >> > > > private arch clocksource data intended for the vDSO, uses a poorly >> > > > maintained copy of the vDSO code to read the time (instead of doing >> > > > the sane thing and using the kernel interfaces for this), and >> > > > propagates a totally made up copy to the guest. >> > > >> > > I posted kernel interfaces for this, and it was suggested to >> > > instead write a "in-kernel user of pvclock data". >> > > >> > > If you can get kernel interfaces to replace that, go for it. I prefer >> > > kernel interfaces as well. >> > > >> > > > And gets it entirely >> > > > wrong when doing nested virt, since, unless there's some secret in >> > > > this maze, it doesn't acutlaly use the scaling factor from the host >> > > > when it tells the guest what to do. >> > > > >> > > > I am really, seriously tempted to send a patch to simply delete all >> > > > this code. >> > > >> > > If your patch which deletes the code gets the necessary features right, >> > > sure, go for it. >> > > >> > > > The correct way to do it is to hook >> > > >> > > Can you expand on the correct way to do it? >> > > >> > > > And I don't see how it's even possible to pass kvmclock correctly to >> > > > the L2 guest when L0 is hyperv. KVM could pass *hyperv's* clock, but >> > > > L1 isn't notified when the data structure changes, so how the heck is >> > > > it supposed to update the kvmclock structure? >> > > >> > > I don't parse your question. >> > >> > Let me ask it more intelligently: when the "reenlightenment" IRQ >> > happens, what tells KVM to do its own update for its guests? >> >> Update of what, and why it needs to update anything from IRQ? >> >> The update i can think of is from host kernel clocksource, >> which there is a notifier for. >> >> > > Unless I've missed some serious magic, L2 guests see kvmclock, not hv. > So we have the following sequence of events: > > - L0 migrates the whole VM. Starting now, RDTSC is emulated to match > the old host, which applies in L1 and L2. > > - An IRQ is queued to L1. > > - L1 acknowledges that it noticed the TSC change. Before the acknowledgement we actually pause all guests so they don't notice the change > RDTSC stops being emulated for L1 and L2. and right after that we update all kvmclocks for all L2s and unpause them so all their readings are still correct (see kvm_hyperv_tsc_notifier()). > > - L2 reads the TSC. It has no idea that anything changed, and it > gets the wrong answer. I have to admit I forgot what happens if L2 uses raw TSC. I *think* that we actually adjust TSC offset along with adjusting kvmclocks so the reading is still correct. I'll have to check this. All bets are off in case L2 was using
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Thu, Oct 4, 2018 at 9:43 AM Marcelo Tosatti wrote: > > On Wed, Oct 03, 2018 at 03:32:08PM -0700, Andy Lutomirski wrote: > > On Wed, Oct 3, 2018 at 12:01 PM Marcelo Tosatti wrote: > > > > > > On Tue, Oct 02, 2018 at 10:15:49PM -0700, Andy Lutomirski wrote: > > > > Hi Vitaly, Paolo, Radim, etc., > > > > > > > > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner > > > > wrote: > > > > > > > > > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() > > > > > implementation, which extended the clockid switch case and added yet > > > > > another slightly different copy of the same code. > > > > > > > > > > Especially the extended switch case is problematic as the compiler > > > > > tends to > > > > > generate a jump table which then requires to use retpolines. If jump > > > > > tables > > > > > are disabled it adds yet another conditional to the existing maze. > > > > > > > > > > This series takes a different approach by consolidating the almost > > > > > identical functions into one implementation for high resolution > > > > > clocks and > > > > > one for the coarse grained clock ids by storing the base data for each > > > > > clock id in an array which is indexed by the clock id. > > > > > > > > > > > > > I was trying to understand more of the implications of this patch > > > > series, and I was again reminded that there is an entire extra copy of > > > > the vclock reading code in arch/x86/kvm/x86.c. And the purpose of > > > > that code is very, very opaque. > > > > > > > > Can one of you explain what the code is even doing? From a couple of > > > > attempts to read through it, it's a whole bunch of > > > > probably-extremely-buggy code that, > > > > > > Yes, probably. > > > > > > > drumroll please, tries to atomically read the TSC value and the time. > > > > And decide whether the > > > > result is "based on the TSC". > > > > > > I think "based on the TSC" refers to whether TSC clocksource is being > > > used. > > > > > > > And then synthesizes a TSC-to-ns > > > > multiplier and shift, based on *something other than the actual > > > > multiply and shift used*. > > > > > > > > IOW, unless I'm totally misunderstanding it, the code digs into the > > > > private arch clocksource data intended for the vDSO, uses a poorly > > > > maintained copy of the vDSO code to read the time (instead of doing > > > > the sane thing and using the kernel interfaces for this), and > > > > propagates a totally made up copy to the guest. > > > > > > I posted kernel interfaces for this, and it was suggested to > > > instead write a "in-kernel user of pvclock data". > > > > > > If you can get kernel interfaces to replace that, go for it. I prefer > > > kernel interfaces as well. > > > > > > > And gets it entirely > > > > wrong when doing nested virt, since, unless there's some secret in > > > > this maze, it doesn't acutlaly use the scaling factor from the host > > > > when it tells the guest what to do. > > > > > > > > I am really, seriously tempted to send a patch to simply delete all > > > > this code. > > > > > > If your patch which deletes the code gets the necessary features right, > > > sure, go for it. > > > > > > > The correct way to do it is to hook > > > > > > Can you expand on the correct way to do it? > > > > > > > And I don't see how it's even possible to pass kvmclock correctly to > > > > the L2 guest when L0 is hyperv. KVM could pass *hyperv's* clock, but > > > > L1 isn't notified when the data structure changes, so how the heck is > > > > it supposed to update the kvmclock structure? > > > > > > I don't parse your question. > > > > Let me ask it more intelligently: when the "reenlightenment" IRQ > > happens, what tells KVM to do its own update for its guests? > > Update of what, and why it needs to update anything from IRQ? > > The update i can think of is from host kernel clocksource, > which there is a notifier for. > > Unless I've missed some serious magic, L2 guests see kvmclock, not hv. So we have the following sequence of events: - L0 migrates the whole VM. Starting now, RDTSC is emulated to match the old host, which applies in L1 and L2. - An IRQ is queued to L1. - L1 acknowledges that it noticed the TSC change. RDTSC stops being emulated for L1 and L2. - L2 reads the TSC. It has no idea that anything changed, and it gets the wrong answer. - At some point, kvm clock updates. What prevents this? Vitaly, am I missing some subtlety of what actually happens? ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Wed, Oct 03, 2018 at 03:32:08PM -0700, Andy Lutomirski wrote: > On Wed, Oct 3, 2018 at 12:01 PM Marcelo Tosatti wrote: > > > > On Tue, Oct 02, 2018 at 10:15:49PM -0700, Andy Lutomirski wrote: > > > Hi Vitaly, Paolo, Radim, etc., > > > > > > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner > > > wrote: > > > > > > > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() > > > > implementation, which extended the clockid switch case and added yet > > > > another slightly different copy of the same code. > > > > > > > > Especially the extended switch case is problematic as the compiler > > > > tends to > > > > generate a jump table which then requires to use retpolines. If jump > > > > tables > > > > are disabled it adds yet another conditional to the existing maze. > > > > > > > > This series takes a different approach by consolidating the almost > > > > identical functions into one implementation for high resolution clocks > > > > and > > > > one for the coarse grained clock ids by storing the base data for each > > > > clock id in an array which is indexed by the clock id. > > > > > > > > > > I was trying to understand more of the implications of this patch > > > series, and I was again reminded that there is an entire extra copy of > > > the vclock reading code in arch/x86/kvm/x86.c. And the purpose of > > > that code is very, very opaque. > > > > > > Can one of you explain what the code is even doing? From a couple of > > > attempts to read through it, it's a whole bunch of > > > probably-extremely-buggy code that, > > > > Yes, probably. > > > > > drumroll please, tries to atomically read the TSC value and the time. > > > And decide whether the > > > result is "based on the TSC". > > > > I think "based on the TSC" refers to whether TSC clocksource is being > > used. > > > > > And then synthesizes a TSC-to-ns > > > multiplier and shift, based on *something other than the actual > > > multiply and shift used*. > > > > > > IOW, unless I'm totally misunderstanding it, the code digs into the > > > private arch clocksource data intended for the vDSO, uses a poorly > > > maintained copy of the vDSO code to read the time (instead of doing > > > the sane thing and using the kernel interfaces for this), and > > > propagates a totally made up copy to the guest. > > > > I posted kernel interfaces for this, and it was suggested to > > instead write a "in-kernel user of pvclock data". > > > > If you can get kernel interfaces to replace that, go for it. I prefer > > kernel interfaces as well. > > > > > And gets it entirely > > > wrong when doing nested virt, since, unless there's some secret in > > > this maze, it doesn't acutlaly use the scaling factor from the host > > > when it tells the guest what to do. > > > > > > I am really, seriously tempted to send a patch to simply delete all > > > this code. > > > > If your patch which deletes the code gets the necessary features right, > > sure, go for it. > > > > > The correct way to do it is to hook > > > > Can you expand on the correct way to do it? > > > > > And I don't see how it's even possible to pass kvmclock correctly to > > > the L2 guest when L0 is hyperv. KVM could pass *hyperv's* clock, but > > > L1 isn't notified when the data structure changes, so how the heck is > > > it supposed to update the kvmclock structure? > > > > I don't parse your question. > > Let me ask it more intelligently: when the "reenlightenment" IRQ > happens, what tells KVM to do its own update for its guests? Update of what, and why it needs to update anything from IRQ? The update i can think of is from host kernel clocksource, which there is a notifier for. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
> On Oct 4, 2018, at 5:00 AM, Paolo Bonzini wrote: > >> On 04/10/2018 09:54, Vitaly Kuznetsov wrote: >> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling >> is supported). > > Not if you want to migrate to pre-Skylake systems. > >> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page >> clocksource is a single page for the whole VM, not a per-cpu thing. Can >> we think that all the buggy hardware is already gone? > > No. :( We still get reports whenever we break 2007-2008 hardware. > > Does the KVM non-masterclock mode actually help? It’s not clear to me exactly how it’s supposed to work, but it seems like it’s trying to expose per-vCPU adjustments to the guest. Which is dubious at best, since the guest can’t validly use them for anything other than sched_clock, since they aren’t fully corrected by anything KVM can do. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
> On Oct 4, 2018, at 1:11 AM, Peter Zijlstra wrote: > >> On Thu, Oct 04, 2018 at 09:54:45AM +0200, Vitaly Kuznetsov wrote: >> I was hoping to hear this from you :-) If I am to suggest how we can >> move forward I'd propose: >> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling >> is supported). >> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page >> clocksource is a single page for the whole VM, not a per-cpu thing. Can >> we think that all the buggy hardware is already gone? > > No, and it is not the hardware you have to worry about (mostly), it is > the frigging PoS firmware people put on it. > > Ever since Nehalem TSC is stable (unless you get to >4 socket systems, > after which it still can be, but bets are off). But even relatively > recent systems fail the TSC sync test because firmware messes it up by > writing to either MSR_TSC or MSR_TSC_ADJUST. > > But the thing is, if the TSC is not synced, you cannot use it for > timekeeping, full stop. So having a single page is fine, it either > contains a mult/shift that is valid, or it indicates TSC is messed up > and you fall back to something else. > > There is no inbetween there. > > For sched_clock we can still use the global page, because the rate will > still be the same for each cpu, it's just offset between CPUs and the > code compensates for that. But if we’re in a KVM guest, then the clock will jump around on the same *vCPU* when the vCPU migrates. But I don’t see how kvmclock helps here, since I don’t think it’s used for sched_clock. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On 04/10/2018 09:54, Vitaly Kuznetsov wrote: > - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling > is supported). Not if you want to migrate to pre-Skylake systems. > - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page > clocksource is a single page for the whole VM, not a per-cpu thing. Can > we think that all the buggy hardware is already gone? No. :( We still get reports whenever we break 2007-2008 hardware. Paolo ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Thu, Oct 04, 2018 at 09:54:45AM +0200, Vitaly Kuznetsov wrote: > I was hoping to hear this from you :-) If I am to suggest how we can > move forward I'd propose: > - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling > is supported). > - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page > clocksource is a single page for the whole VM, not a per-cpu thing. Can > we think that all the buggy hardware is already gone? No, and it is not the hardware you have to worry about (mostly), it is the frigging PoS firmware people put on it. Ever since Nehalem TSC is stable (unless you get to >4 socket systems, after which it still can be, but bets are off). But even relatively recent systems fail the TSC sync test because firmware messes it up by writing to either MSR_TSC or MSR_TSC_ADJUST. But the thing is, if the TSC is not synced, you cannot use it for timekeeping, full stop. So having a single page is fine, it either contains a mult/shift that is valid, or it indicates TSC is messed up and you fall back to something else. There is no inbetween there. For sched_clock we can still use the global page, because the rate will still be the same for each cpu, it's just offset between CPUs and the code compensates for that. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
Marcelo Tosatti writes: > On Wed, Oct 03, 2018 at 11:22:58AM +0200, Vitaly Kuznetsov wrote: >> >> There is a very long history of different (hardware) issues Marcelo was >> fighting with and the current code is the survived Frankenstein. > > Right, the code has to handle different TSC modes. > >> E.g. it >> is very, very unclear what "catchup", "always catchup" and >> masterclock-less mode in general are and if we still need them. > > Catchup means handling exposed (to guest) TSC frequency smaller than > HW TSC frequency. > > That is "frankenstein" code, could be removed. > >> That said I'm all for simplification. I'm not sure if we still need to >> care about buggy hardware though. > > What simplification is that again? > I was hoping to hear this from you :-) If I am to suggest how we can move forward I'd propose: - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling is supported). - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page clocksource is a single page for the whole VM, not a per-cpu thing. Can we think that all the buggy hardware is already gone? -- Vitaly ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Wed, Oct 3, 2018 at 12:01 PM Marcelo Tosatti wrote: > > On Tue, Oct 02, 2018 at 10:15:49PM -0700, Andy Lutomirski wrote: > > Hi Vitaly, Paolo, Radim, etc., > > > > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner wrote: > > > > > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() > > > implementation, which extended the clockid switch case and added yet > > > another slightly different copy of the same code. > > > > > > Especially the extended switch case is problematic as the compiler tends > > > to > > > generate a jump table which then requires to use retpolines. If jump > > > tables > > > are disabled it adds yet another conditional to the existing maze. > > > > > > This series takes a different approach by consolidating the almost > > > identical functions into one implementation for high resolution clocks and > > > one for the coarse grained clock ids by storing the base data for each > > > clock id in an array which is indexed by the clock id. > > > > > > > I was trying to understand more of the implications of this patch > > series, and I was again reminded that there is an entire extra copy of > > the vclock reading code in arch/x86/kvm/x86.c. And the purpose of > > that code is very, very opaque. > > > > Can one of you explain what the code is even doing? From a couple of > > attempts to read through it, it's a whole bunch of > > probably-extremely-buggy code that, > > Yes, probably. > > > drumroll please, tries to atomically read the TSC value and the time. And > > decide whether the > > result is "based on the TSC". > > I think "based on the TSC" refers to whether TSC clocksource is being > used. > > > And then synthesizes a TSC-to-ns > > multiplier and shift, based on *something other than the actual > > multiply and shift used*. > > > > IOW, unless I'm totally misunderstanding it, the code digs into the > > private arch clocksource data intended for the vDSO, uses a poorly > > maintained copy of the vDSO code to read the time (instead of doing > > the sane thing and using the kernel interfaces for this), and > > propagates a totally made up copy to the guest. > > I posted kernel interfaces for this, and it was suggested to > instead write a "in-kernel user of pvclock data". > > If you can get kernel interfaces to replace that, go for it. I prefer > kernel interfaces as well. > > > And gets it entirely > > wrong when doing nested virt, since, unless there's some secret in > > this maze, it doesn't acutlaly use the scaling factor from the host > > when it tells the guest what to do. > > > > I am really, seriously tempted to send a patch to simply delete all > > this code. > > If your patch which deletes the code gets the necessary features right, > sure, go for it. > > > The correct way to do it is to hook > > Can you expand on the correct way to do it? > > > And I don't see how it's even possible to pass kvmclock correctly to > > the L2 guest when L0 is hyperv. KVM could pass *hyperv's* clock, but > > L1 isn't notified when the data structure changes, so how the heck is > > it supposed to update the kvmclock structure? > > I don't parse your question. Let me ask it more intelligently: when the "reenlightenment" IRQ happens, what tells KVM to do its own update for its guests? ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support\
On Wed, Oct 03, 2018 at 04:00:29PM -0300, Marcelo Tosatti wrote: > On Tue, Oct 02, 2018 at 10:15:49PM -0700, Andy Lutomirski wrote: > > Hi Vitaly, Paolo, Radim, etc., > > > > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner wrote: > > > > > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() > > > implementation, which extended the clockid switch case and added yet > > > another slightly different copy of the same code. > > > > > > Especially the extended switch case is problematic as the compiler tends > > > to > > > generate a jump table which then requires to use retpolines. If jump > > > tables > > > are disabled it adds yet another conditional to the existing maze. > > > > > > This series takes a different approach by consolidating the almost > > > identical functions into one implementation for high resolution clocks and > > > one for the coarse grained clock ids by storing the base data for each > > > clock id in an array which is indexed by the clock id. > > > > > > > I was trying to understand more of the implications of this patch > > series, and I was again reminded that there is an entire extra copy of > > the vclock reading code in arch/x86/kvm/x86.c. And the purpose of > > that code is very, very opaque. > > > > Can one of you explain what the code is even doing? From a couple of > > attempts to read through it, it's a whole bunch of > > probably-extremely-buggy code that, > > Yes, probably. > > > drumroll please, tries to atomically read the TSC value and the time. And > > decide whether the > > result is "based on the TSC". > > I think "based on the TSC" refers to whether TSC clocksource is being > used. > > > And then synthesizes a TSC-to-ns > > multiplier and shift, based on *something other than the actual > > multiply and shift used*. > > > > IOW, unless I'm totally misunderstanding it, the code digs into the > > private arch clocksource data intended for the vDSO, uses a poorly > > maintained copy of the vDSO code to read the time (instead of doing > > the sane thing and using the kernel interfaces for this), and > > propagates a totally made up copy to the guest. > > I posted kernel interfaces for this, and it was suggested to > instead write a "in-kernel user of pvclock data". > > If you can get kernel interfaces to replace that, go for it. I prefer > kernel interfaces as well. And cleanup patches, to make that code look nicer, are also very welcome! ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Wed, Oct 03, 2018 at 11:22:58AM +0200, Vitaly Kuznetsov wrote: > Andy Lutomirski writes: > > > Hi Vitaly, Paolo, Radim, etc., > > > > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner wrote: > >> > >> Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() > >> implementation, which extended the clockid switch case and added yet > >> another slightly different copy of the same code. > >> > >> Especially the extended switch case is problematic as the compiler tends to > >> generate a jump table which then requires to use retpolines. If jump tables > >> are disabled it adds yet another conditional to the existing maze. > >> > >> This series takes a different approach by consolidating the almost > >> identical functions into one implementation for high resolution clocks and > >> one for the coarse grained clock ids by storing the base data for each > >> clock id in an array which is indexed by the clock id. > >> > > > > I was trying to understand more of the implications of this patch > > series, and I was again reminded that there is an entire extra copy of > > the vclock reading code in arch/x86/kvm/x86.c. And the purpose of > > that code is very, very opaque. > > > > Can one of you explain what the code is even doing? From a couple of > > attempts to read through it, it's a whole bunch of > > probably-extremely-buggy code that, drumroll please, tries to > > atomically read the TSC value and the time. And decide whether the > > result is "based on the TSC". And then synthesizes a TSC-to-ns > > multiplier and shift, based on *something other than the actual > > multiply and shift used*. > > > > IOW, unless I'm totally misunderstanding it, the code digs into the > > private arch clocksource data intended for the vDSO, uses a poorly > > maintained copy of the vDSO code to read the time (instead of doing > > the sane thing and using the kernel interfaces for this), and > > propagates a totally made up copy to the guest. And gets it entirely > > wrong when doing nested virt, since, unless there's some secret in > > this maze, it doesn't acutlaly use the scaling factor from the host > > when it tells the guest what to do. > > > > I am really, seriously tempted to send a patch to simply delete all > > this code. The correct way to do it is to hook > > "I have discovered a truly marvelous proof of this, which this margin is > too narrow to contain" :-) > > There is a very long history of different (hardware) issues Marcelo was > fighting with and the current code is the survived Frankenstein. Right, the code has to handle different TSC modes. > E.g. it > is very, very unclear what "catchup", "always catchup" and > masterclock-less mode in general are and if we still need them. Catchup means handling exposed (to guest) TSC frequency smaller than HW TSC frequency. That is "frankenstein" code, could be removed. > That said I'm all for simplification. I'm not sure if we still need to > care about buggy hardware though. What simplification is that again? > > > > And I don't see how it's even possible to pass kvmclock correctly to > > the L2 guest when L0 is hyperv. KVM could pass *hyperv's* clock, but > > L1 isn't notified when the data structure changes, so how the heck is > > it supposed to update the kvmclock structure? > > Well, this kind of works in the the followin way: > L1's clocksource is 'tsc_page' which is, basically, a compliment to TSC: > two numbers provided by L0: offset and scale and KVM was tought to treat > this clocksource as a good one (see b0c39dc68e3b "x86/kvm: Pass stable > clocksource to guests when running nested on Hyper-V"). > > The notification you're talking about exists, it is called > Reenligntenment, see 0092e4346f49 "x86/kvm: Support Hyper-V > reenlightenment"). When TSC page changes (and this only happens when L1 > is migrated to a different host with a different TSC frequency and TSC > scaling is not supported by the CPU) we receive an interrupt in L1 (at > this moment all TSC accesses are emulated which guarantees the > correctness of the readings), pause all L2 guests, update their kvmclock > structures with new data (we already know the new TSC frequency) and > then tell L0 that we're done and it can stop emulating TSC accesses. > > (Nothing like this exists for KVM-on-KVM, by the way, when L1's > clocksource is 'kvmclock' L2s won't get a stable kvmclock clocksource.) > > -- > Vitaly ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Tue, Oct 02, 2018 at 10:15:49PM -0700, Andy Lutomirski wrote: > Hi Vitaly, Paolo, Radim, etc., > > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner wrote: > > > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() > > implementation, which extended the clockid switch case and added yet > > another slightly different copy of the same code. > > > > Especially the extended switch case is problematic as the compiler tends to > > generate a jump table which then requires to use retpolines. If jump tables > > are disabled it adds yet another conditional to the existing maze. > > > > This series takes a different approach by consolidating the almost > > identical functions into one implementation for high resolution clocks and > > one for the coarse grained clock ids by storing the base data for each > > clock id in an array which is indexed by the clock id. > > > > I was trying to understand more of the implications of this patch > series, and I was again reminded that there is an entire extra copy of > the vclock reading code in arch/x86/kvm/x86.c. And the purpose of > that code is very, very opaque. > > Can one of you explain what the code is even doing? From a couple of > attempts to read through it, it's a whole bunch of > probably-extremely-buggy code that, Yes, probably. > drumroll please, tries to atomically read the TSC value and the time. And > decide whether the > result is "based on the TSC". I think "based on the TSC" refers to whether TSC clocksource is being used. > And then synthesizes a TSC-to-ns > multiplier and shift, based on *something other than the actual > multiply and shift used*. > > IOW, unless I'm totally misunderstanding it, the code digs into the > private arch clocksource data intended for the vDSO, uses a poorly > maintained copy of the vDSO code to read the time (instead of doing > the sane thing and using the kernel interfaces for this), and > propagates a totally made up copy to the guest. I posted kernel interfaces for this, and it was suggested to instead write a "in-kernel user of pvclock data". If you can get kernel interfaces to replace that, go for it. I prefer kernel interfaces as well. > And gets it entirely > wrong when doing nested virt, since, unless there's some secret in > this maze, it doesn't acutlaly use the scaling factor from the host > when it tells the guest what to do. > > I am really, seriously tempted to send a patch to simply delete all > this code. If your patch which deletes the code gets the necessary features right, sure, go for it. > The correct way to do it is to hook Can you expand on the correct way to do it? > And I don't see how it's even possible to pass kvmclock correctly to > the L2 guest when L0 is hyperv. KVM could pass *hyperv's* clock, but > L1 isn't notified when the data structure changes, so how the heck is > it supposed to update the kvmclock structure? I don't parse your question. > > --Andy ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Wed, Oct 3, 2018 at 8:10 AM Thomas Gleixner wrote: > > On Wed, 3 Oct 2018, Andy Lutomirski wrote: > > > On Oct 3, 2018, at 5:01 AM, Vitaly Kuznetsov wrote: > > > Not all Hyper-V hosts support reenlightenment notifications (and, if I'm > > > not mistaken, you need to enable nesting for the VM to get the feature - > > > and most VMs don't have this) so I think we'll have to keep Hyper-V > > > vclock for the time being. > > > > > But this does suggest that the correct way to pass a clock through to an > > L2 guest where L0 is HV is to make L1 use the “tsc” clock and L2 use > > kvmclock (or something newer and better). This would require adding > > support for atomic frequency changes all the way through the timekeeping > > and arch code. > > > > John, tglx, would that be okay or crazy? > > Not sure what you mean. I think I lost you somewhere on the way. > What I mean is: currently we have a clocksource called ""hyperv_clocksource_tsc_page". Reading it does: static u64 read_hv_clock_tsc(struct clocksource *arg) { u64 current_tick = hv_read_tsc_page(tsc_pg); if (current_tick == U64_MAX) rdmsrl(HV_X64_MSR_TIME_REF_COUNT, current_tick); return current_tick; } From Vitaly's email, it sounds like, on most (all?) hyperv systems with nesting enabled, this clock is better behaved than it appears. It sounds like the read behavior is that current_tick will never be U64_MAX -- instead, the clock always works and, more importantly, the actual scaling factor and offset only change observably on *guest* request. So why don't we we improve the actual "tsc" clocksource to understand this? ISTM the best model would be where the __clocksource_update_freq_xyz() mechanism gets called so we can use it like this: clocksource_begin_update(); clocksource_update_mult_shift(); tell_hv_that_we_reenlightened(); clocksource_end_update(); Where clocksource_begin_update() bumps the seqcount for the vDSO and takes all the locks, clocksource_update_mult_shift() updates everything, and clocksource_end_update() makes the updated parameters usable. (AFAICT there are currently no clocksources at all in the entire kernel that update their frequency on the fly using __clocksource_update_xyz(). Unless I'm missing something, the x86 tsc cpufreq hooks don't call into the core timekeeping at all, so I'm assuming that the tsc clocksource is just unusable as a clocksource on systems that change its frequency.) Or we could keep the hyperv_clocksource_tsc_page clocksource but make it use VCLOCK_TSC and a similar update mechanism. I don't personally want to do this, because the timekeeping code is subtle and I'm unfamiliar with it. And I don't have *that* many spare cycles :) ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Wed, 3 Oct 2018, Andy Lutomirski wrote: > > On Oct 3, 2018, at 5:01 AM, Vitaly Kuznetsov wrote: > > Not all Hyper-V hosts support reenlightenment notifications (and, if I'm > > not mistaken, you need to enable nesting for the VM to get the feature - > > and most VMs don't have this) so I think we'll have to keep Hyper-V > > vclock for the time being. > > > But this does suggest that the correct way to pass a clock through to an > L2 guest where L0 is HV is to make L1 use the “tsc” clock and L2 use > kvmclock (or something newer and better). This would require adding > support for atomic frequency changes all the way through the timekeeping > and arch code. > > John, tglx, would that be okay or crazy? Not sure what you mean. I think I lost you somewhere on the way. Thanks, tglx___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
> On Oct 3, 2018, at 5:01 AM, Vitaly Kuznetsov wrote: > > Andy Lutomirski writes: > >>> On Oct 3, 2018, at 2:22 AM, Vitaly Kuznetsov wrote: >>> >>> Andy Lutomirski writes: >>> Hi Vitaly, Paolo, Radim, etc., >>> The notification you're talking about exists, it is called >>> Reenligntenment, see 0092e4346f49 "x86/kvm: Support Hyper-V >>> reenlightenment"). When TSC page changes (and this only happens when L1 >>> is migrated to a different host with a different TSC frequency and TSC >>> scaling is not supported by the CPU) we receive an interrupt in L1 (at >>> this moment all TSC accesses are emulated which guarantees the >>> correctness of the readings), pause all L2 guests, update their kvmclock >>> structures with new data (we already know the new TSC frequency) and >>> then tell L0 that we're done and it can stop emulating TSC accesses. >> >> That’s delightful! Does the emulation magic also work for L1 user >> mode? > > As far as I understand - yes, all rdtsc* calls will trap into L0. > >> If so, couldn’t we drop the HyperV vclock entirely and just >> fold the adjustment into the core timekeeping data? (Preferably the >> actual core data, which would require core changes, but it could >> plausibly be done in arch code, too.) > > Not all Hyper-V hosts support reenlightenment notifications (and, if I'm > not mistaken, you need to enable nesting for the VM to get the feature - > and most VMs don't have this) so I think we'll have to keep Hyper-V > vclock for the time being. > > But this does suggest that the correct way to pass a clock through to an L2 guest where L0 is HV is to make L1 use the “tsc” clock and L2 use kvmclock (or something newer and better). This would require adding support for atomic frequency changes all the way through the timekeeping and arch code. John, tglx, would that be okay or crazy? ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
Andy Lutomirski writes: >> On Oct 3, 2018, at 2:22 AM, Vitaly Kuznetsov wrote: >> >> Andy Lutomirski writes: >> >>> Hi Vitaly, Paolo, Radim, etc., >>> >> The notification you're talking about exists, it is called >> Reenligntenment, see 0092e4346f49 "x86/kvm: Support Hyper-V >> reenlightenment"). When TSC page changes (and this only happens when L1 >> is migrated to a different host with a different TSC frequency and TSC >> scaling is not supported by the CPU) we receive an interrupt in L1 (at >> this moment all TSC accesses are emulated which guarantees the >> correctness of the readings), pause all L2 guests, update their kvmclock >> structures with new data (we already know the new TSC frequency) and >> then tell L0 that we're done and it can stop emulating TSC accesses. > > That’s delightful! Does the emulation magic also work for L1 user > mode? As far as I understand - yes, all rdtsc* calls will trap into L0. > If so, couldn’t we drop the HyperV vclock entirely and just > fold the adjustment into the core timekeeping data? (Preferably the > actual core data, which would require core changes, but it could > plausibly be done in arch code, too.) Not all Hyper-V hosts support reenlightenment notifications (and, if I'm not mistaken, you need to enable nesting for the VM to get the feature - and most VMs don't have this) so I think we'll have to keep Hyper-V vclock for the time being. -- Vitaly ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
> On Oct 3, 2018, at 2:22 AM, Vitaly Kuznetsov wrote: > > Andy Lutomirski writes: > >> Hi Vitaly, Paolo, Radim, etc., >> >>> On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner wrote: >>> >>> Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() >>> implementation, which extended the clockid switch case and added yet >>> another slightly different copy of the same code. >>> >>> Especially the extended switch case is problematic as the compiler tends to >>> generate a jump table which then requires to use retpolines. If jump tables >>> are disabled it adds yet another conditional to the existing maze. >>> >>> This series takes a different approach by consolidating the almost >>> identical functions into one implementation for high resolution clocks and >>> one for the coarse grained clock ids by storing the base data for each >>> clock id in an array which is indexed by the clock id. >>> >> >> I was trying to understand more of the implications of this patch >> series, and I was again reminded that there is an entire extra copy of >> the vclock reading code in arch/x86/kvm/x86.c. And the purpose of >> that code is very, very opaque. >> >> Can one of you explain what the code is even doing? From a couple of >> attempts to read through it, it's a whole bunch of >> probably-extremely-buggy code that, drumroll please, tries to >> atomically read the TSC value and the time. And decide whether the >> result is "based on the TSC". And then synthesizes a TSC-to-ns >> multiplier and shift, based on *something other than the actual >> multiply and shift used*. >> >> IOW, unless I'm totally misunderstanding it, the code digs into the >> private arch clocksource data intended for the vDSO, uses a poorly >> maintained copy of the vDSO code to read the time (instead of doing >> the sane thing and using the kernel interfaces for this), and >> propagates a totally made up copy to the guest. And gets it entirely >> wrong when doing nested virt, since, unless there's some secret in >> this maze, it doesn't acutlaly use the scaling factor from the host >> when it tells the guest what to do. >> >> I am really, seriously tempted to send a patch to simply delete all >> this code. The correct way to do it is to hook > > "I have discovered a truly marvelous proof of this, which this margin is > too narrow to contain" :-) > > There is a very long history of different (hardware) issues Marcelo was > fighting with and the current code is the survived Frankenstein. E.g. it > is very, very unclear what "catchup", "always catchup" and > masterclock-less mode in general are and if we still need them. > > That said I'm all for simplification. I'm not sure if we still need to > care about buggy hardware though. > >> >> And I don't see how it's even possible to pass kvmclock correctly to >> the L2 guest when L0 is hyperv. KVM could pass *hyperv's* clock, but >> L1 isn't notified when the data structure changes, so how the heck is >> it supposed to update the kvmclock structure? > > Well, this kind of works in the the followin way: > L1's clocksource is 'tsc_page' which is, basically, a compliment to TSC: > two numbers provided by L0: offset and scale and KVM was tought to treat > this clocksource as a good one (see b0c39dc68e3b "x86/kvm: Pass stable > clocksource to guests when running nested on Hyper-V"). > > The notification you're talking about exists, it is called > Reenligntenment, see 0092e4346f49 "x86/kvm: Support Hyper-V > reenlightenment"). When TSC page changes (and this only happens when L1 > is migrated to a different host with a different TSC frequency and TSC > scaling is not supported by the CPU) we receive an interrupt in L1 (at > this moment all TSC accesses are emulated which guarantees the > correctness of the readings), pause all L2 guests, update their kvmclock > structures with new data (we already know the new TSC frequency) and > then tell L0 that we're done and it can stop emulating TSC accesses. That’s delightful! Does the emulation magic also work for L1 user mode? If so, couldn’t we drop the HyperV vclock entirely and just fold the adjustment into the core timekeeping data? (Preferably the actual core data, which would require core changes, but it could plausibly be done in arch code, too.) > > (Nothing like this exists for KVM-on-KVM, by the way, when L1's > clocksource is 'kvmclock' L2s won't get a stable kvmclock clocksource.) > > ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
Andy Lutomirski writes: > Hi Vitaly, Paolo, Radim, etc., > > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner wrote: >> >> Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() >> implementation, which extended the clockid switch case and added yet >> another slightly different copy of the same code. >> >> Especially the extended switch case is problematic as the compiler tends to >> generate a jump table which then requires to use retpolines. If jump tables >> are disabled it adds yet another conditional to the existing maze. >> >> This series takes a different approach by consolidating the almost >> identical functions into one implementation for high resolution clocks and >> one for the coarse grained clock ids by storing the base data for each >> clock id in an array which is indexed by the clock id. >> > > I was trying to understand more of the implications of this patch > series, and I was again reminded that there is an entire extra copy of > the vclock reading code in arch/x86/kvm/x86.c. And the purpose of > that code is very, very opaque. > > Can one of you explain what the code is even doing? From a couple of > attempts to read through it, it's a whole bunch of > probably-extremely-buggy code that, drumroll please, tries to > atomically read the TSC value and the time. And decide whether the > result is "based on the TSC". And then synthesizes a TSC-to-ns > multiplier and shift, based on *something other than the actual > multiply and shift used*. > > IOW, unless I'm totally misunderstanding it, the code digs into the > private arch clocksource data intended for the vDSO, uses a poorly > maintained copy of the vDSO code to read the time (instead of doing > the sane thing and using the kernel interfaces for this), and > propagates a totally made up copy to the guest. And gets it entirely > wrong when doing nested virt, since, unless there's some secret in > this maze, it doesn't acutlaly use the scaling factor from the host > when it tells the guest what to do. > > I am really, seriously tempted to send a patch to simply delete all > this code. The correct way to do it is to hook "I have discovered a truly marvelous proof of this, which this margin is too narrow to contain" :-) There is a very long history of different (hardware) issues Marcelo was fighting with and the current code is the survived Frankenstein. E.g. it is very, very unclear what "catchup", "always catchup" and masterclock-less mode in general are and if we still need them. That said I'm all for simplification. I'm not sure if we still need to care about buggy hardware though. > > And I don't see how it's even possible to pass kvmclock correctly to > the L2 guest when L0 is hyperv. KVM could pass *hyperv's* clock, but > L1 isn't notified when the data structure changes, so how the heck is > it supposed to update the kvmclock structure? Well, this kind of works in the the followin way: L1's clocksource is 'tsc_page' which is, basically, a compliment to TSC: two numbers provided by L0: offset and scale and KVM was tought to treat this clocksource as a good one (see b0c39dc68e3b "x86/kvm: Pass stable clocksource to guests when running nested on Hyper-V"). The notification you're talking about exists, it is called Reenligntenment, see 0092e4346f49 "x86/kvm: Support Hyper-V reenlightenment"). When TSC page changes (and this only happens when L1 is migrated to a different host with a different TSC frequency and TSC scaling is not supported by the CPU) we receive an interrupt in L1 (at this moment all TSC accesses are emulated which guarantees the correctness of the readings), pause all L2 guests, update their kvmclock structures with new data (we already know the new TSC frequency) and then tell L0 that we're done and it can stop emulating TSC accesses. (Nothing like this exists for KVM-on-KVM, by the way, when L1's clocksource is 'kvmclock' L2s won't get a stable kvmclock clocksource.) -- Vitaly ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
Hi Vitaly, Paolo, Radim, etc., On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner wrote: > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() > implementation, which extended the clockid switch case and added yet > another slightly different copy of the same code. > > Especially the extended switch case is problematic as the compiler tends to > generate a jump table which then requires to use retpolines. If jump tables > are disabled it adds yet another conditional to the existing maze. > > This series takes a different approach by consolidating the almost > identical functions into one implementation for high resolution clocks and > one for the coarse grained clock ids by storing the base data for each > clock id in an array which is indexed by the clock id. > I was trying to understand more of the implications of this patch series, and I was again reminded that there is an entire extra copy of the vclock reading code in arch/x86/kvm/x86.c. And the purpose of that code is very, very opaque. Can one of you explain what the code is even doing? From a couple of attempts to read through it, it's a whole bunch of probably-extremely-buggy code that, drumroll please, tries to atomically read the TSC value and the time. And decide whether the result is "based on the TSC". And then synthesizes a TSC-to-ns multiplier and shift, based on *something other than the actual multiply and shift used*. IOW, unless I'm totally misunderstanding it, the code digs into the private arch clocksource data intended for the vDSO, uses a poorly maintained copy of the vDSO code to read the time (instead of doing the sane thing and using the kernel interfaces for this), and propagates a totally made up copy to the guest. And gets it entirely wrong when doing nested virt, since, unless there's some secret in this maze, it doesn't acutlaly use the scaling factor from the host when it tells the guest what to do. I am really, seriously tempted to send a patch to simply delete all this code. The correct way to do it is to hook And I don't see how it's even possible to pass kvmclock correctly to the L2 guest when L0 is hyperv. KVM could pass *hyperv's* clock, but L1 isn't notified when the data structure changes, so how the heck is it supposed to update the kvmclock structure? --Andy ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Mon, Sep 17, 2018 at 3:00 PM Thomas Gleixner wrote: > > On Fri, 14 Sep 2018, Arnd Bergmann wrote: > > On Fri, Sep 14, 2018 at 2:52 PM Thomas Gleixner wrote: > > A couple of architectures (s390, ia64, riscv, powerpc, arm64) > > implement the vdso as assembler code at the moment, so they > > won't be as easy to consolidate (other than outright replacing all > > the code). > > > > The other five: > > arch/x86/entry/vdso/vclock_gettime.c > > arch/sparc/vdso/vclock_gettime.c > > arch/nds32/kernel/vdso/gettimeofday.c > > arch/mips/vdso/gettimeofday.c > > arch/arm/vdso/vgettimeofday.c > > > > are basically all minor variations of the same code base and could be > > consolidated to some degree. > > Any suggestions here? Should we plan to do that consolitdation based on > > your new version, or just add clock_gettime64 in arm32 and x86-32, and then > > be done with it? The other ones will obviously still be fast for 32-bit > > time_t > > and will have a working non-vdso sys_clock_getttime64(). > > In principle consolidating all those implementations should be possible to > some extent and probably worthwhile. What's arch specific are the actual > accessors to the hardware clocks. Ok. > > I also wonder about clock_getres(): half the architectures seem to implement > > it in vdso, but notably arm32 and x86 don't, and I had not expected it to be > > performance critical given that the result is easily cached in user space. > > getres() is not really performance critical, but adding it does not create > a huge problem either. Right, I'd just not add a getres_time64() to the vdso then. Arnd ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Fri, 14 Sep 2018, Arnd Bergmann wrote: > On Fri, Sep 14, 2018 at 2:52 PM Thomas Gleixner wrote: > A couple of architectures (s390, ia64, riscv, powerpc, arm64) > implement the vdso as assembler code at the moment, so they > won't be as easy to consolidate (other than outright replacing all > the code). > > The other five: > arch/x86/entry/vdso/vclock_gettime.c > arch/sparc/vdso/vclock_gettime.c > arch/nds32/kernel/vdso/gettimeofday.c > arch/mips/vdso/gettimeofday.c > arch/arm/vdso/vgettimeofday.c > > are basically all minor variations of the same code base and could be > consolidated to some degree. > Any suggestions here? Should we plan to do that consolitdation based on > your new version, or just add clock_gettime64 in arm32 and x86-32, and then > be done with it? The other ones will obviously still be fast for 32-bit time_t > and will have a working non-vdso sys_clock_getttime64(). In principle consolidating all those implementations should be possible to some extent and probably worthwhile. What's arch specific are the actual accessors to the hardware clocks. > I also wonder about clock_getres(): half the architectures seem to implement > it in vdso, but notably arm32 and x86 don't, and I had not expected it to be > performance critical given that the result is easily cached in user space. getres() is not really performance critical, but adding it does not create a huge problem either. Thanks, tglx ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Fri, Sep 14, 2018 at 2:52 PM Thomas Gleixner wrote: > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() > implementation, which extended the clockid switch case and added yet > another slightly different copy of the same code. > > Especially the extended switch case is problematic as the compiler tends to > generate a jump table which then requires to use retpolines. If jump tables > are disabled it adds yet another conditional to the existing maze. > > This series takes a different approach by consolidating the almost > identical functions into one implementation for high resolution clocks and > one for the coarse grained clock ids by storing the base data for each > clock id in an array which is indexed by the clock id. > > This completely eliminates the switch case and allows further > simplifications of the code base, which at the end all together gain a few > cycles performance or at least stay on par with todays code. The resulting > performance depends heavily on the micro architecture and the compiler. No objections from my side, just a few general remarks: Deepa and I have discussed the vdso in the past for 2038. I have started a patch that I'll have to redo on top of yours, but that is fine, your cleanup is only going to help here. More generally speaking, Deepa said that she would like to see some generalization on the vdso before adding the time64_t based variants. Your series probably makes x86 diverge more from the others, which makes it harder to consolidate them again, but we might just as well use your new implementation to base the generic one on, and just move the other ones over to that. A couple of architectures (s390, ia64, riscv, powerpc, arm64) implement the vdso as assembler code at the moment, so they won't be as easy to consolidate (other than outright replacing all the code). The other five: arch/x86/entry/vdso/vclock_gettime.c arch/sparc/vdso/vclock_gettime.c arch/nds32/kernel/vdso/gettimeofday.c arch/mips/vdso/gettimeofday.c arch/arm/vdso/vgettimeofday.c are basically all minor variations of the same code base and could be consolidated to some degree. Any suggestions here? Should we plan to do that consolitdation based on your new version, or just add clock_gettime64 in arm32 and x86-32, and then be done with it? The other ones will obviously still be fast for 32-bit time_t and will have a working non-vdso sys_clock_getttime64(). I also wonder about clock_getres(): half the architectures seem to implement it in vdso, but notably arm32 and x86 don't, and I had not expected it to be performance critical given that the result is easily cached in user space. Arnd ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Fri, 14 Sep 2018, Florian Weimer wrote: > On 09/14/2018 03:05 PM, Thomas Gleixner wrote: > > On Fri, 14 Sep 2018, Florian Weimer wrote: > > > > > On 09/14/2018 02:50 PM, Thomas Gleixner wrote: > > > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() > > > > implementation, which extended the clockid switch case and added yet > > > > another slightly different copy of the same code. > > > > > > > > Especially the extended switch case is problematic as the compiler tends > > > > to > > > > generate a jump table which then requires to use retpolines. > > > > > > Does vDSO code really have to use retpolines? It's in userspace, after > > > all. > > > > Unless you have IBRS/STIPB enabled, you need user space ratpoutine as well. > > I don't think this is a consensus position, and it obviously depends on the > (sub)architecture. It does, but we are not building kernels for specific micro architectures nor do distros AFAIK. But that aside, even with jump tables the generated code (ratpoutine disabled) is not any better than with the current approach. It's even worse and needs the same optimizations at the end and then the main difference is that you have 3 copies of one function and 2 of the other. Not a win at all. Thanks, tglx ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Fri, Sep 14, 2018 at 02:56:46PM +0200, Florian Weimer wrote: > On 09/14/2018 02:50 PM, Thomas Gleixner wrote: > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() > > implementation, which extended the clockid switch case and added yet > > another slightly different copy of the same code. > > > > Especially the extended switch case is problematic as the compiler tends to > > generate a jump table which then requires to use retpolines. > > Does vDSO code really have to use retpolines? It's in userspace, after all. Userspace is equally susceptible to spectre-v2. Ideally we'd recompile world with retpoline, but given the amount of inline asm in say things like openssl and similar projects, validating that there are indeed no indirect calls/jumps left is nontrivial. There are currently pending patches to otherwise address user-user spectre-v2 attacks. ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On 09/14/2018 03:05 PM, Thomas Gleixner wrote: On Fri, 14 Sep 2018, Florian Weimer wrote: On 09/14/2018 02:50 PM, Thomas Gleixner wrote: Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() implementation, which extended the clockid switch case and added yet another slightly different copy of the same code. Especially the extended switch case is problematic as the compiler tends to generate a jump table which then requires to use retpolines. Does vDSO code really have to use retpolines? It's in userspace, after all. Unless you have IBRS/STIPB enabled, you need user space ratpoutine as well. I don't think this is a consensus position, and it obviously depends on the (sub)architecture. Thanks, Florian ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On Fri, 14 Sep 2018, Florian Weimer wrote: > On 09/14/2018 02:50 PM, Thomas Gleixner wrote: > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() > > implementation, which extended the clockid switch case and added yet > > another slightly different copy of the same code. > > > > Especially the extended switch case is problematic as the compiler tends to > > generate a jump table which then requires to use retpolines. > > Does vDSO code really have to use retpolines? It's in userspace, after all. Unless you have IBRS/STIPB enabled, you need user space ratpoutine as well. Thanks, tglx ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
On 09/14/2018 02:50 PM, Thomas Gleixner wrote: Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() implementation, which extended the clockid switch case and added yet another slightly different copy of the same code. Especially the extended switch case is problematic as the compiler tends to generate a jump table which then requires to use retpolines. Does vDSO code really have to use retpolines? It's in userspace, after all. Thanks, Florian ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
[patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime() implementation, which extended the clockid switch case and added yet another slightly different copy of the same code. Especially the extended switch case is problematic as the compiler tends to generate a jump table which then requires to use retpolines. If jump tables are disabled it adds yet another conditional to the existing maze. This series takes a different approach by consolidating the almost identical functions into one implementation for high resolution clocks and one for the coarse grained clock ids by storing the base data for each clock id in an array which is indexed by the clock id. This completely eliminates the switch case and allows further simplifications of the code base, which at the end all together gain a few cycles performance or at least stay on par with todays code. The resulting performance depends heavily on the micro architecture and the compiler. Thanks, tglx 8<--- arch/x86/Kconfig|1 arch/x86/entry/vdso/vclock_gettime.c| 199 arch/x86/entry/vsyscall/vsyscall_gtod.c | 55 arch/x86/include/asm/vgtod.h| 46 --- arch/x86/kernel/time.c | 22 +++ include/linux/clocksource.h |5 kernel/time/Kconfig |4 kernel/time/clocksource.c |2 8 files changed, 144 insertions(+), 190 deletions(-) ___ devel mailing list de...@linuxdriverproject.org http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel