from:"Marcelo . Tosatti"

Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

2024-05-06 Thread Marcelo Tosatti

On Fri, May 03, 2024 at 05:44:22PM -0300, Leonardo Bras wrote:
> On Wed, Apr 17, 2024 at 10:22:18AM -0700, Sean Christopherson wrote:
> > On Wed, Apr 17, 2024, Marcelo Tosatti wrote:
> > > On Tue, Apr 16, 2024 at 07:07:32AM -0700, Sean Christopherson wrote:
> > > > On Tue, Apr 16, 2024, Marcelo Tosatti wrote:
> > > > > > Why not have
> > > > > > KVM provide a "this task is in KVM_RUN" flag, and then let the 
> > > > > > existing timeout
> > > > > > handle the (hopefully rare) case where KVM doesn't "immediately" 
> > > > > > re-enter the guest?
> > > > > 
> > > > > Do you mean something like:
> > > > > 
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index d9642dd06c25..0ca5a6a45025 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -3938,7 +3938,7 @@ static int rcu_pending(int user)
> > > > > return 1;
> > > > >  
> > > > > /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU 
> > > > > if so.) */
> > > > > -   if ((user || rcu_is_cpu_rrupt_from_idle()) && 
> > > > > rcu_nohz_full_cpu())
> > > > > +   if ((user || rcu_is_cpu_rrupt_from_idle() || 
> > > > > this_cpu->in_kvm_run) && rcu_nohz_full_cpu())
> > > > > return 0;
> > > > 
> > > > Yes.  This, https://lore.kernel.org/all/zhan28bcmsfl4...@google.com, 
> > > > plus logic
> > > > in kvm_sched_{in,out}().
> > > 
> > > Question: where is vcpu->wants_to_run set? (or, where is the full series
> > > again?).
> > 
> > Precisely around the call to kvm_arch_vcpu_ioctl_run().  I am planning on 
> > applying
> > the patch that introduces the code for 6.10[*], I just haven't yet for a 
> > variety
> > of reasons.
> > 
> > [*] https://lore.kernel.org/all/20240307163541.92138-1-dmatl...@google.com
> > 
> > > So for guest HLT emulation, there is a window between
> > > 
> > > kvm_vcpu_block -> fire_sched_out_preempt_notifiers -> vcpu_put 
> > > and the idle's task call to ct_cpuidle_enter, where 
> > > 
> > > ct_dynticks_nesting() != 0 and vcpu_put has already executed.
> > > 
> > > Even for idle=poll, the race exists.
> > 
> > Is waking rcuc actually problematic?
> 
> Yeah, it may introduce a lot (30us) of latency in some cases, causing a 
> missed deadline.
> 
> When dealing with RT tasks, missing a deadline can be really bad, so we 
> need to make sure it will happen as rarely as possible.
> 
> >  I agree it's not ideal, but it's a smallish
> > window, i.e. is unlikely to happen frequently, and if rcuc is awakened, it 
> > will
> > effectively steal cycles from the idle thread, not the vCPU thread.
> 
> It would be fine, but sometimes the idle thread will run very briefly, and 
> stealing microseconds from it will still steal enough time from the vcpu 
> thread to become a problem.
> 
> >  If the vCPU
> > gets a wake event before rcuc completes, then the vCPU could experience 
> > jitter,
> > but that could also happen if the CPU ends up in a deep C-state.
> 
> IIUC, if the scenario calls for a very short HLT, which is kind of usual, 
> then the CPU will not get into deep C-state. 
> For the scenarios longer HLT happens, then it would be fine.

And it might be that the chosen idle state has low latency.

There is interest from customer in using realtime and saving energy as
well.

For example:

https://doc.dpdk.org/guides/sample_app_ug/l3_forward_power_man.html

> > And that race exists in general, i.e. any IRQ that arrives just as the idle 
> > task
> > is being scheduled in will unnecessarily wakeup rcuc.
> 
> That's a race could be solved with the timeout (snapshot) solution, if we 
> don't zero last_guest_exit on kvm_sched_out(), right?

Yes.

> > > > > /* Is the RCU core waiting for a quiescent state from this 
> > > > > CPU? */
> > > > > 
> > > > > The problem is:
> > > > > 
> > > > > 1) You should only set that flag, in the VM-entry path, after the 
> > > > > point
> > > > > where no use of RCU is made: close to guest_state_enter_irqoff call.
> > > > 
> > > > Why?  As established above, KVM essentially has 1 second to enter the 
>

Re: [PATCH v2 01/15] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init()

2024-05-04 Thread Marcelo Tosatti

On Sat, Apr 27, 2024 at 12:04:58PM +0100, David Woodhouse wrote:
> From: David Woodhouse 
> 
> The KVM clock is an interesting thing. It is defined as "nanoseconds
> since the guest was created", but in practice it runs at two *different*
> rates — or three different rates, if you count implementation bugs.
> 
> Definition A is that it runs synchronously with the CLOCK_MONOTONIC_RAW
> of the host, with a delta of kvm->arch.kvmclock_offset.
> 
> But that version doesn't actually get used in the common case, where the
> host has a reliable TSC and the guest TSCs are all running at the same
> rate and in sync with each other, and kvm->arch.use_master_clock is set.
> 
> In that common case, definition B is used: There is a reference point in
> time at kvm->arch.master_kernel_ns (again a CLOCK_MONOTONIC_RAW time),
> and a corresponding host TSC value kvm->arch.master_cycle_now. This
> fixed point in time is converted to guest units (the time offset by
> kvmclock_offset and the TSC Value scaled and offset to be a guest TSC
> value) and advertised to the guest in the pvclock structure. While in
> this 'use_master_clock' mode, the fixed point in time never needs to be
> changed, and the clock runs precisely in time with the guest TSC, at the
> rate advertised in the pvclock structure.
> 
> The third definition C is implemented in kvm_get_wall_clock_epoch() and
> __get_kvmclock(), using the master_cycle_now and master_kernel_ns fields
> but converting the *host* TSC cycles directly to a value in nanoseconds
> instead of scaling via the guest TSC.
> 
> One might naïvely think that all three definitions are identical, since
> CLOCK_MONOTONIC_RAW is not skewed by NTP frequency corrections; all
> three are just the result of counting the host TSC at a known frequency,
> or the scaled guest TSC at a known precise fraction of the host's
> frequency. The problem is with arithmetic precision, and the way that
> frequency scaling is done in a division-free way by multiplying by a
> scale factor, then shifting right. In practice, all three ways of
> calculating the KVM clock will suffer a systemic drift from each other.
> 
> Eventually, definition C should just be eliminated. Commit 451a707813ae
> ("KVM: x86/xen: improve accuracy of Xen timers") worked around it for
> the specific case of Xen timers, which are defined in terms of the KVM
> clock and suffered from a continually increasing error in timer expiry
> times. That commit notes that get_kvmclock_ns() is non-trivial to fix
> and says "I'll come back to that", which remains true.
> 
> Definitions A and B do need to coexist, the former to handle the case
> where the host or guest TSC is suboptimally configured. But KVM should
> be more careful about switching between them, and the discontinuity in
> guest time which could result.
> 
> In particular, KVM_REQ_MASTERCLOCK_UPDATE will take a new snapshot of
> time as the reference in master_kernel_ns and master_cycle_now, yanking
> the guest's clock back to match definition A at that moment.

KVM_REQ_MASTERCLOCK_UPDATE stops the vcpus because:

 * To avoid that problem, do not allow visibility of distinct
 * system_timestamp/tsc_timestamp values simultaneously: use a master
 * copy of host monotonic time values. Update that master copy
 * in lockstep.

> When invoked from in 'use_master_clock' mode, kvm_update_masterclock()
> should probably *adjust* kvm->arch.kvmclock_offset to account for the
> drift, instead of yanking the clock back to defintion A.

You are likely correct...

> But in the meantime there are a bunch of places where it just doesn't need to 
> be
> invoked at all.
> 
> To start with: there is no need to do such an update when a Xen guest
> populates the shared_info page. This seems to have been a hangover from
> the very first implementation of shared_info which automatically
> populated the vcpu_info structures at their default locations, but even
> then it should just have raised KVM_REQ_CLOCK_UPDATE on each vCPU
> instead of using KVM_REQ_MASTERCLOCK_UPDATE. And now that userspace is
> expected to explicitly set the vcpu_info even in its default locations,
> there's not even any need for that either.
> 
> Fixes: 629b5348841a1 ("KVM: x86/xen: update wallclock region")
> Signed-off-by: David Woodhouse 
> Reviewed-by: Paul Durrant 
> ---
>  arch/x86/kvm/xen.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index f65b35a05d91..5a83a8154b79 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -98,8 +98,6 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
>   wc->version = wc_version + 1;
>   read_unlock_irq(>lock);
>  
> - kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE);
> -
>  out:
>   srcu_read_unlock(>srcu, idx);
>   return ret;
> -- 
> 2.44.0

So KVM_REQ_MASTERCLOCK_UPDATE is to avoid the race above.

In what contexes is kvm_xen_shared_info_init called from again?

Not clear to me

Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

2024-04-17 Thread Marcelo Tosatti

On Tue, Apr 16, 2024 at 07:07:32AM -0700, Sean Christopherson wrote:
> On Tue, Apr 16, 2024, Marcelo Tosatti wrote:
> > On Mon, Apr 15, 2024 at 02:29:32PM -0700, Sean Christopherson wrote:
> > > And snapshotting the VM-Exit time will get false negatives when the vCPU 
> > > is about
> > > to run, but for whatever reason has kvm_last_guest_exit=0, e.g. if a vCPU 
> > > was
> > > preempted and/or migrated to a different pCPU.
> > 
> > Right, for the use-case where waking up rcuc is a problem, the pCPU is
> > isolated (there are no userspace processes and hopefully no kernel threads
> > executing there), vCPU pinned to that pCPU.
> > 
> > So there should be no preemptions or migrations.
> 
> I understand that preemption/migration will not be problematic if the system 
> is
> configured "correctly", but we still need to play nice with other scenarios 
> and/or
> suboptimal setups.  While false positives aren't fatal, KVM still should do 
> its
> best to avoid them, especially when it's relatively easy to do so.

Sure.

> > > My understanding is that RCU already has a timeout to avoid stalling RCU. 
> > >  I don't
> > > see what is gained by effectively duplicating that timeout for KVM.
> > 
> > The point is not to avoid stalling RCU. The point is to not perform RCU
> > core processing through rcuc thread (because that interrupts execution
> > of the vCPU thread), if it is known that an extended quiescent state 
> > will occur "soon" anyway (via VM-entry).
> 
> I know.  My point is that, as you note below, RCU will wake-up rcuc after 1 
> second
> even if KVM is still reporting a VM-Enter is imminent, i.e. there's a 1 second
> timeout to avoid an RCU stall to due to KVM never completing entry to the 
> guest.

Right.

So a reply to the sentence:

"My understanding is that RCU already has a timeout to avoid stalling RCU.  I 
don't
 see what is gained by effectively duplicating that timeout for KVM."

Is that the current RCU timeout is not functional for KVM VM entries,
therefore it needs modification.

> > If the extended quiescent state does not occur in 1 second, then rcuc
> > will be woken up (the time_before call in rcu_nohz_full_cpu function 
> > above).
> > 
> > > Why not have
> > > KVM provide a "this task is in KVM_RUN" flag, and then let the existing 
> > > timeout
> > > handle the (hopefully rare) case where KVM doesn't "immediately" re-enter 
> > > the guest?
> > 
> > Do you mean something like:
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index d9642dd06c25..0ca5a6a45025 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3938,7 +3938,7 @@ static int rcu_pending(int user)
> > return 1;
> >  
> > /* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if 
> > so.) */
> > -   if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> > +   if ((user || rcu_is_cpu_rrupt_from_idle() || this_cpu->in_kvm_run) 
> > && rcu_nohz_full_cpu())
> > return 0;
> 
> Yes.  This, https://lore.kernel.org/all/zhan28bcmsfl4...@google.com, plus 
> logic
> in kvm_sched_{in,out}().

Question: where is vcpu->wants_to_run set? (or, where is the full series
again?).

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index bfb2b52a1416..5a7efc669a0f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -209,6 +209,9 @@ void vcpu_load(struct kvm_vcpu *vcpu)
 {
int cpu = get_cpu();
 
+   if (vcpu->wants_to_run)
+   context_tracking_guest_start_run_loop();
+
__this_cpu_write(kvm_running_vcpu, vcpu);
preempt_notifier_register(>preempt_notifier);
kvm_arch_vcpu_load(vcpu, cpu);
@@ -222,6 +225,10 @@ void vcpu_put(struct kvm_vcpu *vcpu)
kvm_arch_vcpu_put(vcpu);
preempt_notifier_unregister(>preempt_notifier);
__this_cpu_write(kvm_running_vcpu, NULL);
+
+   if (vcpu->wants_to_run)
+   context_tracking_guest_stop_run_loop();
+
preempt_enable();
 }
 EXPORT_SYMBOL_GPL(vcpu_put);

A little worried about guest HLT:

/**
 * rcu_is_cpu_rrupt_from_idle - see if 'interrupted' from idle
 *
 * If the current CPU is idle and running at a first-level (not nested)
 * interrupt, or directly, from idle, return true.
 *
 * The caller must have at least disabled IRQs.
 */
static int rcu_is_cpu_rrupt_from_idle(void)
{
long nesting;

/*
 * Usually called from the tick; but also used from smp_function_call()
 * for expedited grace periods. This latter can result in

Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

2024-04-16 Thread Marcelo Tosatti

On Mon, Apr 15, 2024 at 02:29:32PM -0700, Sean Christopherson wrote:
> On Mon, Apr 15, 2024, Marcelo Tosatti wrote:
> > On Mon, Apr 08, 2024 at 10:16:24AM -0700, Sean Christopherson wrote:
> > > On Fri, Apr 05, 2024, Paul E. McKenney wrote:
> > > > Beyond a certain point, we have no choice.  How long should RCU let
> > > > a CPU run with preemption disabled before complaining?  We choose 21
> > > > seconds in mainline and some distros choose 60 seconds.  Android chooses
> > > > 20 milliseconds for synchronize_rcu_expedited() grace periods.
> > > 
> > > Issuing a warning based on an arbitrary time limit is wildly different 
> > > than using
> > > an arbitrary time window to make functional decisions.  My objection to 
> > > the "assume
> > > the CPU will enter a quiescent state if it exited a KVM guest in the last 
> > > second"
> > > is that there are plenty of scenarios where that assumption falls apart, 
> > > i.e. where
> > > _that_ physical CPU will not re-enter the guest.
> > > 
> > > Off the top of my head:
> > > 
> > >  - If the vCPU is migrated to a different physical CPU (pCPU), the *old* 
> > > pCPU
> > >will get false positives, and the *new* pCPU will get false negatives 
> > > (though
> > >the false negatives aren't all that problematic since the pCPU will 
> > > enter a
> > >quiescent state on the next VM-Enter.
> > > 
> > >  - If the vCPU halts, in which case KVM will schedule out the vCPU/task, 
> > > i.e.
> > >won't re-enter the guest.  And so the pCPU will get false positives 
> > > until the
> > >vCPU gets a wake event or the 1 second window expires.
> > > 
> > >  - If the VM terminates, the pCPU will get false positives until the 1 
> > > second
> > >window expires.
> > > 
> > > The false positives are solvable problems, by hooking vcpu_put() to reset
> > > kvm_last_guest_exit.  And to help with the false negatives when a vCPU 
> > > task is
> > > scheduled in on a different pCPU, KVM would hook vcpu_load().
> > 
> > Hi Sean,
> > 
> > So this should deal with it? (untested, don't apply...).
> 
> Not entirely.  As I belatedly noted, hooking vcpu_put() doesn't handle the 
> case
> where the vCPU is preempted, i.e. kvm_sched_out() would also need to zero out
> kvm_last_guest_exit to avoid a false positive. 

True. Can fix that.

> Going through the scheduler will
> note the CPU is quiescent for the current grace period, but after that RCU 
> will
> still see a non-zero kvm_last_guest_exit even though the vCPU task isn't 
> actively
> running.

Right, can fix kvm_sched_out().

> And snapshotting the VM-Exit time will get false negatives when the vCPU is 
> about
> to run, but for whatever reason has kvm_last_guest_exit=0, e.g. if a vCPU was
> preempted and/or migrated to a different pCPU.

Right, for the use-case where waking up rcuc is a problem, the pCPU is
isolated (there are no userspace processes and hopefully no kernel threads
executing there), vCPU pinned to that pCPU.

So there should be no preemptions or migrations.

> I don't understand the motivation for keeping the kvm_last_guest_exit logic.

The motivation is to _avoid_ waking up rcuc to perform RCU core
processing, in case the vCPU runs on a nohz full CPU, since
entering the VM is an extended quiescent state.

The logic for userspace/idle extended quiescent states is:

This is called from the sched clock interrupt.

/*
 * This function is invoked from each scheduling-clock interrupt,
 * and checks to see if this CPU is in a non-context-switch quiescent
 * state, for example, user mode or idle loop.  It also schedules RCU
 * core processing.  If the current grace period has gone on too long,
 * it will ask the scheduler to manufacture a context switch for the sole
 * purpose of providing the needed quiescent state.
 */
void rcu_sched_clock_irq(int user)
{
...
if (rcu_pending(user))
invoke_rcu_core();
...
}

And, from rcu_pending:

/* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
return 0;

/*
 * Is this CPU a NO_HZ_FULL CPU that should ignore RCU so that the
 * grace-period kthread will do force_quiescent_state() processing?
 * The idea is to avoid waking up RCU core processing on such a
 * CPU unless the grace period has extended for too long.
 *
 * This code relies on the fact that all NO_HZ_FULL CPUs are also
 * RCU_NOCB_CPU CPUs.
 */
static bool rcu_nohz_full_cpu(void)
{
#ifdef CONFIG_NO_

Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

2024-04-15 Thread Marcelo Tosatti

On Mon, Apr 08, 2024 at 10:16:24AM -0700, Sean Christopherson wrote:
> On Fri, Apr 05, 2024, Paul E. McKenney wrote:
> > On Fri, Apr 05, 2024 at 07:42:35AM -0700, Sean Christopherson wrote:
> > > On Fri, Apr 05, 2024, Marcelo Tosatti wrote:
> > > > rcuc wakes up (which might exceed the allowed latency threshold
> > > > for certain realtime apps).
> > > 
> > > Isn't that a false negative? (RCU doesn't detect that a CPU is about to 
> > > (re)enter
> > > a guest)  I was trying to ask about the case where RCU thinks a CPU is 
> > > about to
> > > enter a guest, but the CPU never does (at least, not in the immediate 
> > > future).
> > > 
> > > Or am I just not understanding how RCU's kthreads work?
> > 
> > It is quite possible that the current rcu_pending() code needs help,
> > given the possibility of vCPU preemption.  I have heard of people doing
> > nested KVM virtualization -- or is that no longer a thing?
> 
> Nested virtualization is still very much a thing, but I don't see how it is at
> all unique with respect to RCU grace periods and quiescent states.  More 
> below.
> 
> > But the help might well involve RCU telling the hypervisor that a given
> > vCPU needs to run.  Not sure how that would go over, though it has been
> > prototyped a couple times in the context of RCU priority boosting.
> >
> > > > > > 3 - It checks if the guest exit happened over than 1 second ago. 
> > > > > > This 1
> > > > > > second value was copied from rcu_nohz_full_cpu() which checks 
> > > > > > if the
> > > > > > grace period started over than a second ago. If this value is 
> > > > > > bad,
> > > > > > I have no issue changing it.
> > > > > 
> > > > > IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal 
> > > > > heuristic regardless
> > > > > of what magic time threshold is used.  
> > > > 
> > > > Why? It works for this particular purpose.
> > > 
> > > Because maintaining magic numbers is no fun, AFAICT the heurisitic 
> > > doesn't guard
> > > against edge cases, and I'm pretty sure we can do better with about the 
> > > same amount
> > > of effort/churn.
> > 
> > Beyond a certain point, we have no choice.  How long should RCU let
> > a CPU run with preemption disabled before complaining?  We choose 21
> > seconds in mainline and some distros choose 60 seconds.  Android chooses
> > 20 milliseconds for synchronize_rcu_expedited() grace periods.
> 
> Issuing a warning based on an arbitrary time limit is wildly different than 
> using
> an arbitrary time window to make functional decisions.  My objection to the 
> "assume
> the CPU will enter a quiescent state if it exited a KVM guest in the last 
> second"
> is that there are plenty of scenarios where that assumption falls apart, i.e. 
> where
> _that_ physical CPU will not re-enter the guest.
> 
> Off the top of my head:
> 
>  - If the vCPU is migrated to a different physical CPU (pCPU), the *old* pCPU
>will get false positives, and the *new* pCPU will get false negatives 
> (though
>the false negatives aren't all that problematic since the pCPU will enter a
>quiescent state on the next VM-Enter.
> 
>  - If the vCPU halts, in which case KVM will schedule out the vCPU/task, i.e.
>won't re-enter the guest.  And so the pCPU will get false positives until 
> the
>vCPU gets a wake event or the 1 second window expires.
> 
>  - If the VM terminates, the pCPU will get false positives until the 1 second
>window expires.
> 
> The false positives are solvable problems, by hooking vcpu_put() to reset
> kvm_last_guest_exit.  And to help with the false negatives when a vCPU task is
> scheduled in on a different pCPU, KVM would hook vcpu_load().

Hi Sean,

So this should deal with it? (untested, don't apply...).

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 48f31dcd318a..be90d83d631a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -477,6 +477,16 @@ static __always_inline void guest_state_enter_irqoff(void)
lockdep_hardirqs_on(CALLER_ADDR0);
 }
 
+DECLARE_PER_CPU(unsigned long, kvm_last_guest_exit);
+
+/*
+ * Returns time (jiffies) for the last guest exit in current cpu
+ */
+static inline unsigned long guest_exit_last_time(void)
+{
+   return this_cpu_read(kvm_last_guest_exit);
+}
+
 /*
  * Exit guest context and exit an RCU extended quiescent state.
  *
@@ -488,6 +498,9 @@ static

Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

2024-04-11 Thread Marcelo Tosatti

On Mon, Apr 08, 2024 at 10:16:24AM -0700, Sean Christopherson wrote:
> On Fri, Apr 05, 2024, Paul E. McKenney wrote:
> > On Fri, Apr 05, 2024 at 07:42:35AM -0700, Sean Christopherson wrote:
> > > On Fri, Apr 05, 2024, Marcelo Tosatti wrote:
> > > > rcuc wakes up (which might exceed the allowed latency threshold
> > > > for certain realtime apps).
> > > 
> > > Isn't that a false negative? (RCU doesn't detect that a CPU is about to 
> > > (re)enter
> > > a guest)  I was trying to ask about the case where RCU thinks a CPU is 
> > > about to
> > > enter a guest, but the CPU never does (at least, not in the immediate 
> > > future).
> > > 
> > > Or am I just not understanding how RCU's kthreads work?
> > 
> > It is quite possible that the current rcu_pending() code needs help,
> > given the possibility of vCPU preemption.  I have heard of people doing
> > nested KVM virtualization -- or is that no longer a thing?
> 
> Nested virtualization is still very much a thing, but I don't see how it is at
> all unique with respect to RCU grace periods and quiescent states.  More 
> below.
> 
> > But the help might well involve RCU telling the hypervisor that a given
> > vCPU needs to run.  Not sure how that would go over, though it has been
> > prototyped a couple times in the context of RCU priority boosting.
> >
> > > > > > 3 - It checks if the guest exit happened over than 1 second ago. 
> > > > > > This 1
> > > > > > second value was copied from rcu_nohz_full_cpu() which checks 
> > > > > > if the
> > > > > > grace period started over than a second ago. If this value is 
> > > > > > bad,
> > > > > > I have no issue changing it.
> > > > > 
> > > > > IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal 
> > > > > heuristic regardless
> > > > > of what magic time threshold is used.  
> > > > 
> > > > Why? It works for this particular purpose.
> > > 
> > > Because maintaining magic numbers is no fun, AFAICT the heurisitic 
> > > doesn't guard
> > > against edge cases, and I'm pretty sure we can do better with about the 
> > > same amount
> > > of effort/churn.
> > 
> > Beyond a certain point, we have no choice.  How long should RCU let
> > a CPU run with preemption disabled before complaining?  We choose 21
> > seconds in mainline and some distros choose 60 seconds.  Android chooses
> > 20 milliseconds for synchronize_rcu_expedited() grace periods.
> 
> Issuing a warning based on an arbitrary time limit is wildly different than 
> using
> an arbitrary time window to make functional decisions.  My objection to the 
> "assume
> the CPU will enter a quiescent state if it exited a KVM guest in the last 
> second"
> is that there are plenty of scenarios where that assumption falls apart, i.e. 
> where
> _that_ physical CPU will not re-enter the guest.
> 
> Off the top of my head:
> 
>  - If the vCPU is migrated to a different physical CPU (pCPU), the *old* pCPU
>will get false positives, and the *new* pCPU will get false negatives 
> (though
>the false negatives aren't all that problematic since the pCPU will enter a
>quiescent state on the next VM-Enter.
> 
>  - If the vCPU halts, in which case KVM will schedule out the vCPU/task, i.e.
>won't re-enter the guest.  And so the pCPU will get false positives until 
> the
>vCPU gets a wake event or the 1 second window expires.
> 
>  - If the VM terminates, the pCPU will get false positives until the 1 second
>window expires.
> 
> The false positives are solvable problems, by hooking vcpu_put() to reset
> kvm_last_guest_exit.  And to help with the false negatives when a vCPU task is
> scheduled in on a different pCPU, KVM would hook vcpu_load().

Sean,

It seems that fixing the problems you pointed out above is a way to go.

Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

2024-04-05 Thread Marcelo Tosatti

On Mon, Apr 01, 2024 at 01:21:25PM -0700, Sean Christopherson wrote:
> On Thu, Mar 28, 2024, Leonardo Bras wrote:
> > I am dealing with a latency issue inside a KVM guest, which is caused by
> > a sched_switch to rcuc[1].
> > 
> > During guest entry, kernel code will signal to RCU that current CPU was on
> > a quiescent state, making sure no other CPU is waiting for this one.
> > 
> > If a vcpu just stopped running (guest_exit), and a syncronize_rcu() was
> > issued somewhere since guest entry, there is a chance a timer interrupt
> > will happen in that CPU, which will cause rcu_sched_clock_irq() to run.
> > 
> > rcu_sched_clock_irq() will check rcu_pending() which will return true,
> > and cause invoke_rcu_core() to be called, which will (in current config)
> > cause rcuc/N to be scheduled into the current cpu.
> > 
> > On rcu_pending(), I noticed we can avoid returning true (and thus invoking
> > rcu_core()) if the current cpu is nohz_full, and the cpu came from either
> > idle or userspace, since both are considered quiescent states.
> > 
> > Since this is also true to guest context, my idea to solve this latency
> > issue by avoiding rcu_core() invocation if it was running a guest vcpu.
> > 
> > On the other hand, I could not find a way of reliably saying the current
> > cpu was running a guest vcpu, so patch #1 implements a per-cpu variable
> > for keeping the time (jiffies) of the last guest exit.
> > 
> > In patch #2 I compare current time to that time, and if less than a second
> > has past, we just skip rcu_core() invocation, since there is a high chance
> > it will just go back to the guest in a moment.
> 
> What's the downside if there's a false positive?

rcuc wakes up (which might exceed the allowed latency threshold
for certain realtime apps).

> > What I know it's weird with this patch:
> > 1 - Not sure if this is the best way of finding out if the cpu was
> > running a guest recently.
> > 
> > 2 - This per-cpu variable needs to get set at each guest_exit(), so it's
> > overhead, even though it's supposed to be in local cache. If that's
> > an issue, I would suggest having this part compiled out on 
> > !CONFIG_NO_HZ_FULL, but further checking each cpu for being nohz_full
> > enabled seems more expensive than just setting this out.
> 
> A per-CPU write isn't problematic, but I suspect reading jiffies will be quite
> imprecise, e.g. it'll be a full tick "behind" on many exits.
> 
> > 3 - It checks if the guest exit happened over than 1 second ago. This 1
> > second value was copied from rcu_nohz_full_cpu() which checks if the
> > grace period started over than a second ago. If this value is bad,
> > I have no issue changing it.
> 
> IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic 
> regardless
> of what magic time threshold is used.  

Why? It works for this particular purpose.

> IIUC, what you want is a way to detect if
> a CPU is likely to _run_ a KVM vCPU in the near future.  KVM can provide that
> information with much better precision, e.g. KVM knows when when it's in the 
> core
> vCPU run loop.

ktime_t ktime_get(void)
{
struct timekeeper *tk = _core.timekeeper;
unsigned int seq;
ktime_t base;
u64 nsecs;

WARN_ON(timekeeping_suspended);

do {
seq = read_seqcount_begin(_core.seq);
base = tk->tkr_mono.base;
nsecs = timekeeping_get_ns(>tkr_mono);

} while (read_seqcount_retry(_core.seq, seq));

return ktime_add_ns(base, nsecs);
}
EXPORT_SYMBOL_GPL(ktime_get);

ktime_get() is more expensive than unsigned long assignment.

What is done is: If vcpu has entered guest mode in the past, then RCU
extended quiescent state has been transitioned into the CPU, therefore
it is not necessary to wake up rcu core.

The logic is copied from:

/*
 * Is this CPU a NO_HZ_FULL CPU that should ignore RCU so that the
 * grace-period kthread will do force_quiescent_state() processing?
 * The idea is to avoid waking up RCU core processing on such a
 * CPU unless the grace period has extended for too long.
 *
 * This code relies on the fact that all NO_HZ_FULL CPUs are also
 * RCU_NOCB_CPU CPUs.
 */
static bool rcu_nohz_full_cpu(void)
{
#ifdef CONFIG_NO_HZ_FULL
if (tick_nohz_full_cpu(smp_processor_id()) &&
(!rcu_gp_in_progress() ||
 time_before(jiffies, READ_ONCE(rcu_state.gp_start) + HZ)))
return true;
#endif /* #ifdef CONFIG_NO_HZ_FULL */
return false;
}

Note:

avoid waking up RCU core processing on such a
CPU unless the grace period has extended for too long.

> > 4 - Even though I could detect no issue, I included linux/kvm_host.h into 
> > rcu/tree_plugin.h, which is the first time it's getting included
> > outside of kvm or arch code, and can be weird.
> 
> Heh, kvm_host.h isn't included outside of KVM because several architectures 
> can
> build KVM as a module, which

Re: [RFC PATCH 0/8] cgroup/cpuset: Support RCU_NOCB on isolated partitions

2024-02-09 Thread Marcelo Tosatti

On Wed, Feb 07, 2024 at 03:47:46PM +0100, Frederic Weisbecker wrote:
> Le Tue, Feb 06, 2024 at 04:15:18PM -0300, Marcelo Tosatti a écrit :
> > On Tue, Feb 06, 2024 at 01:56:23PM +0100, Frederic Weisbecker wrote:
> > > Le Wed, Jan 17, 2024 at 12:15:07PM -0500, Waiman Long a écrit :
> > > > 
> > > > On 1/17/24 12:07, Tejun Heo wrote:
> > > > > Hello,
> > > > > 
> > > > > On Wed, Jan 17, 2024 at 11:35:03AM -0500, Waiman Long wrote:
> > > > > > The first 2 patches are adopted from Federic with minor twists to 
> > > > > > fix
> > > > > > merge conflicts and compilation issue. The rests are for 
> > > > > > implementing
> > > > > > the new cpuset.cpus.isolation_full interface which is essentially a 
> > > > > > flag
> > > > > > to globally enable or disable full CPU isolation on isolated 
> > > > > > partitions.
> > > > > I think the interface is a bit premature. The cpuset partition 
> > > > > feature is
> > > > > already pretty restrictive and makes it really clear that it's to 
> > > > > isolate
> > > > > the CPUs. I think it'd be better to just enable all the isolation 
> > > > > features
> > > > > by default. If there are valid use cases which can't be served without
> > > > > disabling some isolation features, we can worry about adding the 
> > > > > interface
> > > > > at that point.
> > > > 
> > > > My current thought is to make isolated partitions act like 
> > > > isolcpus=domain,
> > > > additional CPU isolation capabilities are optional and can be turned on
> > > > using isolation_full. However, I am fine with making all these turned 
> > > > on by
> > > > default if it is the consensus.
> > > 
> > > Right it was the consensus last time I tried. Along with the fact that 
> > > mutating
> > > this isolation_full set has to be done on offline CPUs to simplify the 
> > > whole
> > > picture.
> > > 
> > > So lemme try to summarize what needs to be done:
> > > 
> > > 1) An all-isolation feature file (that is, all the HK_TYPE_* things) 
> > > on/off for
> > >   now. And if it ever proves needed, provide a way later for more 
> > > finegrained
> > >   tuning.
> > > 
> > > 2) This file must only apply to offline CPUs because it avoids migrations 
> > > and
> > >   stuff.
> > > 
> > > 3) I need to make RCU NOCB tunable only on offline CPUs, which isn't that 
> > > much
> > >changes.
> > > 
> > > 4) HK_TYPE_TIMER:
> > >* Wrt. timers in general, not much needs to be done, the CPUs are
> > >  offline. But:
> > >* arch/x86/kvm/x86.c does something weird
> > >* drivers/char/random.c might need some care
> > >* watchdog needs to be (de-)activated
> > >
> > > 5) HK_TYPE_DOMAIN:
> > >* This one I fear is not mutable, this is isolcpus...
> > 
> > Except for HK_TYPE_DOMAIN, i have never seen anyone use any of this
> > flags.
> 
> HK_TYPE_DOMAIN is used by isolcpus=domain,

> HK_TYPE_MANAGED_IRQ is used by isolcpus=managed_irq,...
> 
> All the others (except HK_TYPE_SCHED) are used by nohz_full=

I mean i've never seen any use of the individual flags being set.

You either want full isolation (nohz_full and all the flags together,
except for HK_TYPE_DOMAIN which is sometimes enabled/disabled), or not.

So why not group them all together ?

Do you know of any separate uses of these flags (except for
HK_TYPE_DOMAIN).

> Thanks.
> 
> > 
> > > 
> > > 6) HK_TYPE_MANAGED_IRQ:
> > >* I prefer not to think about it :-)
> > > 
> > > 7) HK_TYPE_TICK:
> > >* Maybe some tiny ticks internals to revisit, I'll check that.
> > >* There is a remote tick to take into consideration, but again the
> > >  CPUs are offline so it shouldn't be too complicated.
> > > 
> > > 8) HK_TYPE_WQ:
> > >* Fortunately we already have all the mutable interface in place.
> > >  But we must make it live nicely with the sysfs workqueue affinity
> > >  files.
> > > 
> > > 9) HK_FLAG_SCHED:
> > >* Oops, this one is ignored by nohz_full/isolcpus, isn't it?
> > >Should be removed?
> > > 
> > > 10) HK_TYPE_RCU:
> > > * That's point 3) and also some kthreads to affine, which leads us
> > >  to the following in HK_TYPE_KTHREAD:
> > > 
> > > 11) HK_FLAG_KTHREAD:
> > > * I'm guessing it's fine as long as isolation_full is also an
> > >   isolated partition. Then unbound kthreads shouldn't run there.
> > > 
> > > 12) HK_TYPE_MISC:
> > > * Should be fine as ILB isn't running on offline CPUs.
> > > 
> > > Thanks.
> > > 
> > > 
> > 
> 
>

Re: [RFC PATCH 0/8] cgroup/cpuset: Support RCU_NOCB on isolated partitions

2024-02-09 Thread Marcelo Tosatti

On Wed, Feb 07, 2024 at 03:47:46PM +0100, Frederic Weisbecker wrote:
> Le Tue, Feb 06, 2024 at 04:15:18PM -0300, Marcelo Tosatti a écrit :
> > On Tue, Feb 06, 2024 at 01:56:23PM +0100, Frederic Weisbecker wrote:
> > > Le Wed, Jan 17, 2024 at 12:15:07PM -0500, Waiman Long a écrit :
> > > > 
> > > > On 1/17/24 12:07, Tejun Heo wrote:
> > > > > Hello,
> > > > > 
> > > > > On Wed, Jan 17, 2024 at 11:35:03AM -0500, Waiman Long wrote:
> > > > > > The first 2 patches are adopted from Federic with minor twists to 
> > > > > > fix
> > > > > > merge conflicts and compilation issue. The rests are for 
> > > > > > implementing
> > > > > > the new cpuset.cpus.isolation_full interface which is essentially a 
> > > > > > flag
> > > > > > to globally enable or disable full CPU isolation on isolated 
> > > > > > partitions.
> > > > > I think the interface is a bit premature. The cpuset partition 
> > > > > feature is
> > > > > already pretty restrictive and makes it really clear that it's to 
> > > > > isolate
> > > > > the CPUs. I think it'd be better to just enable all the isolation 
> > > > > features
> > > > > by default. If there are valid use cases which can't be served without
> > > > > disabling some isolation features, we can worry about adding the 
> > > > > interface
> > > > > at that point.
> > > > 
> > > > My current thought is to make isolated partitions act like 
> > > > isolcpus=domain,
> > > > additional CPU isolation capabilities are optional and can be turned on
> > > > using isolation_full. However, I am fine with making all these turned 
> > > > on by
> > > > default if it is the consensus.
> > > 
> > > Right it was the consensus last time I tried. Along with the fact that 
> > > mutating
> > > this isolation_full set has to be done on offline CPUs to simplify the 
> > > whole
> > > picture.
> > > 
> > > So lemme try to summarize what needs to be done:
> > > 
> > > 1) An all-isolation feature file (that is, all the HK_TYPE_* things) 
> > > on/off for
> > >   now. And if it ever proves needed, provide a way later for more 
> > > finegrained
> > >   tuning.
> > > 
> > > 2) This file must only apply to offline CPUs because it avoids migrations 
> > > and
> > >   stuff.
> > > 
> > > 3) I need to make RCU NOCB tunable only on offline CPUs, which isn't that 
> > > much
> > >changes.
> > > 
> > > 4) HK_TYPE_TIMER:
> > >* Wrt. timers in general, not much needs to be done, the CPUs are
> > >  offline. But:
> > >* arch/x86/kvm/x86.c does something weird
> > >* drivers/char/random.c might need some care
> > >* watchdog needs to be (de-)activated
> > >
> > > 5) HK_TYPE_DOMAIN:
> > >* This one I fear is not mutable, this is isolcpus...
> > 
> > Except for HK_TYPE_DOMAIN, i have never seen anyone use any of this
> > flags.
> 
> HK_TYPE_DOMAIN is used by isolcpus=domain,

> HK_TYPE_MANAGED_IRQ is used by isolcpus=managed_irq,...
> 
> All the others (except HK_TYPE_SCHED) are used by nohz_full=

I mean i've never seen any use of the individual flags being set.

You either want full isolation (nohz_full and all the flags together,
except for HK_TYPE_DOMAIN which is sometimes enabled/disabled), or not.

So why not group them all together ?

Do you know of any separate uses of these flags (except for
HK_TYPE_DOMAIN).

> Thanks.
> 
> > 
> > > 
> > > 6) HK_TYPE_MANAGED_IRQ:
> > >* I prefer not to think about it :-)
> > > 
> > > 7) HK_TYPE_TICK:
> > >* Maybe some tiny ticks internals to revisit, I'll check that.
> > >* There is a remote tick to take into consideration, but again the
> > >  CPUs are offline so it shouldn't be too complicated.
> > > 
> > > 8) HK_TYPE_WQ:
> > >* Fortunately we already have all the mutable interface in place.
> > >  But we must make it live nicely with the sysfs workqueue affinity
> > >  files.
> > > 
> > > 9) HK_FLAG_SCHED:
> > >* Oops, this one is ignored by nohz_full/isolcpus, isn't it?
> > >Should be removed?
> > > 
> > > 10) HK_TYPE_RCU:
> > > * That's point 3) and also some kthreads to affine, which leads us
> > >  to the following in HK_TYPE_KTHREAD:
> > > 
> > > 11) HK_FLAG_KTHREAD:
> > > * I'm guessing it's fine as long as isolation_full is also an
> > >   isolated partition. Then unbound kthreads shouldn't run there.
> > > 
> > > 12) HK_TYPE_MISC:
> > > * Should be fine as ILB isn't running on offline CPUs.
> > > 
> > > Thanks.
> > > 
> > > 
> > 
> 
>

Re: [RFC PATCH 0/8] cgroup/cpuset: Support RCU_NOCB on isolated partitions

2024-02-06 Thread Marcelo Tosatti

On Tue, Feb 06, 2024 at 01:56:23PM +0100, Frederic Weisbecker wrote:
> Le Wed, Jan 17, 2024 at 12:15:07PM -0500, Waiman Long a écrit :
> > 
> > On 1/17/24 12:07, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Wed, Jan 17, 2024 at 11:35:03AM -0500, Waiman Long wrote:
> > > > The first 2 patches are adopted from Federic with minor twists to fix
> > > > merge conflicts and compilation issue. The rests are for implementing
> > > > the new cpuset.cpus.isolation_full interface which is essentially a flag
> > > > to globally enable or disable full CPU isolation on isolated partitions.
> > > I think the interface is a bit premature. The cpuset partition feature is
> > > already pretty restrictive and makes it really clear that it's to isolate
> > > the CPUs. I think it'd be better to just enable all the isolation features
> > > by default. If there are valid use cases which can't be served without
> > > disabling some isolation features, we can worry about adding the interface
> > > at that point.
> > 
> > My current thought is to make isolated partitions act like isolcpus=domain,
> > additional CPU isolation capabilities are optional and can be turned on
> > using isolation_full. However, I am fine with making all these turned on by
> > default if it is the consensus.
> 
> Right it was the consensus last time I tried. Along with the fact that 
> mutating
> this isolation_full set has to be done on offline CPUs to simplify the whole
> picture.
> 
> So lemme try to summarize what needs to be done:
> 
> 1) An all-isolation feature file (that is, all the HK_TYPE_* things) on/off 
> for
>   now. And if it ever proves needed, provide a way later for more finegrained
>   tuning.
> 
> 2) This file must only apply to offline CPUs because it avoids migrations and
>   stuff.
> 
> 3) I need to make RCU NOCB tunable only on offline CPUs, which isn't that much
>changes.
> 
> 4) HK_TYPE_TIMER:
>* Wrt. timers in general, not much needs to be done, the CPUs are
>  offline. But:
>* arch/x86/kvm/x86.c does something weird
>* drivers/char/random.c might need some care
>* watchdog needs to be (de-)activated
>
> 5) HK_TYPE_DOMAIN:
>* This one I fear is not mutable, this is isolcpus...

Except for HK_TYPE_DOMAIN, i have never seen anyone use any of this
flags.

> 
> 6) HK_TYPE_MANAGED_IRQ:
>* I prefer not to think about it :-)
> 
> 7) HK_TYPE_TICK:
>* Maybe some tiny ticks internals to revisit, I'll check that.
>* There is a remote tick to take into consideration, but again the
>  CPUs are offline so it shouldn't be too complicated.
> 
> 8) HK_TYPE_WQ:
>* Fortunately we already have all the mutable interface in place.
>  But we must make it live nicely with the sysfs workqueue affinity
>  files.
> 
> 9) HK_FLAG_SCHED:
>* Oops, this one is ignored by nohz_full/isolcpus, isn't it?
>Should be removed?
> 
> 10) HK_TYPE_RCU:
> * That's point 3) and also some kthreads to affine, which leads us
>  to the following in HK_TYPE_KTHREAD:
> 
> 11) HK_FLAG_KTHREAD:
> * I'm guessing it's fine as long as isolation_full is also an
>   isolated partition. Then unbound kthreads shouldn't run there.
> 
> 12) HK_TYPE_MISC:
> * Should be fine as ILB isn't running on offline CPUs.
> 
> Thanks.
> 
>

Re: [RFC PATCH 0/8] cgroup/cpuset: Support RCU_NOCB on isolated partitions

2024-02-06 Thread Marcelo Tosatti

On Tue, Feb 06, 2024 at 01:56:23PM +0100, Frederic Weisbecker wrote:
> Le Wed, Jan 17, 2024 at 12:15:07PM -0500, Waiman Long a écrit :
> > 
> > On 1/17/24 12:07, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Wed, Jan 17, 2024 at 11:35:03AM -0500, Waiman Long wrote:
> > > > The first 2 patches are adopted from Federic with minor twists to fix
> > > > merge conflicts and compilation issue. The rests are for implementing
> > > > the new cpuset.cpus.isolation_full interface which is essentially a flag
> > > > to globally enable or disable full CPU isolation on isolated partitions.
> > > I think the interface is a bit premature. The cpuset partition feature is
> > > already pretty restrictive and makes it really clear that it's to isolate
> > > the CPUs. I think it'd be better to just enable all the isolation features
> > > by default. If there are valid use cases which can't be served without
> > > disabling some isolation features, we can worry about adding the interface
> > > at that point.
> > 
> > My current thought is to make isolated partitions act like isolcpus=domain,
> > additional CPU isolation capabilities are optional and can be turned on
> > using isolation_full. However, I am fine with making all these turned on by
> > default if it is the consensus.
> 
> Right it was the consensus last time I tried. Along with the fact that 
> mutating
> this isolation_full set has to be done on offline CPUs to simplify the whole
> picture.
> 
> So lemme try to summarize what needs to be done:
> 
> 1) An all-isolation feature file (that is, all the HK_TYPE_* things) on/off 
> for
>   now. And if it ever proves needed, provide a way later for more finegrained
>   tuning.
> 
> 2) This file must only apply to offline CPUs because it avoids migrations and
>   stuff.
> 
> 3) I need to make RCU NOCB tunable only on offline CPUs, which isn't that much
>changes.
> 
> 4) HK_TYPE_TIMER:
>* Wrt. timers in general, not much needs to be done, the CPUs are
>  offline. But:
>* arch/x86/kvm/x86.c does something weird
>* drivers/char/random.c might need some care
>* watchdog needs to be (de-)activated
>
> 5) HK_TYPE_DOMAIN:
>* This one I fear is not mutable, this is isolcpus...

Except for HK_TYPE_DOMAIN, i have never seen anyone use any of this
flags.

> 
> 6) HK_TYPE_MANAGED_IRQ:
>* I prefer not to think about it :-)
> 
> 7) HK_TYPE_TICK:
>* Maybe some tiny ticks internals to revisit, I'll check that.
>* There is a remote tick to take into consideration, but again the
>  CPUs are offline so it shouldn't be too complicated.
> 
> 8) HK_TYPE_WQ:
>* Fortunately we already have all the mutable interface in place.
>  But we must make it live nicely with the sysfs workqueue affinity
>  files.
> 
> 9) HK_FLAG_SCHED:
>* Oops, this one is ignored by nohz_full/isolcpus, isn't it?
>Should be removed?
> 
> 10) HK_TYPE_RCU:
> * That's point 3) and also some kthreads to affine, which leads us
>  to the following in HK_TYPE_KTHREAD:
> 
> 11) HK_FLAG_KTHREAD:
> * I'm guessing it's fine as long as isolation_full is also an
>   isolated partition. Then unbound kthreads shouldn't run there.
> 
> 12) HK_TYPE_MISC:
> * Should be fine as ILB isn't running on offline CPUs.
> 
> Thanks.
> 
>

Re: Why invtsc (CPUID_APM_INVTSC) is unmigratable?

2024-01-29 Thread Marcelo Tosatti

On Fri, Jan 26, 2024 at 04:16:57PM +0800, Xiaoyao Li wrote:
> On 1/25/2024 6:05 AM, Marcelo Tosatti wrote:
> > On Wed, Jan 24, 2024 at 10:52:46PM +0800, Xiaoyao Li wrote:
> > > On 1/23/2024 11:39 PM, Marcelo Tosatti wrote:
> > > > On Sat, Jan 20, 2024 at 05:44:07PM +0800, Xiaoyao Li wrote:
> > > > > On 1/20/2024 12:14 AM, Marcelo Tosatti wrote:
> > > > > > On Fri, Jan 19, 2024 at 02:46:22PM +0800, Xiaoyao Li wrote:
> > > > > > > I'm wondering why CPUID_APM_INVTSC is set as unmigratable_flags. 
> > > > > > > Could
> > > > > > > anyone explain it?
> > > > > > 
> > > > > > 
> > > > > > commit 68bfd0ad4a1dcc4c328d5db85dc746b49c1ec07e
> > > > > > Author: Marcelo Tosatti 
> > > > > > Date:   Wed May 14 16:30:09 2014 -0300
> > > > > > 
> > > > > >target-i386: block migration and savevm if invariant tsc is 
> > > > > > exposed
> > > > > >Invariant TSC documentation mentions that "invariant TSC 
> > > > > > will run at a
> > > > > >constant rate in all ACPI P-, C-. and T-states".
> > > > > >This is not the case if migration to a host with different 
> > > > > > TSC frequency
> > > > > >is allowed, or if savevm is performed. So block 
> > > > > > migration/savevm.
> > > > > > 
> > > > > > So the rationale here was that without ensuring the destination host
> > > > > > has the same TSC clock frequency, we can't migrate.
> > > > > 
> > > > > It seems to me the concept of invtsc was extended to "tsc freq will 
> > > > > not
> > > > > change even after the machine is live migrated". I'm not sure it is 
> > > > > correct
> > > > > to extend the concept of invtsc.
> > > > > 
> > > > > The main reason of introducing invtsc is to tell the tsc hardware 
> > > > > keeps
> > > > > counting (at the same rate) even at deep C state, so long as other 
> > > > > states.
> > > > > 
> > > > > For example, a guest is created on machine A with X GHz tsc, and 
> > > > > invtsc
> > > > > exposed (machine A can ensure the guest's tsc counts at X GHz at any 
> > > > > state).
> > > > > If the guest is migrated to machine B with Y GHz tsc, and machine B 
> > > > > can also
> > > > > ensure the invtsc of its guest, i.e., the guest's tsc counts at Y GHz 
> > > > > at any
> > > > > state. IMHO, in this case, the invtsc is supported at both src and 
> > > > > dest,
> > > > > which means it is a migratable feature. However, the migration itself 
> > > > > fails,
> > > > > due to mismatched/different configuration of tsc freq, not due to 
> > > > > invtsc.
> > > > > 
> > > > > > However, this was later extended to allow invtsc migratioon when 
> > > > > > setting
> > > > > > tsc-khz explicitly:
> > > > > > 
> > > > > > commit d99569d9d8562c480e0befab601756b0b7b5d0e0
> > > > > > Author: Eduardo Habkost 
> > > > > > Date:   Sun Jan 8 15:32:34 2017 -0200
> > > > > > 
> > > > > >kvm: Allow invtsc migration if tsc-khz is set explicitly
> > > > > >We can safely allow a VM to be migrated with invtsc enabled 
> > > > > > if
> > > > > >tsc-khz is set explicitly, because:
> > > > > >* QEMU already refuses to start if it can't set the TSC 
> > > > > > frequency
> > > > > >  to the configured value.
> > > > > >* Management software is already required to keep device
> > > > > >  configuration (including CPU configuration) the same on
> > > > > >  migration source and destination.
> > > > > >Signed-off-by: Eduardo Habkost 
> > > > > >Message-Id: <20170108173234.25721-3-ehabk...@redhat.com>
> > > > > >Signed-off-by: Eduardo Habkost 
> > > > > 
> > > > > But in the case that user doesn't set tsc freq explicitly, the live
> > > > > migration is l

Re: Why invtsc (CPUID_APM_INVTSC) is unmigratable?

2024-01-24 Thread Marcelo Tosatti

On Wed, Jan 24, 2024 at 10:52:46PM +0800, Xiaoyao Li wrote:
> On 1/23/2024 11:39 PM, Marcelo Tosatti wrote:
> > On Sat, Jan 20, 2024 at 05:44:07PM +0800, Xiaoyao Li wrote:
> > > On 1/20/2024 12:14 AM, Marcelo Tosatti wrote:
> > > > On Fri, Jan 19, 2024 at 02:46:22PM +0800, Xiaoyao Li wrote:
> > > > > I'm wondering why CPUID_APM_INVTSC is set as unmigratable_flags. Could
> > > > > anyone explain it?
> > > > 
> > > > 
> > > > commit 68bfd0ad4a1dcc4c328d5db85dc746b49c1ec07e
> > > > Author: Marcelo Tosatti 
> > > > Date:   Wed May 14 16:30:09 2014 -0300
> > > > 
> > > >   target-i386: block migration and savevm if invariant tsc is 
> > > > exposed
> > > >   Invariant TSC documentation mentions that "invariant TSC will run 
> > > > at a
> > > >   constant rate in all ACPI P-, C-. and T-states".
> > > >   This is not the case if migration to a host with different TSC 
> > > > frequency
> > > >   is allowed, or if savevm is performed. So block migration/savevm.
> > > > 
> > > > So the rationale here was that without ensuring the destination host
> > > > has the same TSC clock frequency, we can't migrate.
> > > 
> > > It seems to me the concept of invtsc was extended to "tsc freq will not
> > > change even after the machine is live migrated". I'm not sure it is 
> > > correct
> > > to extend the concept of invtsc.
> > > 
> > > The main reason of introducing invtsc is to tell the tsc hardware keeps
> > > counting (at the same rate) even at deep C state, so long as other states.
> > > 
> > > For example, a guest is created on machine A with X GHz tsc, and invtsc
> > > exposed (machine A can ensure the guest's tsc counts at X GHz at any 
> > > state).
> > > If the guest is migrated to machine B with Y GHz tsc, and machine B can 
> > > also
> > > ensure the invtsc of its guest, i.e., the guest's tsc counts at Y GHz at 
> > > any
> > > state. IMHO, in this case, the invtsc is supported at both src and dest,
> > > which means it is a migratable feature. However, the migration itself 
> > > fails,
> > > due to mismatched/different configuration of tsc freq, not due to invtsc.
> > > 
> > > > However, this was later extended to allow invtsc migratioon when setting
> > > > tsc-khz explicitly:
> > > > 
> > > > commit d99569d9d8562c480e0befab601756b0b7b5d0e0
> > > > Author: Eduardo Habkost 
> > > > Date:   Sun Jan 8 15:32:34 2017 -0200
> > > > 
> > > >   kvm: Allow invtsc migration if tsc-khz is set explicitly
> > > >   We can safely allow a VM to be migrated with invtsc enabled if
> > > >   tsc-khz is set explicitly, because:
> > > >   * QEMU already refuses to start if it can't set the TSC frequency
> > > > to the configured value.
> > > >   * Management software is already required to keep device
> > > > configuration (including CPU configuration) the same on
> > > > migration source and destination.
> > > >   Signed-off-by: Eduardo Habkost 
> > > >   Message-Id: <20170108173234.25721-3-ehabk...@redhat.com>
> > > >   Signed-off-by: Eduardo Habkost 
> > > 
> > > But in the case that user doesn't set tsc freq explicitly, the live
> > > migration is likely to fail or have issues even without invtsc exposed to
> > > guest,
> > 
> > Depends on how the guest is using the TSC, but yes.
> > 
> > > if the destination host has a different tsc frequency than src host.
> > > 
> > > So why bother checking invtsc only?
> > 
> > Well, if invtsc is exposed to the guest, then it might use the TSC for
> > timekeeping purposes.
> > 
> > Therefore you don't want to fail (on the invtsc clock characteristics)
> > otherwise timekeeping in the guest might be problematic.
> > 
> > But this are all just heuristics.
> > 
> > Do you have a suggestion for different behaviour?
> 
> I think we need to block live migration when user doesn't specify a certain
> tsc frequency explicitly, regardless of invtsc.

Problem is if that guest is using kvmclock for timekeeping, then live migration 
is safe (kvmclock logic will update the tsc frequency of the destination
host upon migration).

Re: Why invtsc (CPUID_APM_INVTSC) is unmigratable?

2024-01-23 Thread Marcelo Tosatti

On Sat, Jan 20, 2024 at 05:44:07PM +0800, Xiaoyao Li wrote:
> On 1/20/2024 12:14 AM, Marcelo Tosatti wrote:
> > On Fri, Jan 19, 2024 at 02:46:22PM +0800, Xiaoyao Li wrote:
> > > I'm wondering why CPUID_APM_INVTSC is set as unmigratable_flags. Could
> > > anyone explain it?
> > 
> > 
> > commit 68bfd0ad4a1dcc4c328d5db85dc746b49c1ec07e
> > Author: Marcelo Tosatti 
> > Date:   Wed May 14 16:30:09 2014 -0300
> > 
> >  target-i386: block migration and savevm if invariant tsc is exposed
> >  Invariant TSC documentation mentions that "invariant TSC will run at a
> >  constant rate in all ACPI P-, C-. and T-states".
> >  This is not the case if migration to a host with different TSC 
> > frequency
> >  is allowed, or if savevm is performed. So block migration/savevm.
> > 
> > So the rationale here was that without ensuring the destination host
> > has the same TSC clock frequency, we can't migrate.
> 
> It seems to me the concept of invtsc was extended to "tsc freq will not
> change even after the machine is live migrated". I'm not sure it is correct
> to extend the concept of invtsc.
> 
> The main reason of introducing invtsc is to tell the tsc hardware keeps
> counting (at the same rate) even at deep C state, so long as other states.
> 
> For example, a guest is created on machine A with X GHz tsc, and invtsc
> exposed (machine A can ensure the guest's tsc counts at X GHz at any state).
> If the guest is migrated to machine B with Y GHz tsc, and machine B can also
> ensure the invtsc of its guest, i.e., the guest's tsc counts at Y GHz at any
> state. IMHO, in this case, the invtsc is supported at both src and dest,
> which means it is a migratable feature. However, the migration itself fails,
> due to mismatched/different configuration of tsc freq, not due to invtsc.
> 
> > However, this was later extended to allow invtsc migratioon when setting
> > tsc-khz explicitly:
> > 
> > commit d99569d9d8562c480e0befab601756b0b7b5d0e0
> > Author: Eduardo Habkost 
> > Date:   Sun Jan 8 15:32:34 2017 -0200
> > 
> >  kvm: Allow invtsc migration if tsc-khz is set explicitly
> >  We can safely allow a VM to be migrated with invtsc enabled if
> >  tsc-khz is set explicitly, because:
> >  * QEMU already refuses to start if it can't set the TSC frequency
> >to the configured value.
> >  * Management software is already required to keep device
> >configuration (including CPU configuration) the same on
> >migration source and destination.
> >  Signed-off-by: Eduardo Habkost 
> >  Message-Id: <20170108173234.25721-3-ehabk...@redhat.com>
> >  Signed-off-by: Eduardo Habkost 
> 
> But in the case that user doesn't set tsc freq explicitly, the live
> migration is likely to fail or have issues even without invtsc exposed to
> guest, 

Depends on how the guest is using the TSC, but yes.

> if the destination host has a different tsc frequency than src host.
> 
> So why bother checking invtsc only?

Well, if invtsc is exposed to the guest, then it might use the TSC for
timekeeping purposes. 

Therefore you don't want to fail (on the invtsc clock characteristics)
otherwise timekeeping in the guest might be problematic.

But this are all just heuristics. 

Do you have a suggestion for different behaviour?

> 
> > And support for libvirt was added:
> > 
> > https://listman.redhat.com/archives/libvir-list/2017-January/141757.html
> > 
> > > 
> > > When the host supports invtsc, it can be exposed to guest.
> > > When the src VM has invtsc exposed, what will forbid it to be migrated to 
> > > a
> > > dest that also supports VMs with invtsc exposed?
> > > 
> > > 
> > 
> 
>

Re: Why invtsc (CPUID_APM_INVTSC) is unmigratable?

2024-01-19 Thread Marcelo Tosatti

On Fri, Jan 19, 2024 at 02:46:22PM +0800, Xiaoyao Li wrote:
> I'm wondering why CPUID_APM_INVTSC is set as unmigratable_flags. Could
> anyone explain it?

commit 68bfd0ad4a1dcc4c328d5db85dc746b49c1ec07e
Author: Marcelo Tosatti 
Date:   Wed May 14 16:30:09 2014 -0300

target-i386: block migration and savevm if invariant tsc is exposed

Invariant TSC documentation mentions that "invariant TSC will run at a
constant rate in all ACPI P-, C-. and T-states".

This is not the case if migration to a host with different TSC frequency
is allowed, or if savevm is performed. So block migration/savevm.

So the rationale here was that without ensuring the destination host 
has the same TSC clock frequency, we can't migrate.

However, this was later extended to allow invtsc migratioon when setting
tsc-khz explicitly:

commit d99569d9d8562c480e0befab601756b0b7b5d0e0
Author: Eduardo Habkost 
Date:   Sun Jan 8 15:32:34 2017 -0200

kvm: Allow invtsc migration if tsc-khz is set explicitly

We can safely allow a VM to be migrated with invtsc enabled if
tsc-khz is set explicitly, because:
* QEMU already refuses to start if it can't set the TSC frequency
  to the configured value.
* Management software is already required to keep device
  configuration (including CPU configuration) the same on
  migration source and destination.

Signed-off-by: Eduardo Habkost 
Message-Id: <20170108173234.25721-3-ehabk...@redhat.com>
Signed-off-by: Eduardo Habkost 

And support for libvirt was added:

https://listman.redhat.com/archives/libvir-list/2017-January/141757.html

> 
> When the host supports invtsc, it can be exposed to guest.
> When the src VM has invtsc exposed, what will forbid it to be migrated to a
> dest that also supports VMs with invtsc exposed?
> 
>

Re: [PATCH] Add support for RAPL MSRs in KVM/Qemu

2023-06-28 Thread Marcelo Tosatti

On Fri, Jun 16, 2023 at 04:08:30PM +0200, Anthony Harivel wrote:
> Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL
> interface (Running Average Power Limit) for advertising the accumulated
> energy consumption of various power domains (e.g. CPU packages, DRAM,
> etc.).
> 
> The consumption is reported via MSRs (model specific registers) like
> MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are
> 64 bits registers that represent the accumulated energy consumption in
> micro Joules. They are updated by microcode every ~1ms.
> 
> For now, KVM always returns 0 when the guest requests the value of
> these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle
> these MSRs dynamically in userspace.
> 
> To limit the amount of system calls for every MSR call, create a new
> thread in QEMU that updates the "virtual" MSR values asynchronously.
> 
> Each vCPU has its own vMSR to reflect the independence of vCPUs. The
> thread updates the vMSR values with the ratio of energy consumed of
> the whole physical CPU package the vCPU thread runs on and the
> thread's utime and stime values.
> 
> All other non-vCPU threads are also taken into account. Their energy
> consumption is evenly distributed among all vCPUs threads running on
> the same physical CPU package.
> 
> This feature is activated with -accel kvm,rapl=true.

I suppose this should be a CPU flag instead? -cpu xxx,rapl=on.

Re: [PATCH v2 0/2] send tlb_remove_table_smp_sync IPI only to necessary CPUs

2023-06-22 Thread Marcelo Tosatti

On Wed, Jun 21, 2023 at 09:43:37AM +0200, Peter Zijlstra wrote:
> On Tue, Jun 20, 2023 at 05:46:16PM +0300, Yair Podemsky wrote:
> > Currently the tlb_remove_table_smp_sync IPI is sent to all CPUs
> > indiscriminately, this causes unnecessary work and delays notable in
> > real-time use-cases and isolated cpus.
> > By limiting the IPI to only be sent to cpus referencing the effected
> > mm.
> > a config to differentiate architectures that support mm_cpumask from
> > those that don't will allow safe usage of this feature.
> > 
> > changes from -v1:
> > - Previous version included a patch to only send the IPI to CPU's with
> > context_tracking in the kernel space, this was removed due to race 
> > condition concerns.
> > - for archs that do not maintain mm_cpumask the mask used should be
> >  cpu_online_mask (Peter Zijlstra).
> >  
> 
> Would it not be much better to fix the root cause? As per the last time,
> there's patches that cure the thp abuse of this.

The other case where the IPI can happen is:

CPU-0   CPU-1

tlb_remove_table
tlb_remove_table_sync_one
IPI
local_irq_disable
gup_fast
local_irq_enable


So its not only the THP case.

[PATCH v2] kvm: reuse per-vcpu stats fd to avoid vcpu interruption

2023-06-17 Thread Marcelo Tosatti



A regression has been detected in latency testing of KVM guests.  
More specifically, it was observed that the cyclictest  
numbers inside of an isolated vcpu (running on isolated pcpu) are:  

# Max Latencies: 00090 00096 00141  
  
Where a maximum of 50us is acceptable.   
   
The implementation of KVM_GET_STATS_FD uses run_on_cpu to query  
per vcpu statistics, which interrupts the vcpu (and is unnecessary).  

To fix this, open the per vcpu stats fd on vcpu initialization,  
and read from that fd from QEMU's main thread.  
 
Signed-off-by: Marcelo Tosatti   

---

v2: use convention for Error parameter order (Markus Armbruster)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 7679f397ae..9aa898db14 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -450,6 +450,8 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp)
  "kvm_init_vcpu: kvm_arch_init_vcpu failed (%lu)",
  kvm_arch_vcpu_id(cpu));
 }
+cpu->kvm_vcpu_stats_fd = kvm_vcpu_ioctl(cpu, KVM_GET_STATS_FD, NULL);
+
 err:
 return ret;
 }
@@ -4007,7 +4009,7 @@ static StatsDescriptors 
*find_stats_descriptors(StatsTarget target, int stats_fd
 
 /* Read stats header */
 kvm_stats_header = >kvm_stats_header;
-ret = read(stats_fd, kvm_stats_header, sizeof(*kvm_stats_header));
+ret = pread(stats_fd, kvm_stats_header, sizeof(*kvm_stats_header), 0);
 if (ret != sizeof(*kvm_stats_header)) {
 error_setg(errp, "KVM stats: failed to read stats header: "
"expected %zu actual %zu",
@@ -4038,7 +4040,8 @@ static StatsDescriptors 
*find_stats_descriptors(StatsTarget target, int stats_fd
 }
 
 static void query_stats(StatsResultList **result, StatsTarget target,
-strList *names, int stats_fd, Error **errp)
+strList *names, int stats_fd, CPUState *cpu,
+Error **errp)
 {
 struct kvm_stats_desc *kvm_stats_desc;
 struct kvm_stats_header *kvm_stats_header;
@@ -4096,7 +4099,7 @@ static void query_stats(StatsResultList **result, 
StatsTarget target,
 break;
 case STATS_TARGET_VCPU:
 add_stats_entry(result, STATS_PROVIDER_KVM,
-current_cpu->parent_obj.canonical_path,
+cpu->parent_obj.canonical_path,
 stats_list);
 break;
 default:
@@ -4133,10 +4136,9 @@ static void query_stats_schema(StatsSchemaList **result, 
StatsTarget target,
 add_stats_schema(result, STATS_PROVIDER_KVM, target, stats_list);
 }
 
-static void query_stats_vcpu(CPUState *cpu, run_on_cpu_data data)
+static void query_stats_vcpu(CPUState *cpu, StatsArgs *kvm_stats_args)
 {
-StatsArgs *kvm_stats_args = (StatsArgs *) data.host_ptr;
-int stats_fd = kvm_vcpu_ioctl(cpu, KVM_GET_STATS_FD, NULL);
+int stats_fd = cpu->kvm_vcpu_stats_fd;
 Error *local_err = NULL;
 
 if (stats_fd == -1) {
@@ -4145,14 +4147,13 @@ static void query_stats_vcpu(CPUState *cpu, 
run_on_cpu_data data)
 return;
 }
 query_stats(kvm_stats_args->result.stats, STATS_TARGET_VCPU,
-kvm_stats_args->names, stats_fd, kvm_stats_args->errp);
-close(stats_fd);
+kvm_stats_args->names, stats_fd, cpu,
+kvm_stats_args->errp);
 }
 
-static void query_stats_schema_vcpu(CPUState *cpu, run_on_cpu_data data)
+static void query_stats_schema_vcpu(CPUState *cpu, StatsArgs *kvm_stats_args)
 {
-StatsArgs *kvm_stats_args = (StatsArgs *) data.host_ptr;
-int stats_fd = kvm_vcpu_ioctl(cpu, KVM_GET_STATS_FD, NULL);
+int stats_fd = cpu->kvm_vcpu_stats_fd;
 Error *local_err = NULL;
 
 if (stats_fd == -1) {
@@ -4162,7 +4163,6 @@ static void query_stats_schema_vcpu(CPUState *cpu, 
run_on_cpu_data data)
 }
 query_stats_schema(kvm_stats_args->result.schema, STATS_TARGET_VCPU, 
stats_fd,
kvm_stats_args->errp);
-close(stats_fd);
 }
 
 static void query_stats_cb(StatsResultList **result, StatsTarget target,
@@ -4180,7 +4180,7 @@ static void query_stats_cb(StatsResultList **result, 
StatsTarget target,
 error_setg_errno(errp, errno, "KVM stats: ioctl failed");
 return;
 }
-query_stats(result, target, names, stats_fd, errp);
+query_stats(result, target, names, stats_fd, NULL, errp);
 close(stats_fd);
 break;
 }
@@ -4194,7 +4194,7 @@ static void query_stats_cb(StatsResultList **result, 
StatsTarget target,
 if (!apply_str_list_filter(cpu->parent_obj.canonical_path, 
targets)) {
 continue;
 }
-run_on_cpu(cpu, query_stats_vcpu, 
RUN_ON_CPU_HOST_PTR(_args));
+query_stats_vcpu(cpu, _args);
 }
 break;
 }
@@ -422

[PATCH] kvm: reuse per-vcpu stats fd to avoid vcpu interruption

2023-06-13 Thread Marcelo Tosatti



A regression has been detected in latency testing of KVM guests.
More specifically, it was observed that the cyclictest
numbers inside of an isolated vcpu (running on isolated pcpu) are:

# Max Latencies: 00090 00096 00141

Where a maximum of 50us is acceptable.

The implementation of KVM_GET_STATS_FD uses run_on_cpu to query
per vcpu statistics, which interrupts the vcpu (and is unnecessary).

To fix this, open the per vcpu stats fd on vcpu initialization,
and read from that fd from QEMU's main thread.

Signed-off-by: Marcelo Tosatti 

---

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 7679f397ae..5da2901eca 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -450,6 +450,8 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp)
  "kvm_init_vcpu: kvm_arch_init_vcpu failed (%lu)",
  kvm_arch_vcpu_id(cpu));
 }
+cpu->kvm_vcpu_stats_fd = kvm_vcpu_ioctl(cpu, KVM_GET_STATS_FD, NULL);
+
 err:
 return ret;
 }
@@ -4007,7 +4009,7 @@ static StatsDescriptors 
*find_stats_descriptors(StatsTarget target, int stats_fd
 
 /* Read stats header */
 kvm_stats_header = >kvm_stats_header;
-ret = read(stats_fd, kvm_stats_header, sizeof(*kvm_stats_header));
+ret = pread(stats_fd, kvm_stats_header, sizeof(*kvm_stats_header), 0);
 if (ret != sizeof(*kvm_stats_header)) {
 error_setg(errp, "KVM stats: failed to read stats header: "
"expected %zu actual %zu",
@@ -4038,7 +4040,8 @@ static StatsDescriptors 
*find_stats_descriptors(StatsTarget target, int stats_fd
 }
 
 static void query_stats(StatsResultList **result, StatsTarget target,
-strList *names, int stats_fd, Error **errp)
+strList *names, int stats_fd, Error **errp,
+CPUState *cpu)
 {
 struct kvm_stats_desc *kvm_stats_desc;
 struct kvm_stats_header *kvm_stats_header;
@@ -4096,7 +4099,7 @@ static void query_stats(StatsResultList **result, 
StatsTarget target,
 break;
 case STATS_TARGET_VCPU:
 add_stats_entry(result, STATS_PROVIDER_KVM,
-current_cpu->parent_obj.canonical_path,
+cpu->parent_obj.canonical_path,
 stats_list);
 break;
 default:
@@ -4133,10 +4136,9 @@ static void query_stats_schema(StatsSchemaList **result, 
StatsTarget target,
 add_stats_schema(result, STATS_PROVIDER_KVM, target, stats_list);
 }
 
-static void query_stats_vcpu(CPUState *cpu, run_on_cpu_data data)
+static void query_stats_vcpu(CPUState *cpu, StatsArgs *kvm_stats_args)
 {
-StatsArgs *kvm_stats_args = (StatsArgs *) data.host_ptr;
-int stats_fd = kvm_vcpu_ioctl(cpu, KVM_GET_STATS_FD, NULL);
+int stats_fd = cpu->kvm_vcpu_stats_fd;
 Error *local_err = NULL;
 
 if (stats_fd == -1) {
@@ -4145,14 +4147,13 @@ static void query_stats_vcpu(CPUState *cpu, 
run_on_cpu_data data)
 return;
 }
 query_stats(kvm_stats_args->result.stats, STATS_TARGET_VCPU,
-kvm_stats_args->names, stats_fd, kvm_stats_args->errp);
-close(stats_fd);
+kvm_stats_args->names, stats_fd, kvm_stats_args->errp,
+cpu);
 }
 
-static void query_stats_schema_vcpu(CPUState *cpu, run_on_cpu_data data)
+static void query_stats_schema_vcpu(CPUState *cpu, StatsArgs *kvm_stats_args)
 {
-StatsArgs *kvm_stats_args = (StatsArgs *) data.host_ptr;
-int stats_fd = kvm_vcpu_ioctl(cpu, KVM_GET_STATS_FD, NULL);
+int stats_fd = cpu->kvm_vcpu_stats_fd;
 Error *local_err = NULL;
 
 if (stats_fd == -1) {
@@ -4162,7 +4163,6 @@ static void query_stats_schema_vcpu(CPUState *cpu, 
run_on_cpu_data data)
 }
 query_stats_schema(kvm_stats_args->result.schema, STATS_TARGET_VCPU, 
stats_fd,
kvm_stats_args->errp);
-close(stats_fd);
 }
 
 static void query_stats_cb(StatsResultList **result, StatsTarget target,
@@ -4180,7 +4180,7 @@ static void query_stats_cb(StatsResultList **result, 
StatsTarget target,
 error_setg_errno(errp, errno, "KVM stats: ioctl failed");
 return;
 }
-query_stats(result, target, names, stats_fd, errp);
+query_stats(result, target, names, stats_fd, errp, NULL);
 close(stats_fd);
 break;
 }
@@ -4194,7 +4194,7 @@ static void query_stats_cb(StatsResultList **result, 
StatsTarget target,
 if (!apply_str_list_filter(cpu->parent_obj.canonical_path, 
targets)) {
 continue;
 }
-run_on_cpu(cpu, query_stats_vcpu, 
RUN_ON_CPU_HOST_PTR(_args));
+query_stats_vcpu(cpu, _args);
 }
 break;
 }
@@ -4220,6 +4220,6 @@ void query_stats_schemas_cb(StatsSchemaList **result, 
Error **errp)
 if (first_cpu) {
 stats_args.result.schema = result;
 stats_args.errp = errp;

Re: [RFC PATCH] Add support for RAPL MSRs in KVM/Qemu

2023-05-26 Thread Marcelo Tosatti

On Wed, May 24, 2023 at 04:53:49PM +0200, Anthony Harivel wrote:
> set=UTF-8
> Status: RO
> Content-Length: 24102
> Lines: 667
> 
> Marcelo Tosatti, May 19, 2023 at 20:28:
> 
> Hi Marcelo,
> 
> > > > > +/* Assuming those values are the same accross physical 
> > > > > system/packages */
> > > > > +maxcpus = get_maxcpus(0); /* Number of CPUS per packages */
> > > > > +maxpkgs = numa_max_node(); /* Number of Packages on the system */
> >
> > numa_max_node() returns the highest node number available on the current 
> > system. 
> > (See the node numbers in /sys/devices/system/node/ ). Also see 
> > numa_num_configured_nodes().
> >
> > One can find package topology information from
> > /sys/devices/system/cpu/cpuX/topology/
> >
> 
> Good point. 
> I will find a better solution to identify the topology using your hint. 
> 
>  > > > +/* Allocate memory for each thread stats */
> > > > > +thd_stat = (thread_stat *) calloc(num_threads, 
> > > > > sizeof(thread_stat));
> >
> > Can you keep this pre-allocated ? And all other data as well.
> 
> Ok no problem.
> 
>  > > > +/* Retrieve all packages power plane energy counter */
> > > > > +for (int i = 0; i <= maxpkgs; i++) {
> > > > > +for (int j = 0; j < num_threads; j++) {
> > > > > +/*
> > > > > + * Use the first thread we found that ran on the CPU
> > > > > + * of the package to read the packages energy counter
> > > > > + */
> > > > > +if (thd_stat[j].numa_node_id == i) {
> > > > > +pkg_stat[i].e_start = 
> > > > > read_msr(MSR_PKG_ENERGY_STATUS, i);
> > > > > +break;
> > > > > +}
> > > > > +}
> > > > > +}
> >
> > NUMA node does not map necessarily to one package.
> 
> True. I will update this part at the same time with the topology info
> discussed above. 
> 
> >
> > > > > +/* Sleep a short period while the other threads are working 
> > > > > */
> > > > > +usleep(MSR_ENERGY_THREAD_SLEEP_US);
> > > > > +
> > > > > +/*
> > > > > + * Retrieve all packages power plane energy counter
> > > > > + * Calculate the delta of all packages
> > > > > + */
> > > > > +for (int i = 0; i <= maxpkgs; i++) {
> > > > > +for (int j = 0; j < num_threads; j++) {
> > > > > +/*
> > > > > + * Use the first thread we found that ran on the CPU
> > > > > + * of the package to read the packages energy counter
> > > > > + */
> > > > > +if (thd_stat[j].numa_node_id == i) {
> > > > > +pkg_stat[i].e_end =
> > > > > +read_msr(MSR_PKG_ENERGY_STATUS, 
> > > > > thd_stat[j].cpu_id);
> >
> > This is excessive (to read the MSRs of each package in the system).
> >
> > Consider 100 Linux guests all of them with this enabled, on a system with
> > 4 packages. How many times you'll be reading MSR of each package?
> 
> The problem here is that you can have vCPUs that are running on different
> packages. However the energy counter of the different packages are
> increasing independently. 
> Either we "force" somehow users to run only on the same package, either I'm
> afraid we are obliged to read all the packages energy counter (when they
> are involved in the VM).
> 
> Imagine this:
> 
> |pkg-0|pkg-1|
> |0|1|2|3|4|5|6|0|1|2|3|4|5|6|
> |   |   |   |
> | vm-0  |  vm-1 |  vm-2 |
> 
> Only vm-1 that has cores from pkg-0 and pkg-1 would have to read both
> pkg energy. vm-0 would only read pkg-0 and vm-2 only pkg-1.
> 
> 
> >
> > Moreover, don't want to readmsr on an isolated CPU.
> >
> 
> Could you explain me why ?

Nevermind, its a separate topic.

> > > No problem, let me try to explain: 
> > > a QEMU process is composed of vCPU thread(s) and non-vCPU thread(s) (IO,
> > > emulated device,...). Each of those threads can run on different cores
> > > that can belongs to the same Package or not.
> > > The

Re: [RFC PATCH] Add support for RAPL MSRs in KVM/Qemu

2023-05-19 Thread Marcelo Tosatti

Hi Anthony,

On Thu, May 18, 2023 at 04:26:51PM +0200, Anthony Harivel wrote:
> Marcelo Tosatti, May 17, 2023 at 17:43:
> 
> Hi Marcelo,
> 
> > On Wed, May 17, 2023 at 03:07:30PM +0200, Anthony Harivel wrote:
>  > diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> > > index 8504aaac6807..14f9c2901680 100644
> > > --- a/target/i386/cpu.h
> > > +++ b/target/i386/cpu.h
> > > @@ -396,6 +396,10 @@ typedef enum X86Seg {
> > >  #define MSR_IA32_TSX_CTRL0x122
> > >  #define MSR_IA32_TSCDEADLINE0x6e0
> > >  #define MSR_IA32_PKRS   0x6e1
> > > +#define MSR_RAPL_POWER_UNIT 0x0606
> > > +#define MSR_PKG_POWER_LIMIT 0x0610
> > > +#define MSR_PKG_ENERGY_STATUS   0x0611
> > > +#define MSR_PKG_POWER_INFO  0x0614
> >
> > Why only PKG and not all domains?
> >
> 
> Package domains are the only ones you can find accross different CPU
> segments (client and server platforms).
> Processor cores domains are only available on client platform while
> DRAM domains only on server platform.
> 
> I figured out that Package domains would be a good start to validate the
> implementation and the rest could technically be added later on. 

Understood.

> > > +/* Assuming those values are the same accross physical 
> > > system/packages */
> > > +maxcpus = get_maxcpus(0); /* Number of CPUS per packages */
> > > +maxpkgs = numa_max_node(); /* Number of Packages on the system */

numa_max_node() returns the highest node number available on the current 
system. 
(See the node numbers in /sys/devices/system/node/ ). Also see 
numa_num_configured_nodes().

One can find package topology information from
/sys/devices/system/cpu/cpuX/topology/

> > > +/* Those MSR values should not change as well */
> > > +vmsr->msr_unit = read_msr(MSR_RAPL_POWER_UNIT, 0);
> > > +vmsr->msr_limit = read_msr(MSR_PKG_POWER_LIMIT, 0);
> >
> > Power limit - MSR interfaces to specify power limit, time window; lock bit, 
> > clamp bit etc
> >
> > This one can change, right? And why expose the power limit to the guest?
> >
> 
> Right.
> Because it belongs to the non-optional RAPL interfaces MSRs, I added it
> with the thought that it was mandatory for the RAPL driver to mount
> insite the guest. 
> Either it is not and can be removed, or we can set the "lock bit" to
> inform the guest that power limit settings are static and un-modifiable.
> I will correct that. 

OK.

> > > +vmsr->msr_info = read_msr(MSR_PKG_POWER_INFO, 0);
> > > +
> > > +/* Allocate memory for each package energy status */
> > > +pkg_stat = (package_energy_stat *) calloc(maxpkgs + 1,
> > > +  
> > > sizeof(package_energy_stat));
> > > +
> > > +/*
> > > + * Max numbers of ticks per package
> > > + * time in second * number of ticks/second * Number of cores / 
> > > package
> > > + * ex: for 100 ticks/second/CPU, 12 CPUs per Package gives 1200 
> > > ticks max
> > > + */
> > > +maxticks = (MSR_ENERGY_THREAD_SLEEP_US / 100)
> > > +* sysconf(_SC_CLK_TCK) * maxcpus;
> > > +
> > > +while (true) {
> > > +
> > > +/* Get all qemu threads id */
> > > +pid_t *thread_ids = get_thread_ids(pid, _threads);
> > > +
> > > +if (thread_ids == NULL) {
> > > +return NULL;
> > > +}
> > > +
> > > +/* Allocate memory for each thread stats */
> > > +thd_stat = (thread_stat *) calloc(num_threads, 
> > > sizeof(thread_stat));

Can you keep this pre-allocated ? And all other data as well.

> > > +/* Populate all the thread stats */
> > > +for (int i = 0; i < num_threads; i++) {
> > > +thd_stat[i].thread_id = thread_ids[i];
> > > +thd_stat[i].utime = calloc(2, sizeof(unsigned long long));
> > > +thd_stat[i].stime = calloc(2, sizeof(unsigned long long));
> > > +read_thread_stat(_stat[i], pid, 0);
> > > +thd_stat[i].numa_node_id = 
> > > numa_node_of_cpu(thd_stat[i].cpu_id);
> > > +}
> > > +
> > > +/* Retrieve all packages power plane energy counter */
> > > +for (int i = 0; i <= maxpkgs; i++) {
> > > +for (int j = 0; j < num_threads; j++) {
> >

Re: [RFC PATCH] Add support for RAPL MSRs in KVM/Qemu

2023-05-17 Thread Marcelo Tosatti

On Wed, May 17, 2023 at 03:07:30PM +0200, Anthony Harivel wrote:
> Starting with the "Sandy Bridge" generation, Intel CPUs provide a RAPL
> interface (Running Average Power Limit) for advertising the accumulated
> energy consumption of various power domains (e.g. CPU packages, DRAM,
> etc.).
> 
> The consumption is reported via MSRs (model specific registers) like
> MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are
> 64 bits registers that represent the accumulated energy consumption in
> micro Joules. They are updated by microcode every ~1ms.
> 
> For now, KVM always returns 0 when the guest requests the value of
> these MSRs. Use the KVM MSR filtering mechanism to allow QEMU handle
> these MSRs dynamically in userspace.
> 
> To limit the amount of system calls for every MSR call, create a new
> thread in QEMU that updates the "virtual" MSR values asynchronously.
> 
> Each vCPU has its own vMSR to reflect the independence of vCPUs. The
> thread updates the vMSR values with the ratio of energy consumed of
> the whole physical CPU package the vCPU thread runs on and the
> thread's utime and stime values.
> 
> All other non-vCPU threads are also taken into account. Their energy
> consumption is evenly distributed among all vCPUs threads running on
> the same physical CPU package.
> 
> This feature is activated with -accel kvm,rapl=true.
> 
> Actual limitation:
> - Works only on Intel host CPU because AMD CPUs are using different MSR
>   adresses.
> 
> - Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at
>   the moment.
> 
> - Since each vCPU has an independent vMSR value, the vCPU topology must
>   be changed to match that reality. There must be a single vCPU per
>   virtual socket (e.g.: -smp 4,sockets=4). Accessing pkg-0 energy will
>   give vCPU 0 energy, pkg-1 will give vCPU 1 energy, etc.
> 
> Signed-off-by: Anthony Harivel 
> ---
> 
> Notes:
> Earlier this year, I've proposed a patch in linux KVM [1] in order to
> bring energy awareness in VM.
> 
> Thanks to the feedback, I've worked on another solution that requires
> only a QEMU patch that make us of MSR filtering mecanism.
> 
> This patch is proposed as an RFC at the moment in order to validate the
> paradigm and see if the actual limitation could be adressed in a second
> phase.
> 
> Regards,
> Anthony
> 
> [1]: 
> https://lore.kernel.org/kvm/20230118142123.461247-1-ahari...@redhat.com/

Hi Anthony,
> 
>  accel/kvm/kvm-all.c   |  13 ++
>  include/sysemu/kvm_int.h  |  11 ++
>  target/i386/cpu.h |   8 +
>  target/i386/kvm/kvm.c | 273 ++
>  target/i386/kvm/meson.build   |   1 +
>  target/i386/kvm/vmsr_energy.c | 132 
>  target/i386/kvm/vmsr_energy.h |  80 ++
>  7 files changed, 518 insertions(+)
>  create mode 100644 target/i386/kvm/vmsr_energy.c
>  create mode 100644 target/i386/kvm/vmsr_energy.h
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index cf3a88d90e92..13bb2a523c5d 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -3699,6 +3699,12 @@ static void kvm_set_dirty_ring_size(Object *obj, 
> Visitor *v,
>  s->kvm_dirty_ring_size = value;
>  }
>  
> +static void kvm_set_kvm_rapl(Object *obj, bool value, Error **errp)
> +{
> +KVMState *s = KVM_STATE(obj);
> +s->msr_energy.enable = value;
> +}
> +
>  static void kvm_accel_instance_init(Object *obj)
>  {
>  KVMState *s = KVM_STATE(obj);
> @@ -3715,6 +3721,7 @@ static void kvm_accel_instance_init(Object *obj)
>  s->xen_version = 0;
>  s->xen_gnttab_max_frames = 64;
>  s->xen_evtchn_max_pirq = 256;
> +s->msr_energy.enable = false;
>  }
>  
>  /**
> @@ -3755,6 +3762,12 @@ static void kvm_accel_class_init(ObjectClass *oc, void 
> *data)
>  object_class_property_set_description(oc, "dirty-ring-size",
>  "Size of KVM dirty page ring buffer (default: 0, i.e. use bitmap)");
>  
> +object_class_property_add_bool(oc, "rapl",
> +   NULL,
> +   kvm_set_kvm_rapl);
> +object_class_property_set_description(oc, "rapl",
> +"Allow energy related MSRs for RAPL interface in Guest");
> +
>  kvm_arch_accel_class_init(oc);
>  }
>  
> diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
> index a641c974ea54..cf3a01f498d7 100644
> --- a/include/sysemu/kvm_int.h
> +++ b/include/sysemu/kvm_int.h
> @@ -47,6 +47,16 @@ typedef struct KVMMemoryListener {
>  
>  #define KVM_MSI_HASHTAB_SIZE256
>  
> +struct KVMMsrEnergy {
> +bool enable;
> +QemuThread msr_thr;
> +int cpus;
> +uint64_t *msr_value;
> +uint64_t msr_unit;
> +uint64_t msr_limit;
> +uint64_t msr_info;
> +};
> +
>  enum KVMDirtyRingReaperState {
>  KVM_DIRTY_RING_REAPER_NONE = 0,
>  /* The reaper is sleeping */
> @@ -116,6 +126,7 @@ struct KVMState
>  uint64_t

Re: [PATCH 3/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to CPUs in kernel mode

2023-04-19 Thread Marcelo Tosatti

On Wed, Apr 19, 2023 at 01:30:57PM +0200, David Hildenbrand wrote:
> On 06.04.23 20:27, Peter Zijlstra wrote:
> > On Thu, Apr 06, 2023 at 05:51:52PM +0200, David Hildenbrand wrote:
> > > On 06.04.23 17:02, Peter Zijlstra wrote:
> > 
> > > > DavidH, what do you thikn about reviving Jann's patches here:
> > > > 
> > > > https://bugs.chromium.org/p/project-zero/issues/detail?id=2365#c1
> > > > 
> > > > Those are far more invasive, but afaict they seem to do the right thing.
> > > > 
> > > 
> > > I recall seeing those while discussed on secur...@kernel.org. What we
> > > currently have was (IMHO for good reasons) deemed better to fix the issue,
> > > especially when caring about backports and getting it right.
> > 
> > Yes, and I think that was the right call. However, we can now revisit
> > without having the pressure of a known defect and backport
> > considerations.
> > 
> > > The alternative that was discussed in that context IIRC was to simply
> > > allocate a fresh page table, place the fresh page table into the list
> > > instead, and simply free the old page table (then using common machinery).
> > > 
> > > TBH, I'd wish (and recently raised) that we could just stop wasting memory
> > > on page tables for THPs that are maybe never going to get PTE-mapped ... 
> > > and
> > > eventually just allocate on demand (with some caching?) and handle the
> > > places where we're OOM and cannot PTE-map a THP in some descend way.
> > > 
> > > ... instead of trying to figure out how to deal with these page tables we
> > > cannot free but have to special-case simply because of GUP-fast.
> > 
> > Not keeping them around sounds good to me, but I'm not *that* familiar
> > with the THP code, most of that happened after I stopped tracking mm. So
> > I'm not sure how feasible is it.
> > 
> > But it does look entirely feasible to rework this page-table freeing
> > along the lines Jann did.
> 
> It's most probably more feasible, although the easiest would be to just
> allocate a fresh page table to deposit and free the old one using the mmu
> gatherer.
> 
> This way we can avoid the khugepaged of tlb_remove_table_smp_sync(), but not
> the tlb_remove_table_one() usage. I suspect khugepaged isn't really relevant
> in RT kernels (IIRC, most of RT setups disable THP completely).

People will disable khugepaged because it causes IPIs (and the fact one
has to disable khugepaged is a configuration overhead, and a source of
headache for configuring the realtime system, since one can forget of
doing that, etc).

But people do want to run non-RT applications along with RT applications
(in case you have a single box on a priviledged location, for example).

> 
> tlb_remove_table_one() only triggers if __get_free_page(GFP_NOWAIT |
> __GFP_NOWARN); fails. IIUC, that can happen easily under memory pressure
> because it doesn't wait for direct reclaim.
> 
> I don't know much about RT workloads (so I'd appreciate some feedback), but
> I guess we can run int memory pressure as well due to some !rt housekeeping
> task on the system?

Yes, exactly (memory for -RT app will be mlocked).

Re: [PATCH 3/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to CPUs in kernel mode

2023-04-19 Thread Marcelo Tosatti

On Thu, Apr 06, 2023 at 03:32:06PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 06, 2023 at 09:49:22AM -0300, Marcelo Tosatti wrote:
> 
> > > > 2) Depends on the application and the definition of "occasional".
> > > > 
> > > > For certain types of applications (for example PLC software or
> > > > RAN processing), upon occurrence of an event, it is necessary to
> > > > complete a certain task in a maximum amount of time (deadline).
> > > 
> > > If the application is properly NOHZ_FULL and never does a kernel entry,
> > > it will never get that IPI. If it is a pile of shit and does kernel
> > > entries while it pretends to be NOHZ_FULL it gets to keep the pieces and
> > > no amount of crying will get me to care.
> > 
> > I suppose its common practice to use certain system calls in latency
> > sensitive applications, for example nanosleep. Some examples:
> > 
> > 1) cyclictest   (nanosleep)
> 
> cyclictest is not a NOHZ_FULL application, if you tihnk it is, you're
> deluded.

On the field (what end-users do on production):

cyclictest runs on NOHZ_FULL cores.
PLC type programs run on NOHZ_FULL cores.

So accordingly to physical reality i observe, i am not deluded.

> > 2) PLC programs (nanosleep)
> 
> What's a PLC? Programmable Logic Circuit?

Programmable logic controller.

> > A system call does not necessarily have to take locks, does it ?
> 
> This all is unrelated to locks

OK.

> > Or even if application does system calls, but runs under a VM,
> > then you are requiring it to never VM-exit.
> 
> That seems to be a goal for performance anyway.

Not sure what you mean.

> > This reduces the flexibility of developing such applications.
> 
> Yeah, that's the cards you're dealt, deal with it.

This is not what happens on the field.

Re: [PATCH 3/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to CPUs in kernel mode

2023-04-06 Thread Marcelo Tosatti

On Wed, Apr 05, 2023 at 09:54:57PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 05, 2023 at 04:43:14PM -0300, Marcelo Tosatti wrote:
> 
> > Two points:
> > 
> > 1) For a virtualized system, the overhead is not only of executing the
> > IPI but:
> > 
> > VM-exit
> > run VM-exit code in host
> > handle IPI
> > run VM-entry code in host
> > VM-entry
> 
> I thought we could do IPIs without VMexit these days? 

Yes, IPIs to vCPU (guest context). In this case we can consider
an IPI to the host pCPU (which requires VM-exit from guest context).

> Also virt... /me walks away.
> 
> > 2) Depends on the application and the definition of "occasional".
> > 
> > For certain types of applications (for example PLC software or
> > RAN processing), upon occurrence of an event, it is necessary to
> > complete a certain task in a maximum amount of time (deadline).
> 
> If the application is properly NOHZ_FULL and never does a kernel entry,
> it will never get that IPI. If it is a pile of shit and does kernel
> entries while it pretends to be NOHZ_FULL it gets to keep the pieces and
> no amount of crying will get me to care.

I suppose its common practice to use certain system calls in latency
sensitive applications, for example nanosleep. Some examples:

1) cyclictest   (nanosleep)
2) PLC programs (nanosleep)

A system call does not necessarily have to take locks, does it ?

Or even if application does system calls, but runs under a VM,
then you are requiring it to never VM-exit.

This reduces the flexibility of developing such applications.

Re: [PATCH 3/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to CPUs in kernel mode

2023-04-06 Thread Marcelo Tosatti

On Wed, Apr 05, 2023 at 09:52:26PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 05, 2023 at 04:45:32PM -0300, Marcelo Tosatti wrote:
> > On Wed, Apr 05, 2023 at 01:10:07PM +0200, Frederic Weisbecker wrote:
> > > On Wed, Apr 05, 2023 at 12:44:04PM +0200, Frederic Weisbecker wrote:
> > > > On Tue, Apr 04, 2023 at 04:42:24PM +0300, Yair Podemsky wrote:
> > > > > + int state = atomic_read(>state);
> > > > > + /* will return true only for cpus in kernel space */
> > > > > + return state & CT_STATE_MASK == CONTEXT_KERNEL;
> > > > > +}
> > > > 
> > > > Also note that this doesn't stricly prevent userspace from being 
> > > > interrupted.
> > > > You may well observe the CPU in kernel but it may receive the IPI later 
> > > > after
> > > > switching to userspace.
> > > > 
> > > > We could arrange for avoiding that with marking ct->state with a 
> > > > pending work bit
> > > > to flush upon user entry/exit but that's a bit more overhead so I first 
> > > > need to
> > > > know about your expectations here, ie: can you tolerate such an 
> > > > occasional
> > > > interruption or not?
> > > 
> > > Bah, actually what can we do to prevent from that racy IPI? Not much I 
> > > fear...
> > 
> > Use a different mechanism other than an IPI to ensure in progress
> > __get_free_pages_fast() has finished execution.
> > 
> > Isnt this codepath slow path enough that it can use
> > synchronize_rcu_expedited?
> 
> To actually hit this path you're doing something really dodgy.

Apparently khugepaged is using the same infrastructure:

$ grep tlb_remove_table khugepaged.c 
tlb_remove_table_sync_one();
tlb_remove_table_sync_one();

So just enabling khugepaged will hit that path.

Re: [PATCH 3/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to CPUs in kernel mode

2023-04-05 Thread Marcelo Tosatti

On Wed, Apr 05, 2023 at 01:10:07PM +0200, Frederic Weisbecker wrote:
> On Wed, Apr 05, 2023 at 12:44:04PM +0200, Frederic Weisbecker wrote:
> > On Tue, Apr 04, 2023 at 04:42:24PM +0300, Yair Podemsky wrote:
> > > + int state = atomic_read(>state);
> > > + /* will return true only for cpus in kernel space */
> > > + return state & CT_STATE_MASK == CONTEXT_KERNEL;
> > > +}
> > 
> > Also note that this doesn't stricly prevent userspace from being 
> > interrupted.
> > You may well observe the CPU in kernel but it may receive the IPI later 
> > after
> > switching to userspace.
> > 
> > We could arrange for avoiding that with marking ct->state with a pending 
> > work bit
> > to flush upon user entry/exit but that's a bit more overhead so I first 
> > need to
> > know about your expectations here, ie: can you tolerate such an occasional
> > interruption or not?
> 
> Bah, actually what can we do to prevent from that racy IPI? Not much I fear...

Use a different mechanism other than an IPI to ensure in progress
__get_free_pages_fast() has finished execution.

Isnt this codepath slow path enough that it can use
synchronize_rcu_expedited?

Re: [PATCH 3/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to CPUs in kernel mode

2023-04-05 Thread Marcelo Tosatti

On Wed, Apr 05, 2023 at 12:43:58PM +0200, Frederic Weisbecker wrote:
> On Tue, Apr 04, 2023 at 04:42:24PM +0300, Yair Podemsky wrote:
> > @@ -191,6 +192,20 @@ static void tlb_remove_table_smp_sync(void *arg)
> > /* Simply deliver the interrupt */
> >  }
> >  
> > +
> > +#ifdef CONFIG_CONTEXT_TRACKING
> > +static bool cpu_in_kernel(int cpu, void *info)
> > +{
> > +   struct context_tracking *ct = per_cpu_ptr(_tracking, cpu);
> 
> Like Peter said, an smp_mb() is required here before the read (unless there is
> already one between the page table modification and that ct->state read?).
> 
> So that you have this pairing:
> 
> 
>WRITE page_table  WRITE ct->state
>  smp_mb()  smp_mb() // implied by 
> atomic_fetch_or
>READ ct->stateREAD page_table
> 
> > +   int state = atomic_read(>state);
> > +   /* will return true only for cpus in kernel space */
> > +   return state & CT_STATE_MASK == CONTEXT_KERNEL;
> > +}
> 
> Also note that this doesn't stricly prevent userspace from being interrupted.
> You may well observe the CPU in kernel but it may receive the IPI later after
> switching to userspace.
> 
> We could arrange for avoiding that with marking ct->state with a pending work 
> bit
> to flush upon user entry/exit but that's a bit more overhead so I first need 
> to
> know about your expectations here, ie: can you tolerate such an occasional
> interruption or not?

Two points:

1) For a virtualized system, the overhead is not only of executing the
IPI but:

VM-exit
run VM-exit code in host
handle IPI
run VM-entry code in host
VM-entry

2) Depends on the application and the definition of "occasional".

For certain types of applications (for example PLC software or
RAN processing), upon occurrence of an event, it is necessary to
complete a certain task in a maximum amount of time (deadline).

One way to express this requirement is with a pair of numbers,
deadline time and execution time, where:

* deadline time: length of time between event and deadline.
* execution time: length of time it takes for processing of event
  to occur on a particular hardware platform
  (uninterrupted).

Re: [PATCH linux-next 2/2] x86/xen/time: cleanup xen_tsc_safe_clocksource

2023-02-23 Thread Marcelo Tosatti

On Mon, Feb 20, 2023 at 08:14:40PM -0800, Krister Johansen wrote:
> On Mon, Feb 20, 2023 at 11:01:18PM +0100, Thomas Gleixner wrote:
> > On Mon, Feb 20 2023 at 09:17, Krister Johansen wrote:
> > > @@ -495,8 +496,7 @@ static int __init xen_tsc_safe_clocksource(void)
> > >   /* Leaf 4, sub-leaf 0 (0x4x03) */
> > >   cpuid_count(xen_cpuid_base() + 3, 0, , , , );
> > >  
> > > - /* tsc_mode = no_emulate (2) */
> > > - if (ebx != 2)
> > > + if (ebx != XEN_CPUID_TSC_MODE_NEVER_EMULATE)
> > >   return 0;
> > >  
> > >   return 1;
> > 
> > What about removing more stupidity from that function?
> > 
> > static bool __init xen_tsc_safe_clocksource(void)
> > {
> > u32 eax, ebx. ecx, edx;
> >  
> > /* Leaf 4, sub-leaf 0 (0x4x03) */
> > cpuid_count(xen_cpuid_base() + 3, 0, , , , );
> > 
> > return ebx == XEN_CPUID_TSC_MODE_NEVER_EMULATE;
> > }
> 
> I'm all for simplifying.  I'm happy to clean up that return to be more
> idiomatic.  I was under the impression, perhaps mistaken, though, that
> the X86_FEATURE_CONSTANT_TSC, X86_FEATURE_NONSTOP_TSC, and
> check_tsc_unstable() checks were actually serving a purpose: to ensure
> that we don't rely on the tsc in environments where it's being emulated
> and the OS would be better served by using a PV clock.  Specifically,
> kvmclock_init() makes a very similar set of checks that I also thought
> were load-bearing.

kvmclock_init will lower the rating of kvmclock so that TSC clocksource
can be used instead:

/*
 * X86_FEATURE_NONSTOP_TSC is TSC runs at constant rate
 * with P/T states and does not stop in deep C-states.
 *
 * Invariant TSC exposed by host means kvmclock is not necessary:
 * can use TSC as clocksource.
 *
 */
if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
!check_tsc_unstable())
kvm_clock.rating = 299;

[PATCH 2/2] hw/i386/kvm/clock.c: read kvmclock from guest memory if !correct_tsc_shift

2023-01-19 Thread Marcelo Tosatti

Before kernel commit 78db6a5037965429c04d708281f35a6e5562d31b,
kvm_guest_time_update() would use vcpu->virtual_tsc_khz to calculate
tsc_shift value in the vcpus pvclock structure written to guest memory.

For those kernels, if vcpu->virtual_tsc_khz != tsc_khz (which can be the
case when guest state is restored via migration, or if tsc-khz option is
passed to QEMU), and TSC scaling is not enabled (which happens if the
difference between the frequency requested via KVM_SET_TSC_KHZ and the
host TSC KHZ is smaller than 250ppm), then there can be a difference
between what KVM_GET_CLOCK would return and what the guest reads as
kvmclock value.

The effect is that the guest sees a jump in kvmclock value
(either forwards or backwards) in such case.

To fix incoming migration from pre-78db6a5037965 hosts, 
read kvmclock value from guest memory.

Unless the KVM_CLOCK_CORRECT_TSC_SHIFT bit indicates
that the value retrieved by KVM_GET_CLOCK on the source
is safe to be used.

Signed-off-by: Marcelo Tosatti 

Index: qemu/hw/i386/kvm/clock.c
===
--- qemu.orig/hw/i386/kvm/clock.c
+++ qemu/hw/i386/kvm/clock.c
@@ -50,6 +50,16 @@ struct KVMClockState {
 /* whether the 'clock' value was obtained in a host with
  * reliable KVM_GET_CLOCK */
 bool clock_is_reliable;
+
+/* whether machine type supports correct_tsc_shift */
+bool mach_use_correct_tsc_shift;
+
+/*
+ * whether the 'clock' value was obtained in a host
+ * that computes correct tsc_shift field (the one
+ * written to guest memory)
+ */
+bool clock_correct_tsc_shift;
 };
 
 struct pvclock_vcpu_time_info {
@@ -150,6 +160,8 @@ static void kvm_update_clock(KVMClockSta
  *   read from memory
  */
 s->clock_is_reliable = kvm_has_adjust_clock_stable();
+
+s->clock_correct_tsc_shift = kvm_has_correct_tsc_shift();
 }
 
 static void do_kvmclock_ctrl(CPUState *cpu, run_on_cpu_data data)
@@ -176,7 +188,7 @@ static void kvmclock_vm_state_change(voi
  * If the host where s->clock was read did not support reliable
  * KVM_GET_CLOCK, read kvmclock value from memory.
  */
-if (!s->clock_is_reliable) {
+if (!s->clock_is_reliable || !s->clock_correct_tsc_shift) {
 uint64_t pvclock_via_mem = kvmclock_current_nsec(s);
 /* We can't rely on the saved clock value, just discard it */
 if (pvclock_via_mem) {
@@ -252,14 +264,40 @@ static const VMStateDescription kvmclock
 };
 
 /*
+ * Sending clock_correct_tsc_shift=true means that the destination
+ * can use VMSTATE_UINT64(clock, KVMClockState) value,
+ * instead of reading from guest memory.
+ */
+static bool kvmclock_clock_correct_tsc_shift_needed(void *opaque)
+{
+KVMClockState *s = opaque;
+
+return s->mach_use_correct_tsc_shift;
+}
+
+static const VMStateDescription kvmclock_correct_tsc_shift = {
+.name = "kvmclock/clock_correct_tsc_shift",
+.version_id = 1,
+.minimum_version_id = 1,
+.needed = kvmclock_clock_correct_tsc_shift_needed,
+.fields = (VMStateField[]) {
+VMSTATE_BOOL(clock_correct_tsc_shift, KVMClockState),
+VMSTATE_END_OF_LIST()
+}
+};
+
+/*
  * When migrating, assume the source has an unreliable
- * KVM_GET_CLOCK unless told otherwise.
+ * KVM_GET_CLOCK (and computes tsc shift
+ * in guest memory using vcpu->virtual_tsc_khz),
+ * unless told otherwise.
  */
 static int kvmclock_pre_load(void *opaque)
 {
 KVMClockState *s = opaque;
 
 s->clock_is_reliable = false;
+s->clock_correct_tsc_shift = false;
 
 return 0;
 }
@@ -301,6 +339,7 @@ static const VMStateDescription kvmclock
 },
 .subsections = (const VMStateDescription * []) {
 _reliable_get_clock,
+_correct_tsc_shift,
 NULL
 }
 };
@@ -308,6 +347,8 @@ static const VMStateDescription kvmclock
 static Property kvmclock_properties[] = {
 DEFINE_PROP_BOOL("x-mach-use-reliable-get-clock", KVMClockState,
   mach_use_reliable_get_clock, true),
+DEFINE_PROP_BOOL("x-mach-use-correct-tsc-shift", KVMClockState,
+  mach_use_correct_tsc_shift, true),
 DEFINE_PROP_END_OF_LIST(),
 };
 
Index: qemu/target/i386/kvm/kvm.c
===
--- qemu.orig/target/i386/kvm/kvm.c
+++ qemu/target/i386/kvm/kvm.c
@@ -164,6 +164,13 @@ bool kvm_has_adjust_clock_stable(void)
 return (ret & KVM_CLOCK_TSC_STABLE);
 }
 
+bool kvm_has_correct_tsc_shift(void)
+{
+int ret = kvm_check_extension(kvm_state, KVM_CAP_ADJUST_CLOCK);
+
+return ret & KVM_CLOCK_CORRECT_TSC_SHIFT;
+}
+
 bool kvm_has_adjust_clock(void)
 {
 return kvm_check_extension(kvm_state, KVM_CAP_ADJUST_CLOCK);
Index: qemu/target/i386/kvm/kvm_i386.h
===
--- qemu.orig/target/i386/

[PATCH 1/2] linux-headers: sync KVM_CLOCK_CORRECT_TSC_SHIFT flag

2023-01-19 Thread Marcelo Tosatti

Sync new KVM_CLOCK_CORRECT_TSC_SHIFT from upstream Linux headers.

Signed-off-by: Marcelo Tosatti 

Index: qemu/linux-headers/linux/kvm.h
===
--- qemu.orig/linux-headers/linux/kvm.h
+++ qemu/linux-headers/linux/kvm.h
@@ -1300,6 +1300,9 @@ struct kvm_irqfd {
 #define KVM_CLOCK_TSC_STABLE   2
 #define KVM_CLOCK_REALTIME (1 << 2)
 #define KVM_CLOCK_HOST_TSC (1 << 3)
+/* whether tsc_shift as seen by the guest matches guest visible TSC */
+/* This is true since commit 78db6a5037965429c04d708281f35a6e5562d31b */
+#define KVM_CLOCK_CORRECT_TSC_SHIFT(1 << 4)
 
 struct kvm_clock_data {
__u64 clock;

[PATCH 0/2] read kvmclock from guest memory if !correct_tsc_shift

2023-01-19 Thread Marcelo Tosatti

Before kernel commit 78db6a5037965429c04d708281f35a6e5562d31b,
kvm_guest_time_update() would use vcpu->virtual_tsc_khz to calculate
tsc_shift value in the vcpus pvclock structure written to guest memory.

For those kernels, if vcpu->virtual_tsc_khz != tsc_khz (which can be the
case when guest state is restored via migration, or if tsc-khz option is
passed to QEMU), and TSC scaling is not enabled (which happens if the
difference between the frequency requested via KVM_SET_TSC_KHZ and the
host TSC KHZ is smaller than 250ppm), then there can be a difference
between what KVM_GET_CLOCK would return and what the guest reads as
kvmclock value.

The effect is that the guest sees a jump in kvmclock value
(either forwards or backwards) in such case.

To fix incoming migration from pre-78db6a5037965 hosts,
read kvmclock value from guest memory.

Unless the KVM_CLOCK_CORRECT_TSC_SHIFT bit indicates
that the value retrieved by KVM_GET_CLOCK on the source
is safe to be used.

Re: [RFC PATCH 0/5] Generic IPI sending tracepoint

2022-10-07 Thread Marcelo Tosatti

Hi Valentin,

On Fri, Oct 07, 2022 at 04:41:40PM +0100, Valentin Schneider wrote:
> Background
> ==
> 
> Detecting IPI *reception* is relatively easy, e.g. using
> trace_irq_handler_{entry,exit} or even just function-trace
> flush_smp_call_function_queue() for SMP calls.  
> 
> Figuring out their *origin*, is trickier as there is no generic tracepoint 
> tied
> to e.g. smp_call_function():
> 
> o AFAIA x86 has no tracepoint tied to sending IPIs, only receiving them
>   (cf. trace_call_function{_single}_entry()).
> o arm/arm64 do have trace_ipi_raise(), which gives us the target cpus but 
> also a
>   mostly useless string (smp_calls will all be "Function call interrupts").
> o Other architectures don't seem to have any IPI-sending related tracepoint.  
> 
> I believe one reason those tracepoints used by arm/arm64 ended up as they were
> is because these archs used to handle IPIs differently from regular interrupts
> (the IRQ driver would directly invoke an IPI-handling routine), which meant 
> they 
> never showed up in trace_irq_handler_{entry, exit}. The trace_ipi_{entry,exit}
> tracepoints gave a way to trace IPI reception but those have become redundant 
> as
> of: 
> 
>   56afcd3dbd19 ("ARM: Allow IPIs to be handled as normal interrupts")
>   d3afc7f12987 ("arm64: Allow IPIs to be handled as normal interrupts")
> 
> which gave IPIs a "proper" handler function used through
> generic_handle_domain_irq(), which makes them show up via
> trace_irq_handler_{entry, exit}.
> 
> Changing stuff up
> =
> 
> Per the above, it would make sense to reshuffle trace_ipi_raise() and move it
> into generic code. This also came up during Daniel's talk on Osnoise at the 
> CPU
> isolation MC of LPC 2022 [1]. 
> 
> Now, to be useful, such a tracepoint needs to export:
> o targeted CPU(s)
> o calling context
> 
> The only way to get the calling context with trace_ipi_raise() is to trigger a
> stack dump, e.g. $(trace-cmd -e ipi* -T echo 42).
> 
> As for the targeted CPUs, the existing tracepoint does export them, albeit in
> cpumask form, which is quite inconvenient from a tooling perspective. For
> instance, as far as I'm aware, it's not possible to do event filtering on a
> cpumask via trace-cmd.

https://man7.org/linux/man-pages/man1/trace-cmd-set.1.html

   -f filter
   Specify a filter for the previous event. This must come after
   a -e. This will filter what events get recorded based on the
   content of the event. Filtering is passed to the kernel
   directly so what filtering is allowed may depend on what
   version of the kernel you have. Basically, it will let you
   use C notation to check if an event should be processed or
   not.

   ==, >=, <=, >, <, &, |, && and ||

   The above are usually safe to use to compare fields.

This looks overkill to me (consider large number of bits set in mask).

+#define trace_ipi_send_cpumask(callsite, mask) do {\
+   if (static_key_false(&__tracepoint_ipi_send_cpu.key)) { \
+   int cpu;\
+   for_each_cpu(cpu, mask) \
+   trace_ipi_send_cpu(callsite, cpu);  \
+   }   \
+} while (0)


> 
> Because of the above points, this is introducing a new tracepoint.
> 
> Patches
> ===
> 
> This results in having trace events for:
> 
> o smp_call_function*()
> o smp_send_reschedule()
> o irq_work_queue*()
> 
> This is incomplete, just looking at arm64 there's more IPI types that aren't 
> covered:
> 
>   IPI_CPU_STOP,
>   IPI_CPU_CRASH_STOP,
>   IPI_TIMER,
>   IPI_WAKEUP,
> 
> ... But it feels like a good starting point.

Can't you have a single tracepoint (or variant with cpumask) that would
cover such cases as well?

Maybe (as parameters for tracepoint):

* type (reschedule, smp_call_function, timer, wakeup, ...).

* function address: valid for smp_call_function, irq_work_queue
  types.

> Another thing worth mentioning is that depending on the callsite, the _RET_IP_
> fed to the tracepoint is not always useful - generic_exec_single() doesn't 
> tell
> you much about the actual callback being sent via IPI, so there might be value
> in exploding the single tracepoint into at least one variant for smp_calls.

Not sure i grasp what you mean by "exploding the single tracepoint...",
but yes knowing the function or irq work function is very useful.

> 
> Links
> =
> 
> [1]: https://youtu.be/5gT57y4OzBM?t=14234
> 
> Valentin Schneider (5):
>   trace: Add trace_ipi_send_{cpu, cpumask}
>   sched, smp: Trace send_call_function_single_ipi()
>   smp: Add a multi-CPU variant to send_call_function_single_ipi()
>   irq_work: Trace calls to arch_irq_work_raise()
>   treewide: Rename and trace arch-definitions of smp_send_reschedule()
> 
>  arch/alpha/kernel/smp.c  |  2 +-

Re: [RFC PATCH 0/5] Generic IPI sending tracepoint

2022-10-07 Thread Marcelo Tosatti

Hi Valentin,

On Fri, Oct 07, 2022 at 04:41:40PM +0100, Valentin Schneider wrote:
> Background
> ==
> 
> Detecting IPI *reception* is relatively easy, e.g. using
> trace_irq_handler_{entry,exit} or even just function-trace
> flush_smp_call_function_queue() for SMP calls.  
> 
> Figuring out their *origin*, is trickier as there is no generic tracepoint 
> tied
> to e.g. smp_call_function():
> 
> o AFAIA x86 has no tracepoint tied to sending IPIs, only receiving them
>   (cf. trace_call_function{_single}_entry()).
> o arm/arm64 do have trace_ipi_raise(), which gives us the target cpus but 
> also a
>   mostly useless string (smp_calls will all be "Function call interrupts").
> o Other architectures don't seem to have any IPI-sending related tracepoint.  
> 
> I believe one reason those tracepoints used by arm/arm64 ended up as they were
> is because these archs used to handle IPIs differently from regular interrupts
> (the IRQ driver would directly invoke an IPI-handling routine), which meant 
> they 
> never showed up in trace_irq_handler_{entry, exit}. The trace_ipi_{entry,exit}
> tracepoints gave a way to trace IPI reception but those have become redundant 
> as
> of: 
> 
>   56afcd3dbd19 ("ARM: Allow IPIs to be handled as normal interrupts")
>   d3afc7f12987 ("arm64: Allow IPIs to be handled as normal interrupts")
> 
> which gave IPIs a "proper" handler function used through
> generic_handle_domain_irq(), which makes them show up via
> trace_irq_handler_{entry, exit}.
> 
> Changing stuff up
> =
> 
> Per the above, it would make sense to reshuffle trace_ipi_raise() and move it
> into generic code. This also came up during Daniel's talk on Osnoise at the 
> CPU
> isolation MC of LPC 2022 [1]. 
> 
> Now, to be useful, such a tracepoint needs to export:
> o targeted CPU(s)
> o calling context
> 
> The only way to get the calling context with trace_ipi_raise() is to trigger a
> stack dump, e.g. $(trace-cmd -e ipi* -T echo 42).
> 
> As for the targeted CPUs, the existing tracepoint does export them, albeit in
> cpumask form, which is quite inconvenient from a tooling perspective. For
> instance, as far as I'm aware, it's not possible to do event filtering on a
> cpumask via trace-cmd.

https://man7.org/linux/man-pages/man1/trace-cmd-set.1.html

   -f filter
   Specify a filter for the previous event. This must come after
   a -e. This will filter what events get recorded based on the
   content of the event. Filtering is passed to the kernel
   directly so what filtering is allowed may depend on what
   version of the kernel you have. Basically, it will let you
   use C notation to check if an event should be processed or
   not.

   ==, >=, <=, >, <, &, |, && and ||

   The above are usually safe to use to compare fields.

This looks overkill to me (consider large number of bits set in mask).

+#define trace_ipi_send_cpumask(callsite, mask) do {\
+   if (static_key_false(&__tracepoint_ipi_send_cpu.key)) { \
+   int cpu;\
+   for_each_cpu(cpu, mask) \
+   trace_ipi_send_cpu(callsite, cpu);  \
+   }   \
+} while (0)


> 
> Because of the above points, this is introducing a new tracepoint.
> 
> Patches
> ===
> 
> This results in having trace events for:
> 
> o smp_call_function*()
> o smp_send_reschedule()
> o irq_work_queue*()
> 
> This is incomplete, just looking at arm64 there's more IPI types that aren't 
> covered:
> 
>   IPI_CPU_STOP,
>   IPI_CPU_CRASH_STOP,
>   IPI_TIMER,
>   IPI_WAKEUP,
> 
> ... But it feels like a good starting point.

Can't you have a single tracepoint (or variant with cpumask) that would
cover such cases as well?

Maybe (as parameters for tracepoint):

* type (reschedule, smp_call_function, timer, wakeup, ...).

* function address: valid for smp_call_function, irq_work_queue
  types.

> Another thing worth mentioning is that depending on the callsite, the _RET_IP_
> fed to the tracepoint is not always useful - generic_exec_single() doesn't 
> tell
> you much about the actual callback being sent via IPI, so there might be value
> in exploding the single tracepoint into at least one variant for smp_calls.

Not sure i grasp what you mean by "exploding the single tracepoint...",
but yes knowing the function or irq work function is very useful.

> 
> Links
> =
> 
> [1]: https://youtu.be/5gT57y4OzBM?t=14234
> 
> Valentin Schneider (5):
>   trace: Add trace_ipi_send_{cpu, cpumask}
>   sched, smp: Trace send_call_function_single_ipi()
>   smp: Add a multi-CPU variant to send_call_function_single_ipi()
>   irq_work: Trace calls to arch_irq_work_raise()
>   treewide: Rename and trace arch-definitions of smp_send_reschedule()
> 
>  arch/alpha/kernel/smp.c  |  2 +-

Re: [PATCH] i386: Fix KVM_CAP_ADJUST_CLOCK capability check

2022-09-20 Thread Marcelo Tosatti

On Tue, Sep 20, 2022 at 04:40:24PM +0200, Vitaly Kuznetsov wrote:
> KVM commit c68dc1b577ea ("KVM: x86: Report host tsc and realtime values in
> KVM_GET_CLOCK") broke migration of certain workloads, e.g. Win11 + WSL2
> guest reboots immediately after migration. KVM, however, is not to
> blame this time. When KVM_CAP_ADJUST_CLOCK capability is checked, the
> result is all supported flags (which the above mentioned KVM commit
> enhanced) but kvm_has_adjust_clock_stable() wants it to be
> KVM_CLOCK_TSC_STABLE precisely. The result is that 'clock_is_reliable'
> is not set in vmstate and the saved clock reading is discarded in
> kvmclock_vm_state_change().
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  target/i386/kvm/kvm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index a1fd1f53791d..c33192a87dcb 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -157,7 +157,7 @@ bool kvm_has_adjust_clock_stable(void)
>  {
>  int ret = kvm_check_extension(kvm_state, KVM_CAP_ADJUST_CLOCK);
>  
> -return (ret == KVM_CLOCK_TSC_STABLE);
> +return ret & KVM_CLOCK_TSC_STABLE;
>  }
>  
>  bool kvm_has_adjust_clock(void)
> -- 
> 2.37.3
> 
> 

ACK.

Re: [PATCH] target/i386: properly reset TSC on reset

2022-05-09 Thread Marcelo Tosatti

On Thu, Mar 24, 2022 at 06:31:36PM +0100, Paolo Bonzini wrote:
> Some versions of Windows hang on reboot if their TSC value is greater
> than 2^54.  The calibration of the Hyper-V reference time overflows
> and fails; as a result the processors' clock sources are out of sync.
> 
> The issue is that the TSC _should_ be reset to 0 on CPU reset and
> QEMU tries to do that.  However, KVM special cases writing 0 to the
> TSC and thinks that QEMU is trying to hot-plug a CPU, which is
> correct the first time through but not later.  Thwart this valiant
> effort and reset the TSC to 1 instead, but only if the CPU has been
> run once.
> 
> For this to work, env->tsc has to be moved to the part of CPUArchState
> that is not zeroed at the beginning of x86_cpu_reset.
> 
> Reported-by: Vadim Rozenfeld 
> Supersedes: <20220324082346.72180-1-pbonz...@redhat.com>
> Signed-off-by: Paolo Bonzini 

Paolo,

Won't this disable the logic to sync TSCs, making it possible
for TSC of SMP guests to go out of sync? (And remember the logic
to sync TSCs from within a guest is fragile, in case of VCPU overload
for example).

> ---
>  target/i386/cpu.c | 13 +
>  target/i386/cpu.h |  2 +-
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index ec3b50bf6e..cb6b5467d0 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -5931,6 +5931,19 @@ static void x86_cpu_reset(DeviceState *dev)
>  env->xstate_bv = 0;
>  
>  env->pat = 0x0007040600070406ULL;
> +
> +if (kvm_enabled()) {
> +/*
> + * KVM handles TSC = 0 specially and thinks we are hot-plugging
> + * a new CPU, use 1 instead to force a reset.
> + */
> +if (env->tsc != 0) {
> +env->tsc = 1;
> +}
> +} else {
> +env->tsc = 0;
> +}
> +
>  env->msr_ia32_misc_enable = MSR_IA32_MISC_ENABLE_DEFAULT;
>  if (env->features[FEAT_1_ECX] & CPUID_EXT_MONITOR) {
>  env->msr_ia32_misc_enable |= MSR_IA32_MISC_ENABLE_MWAIT;
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index e31e6bd8b8..982c532353 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -1554,7 +1554,6 @@ typedef struct CPUArchState {
>  target_ulong kernelgsbase;
>  #endif
>  
> -uint64_t tsc;
>  uint64_t tsc_adjust;
>  uint64_t tsc_deadline;
>  uint64_t tsc_aux;
> @@ -1708,6 +1707,7 @@ typedef struct CPUArchState {
>  int64_t tsc_khz;
>  int64_t user_tsc_khz; /* for sanity check only */
>  uint64_t apic_bus_freq;
> +uint64_t tsc;
>  #if defined(CONFIG_KVM) || defined(CONFIG_HVF)
>  void *xsave_buf;
>  uint32_t xsave_buf_len;
> -- 
> 2.35.1
> 
> 
>

Re: [RFC PATCH 2/2] KVM: arm64: export cntvoff in debugfs

2021-11-29 Thread Marcelo Tosatti

On Mon, Nov 22, 2021 at 09:40:52PM +0100, Nicolas Saenz Julienne wrote:
> Hi Marc, thanks for the review.
> 
> On Fri, 2021-11-19 at 12:17 +, Marc Zyngier wrote:
> > On Fri, 19 Nov 2021 10:21:18 +,
> > Nicolas Saenz Julienne  wrote:
> > > 
> > > While using cntvct as the raw clock for tracing, it's possible to
> > > synchronize host/guest traces just by knowing the virtual offset applied
> > > to the guest's virtual counter.
> > > 
> > > This is also the case on x86 when TSC is available. The offset is
> > > exposed in debugfs as 'tsc-offset' on a per vcpu basis. So let's
> > > implement the same for arm64.
> > 
> > How does this work with NV, where the guest hypervisor is in control
> > of the virtual offset? 
> 
> TBH I handn't thought about NV. Looking at it from that angle, I now see my
> approach doesn't work on hosts that use CNTVCT (regardless of NV). Upon
> entering into a guest, we change CNTVOFF before the host is done with tracing,
> so traces like 'kvm_entry' will have weird timestamps. I was just lucky that
> the hosts I was testing with use CNTPCT.
> 
> I believe the solution would be to be able to force a 0 offset between
> guest/host. With that in mind, is there a reason why kvm_timer_vcpu_init()
> imposes a non-zero one by default? I checked out the commits that introduced
> that code, but couldn't find a compelling reason. VMMs can always change it
> through KVM_REG_ARM_TIMER_CNT afterwards.

One reason is that you leak information from host to guest (the hosts
TSC value).

Another reason would be that you introduce a configuration which is 
different from the what the hardware has, which can in theory trigger
guest bugs.

> > I also wonder why we need this when userspace already has direct access to
> > that information without any extra kernel support (read the CNTVCT view of
> > the vcpu using the ONEREG API, subtract it from the host view of the 
> > counter,
> > job done).
> 
> Well IIUC, you're at the mercy of how long it takes to return from the ONEREG
> ioctl. The results will be skewed. For some workloads, where low latency is
> key, we really need high precision traces in the order of single digit us or
> even 100s of ns. I'm not sure you'll be able to get there with that approach.

If the guest can read the host to guest HW clock offset already, it
could directly do the conversion.

> [...]
> 
> > > diff --git a/arch/arm64/kvm/debugfs.c b/arch/arm64/kvm/debugfs.c
> > > new file mode 100644
> > > index ..f0f5083ea8d4
> > > --- /dev/null
> > > +++ b/arch/arm64/kvm/debugfs.c
> > > @@ -0,0 +1,25 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/*
> > > + * Copyright (C) 2021 Red Hat Inc.
> > > + */
> > > +
> > > +#include 
> > > +#include 
> > > +
> > > +#include 
> > > +
> > > +static int vcpu_get_cntv_offset(void *data, u64 *val)
> > > +{
> > > + struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data;
> > > +
> > > + *val = timer_get_offset(vcpu_vtimer(vcpu));
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +DEFINE_SIMPLE_ATTRIBUTE(vcpu_cntvoff_fops, vcpu_get_cntv_offset, NULL, 
> > > "%lld\n");
> > > +
> > > +void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry 
> > > *debugfs_dentry)
> > > +{
> > > + debugfs_create_file("cntvoff", 0444, debugfs_dentry, vcpu, 
> > > _cntvoff_fops);
> > > +}
> > 
> > This should be left in arch_timer.c until we actually need it for
> > multiple subsystems. When (and if) that happens, we will expose
> > per-subsystem debugfs initialisers instead of exposing the guts of the
> > timer code.
> 
> Noted.
> 
> -- 
> Nicolás Sáenz
> 
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [RFC PATCH 2/2] KVM: arm64: export cntvoff in debugfs

2021-11-19 Thread Marcelo Tosatti

On Fri, Nov 19, 2021 at 12:17:00PM +, Marc Zyngier wrote:
> On Fri, 19 Nov 2021 10:21:18 +,
> Nicolas Saenz Julienne  wrote:
> > 
> > While using cntvct as the raw clock for tracing, it's possible to
> > synchronize host/guest traces just by knowing the virtual offset applied
> > to the guest's virtual counter.
> > 
> > This is also the case on x86 when TSC is available. The offset is
> > exposed in debugfs as 'tsc-offset' on a per vcpu basis. So let's
> > implement the same for arm64.
> 
> How does this work with NV, where the guest hypervisor is in control
> of the virtual offset? How does userspace knows which vcpu to pick so
> that it gets the right offset?

On x86, the offsets for different vcpus are the same due to the logic at
kvm_synchronize_tsc function:

During guest vcpu creation, when the TSC-clock values are written
in a short window of time (or the clock value is zero), the code uses
the same TSC.

This logic is problematic (since "short window of time" is a heuristic 
which can fail), and is being replaced by writing the same offset
for each vCPU:

commit 828ca89628bfcb1b8f27535025f69dd00eb55207
Author: Oliver Upton 
Date:   Thu Sep 16 18:15:38 2021 +

KVM: x86: Expose TSC offset controls to userspace

To date, VMM-directed TSC synchronization and migration has been a bit
messy. KVM has some baked-in heuristics around TSC writes to infer if
the VMM is attempting to synchronize. This is problematic, as it depends
on host userspace writing to the guest's TSC within 1 second of the last
write.

A much cleaner approach to configuring the guest's views of the TSC is to
simply migrate the TSC offset for every vCPU. Offsets are idempotent,
and thus not subject to change depending on when the VMM actually
reads/writes values from/to KVM. The VMM can then read the TSC once with
KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
the guest is paused.

So with that in place, the answer to 

How does userspace knows which vcpu to pick so
that it gets the right offset?

is any vcpu, since the offsets are the same.

> I also wonder why we need this when userspace already has direct
> access to that information without any extra kernel support (read the
> CNTVCT view of the vcpu using the ONEREG API, subtract it from the
> host view of the counter, job done).

If guest has access to the clock offset (between guest and host), then
in the guest:

clockval = hostclockval - clockoffset

Adding "clockoffset" to that will retrieve the host clock.

Is that what you mean?

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [RFC PATCH 1/2] arm64/tracing: add cntvct based trace clock

2021-11-19 Thread Marcelo Tosatti

On Fri, Nov 19, 2021 at 11:21:17AM +0100, Nicolas Saenz Julienne wrote:
> Add a new arm64-specific trace clock using the cntvct register, similar
> to x64-tsc. This gives us:
>  - A clock that is relatively fast (1GHz on armv8.6, 1-50MHz otherwise),
>monotonic, and resilient to low power modes.
>  - It can be used to correlate events across cpus as well as across
>hypervisor and guests.
> 
> By using arch_timer_read_counter() we make sure that armv8.6 cpus use
> the less expensive CNTVCTSS_EL0, which cannot be accessed speculatively.

Can this register be read by userspace ? (otherwise it won't be possible
to correlate userspace events).

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [RFC PATCH 2/2] KVM: arm64: export cntvoff in debugfs

2021-11-19 Thread Marcelo Tosatti

On Fri, Nov 19, 2021 at 11:21:18AM +0100, Nicolas Saenz Julienne wrote:
> While using cntvct as the raw clock for tracing, it's possible to
> synchronize host/guest traces just by knowing the virtual offset applied
> to the guest's virtual counter.
> 
> This is also the case on x86 when TSC is available. The offset is
> exposed in debugfs as 'tsc-offset' on a per vcpu basis. So let's
> implement the same for arm64.
> 
> Signed-off-by: Nicolas Saenz Julienne 

Hi Nicolas,

ARM:

CNTVCTSS_EL0, Counter-timer Self-Synchronized Virtual Count register
The CNTVCTSS_EL0 characteristics are:

Purpose
Holds the 64-bit virtual count value. The virtual count value is equal to the 
physical count value visible in CNTPCT_EL0 minus the virtual offset visible in 
CNTVOFF_EL2.
   ^

x86:

24.6.5 Time-Stamp Counter Offset and Multiplier
The VM-execution control fields include a 64-bit TSC-offset field. If the 
“RDTSC exiting” control is 0 and the “use
TSC offsetting” control is 1, this field controls executions of the RDTSC and 
RDTSCP instructions. It also controls
executions of the RDMSR instruction that read from the IA32_TIME_STAMP_COUNTER 
MSR. For all of these, the
value of the TSC offset is added to the value of the time-stamp counter, and 
the sum is returned to guest software
   ^
in EDX:EAX.

So it would be nice to keep the formula consistent for userspace:

GUEST_CLOCK_VAL = HOST_CLOCK_VAL + CLOCK_OFFSET

So would have to add a negative sign to the value to userspace.

Other than that, both the clock value (VCNTPCT_EL0) and the offset
(CNTVOFF_EL2) are not modified during guest execution? That is, CNTVOFF_EL2 is
written once during guest initialization.


> ---
>  arch/arm64/include/asm/kvm_host.h |  1 +
>  arch/arm64/kvm/Makefile   |  2 +-
>  arch/arm64/kvm/arch_timer.c   |  2 +-
>  arch/arm64/kvm/debugfs.c  | 25 +
>  include/kvm/arm_arch_timer.h  |  3 +++
>  5 files changed, 31 insertions(+), 2 deletions(-)
>  create mode 100644 arch/arm64/kvm/debugfs.c
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 2a5f7f38006f..130534c9079e 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -29,6 +29,7 @@
>  #include 
>  
>  #define __KVM_HAVE_ARCH_INTC_INITIALIZED
> +#define __KVM_HAVE_ARCH_VCPU_DEBUGFS
>  
>  #define KVM_HALT_POLL_NS_DEFAULT 50
>  
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 989bb5dad2c8..17be7cf770f2 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -14,7 +14,7 @@ kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o 
> $(KVM)/eventfd.o \
>$(KVM)/vfio.o $(KVM)/irqchip.o $(KVM)/binary_stats.o \
>arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o \
>inject_fault.o va_layout.o handle_exit.o \
> -  guest.o debug.o reset.o sys_regs.o \
> +  guest.o debug.o debugfs.o reset.o sys_regs.o \
>vgic-sys-reg-v3.o fpsimd.o pmu.o \
>arch_timer.o trng.o\
>vgic/vgic.o vgic/vgic-init.o \
> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
> index 3df67c127489..ee69387f7fb6 100644
> --- a/arch/arm64/kvm/arch_timer.c
> +++ b/arch/arm64/kvm/arch_timer.c
> @@ -82,7 +82,7 @@ u64 timer_get_cval(struct arch_timer_context *ctxt)
>   }
>  }
>  
> -static u64 timer_get_offset(struct arch_timer_context *ctxt)
> +u64 timer_get_offset(struct arch_timer_context *ctxt)
>  {
>   struct kvm_vcpu *vcpu = ctxt->vcpu;
>  
> diff --git a/arch/arm64/kvm/debugfs.c b/arch/arm64/kvm/debugfs.c
> new file mode 100644
> index ..f0f5083ea8d4
> --- /dev/null
> +++ b/arch/arm64/kvm/debugfs.c
> @@ -0,0 +1,25 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2021 Red Hat Inc.
> + */
> +
> +#include 
> +#include 
> +
> +#include 
> +
> +static int vcpu_get_cntv_offset(void *data, u64 *val)
> +{
> + struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data;
> +
> + *val = timer_get_offset(vcpu_vtimer(vcpu));
> +
> + return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(vcpu_cntvoff_fops, vcpu_get_cntv_offset, NULL, 
> "%lld\n");
> +
> +void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry 
> *debugfs_dentry)
> +{
> + debugfs_create_file("cntvoff", 0444, debugfs_dentry, vcpu, 
> _cntvoff_fops);
> +}
> diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h
> index 51c19381108c..de0cd9be825c 100644
> --- a/include/kvm/arm_arch_timer.h
> +++ b/include/kvm/arm_arch_timer.h
> @@ -106,4 +106,7 @@ void kvm_arm_timer_write_sysreg(struct kvm_vcpu *vcpu,
>  u32 timer_get_ctl(struct arch_timer_context *ctxt);
>  u64 timer_get_cval(struct arch_timer_context *ctxt);
>  
> +/* Nedded for debugfs */
> +u64 timer_get_offset(struct arch_timer_context *ctxt);
> +
>  #endif
> -- 
> 2.33.1
> 
> 

___
kvmarm

Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace

2021-10-04 Thread Marcelo Tosatti

On Fri, Oct 01, 2021 at 12:33:28PM -0700, Oliver Upton wrote:
> Marcelo,
> 
> On Fri, Oct 1, 2021 at 12:11 PM Marcelo Tosatti  wrote:
> >
> > On Fri, Oct 01, 2021 at 05:12:20PM +0200, Paolo Bonzini wrote:
> > > On 01/10/21 12:32, Marcelo Tosatti wrote:
> > > > > +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), +
> > > > > kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...]
> > > > >  +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock
> > > > > nanoseconds +   (k_0) and realtime nanoseconds (r_0) in their
> > > > > respective fields. +   Ensure that the KVM_CLOCK_REALTIME flag is
> > > > > set in the provided +   structure. KVM will advance the VM's
> > > > > kvmclock to account for elapsed +   time since recording the clock
> > > > > values.
> > > >
> > > > You can't advance both kvmclock (kvmclock_offset variable) and the
> > > > TSCs, which would be double counting.
> > > >
> > > > So you have to either add the elapsed realtime (1) between
> > > > KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the
> > > > TSCs. If you do both, there is double counting. Am i missing
> > > > something?
> > >
> > > Probably one of these two (but it's worth pointing out both of them):
> > >
> > > 1) the attribute that's introduced here *replaces*
> > > KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added.
> > >
> > > 2) the adjustment formula later in the algorithm does not care about how
> > > much time passed between step 1 and step 4.  It just takes two well
> > > known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is
> > > the same on the destination as if the guest was still running on the
> > > source.  It is irrelevant that one of them is before migration and one
> > > is after, all it matters is that one is on the source and one is on the
> > > destination.
> >
> > OK, so it still relies on NTPd daemon to fix the CLOCK_REALTIME delay
> > which is introduced during migration (which is what i would guess is
> > the lower hanging fruit) (for guests using TSC).
> 
> The series gives userspace the ability to modify the guest's
> perception of the TSC in whatever way it sees fit. The algorithm in
> the documentation provides a suggestion to userspace on how to do
> exactly that. I kept that advancement logic out of the kernel because
> IMO it is an implementation detail: users have differing opinions on
> how clocks should behave across a migration and KVM shouldn't have any
> baked-in rules around it.

Ok, was just trying to visualize how this would work with QEMU Linux guests.

> 
> At the same time, userspace can choose to _not_ jump the TSC and use
> the available interfaces to just migrate the existing state of the
> TSCs.
> 
> When I had initially proposed this series upstream, Paolo astutely
> pointed out that there was no good way to get a (CLOCK_REALTIME, TSC)
> pairing, which is critical for the TSC advancement algorithm in the
> documentation. Google's best way to get (CLOCK_REALTIME, TSC) exists
> in userspace [1], hence the missing kvm clock changes. So, in all, the
> spirit of the KVM clock changes is to provide missing UAPI around the
> clock/TSC, with the side effect of changing the guest-visible value.
> 
> [1] https://cloud.google.com/spanner/docs/true-time-external-consistency
> 
> > My point was that, by advancing the _TSC value_ by:
> >
> > T0. stop guest vcpus(source)
> > T1. KVM_GET_CLOCK   (source)
> > T2. KVM_SET_CLOCK   (destination)
> > T3. Write guest TSCs(destination)
> > T4. resume guest(destination)
> >
> > new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
> >
> > t_0:host TSC at KVM_GET_CLOCK time.
> > off_n:  TSC offset at vcpu-n (as long as no guest TSC writes are performed,
> > TSC offset is fixed).
> > ...
> >
> > +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
> > +   (k_0) and realtime nanoseconds (r_0) in their respective fields.
> > +   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
> > +   structure. KVM will advance the VM's kvmclock to account for elapsed
> > +   time since recording the clock values.
> >
> > Only kvmclock is advanced (by passing r_0). But a guest might not use 
> > kvmclock
> > (hopefully modern guests on modern hosts will use TSC clocksource,
> > whose clock_gettime is faster... some people are using that already).
> &

Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace

2021-10-01 Thread Marcelo Tosatti

On Fri, Oct 01, 2021 at 05:12:20PM +0200, Paolo Bonzini wrote:
> On 01/10/21 12:32, Marcelo Tosatti wrote:
> > > +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), +
> > > kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...]
> > >  +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock
> > > nanoseconds +   (k_0) and realtime nanoseconds (r_0) in their
> > > respective fields. +   Ensure that the KVM_CLOCK_REALTIME flag is
> > > set in the provided +   structure. KVM will advance the VM's
> > > kvmclock to account for elapsed +   time since recording the clock
> > > values.
> > 
> > You can't advance both kvmclock (kvmclock_offset variable) and the
> > TSCs, which would be double counting.
> > 
> > So you have to either add the elapsed realtime (1) between
> > KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the
> > TSCs. If you do both, there is double counting. Am i missing
> > something?
> 
> Probably one of these two (but it's worth pointing out both of them):
> 
> 1) the attribute that's introduced here *replaces*
> KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added.
> 
> 2) the adjustment formula later in the algorithm does not care about how
> much time passed between step 1 and step 4.  It just takes two well
> known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is
> the same on the destination as if the guest was still running on the
> source.  It is irrelevant that one of them is before migration and one
> is after, all it matters is that one is on the source and one is on the
> destination.

OK, so it still relies on NTPd daemon to fix the CLOCK_REALTIME delay 
which is introduced during migration (which is what i would guess is
the lower hanging fruit) (for guests using TSC).

My point was that, by advancing the _TSC value_ by:

T0. stop guest vcpus(source)
T1. KVM_GET_CLOCK   (source)
T2. KVM_SET_CLOCK   (destination)
T3. Write guest TSCs(destination)
T4. resume guest(destination)

new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1

t_0:host TSC at KVM_GET_CLOCK time.
off_n:  TSC offset at vcpu-n (as long as no guest TSC writes are performed,
TSC offset is fixed).
...

+4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
+   (k_0) and realtime nanoseconds (r_0) in their respective fields.
+   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
+   structure. KVM will advance the VM's kvmclock to account for elapsed
+   time since recording the clock values.

Only kvmclock is advanced (by passing r_0). But a guest might not use kvmclock
(hopefully modern guests on modern hosts will use TSC clocksource,
whose clock_gettime is faster... some people are using that already).

At some point QEMU should enable invariant TSC flag by default?

That said, the point is: why not advance the _TSC_ values
(instead of kvmclock nanoseconds), as doing so would reduce 
the "the CLOCK_REALTIME delay which is introduced during migration"
for both kvmclock users and modern tsc clocksource users.

So yes, i also like this patchset, but would like it even more
if it fixed the case above as well (and not sure whether adding
the migration delta to KVMCLOCK makes it harder to fix TSC case
later).

> Perhaps we can add to step 6 something like:
> 
> > +6. Adjust the guest TSC offsets for every vCPU to account for (1)
> > time +   elapsed since recording state and (2) difference in TSCs
> > between the +   source and destination machine: + +   new_off_n = t_0
> > + off_n + (k_1 - k_0) * freq - t_1 +
> 
> "off + t - k * freq" is the guest TSC value corresponding to a time of 0
> in kvmclock.  The above formula ensures that it is the same on the
> destination as it was on the source.
> 
> Also, the names are a bit hard to follow.  Perhaps
> 
>   t_0 tsc_src
>   t_1 tsc_dest
>   k_0 guest_src
>   k_1 guest_dest
>   r_0 host_src
>   off_n   ofs_src[i]
>   new_off_n   ofs_dest[i]
> 
> Paolo
> 
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK

2021-10-01 Thread Marcelo Tosatti

On Fri, Oct 01, 2021 at 09:05:27AM -0300, Marcelo Tosatti wrote:
> On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
> > Marcelo,
> > 
> > On Thu, Sep 30 2021 at 16:21, Marcelo Tosatti wrote:
> > > On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote:
> > >> On Thu, Sep 16, 2021 at 06:15:35PM +, Oliver Upton wrote:
> > >> 
> > >> Thomas, CC'ed, has deeper understanding of problems with 
> > >> forward time jumps than I do. Thomas, any comments?
> > >
> > > Based on the earlier discussion about the problems of synchronizing
> > > the guests clock via a notification to the NTP/Chrony daemon 
> > > (where there is a window where applications can read the stale
> > > value of the clock), a possible solution would be triggering
> > > an NMI on the destination (so that it runs ASAP, with higher
> > > priority than application/kernel).
> > >
> > > What would this NMI do, exactly?
> > 
> > Nothing. You cannot do anything time related in an NMI.
> > 
> > You might queue irq work which handles that, but that would still not
> > prevent user space or kernel space from observing the stale time stamp
> > depending on the execution state from where it resumes.
> 
> Yes.
> 
> > >> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag 
> > >> for either vm pause / vm resume (well, if paused for long periods of 
> > >> time) 
> > >> or savevm / restorevm.
> > >
> > > Maybe with the NMI above, it would be possible to use
> > > the realtime clock as a way to know time elapsed between
> > > events and advance guest clock without the current 
> > > problematic window.
> > 
> > As much duct tape you throw at the problem, it cannot be solved ever
> > because it's fundamentally broken. All you can do is to make the
> > observation windows smaller, but that's just curing the symptom.
> 
> Yes.
> 
> > The problem is that the guest is paused/resumed without getting any
> > information about that and the execution of the guest is stopped at an
> > arbitrary instruction boundary from which it resumes after migration or
> > restore. So there is no way to guarantee that after resume all vCPUs are
> > executing in a state which can handle that.
> > 
> > But even if that would be the case, then what prevents the stale time
> > stamps to be visible? Nothing:
> > 
> > T0:t = now();
> >  -> pause
> >  -> resume
> >  -> magic "fixup"
> > T1:dostuff(t);
> 
> Yes.
> 
> BTW, you could have a userspace notification (then applications 
> could handle this if desired).
> 
> > But that's not a fundamental problem because every preemptible or
> > interruptible code has the same issue:
> > 
> > T0:t = now();
> >  -> preemption or interrupt
> > T1:dostuff(t);
> > 
> > Which is usually not a problem, but It becomes a problem when T1 - T0 is
> > greater than the usual expectations which can obviously be trivially
> > achieved by guest migration or a savevm, restorevm cycle.
> > 
> > But let's go a step back and look at the clocks and their expectations:
> > 
> > CLOCK_MONOTONIC:
> > 
> >   Monotonically increasing clock which counts unless the system
> >   is in suspend. On resume it continues counting without jumping
> >   forward.
> > 
> >   That's the reference clock for everything else and therefore it
> >   is important that it does _not_ jump around.
> > 
> >   The reasons why CLOCK_MONOTONIC stops during suspend is
> >   historical and any attempt to change that breaks the world and
> >   some more because making it jump forward will trigger all sorts
> >   of timeouts, watchdogs and whatever. The last attempt to make
> >   CLOCK_MONOTONIC behave like CLOCK_BOOTTIME was reverted within 3
> >   weeks. It's not going to be attempted again. See a3ed0e4393d6
> >   ("Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME") for
> >   details.
> > 
> >   Now the proposed change is creating exactly the same problem:
> > 
> >   >> > +if (data.flags & KVM_CLOCK_REALTIME) {
> >   >> > +u64 now_real_ns = ktime_get_real_ns();
> >   >> > +
> >   >> > +/*
> >   >> > + * Avoid stepping the kvmclock backwards.
> >   >> > + */
> >   >> > +if (now_real_ns >

Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK

2021-10-01 Thread Marcelo Tosatti

On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
> Marcelo,
> 
> On Thu, Sep 30 2021 at 16:21, Marcelo Tosatti wrote:
> > On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote:
> >> On Thu, Sep 16, 2021 at 06:15:35PM +, Oliver Upton wrote:
> >> 
> >> Thomas, CC'ed, has deeper understanding of problems with 
> >> forward time jumps than I do. Thomas, any comments?
> >
> > Based on the earlier discussion about the problems of synchronizing
> > the guests clock via a notification to the NTP/Chrony daemon 
> > (where there is a window where applications can read the stale
> > value of the clock), a possible solution would be triggering
> > an NMI on the destination (so that it runs ASAP, with higher
> > priority than application/kernel).
> >
> > What would this NMI do, exactly?
> 
> Nothing. You cannot do anything time related in an NMI.
> 
> You might queue irq work which handles that, but that would still not
> prevent user space or kernel space from observing the stale time stamp
> depending on the execution state from where it resumes.

Yes.

> >> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag 
> >> for either vm pause / vm resume (well, if paused for long periods of time) 
> >> or savevm / restorevm.
> >
> > Maybe with the NMI above, it would be possible to use
> > the realtime clock as a way to know time elapsed between
> > events and advance guest clock without the current 
> > problematic window.
> 
> As much duct tape you throw at the problem, it cannot be solved ever
> because it's fundamentally broken. All you can do is to make the
> observation windows smaller, but that's just curing the symptom.

Yes.

> The problem is that the guest is paused/resumed without getting any
> information about that and the execution of the guest is stopped at an
> arbitrary instruction boundary from which it resumes after migration or
> restore. So there is no way to guarantee that after resume all vCPUs are
> executing in a state which can handle that.
> 
> But even if that would be the case, then what prevents the stale time
> stamps to be visible? Nothing:
> 
> T0:t = now();
>  -> pause
>  -> resume
>  -> magic "fixup"
> T1:dostuff(t);

Yes.

BTW, you could have a userspace notification (then applications 
could handle this if desired).

> But that's not a fundamental problem because every preemptible or
> interruptible code has the same issue:
> 
> T0:t = now();
>  -> preemption or interrupt
> T1:dostuff(t);
> 
> Which is usually not a problem, but It becomes a problem when T1 - T0 is
> greater than the usual expectations which can obviously be trivially
> achieved by guest migration or a savevm, restorevm cycle.
> 
> But let's go a step back and look at the clocks and their expectations:
> 
> CLOCK_MONOTONIC:
> 
>   Monotonically increasing clock which counts unless the system
>   is in suspend. On resume it continues counting without jumping
>   forward.
> 
>   That's the reference clock for everything else and therefore it
>   is important that it does _not_ jump around.
> 
>   The reasons why CLOCK_MONOTONIC stops during suspend is
>   historical and any attempt to change that breaks the world and
>   some more because making it jump forward will trigger all sorts
>   of timeouts, watchdogs and whatever. The last attempt to make
>   CLOCK_MONOTONIC behave like CLOCK_BOOTTIME was reverted within 3
>   weeks. It's not going to be attempted again. See a3ed0e4393d6
>   ("Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME") for
>   details.
> 
>   Now the proposed change is creating exactly the same problem:
> 
>   >> > +  if (data.flags & KVM_CLOCK_REALTIME) {
>   >> > +  u64 now_real_ns = ktime_get_real_ns();
>   >> > +
>   >> > +  /*
>   >> > +   * Avoid stepping the kvmclock backwards.
>   >> > +   */
>   >> > +  if (now_real_ns > data.realtime)
>   >> > +  data.clock += now_real_ns - data.realtime;
>   >> > +  }
> 
>   IOW, it takes the time between pause and resume into account and
>   forwards the underlying base clock which makes CLOCK_MONOTONIC
>   jump forward by exactly that amount of time.

Well, it is assuming that the

 T0:t = now();
 T1:pause vm()
 T2:finish vm migration()
 T3:dostuff(t);

Interval between T1 and T2 is small (and that the guest
clocks are synchronized up to a given boundary).

But i suppose add

Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace

2021-10-01 Thread Marcelo Tosatti

On Fri, Oct 01, 2021 at 11:17:33AM +0200, Paolo Bonzini wrote:
> On 30/09/21 21:14, Marcelo Tosatti wrote:
> > > +   new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
> > Hi Oliver,
> > 
> > This won't advance the TSC values themselves, right?
> 
> Why not?  It affects the TSC offset in the vmcs, so the TSC in the VM is
> advanced too.
> 
> Paolo

+4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
+   (k_0) and realtime nanoseconds (r_0) in their respective fields.
+   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
+   structure. KVM will advance the VM's kvmclock to account for elapsed
+   time since recording the clock values.

You can't advance both kvmclock (kvmclock_offset variable) and the TSCs,
which would be double counting.

So you have to either add the elapsed realtime (1) between KVM_GET_CLOCK
to kvmclock (which this patch is doing), or to the TSCs. If you do both, there
is double counting. Am i missing something?

To make it clearer: TSC clocksource is faster than kvmclock source, so
we'd rather use when possible, which is achievable with TSC scaling 
support on HW.

1: As mentioned earlier, just using the realtime clock delta between
hosts can introduce problems. So need a scheme to:

- Find the offset between host clocks, with upper and lower
  bounds on error.
- Take appropriate actions based on that (for example,
  do not use KVM_CLOCK_REALTIME flag on KVM_SET_CLOCK
  if the delta between hosts is large).

Which can be done in userspace or kernel space... (hum, but maybe
delegating this to userspace will introduce different solutions
of the same problem?).

> > This (advancing the TSC values by the realtime elapsed time) would be
> > awesome because TSC clock_gettime() vdso is faster, and some
> > applications prefer to just read from TSC directly.
> > See "x86: kvmguest: use TSC clocksource if invariant TSC is exposed".
> > 
> > The advancement with this patchset only applies to kvmclock.
> > 
> 
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK

2021-09-30 Thread Marcelo Tosatti

On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote:
> Oliver,
> 
> Do you have any numbers for the improvement in guests CLOCK_REALTIME
> accuracy across migration, when this is in place?
> 
> On Thu, Sep 16, 2021 at 06:15:35PM +, Oliver Upton wrote:
> > Handling the migration of TSCs correctly is difficult, in part because
> > Linux does not provide userspace with the ability to retrieve a (TSC,
> > realtime) clock pair for a single instant in time. In lieu of a more
> > convenient facility, KVM can report similar information in the kvm_clock
> > structure.
> > 
> > Provide userspace with a host TSC & realtime pair iff the realtime clock
> > is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid
> > realtime value, advance the KVM clock by the amount of elapsed time. Do
> > not step the KVM clock backwards, though, as it is a monotonic
> > oscillator.
> > 
> > Suggested-by: Paolo Bonzini 
> > Signed-off-by: Oliver Upton 
> > ---
> >  Documentation/virt/kvm/api.rst  | 42 ++---
> >  arch/x86/include/asm/kvm_host.h |  3 +++
> >  arch/x86/kvm/x86.c  | 36 +---
> >  include/uapi/linux/kvm.h|  7 +-
> >  4 files changed, 70 insertions(+), 18 deletions(-)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index a6729c8cf063..d0b9c986cf6c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -993,20 +993,34 @@ such as migration.
> >  When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the
> >  set of bits that KVM can return in struct kvm_clock_data's flag member.
> >  
> > -The only flag defined now is KVM_CLOCK_TSC_STABLE.  If set, the returned
> > -value is the exact kvmclock value seen by all VCPUs at the instant
> > -when KVM_GET_CLOCK was called.  If clear, the returned value is simply
> > -CLOCK_MONOTONIC plus a constant offset; the offset can be modified
> > -with KVM_SET_CLOCK.  KVM will try to make all VCPUs follow this clock,
> > -but the exact value read by each VCPU could differ, because the host
> > -TSC is not stable.
> > +FLAGS:
> > +
> > +KVM_CLOCK_TSC_STABLE.  If set, the returned value is the exact kvmclock
> > +value seen by all VCPUs at the instant when KVM_GET_CLOCK was called.
> > +If clear, the returned value is simply CLOCK_MONOTONIC plus a constant
> > +offset; the offset can be modified with KVM_SET_CLOCK.  KVM will try
> > +to make all VCPUs follow this clock, but the exact value read by each
> > +VCPU could differ, because the host TSC is not stable.
> > +
> > +KVM_CLOCK_REALTIME.  If set, the `realtime` field in the kvm_clock_data
> > +structure is populated with the value of the host's real time
> > +clocksource at the instant when KVM_GET_CLOCK was called. If clear,
> > +the `realtime` field does not contain a value.
> > +
> > +KVM_CLOCK_HOST_TSC.  If set, the `host_tsc` field in the kvm_clock_data
> > +structure is populated with the value of the host's timestamp counter (TSC)
> > +at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` 
> > field
> > +does not contain a value.
> >  
> >  ::
> >  
> >struct kvm_clock_data {
> > __u64 clock;  /* kvmclock current value */
> > __u32 flags;
> > -   __u32 pad[9];
> > +   __u32 pad0;
> > +   __u64 realtime;
> > +   __u64 host_tsc;
> > +   __u32 pad[4];
> >};
> >  
> >  
> > @@ -1023,12 +1037,22 @@ Sets the current timestamp of kvmclock to the value 
> > specified in its parameter.
> >  In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on 
> > scenarios
> >  such as migration.
> >  
> > +FLAGS:
> > +
> > +KVM_CLOCK_REALTIME.  If set, KVM will compare the value of the `realtime` 
> > field
> > +with the value of the host's real time clocksource at the instant when
> > +KVM_SET_CLOCK was called. The difference in elapsed time is added to the 
> > final
> > +kvmclock value that will be provided to guests.
> > +
> >  ::
> >  
> >struct kvm_clock_data {
> > __u64 clock;  /* kvmclock current value */
> > __u32 flags;
> > -   __u32 pad[9];
> > +   __u32 pad0;
> > +   __u64 realtime;
> > +   __u64 host_tsc;
> > +   __u32 pad[4];
> >};
> >  
> >  
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index be6805fc0260..9c34b5b6

Re: [PATCH v8 5/7] kvm: x86: protect masterclock with a seqcount

2021-09-30 Thread Marcelo Tosatti

On Thu, Sep 16, 2021 at 06:15:36PM +, Oliver Upton wrote:
> From: Paolo Bonzini 
> 
> Protect the reference point for kvmclock with a seqcount, so that
> kvmclock updates for all vCPUs can proceed in parallel.  Xen runstate
> updates will also run in parallel and not bounce the kvmclock cacheline.
> 
> nr_vcpus_matched_tsc is updated outside pvclock_update_vm_gtod_copy
> though, so a spinlock must be kept for that one.
> 
> Signed-off-by: Paolo Bonzini 
> [Oliver - drop unused locals, don't double acquire tsc_write_lock]
> Signed-off-by: Oliver Upton 
> ---
>  arch/x86/include/asm/kvm_host.h |  7 ++-
>  arch/x86/kvm/x86.c  | 83 +
>  2 files changed, 49 insertions(+), 41 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9c34b5b63e39..5accfe7246ce 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1087,6 +1087,11 @@ struct kvm_arch {
>  
>   unsigned long irq_sources_bitmap;
>   s64 kvmclock_offset;
> +
> + /*
> +  * This also protects nr_vcpus_matched_tsc which is read from a
> +  * preemption-disabled region, so it must be a raw spinlock.
> +  */
>   raw_spinlock_t tsc_write_lock;
>   u64 last_tsc_nsec;
>   u64 last_tsc_write;
> @@ -1097,7 +1102,7 @@ struct kvm_arch {
>   u64 cur_tsc_generation;
>   int nr_vcpus_matched_tsc;
>  
> - spinlock_t pvclock_gtod_sync_lock;
> + seqcount_raw_spinlock_t pvclock_sc;
>   bool use_master_clock;
>   u64 master_kernel_ns;
>   u64 master_cycle_now;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index cb5d5cad5124..29156c49cd11 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2533,9 +2533,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, 
> u64 data)
>   vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
>  
>   kvm_vcpu_write_tsc_offset(vcpu, offset);
> - raw_spin_unlock_irqrestore(>arch.tsc_write_lock, flags);
>  
> - spin_lock_irqsave(>arch.pvclock_gtod_sync_lock, flags);
>   if (!matched) {
>   kvm->arch.nr_vcpus_matched_tsc = 0;
>   } else if (!already_matched) {
> @@ -2543,7 +2541,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, 
> u64 data)
>   }
>  
>   kvm_track_tsc_matching(vcpu);
> - spin_unlock_irqrestore(>arch.pvclock_gtod_sync_lock, flags);
> + raw_spin_unlock_irqrestore(>arch.tsc_write_lock, flags);
>  }
>  
>  static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu,
> @@ -2731,9 +2729,6 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
>   int vclock_mode;
>   bool host_tsc_clocksource, vcpus_matched;
>  
> - vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
> - atomic_read(>online_vcpus));
> -
>   /*
>* If the host uses TSC clock, then passthrough TSC as stable
>* to the guest.
> @@ -2742,6 +2737,10 @@ static void pvclock_update_vm_gtod_copy(struct kvm 
> *kvm)
>   >master_kernel_ns,
>   >master_cycle_now);
>  
> + lockdep_assert_held(>arch.tsc_write_lock);
> + vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
> + atomic_read(>online_vcpus));
> +
>   ka->use_master_clock = host_tsc_clocksource && vcpus_matched
>   && !ka->backwards_tsc_observed
>   && !ka->boot_vcpu_runs_old_kvmclock;
> @@ -2760,14 +2759,18 @@ static void kvm_make_mclock_inprogress_request(struct 
> kvm *kvm)
>   kvm_make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
>  }
>  
> -static void kvm_start_pvclock_update(struct kvm *kvm)
> +static void __kvm_start_pvclock_update(struct kvm *kvm)
>  {
> - struct kvm_arch *ka = >arch;
> + raw_spin_lock_irq(>arch.tsc_write_lock);
> + write_seqcount_begin(>arch.pvclock_sc);
> +}
>  
> +static void kvm_start_pvclock_update(struct kvm *kvm)
> +{
>   kvm_make_mclock_inprogress_request(kvm);
>  
>   /* no guest entries from this point */
> - spin_lock_irq(>pvclock_gtod_sync_lock);
> + __kvm_start_pvclock_update(kvm);
>  }
>  
>  static void kvm_end_pvclock_update(struct kvm *kvm)
> @@ -2776,7 +2779,8 @@ static void kvm_end_pvclock_update(struct kvm *kvm)
>   struct kvm_vcpu *vcpu;
>   int i;
>  
> - spin_unlock_irq(>pvclock_gtod_sync_lock);
> + write_seqcount_end(>pvclock_sc);
> + raw_spin_unlock_irq(>tsc_write_lock);
>   kvm_for_each_vcpu(i, vcpu, kvm)
>   kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>  
> @@ -2797,20 +2801,12 @@ static void get_kvmclock(struct kvm *kvm, struct 
> kvm_clock_data *data)
>  {
>   struct kvm_arch *ka = >arch;
>   struct pvclock_vcpu_time_info hv_clock;
> - unsigned long flags;
>  
> - spin_lock_irqsave(>pvclock_gtod_sync_lock, flags);
>   if (!ka->use_master_clock) {
> -

Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace

2021-09-30 Thread Marcelo Tosatti

On Thu, Sep 16, 2021 at 06:15:38PM +, Oliver Upton wrote:
> To date, VMM-directed TSC synchronization and migration has been a bit
> messy. KVM has some baked-in heuristics around TSC writes to infer if
> the VMM is attempting to synchronize. This is problematic, as it depends
> on host userspace writing to the guest's TSC within 1 second of the last
> write.
> 
> A much cleaner approach to configuring the guest's views of the TSC is to
> simply migrate the TSC offset for every vCPU. Offsets are idempotent,
> and thus not subject to change depending on when the VMM actually
> reads/writes values from/to KVM. The VMM can then read the TSC once with
> KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
> the guest is paused.
> 
> Cc: David Matlack 
> Cc: Sean Christopherson 
> Signed-off-by: Oliver Upton 


> ---
>  Documentation/virt/kvm/devices/vcpu.rst |  57 
>  arch/x86/include/asm/kvm_host.h |   1 +
>  arch/x86/include/uapi/asm/kvm.h |   4 +
>  arch/x86/kvm/x86.c  | 110 
>  4 files changed, 172 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/devices/vcpu.rst 
> b/Documentation/virt/kvm/devices/vcpu.rst
> index 2acec3b9ef65..3b399d727c11 100644
> --- a/Documentation/virt/kvm/devices/vcpu.rst
> +++ b/Documentation/virt/kvm/devices/vcpu.rst
> @@ -161,3 +161,60 @@ Specifies the base address of the stolen time structure 
> for this VCPU. The
>  base address must be 64 byte aligned and exist within a valid guest memory
>  region. See Documentation/virt/kvm/arm/pvtime.rst for more information
>  including the layout of the stolen time structure.
> +
> +4. GROUP: KVM_VCPU_TSC_CTRL
> +===
> +
> +:Architectures: x86
> +
> +4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET
> +
> +:Parameters: 64-bit unsigned TSC offset
> +
> +Returns:
> +
> +  === ==
> +  -EFAULT Error reading/writing the provided
> +  parameter address.
> +  -ENXIO  Attribute not supported
> +  === ==
> +
> +Specifies the guest's TSC offset relative to the host's TSC. The guest's
> +TSC is then derived by the following equation:
> +
> +  guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
> +
> +This attribute is useful for the precise migration of a guest's TSC. The
> +following describes a possible algorithm to use for the migration of a
> +guest's TSC:
> +
> +From the source VMM process:
> +
> +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0),
> +   kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0).
> +
> +2. Read the KVM_VCPU_TSC_OFFSET attribute for every vCPU to record the
> +   guest TSC offset (off_n).
> +
> +3. Invoke the KVM_GET_TSC_KHZ ioctl to record the frequency of the
> +   guest's TSC (freq).
> +
> +From the destination VMM process:
> +
> +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
> +   (k_0) and realtime nanoseconds (r_0) in their respective fields.
> +   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
> +   structure. KVM will advance the VM's kvmclock to account for elapsed
> +   time since recording the clock values.
> +
> +5. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_1) and
> +   kvmclock nanoseconds (k_1).
> +
> +6. Adjust the guest TSC offsets for every vCPU to account for (1) time
> +   elapsed since recording state and (2) difference in TSCs between the
> +   source and destination machine:
> +
> +   new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1

Hi Oliver,

This won't advance the TSC values themselves, right?
This (advancing the TSC values by the realtime elapsed time) would be
awesome because TSC clock_gettime() vdso is faster, and some
applications prefer to just read from TSC directly.
See "x86: kvmguest: use TSC clocksource if invariant TSC is exposed".

The advancement with this patchset only applies to kvmclock.

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK

2021-09-29 Thread Marcelo Tosatti

Oliver,

Do you have any numbers for the improvement in guests CLOCK_REALTIME
accuracy across migration, when this is in place?

On Thu, Sep 16, 2021 at 06:15:35PM +, Oliver Upton wrote:
> Handling the migration of TSCs correctly is difficult, in part because
> Linux does not provide userspace with the ability to retrieve a (TSC,
> realtime) clock pair for a single instant in time. In lieu of a more
> convenient facility, KVM can report similar information in the kvm_clock
> structure.
> 
> Provide userspace with a host TSC & realtime pair iff the realtime clock
> is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid
> realtime value, advance the KVM clock by the amount of elapsed time. Do
> not step the KVM clock backwards, though, as it is a monotonic
> oscillator.
> 
> Suggested-by: Paolo Bonzini 
> Signed-off-by: Oliver Upton 
> ---
>  Documentation/virt/kvm/api.rst  | 42 ++---
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/x86.c  | 36 +---
>  include/uapi/linux/kvm.h|  7 +-
>  4 files changed, 70 insertions(+), 18 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index a6729c8cf063..d0b9c986cf6c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -993,20 +993,34 @@ such as migration.
>  When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the
>  set of bits that KVM can return in struct kvm_clock_data's flag member.
>  
> -The only flag defined now is KVM_CLOCK_TSC_STABLE.  If set, the returned
> -value is the exact kvmclock value seen by all VCPUs at the instant
> -when KVM_GET_CLOCK was called.  If clear, the returned value is simply
> -CLOCK_MONOTONIC plus a constant offset; the offset can be modified
> -with KVM_SET_CLOCK.  KVM will try to make all VCPUs follow this clock,
> -but the exact value read by each VCPU could differ, because the host
> -TSC is not stable.
> +FLAGS:
> +
> +KVM_CLOCK_TSC_STABLE.  If set, the returned value is the exact kvmclock
> +value seen by all VCPUs at the instant when KVM_GET_CLOCK was called.
> +If clear, the returned value is simply CLOCK_MONOTONIC plus a constant
> +offset; the offset can be modified with KVM_SET_CLOCK.  KVM will try
> +to make all VCPUs follow this clock, but the exact value read by each
> +VCPU could differ, because the host TSC is not stable.
> +
> +KVM_CLOCK_REALTIME.  If set, the `realtime` field in the kvm_clock_data
> +structure is populated with the value of the host's real time
> +clocksource at the instant when KVM_GET_CLOCK was called. If clear,
> +the `realtime` field does not contain a value.
> +
> +KVM_CLOCK_HOST_TSC.  If set, the `host_tsc` field in the kvm_clock_data
> +structure is populated with the value of the host's timestamp counter (TSC)
> +at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field
> +does not contain a value.
>  
>  ::
>  
>struct kvm_clock_data {
>   __u64 clock;  /* kvmclock current value */
>   __u32 flags;
> - __u32 pad[9];
> + __u32 pad0;
> + __u64 realtime;
> + __u64 host_tsc;
> + __u32 pad[4];
>};
>  
>  
> @@ -1023,12 +1037,22 @@ Sets the current timestamp of kvmclock to the value 
> specified in its parameter.
>  In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on 
> scenarios
>  such as migration.
>  
> +FLAGS:
> +
> +KVM_CLOCK_REALTIME.  If set, KVM will compare the value of the `realtime` 
> field
> +with the value of the host's real time clocksource at the instant when
> +KVM_SET_CLOCK was called. The difference in elapsed time is added to the 
> final
> +kvmclock value that will be provided to guests.
> +
>  ::
>  
>struct kvm_clock_data {
>   __u64 clock;  /* kvmclock current value */
>   __u32 flags;
> - __u32 pad[9];
> + __u32 pad0;
> + __u64 realtime;
> + __u64 host_tsc;
> + __u32 pad[4];
>};
>  
>  
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index be6805fc0260..9c34b5b63e39 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1936,4 +1936,7 @@ int kvm_cpu_dirty_log_size(void);
>  
>  int alloc_all_memslots_rmaps(struct kvm *kvm);
>  
> +#define KVM_CLOCK_VALID_FLAGS
> \
> + (KVM_CLOCK_TSC_STABLE | KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC)
> +
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 523c4e5c109f..cb5d5cad5124 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2815,10 +2815,20 @@ static void get_kvmclock(struct kvm *kvm, struct 
> kvm_clock_data *data)
>   get_cpu();
>  
>   if (__this_cpu_read(cpu_tsc_khz)) {
> +#ifdef CONFIG_X86_64
> + struct timespec64 ts;
> +
> + if (kvm_get_walltime_and_clockread(, >host_tsc)) {
> +

Re: [PATCH v8 3/7] KVM: x86: Fix potential race in KVM_GET_CLOCK

2021-09-29 Thread Marcelo Tosatti

On Thu, Sep 16, 2021 at 06:15:34PM +, Oliver Upton wrote:
> Sean noticed that KVM_GET_CLOCK was checking kvm_arch.use_master_clock
> outside of the pvclock sync lock. This is problematic, as the clock
> value written to the user may or may not actually correspond to a stable
> TSC.
> 
> Fix the race by populating the entire kvm_clock_data structure behind
> the pvclock_gtod_sync_lock.
> 
> Suggested-by: Sean Christopherson 
> Signed-off-by: Oliver Upton 

ACK patches 1-3, still reviewing the remaining ones...

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK

2021-09-28 Thread Marcelo Tosatti

On Thu, Sep 16, 2021 at 06:15:35PM +, Oliver Upton wrote:
> Handling the migration of TSCs correctly is difficult, in part because
> Linux does not provide userspace with the ability to retrieve a (TSC,
> realtime) clock pair for a single instant in time. In lieu of a more
> convenient facility, KVM can report similar information in the kvm_clock
> structure.
> 
> Provide userspace with a host TSC & realtime pair iff the realtime clock
> is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid
> realtime value, advance the KVM clock by the amount of elapsed time. Do
> not step the KVM clock backwards, though, as it is a monotonic
> oscillator.
> 
> Suggested-by: Paolo Bonzini 
> Signed-off-by: Oliver Upton 
> ---
>  Documentation/virt/kvm/api.rst  | 42 ++---
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/x86.c  | 36 +---
>  include/uapi/linux/kvm.h|  7 +-
>  4 files changed, 70 insertions(+), 18 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index a6729c8cf063..d0b9c986cf6c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -993,20 +993,34 @@ such as migration.
>  When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the
>  set of bits that KVM can return in struct kvm_clock_data's flag member.
>  
> -The only flag defined now is KVM_CLOCK_TSC_STABLE.  If set, the returned
> -value is the exact kvmclock value seen by all VCPUs at the instant
> -when KVM_GET_CLOCK was called.  If clear, the returned value is simply
> -CLOCK_MONOTONIC plus a constant offset; the offset can be modified
> -with KVM_SET_CLOCK.  KVM will try to make all VCPUs follow this clock,
> -but the exact value read by each VCPU could differ, because the host
> -TSC is not stable.
> +FLAGS:
> +
> +KVM_CLOCK_TSC_STABLE.  If set, the returned value is the exact kvmclock
> +value seen by all VCPUs at the instant when KVM_GET_CLOCK was called.
> +If clear, the returned value is simply CLOCK_MONOTONIC plus a constant
> +offset; the offset can be modified with KVM_SET_CLOCK.  KVM will try
> +to make all VCPUs follow this clock, but the exact value read by each
> +VCPU could differ, because the host TSC is not stable.
> +
> +KVM_CLOCK_REALTIME.  If set, the `realtime` field in the kvm_clock_data
> +structure is populated with the value of the host's real time
> +clocksource at the instant when KVM_GET_CLOCK was called. If clear,
> +the `realtime` field does not contain a value.
> +
> +KVM_CLOCK_HOST_TSC.  If set, the `host_tsc` field in the kvm_clock_data
> +structure is populated with the value of the host's timestamp counter (TSC)
> +at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field
> +does not contain a value.

If the host TSCs are not stable, then KVM_CLOCK_HOST_TSC bit (and
host_tsc field) are ambiguous. Shouldnt exposing them be conditional on 
stable TSC for the host ?

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v7 6/6] KVM: x86: Expose TSC offset controls to userspace

2021-08-26 Thread Marcelo Tosatti

On Mon, Aug 23, 2021 at 01:56:30PM -0700, Oliver Upton wrote:
> Paolo,
> 
> On Sun, Aug 15, 2021 at 5:11 PM Oliver Upton  wrote:
> >
> > To date, VMM-directed TSC synchronization and migration has been a bit
> > messy. KVM has some baked-in heuristics around TSC writes to infer if
> > the VMM is attempting to synchronize. This is problematic, as it depends
> > on host userspace writing to the guest's TSC within 1 second of the last
> > write.
> >
> > A much cleaner approach to configuring the guest's views of the TSC is to
> > simply migrate the TSC offset for every vCPU. Offsets are idempotent,
> > and thus not subject to change depending on when the VMM actually
> > reads/writes values from/to KVM. The VMM can then read the TSC once with
> > KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
> > the guest is paused.
> >
> > Cc: David Matlack 
> > Cc: Sean Christopherson 
> > Signed-off-by: Oliver Upton 
> 
> Could you please squash the following into this patch? We need to
> advertise KVM_CAP_VCPU_ATTRIBUTES to userspace. Otherwise, happy to
> resend.
> 
> Thanks,
> Oliver

Oliver,

Is there QEMU support for this, or are you using your own
userspace with this?

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v7 3/6] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK

2021-08-20 Thread Marcelo Tosatti

On Mon, Aug 16, 2021 at 12:11:27AM +, Oliver Upton wrote:
> Handling the migration of TSCs correctly is difficult, in part because
> Linux does not provide userspace with the ability to retrieve a (TSC,
> realtime) clock pair for a single instant in time. In lieu of a more
> convenient facility, KVM can report similar information in the kvm_clock
> structure.
> 
> Provide userspace with a host TSC & realtime pair iff the realtime clock
> is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid
> realtime value, advance the KVM clock by the amount of elapsed time. Do
> not step the KVM clock backwards, though, as it is a monotonic
> oscillator.
> 
> Suggested-by: Paolo Bonzini 
> Signed-off-by: Oliver Upton 

This is a good idea. Userspace could check if host and destination
clocks are up to a certain difference and not use the feature if
not appropriate.

Is there a qemu patch for it?

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [PATCH v7 1/6] KVM: x86: Fix potential race in KVM_GET_CLOCK

2021-08-19 Thread Marcelo Tosatti

On Mon, Aug 16, 2021 at 12:11:25AM +, Oliver Upton wrote:
> Sean noticed that KVM_GET_CLOCK was checking kvm_arch.use_master_clock
> outside of the pvclock sync lock. This is problematic, as the clock
> value written to the user may or may not actually correspond to a stable
> TSC.
> 
> Fix the race by populating the entire kvm_clock_data structure behind
> the pvclock_gtod_sync_lock.

Oliver, 

Can you please describe the race in more detail?

Is it about host TSC going unstable VS parallel KVM_GET_CLOCK ? 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

Re: [RFC 1/3] cpuidle: add poll_source API

2021-07-19 Thread Marcelo Tosatti

Hi Stefan,

On Tue, Jul 13, 2021 at 05:19:04PM +0100, Stefan Hajnoczi wrote:
> Introduce an API for adding cpuidle poll callbacks:
> 
>   struct poll_source_ops {
>   void (*start)(struct poll_source *src);
>   void (*stop)(struct poll_source *src);
>   void (*poll)(struct poll_source *src);
>   };
> 
>   int poll_source_register(struct poll_source *src);
>   int poll_source_unregister(struct poll_source *src);
> 
> When cpuidle enters the poll state it invokes ->start() and then invokes
> ->poll() repeatedly from the busy wait loop. Finally ->stop() is invoked
> when the busy wait loop finishes.
> 
> The ->poll() function should check for activity and cause
> TIF_NEED_RESCHED to be set in order to stop the busy wait loop.
> 
> This API is intended to be used by drivers that can cheaply poll for
> events. Participating in cpuidle polling allows them to avoid interrupt
> latencies during periods where the CPU is going to poll anyway.
> 
> Note that each poll_source is bound to a particular CPU. The API is
> mainly intended to by used by drivers that have multiple queues with irq
> affinity.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  drivers/cpuidle/Makefile  |  1 +
>  include/linux/poll_source.h   | 53 +++
>  drivers/cpuidle/poll_source.c | 99 +++
>  drivers/cpuidle/poll_state.c  |  6 +++
>  4 files changed, 159 insertions(+)
>  create mode 100644 include/linux/poll_source.h
>  create mode 100644 drivers/cpuidle/poll_source.c
> 
> diff --git a/drivers/cpuidle/Makefile b/drivers/cpuidle/Makefile
> index 26bbc5e74123..994f72d6fe95 100644
> --- a/drivers/cpuidle/Makefile
> +++ b/drivers/cpuidle/Makefile
> @@ -7,6 +7,7 @@ obj-y += cpuidle.o driver.o governor.o sysfs.o governors/
>  obj-$(CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED) += coupled.o
>  obj-$(CONFIG_DT_IDLE_STATES)   += dt_idle_states.o
>  obj-$(CONFIG_ARCH_HAS_CPU_RELAX)   += poll_state.o
> +obj-$(CONFIG_ARCH_HAS_CPU_RELAX)   += poll_source.o
>  obj-$(CONFIG_HALTPOLL_CPUIDLE) += cpuidle-haltpoll.o
>  
>  
> ##
> diff --git a/include/linux/poll_source.h b/include/linux/poll_source.h
> new file mode 100644
> index ..ccfb424e170b
> --- /dev/null
> +++ b/include/linux/poll_source.h
> @@ -0,0 +1,53 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * poll_source.h - cpuidle busy waiting API
> + */
> +#ifndef __LINUX_POLLSOURCE_H__
> +#define __LINUX_POLLSOURCE_H__
> +
> +#include 
> +
> +struct poll_source;
> +
> +struct poll_source_ops {
> + void (*start)(struct poll_source *src);
> + void (*stop)(struct poll_source *src);
> + void (*poll)(struct poll_source *src);
> +};
> +
> +struct poll_source {
> + const struct poll_source_ops *ops;
> + struct list_head node;
> + int cpu;
> +};
> +
> +/**
> + * poll_source_register - Add a poll_source for a CPU
> + */
> +#if defined(CONFIG_CPU_IDLE) && defined(CONFIG_ARCH_HAS_CPU_RELAX)
> +int poll_source_register(struct poll_source *src);
> +#else
> +static inline int poll_source_register(struct poll_source *src)
> +{
> + return 0;
> +}
> +#endif
> +
> +/**
> + * poll_source_unregister - Remove a previously registered poll_source
> + */
> +#if defined(CONFIG_CPU_IDLE) && defined(CONFIG_ARCH_HAS_CPU_RELAX)
> +int poll_source_unregister(struct poll_source *src);
> +#else
> +static inline int poll_source_unregister(struct poll_source *src)
> +{
> + return 0;
> +}
> +#endif
> +
> +/* Used by the cpuidle driver */
> +void poll_source_start(void);
> +void poll_source_run_once(void);
> +void poll_source_stop(void);
> +
> +#endif /* __LINUX_POLLSOURCE_H__ */
> diff --git a/drivers/cpuidle/poll_source.c b/drivers/cpuidle/poll_source.c
> new file mode 100644
> index ..46100e5a71e4
> --- /dev/null
> +++ b/drivers/cpuidle/poll_source.c
> @@ -0,0 +1,99 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * poll_source.c - cpuidle busy waiting API
> + */
> +
> +#include 
> +#include 
> +#include 
> +
> +/* The per-cpu list of registered poll sources */
> +DEFINE_PER_CPU(struct list_head, poll_source_list);
> +
> +/* Called from idle task with TIF_POLLING_NRFLAG set and irqs enabled */
> +void poll_source_start(void)
> +{
> + struct poll_source *src;
> +
> + list_for_each_entry(src, this_cpu_ptr(_source_list), node)
> + src->ops->start(src);
> +}
> +
> +/* Called from idle task with TIF_POLLING_NRFLAG set and irqs enabled */
> +void poll_source_run_once(void)
> +{
> + struct poll_source *src;
> +
> + list_for_each_entry(src, this_cpu_ptr(_source_list), node)
> + src->ops->poll(src);
> +}
> +
> +/* Called from idle task with TIF_POLLING_NRFLAG set and irqs enabled */
> +void poll_source_stop(void)
> +{
> + struct poll_source *src;
> +
> + list_for_each_entry(src, this_cpu_ptr(_source_list), node)
> + src->ops->stop(src);
> +}
>

Re: constant_tsc support for SVM guest

2021-04-26 Thread Marcelo Tosatti

On Sun, Apr 25, 2021 at 12:19:11AM -0500, Wei Huang wrote:
> 
> 
> On 4/23/21 4:27 PM, Eduardo Habkost wrote:
> > On Fri, Apr 23, 2021 at 12:32:00AM -0500, Wei Huang wrote:
> > > There was a customer request for const_tsc support on AMD guests. Right 
> > > now
> > > this feature is turned off by default for QEMU x86 CPU types (in
> > > CPUID_Fn8007_EDX[8]). However we are seeing a discrepancy in guest VM
> > > behavior between Intel and AMD.
> > > 
> > > In Linux kernel, Intel x86 code enables X86_FEATURE_CONSTANT_TSC based on
> > > vCPU's family & model. So it ignores CPUID_Fn8007_EDX[8] and guest VMs
> > > have const_tsc enabled. On AMD, however, the kernel checks
> > > CPUID_Fn8007_EDX[8]. So const_tsc is disabled on AMD by default.
> > 
> > Oh.  This seems to defeat the purpose of the invtsc migration
> > blocker we have.
> > 
> > Do we know when this behavior was introduced in Linux?
> 
> This code has existed in the kernel for a long time:
> 
>   commit 2b16a2353814a513cdb5c5c739b76a19d7ea39ce
>   Author: Andi Kleen 
>   Date:   Wed Jan 30 13:32:40 2008 +0100
> 
>  x86: move X86_FEATURE_CONSTANT_TSC into early cpu feature detection
> 
> There was another related commit which might explain the reasoning of
> turning on CONSTANT_TSC based on CPU family on Intel:
> 
>   commit 40fb17152c50a69dc304dd632131c2f41281ce44
>   Author: Venki Pallipadi 
>   Date:   Mon Nov 17 16:11:37 2008 -0800
> 
>  x86: support always running TSC on Intel CPUs
> 
> According to the commit above, there are two kernel features: CONSTANT_TSC
> and NONSTOP_TSC:
> 
>   * CONSTANT_TSC: TSC runs at constant rate
>   * NONSTOP_TSC: TSC not stop in deep C-states
> 
> If CPUID_Fn8007_EDX[8] == 1, both CONSTANT_TSC and NONSTOP_TSC are
> turned on. This applies to all x86 vendors. For Intel CPU with certain CPU
> families (i.e. c->x86 == 0x6 && c->x86_model >= 0x0e), it will turn on
> CONSTANT_TSC (but no NONSTOP_TSC) with CPUID_Fn8007_EDX[8]=0.
> 
> I believe the migration blocker was created for the CONSTANT_TSC case: if
> vCPU claims to have a constant TSC rate, we have to make sure src/dest are
> matched with each other (having the same CPU frequency or have tsc_ratio).
> NONSTOP_TSC doesn't matter in this scope.
>
> > > I am thinking turning on invtsc for EPYC CPU types (see example below). 
> > > Most
> > > AMD server CPUs have supported invariant TSC for a long time. So this 
> > > change
> > > is compatible with the hardware behavior. The only problem is live 
> > > migration
> > > support, which will be blocked because of invtsc. 

It should be blocked, if performed to a host with a different frequency
or without TscRateMsr, if one desires the "constant TSC rate" meaning
to be maintained.

> > > However this problem
> > > should be considered very minor because most server CPUs support 
> > > TscRateMsr
> > > (see CPUID_Fn800A_EDX[4]), allowing VMs to migrate among CPUs with
> > > different TSC rates. This live migration restriction can be lifted as long
> > > as the destination supports TscRateMsr or has the same frequency as the
> > > source (QEMU/libvirt do it).
> > > 
> > > [BTW I believe this migration limitation might be unnecessary because it 
> > > is
> > > apparently OK for Intel guests to ignore invtsc while claiming const_tsc.
> > > Have anyone reported issues?]

Not as far as i know.

Fact is that libvirt will set the TSC_KHZ (from the value of
KVM_GET_TSC_KHZ ioctl).

That could be done inside QEMU itself, maybe by specifying -cpu
AAA,cpu-freq=auto ?

https://www.spinics.net/linux/fedora/libvir/msg141570.html

Re: constant_tsc support for SVM guest

2021-04-26 Thread Marcelo Tosatti



Hi Wei, Eduardo,

On Fri, Apr 23, 2021 at 05:27:44PM -0400, Eduardo Habkost wrote:
> On Fri, Apr 23, 2021 at 12:32:00AM -0500, Wei Huang wrote:
> > There was a customer request for const_tsc support on AMD guests. Right now
> > this feature is turned off by default for QEMU x86 CPU types (in
> > CPUID_Fn8007_EDX[8]). However we are seeing a discrepancy in guest VM
> > behavior between Intel and AMD.
> > 
> > In Linux kernel, Intel x86 code enables X86_FEATURE_CONSTANT_TSC based on
> > vCPU's family & model. So it ignores CPUID_Fn8007_EDX[8] and guest VMs
> > have const_tsc enabled. On AMD, however, the kernel checks
> > CPUID_Fn8007_EDX[8]. So const_tsc is disabled on AMD by default.

EAX=8007h: Advanced Power Management Information
This function provides advanced power management feature identifiers. 
EDX bit 8 indicates support for invariant TSC. 

Intel documentation states:

"The time stamp counter in newer processors may support an enhancement,
referred to as invariant TSC. Processor's support for invariant TSC
is indicated by CPUID.8007H:EDX[8]. The invariant TSC will run
at a constant rate in all ACPI P-, C-. and T-states. This is the
architectural behavior moving forward. On processors with invariant TSC
support, the OS may use the TSC for wall clock timer services (instead
of ACPI or HPET timers). TSC reads are much more efficient and do not
incur the overhead associated with a ring transition or access to a
platform resource."

X86_FEATURE_NONSTOP_TSC is enabled (on both Intel and AMD) by checking
the CPUID_Fn8007_EDX[8] bit.

> Oh.  This seems to defeat the purpose of the invtsc migration
> blocker we have.
> 
> Do we know when this behavior was introduced in Linux?
> 
> > 
> > I am thinking turning on invtsc for EPYC CPU types (see example below). Most
> > AMD server CPUs have supported invariant TSC for a long time. So this change
> > is compatible with the hardware behavior. The only problem is live migration
> > support, which will be blocked because of invtsc. However this problem
> > should be considered very minor because most server CPUs support TscRateMsr
> > (see CPUID_Fn800A_EDX[4]), allowing VMs to migrate among CPUs with
> > different TSC rates. This live migration restriction can be lifted as long
> > as the destination supports TscRateMsr or has the same frequency as the
> > source (QEMU/libvirt do it).

Yes.

> > [BTW I believe this migration limitation might be unnecessary because it is
> > apparently OK for Intel guests to ignore invtsc while claiming const_tsc.
> > Have anyone reported issues?]
> 
> CCing Marcelo, who originally added the migration blocker in QEMU.

The reasoning behind the migration blocker was to ensure that 
the invariant TSC meaning as defined:

"The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states"

Would be maintained across migration.

> > 
> > Do I miss anything here? Any comments about the proposal?
> > 
> > Thanks,
> > -Wei
> > 
> > diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> > index ad99cad0e7..3c48266884 100644
> > --- a/target/i386/cpu.c
> > +++ b/target/i386/cpu.c
> > @@ -4077,6 +4076,21 @@ static X86CPUDefinition builtin_x86_defs[] = {
> >  { /* end of list */ }
> >  }
> >  },
> > +{
> > +.version = 4,
> > +.alias = "EPYC-IBPB",
> > +.props = (PropValue[]) {
> > +{ "ibpb", "on" },
> > +{ "perfctr-core", "on" },
> > +{ "clzero", "on" },
> > +{ "xsaveerptr", "on" },
> > +{ "xsaves", "on" },
> 
> You don't need to copy the properties from the previous version.
> The properties of version N are applied on top of the properties
> of version (N-1).
> 
> > +{ "invtsc", "on" },
> > +{ "model-id",
> > +  "AMD EPYC Processor" },
> > +{ /* end of list */ }
> > +}
> > +},
> >  { /* end of list */ }
> >  }
> >  },
> > @@ -4189,6 +4203,15 @@ static X86CPUDefinition builtin_x86_defs[] = {
> >  { /* end of list */ }
> >  }
> >  },
> > +{
> > +.version = 3,
> > +.props = (PropValue[]) {
> > +{ "ibrs", "on" },
> > +{ "amd-ssbd", "on" },
> > +{ "invtsc", "on" },
> > +{ /* end of list */ }
> > +}
> > +},
> >  { /* end of list */ }
> >  }
> >  },
> > @@ -4246,6 +4269,17 @@ static X86CPUDefinition builtin_x86_defs[] = {
> >  .xlevel = 0x801E,
> >  .model_id = "AMD EPYC-Milan Processor",
> >  .cache_info = _milan_cache_info,
> > +.versions = (X86CPUVersionDefinition[]) {
> > +{ .version = 1 },
> > +

[PATCH v6] hrtimer: avoid retrigger_next_event IPI

2021-04-19 Thread Marcelo Tosatti



Setting the realtime clock triggers an IPI to all CPUs to reprogram
the clock event device.

However, only realtime and TAI clocks have their offsets updated
(and therefore potentially require a reprogram).

Instead of sending an IPI unconditionally, check each per CPU hrtimer base
whether it has active timers in the CLOCK_REALTIME and CLOCK_TAI bases. If
that's not the case, update the realtime and TAI base offsets remotely and
skip the IPI. This ensures that any subsequently armed timers on
CLOCK_REALTIME and CLOCK_TAI are evaluated with the correct offsets.

Signed-off-by: Marcelo Tosatti 

---

v6:
  - Do not take softirq_raised into account (Peter Xu).
  - Include BOOTTIME as base that requires IPI (Thomas).
  - Unconditional reprogram on resume path, since there is
nothing to gain in such path anyway.

v5:
  - Add missing hrtimer_update_base (Peter Xu).

v4:
   - Drop unused code (Thomas).

v3:
   - Nicer changelog  (Thomas).
   - Code style fixes (Thomas).
   - Compilation warning with CONFIG_HIGH_RES_TIMERS=n (Thomas).
   - Shrink preemption disabled section (Thomas).

v2:
   - Only REALTIME and TAI bases are affected by offset-to-monotonic changes 
(Thomas).
   - Don't special case nohz_full CPUs (Thomas).


diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index bb5e7b0a4274..14a6e449b221 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -318,7 +318,7 @@ struct clock_event_device;
 
 extern void hrtimer_interrupt(struct clock_event_device *dev);
 
-extern void clock_was_set_delayed(void);
+extern void clock_was_set_delayed(bool force_reprogram);
 
 extern unsigned int hrtimer_resolution;
 
@@ -326,7 +326,7 @@ extern unsigned int hrtimer_resolution;
 
 #define hrtimer_resolution (unsigned int)LOW_RES_NSEC
 
-static inline void clock_was_set_delayed(void) { }
+static inline void clock_was_set_delayed(bool force_reprogram) { }
 
 #endif
 
@@ -351,7 +351,7 @@ hrtimer_expires_remaining_adjusted(const struct hrtimer 
*timer)
timer->base->get_time());
 }
 
-extern void clock_was_set(void);
+extern void clock_was_set(bool);
 #ifdef CONFIG_TIMERFD
 extern void timerfd_clock_was_set(void);
 #else
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5c9d968187ae..2258782fd714 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -758,9 +758,17 @@ static void hrtimer_switch_to_hres(void)
retrigger_next_event(NULL);
 }
 
+static void clock_was_set_force_reprogram_work(struct work_struct *work)
+{
+   clock_was_set(true);
+}
+
+static DECLARE_WORK(hrtimer_force_reprogram_work, 
clock_was_set_force_reprogram_work);
+
+
 static void clock_was_set_work(struct work_struct *work)
 {
-   clock_was_set();
+   clock_was_set(false);
 }
 
 static DECLARE_WORK(hrtimer_work, clock_was_set_work);
@@ -769,9 +777,12 @@ static DECLARE_WORK(hrtimer_work, clock_was_set_work);
  * Called from timekeeping and resume code to reprogram the hrtimer
  * interrupt device on all cpus.
  */
-void clock_was_set_delayed(void)
+void clock_was_set_delayed(bool force_reprogram)
 {
-   schedule_work(_work);
+   if (force_reprogram)
+   schedule_work(_force_reprogram_work);
+   else
+   schedule_work(_work);
 }
 
 #else
@@ -871,6 +882,18 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
tick_program_event(expires, 1);
 }
 
+#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME) |   \
+(1U << HRTIMER_BASE_REALTIME_SOFT) |   \
+(1U << HRTIMER_BASE_TAI) | \
+(1U << HRTIMER_BASE_TAI_SOFT) |\
+(1U << HRTIMER_BASE_BOOTTIME) |\
+(1U << HRTIMER_BASE_BOOTTIME_SOFT))
+
+static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
+{
+   return (cpu_base->active_bases & CLOCK_SET_BASES) != 0;
+}
+
 /*
  * Clock realtime was set
  *
@@ -882,11 +905,42 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
  * resolution timer interrupts. On UP we just disable interrupts and
  * call the high resolution interrupt code.
  */
-void clock_was_set(void)
+void clock_was_set(bool force_reprogram)
 {
 #ifdef CONFIG_HIGH_RES_TIMERS
-   /* Retrigger the CPU local events everywhere */
-   on_each_cpu(retrigger_next_event, NULL, 1);
+   cpumask_var_t mask;
+   int cpu;
+
+   if (force_reprogram == true) {
+   on_each_cpu(retrigger_next_event, NULL, 1);
+   goto set_timerfd;
+   }
+
+   if (!zalloc_cpumask_var(, GFP_KERNEL)) {
+   on_each_cpu(retrigger_next_event, NULL, 1);
+   goto set_timerfd;
+   }
+
+   /* Avoid interrupting CPUs if possible */
+   cpus_read_lock();
+   for_each_online_cpu(cpu) {
+   unsigned long flags;
+

Re: [PATCH v5] hrtimer: avoid retrigger_next_event IPI

2021-04-19 Thread Marcelo Tosatti

On Sat, Apr 17, 2021 at 06:51:08PM +0200, Thomas Gleixner wrote:
> On Sat, Apr 17 2021 at 18:24, Thomas Gleixner wrote:
> > On Fri, Apr 16 2021 at 13:13, Peter Xu wrote:
> >> On Fri, Apr 16, 2021 at 01:00:23PM -0300, Marcelo Tosatti wrote:
> >>>  
> >>> +#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME) | \
> >>> +  (1U << HRTIMER_BASE_REALTIME_SOFT) |   \
> >>> +  (1U << HRTIMER_BASE_TAI) | \
> >>> +  (1U << HRTIMER_BASE_TAI_SOFT))
> >>> +
> >>> +static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
> >>> +{
> >>> + if (cpu_base->softirq_activated)
> >>> + return true;
> >>
> >> A pure question on whether this check is needed...
> >>
> >> Here even if softirq_activated==1 (as softirq is going to happen), as long 
> >> as
> >> (cpu_base->active_bases & CLOCK_SET_BASES)==0, shouldn't it already mean 
> >> that
> >> "yes indeed clock was set, but no need to kick this cpu as no relevant 
> >> timer"?
> >> As that question seems to be orthogonal to whether a softirq is going to
> >> trigger on that cpu.
> >
> > That's correct and it's not any different from firing the IPI because in
> > both cases the update happens with the base lock of the CPU in question
> > held. And if there are no active timers in any of the affected bases,
> > then there is no need to reevaluate the next expiry because the offset
> > update does not affect any armed timers. It just makes sure that the
> > next enqueu of a timer on such a base will see the the correct offset.
> >
> > I'll just zap it.
> 
> But the whole thing is still wrong in two aspects:
> 
> 1) BOOTTIME can be one of the affected clocks when sleep time
>(suspended time) is injected because that uses the same mechanism.
> 
>Sorry for missing that earlier when I asked to remove it, but
>that's trivial to fix by adding the BOOTTIME base back.
> 
> 2) What's worse is that on resume this might break because that
>mechanism is also used to enforce the reprogramming of the clock
>event devices and there we cannot be selective on clock bases.
> 
>I need to dig deeper into that because suspend/resume has changed
>a lot over time, so this might be just a historical leftover. But
>without proper analysis we might end up with subtle and hard to
>debug wreckage.
> 
> Thanks,
> 
> tglx

Thomas,

There is no gain in avoiding the IPIs for the suspend/resume case 
(since suspending is a large interruption anyway). To avoid 
the potential complexity (and associated bugs), one option would 
be to NOT skip IPIs for the resume case.

Sending -v6 with that (and other suggestions/fixes).

[PATCH v5] hrtimer: avoid retrigger_next_event IPI

2021-04-16 Thread Marcelo Tosatti



Setting the realtime clock triggers an IPI to all CPUs to reprogram
the clock event device.

However, only realtime and TAI clocks have their offsets updated
(and therefore potentially require a reprogram).

Instead of sending an IPI unconditionally, check each per CPU hrtimer base
whether it has active timers in the CLOCK_REALTIME and CLOCK_TAI bases. If
that's not the case, update the realtime and TAI base offsets remotely and
skip the IPI. This ensures that any subsequently armed timers on
CLOCK_REALTIME and CLOCK_TAI are evaluated with the correct offsets.

Signed-off-by: Marcelo Tosatti 

---

v5:
  - Add missing hrtimer_update_base (Peter Xu).

v4:
   - Drop unused code (Thomas).

v3:
   - Nicer changelog  (Thomas).
   - Code style fixes (Thomas).
   - Compilation warning with CONFIG_HIGH_RES_TIMERS=n (Thomas).
   - Shrink preemption disabled section (Thomas).

v2:
   - Only REALTIME and TAI bases are affected by offset-to-monotonic changes 
(Thomas).
   - Don't special case nohz_full CPUs (Thomas).


diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5c9d968187ae..06fcc272e28d 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -871,6 +871,19 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
tick_program_event(expires, 1);
 }
 
+#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME) |   \
+(1U << HRTIMER_BASE_REALTIME_SOFT) |   \
+(1U << HRTIMER_BASE_TAI) | \
+(1U << HRTIMER_BASE_TAI_SOFT))
+
+static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
+{
+   if (cpu_base->softirq_activated)
+   return true;
+
+   return (cpu_base->active_bases & CLOCK_SET_BASES) != 0;
+}
+
 /*
  * Clock realtime was set
  *
@@ -885,8 +898,34 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
 void clock_was_set(void)
 {
 #ifdef CONFIG_HIGH_RES_TIMERS
-   /* Retrigger the CPU local events everywhere */
-   on_each_cpu(retrigger_next_event, NULL, 1);
+   cpumask_var_t mask;
+   int cpu;
+
+   if (!zalloc_cpumask_var(, GFP_KERNEL)) {
+   on_each_cpu(retrigger_next_event, NULL, 1);
+   goto set_timerfd;
+   }
+
+   /* Avoid interrupting CPUs if possible */
+   cpus_read_lock();
+   for_each_online_cpu(cpu) {
+   unsigned long flags;
+   struct hrtimer_cpu_base *cpu_base = _cpu(hrtimer_bases, 
cpu);
+
+   raw_spin_lock_irqsave(_base->lock, flags);
+   if (need_reprogram_timer(cpu_base))
+   cpumask_set_cpu(cpu, mask);
+   else
+   hrtimer_update_base(cpu_base);
+   raw_spin_unlock_irqrestore(_base->lock, flags);
+   }
+
+   preempt_disable();
+   smp_call_function_many(mask, retrigger_next_event, NULL, 1);
+   preempt_enable();
+   cpus_read_unlock();
+   free_cpumask_var(mask);
+set_timerfd:
 #endif
timerfd_clock_was_set();
 }

[PATCH v4] hrtimer: avoid retrigger_next_event IPI

2021-04-15 Thread Marcelo Tosatti

Setting the realtime clock triggers an IPI to all CPUs to reprogram
the clock event device.

However, only realtime and TAI clocks have their offsets updated
(and therefore potentially require a reprogram).

Instead of sending an IPI unconditionally, check each per CPU hrtimer base
whether it has active timers in the CLOCK_REALTIME and CLOCK_TAI bases. If
that's not the case, update the realtime and TAI base offsets remotely and
skip the IPI. This ensures that any subsequently armed timers on
CLOCK_REALTIME and CLOCK_TAI are evaluated with the correct offsets.

Signed-off-by: Marcelo Tosatti 

---

v4:
   - Drop unused code (Thomas).

v3:
   - Nicer changelog  (Thomas).
   - Code style fixes (Thomas).
   - Compilation warning with CONFIG_HIGH_RES_TIMERS=n (Thomas).
   - Shrink preemption disabled section (Thomas).

v2:
   - Only REALTIME and TAI bases are affected by offset-to-monotonic changes 
(Thomas).
   - Don't special case nohz_full CPUs (Thomas).

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5c9d968187ae..e228c0a0c98f 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -871,6 +871,19 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
tick_program_event(expires, 1);
 }
 
+#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME) |   \
+(1U << HRTIMER_BASE_REALTIME_SOFT) |   \
+(1U << HRTIMER_BASE_TAI) | \
+(1U << HRTIMER_BASE_TAI_SOFT))
+
+static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
+{
+   if (cpu_base->softirq_activated)
+   return true;
+
+   return (cpu_base->active_bases & CLOCK_SET_BASES) != 0;
+}
+
 /*
  * Clock realtime was set
  *
@@ -885,8 +898,32 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
 void clock_was_set(void)
 {
 #ifdef CONFIG_HIGH_RES_TIMERS
-   /* Retrigger the CPU local events everywhere */
-   on_each_cpu(retrigger_next_event, NULL, 1);
+   cpumask_var_t mask;
+   int cpu;
+
+   if (!zalloc_cpumask_var(, GFP_KERNEL)) {
+   on_each_cpu(retrigger_next_event, NULL, 1);
+   goto set_timerfd;
+   }
+
+   /* Avoid interrupting CPUs if possible */
+   cpus_read_lock();
+   for_each_online_cpu(cpu) {
+   unsigned long flags;
+   struct hrtimer_cpu_base *cpu_base = _cpu(hrtimer_bases, 
cpu);
+
+   raw_spin_lock_irqsave(_base->lock, flags);
+   if (need_reprogram_timer(cpu_base))
+   cpumask_set_cpu(cpu, mask);
+   raw_spin_unlock_irqrestore(_base->lock, flags);
+   }
+
+   preempt_disable();
+   smp_call_function_many(mask, retrigger_next_event, NULL, 1);
+   preempt_enable();
+   cpus_read_unlock();
+   free_cpumask_var(mask);
+set_timerfd:
 #endif
timerfd_clock_was_set();
 }

[PATCH v3] hrtimer: avoid retrigger_next_event IPI

2021-04-15 Thread Marcelo Tosatti



Setting the realtime clock triggers an IPI to all CPUs to reprogram
the clock event device.

However, only realtime and TAI clocks have their offsets updated
(and therefore potentially require a reprogram).

Instead of sending an IPI unconditionally, check each per CPU hrtimer base
whether it has active timers in the CLOCK_REALTIME and CLOCK_TAI bases. If
that's not the case, update the realtime and TAI base offsets remotely and
skip the IPI. This ensures that any subsequently armed timers on
CLOCK_REALTIME and CLOCK_TAI are evaluated with the correct offsets.

Signed-off-by: Marcelo Tosatti 

---

v3:
   - Nicer changelog  (Thomas).
   - Code style fixes (Thomas).
   - Compilation warning with CONFIG_HIGH_RES_TIMERS=n (Thomas).
   - Shrink preemption disabled section (Thomas).

v2:
   - Only REALTIME and TAI bases are affected by offset-to-monotonic changes 
(Thomas).
   - Don't special case nohz_full CPUs (Thomas).

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5c9d968187ae..dd9c0d2f469f 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -871,6 +871,24 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
tick_program_event(expires, 1);
 }
 
+#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME) |   \
+(1U << HRTIMER_BASE_REALTIME_SOFT) |   \
+(1U << HRTIMER_BASE_TAI) | \
+(1U << HRTIMER_BASE_TAI_SOFT))
+
+static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
+{
+   unsigned int active = 0;
+
+   if (cpu_base->softirq_activated)
+   return true;
+
+   active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
+   active = active | (cpu_base->active_bases & HRTIMER_ACTIVE_HARD);
+
+   return (cpu_base->active_bases & CLOCK_SET_BASES) != 0;
+}
+
 /*
  * Clock realtime was set
  *
@@ -885,8 +903,32 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
 void clock_was_set(void)
 {
 #ifdef CONFIG_HIGH_RES_TIMERS
-   /* Retrigger the CPU local events everywhere */
-   on_each_cpu(retrigger_next_event, NULL, 1);
+   cpumask_var_t mask;
+   int cpu;
+
+   if (!zalloc_cpumask_var(, GFP_KERNEL)) {
+   on_each_cpu(retrigger_next_event, NULL, 1);
+   goto set_timerfd;
+   }
+
+   /* Avoid interrupting CPUs if possible */
+   cpus_read_lock();
+   for_each_online_cpu(cpu) {
+   unsigned long flags;
+   struct hrtimer_cpu_base *cpu_base = _cpu(hrtimer_bases, 
cpu);
+
+   raw_spin_lock_irqsave(_base->lock, flags);
+   if (need_reprogram_timer(cpu_base))
+   cpumask_set_cpu(cpu, mask);
+   raw_spin_unlock_irqrestore(_base->lock, flags);
+   }
+
+   preempt_disable();
+   smp_call_function_many(mask, retrigger_next_event, NULL, 1);
+   preempt_enable();
+   cpus_read_unlock();
+   free_cpumask_var(mask);
+set_timerfd:
 #endif
timerfd_clock_was_set();
 }

[PATCH v2] hrtimer: avoid retrigger_next_event IPI

2021-04-13 Thread Marcelo Tosatti




Setting the realtime clock triggers an IPI to all CPUs to reprogram
hrtimers.

However, only realtime and TAI clocks have their offsets updated
(and therefore potentially require a reprogram).

Check if it only has monotonic active timers, and in that case 
update the realtime and TAI base offsets remotely, skipping the IPI.

This reduces interruptions to latency sensitive applications.

Signed-off-by: Marcelo Tosatti 

---

v2:
   - Only REALTIME and TAI bases are affected by offset-to-monotonic changes 
(Thomas).
   - Don't special case nohz_full CPUs (Thomas).
   

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5c9d968187ae..be21b85c679d 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -871,6 +871,28 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
tick_program_event(expires, 1);
 }
 
+#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME)|\
+(1U << HRTIMER_BASE_REALTIME_SOFT)|\
+(1U << HRTIMER_BASE_TAI)|  \
+(1U << HRTIMER_BASE_TAI_SOFT))
+
+static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
+{
+   unsigned int active = 0;
+
+   if (cpu_base->softirq_activated)
+   return true;
+
+   active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
+
+   active = active | (cpu_base->active_bases & HRTIMER_ACTIVE_HARD);
+
+   if ((active & CLOCK_SET_BASES) == 0)
+   return false;
+
+   return true;
+}
+
 /*
  * Clock realtime was set
  *
@@ -885,9 +907,31 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
 void clock_was_set(void)
 {
 #ifdef CONFIG_HIGH_RES_TIMERS
-   /* Retrigger the CPU local events everywhere */
-   on_each_cpu(retrigger_next_event, NULL, 1);
+   cpumask_var_t mask;
+   int cpu;
+
+   if (!zalloc_cpumask_var(, GFP_KERNEL)) {
+   on_each_cpu(retrigger_next_event, NULL, 1);
+   goto set_timerfd;
+   }
+
+   /* Avoid interrupting CPUs if possible */
+   preempt_disable();
+   for_each_online_cpu(cpu) {
+   unsigned long flags;
+   struct hrtimer_cpu_base *cpu_base = _cpu(hrtimer_bases, 
cpu);
+
+   raw_spin_lock_irqsave(_base->lock, flags);
+   if (need_reprogram_timer(cpu_base))
+   cpumask_set_cpu(cpu, mask);
+   raw_spin_unlock_irqrestore(_base->lock, flags);
+   }
+
+   smp_call_function_many(mask, retrigger_next_event, NULL, 1);
+   preempt_enable();
+   free_cpumask_var(mask);
 #endif
+set_timerfd:
timerfd_clock_was_set();
 }

Re: [PATCH] hrtimer: avoid retrigger_next_event IPI

2021-04-09 Thread Marcelo Tosatti



+CC Anna-Maria.

On Fri, Apr 09, 2021 at 04:15:13PM +0200, Thomas Gleixner wrote:
> On Wed, Apr 07 2021 at 10:53, Marcelo Tosatti wrote:
> > Setting the realtime clock triggers an IPI to all CPUs to reprogram
> > hrtimers.
> >
> > However, only base, boottime and tai clocks have their offsets updated
> 
> base clock? 

Heh...

> And why boottime? Boottime is not affected by a clock
> realtime set. It's clock REALTIME and TAI, nothing else.

OK!

> > +#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME)|\
> > +(1U << HRTIMER_BASE_REALTIME_SOFT)|\
> > +(1U << HRTIMER_BASE_BOOTTIME)| \
> > +(1U << HRTIMER_BASE_BOOTTIME_SOFT)|\
> > +(1U << HRTIMER_BASE_TAI)|  \
> > +(1U << HRTIMER_BASE_TAI_SOFT))
> > +
> > +static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
> > +{
> > +   unsigned int active = 0;
> > +
> > +   if (!cpu_base->softirq_activated)
> > +   active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;

Again, if (cpu_base->softirq_activated), need to IPI (will resend).

> > +   active = active | (cpu_base->active_bases & HRTIMER_ACTIVE_HARD);
> > +
> > +   if ((active & CLOCK_SET_BASES) == 0)
> > +   return false;
> > +
> > +   return true;
> > +}
> 
> Errm. 

What?

> > +   /* Avoid interrupting nohz_full CPUs if possible */
> > +   preempt_disable();
> > +   for_each_online_cpu(cpu) {
> > +   if (tick_nohz_full_cpu(cpu)) {
> > +   unsigned long flags;
> > +   struct hrtimer_cpu_base *cpu_base = 
> > _cpu(hrtimer_bases, cpu);
> > +
> > +   raw_spin_lock_irqsave(_base->lock, flags);
> > +   if (need_reprogram_timer(cpu_base))
> > +   cpumask_set_cpu(cpu, mask);
> > +   else
> > +   hrtimer_update_base(cpu_base);
> > +   raw_spin_unlock_irqrestore(_base->lock, flags);
> > +   }
> > +   }
> 
> How is that supposed to be correct?
> 
> CPU0  CPU1
> 
> clock_was_set() hrtimer_start(CLOCK_REALTIME)
> 
>   if (!active_mask[CPU1] & XXX)
>   continue;
> active_mask |= REALTIME;
> 
> ---> fail because that newly started timer is on the old offset.

CPU0CPU1


clock_was_set()
Case-1: CPU-1 grabs 
base->lock before CPU-0:
CPU-0 sees 
active_mask[CPU1] and IPIs.

base = 
lock_hrtimer_base(timer, );
if 
(__hrtimer_start_range_ns(timer, tim, ...

hrtimer_reprogram(timer, true);


unlock_hrtimer_base(timer, );


raw_spin_lock_irqsave(_base->lock, flags);
if (need_reprogram_timer(cpu_base))
cpumask_set_cpu(cpu, mask);
else
hrtimer_update_base(cpu_base);
raw_spin_unlock_irqrestore(_base->lock, flags);

Case-2: CPU-1 grabs 
base->lock after CPU-0:
CPU-0 will have updated 
the offsets remotely.

base = 
lock_hrtimer_base(timer, );
if 
(__hrtimer_start_range_ns(timer, tim, ...

hrtimer_reprogram(timer, true);


unlock_hrtimer_base(timer, );


No?

Re: [PATCH] hrtimer: avoid retrigger_next_event IPI

2021-04-08 Thread Marcelo Tosatti

On Thu, Apr 08, 2021 at 12:14:57AM +0200, Frederic Weisbecker wrote:
> On Wed, Apr 07, 2021 at 10:53:01AM -0300, Marcelo Tosatti wrote:
> > 
> > Setting the realtime clock triggers an IPI to all CPUs to reprogram
> > hrtimers.
> > 
> > However, only base, boottime and tai clocks have their offsets updated
> > (and therefore potentially require a reprogram).
> > 
> > If the CPU is a nohz_full one, check if it only has 
> > monotonic active timers, and in that case update the 
> > realtime base offsets, skipping the IPI.
> > 
> > This reduces interruptions to nohz_full CPUs.
> > 
> > Signed-off-by: Marcelo Tosatti 
> > 
> > diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> > index 743c852e10f2..b42b1a434b22 100644
> > --- a/kernel/time/hrtimer.c
> > +++ b/kernel/time/hrtimer.c
> > @@ -853,6 +853,28 @@ static void hrtimer_reprogram(struct hrtimer *timer, 
> > bool reprogram)
> > tick_program_event(expires, 1);
> >  }
> >  
> > +#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME)|\
> > +(1U << HRTIMER_BASE_REALTIME_SOFT)|\
> > +(1U << HRTIMER_BASE_BOOTTIME)| \
> > +(1U << HRTIMER_BASE_BOOTTIME_SOFT)|\
> > +(1U << HRTIMER_BASE_TAI)|  \
> > +(1U << HRTIMER_BASE_TAI_SOFT))
> > +
> > +static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
> > +{
> > +   unsigned int active = 0;
> > +
> > +   if (!cpu_base->softirq_activated)
> > +   active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;

If cpu_base->softirq_activated == 1, should IPI as well.

> > +   active = active | (cpu_base->active_bases & HRTIMER_ACTIVE_HARD);
> > +
> > +   if ((active & CLOCK_SET_BASES) == 0)
> > +   return false;
> > +
> > +   return true;
> > +}
> > +
> >  /*
> >   * Clock realtime was set
> >   *
> > @@ -867,9 +889,41 @@ static void hrtimer_reprogram(struct hrtimer *timer, 
> > bool reprogram)
> >  void clock_was_set(void)
> >  {
> >  #ifdef CONFIG_HIGH_RES_TIMERS
> > -   /* Retrigger the CPU local events everywhere */
> > -   on_each_cpu(retrigger_next_event, NULL, 1);
> > +   cpumask_var_t mask;
> > +   int cpu;
> > +
> > +   if (!tick_nohz_full_enabled()) {
> > +   /* Retrigger the CPU local events everywhere */
> > +   on_each_cpu(retrigger_next_event, NULL, 1);
> > +   goto set_timerfd;
> > +   }
> > +
> > +   if (!zalloc_cpumask_var(, GFP_KERNEL)) {
> > +   on_each_cpu(retrigger_next_event, NULL, 1);
> > +   goto set_timerfd;
> > +   }
> > +
> > +   /* Avoid interrupting nohz_full CPUs if possible */
> > +   preempt_disable();
> > +   for_each_online_cpu(cpu) {
> > +   if (tick_nohz_full_cpu(cpu)) {
> > +   unsigned long flags;
> > +   struct hrtimer_cpu_base *cpu_base = 
> > _cpu(hrtimer_bases, cpu);
> > +
> > +   raw_spin_lock_irqsave(_base->lock, flags);
> > +   if (need_reprogram_timer(cpu_base))
> > +   cpumask_set_cpu(cpu, mask);
> > +   else
> > +   hrtimer_update_base(cpu_base);
> > +   raw_spin_unlock_irqrestore(_base->lock, flags);
> > +   }
> 
> You forgot to add the housekeeping CPUs to the mask.

So people are using:

console=tty0 console=ttyS0,115200n8 skew_tick=1 nohz=on rcu_nocbs=8-31 
tuned.non_isolcpus=00ff intel_pstate=disable nosoftlockup tsc=nowatchdog 
intel_iommu=on iommu=pt isolcpus=managed_irq,8-31 
systemd.cpu_affinity=0,1,2,3,4,5,6,7 default_hugepagesz=1G hugepagesz=2M 
hugepages=128 nohz_full=8-31

And using the nohz_full= CPUs (or subsets of nohz_full= CPUs) in two modes:

Either "generic non-isolated applications" 
(with load-balancing enabled for those CPUs), or for 
latency sensitive applications. And switching between the modes.

In this case, it would only be possible to check for
housekeeping CPUs of type MANAGED_IRQ, which would be strange.

> As for the need_reprogram_timer() trick, I'll rather defer to Thomas review...
> 
> Thanks.

Thanks!

> 
> > +   }
> > +
> > +   smp_call_function_many(mask, retrigger_next_event, NULL, 1);
> > +   preempt_enable();
> > +   free_cpumask_var(mask);
> >  #endif
> > +set_timerfd:
> > timerfd_clock_was_set();
> >  }
> >  
> >

Re: [PATCH 1/2] KVM: x86: reduce pvclock_gtod_sync_lock critical sections

2021-04-08 Thread Marcelo Tosatti

Hi Paolo,

On Thu, Apr 08, 2021 at 10:15:16AM +0200, Paolo Bonzini wrote:
> On 07/04/21 19:40, Marcelo Tosatti wrote:
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index fe806e894212..0a83eff40b43 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -2562,10 +2562,12 @@ static void kvm_gen_update_masterclock(struct kvm 
> > > *kvm)
> > >   kvm_hv_invalidate_tsc_page(kvm);
> > > - spin_lock(>pvclock_gtod_sync_lock);
> > >   kvm_make_mclock_inprogress_request(kvm);
> > > +
> > Might be good to serialize against two kvm_gen_update_masterclock
> > callers? Otherwise one caller could clear KVM_REQ_MCLOCK_INPROGRESS,
> > while the other is still at pvclock_update_vm_gtod_copy().
> 
> Makes sense, but this stuff has always seemed unnecessarily complicated to
> me.
>
> KVM_REQ_MCLOCK_INPROGRESS is only needed to kick running vCPUs out of the
> execution loop; 

We do not want vcpus with different system_timestamp/tsc_timestamp
pair:

 * To avoid that problem, do not allow visibility of distinct
 * system_timestamp/tsc_timestamp values simultaneously: use a master
 * copy of host monotonic time values. Update that master copy
 * in lockstep.

So KVM_REQ_MCLOCK_INPROGRESS also ensures that no vcpu enters 
guest mode (via vcpu->requests check before VM-entry) with a 
different system_timestamp/tsc_timestamp pair.

> clearing it in kvm_gen_update_masterclock is unnecessary,
> because KVM_REQ_CLOCK_UPDATE takes pvclock_gtod_sync_lock too and thus will
> already wait for pvclock_update_vm_gtod_copy to end.
> 
> I think it's possible to use a seqcount in KVM_REQ_CLOCK_UPDATE instead of
> KVM_REQ_MCLOCK_INPROGRESS.  Both cause the vCPUs to spin. I'll take a look.
> 
> Paolo

Re: [PATCH 1/2] KVM: x86: reduce pvclock_gtod_sync_lock critical sections

2021-04-07 Thread Marcelo Tosatti

On Tue, Mar 30, 2021 at 12:59:57PM -0400, Paolo Bonzini wrote:
> There is no need to include changes to vcpu->requests into
> the pvclock_gtod_sync_lock critical section.  The changes to
> the shared data structures (in pvclock_update_vm_gtod_copy)
> already occur under the lock.
> 
> Cc: David Woodhouse 
> Cc: Marcelo Tosatti 
> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/kvm/x86.c | 10 --
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index fe806e894212..0a83eff40b43 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2562,10 +2562,12 @@ static void kvm_gen_update_masterclock(struct kvm 
> *kvm)
>  
>   kvm_hv_invalidate_tsc_page(kvm);
>  
> - spin_lock(>pvclock_gtod_sync_lock);
>   kvm_make_mclock_inprogress_request(kvm);
> +

Might be good to serialize against two kvm_gen_update_masterclock
callers? Otherwise one caller could clear KVM_REQ_MCLOCK_INPROGRESS,
while the other is still at pvclock_update_vm_gtod_copy().

Otherwise, looks good.

[PATCH] hrtimer: avoid retrigger_next_event IPI

2021-04-07 Thread Marcelo Tosatti



Setting the realtime clock triggers an IPI to all CPUs to reprogram
hrtimers.

However, only base, boottime and tai clocks have their offsets updated
(and therefore potentially require a reprogram).

If the CPU is a nohz_full one, check if it only has 
monotonic active timers, and in that case update the 
realtime base offsets, skipping the IPI.

This reduces interruptions to nohz_full CPUs.

Signed-off-by: Marcelo Tosatti 

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 743c852e10f2..b42b1a434b22 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -853,6 +853,28 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
tick_program_event(expires, 1);
 }
 
+#define CLOCK_SET_BASES ((1U << HRTIMER_BASE_REALTIME)|\
+(1U << HRTIMER_BASE_REALTIME_SOFT)|\
+(1U << HRTIMER_BASE_BOOTTIME)| \
+(1U << HRTIMER_BASE_BOOTTIME_SOFT)|\
+(1U << HRTIMER_BASE_TAI)|  \
+(1U << HRTIMER_BASE_TAI_SOFT))
+
+static bool need_reprogram_timer(struct hrtimer_cpu_base *cpu_base)
+{
+   unsigned int active = 0;
+
+   if (!cpu_base->softirq_activated)
+   active = cpu_base->active_bases & HRTIMER_ACTIVE_SOFT;
+
+   active = active | (cpu_base->active_bases & HRTIMER_ACTIVE_HARD);
+
+   if ((active & CLOCK_SET_BASES) == 0)
+   return false;
+
+   return true;
+}
+
 /*
  * Clock realtime was set
  *
@@ -867,9 +889,41 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool 
reprogram)
 void clock_was_set(void)
 {
 #ifdef CONFIG_HIGH_RES_TIMERS
-   /* Retrigger the CPU local events everywhere */
-   on_each_cpu(retrigger_next_event, NULL, 1);
+   cpumask_var_t mask;
+   int cpu;
+
+   if (!tick_nohz_full_enabled()) {
+   /* Retrigger the CPU local events everywhere */
+   on_each_cpu(retrigger_next_event, NULL, 1);
+   goto set_timerfd;
+   }
+
+   if (!zalloc_cpumask_var(, GFP_KERNEL)) {
+   on_each_cpu(retrigger_next_event, NULL, 1);
+   goto set_timerfd;
+   }
+
+   /* Avoid interrupting nohz_full CPUs if possible */
+   preempt_disable();
+   for_each_online_cpu(cpu) {
+   if (tick_nohz_full_cpu(cpu)) {
+   unsigned long flags;
+   struct hrtimer_cpu_base *cpu_base = 
_cpu(hrtimer_bases, cpu);
+
+   raw_spin_lock_irqsave(_base->lock, flags);
+   if (need_reprogram_timer(cpu_base))
+   cpumask_set_cpu(cpu, mask);
+   else
+   hrtimer_update_base(cpu_base);
+   raw_spin_unlock_irqrestore(_base->lock, flags);
+   }
+   }
+
+   smp_call_function_many(mask, retrigger_next_event, NULL, 1);
+   preempt_enable();
+   free_cpumask_var(mask);
 #endif
+set_timerfd:
timerfd_clock_was_set();
 }

Re: [RFC PATCH] i386: Add ratelimit for bus locks acquired in guest

2021-03-19 Thread Marcelo Tosatti

On Fri, Mar 19, 2021 at 10:59:20AM +0800, Chenyi Qiang wrote:
> Hi Marcelo,
> 
> Thank you for your comment.
> 
> On 3/19/2021 1:32 AM, Marcelo Tosatti wrote:
> > On Wed, Mar 17, 2021 at 04:47:09PM +0800, Chenyi Qiang wrote:
> > > Virtual Machines can exploit bus locks to degrade the performance of
> > > system. To address this kind of performance DOS attack, bus lock VM exit
> > > is introduced in KVM and it will report the bus locks detected in guest,
> > > which can help userspace to enforce throttling policies.
> > 
> > > 
> > > The availability of bus lock VM exit can be detected through the
> > > KVM_CAP_X86_BUS_LOCK_EXIT. The returned bitmap contains the potential
> > > policies supported by KVM. The field KVM_BUS_LOCK_DETECTION_EXIT in
> > > bitmap is the only supported strategy at present. It indicates that KVM
> > > will exit to userspace to handle the bus locks.
> > > 
> > > This patch adds a ratelimit on the bus locks acquired in guest as a
> > > mitigation policy.
> > > 
> > > Introduce a new field "bld" to record the limited speed of bus locks in
> > > target VM. The user can specify it through the "bus-lock-detection"
> > > as a machine property. In current implementation, the default value of
> > > the speed is 0 per second, which means no restriction on the bus locks.
> > > 
> > > Ratelimit enforced in data transmission uses a time slice of 100ms to
> > > get smooth output during regular operations in block jobs. As for
> > > ratelimit on bus lock detection, simply set the ratelimit interval to 1s
> > > and restrict the quota of bus lock occurrence to the value of "bld". A
> > > potential alternative is to introduce the time slice as a property
> > > which can help the user achieve more precise control.
> > > 
> > > The detail of Bus lock VM exit can be found in spec:
> > > https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
> > > 
> > > Signed-off-by: Chenyi Qiang 
> > > ---
> > >   hw/i386/x86.c |  6 ++
> > >   include/hw/i386/x86.h |  7 +++
> > >   target/i386/kvm/kvm.c | 44 +++
> > >   3 files changed, 57 insertions(+)
> > > 
> > > diff --git a/hw/i386/x86.c b/hw/i386/x86.c
> > > index 7865660e2c..a70a259e97 100644
> > > --- a/hw/i386/x86.c
> > > +++ b/hw/i386/x86.c
> > > @@ -1209,6 +1209,12 @@ static void x86_machine_initfn(Object *obj)
> > >   x86ms->acpi = ON_OFF_AUTO_AUTO;
> > >   x86ms->smp_dies = 1;
> > >   x86ms->pci_irq_mask = ACPI_BUILD_PCI_IRQS;
> > > +x86ms->bld = 0;
> > > +
> > > +object_property_add_uint64_ptr(obj, "bus-lock-detection",
> > > +   >bld, OBJ_PROP_FLAG_READWRITE);
> > > +object_property_set_description(obj, "bus-lock-detection",
> > > +"Bus lock detection ratelimit");
> > >   }
> > >   static void x86_machine_class_init(ObjectClass *oc, void *data)
> > > diff --git a/include/hw/i386/x86.h b/include/hw/i386/x86.h
> > > index 56080bd1fb..1f0ffbcfb9 100644
> > > --- a/include/hw/i386/x86.h
> > > +++ b/include/hw/i386/x86.h
> > > @@ -72,6 +72,13 @@ struct X86MachineState {
> > >* will be translated to MSI messages in the address space.
> > >*/
> > >   AddressSpace *ioapic_as;
> > > +
> > > +/*
> > > + * ratelimit enforced on detected bus locks, the default value
> > > + * is 0 per second
> > > + */
> > > +uint64_t bld;
> > > +RateLimit bld_limit;
> > >   };
> > >   #define X86_MACHINE_SMM  "smm"
> > > diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> > > index c8d61daf68..724862137d 100644
> > > --- a/target/i386/kvm/kvm.c
> > > +++ b/target/i386/kvm/kvm.c
> > > @@ -130,6 +130,8 @@ static bool has_msr_mcg_ext_ctl;
> > >   static struct kvm_cpuid2 *cpuid_cache;
> > >   static struct kvm_msr_list *kvm_feature_msrs;
> > > +#define SLICE_TIME 10ULL /* ns */
> > > +
> > >   int kvm_has_pit_state2(void)
> > >   {
> > >   return has_pit_state2;
> > > @@ -2267,6 +2269,27 @@ int kvm_arch_init(MachineState *ms, KVMSt

Re: [PATCH 3/3] i386: Make sure kvm_arch_set_tsc_khz() succeeds on migration when 'hv-reenlightenment' was exposed

2021-03-18 Thread Marcelo Tosatti

On Thu, Mar 18, 2021 at 05:38:00PM +0100, Vitaly Kuznetsov wrote:
> Paolo Bonzini  writes:
> 
> > On 18/03/21 17:02, Vitaly Kuznetsov wrote:
> >> KVM doesn't fully support Hyper-V reenlightenment notifications on
> >> migration. In particular, it doesn't support emulating TSC frequency
> >> of the source host by trapping all TSC accesses so unless TSC scaling
> >> is supported on the destination host and KVM_SET_TSC_KHZ succeeds, it
> >> is unsafe to proceed with migration.
> >> 
> >> Normally, we only require KVM_SET_TSC_KHZ to succeed when 'user_tsc_khz'
> >> was set and just 'try' KVM_SET_TSC_KHZ without otherwise.
> >> 
> >> Introduce a new vmstate section (which is added when the guest has
> >> reenlightenment feature enabled) and add env.tsc_khz to it. We already
> >> have env.tsc_khz packed in 'cpu/tsc_khz' but we don't want to be dependent
> >> on the section order.
> >> 
> >> Signed-off-by: Vitaly Kuznetsov 
> >
> > Could we instead fail to load the reenlightenment section if 
> > user_tsc_khz was not set?  This seems to be user (well, management) 
> > error really, since reenlightenment has to be enabled manually (or with 
> > hv-passthrough which blocks migration too).

Seems to match the strategy of the patchset...

> Yes, we certainly could do that but what's the added value of
> user_tsc_khz which upper layer will have to set explicitly (probably to
> the tsc frequency of the source host anyway)?

Yes. I think what happened was "evolution":

1) Added support to set tsc frequency (with hardware multiplier)
in KVM, so add -tsc-khz VAL (kHz) option to KVM.

2) Scaling is enabled only if -tsc-khz VAL is supplied.

3) libvirt switches to using -tsc-khz HVAL, where HVAL it retrieves
from KVM_GET_TSC_KHZ of newly created KVM_CREATE_VM instance.

It could have been done inside qemu instead.

> In case we just want to avoid calling KVM_SET_TSC_KHZ twice, we can probably 
> achieve that by
> adding a CPU flag or something.

Avoid calling KVM_SET_TSC_KHZ twice ? Don't see why you would avoid
that.

Re: [RFC PATCH] i386: Add ratelimit for bus locks acquired in guest

2021-03-18 Thread Marcelo Tosatti

On Wed, Mar 17, 2021 at 04:47:09PM +0800, Chenyi Qiang wrote:
> Virtual Machines can exploit bus locks to degrade the performance of
> system. To address this kind of performance DOS attack, bus lock VM exit
> is introduced in KVM and it will report the bus locks detected in guest,
> which can help userspace to enforce throttling policies.

> 
> The availability of bus lock VM exit can be detected through the
> KVM_CAP_X86_BUS_LOCK_EXIT. The returned bitmap contains the potential
> policies supported by KVM. The field KVM_BUS_LOCK_DETECTION_EXIT in
> bitmap is the only supported strategy at present. It indicates that KVM
> will exit to userspace to handle the bus locks.
> 
> This patch adds a ratelimit on the bus locks acquired in guest as a
> mitigation policy.
> 
> Introduce a new field "bld" to record the limited speed of bus locks in
> target VM. The user can specify it through the "bus-lock-detection"
> as a machine property. In current implementation, the default value of
> the speed is 0 per second, which means no restriction on the bus locks.
> 
> Ratelimit enforced in data transmission uses a time slice of 100ms to
> get smooth output during regular operations in block jobs. As for
> ratelimit on bus lock detection, simply set the ratelimit interval to 1s
> and restrict the quota of bus lock occurrence to the value of "bld". A
> potential alternative is to introduce the time slice as a property
> which can help the user achieve more precise control.
> 
> The detail of Bus lock VM exit can be found in spec:
> https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
> 
> Signed-off-by: Chenyi Qiang 
> ---
>  hw/i386/x86.c |  6 ++
>  include/hw/i386/x86.h |  7 +++
>  target/i386/kvm/kvm.c | 44 +++
>  3 files changed, 57 insertions(+)
> 
> diff --git a/hw/i386/x86.c b/hw/i386/x86.c
> index 7865660e2c..a70a259e97 100644
> --- a/hw/i386/x86.c
> +++ b/hw/i386/x86.c
> @@ -1209,6 +1209,12 @@ static void x86_machine_initfn(Object *obj)
>  x86ms->acpi = ON_OFF_AUTO_AUTO;
>  x86ms->smp_dies = 1;
>  x86ms->pci_irq_mask = ACPI_BUILD_PCI_IRQS;
> +x86ms->bld = 0;
> +
> +object_property_add_uint64_ptr(obj, "bus-lock-detection",
> +   >bld, OBJ_PROP_FLAG_READWRITE);
> +object_property_set_description(obj, "bus-lock-detection",
> +"Bus lock detection ratelimit");
>  }
>  
>  static void x86_machine_class_init(ObjectClass *oc, void *data)
> diff --git a/include/hw/i386/x86.h b/include/hw/i386/x86.h
> index 56080bd1fb..1f0ffbcfb9 100644
> --- a/include/hw/i386/x86.h
> +++ b/include/hw/i386/x86.h
> @@ -72,6 +72,13 @@ struct X86MachineState {
>   * will be translated to MSI messages in the address space.
>   */
>  AddressSpace *ioapic_as;
> +
> +/*
> + * ratelimit enforced on detected bus locks, the default value
> + * is 0 per second
> + */
> +uint64_t bld;
> +RateLimit bld_limit;
>  };
>  
>  #define X86_MACHINE_SMM  "smm"
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index c8d61daf68..724862137d 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -130,6 +130,8 @@ static bool has_msr_mcg_ext_ctl;
>  static struct kvm_cpuid2 *cpuid_cache;
>  static struct kvm_msr_list *kvm_feature_msrs;
>  
> +#define SLICE_TIME 10ULL /* ns */
> +
>  int kvm_has_pit_state2(void)
>  {
>  return has_pit_state2;
> @@ -2267,6 +2269,27 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>  }
>  }
>  
> +if (object_dynamic_cast(OBJECT(ms), TYPE_X86_MACHINE)) {
> +X86MachineState *x86ms = X86_MACHINE(ms);
> +
> +if (x86ms->bld > 0) {
> +ret = kvm_check_extension(s, KVM_CAP_X86_BUS_LOCK_EXIT);
> +if (!(ret & KVM_BUS_LOCK_DETECTION_EXIT)) {
> +error_report("kvm: bus lock detection unsupported");
> +return -ENOTSUP;
> +}
> +ret = kvm_vm_enable_cap(s, KVM_CAP_X86_BUS_LOCK_EXIT, 0,
> +KVM_BUS_LOCK_DETECTION_EXIT);
> +if (ret < 0) {
> +error_report("kvm: Failed to enable bus lock detection cap: 
> %s",
> + strerror(-ret));
> +return ret;
> +}
> +
> +ratelimit_set_speed(>bld_limit, x86ms->bld, SLICE_TIME);
> +}
> +}
> +
>  return 0;
>  }
>  
> @@ -4221,6 +4244,18 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run 
> *run)
>  }
>  }
>  
> +static void kvm_rate_limit_on_bus_lock(void)
> +{
> +MachineState *ms = MACHINE(qdev_get_machine());
> +X86MachineState *x86ms = X86_MACHINE(ms);
> +
> +uint64_t delay_ns = ratelimit_calculate_delay(>bld_limit, 1);
> +
> +if (delay_ns) {
> +g_usleep(delay_ns / SCALE_US);
> +}
> +}

Hi,

Can't see a use-case where the throttling is

Re: [patch 2/3] nohz: change signal tick dependency to wakeup CPUs of member tasks

2021-02-12 Thread Marcelo Tosatti

On Fri, Feb 12, 2021 at 01:25:21PM +0100, Frederic Weisbecker wrote:
> On Thu, Jan 28, 2021 at 05:21:36PM -0300, Marcelo Tosatti wrote:
> > Rather than waking up all nohz_full CPUs on the system, only wakeup 
> > the target CPUs of member threads of the signal.
> > 
> > Reduces interruptions to nohz_full CPUs.
> > 
> > Signed-off-by: Marcelo Tosatti 
> > 
> > Index: linux-2.6/kernel/time/tick-sched.c
> > ===
> > --- linux-2.6.orig/kernel/time/tick-sched.c
> > +++ linux-2.6/kernel/time/tick-sched.c
> > @@ -444,9 +444,20 @@ EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_ta
> >   * Set a per-taskgroup tick dependency. Posix CPU timers need this in 
> > order to elapse
> >   * per process timers.
> >   */
> > -void tick_nohz_dep_set_signal(struct signal_struct *sig, enum tick_dep_bits
> > bit)
> 
> Why not keeping the signal struct as a parameter?
> 
> Thanks.

All callers use "struct signal_struct *sig = tsk->signal" as
signal parameter anyway...

Can change parameters to (task, signal, bit) if you prefer.

Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-02-04 Thread Marcelo Tosatti

On Thu, Feb 04, 2021 at 01:47:38PM -0500, Nitesh Narayan Lal wrote:
> 
> On 2/4/21 1:15 PM, Marcelo Tosatti wrote:
> > On Thu, Jan 28, 2021 at 09:01:37PM +0100, Thomas Gleixner wrote:
> >> On Thu, Jan 28 2021 at 13:59, Marcelo Tosatti wrote:
> >>>> The whole pile wants to be reverted. It's simply broken in several ways.
> >>> I was asking for your comments on interaction with CPU hotplug :-)
> >> Which I answered in an seperate mail :)
> >>
> >>> So housekeeping_cpumask has multiple meanings. In this case:
> >> ...
> >>
> >>> So as long as the meaning of the flags are respected, seems
> >>> alright.
> >> Yes. Stuff like the managed interrupts preference for housekeeping CPUs
> >> when a affinity mask spawns housekeeping and isolated is perfectly
> >> fine. It's well thought out and has no limitations.
> >>
> >>> Nitesh, is there anything preventing this from being fixed
> >>> in userspace ? (as Thomas suggested previously).
> >> Everything with is not managed can be steered by user space.
> > Yes, but it seems to be racy (that is, there is a window where the 
> > interrupt can be delivered to an isolated CPU).
> >
> > ethtool ->
> > xgbe_set_channels ->
> > xgbe_full_restart_dev ->
> > xgbe_alloc_memory ->
> > xgbe_alloc_channels ->
> > cpumask_local_spread
> >
> > Also ifconfig eth0 down / ifconfig eth0 up leads
> > to cpumask_spread_local.
> 
> There's always that possibility.

Then there is a window where isolation can be broken.

> We have to ensure that we move the IRQs by a tuned daemon or some other
> userspace script every time there is a net-dev change (eg. device comes up,
> creates VFs, etc).

Again, race window open can result in interrupt to isolated CPU.

> > How about adding a new flag for isolcpus instead?
> >
> 
> Do you mean a flag based on which we can switch the affinity mask to
> housekeeping for all the devices at the time of IRQ distribution?

Yes a new flag for isolcpus. HK_FLAG_IRQ_SPREAD or some better name.

Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-02-04 Thread Marcelo Tosatti

On Thu, Jan 28, 2021 at 09:01:37PM +0100, Thomas Gleixner wrote:
> On Thu, Jan 28 2021 at 13:59, Marcelo Tosatti wrote:
> >> The whole pile wants to be reverted. It's simply broken in several ways.
> >
> > I was asking for your comments on interaction with CPU hotplug :-)
> 
> Which I answered in an seperate mail :)
> 
> > So housekeeping_cpumask has multiple meanings. In this case:
> 
> ...
> 
> > So as long as the meaning of the flags are respected, seems
> > alright.
> 
> Yes. Stuff like the managed interrupts preference for housekeeping CPUs
> when a affinity mask spawns housekeeping and isolated is perfectly
> fine. It's well thought out and has no limitations.
> 
> > Nitesh, is there anything preventing this from being fixed
> > in userspace ? (as Thomas suggested previously).
> 
> Everything with is not managed can be steered by user space.

Yes, but it seems to be racy (that is, there is a window where the 
interrupt can be delivered to an isolated CPU).

ethtool ->
xgbe_set_channels ->
xgbe_full_restart_dev ->
xgbe_alloc_memory ->
xgbe_alloc_channels ->
cpumask_local_spread

Also ifconfig eth0 down / ifconfig eth0 up leads
to cpumask_spread_local.

How about adding a new flag for isolcpus instead?

Re: [EXT] Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-02-01 Thread Marcelo Tosatti

On Fri, Jan 29, 2021 at 07:41:27AM -0800, Alex Belits wrote:
> On 1/28/21 07:56, Thomas Gleixner wrote:
> > External Email
> > 
> > --
> > On Wed, Jan 27 2021 at 10:09, Marcelo Tosatti wrote:
> > > On Wed, Jan 27, 2021 at 12:36:30PM +, Robin Murphy wrote:
> > > > > > > /**
> > > > > > >  * cpumask_next - get the next cpu in a cpumask
> > > > > > > @@ -205,22 +206,27 @@ void __init 
> > > > > > > free_bootmem_cpumask_var(cpumask_var_t mask)
> > > > > > >  */
> > > > > > > unsigned int cpumask_local_spread(unsigned int i, int node)
> > > > > > > {
> > > > > > > - int cpu;
> > > > > > > + int cpu, hk_flags;
> > > > > > > + const struct cpumask *mask;
> > > > > > > + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> > > > > > > + mask = housekeeping_cpumask(hk_flags);
> > > > > > 
> > > > > > AFAICS, this generally resolves to something based on 
> > > > > > cpu_possible_mask
> > > > > > rather than cpu_online_mask as before, so could now potentially 
> > > > > > return an
> > > > > > offline CPU. Was that an intentional change?
> > > > > 
> > > > > Robin,
> > > > > 
> > > > > AFAICS online CPUs should be filtered.
> > > > 
> > > > Apologies if I'm being thick, but can you explain how? In the case of
> > > > isolation being disabled or compiled out, housekeeping_cpumask() is
> > > > literally just "return cpu_possible_mask;". If we then iterate over that
> > > > with for_each_cpu() and just return the i'th possible CPU (e.g. in the
> > > > NUMA_NO_NODE case), what guarantees that CPU is actually online?
> > > > 
> > > > Robin.
> > > 
> > > Nothing, but that was the situation before 
> > > 1abdfe706a579a702799fce465bceb9fb01d407c
> > > as well.
> > > 
> > > cpumask_local_spread() should probably be disabling CPU hotplug.
> > 
> > It can't unless all callers are from preemtible code.
> > 
> > Aside of that this whole frenzy to sprinkle housekeeping_cpumask() all
> > over the kernel is just wrong, really.
> > 
> > As I explained several times before there are very valid reasons for
> > having queues and interrupts on isolated CPUs. Just optimizing for the
> > usecases some people care about is not making anything better.
> 
> However making it mandatory for isolated CPUs to allow interrupts is not a
> good idea, either. Providing an environment free of disturbances is a valid
> goal, so we can't do something that will make it impossible to achieve. We
> know that both there is a great amount of demand for this feature and
> implementing it is doable, so cutting off the possibility of development in
> this direction would be bad.
> 
> Before there was housekeeping mask, I had to implement another, more
> cumbersome model that ended up being more intrusive than I wanted. That was
> one of the reasons why I have spent some time working on it in, please
> forgive me the pun, isolation.
> 
> I was relieved when housekeeping mask appeared, and I was able to remove a
> large chunk of code that distinguished between CPUs that "are there" and
> CPUs "available to run work". Housekeeping is supposed to define the set of
> CPUs that are intended to run work that is not specifically triggered by
> anything running on those CPUs. "CPUs that are there" are CPUs that are
> being maintained as a part of the system, so they are usable for running
> things on them.
> 
> My idea at the time was that we can separate this into two directions of
> development:
> 
> 1. Make sure that housekeeping mask applies to all kinds of work that
> appears on CPUs, so nothing random will end up running there. Because this
> is very much in line of what it does.

Its easier to specify "all members of set" rather than having to specify each
individual member. Thinking of set as a description of types of
activities that should not have a given CPU as a target.

> 2. Rely on housekeeping mask to exclude anything not specifically intended
> to run on isolated CPUs, and concentrate efforts on making sure that things
> that are intended to [eventually] happen on those CPUs are handled properly
> -- in case of my recent proposals, delayed until synchronization even

[patch 3/3] nohz: tick_nohz_kick_task: only IPI if remote task is running

2021-01-28 Thread Marcelo Tosatti

If the task is not running, run_posix_cpu_timers has nothing
to elapsed, so spare IPI in that case.

Suggested-by: Peter Zijlstra 
Signed-off-by: Marcelo Tosatti 

Index: linux-2.6/kernel/sched/core.c
===
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -9182,3 +9182,9 @@ void call_trace_sched_update_nr_running(
 {
 trace_sched_update_nr_running_tp(rq, count);
 }
+
+bool task_on_rq(struct task_struct *p)
+{
+   return p->on_rq == TASK_ON_RQ_QUEUED;
+}
+
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -232,6 +232,8 @@ extern void io_schedule_finish(int token
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 
+extern bool task_on_rq(struct task_struct *p);
+
 /**
  * struct prev_cputime - snapshot of system and user cputime
  * @utime: time spent in user mode
Index: linux-2.6/kernel/time/tick-sched.c
===
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -324,8 +324,6 @@ void tick_nohz_full_kick_cpu(int cpu)
 
 static void tick_nohz_kick_task(struct task_struct *tsk)
 {
-   int cpu = task_cpu(tsk);
-
/*
 * If the task concurrently migrates to another cpu,
 * we guarantee it sees the new tick dependency upon
@@ -340,6 +338,23 @@ static void tick_nohz_kick_task(struct t
 *   tick_nohz_task_switch()smp_mb() (atomic_fetch_or())
 *  LOAD p->tick_dep_mask   LOAD p->cpu
 */
+   int cpu = task_cpu(tsk);
+
+   /*
+* If the task is not running, run_posix_cpu_timers
+* has nothing to elapsed, can spare IPI in that
+* case.
+*
+* activate_task()  STORE p->tick_dep_mask
+* STORE p->task_on_rq
+* __schedule() (switch to task 'p')smp_mb() (atomic_fetch_or())
+* LOCK rq->lockLOAD p->task_on_rq
+* smp_mb__after_spin_lock()
+* tick_nohz_task_switch()
+*  LOAD p->tick_dep_mask
+*/
+   if (!task_on_rq(tsk))
+   return;
 
preempt_disable();
if (cpu_online(cpu))

[patch 2/3] nohz: change signal tick dependency to wakeup CPUs of member tasks

2021-01-28 Thread Marcelo Tosatti

Rather than waking up all nohz_full CPUs on the system, only wakeup 
the target CPUs of member threads of the signal.

Reduces interruptions to nohz_full CPUs.

Signed-off-by: Marcelo Tosatti 

Index: linux-2.6/kernel/time/tick-sched.c
===
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -444,9 +444,20 @@ EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_ta
  * Set a per-taskgroup tick dependency. Posix CPU timers need this in order to 
elapse
  * per process timers.
  */
-void tick_nohz_dep_set_signal(struct signal_struct *sig, enum tick_dep_bits 
bit)
+void tick_nohz_dep_set_signal(struct task_struct *tsk,
+ enum tick_dep_bits bit)
 {
-   tick_nohz_dep_set_all(>tick_dep_mask, bit);
+   int prev;
+   struct signal_struct *sig = tsk->signal;
+
+   prev = atomic_fetch_or(BIT(bit), >tick_dep_mask);
+   if (!prev) {
+   struct task_struct *t;
+
+   lockdep_assert_held(>sighand->siglock);
+   __for_each_thread(sig, t)
+   tick_nohz_kick_task(t);
+   }
 }
 
 void tick_nohz_dep_clear_signal(struct signal_struct *sig, enum tick_dep_bits 
bit)
Index: linux-2.6/include/linux/tick.h
===
--- linux-2.6.orig/include/linux/tick.h
+++ linux-2.6/include/linux/tick.h
@@ -207,7 +207,7 @@ extern void tick_nohz_dep_set_task(struc
   enum tick_dep_bits bit);
 extern void tick_nohz_dep_clear_task(struct task_struct *tsk,
 enum tick_dep_bits bit);
-extern void tick_nohz_dep_set_signal(struct signal_struct *signal,
+extern void tick_nohz_dep_set_signal(struct task_struct *tsk,
 enum tick_dep_bits bit);
 extern void tick_nohz_dep_clear_signal(struct signal_struct *signal,
   enum tick_dep_bits bit);
@@ -252,11 +252,11 @@ static inline void tick_dep_clear_task(s
if (tick_nohz_full_enabled())
tick_nohz_dep_clear_task(tsk, bit);
 }
-static inline void tick_dep_set_signal(struct signal_struct *signal,
+static inline void tick_dep_set_signal(struct task_struct *tsk,
   enum tick_dep_bits bit)
 {
if (tick_nohz_full_enabled())
-   tick_nohz_dep_set_signal(signal, bit);
+   tick_nohz_dep_set_signal(tsk, bit);
 }
 static inline void tick_dep_clear_signal(struct signal_struct *signal,
 enum tick_dep_bits bit)
@@ -284,7 +284,7 @@ static inline void tick_dep_set_task(str
 enum tick_dep_bits bit) { }
 static inline void tick_dep_clear_task(struct task_struct *tsk,
   enum tick_dep_bits bit) { }
-static inline void tick_dep_set_signal(struct signal_struct *signal,
+static inline void tick_dep_set_signal(struct task_struct *tsk,
   enum tick_dep_bits bit) { }
 static inline void tick_dep_clear_signal(struct signal_struct *signal,
 enum tick_dep_bits bit) { }
Index: linux-2.6/kernel/time/posix-cpu-timers.c
===
--- linux-2.6.orig/kernel/time/posix-cpu-timers.c
+++ linux-2.6/kernel/time/posix-cpu-timers.c
@@ -523,7 +523,7 @@ static void arm_timer(struct k_itimer *t
if (CPUCLOCK_PERTHREAD(timer->it_clock))
tick_dep_set_task(p, TICK_DEP_BIT_POSIX_TIMER);
else
-   tick_dep_set_signal(p->signal, TICK_DEP_BIT_POSIX_TIMER);
+   tick_dep_set_signal(p, TICK_DEP_BIT_POSIX_TIMER);
 }
 
 /*
@@ -1358,7 +1358,7 @@ void set_process_cpu_timer(struct task_s
if (*newval < *nextevt)
*nextevt = *newval;
 
-   tick_dep_set_signal(tsk->signal, TICK_DEP_BIT_POSIX_TIMER);
+   tick_dep_set_signal(tsk, TICK_DEP_BIT_POSIX_TIMER);
 }
 
 static int do_cpu_nanosleep(const clockid_t which_clock, int flags,

[patch 1/3] nohz: only wakeup a single target cpu when kicking a task

2021-01-28 Thread Marcelo Tosatti

When adding a tick dependency to a task, its necessary to
wakeup the CPU where the task resides to reevaluate tick
dependencies on that CPU.

However the current code wakes up all nohz_full CPUs, which 
is unnecessary.

Switch to waking up a single CPU, by using ordering of writes
to task->cpu and task->tick_dep_mask.

From: Frederic Weisbecker 
Suggested-by: Peter Zijlstra 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Marcelo Tosatti 

Index: linux-2.6/kernel/time/tick-sched.c
===
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -322,6 +322,31 @@ void tick_nohz_full_kick_cpu(int cpu)
irq_work_queue_on(_cpu(nohz_full_kick_work, cpu), cpu);
 }
 
+static void tick_nohz_kick_task(struct task_struct *tsk)
+{
+   int cpu = task_cpu(tsk);
+
+   /*
+* If the task concurrently migrates to another cpu,
+* we guarantee it sees the new tick dependency upon
+* schedule.
+*
+*
+* set_task_cpu(p, cpu);
+*   STORE p->cpu = @cpu
+* __schedule() (switch to task 'p')
+*   LOCK rq->lock
+*   smp_mb__after_spin_lock()  STORE p->tick_dep_mask
+*   tick_nohz_task_switch()smp_mb() (atomic_fetch_or())
+*  LOAD p->tick_dep_mask   LOAD p->cpu
+*/
+
+   preempt_disable();
+   if (cpu_online(cpu))
+   tick_nohz_full_kick_cpu(cpu);
+   preempt_enable();
+}
+
 /*
  * Kick all full dynticks CPUs in order to force these to re-evaluate
  * their dependency on the tick and restart it if necessary.
@@ -404,19 +429,8 @@ EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_cp
  */
 void tick_nohz_dep_set_task(struct task_struct *tsk, enum tick_dep_bits bit)
 {
-   if (!atomic_fetch_or(BIT(bit), >tick_dep_mask)) {
-   if (tsk == current) {
-   preempt_disable();
-   tick_nohz_full_kick();
-   preempt_enable();
-   } else {
-   /*
-* Some future tick_nohz_full_kick_task()
-* should optimize this.
-*/
-   tick_nohz_full_kick_all();
-   }
-   }
+   if (!atomic_fetch_or(BIT(bit), >tick_dep_mask))
+   tick_nohz_kick_task(tsk);
 }
 EXPORT_SYMBOL_GPL(tick_nohz_dep_set_task);

[patch 0/3] nohz_full: only wakeup target CPUs when notifying new tick dependency (v5)

2021-01-28 Thread Marcelo Tosatti

When enabling per-CPU posix timers, an IPI to nohz_full CPUs might be
performed (to re-read the dependencies and possibly not re-enter
nohz_full on a given CPU).

A common case is for applications that run on nohz_full= CPUs
to not use POSIX timers (eg DPDK). This patch changes the notification
to only IPI the target CPUs where the task(s) whose tick dependencies
are being updated are executing.

This reduces interruptions to nohz_full= CPUs.

v5: actually replace superfluous rcu_read_lock with lockdep_assert
v4: only IPI if the remote task is on the remote runqueue (PeterZ/Frederic)
v3: replace superfluous rcu_read_lock with lockdep_assert (PeterZ)

[patch 0/3] nohz_full: only wakeup target CPUs when notifying new tick dependency (v4)

2021-01-28 Thread Marcelo Tosatti

When enabling per-CPU posix timers, an IPI to nohz_full CPUs might be
performed (to re-read the dependencies and possibly not re-enter
nohz_full on a given CPU).

A common case is for applications that run on nohz_full= CPUs
to not use POSIX timers (eg DPDK). This patch changes the notification
to only IPI the target CPUs where the task(s) whose tick dependencies
are being updated are executing.

This reduces interruptions to nohz_full= CPUs.

v4: only IPI if the remote task is on the remote runqueue (PeterZ/Frederic)
v3: replace superfluous rcu_read_lock with lockdep_assert (PeterZ)

[patch 1/3] nohz: only wakeup a single target cpu when kicking a task

2021-01-28 Thread Marcelo Tosatti

When adding a tick dependency to a task, its necessary to
wakeup the CPU where the task resides to reevaluate tick
dependencies on that CPU.

However the current code wakes up all nohz_full CPUs, which 
is unnecessary.

Switch to waking up a single CPU, by using ordering of writes
to task->cpu and task->tick_dep_mask.

From: Frederic Weisbecker 
Suggested-by: Peter Zijlstra 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Marcelo Tosatti 

Index: linux-2.6/kernel/time/tick-sched.c
===
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -322,6 +322,31 @@ void tick_nohz_full_kick_cpu(int cpu)
irq_work_queue_on(_cpu(nohz_full_kick_work, cpu), cpu);
 }
 
+static void tick_nohz_kick_task(struct task_struct *tsk)
+{
+   int cpu = task_cpu(tsk);
+
+   /*
+* If the task concurrently migrates to another cpu,
+* we guarantee it sees the new tick dependency upon
+* schedule.
+*
+*
+* set_task_cpu(p, cpu);
+*   STORE p->cpu = @cpu
+* __schedule() (switch to task 'p')
+*   LOCK rq->lock
+*   smp_mb__after_spin_lock()  STORE p->tick_dep_mask
+*   tick_nohz_task_switch()smp_mb() (atomic_fetch_or())
+*  LOAD p->tick_dep_mask   LOAD p->cpu
+*/
+
+   preempt_disable();
+   if (cpu_online(cpu))
+   tick_nohz_full_kick_cpu(cpu);
+   preempt_enable();
+}
+
 /*
  * Kick all full dynticks CPUs in order to force these to re-evaluate
  * their dependency on the tick and restart it if necessary.
@@ -404,19 +429,8 @@ EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_cp
  */
 void tick_nohz_dep_set_task(struct task_struct *tsk, enum tick_dep_bits bit)
 {
-   if (!atomic_fetch_or(BIT(bit), >tick_dep_mask)) {
-   if (tsk == current) {
-   preempt_disable();
-   tick_nohz_full_kick();
-   preempt_enable();
-   } else {
-   /*
-* Some future tick_nohz_full_kick_task()
-* should optimize this.
-*/
-   tick_nohz_full_kick_all();
-   }
-   }
+   if (!atomic_fetch_or(BIT(bit), >tick_dep_mask))
+   tick_nohz_kick_task(tsk);
 }
 EXPORT_SYMBOL_GPL(tick_nohz_dep_set_task);

[patch 2/3] nohz: change signal tick dependency to wakeup CPUs of member tasks

2021-01-28 Thread Marcelo Tosatti

Rather than waking up all nohz_full CPUs on the system, only wakeup 
the target CPUs of member threads of the signal.

Reduces interruptions to nohz_full CPUs.

Signed-off-by: Marcelo Tosatti 

Index: linux-2.6/kernel/time/tick-sched.c
===
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -446,7 +446,17 @@ EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_ta
  */
 void tick_nohz_dep_set_signal(struct signal_struct *sig, enum tick_dep_bits 
bit)
 {
-   tick_nohz_dep_set_all(>tick_dep_mask, bit);
+   int prev;
+
+   prev = atomic_fetch_or(BIT(bit), >tick_dep_mask);
+   if (!prev) {
+   struct task_struct *t;
+
+   rcu_read_lock();
+   __for_each_thread(sig, t)
+   tick_nohz_kick_task(t);
+   rcu_read_unlock();
+   }
 }
 
 void tick_nohz_dep_clear_signal(struct signal_struct *sig, enum tick_dep_bits 
bit)

[patch 3/3] nohz: tick_nohz_kick_task: only IPI if remote task is running

2021-01-28 Thread Marcelo Tosatti

If the task is not running, run_posix_cpu_timers has nothing
to elapsed, so spare IPI in that case.

Suggested-by: Peter Zijlstra 
Signed-off-by: Marcelo Tosatti 

Index: linux-2.6/kernel/sched/core.c
===
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -9182,3 +9182,9 @@ void call_trace_sched_update_nr_running(
 {
 trace_sched_update_nr_running_tp(rq, count);
 }
+
+bool task_on_rq(struct task_struct *p)
+{
+   return p->on_rq == TASK_ON_RQ_QUEUED;
+}
+
Index: linux-2.6/include/linux/sched.h
===
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -232,6 +232,8 @@ extern void io_schedule_finish(int token
 extern long io_schedule_timeout(long timeout);
 extern void io_schedule(void);
 
+extern bool task_on_rq(struct task_struct *p);
+
 /**
  * struct prev_cputime - snapshot of system and user cputime
  * @utime: time spent in user mode
Index: linux-2.6/kernel/time/tick-sched.c
===
--- linux-2.6.orig/kernel/time/tick-sched.c
+++ linux-2.6/kernel/time/tick-sched.c
@@ -324,8 +324,6 @@ void tick_nohz_full_kick_cpu(int cpu)
 
 static void tick_nohz_kick_task(struct task_struct *tsk)
 {
-   int cpu = task_cpu(tsk);
-
/*
 * If the task concurrently migrates to another cpu,
 * we guarantee it sees the new tick dependency upon
@@ -340,6 +338,23 @@ static void tick_nohz_kick_task(struct t
 *   tick_nohz_task_switch()smp_mb() (atomic_fetch_or())
 *  LOAD p->tick_dep_mask   LOAD p->cpu
 */
+   int cpu = task_cpu(tsk);
+
+   /*
+* If the task is not running, run_posix_cpu_timers
+* has nothing to elapsed, can spare IPI in that
+* case.
+*
+* activate_task()  STORE p->tick_dep_mask
+* STORE p->task_on_rq
+* __schedule() (switch to task 'p')smp_mb() (atomic_fetch_or())
+* LOCK rq->lockLOAD p->task_on_rq
+* smp_mb__after_spin_lock()
+* tick_nohz_task_switch()
+*  LOAD p->tick_dep_mask
+*/
+   if (!task_on_rq(tsk))
+   return;
 
preempt_disable();
if (cpu_online(cpu))

Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-01-28 Thread Marcelo Tosatti

On Thu, Jan 28, 2021 at 04:56:07PM +0100, Thomas Gleixner wrote:
> On Wed, Jan 27 2021 at 10:09, Marcelo Tosatti wrote:
> > On Wed, Jan 27, 2021 at 12:36:30PM +, Robin Murphy wrote:
> >> > > >/**
> >> > > > * cpumask_next - get the next cpu in a cpumask
> >> > > > @@ -205,22 +206,27 @@ void __init 
> >> > > > free_bootmem_cpumask_var(cpumask_var_t mask)
> >> > > > */
> >> > > >unsigned int cpumask_local_spread(unsigned int i, int node)
> >> > > >{
> >> > > > -int cpu;
> >> > > > +int cpu, hk_flags;
> >> > > > +const struct cpumask *mask;
> >> > > > +hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> >> > > > +mask = housekeeping_cpumask(hk_flags);
> >> > > 
> >> > > AFAICS, this generally resolves to something based on cpu_possible_mask
> >> > > rather than cpu_online_mask as before, so could now potentially return 
> >> > > an
> >> > > offline CPU. Was that an intentional change?
> >> > 
> >> > Robin,
> >> > 
> >> > AFAICS online CPUs should be filtered.
> >> 
> >> Apologies if I'm being thick, but can you explain how? In the case of
> >> isolation being disabled or compiled out, housekeeping_cpumask() is
> >> literally just "return cpu_possible_mask;". If we then iterate over that
> >> with for_each_cpu() and just return the i'th possible CPU (e.g. in the
> >> NUMA_NO_NODE case), what guarantees that CPU is actually online?
> >> 
> >> Robin.
> >
> > Nothing, but that was the situation before 
> > 1abdfe706a579a702799fce465bceb9fb01d407c
> > as well.
> >
> > cpumask_local_spread() should probably be disabling CPU hotplug.
> 
> It can't unless all callers are from preemtible code.
> 
> Aside of that this whole frenzy to sprinkle housekeeping_cpumask() all
> over the kernel is just wrong, really.
> 
> As I explained several times before there are very valid reasons for
> having queues and interrupts on isolated CPUs. Just optimizing for the
> usecases some people care about is not making anything better.

And that is right.

Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-01-28 Thread Marcelo Tosatti

On Thu, Jan 28, 2021 at 05:02:41PM +0100, Thomas Gleixner wrote:
> On Wed, Jan 27 2021 at 09:19, Marcelo Tosatti wrote:
> > On Wed, Jan 27, 2021 at 11:57:16AM +, Robin Murphy wrote:
> >> > +hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> >> > +mask = housekeeping_cpumask(hk_flags);
> >> 
> >> AFAICS, this generally resolves to something based on cpu_possible_mask
> >> rather than cpu_online_mask as before, so could now potentially return an
> >> offline CPU. Was that an intentional change?
> >
> > Robin,
> >
> > AFAICS online CPUs should be filtered.
> 
> The whole pile wants to be reverted. It's simply broken in several ways.

I was asking for your comments on interaction with CPU hotplug :-)
Anyway...

So housekeeping_cpumask has multiple meanings. In this case:

HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ

 domain
   Isolate from the general SMP balancing and scheduling
   algorithms. Note that performing domain isolation this way
   is irreversible: it's not possible to bring back a CPU to
   the domains once isolated through isolcpus. It's strongly
   advised to use cpusets instead to disable scheduler load
   balancing through the "cpuset.sched_load_balance" file.
   It offers a much more flexible interface where CPUs can
   move in and out of an isolated set anytime.

   You can move a process onto or off an "isolated" CPU via
   the CPU affinity syscalls or cpuset.
begins at 0 and the maximum value is
   "number of CPUs in system - 1".

 managed_irq

   Isolate from being targeted by managed interrupts
   which have an interrupt mask containing isolated
   CPUs. The affinity of managed interrupts is
   handled by the kernel and cannot be changed via
   the /proc/irq/* interfaces.

   This isolation is best effort and only effective
   if the automatically assigned interrupt mask of a
   device queue contains isolated and housekeeping
   CPUs. If housekeeping CPUs are online then such
   interrupts are directed to the housekeeping CPU
   so that IO submitted on the housekeeping CPU
   cannot disturb the isolated CPU.

   If a queue's affinity mask contains only isolated
   CPUs then this parameter has no effect on the
   interrupt routing decision, though interrupts are
   only delivered when tasks running on those
   isolated CPUs submit IO. IO submitted on
   housekeeping CPUs has no influence on those
   queues.

So as long as the meaning of the flags are respected, seems
alright.

Nitesh, is there anything preventing this from being fixed
in userspace ? (as Thomas suggested previously).

Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-01-27 Thread Marcelo Tosatti

On Wed, Jan 27, 2021 at 12:36:30PM +, Robin Murphy wrote:
> On 2021-01-27 12:19, Marcelo Tosatti wrote:
> > On Wed, Jan 27, 2021 at 11:57:16AM +, Robin Murphy wrote:
> > > Hi,
> > > 
> > > On 2020-06-25 23:34, Nitesh Narayan Lal wrote:
> > > > From: Alex Belits 
> > > > 
> > > > The current implementation of cpumask_local_spread() does not respect 
> > > > the
> > > > isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
> > > > it will return it to the caller for pinning of its IRQ threads. Having
> > > > these unwanted IRQ threads on an isolated CPU adds up to a latency
> > > > overhead.
> > > > 
> > > > Restrict the CPUs that are returned for spreading IRQs only to the
> > > > available housekeeping CPUs.
> > > > 
> > > > Signed-off-by: Alex Belits 
> > > > Signed-off-by: Nitesh Narayan Lal 
> > > > ---
> > > >lib/cpumask.c | 16 +++-
> > > >1 file changed, 11 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/lib/cpumask.c b/lib/cpumask.c
> > > > index fb22fb266f93..85da6ab4fbb5 100644
> > > > --- a/lib/cpumask.c
> > > > +++ b/lib/cpumask.c
> > > > @@ -6,6 +6,7 @@
> > > >#include 
> > > >#include 
> > > >#include 
> > > > +#include 
> > > >/**
> > > > * cpumask_next - get the next cpu in a cpumask
> > > > @@ -205,22 +206,27 @@ void __init 
> > > > free_bootmem_cpumask_var(cpumask_var_t mask)
> > > > */
> > > >unsigned int cpumask_local_spread(unsigned int i, int node)
> > > >{
> > > > -   int cpu;
> > > > +   int cpu, hk_flags;
> > > > +   const struct cpumask *mask;
> > > > +   hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> > > > +   mask = housekeeping_cpumask(hk_flags);
> > > 
> > > AFAICS, this generally resolves to something based on cpu_possible_mask
> > > rather than cpu_online_mask as before, so could now potentially return an
> > > offline CPU. Was that an intentional change?
> > 
> > Robin,
> > 
> > AFAICS online CPUs should be filtered.
> 
> Apologies if I'm being thick, but can you explain how? In the case of
> isolation being disabled or compiled out, housekeeping_cpumask() is
> literally just "return cpu_possible_mask;". If we then iterate over that
> with for_each_cpu() and just return the i'th possible CPU (e.g. in the
> NUMA_NO_NODE case), what guarantees that CPU is actually online?
> 
> Robin.

Nothing, but that was the situation before 
1abdfe706a579a702799fce465bceb9fb01d407c
as well.

cpumask_local_spread() should probably be disabling CPU hotplug.

Thomas?

> 
> > > I was just looking at the current code since I had the rare presence of 
> > > mind
> > > to check if something suitable already existed before I start open-coding
> > > "any online CPU, but local node preferred" logic for handling IRQ affinity
> > > in a driver - cpumask_local_spread() appears to be almost what I want (if 
> > > a
> > > bit more heavyweight), if only it would actually guarantee an online CPU 
> > > as
> > > the kerneldoc claims :(
> > > 
> > > Robin.
> > > 
> > > > /* Wrap: we always want a cpu. */
> > > > -   i %= num_online_cpus();
> > > > +   i %= cpumask_weight(mask);
> > > > if (node == NUMA_NO_NODE) {
> > > > -   for_each_cpu(cpu, cpu_online_mask)
> > > > +   for_each_cpu(cpu, mask) {
> > > > if (i-- == 0)
> > > > return cpu;
> > > > +   }
> > > > } else {
> > > > /* NUMA first. */
> > > > -   for_each_cpu_and(cpu, cpumask_of_node(node), 
> > > > cpu_online_mask)
> > > > +   for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
> > > > if (i-- == 0)
> > > > return cpu;
> > > > +   }
> > > > -   for_each_cpu(cpu, cpu_online_mask) {
> > > > +   for_each_cpu(cpu, mask) {
> > > > /* Skip NUMA nodes, done above. */
> > > > if (cpumask_test_cpu(cpu, 
> > > > cpumask_of_node(node)))
> > > > continue;
> > > > 
> >

Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-01-27 Thread Marcelo Tosatti

On Wed, Jan 27, 2021 at 11:57:16AM +, Robin Murphy wrote:
> Hi,
> 
> On 2020-06-25 23:34, Nitesh Narayan Lal wrote:
> > From: Alex Belits 
> > 
> > The current implementation of cpumask_local_spread() does not respect the
> > isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
> > it will return it to the caller for pinning of its IRQ threads. Having
> > these unwanted IRQ threads on an isolated CPU adds up to a latency
> > overhead.
> > 
> > Restrict the CPUs that are returned for spreading IRQs only to the
> > available housekeeping CPUs.
> > 
> > Signed-off-by: Alex Belits 
> > Signed-off-by: Nitesh Narayan Lal 
> > ---
> >   lib/cpumask.c | 16 +++-
> >   1 file changed, 11 insertions(+), 5 deletions(-)
> > 
> > diff --git a/lib/cpumask.c b/lib/cpumask.c
> > index fb22fb266f93..85da6ab4fbb5 100644
> > --- a/lib/cpumask.c
> > +++ b/lib/cpumask.c
> > @@ -6,6 +6,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   /**
> >* cpumask_next - get the next cpu in a cpumask
> > @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t 
> > mask)
> >*/
> >   unsigned int cpumask_local_spread(unsigned int i, int node)
> >   {
> > -   int cpu;
> > +   int cpu, hk_flags;
> > +   const struct cpumask *mask;
> > +   hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
> > +   mask = housekeeping_cpumask(hk_flags);
> 
> AFAICS, this generally resolves to something based on cpu_possible_mask
> rather than cpu_online_mask as before, so could now potentially return an
> offline CPU. Was that an intentional change?

Robin,

AFAICS online CPUs should be filtered.

> I was just looking at the current code since I had the rare presence of mind
> to check if something suitable already existed before I start open-coding
> "any online CPU, but local node preferred" logic for handling IRQ affinity
> in a driver - cpumask_local_spread() appears to be almost what I want (if a
> bit more heavyweight), if only it would actually guarantee an online CPU as
> the kerneldoc claims :(
> 
> Robin.
> 
> > /* Wrap: we always want a cpu. */
> > -   i %= num_online_cpus();
> > +   i %= cpumask_weight(mask);
> > if (node == NUMA_NO_NODE) {
> > -   for_each_cpu(cpu, cpu_online_mask)
> > +   for_each_cpu(cpu, mask) {
> > if (i-- == 0)
> > return cpu;
> > +   }
> > } else {
> > /* NUMA first. */
> > -   for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
> > +   for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
> > if (i-- == 0)
> > return cpu;
> > +   }
> > -   for_each_cpu(cpu, cpu_online_mask) {
> > +   for_each_cpu(cpu, mask) {
> > /* Skip NUMA nodes, done above. */
> > if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
> > continue;
> >

Re: [EXT] Re: [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus

2021-01-22 Thread Marcelo Tosatti

On Tue, Nov 24, 2020 at 12:21:06AM +0100, Frederic Weisbecker wrote:
> On Mon, Nov 23, 2020 at 10:39:34PM +, Alex Belits wrote:
> > 
> > On Mon, 2020-11-23 at 23:29 +0100, Frederic Weisbecker wrote:
> > > External Email
> > > 
> > > ---
> > > ---
> > > On Mon, Nov 23, 2020 at 05:58:42PM +, Alex Belits wrote:
> > > > From: Yuri Norov 
> > > > 
> > > > Make sure that kick_all_cpus_sync() does not call CPUs that are
> > > > running
> > > > isolated tasks.
> > > > 
> > > > Signed-off-by: Yuri Norov 
> > > > [abel...@marvell.com: use safe task_isolation_cpumask()
> > > > implementation]
> > > > Signed-off-by: Alex Belits 
> > > > ---
> > > >  kernel/smp.c | 14 +-
> > > >  1 file changed, 13 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/kernel/smp.c b/kernel/smp.c
> > > > index 4d17501433be..b2faecf58ed0 100644
> > > > --- a/kernel/smp.c
> > > > +++ b/kernel/smp.c
> > > > @@ -932,9 +932,21 @@ static void do_nothing(void *unused)
> > > >   */
> > > >  void kick_all_cpus_sync(void)
> > > >  {
> > > > +   struct cpumask mask;
> > > > +
> > > > /* Make sure the change is visible before we kick the cpus */
> > > > smp_mb();
> > > > -   smp_call_function(do_nothing, NULL, 1);
> > > > +
> > > > +   preempt_disable();
> > > > +#ifdef CONFIG_TASK_ISOLATION
> > > > +   cpumask_clear();
> > > > +   task_isolation_cpumask();
> > > > +   cpumask_complement(, );
> > > > +#else
> > > > +   cpumask_setall();
> > > > +#endif
> > > > +   smp_call_function_many(, do_nothing, NULL, 1);
> > > > +   preempt_enable();
> > > 
> > > Same comment about IPIs here.
> > 
> > This is different from timers. The original design was based on the
> > idea that every CPU should be able to enter kernel at any time and run
> > kernel code with no additional preparation. Then the only solution is
> > to always do full broadcast and require all CPUs to process it.
> > 
> > What I am trying to introduce is the idea of CPU that is not likely to
> > run kernel code any soon, and can afford to go through an additional
> > synchronization procedure on the next entry into kernel. The
> > synchronization is not skipped, it simply happens later, early in
> > kernel entry code.

Perhaps a bitmask of pending flushes makes more sense? 
static_key_enable IPIs is one of the users, but for its case it would 
be necessary to differentiate between in-kernel mode and out of kernel 
mode atomically (since i-cache flush must be performed if isolated CPU 
is in kernel mode).

> Ah I see, this is ordered that way:
> 
> ll_isol_flags = ISOLATED
> 
>  CPU 0CPU 1
> --   -
> // kernel entry
> data_to_sync = 1ll_isol_flags = ISOLATED_BROKEN
> smp_mb()smp_mb()
> if ll_isol_flags(CPU 1) == ISOLATED READ data_to_sync
>  smp_call(CPU 1)

Since isolated mode with syscalls is a desired feature, having a
separate atomic with in_kernel_mode = 0/1 (that is set/cleared 
on kernel entry / kernel exit, while on TIF_TASK_ISOLATION), would be
necessary (and a similar race-free logic as above).

> You should document that, ie: explain why what you're doing is safe.
> 
> Also Beware though that the data to sync in question doesn't need to be 
> visible
> in the entry code before task_isolation_kernel_enter(). You need to audit all
> the callers of kick_all_cpus_sync().

Cscope tag: flush_icache_range
   #   line  filename / context / line
   1 96  arch/arc/kernel/jump_label.c <>
 flush_icache_range(entry->code, entry->code + JUMP_LABEL_NOP_SIZE);

This case would be OK for delayed processing before kernel entry, as long as
no code before task_isolation_kernel_enter can be modified (which i am
not sure about).

But:

  36 28  arch/ia64/include/asm/cacheflush.h <>
 flush_icache_range(_addr, _addr + (len)); \

Is less certain.

Alex do you recall if arch_jump_label_transform was the only offender or 
there were others as well? (suppose handling only the ones which matter
in production at the moment, and later fixing individual ones makes most
sense).

Re: [PATCH v4 11/13] task_isolation: net: don't flush backlog on CPUs running isolated tasks

2021-01-22 Thread Marcelo Tosatti

On Thu, Oct 01, 2020 at 04:47:31PM +0200, Frederic Weisbecker wrote:
> On Wed, Jul 22, 2020 at 02:58:24PM +, Alex Belits wrote:
> > From: Yuri Norov 
> > 

> > so we don't need to flush it.
> 
> What guarantees that we have no backlog on it?

>From Paolo's work to use lockless reading of 
per-CPU skb lists

https://www.spinics.net/lists/netdev/msg682693.html

It also exposed skb queue length to userspace

https://www.spinics.net/lists/netdev/msg684939.html

But if i remember correctly waiting for a RCU grace
period was also necessary to ensure no backlog !?! 

Paolo would you please remind us what was the sequence of steps?
(and then also, for the userspace isolation interface, where 
the application informs the kernel that its entering isolated
mode, is just confirming the queues have zero length is
sufficient?).

TIA!

> 
> > Currently flush_all_backlogs()
> > enqueues corresponding work on all CPUs including ones that run
> > isolated tasks. It leads to breaking task isolation for nothing.
> > 
> > In this patch, backlog flushing is enqueued only on non-isolated CPUs.
> > 
> > Signed-off-by: Yuri Norov 
> > [abel...@marvell.com: use safe task_isolation_on_cpu() implementation]
> > Signed-off-by: Alex Belits 
> > ---
> >  net/core/dev.c | 7 ++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 90b59fc50dc9..83a282f7453d 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -74,6 +74,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -5624,9 +5625,13 @@ static void flush_all_backlogs(void)
> >  
> > get_online_cpus();
> >  
> > -   for_each_online_cpu(cpu)
> > +   smp_rmb();
> 
> What is it ordering?
> 
> > +   for_each_online_cpu(cpu) {
> > +   if (task_isolation_on_cpu(cpu))
> > +   continue;
> > queue_work_on(cpu, system_highpri_wq,
> >   per_cpu_ptr(_works, cpu));
> > +   }
> >  
> > for_each_online_cpu(cpu)
> > flush_work(per_cpu_ptr(_works, cpu));
> 
> Thanks.

Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

2020-12-15 Thread Marcelo Tosatti

On Fri, Dec 11, 2020 at 10:59:59PM +0100, Paolo Bonzini wrote:
> On 11/12/20 22:04, Thomas Gleixner wrote:
> > > Its 100ms off with migration, and can be reduced further (customers
> > > complained about 5 seconds but seem happy with 0.1ms).
> > What is 100ms? Guaranteed maximum migration time?
> 
> I suppose it's the length between the time from KVM_GET_CLOCK and
> KVM_GET_MSR(IA32_TSC) to KVM_SET_CLOCK and KVM_SET_MSR(IA32_TSC).  But the
> VM is paused for much longer, the sequence for the non-live part of the
> migration (aka brownout) is as follows:
> 
> pause
> finish sending RAMreceive RAM   ~1 sec
> send paused-VM state  finish receiving RAM \
>   receive paused-VM state   ) 0.1 sec
>   restart  /
> 
> The nanosecond and TSC times are sent as part of the paused-VM state at the
> very end of the live migration process.
> 
> So it's still true that the time advances during live migration brownout;
> 0.1 seconds is just the final part of the live migration process.  But for
> _live_ migration there is no need to design things according to "people are
> happy if their clock is off by 0.1 seconds only".  

Agree. What would be a good way to fix this? 

It seems to me using CLOCK_REALTIME as in the interface Maxim is
proposing is prone to difference in CLOCK_REALTIME itself.

Perhaps there is another way to measure that 0.1 sec which is
independent of the clock values of the source and destination hosts
(say by sending a packet once the clock stops counting).

Then on destination measure delta = clock_restart_time - packet_receival
and increase clock by that amount.



> Again, save-to-disk,
> reverse debugging and the like are a different story, which is why KVM
> should delegate policy to userspace (while documenting how to do it right).
> 
> Paolo
> 
> > CLOCK_REALTIME and CLOCK_TAI are off by the time the VM is paused and
> > this state persists up to the point where NTP corrects it with a time
> > jump.
> > 
> > So if migration takes 5 seconds then CLOCK_REALTIME is not off by 100ms
> > it's off by 5 seconds.
> > 
> > CLOCK_MONOTONIC/BOOTTIME might be off by 100ms between pause and resume.
> >

Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

2020-12-11 Thread Marcelo Tosatti

On Fri, Dec 11, 2020 at 02:30:34PM +0100, Thomas Gleixner wrote:
> On Thu, Dec 10 2020 at 21:27, Marcelo Tosatti wrote:
> > On Thu, Dec 10, 2020 at 10:48:10PM +0100, Thomas Gleixner wrote:
> >> You really all live in a seperate universe creating your own rules how
> >> things which other people work hard on to get it correct can be screwed
> >> over.
> >
> > 1. T = read timestamp.
> > 2. migrate (VM stops for a certain period).
> > 3. use timestamp T.
> 
> This is exactly the problem. Time stops at pause and continues where it
> stopped on resume.
> 
> But CLOCK_REALTIME and CLOCK_TAI advanced in reality. So up to the point
> where NTP fixes this - if there is NTP at all - the guest CLOCK_REALTIME
> and CLOCK_TAI are off by tpause.
> 
> Now the application gets a packet from the outside world with a
> CLOCK_REALTIME timestamp which is suddenly ahead of the value it reads
> from clock_gettime(CLOCK_REALTIME) by tpause. So what is it supposed to
> do with that? Make stupid assumptions that the other end screwed up
> timekeeping, throw an error that the system it is running on screwed up
> timekeeping? And a second later when NTP catched up it gets the next
> surprise because the systems CLOCK_REALTIME jumped forward unexpectedly
> or if there is no NTP it's confused forever.

This can happen even with a "perfect" solution that syncs time
instantly on the migration destination. See steps 1,2,3.

Unless you notify applications to invalidate their time reads,
i can't see a way to fix this.

Therefore if you use VM migration in the first place, a certain amount of
timestamp accuracy error must be tolerated.

> How can you even assume that this is correct?

As noted above, even without a window of unsynchronized time (due to
delay for NTP to sync time), time reads can be stale.

> It is exactly the same problem as we had many years ago with hardware
> clocks suddenly stopping to tick which caused quite some stuff to go
> belly up.

Customers complained when it was 5 seconds off, now its 0.1ms (and
people seem happy).

> In a proper suspend/resume scenario CLOCK_REALTIME/TAI are advanced
> (with a certain degree of accuracy) to compensate for the sleep time, so
> the other end of a communication is at least in the same ballpark, but
> not 50 seconds off.

Its 100ms off with migration, and can be reduced further (customers
complained about 5 seconds but seem happy with 0.1ms).

> >> This features first, correctness later frenzy is insane and it better
> >> stops now before you pile even more crap on the existing steaming pile
> >> of insanities.
> >
> > Sure.
> 
> I wish that would be true. OS people - you should know that - are
> fighting forever with hardware people over feature madness and the
> attitude of 'we can fix that in software' which turns often enough out
> to be wrong.
> 
> Now sadly enough people who suffered from that madness work on
> virtualization and instead of trying to avoid the same problem they go
> off and make it even worse.

So you think its important to reduce the 100ms offset? 

> It's the same problem again as with hardware people. Not talking to the
> other people _before_ making uninformed assumptions and decisions.
> 
> We did it that way because big customer asked for it is not a
> justification for inflicting this on everybody else and thereby
> violating correctness. Works for me and my big customer is not a proof
> of correctness either.
> 
> It's another proof that this industry just "works" by chance.
> 
> Thanks,
> 
> tglx

OK, makes sense, then reducing the 0.1ms window even further
is a useful thing to do. What would be an acceptable 
CLOCK_REALTIME accuracy error, on migration?

Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

2020-12-10 Thread Marcelo Tosatti

On Thu, Dec 10, 2020 at 10:48:10PM +0100, Thomas Gleixner wrote:
> On Thu, Dec 10 2020 at 12:26, Marcelo Tosatti wrote:
> > On Wed, Dec 09, 2020 at 09:58:23PM +0100, Thomas Gleixner wrote:
> >> Marcelo,
> >> 
> >> On Wed, Dec 09 2020 at 13:34, Marcelo Tosatti wrote:
> >> > On Tue, Dec 08, 2020 at 10:33:15PM +0100, Thomas Gleixner wrote:
> >> >> On Tue, Dec 08 2020 at 15:11, Marcelo Tosatti wrote:
> >> >> > max_cycles overflow. Sent a message to Maxim describing it.
> >> >> 
> >> >> Truly helpful. Why the hell did you not talk to me when you ran into
> >> >> that the first time?
> >> >
> >> > Because 
> >> >
> >> > 1) Users wanted CLOCK_BOOTTIME to stop counting while the VM 
> >> > is paused (so we wanted to stop guest clock when VM is paused anyway).
> >> 
> >> How is that supposed to work w/o the guest kernels help if you have to
> >> keep clock realtime up to date? 
> >
> > Upon VM resume, we notify NTP daemon in the guest to sync realtime
> > clock.
> 
> Brilliant. What happens if there is no NTP daemon? What happens if the
> NTP daemon is not part of the virt orchestration magic and cannot be
> notified, then it will notice the time jump after the next update
> interval.
> 
> What about correctness?
> 
> ALL CLOCK_* stop and resume when the VM is resumed at the point where
> they stopped.
> 
> So up to the point where NTP catches up and corrects clock realtime and
> TAI other processes can observe that time jumped in the outside world,
> e.g. via a network packet or whatever, but there is no reason why time
> should have jumped outside vs. the local one.
> 
> You really all live in a seperate universe creating your own rules how
> things which other people work hard on to get it correct can be screwed
> over.

1. T = read timestamp.
2. migrate (VM stops for a certain period).
3. use timestamp T.

> Of course this all is nowhere documented in detail. At least a quick
> search with about 10 different keyword combinations revealed absolutely
> nothing.
> 
> This features first, correctness later frenzy is insane and it better
> stops now before you pile even more crap on the existing steaming pile
> of insanities.

Sure.

Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

2020-12-10 Thread Marcelo Tosatti

On Wed, Dec 09, 2020 at 09:58:23PM +0100, Thomas Gleixner wrote:
> Marcelo,
> 
> On Wed, Dec 09 2020 at 13:34, Marcelo Tosatti wrote:
> > On Tue, Dec 08, 2020 at 10:33:15PM +0100, Thomas Gleixner wrote:
> >> On Tue, Dec 08 2020 at 15:11, Marcelo Tosatti wrote:
> >> > max_cycles overflow. Sent a message to Maxim describing it.
> >> 
> >> Truly helpful. Why the hell did you not talk to me when you ran into
> >> that the first time?
> >
> > Because 
> >
> > 1) Users wanted CLOCK_BOOTTIME to stop counting while the VM 
> > is paused (so we wanted to stop guest clock when VM is paused anyway).
> 
> How is that supposed to work w/o the guest kernels help if you have to
> keep clock realtime up to date? 

Upon VM resume, we notify NTP daemon in the guest to sync realtime
clock.
> 
> > 2) The solution to inject NMIs to the guest seemed overly
> > complicated.
> 
> Why do you need NMIs?
> 
> All you need is a way to communicate to the guest that it should prepare
> for clock madness to happen. Whether that's an IPI or a bit in a
> hyperpage which gets checked during the update of the guest timekeeping
> does not matter at all.
> 
> But you certainly do not need an NMI because there is nothing useful you
> can do within an NMI.
> 
> Thanks,
> 
> tglx

Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

2020-12-09 Thread Marcelo Tosatti

On Tue, Dec 08, 2020 at 10:33:15PM +0100, Thomas Gleixner wrote:
> On Tue, Dec 08 2020 at 15:11, Marcelo Tosatti wrote:
> > On Tue, Dec 08, 2020 at 05:02:07PM +0100, Thomas Gleixner wrote:
> >> On Tue, Dec 08 2020 at 16:50, Maxim Levitsky wrote:
> >> > On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
> >> >> > +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST 
> >> >> > value
> >> >> > +from the state obtained in the past by KVM_GET_TSC_STATE on the same 
> >> >> > vCPU.
> >> >> > +
> >> >> > +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> >> >> > +KVM will adjust the guest TSC value by the time that passed since 
> >> >> > the moment
> >> >> > +CLOCK_REALTIME timestamp was saved in the struct and current value of
> >> >> > +CLOCK_REALTIME, and set the guest's TSC to the new value.
> >> >> 
> >> >> This introduces the wraparound bug in Linux timekeeping, doesnt it?
> >> 
> >> Which bug?
> >
> > max_cycles overflow. Sent a message to Maxim describing it.
> 
> Truly helpful. Why the hell did you not talk to me when you ran into
> that the first time?

Because 

1) Users wanted CLOCK_BOOTTIME to stop counting while the VM 
is paused (so we wanted to stop guest clock when VM is paused anyway).

2) The solution to inject NMIs to the guest seemed overly
complicated.

> >> For one I have no idea which bug you are talking about and if the bug is
> >> caused by the VMM then why would you "fix" it in the guest kernel.
> >
> > 1) Stop guest, save TSC value of cpu-0 = V.
> > 2) Wait for some amount of time = W.
> > 3) Start guest, load TSC value with V+W.
> >
> > Can cause an overflow on Linux timekeeping.
> 
> Yes, because you violate the basic assumption which Linux timekeeping
> makes. See the other mail in this thread.
> 
> Thanks,
> 
> tglx

Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

2020-12-08 Thread Marcelo Tosatti

On Tue, Dec 08, 2020 at 06:25:13PM +0200, Maxim Levitsky wrote:
> On Tue, 2020-12-08 at 17:02 +0100, Thomas Gleixner wrote:
> > On Tue, Dec 08 2020 at 16:50, Maxim Levitsky wrote:
> > > On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
> > > > > +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST 
> > > > > value
> > > > > +from the state obtained in the past by KVM_GET_TSC_STATE on the same 
> > > > > vCPU.
> > > > > +
> > > > > +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> > > > > +KVM will adjust the guest TSC value by the time that passed since 
> > > > > the moment
> > > > > +CLOCK_REALTIME timestamp was saved in the struct and current value of
> > > > > +CLOCK_REALTIME, and set the guest's TSC to the new value.
> > > > 
> > > > This introduces the wraparound bug in Linux timekeeping, doesnt it?
> > 
> > Which bug?
> > 
> > > It does.
> > > Could you prepare a reproducer for this bug so I get a better idea about
> > > what are you talking about?
> > > 
> > > I assume you need very long (like days worth) jump to trigger this bug
> > > and for such case we can either work around it in qemu / kernel 
> > > or fix it in the guest kernel and I strongly prefer the latter.
> > > 
> > > Thomas, what do you think about it?
> > 
> > For one I have no idea which bug you are talking about and if the bug is
> > caused by the VMM then why would you "fix" it in the guest kernel.
> 
> The "bug" is that if VMM moves a hardware time counter (tsc or anything else) 
> forward by large enough value in one go, 
> then the guest kernel will supposingly have an overflow in the time code.
> I don't consider this to be a buggy VMM behavior, but rather a kernel
> bug that should be fixed (if this bug actually exists)

It exists.

> Purely in theory this can even happen on real hardware if for example SMM 
> handler
> blocks a CPU from running for a long duration, or hardware debugging
> interface does, or some other hardware transparent sleep mechanism kicks in
> and blocks a CPU from running.
> (We do handle this gracefully for S3/S4)
> 
> > 
> > Aside of that I think I made it pretty clear what the right thing to do
> > is.
> 
> This is orthogonal to this issue of the 'bug'. 
> Here we are not talking about per-vcpu TSC offsets, something that I said 
> that I do agree with you that it would be very nice to get rid of.
>  
> We are talking about the fact that TSC can jump forward by arbitrary large
> value if the migration took arbitrary amount of time, which 
> (assuming that the bug is real) can crash the guest kernel.

QE reproduced it.

> This will happen even if we use per VM global tsc offset.
> 
> So what do you think?
> 
> Best regards,
>   Maxim Levitsky
> 
> > 
> > Thanks,
> > 
> > tglx
> > 
>

Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

2020-12-08 Thread Marcelo Tosatti

On Tue, Dec 08, 2020 at 05:02:07PM +0100, Thomas Gleixner wrote:
> On Tue, Dec 08 2020 at 16:50, Maxim Levitsky wrote:
> > On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
> >> > +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST 
> >> > value
> >> > +from the state obtained in the past by KVM_GET_TSC_STATE on the same 
> >> > vCPU.
> >> > +
> >> > +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> >> > +KVM will adjust the guest TSC value by the time that passed since the 
> >> > moment
> >> > +CLOCK_REALTIME timestamp was saved in the struct and current value of
> >> > +CLOCK_REALTIME, and set the guest's TSC to the new value.
> >> 
> >> This introduces the wraparound bug in Linux timekeeping, doesnt it?
> 
> Which bug?

max_cycles overflow. Sent a message to Maxim describing it.

> 
> > It does.
> > Could you prepare a reproducer for this bug so I get a better idea about
> > what are you talking about?
> >
> > I assume you need very long (like days worth) jump to trigger this bug
> > and for such case we can either work around it in qemu / kernel 
> > or fix it in the guest kernel and I strongly prefer the latter.
> >
> > Thomas, what do you think about it?
> 
> For one I have no idea which bug you are talking about and if the bug is
> caused by the VMM then why would you "fix" it in the guest kernel.

1) Stop guest, save TSC value of cpu-0 = V.
2) Wait for some amount of time = W.
3) Start guest, load TSC value with V+W.

Can cause an overflow on Linux timekeeping.

> Aside of that I think I made it pretty clear what the right thing to do
> is.

Sure: the notion of a "unique TSC offset" already exists (it is detected
by write TSC logic, and not explicit in the interface, though).

But AFAIK it works pretty well.

Exposing a single TSC value on the interface level seems alright to
me...

Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

2020-12-08 Thread Marcelo Tosatti

On Tue, Dec 08, 2020 at 04:50:53PM +0200, Maxim Levitsky wrote:
> On Mon, 2020-12-07 at 20:29 -0300, Marcelo Tosatti wrote:
> > On Thu, Dec 03, 2020 at 07:11:16PM +0200, Maxim Levitsky wrote:
> > > These two new ioctls allow to more precisly capture and
> > > restore guest's TSC state.
> > > 
> > > Both ioctls are meant to be used to accurately migrate guest TSC
> > > even when there is a significant downtime during the migration.
> > > 
> > > Suggested-by: Paolo Bonzini 
> > > Signed-off-by: Maxim Levitsky 
> > > ---
> > >  Documentation/virt/kvm/api.rst | 65 ++
> > >  arch/x86/kvm/x86.c | 73 ++
> > >  include/uapi/linux/kvm.h   | 15 +++
> > >  3 files changed, 153 insertions(+)
> > > 
> > > diff --git a/Documentation/virt/kvm/api.rst 
> > > b/Documentation/virt/kvm/api.rst
> > > index 70254eaa5229f..ebecfe4b414ce 100644
> > > --- a/Documentation/virt/kvm/api.rst
> > > +++ b/Documentation/virt/kvm/api.rst
> > > @@ -4826,6 +4826,71 @@ If a vCPU is in running state while this ioctl is 
> > > invoked, the vCPU may
> > >  experience inconsistent filtering behavior on MSR accesses.
> > >  
> > >  
> > > +4.127 KVM_GET_TSC_STATE
> > > +
> > > +
> > > +:Capability: KVM_CAP_PRECISE_TSC
> > > +:Architectures: x86
> > > +:Type: vcpu ioctl
> > > +:Parameters: struct kvm_tsc_state
> > > +:Returns: 0 on success, < 0 on error
> > > +
> > > +::
> > > +
> > > +  #define KVM_TSC_STATE_TIMESTAMP_VALID 1
> > > +  #define KVM_TSC_STATE_TSC_ADJUST_VALID 2
> > > +  struct kvm_tsc_state {
> > > + __u32 flags;
> > > + __u64 nsec;
> > > + __u64 tsc;
> > > + __u64 tsc_adjust;
> > > +  };
> > > +
> > > +flags values for ``struct kvm_tsc_state``:
> > > +
> > > +``KVM_TSC_STATE_TIMESTAMP_VALID``
> > > +
> > > +  ``nsec`` contains nanoseconds from unix epoch.
> > > +Always set by KVM_GET_TSC_STATE, might be omitted in 
> > > KVM_SET_TSC_STATE
> > > +
> > > +``KVM_TSC_STATE_TSC_ADJUST_VALID``
> > > +
> > > +  ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
> > > +
> > > +
> > > +This ioctl allows the user space to read the guest's 
> > > IA32_TSC,IA32_TSC_ADJUST,
> > > +and the current value of host's CLOCK_REALTIME clock in nanoseconds 
> > > since unix
> > > +epoch.
> > 
> > Why is CLOCK_REALTIME necessary at all? kvmclock uses the host clock as
> > a time base, but for TSC it should not be necessary.
> 
> 
> CLOCK_REALTIME is used as an absolute time reference that should match
> on both computers. I could have used CLOCK_TAI instead for example.
> 
> The reference allows to account for time passed between saving and restoring
> the TSC as explained above.

As mentioned we don't want this due to the overflow. 

Again, i think higher priority is to allow enablement of invariant TSC
by default (to disable kvmclock).

> > > +
> > > +
> > > +4.128 KVM_SET_TSC_STATE
> > > +
> > > +
> > > +:Capability: KVM_CAP_PRECISE_TSC
> > > +:Architectures: x86
> > > +:Type: vcpu ioctl
> > > +:Parameters: struct kvm_tsc_state
> > > +:Returns: 0 on success, < 0 on error
> > > +
> > > +::
> > > +
> > > +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST 
> > > value
> > > +from the state obtained in the past by KVM_GET_TSC_STATE on the same 
> > > vCPU.
> > > +
> > > +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> > > +KVM will adjust the guest TSC value by the time that passed since the 
> > > moment
> > > +CLOCK_REALTIME timestamp was saved in the struct and current value of
> > > +CLOCK_REALTIME, and set the guest's TSC to the new value.
> > 
> > This introduces the wraparound bug in Linux timekeeping, doesnt it?
> 
> It does.
> Could you prepare a reproducer for this bug so I get a better idea about
> what are you talking about?

Enable CONFIG_DEBUG_TIMEKEEPING, check what max_cycles is from the TSC
clocksource:

#ifdef CONFIG_DEBUG_TIMEKEEPING
#define WARNING_FREQ (HZ*300) /* 5 minute rate-limiting */

static void timekeeping_check_update(struct timekeeper *tk, u64 offset)
{

u64 max_cycles = tk->tkr_mono.clock->max_c

Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

2020-12-08 Thread Marcelo Tosatti

On Mon, Dec 07, 2020 at 10:04:45AM -0800, Andy Lutomirski wrote:
> 
> > On Dec 7, 2020, at 9:00 AM, Maxim Levitsky  wrote:
> > 
> > On Mon, 2020-12-07 at 08:53 -0800, Andy Lutomirski wrote:
>  On Dec 7, 2020, at 8:38 AM, Thomas Gleixner  wrote:
> >>> 
> >>> On Mon, Dec 07 2020 at 14:16, Maxim Levitsky wrote:
> > On Sun, 2020-12-06 at 17:19 +0100, Thomas Gleixner wrote:
> > From a timekeeping POV and the guests expectation of TSC this is
> > fundamentally wrong:
> > 
> > tscguest = scaled(hosttsc) + offset
> > 
> > The TSC has to be viewed systemwide and not per CPU. It's systemwide
> > used for timekeeping and for that to work it has to be synchronized. 
> > 
> > Why would this be different on virt? Just because it's virt or what? 
> > 
> > Migration is a guest wide thing and you're not migrating single vCPUs.
> > 
> > This hackery just papers over he underlying design fail that KVM looks
> > at the TSC per vCPU which is the root cause and that needs to be fixed.
>  
>  I don't disagree with you.
>  As far as I know the main reasons that kvm tracks TSC per guest are
>  
>  1. cases when host tsc is not stable 
>  (hopefully rare now, and I don't mind making
>  the new API just refuse to work when this is detected, and revert to old 
>  way
>  of doing things).
> >>> 
> >>> That's a trainwreck to begin with and I really would just not support it
> >>> for anything new which aims to be more precise and correct.  TSC has
> >>> become pretty reliable over the years.
> >>> 
>  2. (theoretical) ability of the guest to introduce per core tsc offfset
>  by either using TSC_ADJUST (for which I got recently an idea to stop
>  advertising this feature to the guest), or writing TSC directly which
>  is allowed by Intel's PRM:
> >>> 
> >>> For anything halfways modern the write to TSC is reflected in TSC_ADJUST
> >>> which means you get the precise offset.
> >>> 
> >>> The general principle still applies from a system POV.
> >>> 
> >>>TSC base (systemwide view) - The sane case
> >>> 
> >>>TSC CPU  = TSC base + TSC_ADJUST
> >>> 
> >>> The guest TSC base is a per guest constant offset to the host TSC.
> >>> 
> >>>TSC guest base = TSC host base + guest base offset
> >>> 
> >>> If the guest want's this different per vCPU by writing to the MSR or to
> >>> TSC_ADJUST then you still can have a per vCPU offset in TSC_ADJUST which
> >>> is the offset to the TSC base of the guest.
> >> 
> >> How about, if the guest wants to write TSC_ADJUST, it can turn off all 
> >> paravirt features and keep both pieces?
> >> 
> > 
> > This is one of the things I had in mind recently.
> > 
> > Even better, we can stop advertising TSC_ADJUST in CPUID to the guest 
> > and forbid it from writing it at all.
> 
> Seems reasonable to me.
> 
> It also seems okay for some MSRs to stop working after the guest enabled new 
> PV timekeeping.
> 
> I do have a feature request, though: IMO it would be quite nifty if the new 
> kvmclock structure could also expose NTP corrections. In other words, if you 
> could expose enough info to calculate CLOCK_MONOTONIC_RAW, CLOCK_MONOTONIC, 
> and CLOCK_REALTIME, then we could have paravirt NTP.

Hi Andy,

Any reason why drivers/ptp/ptp_kvm.c does not work for you?

> Bonus points if whatever you do for CLOCK_REALTIME also exposes leap seconds 
> in a race free way :). But I suppose that just exposing TAI and letting the 
> guest deal with the TAI - UTC offset itself would get the job done just fine.

Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE

2020-12-08 Thread Marcelo Tosatti

On Thu, Dec 03, 2020 at 07:11:16PM +0200, Maxim Levitsky wrote:
> These two new ioctls allow to more precisly capture and
> restore guest's TSC state.
> 
> Both ioctls are meant to be used to accurately migrate guest TSC
> even when there is a significant downtime during the migration.
> 
> Suggested-by: Paolo Bonzini 
> Signed-off-by: Maxim Levitsky 
> ---
>  Documentation/virt/kvm/api.rst | 65 ++
>  arch/x86/kvm/x86.c | 73 ++
>  include/uapi/linux/kvm.h   | 15 +++
>  3 files changed, 153 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 70254eaa5229f..ebecfe4b414ce 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -4826,6 +4826,71 @@ If a vCPU is in running state while this ioctl is 
> invoked, the vCPU may
>  experience inconsistent filtering behavior on MSR accesses.
>  
>  
> +4.127 KVM_GET_TSC_STATE
> +
> +
> +:Capability: KVM_CAP_PRECISE_TSC
> +:Architectures: x86
> +:Type: vcpu ioctl
> +:Parameters: struct kvm_tsc_state
> +:Returns: 0 on success, < 0 on error
> +
> +::
> +
> +  #define KVM_TSC_STATE_TIMESTAMP_VALID 1
> +  #define KVM_TSC_STATE_TSC_ADJUST_VALID 2
> +  struct kvm_tsc_state {
> + __u32 flags;
> + __u64 nsec;
> + __u64 tsc;
> + __u64 tsc_adjust;
> +  };
> +
> +flags values for ``struct kvm_tsc_state``:
> +
> +``KVM_TSC_STATE_TIMESTAMP_VALID``
> +
> +  ``nsec`` contains nanoseconds from unix epoch.
> +Always set by KVM_GET_TSC_STATE, might be omitted in KVM_SET_TSC_STATE
> +
> +``KVM_TSC_STATE_TSC_ADJUST_VALID``
> +
> +  ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
> +
> +
> +This ioctl allows the user space to read the guest's 
> IA32_TSC,IA32_TSC_ADJUST,
> +and the current value of host's CLOCK_REALTIME clock in nanoseconds since 
> unix
> +epoch.

Why is CLOCK_REALTIME necessary at all? kvmclock uses the host clock as
a time base, but for TSC it should not be necessary.

> +
> +
> +4.128 KVM_SET_TSC_STATE
> +
> +
> +:Capability: KVM_CAP_PRECISE_TSC
> +:Architectures: x86
> +:Type: vcpu ioctl
> +:Parameters: struct kvm_tsc_state
> +:Returns: 0 on success, < 0 on error
> +
> +::
> +
> +This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value
> +from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
> +
> +If 'KVM_TSC_STATE_TIMESTAMP_VALID' is set in flags,
> +KVM will adjust the guest TSC value by the time that passed since the moment
> +CLOCK_REALTIME timestamp was saved in the struct and current value of
> +CLOCK_REALTIME, and set the guest's TSC to the new value.

This introduces the wraparound bug in Linux timekeeping, doesnt it?

> +
> +Otherwise KVM will set the guest TSC value to the exact value as given
> +in the struct.
> +
> +if KVM_TSC_STATE_TSC_ADJUST_VALID is set, and guest supports 
> IA32_MSR_TSC_ADJUST,
> +then its value will be set to the given value from the struct.
> +
> +It is assumed that either both ioctls will be run on the same machine,
> +or that source and destination machines have synchronized clocks.



>  5. The kvm_run structure
>  
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a3fdc16cfd6f3..9b8a2fe3a2398 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct 
> timespec64 *ts,
>  
>   return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
>  }
> +
> +
> +static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc)
> +{
> + struct timespec64 ts;
> +
> + if (kvm_get_walltime_and_clockread(, host_tsc)) {
> + *walltime_ns = timespec64_to_ns();
> + return;
> + }
> +
> + *host_tsc = rdtsc();
> + *walltime_ns = ktime_get_real_ns();
> +}
> +
>  #endif
>  
>  /*
> @@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
> ext)
>   case KVM_CAP_X86_USER_SPACE_MSR:
>   case KVM_CAP_X86_MSR_FILTER:
>   case KVM_CAP_ENFORCE_PV_FEATURE_CPUID:
> +#ifdef CONFIG_X86_64
> + case KVM_CAP_PRECISE_TSC:
> +#endif
>   r = 1;
>   break;
>   case KVM_CAP_SYNC_REGS:
> @@ -4999,6 +5017,61 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>   case KVM_GET_SUPPORTED_HV_CPUID:
>   r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
>   break;
> +#ifdef CONFIG_X86_64
> + case KVM_GET_TSC_STATE: {
> + struct kvm_tsc_state __user *user_tsc_state = argp;
> + u64 host_tsc;
> +
> + struct kvm_tsc_state tsc_state = {
> + .flags = KVM_TSC_STATE_TIMESTAMP_VALID
> + };
> +
> + kvm_get_walltime(_state.nsec, _tsc);
> + tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
> +
> + if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
> +

Re: [PATCH v2 0/3] RFC: Precise TSC migration

2020-12-08 Thread Marcelo Tosatti

On Thu, Dec 03, 2020 at 07:11:15PM +0200, Maxim Levitsky wrote:
> Hi!
> 
> This is the second version of the work to make TSC migration more accurate,
> as was defined by Paulo at:
> https://www.spinics.net/lists/kvm/msg225525.html

Maxim,

Can you please make a description of what is the practical problem that is 
being fixed, preferably with instructions on how to reproduce ?

> I omitted most of the semi-offtopic points I raised related to TSC
> in the previous RFC where we can continue the discussion.
> 
> I do want to raise another thing that I almost forgot.
> 
> On AMD systems, the Linux kernel will mark the guest tsc as
> unstable unless invtsc is set which is set on recent AMD
> hardware.
> 
> Take a look at 'unsynchronized_tsc()' to verify this.
> 
> This is another thing that IMHO should be fixed at least when
> running under KVM.
> 
> Note that I forgot to mention that
> X86_FEATURE_TSC_RELIABLE also short-circuits this code,
> thus giving another reason to enable it under KVM.
> 
> Changes from V1:
> 
> - added KVM_TSC_STATE_TIMESTAMP_VALID instead of testing ns == 0
> - allow diff < 0, because it is still better that capping it to 0
> - updated tsc_msr_test unit test to cover this feature
> - refactoring
> 
> Patches to enable this feature in qemu are in the process of
> being sent to qemu-devel mailing list.
> 
> Best regards,
> Maxim Levitsky
> 
> Maxim Levitsky (3):
>   KVM: x86: implement KVM_{GET|SET}_TSC_STATE
>   KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS
>   kvm/selftests: update tsc_msrs_test to cover
> KVM_X86_QUIRK_TSC_HOST_ACCESS
> 
>  Documentation/virt/kvm/api.rst| 65 +
>  arch/x86/include/uapi/asm/kvm.h   |  1 +
>  arch/x86/kvm/x86.c| 92 ++-
>  include/uapi/linux/kvm.h  | 15 +++
>  .../selftests/kvm/x86_64/tsc_msrs_test.c  | 79 ++--
>  5 files changed, 237 insertions(+), 15 deletions(-)
> 
> -- 
> 2.26.2
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 9249 matches

Mail list logo