Re: Re: [RFC PATCH 0/2] kvm/vmx: Output TSC offset

2012-11-14 Thread Yoshihiro YUNOMAE

Thank you for commenting on my patch set.

(2012/11/14 11:31), Steven Rostedt wrote:

On Tue, 2012-11-13 at 18:03 -0800, David Sharp wrote:

On Tue, Nov 13, 2012 at 6:00 PM, Steven Rostedt rost...@goodmis.org wrote:

On Wed, 2012-11-14 at 10:36 +0900, Yoshihiro YUNOMAE wrote:


To merge the data like previous pattern, we apply this patch set. Then, we can
get TSC offset of the guest as follows:

$ dmesg | grep kvm
[   57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ##
     |
  PID TSC offset |
HOST TSC value --+



Using printk to export something like this is IMO a nasty hack.

Can't we create a /sys or /proc file to export the same thing?


Since the value changes over the course of the trace, and seems to be
part of the context of the trace, I think I'd include it as a
tracepoint.



I'm fine with that too.


Using some tracepoint is a nice idea, but there is one problem. Here,
our discussion point is the event which TSC offset is changed does not
frequently occur, but the buffer must keep the event data.

There are two ideas for using tracepoint. First, we define new
tracepoint for changed TSC offset. This is simple and the overhead will
be low. However, this trace event stored in the buffer will be
overwritten by other trace events because this TSC offset event does
not frequently occur. Second, we add TSC offset information to the
tracepoint frequently occured. For example, we assume that TSC offset
information is added to arguments of trace_kvm_exit(). By adding the
information to the arguments, we can avoid the situation where the TSC
offset information is overwritten by other events. However, TSC offset
is not frequently changed and same information is output many times
because almost all data are waste. Therefore, only using tracepoint
is not good idea.

So, I suggest a hybrid method; record TSC offset change events and read
the last TSC offset from procfs when collecting the trace data.
In particular, the method is as follows:
 1. Enable the tracepoint of TSC offset change and record the value
before and after changing
 2. Start tracing
 3. Stop tracing
 4. Collect trace data and read /proc/pid/kvm/*
 5. Check if any trace event recording the two TSC offsets exists
in the trace data
if(existing) = use trace event (flow 6)
else = use /proc/pid/kvm/* (flow 7)
 6. Apply two TSC offsets of the trace event to the trace data and
sort the trace data
  (Ex.)
* = tracepoint of changing TSC offset
. = another trace event

  [START]*[END]
 -- --
   previous  current
  TSC offset   TSC offset

 7. Apply TSC offset of /proc/pid/kvm/* to the trace data and
sort the trace data
   (Ex.)
. = another trace event(not tracepoint of changing TSC offset)

  [START].[END]
 ---
 current
   TSC offset

Thanks,

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: yoshihiro.yunomae...@hitachi.com


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/2] KVM: make crash_clear_loaded_vmcss valid when loading kvm_intel module

2012-11-14 Thread zhangyanfei
于 2012年11月14日 05:22, Marcelo Tosatti 写道:
 On Thu, Nov 01, 2012 at 01:55:04PM +0800, zhangyanfei wrote:
 于 2012年10月31日 17:01, Hatayama, Daisuke 写道:


 -Original Message-
 From: kexec-boun...@lists.infradead.org
 [mailto:kexec-boun...@lists.infradead.org] On Behalf Of zhangyanfei
 Sent: Wednesday, October 31, 2012 12:34 PM
 To: x...@kernel.org; ke...@lists.infradead.org; Avi Kivity; Marcelo
 Tosatti
 Cc: linux-ker...@vger.kernel.org; kvm@vger.kernel.org
 Subject: [PATCH v3 2/2] KVM: make crash_clear_loaded_vmcss valid when
 loading kvm_intel module

 Signed-off-by: Zhang Yanfei zhangyan...@cn.fujitsu.com

 [...]

 @@ -7230,6 +7231,10 @@ static int __init vmx_init(void)
if (r)
goto out3;

 +#ifdef CONFIG_KEXEC
 +  crash_clear_loaded_vmcss = vmclear_local_loaded_vmcss;
 +#endif
 +

 Assignment here cannot cover the case where NMI is initiated after VMX is 
 on in kvm_init and before vmclear_local_loaded_vmcss is assigned, though 
 rare but can happen.


 By saying VMX is on in kvm init, you mean kvm_init enables the VMX feature 
 in the logical processor?
 No, only there is a vcpu to be created, kvm will enable the VMX feature.

 I think there is no difference with this assignment before or after kvm_init 
 because the vmcs linked
 list must be empty before vmx_init is finished.
 
 The list is not initialized before hardware_enable(), though. Should
 move the assignment after that.
 
 Also, it is possible that the loaded_vmcss_on_cpu list is being modified
 _while_ crash executes say via NMI, correct? If that is the case, better
 flag that the list is under manipulation so the vmclear can be skipped.
 

Thanks for your comments.
In the new patchset, I didn't move the crash_clear_loaded_vmcss assignment.
I added a new percpu variable vmclear_skipped to indicate everything:
1. Before the loaded_vmcss_on_cpu list is initialized, vmclear_skipped is 1 and
   this means if the machine crashes and doing kdump, crash_clear_loaded_vmcss
   still will not be called.
2. If the loaded_vmcss_on_cpu list is under manipulation, vmclear_skipped is
   set to 1 and after the manipulation is finished, the variable is set to 0.
3. After all loaded vmcss are vmcleared, vmclear_skipped is set to 1. So we
   needn't repeat to vmclear loaded vmcss in kdump path.

Please refer to the new version of the patchset I sent. If you have any 
suggestions, that'll be helpful.

Thanks
Zhang

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 0/2] x86: clear vmcss on all cpus when doing kdump if necessary

2012-11-14 Thread zhangyanfei
Currently, kdump just makes all the logical processors leave VMX operation by
executing VMXOFF instruction, so any VMCSs active on the logical processors may
be corrupted. But, sometimes, we need the VMCSs to debug guest images contained
in the host vmcore. To prevent the corruption, we should VMCLEAR the VMCSs 
before
executing the VMXOFF instruction.

The patch set provides a way to VMCLEAR vmcss related to guests on all cpus 
before
executing the VMXOFF when doing kdump. This is used to ensure the VMCSs in the
vmcore updated and non-corrupted.

Changelog from v3 to v4:
1. add a new percpu variable vmclear_skipped to skip
   vmclear in kdump in some conditions.

Changelog from v2 to v3:
1. remove unnecessary conditions in function
   cpu_emergency_clear_loaded_vmcss as Marcelo suggested.

Changelog from v1 to v2:
1. remove the sysctl and clear VMCSs unconditionally.

Zhang Yanfei (2):
  x86/kexec: VMCLEAR vmcss on all cpus if necessary
  KVM: set/unset crash_clear_loaded_vmcss and vmclear_skipped in
kvm_intel module

 arch/x86/include/asm/kexec.h |3 +++ 
 arch/x86/kernel/crash.c  |   32 
 arch/x86/kvm/vmx.c   |   32 
 3 files changed, 67 insertions(+), 0 deletions(-)


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/2] x86/kexec: VMCLEAR vmcss on all cpus if necessary

2012-11-14 Thread zhangyanfei
crash_clear_loaded_vmcss is added to VMCLEAR vmcss loaded on all
cpus. And when loading kvm_intel module, the function pointer will be
made valid.
The percpu variable vmclear_skipped is added to flag the case that
if loaded_vmcss_on_cpu list is being modified while the machine crashes
and doing kdump, the vmclear here can be skipped.

Signed-off-by: Zhang Yanfei zhangyan...@cn.fujitsu.com
---
 arch/x86/include/asm/kexec.h |3 +++
 arch/x86/kernel/crash.c  |   32 
 2 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 317ff17..d892211 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -163,6 +163,9 @@ struct kimage_arch {
 };
 #endif
 
+extern void (*crash_clear_loaded_vmcss)(void);
+DECLARE_PER_CPU(int, vmclear_skipped);
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_X86_KEXEC_H */
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 13ad899..b9f264e 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -16,6 +16,7 @@
 #include linux/delay.h
 #include linux/elf.h
 #include linux/elfcore.h
+#include linux/module.h
 
 #include asm/processor.h
 #include asm/hardirq.h
@@ -30,6 +31,27 @@
 
 int in_crash_kexec;
 
+/*
+ * This is used to VMCLEAR vmcss loaded on all
+ * cpus. And when loading kvm_intel module, the
+ * function pointer will be made valid.
+ */
+void (*crash_clear_loaded_vmcss)(void) = NULL;
+EXPORT_SYMBOL_GPL(crash_clear_loaded_vmcss);
+
+DEFINE_PER_CPU(int, vmclear_skipped) = 1;
+EXPORT_SYMBOL_GPL(vmclear_skipped);
+
+static void cpu_emergency_clear_loaded_vmcss(void)
+{
+   int cpu = raw_smp_processor_id();
+   int skipped;
+
+   skipped = per_cpu(vmclear_skipped, cpu);
+   if (!skipped  crash_clear_loaded_vmcss)
+   crash_clear_loaded_vmcss();
+}
+
 #if defined(CONFIG_SMP)  defined(CONFIG_X86_LOCAL_APIC)
 
 static void kdump_nmi_callback(int cpu, struct pt_regs *regs)
@@ -46,6 +68,11 @@ static void kdump_nmi_callback(int cpu, struct pt_regs *regs)
 #endif
crash_save_cpu(regs, cpu);
 
+   /*
+* VMCLEAR vmcss loaded on all cpus if needed.
+*/
+   cpu_emergency_clear_loaded_vmcss();
+
/* Disable VMX or SVM if needed.
 *
 * We need to disable virtualization on all CPUs.
@@ -88,6 +115,11 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 
kdump_nmi_shootdown_cpus();
 
+   /*
+* VMCLEAR vmcss loaded on this cpu if needed.
+*/
+   cpu_emergency_clear_loaded_vmcss();
+
/* Booting kdump kernel with VMX or SVM enabled won't work,
 * because (among other limitations) we can't disable paging
 * with the virt flags.
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 2/2] KVM: set/unset crash_clear_loaded_vmcss and vmclear_skipped in kvm_intel module

2012-11-14 Thread zhangyanfei
Signed-off-by: Zhang Yanfei zhangyan...@cn.fujitsu.com
---
 arch/x86/kvm/vmx.c |   32 
 1 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4ff0ab9..029ec7b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -41,6 +41,7 @@
 #include asm/i387.h
 #include asm/xcr.h
 #include asm/perf_event.h
+#include asm/kexec.h
 
 #include trace.h
 
@@ -963,6 +964,20 @@ static void vmcs_load(struct vmcs *vmcs)
   vmcs, phys_addr);
 }
 
+static inline void enable_vmclear_in_kdump(int cpu)
+{
+#ifdef CONFIG_KEXEC
+   per_cpu(vmclear_skipped, cpu) = 0;
+#endif
+}
+
+static inline void disable_vmclear_in_kdump(int cpu)
+{
+#ifdef CONFIG_KEXEC
+   per_cpu(vmclear_skipped, cpu) = 1;
+#endif
+}
+
 static void __loaded_vmcs_clear(void *arg)
 {
struct loaded_vmcs *loaded_vmcs = arg;
@@ -972,8 +987,10 @@ static void __loaded_vmcs_clear(void *arg)
return; /* vcpu migration can race with cpu offline */
if (per_cpu(current_vmcs, cpu) == loaded_vmcs-vmcs)
per_cpu(current_vmcs, cpu) = NULL;
+   disable_vmclear_in_kdump(cpu);
list_del(loaded_vmcs-loaded_vmcss_on_cpu_link);
loaded_vmcs_init(loaded_vmcs);
+   enable_vmclear_in_kdump(cpu);
 }
 
 static void loaded_vmcs_clear(struct loaded_vmcs *loaded_vmcs)
@@ -1491,8 +1508,10 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
local_irq_disable();
+   disable_vmclear_in_kdump(cpu);
list_add(vmx-loaded_vmcs-loaded_vmcss_on_cpu_link,
 per_cpu(loaded_vmcss_on_cpu, cpu));
+   enable_vmclear_in_kdump(cpu);
local_irq_enable();
 
/*
@@ -2302,6 +2321,9 @@ static int hardware_enable(void *garbage)
return -EBUSY;
 
INIT_LIST_HEAD(per_cpu(loaded_vmcss_on_cpu, cpu));
+
+   enable_vmclear_in_kdump(cpu);
+
rdmsrl(MSR_IA32_FEATURE_CONTROL, old);
 
test_bits = FEATURE_CONTROL_LOCKED;
@@ -2333,6 +2355,8 @@ static void vmclear_local_loaded_vmcss(void)
list_for_each_entry_safe(v, n, per_cpu(loaded_vmcss_on_cpu, cpu),
 loaded_vmcss_on_cpu_link)
__loaded_vmcs_clear(v);
+
+   disable_vmclear_in_kdump(cpu);
 }
 
 
@@ -7230,6 +7254,10 @@ static int __init vmx_init(void)
if (r)
goto out3;
 
+#ifdef CONFIG_KEXEC
+   crash_clear_loaded_vmcss = vmclear_local_loaded_vmcss;
+#endif
+
vmx_disable_intercept_for_msr(MSR_FS_BASE, false);
vmx_disable_intercept_for_msr(MSR_GS_BASE, false);
vmx_disable_intercept_for_msr(MSR_KERNEL_GS_BASE, true);
@@ -7265,6 +7293,10 @@ static void __exit vmx_exit(void)
free_page((unsigned long)vmx_io_bitmap_b);
free_page((unsigned long)vmx_io_bitmap_a);
 
+#ifdef CONFIG_KEXEC
+   crash_clear_loaded_vmcss = NULL;
+#endif
+
kvm_exit();
 }
 
-- 
1.7.1
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 10/16] x86: vdso: pvclock gettime support

2012-11-14 Thread Gleb Natapov
On Wed, Oct 31, 2012 at 08:47:06PM -0200, Marcelo Tosatti wrote:
 Improve performance of time system calls when using Linux pvclock, 
 by reading time info from fixmap visible copy of pvclock data.
 
 Originally from Jeremy Fitzhardinge.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 Index: vsyscall/arch/x86/vdso/vclock_gettime.c
 ===
 --- vsyscall.orig/arch/x86/vdso/vclock_gettime.c
 +++ vsyscall/arch/x86/vdso/vclock_gettime.c
 @@ -22,6 +22,7 @@
  #include asm/hpet.h
  #include asm/unistd.h
  #include asm/io.h
 +#include asm/pvclock.h
  
  #define gtod (VVAR(vsyscall_gtod_data))
  
 @@ -62,6 +63,70 @@ static notrace cycle_t vread_hpet(void)
   return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0);
  }
  
 +#ifdef CONFIG_PARAVIRT_CLOCK
 +
 +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
 +{
 + const aligned_pvti_t *pvti_base;
 + int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
 + int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
 +
 + BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx  PVCLOCK_FIXMAP_END);
 +
 + pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
 +
 + return pvti_base[offset].info;
 +}
 +
 +static notrace cycle_t vread_pvclock(int *mode)
 +{
 + const struct pvclock_vsyscall_time_info *pvti;
 + cycle_t ret;
 + u64 last;
 + u32 version;
 + u32 migrate_count;
 + u8 flags;
 + unsigned cpu, cpu1;
 +
 +
 + /*
 +  * When looping to get a consistent (time-info, tsc) pair, we
 +  * also need to deal with the possibility we can switch vcpus,
 +  * so make sure we always re-fetch time-info for the current vcpu.
 +  */
 + do {
 + cpu = __getcpu()  VGETCPU_CPU_MASK;
 + pvti = get_pvti(cpu);
 +
 + migrate_count = pvti-migrate_count;
 +
 + version = __pvclock_read_cycles(pvti-pvti, ret, flags);
 +
 + /*
 +  * Test we're still on the cpu as well as the version.
 +  * We could have been migrated just after the first
 +  * vgetcpu but before fetching the version, so we
 +  * wouldn't notice a version change.
 +  */
 + cpu1 = __getcpu()  VGETCPU_CPU_MASK;
 + } while (unlikely(cpu != cpu1 ||
 +   (pvti-pvti.version  1) ||
 +   pvti-pvti.version != version ||
 +   pvti-migrate_count != migrate_count));
 +
We can put vcpu id into higher bits of pvti.version. This will
save a couple of cycles by getting rid of __getcpu() calls.

 + if (unlikely(!(flags  PVCLOCK_TSC_STABLE_BIT)))
 + *mode = VCLOCK_NONE;
 +
 + /* refer to tsc.c read_tsc() comment for rationale */
 + last = VVAR(vsyscall_gtod_data).clock.cycle_last;
 +
 + if (likely(ret = last))
 + return ret;
 +
 + return last;
 +}
 +#endif
 +
  notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
  {
   long ret;
 @@ -80,7 +145,7 @@ notrace static long vdso_fallback_gtod(s
  }
  
  
 -notrace static inline u64 vgetsns(void)
 +notrace static inline u64 vgetsns(int *mode)
  {
   long v;
   cycles_t cycles;
 @@ -88,6 +153,8 @@ notrace static inline u64 vgetsns(void)
   cycles = vread_tsc();
   else if (gtod-clock.vclock_mode == VCLOCK_HPET)
   cycles = vread_hpet();
 + else if (gtod-clock.vclock_mode == VCLOCK_PVCLOCK)
 + cycles = vread_pvclock(mode);
   else
   return 0;
   v = (cycles - gtod-clock.cycle_last)  gtod-clock.mask;
 @@ -107,7 +174,7 @@ notrace static int __always_inline do_re
   mode = gtod-clock.vclock_mode;
   ts-tv_sec = gtod-wall_time_sec;
   ns = gtod-wall_time_snsec;
 - ns += vgetsns();
 + ns += vgetsns(mode);
   ns = gtod-clock.shift;
   } while (unlikely(read_seqcount_retry(gtod-seq, seq)));
  
 @@ -127,7 +194,7 @@ notrace static int do_monotonic(struct t
   mode = gtod-clock.vclock_mode;
   ts-tv_sec = gtod-monotonic_time_sec;
   ns = gtod-monotonic_time_snsec;
 - ns += vgetsns();
 + ns += vgetsns(mode);
   ns = gtod-clock.shift;
   } while (unlikely(read_seqcount_retry(gtod-seq, seq)));
   timespec_add_ns(ts, ns);
 Index: vsyscall/arch/x86/include/asm/vsyscall.h
 ===
 --- vsyscall.orig/arch/x86/include/asm/vsyscall.h
 +++ vsyscall/arch/x86/include/asm/vsyscall.h
 @@ -33,6 +33,23 @@ extern void map_vsyscall(void);
   */
  extern bool emulate_vsyscall(struct pt_regs *regs, unsigned long address);
  
 +#define VGETCPU_CPU_MASK 0xfff
 +
 +static inline unsigned int __getcpu(void)
 +{
 + unsigned int p;
 +
 + if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) {
 + /* Load per CPU data from RDTSCP */
 + 

Re: [PATCH] KVM: MMU: lazily drop large spte

2012-11-14 Thread Marcelo Tosatti
On Tue, Nov 13, 2012 at 04:26:16PM +0800, Xiao Guangrong wrote:
 Hi Marcelo,
 
 On 11/13/2012 07:10 AM, Marcelo Tosatti wrote:
  On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
  Do not drop large spte until it can be insteaded by small pages so that
  the guest can happliy read memory through it
 
  The idea is from Avi:
  | As I mentioned before, write-protecting a large spte is a good idea,
  | since it moves some work from protect-time to fault-time, so it reduces
  | jitter.  This removes the need for the return value.
 
  Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
  ---
   arch/x86/kvm/mmu.c |   34 +-
   1 files changed, 9 insertions(+), 25 deletions(-)
  
  Its likely that other 4k pages are mapped read-write in the 2mb range 
  covered by a read-only 2mb map. Therefore its not entirely useful to
  map read-only. 
  
 
 It needs a page fault to install a pte even if it is the read access.
 After the change, the page fault can be avoided.
 
  Can you measure an improvement with this change?
 
 I have a test case to measure the read time which has been attached.
 It maps 4k pages at first (dirt-loggged), then switch to large sptes
 (stop dirt-logging), at the last, measure the read access time after write
 protect sptes.
 
 Before: 23314111 ns   After: 11404197 ns

Ok, i'm concerned about cases similar to e49146dce8c3dc6f44 (with shadow),
that is:

- large page must be destroyed when write protecting due to 
shadowed page.
- with shadow, it does not make sense to write protect 
large sptes as mentioned earlier.

So i wonder why is this part from your patch

-   if (level  PT_PAGE_TABLE_LEVEL 
-   has_wrprotected_page(vcpu-kvm, gfn, level)) {
-   ret = 1;
-   drop_spte(vcpu-kvm, sptep);
-   goto done;
-   }

necessary (assuming EPT is in use).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: MMU: lazily drop large spte

2012-11-14 Thread Marcelo Tosatti
On Wed, Nov 14, 2012 at 12:33:50AM +0900, Takuya Yoshikawa wrote:
 Ccing live migration developers who should be interested in this work,
 
 On Mon, 12 Nov 2012 21:10:32 -0200
 Marcelo Tosatti mtosa...@redhat.com wrote:
 
  On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
   Do not drop large spte until it can be insteaded by small pages so that
   the guest can happliy read memory through it
   
   The idea is from Avi:
   | As I mentioned before, write-protecting a large spte is a good idea,
   | since it moves some work from protect-time to fault-time, so it reduces
   | jitter.  This removes the need for the return value.
   
   Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
   ---
arch/x86/kvm/mmu.c |   34 +-
1 files changed, 9 insertions(+), 25 deletions(-)
  
  Its likely that other 4k pages are mapped read-write in the 2mb range 
  covered by a read-only 2mb map. Therefore its not entirely useful to
  map read-only. 
  
  Can you measure an improvement with this change?
 
 What we discussed at KVM Forum last week was about the jitter we could
 measure right after starting live migration: both Isaku and Chegu reported
 such jitter.
 
 So if this patch reduces such jitter for some real workloads, by lazily
 dropping largepage mappings and saving read faults until that point, that
 would be very nice!
 
 But sadly, what they measured included interactions with the outside of the
 guest, and the main cause was due to the big QEMU lock problem, they guessed.
 The order is so different that an improvement by a kernel side effort may not
 be seen easily.
 
 FWIW: I am now changing the initial write protection by
 kvm_mmu_slot_remove_write_access() to rmap based as I proposed at KVM Forum.
 ftrace said that 1ms was improved to 250-350us by the change for 10GB guest.
 My code still drops largepage mappings, so the initial write protection time
 itself may not be a such big issue here, I think.
 
 Again, if we can eliminate read faults to such an extent that guests can see
 measurable improvement, that should be very nice!
 
 Any thoughts?
 
 Thanks,
   Takuya

OK, makes sense. I'm worried about shadow / oos interactions 
with large read-only mappings (trying to remember what was the 
case exactly, it might be non-existant now).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 0/3] vfio-pci for 1.3-rc0

2012-11-14 Thread Anthony Liguori
Alex Williamson alex.william...@redhat.com writes:

 Hi Anthony,

 Please pull the tag below.  I posted the linux-headers update
 separately on Oct-15; since it hasn't been applied and should be
 non-controversial, I include it again here.  Thanks,

 Alex


Pulled. Thanks.

Regards,

Anthony Liguori

 The following changes since commit f5022a135e4309a54d433c69b2a056756b2d0d6b:

   aio: fix aio_ctx_prepare with idle bottom halves (2012-11-12 20:02:09 +0400)

 are available in the git repository at:

   git://github.com/awilliam/qemu-vfio.git tags/vfio-pci-for-qemu-1.3.0-rc0

 for you to fetch changes up to a771c51703cf9f91023c6570426258bdf5ec775b:

   vfio-pci: Use common msi_get_message (2012-11-13 12:27:40 -0700)

 
 vfio-pci: KVM INTx accel  common msi_get_message

 
 Alex Williamson (3):
   linux-headers: Update to 3.7-rc5
   vfio-pci: Add KVM INTx acceleration
   vfio-pci: Use common msi_get_message

  hw/vfio_pci.c| 210 
 +++
  linux-headers/asm-powerpc/kvm_para.h |   6 +-
  linux-headers/asm-s390/kvm_para.h|   8 +-
  linux-headers/asm-x86/kvm.h  |  17 +++
  linux-headers/linux/kvm.h|  25 -
  linux-headers/linux/kvm_para.h   |   6 +-
  linux-headers/linux/vfio.h   |   6 +-
  linux-headers/linux/virtio_config.h  |   6 +-
  linux-headers/linux/virtio_ring.h|   6 +-
  9 files changed, 241 insertions(+), 49 deletions(-)
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: interrupt remapping support

2012-11-14 Thread Gleb Natapov
On Wed, Nov 14, 2012 at 05:46:45PM +0100, emdel wrote:
 On Tue, Nov 13, 2012 at 10:29 AM, Gleb Natapov g...@redhat.com wrote:
 
  KVM does not implement VT-d spec if this is your question. Any help with
  this will be appreciated.
 
 
 Hello everybody,
 following this link [1] it looks like that we can configure pass-through
 devices for KVM guests,
 so if it is the case and as you said KVM doesn't implement any Vt-d
 specification,
 are there any protections in place for DMA attacks?
 
 
KVM uses VT-d on a host for device assignment. Guest running inside KVM
will not see VT-d though since KVM does not emulate it.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: interrupt remapping support

2012-11-14 Thread Gleb Natapov
On Wed, Nov 14, 2012 at 06:31:30PM +0100, emdel wrote:
 On Wed, Nov 14, 2012 at 6:06 PM, Gleb Natapov g...@redhat.com wrote:
 
  KVM uses VT-d on a host for device assignment. Guest running inside KVM
  will not see VT-d though since KVM does not emulate it.
 
 
 So another question comes up in my mind:
 what's the purpose of the host devices assignment if I cannot use it for
 the guest?
 
I do not think you understand what I am saying. You can use device
assignment for a guest.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/20] KVM/MIPS32: MIPS arch specific APIs for KVM

2012-11-14 Thread Sanjay Lal

On Nov 1, 2012, at 11:18 AM, Avi Kivity wrote:

 +
 +/* Set the appropriate status bits based on host CPU features, before 
 we hit the scheduler */
 +kvm_mips_set_c0_status();
 +
 +local_irq_enable();
 
 Ah, so you handle exits with interrupts enabled.  But that's not how we
 usually do it; the standard pattern is
 
 
 while (can continue)
 disable interrupts
 enter guest
 enable interrupts
 process exit

A bit more detail here. KVM/MIPS has its own set of exception handlers which 
are separate from the host kernel's handlers.  We switch between the 2 sets of 
handlers by setting the Exception Base Register (EBASE).  We enable host 
interrupts just before we switch to guest context so that we trap when the host 
gets a timer or I/O interrupt.  

When an exception does occur in guest context, the KVM/MIPS handlers will save 
the guest context, and switch back to the default host kernel exception 
handlers. We enter the C handler (kvm_mips_handle_exit()) with interrupts 
disabled, and explicitly enable them there.  This allows the host kernel to 
handle any pending interrupts.

The sequence is as follows
while (can continue)
disable interrupts
trampoline code to save host kernel context, load guest context
enable host interrupts
enter guest context
KVM/MIPS trap handler (called with interrupts disabled, per MIPS 
architecture)
Restore host Linux context, setup stack to handle exception
Jump to C handler
Enable interrupts before handling VM exit.


Regards
Sanjay



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 10/16] x86: vdso: pvclock gettime support

2012-11-14 Thread Marcelo Tosatti
On Wed, Nov 14, 2012 at 12:42:48PM +0200, Gleb Natapov wrote:
 On Wed, Oct 31, 2012 at 08:47:06PM -0200, Marcelo Tosatti wrote:
  Improve performance of time system calls when using Linux pvclock, 
  by reading time info from fixmap visible copy of pvclock data.
  
  Originally from Jeremy Fitzhardinge.
  
  Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
  
  Index: vsyscall/arch/x86/vdso/vclock_gettime.c
  ===
  --- vsyscall.orig/arch/x86/vdso/vclock_gettime.c
  +++ vsyscall/arch/x86/vdso/vclock_gettime.c
  @@ -22,6 +22,7 @@
   #include asm/hpet.h
   #include asm/unistd.h
   #include asm/io.h
  +#include asm/pvclock.h
   
   #define gtod (VVAR(vsyscall_gtod_data))
   
  @@ -62,6 +63,70 @@ static notrace cycle_t vread_hpet(void)
  return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0);
   }
   
  +#ifdef CONFIG_PARAVIRT_CLOCK
  +
  +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
  +{
  +   const aligned_pvti_t *pvti_base;
  +   int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
  +   int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
  +
  +   BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx  PVCLOCK_FIXMAP_END);
  +
  +   pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
  +
  +   return pvti_base[offset].info;
  +}
  +
  +static notrace cycle_t vread_pvclock(int *mode)
  +{
  +   const struct pvclock_vsyscall_time_info *pvti;
  +   cycle_t ret;
  +   u64 last;
  +   u32 version;
  +   u32 migrate_count;
  +   u8 flags;
  +   unsigned cpu, cpu1;
  +
  +
  +   /*
  +* When looping to get a consistent (time-info, tsc) pair, we
  +* also need to deal with the possibility we can switch vcpus,
  +* so make sure we always re-fetch time-info for the current vcpu.
  +*/
  +   do {
  +   cpu = __getcpu()  VGETCPU_CPU_MASK;
  +   pvti = get_pvti(cpu);
  +
  +   migrate_count = pvti-migrate_count;
  +
  +   version = __pvclock_read_cycles(pvti-pvti, ret, flags);
  +
  +   /*
  +* Test we're still on the cpu as well as the version.
  +* We could have been migrated just after the first
  +* vgetcpu but before fetching the version, so we
  +* wouldn't notice a version change.
  +*/
  +   cpu1 = __getcpu()  VGETCPU_CPU_MASK;
  +   } while (unlikely(cpu != cpu1 ||
  + (pvti-pvti.version  1) ||
  + pvti-pvti.version != version ||
  + pvti-migrate_count != migrate_count));
  +
 We can put vcpu id into higher bits of pvti.version. This will
 save a couple of cycles by getting rid of __getcpu() calls.

Yes. Added as comment in the code.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: MMU: lazily drop large spte

2012-11-14 Thread Xiao Guangrong
On 11/14/2012 10:37 PM, Marcelo Tosatti wrote:
 On Tue, Nov 13, 2012 at 04:26:16PM +0800, Xiao Guangrong wrote:
 Hi Marcelo,

 On 11/13/2012 07:10 AM, Marcelo Tosatti wrote:
 On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
 Do not drop large spte until it can be insteaded by small pages so that
 the guest can happliy read memory through it

 The idea is from Avi:
 | As I mentioned before, write-protecting a large spte is a good idea,
 | since it moves some work from protect-time to fault-time, so it reduces
 | jitter.  This removes the need for the return value.

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/mmu.c |   34 +-
  1 files changed, 9 insertions(+), 25 deletions(-)

 Its likely that other 4k pages are mapped read-write in the 2mb range 
 covered by a read-only 2mb map. Therefore its not entirely useful to
 map read-only. 


 It needs a page fault to install a pte even if it is the read access.
 After the change, the page fault can be avoided.

 Can you measure an improvement with this change?

 I have a test case to measure the read time which has been attached.
 It maps 4k pages at first (dirt-loggged), then switch to large sptes
 (stop dirt-logging), at the last, measure the read access time after write
 protect sptes.

 Before: 23314111 ns  After: 11404197 ns
 
 Ok, i'm concerned about cases similar to e49146dce8c3dc6f44 (with shadow),
 that is:
 
 - large page must be destroyed when write protecting due to 
 shadowed page.
 - with shadow, it does not make sense to write protect 
 large sptes as mentioned earlier.
 

This case is removed now, the code when e49146dce8c3dc6f44 was applied is:
|
|pt = sp-spt;
|for (i = 0; i  PT64_ENT_PER_PAGE; ++i)
|/* avoid RMW */
|if (is_writable_pte(pt[i]))
|update_spte(pt[i], pt[i]  ~PT_WRITABLE_MASK);
|}

The real problem in this code is it would write-protect the spte even if
it is not a last spte that caused the middle-level shadow page table was
write-protected. So e49146dce8c3dc6f44 added this code:
|if (sp-role.level != PT_PAGE_TABLE_LEVEL)
|continue;
|
was good to fix this problem.

Now, the current code is:
|   for (i = 0; i  PT64_ENT_PER_PAGE; ++i) {
|   if (!is_shadow_present_pte(pt[i]) ||
| !is_last_spte(pt[i], sp-role.level))
|   continue;
|
|   spte_write_protect(kvm, pt[i], flush, false);
|   }
It only write-protect the last spte. So, it allows large spte existent.
(the large spte can be broken by drop_large_spte() on the page-fault path.)

 So i wonder why is this part from your patch
 
 -   if (level  PT_PAGE_TABLE_LEVEL 
 -   has_wrprotected_page(vcpu-kvm, gfn, level)) {
 -   ret = 1;
 -   drop_spte(vcpu-kvm, sptep);
 -   goto done;
 -   }
 
 necessary (assuming EPT is in use).

This is safe, we change these code to:

-   if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
+   if ((level  PT_PAGE_TABLE_LEVEL 
+  has_wrprotected_page(vcpu-kvm, gfn, level)) ||
+ mmu_need_write_protect(vcpu, gfn, can_unsync)) {
pgprintk(%s: found shadow page for %llx, marking ro\n,
 __func__, gfn);
ret = 1;

The spte become read-only which can ensure the shadow gfn can not be changed.

Btw, the origin code allows to create readonly spte under this case if 
!(pte_access  WRITEABBLE)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: MMU: lazily drop large spte

2012-11-14 Thread Xiao Guangrong
On 11/14/2012 10:44 PM, Marcelo Tosatti wrote:
 On Wed, Nov 14, 2012 at 12:33:50AM +0900, Takuya Yoshikawa wrote:
 Ccing live migration developers who should be interested in this work,

 On Mon, 12 Nov 2012 21:10:32 -0200
 Marcelo Tosatti mtosa...@redhat.com wrote:

 On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote:
 Do not drop large spte until it can be insteaded by small pages so that
 the guest can happliy read memory through it

 The idea is from Avi:
 | As I mentioned before, write-protecting a large spte is a good idea,
 | since it moves some work from protect-time to fault-time, so it reduces
 | jitter.  This removes the need for the return value.

 Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
 ---
  arch/x86/kvm/mmu.c |   34 +-
  1 files changed, 9 insertions(+), 25 deletions(-)

 Its likely that other 4k pages are mapped read-write in the 2mb range 
 covered by a read-only 2mb map. Therefore its not entirely useful to
 map read-only. 

 Can you measure an improvement with this change?

 What we discussed at KVM Forum last week was about the jitter we could
 measure right after starting live migration: both Isaku and Chegu reported
 such jitter.

 So if this patch reduces such jitter for some real workloads, by lazily
 dropping largepage mappings and saving read faults until that point, that
 would be very nice!

 But sadly, what they measured included interactions with the outside of the
 guest, and the main cause was due to the big QEMU lock problem, they guessed.
 The order is so different that an improvement by a kernel side effort may not
 be seen easily.

 FWIW: I am now changing the initial write protection by
 kvm_mmu_slot_remove_write_access() to rmap based as I proposed at KVM Forum.
 ftrace said that 1ms was improved to 250-350us by the change for 10GB guest.
 My code still drops largepage mappings, so the initial write protection time
 itself may not be a such big issue here, I think.

 Again, if we can eliminate read faults to such an extent that guests can see
 measurable improvement, that should be very nice!

 Any thoughts?

 Thanks,
  Takuya
 
 OK, makes sense. I'm worried about shadow / oos interactions 
 with large read-only mappings (trying to remember what was the 
 case exactly, it might be non-existant now).

Marcelo, i guess commit 38187c830cab84daecb41169948467f1f19317e3 is what you
mentioned, but i do not know how it can Simplifies out of sync shadow.  :(

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PROBLEM: compilation issue, inline assembly arch/x86/kvm/emulate.c fails at -O0

2012-11-14 Thread H. Peter Anvin

On 11/14/2012 11:45 AM, Blower, Melanie wrote:

[1.] gcc -O0 assembly arch/x86/kvm/emulate.c gets compilation failure -- 
incorrect register restrictions
[2.] Full description of the problem/report:
I'm trying to compile this file at -O0, but gcc chokes in register allocation 
at the inline assembly.

In the ordinary Linux build, this file compiles with gcc at -O2, without 
compilation errors.


Compiling with -O0 is not really expected to work (although -O1 *is*), 
although what you are reporting is an actual bug (+a : a should

either be +a or =a : a).

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] [PULL] qemu-kvm.git uq/master queue

2012-11-14 Thread Marcelo Tosatti
The following changes since commit ce34cf72fe508b27a78f83c184142e8d1e6a048a:

  Merge remote-tracking branch 'awilliam/tags/vfio-pci-for-qemu-1.3.0-rc0' into 
staging (2012-11-14 08:53:40 -0600)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master

Jan Kiszka (1):
  kvm: Actually remove software breakpoints from list on cleanup

Marcelo Tosatti (1):
  acpi_piix4: fix migration of gpe fields

 hw/acpi_piix4.c |   50 ++
 kvm-all.c   |2 ++
 2 files changed, 48 insertions(+), 4 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 11/18] x86: vdso: pvclock gettime support

2012-11-14 Thread Marcelo Tosatti
Improve performance of time system calls when using Linux pvclock, 
by reading time info from fixmap visible copy of pvclock data.

Originally from Jeremy Fitzhardinge.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/vdso/vclock_gettime.c
===
--- vsyscall.orig/arch/x86/vdso/vclock_gettime.c
+++ vsyscall/arch/x86/vdso/vclock_gettime.c
@@ -22,6 +22,7 @@
 #include asm/hpet.h
 #include asm/unistd.h
 #include asm/io.h
+#include asm/pvclock.h
 
 #define gtod (VVAR(vsyscall_gtod_data))
 
@@ -62,6 +63,76 @@ static notrace cycle_t vread_hpet(void)
return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0);
 }
 
+#ifdef CONFIG_PARAVIRT_CLOCK
+
+static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
+{
+   const struct pvclock_vsyscall_time_info *pvti_base;
+   int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
+   int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
+
+   BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx  PVCLOCK_FIXMAP_END);
+
+   pvti_base = (struct pvclock_vsyscall_time_info *)
+   __fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
+
+   return pvti_base[offset];
+}
+
+static notrace cycle_t vread_pvclock(int *mode)
+{
+   const struct pvclock_vsyscall_time_info *pvti;
+   cycle_t ret;
+   u64 last;
+   u32 version;
+   u32 migrate_count;
+   u8 flags;
+   unsigned cpu, cpu1;
+
+
+   /*
+* When looping to get a consistent (time-info, tsc) pair, we
+* also need to deal with the possibility we can switch vcpus,
+* so make sure we always re-fetch time-info for the current vcpu.
+*/
+   do {
+   cpu = __getcpu()  VGETCPU_CPU_MASK;
+   /* TODO: We can put vcpu id into higher bits of pvti.version.
+* This will save a couple of cycles by getting rid of
+* __getcpu() calls (Gleb).
+*/
+
+   pvti = get_pvti(cpu);
+
+   migrate_count = pvti-migrate_count;
+
+   version = __pvclock_read_cycles(pvti-pvti, ret, flags);
+
+   /*
+* Test we're still on the cpu as well as the version.
+* We could have been migrated just after the first
+* vgetcpu but before fetching the version, so we
+* wouldn't notice a version change.
+*/
+   cpu1 = __getcpu()  VGETCPU_CPU_MASK;
+   } while (unlikely(cpu != cpu1 ||
+ (pvti-pvti.version  1) ||
+ pvti-pvti.version != version ||
+ pvti-migrate_count != migrate_count));
+
+   if (unlikely(!(flags  PVCLOCK_TSC_STABLE_BIT)))
+   *mode = VCLOCK_NONE;
+
+   /* refer to tsc.c read_tsc() comment for rationale */
+   last = VVAR(vsyscall_gtod_data).clock.cycle_last;
+
+   if (likely(ret = last))
+   return ret;
+
+   return last;
+}
+#endif
+
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
 {
long ret;
@@ -80,7 +151,7 @@ notrace static long vdso_fallback_gtod(s
 }
 
 
-notrace static inline u64 vgetsns(void)
+notrace static inline u64 vgetsns(int *mode)
 {
long v;
cycles_t cycles;
@@ -88,6 +159,8 @@ notrace static inline u64 vgetsns(void)
cycles = vread_tsc();
else if (gtod-clock.vclock_mode == VCLOCK_HPET)
cycles = vread_hpet();
+   else if (gtod-clock.vclock_mode == VCLOCK_PVCLOCK)
+   cycles = vread_pvclock(mode);
else
return 0;
v = (cycles - gtod-clock.cycle_last)  gtod-clock.mask;
@@ -107,7 +180,7 @@ notrace static int __always_inline do_re
mode = gtod-clock.vclock_mode;
ts-tv_sec = gtod-wall_time_sec;
ns = gtod-wall_time_snsec;
-   ns += vgetsns();
+   ns += vgetsns(mode);
ns = gtod-clock.shift;
} while (unlikely(read_seqcount_retry(gtod-seq, seq)));
 
@@ -127,7 +200,7 @@ notrace static int do_monotonic(struct t
mode = gtod-clock.vclock_mode;
ts-tv_sec = gtod-monotonic_time_sec;
ns = gtod-monotonic_time_snsec;
-   ns += vgetsns();
+   ns += vgetsns(mode);
ns = gtod-clock.shift;
} while (unlikely(read_seqcount_retry(gtod-seq, seq)));
timespec_add_ns(ts, ns);
Index: vsyscall/arch/x86/include/asm/vsyscall.h
===
--- vsyscall.orig/arch/x86/include/asm/vsyscall.h
+++ vsyscall/arch/x86/include/asm/vsyscall.h
@@ -33,6 +33,23 @@ extern void map_vsyscall(void);
  */
 extern bool emulate_vsyscall(struct pt_regs *regs, unsigned long address);
 
+#define VGETCPU_CPU_MASK 0xfff
+
+static inline unsigned int __getcpu(void)
+{
+   unsigned int p;
+
+   if (VVAR(vgetcpu_mode) == 

[patch 07/18] x86: pvclock: add note about rdtsc barriers

2012-11-14 Thread Marcelo Tosatti
As noted by Gleb, not advertising SSE2 support implies
no RDTSC barriers.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/include/asm/pvclock.h
===
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -74,6 +74,9 @@ unsigned __pvclock_read_cycles(const str
u8 ret_flags;
 
version = src-version;
+   /* Note: emulated platforms which do not advertise SSE2 support
+* result in kvmclock not using the necessary RDTSC barriers.
+*/
rdtsc_barrier();
offset = pvclock_get_nsec_offset(src);
ret = src-system_time + offset;


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 09/18] x86: pvclock: generic pvclock vsyscall initialization

2012-11-14 Thread Marcelo Tosatti
Originally from Jeremy Fitzhardinge.

Introduce generic, non hypervisor specific, pvclock initialization 
routines.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/kernel/pvclock.c
===
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -17,6 +17,10 @@
 
 #include linux/kernel.h
 #include linux/percpu.h
+#include linux/notifier.h
+#include linux/sched.h
+#include linux/gfp.h
+#include linux/bootmem.h
 #include asm/pvclock.h
 
 static u8 valid_flags __read_mostly = 0;
@@ -122,3 +126,68 @@ void pvclock_read_wallclock(struct pvclo
 
set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
 }
+
+static struct pvclock_vsyscall_time_info *pvclock_vdso_info;
+
+static struct pvclock_vsyscall_time_info *
+pvclock_get_vsyscall_user_time_info(int cpu)
+{
+   if (!pvclock_vdso_info) {
+   BUG();
+   return NULL;
+   }
+
+   return pvclock_vdso_info[cpu];
+}
+
+struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu)
+{
+   return pvclock_get_vsyscall_user_time_info(cpu)-pvti;
+}
+
+int pvclock_task_migrate(struct notifier_block *nb, unsigned long l, void *v)
+{
+   struct task_migration_notifier *mn = v;
+   struct pvclock_vsyscall_time_info *pvti;
+
+   pvti = pvclock_get_vsyscall_user_time_info(mn-from_cpu);
+
+   /* this is NULL when pvclock vsyscall is not initialized */
+   if (unlikely(pvti == NULL))
+   return NOTIFY_DONE;
+
+   pvti-migrate_count++;
+
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block pvclock_migrate = {
+   .notifier_call = pvclock_task_migrate,
+};
+
+/*
+ * Initialize the generic pvclock vsyscall state.  This will allocate
+ * a/some page(s) for the per-vcpu pvclock information, set up a
+ * fixmap mapping for the page(s)
+ */
+
+int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
+int size)
+{
+   int idx;
+
+   WARN_ON (size != PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE);
+
+   pvclock_vdso_info = i;
+
+   for (idx = 0; idx = (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) {
+   __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx,
+__pa_symbol(i) + (idx*PAGE_SIZE),
+PAGE_KERNEL_VVAR);
+   }
+
+
+   register_task_migration_notifier(pvclock_migrate);
+
+   return 0;
+}
Index: vsyscall/arch/x86/include/asm/fixmap.h
===
--- vsyscall.orig/arch/x86/include/asm/fixmap.h
+++ vsyscall/arch/x86/include/asm/fixmap.h
@@ -19,6 +19,7 @@
 #include asm/acpi.h
 #include asm/apicdef.h
 #include asm/page.h
+#include asm/pvclock.h
 #ifdef CONFIG_X86_32
 #include linux/threads.h
 #include asm/kmap_types.h
@@ -81,6 +82,10 @@ enum fixed_addresses {
VVAR_PAGE,
VSYSCALL_HPET,
 #endif
+#ifdef CONFIG_PARAVIRT_CLOCK
+   PVCLOCK_FIXMAP_BEGIN,
+   PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1,
+#endif
FIX_DBGP_BASE,
FIX_EARLYCON_MEM_BASE,
 #ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT
Index: vsyscall/arch/x86/include/asm/pvclock.h
===
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -85,4 +85,16 @@ unsigned __pvclock_read_cycles(const str
return version;
 }
 
+struct pvclock_vsyscall_time_info {
+   struct pvclock_vcpu_time_info pvti;
+   u32 migrate_count;
+} __attribute__((__aligned__(SMP_CACHE_BYTES)));
+
+#define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
+#define PVCLOCK_VSYSCALL_NR_PAGES (((NR_CPUS-1)/(PAGE_SIZE/PVTI_SIZE))+1)
+
+int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
+int size);
+struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu);
+
 #endif /* _ASM_X86_PVCLOCK_H */
Index: vsyscall/arch/x86/include/asm/clocksource.h
===
--- vsyscall.orig/arch/x86/include/asm/clocksource.h
+++ vsyscall/arch/x86/include/asm/clocksource.h
@@ -8,6 +8,7 @@
 #define VCLOCK_NONE 0  /* No vDSO clock available. */
 #define VCLOCK_TSC  1  /* vDSO should use vread_tsc.   */
 #define VCLOCK_HPET 2  /* vDSO should use vread_hpet.  */
+#define VCLOCK_PVCLOCK 3 /* vDSO should use vread_pvclock. */
 
 struct arch_clocksource_data {
int vclock_mode;


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] kvm: Actually remove software breakpoints from list on cleanup

2012-11-14 Thread Marcelo Tosatti
From: Jan Kiszka jan.kis...@siemens.com

So far we only removed them from the guest, leaving its states in the
list. This made it impossible for gdb to re-enable breakpoints on the
same address after re-attaching.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 kvm-all.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index b6d0483..3bc3347 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1905,6 +1905,8 @@ void kvm_remove_all_breakpoints(CPUArchState *current_env)
 }
 }
 }
+QTAILQ_REMOVE(s-kvm_sw_breakpoints, bp, entry);
+g_free(bp);
 }
 kvm_arch_remove_all_hw_breakpoints();
 
-- 
1.7.6.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4)

2012-11-14 Thread Marcelo Tosatti

This patchset, based on earlier work by Jeremy Fitzhardinge, implements
paravirtual clock vsyscall support.

It should be possible to implement Xen support relatively easily.

It reduces clock_gettime from 500 cycles to 200 cycles
on my testbox.

Please review.

v4:
- remove aligned_pvti structure, align directly (Glauber)
- add comments to migration notifier (Glauber)
- mark migration notifier condition as unlikely (Glauber)
- add comment about rdtsc barrier dependency on sse2 (Gleb)
- add idea to improve vdso gettime call (Gleb)
- remove new msr interface, reuse kernel copy of pvclock
data (Glauber)
- move copying of timekeeping data from generic timekeeping 
code to kvm code (John)

v3:
- fix PVCLOCK_VSYSCALL_NR_PAGES definition (glommer)
- fold flags race fix into pvclock refactoring (avi)
- remove CONFIG_PARAVIRT_CLOCK_VSYSCALL (glommer)
- add reference to tsc.c from vclock_gettime.c about cycle_last rationale
(glommer)
- fix whitespace damage (glommer)


v2:
- Do not allow visibility of different system_timestamp, tsc_timestamp
tuples.
- Add option to disable vsyscall.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 08/18] sched: add notifier for cross-cpu migrations

2012-11-14 Thread Marcelo Tosatti
Originally from Jeremy Fitzhardinge.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/include/linux/sched.h
===
--- vsyscall.orig/include/linux/sched.h
+++ vsyscall/include/linux/sched.h
@@ -107,6 +107,14 @@ extern unsigned long this_cpu_load(void)
 extern void calc_global_load(unsigned long ticks);
 extern void update_cpu_load_nohz(void);
 
+/* Notifier for when a task gets migrated to a new CPU */
+struct task_migration_notifier {
+   struct task_struct *task;
+   int from_cpu;
+   int to_cpu;
+};
+extern void register_task_migration_notifier(struct notifier_block *n);
+
 extern unsigned long get_parent_ip(unsigned long addr);
 
 struct seq_file;
Index: vsyscall/kernel/sched/core.c
===
--- vsyscall.orig/kernel/sched/core.c
+++ vsyscall/kernel/sched/core.c
@@ -922,6 +922,13 @@ void check_preempt_curr(struct rq *rq, s
rq-skip_clock_update = 1;
 }
 
+static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
+
+void register_task_migration_notifier(struct notifier_block *n)
+{
+   atomic_notifier_chain_register(task_migration_notifier, n);
+}
+
 #ifdef CONFIG_SMP
 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 {
@@ -952,8 +959,16 @@ void set_task_cpu(struct task_struct *p,
trace_sched_migrate_task(p, new_cpu);
 
if (task_cpu(p) != new_cpu) {
+   struct task_migration_notifier tmn;
+
p-se.nr_migrations++;
perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);
+
+   tmn.task = p;
+   tmn.from_cpu = task_cpu(p);
+   tmn.to_cpu = new_cpu;
+
+   atomic_notifier_call_chain(task_migration_notifier, 0, tmn);
}
 
__set_task_cpu(p, new_cpu);


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 15/18] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag

2012-11-14 Thread Marcelo Tosatti
KVM added a global variable to guarantee monotonicity in the guest. 
One of the reasons for that is that the time between

1. ktime_get_ts(timespec);
2. rdtscll(tsc);

Is variable. That is, given a host with stable TSC, suppose that
two VCPUs read the same time via ktime_get_ts() above.

The time required to execute 2. is not the same on those two instances 
executing in different VCPUS (cache misses, interrupts...).

If the TSC value that is used by the host to interpolate when 
calculating the monotonic time is the same value used to calculate
the tsc_timestamp value stored in the pvclock data structure, and
a single system_timestamp, tsc_timestamp tuple is visible to all 
vcpus simultaneously, this problem disappears. See comment on top
of pvclock_update_vm_gtod_copy for details.

Monotonicity is then guaranteed by synchronicity of the host TSCs
and guest TSCs. 

Set TSC stable pvclock flag in that case, allowing the guest to read
clock from userspace.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/kvm/x86.c
===
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1186,21 +1186,166 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
 
+static cycle_t read_tsc(void)
+{
+   cycle_t ret;
+   u64 last;
+
+   /*
+* Empirically, a fence (of type that depends on the CPU)
+* before rdtsc is enough to ensure that rdtsc is ordered
+* with respect to loads.  The various CPU manuals are unclear
+* as to whether rdtsc can be reordered with later loads,
+* but no one has ever seen it happen.
+*/
+   rdtsc_barrier();
+   ret = (cycle_t)vget_cycles();
+
+   last = pvclock_gtod_data.clock.cycle_last;
+
+   if (likely(ret = last))
+   return ret;
+
+   /*
+* GCC likes to generate cmov here, but this branch is extremely
+* predictable (it's just a funciton of time and the likely is
+* very likely) and there's a data dependence, so force GCC
+* to generate a branch instead.  I don't barrier() because
+* we don't actually need a barrier, and if this function
+* ever gets inlined it will generate worse code.
+*/
+   asm volatile ();
+   return last;
+}
+
+static inline u64 vgettsc(cycle_t *cycle_now)
+{
+   long v;
+   struct pvclock_gtod_data *gtod = pvclock_gtod_data;
+
+   *cycle_now = read_tsc();
+
+   v = (*cycle_now - gtod-clock.cycle_last)  gtod-clock.mask;
+   return v * gtod-clock.mult;
+}
+
+static int do_monotonic(struct timespec *ts, cycle_t *cycle_now)
+{
+   unsigned long seq;
+   u64 ns;
+   int mode;
+   struct pvclock_gtod_data *gtod = pvclock_gtod_data;
+
+   ts-tv_nsec = 0;
+   do {
+   seq = read_seqcount_begin(gtod-seq);
+   mode = gtod-clock.vclock_mode;
+   ts-tv_sec = gtod-monotonic_time_sec;
+   ns = gtod-monotonic_time_snsec;
+   ns += vgettsc(cycle_now);
+   ns = gtod-clock.shift;
+   } while (unlikely(read_seqcount_retry(gtod-seq, seq)));
+   timespec_add_ns(ts, ns);
+
+   return mode;
+}
+
+/* returns true if host is using tsc clocksource */
+static bool kvm_get_time_and_clockread(s64 *kernel_ns, cycle_t *cycle_now)
+{
+   struct timespec ts;
+
+   /* checked again under seqlock below */
+   if (pvclock_gtod_data.clock.vclock_mode != VCLOCK_TSC)
+   return false;
+
+   if (do_monotonic(ts, cycle_now) != VCLOCK_TSC)
+   return false;
+
+   monotonic_to_bootbased(ts);
+   *kernel_ns = timespec_to_ns(ts);
+
+   return true;
+}
+
+
+/*
+ *
+ * Assuming a stable TSC across physical CPUS, the following condition
+ * is possible. Each numbered line represents an event visible to both
+ * CPUs at the next numbered event.
+ *
+ * timespecX represents host monotonic time. tscX represents
+ * RDTSC value.
+ *
+ * VCPU0 on CPU0   |   VCPU1 on CPU1
+ *
+ * 1.  read timespec0,tsc0
+ * 2.  | timespec1 = timespec0 + N
+ * | tsc1 = tsc0 + M
+ * 3. transition to guest  | transition to guest
+ * 4. ret0 = timespec0 + (rdtsc - tsc0) |
+ * 5.  | ret1 = timespec1 + (rdtsc - tsc1)
+ * | ret1 = timespec0 + N + (rdtsc - (tsc0 
+ M))
+ *
+ * Since ret0 update is visible to VCPU1 at time 5, to obey monotonicity:
+ *
+ * - ret0  ret1
+ * - timespec0 + (rdtsc - tsc0)  timespec0 + N + (rdtsc - (tsc0 + M))
+ * ...
+ * - 0  N - M = M  N
+ *
+ * That is, when timespec0 != timespec1, M  N. Unfortunately that is not
+ * always the case (the difference between two distinct xtime instances
+ * might be smaller then the difference between corresponding 

[patch 13/18] time: export time information for KVM pvclock

2012-11-14 Thread Marcelo Tosatti
As suggested by John, export time data similarly to how its
done by vsyscall support. This allows KVM to retrieve necessary
information to implement vsyscall support in KVM guests.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/include/linux/pvclock_gtod.h
===
--- /dev/null
+++ vsyscall/include/linux/pvclock_gtod.h
@@ -0,0 +1,9 @@
+#ifndef _PVCLOCK_GTOD_H
+#define _PVCLOCK_GTOD_H
+
+#include linux/notifier.h
+
+extern int pvclock_gtod_register_notifier(struct notifier_block *nb);
+extern int pvclock_gtod_unregister_notifier(struct notifier_block *nb);
+
+#endif /* _PVCLOCK_GTOD_H */
Index: vsyscall/kernel/time/timekeeping.c
===
--- vsyscall.orig/kernel/time/timekeeping.c
+++ vsyscall/kernel/time/timekeeping.c
@@ -21,6 +21,7 @@
 #include linux/time.h
 #include linux/tick.h
 #include linux/stop_machine.h
+#include linux/pvclock_gtod.h
 
 
 static struct timekeeper timekeeper;
@@ -180,6 +181,54 @@ static inline s64 timekeeping_get_ns_raw
return nsec + arch_gettimeoffset();
 }
 
+static RAW_NOTIFIER_HEAD(pvclock_gtod_chain);
+
+static void update_pvclock_gtod(struct timekeeper *tk)
+{
+   raw_notifier_call_chain(pvclock_gtod_chain, 0, tk);
+}
+
+/**
+ * pvclock_gtod_register_notifier - register a pvclock timedata update listener
+ *
+ * Must hold write on timekeeper.lock
+ */
+int pvclock_gtod_register_notifier(struct notifier_block *nb)
+{
+   struct timekeeper *tk = timekeeper;
+   unsigned long flags;
+   int ret;
+
+   write_seqlock_irqsave(tk-lock, flags);
+   ret = raw_notifier_chain_register(pvclock_gtod_chain, nb);
+   /* update timekeeping data */
+   update_pvclock_gtod(tk);
+   write_sequnlock_irqrestore(tk-lock, flags);
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(pvclock_gtod_register_notifier);
+
+/**
+ * pvclock_gtod_unregister_notifier - unregister a pvclock
+ * timedata update listener
+ *
+ * Must hold write on timekeeper.lock
+ */
+int pvclock_gtod_unregister_notifier(struct notifier_block *nb)
+{
+   struct timekeeper *tk = timekeeper;
+   unsigned long flags;
+   int ret;
+
+   write_seqlock_irqsave(tk-lock, flags);
+   ret = raw_notifier_chain_unregister(pvclock_gtod_chain, nb);
+   write_sequnlock_irqrestore(tk-lock, flags);
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(pvclock_gtod_unregister_notifier);
+
 /* must hold write on timekeeper.lock */
 static void timekeeping_update(struct timekeeper *tk, bool clearntp)
 {
@@ -188,6 +237,7 @@ static void timekeeping_update(struct ti
ntp_clear();
}
update_vsyscall(tk);
+   update_pvclock_gtod(tk);
 }
 
 /**


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 06/18] x86: pvclock: introduce helper to read flags

2012-11-14 Thread Marcelo Tosatti
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com


Index: vsyscall/arch/x86/kernel/pvclock.c
===
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -45,6 +45,19 @@ void pvclock_resume(void)
atomic64_set(last_value, 0);
 }
 
+u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src)
+{
+   unsigned version;
+   cycle_t ret;
+   u8 flags;
+
+   do {
+   version = __pvclock_read_cycles(src, ret, flags);
+   } while ((src-version  1) || version != src-version);
+
+   return flags  valid_flags;
+}
+
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
 {
unsigned version;
Index: vsyscall/arch/x86/include/asm/pvclock.h
===
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -6,6 +6,7 @@
 
 /* some helper functions for xen and kvm pv clock sources */
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src);
+u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src);
 void pvclock_set_flags(u8 flags);
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src);
 void pvclock_read_wallclock(struct pvclock_wall_clock *wall,


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 02/18] x86: kvmclock: allocate pvclock shared memory area

2012-11-14 Thread Marcelo Tosatti
We want to expose the pvclock shared memory areas, which 
the hypervisor periodically updates, to userspace.

For a linear mapping from userspace, it is necessary that
entire page sized regions are used for array of pvclock 
structures.

There is no such guarantee with per cpu areas, therefore move
to memblock_alloc based allocation.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/kernel/kvmclock.c
===
--- vsyscall.orig/arch/x86/kernel/kvmclock.c
+++ vsyscall/arch/x86/kernel/kvmclock.c
@@ -23,6 +23,7 @@
 #include asm/apic.h
 #include linux/percpu.h
 #include linux/hardirq.h
+#include linux/memblock.h
 
 #include asm/x86_init.h
 #include asm/reboot.h
@@ -39,7 +40,11 @@ static int parse_no_kvmclock(char *arg)
 early_param(no-kvmclock, parse_no_kvmclock);
 
 /* The hypervisor will put information about time periodically here */
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock);
+struct pvclock_aligned_vcpu_time_info {
+   struct pvclock_vcpu_time_info clock;
+} __attribute__((__aligned__(SMP_CACHE_BYTES)));
+
+static struct pvclock_aligned_vcpu_time_info *hv_clock;
 static struct pvclock_wall_clock wall_clock;
 
 /*
@@ -52,15 +57,20 @@ static unsigned long kvm_get_wallclock(v
struct pvclock_vcpu_time_info *vcpu_time;
struct timespec ts;
int low, high;
+   int cpu;
+
+   preempt_disable();
+   cpu = smp_processor_id();
 
low = (int)__pa_symbol(wall_clock);
high = ((u64)__pa_symbol(wall_clock)  32);
 
native_write_msr(msr_kvm_wall_clock, low, high);
 
-   vcpu_time = get_cpu_var(hv_clock);
+   vcpu_time = hv_clock[cpu].clock;
pvclock_read_wallclock(wall_clock, vcpu_time, ts);
-   put_cpu_var(hv_clock);
+
+   preempt_enable();
 
return ts.tv_sec;
 }
@@ -74,9 +84,11 @@ static cycle_t kvm_clock_read(void)
 {
struct pvclock_vcpu_time_info *src;
cycle_t ret;
+   int cpu;
 
preempt_disable_notrace();
-   src = __get_cpu_var(hv_clock);
+   cpu = smp_processor_id();
+   src = hv_clock[cpu].clock;
ret = pvclock_clocksource_read(src);
preempt_enable_notrace();
return ret;
@@ -99,8 +111,15 @@ static cycle_t kvm_clock_get_cycles(stru
 static unsigned long kvm_get_tsc_khz(void)
 {
struct pvclock_vcpu_time_info *src;
-   src = per_cpu(hv_clock, 0);
-   return pvclock_tsc_khz(src);
+   int cpu;
+   unsigned long tsc_khz;
+
+   preempt_disable();
+   cpu = smp_processor_id();
+   src = hv_clock[cpu].clock;
+   tsc_khz = pvclock_tsc_khz(src);
+   preempt_enable();
+   return tsc_khz;
 }
 
 static void kvm_get_preset_lpj(void)
@@ -119,10 +138,14 @@ bool kvm_check_and_clear_guest_paused(vo
 {
bool ret = false;
struct pvclock_vcpu_time_info *src;
+   int cpu = smp_processor_id();
 
-   src = __get_cpu_var(hv_clock);
+   if (!hv_clock)
+   return ret;
+
+   src = hv_clock[cpu].clock;
if ((src-flags  PVCLOCK_GUEST_STOPPED) != 0) {
-   __this_cpu_and(hv_clock.flags, ~PVCLOCK_GUEST_STOPPED);
+   src-flags = ~PVCLOCK_GUEST_STOPPED;
ret = true;
}
 
@@ -141,9 +164,10 @@ int kvm_register_clock(char *txt)
 {
int cpu = smp_processor_id();
int low, high, ret;
+   struct pvclock_vcpu_time_info *src = hv_clock[cpu].clock;
 
-   low = (int)__pa(per_cpu(hv_clock, cpu)) | 1;
-   high = ((u64)__pa(per_cpu(hv_clock, cpu))  32);
+   low = (int)__pa(src) | 1;
+   high = ((u64)__pa(src)  32);
ret = native_write_msr_safe(msr_kvm_system_time, low, high);
printk(KERN_INFO kvm-clock: cpu %d, msr %x:%x, %s\n,
   cpu, high, low, txt);
@@ -197,9 +221,17 @@ static void kvm_shutdown(void)
 
 void __init kvmclock_init(void)
 {
+   unsigned long mem;
+
if (!kvm_para_available())
return;
 
+   mem = memblock_alloc(sizeof(struct pvclock_aligned_vcpu_time_info) * 
NR_CPUS,
+PAGE_SIZE);
+   if (!mem)
+   return;
+   hv_clock = __va(mem);
+
if (kvmclock  kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 17/18] KVM: x86: require matched TSC offsets for master clock

2012-11-14 Thread Marcelo Tosatti
With master clock, a pvclock clock read calculates:

ret = system_timestamp + [ (rdtsc + tsc_offset) - tsc_timestamp ]

Where 'rdtsc' is the host TSC.

system_timestamp and tsc_timestamp are unique, one tuple 
per VM: the master clock.

Given a host with synchronized TSCs, its obvious that
guest TSC must be matched for the above to guarantee monotonicity.

Allow master clock usage only if guest TSCs are synchronized.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/include/asm/kvm_host.h
===
--- vsyscall.orig/arch/x86/include/asm/kvm_host.h
+++ vsyscall/arch/x86/include/asm/kvm_host.h
@@ -560,6 +560,7 @@ struct kvm_arch {
u64 cur_tsc_write;
u64 cur_tsc_offset;
u8  cur_tsc_generation;
+   int nr_vcpus_matched_tsc;
 
spinlock_t pvclock_gtod_sync_lock;
bool use_master_clock;
Index: vsyscall/arch/x86/kvm/x86.c
===
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1097,12 +1097,38 @@ static u64 compute_guest_tsc(struct kvm_
return tsc;
 }
 
+void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
+{
+   bool vcpus_matched;
+   bool do_request = false;
+   struct kvm_arch *ka = vcpu-kvm-arch;
+   struct pvclock_gtod_data *gtod = pvclock_gtod_data;
+
+   vcpus_matched = (ka-nr_vcpus_matched_tsc + 1 ==
+atomic_read(vcpu-kvm-online_vcpus));
+
+   if (vcpus_matched  gtod-clock.vclock_mode == VCLOCK_TSC)
+   if (!ka-use_master_clock)
+   do_request = 1;
+
+   if (!vcpus_matched  ka-use_master_clock)
+   do_request = 1;
+
+   if (do_request)
+   kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
+
+   trace_kvm_track_tsc(vcpu-vcpu_id, ka-nr_vcpus_matched_tsc,
+   atomic_read(vcpu-kvm-online_vcpus),
+   ka-use_master_clock, gtod-clock.vclock_mode);
+}
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data)
 {
struct kvm *kvm = vcpu-kvm;
u64 offset, ns, elapsed;
unsigned long flags;
s64 usdiff;
+   bool matched;
 
raw_spin_lock_irqsave(kvm-arch.tsc_write_lock, flags);
offset = kvm_x86_ops-compute_tsc_offset(vcpu, data);
@@ -1145,6 +1171,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
offset = kvm_x86_ops-compute_tsc_offset(vcpu, data);
pr_debug(kvm: adjusted tsc offset by %llu\n, delta);
}
+   matched = true;
} else {
/*
 * We split periods of matched TSC writes into generations.
@@ -1159,6 +1186,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
kvm-arch.cur_tsc_nsec = ns;
kvm-arch.cur_tsc_write = data;
kvm-arch.cur_tsc_offset = offset;
+   matched = false;
pr_debug(kvm: new tsc generation %u, clock %llu\n,
 kvm-arch.cur_tsc_generation, data);
}
@@ -1182,6 +1210,15 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
 
kvm_x86_ops-write_tsc_offset(vcpu, offset);
raw_spin_unlock_irqrestore(kvm-arch.tsc_write_lock, flags);
+
+   spin_lock(kvm-arch.pvclock_gtod_sync_lock);
+   if (matched)
+   kvm-arch.nr_vcpus_matched_tsc++;
+   else
+   kvm-arch.nr_vcpus_matched_tsc = 0;
+
+   kvm_track_tsc_matching(vcpu);
+   spin_unlock(kvm-arch.pvclock_gtod_sync_lock);
 }
 
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
@@ -1271,8 +1308,9 @@ static bool kvm_get_time_and_clockread(s
 
 /*
  *
- * Assuming a stable TSC across physical CPUS, the following condition
- * is possible. Each numbered line represents an event visible to both
+ * Assuming a stable TSC across physical CPUS, and a stable TSC
+ * across virtual CPUs, the following condition is possible.
+ * Each numbered line represents an event visible to both
  * CPUs at the next numbered event.
  *
  * timespecX represents host monotonic time. tscX represents
@@ -1305,7 +1343,7 @@ static bool kvm_get_time_and_clockread(s
  * copy of host monotonic time values. Update that master copy
  * in lockstep.
  *
- * Rely on synchronization of host TSCs for monotonicity.
+ * Rely on synchronization of host TSCs and guest TSCs for monotonicity.
  *
  */
 
@@ -1313,20 +1351,27 @@ static void pvclock_update_vm_gtod_copy(
 {
struct kvm_arch *ka = kvm-arch;
int vclock_mode;
+   bool host_tsc_clocksource, vcpus_matched;
+
+   vcpus_matched = (ka-nr_vcpus_matched_tsc + 1 ==
+   atomic_read(kvm-online_vcpus));
 
/*
 * If the host uses TSC clock, then passthrough TSC as stable
 * to the guest.
 */
-   ka-use_master_clock = kvm_get_time_and_clockread(
+   host_tsc_clocksource = kvm_get_time_and_clockread(
  

[patch 16/18] KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization

2012-11-14 Thread Marcelo Tosatti
TSC initialization will soon make use of online_vcpus.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/ia64/kvm/kvm-ia64.c
===
--- vsyscall.orig/arch/ia64/kvm/kvm-ia64.c
+++ vsyscall/arch/ia64/kvm/kvm-ia64.c
@@ -1330,6 +1330,11 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu 
return 0;
 }
 
+int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+   return 0;
+}
+
 int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 {
return -EINVAL;
Index: vsyscall/arch/powerpc/kvm/powerpc.c
===
--- vsyscall.orig/arch/powerpc/kvm/powerpc.c
+++ vsyscall/arch/powerpc/kvm/powerpc.c
@@ -354,6 +354,11 @@ struct kvm_vcpu *kvm_arch_vcpu_create(st
return vcpu;
 }
 
+void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+   return 0;
+}
+
 void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 {
/* Make sure we're not using the vcpu anymore */
Index: vsyscall/arch/s390/kvm/kvm-s390.c
===
--- vsyscall.orig/arch/s390/kvm/kvm-s390.c
+++ vsyscall/arch/s390/kvm/kvm-s390.c
@@ -355,6 +355,11 @@ static void kvm_s390_vcpu_initial_reset(
atomic_set_mask(CPUSTAT_STOPPED, vcpu-arch.sie_block-cpuflags);
 }
 
+void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+   return 0;
+}
+
 int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
 {
atomic_set(vcpu-arch.sie_block-cpuflags, CPUSTAT_ZARCH |
Index: vsyscall/arch/x86/kvm/svm.c
===
--- vsyscall.orig/arch/x86/kvm/svm.c
+++ vsyscall/arch/x86/kvm/svm.c
@@ -1254,7 +1254,6 @@ static struct kvm_vcpu *svm_create_vcpu(
svm-vmcb_pa = page_to_pfn(page)  PAGE_SHIFT;
svm-asid_generation = 0;
init_vmcb(svm);
-   kvm_write_tsc(svm-vcpu, 0);
 
err = fx_init(svm-vcpu);
if (err)
Index: vsyscall/arch/x86/kvm/vmx.c
===
--- vsyscall.orig/arch/x86/kvm/vmx.c
+++ vsyscall/arch/x86/kvm/vmx.c
@@ -3896,8 +3896,6 @@ static int vmx_vcpu_setup(struct vcpu_vm
vmcs_writel(CR0_GUEST_HOST_MASK, ~0UL);
set_cr4_guest_host_mask(vmx);
 
-   kvm_write_tsc(vmx-vcpu, 0);
-
return 0;
 }
 
Index: vsyscall/arch/x86/kvm/x86.c
===
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -6289,6 +6289,19 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu 
return r;
 }
 
+int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
+{
+   int r;
+
+   r = vcpu_load(vcpu);
+   if (r)
+   return r;
+   kvm_write_tsc(vcpu, 0);
+   vcpu_put(vcpu);
+
+   return r;
+}
+
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 {
int r;
Index: vsyscall/include/linux/kvm_host.h
===
--- vsyscall.orig/include/linux/kvm_host.h
+++ vsyscall/include/linux/kvm_host.h
@@ -583,6 +583,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu);
 struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm, unsigned int id);
 int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu);
+int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu);
 
 int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu);
Index: vsyscall/virt/kvm/kvm_main.c
===
--- vsyscall.orig/virt/kvm/kvm_main.c
+++ vsyscall/virt/kvm/kvm_main.c
@@ -1855,6 +1855,7 @@ static int kvm_vm_ioctl_create_vcpu(stru
atomic_inc(kvm-online_vcpus);
 
mutex_unlock(kvm-lock);
+   kvm_arch_vcpu_postcreate(vcpu);
return r;
 
 unlock_vcpu_destroy:


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 04/18] x86: pvclock: remove pvclock_shadow_time

2012-11-14 Thread Marcelo Tosatti
Originally from Jeremy Fitzhardinge.

We can copy the information directly from struct pvclock_vcpu_time_info, 
remove pvclock_shadow_time.

Reviewed-by: Glauber Costa glom...@parallels.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/kernel/pvclock.c
===
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -19,21 +19,6 @@
 #include linux/percpu.h
 #include asm/pvclock.h
 
-/*
- * These are perodically updated
- *xen: magic shared_info page
- *kvm: gpa registered via msr
- * and then copied here.
- */
-struct pvclock_shadow_time {
-   u64 tsc_timestamp; /* TSC at last update of time vals.  */
-   u64 system_timestamp;  /* Time, in nanosecs, since boot.*/
-   u32 tsc_to_nsec_mul;
-   int tsc_shift;
-   u32 version;
-   u8  flags;
-};
-
 static u8 valid_flags __read_mostly = 0;
 
 void pvclock_set_flags(u8 flags)
@@ -41,32 +26,11 @@ void pvclock_set_flags(u8 flags)
valid_flags = flags;
 }
 
-static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow)
+static u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
 {
-   u64 delta = native_read_tsc() - shadow-tsc_timestamp;
-   return pvclock_scale_delta(delta, shadow-tsc_to_nsec_mul,
-  shadow-tsc_shift);
-}
-
-/*
- * Reads a consistent set of time-base values from hypervisor,
- * into a shadow data area.
- */
-static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
-   struct pvclock_vcpu_time_info *src)
-{
-   do {
-   dst-version = src-version;
-   rmb();  /* fetch version before data */
-   dst-tsc_timestamp = src-tsc_timestamp;
-   dst-system_timestamp  = src-system_time;
-   dst-tsc_to_nsec_mul   = src-tsc_to_system_mul;
-   dst-tsc_shift = src-tsc_shift;
-   dst-flags = src-flags;
-   rmb();  /* test version after fetching data */
-   } while ((src-version  1) || (dst-version != src-version));
-
-   return dst-version;
+   u64 delta = native_read_tsc() - src-tsc_timestamp;
+   return pvclock_scale_delta(delta, src-tsc_to_system_mul,
+  src-tsc_shift);
 }
 
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
@@ -90,21 +54,22 @@ void pvclock_resume(void)
 
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
 {
-   struct pvclock_shadow_time shadow;
unsigned version;
cycle_t ret, offset;
u64 last;
+   u8 flags;
 
do {
-   version = pvclock_get_time_values(shadow, src);
+   version = src-version;
rdtsc_barrier();
-   offset = pvclock_get_nsec_offset(shadow);
-   ret = shadow.system_timestamp + offset;
+   offset = pvclock_get_nsec_offset(src);
+   ret = src-system_time + offset;
+   flags = src-flags;
rdtsc_barrier();
-   } while (version != src-version);
+   } while ((src-version  1) || version != src-version);
 
if ((valid_flags  PVCLOCK_TSC_STABLE_BIT) 
-   (shadow.flags  PVCLOCK_TSC_STABLE_BIT))
+   (flags  PVCLOCK_TSC_STABLE_BIT))
return ret;
 
/*


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 10/18] x86: kvm guest: pvclock vsyscall support

2012-11-14 Thread Marcelo Tosatti
Hook into generic pvclock vsyscall code, with the aim to 
allow userspace to have visibility into pvclock data.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/kernel/kvmclock.c
===
--- vsyscall.orig/arch/x86/kernel/kvmclock.c
+++ vsyscall/arch/x86/kernel/kvmclock.c
@@ -40,11 +40,7 @@ static int parse_no_kvmclock(char *arg)
 early_param(no-kvmclock, parse_no_kvmclock);
 
 /* The hypervisor will put information about time periodically here */
-struct pvclock_aligned_vcpu_time_info {
-   struct pvclock_vcpu_time_info clock;
-} __attribute__((__aligned__(SMP_CACHE_BYTES)));
-
-static struct pvclock_aligned_vcpu_time_info *hv_clock;
+static struct pvclock_vsyscall_time_info *hv_clock;
 static struct pvclock_wall_clock wall_clock;
 
 /*
@@ -67,7 +63,7 @@ static unsigned long kvm_get_wallclock(v
 
native_write_msr(msr_kvm_wall_clock, low, high);
 
-   vcpu_time = hv_clock[cpu].clock;
+   vcpu_time = hv_clock[cpu].pvti;
pvclock_read_wallclock(wall_clock, vcpu_time, ts);
 
preempt_enable();
@@ -88,7 +84,7 @@ static cycle_t kvm_clock_read(void)
 
preempt_disable_notrace();
cpu = smp_processor_id();
-   src = hv_clock[cpu].clock;
+   src = hv_clock[cpu].pvti;
ret = pvclock_clocksource_read(src);
preempt_enable_notrace();
return ret;
@@ -116,7 +112,7 @@ static unsigned long kvm_get_tsc_khz(voi
 
preempt_disable();
cpu = smp_processor_id();
-   src = hv_clock[cpu].clock;
+   src = hv_clock[cpu].pvti;
tsc_khz = pvclock_tsc_khz(src);
preempt_enable();
return tsc_khz;
@@ -143,7 +139,7 @@ bool kvm_check_and_clear_guest_paused(vo
if (!hv_clock)
return ret;
 
-   src = hv_clock[cpu].clock;
+   src = hv_clock[cpu].pvti;
if ((src-flags  PVCLOCK_GUEST_STOPPED) != 0) {
src-flags = ~PVCLOCK_GUEST_STOPPED;
ret = true;
@@ -164,7 +160,7 @@ int kvm_register_clock(char *txt)
 {
int cpu = smp_processor_id();
int low, high, ret;
-   struct pvclock_vcpu_time_info *src = hv_clock[cpu].clock;
+   struct pvclock_vcpu_time_info *src = hv_clock[cpu].pvti;
 
low = (int)__pa(src) | 1;
high = ((u64)__pa(src)  32);
@@ -226,7 +222,7 @@ void __init kvmclock_init(void)
if (!kvm_para_available())
return;
 
-   mem = memblock_alloc(sizeof(struct pvclock_aligned_vcpu_time_info) * 
NR_CPUS,
+   mem = memblock_alloc(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS,
 PAGE_SIZE);
if (!mem)
return;
@@ -265,3 +261,36 @@ void __init kvmclock_init(void)
if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT))
pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT);
 }
+
+int kvm_setup_vsyscall_timeinfo(void)
+{
+   int cpu;
+   int ret;
+   u8 flags;
+   struct pvclock_vcpu_time_info *vcpu_time;
+   unsigned int size;
+
+   size = sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS;
+
+   preempt_disable();
+   cpu = smp_processor_id();
+
+   vcpu_time = hv_clock[cpu].pvti;
+   flags = pvclock_read_flags(vcpu_time);
+
+   if (!(flags  PVCLOCK_TSC_STABLE_BIT)) {
+   preempt_enable();
+   return 1;
+   }
+
+   if ((ret = pvclock_init_vsyscall(hv_clock, size))) {
+   preempt_enable();
+   return ret;
+   }
+
+   preempt_enable();
+
+   kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
+   return 0;
+}
+
Index: vsyscall/arch/x86/kernel/kvm.c
===
--- vsyscall.orig/arch/x86/kernel/kvm.c
+++ vsyscall/arch/x86/kernel/kvm.c
@@ -42,6 +42,7 @@
 #include asm/apic.h
 #include asm/apicdef.h
 #include asm/hypervisor.h
+#include asm/kvm_guest.h
 
 static int kvmapf = 1;
 
@@ -62,6 +63,15 @@ static int parse_no_stealacc(char *arg)
 
 early_param(no-steal-acc, parse_no_stealacc);
 
+static int kvmclock_vsyscall = 1;
+static int parse_no_kvmclock_vsyscall(char *arg)
+{
+kvmclock_vsyscall = 0;
+return 0;
+}
+
+early_param(no-kvmclock-vsyscall, parse_no_kvmclock_vsyscall);
+
 static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
 static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
 static int has_steal_clock = 0;
@@ -468,6 +478,9 @@ void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
apic_set_eoi_write(kvm_guest_apic_eoi_write);
 
+   if (kvmclock_vsyscall)
+   kvm_setup_vsyscall_timeinfo();
+
 #ifdef CONFIG_SMP
smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
register_cpu_notifier(kvm_cpu_notifier);
Index: vsyscall/arch/x86/include/asm/kvm_guest.h
===
--- 

[patch 18/18] KVM: x86: update pvclock area conditionally, on cpu migration

2012-11-14 Thread Marcelo Tosatti
As requested by Glauber, do not update kvmclock area on vcpu-pcpu 
migration, in case the host has stable TSC. 

This is to reduce cacheline bouncing.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/kvm/x86.c
===
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -2615,7 +2615,12 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
kvm_x86_ops-write_tsc_offset(vcpu, offset);
vcpu-arch.tsc_catchup = 1;
}
-   kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+   /*
+* On a host with synchronized TSC, there is no need to update
+* kvmclock on vcpu-cpu migration
+*/
+   if (!vcpu-kvm-arch.use_master_clock || vcpu-cpu == -1)
+   kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
if (vcpu-cpu != cpu)
kvm_migrate_timers(vcpu);
vcpu-cpu = cpu;


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 01/18] KVM: x86: retain pvclock guest stopped bit in guest memory

2012-11-14 Thread Marcelo Tosatti
Otherwise its possible for an unrelated KVM_REQ_UPDATE_CLOCK (such as due to CPU
migration) to clear the bit.

Noticed by Paolo Bonzini.

Reviewed-by: Gleb Natapov g...@redhat.com
Reviewed-by: Glauber Costa glom...@parallels.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/kvm/x86.c
===
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1143,6 +1143,7 @@ static int kvm_guest_time_update(struct 
unsigned long this_tsc_khz;
s64 kernel_ns, max_kernel_ns;
u64 tsc_timestamp;
+   struct pvclock_vcpu_time_info *guest_hv_clock;
u8 pvclock_flags;
 
/* Keep irq disabled to prevent changes to the clock */
@@ -1226,13 +1227,6 @@ static int kvm_guest_time_update(struct 
vcpu-last_kernel_ns = kernel_ns;
vcpu-last_guest_tsc = tsc_timestamp;
 
-   pvclock_flags = 0;
-   if (vcpu-pvclock_set_guest_stopped_request) {
-   pvclock_flags |= PVCLOCK_GUEST_STOPPED;
-   vcpu-pvclock_set_guest_stopped_request = false;
-   }
-
-   vcpu-hv_clock.flags = pvclock_flags;
 
/*
 * The interface expects us to write an even number signaling that the
@@ -1243,6 +1237,18 @@ static int kvm_guest_time_update(struct 
 
shared_kaddr = kmap_atomic(vcpu-time_page);
 
+   guest_hv_clock = shared_kaddr + vcpu-time_offset;
+
+   /* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
+   pvclock_flags = (guest_hv_clock-flags  PVCLOCK_GUEST_STOPPED);
+
+   if (vcpu-pvclock_set_guest_stopped_request) {
+   pvclock_flags |= PVCLOCK_GUEST_STOPPED;
+   vcpu-pvclock_set_guest_stopped_request = false;
+   }
+
+   vcpu-hv_clock.flags = pvclock_flags;
+
memcpy(shared_kaddr + vcpu-time_offset, vcpu-hv_clock,
   sizeof(vcpu-hv_clock));
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 12/18] KVM: x86: pass host_tsc to read_l1_tsc

2012-11-14 Thread Marcelo Tosatti
Allow the caller to pass host tsc value to kvm_x86_ops-read_l1_tsc().

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/include/asm/kvm_host.h
===
--- vsyscall.orig/arch/x86/include/asm/kvm_host.h
+++ vsyscall/arch/x86/include/asm/kvm_host.h
@@ -703,7 +703,7 @@ struct kvm_x86_ops {
void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
 
u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc);
-   u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu);
+   u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc);
 
void (*get_exit_info)(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2);
 
Index: vsyscall/arch/x86/kvm/lapic.c
===
--- vsyscall.orig/arch/x86/kvm/lapic.c
+++ vsyscall/arch/x86/kvm/lapic.c
@@ -1011,7 +1011,7 @@ static void start_apic_timer(struct kvm_
local_irq_save(flags);
 
now = apic-lapic_timer.timer.base-get_time();
-   guest_tsc = kvm_x86_ops-read_l1_tsc(vcpu);
+   guest_tsc = kvm_x86_ops-read_l1_tsc(vcpu, native_read_tsc());
if (likely(tscdeadline  guest_tsc)) {
ns = (tscdeadline - guest_tsc) * 100ULL;
do_div(ns, this_tsc_khz);
Index: vsyscall/arch/x86/kvm/svm.c
===
--- vsyscall.orig/arch/x86/kvm/svm.c
+++ vsyscall/arch/x86/kvm/svm.c
@@ -3008,11 +3008,11 @@ static int cr8_write_interception(struct
return 0;
 }
 
-u64 svm_read_l1_tsc(struct kvm_vcpu *vcpu)
+u64 svm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
 {
struct vmcb *vmcb = get_host_vmcb(to_svm(vcpu));
return vmcb-control.tsc_offset +
-   svm_scale_tsc(vcpu, native_read_tsc());
+   svm_scale_tsc(vcpu, host_tsc);
 }
 
 static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data)
Index: vsyscall/arch/x86/kvm/vmx.c
===
--- vsyscall.orig/arch/x86/kvm/vmx.c
+++ vsyscall/arch/x86/kvm/vmx.c
@@ -1839,11 +1839,10 @@ static u64 guest_read_tsc(void)
  * Like guest_read_tsc, but always returns L1's notion of the timestamp
  * counter, even if a nested guest (L2) is currently running.
  */
-u64 vmx_read_l1_tsc(struct kvm_vcpu *vcpu)
+u64 vmx_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
 {
-   u64 host_tsc, tsc_offset;
+   u64 tsc_offset;
 
-   rdtscll(host_tsc);
tsc_offset = is_guest_mode(vcpu) ?
to_vmx(vcpu)-nested.vmcs01_tsc_offset :
vmcs_read64(TSC_OFFSET);
Index: vsyscall/arch/x86/kvm/x86.c
===
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -1175,7 +1175,7 @@ static int kvm_guest_time_update(struct 
 
/* Keep irq disabled to prevent changes to the clock */
local_irq_save(flags);
-   tsc_timestamp = kvm_x86_ops-read_l1_tsc(v);
+   tsc_timestamp = kvm_x86_ops-read_l1_tsc(v, native_read_tsc());
kernel_ns = get_kernel_ns();
this_tsc_khz = __get_cpu_var(cpu_tsc_khz);
if (unlikely(this_tsc_khz == 0)) {
@@ -5429,7 +5429,8 @@ static int vcpu_enter_guest(struct kvm_v
if (hw_breakpoint_active())
hw_breakpoint_restore();
 
-   vcpu-arch.last_guest_tsc = kvm_x86_ops-read_l1_tsc(vcpu);
+   vcpu-arch.last_guest_tsc = kvm_x86_ops-read_l1_tsc(vcpu,
+  native_read_tsc());
 
vcpu-mode = OUTSIDE_GUEST_MODE;
smp_wmb();


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 14/18] KVM: x86: notifier for clocksource changes

2012-11-14 Thread Marcelo Tosatti
Register a notifier for clocksource change event. In case
the host switches to clock other than TSC, disable master
clock usage.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/kvm/x86.c
===
--- vsyscall.orig/arch/x86/kvm/x86.c
+++ vsyscall/arch/x86/kvm/x86.c
@@ -46,6 +46,8 @@
 #include linux/uaccess.h
 #include linux/hash.h
 #include linux/pci.h
+#include linux/timekeeper_internal.h
+#include linux/pvclock_gtod.h
 #include trace/events/kvm.h
 
 #define CREATE_TRACE_POINTS
@@ -899,6 +901,53 @@ static int do_set_msr(struct kvm_vcpu *v
return kvm_set_msr(vcpu, index, *data);
 }
 
+struct pvclock_gtod_data {
+   seqcount_t  seq;
+
+   struct { /* extract of a clocksource struct */
+   int vclock_mode;
+   cycle_t cycle_last;
+   cycle_t mask;
+   u32 mult;
+   u32 shift;
+   } clock;
+
+   /* open coded 'struct timespec' */
+   u64 monotonic_time_snsec;
+   time_t  monotonic_time_sec;
+};
+
+static struct pvclock_gtod_data pvclock_gtod_data;
+
+static void update_pvclock_gtod(struct timekeeper *tk)
+{
+   struct pvclock_gtod_data *vdata = pvclock_gtod_data;
+
+   write_seqcount_begin(vdata-seq);
+
+   /* copy pvclock gtod data */
+   vdata-clock.vclock_mode= tk-clock-archdata.vclock_mode;
+   vdata-clock.cycle_last = tk-clock-cycle_last;
+   vdata-clock.mask   = tk-clock-mask;
+   vdata-clock.mult   = tk-mult;
+   vdata-clock.shift  = tk-shift;
+
+   vdata-monotonic_time_sec   = tk-xtime_sec
+   + tk-wall_to_monotonic.tv_sec;
+   vdata-monotonic_time_snsec = tk-xtime_nsec
+   + (tk-wall_to_monotonic.tv_nsec
+tk-shift);
+   while (vdata-monotonic_time_snsec =
+   (((u64)NSEC_PER_SEC)  tk-shift)) {
+   vdata-monotonic_time_snsec -=
+   ((u64)NSEC_PER_SEC)  tk-shift;
+   vdata-monotonic_time_sec++;
+   }
+
+   write_seqcount_end(vdata-seq);
+}
+
+
 static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock)
 {
int version;
@@ -995,6 +1044,8 @@ static inline u64 get_kernel_ns(void)
return timespec_to_ns(ts);
 }
 
+static atomic_t kvm_guest_has_master_clock = ATOMIC_INIT(0);
+
 static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
 unsigned long max_tsc_khz;
 
@@ -1227,7 +1278,6 @@ static int kvm_guest_time_update(struct 
vcpu-last_kernel_ns = kernel_ns;
vcpu-last_guest_tsc = tsc_timestamp;
 
-
/*
 * The interface expects us to write an even number signaling that the
 * update is finished. Since the guest won't see the intermediate
@@ -4894,6 +4944,37 @@ static void kvm_set_mmio_spte_mask(void)
kvm_mmu_set_mmio_spte_mask(mask);
 }
 
+static void pvclock_gtod_update_fn(struct work_struct *work)
+{
+}
+
+static DECLARE_WORK(pvclock_gtod_work, pvclock_gtod_update_fn);
+
+/*
+ * Notification about pvclock gtod data update.
+ */
+static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
+  void *priv)
+{
+   struct pvclock_gtod_data *gtod = pvclock_gtod_data;
+   struct timekeeper *tk = priv;
+
+   update_pvclock_gtod(tk);
+
+   /* disable master clock if host does not trust, or does not
+* use, TSC clocksource
+*/
+   if (gtod-clock.vclock_mode != VCLOCK_TSC 
+   atomic_read(kvm_guest_has_master_clock) != 0)
+   queue_work(system_long_wq, pvclock_gtod_work);
+
+   return 0;
+}
+
+static struct notifier_block pvclock_gtod_notifier = {
+   .notifier_call = pvclock_gtod_notify,
+};
+
 int kvm_arch_init(void *opaque)
 {
int r;
@@ -4935,6 +5016,8 @@ int kvm_arch_init(void *opaque)
host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
 
kvm_lapic_init();
+   pvclock_gtod_register_notifier(pvclock_gtod_notifier);
+
return 0;
 
 out:
@@ -4949,6 +5032,7 @@ void kvm_arch_exit(void)
cpufreq_unregister_notifier(kvmclock_cpufreq_notifier_block,
CPUFREQ_TRANSITION_NOTIFIER);
unregister_hotcpu_notifier(kvmclock_cpu_notifier_block);
+   pvclock_gtod_unregister_notifier(pvclock_gtod_notifier);
kvm_x86_ops = NULL;
kvm_mmu_module_exit();
 }


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 13/18] time: export time information for KVM pvclock

2012-11-14 Thread John Stultz

On 11/14/2012 04:08 PM, Marcelo Tosatti wrote:

As suggested by John, export time data similarly to how its
done by vsyscall support. This allows KVM to retrieve necessary
information to implement vsyscall support in KVM guests.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Thanks for the updates here.  The notifier method is interesting, and if 
it works well, we may want to extend it later to cover the vsyscall code 
too, but that can be done in a later iteration.


Acked-by: John Stultz johns...@us.ibm.com

thanks
-john

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] acpi_piix4: fix migration of gpe fields

2012-11-14 Thread Marcelo Tosatti
Migrate 16 bytes for en/sts fields (which is the correct size),
increase version to 3, and document how to support incoming
migration from qemu-kvm 1.2.

Acked-by: Paolo Bonzini pbonz...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
 hw/acpi_piix4.c |   50 ++
 1 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
index 15275cf..519269a 100644
--- a/hw/acpi_piix4.c
+++ b/hw/acpi_piix4.c
@@ -235,10 +235,9 @@ static int vmstate_acpi_post_load(void *opaque, int 
version_id)
  {   \
  .name   = (stringify(_field)),  \
  .version_id = 0,\
- .num= GPE_LEN,  \
  .info   = vmstate_info_uint16, \
  .size   = sizeof(uint16_t), \
- .flags  = VMS_ARRAY | VMS_POINTER,  \
+ .flags  = VMS_SINGLE | VMS_POINTER, \
  .offset = vmstate_offset_pointer(_state, _field, uint8_t),  \
  }
 
@@ -267,11 +266,54 @@ static const VMStateDescription vmstate_pci_status = {
 }
 };
 
+static int acpi_load_old(QEMUFile *f, void *opaque, int version_id)
+{
+PIIX4PMState *s = opaque;
+int ret, i;
+uint16_t temp;
+
+ret = pci_device_load(s-dev, f);
+if (ret  0) {
+return ret;
+}
+qemu_get_be16s(f, s-ar.pm1.evt.sts);
+qemu_get_be16s(f, s-ar.pm1.evt.en);
+qemu_get_be16s(f, s-ar.pm1.cnt.cnt);
+
+ret = vmstate_load_state(f, vmstate_apm, opaque, 1);
+if (ret) {
+return ret;
+}
+
+qemu_get_timer(f, s-ar.tmr.timer);
+qemu_get_sbe64s(f, s-ar.tmr.overflow_time);
+
+qemu_get_be16s(f, (uint16_t *)s-ar.gpe.sts);
+for (i = 0; i  3; i++) {
+qemu_get_be16s(f, temp);
+}
+
+qemu_get_be16s(f, (uint16_t *)s-ar.gpe.en);
+for (i = 0; i  3; i++) {
+qemu_get_be16s(f, temp);
+}
+
+ret = vmstate_load_state(f, vmstate_pci_status, opaque, 1);
+return ret;
+}
+
+/* qemu-kvm 1.2 uses version 3 but advertised as 2
+ * To support incoming qemu-kvm 1.2 migration, change version_id
+ * and minimum_version_id to 2 below (which breaks migration from
+ * qemu 1.2).
+ *
+ */
 static const VMStateDescription vmstate_acpi = {
 .name = piix4_pm,
-.version_id = 2,
-.minimum_version_id = 1,
+.version_id = 3,
+.minimum_version_id = 3,
 .minimum_version_id_old = 1,
+.load_state_old = acpi_load_old,
 .post_load = vmstate_acpi_post_load,
 .fields  = (VMStateField []) {
 VMSTATE_PCI_DEVICE(dev, PIIX4PMState),
-- 
1.7.6.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 03/18] x86: pvclock: make sure rdtsc doesnt speculate out of region

2012-11-14 Thread Marcelo Tosatti
Originally from Jeremy Fitzhardinge.

pvclock_get_time_values, which contains the memory barriers
will be removed by next patch.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/kernel/pvclock.c
===
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -97,10 +97,10 @@ cycle_t pvclock_clocksource_read(struct 
 
do {
version = pvclock_get_time_values(shadow, src);
-   barrier();
+   rdtsc_barrier();
offset = pvclock_get_nsec_offset(shadow);
ret = shadow.system_timestamp + offset;
-   barrier();
+   rdtsc_barrier();
} while (version != src-version);
 
if ((valid_flags  PVCLOCK_TSC_STABLE_BIT) 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 05/18] x86: pvclock: create helper for pvclock data retrieval

2012-11-14 Thread Marcelo Tosatti
Originally from Jeremy Fitzhardinge.

So code can be reused.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: vsyscall/arch/x86/kernel/pvclock.c
===
--- vsyscall.orig/arch/x86/kernel/pvclock.c
+++ vsyscall/arch/x86/kernel/pvclock.c
@@ -26,13 +26,6 @@ void pvclock_set_flags(u8 flags)
valid_flags = flags;
 }
 
-static u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
-{
-   u64 delta = native_read_tsc() - src-tsc_timestamp;
-   return pvclock_scale_delta(delta, src-tsc_to_system_mul,
-  src-tsc_shift);
-}
-
 unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
 {
u64 pv_tsc_khz = 100ULL  32;
@@ -55,17 +48,12 @@ void pvclock_resume(void)
 cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
 {
unsigned version;
-   cycle_t ret, offset;
+   cycle_t ret;
u64 last;
u8 flags;
 
do {
-   version = src-version;
-   rdtsc_barrier();
-   offset = pvclock_get_nsec_offset(src);
-   ret = src-system_time + offset;
-   flags = src-flags;
-   rdtsc_barrier();
+   version = __pvclock_read_cycles(src, ret, flags);
} while ((src-version  1) || version != src-version);
 
if ((valid_flags  PVCLOCK_TSC_STABLE_BIT) 
Index: vsyscall/arch/x86/include/asm/pvclock.h
===
--- vsyscall.orig/arch/x86/include/asm/pvclock.h
+++ vsyscall/arch/x86/include/asm/pvclock.h
@@ -56,4 +56,32 @@ static inline u64 pvclock_scale_delta(u6
return product;
 }
 
+static __always_inline
+u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src)
+{
+   u64 delta = __native_read_tsc() - src-tsc_timestamp;
+   return pvclock_scale_delta(delta, src-tsc_to_system_mul,
+  src-tsc_shift);
+}
+
+static __always_inline
+unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src,
+  cycle_t *cycles, u8 *flags)
+{
+   unsigned version;
+   cycle_t ret, offset;
+   u8 ret_flags;
+
+   version = src-version;
+   rdtsc_barrier();
+   offset = pvclock_get_nsec_offset(src);
+   ret = src-system_time + offset;
+   ret_flags = src-flags;
+   rdtsc_barrier();
+
+   *cycles = ret;
+   *flags = ret_flags;
+   return version;
+}
+
 #endif /* _ASM_X86_PVCLOCK_H */


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 08/18] sched: add notifier for cross-cpu migrations

2012-11-14 Thread Gleb Natapov
CCing Peter and Ingo.

On Wed, Nov 14, 2012 at 10:08:31PM -0200, Marcelo Tosatti wrote:
 Originally from Jeremy Fitzhardinge.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 Index: vsyscall/include/linux/sched.h
 ===
 --- vsyscall.orig/include/linux/sched.h
 +++ vsyscall/include/linux/sched.h
 @@ -107,6 +107,14 @@ extern unsigned long this_cpu_load(void)
  extern void calc_global_load(unsigned long ticks);
  extern void update_cpu_load_nohz(void);
  
 +/* Notifier for when a task gets migrated to a new CPU */
 +struct task_migration_notifier {
 + struct task_struct *task;
 + int from_cpu;
 + int to_cpu;
 +};
 +extern void register_task_migration_notifier(struct notifier_block *n);
 +
  extern unsigned long get_parent_ip(unsigned long addr);
  
  struct seq_file;
 Index: vsyscall/kernel/sched/core.c
 ===
 --- vsyscall.orig/kernel/sched/core.c
 +++ vsyscall/kernel/sched/core.c
 @@ -922,6 +922,13 @@ void check_preempt_curr(struct rq *rq, s
   rq-skip_clock_update = 1;
  }
  
 +static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
 +
 +void register_task_migration_notifier(struct notifier_block *n)
 +{
 + atomic_notifier_chain_register(task_migration_notifier, n);
 +}
 +
  #ifdef CONFIG_SMP
  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
  {
 @@ -952,8 +959,16 @@ void set_task_cpu(struct task_struct *p,
   trace_sched_migrate_task(p, new_cpu);
  
   if (task_cpu(p) != new_cpu) {
 + struct task_migration_notifier tmn;
 +
   p-se.nr_migrations++;
   perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0);
 +
 + tmn.task = p;
 + tmn.from_cpu = task_cpu(p);
 + tmn.to_cpu = new_cpu;
 +
 + atomic_notifier_call_chain(task_migration_notifier, 0, tmn);
   }
  
   __set_task_cpu(p, new_cpu);
 

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html