Re: performance trouble

2012-02-03 Thread David Cure
Hello,

Le Thu, Feb 02, 2012 at 11:41:48AM +0100, David Cure ecrivait :
> 
>   For kvm_stats, I mean it's better to have only one VM with one
> user for the test so I send this evening or tomorrow morning.

I attach snapshot of the kvm_stat : the first one just before to
run the function and the second one at the end.

Here are the vmstat 1 take when the function is running :


 4  0  0 32601320  21360 1987760049  1123 14577 21601  8
4 88  0
procs ---memory-- ---swap-- -io -system--
cpu
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy
id wa
 4  0  0 32601196  21360 19877600 0   141 16559 21315 12
3 84  0
 2  0  0 32601212  21360 19877600 0   128 17377 24949 10
4 86  0
 2  0  0 32601128  21360 19877600 0   120 13064 18986 13
3 85  0
 4  0  0 32601216  21360 19877600 0   284 16052 21569 11
4 85  0
 3  0  0 32601132  21360 19877600 0   192 15846 23323 10
3 87  0
 3  0  0 32601492  21360 19877600 0   564 19116 27478 10
4 85  0
 4  0  0 32601492  21360 19877600 0 0 16996 24286 14
5 81  0
 2  0  0 32601492  21360 19877600 0   661 15754 23371  9
4 87  0
 3  0  0 32601268  21360 1987760036 0 9808 13849 11
1 88  0
 2  0  0 32601304  21360 19877600 036 11846 13826 14
1 85  0
 3  0  0 32601304  21360 198776006437 9445 12263 18
1 81  0
 4  0  0 32601604  21360 19877600   29020 9959 14786 10
2 88  0

I hope you can see something strange ;)

David.
<><>

KVM: x86: increase recommended max vcpus to 160

2012-02-03 Thread Marcelo Tosatti

Increase recommended max vcpus from 64 to 160 (tested internally
at Red Hat).

Signed-off-by: Marcelo Tosatti 

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4610166..782d973 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -29,7 +29,7 @@
 #include 
 
 #define KVM_MAX_VCPUS 254
-#define KVM_SOFT_MAX_VCPUS 64
+#define KVM_SOFT_MAX_VCPUS 160
 #define KVM_MEMORY_SLOTS 32
 /* memory slots that does not exposed to userspace */
 #define KVM_PRIVATE_MEM_SLOTS 4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: increase recommended max vcpus to 160

2012-02-03 Thread Sasha Levin
On Fri, 2012-02-03 at 12:28 -0200, Marcelo Tosatti wrote:
> Increase recommended max vcpus from 64 to 160 (tested internally
> at Red Hat).
> 
> Signed-off-by: Marcelo Tosatti 

Awesome!

Would it be possible to share the benchmarks?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2012-02-03 Thread Roopa Prabhu



On 2/2/12 10:58 AM, "John Fastabend"  wrote:

> On 2/2/2012 10:07 AM, Roopa Prabhu wrote:

..
>> 
>> My patches were trying to do just this (unless I am missing something).
>> 
> 
> Right I was trying enumerate the cases. Your patches 5,6 seem to use
> dev_{uc|mc}_{add|del} like this.
> 
Ok

>>> 
>>> I think this has some real advantages to the above scheme. First
>>> we get rid of _all_ the drivers having to add a bunch of new
>>> net_device ops and do it once in the layer above. This is nice
>>> for driver implementers but also because your feature becomes usable
>>> immediately and we don't have to wait for driver developers to implement
>>> it.
>> 
>> Yes my patches were targeting towards this too. I had macvlan implement the
>> netlink ops and macvlan internally was using the dev_uc_add and del routines
>> to pass the addr lists to lower device.
> 
> Yes. But I am missing why the VF ops and netlink extensions are useful. Or
> even the op/netlink extension into the PF for that matter.
> 

This was to support something that intel wanted. I think Ben described that
in another email. This was done by v3 version of the patch. This is not
needed for the macvlan problem that this patch is trying to solve. We were
trying to make the new netlink interface fit other use cases if possible.

I believe at this point we are convinced that the hybrid acceleration with
PF/VF that ben described can be solved in possibly other ways.

>> 
>>> 
>>> Also it prunes down the number of netlink extensions being added
>>> here. 
>>> 
>>> Additionally the existing semantics seem a bit strange to me on the
>>> netlink message side. Taking a quick look at the macvlan implementation
>>> it looks like every set has to have a complete list of address. But
>>> the dev_uc_add and dev_uc_del seem to be using a refcnt scheme so
>>> if I want to add a second address and then latter a third address
>>> how does that work?
>> 
>> Every set has a complete list of addresses because, for macvlan non-passthru
>> modes, in future we might want to have macvlan driver do the filtering (This
>> is for the case when we have a single lower device and multiple macvlans)
>> 
> 
> hmm but lists seem problematic when hooked up to the netdev uc and mc addr
> lists. Consider this case
> 
> read uc_list  <--- thread1: dumps unicast table
> add vlan  <--- thread2: adds a vlan maybe inserting a uc addr
> write uc_list <--- thread1: writes the table back + 1 addr
> 
> Does the uc addr of the vlan get deleted? And this case

It should.

> 
> read uc_list   <--- dump table
> write uc_list  <--- add a new filter A to the uc list
> read uc_list   <--- dump table
> write uc_list  <--- add a second filter B to the uc list
> 
> Now based on your patch 4,5 it looks like the refcnt on the address A is
> two so to remove it I have to call set filters twice without the A addr.

I think this depends on the implementation of the handler. The macvtap
handler will remove A and add B if A is not part of the second filter.


> 
> read  uc_list   <--- dump table
> write uc_list   <--- list without A
> write uc_list   <--- list without A
> 
> This seems really easy to get screwed up and it doesn't look like user
> space can learn the refcnt (at least in this series).
> 
> 


the sequences you are describing above are possible but they depend on how
you implement the filter handler I think.
For macvlan, this filter op could just populate the internal filter.
Except that for passthru mode it tries to program the lower dev hw filter
using uc/mc lists. And in passthru mode it assumes that it owns the lower
device, and tries to make sure that it adds or deletes only addresses it
knows about. I think if you have another thread also adding/deleting address
to the lowerdev when it is assigned to macvtap in passthru mode, the other
thread might see inconsistent results.

The netlink filter interface which the patch was trying to add was not a
replacement for existing uc/mc lists. It was really targeted to virtual
devices that want to do filtering on their own.
I believe uc/mc lists are used to set/unset hw filters and they are not used
in fwding lookups in the kernel (pls correct me if I am wrong). The filter
that this netlink msg was trying to set was for devices that want to do
filtering/fwding lookups like the hw.

>>> 
>>> Is the expected flow from user space 'read uc_list -> write uc_list'?
>>> This seems risky because with two adders in user space you might
>>> lose addresses unless they are somehow kept in sync. IMHO it is likely
>>> easier to implement an ADD and DEL attribute rather than a table
>>> approach.
>> 
>> The ADD and DEL will work for macvlan passthru mode because it maps 1-1 with
>> the lowerdev uc and mc list. The table was for non passthru modes when
>> macvlan driver might need to do filtering. So my patchset started with
>> macvlan filter table for all macvlan modes (hopefully) with passthru mode as
>> a specific case of offloading everything to the lower

[patch 0/8] KVM: Remaining body of TSC emulation work (Zachary Amsden)

2012-02-03 Thread Marcelo Tosatti
Rebase Zachary's last TSC series. From his original email:

"This is the remaining bulk of work I have related to TSC emulation.
In summary, I believe this fixes all known issues with TSC.  A few
rather subtle issues are cleaned up, S4 suspend is fixed, and the
API for adjust_tsc_offset is expanded to take arguments in either
guest cycles or optionally host cycles.  The final patch adds
software TSC emulation, with a configurable default setting.

Note that TSC trapping will only be used when other methods fail
due to unstable TSC or lack of scaling hardware.  With the improved
offset matching, I was even able to get SMP guests to boot with a
TSC clock; cycle testing showed a maximum backwards leap of around
0.25 ms, which is actually fairly good.  With the last patch applied,
software TSC emulation kicks in and the backwards TSC count even on
my broken hardware dropped to zero.

Some of this code (the S4 suspend compensation) has already been
ported into RHEL to fix various bugs - however upstream had diverged
a bit from the original course I planned; I had to add a few things
that had been optimized out of upstream back in (last_host_tsc).

In the course of doing this, I think the new code looks much
cleaner, with well documented and easy to understand variables.
Yes, there are a lot of tracking variables to maintain this whole
infrastructure - and yes, some of them appear to be redundant of
easily computable from others - but in actuality, the information
they provide is easy to understand, and the resulting code is much
easier to verify than a complex system where some quantities may
be undefined or computed on the fly and thus causing subtle races.
"

The last patch which introduces TSC trapping has not been included: 
it unconditionally enables TSC trapping if the guest frequency 
differs from the host or the host TSC is unstable. It should instead
use tsc catchup if the guest frequency is higher than the host. Also, 
there is no need for trapping if the guest is using kvmclock.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: increase recommended max vcpus to 160

2012-02-03 Thread Marcelo Tosatti
On Fri, Feb 03, 2012 at 10:28:57AM -0500, Sasha Levin wrote:
> On Fri, 2012-02-03 at 12:28 -0200, Marcelo Tosatti wrote:
> > Increase recommended max vcpus from 64 to 160 (tested internally
> > at Red Hat).
> > 
> > Signed-off-by: Marcelo Tosatti 
> 
> Awesome!
> 
> Would it be possible to share the benchmarks?

Not sure. Shak?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 7/8] Dont mark TSC unstable due to S4 suspend

2012-02-03 Thread Marcelo Tosatti
From: Zachary Amsden 

During a host suspend, TSC may go backwards, which KVM interprets
as an unstable TSC.  Technically, KVM should not be marking the
TSC unstable, which causes the TSC clocksource to go bad, but we
need to be adjusting the TSC offsets in such a case.

Dealing with this issue is a little tricky as the only place we
can reliably do it is before much of the timekeeping infrastructure
is up and running.  On top of this, we are not in a KVM thread
context, so we may not be able to safely access VCPU fields.
Instead, we compute our best known hardware offset at power-up and
stash it to be applied to all VCPUs when they actually start running.

Signed-off-by: Zachary Amsden 
Signed-off-by: Marcelo Tosatti 

Index: kvm/arch/x86/include/asm/kvm_host.h
===
--- kvm.orig/arch/x86/include/asm/kvm_host.h
+++ kvm/arch/x86/include/asm/kvm_host.h
@@ -423,6 +423,7 @@ struct kvm_vcpu_arch {
u64 last_tsc_nsec;
u64 last_tsc_write;
u64 last_host_tsc;
+   u64 tsc_offset_adjustment;
bool tsc_catchup;
bool tsc_always_catchup;
s8 virtual_tsc_shift;
Index: kvm/arch/x86/kvm/x86.c
===
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -2238,6 +2238,14 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
}
 
kvm_x86_ops->vcpu_load(vcpu, cpu);
+
+   /* Apply any externally detected TSC adjustments (due to suspend) */
+   if (unlikely(vcpu->arch.tsc_offset_adjustment)) {
+   adjust_tsc_offset_host(vcpu, vcpu->arch.tsc_offset_adjustment);
+   vcpu->arch.tsc_offset_adjustment = 0;
+   set_bit(KVM_REQ_CLOCK_UPDATE, &vcpu->requests);
+   }
+
if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
native_read_tsc() - vcpu->arch.last_host_tsc;
@@ -5950,13 +5958,88 @@ int kvm_arch_hardware_enable(void *garba
struct kvm *kvm;
struct kvm_vcpu *vcpu;
int i;
+   int ret;
+   u64 local_tsc;
+   u64 max_tsc = 0;
+   bool stable, backwards_tsc = false;
 
kvm_shared_msr_cpu_online();
-   list_for_each_entry(kvm, &vm_list, vm_list)
-   kvm_for_each_vcpu(i, vcpu, kvm)
-   if (vcpu->cpu == smp_processor_id())
-   kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-   return kvm_x86_ops->hardware_enable(garbage);
+   ret = kvm_x86_ops->hardware_enable(garbage);
+   if (ret != 0)
+   return ret;
+
+   local_tsc = native_read_tsc();
+   stable = !check_tsc_unstable();
+   list_for_each_entry(kvm, &vm_list, vm_list) {
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   if (!stable && vcpu->cpu == smp_processor_id())
+   set_bit(KVM_REQ_CLOCK_UPDATE, &vcpu->requests);
+   if (stable && vcpu->arch.last_host_tsc > local_tsc) {
+   backwards_tsc = true;
+   if (vcpu->arch.last_host_tsc > max_tsc)
+   max_tsc = vcpu->arch.last_host_tsc;
+   }
+   }
+   }
+
+   /*
+* Sometimes, even reliable TSCs go backwards.  This happens on
+* platforms that reset TSC during suspend or hibernate actions, but
+* maintain synchronization.  We must compensate.  Fortunately, we can
+* detect that condition here, which happens early in CPU bringup,
+* before any KVM threads can be running.  Unfortunately, we can't
+* bring the TSCs fully up to date with real time, as we aren't yet far
+* enough into CPU bringup that we know how much real time has actually
+* elapsed; our helper function, get_kernel_ns() will be using boot
+* variables that haven't been updated yet.
+*
+* So we simply find the maximum observed TSC above, then record the
+* adjustment to TSC in each VCPU.  When the VCPU later gets loaded,
+* the adjustment will be applied.  Note that we accumulate
+* adjustments, in case multiple suspend cycles happen before some VCPU
+* gets a chance to run again.  In the event that no KVM threads get a
+* chance to run, we will miss the entire elapsed period, as we'll have
+* reset last_host_tsc, so VCPUs will not have the TSC adjusted and may
+* loose cycle time.  This isn't too big a deal, since the loss will be
+* uniform across all VCPUs (not to mention the scenario is extremely
+* unlikely). It is possible that a second hibernate recovery happens
+* much faster than a first, causing the observed TSC here to be
+* smaller; this would require additional padding adjustment, which is
+* why we set l

[patch 2/8] Improve TSC offset matching

2012-02-03 Thread Marcelo Tosatti
From: Zachary Amsden 

There are a few improvements that can be made to the TSC offset
matching code.  First, we don't need to call the 128-bit multiply
(especially on a constant number), the code works much nicer to
do computation in nanosecond units.

Second, the way everything is setup with software TSC rate scaling,
we currently have per-cpu rates.  Obviously this isn't too desirable
to use in practice, but if for some reason we do change the rate of
all VCPUs at runtime, then reset the TSCs, we will only want to
match offsets for VCPUs running at the same rate.

Finally, for the case where we have an unstable host TSC, but
rate scaling is being done in hardware, we should call the platform
code to compute the TSC offset, so the math is reorganized to recompute
the base instead, then transform the base into an offset using the
existing API.

Signed-off-by: Zachary Amsden 
Signed-off-by: Marcelo Tosatti 

Index: kvm/arch/x86/include/asm/kvm_host.h
===
--- kvm.orig/arch/x86/include/asm/kvm_host.h
+++ kvm/arch/x86/include/asm/kvm_host.h
@@ -513,6 +513,7 @@ struct kvm_arch {
u64 last_tsc_nsec;
u64 last_tsc_offset;
u64 last_tsc_write;
+   u32 last_tsc_khz;
 
struct kvm_xen_hvm_config xen_hvm_config;
 
Index: kvm/arch/x86/kvm/x86.c
===
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -1018,33 +1018,39 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
struct kvm *kvm = vcpu->kvm;
u64 offset, ns, elapsed;
unsigned long flags;
-   s64 sdiff;
+   s64 nsdiff;
 
raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
offset = kvm_x86_ops->compute_tsc_offset(vcpu, data);
ns = get_kernel_ns();
elapsed = ns - kvm->arch.last_tsc_nsec;
-   sdiff = data - kvm->arch.last_tsc_write;
-   if (sdiff < 0)
-   sdiff = -sdiff;
+
+   /* n.b - signed multiplication and division required */
+   nsdiff = data - kvm->arch.last_tsc_write;
+   nsdiff = (nsdiff * 1000) / vcpu->arch.virtual_tsc_khz;
+   nsdiff -= elapsed;
+   if (nsdiff < 0)
+   nsdiff = -nsdiff;
 
/*
-* Special case: close write to TSC within 5 seconds of
-* another CPU is interpreted as an attempt to synchronize
-* The 5 seconds is to accommodate host load / swapping as
-* well as any reset of TSC during the boot process.
-*
-* In that case, for a reliable TSC, we can match TSC offsets,
-* or make a best guest using elapsed value.
-*/
-   if (sdiff < nsec_to_cycles(vcpu, 5ULL * NSEC_PER_SEC) &&
-   elapsed < 5ULL * NSEC_PER_SEC) {
+* Special case: TSC write with a small delta (1 second) of virtual
+* cycle time against real time is interpreted as an attempt to
+* synchronize the CPU.
+ *
+* For a reliable TSC, we can match TSC offsets, and for an unstable
+* TSC, we add elapsed time in this computation.  We could let the
+* compensation code attempt to catch up if we fall behind, but
+* it's better to try to match offsets from the beginning.
+ */
+   if (nsdiff < NSEC_PER_SEC &&
+   vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
if (!check_tsc_unstable()) {
offset = kvm->arch.last_tsc_offset;
pr_debug("kvm: matched tsc offset for %llu\n", data);
} else {
u64 delta = nsec_to_cycles(vcpu, elapsed);
-   offset += delta;
+   data += delta;
+   offset = kvm_x86_ops->compute_tsc_offset(vcpu, data);
pr_debug("kvm: adjusted tsc offset by %llu\n", delta);
}
ns = kvm->arch.last_tsc_nsec;
@@ -1052,6 +1058,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
kvm->arch.last_tsc_nsec = ns;
kvm->arch.last_tsc_write = data;
kvm->arch.last_tsc_offset = offset;
+   kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
kvm_x86_ops->write_tsc_offset(vcpu, offset);
raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 4/8] Fix last_guest_tsc / tsc_offset semantics

2012-02-03 Thread Marcelo Tosatti
From: Zachary Amsden 

The variable last_guest_tsc was being used as an ad-hoc indicator
that guest TSC has been initialized and recorded correctly.  However,
it may not have been, it could be that guest TSC has been set to some
large value, the back to a small value (by, say, a software reboot).

This defeats the logic and causes KVM to falsely assume that the
guest TSC has gone backwards, marking the host TSC unstable, which
is undesirable behavior.

In addition, rather than try to compute an offset adjustment for the
TSC on unstable platforms, just recompute the whole offset.  This
allows us to get rid of one callsite for adjust_tsc_offset, which
is problematic because the units it takes are in guest units, but
here, the computation was originally being done in host units.

Doing this, and also recording last_guest_tsc when the TSC is written
allow us to remove the tricky logic which depended on last_guest_tsc
being zero to indicate a reset of uninitialized value.

Instead, we now have the guarantee that the guest TSC offset is
always at least something which will get us last_guest_tsc.

Signed-off-by: Zachary Amsden 
Signed-off-by: Marcelo Tosatti 

Index: kvm/arch/x86/kvm/x86.c
===
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -1065,6 +1065,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
vcpu->arch.hv_clock.tsc_timestamp = 0;
vcpu->arch.last_tsc_write = data;
vcpu->arch.last_tsc_nsec = ns;
+   vcpu->arch.last_guest_tsc = data;
 }
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
 
@@ -1133,7 +1134,7 @@ static int kvm_guest_time_update(struct 
 * observed by the guest and ensure the new system time is greater.
 */
max_kernel_ns = 0;
-   if (vcpu->hv_clock.tsc_timestamp && vcpu->last_guest_tsc) {
+   if (vcpu->hv_clock.tsc_timestamp) {
max_kernel_ns = vcpu->last_guest_tsc -
vcpu->hv_clock.tsc_timestamp;
max_kernel_ns = pvclock_scale_delta(max_kernel_ns,
@@ -2243,13 +2244,14 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
u64 tsc;
 
tsc = kvm_x86_ops->read_l1_tsc(vcpu);
-   tsc_delta = !vcpu->arch.last_guest_tsc ? 0 :
-tsc - vcpu->arch.last_guest_tsc;
+   tsc_delta = tsc - vcpu->arch.last_guest_tsc;
 
if (tsc_delta < 0)
mark_tsc_unstable("KVM discovered backwards TSC");
if (check_tsc_unstable()) {
-   kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
+   u64 offset = kvm_x86_ops->compute_tsc_offset(vcpu,
+   vcpu->arch.last_guest_tsc);
+   kvm_x86_ops->write_tsc_offset(vcpu, offset);
vcpu->arch.tsc_catchup = 1;
}
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need advice how to fix an access to uninitialized per_cpu clock

2012-02-03 Thread Marcelo Tosatti
On Thu, Feb 02, 2012 at 12:43:00PM +0100, Igor Mammedov wrote:
> While playing with kvm cpu hot-plug, I've probably stumbled on general
> kernel bug. So I'm lookng for advice on approach to fix it.
> 
> When kvm guest uses kvmclock, it may hang on cpu hot-plug at
> 
> BSP:
>  smp_apic_timer_interrupt
>...
>  -> do_timer
>-> update_wall_time
> 
> and a being on-lined CPU at waiting on sync with BSP at:
>  start_secondary:
>while (!cpumask_test_cpu(smp_processor_id(), cpu_active_mask))
> cpu_relax();
> 
> because of unusually big value returned by clock->read(clock) in
> update_wall_time.
> This happens due to overflow in pvclock_get_nsec_offset
> 
> u64 delta = tsc - shadow->tsc_timestamp;
> 
> when shadow->tsc_timestamp is bigger than tsc.
> And since pvclock_clocksource_read remembers and returns largest
> value of any clock that ever was returned, clock for affected guest
> virtually freezes.
> 
> Overflow happens due to reading undefined values from uninitialized
> per_cpu variable hv_clock. In case of cpu hot-plug, clock is read at
> least once on being on-lined cpu before it is initialized for this
> cpu:
> 
>   start_secondary
> |-> smp_callin
> |  ->  smp_store_cpu_info
> |-> identify_secondary_cpu
> |  -> mtrr_ap_init
> |-> mtrr_restore
> |  -> stop_machine_from_inactive_cpu
> |-> queue_stop_cpus_work
> |  ...
> |-> kvm_clock_read
> |...
> |x86_cpuinit.setup_percpu_clockev
> 
> full call chain is below:
> [0.002999]  [] pvclock_clocksource_read+0x9f/0x275
> [0.002999]  [] ? check_preempt_wakeup+0x11c/0x1c2
> [0.002999]  [] kvm_clock_read+0x21/0xd3
> [0.002999]  [] sched_clock+0x9/0xd
> [0.002999]  [] sched_clock_local+0x12/0x75
> [0.002999]  [] sched_clock_cpu+0x84/0xc6
> [0.002999]  [] update_rq_clock+0x28/0x108
> [0.002999]  [] enqueue_task+0x1d/0x64
> [0.002999]  [] activate_task+0x22/0x24
> [0.002999]  [] ttwu_do_activate.constprop.39+0x32/0x61
> [0.002999]  [] try_to_wake_up+0x17e/0x1e1
> [0.002999]  [] wake_up_process+0x15/0x17
> [0.002999]  [] cpu_stop_queue_work+0x3d/0x5f
> [0.002999]  [] queue_stop_cpus_work+0x8b/0xb6
> [0.002999]  [] stop_machine_from_inactive_cpu+0xb4/0xed
> [0.002999]  [] ? mtrr_restore+0x4a/0x4a
> [0.002999]  [] mtrr_ap_init+0x5a/0x5c
> [0.002999]  [] identify_secondary_cpu+0x19/0x1b
> [0.002999]  [] smp_store_cpu_info+0x3c/0x3e
> [0.002999]  [] start_secondary+0x122/0x263
> 
> Looking at native_smp_prepare_cpus in arch/x86/kernel/smpboot.c
> I see that it unconditionally calls set_mtrr_aps_delayed_init which
> in turn effectively converts calls to mtrr_ap_init to nop ops.
> And later call to native_smp_cpus_done -> mtrr_aps_init completes
> mtrr initialization.
> 
> The same pattern might be noticed in suspend/hibernate handlers, that
> call enable_nonboot_cpus.
> 
>   enable_nonboot_cpus
> -> arch_enable_nonboot_cpus_begin
>  -> set_mtrr_aps_delayed_init
> 
> -> boot secondary cpus
> 
> -> arch_enable_nonboot_cpus_end
>  -> mtrr_aps_init
> 
> So question is if the calling mtrr_ap_init so early in
> identify_secondary_cpu is really necessary?
> From current code it looks like it is never called at normal smp
> boot/resume time. And only path that triggers it is cpu hot-plug one.
> 
> I see following possible solutions:
> 
> 1. Could the call to mtrr_ap_init be just moved from identify_secondary_cpu
> to start_secondary right after x86_cpuinit.setup_percpu_clockev call?
>
> This will prevent an access to uninitialized per_cpu clock. Or move
> x86_cpuinit.setup_percpu_clockev before smp_callin?
> 
> 2. Another way to prevent access to uninitialized per_cpu clock is to
> introduce hook.early_percpu_clock_init that will be called before
> smp_callin or right before mtrr_ap_init.
> 
> I hope to see opinions about this matter from a more experienced people than 
> me.

It would be more robust to introduce x86_cpuinit.early_percpu_clock_init
and use it for KVM, replacing setup_percpu_clockev hook.

setup_percpu_clockev initializes the APIC timer, but clearly sched_clock
should be functional before that, for example for ftrace (in addition to
sched_clock from scheduler as in the trace above).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/8] Infrastructure for software and hardware based TSC rate scaling

2012-02-03 Thread Marcelo Tosatti
From: Zachary Amsden 

This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate.  Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code.  This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.

There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed.  Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used.  The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.

In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.

Signed-off-by: Zachary Amsden 
Signed-off-by: Marcelo Tosatti 

Index: kvm/arch/x86/include/asm/kvm_host.h
===
--- kvm.orig/arch/x86/include/asm/kvm_host.h
+++ kvm/arch/x86/include/asm/kvm_host.h
@@ -422,10 +422,11 @@ struct kvm_vcpu_arch {
u64 last_kernel_ns;
u64 last_tsc_nsec;
u64 last_tsc_write;
-   u32 virtual_tsc_khz;
bool tsc_catchup;
-   u32  tsc_catchup_mult;
-   s8   tsc_catchup_shift;
+   bool tsc_always_catchup;
+   s8 virtual_tsc_shift;
+   u32 virtual_tsc_mult;
+   u32 virtual_tsc_khz;
 
atomic_t nmi_queued;  /* unprocessed asynchronous NMIs */
unsigned nmi_pending; /* NMI queued after currently running handler */
@@ -651,7 +652,7 @@ struct kvm_x86_ops {
 
bool (*has_wbinvd_exit)(void);
 
-   void (*set_tsc_khz)(struct kvm_vcpu *vcpu, u32 user_tsc_khz);
+   void (*set_tsc_khz)(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool 
scale);
void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
 
u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc);
Index: kvm/arch/x86/kvm/svm.c
===
--- kvm.orig/arch/x86/kvm/svm.c
+++ kvm/arch/x86/kvm/svm.c
@@ -964,20 +964,25 @@ static u64 svm_scale_tsc(struct kvm_vcpu
return _tsc;
 }
 
-static void svm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
+static void svm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool 
scale)
 {
struct vcpu_svm *svm = to_svm(vcpu);
u64 ratio;
u64 khz;
 
-   /* TSC scaling supported? */
-   if (!boot_cpu_has(X86_FEATURE_TSCRATEMSR))
+   /* Guest TSC same frequency as host TSC? */
+   if (!scale) {
+   svm->tsc_ratio = TSC_RATIO_DEFAULT;
return;
+   }
 
-   /* TSC-Scaling disabled or guest TSC same frequency as host TSC? */
-   if (user_tsc_khz == 0) {
-   vcpu->arch.virtual_tsc_khz = 0;
-   svm->tsc_ratio = TSC_RATIO_DEFAULT;
+   /* TSC scaling supported? */
+   if (!boot_cpu_has(X86_FEATURE_TSCRATEMSR)) {
+   if (user_tsc_khz > tsc_khz) {
+   vcpu->arch.tsc_catchup = 1;
+   vcpu->arch.tsc_always_catchup = 1;
+   } else
+   WARN(1, "user requested TSC rate below hardware 
speed\n");
return;
}
 
@@ -992,7 +997,6 @@ static void svm_set_tsc_khz(struct kvm_v
user_tsc_khz);
return;
}
-   vcpu->arch.virtual_tsc_khz = user_tsc_khz;
svm->tsc_ratio = ratio;
 }
 
Index: kvm/arch/x86/kvm/vmx.c
===
--- kvm.orig/arch/x86/kvm/vmx.c
+++ kvm/arch/x86/kvm/vmx.c
@@ -1817,13 +1817,19 @@ u64 vmx_read_l1_tsc(struct kvm_vcpu *vcp
 }
 
 /*
- * Empty call-back. Needs to be implemented when VMX enables the SET_TSC_KHZ
- * ioctl. In this case the call-back should update internal vmx state to make
- * the changes effective.
+ * Engage any workarounds for mis-matched TSC rates.  Currently limited to
+ * software catchup for faster rates on slower CPUs.
  */
-static void vmx_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
+static void vmx_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool 
scale)
 {
-   /* Nothing to do here */
+   if (!scale)
+   return;
+
+   if (user_tsc_khz > tsc_khz) {
+   vcpu->arch.tsc_catchup = 1;
+   vcpu->arch.tsc_always_catchup = 1;
+   } else
+   WARN(1, "user requested TSC rate below hardware speed\n");

[patch 8/8] Track TSC synchronization in generations

2012-02-03 Thread Marcelo Tosatti
From: Zachary Amsden 

This allows us to track the original nanosecond and counter values
at each phase of TSC writing by the guest.  This gets us perfect
offset matching for stable TSC systems, and perfect software
computed TSC matching for machines with unstable TSC.

Signed-off-by: Zachary Amsden 
Signed-off-by: Marcelo Tosatti 

Index: kvm/arch/x86/include/asm/kvm_host.h
===
--- kvm.orig/arch/x86/include/asm/kvm_host.h
+++ kvm/arch/x86/include/asm/kvm_host.h
@@ -420,10 +420,11 @@ struct kvm_vcpu_arch {
 
u64 last_guest_tsc;
u64 last_kernel_ns;
-   u64 last_tsc_nsec;
-   u64 last_tsc_write;
u64 last_host_tsc;
u64 tsc_offset_adjustment;
+   u64 this_tsc_nsec;
+   u64 this_tsc_write;
+   u8  this_tsc_generation;
bool tsc_catchup;
bool tsc_always_catchup;
s8 virtual_tsc_shift;
@@ -513,9 +514,12 @@ struct kvm_arch {
s64 kvmclock_offset;
raw_spinlock_t tsc_write_lock;
u64 last_tsc_nsec;
-   u64 last_tsc_offset;
u64 last_tsc_write;
u32 last_tsc_khz;
+   u64 cur_tsc_nsec;
+   u64 cur_tsc_write;
+   u64 cur_tsc_offset;
+   u8  cur_tsc_generation;
 
struct kvm_xen_hvm_config xen_hvm_config;
 
Index: kvm/arch/x86/kvm/x86.c
===
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -1006,10 +1006,10 @@ static void kvm_set_tsc_khz(struct kvm_v
 
 static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
 {
-   u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.last_tsc_nsec,
+   u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec,
  vcpu->arch.virtual_tsc_mult,
  vcpu->arch.virtual_tsc_shift);
-   tsc += vcpu->arch.last_tsc_write;
+   tsc += vcpu->arch.this_tsc_write;
return tsc;
 }
 
@@ -1045,7 +1045,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
if (nsdiff < NSEC_PER_SEC &&
vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
if (!check_tsc_unstable()) {
-   offset = kvm->arch.last_tsc_offset;
+   offset = kvm->arch.cur_tsc_offset;
pr_debug("kvm: matched tsc offset for %llu\n", data);
} else {
u64 delta = nsec_to_cycles(vcpu, elapsed);
@@ -1053,20 +1053,45 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
offset = kvm_x86_ops->compute_tsc_offset(vcpu, data);
pr_debug("kvm: adjusted tsc offset by %llu\n", delta);
}
+   } else {
+   /*
+* We split periods of matched TSC writes into generations.
+* For each generation, we track the original measured
+* nanosecond time, offset, and write, so if TSCs are in
+* sync, we can match exact offset, and if not, we can match
+* exact software computaion in compute_guest_tsc()
+*
+* These values are tracked in kvm->arch.cur_xxx variables.
+*/
+   kvm->arch.cur_tsc_generation++;
+   kvm->arch.cur_tsc_nsec = ns;
+   kvm->arch.cur_tsc_write = data;
+   kvm->arch.cur_tsc_offset = offset;
+   pr_debug("kvm: new tsc generation %u, clock %llu\n",
+kvm->arch.cur_tsc_generation, data);
}
+
+   /*
+* We also track th most recent recorded KHZ, write and time to
+* allow the matching interval to be extended at each write.
+*/
kvm->arch.last_tsc_nsec = ns;
kvm->arch.last_tsc_write = data;
-   kvm->arch.last_tsc_offset = offset;
kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
-   kvm_x86_ops->write_tsc_offset(vcpu, offset);
-   raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
 
/* Reset of TSC must disable overshoot protection below */
vcpu->arch.hv_clock.tsc_timestamp = 0;
-   vcpu->arch.last_tsc_write = data;
-   vcpu->arch.last_tsc_nsec = ns;
vcpu->arch.last_guest_tsc = data;
+
+   /* Keep track of which generation this VCPU has synchronized to */
+   vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation;
+   vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
+   vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
+
+   kvm_x86_ops->write_tsc_offset(vcpu, offset);
+   raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
 }
+
 EXPORT_SYMBOL_GPL(kvm_write_tsc);
 
 static int kvm_guest_time_update(struct kvm_vcpu *v)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 6/8] Allow adjust_tsc_offset to be in host or guest cycles

2012-02-03 Thread Marcelo Tosatti
Redefine the API to take a parameter indicating whether an
adjustment is in host or guest cycles.

Signed-off-by: Zachary Amsden 
Signed-off-by: Marcelo Tosatti 

Index: kvm/arch/x86/include/asm/kvm_host.h
===
--- kvm.orig/arch/x86/include/asm/kvm_host.h
+++ kvm/arch/x86/include/asm/kvm_host.h
@@ -646,7 +646,7 @@ struct kvm_x86_ops {
u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
int (*get_lpage_level)(void);
bool (*rdtscp_supported)(void);
-   void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment);
+   void (*adjust_tsc_offset)(struct kvm_vcpu *vcpu, s64 adjustment, bool 
host);
 
void (*set_tdp_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3);
 
@@ -676,6 +676,17 @@ struct kvm_arch_async_pf {
 
 extern struct kvm_x86_ops *kvm_x86_ops;
 
+static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu,
+  s64 adjustment)
+{
+   kvm_x86_ops->adjust_tsc_offset(vcpu, adjustment, false);
+}
+
+static inline void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, s64 
adjustment)
+{
+   kvm_x86_ops->adjust_tsc_offset(vcpu, adjustment, true);
+}
+
 int kvm_mmu_module_init(void);
 void kvm_mmu_module_exit(void);
 
Index: kvm/arch/x86/kvm/svm.c
===
--- kvm.orig/arch/x86/kvm/svm.c
+++ kvm/arch/x86/kvm/svm.c
@@ -1016,10 +1016,14 @@ static void svm_write_tsc_offset(struct 
mark_dirty(svm->vmcb, VMCB_INTERCEPTS);
 }
 
-static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
+static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment, bool 
host)
 {
struct vcpu_svm *svm = to_svm(vcpu);
 
+   WARN_ON(adjustment < 0);
+   if (host)
+   adjustment = svm_scale_tsc(vcpu, adjustment);
+
svm->vmcb->control.tsc_offset += adjustment;
if (is_guest_mode(vcpu))
svm->nested.hsave->control.tsc_offset += adjustment;
Index: kvm/arch/x86/kvm/vmx.c
===
--- kvm.orig/arch/x86/kvm/vmx.c
+++ kvm/arch/x86/kvm/vmx.c
@@ -1856,7 +1856,7 @@ static void vmx_write_tsc_offset(struct 
}
 }
 
-static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
+static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment, bool 
host)
 {
u64 offset = vmcs_read64(TSC_OFFSET);
vmcs_write64(TSC_OFFSET, offset + adjustment);
Index: kvm/arch/x86/kvm/x86.c
===
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -1102,7 +1102,7 @@ static int kvm_guest_time_update(struct 
if (vcpu->tsc_catchup) {
u64 tsc = compute_guest_tsc(v, kernel_ns);
if (tsc > tsc_timestamp) {
-   kvm_x86_ops->adjust_tsc_offset(v, tsc - tsc_timestamp);
+   adjust_tsc_offset_guest(v, tsc - tsc_timestamp);
tsc_timestamp = tsc;
}
}


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 3/8] Leave TSC synchronization window open with each new sync

2012-02-03 Thread Marcelo Tosatti
From: Zachary Amsden 

Currently, when the TSC is written by the guest, the variable
ns is updated to force the current write to appear to have taken
place at the time of the first write in this sync phase.  This
leaves a cliff at the end of the match window where updates will
fall of the end.  There are two scenarios where this can be a
problem in practe - first, on a system with a large number of
VCPUs, the sync period may last for an extended period of time.

The second way this can happen is if the VM reboots very rapidly
and we catch a VCPU TSC synchronization just around the edge.
We may be unaware of the reboot, and thus the first VCPU might
synchronize with an old set of the timer (at, say 0.97 seconds
ago, when first powered on).  The second VCPU can come in 0.04
seconds later to try to synchronize, but it misses the window
because it is just over the threshold.

Instead, stop doing this artificial setback of the ns variable
and just update it with every write of the TSC.

It may be observed that doing so causes values computed by
compute_guest_tsc to diverge slightly across CPUs - note that
the last_tsc_ns and last_tsc_write variable are used here, and
now they last_tsc_ns will be different for each VCPU, reflecting
the actual time of the update.

However, compute_guest_tsc is used only for guests which already
have TSC stability issues, and further, note that the previous
patch has caused last_tsc_write to be incremented by the difference
in nanoseconds, converted back into guest cycles.  As such, only
boundary rounding errors should be visible, which given the
resolution in nanoseconds, is going to only be a few cycles and
only visible in cross-CPU consistency tests.  The problem can be
fixed by adding a new set of variables to track the start offset
and start write value for the current sync cycle.

Signed-off-by: Zachary Amsden 
Signed-off-by: Marcelo Tosatti 

Index: kvm/arch/x86/kvm/x86.c
===
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -1053,7 +1053,6 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu
offset = kvm_x86_ops->compute_tsc_offset(vcpu, data);
pr_debug("kvm: adjusted tsc offset by %llu\n", delta);
}
-   ns = kvm->arch.last_tsc_nsec;
}
kvm->arch.last_tsc_nsec = ns;
kvm->arch.last_tsc_write = data;


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 5/8] Add last_host_tsc tracking back to KVM

2012-02-03 Thread Marcelo Tosatti
From: Zachary Amsden 

The variable last_host_tsc was removed from upstream code.  I am adding
it back for two reasons.  First, it is unnecessary to use guest TSC
computation to conclude information about the host TSC.  The guest may
set the TSC backwards (this case handled by the previous patch), but
the computation of guest TSC (and fetching an MSR) is significanlty more
work and complexity than simply reading the hardware counter.  In addition,
we don't actually need the guest TSC for any part of the computation,
by always recomputing the offset, we can eliminate the need to deal with
the current offset and any scaling factors that may apply.

The second reason is that later on, we are going to be using the host
TSC value to restore TSC offsets after a host S4 suspend, so we need to
be reading the host values, not the guest values here.

Signed-off-by: Zachary Amsden 
Signed-off-by: Marcelo Tosatti 

Index: kvm/arch/x86/include/asm/kvm_host.h
===
--- kvm.orig/arch/x86/include/asm/kvm_host.h
+++ kvm/arch/x86/include/asm/kvm_host.h
@@ -422,6 +422,7 @@ struct kvm_vcpu_arch {
u64 last_kernel_ns;
u64 last_tsc_nsec;
u64 last_tsc_write;
+   u64 last_host_tsc;
bool tsc_catchup;
bool tsc_always_catchup;
s8 virtual_tsc_shift;
Index: kvm/arch/x86/kvm/x86.c
===
--- kvm.orig/arch/x86/kvm/x86.c
+++ kvm/arch/x86/kvm/x86.c
@@ -2239,13 +2239,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
 
kvm_x86_ops->vcpu_load(vcpu, cpu);
if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
-   /* Make sure TSC doesn't go backwards */
-   s64 tsc_delta;
-   u64 tsc;
-
-   tsc = kvm_x86_ops->read_l1_tsc(vcpu);
-   tsc_delta = tsc - vcpu->arch.last_guest_tsc;
-
+   s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
+   native_read_tsc() - vcpu->arch.last_host_tsc;
if (tsc_delta < 0)
mark_tsc_unstable("KVM discovered backwards TSC");
if (check_tsc_unstable()) {
@@ -2268,7 +2263,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *
 {
kvm_x86_ops->vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);
-   vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
+   vcpu->arch.last_host_tsc = native_read_tsc();
 }
 
 static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu,


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Next gen kvm api

2012-02-03 Thread Eric Northup
On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity  wrote:
[...]
>
> Moving to syscalls avoids these problems, but introduces new ones:
>
> - adding new syscalls is generally frowned upon, and kvm will need several
> - syscalls into modules are harder and rarer than into core kernel code
> - will need to add a vcpu pointer to task_struct, and a kvm pointer to
> mm_struct
- Lost a good place to put access control (permissions on /dev/kvm)
for which user-mode processes can use KVM.

How would the ability to use sys_kvm_* be regulated?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/4] Fix task switches into/out of VM86

2012-02-03 Thread Kevin Wolf
Kevin Wolf (4):
  KVM: x86 emulator: Fix task switch privilege checks
  KVM: x86 emulator: VM86 segments must have DPL 3
  KVM: SVM: Fix CPL updates
  KVM: x86 emulator: Allow PM/VM86 switch during task switch

 arch/x86/include/asm/kvm_emulate.h |3 +-
 arch/x86/include/asm/kvm_host.h|4 +-
 arch/x86/kvm/emulate.c |   75 ---
 arch/x86/kvm/svm.c |   25 ++--
 arch/x86/kvm/vmx.c |8 ++-
 arch/x86/kvm/x86.c |   12 -
 6 files changed, 107 insertions(+), 20 deletions(-)

-- 
1.7.6.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/4] KVM: x86 emulator: Fix task switch privilege checks

2012-02-03 Thread Kevin Wolf
Currently, all task switches check privileges against the DPL of the
TSS. This is only correct for jmp/call to a TSS. If a task gate is used,
the DPL of this take gate is used for the check instead. Exceptions,
external interrupts and iret shouldn't perform any check.

Signed-off-by: Kevin Wolf 
---
 arch/x86/include/asm/kvm_emulate.h |2 +-
 arch/x86/include/asm/kvm_host.h|4 +-
 arch/x86/kvm/emulate.c |   53 +++-
 arch/x86/kvm/svm.c |5 +++-
 arch/x86/kvm/vmx.c |8 +++--
 arch/x86/kvm/x86.c |6 ++--
 6 files changed, 61 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index ab4092e..c8a9cf3 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -372,7 +372,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt 
*ctxt);
 #define EMULATION_INTERCEPTED 2
 int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
 int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
-u16 tss_selector, int reason,
+u16 tss_selector, int idt_index, int reason,
 bool has_error_code, u32 error_code);
 int emulate_int_real(struct x86_emulate_ctxt *ctxt, int irq);
 #endif /* _ASM_X86_KVM_X86_EMULATE_H */
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 52d6640..0533fc4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -741,8 +741,8 @@ int kvm_emulate_wbinvd(struct kvm_vcpu *vcpu);
 void kvm_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
 int kvm_load_segment_descriptor(struct kvm_vcpu *vcpu, u16 selector, int seg);
 
-int kvm_task_switch(struct kvm_vcpu *vcpu, u16 tss_selector, int reason,
-   bool has_error_code, u32 error_code);
+int kvm_task_switch(struct kvm_vcpu *vcpu, u16 tss_selector, int idt_index,
+   int reason, bool has_error_code, u32 error_code);
 
 int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
 int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3);
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 05a562b..7097ca9 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1151,6 +1151,22 @@ static int pio_in_emulated(struct x86_emulate_ctxt *ctxt,
return 1;
 }
 
+static int read_interrupt_descriptor(struct x86_emulate_ctxt *ctxt,
+u16 index, struct kvm_desc_struct *desc)
+{
+   struct kvm_desc_ptr dt;
+   ulong addr;
+
+   ctxt->ops->get_idt(ctxt, &dt);
+
+   if (dt.size < index * 8 + 7)
+   return emulate_gp(ctxt, index << 3 | 0x2);
+
+   addr = dt.address + index * 8;
+   return ctxt->ops->read_std(ctxt, addr, desc, sizeof *desc,
+  &ctxt->exception);
+}
+
 static void get_descriptor_table_ptr(struct x86_emulate_ctxt *ctxt,
 u16 selector, struct desc_ptr *dt)
 {
@@ -2350,7 +2366,7 @@ static int task_switch_32(struct x86_emulate_ctxt *ctxt,
 }
 
 static int emulator_do_task_switch(struct x86_emulate_ctxt *ctxt,
-  u16 tss_selector, int reason,
+  u16 tss_selector, int idt_index, int reason,
   bool has_error_code, u32 error_code)
 {
struct x86_emulate_ops *ops = ctxt->ops;
@@ -2372,12 +2388,35 @@ static int emulator_do_task_switch(struct 
x86_emulate_ctxt *ctxt,
 
/* FIXME: check that next_tss_desc is tss */
 
-   if (reason != TASK_SWITCH_IRET) {
-   if ((tss_selector & 3) > next_tss_desc.dpl ||
-   ops->cpl(ctxt) > next_tss_desc.dpl)
-   return emulate_gp(ctxt, 0);
+   /*
+* Check privileges. The three cases are task switch caused by...
+*
+* 1. jmp/call/int to task gate: Check against DPL of the task gate
+* 2. Exception/IRQ/iret: No check is performed
+* 3. jmp/call to TSS: Check agains DPL of the TSS
+*/
+   if (reason == TASK_SWITCH_GATE) {
+   if (idt_index != -1) {
+   /* Software interrupts */
+   struct kvm_desc_struct task_gate_desc;
+   int dpl;
+
+   ret = read_interrupt_descriptor(ctxt, idt_index,
+   &task_gate_desc);
+   if (ret != X86EMUL_CONTINUE)
+   return ret;
+
+   dpl = task_gate_desc.dpl;
+   if ((tss_selector & 3) > dpl || ops->cpl(ctxt) > dpl)
+   return emulate_gp(ctxt, (idt_index << 3) | 0x2);
+   }
+   } else if (reason != TASK_SWITCH_IRET) {
+   int dpl = next_tss_desc.dpl;
+

[PATCH v3 2/4] KVM: x86 emulator: VM86 segments must have DPL 3

2012-02-03 Thread Kevin Wolf
Setting the segment DPL to 0 for at least the VM86 code segment makes
the VM entry fail on VMX.

Signed-off-by: Kevin Wolf 
---
 arch/x86/kvm/emulate.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 7097ca9..144a203 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1243,6 +1243,8 @@ static int load_segment_descriptor(struct 
x86_emulate_ctxt *ctxt,
seg_desc.type = 3;
seg_desc.p = 1;
seg_desc.s = 1;
+   if (ctxt->mode == X86EMUL_MODE_VM86)
+   seg_desc.dpl = 3;
goto load;
}
 
-- 
1.7.6.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 3/4] KVM: SVM: Fix CPL updates

2012-02-03 Thread Kevin Wolf
Keep CPL at 0 in real mode and at 3 in VM86. In protected/long mode, use
RPL rather than DPL of the code segment.

Signed-off-by: Kevin Wolf 
---
 arch/x86/kvm/svm.c |   19 ---
 1 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 6a977c1..4124a7e 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1263,6 +1263,21 @@ static void svm_vcpu_put(struct kvm_vcpu *vcpu)
wrmsrl(host_save_user_msrs[i], svm->host_user_msrs[i]);
 }
 
+static void svm_update_cpl(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_svm *svm = to_svm(vcpu);
+   int cpl;
+
+   if (!is_protmode(vcpu))
+   cpl = 0;
+   else if (svm->vmcb->save.rflags & X86_EFLAGS_VM)
+   cpl = 3;
+   else
+   cpl = svm->vmcb->save.cs.selector & 0x3;
+
+   svm->vmcb->save.cpl = cpl;
+}
+
 static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu)
 {
return to_svm(vcpu)->vmcb->save.rflags;
@@ -1538,9 +1553,7 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
s->attrib |= (var->g & 1) << SVM_SELECTOR_G_SHIFT;
}
if (seg == VCPU_SREG_CS)
-   svm->vmcb->save.cpl
-   = (svm->vmcb->save.cs.attrib
-  >> SVM_SELECTOR_DPL_SHIFT) & 3;
+   svm_update_cpl(vcpu);
 
mark_dirty(svm->vmcb, VMCB_SEG);
 }
-- 
1.7.6.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 4/4] KVM: x86 emulator: Allow PM/VM86 switch during task switch

2012-02-03 Thread Kevin Wolf
Task switches can switch between Protected Mode and VM86. The current
mode must be updated during the task switch emulation so that the new
segment selectors are interpreted correctly.

In order to let privilege checks succeed, rflags needs to be updated in
the vcpu struct as this causes a CPL update.

Signed-off-by: Kevin Wolf 
---
 arch/x86/include/asm/kvm_emulate.h |1 +
 arch/x86/kvm/emulate.c |   20 
 arch/x86/kvm/svm.c |1 +
 arch/x86/kvm/x86.c |6 ++
 4 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_emulate.h 
b/arch/x86/include/asm/kvm_emulate.h
index c8a9cf3..4a21c7d 100644
--- a/arch/x86/include/asm/kvm_emulate.h
+++ b/arch/x86/include/asm/kvm_emulate.h
@@ -176,6 +176,7 @@ struct x86_emulate_ops {
void (*set_idt)(struct x86_emulate_ctxt *ctxt, struct desc_ptr *dt);
ulong (*get_cr)(struct x86_emulate_ctxt *ctxt, int cr);
int (*set_cr)(struct x86_emulate_ctxt *ctxt, int cr, ulong val);
+   void (*set_rflags)(struct x86_emulate_ctxt *ctxt, ulong val);
int (*cpl)(struct x86_emulate_ctxt *ctxt);
int (*get_dr)(struct x86_emulate_ctxt *ctxt, int dr, ulong *dest);
int (*set_dr)(struct x86_emulate_ctxt *ctxt, int dr, ulong value);
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 144a203..a9fc21d 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2273,6 +2273,8 @@ static int load_state_from_tss32(struct x86_emulate_ctxt 
*ctxt,
return emulate_gp(ctxt, 0);
ctxt->_eip = tss->eip;
ctxt->eflags = tss->eflags | 2;
+
+   /* General purpose registers */
ctxt->regs[VCPU_REGS_RAX] = tss->eax;
ctxt->regs[VCPU_REGS_RCX] = tss->ecx;
ctxt->regs[VCPU_REGS_RDX] = tss->edx;
@@ -2295,6 +2297,24 @@ static int load_state_from_tss32(struct x86_emulate_ctxt 
*ctxt,
set_segment_selector(ctxt, tss->gs, VCPU_SREG_GS);
 
/*
+* If we're switching between Protected Mode and VM86, we need to make
+* sure to update the mode before loading the segment descriptors so
+* that the selectors are interpreted correctly.
+*
+* Need to get rflags to the vcpu struct immediately because it
+* influences the CPL which is checked at least when loading the segment
+* descriptors and when pushing an error code to the new kernel stack.
+*
+* TODO Introduce a separate ctxt->ops->set_cpl callback
+*/
+   if (ctxt->eflags & X86_EFLAGS_VM)
+   ctxt->mode = X86EMUL_MODE_VM86;
+   else
+   ctxt->mode = X86EMUL_MODE_PROT32;
+
+   ctxt->ops->set_rflags(ctxt, ctxt->eflags);
+
+   /*
 * Now load segment descriptors. If fault happenes at this stage
 * it is handled in a context of new task
 */
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 4124a7e..38d5ef3 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1286,6 +1286,7 @@ static unsigned long svm_get_rflags(struct kvm_vcpu *vcpu)
 static void svm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 {
to_svm(vcpu)->vmcb->save.rflags = rflags;
+   svm_update_cpl(vcpu);
 }
 
 static void svm_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dc3e945..502b5c3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4040,6 +4040,11 @@ static int emulator_set_cr(struct x86_emulate_ctxt 
*ctxt, int cr, ulong val)
return res;
 }
 
+static void emulator_set_rflags(struct x86_emulate_ctxt *ctxt, ulong val)
+{
+   kvm_set_rflags(emul_to_vcpu(ctxt), val);
+}
+
 static int emulator_get_cpl(struct x86_emulate_ctxt *ctxt)
 {
return kvm_x86_ops->get_cpl(emul_to_vcpu(ctxt));
@@ -4199,6 +4204,7 @@ static struct x86_emulate_ops emulate_ops = {
.set_idt = emulator_set_idt,
.get_cr  = emulator_get_cr,
.set_cr  = emulator_set_cr,
+   .set_rflags  = emulator_set_rflags,
.cpl = emulator_get_cpl,
.get_dr  = emulator_get_dr,
.set_dr  = emulator_set_dr,
-- 
1.7.6.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm tools: Fix test for mmap failure

2012-02-03 Thread Cyrill Gorcunov
On error mmap returns MAP_FAILED so we
need a proper test here.

Signed-off-by: Cyrill Gorcunov 
---
 tools/kvm/hw/pci-shmem.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6.git/tools/kvm/hw/pci-shmem.c
===
--- linux-2.6.git.orig/tools/kvm/hw/pci-shmem.c
+++ linux-2.6.git/tools/kvm/hw/pci-shmem.c
@@ -209,7 +209,7 @@ static void *setup_shmem(const char *key
   PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, fd, 0);
close(fd);
 
-   if (mem == NULL)
+   if (mem == MAP_FAILED)
pr_warning("Failed to mmap shared memory file");
 
return mem;
@@ -259,8 +259,9 @@ int pci_shmem__init(struct kvm *kvm)
/* Open shared memory and plug it into the guest */
mem = setup_shmem(shmem_region->handle, shmem_region->size,
shmem_region->create);
-   if (mem == NULL)
+   if (mem == MAP_FAILED)
return 0;
+
kvm__register_mem(kvm, shmem_region->phys_addr, shmem_region->size,
  mem);
return 1;
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: Fix test for mmap failure

2012-02-03 Thread Cyrill Gorcunov
On Fri, Feb 03, 2012 at 11:15:41PM +0400, Cyrill Gorcunov wrote:
> On error mmap returns MAP_FAILED so we
> need a proper test here.
>

Pekka, pick this one instead -- a caller is expecting null/not-null
only.

Cyrill
---
kvm tools: Fix test for mmap failure

On error mmap returns MAP_FAILED so we
need a proper test here.

Signed-off-by: Cyrill Gorcunov 
---
 tools/kvm/hw/pci-shmem.c |7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

Index: linux-2.6.git/tools/kvm/hw/pci-shmem.c
===
--- linux-2.6.git.orig/tools/kvm/hw/pci-shmem.c
+++ linux-2.6.git/tools/kvm/hw/pci-shmem.c
@@ -207,10 +207,11 @@ static void *setup_shmem(const char *key
}
mem = mmap(NULL, len,
   PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, fd, 0);
-   close(fd);
-
-   if (mem == NULL)
+   if (mem == MAP_FAILED) {
pr_warning("Failed to mmap shared memory file");
+   mem = NULL;
+   }
+   close(fd);
 
return mem;
 }
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm tool: Make kvm structure to carry name copy

2012-02-03 Thread Cyrill Gorcunov
If guest name is used (which is default case) the kvm might end
up carrying the pointer to name which is allocated on stack.

kvm_cmd_run_init
  (on stack) default_name
  kvm__init(..., default_name)
kvm->name = default_name

So I think better to allow kvm to carry own copy
of guest name. 64 symbols should be more than enough.

Signed-off-by: Cyrill Gorcunov 
---

I hope I didn't miss anything?

 tools/kvm/kvm.c  |2 +-
 tools/kvm/powerpc/include/kvm/kvm-arch.h |2 +-
 tools/kvm/x86/include/kvm/kvm-arch.h |2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6.git/tools/kvm/kvm.c
===
--- linux-2.6.git.orig/tools/kvm/kvm.c
+++ linux-2.6.git/tools/kvm/kvm.c
@@ -384,7 +384,7 @@ struct kvm *kvm__init(const char *kvm_de
 
kvm__arch_init(kvm, hugetlbfs_path, ram_size);
 
-   kvm->name = name;
+   strncpy(kvm->name, name, sizeof(kvm->name));
 
kvm_ipc__start(kvm__create_socket(kvm));
kvm_ipc__register_handler(KVM_IPC_PID, kvm__pid);
Index: linux-2.6.git/tools/kvm/powerpc/include/kvm/kvm-arch.h
===
--- linux-2.6.git.orig/tools/kvm/powerpc/include/kvm/kvm-arch.h
+++ linux-2.6.git/tools/kvm/powerpc/include/kvm/kvm-arch.h
@@ -64,7 +64,7 @@ struct kvm {
unsigned long   fdt_gra;
unsigned long   initrd_gra;
unsigned long   initrd_size;
-   const char  *name;
+   charname[64];
int vm_state;
 };
 
Index: linux-2.6.git/tools/kvm/x86/include/kvm/kvm-arch.h
===
--- linux-2.6.git.orig/tools/kvm/x86/include/kvm/kvm-arch.h
+++ linux-2.6.git/tools/kvm/x86/include/kvm/kvm-arch.h
@@ -48,7 +48,7 @@ struct kvm {
struct disk_image   **disks;
int nr_disks;
 
-   const char  *name;
+   charname[64];
 
int vm_state;
 };
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: increase recommended max vcpus to 160

2012-02-03 Thread shak

On 02/03/2012 10:55 AM, Marcelo Tosatti wrote:

On Fri, Feb 03, 2012 at 10:28:57AM -0500, Sasha Levin wrote:
   

On Fri, 2012-02-03 at 12:28 -0200, Marcelo Tosatti wrote:
 

Increase recommended max vcpus from 64 to 160 (tested internally
at Red Hat).

Signed-off-by: Marcelo Tosatti
   

Awesome!

Would it be possible to share the benchmarks?
 

Not sure. Shak?

   
We have been using Linpack NxN but always hit a HW barriers even on bare 
metal


http://software.intel.com/en-us/articles/running-the-hpl-benchmark-over-intel-mpi/?wapkw=(LINPACK+MPI)
http://software.intel.com/en-us/articles/intel-math-kernel-library-linpack-download

Shak


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] virtio-pci: add setup_vqs flag in vp_try_to_find_vqs

2012-02-03 Thread Rusty Russell
On Wed, 1 Feb 2012 11:54:09 +0200, "Michael S. Tsirkin"  wrote:
> On Thu, Jan 12, 2012 at 09:20:06AM +0800, zanghongy...@huawei.com wrote:
> > From: Hongyong Zang 
> > 
> > changes in vp_try_to_find_vqs:
> > Virtio-serial's probe() calls it to request irqs and setup vqs of port0 and
> > controls; add_port() calls it to set up vqs of io_port.
> > it will not create virtqueue if the name is null.
> > 
> > Signed-off-by: Hongyong Zang 
> 
> This looks like a fragile hack, to me.
> 
> virtio spec also implied that VQ initialization is done
> before status is set to OK (during probe) and
> devices might have relied on that. So if we want to change
> that, I think we need a feature bit.

If virtio_serial doesn't use those virtqueues, I think it's OK.

I think I'd rather have vp_find_vqs take an offset, however.

> Besides, what about documentation? non pci transports?

I think this is a device-specific issue, but re-calling find_vqs will
have to be audited.  Which adding an offset arg should lead to anyway.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-03 Thread Anthony Liguori

On 02/03/2012 12:07 PM, Eric Northup wrote:

On Thu, Feb 2, 2012 at 8:09 AM, Avi Kivity  wrote:
[...]


Moving to syscalls avoids these problems, but introduces new ones:

- adding new syscalls is generally frowned upon, and kvm will need several
- syscalls into modules are harder and rarer than into core kernel code
- will need to add a vcpu pointer to task_struct, and a kvm pointer to
mm_struct

- Lost a good place to put access control (permissions on /dev/kvm)
for which user-mode processes can use KVM.

How would the ability to use sys_kvm_* be regulated?


Why should it be regulated?

It's not a finite or privileged resource.

Regards,

Anthony Liguori





--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-03 Thread Takuya Yoshikawa
Hope to get comments from live migration developers,

Anthony Liguori  wrote:

> > Guest memory management
> > ---
> > Instead of managing each memory slot individually, a single API will be
> > provided that replaces the entire guest physical memory map atomically.
> > This matches the implementation (using RCU) and plugs holes in the
> > current API, where you lose the dirty log in the window between the last
> > call to KVM_GET_DIRTY_LOG and the call to KVM_SET_USER_MEMORY_REGION
> > that removes the slot.
> >
> > Slot-based dirty logging will be replaced by range-based and work-based
> > dirty logging; that is "what pages are dirty in this range, which may be
> > smaller than a slot" and "don't return more than N pages".
> >
> > We may want to place the log in user memory instead of kernel memory, to
> > reduce pinned memory and increase flexibility.
> 
> Since we really only support 64-bit hosts, what about just pointing the 
> kernel 
> at a address/size pair and rely on userspace to mmap() the range 
> appropriately?
> 

Seems reasonable but the real problem is not how to set up the memory:
the problem is how to set a bit in user-space.

We need two things:
- introducing set_bit_user()
- changing mmu_lock from spin_lock to mutex_lock
  (mark_page_dirty() can be called with mmu_lock held)

The former is straightforward and I sent a patch last year.
The latter needs a fundamental change:  I heard (from Avi) that we can
change mmu_lock to mutex_lock if mmu_notifier becomes preemptible.

So I was planning to restart this work when Peter's
"mm: Preemptibility"
http://lkml.org/lkml/2011/4/1/141
gets finished.

But even if we cannot achieve "without pinned memory" we may also want
to make the user-space know how many pages are getting dirty.

For example think about the last step of live migration.  We stop the
guest and send the remaining pages.  For this we do not need to write
protect them any more, just want to know which ones are dirty.

If user-space can read the bitmap, it does not need to do GET_DIRTY_LOG
because the guest is already stopped, so we can reduce the downtime.

Is this correct?


So I think we can do this in two steps:
1. just move the bitmap to user-space and (pin it)
2. un-pin it when the time comes

I can start 1 after "srcu-less dirty logging" gets finished.


Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: Fix test for mmap failure

2012-02-03 Thread Pekka Enberg

On Fri, Feb 03, 2012 at 11:15:41PM +0400, Cyrill Gorcunov wrote:

On error mmap returns MAP_FAILED so we
need a proper test here.


On Fri, 3 Feb 2012, Cyrill Gorcunov wrote:

Pekka, pick this one instead -- a caller is expecting null/not-null
only.


Applied, thanks!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html