Re: RHEL5.5, 32-bit VM repeatedly locks up due to kvmclock

2010-04-23 Thread David S. Ahern


On 04/23/2010 04:21 PM, BRUNO CESAR RIBAS wrote:
> 
> Could you try hpet? I had similar problem with multicore and multiCPU (per
> mother board) [even with constant_tsc].
> 
> Since I changed the guest to hpet i had no more problems.

It's stable in the sense of no lockups yet, but is a much slower time
source from a gettimeofday perspective compared to tsc and jiffies
(based on speed jiffies appears to be tsc-based).

David

> 
>>
>> David
>>
>>>
>>> Zach
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RHEL5.5, 32-bit VM repeatedly locks up due to kvmclock

2010-04-23 Thread BRUNO CESAR RIBAS
On Fri, Apr 23, 2010 at 03:42:49PM -0600, David S. Ahern wrote:
> 
> 
> On 04/23/2010 03:39 PM, Zachary Amsden wrote:
> > On 04/23/2010 10:39 AM, Brian Jackson wrote:
> >> On Friday 23 April 2010 12:08:22 David S. Ahern wrote:
> >>   
> >>> After a few days of debugging I think kvmclock is the source of lockups
> >>> for a RHEL5.5-based VM. The VM works fine on one host, but repeatedly
> >>> locks up on another.
> >>>
> >>> Server 1 - VM locks up repeatedly
> >>> -- DL580 G5
> >>> -- 4 quad-core X7350 processors at 2.93GHz
> >>> -- 48GB RAM
> >>>
> >>> Server 2 - VM works just fine
> >>> -- DL380 G6
> >>> -- 2 quad-core E5540 processors at 2.53GHz
> >>> -- 24GB RAM
> >>>
> >>> Both host servers are running Fedora Core 12, 2.6.32.11-99.fc12.x86_64
> >>> kernel. I have tried various versions of qemu-kvm -- the version in
> >>> FC-12 and the version for FC-12 in virt-preview. In both cases the
> >>> qemu-kvm command line is identical.
> >>>
> >>> VM
> >>> - RHEL5.5, PAE kernel (also tried standard 32-bit)
> >>> - 2 vcpus
> >>> - 3GB RAM
> >>> - virtio network and disk
> >>>
> >>> When the VM locks up both vcpu threads are spinning at 100%. Changing
> >>> the clocksource to jiffies appears to have addressed the problem.
> >>>  
> >>
> >> Does changing the guest to -smp 1 help?
> >>
> >>
> > 
> > Based on our current understanding of the problem, it should help, but
> > it may not prevent the problem entirely.
> > 
> > There are three issues with kvmclock due to sampling:
> > 
> > 1) smp clock alignment may be slightly off due to timing conditions
> > 2) kvmclock is resampled at each switch of vcpu to another pcpu
> > 3) kvmclock granularity exceeds that of kernel timespec, which means
> > sampling errors may show even on UP
> > 
> > Recommend using a different clocksource (tsc is great if you have stable
> > TSC and don't migrate across different-speed machines) until we have all
> > the fixes in place.
> 
> That's my plan for now. As I recall jiffies was the default in early
> RHEL5 versions. Not sure what that means hardware wise.
> 
> The biggest problem for me is that RHEL5.5 defaults to kvmclock; I'll
> find some workaround for it.

Could you try hpet? I had similar problem with multicore and multiCPU (per
mother board) [even with constant_tsc].

Since I changed the guest to hpet i had no more problems.

> 
> David
> 
> > 
> > Zach
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Bruno Ribas - ri...@c3sl.ufpr.br
http://www.inf.ufpr.br/ribas
C3SL: http://www.c3sl.ufpr.br
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RHEL5.5, 32-bit VM repeatedly locks up due to kvmclock

2010-04-23 Thread David S. Ahern


On 04/23/2010 03:39 PM, Zachary Amsden wrote:
> On 04/23/2010 10:39 AM, Brian Jackson wrote:
>> On Friday 23 April 2010 12:08:22 David S. Ahern wrote:
>>   
>>> After a few days of debugging I think kvmclock is the source of lockups
>>> for a RHEL5.5-based VM. The VM works fine on one host, but repeatedly
>>> locks up on another.
>>>
>>> Server 1 - VM locks up repeatedly
>>> -- DL580 G5
>>> -- 4 quad-core X7350 processors at 2.93GHz
>>> -- 48GB RAM
>>>
>>> Server 2 - VM works just fine
>>> -- DL380 G6
>>> -- 2 quad-core E5540 processors at 2.53GHz
>>> -- 24GB RAM
>>>
>>> Both host servers are running Fedora Core 12, 2.6.32.11-99.fc12.x86_64
>>> kernel. I have tried various versions of qemu-kvm -- the version in
>>> FC-12 and the version for FC-12 in virt-preview. In both cases the
>>> qemu-kvm command line is identical.
>>>
>>> VM
>>> - RHEL5.5, PAE kernel (also tried standard 32-bit)
>>> - 2 vcpus
>>> - 3GB RAM
>>> - virtio network and disk
>>>
>>> When the VM locks up both vcpu threads are spinning at 100%. Changing
>>> the clocksource to jiffies appears to have addressed the problem.
>>>  
>>
>> Does changing the guest to -smp 1 help?
>>
>>
> 
> Based on our current understanding of the problem, it should help, but
> it may not prevent the problem entirely.
> 
> There are three issues with kvmclock due to sampling:
> 
> 1) smp clock alignment may be slightly off due to timing conditions
> 2) kvmclock is resampled at each switch of vcpu to another pcpu
> 3) kvmclock granularity exceeds that of kernel timespec, which means
> sampling errors may show even on UP
> 
> Recommend using a different clocksource (tsc is great if you have stable
> TSC and don't migrate across different-speed machines) until we have all
> the fixes in place.

That's my plan for now. As I recall jiffies was the default in early
RHEL5 versions. Not sure what that means hardware wise.

The biggest problem for me is that RHEL5.5 defaults to kvmclock; I'll
find some workaround for it.

David

> 
> Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] Add a global synchronization point for pvclock

2010-04-23 Thread Zachary Amsden

On 04/23/2010 11:35 AM, Jeremy Fitzhardinge wrote:

On 04/23/2010 02:31 PM, Zachary Amsden wrote:
   

Does lfence / mfence actually serialize?  I thought there was some
great confusion about that not being the case on all AMD processors,
and possibly not at all on Intel.

A trap, however is a great way to serialize.

I think, there is no serializing instruction which can be used from
userspace which does not trap, at least, I don't know one off the top
of my head.
 

rsm is not technically privileged... but not quite usable from usermode ;)
   


rsm under hardware virtualization makes my head hurt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RHEL5.5, 32-bit VM repeatedly locks up due to kvmclock

2010-04-23 Thread Zachary Amsden

On 04/23/2010 10:39 AM, Brian Jackson wrote:

On Friday 23 April 2010 12:08:22 David S. Ahern wrote:
   

After a few days of debugging I think kvmclock is the source of lockups
for a RHEL5.5-based VM. The VM works fine on one host, but repeatedly
locks up on another.

Server 1 - VM locks up repeatedly
-- DL580 G5
-- 4 quad-core X7350 processors at 2.93GHz
-- 48GB RAM

Server 2 - VM works just fine
-- DL380 G6
-- 2 quad-core E5540 processors at 2.53GHz
-- 24GB RAM

Both host servers are running Fedora Core 12, 2.6.32.11-99.fc12.x86_64
kernel. I have tried various versions of qemu-kvm -- the version in
FC-12 and the version for FC-12 in virt-preview. In both cases the
qemu-kvm command line is identical.

VM
- RHEL5.5, PAE kernel (also tried standard 32-bit)
- 2 vcpus
- 3GB RAM
- virtio network and disk

When the VM locks up both vcpu threads are spinning at 100%. Changing
the clocksource to jiffies appears to have addressed the problem.
 


Does changing the guest to -smp 1 help?

   


Based on our current understanding of the problem, it should help, but 
it may not prevent the problem entirely.


There are three issues with kvmclock due to sampling:

1) smp clock alignment may be slightly off due to timing conditions
2) kvmclock is resampled at each switch of vcpu to another pcpu
3) kvmclock granularity exceeds that of kernel timespec, which means 
sampling errors may show even on UP


Recommend using a different clocksource (tsc is great if you have stable 
TSC and don't migrate across different-speed machines) until we have all 
the fixes in place.


Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] Add a global synchronization point for pvclock

2010-04-23 Thread Jeremy Fitzhardinge
On 04/23/2010 02:31 PM, Zachary Amsden wrote:
> Does lfence / mfence actually serialize?  I thought there was some
> great confusion about that not being the case on all AMD processors,
> and possibly not at all on Intel.
>
> A trap, however is a great way to serialize.
>
> I think, there is no serializing instruction which can be used from
> userspace which does not trap, at least, I don't know one off the top
> of my head.

rsm is not technically privileged... but not quite usable from usermode ;)

J
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] Add a global synchronization point for pvclock

2010-04-23 Thread Zachary Amsden

On 04/22/2010 11:34 PM, Avi Kivity wrote:

On 04/23/2010 04:44 AM, Zachary Amsden wrote:

   Or apply this patch.
time-warp.patch


diff -rup a/time-warp-test.c b/time-warp-test.c
--- a/time-warp-test.c2010-04-15 16:30:13.955981607 -1000
+++ b/time-warp-test.c2010-04-15 16:35:37.777982377 -1000
@@ -91,7 +91,7 @@ static inline unsigned long long __rdtsc
  {
  DECLARE_ARGS(val, low, high);

-asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high));
+asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high) :: 
"ebx", "ecx");




Plus, replace cpuid by lfence/mfence.  cpuid will trap.



Does lfence / mfence actually serialize?  I thought there was some great 
confusion about that not being the case on all AMD processors, and 
possibly not at all on Intel.


A trap, however is a great way to serialize.

I think, there is no serializing instruction which can be used from 
userspace which does not trap, at least, I don't know one off the top of 
my head.


Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RHEL5.5, 32-bit VM repeatedly locks up due to kvmclock

2010-04-23 Thread Brian Jackson
On Friday 23 April 2010 12:08:22 David S. Ahern wrote:
> After a few days of debugging I think kvmclock is the source of lockups
> for a RHEL5.5-based VM. The VM works fine on one host, but repeatedly
> locks up on another.
> 
> Server 1 - VM locks up repeatedly
> -- DL580 G5
> -- 4 quad-core X7350 processors at 2.93GHz
> -- 48GB RAM
> 
> Server 2 - VM works just fine
> -- DL380 G6
> -- 2 quad-core E5540 processors at 2.53GHz
> -- 24GB RAM
> 
> Both host servers are running Fedora Core 12, 2.6.32.11-99.fc12.x86_64
> kernel. I have tried various versions of qemu-kvm -- the version in
> FC-12 and the version for FC-12 in virt-preview. In both cases the
> qemu-kvm command line is identical.
> 
> VM
> - RHEL5.5, PAE kernel (also tried standard 32-bit)
> - 2 vcpus
> - 3GB RAM
> - virtio network and disk
> 
> When the VM locks up both vcpu threads are spinning at 100%. Changing
> the clocksource to jiffies appears to have addressed the problem.


Does changing the guest to -smp 1 help?


> 
> David
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


questions about implementing a zero-copy solution to transfer data between VMs

2010-04-23 Thread JinShan Xiong
Hi all,

Probably this is a stale question because I heard of we've already had
some this kind of patches.

I want to implement a zero-copy mechanism to transfer data between
VMs. From what I've investigated, it's quite easy at sender side,
because virtio is good at sharing data pages from VM to HOST; however,
problem arises at receiver side, we have to invent a mechanism to
share host pages to VM. There're two solutions I can think of, but I'm
not sure if they really work:
1. modify the shadow page table of the receiver in the HOST, and
notify the receiver, so that the receiver can access the data pages
directly;
2. implement a pci-character device in receiver. Before the receiver
is notified, HOST has to rearrange the data pages from sender, and set
the base address space of that pci device. In this way, after receiver
mapping the pci memory, it may access the data directly.

folks, would you please help me make sure if the schemes work, better one?

Thanks,
Jay
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Don't confine mouse on grab

2010-04-23 Thread Harald Braumann
Hi,

I'd like to propose a small change in the grab behaviour of the SDL
window:

Currently, when the mouse is moved over the window, the guest mouse is
moved as well and the guest receives keyboard input (using absolute
mouse here and focus-follow-mouse). However, the keyboard is not
grabbed, and so the guest doesn't receive all key combinations (very
bad, when you try to close a window in the guest and instead the whole
SDL window is closed). You have to click or use a key combination to
grab. Now, when the grab is active, also the pointer is confined
inside the SDL window. You have to again use a key combination to get
out.

The better behaviour, I think would be the following:
Whenever the SDL window gets the focus, it should grab the keyboard
without confining the pointer. Then you could use all the key
combinations in the guest and you could still change the focus by
moving the mouse out and to another window. VMware does it like this,
VirtualBox, I think, as well, and also the rdesktop client. 

However, I've checked the SDL documentation and it seems it only
provides SDL_WM_GrabInput which grabs the keyboard *and* confines the
mouse. So maybe SDL would need to be extended for this to work. If
that is in fact the case, a work-around would be, to never do a
grab. My window manager can be configured to deliver (almost) all key
combinations to certain windows, so I could still get the desired
behaviour.

Cheers,
harry
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv5] add mergeable receiver buffers support to vhost

2010-04-23 Thread David L Stevens
This patch adds mergeable receive buffers support to vhost.

Signed-off-by: David L Stevens 

diff -ruNp net-next-v0/drivers/vhost/net.c net-next-v5/drivers/vhost/net.c
--- net-next-v0/drivers/vhost/net.c 2010-04-22 11:31:57.0 -0700
+++ net-next-v5/drivers/vhost/net.c 2010-04-22 12:41:17.0 -0700
@@ -109,7 +109,7 @@ static void handle_tx(struct vhost_net *
};
size_t len, total_len = 0;
int err, wmem;
-   size_t hdr_size;
+   size_t vhost_hlen;
struct socket *sock = rcu_dereference(vq->private_data);
if (!sock)
return;
@@ -128,13 +128,13 @@ static void handle_tx(struct vhost_net *
 
if (wmem < sock->sk->sk_sndbuf / 2)
tx_poll_stop(net);
-   hdr_size = vq->hdr_size;
+   vhost_hlen = vq->vhost_hlen;
 
for (;;) {
-   head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
-ARRAY_SIZE(vq->iov),
-&out, &in,
-NULL, NULL);
+   head = vhost_get_desc(&net->dev, vq, vq->iov,
+ ARRAY_SIZE(vq->iov),
+ &out, &in,
+ NULL, NULL);
/* Nothing new?  Wait for eventfd to tell us they refilled. */
if (head == vq->num) {
wmem = atomic_read(&sock->sk->sk_wmem_alloc);
@@ -155,20 +155,20 @@ static void handle_tx(struct vhost_net *
break;
}
/* Skip header. TODO: support TSO. */
-   s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
+   s = move_iovec_hdr(vq->iov, vq->hdr, vhost_hlen, out);
msg.msg_iovlen = out;
len = iov_length(vq->iov, out);
/* Sanity check */
if (!len) {
vq_err(vq, "Unexpected header len for TX: "
   "%zd expected %zd\n",
-  iov_length(vq->hdr, s), hdr_size);
+  iov_length(vq->hdr, s), vhost_hlen);
break;
}
/* TODO: Check specific error and bomb out unless ENOBUFS? */
err = sock->ops->sendmsg(NULL, sock, &msg, len);
if (unlikely(err < 0)) {
-   vhost_discard_vq_desc(vq);
+   vhost_discard_desc(vq, 1);
tx_poll_start(net, sock);
break;
}
@@ -187,12 +187,25 @@ static void handle_tx(struct vhost_net *
unuse_mm(net->dev.mm);
 }
 
+static int vhost_head_len(struct vhost_virtqueue *vq, struct sock *sk)
+{
+   struct sk_buff *head;
+   int len = 0;
+
+   lock_sock(sk);
+   head = skb_peek(&sk->sk_receive_queue);
+   if (head)
+   len = head->len + vq->sock_hlen;
+   release_sock(sk);
+   return len;
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_rx(struct vhost_net *net)
 {
struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
-   unsigned head, out, in, log, s;
+   unsigned in, log, s;
struct vhost_log *vq_log;
struct msghdr msg = {
.msg_name = NULL,
@@ -203,14 +216,14 @@ static void handle_rx(struct vhost_net *
.msg_flags = MSG_DONTWAIT,
};
 
-   struct virtio_net_hdr hdr = {
-   .flags = 0,
-   .gso_type = VIRTIO_NET_HDR_GSO_NONE
+   struct virtio_net_hdr_mrg_rxbuf hdr = {
+   .hdr.flags = 0,
+   .hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
};
 
size_t len, total_len = 0;
-   int err;
-   size_t hdr_size;
+   int err, headcount, datalen;
+   size_t vhost_hlen;
struct socket *sock = rcu_dereference(vq->private_data);
if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
return;
@@ -218,18 +231,18 @@ static void handle_rx(struct vhost_net *
use_mm(net->dev.mm);
mutex_lock(&vq->mutex);
vhost_disable_notify(vq);
-   hdr_size = vq->hdr_size;
+   vhost_hlen = vq->vhost_hlen;
 
vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
vq->log : NULL;
 
-   for (;;) {
-   head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
-ARRAY_SIZE(vq->iov),
-&out, &in,
-vq_log, &log);
+   while ((datalen = vhost_head_len(vq, sock->sk))) {
+   headcount = vhost_get_desc_n(vq, vq->heads, datalen+vhost_hlen,
+&in, vq_log, &log);
+   if (headcount < 0)
+   break;
 

Re: [PATCH 1/5] Add a global synchronization point for pvclock

2010-04-23 Thread Avi Kivity

On 04/23/2010 10:22 PM, Jeremy Fitzhardinge wrote:

On 04/23/2010 02:34 AM, Avi Kivity wrote:
   

diff -rup a/time-warp-test.c b/time-warp-test.c
--- a/time-warp-test.c2010-04-15 16:30:13.955981607 -1000
+++ b/time-warp-test.c2010-04-15 16:35:37.777982377 -1000
@@ -91,7 +91,7 @@ static inline unsigned long long __rdtsc
   {
   DECLARE_ARGS(val, low, high);

-asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high));
+asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high) ::
"ebx", "ecx");


   


Plus, replace cpuid by lfence/mfence.  cpuid will trap.
 

I presume the clobbers are needed for cpuid; if you use [lm]fence then
they shouldn't be needed, right?

   


Right.  But I'm not sure lfence/mfence are sufficient - from what I 
understand rdtsc can pass right through them.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PXE Boot Timeout Issue...

2010-04-23 Thread Stefan Hajnoczi
For reference, my libvirt managed virbr0 has forwarding delay 0.  This
is the default:

http://libvirt.org/formatnetwork.html#elementsConnect

I know that my VM is a leaf node (it only has one NIC and isn't going
to create a loop in the network) and therefore it makes sense to
eliminate the forwarding delay entirely.

You might be interested in this link about STP delay on physical networks:

http://www.cisco.com/en/US/products/hw/switches/ps700/products_tech_note09186a00800b1500.shtml

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] Add a global synchronization point for pvclock

2010-04-23 Thread Jeremy Fitzhardinge
On 04/23/2010 02:34 AM, Avi Kivity wrote:
>>
>> diff -rup a/time-warp-test.c b/time-warp-test.c
>> --- a/time-warp-test.c2010-04-15 16:30:13.955981607 -1000
>> +++ b/time-warp-test.c2010-04-15 16:35:37.777982377 -1000
>> @@ -91,7 +91,7 @@ static inline unsigned long long __rdtsc
>>   {
>>   DECLARE_ARGS(val, low, high);
>>
>> -asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high));
>> +asm volatile("cpuid; rdtsc" : EAX_EDX_RET(val, low, high) ::
>> "ebx", "ecx");
>>
>>
>
>
> Plus, replace cpuid by lfence/mfence.  cpuid will trap.

I presume the clobbers are needed for cpuid; if you use [lm]fence then
they shouldn't be needed, right?

J

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] More fixes for nested svm

2010-04-23 Thread Alexander Graf

On 22.04.2010, at 13:04, Avi Kivity wrote:

> On 04/22/2010 01:33 PM, Joerg Roedel wrote:
>> Hi Avi, Marcelo,
>> 
>> here is another set of nested svm fixes. They fix NMI code to make UP Hyper-V
>> root domain booting. The patches also add better handling for nested entry
>> failures and mce intercepts.
>> Also in this patchset are the fixes for the supported cpuid reporting for svm
>> features. These patches were taken from the nested-npt patchset and slightly
>> modified. These patches are also marked for -stable backporting.
>> The probably most important fix is about exception reinjection. This didn't
>> work reliably before and is fixed with the patch in this series now. This fix
>> also touches common x86 code but that should be ok because it could be reused
>> by nested-vmx later.
>> Please review and give comments (or apply ;-).
>>   
> 
> Looks good.  I'll wait a day or two for more reviews (esp. Alex).

No complaints from me.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] KVM: SVM: Fix nested nmi handling

2010-04-23 Thread Alexander Graf

On 23.04.2010, at 16:51, Joerg Roedel wrote:

> On Fri, Apr 23, 2010 at 04:42:52PM +0200, Alexander Graf wrote:
>> 
>> On 23.04.2010, at 16:31, Joerg Roedel wrote:
>> 
>>> On Fri, Apr 23, 2010 at 04:24:54PM +0200, Alexander Graf wrote:
 
 On 23.04.2010, at 16:22, Joerg Roedel wrote:
>>> 
> No, nested_svm_nmi runs in atomic context where we can't emulate a
> vmexit. We set exit_required and emulate the vmexit later.
 
 So we modify the L2 rflags and then trigger a #vmexit, leaving the l2
 state broken?
>>> 
>>> No, the rflags are changed in enable_nmi_window which isn't called when
>>> we run nested and the nested hypervisor intercepts nmi. So it only runs
>>> in the !nested case where it can't corrupt L2 state.
>> 
>> Last time I checked the code enable_nmi_window was the function
>> triggering the #vmexit,
> 
> Yes, thats the bug which this patch fixes :-)
> 
>> so it should run in that exact scenario. If what you say is true, where
>> do we #vmexit instead then?
> 
> After setting exit_required we run into svm.c:svm_vcpu_run. There the
> exit_required flag is checked and if set, the function immediatly
> returns without doing a vmrun. A few cycles later we run into
> svm.c:handle_exit() where at the beginning exit_required is checked, and
> if set the vmexit is emulated.

Oh well, I guess I have to trust you on this one :)


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mount and unmount CD

2010-04-23 Thread David S. Ahern
oops. the previous patch rides on top of this one.

David


On 04/23/2010 12:18 PM, David S. Ahern wrote:
> I saw this with RHEL5.3. I ended up hacking qemu to re_open the CD every
> so often. See attached.
> 
> David
> 
> 
> On 04/23/2010 09:10 AM, Matt Burkhardt wrote:
>> I'm having a problem with a virtual machine running under RHEL 5.4
>> 64-bit.  I take out the CD / insert a new and the main machine sees the
>> new cd and makes it available.  However, the virtual machines still see
>> the old CD.  I've tried mounting the new CD, but it just keeps mounting
>> what it "thinks" is in there - the old one.
>>
>> Any ideas?
>>
>>
>> Matt Burkhardt
>> Impari Systems, Inc.
>>
>> m...@imparisystems.com
>> http://www.imparisystems.com 
>> http://www.linkedin.com/in/mlburkhardt 
>> http://www.twitter.com/matthewboh
>> 502 Fairview Avenue
>> Frederick, MD  21701
>> work (301) 682-7901
>> cell   (301) 802-3235
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
--- qemu/block-raw-posix.c.orig 2010-01-06 21:46:31.0 -0700
+++ qemu/block-raw-posix.c  2010-01-06 21:54:22.0 -0700
@@ -107,20 +107,24 @@
 int fd_got_error;
 int fd_media_changed;
 #endif
 uint8_t* aligned_buf;
 } BDRVRawState;
 
 static int posix_aio_init(void);
 
 static int fd_open(BlockDriverState *bs);
 
+#if defined(__linux__)
+int cdrom_reopen(BlockDriverState *bs);
+#endif
+
 static int raw_open(BlockDriverState *bs, const char *filename, int flags)
 {
 BDRVRawState *s = bs->opaque;
 int fd, open_flags, ret;
 
 posix_aio_init();
 
 s->lseek_err_cnt = 0;
 
 open_flags = O_BINARY;
@@ -212,29 +216,32 @@
 if (ret == count)
 goto label__raw_read__success;
 
 DEBUG_BLOCK_PRINT("raw_pread(%d:%s, %" PRId64 ", %p, %d) [%" PRId64
   "] read failed %d : %d = %s\n",
   s->fd, bs->filename, offset, buf, count,
   bs->total_sectors, ret, errno, strerror(errno));
 
 /* Try harder for CDrom. */
 if (bs->type == BDRV_TYPE_CDROM) {
-lseek(s->fd, offset, SEEK_SET);
-ret = read(s->fd, buf, count);
-if (ret == count)
-goto label__raw_read__success;
-lseek(s->fd, offset, SEEK_SET);
-ret = read(s->fd, buf, count);
-if (ret == count)
+int i;
+for (i = 0; i < 2; ++i) {
+#if defined(__linux__)
+ret = cdrom_reopen(bs);
+if (ret < 0)
 goto label__raw_read__success;
-
+#endif
+lseek(s->fd, offset, SEEK_SET);
+ret = read(s->fd, buf, count);
+if (ret == count)
+goto label__raw_read__success;
+}
 DEBUG_BLOCK_PRINT("raw_pread(%d:%s, %" PRId64 ", %p, %d) [%" PRId64
   "] retry read failed %d : %d = %s\n",
   s->fd, bs->filename, offset, buf, count,
   bs->total_sectors, ret, errno, strerror(errno));
 }
 
 label__raw_read__success:
 
 return ret;
 }
@@ -1025,20 +1032,27 @@
 printf("Floppy opened\n");
 #endif
 }
 if (!last_media_present)
 s->fd_media_changed = 1;
 s->fd_open_time = qemu_get_clock(rt_clock);
 s->fd_got_error = 0;
 return 0;
 }
 
+int cdrom_reopen(BlockDriverState *bs)
+{
+/* mimics a 'change' monitor command - without the eject */
+bdrv_close(bs);
+return bdrv_open2(bs, bs->filename, 0, bs->drv);
+}
+
 static int raw_is_inserted(BlockDriverState *bs)
 {
 BDRVRawState *s = bs->opaque;
 int ret;
 
 switch(s->type) {
 case FTYPE_CD:
 ret = ioctl(s->fd, CDROM_DRIVE_STATUS, CDSL_CURRENT);
 if (ret == CDS_DISC_OK)
 return 1;
--- qemu/hw/ide.c.orig  2010-01-06 21:54:33.0 -0700
+++ qemu/hw/ide.c   2010-01-06 21:56:16.0 -0700
@@ -29,20 +29,24 @@
 #include "pcmcia.h"
 #include "block.h"
 #include "block_int.h"
 #include "qemu-timer.h"
 #include "sysemu.h"
 #include "ppc_mac.h"
 #include "sh.h"
 #include 
 #include 
 
+#if defined(__linux__)
+int cdrom_reopen(BlockDriverState *bs);
+#endif
+
 /* debug IDE devices */
 //#define DEBUG_IDE
 //#define DEBUG_IDE_ATAPI
 //#define DEBUG_AIO
 #define USE_DMA_CDROM
 
 /* Bits of HD_STATUS */
 #define ERR_STAT   0x01
 #define INDEX_STAT 0x02
 #define ECC_STAT   0x04/* Corrected error */
@@ -1363,20 +1367,25 @@
 /* ATAPI DMA support */
 
 /* XXX: handle read errors */
 static void ide_atapi_cmd_read_dma_cb(void *opaque, int ret)
 {
 BMDMAState *bm = opaque;
 IDEState *s = bm->ide_if;
 int data_offset, n;
 
 if (ret < 0) {
+#if defined(__linux__)
+/* on EIO failure try re-opening file */
+if (ret == -EIO)
+(void) cdrom_reopen(s->bs);
+#endif
 ide_atapi_io_error(s, ret);
 goto eot;
 }
 

Re: Mount and unmount CD

2010-04-23 Thread David S. Ahern
I saw this with RHEL5.3. I ended up hacking qemu to re_open the CD every
so often. See attached.

David


On 04/23/2010 09:10 AM, Matt Burkhardt wrote:
> I'm having a problem with a virtual machine running under RHEL 5.4
> 64-bit.  I take out the CD / insert a new and the main machine sees the
> new cd and makes it available.  However, the virtual machines still see
> the old CD.  I've tried mounting the new CD, but it just keeps mounting
> what it "thinks" is in there - the old one.
> 
> Any ideas?
> 
> 
> Matt Burkhardt
> Impari Systems, Inc.
> 
> m...@imparisystems.com
> http://www.imparisystems.com 
> http://www.linkedin.com/in/mlburkhardt 
> http://www.twitter.com/matthewboh
> 502 Fairview Avenue
> Frederick, MD  21701
> work (301) 682-7901
> cell   (301) 802-3235
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--- qemu/block-raw-posix.c.orig 2010-01-06 22:27:56.0 -0700
+++ qemu/block-raw-posix.c  2010-01-06 22:29:51.0 -0700
@@ -193,20 +193,40 @@
 static int raw_pread_aligned(BlockDriverState *bs, int64_t offset,
  uint8_t *buf, int count)
 {
 BDRVRawState *s = bs->opaque;
 int ret;
 
 ret = fd_open(bs);
 if (ret < 0)
 return ret;
 
+/* media changes are only detected at the host layer when
+ * somethin reopens the cdrom device. Without an event 
+ * notice, we need a heuristic. Try the following which mimics
+ * what is done for floppy drives. Here we reopen the cdrom
+ * after 3 seconds of elapsed time - this should be short
+ * enough to cover a user inserting a new disk and then accessing
+ * it via the CLI/GUI.
+ */
+if (bs->type == BDRV_TYPE_CDROM) {
+static int64_t last = 0;
+int64_t now = qemu_get_clock(rt_clock);
+if ((now - last) > 3000)
+ret = cdrom_reopen(bs);
+else
+   ret = 0;
+last = now;
+if (ret < 0)
+   return ret;
+}
+
 if (offset >= 0 && lseek(s->fd, offset, SEEK_SET) == (off_t)-1) {
 ++(s->lseek_err_cnt);
 if(s->lseek_err_cnt <= 10) {
 DEBUG_BLOCK_PRINT("raw_pread(%d:%s, %" PRId64 ", %p, %d) [%" PRId64
   "] lseek failed : %d = %s\n",
   s->fd, bs->filename, offset, buf, count,
   bs->total_sectors, errno, strerror(errno));
 }
 return -1;
 }
--- qemu/hw/ide.c.orig  2010-01-06 22:28:02.0 -0700
+++ qemu/hw/ide.c   2010-01-06 22:30:45.0 -0700
@@ -1456,20 +1456,28 @@
 s->cd_sector_size = sector_size;
 
 /* XXX: check if BUSY_STAT should be set */
 s->status = READY_STAT | SEEK_STAT | DRQ_STAT | BUSY_STAT;
 ide_dma_start(s, ide_atapi_cmd_read_dma_cb);
 }
 
 static void ide_atapi_cmd_read(IDEState *s, int lba, int nb_sectors,
int sector_size)
 {
+if (s->is_cdrom) {
+static int64_t last = 0;
+int64_t now = qemu_get_clock(rt_clock);
+if ((now - last) > 3000)
+(void) cdrom_reopen(s->bs);
+last = now;
+}
+
 #ifdef DEBUG_IDE_ATAPI
 printf("read %s: LBA=%d nb_sectors=%d\n", s->atapi_dma ? "dma" : "pio",
lba, nb_sectors);
 #endif
 if (s->atapi_dma) {
 ide_atapi_cmd_read_dma(s, lba, nb_sectors, sector_size);
 } else {
 ide_atapi_cmd_read_pio(s, lba, nb_sectors, sector_size);
 }
 }


Re: [PATCH v4] Shared memory uio_pci driver

2010-04-23 Thread Cam Macdonell
On Mon, Apr 12, 2010 at 2:57 PM, Avi Kivity  wrote:
>
> There is work now to bring msi to the generic pci 2.3 driver, perhaps we can
> use that instead.  From a quick look it looks fine.
>

I'd be interested to follow this development.  I can't find anything
on LKML, is it being discussed anywhere?

Thanks,
Cam
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PXE Boot Timeout Issue...

2010-04-23 Thread Stuart Sheldon
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Jim Paris wrote:
> Stuart Sheldon wrote:
>> Stefan Hajnoczi wrote:
>>> On Fri, Apr 23, 2010 at 1:45 AM, Stuart Sheldon  wrote:
 Just upgraded to 12.3 user space tools from 11.0, and now when I attempt
 to netboot a guest, it appears that the pxe rom is timing out on dhcp
 before the bridge has enough time to come up.
> 
>> 3) after about 15 seconds, tap device enters promiscuous mode on host,
>> and at same time, gPXE reports "No bootable device." on guest
> ...
>>  forward delay 15.00 bridge forward delay
>>   15.00
> 
> I've run into this issue as well and lowering the forwarding delay
> with "brctl setfd br0 5" fixes it.
> 
> -jim
> 

Thanks Jim! I think that has it...

Stu


- --
Hey nineteen Thats retha franklin, She dont remember The queen of soul,
Its hard times befallen The sole survivors, She thinks Im crazy,
But Im just growing old
   -- Steely Dan - "Hey 19 - Lyrics"
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iQIcBAEBCAAGBQJL0dkFAAoJEFKVLITDJSGSwJAQAIBL16jFmTavFWX0dU50syP+
VtK3EqQcv02bJdpAXdbfKuO1QDdjmucG2zEbNK4lMXKEwBsGcKtxDhUOVaNS07fK
OdfvsO+/wX9mC8JmqCah6r8Pwppd3F07nzP3HkPnswV/rgtDVrSWB2JInDy+d3R7
sd0DisPOeaHoXSQqJ8REGElVR2adzsdmRWUKpFmtu9tJCz48wsbkDLpcr5iMt7jA
sAPtTm1ENUEF3X3l6mlGB4kuYsLum93hPpgrxWmMvGO3uokVzRJYm3x2Yf5J++3v
yc8reMPPsQe1tI9r+ZtCKXE2XLqI/X4Pdap0IIjtid2245dBOMKys+aIXYZ1d7jL
YwngcgFDdse6B3Gt9/Tk5RhqqYQeYLsZckMOhy89DN2bftYxsX70xIWKKVxPEkNg
GniSN8pCg/Rx4ZOFkygwXRUayOx5Hi5nR6hetKyBqRM6mWYZvfvzRYH3O05JAOIz
3qKRObyuKHYFGkdWZ1LaTBoQoOdq+iWyuPK53kvpHFbdks6mZZg3YO70M6+ziqyT
zy9UvuumjzquuFwRu91R0qi02SZlNifvTQZ9aUpsPvRAhijroidqIzsH/5CoDuDe
AyQEKJgq64/ViA2QrYgpo/IFLAUbhCpMRIr7Tt5Te2TT+s43Lefyg6aY2XA2UwY0
6LwIWdBWwITjvP7XUU5B
=hwfS
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PXE Boot Timeout Issue...

2010-04-23 Thread Jim Paris
Stuart Sheldon wrote:
> Stefan Hajnoczi wrote:
> > On Fri, Apr 23, 2010 at 1:45 AM, Stuart Sheldon  wrote:
> >> Just upgraded to 12.3 user space tools from 11.0, and now when I attempt
> >> to netboot a guest, it appears that the pxe rom is timing out on dhcp
> >> before the bridge has enough time to come up.

> 3) after about 15 seconds, tap device enters promiscuous mode on host,
> and at same time, gPXE reports "No bootable device." on guest
...
>  forward delay  15.00 bridge forward delay
>   15.00

I've run into this issue as well and lowering the forwarding delay
with "brctl setfd br0 5" fixes it.

-jim
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM: convert ioapic lock to spinlock

2010-04-23 Thread Marcelo Tosatti

Ralf, can you give this patch a try, please? (revert the first one
first).

-

kvm_set_irq is used from non sleepable contexes, so convert ioapic from
mutex to spinlock.

Signed-off-by: Marcelo Tosatti 

diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index 3db15a8..8edd6fd 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -196,7 +196,7 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, 
int level)
union kvm_ioapic_redirect_entry entry;
int ret = 1;
 
-   mutex_lock(&ioapic->lock);
+   spin_lock(&ioapic->lock);
if (irq >= 0 && irq < IOAPIC_NUM_PINS) {
entry = ioapic->redirtbl[irq];
level ^= entry.fields.polarity;
@@ -213,7 +213,7 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, 
int level)
}
trace_kvm_ioapic_set_irq(entry.bits, irq, ret == 0);
}
-   mutex_unlock(&ioapic->lock);
+   spin_unlock(&ioapic->lock);
 
return ret;
 }
@@ -237,9 +237,9 @@ static void __kvm_ioapic_update_eoi(struct kvm_ioapic 
*ioapic, int vector,
 * is dropped it will be put into irr and will be delivered
 * after ack notifier returns.
 */
-   mutex_unlock(&ioapic->lock);
+   spin_unlock(&ioapic->lock);
kvm_notify_acked_irq(ioapic->kvm, KVM_IRQCHIP_IOAPIC, i);
-   mutex_lock(&ioapic->lock);
+   spin_lock(&ioapic->lock);
 
if (trigger_mode != IOAPIC_LEVEL_TRIG)
continue;
@@ -258,9 +258,9 @@ void kvm_ioapic_update_eoi(struct kvm *kvm, int vector, int 
trigger_mode)
smp_rmb();
if (!test_bit(vector, ioapic->handled_vectors))
return;
-   mutex_lock(&ioapic->lock);
+   spin_lock(&ioapic->lock);
__kvm_ioapic_update_eoi(ioapic, vector, trigger_mode);
-   mutex_unlock(&ioapic->lock);
+   spin_unlock(&ioapic->lock);
 }
 
 static inline struct kvm_ioapic *to_ioapic(struct kvm_io_device *dev)
@@ -286,7 +286,7 @@ static int ioapic_mmio_read(struct kvm_io_device *this, 
gpa_t addr, int len,
ASSERT(!(addr & 0xf));  /* check alignment */
 
addr &= 0xff;
-   mutex_lock(&ioapic->lock);
+   spin_lock(&ioapic->lock);
switch (addr) {
case IOAPIC_REG_SELECT:
result = ioapic->ioregsel;
@@ -300,7 +300,7 @@ static int ioapic_mmio_read(struct kvm_io_device *this, 
gpa_t addr, int len,
result = 0;
break;
}
-   mutex_unlock(&ioapic->lock);
+   spin_unlock(&ioapic->lock);
 
switch (len) {
case 8:
@@ -337,7 +337,7 @@ static int ioapic_mmio_write(struct kvm_io_device *this, 
gpa_t addr, int len,
}
 
addr &= 0xff;
-   mutex_lock(&ioapic->lock);
+   spin_lock(&ioapic->lock);
switch (addr) {
case IOAPIC_REG_SELECT:
ioapic->ioregsel = data;
@@ -355,7 +355,7 @@ static int ioapic_mmio_write(struct kvm_io_device *this, 
gpa_t addr, int len,
default:
break;
}
-   mutex_unlock(&ioapic->lock);
+   spin_unlock(&ioapic->lock);
return 0;
 }
 
@@ -385,7 +385,7 @@ int kvm_ioapic_init(struct kvm *kvm)
ioapic = kzalloc(sizeof(struct kvm_ioapic), GFP_KERNEL);
if (!ioapic)
return -ENOMEM;
-   mutex_init(&ioapic->lock);
+   spin_lock_init(&ioapic->lock);
kvm->arch.vioapic = ioapic;
kvm_ioapic_reset(ioapic);
kvm_iodevice_init(&ioapic->dev, &ioapic_mmio_ops);
@@ -418,9 +418,9 @@ int kvm_get_ioapic(struct kvm *kvm, struct kvm_ioapic_state 
*state)
if (!ioapic)
return -EINVAL;
 
-   mutex_lock(&ioapic->lock);
+   spin_lock(&ioapic->lock);
memcpy(state, ioapic, sizeof(struct kvm_ioapic_state));
-   mutex_unlock(&ioapic->lock);
+   spin_unlock(&ioapic->lock);
return 0;
 }
 
@@ -430,9 +430,9 @@ int kvm_set_ioapic(struct kvm *kvm, struct kvm_ioapic_state 
*state)
if (!ioapic)
return -EINVAL;
 
-   mutex_lock(&ioapic->lock);
+   spin_lock(&ioapic->lock);
memcpy(ioapic, state, sizeof(struct kvm_ioapic_state));
update_handled_vectors(ioapic);
-   mutex_unlock(&ioapic->lock);
+   spin_unlock(&ioapic->lock);
return 0;
 }
diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
index 8a751b7..0b190c3 100644
--- a/virt/kvm/ioapic.h
+++ b/virt/kvm/ioapic.h
@@ -45,7 +45,7 @@ struct kvm_ioapic {
struct kvm_io_device dev;
struct kvm *kvm;
void (*ack_notifier)(void *opaque, int irq);
-   struct mutex lock;
+   spinlock_t lock;
DECLARE_BITMAP(handled_vectors, 256);
 };
 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[uq/master patch 4/5] kvm: port qemu-kvm's bitmap scanning

2010-04-23 Thread Marcelo Tosatti
Which is significantly faster.

Signed-off-by: Marcelo Tosatti 

Index: qemu-kvm/kvm-all.c
===
--- qemu-kvm.orig/kvm-all.c
+++ qemu-kvm/kvm-all.c
@@ -26,6 +26,7 @@
 #include "hw/hw.h"
 #include "gdbstub.h"
 #include "kvm.h"
+#include "bswap.h"
 
 /* KVM uses PAGE_SIZE in it's definition of COALESCED_MMIO_MAX */
 #define PAGE_SIZE TARGET_PAGE_SIZE
@@ -305,11 +306,41 @@ static int kvm_set_migration_log(int ena
 return 0;
 }
 
-static int test_le_bit(unsigned long nr, unsigned char *addr)
-{
-return (addr[nr >> 3] >> (nr & 7)) & 1;
+/* get kvm's dirty pages bitmap and update qemu's */
+static int kvm_get_dirty_pages_log_range(unsigned long start_addr,
+ unsigned long *bitmap,
+ unsigned long offset,
+ unsigned long mem_size)
+{
+unsigned int i, j;
+unsigned long page_number, addr, addr1, c;
+ram_addr_t ram_addr;
+unsigned int len = ((mem_size / TARGET_PAGE_SIZE) + HOST_LONG_BITS - 1) /
+HOST_LONG_BITS;
+
+/*
+ * bitmap-traveling is faster than memory-traveling (for addr...)
+ * especially when most of the memory is not dirty.
+ */
+for (i = 0; i < len; i++) {
+if (bitmap[i] != 0) {
+c = leul_to_cpu(bitmap[i]);
+do {
+j = ffsl(c) - 1;
+c &= ~(1ul << j);
+page_number = i * HOST_LONG_BITS + j;
+addr1 = page_number * TARGET_PAGE_SIZE;
+addr = offset + addr1;
+ram_addr = cpu_get_physical_page_desc(addr);
+cpu_physical_memory_set_dirty(ram_addr);
+} while (c != 0);
+}
+}
+return 0;
 }
 
+#define ALIGN(x, y)  (((x)+(y)-1) & ~((y)-1))
+
 /**
  * kvm_physical_sync_dirty_bitmap - Grab dirty bitmap from kernel space
  * This function updates qemu's dirty bitmap using 
cpu_physical_memory_set_dirty().
@@ -323,8 +354,6 @@ static int kvm_physical_sync_dirty_bitma
 {
 KVMState *s = kvm_state;
 unsigned long size, allocated_size = 0;
-target_phys_addr_t phys_addr;
-ram_addr_t addr;
 KVMDirtyLog d;
 KVMSlot *mem;
 int ret = 0;
@@ -336,7 +365,7 @@ static int kvm_physical_sync_dirty_bitma
 break;
 }
 
-size = ((mem->memory_size >> TARGET_PAGE_BITS) + 7) / 8;
+size = ALIGN(((mem->memory_size) >> TARGET_PAGE_BITS), HOST_LONG_BITS) 
/ 8;
 if (!d.dirty_bitmap) {
 d.dirty_bitmap = qemu_malloc(size);
 } else if (size > allocated_size) {
@@ -353,17 +382,9 @@ static int kvm_physical_sync_dirty_bitma
 break;
 }
 
-for (phys_addr = mem->start_addr, addr = mem->phys_offset;
- phys_addr < mem->start_addr + mem->memory_size;
- phys_addr += TARGET_PAGE_SIZE, addr += TARGET_PAGE_SIZE) {
-unsigned char *bitmap = (unsigned char *)d.dirty_bitmap;
-unsigned nr = (phys_addr - mem->start_addr) >> TARGET_PAGE_BITS;
-
-if (test_le_bit(nr, bitmap)) {
-cpu_physical_memory_set_dirty(addr);
-}
-}
-start_addr = phys_addr;
+kvm_get_dirty_pages_log_range(mem->start_addr, d.dirty_bitmap,
+  mem->start_addr, mem->memory_size);
+start_addr = mem->start_addr + mem->memory_size;
 }
 qemu_free(d.dirty_bitmap);
 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[uq/master patch 5/5] introduce qemu_ram_map

2010-04-23 Thread Marcelo Tosatti
Which allows drivers to register an mmaped region into ram block mappings.
To be used by device assignment driver.

CC: Cam Macdonell 
Signed-off-by: Marcelo Tosatti 

Index: qemu-kvm/cpu-common.h
===
--- qemu-kvm.orig/cpu-common.h
+++ qemu-kvm/cpu-common.h
@@ -32,6 +32,7 @@ static inline void cpu_register_physical
 }
 
 ram_addr_t cpu_get_physical_page_desc(target_phys_addr_t addr);
+ram_addr_t qemu_ram_map(ram_addr_t size, void *host);
 ram_addr_t qemu_ram_alloc(ram_addr_t);
 void qemu_ram_free(ram_addr_t addr);
 /* This should only be used for ram local to a device.  */
Index: qemu-kvm/exec.c
===
--- qemu-kvm.orig/exec.c
+++ qemu-kvm/exec.c
@@ -2710,6 +2710,34 @@ static void *file_ram_alloc(ram_addr_t m
 }
 #endif
 
+ram_addr_t qemu_ram_map(ram_addr_t size, void *host)
+{
+RAMBlock *new_block;
+
+size = TARGET_PAGE_ALIGN(size);
+new_block = qemu_malloc(sizeof(*new_block));
+
+new_block->host = host;
+
+new_block->offset = last_ram_offset;
+new_block->length = size;
+
+new_block->next = ram_blocks;
+ram_blocks = new_block;
+
+phys_ram_dirty = qemu_realloc(phys_ram_dirty,
+(last_ram_offset + size) >> TARGET_PAGE_BITS);
+memset(phys_ram_dirty + (last_ram_offset >> TARGET_PAGE_BITS),
+   0xff, size >> TARGET_PAGE_BITS);
+
+last_ram_offset += size;
+
+if (kvm_enabled())
+kvm_setup_guest_memory(new_block->host, size);
+
+return new_block->offset;
+}
+
 ram_addr_t qemu_ram_alloc(ram_addr_t size)
 {
 RAMBlock *new_block;


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[uq/master patch 0/5] prepare for qemu-kvm's usage of upstream memslot code

2010-04-23 Thread Marcelo Tosatti

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[uq/master patch 3/5] introduce leul_to_cpu

2010-04-23 Thread Marcelo Tosatti
To be used by next patch.

Signed-off-by: Marcelo Tosatti 

Index: qemu-memslot/bswap.h
===
--- qemu-memslot.orig/bswap.h
+++ qemu-memslot/bswap.h
@@ -205,8 +205,10 @@ static inline void cpu_to_be32wu(uint32_
 
 #ifdef HOST_WORDS_BIGENDIAN
 #define cpu_to_32wu cpu_to_be32wu
+#define leul_to_cpu(v) le ## HOST_LONG_BITS ## _to_cpu(v)
 #else
 #define cpu_to_32wu cpu_to_le32wu
+#define leul_to_cpu(v) (v)
 #endif
 
 #undef le_bswap


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[uq/master patch 2/5] kvm: add logging count to slots

2010-04-23 Thread Marcelo Tosatti
Otherwise there is no way to differentiate between global and slot 
specific logging, so for example 

vga dirty log start
migration start
migration fail

Disables dirty logging for the vga slot.

Signed-off-by: Marcelo Tosatti 

Index: qemu-kvm/kvm-all.c
===
--- qemu-kvm.orig/kvm-all.c
+++ qemu-kvm/kvm-all.c
@@ -47,6 +47,7 @@ typedef struct KVMSlot
 ram_addr_t phys_offset;
 int slot;
 int flags;
+int logging_count;
 } KVMSlot;
 
 typedef struct kvm_dirty_log KVMDirtyLog;
@@ -218,20 +219,11 @@ err:
 /*
  * dirty pages logging control
  */
-static int kvm_dirty_pages_log_change(target_phys_addr_t phys_addr,
-  ram_addr_t size, int flags, int mask)
+static int kvm_dirty_pages_log_change(KVMSlot *mem, int flags, int mask)
 {
 KVMState *s = kvm_state;
-KVMSlot *mem = kvm_lookup_matching_slot(s, phys_addr, phys_addr + size);
 int old_flags;
 
-if (mem == NULL)  {
-fprintf(stderr, "BUG: %s: invalid parameters " TARGET_FMT_plx "-"
-TARGET_FMT_plx "\n", __func__, phys_addr,
-(target_phys_addr_t)(phys_addr + size - 1));
-return -EINVAL;
-}
-
 old_flags = mem->flags;
 
 flags = (mem->flags & ~mask) | flags;
@@ -250,16 +242,42 @@ static int kvm_dirty_pages_log_change(ta
 
 int kvm_log_start(target_phys_addr_t phys_addr, ram_addr_t size)
 {
-return kvm_dirty_pages_log_change(phys_addr, size,
-  KVM_MEM_LOG_DIRTY_PAGES,
-  KVM_MEM_LOG_DIRTY_PAGES);
+KVMState *s = kvm_state;
+KVMSlot *mem = kvm_lookup_matching_slot(s, phys_addr, phys_addr + size);
+
+if (mem == NULL)  {
+fprintf(stderr, "BUG: %s: invalid parameters " TARGET_FMT_plx "-"
+TARGET_FMT_plx "\n", __func__, phys_addr,
+(target_phys_addr_t)(phys_addr + size - 1));
+return -EINVAL;
+}
+
+if (mem->logging_count++)
+return 0;
+
+return kvm_dirty_pages_log_change(mem,
+  KVM_MEM_LOG_DIRTY_PAGES,
+  KVM_MEM_LOG_DIRTY_PAGES);
 }
 
 int kvm_log_stop(target_phys_addr_t phys_addr, ram_addr_t size)
 {
-return kvm_dirty_pages_log_change(phys_addr, size,
-  0,
-  KVM_MEM_LOG_DIRTY_PAGES);
+KVMState *s = kvm_state;
+KVMSlot *mem = kvm_lookup_matching_slot(s, phys_addr, phys_addr + size);
+
+if (mem == NULL)  {
+fprintf(stderr, "BUG: %s: invalid parameters " TARGET_FMT_plx "-"
+TARGET_FMT_plx "\n", __func__, phys_addr,
+(target_phys_addr_t)(phys_addr + size - 1));
+return -EINVAL;
+}
+
+if (--mem->logging_count)
+return 0;
+
+return kvm_dirty_pages_log_change(mem,
+  0,
+  KVM_MEM_LOG_DIRTY_PAGES);
 }
 
 static int kvm_set_migration_log(int enable)
@@ -273,12 +291,15 @@ static int kvm_set_migration_log(int ena
 for (i = 0; i < ARRAY_SIZE(s->slots); i++) {
 mem = &s->slots[i];
 
-if (!!(mem->flags & KVM_MEM_LOG_DIRTY_PAGES) == enable) {
-continue;
-}
-err = kvm_set_user_memory_region(s, mem);
-if (err) {
-return err;
+if (mem->memory_size) {
+if (enable) {
+err = kvm_log_start(mem->start_addr, mem->memory_size);
+} else {
+err = kvm_log_stop(mem->start_addr, mem->memory_size);
+}
+if (err) {
+return err;
+}
 }
 }
 return 0;
@@ -442,6 +463,7 @@ static void kvm_set_phys_mem(target_phys
 
 /* unregister the overlapping slot */
 mem->memory_size = 0;
+mem->logging_count = 0;
 err = kvm_set_user_memory_region(s, mem);
 if (err) {
 fprintf(stderr, "%s: error unregistering overlapping slot: %s\n",


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[uq/master patch 1/5] vga: fix typo in length passed to kvm_log_stop

2010-04-23 Thread Marcelo Tosatti
Signed-off-by: Marcelo Tosatti 

Index: qemu-kvm/hw/vga.c
===
--- qemu-kvm.orig/hw/vga.c
+++ qemu-kvm/hw/vga.c
@@ -1613,8 +1613,8 @@ void vga_dirty_log_stop(VGACommonState *
kvm_log_stop(s->map_addr, s->map_end - s->map_addr);
 
 if (kvm_enabled() && s->lfb_vram_mapped) {
-   kvm_log_stop(isa_mem_base + 0xa, 0x8);
-   kvm_log_stop(isa_mem_base + 0xa8000, 0x8);
+   kvm_log_stop(isa_mem_base + 0xa, 0x8000);
+   kvm_log_stop(isa_mem_base + 0xa8000, 0x8000);
 }
 
 #ifdef CONFIG_BOCHS_VBE


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RHEL5.5, 32-bit VM repeatedly locks up due to kvmclock

2010-04-23 Thread David S. Ahern
After a few days of debugging I think kvmclock is the source of lockups
for a RHEL5.5-based VM. The VM works fine on one host, but repeatedly
locks up on another.

Server 1 - VM locks up repeatedly
-- DL580 G5
-- 4 quad-core X7350 processors at 2.93GHz
-- 48GB RAM

Server 2 - VM works just fine
-- DL380 G6
-- 2 quad-core E5540 processors at 2.53GHz
-- 24GB RAM

Both host servers are running Fedora Core 12, 2.6.32.11-99.fc12.x86_64
kernel. I have tried various versions of qemu-kvm -- the version in
FC-12 and the version for FC-12 in virt-preview. In both cases the
qemu-kvm command line is identical.

VM
- RHEL5.5, PAE kernel (also tried standard 32-bit)
- 2 vcpus
- 3GB RAM
- virtio network and disk

When the VM locks up both vcpu threads are spinning at 100%. Changing
the clocksource to jiffies appears to have addressed the problem.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PXE Boot Timeout Issue...

2010-04-23 Thread Stuart Sheldon
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Stefan Hajnoczi wrote:
> On Fri, Apr 23, 2010 at 1:45 AM, Stuart Sheldon  wrote:
>> Just upgraded to 12.3 user space tools from 11.0, and now when I attempt
>> to netboot a guest, it appears that the pxe rom is timing out on dhcp
>> before the bridge has enough time to come up.
>>
>> Is there a command line switch to set the dhcp timeout, or a build
>> option that can be changed to set the timeout to a longer value, or
>> disable it entirely?
> 
> The bridge shouldn't need significant amounts of time to come up.  Can
> you describe the networking setup?  Are you using libvirt and with
> what network config?
> 
> If you have a bridge configured, can you show the output of:
> 
> $ sudo brctl showstp $bridge_name

The network is fairly straight forward. PC with single NIC directly
attached to switch with a local DHCP server.

We do not use libvirt, we are using qemu with command switches.

Interestingly it appears that the -option-rom command switch doesn't do
anything any longer. It is using gPXE by default only.

Here is our current command line we are testing with:

export QEMU_AUDIO_DRV=alsa
qemu-system-x86_64 \
-m 1024 \
-soundhw es1370 \
-name 'Net Boot Disk' \
-net nic,macaddr=52:54:00:12:31:06,model=virtio \
-net tap,ifname=tap6,script=/etc/kvm/kvm-ifup-br0 \
-option-rom /usr/local/share/qemu/pxe-virtio.bin \
-boot n \
-daemonize


Here is the exact chain of events:

1) Start guest from command line
2) tap adapter is added to br0 and enters learning state on host while
gPXE starts DHCP request on guest.
3) after about 15 seconds, tap device enters promiscuous mode on host,
and at same time, gPXE reports "No bootable device." on guest
4) Drop to guest console and signal a system_reset, guest gets DHCP
address and netboots normally.

Here is the info you requested, both before and after guest startup.

Before guest startup:

linus:~# brctl show br0
bridge name bridge id   STP enabled interfaces
br0 8000.001cc092ba91   no  eth0

linus:~# brctl showstp br0
br0
 bridge id  8000.001cc092ba91
 designated root8000.001cc092ba91
 root port 0path cost  0
 max age  20.00 bridge max age20.00
 hello time2.00 bridge hello time  2.00
 forward delay15.00 bridge forward delay  15.00
 ageing time 300.00
 hello timer   1.87 tcn timer  0.00
 topology change timer 0.00 gc timer   8.87
 flags  


eth0 (1)
 port id8001stateforwarding
 designated root8000.001cc092ba91   path cost 19
 designated bridge  8000.001cc092ba91   message age timer  0.00
 designated port8001forward delay timer0.00
 designated cost   0hold timer 0.87
 flags  

After guest startup:

linus:~# brctl show br0
bridge name bridge id   STP enabled interfaces
br0 8000.001cc092ba91   no  eth0
tap6
linus:~# brctl showstp br0
br0
 bridge id  8000.001cc092ba91
 designated root8000.001cc092ba91
 root port 0path cost  0
 max age  20.00 bridge max age20.00
 hello time2.00 bridge hello time  2.00
 forward delay15.00 bridge forward delay  15.00
 ageing time 300.00
 hello timer   1.00 tcn timer  0.00
 topology change timer 0.00 gc timer   2.00
 flags  


eth0 (1)
 port id8001stateforwarding
 designated root8000.001cc092ba91   path cost 19
 designated bridge  8000.001cc092ba91   message age timer  0.00
 designated port8001forward delay timer0.00
 designated cost   0hold timer 0.00
 flags  

tap6 (2)
 port id8002stateforwarding
 designated root8000.001cc092ba91   path cost100
 designated bridge  8000.001cc092ba91   message age timer  0.00
 designated port8002forward delay timer0.00
 designated cost   0hold timer 0.00
 flags  

Let me know if you need anything else...

Mount and unmount CD

2010-04-23 Thread Matt Burkhardt
I'm having a problem with a virtual machine running under RHEL 5.4
64-bit.  I take out the CD / insert a new and the main machine sees the
new cd and makes it available.  However, the virtual machines still see
the old CD.  I've tried mounting the new CD, but it just keeps mounting
what it "thinks" is in there - the old one.

Any ideas?


Matt Burkhardt
Impari Systems, Inc.

m...@imparisystems.com
http://www.imparisystems.com 
http://www.linkedin.com/in/mlburkhardt 
http://www.twitter.com/matthewboh
502 Fairview Avenue
Frederick, MD  21701
work (301) 682-7901
cell   (301) 802-3235



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-23 Thread Jamie Lokier
Yoshiaki Tamura wrote:
> Jamie Lokier wrote:
> >Yoshiaki Tamura wrote:
> >>Dor Laor wrote:
> >>>On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
> Event tapping is the core component of Kemari, and it decides on which
> event the
> primary should synchronize with the secondary. The basic assumption
> here is
> that outgoing I/O operations are idempotent, which is usually true for
> disk I/O
> and reliable network protocols such as TCP.
> >>>
> >>>IMO any type of network even should be stalled too. What if the VM runs
> >>>non tcp protocol and the packet that the master node sent reached some
> >>>remote client and before the sync to the slave the master failed?
> >>
> >>In current implementation, it is actually stalling any type of network
> >>that goes through virtio-net.
> >>
> >>However, if the application was using unreliable protocols, it should have
> >>its own recovering mechanism, or it should be completely stateless.
> >
> >Even with unreliable protocols, if slave takeover causes the receiver
> >to have received a packet that the sender _does not think it has ever
> >sent_, expect some protocols to break.
> >
> >If the slave replaying master's behaviour since the last sync means it
> >will definitely get into the same state of having sent the packet,
> >that works out.
> 
> That's something we're expecting now.
> 
> >But you still have to be careful that the other end's responses to
> >that packet are not seen by the slave too early during that replay.
> >Otherwise, for example, the slave may observe a TCP ACK to a packet
> >that it hasn't yet sent, which is an error.
> 
> Even current implementation syncs just before network output, what you 
> pointed out could happen.  In this case, would the connection going to be 
> lost, or would client/server recover from it?  If latter, it would be fine, 
> otherwise I wonder how people doing similar things are handling this 
> situation.

In the case of TCP in a "synchronised state", I think it will recover
according to the rules in RFC793.  In an "unsynchronised state"
(during connection), I'm not sure if it recovers or if it looks like a
"Connection reset" error.  I suspect it does recover but I'm not certain.

But that's TCP.  Other protocols, such as over UDP, may behave
differently, because this is not an anticipated behaviour of a
network.

> >However there is one respect in which they're not idempotent:
> >
> >The TTL field should be decreased if packets are delayed.  Packets
> >should not appear to live in the network for longer than TTL seconds.
> >If they do, some protocols (like TCP) can react to the delayed ones
> >differently, such as sending a RST packet and breaking a connection.
> >
> >It is acceptable to reduce TTL faster than the minimum.  After all, it
> >is reduced by 1 on every forwarding hop, in addition to time delays.
> 
> So the problem is, when the slave takes over, it sends a packet with same 
> TTL which client may have received.

Yes.  I guess this is a general problem with time-based protocols and
virtual machines getting stopped for 1 minute (say), without knowing
that real time has moved on for the other nodes.

Some application transaction, caching and locking protocols will give
wrong results when their time assumptions are discontinuous to such a
large degree.  It's a bit nasty to impose that on them after they
worked so hard on their reliability :-)

However, I think such implementations _could_ be made safe if those
programs can arrange to definitely be interrupted with a signal when
the discontinuity happens.  Of course, only if they're aware they may
be running on a Kemari system...

I have an intuitive idea that there is a solution to that, but each
time I try to write the next paragraph explaining it, some little
complication crops up and it needs more thought.  Something about
concurrent, asynchronous transactions to keep the master running while
recording the minimum states that replay needs to be safe, while
slewing the replaying slave's virtual clock back to real time quickly
during recovery mode.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] KVM: SVM: Fix nested nmi handling

2010-04-23 Thread Joerg Roedel
On Fri, Apr 23, 2010 at 04:42:52PM +0200, Alexander Graf wrote:
> 
> On 23.04.2010, at 16:31, Joerg Roedel wrote:
> 
> > On Fri, Apr 23, 2010 at 04:24:54PM +0200, Alexander Graf wrote:
> >> 
> >> On 23.04.2010, at 16:22, Joerg Roedel wrote:
> > 
> >>> No, nested_svm_nmi runs in atomic context where we can't emulate a
> >>> vmexit. We set exit_required and emulate the vmexit later.
> >> 
> >> So we modify the L2 rflags and then trigger a #vmexit, leaving the l2
> >> state broken?
> > 
> > No, the rflags are changed in enable_nmi_window which isn't called when
> > we run nested and the nested hypervisor intercepts nmi. So it only runs
> > in the !nested case where it can't corrupt L2 state.
> 
> Last time I checked the code enable_nmi_window was the function
> triggering the #vmexit,

Yes, thats the bug which this patch fixes :-)

>so it should run in that exact scenario. If what you say is true, where
>do we #vmexit instead then?

After setting exit_required we run into svm.c:svm_vcpu_run. There the
exit_required flag is checked and if set, the function immediatly
returns without doing a vmrun. A few cycles later we run into
svm.c:handle_exit() where at the beginning exit_required is checked, and
if set the vmexit is emulated.

Joerg


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kvm apparently still has issues with blockdev >1Tb

2010-04-23 Thread Michael Tokarev
Not many details, but JFYI: at least 0.12 still has issues
(leads to silent data corruption) with block devices of
size >= 1Tb.  There were several such cases reported on
#...@oftc in last two months.  Usually reducing the device
size to something less than 1Tb solves that issue (i.e.,
stops new data corruption from happening).

Thanks!

/mjt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] KVM: SVM: Fix nested nmi handling

2010-04-23 Thread Alexander Graf

On 23.04.2010, at 16:31, Joerg Roedel wrote:

> On Fri, Apr 23, 2010 at 04:24:54PM +0200, Alexander Graf wrote:
>> 
>> On 23.04.2010, at 16:22, Joerg Roedel wrote:
> 
>>> No, nested_svm_nmi runs in atomic context where we can't emulate a
>>> vmexit. We set exit_required and emulate the vmexit later.
>> 
>> So we modify the L2 rflags and then trigger a #vmexit, leaving the l2
>> state broken?
> 
> No, the rflags are changed in enable_nmi_window which isn't called when
> we run nested and the nested hypervisor intercepts nmi. So it only runs
> in the !nested case where it can't corrupt L2 state.

Last time I checked the code enable_nmi_window was the function triggering the 
#vmexit, so it should run in that exact scenario. If what you say is true, 
where do we #vmexit instead then?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] KVM: x86: Allow marking an exception as reinjected

2010-04-23 Thread Alexander Graf

On 23.04.2010, at 16:27, Joerg Roedel wrote:

> On Fri, Apr 23, 2010 at 03:57:32PM +0200, Alexander Graf wrote:
>> 
>> On 22.04.2010, at 12:33, Joerg Roedel wrote:
>> 
>>> This patch adds logic to kvm/x86 which allows to mark an
>>> injected exception as reinjected. This allows to remove an
>>> ugly hack from svm_complete_interrupts that prevented
>>> exceptions from being reinjected at all in the nested case.
>>> The hack was necessary because an reinjected exception into
>>> the nested guest could cause a nested vmexit emulation. But
>>> reinjected exceptions must not intercept. The downside of
>>> the hack is that a exception that in injected could get
>>> lost.
>>> This patch fixes the problem and puts the code for it into
>>> generic x86 files because. Nested-VMX will likely have the
>>> same problem and could reuse the code.
>> 
>> So we always handle the reinjection from KVM code? Shouldn't the l1
>> hypervisor do this?
> 
> No. We only have the problem if we need to handle a nested intercept on
> the host level instead of reinjecting it. So the nested hypervisor
> couldn't be involved in the reinjection.

Hrm, makes sense. Not pretty.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] KVM: SVM: Fix nested nmi handling

2010-04-23 Thread Joerg Roedel
On Fri, Apr 23, 2010 at 04:24:54PM +0200, Alexander Graf wrote:
> 
> On 23.04.2010, at 16:22, Joerg Roedel wrote:

> > No, nested_svm_nmi runs in atomic context where we can't emulate a
> > vmexit. We set exit_required and emulate the vmexit later.
> 
> So we modify the L2 rflags and then trigger a #vmexit, leaving the l2
> state broken?

No, the rflags are changed in enable_nmi_window which isn't called when
we run nested and the nested hypervisor intercepts nmi. So it only runs
in the !nested case where it can't corrupt L2 state.

Joerg


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] KVM: SVM: Handle MCE intercepts always on host level

2010-04-23 Thread Joerg Roedel
On Fri, Apr 23, 2010 at 03:58:18PM +0200, Alexander Graf wrote:
> 
> On 22.04.2010, at 12:33, Joerg Roedel wrote:
> 
> > This patch prevents MCE intercepts from being propagated
> > into the L1 guest if they happened in an L2 guest.
> 
> Good catch. How did you stumble over this?

While thinking about another issue which I can't disclose
right now ;-) Stay tuned...

Joerg


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] KVM: x86: Allow marking an exception as reinjected

2010-04-23 Thread Joerg Roedel
On Fri, Apr 23, 2010 at 03:57:32PM +0200, Alexander Graf wrote:
> 
> On 22.04.2010, at 12:33, Joerg Roedel wrote:
> 
> > This patch adds logic to kvm/x86 which allows to mark an
> > injected exception as reinjected. This allows to remove an
> > ugly hack from svm_complete_interrupts that prevented
> > exceptions from being reinjected at all in the nested case.
> > The hack was necessary because an reinjected exception into
> > the nested guest could cause a nested vmexit emulation. But
> > reinjected exceptions must not intercept. The downside of
> > the hack is that a exception that in injected could get
> > lost.
> > This patch fixes the problem and puts the code for it into
> > generic x86 files because. Nested-VMX will likely have the
> > same problem and could reuse the code.
> 
> So we always handle the reinjection from KVM code? Shouldn't the l1
> hypervisor do this?

No. We only have the problem if we need to handle a nested intercept on
the host level instead of reinjecting it. So the nested hypervisor
couldn't be involved in the reinjection.

Joerg


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] KVM: SVM: Fix nested nmi handling

2010-04-23 Thread Alexander Graf

On 23.04.2010, at 16:22, Joerg Roedel wrote:

> On Fri, Apr 23, 2010 at 04:19:40PM +0200, Alexander Graf wrote:
>> 
>> On 23.04.2010, at 16:13, Joerg Roedel wrote:
>> 
>>> On Fri, Apr 23, 2010 at 03:46:07PM +0200, Alexander Graf wrote:
 
 On 22.04.2010, at 12:33, Joerg Roedel wrote:
 
> The patch introducing nested nmi handling had a bug. The
> check does not belong to enable_nmi_window but must be in
> nmi_allowed. This patch fixes this.
> 
> Signed-off-by: Joerg Roedel 
> ---
> arch/x86/kvm/svm.c |   16 +---
> 1 files changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index ab78eb8..ec20584 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -2771,8 +2771,12 @@ static int svm_nmi_allowed(struct kvm_vcpu *vcpu)
> {
>   struct vcpu_svm *svm = to_svm(vcpu);
>   struct vmcb *vmcb = svm->vmcb;
> - return !(vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) &&
> - !(svm->vcpu.arch.hflags & HF_NMI_MASK);
> + int ret;
> + ret = !(vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) &&
> +   !(svm->vcpu.arch.hflags & HF_NMI_MASK);
> + ret = ret && gif_set(svm) && nested_svm_nmi(svm);
> +
> + return ret;
> }
> 
> static bool svm_get_nmi_mask(struct kvm_vcpu *vcpu)
> @@ -2841,11 +2845,9 @@ static void enable_nmi_window(struct kvm_vcpu 
> *vcpu)
>* Something prevents NMI from been injected. Single step over possible
>* problem (IRET or exception injection or interrupt shadow)
>*/
> - if (gif_set(svm) && nested_svm_nmi(svm)) {
> - svm->nmi_singlestep = true;
> - svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
> - update_db_intercept(vcpu);
> - }
> + svm->nmi_singlestep = true;
> + svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
> + update_db_intercept(vcpu);
 
 So we're always messing with the nested guest state when the host
 wants to inject an nmi into the l1 guest? Is that safe?
>>> 
>>> Why not? We can't inject an NMI directly into L2 if the nested
>>> hypervisor intercepts it.
>> 
>> So where did the code go that does the #vmexit in case the nested
>> hypervisor does intercept it? It used to be nested_svm_nmi(), right?
> 
> No, nested_svm_nmi runs in atomic context where we can't emulate a
> vmexit. We set exit_required and emulate the vmexit later.

So we modify the L2 rflags and then trigger a #vmexit, leaving the l2 state 
broken?

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] KVM: SVM: Fix nested nmi handling

2010-04-23 Thread Joerg Roedel
On Fri, Apr 23, 2010 at 04:19:40PM +0200, Alexander Graf wrote:
> 
> On 23.04.2010, at 16:13, Joerg Roedel wrote:
> 
> > On Fri, Apr 23, 2010 at 03:46:07PM +0200, Alexander Graf wrote:
> >> 
> >> On 22.04.2010, at 12:33, Joerg Roedel wrote:
> >> 
> >>> The patch introducing nested nmi handling had a bug. The
> >>> check does not belong to enable_nmi_window but must be in
> >>> nmi_allowed. This patch fixes this.
> >>> 
> >>> Signed-off-by: Joerg Roedel 
> >>> ---
> >>> arch/x86/kvm/svm.c |   16 +---
> >>> 1 files changed, 9 insertions(+), 7 deletions(-)
> >>> 
> >>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> >>> index ab78eb8..ec20584 100644
> >>> --- a/arch/x86/kvm/svm.c
> >>> +++ b/arch/x86/kvm/svm.c
> >>> @@ -2771,8 +2771,12 @@ static int svm_nmi_allowed(struct kvm_vcpu *vcpu)
> >>> {
> >>>   struct vcpu_svm *svm = to_svm(vcpu);
> >>>   struct vmcb *vmcb = svm->vmcb;
> >>> - return !(vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) &&
> >>> - !(svm->vcpu.arch.hflags & HF_NMI_MASK);
> >>> + int ret;
> >>> + ret = !(vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) &&
> >>> +   !(svm->vcpu.arch.hflags & HF_NMI_MASK);
> >>> + ret = ret && gif_set(svm) && nested_svm_nmi(svm);
> >>> +
> >>> + return ret;
> >>> }
> >>> 
> >>> static bool svm_get_nmi_mask(struct kvm_vcpu *vcpu)
> >>> @@ -2841,11 +2845,9 @@ static void enable_nmi_window(struct kvm_vcpu 
> >>> *vcpu)
> >>>* Something prevents NMI from been injected. Single step over possible
> >>>* problem (IRET or exception injection or interrupt shadow)
> >>>*/
> >>> - if (gif_set(svm) && nested_svm_nmi(svm)) {
> >>> - svm->nmi_singlestep = true;
> >>> - svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
> >>> - update_db_intercept(vcpu);
> >>> - }
> >>> + svm->nmi_singlestep = true;
> >>> + svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
> >>> + update_db_intercept(vcpu);
> >> 
> >> So we're always messing with the nested guest state when the host
> >> wants to inject an nmi into the l1 guest? Is that safe?
> > 
> > Why not? We can't inject an NMI directly into L2 if the nested
> > hypervisor intercepts it.
> 
> So where did the code go that does the #vmexit in case the nested
> hypervisor does intercept it? It used to be nested_svm_nmi(), right?

No, nested_svm_nmi runs in atomic context where we can't emulate a
vmexit. We set exit_required and emulate the vmexit later.

Joerg


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] KVM: SVM: Report emulated SVM features to userspace

2010-04-23 Thread Joerg Roedel
On Fri, Apr 23, 2010 at 03:55:15PM +0200, Alexander Graf wrote:
> 
> On 22.04.2010, at 12:33, Joerg Roedel wrote:
> > static void svm_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 
> > *entry)
> > {
> > +   switch (func) {
> > +   case 0x800A:
> > +   entry->eax = 1; /* SVM revision 1 */
> > +   entry->ebx = 8; /* Lets support 8 ASIDs in case we add proper
> > +  ASID emulation to nested SVM */
> 
> I completely forgot what we do now. What do we do? It shouldn't be too
> hard to keep a table around and just assign 8 host ASIDs to the guest.
> If possible lazily, so we can just flush the whole thing when we run
> out of entries.

Currently we have no ASID emulation. We just assign a new asid at every
vmrun/vmexit. But I have a rough idea of how we can emulate ASIDs for
the L2 guest with an per-vcpu ASID cache.

> It's basically the same as my VSID mapping on ppc64 actually. See
> arch/powerpc/kvm/book3s_64_mmu_host.c and search for "slb".

Thanks, will have a look there.

> > +   entry->ecx = 0; /* Reserved */
> > +   entry->edx = 0; /* Do not support any additional features */
> 
> What about nnpt? Hasn't that been accepted yet?

I am currently working on it. If all goes well I can submit a new
version next week.

Joerg


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] KVM: SVM: Sync cr0 and cr3 to kvm state before nested handling

2010-04-23 Thread Alexander Graf

On 23.04.2010, at 16:17, Joerg Roedel wrote:

> On Fri, Apr 23, 2010 at 03:50:20PM +0200, Alexander Graf wrote:
>> 
>> On 22.04.2010, at 12:33, Joerg Roedel wrote:
>> 
>>> This patch syncs cr0 and cr3 from the vmcb to the kvm state
>>> before nested intercept handling is done. This allows to
>>> simplify the vmexit path.
>>> 
>>> Signed-off-by: Joerg Roedel 
>>> ---
>>> arch/x86/kvm/svm.c |   15 ++-
>>> 1 files changed, 6 insertions(+), 9 deletions(-)
>>> 
>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>> index c480d7f..5ad9d80 100644
>>> --- a/arch/x86/kvm/svm.c
>>> +++ b/arch/x86/kvm/svm.c
>>> @@ -1799,10 +1799,7 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
>>> nested_vmcb->save.gdtr   = vmcb->save.gdtr;
>>> nested_vmcb->save.idtr   = vmcb->save.idtr;
>>> nested_vmcb->save.cr0= kvm_read_cr0(&svm->vcpu);
>>> -   if (npt_enabled)
>>> -   nested_vmcb->save.cr3= vmcb->save.cr3;
>>> -   else
>>> -   nested_vmcb->save.cr3= svm->vcpu.arch.cr3;
>>> +   nested_vmcb->save.cr3= svm->vcpu.arch.cr3;
>> 
>> 
>> Why don't we need this anymore again?
> 
> Because arch.cr3 contains always the current l2-cr3. This wasn't the
> case before because the sync was done after the nested handling. But now
> we do it before a vmexit can happen and are safe to just use arch.cr3.

I see.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] KVM: SVM: Fix nested nmi handling

2010-04-23 Thread Alexander Graf

On 23.04.2010, at 16:13, Joerg Roedel wrote:

> On Fri, Apr 23, 2010 at 03:46:07PM +0200, Alexander Graf wrote:
>> 
>> On 22.04.2010, at 12:33, Joerg Roedel wrote:
>> 
>>> The patch introducing nested nmi handling had a bug. The
>>> check does not belong to enable_nmi_window but must be in
>>> nmi_allowed. This patch fixes this.
>>> 
>>> Signed-off-by: Joerg Roedel 
>>> ---
>>> arch/x86/kvm/svm.c |   16 +---
>>> 1 files changed, 9 insertions(+), 7 deletions(-)
>>> 
>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>> index ab78eb8..ec20584 100644
>>> --- a/arch/x86/kvm/svm.c
>>> +++ b/arch/x86/kvm/svm.c
>>> @@ -2771,8 +2771,12 @@ static int svm_nmi_allowed(struct kvm_vcpu *vcpu)
>>> {
>>> struct vcpu_svm *svm = to_svm(vcpu);
>>> struct vmcb *vmcb = svm->vmcb;
>>> -   return !(vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) &&
>>> -   !(svm->vcpu.arch.hflags & HF_NMI_MASK);
>>> +   int ret;
>>> +   ret = !(vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) &&
>>> + !(svm->vcpu.arch.hflags & HF_NMI_MASK);
>>> +   ret = ret && gif_set(svm) && nested_svm_nmi(svm);
>>> +
>>> +   return ret;
>>> }
>>> 
>>> static bool svm_get_nmi_mask(struct kvm_vcpu *vcpu)
>>> @@ -2841,11 +2845,9 @@ static void enable_nmi_window(struct kvm_vcpu *vcpu)
>>>  * Something prevents NMI from been injected. Single step over possible
>>>  * problem (IRET or exception injection or interrupt shadow)
>>>  */
>>> -   if (gif_set(svm) && nested_svm_nmi(svm)) {
>>> -   svm->nmi_singlestep = true;
>>> -   svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
>>> -   update_db_intercept(vcpu);
>>> -   }
>>> +   svm->nmi_singlestep = true;
>>> +   svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
>>> +   update_db_intercept(vcpu);
>> 
>> So we're always messing with the nested guest state when the host
>> wants to inject an nmi into the l1 guest? Is that safe?
> 
> Why not? We can't inject an NMI directly into L2 if the nested
> hypervisor intercepts it.

So where did the code go that does the #vmexit in case the nested hypervisor 
does intercept it? It used to be nested_svm_nmi(), right?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] KVM: SVM: Sync cr0 and cr3 to kvm state before nested handling

2010-04-23 Thread Joerg Roedel
On Fri, Apr 23, 2010 at 03:50:20PM +0200, Alexander Graf wrote:
> 
> On 22.04.2010, at 12:33, Joerg Roedel wrote:
> 
> > This patch syncs cr0 and cr3 from the vmcb to the kvm state
> > before nested intercept handling is done. This allows to
> > simplify the vmexit path.
> > 
> > Signed-off-by: Joerg Roedel 
> > ---
> > arch/x86/kvm/svm.c |   15 ++-
> > 1 files changed, 6 insertions(+), 9 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> > index c480d7f..5ad9d80 100644
> > --- a/arch/x86/kvm/svm.c
> > +++ b/arch/x86/kvm/svm.c
> > @@ -1799,10 +1799,7 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
> > nested_vmcb->save.gdtr   = vmcb->save.gdtr;
> > nested_vmcb->save.idtr   = vmcb->save.idtr;
> > nested_vmcb->save.cr0= kvm_read_cr0(&svm->vcpu);
> > -   if (npt_enabled)
> > -   nested_vmcb->save.cr3= vmcb->save.cr3;
> > -   else
> > -   nested_vmcb->save.cr3= svm->vcpu.arch.cr3;
> > +   nested_vmcb->save.cr3= svm->vcpu.arch.cr3;
> 
> 
> Why don't we need this anymore again?

Because arch.cr3 contains always the current l2-cr3. This wasn't the
case before because the sync was done after the nested handling. But now
we do it before a vmexit can happen and are safe to just use arch.cr3.

Joerg


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] KVM: SVM: Fix nested nmi handling

2010-04-23 Thread Joerg Roedel
On Fri, Apr 23, 2010 at 03:46:07PM +0200, Alexander Graf wrote:
> 
> On 22.04.2010, at 12:33, Joerg Roedel wrote:
> 
> > The patch introducing nested nmi handling had a bug. The
> > check does not belong to enable_nmi_window but must be in
> > nmi_allowed. This patch fixes this.
> > 
> > Signed-off-by: Joerg Roedel 
> > ---
> > arch/x86/kvm/svm.c |   16 +---
> > 1 files changed, 9 insertions(+), 7 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> > index ab78eb8..ec20584 100644
> > --- a/arch/x86/kvm/svm.c
> > +++ b/arch/x86/kvm/svm.c
> > @@ -2771,8 +2771,12 @@ static int svm_nmi_allowed(struct kvm_vcpu *vcpu)
> > {
> > struct vcpu_svm *svm = to_svm(vcpu);
> > struct vmcb *vmcb = svm->vmcb;
> > -   return !(vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) &&
> > -   !(svm->vcpu.arch.hflags & HF_NMI_MASK);
> > +   int ret;
> > +   ret = !(vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) &&
> > + !(svm->vcpu.arch.hflags & HF_NMI_MASK);
> > +   ret = ret && gif_set(svm) && nested_svm_nmi(svm);
> > +
> > +   return ret;
> > }
> > 
> > static bool svm_get_nmi_mask(struct kvm_vcpu *vcpu)
> > @@ -2841,11 +2845,9 @@ static void enable_nmi_window(struct kvm_vcpu *vcpu)
> >  * Something prevents NMI from been injected. Single step over possible
> >  * problem (IRET or exception injection or interrupt shadow)
> >  */
> > -   if (gif_set(svm) && nested_svm_nmi(svm)) {
> > -   svm->nmi_singlestep = true;
> > -   svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
> > -   update_db_intercept(vcpu);
> > -   }
> > +   svm->nmi_singlestep = true;
> > +   svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
> > +   update_db_intercept(vcpu);
> 
> So we're always messing with the nested guest state when the host
> wants to inject an nmi into the l1 guest? Is that safe?

Why not? We can't inject an NMI directly into L2 if the nested
hypervisor intercepts it.

Joerg


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/8] KVM: X86: Add callback to let modules decide over some supported cpuid bits

2010-04-23 Thread Avi Kivity

On 04/23/2010 04:52 PM, Alexander Graf wrote:

On 22.04.2010, at 12:33, Joerg Roedel wrote:

   

This patch adds the get_supported_cpuid callback to
kvm_x86_ops. It will be used in do_cpuid_ent to delegate the
decission about some supported cpuid bits to the
architecture modules.

Cc: sta...@kernel.org
 

Please don't CC stable. There is a KVM stable tag now, so every stable 
submission goes through Avi/Marcelo.


   


It's not a problem.  stable@ will ignore kvm patches not coming from the 
maintainers, and we will recognize cc: stable as a stable request.


(the new approach is to collect the patches in kvm-updates/2.6.x, 
autotest, and forward to stable@).


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] KVM: SVM: Handle MCE intercepts always on host level

2010-04-23 Thread Alexander Graf

On 22.04.2010, at 12:33, Joerg Roedel wrote:

> This patch prevents MCE intercepts from being propagated
> into the L1 guest if they happened in an L2 guest.

Good catch. How did you stumble over this?

Ack.

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] KVM: x86: Allow marking an exception as reinjected

2010-04-23 Thread Alexander Graf

On 22.04.2010, at 12:33, Joerg Roedel wrote:

> This patch adds logic to kvm/x86 which allows to mark an
> injected exception as reinjected. This allows to remove an
> ugly hack from svm_complete_interrupts that prevented
> exceptions from being reinjected at all in the nested case.
> The hack was necessary because an reinjected exception into
> the nested guest could cause a nested vmexit emulation. But
> reinjected exceptions must not intercept. The downside of
> the hack is that a exception that in injected could get
> lost.
> This patch fixes the problem and puts the code for it into
> generic x86 files because. Nested-VMX will likely have the
> same problem and could reuse the code.

So we always handle the reinjection from KVM code? Shouldn't the l1 hypervisor 
do this?

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] KVM: SVM: Report emulated SVM features to userspace

2010-04-23 Thread Alexander Graf

On 22.04.2010, at 12:33, Joerg Roedel wrote:

> This patch implements the reporting of the emulated SVM
> features to userspace instead of the real hardware
> capabilities. Every real hardware capability needs emulation
> in nested svm so the old behavior was broken.
> 
> Cc: sta...@kernel.org

Again, please don't CC stable directly.

> Signed-off-by: Joerg Roedel 
> ---
> arch/x86/kvm/svm.c |   10 ++
> 1 files changed, 10 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 0fa2035..65fc114 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -3154,6 +3154,16 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu)
> 
> static void svm_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
> {
> + switch (func) {
> + case 0x800A:
> + entry->eax = 1; /* SVM revision 1 */
> + entry->ebx = 8; /* Lets support 8 ASIDs in case we add proper
> +ASID emulation to nested SVM */

I completely forgot what we do now. What do we do? It shouldn't be too hard to 
keep a table around and just assign 8 host ASIDs to the guest. If possible 
lazily, so we can just flush the whole thing when we run out of entries.

It's basically the same as my VSID mapping on ppc64 actually. See 
arch/powerpc/kvm/book3s_64_mmu_host.c and search for "slb".

> + entry->ecx = 0; /* Reserved */
> + entry->edx = 0; /* Do not support any additional features */

What about nnpt? Hasn't that been accepted yet?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/8] KVM: X86: Add callback to let modules decide over some supported cpuid bits

2010-04-23 Thread Alexander Graf

On 22.04.2010, at 12:33, Joerg Roedel wrote:

> This patch adds the get_supported_cpuid callback to
> kvm_x86_ops. It will be used in do_cpuid_ent to delegate the
> decission about some supported cpuid bits to the
> architecture modules.
> 
> Cc: sta...@kernel.org

Please don't CC stable. There is a KVM stable tag now, so every stable 
submission goes through Avi/Marcelo.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/8] KVM: SVM: Propagate nested entry failure into guest hypervisor

2010-04-23 Thread Alexander Graf

On 22.04.2010, at 12:33, Joerg Roedel wrote:

> This patch implements propagation of a failes guest vmrun
> back into the guest instead of killing the whole guest.

Oh, nice one.

Ack.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] KVM: SVM: Sync cr0 and cr3 to kvm state before nested handling

2010-04-23 Thread Alexander Graf

On 22.04.2010, at 12:33, Joerg Roedel wrote:

> This patch syncs cr0 and cr3 from the vmcb to the kvm state
> before nested intercept handling is done. This allows to
> simplify the vmexit path.
> 
> Signed-off-by: Joerg Roedel 
> ---
> arch/x86/kvm/svm.c |   15 ++-
> 1 files changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index c480d7f..5ad9d80 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1799,10 +1799,7 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
>   nested_vmcb->save.gdtr   = vmcb->save.gdtr;
>   nested_vmcb->save.idtr   = vmcb->save.idtr;
>   nested_vmcb->save.cr0= kvm_read_cr0(&svm->vcpu);
> - if (npt_enabled)
> - nested_vmcb->save.cr3= vmcb->save.cr3;
> - else
> - nested_vmcb->save.cr3= svm->vcpu.arch.cr3;
> + nested_vmcb->save.cr3= svm->vcpu.arch.cr3;


Why don't we need this anymore again?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().

2010-04-23 Thread Avi Kivity

On 04/23/2010 04:22 PM, Anthony Liguori wrote:

I currently don't have data, but I'll prepare it.
There were two things I wanted to avoid.

1. Pages to be copied to QEMUFile buf through qemu_put_buffer.
2. Calling write() everytime even when we want to send multiple pages 
at once.


I think 2 may be neglectable.
But 1 seems to be problematic if we want make to the latency as small 
as possible, no?



Copying often has strange CPU characteristics depending on whether the 
data is already in cache.  It's better to drive these sort of 
optimizations through performance measurement because changes are not 
always obvious.


Copying always introduces more cache pollution, so even if the data is 
in the cache, it is worthwhile (not disagreeing with the need to measure).


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] More fixes for nested svm

2010-04-23 Thread Alexander Graf

On 22.04.2010, at 13:04, Avi Kivity wrote:

> On 04/22/2010 01:33 PM, Joerg Roedel wrote:
>> Hi Avi, Marcelo,
>> 
>> here is another set of nested svm fixes. They fix NMI code to make UP Hyper-V
>> root domain booting. The patches also add better handling for nested entry
>> failures and mce intercepts.
>> Also in this patchset are the fixes for the supported cpuid reporting for svm
>> features. These patches were taken from the nested-npt patchset and slightly
>> modified. These patches are also marked for -stable backporting.
>> The probably most important fix is about exception reinjection. This didn't
>> work reliably before and is fixed with the patch in this series now. This fix
>> also touches common x86 code but that should be ok because it could be reused
>> by nested-vmx later.
>> Please review and give comments (or apply ;-).
>>   
> 
> Looks good.  I'll wait a day or two for more reviews (esp. Alex).


Looking over it, but can't promise I'll be able to fully go through the 
feedback. Flying off for 2 weeks of vacation tomorrow morning.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] KVM: SVM: Fix nested nmi handling

2010-04-23 Thread Alexander Graf

On 22.04.2010, at 12:33, Joerg Roedel wrote:

> The patch introducing nested nmi handling had a bug. The
> check does not belong to enable_nmi_window but must be in
> nmi_allowed. This patch fixes this.
> 
> Signed-off-by: Joerg Roedel 
> ---
> arch/x86/kvm/svm.c |   16 +---
> 1 files changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index ab78eb8..ec20584 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -2771,8 +2771,12 @@ static int svm_nmi_allowed(struct kvm_vcpu *vcpu)
> {
>   struct vcpu_svm *svm = to_svm(vcpu);
>   struct vmcb *vmcb = svm->vmcb;
> - return !(vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) &&
> - !(svm->vcpu.arch.hflags & HF_NMI_MASK);
> + int ret;
> + ret = !(vmcb->control.int_state & SVM_INTERRUPT_SHADOW_MASK) &&
> +   !(svm->vcpu.arch.hflags & HF_NMI_MASK);
> + ret = ret && gif_set(svm) && nested_svm_nmi(svm);
> +
> + return ret;
> }
> 
> static bool svm_get_nmi_mask(struct kvm_vcpu *vcpu)
> @@ -2841,11 +2845,9 @@ static void enable_nmi_window(struct kvm_vcpu *vcpu)
>* Something prevents NMI from been injected. Single step over possible
>* problem (IRET or exception injection or interrupt shadow)
>*/
> - if (gif_set(svm) && nested_svm_nmi(svm)) {
> - svm->nmi_singlestep = true;
> - svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
> - update_db_intercept(vcpu);
> - }
> + svm->nmi_singlestep = true;
> + svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
> + update_db_intercept(vcpu);

So we're always messing with the nested guest state when the host wants to 
inject an nmi into the l1 guest? Is that safe?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] More fixes for nested svm

2010-04-23 Thread Alexander Graf
Hey Joerg,

On 22.04.2010, at 12:33, Joerg Roedel wrote:

> Hi Avi, Marcelo,
> 
> here is another set of nested svm fixes. They fix NMI code to make UP Hyper-V
> root domain booting. The patches also add better handling for nested entry
> failures and mce intercepts.
> Also in this patchset are the fixes for the supported cpuid reporting for svm
> features. These patches were taken from the nested-npt patchset and slightly
> modified. These patches are also marked for -stable backporting.
> The probably most important fix is about exception reinjection. This didn't
> work reliably before and is fixed with the patch in this series now. This fix
> also touches common x86 code but that should be ok because it could be reused
> by nested-vmx later.
> Please review and give comments (or apply ;-).


Nice work. Please remember to keep me CC'ed when posting nested SVM patches.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNTESTED] KVM: do not call kvm_set_irq from irq disabled section

2010-04-23 Thread Avi Kivity

On 04/23/2010 04:02 PM, Chris Lalancette wrote:

On 04/23/2010 07:05 AM, Avi Kivity wrote:
   

On 04/22/2010 10:55 PM, Gleb Natapov wrote:
 


   

What about converting PIC/IOAPIC mutexes into spinlocks?

   

Works for me, but on large guests the spinning will be noticeable.
I believe.

 

For interrupts going through IOPIC, but we know this is not scalable
anyway.

   

Yes.  We also wanted to convert the ioapic/pic to spinlocks so we could
queue the interrupt from the PIT directly instead of using
KVM_REQ_PENDING_TIMER which keeps confusing me.  Chris Lalancette posted
a patchset for this a while back but it was never completed.
 

Yeah, I'm sorry I never completed it.  It turns out that with the HPET
changes that went in around the time I was looking at it, that set of
patches wasn't really required to fix the problem I was seeing with kdump.

That being said, if it's useful to somebody, I can repost the patches
(though they are woefully out-of-date now).  Let me know if you want
to see them again.
   


Let's see if one of the alternatives works out.  I prefer to keep the 
critical sections short.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/10] KVM MMU: Move invlpg code out of paging_tmpl.h

2010-04-23 Thread Avi Kivity

On 04/23/2010 04:21 PM, Xiao Guangrong wrote:


OK, i'll keep invlpg code in paging_tmpl.h and directly call FNAME(update_pte).

But, i don't see mmu_guess_page_from_pte_write() code depends on pte size. :-(
   


It doesn't indeed, I misremembered.  It's mmu_pte_write_new_pte() (which 
is no longer needed).


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().

2010-04-23 Thread Anthony Liguori

On 04/23/2010 04:53 AM, Avi Kivity wrote:

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Currently buf size is fixed at 32KB.  It would be useful if it could
be flexible.



Why is this needed?  The real buffering is in the kernel anyways; this 
is only used to reduce the number of write() syscalls.


With vmstate, we really shouldn't need to do this magic anymore as an 
optimization.


Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/10] KVM MMU: Move invlpg code out of paging_tmpl.h

2010-04-23 Thread Xiao Guangrong


Avi Kivity wrote:
> On 04/23/2010 02:27 PM, Avi Kivity wrote:
>> On 04/22/2010 09:12 AM, Xiao Guangrong wrote:
>>> Using '!sp->role.cr4_pae' replaces 'PTTYPE == 32' and using
>>> 'pte_size = sp->role.cr4_pae ? 8 : 4' replaces sizeof(pt_element_t)
>>>
>>> Then no need compile twice for this code
>>>
>> I think we should keep it in - kvm_mmu_pte_write() calls back to
>> FNAME(update_pte), we could make the call directly from here speed
>> things up, since we already have the spte and don't need to look it up.
>>
> 
> I see you do this in patches 9, 10 - but is it possible to use
> update_pte directly?  I think we'll need to make
> guess_page_from_pte_write() part of paging_tmpl.h (in general anything
> that depends on pte size is better off in paging_tmpl.h).
> 

OK, i'll keep invlpg code in paging_tmpl.h and directly call FNAME(update_pte).

But, i don't see mmu_guess_page_from_pte_write() code depends on pte size. :-(

Xiao
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-23 Thread Avi Kivity

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Kemari starts synchronizing VMs when QEMU handles I/O requests.
Without this patch VCPU state is already proceeded before
synchronization, and after failover to the VM on the receiver, it
hangs because of this.
   


We discussed moving the barrier to the actual output device, instead of 
the I/O port.  This allows you to complete the I/O transaction before 
starting synchronization.


Does it not work for some reason?

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 07/20] Introduce qemu_put_vector() and qemu_put_vector_prepare() to use put_vector() in QEMUFile.

2010-04-23 Thread Anthony Liguori

On 04/22/2010 11:02 PM, Yoshiaki Tamura wrote:

Anthony Liguori wrote:

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

For fool proof purpose, qemu_put_vector_parepare should be called
before qemu_put_vector. Then, if qemu_put_* functions except this is
called after qemu_put_vector_prepare, program will abort().

Signed-off-by: Yoshiaki Tamura


I don't get it. What's this protecting against?


This was introduced to prevent mixing the order of normal write and 
vector write, and flush QEMUFile buffer before handling vectors.
While qemu_put_buffer copies data to QEMUFile buffer, 
qemu_put_vector() will bypass that buffer.


It's just fool proof purpose for what we encountered at beginning, and 
if the user of qemu_put_vector() is careful enough, we can remove 
qemu_put_vectore_prepare().  While writing this message, I started to 
think that just calling qemu_fflush() in qemu_put_vector() would be 
enough...


I definitely think removing the vector stuff in the first version would 
simplify the process of getting everything merged.  I'd prefer not to 
have two apis so if vector operations were important from a performance 
perspective, I'd want to see everything converted to a vector API.


Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().

2010-04-23 Thread Anthony Liguori

On 04/22/2010 10:37 PM, Yoshiaki Tamura wrote:

Anthony Liguori wrote:

On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:

QEMUFile currently doesn't support writev(). For sending multiple
data, such as pages, using writev() should be more efficient.

Signed-off-by: Yoshiaki Tamura


Is there performance data that backs this up? Since QEMUFile uses a
linear buffer for most operations that's limited to 16k, I suspect you
wouldn't be able to observe a difference in practice.


I currently don't have data, but I'll prepare it.
There were two things I wanted to avoid.

1. Pages to be copied to QEMUFile buf through qemu_put_buffer.
2. Calling write() everytime even when we want to send multiple pages 
at once.


I think 2 may be neglectable.
But 1 seems to be problematic if we want make to the latency as small 
as possible, no?


Copying often has strange CPU characteristics depending on whether the 
data is already in cache.  It's better to drive these sort of 
optimizations through performance measurement because changes are not 
always obvious.


Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Avi Kivity

On 04/23/2010 03:59 PM, Alexander Graf wrote:



Ah so the 31st bit is optional as far as userspace is concerned?  What does it 
mean? (just curious)
 

The 0x8000 bit declares that a pointer is in 24-bit mode, so that 
applications can use the spare upper bits for random data.

See http://en.wikipedia.org/wiki/31-bit for an explanation.
   


Interesting.  Luckily AMD made the top 16 bits of pointers reserved in 
x86-64.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-23 Thread Anthony Liguori

On 04/22/2010 08:53 PM, Yoshiaki Tamura wrote:

Anthony Liguori wrote:

On 04/22/2010 08:16 AM, Yoshiaki Tamura wrote:

2010/4/22 Dor Laor:

On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:

Dor Laor wrote:

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Hi all,

We have been implementing the prototype of Kemari for KVM, and 
we're

sending
this message to share what we have now and TODO lists. 
Hopefully, we

would like
to get early feedback to keep us in the right direction. Although
advanced
approaches in the TODO lists are fascinating, we would like to run
this project
step by step while absorbing comments from the community. The 
current

code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The transmission/transaction protocol, and most of the control
logic is
implemented in QEMU. However, we needed a hack in KVM to prevent 
rip

from
proceeding before synchronizing VMs. It may also need some
plumbing in
the
kernel side to guarantee replayability of certain events and
instructions,
integrate the RAS capabilities of newer x86 hardware with the HA
stack, as well
as for optimization purposes, for example.

[ snap]

The rest of this message describes TODO lists grouped by each 
topic.


=== event tapping ===

Event tapping is the core component of Kemari, and it decides on
which
event the
primary should synchronize with the secondary. The basic assumption
here is
that outgoing I/O operations are idempotent, which is usually true
for
disk I/O
and reliable network protocols such as TCP.

IMO any type of network even should be stalled too. What if the VM
runs
non tcp protocol and the packet that the master node sent reached 
some

remote client and before the sync to the slave the master failed?
In current implementation, it is actually stalling any type of 
network

that goes through virtio-net.

However, if the application was using unreliable protocols, it should
have its own recovering mechanism, or it should be completely
stateless.

Why do you treat tcp differently? You can damage the entire VM this
way -
think of dhcp request that was dropped on the moment you switched
between
the master and the slave?

I'm not trying to say that we should treat tcp differently, but just
it's severe.
In case of dhcp request, the client would have a chance to retry after
failover, correct?
BTW, in current implementation,


I'm slightly confused about the current implementation vs. my
recollection of the original paper with Xen. I had thought that all disk
and network I/O was buffered in such a way that at each checkpoint, the
I/O operations would be released in a burst. Otherwise, you would have
to synchronize after every I/O operation which is what it seems the
current implementation does.


Yes, you're almost right.
It's synchronizing before QEMU starts emulating I/O at each device model.


If NodeA is the master and NodeB is the slave, if NodeA sends a network 
packet, you'll checkpoint before the packet is actually sent, and then 
if a failure occurs before the next checkpoint, won't that result in 
both NodeA and NodeB sending out a duplicate version of the packet?


Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().

2010-04-23 Thread Avi Kivity

On 04/23/2010 12:59 PM, Yoshiaki Tamura wrote:

Avi Kivity wrote:

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Currently buf size is fixed at 32KB. It would be useful if it could
be flexible.



Why is this needed? The real buffering is in the kernel anyways; this is
only used to reduce the number of write() syscalls.


This was introduced to buffer the transfered guests image transaction 
ally on the receiver side.  The sender doesn't use it.

In case of intermediate state, we just discard this buffer.


How large can it grow?

What's wrong with applying it (perhaps partially) to the guest state?  
The next state transfer will overwrite it completely, no?


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Arnd Bergmann
On Friday 23 April 2010, Avi Kivity wrote:
> Ah so the 31st bit is optional as far as userspace is concerned?  What 
> does it mean? (just curious)

On data pointers it's ignored. When you branch to a function, this bit
determines whether the target function is run in 24 or 31 bit mode.
This allows linking to legacy code on older operating systems that
also support 24 bit libraries.

> What happens on the opposite conversion?  is it restored?
> 
> What about
> 
> int compare_pointer(void *a, void *b)
> {
> unsigned long ai = (unsigned long)a;
> void *aia = (void *)ai;
> 
> return a == b; /* true if a and b point to the same object */
> }

Some instructions set the bit, others clear it, so aia and a may not
be bitwise identical.

> Does gcc mask the big in pointer comparisons as well?

Yes. To stay in the above example:

a == aia;   /* true */
(unsigned long)a == (unsigned long)aia; /* true */
*(unsigned long *)&a == *(unsigned long *)&aia; /* undefined on s390 */

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/20] Kemari for KVM v0.1

2010-04-23 Thread Anthony Liguori

On 04/22/2010 07:45 PM, Yoshiaki Tamura wrote:

Anthony Liguori wrote:


I think it would make sense to separate out the things that are actually
optimizations (like the dirty bitmap changes and the writev/readv
changes) and to attempt to justify them with actual performance data.


I agree with the separation plan.

For dirty bitmap change, Avi and I discussed on patchset for upsream 
QEMU while you were offline (Sorry, if I was wrong).  Could you also 
take a look?


Yes, I've seen it and I don't disagree.  That said, there ought to be 
perf data in the commit log so that down the road, the justification is 
understood.



http://lists.gnu.org/archive/html/qemu-devel/2010-04/msg01396.html

Regarding writev, I agree that it should be backed with actual data, 
otherwise it should be removed.  We attemped to do everything that may 
reduce the overhead of the transaction.



I'd prefer not to modify the live migration protocol ABI and it doesn't
seem to be necessary if we're willing to add options to the -incoming
flag. We also want to be a bit more generic with respect to IO.


I totally agree with your approach not to change the protocol ABI.  
Can we add an option to -incoming?  Like, -incoming ft_mode, for example

Regarding the IO, let me reply to the next message.


Otherwise, the series looks very close to being mergable.


Thank you for your comment on each patch.

To be honest, I wasn't that confident because I'm a newbie to KVM/QEMU 
and struggled for how to implement in an acceptable way.


The series looks very good.  I'm eager to see this functionality merged.

Regards,

Anthony Liguori


Thanks,

Yoshi



Regards,

Anthony Liguori







--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNTESTED] KVM: do not call kvm_set_irq from irq disabled section

2010-04-23 Thread Chris Lalancette
On 04/23/2010 07:05 AM, Avi Kivity wrote:
> On 04/22/2010 10:55 PM, Gleb Natapov wrote:
>>
>>
 What about converting PIC/IOAPIC mutexes into spinlocks?

>>> Works for me, but on large guests the spinning will be noticeable.
>>> I believe.
>>>  
>> For interrupts going through IOPIC, but we know this is not scalable
>> anyway.
>>
> 
> Yes.  We also wanted to convert the ioapic/pic to spinlocks so we could
> queue the interrupt from the PIT directly instead of using
> KVM_REQ_PENDING_TIMER which keeps confusing me.  Chris Lalancette posted
> a patchset for this a while back but it was never completed.

Yeah, I'm sorry I never completed it.  It turns out that with the HPET
changes that went in around the time I was looking at it, that set of
patches wasn't really required to fix the problem I was seeing with kdump.

That being said, if it's useful to somebody, I can repost the patches
(though they are woefully out-of-date now).  Let me know if you want
to see them again.

-- 
Chris Lalancette
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Alexander Graf

On 23.04.2010, at 14:53, Avi Kivity wrote:

> On 04/23/2010 03:46 PM, Arnd Bergmann wrote:
>> On Friday 23 April 2010, Avi Kivity wrote:
>>   
>>> On 04/23/2010 03:27 PM, Arnd Bergmann wrote:
>>> 
   
> Using a 64-bit integer avoids the problem (though perhaps not sufficient
> for s390, Arnd?)
> 
> 
 When there is only a __u64 for the address, it will work on s390 as well,
 gcc is smart enough to clear the upper bit on a cast from long to pointer.
 
   
>>> Ah, much more convenient than compat_ioctl.  I assume it part of the
>>> ABI, not a gccism?
>>> 
>> I don't think it's part of the ABI, but it's required to guarantee
>> that code like this works:
>> 
>> int compare_pointer(void *a, void *b)
>> {
>>  unsigned long ai = (unsigned long)a, bi = (unsigned long)b;
>> 
>>  return ai == bi; /* true if a and b point to the same object */
>> }
>> 
>> We certainly rely on this already.
>>   
> 
> Ah so the 31st bit is optional as far as userspace is concerned?  What does 
> it mean? (just curious)

The 0x8000 bit declares that a pointer is in 24-bit mode, so that 
applications can use the spare upper bits for random data.

See http://en.wikipedia.org/wiki/31-bit for an explanation.

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Avi Kivity

On 04/23/2010 03:46 PM, Arnd Bergmann wrote:

On Friday 23 April 2010, Avi Kivity wrote:
   

On 04/23/2010 03:27 PM, Arnd Bergmann wrote:
 
   

Using a 64-bit integer avoids the problem (though perhaps not sufficient
for s390, Arnd?)

 

When there is only a __u64 for the address, it will work on s390 as well,
gcc is smart enough to clear the upper bit on a cast from long to pointer.

   

Ah, much more convenient than compat_ioctl.  I assume it part of the
ABI, not a gccism?
 

I don't think it's part of the ABI, but it's required to guarantee
that code like this works:

int compare_pointer(void *a, void *b)
{
unsigned long ai = (unsigned long)a, bi = (unsigned long)b;

return ai == bi; /* true if a and b point to the same object */
}

We certainly rely on this already.
   


Ah so the 31st bit is optional as far as userspace is concerned?  What 
does it mean? (just curious)


What happens on the opposite conversion?  is it restored?

What about

int compare_pointer(void *a, void *b)
{
unsigned long ai = (unsigned long)a;
void *aia = (void *)ai;

return a == b; /* true if a and b point to the same object */
}


Does gcc mask the big in pointer comparisons as well?

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Arnd Bergmann
On Friday 23 April 2010, Avi Kivity wrote:
> On 04/23/2010 03:27 PM, Arnd Bergmann wrote:
> >
> >> Using a 64-bit integer avoids the problem (though perhaps not sufficient
> >> for s390, Arnd?)
> >>  
> > When there is only a __u64 for the address, it will work on s390 as well,
> > gcc is smart enough to clear the upper bit on a cast from long to pointer.
> >
> 
> Ah, much more convenient than compat_ioctl.  I assume it part of the 
> ABI, not a gccism?

I don't think it's part of the ABI, but it's required to guarantee
that code like this works:

int compare_pointer(void *a, void *b)
{
unsigned long ai = (unsigned long)a, bi = (unsigned long)b;

return ai == bi; /* true if a and b point to the same object */
}

We certainly rely on this already.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Avi Kivity

On 04/23/2010 03:27 PM, Arnd Bergmann wrote:



Using a 64-bit integer avoids the problem (though perhaps not sufficient
for s390, Arnd?)
 

When there is only a __u64 for the address, it will work on s390 as well,
gcc is smart enough to clear the upper bit on a cast from long to pointer.
   


Ah, much more convenient than compat_ioctl.  I assume it part of the 
ABI, not a gccism?


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Arnd Bergmann
On Friday 23 April 2010, Avi Kivity wrote:
> On 04/23/2010 01:20 PM, Alexander Graf wrote:
> >
> >> I would say the reason is that if we did not convert the user-space 
> >> pointer to
> >> a "void *" kvm_get_dirty_log() would end up copying the dirty log to
> >>
> >> (log->dirty_bitmap<<  32) | 0x
> >>  
> > Well yes, that was the problem. If we always set the __u64 value to the 
> > pointer we're safe though.
> >
> > union {
> >void *p;
> >__u64 q;
> > }
> >
> > void x(void *r)
> > {
> >// breaks:
> >p = r;
> >
> >// works:
> >q = (ulong)r;
> > }
> >
> 
> In that case it's better to avoid p altogether, since users will 
> naturally assign to the pointer.

Right.
 
> Using a 64-bit integer avoids the problem (though perhaps not sufficient 
> for s390, Arnd?)

When there is only a __u64 for the address, it will work on s390 as well,
gcc is smart enough to clear the upper bit on a cast from long to pointer.

The simple rule is to never put any 'long' or pointer into data structures
that you pass to an ioctl, and to add padding to multiples of 64 bit to
align the data structure for the x86 alignment problem.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Alexander Graf

On 23.04.2010, at 13:57, Avi Kivity wrote:

> On 04/23/2010 01:20 PM, Alexander Graf wrote:
>> 
>>> I would say the reason is that if we did not convert the user-space pointer 
>>> to
>>> a "void *" kvm_get_dirty_log() would end up copying the dirty log to
>>> 
>>> (log->dirty_bitmap<<  32) | 0x
>>> 
>> Well yes, that was the problem. If we always set the __u64 value to the 
>> pointer we're safe though.
>> 
>> union {
>>   void *p;
>>   __u64 q;
>> }
>> 
>> void x(void *r)
>> {
>>   // breaks:
>>   p = r;
>> 
>>   // works:
>>   q = (ulong)r;
>> }
>>   
> 
> In that case it's better to avoid p altogether, since users will naturally 
> assign to the pointer.
> 
> Using a 64-bit integer avoids the problem (though perhaps not sufficient for 
> s390, Arnd?)

We only support 64 bit on S390, so we should be safe here. Even if not, compat 
mode has 31 bits pointers, so padding them to 64 bit should work too.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/10] KVM MMU: fix for calculating gpa in invlpg code

2010-04-23 Thread Avi Kivity

On 04/23/2010 03:05 PM, Xiao Guangrong wrote:




@@ -478,9 +478,14 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu,
gva_t gva)
   ((level == PT_DIRECTORY_LEVEL&&   is_large_pte(*sptep))) ||
   ((level == PT_PDPE_LEVEL&&   is_large_pte(*sptep {
   struct kvm_mmu_page *sp = page_header(__pa(sptep));
+int offset = 0;
+
+if (PTTYPE == 32)
+offset = sp->role.quadrant<<   PT64_LEVEL_BITS;;

   

Wrong for PT_DIRECTORY_LEVEL (should be q<<  8).  Also, too many
semicolons.

 

I guess you mean 'PT64_LEVEL_BITS' not 'PT_DIRECTORY_LEVEL' here :-)
   


No, I mean if level == PT_DIRECTORY_LEVEL, then we want role.quadrant << 
8, not 9.



It should be q<<  8 here? it hardly understand, take leve = 1 for example,
32-bit guest PTE page table mapping range is 2^(10+12), PAE's PTE page table
mapping range is 2^(9+12),


For level == PT_DIRECTORY_LEVEL, quadrant is in the range 0..3.  Each sp 
maps 1GB, while the guest page table maps 4GB.  So the upper two bits 
become the quadrant.



  so, i think it's quadrant<<  9 here, and other
function like FNAME(prefetch_page), FNAME(sync_page) also are q<<  9
   


They only work for PT_PAGE_TABLE_LEVEL, so for them 9 is correct.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/10] KVM MMU: allow more page become unsync at getting sp time

2010-04-23 Thread Avi Kivity

On 04/22/2010 09:13 AM, Xiao Guangrong wrote:

Allow more page become asynchronous at getting sp time, if need create new
shadow page for gfn but it not allow unsync(level>  0), we should unsync all
gfn's unsync page
   


This is something I wanted for a long time.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/10] KVM MMU: fix for calculating gpa in invlpg code

2010-04-23 Thread Xiao Guangrong


Avi Kivity wrote:
> On 04/22/2010 09:12 AM, Xiao Guangrong wrote:
>> If the guest is 32-bit, we should use 'quadrant' to adjust gpa
>> offset
>>
>>
> 
> Good catch.  Only affects kvm_mmu_pte_write(), so I don't think this had
> ill effects other than not prefetching the correct address?
> 

Yes

>> @@ -478,9 +478,14 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu,
>> gva_t gva)
>>   ((level == PT_DIRECTORY_LEVEL&&  is_large_pte(*sptep))) ||
>>   ((level == PT_PDPE_LEVEL&&  is_large_pte(*sptep {
>>   struct kvm_mmu_page *sp = page_header(__pa(sptep));
>> +int offset = 0;
>> +
>> +if (PTTYPE == 32)
>> +offset = sp->role.quadrant<<  PT64_LEVEL_BITS;;
>>
> 
> Wrong for PT_DIRECTORY_LEVEL (should be q << 8).  Also, too many
> semicolons.
> 

I guess you mean 'PT64_LEVEL_BITS' not 'PT_DIRECTORY_LEVEL' here :-)

It should be q << 8 here? it hardly understand, take leve = 1 for example,
32-bit guest PTE page table mapping range is 2^(10+12), PAE's PTE page table
mapping range is 2^(9+12), so, i think it's quadrant << 9 here, and other
function like FNAME(prefetch_page), FNAME(sync_page) also are q << 9

Sorry for the double semicolons here, will fix it

Thanks,
Xiao
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Avi Kivity

On 04/22/2010 12:34 PM, Takuya Yoshikawa wrote:

Thanks, I can know basic rules about kvm/api .

(2010/04/21 20:46), Avi Kivity wrote:




+Type: vm ioctl
+Parameters: struct kvm_dirty_log (in/out)
+Returns: 0 on success, -1 on error
+
+/* for KVM_SWITCH_DIRTY_LOG */
+struct kvm_dirty_log {
+ __u32 slot;
+ __u32 padding;


Please put a flags field here (and verify it is zero in the ioctl
implementation). This allows us to extend it later.


+ union {
+ void __user *dirty_bitmap; /* one bit per page */
+ __u64 addr;
+ };


Just make dirty_bitmap a __u64. With the union there is the risk that
someone forgets to zero the structure so we get some random bits in the
pointer, and also issues with big endian and s390 compat.

Reserve some extra space here for future expansion.

Hm. I see that this is the existing kvm_dirty_log structure. So we can't
play with it, ignore my comments about it.


So, introducing a new structure to export(and get) two bitmap addresses
with a flag is the thing?

1. Qemu calls ioctl to get the two addresses.
2. Qemu calls ioctl to make KVM switch the dirty bitmaps.
   --> in this case, qemu have to change the address got in step 1.
   ...
3. Qemu calls ioctl(we can reuse 1. with different command flag) to 
free the bitmaps.



In my plan, we don't let qemu malloc() dirty bitmaps: at least, I want 
to make that
another patch set because this patch set already has some dependencies 
to other issues.


But, yes, I can make the struct considering the future expansion if it 
needed.


I guess it's better, since this patch is a "future expansion" of 
KVN_GET_DIRTY_LOG.  If we had a flags field and some room there, we 
could keep the old ioctl.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Avi Kivity

On 04/23/2010 01:20 PM, Alexander Graf wrote:



I would say the reason is that if we did not convert the user-space pointer to
a "void *" kvm_get_dirty_log() would end up copying the dirty log to

(log->dirty_bitmap<<  32) | 0x
 

Well yes, that was the problem. If we always set the __u64 value to the pointer 
we're safe though.

union {
   void *p;
   __u64 q;
}

void x(void *r)
{
   // breaks:
   p = r;

   // works:
   q = (ulong)r;
}
   


In that case it's better to avoid p altogether, since users will 
naturally assign to the pointer.


Using a 64-bit integer avoids the problem (though perhaps not sufficient 
for s390, Arnd?)


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 5/6] KVM: moving dirty bitmaps to user space

2010-04-23 Thread Avi Kivity

On 04/23/2010 02:14 PM, Takuya Yoshikawa wrote:



Do you have performance numbers? I'm interested in both measurements of
KVM_SWITCH_DIRTY_LOG under various conditions and macro benchmarks (for
example, total guest throughput improvement under Kemari).



Currently, I'm just checking the performance visually.
  - on laptops, we can feel the speed directly: my favorite is installing
ubuntu or debian by alternate installers

  - live migration with heavy work load

Now that I've got the overall design, I want to measure the 
performance by numbers.



About performance under Kemari: we(oss.ntt.co.jp staffs: me and 
Fernando) are
now concentrating on improving the basic live-migration 
infrastructures and

they(lab.ntt.co.jp staffs) are working hard for building Kemari itself.

  - We are also interested in using live-migration with HA software 
and this needs

light, stable live-migration: same as Kemari!



General live migration improvements are also interesting, I just the 
improvement would be more pronounced under Kemari.  Looking forward to 
seeing the results.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/10] KVM MMU: Move invlpg code out of paging_tmpl.h

2010-04-23 Thread Avi Kivity

On 04/23/2010 02:27 PM, Avi Kivity wrote:

On 04/22/2010 09:12 AM, Xiao Guangrong wrote:

Using '!sp->role.cr4_pae' replaces 'PTTYPE == 32' and using
'pte_size = sp->role.cr4_pae ? 8 : 4' replaces sizeof(pt_element_t)

Then no need compile twice for this code

I think we should keep it in - kvm_mmu_pte_write() calls back to 
FNAME(update_pte), we could make the call directly from here speed 
things up, since we already have the spte and don't need to look it up.




I see you do this in patches 9, 10 - but is it possible to use 
update_pte directly?  I think we'll need to make 
guess_page_from_pte_write() part of paging_tmpl.h (in general anything 
that depends on pte size is better off in paging_tmpl.h).


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/10] KVM MMU: don't write-protect if have new mapping to unsync page

2010-04-23 Thread Avi Kivity

On 04/22/2010 09:13 AM, Xiao Guangrong wrote:

If have new mapping to the unsync page(i.e, add a new parent), just
update the page from sp->gfn but not write-protect gfn, and if need
create new shadow page form sp->gfn, we should sync it
   


Sorry, I don't understand this patch.  Can you clarify?  for example, 
the situation before adding the new parent, what the guest action was, 
and the new situation.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 5/6] KVM: moving dirty bitmaps to user space

2010-04-23 Thread Yoshiaki Tamura
2010/4/23 Takuya Yoshikawa :
> (2010/04/23 19:28), Avi Kivity wrote:
>
>>>
>>> OK, I will do in the next version. In this RFC, I would be happy if I can
>>> know the overall design is right or not.
>>>
>>
>> Everything looks reasonable to me.
>>
>
> Thank you!
>
>> Do you have performance numbers? I'm interested in both measurements of
>> KVM_SWITCH_DIRTY_LOG under various conditions and macro benchmarks (for
>> example, total guest throughput improvement under Kemari).
>>
>
> Currently, I'm just checking the performance visually.
>  - on laptops, we can feel the speed directly: my favorite is installing
>    ubuntu or debian by alternate installers
>
>  - live migration with heavy work load
>
> Now that I've got the overall design, I want to measure the performance by
> numbers.
>
>
> About performance under Kemari: we(oss.ntt.co.jp staffs: me and Fernando)
> are
> now concentrating on improving the basic live-migration infrastructures and
> they(lab.ntt.co.jp staffs) are working hard for building Kemari itself.
>
>  - We are also interested in using live-migration with HA software and this
> needs
>    light, stable live-migration: same as Kemari!
>
>
> So measuring Kemari's performance with our patch-set will need some more
> time,
> ** How about Tamura-san? **

You can measure your patch quite easy.

1. Run a simple program that malloc huge size (2GB for example), then
go into infinite loop that touches each in a really bad manner, like
only touch odd page numbers.
2. Start live migration which usually won't finish. Similar with Kemari.
3. Measure the response time of your ioctl().

I used 1 and 2 to measure dirty physmap speed up in QEMU.

Thanks,

Yoshi

> So at this stage, I'll show you performance improvement(I hope to improve)
> by
> another test cases.
>
> If possible next week!
>
> Thanks,
>  Takuya
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/10] KVM MMU: Move invlpg code out of paging_tmpl.h

2010-04-23 Thread Avi Kivity

On 04/22/2010 09:12 AM, Xiao Guangrong wrote:

Using '!sp->role.cr4_pae' replaces 'PTTYPE == 32' and using
'pte_size = sp->role.cr4_pae ? 8 : 4' replaces sizeof(pt_element_t)

Then no need compile twice for this code

Signed-off-by: Xiao Guangrong
---
  arch/x86/kvm/mmu.c |   60 ++-
  arch/x86/kvm/paging_tmpl.h |   56 -
  2 files changed, 58 insertions(+), 58 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index abf8bd4..fac7c09 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2256,6 +2256,62 @@ static bool is_rsvd_bits_set(struct kvm_vcpu *vcpu, u64 
gpte, int level)
return (gpte&  vcpu->arch.mmu.rsvd_bits_mask[bit7][level-1]) != 0;
  }

+static void paging_invlpg(struct kvm_vcpu *vcpu, gva_t gva)
+{
+   struct kvm_shadow_walk_iterator iterator;
+   gpa_t pte_gpa = -1;
+   int level;
+   u64 *sptep;
+   int need_flush = 0;
+   unsigned pte_size = 0;
+
+   spin_lock(&vcpu->kvm->mmu_lock);
+
+   for_each_shadow_entry(vcpu, gva, iterator) {
+   level = iterator.level;
+   sptep = iterator.sptep;
+
+   if (level == PT_PAGE_TABLE_LEVEL  ||
+   ((level == PT_DIRECTORY_LEVEL&&  is_large_pte(*sptep))) ||
+   ((level == PT_PDPE_LEVEL&&  is_large_pte(*sptep {
+   struct kvm_mmu_page *sp = page_header(__pa(sptep));
+   int offset = 0;
+
+   if (!sp->role.cr4_pae)
+   offset = sp->role.quadrant<<  PT64_LEVEL_BITS;;
+   pte_size = sp->role.cr4_pae ? 8 : 4;
+   pte_gpa = (sp->gfn<<  PAGE_SHIFT);
+   pte_gpa += (sptep - sp->spt + offset) * pte_size;
+
+   if (is_shadow_present_pte(*sptep)) {
+   rmap_remove(vcpu->kvm, sptep);
+   if (is_large_pte(*sptep))
+   --vcpu->kvm->stat.lpages;
+   need_flush = 1;
+   }
+   __set_spte(sptep, shadow_trap_nonpresent_pte);
+   break;
+   }
+
+   if (!is_shadow_present_pte(*sptep))
+   break;
+   }
+
+   if (need_flush)
+   kvm_flush_remote_tlbs(vcpu->kvm);
+
+   atomic_inc(&vcpu->kvm->arch.invlpg_counter);
+
+   spin_unlock(&vcpu->kvm->mmu_lock);
+
+   if (pte_gpa == -1)
+   return;
+
+   if (mmu_topup_memory_caches(vcpu))
+   return;
+   kvm_mmu_pte_write(vcpu, pte_gpa, NULL, pte_size, 0);
+}
+
   


I think we should keep it in - kvm_mmu_pte_write() calls back to 
FNAME(update_pte), we could make the call directly from here speed 
things up, since we already have the spte and don't need to look it up.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/10] KVM MMU: fix for calculating gpa in invlpg code

2010-04-23 Thread Avi Kivity

On 04/22/2010 09:12 AM, Xiao Guangrong wrote:

If the guest is 32-bit, we should use 'quadrant' to adjust gpa
offset

   


Good catch.  Only affects kvm_mmu_pte_write(), so I don't think this had 
ill effects other than not prefetching the correct address?



@@ -478,9 +478,14 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
((level == PT_DIRECTORY_LEVEL&&  is_large_pte(*sptep))) ||
((level == PT_PDPE_LEVEL&&  is_large_pte(*sptep {
struct kvm_mmu_page *sp = page_header(__pa(sptep));
+   int offset = 0;
+
+   if (PTTYPE == 32)
+   offset = sp->role.quadrant<<  PT64_LEVEL_BITS;;
   


Wrong for PT_DIRECTORY_LEVEL (should be q << 8).  Also, too many 
semicolons.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 5/6] KVM: moving dirty bitmaps to user space

2010-04-23 Thread Takuya Yoshikawa

(2010/04/23 19:28), Avi Kivity wrote:



OK, I will do in the next version. In this RFC, I would be happy if I can
know the overall design is right or not.



Everything looks reasonable to me.



Thank you!


Do you have performance numbers? I'm interested in both measurements of
KVM_SWITCH_DIRTY_LOG under various conditions and macro benchmarks (for
example, total guest throughput improvement under Kemari).



Currently, I'm just checking the performance visually.
  - on laptops, we can feel the speed directly: my favorite is installing
ubuntu or debian by alternate installers

  - live migration with heavy work load

Now that I've got the overall design, I want to measure the performance by 
numbers.


About performance under Kemari: we(oss.ntt.co.jp staffs: me and Fernando) are
now concentrating on improving the basic live-migration infrastructures and
they(lab.ntt.co.jp staffs) are working hard for building Kemari itself.

  - We are also interested in using live-migration with HA software and this 
needs
light, stable live-migration: same as Kemari!


So measuring Kemari's performance with our patch-set will need some more time,
** How about Tamura-san? **


So at this stage, I'll show you performance improvement(I hope to improve) by
another test cases.

If possible next week!

Thanks,
  Takuya

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] KVM MMU: fix hashing for TDP and non-paging modes

2010-04-23 Thread Avi Kivity

On 04/23/2010 12:15 AM, Eric Northup wrote:

I've been reading the x86's mmu.c recently and had been wondering
about something.  Avi's recent mmu documentation (thanks!) seems to
have confirmed my understanding of how the shadow paging is supposed
to be working.


Wasn't it a lot more fun to understand it from the code?  reading the 
documentation is way to easy.



In TDP mode, when mmu_alloc_roots() calls
kvm_mmu_get_page(), why does it pass (vcpu->arch.cr3>>  PAGE_SHIFT) or
(vcpu->arch.mmu.pae_root[i]) as gfn?
   


Historical accident.  Luckily, no harmful effects other than wasting a 
bit of cpu cycles every now and then.



It seems to me that in TDP mode, gfn should be either zero for the
root page table, or 0/1GB/2GB/3GB (for PAE page tables).

The existing behavior can lead to multiple, semantically-identical TDP
roots being created by mmu_alloc_roots, depending on the VCPU's CR3 at
the time that mmu_alloc_roots was called.  But the nested page tables
should be* independent of the VCPU state. That wastes some memory and
causes extra page faults while populating the extra copies of the page
tables.
   


Yeah.


*assuming that we aren't modeling per-VCPU state that might change the
physical address map as seen by that VCPU, such as setting the APIC
base to an address overlapping RAM.
   


We don't actually support that.


All feedback would be welcome, since I'm new to this system!  A
strawman patch follows.

   


Patch is correct, but I have already fixed this in a more extensive 
patch set (that also folds the 32-bit and 64-bit cases together, etc.) 
and I'm too lazy to rebase on top of yours.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [UNTESTED] KVM: do not call kvm_set_irq from irq disabled section

2010-04-23 Thread Avi Kivity

On 04/22/2010 10:55 PM, Gleb Natapov wrote:




What about converting PIC/IOAPIC mutexes into spinlocks?
   

Works for me, but on large guests the spinning will be noticeable.
I believe.
 

For interrupts going through IOPIC, but we know this is not scalable
anyway.
   


Yes.  We also wanted to convert the ioapic/pic to spinlocks so we could 
queue the interrupt from the PIT directly instead of using 
KVM_REQ_PENDING_TIMER which keeps confusing me.  Chris Lalancette posted 
a patchset for this a while back but it was never completed.


I'm not really happy with adding lots of spin_lock_irqsave()s though, 
especially on the ioapic which may iterate over all vcpus (not worried 
about scaling, but about a malicious guest hurting host latency).


An alternative is make kvm_set_irq() irq safe: if msi, do things 
immediately, otherwise post a work item.  So we can call kvm_set_irq() 
directly from the interrupt.


Alternative alternative (perhaps better for short term): switch 
assigned_dev_lock to a mutex (we're already in a work handler, no need 
for spinlock).  The race between the irq and removal of an assigned 
device is closed by free_irq():



 lock
 mark assigned device as going away
 unlock
 free_irq()
 actually kill it

like invalid mmu pages.

--

Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 1/3] perf & kvm: Enhance perf to collect KVM guest os statistics from host side

2010-04-23 Thread Avi Kivity

On 04/22/2010 01:27 PM, Liu Yu-B13201 wrote:


I met this error when built kernel. Anything wrong?

   CC  init/main.o
In file included from include/linux/ftrace_event.h:8,
  from include/trace/syscall.h:6,
  from include/linux/syscalls.h:75,
  from init/main.c:16:
include/linux/perf_event.h: In function 'perf_register_guest_info_callbacks':
include/linux/perf_event.h:1019: error: parameter name omitted
include/linux/perf_event.h: In function 'perf_unregister_guest_info_callbacks':
include/linux/perf_event.h:1021: error: parameter name omitted
make[1]: *** [init/main.o] Error 1
make: *** [init] Error 2
   


I merged tip/perf/code which may fix this.  Find it in kvm.git next branch.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: no need to test bit before setting dirty bit these days

2010-04-23 Thread Avi Kivity

On 04/23/2010 11:48 AM, Takuya Yoshikawa wrote:

As Avi pointed out, testing bit part in mark_page_dirty() was important
in the days of shadow paging, but currently EPT and NPT has already become
common and the chance of faulting a page more that once per iteration is
small. So let's remove the test bit to avoid extra access.
   


Applied, thanks.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 5/6] KVM: moving dirty bitmaps to user space

2010-04-23 Thread Avi Kivity

On 04/22/2010 12:07 PM, Takuya Yoshikawa wrote:





+ slots->memslots[i].dirty_bitmap = NULL;
+ slots->memslots[i].dirty_bitmap_old = NULL;
kvm_free_physmem_slot(&slots->memslots[i], NULL);
+ }


+/*
+ * Please use generic *_user bitops once they become available.
+ * Be careful setting the bit won't be done atomically.
+ */


Please introduce the user bitops as part of this patchset.



OK, I will do in the next version. In this RFC, I would be happy if I can
know the overall design is right or not.



Everything looks reasonable to me.

Do you have performance numbers?  I'm interested in both measurements of 
KVM_SWITCH_DIRTY_LOG under various conditions and macro benchmarks (for 
example, total guest throughput improvement under Kemari).


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 3/6] KVM: introduce a wrapper function to copy dirty bitmaps to user space

2010-04-23 Thread Avi Kivity

On 04/22/2010 11:57 AM, Takuya Yoshikawa wrote:


And about x86_32 copy_in_user().

They are using copy_user_generic() which is implemented only for 64bit.

So I'll show them current copy_from_user() and copy_to_user() version and
see their response.

If rejected, though I hope not, please give me another option, OK?



If we can justify it, I doubt it will be rejected, but in case it does, 
we'll work something out.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Document mmu

2010-04-23 Thread Avi Kivity

On 04/23/2010 10:12 AM, Gui Jianfeng wrote:



+Guest memory (gpa) is part of user address space of the process that is using
+kvm.  Userspace defines the translation between guest addresses and user
+addresses (gpa->gva); note that two gpas may alias to the same gva, but not
 

Do you mean (gpa->hva)?
   


I do.  Fixed.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Document mmu

2010-04-23 Thread Avi Kivity

On 04/22/2010 11:16 PM, Marcelo Tosatti wrote:


Looks good otherwise. Perhaps add a pointer to Joerg's NPT slides,
although they're AMD specific.
   


Fixed all the comments, added a "Further reading" section and applied.  
Note that this is still complete (example: large pages); patches 
(especially to Documentation/kvm/locking.txt) more than welcome.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Alexander Graf

On 23.04.2010, at 12:17, Fernando Luis Vázquez Cao wrote:

> On 04/23/2010 08:29 AM, Alexander Graf wrote:
>> 
>> On 22.04.2010, at 08:09, Fernando Luis Vázquez Cao wrote:
>> 
>>> On 04/22/2010 11:45 AM, Fernando Luis Vázquez Cao wrote:
 On 04/21/2010 06:41 PM, Alexander Graf wrote:
> On 21.04.2010, at 10:29, Fernando Luis Vázquez Cao wrote:
> 
>> On 04/20/2010 08:03 PM, Takuya Yoshikawa wrote:
>>> @@ -318,7 +318,7 @@ struct kvm_dirty_log {
>>> __u32 padding1;
>>> union {
>>> void __user *dirty_bitmap; /* one bit per page */
>>> -   __u64 padding2;
>>> +   __u64 addr;
>> 
>> This can break on x86_32 and x86_64-compat. addr is a long not a __u64.
> 
> So the high 32 bits are zero. Where's the problem?
 
 If we are careful enough to cast the addr appropriately we should be fine,
 even if we keep the padding field in the union. I am not saying that it
 breaks 32 architectures but that it can potentially be problematic.
 
>>> +   case KVM_SWITCH_DIRTY_LOG: {
>>> +   struct kvm_dirty_log log;
>>> +
>>> +   r = -EFAULT;
>>> +   if (copy_from_user(&log, argp, sizeof log))
>>> +   goto out;
>>> +   r = kvm_vm_ioctl_switch_dirty_log(kvm, &log);
>>> +   if (r)
>>> +   goto out;
>>> +   r = -EFAULT;
>>> +   if (copy_to_user(argp, &log, sizeof log))
>>> +   goto out;
>>> +   r = 0;
>>> +   break;
>>> +   }
>> 
>> In x86_64-compat mode we are handling 32bit user-space addresses
>> so we need the compat counterpart of KVM_SWITCH_DIRTY_LOG too.
> 
> The compat code just forwards everything to the generic ioctls.
 
 The compat code uses struct compat_kvm_dirty_log instead of
 struct kvm_dirty_log to communicate with user space so
 the necessary conversions needs to be done before invoking
 the generic ioctl (see KVM_GET_DIRTY_LOG in kvm_vm_compat_ioctl).
 
 By the way we probable should move the definition of struct
 compat_kvm_dirty_log to a header file.
>>> 
>>> It seems that it was you and Arnd who added the kvm_vm compat ioctl :-).
>>> Are you considering a different approach to tackle the issues that we
>>> have with a big-endian userspace?
>> 
>> IIRC the issue was a pointer inside of a nested structure, no?
> 
> I would say the reason is that if we did not convert the user-space pointer to
> a "void *" kvm_get_dirty_log() would end up copying the dirty log to
> 
> (log->dirty_bitmap << 32) | 0x

Well yes, that was the problem. If we always set the __u64 value to the pointer 
we're safe though.

union {
  void *p;
  __u64 q;
}

void x(void *r)
{
  // breaks:
  p = r;

  // works:
  q = (ulong)r;
}

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps

2010-04-23 Thread Fernando Luis Vázquez Cao
On 04/23/2010 08:29 AM, Alexander Graf wrote:
> 
> On 22.04.2010, at 08:09, Fernando Luis Vázquez Cao wrote:
> 
>> On 04/22/2010 11:45 AM, Fernando Luis Vázquez Cao wrote:
>>> On 04/21/2010 06:41 PM, Alexander Graf wrote:
 On 21.04.2010, at 10:29, Fernando Luis Vázquez Cao wrote:

> On 04/20/2010 08:03 PM, Takuya Yoshikawa wrote:
>> @@ -318,7 +318,7 @@ struct kvm_dirty_log {
>>  __u32 padding1;
>>  union {
>>  void __user *dirty_bitmap; /* one bit per page */
>> -__u64 padding2;
>> +__u64 addr;
>
> This can break on x86_32 and x86_64-compat. addr is a long not a __u64.

 So the high 32 bits are zero. Where's the problem?
>>>
>>> If we are careful enough to cast the addr appropriately we should be fine,
>>> even if we keep the padding field in the union. I am not saying that it
>>> breaks 32 architectures but that it can potentially be problematic.
>>>
>> +case KVM_SWITCH_DIRTY_LOG: {
>> +struct kvm_dirty_log log;
>> +
>> +r = -EFAULT;
>> +if (copy_from_user(&log, argp, sizeof log))
>> +goto out;
>> +r = kvm_vm_ioctl_switch_dirty_log(kvm, &log);
>> +if (r)
>> +goto out;
>> +r = -EFAULT;
>> +if (copy_to_user(argp, &log, sizeof log))
>> +goto out;
>> +r = 0;
>> +break;
>> +}
>
> In x86_64-compat mode we are handling 32bit user-space addresses
> so we need the compat counterpart of KVM_SWITCH_DIRTY_LOG too.

 The compat code just forwards everything to the generic ioctls.
>>>
>>> The compat code uses struct compat_kvm_dirty_log instead of
>>> struct kvm_dirty_log to communicate with user space so
>>> the necessary conversions needs to be done before invoking
>>> the generic ioctl (see KVM_GET_DIRTY_LOG in kvm_vm_compat_ioctl).
>>>
>>> By the way we probable should move the definition of struct
>>> compat_kvm_dirty_log to a header file.
>>
>> It seems that it was you and Arnd who added the kvm_vm compat ioctl :-).
>> Are you considering a different approach to tackle the issues that we
>> have with a big-endian userspace?
> 
> IIRC the issue was a pointer inside of a nested structure, no?

I would say the reason is that if we did not convert the user-space pointer to
a "void *" kvm_get_dirty_log() would end up copying the dirty log to

(log->dirty_bitmap << 32) | 0x

Am I missing something?

Thanks,

Fernando
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().

2010-04-23 Thread Yoshiaki Tamura

Avi Kivity wrote:

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Currently buf size is fixed at 32KB. It would be useful if it could
be flexible.



Why is this needed? The real buffering is in the kernel anyways; this is
only used to reduce the number of write() syscalls.


This was introduced to buffer the transfered guests image transaction ally on 
the receiver side.  The sender doesn't use it.

In case of intermediate state, we just discard this buffer.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 04/20] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().

2010-04-23 Thread Avi Kivity

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Currently buf size is fixed at 32KB.  It would be useful if it could
be flexible.

   


Why is this needed?  The real buffering is in the kernel anyways; this 
is only used to reduce the number of write() syscalls.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >