Re: FreeBSD-amd64 fails to start with SMP on quemu-kvm

2013-01-13 Thread Gleb Natapov
On Fri, Jan 11, 2013 at 03:57:05PM +0100, Artur Samborski wrote:
 W dniu 09.01.2013 11:15, Gleb Natapov pisze:
 On Tue, Jan 08, 2013 at 09:45:35PM +0100, Artur Samborski wrote:
 W dniu 08.01.2013 00:00, Marcelo Tosatti pisze:
 On Mon, Jan 07, 2013 at 06:13:22PM +0100, Artur Samborski wrote:
 Hello,
 
 When i try to run FreeBSD-amd64 on more than 1 vcpu in quemu-kvm
 (Fedora Core 17) eg. to run FreeBSD-9.0-RELEASE-amd64 with:
 
 qemu-kvm -m 1024m -cpu host -smp 2 -cdrom
 /storage/iso/FreeBSD-9.0-RELEASE-amd64-dvd1.iso
 
 it freezes KVM with:
 
 KVM internal error. Suberror: 1
 emulation failure
 RAX=80b0d4c0 RBX=0009f000 RCX=c080
 RDX=
 RSI=d238 RDI= RBP=
 RSP=
 R8 = R9 = R10=
 R11=
 R12= R13= R14=
 R15=
 RIP=0009f076 RFL=00010086 [--S--P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =   f300 DPL=3 DS16 [-WA]
 CS =0008   00209900 DPL=0 CS64 [--A]
 SS =9f00 0009f000  f300 DPL=3 DS16 [-WA]
 DS =0018   00c09300 DPL=0 DS   [-WA]
 FS =   f300 DPL=3 DS16 [-WA]
 GS =   f300 DPL=3 DS16 [-WA]
 LDT=   8200 DPL=0 LDT
 TR =   8b00 DPL=0 TSS64-busy
 GDT= 0009f080 0020
 IDT=  
 CR0=8011 CR2= CR3=0009c000 CR4=0030
 DR0= DR1= DR2=
 DR3=
 DR6=0ff0 DR7=0400
 EFER=0501
 Code=00 00 00 80 0f 22 c0 ea 70 f0 09 00 08 00 48 b8 c0 d4 b0 80
 ff ff ff ff ff e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 00 99 20 00 ff ff 00 00
 
 Artur,
 
 Can you check whether
 
 https://patchwork-mail.kernel.org/patch/1942681/
 
 fixes your problem
 
 
 Hi, thanks for the reply.
 
 Unfortunately, the patch does not help. Attempt to start FreeBSD
 amd64 via quemu-kvm with -smp parameter greater than 1 fails in
 exactly the same way as before.
 
 The patch was applied to the kernel from the 3.6.11-1.fc17.src.rpm package.
 
 Do I need some additional patches?
 
 
 Can you try queue branch from kvm.git?
 
 
 Thanks for the advice - I used kernel from kvm.git(queue) and was
 able to run FreeBSD-amd guest with SMP on my kvm-host server without
 previous problems. Unfortunately, after few hours server was hung
 up. Case requires further investigations, but generally speaking we
 went forward. Unfortunately, experiments on the server are
 difficult, because it is used for everyday work.
 
Thanks for testing. This hang indeed looks like completely different
problem. What workload the VM was running?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 2/2] vhost: handle polling errors

2013-01-13 Thread Michael S. Tsirkin
On Mon, Jan 07, 2013 at 11:04:32PM +0800, Jason Wang wrote:
 On 01/07/2013 10:55 PM, Michael S. Tsirkin wrote:
  On Mon, Jan 07, 2013 at 12:38:17PM +0800, Jason Wang wrote:
  On 01/06/2013 09:22 PM, Michael S. Tsirkin wrote:
  On Sun, Jan 06, 2013 at 03:18:38PM +0800, Jason Wang wrote:
  Polling errors were ignored by vhost/vhost_net, this may lead to crash 
  when
  trying to remove vhost from waitqueue when after the polling is failed. 
  Solve
  this problem by:
 
  - checking the poll-wqh before trying to remove from waitqueue
  - report an error when poll() returns a POLLERR in vhost_start_poll()
  - report an error when vhost_start_poll() fails in
vhost_vring_ioctl()/vhost_net_set_backend() which is used to notify the
failure to userspace.
  - report an error in the data path in vhost_net when meet polling errors.
 
  After those changes, we can safely drop the tx polling state in 
  vhost_net since
  it was replaced by the checking of poll-wqh.
 
  Signed-off-by: Jason Wang jasow...@redhat.com
  ---
   drivers/vhost/net.c   |   74 
  
   drivers/vhost/vhost.c |   31 +++-
   drivers/vhost/vhost.h |2 +-
   3 files changed, 49 insertions(+), 58 deletions(-)
 
  diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
  index d10ad6f..125c1e5 100644
  --- a/drivers/vhost/net.c
  +++ b/drivers/vhost/net.c
  @@ -64,20 +64,10 @@ enum {
   VHOST_NET_VQ_MAX = 2,
   };
   
  -enum vhost_net_poll_state {
  -VHOST_NET_POLL_DISABLED = 0,
  -VHOST_NET_POLL_STARTED = 1,
  -VHOST_NET_POLL_STOPPED = 2,
  -};
  -
   struct vhost_net {
   struct vhost_dev dev;
   struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
   struct vhost_poll poll[VHOST_NET_VQ_MAX];
  -/* Tells us whether we are polling a socket for TX.
  - * We only do this when socket buffer fills up.
  - * Protected by tx vq lock. */
  -enum vhost_net_poll_state tx_poll_state;
   /* Number of TX recently submitted.
* Protected by tx vq lock. */
   unsigned tx_packets;
  @@ -155,24 +145,6 @@ static void copy_iovec_hdr(const struct iovec 
  *from, struct iovec *to,
   }
   }
   
  -/* Caller must have TX VQ lock */
  -static void tx_poll_stop(struct vhost_net *net)
  -{
  -if (likely(net-tx_poll_state != VHOST_NET_POLL_STARTED))
  -return;
  -vhost_poll_stop(net-poll + VHOST_NET_VQ_TX);
  -net-tx_poll_state = VHOST_NET_POLL_STOPPED;
  -}
  -
  -/* Caller must have TX VQ lock */
  -static void tx_poll_start(struct vhost_net *net, struct socket *sock)
  -{
  -if (unlikely(net-tx_poll_state != VHOST_NET_POLL_STOPPED))
  -return;
  -vhost_poll_start(net-poll + VHOST_NET_VQ_TX, sock-file);
  -net-tx_poll_state = VHOST_NET_POLL_STARTED;
  -}
  -
   /* In case of DMA done not in order in lower device driver for some 
  reason.
* upend_idx is used to track end of used idx, done_idx is used to 
  track head
* of used idx. Once lower device DMA done contiguously, we will signal 
  KVM
  @@ -227,6 +199,7 @@ static void vhost_zerocopy_callback(struct ubuf_info 
  *ubuf, bool success)
   static void handle_tx(struct vhost_net *net)
   {
   struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
  +struct vhost_poll *poll = net-poll + VHOST_NET_VQ_TX;
   unsigned out, in, s;
   int head;
   struct msghdr msg = {
  @@ -252,7 +225,8 @@ static void handle_tx(struct vhost_net *net)
   wmem = atomic_read(sock-sk-sk_wmem_alloc);
   if (wmem = sock-sk-sk_sndbuf) {
   mutex_lock(vq-mutex);
  -tx_poll_start(net, sock);
  +if (vhost_poll_start(poll, sock-file))
  +vq_err(vq, Fail to start TX polling\n);
  s/Fail/Failed/
 
  A question though: how can this happen? Could you clarify please?
  Maybe we can find a way to prevent this error?
  Two conditions I think this can happen:
 
  1) a buggy userspace disable a queue through TUNSETQUEUE
  2) the net device were gone
 
  For 1, looks like we can delay the disabling until the refcnt goes to
  zero. For 2 may needs more changes.
  I'd expect keeping a socket reference would prevent both issues.
  Doesn't it?
 
 Doesn't work for 2 I think, the socket didn't hold a refcnt of the
 device, so the device can go away at anytime. Although we can change
 this, but it's the behaviour before multiqueue support.

Hmm there's one scenario that does seem to
trigger this: queue can get disabled
and then poll fails.

Is this the only issue?


 
  Not sure it's worth to do this work,
  maybe a warning is enough just like other failure.
  With other failures, you normally can correct the error then
  kick to have it restart. This is soomething thagt would not
  work here.
 
 If userspace is wrote correctly, (e.g passing a fd with correct state)
 

Re: [PATCH V3 2/2] vhost: handle polling errors

2013-01-13 Thread Michael S. Tsirkin
On Mon, Jan 07, 2013 at 11:04:32PM +0800, Jason Wang wrote:
 On 01/07/2013 10:55 PM, Michael S. Tsirkin wrote:
  On Mon, Jan 07, 2013 at 12:38:17PM +0800, Jason Wang wrote:
  On 01/06/2013 09:22 PM, Michael S. Tsirkin wrote:
  On Sun, Jan 06, 2013 at 03:18:38PM +0800, Jason Wang wrote:
  Polling errors were ignored by vhost/vhost_net, this may lead to crash 
  when
  trying to remove vhost from waitqueue when after the polling is failed. 
  Solve
  this problem by:
 
  - checking the poll-wqh before trying to remove from waitqueue
  - report an error when poll() returns a POLLERR in vhost_start_poll()
  - report an error when vhost_start_poll() fails in
vhost_vring_ioctl()/vhost_net_set_backend() which is used to notify the
failure to userspace.
  - report an error in the data path in vhost_net when meet polling errors.
 
  After those changes, we can safely drop the tx polling state in 
  vhost_net since
  it was replaced by the checking of poll-wqh.
 
  Signed-off-by: Jason Wang jasow...@redhat.com
  ---
   drivers/vhost/net.c   |   74 
  
   drivers/vhost/vhost.c |   31 +++-
   drivers/vhost/vhost.h |2 +-
   3 files changed, 49 insertions(+), 58 deletions(-)
 
  diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
  index d10ad6f..125c1e5 100644
  --- a/drivers/vhost/net.c
  +++ b/drivers/vhost/net.c
  @@ -64,20 +64,10 @@ enum {
   VHOST_NET_VQ_MAX = 2,
   };
   
  -enum vhost_net_poll_state {
  -VHOST_NET_POLL_DISABLED = 0,
  -VHOST_NET_POLL_STARTED = 1,
  -VHOST_NET_POLL_STOPPED = 2,
  -};
  -
   struct vhost_net {
   struct vhost_dev dev;
   struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
   struct vhost_poll poll[VHOST_NET_VQ_MAX];
  -/* Tells us whether we are polling a socket for TX.
  - * We only do this when socket buffer fills up.
  - * Protected by tx vq lock. */
  -enum vhost_net_poll_state tx_poll_state;
   /* Number of TX recently submitted.
* Protected by tx vq lock. */
   unsigned tx_packets;
  @@ -155,24 +145,6 @@ static void copy_iovec_hdr(const struct iovec 
  *from, struct iovec *to,
   }
   }
   
  -/* Caller must have TX VQ lock */
  -static void tx_poll_stop(struct vhost_net *net)
  -{
  -if (likely(net-tx_poll_state != VHOST_NET_POLL_STARTED))
  -return;
  -vhost_poll_stop(net-poll + VHOST_NET_VQ_TX);
  -net-tx_poll_state = VHOST_NET_POLL_STOPPED;
  -}
  -
  -/* Caller must have TX VQ lock */
  -static void tx_poll_start(struct vhost_net *net, struct socket *sock)
  -{
  -if (unlikely(net-tx_poll_state != VHOST_NET_POLL_STOPPED))
  -return;
  -vhost_poll_start(net-poll + VHOST_NET_VQ_TX, sock-file);
  -net-tx_poll_state = VHOST_NET_POLL_STARTED;
  -}
  -
   /* In case of DMA done not in order in lower device driver for some 
  reason.
* upend_idx is used to track end of used idx, done_idx is used to 
  track head
* of used idx. Once lower device DMA done contiguously, we will signal 
  KVM
  @@ -227,6 +199,7 @@ static void vhost_zerocopy_callback(struct ubuf_info 
  *ubuf, bool success)
   static void handle_tx(struct vhost_net *net)
   {
   struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
  +struct vhost_poll *poll = net-poll + VHOST_NET_VQ_TX;
   unsigned out, in, s;
   int head;
   struct msghdr msg = {
  @@ -252,7 +225,8 @@ static void handle_tx(struct vhost_net *net)
   wmem = atomic_read(sock-sk-sk_wmem_alloc);
   if (wmem = sock-sk-sk_sndbuf) {
   mutex_lock(vq-mutex);
  -tx_poll_start(net, sock);
  +if (vhost_poll_start(poll, sock-file))
  +vq_err(vq, Fail to start TX polling\n);
  s/Fail/Failed/
 
  A question though: how can this happen? Could you clarify please?
  Maybe we can find a way to prevent this error?
  Two conditions I think this can happen:
 
  1) a buggy userspace disable a queue through TUNSETQUEUE
  2) the net device were gone
 
  For 1, looks like we can delay the disabling until the refcnt goes to
  zero. For 2 may needs more changes.
  I'd expect keeping a socket reference would prevent both issues.
  Doesn't it?
 
 Doesn't work for 2 I think, the socket didn't hold a refcnt of the
 device, so the device can go away at anytime.

Hmm I don't really understand.
All we care about is that socket is around no?
Could you please show how a problematic case can
be triggered?


 Although we can change
 this, but it's the behaviour before multiqueue support.
 
  Not sure it's worth to do this work,
  maybe a warning is enough just like other failure.
  With other failures, you normally can correct the error then
  kick to have it restart. This is soomething thagt would not
  work here.
 
 If userspace is wrote correctly, (e.g passing a fd with 

[PATCH v6 1/3] KVM: x86: clean up reexecute_instruction

2013-01-13 Thread Xiao Guangrong
Little cleanup for reexecute_instruction, also use gpa_to_gfn in
retry_instruction

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/x86.c |   13 ++---
 1 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1c9c834..08cacd9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4761,19 +4761,18 @@ static bool reexecute_instruction(struct kvm_vcpu 
*vcpu, gva_t gva)
if (tdp_enabled)
return false;

+   gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, NULL);
+   if (gpa == UNMAPPED_GVA)
+   return true; /* let cpu generate fault */
+
/*
 * if emulation was due to access to shadowed page table
 * and it failed try to unshadow page and re-enter the
 * guest to let CPU execute the instruction.
 */
-   if (kvm_mmu_unprotect_page_virt(vcpu, gva))
+   if (kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa)))
return true;

-   gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, NULL);
-
-   if (gpa == UNMAPPED_GVA)
-   return true; /* let cpu generate fault */
-
/*
 * Do not retry the unhandleable instruction if it faults on the
 * readonly host memory, otherwise it will goto a infinite loop:
@@ -4828,7 +4827,7 @@ static bool retry_instruction(struct x86_emulate_ctxt 
*ctxt,
if (!vcpu-arch.mmu.direct_map)
gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL);

-   kvm_mmu_unprotect_page(vcpu-kvm, gpa  PAGE_SHIFT);
+   kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa));

return true;
 }
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 2/3] KVM: x86: let reexecute_instruction work for tdp

2013-01-13 Thread Xiao Guangrong
Currently, reexecute_instruction refused to retry all instructions if
tdp is enabled. If nested npt is used, the emulation may be caused by
shadow page, it can be fixed by dropping the shadow page. And the only
condition that tdp can not retry the instruction is the access fault
on error pfn

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/x86.c |   61 ---
 1 files changed, 43 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 08cacd9..6f13e03 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4753,25 +4753,25 @@ static int handle_emulation_failure(struct kvm_vcpu 
*vcpu)
return r;
 }

-static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t gva)
+static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2)
 {
-   gpa_t gpa;
+   gpa_t gpa = cr2;
pfn_t pfn;

-   if (tdp_enabled)
-   return false;
-
-   gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, NULL);
-   if (gpa == UNMAPPED_GVA)
-   return true; /* let cpu generate fault */
+   if (!vcpu-arch.mmu.direct_map) {
+   /*
+* Write permission should be allowed since only
+* write access need to be emulated.
+*/
+   gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL);

-   /*
-* if emulation was due to access to shadowed page table
-* and it failed try to unshadow page and re-enter the
-* guest to let CPU execute the instruction.
-*/
-   if (kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa)))
-   return true;
+   /*
+* If the mapping is invalid in guest, let cpu retry
+* it to generate fault.
+*/
+   if (gpa == UNMAPPED_GVA)
+   return true;
+   }

/*
 * Do not retry the unhandleable instruction if it faults on the
@@ -4780,12 +4780,37 @@ static bool reexecute_instruction(struct kvm_vcpu 
*vcpu, gva_t gva)
 * instruction - ...
 */
pfn = gfn_to_pfn(vcpu-kvm, gpa_to_gfn(gpa));
-   if (!is_error_noslot_pfn(pfn)) {
-   kvm_release_pfn_clean(pfn);
+
+   /*
+* If the instruction failed on the error pfn, it can not be fixed,
+* report the error to userspace.
+*/
+   if (is_error_noslot_pfn(pfn))
+   return false;
+
+   kvm_release_pfn_clean(pfn);
+
+   /* The instructions are well-emulated on direct mmu. */
+   if (vcpu-arch.mmu.direct_map) {
+   unsigned int indirect_shadow_pages;
+
+   spin_lock(vcpu-kvm-mmu_lock);
+   indirect_shadow_pages = vcpu-kvm-arch.indirect_shadow_pages;
+   spin_unlock(vcpu-kvm-mmu_lock);
+
+   if (indirect_shadow_pages)
+   kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa));
+
return true;
}

-   return false;
+   /*
+* if emulation was due to access to shadowed page table
+* and it failed try to unshadow page and re-enter the
+* guest to let CPU execute the instruction.
+*/
+   kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa));
+   return true;
 }

 static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 3/3] KVM: x86: improve reexecute_instruction

2013-01-13 Thread Xiao Guangrong
The current reexecute_instruction can not well detect the failed instruction
emulation. It allows guest to retry all the instructions except it accesses
on error pfn

For example, some cases are nested-write-protect - if the page we want to
write is used as PDE but it chains to itself. Under this case, we should
stop the emulation and report the case to userspace

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/include/asm/kvm_host.h |7 +++
 arch/x86/kvm/paging_tmpl.h  |   27 ---
 arch/x86/kvm/x86.c  |   22 ++
 3 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c431b33..d6ab8d2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -502,6 +502,13 @@ struct kvm_vcpu_arch {
u64 msr_val;
struct gfn_to_hva_cache data;
} pv_eoi;
+
+   /*
+* Indicate whether the access faults on its page table in guest
+* which is set when fix page fault and used to detect unhandeable
+* instruction.
+*/
+   bool write_fault_to_shadow_pgtable;
 };

 struct kvm_lpage_info {
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 3d1a352..ca69dcc 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -497,26 +497,34 @@ out_gpte_changed:
  * created when kvm establishes shadow page table that stop kvm using large
  * page size. Do it early can avoid unnecessary #PF and emulation.
  *
+ * @write_fault_to_shadow_pgtable will return true if the fault gfn is
+ * currently used as its page table.
+ *
  * Note: the PDPT page table is not checked for PAE-32 bit guest. It is ok
  * since the PDPT is always shadowed, that means, we can not use large page
  * size to map the gfn which is used as PDPT.
  */
 static bool
 FNAME(is_self_change_mapping)(struct kvm_vcpu *vcpu,
- struct guest_walker *walker, int user_fault)
+ struct guest_walker *walker, int user_fault,
+ bool *write_fault_to_shadow_pgtable)
 {
int level;
gfn_t mask = ~(KVM_PAGES_PER_HPAGE(walker-level) - 1);
+   bool self_changed = false;

if (!(walker-pte_access  ACC_WRITE_MASK ||
  (!is_write_protection(vcpu)  !user_fault)))
return false;

-   for (level = walker-level; level = walker-max_level; level++)
-   if (!((walker-gfn ^ walker-table_gfn[level - 1])  mask))
-   return true;
+   for (level = walker-level; level = walker-max_level; level++) {
+   gfn_t gfn = walker-gfn ^ walker-table_gfn[level - 1];
+
+   self_changed |= !(gfn  mask);
+   *write_fault_to_shadow_pgtable |= !gfn;
+   }

-   return false;
+   return self_changed;
 }

 /*
@@ -544,7 +552,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
int level = PT_PAGE_TABLE_LEVEL;
int force_pt_level;
unsigned long mmu_seq;
-   bool map_writable;
+   bool map_writable, is_self_change_mapping;

pgprintk(%s: addr %lx err %x\n, __func__, addr, error_code);

@@ -572,9 +580,14 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
return 0;
}

+   vcpu-arch.write_fault_to_shadow_pgtable = false;
+
+   is_self_change_mapping = FNAME(is_self_change_mapping)(vcpu,
+ walker, user_fault, vcpu-arch.write_fault_to_shadow_pgtable);
+
if (walker.level = PT_DIRECTORY_LEVEL)
force_pt_level = mapping_level_dirty_bitmap(vcpu, walker.gfn)
-  || FNAME(is_self_change_mapping)(vcpu, walker, user_fault);
+  || is_self_change_mapping;
else
force_pt_level = 1;
if (!force_pt_level) {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6f13e03..139a5f8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4753,7 +4753,8 @@ static int handle_emulation_failure(struct kvm_vcpu *vcpu)
return r;
 }

-static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2)
+static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2,
+ bool write_fault_to_shadow_pgtable)
 {
gpa_t gpa = cr2;
pfn_t pfn;
@@ -4810,7 +4811,13 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, 
gva_t cr2)
 * guest to let CPU execute the instruction.
 */
kvm_mmu_unprotect_page(vcpu-kvm, gpa_to_gfn(gpa));
-   return true;
+
+   /*
+* If the access faults on its page table, it can not
+* be fixed by unprotecting shadow page and it should
+* be reported to userspace.
+*/
+   return !write_fault_to_shadow_pgtable;
 }

 static bool 

qemu-kvm hangs at start up under 3.8.0-rc3-00074-gb719f43 (works with CONFIG_LOCKDEP)

2013-01-13 Thread Andrew Clayton
When running qemu-kvm under 64but Fedora 16 under current 3.8, it just
hangs at start up. Dong a ps -ef hangs the process at the point where it
would display the qemu process (trying to list the qemu-kvm /proc pid
directory contents just hangs ls).

I also noticed some other weirdness at this point like Firefox hanging
for many seconds at a time and increasing load average.

The qemu command I was trying to run was

$ qemu-kvm -m 512 -smp 2 -vga vmware -k en-gb -drive
file=/home/andrew/machines/qemu/f16-i386.img,if=virtio

Here's the last few lines of a strace on it at start up.

open(/home/andrew/machines/qemu/f16-i386.img, O_RDWR|O_DSYNC|O_CLOEXEC) = 8
lseek(8, 0, SEEK_END)   = 9100722176
pread(8, 
QFI\373\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\20\0\0\0\2\200\0\0\0..., 512, 
0) = 512
pread(8, 
\200\0\0\0\0\4\0\0\200\0\0\0\0\10\0\0\200\0\0\0\2\210\0\0\200\0\0\0\2\233\0\0...,
 512, 65536) = 512
brk(0)  = 0x7faf12db
brk(0x7faf12ddd000

It's hanging in that brk syscall. The load average also then starts to increase.


However. I can make it run fine, if I enable CONFIG_LOCKDEP. But the only
thing in dmesg I get is the usual.

kvm: SMP vm created on host with unstable TSC; guest TSC will not be reliable

I've attached both working and non-working .configs. The only difference being
the lock checking enabled in config.good.

The most recent kernel I had it working in was 3.7.0

System is a Quad Core Intel running 64bit Fedora 16.

Cheers,
Andrew

 

config.bad.gz
Description: GNU Zip compressed data


config.good.gz
Description: GNU Zip compressed data


RE: [PATCH v9 2/3] x86, apicv: add virtual x2apic support

2013-01-13 Thread Zhang, Yang Z
Gleb Natapov wrote on 2013-01-12:
 On Fri, Jan 11, 2013 at 07:36:44AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-01-10:
 On Thu, Jan 10, 2013 at 12:22:51PM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-01-10:
 On Thu, Jan 10, 2013 at 11:54:21AM +, Zhang, Yang Z wrote:
 The right logic should be:
 When register virtualization enabled, read all apic msr(except apic id
 reg
 and
 tmcct which have the hook) not cause vmexit and only write TPR not
 cause vmexit.
 When vid enabled, write to TPR, EOI and SELF IPI not cause vmexit.
 It's better to put the patch of envirtualize x2apic mode as first 
 patch.
 
 There is no point whatsoever to enable virtualize x2apic without
 allowing at least one non intercepted access to x2apic MSR range and
 this is what your patch is doing when run on cpu without vid capability.
 No, at least read/write TPR will not cause vmexit if virtualize x2apic 
 mode
 is
 enabled.
 For that you need to disable 808H MSR intercept, which your patch does
 not
 do.
 I want to do this in next patch.
 
 Then don't. There is no point in the patch that enables virtualize
 x2apic and does not allow at least one non-intercepted MSR access.
 As I said, read/write TPR will not cause vmexit if virtual x2apic is set.
 
 From my reading of the spec you need to disable 808H intercept for that.
 Is this not the case?
 
 I am not sure whether I understand your comments right in previous
 discussion, here is my thinking: 1. enable virtualize x2apic mode if
 guest is really using x2apic and clear TPR in msr read  and write
 bitmap. This will benefit reading TPR. 2. If APIC registers
 virtualization is enabled, clear all bit in rang
 0x800-0x8ff(except apic id reg and tmcct).
 Clear all read bits in the range.
 
 3. If vid is enabled, clear EOI and SELF IPI in msr write map.
 
 Looks OK.
 
 One concern you mentioned is that vid enabled and virtualize x2apic is
 disabled
 but guest still use x2apic. In this case, we still use MSR bitmap to
 intercept x2apic. This means the vEOI update will never happen. But we
 still can benefit from interrupt window.
 
 What interrupt windows? Without virtualized x2apic TPR/EOI
 virtualization will not happen and we rely on that in the code.
 Yes, but we can eliminate vmexit of interrupt window. Interrupt still can
 delivery to guest without vmexit when guest enables interrupt if vid
 is enabled. See SDM vol 3, 29.2.2.
 
 What does it matter that you eliminated vmexit of interrupt window if
 you broke everything else? The vid patch assumes that apic emulation
 either entirely happens in a software when vid is disabled or in a CPU
 if vid is enabled. Mixed mode will not work (apic-isr_count = 1 trick
 will break things for instance). And it is not worth it to complicate
 the code to make it work.
 Yes, you are right. It too complicated.
 Another question? Why not to hide x2apic capability to guest when vid is
 supported and virtual x2apic mode is not supported? It should be more
 reasonable than disable vid when virtual x2apic mode is unavailable.
 
 Because I think there will be no such HW. If such HW expected to be
 common then we can go this route.
Yes. No HW will support vid without supporting virtual x2apic mode. So we just 
ignore this case.

Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v9 2/3] x86, apicv: add virtual x2apic support

2013-01-13 Thread Zhang, Yang Z
Gleb Natapov wrote on 2013-01-12:
 On Fri, Jan 11, 2013 at 07:36:44AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-01-10:
 On Thu, Jan 10, 2013 at 12:22:51PM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-01-10:
 On Thu, Jan 10, 2013 at 11:54:21AM +, Zhang, Yang Z wrote:
 The right logic should be:
 When register virtualization enabled, read all apic msr(except apic id
 reg
 and
 tmcct which have the hook) not cause vmexit and only write TPR not
 cause vmexit.
 When vid enabled, write to TPR, EOI and SELF IPI not cause vmexit.
 It's better to put the patch of envirtualize x2apic mode as first 
 patch.
 
 There is no point whatsoever to enable virtualize x2apic without
 allowing at least one non intercepted access to x2apic MSR range and
 this is what your patch is doing when run on cpu without vid capability.
 No, at least read/write TPR will not cause vmexit if virtualize x2apic 
 mode
 is
 enabled.
 For that you need to disable 808H MSR intercept, which your patch does
 not
 do.
 I want to do this in next patch.
 
 Then don't. There is no point in the patch that enables virtualize
 x2apic and does not allow at least one non-intercepted MSR access.
 As I said, read/write TPR will not cause vmexit if virtual x2apic is set.
 
 From my reading of the spec you need to disable 808H intercept for that.
 Is this not the case?
After thinking, since Linux doesn't use TPR, there is no point to do this. No 
real guest will benefit from this.

 I am not sure whether I understand your comments right in previous
 discussion, here is my thinking: 1. enable virtualize x2apic mode if
 guest is really using x2apic and clear TPR in msr read  and write
 bitmap. This will benefit reading TPR. 2. If APIC registers
 virtualization is enabled, clear all bit in rang
 0x800-0x8ff(except apic id reg and tmcct).
 Clear all read bits in the range.
 
 3. If vid is enabled, clear EOI and SELF IPI in msr write map.
 
 Looks OK.
 
 One concern you mentioned is that vid enabled and virtualize x2apic is
 disabled
 but guest still use x2apic. In this case, we still use MSR bitmap to
 intercept x2apic. This means the vEOI update will never happen. But we
 still can benefit from interrupt window.
 
 What interrupt windows? Without virtualized x2apic TPR/EOI
 virtualization will not happen and we rely on that in the code.
 Yes, but we can eliminate vmexit of interrupt window. Interrupt still can
 delivery to guest without vmexit when guest enables interrupt if vid
 is enabled. See SDM vol 3, 29.2.2.
 
 What does it matter that you eliminated vmexit of interrupt window if
 you broke everything else? The vid patch assumes that apic emulation
 either entirely happens in a software when vid is disabled or in a CPU
 if vid is enabled. Mixed mode will not work (apic-isr_count = 1 trick
 will break things for instance). And it is not worth it to complicate
 the code to make it work.
 Yes, you are right. It too complicated.
 Another question? Why not to hide x2apic capability to guest when vid is
 supported and virtual x2apic mode is not supported? It should be more
 reasonable than disable vid when virtual x2apic mode is unavailable.
 
 Because I think there will be no such HW. If such HW expected to be
 common then we can go this route.
 
 --
   Gleb.
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 2/2] vhost: handle polling errors

2013-01-13 Thread Jason Wang
On 01/13/2013 07:10 PM, Michael S. Tsirkin wrote:
 On Mon, Jan 07, 2013 at 11:04:32PM +0800, Jason Wang wrote:
 On 01/07/2013 10:55 PM, Michael S. Tsirkin wrote:
 On Mon, Jan 07, 2013 at 12:38:17PM +0800, Jason Wang wrote:
 On 01/06/2013 09:22 PM, Michael S. Tsirkin wrote:
 On Sun, Jan 06, 2013 at 03:18:38PM +0800, Jason Wang wrote:
 Polling errors were ignored by vhost/vhost_net, this may lead to crash 
 when
 trying to remove vhost from waitqueue when after the polling is failed. 
 Solve
 this problem by:

 - checking the poll-wqh before trying to remove from waitqueue
 - report an error when poll() returns a POLLERR in vhost_start_poll()
 - report an error when vhost_start_poll() fails in
   vhost_vring_ioctl()/vhost_net_set_backend() which is used to notify the
   failure to userspace.
 - report an error in the data path in vhost_net when meet polling errors.

 After those changes, we can safely drop the tx polling state in 
 vhost_net since
 it was replaced by the checking of poll-wqh.

 Signed-off-by: Jason Wang jasow...@redhat.com
 ---
  drivers/vhost/net.c   |   74 
 
  drivers/vhost/vhost.c |   31 +++-
  drivers/vhost/vhost.h |2 +-
  3 files changed, 49 insertions(+), 58 deletions(-)

 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index d10ad6f..125c1e5 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -64,20 +64,10 @@ enum {
  VHOST_NET_VQ_MAX = 2,
  };
  
 -enum vhost_net_poll_state {
 -VHOST_NET_POLL_DISABLED = 0,
 -VHOST_NET_POLL_STARTED = 1,
 -VHOST_NET_POLL_STOPPED = 2,
 -};
 -
  struct vhost_net {
  struct vhost_dev dev;
  struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
  struct vhost_poll poll[VHOST_NET_VQ_MAX];
 -/* Tells us whether we are polling a socket for TX.
 - * We only do this when socket buffer fills up.
 - * Protected by tx vq lock. */
 -enum vhost_net_poll_state tx_poll_state;
  /* Number of TX recently submitted.
   * Protected by tx vq lock. */
  unsigned tx_packets;
 @@ -155,24 +145,6 @@ static void copy_iovec_hdr(const struct iovec 
 *from, struct iovec *to,
  }
  }
  
 -/* Caller must have TX VQ lock */
 -static void tx_poll_stop(struct vhost_net *net)
 -{
 -if (likely(net-tx_poll_state != VHOST_NET_POLL_STARTED))
 -return;
 -vhost_poll_stop(net-poll + VHOST_NET_VQ_TX);
 -net-tx_poll_state = VHOST_NET_POLL_STOPPED;
 -}
 -
 -/* Caller must have TX VQ lock */
 -static void tx_poll_start(struct vhost_net *net, struct socket *sock)
 -{
 -if (unlikely(net-tx_poll_state != VHOST_NET_POLL_STOPPED))
 -return;
 -vhost_poll_start(net-poll + VHOST_NET_VQ_TX, sock-file);
 -net-tx_poll_state = VHOST_NET_POLL_STARTED;
 -}
 -
  /* In case of DMA done not in order in lower device driver for some 
 reason.
   * upend_idx is used to track end of used idx, done_idx is used to 
 track head
   * of used idx. Once lower device DMA done contiguously, we will signal 
 KVM
 @@ -227,6 +199,7 @@ static void vhost_zerocopy_callback(struct ubuf_info 
 *ubuf, bool success)
  static void handle_tx(struct vhost_net *net)
  {
  struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
 +struct vhost_poll *poll = net-poll + VHOST_NET_VQ_TX;
  unsigned out, in, s;
  int head;
  struct msghdr msg = {
 @@ -252,7 +225,8 @@ static void handle_tx(struct vhost_net *net)
  wmem = atomic_read(sock-sk-sk_wmem_alloc);
  if (wmem = sock-sk-sk_sndbuf) {
  mutex_lock(vq-mutex);
 -tx_poll_start(net, sock);
 +if (vhost_poll_start(poll, sock-file))
 +vq_err(vq, Fail to start TX polling\n);
 s/Fail/Failed/

 A question though: how can this happen? Could you clarify please?
 Maybe we can find a way to prevent this error?
 Two conditions I think this can happen:

 1) a buggy userspace disable a queue through TUNSETQUEUE
 2) the net device were gone

 For 1, looks like we can delay the disabling until the refcnt goes to
 zero. For 2 may needs more changes.
 I'd expect keeping a socket reference would prevent both issues.
 Doesn't it?
 Doesn't work for 2 I think, the socket didn't hold a refcnt of the
 device, so the device can go away at anytime. Although we can change
 this, but it's the behaviour before multiqueue support.
 Hmm there's one scenario that does seem to
 trigger this: queue can get disabled
 and then poll fails.

 Is this the only issue?

Another one I think we can trigger is:

- start vhost thread
- do ip link del link dev tap0 to delete the tap device

In this case, the netdevice is unregistered but the file/socket still exist.

 Not sure it's worth to do this work,
 maybe a warning is enough just like other failure.
 With other failures, you normally can correct the error then
 kick to have it 

Re: [PATCH V3 2/2] vhost: handle polling errors

2013-01-13 Thread Michael S. Tsirkin
On Mon, Jan 14, 2013 at 10:59:02AM +0800, Jason Wang wrote:
 On 01/13/2013 07:10 PM, Michael S. Tsirkin wrote:
  On Mon, Jan 07, 2013 at 11:04:32PM +0800, Jason Wang wrote:
  On 01/07/2013 10:55 PM, Michael S. Tsirkin wrote:
  On Mon, Jan 07, 2013 at 12:38:17PM +0800, Jason Wang wrote:
  On 01/06/2013 09:22 PM, Michael S. Tsirkin wrote:
  On Sun, Jan 06, 2013 at 03:18:38PM +0800, Jason Wang wrote:
  Polling errors were ignored by vhost/vhost_net, this may lead to crash 
  when
  trying to remove vhost from waitqueue when after the polling is 
  failed. Solve
  this problem by:
 
  - checking the poll-wqh before trying to remove from waitqueue
  - report an error when poll() returns a POLLERR in vhost_start_poll()
  - report an error when vhost_start_poll() fails in
vhost_vring_ioctl()/vhost_net_set_backend() which is used to notify 
  the
failure to userspace.
  - report an error in the data path in vhost_net when meet polling 
  errors.
 
  After those changes, we can safely drop the tx polling state in 
  vhost_net since
  it was replaced by the checking of poll-wqh.
 
  Signed-off-by: Jason Wang jasow...@redhat.com
  ---
   drivers/vhost/net.c   |   74 
  
   drivers/vhost/vhost.c |   31 +++-
   drivers/vhost/vhost.h |2 +-
   3 files changed, 49 insertions(+), 58 deletions(-)
 
  diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
  index d10ad6f..125c1e5 100644
  --- a/drivers/vhost/net.c
  +++ b/drivers/vhost/net.c
  @@ -64,20 +64,10 @@ enum {
 VHOST_NET_VQ_MAX = 2,
   };
   
  -enum vhost_net_poll_state {
  -  VHOST_NET_POLL_DISABLED = 0,
  -  VHOST_NET_POLL_STARTED = 1,
  -  VHOST_NET_POLL_STOPPED = 2,
  -};
  -
   struct vhost_net {
 struct vhost_dev dev;
 struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
 struct vhost_poll poll[VHOST_NET_VQ_MAX];
  -  /* Tells us whether we are polling a socket for TX.
  -   * We only do this when socket buffer fills up.
  -   * Protected by tx vq lock. */
  -  enum vhost_net_poll_state tx_poll_state;
 /* Number of TX recently submitted.
  * Protected by tx vq lock. */
 unsigned tx_packets;
  @@ -155,24 +145,6 @@ static void copy_iovec_hdr(const struct iovec 
  *from, struct iovec *to,
 }
   }
   
  -/* Caller must have TX VQ lock */
  -static void tx_poll_stop(struct vhost_net *net)
  -{
  -  if (likely(net-tx_poll_state != VHOST_NET_POLL_STARTED))
  -  return;
  -  vhost_poll_stop(net-poll + VHOST_NET_VQ_TX);
  -  net-tx_poll_state = VHOST_NET_POLL_STOPPED;
  -}
  -
  -/* Caller must have TX VQ lock */
  -static void tx_poll_start(struct vhost_net *net, struct socket *sock)
  -{
  -  if (unlikely(net-tx_poll_state != VHOST_NET_POLL_STOPPED))
  -  return;
  -  vhost_poll_start(net-poll + VHOST_NET_VQ_TX, sock-file);
  -  net-tx_poll_state = VHOST_NET_POLL_STARTED;
  -}
  -
   /* In case of DMA done not in order in lower device driver for some 
  reason.
* upend_idx is used to track end of used idx, done_idx is used to 
  track head
* of used idx. Once lower device DMA done contiguously, we will 
  signal KVM
  @@ -227,6 +199,7 @@ static void vhost_zerocopy_callback(struct 
  ubuf_info *ubuf, bool success)
   static void handle_tx(struct vhost_net *net)
   {
 struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
  +  struct vhost_poll *poll = net-poll + VHOST_NET_VQ_TX;
 unsigned out, in, s;
 int head;
 struct msghdr msg = {
  @@ -252,7 +225,8 @@ static void handle_tx(struct vhost_net *net)
 wmem = atomic_read(sock-sk-sk_wmem_alloc);
 if (wmem = sock-sk-sk_sndbuf) {
 mutex_lock(vq-mutex);
  -  tx_poll_start(net, sock);
  +  if (vhost_poll_start(poll, sock-file))
  +  vq_err(vq, Fail to start TX polling\n);
  s/Fail/Failed/
 
  A question though: how can this happen? Could you clarify please?
  Maybe we can find a way to prevent this error?
  Two conditions I think this can happen:
 
  1) a buggy userspace disable a queue through TUNSETQUEUE
  2) the net device were gone
 
  For 1, looks like we can delay the disabling until the refcnt goes to
  zero. For 2 may needs more changes.
  I'd expect keeping a socket reference would prevent both issues.
  Doesn't it?
  Doesn't work for 2 I think, the socket didn't hold a refcnt of the
  device, so the device can go away at anytime. Although we can change
  this, but it's the behaviour before multiqueue support.
  Hmm there's one scenario that does seem to
  trigger this: queue can get disabled
  and then poll fails.
 
  Is this the only issue?
 
 Another one I think we can trigger is:
 
 - start vhost thread
 - do ip link del link dev tap0 to delete the tap device
 
 In this case, the netdevice is unregistered but the file/socket still exist.

Yes but in this case poll_wait is called so 

[PATCH v10 0/3] x86, apicv: Add APIC virtualization support

2013-01-13 Thread Yang Zhang
From: Yang Zhang yang.z.zh...@intel.com

APIC virtualization is a new feature which can eliminate most of VM exit
when vcpu handle a interrupt:

APIC register virtualization:
APIC read access doesn't cause APIC-access VM exits.
APIC write becomes trap-like.

Virtual interrupt delivery:
Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
manually, which is fully taken care of by the hardware.

Please refer to Intel SDM volume 3, chapter 29 for more details.
Changes v9 to v10:
 * Enable virtualize x2apic mode when guest is using x2apic and apicv:
   There is no point to enable x2apic mode when apicv is disabled.
 * Grep ioapic_lock when traversing ioapic entry to set eoi exit bitmap
 * Rebased on top of KVM upstream

Changes v8 to v9:
 * Update eoi exit bitmap by vcpu itself.
 * Enable virtualize x2apic mode when guest is using x2apic.
 * Rebase on top of KVM upstream

Changes v7 to v8:
 * According Marcelo's suggestion, add comments for irr_pending and isr_count,
   since the two valiables have different meaning when using apicv.
 * Set highest bit in vISR to SVI after migation.
 * Use spinlock to access eoi exit bitmap synchronously.
 * Enable virtualize x2apic mode when guest is using x2apic
 * Rebased on top of KVM upstream.

Changes v6 to v7:
 * fix a bug when set exit bitmap.
 * Rebased on top of KVM upstream.

Changes v5 to v6:
 * minor adjustments according gleb's comments
 * Rebased on top of KVM upstream.

Changes v4 to v5:
 * Set eoi exit bitmap when an interrupt has notifier registered.
 * Use request to track ioapic entry's modification.
 * Rebased on top of KVM upstream.

Changes v3 to v4:
 * use one option to control both register virtualization and virtual interrupt
   delivery.
 * Update eoi exit bitmap when programing ioapic or programing apic's 
id/dfr/ldr.
 * Rebased on top of KVM upstream.

Changes v2 to v3:
 * Drop Posted Interrupt patch from v3.
   According Gleb's suggestion, we will use global vector for all VCPUs as 
notification
   event vector. So we will rewrite the Posted Interrupt patch. And resend it 
later.
 * Use TMR to set the eoi exiting bitmap. We only want to set eoi exiting 
bitmap for
   those interrupt which is level trigger or has notifier in EOI write path. So 
TMR is
   enough to distinguish the interrupt trigger mode.
 * Simplify some code according Gleb's comments.
 * rebased on top of KVM upstream.

Changes v1 to v2:
 * Add Posted Interrupt support in this series patch.
 * Since there is a notifer hook in vAPIC EOI for PIT interrupt. So always Set 
PIT
   interrupt in eoi exit bitmap to force vmexit when EOI to interrupt.
 * Rebased on top of KVM upstream

Yang Zhang (3):
  x86, apicv: add APICv register virtualization support
  x86, apicv: add virtual x2apic support
  x86, apicv: add virtual interrupt delivery support

 arch/ia64/kvm/lapic.h   |6 +
 arch/x86/include/asm/kvm_host.h |8 +
 arch/x86/include/asm/vmx.h  |   14 ++
 arch/x86/kvm/irq.c  |   56 ++-
 arch/x86/kvm/lapic.c|   99 +++
 arch/x86/kvm/lapic.h|   25 +++
 arch/x86/kvm/svm.c  |   31 
 arch/x86/kvm/vmx.c  |  366 ++-
 arch/x86/kvm/x86.c  |   11 +-
 arch/x86/kvm/x86.h  |2 +
 include/linux/kvm_host.h|3 +
 virt/kvm/ioapic.c   |   38 
 virt/kvm/ioapic.h   |1 +
 virt/kvm/irq_comm.c |   25 +++
 virt/kvm/kvm_main.c |5 +
 15 files changed, 641 insertions(+), 49 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 1/3] x86, apicv: add APICv register virtualization support

2013-01-13 Thread Yang Zhang
- APIC read doesn't cause VM-Exit
- APIC write becomes trap-like

Signed-off-by: Kevin Tian kevin.t...@intel.com
Signed-off-by: Yang Zhang yang.z.zh...@intel.com
---
 arch/x86/include/asm/vmx.h |2 ++
 arch/x86/kvm/lapic.c   |   15 +++
 arch/x86/kvm/lapic.h   |2 ++
 arch/x86/kvm/vmx.c |   33 -
 4 files changed, 51 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index e385df9..44c3f7e 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -66,6 +66,7 @@
 #define EXIT_REASON_EPT_MISCONFIG   49
 #define EXIT_REASON_WBINVD  54
 #define EXIT_REASON_XSETBV  55
+#define EXIT_REASON_APIC_WRITE  56
 #define EXIT_REASON_INVPCID 58
 
 #define VMX_EXIT_REASONS \
@@ -141,6 +142,7 @@
 #define SECONDARY_EXEC_ENABLE_VPID  0x0020
 #define SECONDARY_EXEC_WBINVD_EXITING  0x0040
 #define SECONDARY_EXEC_UNRESTRICTED_GUEST  0x0080
+#define SECONDARY_EXEC_APIC_REGISTER_VIRT   0x0100
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING  0x0400
 #define SECONDARY_EXEC_ENABLE_INVPCID  0x1000
 
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 9392f52..0664c13 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1212,6 +1212,21 @@ void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
 
+/* emulate APIC access in a trap manner */
+void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset)
+{
+   u32 val = 0;
+
+   /* hw has done the conditional check and inst decode */
+   offset = 0xff0;
+
+   apic_reg_read(vcpu-arch.apic, offset, 4, val);
+
+   /* TODO: optimize to just emulate side effect w/o one more write */
+   apic_reg_write(vcpu-arch.apic, offset, val);
+}
+EXPORT_SYMBOL_GPL(kvm_apic_write_nodecode);
+
 void kvm_free_lapic(struct kvm_vcpu *vcpu)
 {
struct kvm_lapic *apic = vcpu-arch.apic;
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index e5ebf9f..9a8ee22 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -64,6 +64,8 @@ int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu);
 u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu);
 void kvm_set_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu, u64 data);
 
+void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset);
+
 void kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr);
 void kvm_lapic_sync_from_vapic(struct kvm_vcpu *vcpu);
 void kvm_lapic_sync_to_vapic(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index dd2a85c..0403634 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -84,6 +84,9 @@ module_param(vmm_exclusive, bool, S_IRUGO);
 static bool __read_mostly fasteoi = 1;
 module_param(fasteoi, bool, S_IRUGO);
 
+static bool __read_mostly enable_apicv_reg = 1;
+module_param(enable_apicv_reg, bool, S_IRUGO);
+
 /*
  * If nested=1, nested virtualization is supported, i.e., guests may use
  * VMX and be a hypervisor for its own guests. If nested=0, guests may not
@@ -764,6 +767,12 @@ static inline bool 
cpu_has_vmx_virtualize_apic_accesses(void)
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
 }
 
+static inline bool cpu_has_vmx_apic_register_virt(void)
+{
+   return vmcs_config.cpu_based_2nd_exec_ctrl 
+   SECONDARY_EXEC_APIC_REGISTER_VIRT;
+}
+
 static inline bool cpu_has_vmx_flexpriority(void)
 {
return cpu_has_vmx_tpr_shadow() 
@@ -2540,7 +2549,8 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
SECONDARY_EXEC_UNRESTRICTED_GUEST |
SECONDARY_EXEC_PAUSE_LOOP_EXITING |
SECONDARY_EXEC_RDTSCP |
-   SECONDARY_EXEC_ENABLE_INVPCID;
+   SECONDARY_EXEC_ENABLE_INVPCID |
+   SECONDARY_EXEC_APIC_REGISTER_VIRT;
if (adjust_vmx_controls(min2, opt2,
MSR_IA32_VMX_PROCBASED_CTLS2,
_cpu_based_2nd_exec_control)  0)
@@ -2551,6 +2561,11 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
_cpu_based_exec_control = ~CPU_BASED_TPR_SHADOW;
 #endif
+
+   if (!(_cpu_based_exec_control  CPU_BASED_TPR_SHADOW))
+   _cpu_based_2nd_exec_control = ~(
+   SECONDARY_EXEC_APIC_REGISTER_VIRT);
+
if (_cpu_based_2nd_exec_control  SECONDARY_EXEC_ENABLE_EPT) {
/* CR3 accesses and invlpg don't need to cause VM Exits when EPT
   enabled */
@@ -2748,6 +2763,9 @@ static __init int hardware_setup(void)
if (!cpu_has_vmx_ple())
ple_gap = 0;
 
+   if (!cpu_has_vmx_apic_register_virt())
+   enable_apicv_reg 

[PATCH v10 2/3] x86, apicv: add virtual x2apic support

2013-01-13 Thread Yang Zhang
From: Yang Zhang yang.z.zh...@intel.com

basically to benefit from apicv, we need to enable virtualized x2apic mode.
Currently, we only enable it when guest is really using x2apic.

Also, clear MSR bitmap for corresponding x2apic MSRs when guest enabled x2apic:
0x800 - 0x8ff: no read intercept for apicv register virtualization,
   except APIC ID and TMCCT.
APIC ID and TMCCT: need software's assistance to get right value.

Signed-off-by: Kevin Tian kevin.t...@intel.com
Signed-off-by: Yang Zhang yang.z.zh...@intel.com
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/include/asm/vmx.h  |1 +
 arch/x86/kvm/lapic.c|   15 +++-
 arch/x86/kvm/svm.c  |6 ++
 arch/x86/kvm/vmx.c  |  162 +--
 5 files changed, 173 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c431b33..35aa8e6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -697,6 +697,7 @@ struct kvm_x86_ops {
void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
void (*enable_irq_window)(struct kvm_vcpu *vcpu);
void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
+   void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
int (*get_tdp_level)(void);
u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 44c3f7e..0a54df0 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -139,6 +139,7 @@
 #define SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES 0x0001
 #define SECONDARY_EXEC_ENABLE_EPT   0x0002
 #define SECONDARY_EXEC_RDTSCP  0x0008
+#define SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE   0x0010
 #define SECONDARY_EXEC_ENABLE_VPID  0x0020
 #define SECONDARY_EXEC_WBINVD_EXITING  0x0040
 #define SECONDARY_EXEC_UNRESTRICTED_GUEST  0x0080
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 0664c13..2ef5e2b 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1323,12 +1323,17 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 
value)
if (!kvm_vcpu_is_bsp(apic-vcpu))
value = ~MSR_IA32_APICBASE_BSP;
 
-   vcpu-arch.apic_base = value;
-   if (apic_x2apic_mode(apic)) {
-   u32 id = kvm_apic_id(apic);
-   u32 ldr = ((id  4)  16) | (1  (id  0xf));
-   kvm_apic_set_ldr(apic, ldr);
+   if ((vcpu-arch.apic_base ^ value)  X2APIC_ENABLE) {
+   if (value  X2APIC_ENABLE) {
+   u32 id = kvm_apic_id(apic);
+   u32 ldr = ((id  4)  16) | (1  (id  0xf));
+   kvm_apic_set_ldr(apic, ldr);
+   kvm_x86_ops-set_virtual_x2apic_mode(vcpu, true);
+   } else
+   kvm_x86_ops-set_virtual_x2apic_mode(vcpu, false);
}
+
+   vcpu-arch.apic_base = value;
apic-base_address = apic-vcpu-arch.apic_base 
 MSR_IA32_APICBASE_BASE;
 
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d29d3cd..38407e9 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3571,6 +3571,11 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu, 
int tpr, int irr)
set_cr_intercept(svm, INTERCEPT_CR8_WRITE);
 }
 
+static void svm_set_virtual_x2apic_mode(struct kvm_vcpu *vcpu, bool set)
+{
+   return;
+}
+
 static int svm_nmi_allowed(struct kvm_vcpu *vcpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
@@ -4290,6 +4295,7 @@ static struct kvm_x86_ops svm_x86_ops = {
.enable_nmi_window = enable_nmi_window,
.enable_irq_window = enable_irq_window,
.update_cr8_intercept = update_cr8_intercept,
+   .set_virtual_x2apic_mode = svm_set_virtual_x2apic_mode,
 
.set_tss_addr = svm_set_tss_addr,
.get_tdp_level = get_npt_level,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 0403634..847022e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -767,6 +767,12 @@ static inline bool 
cpu_has_vmx_virtualize_apic_accesses(void)
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
 }
 
+static inline bool cpu_has_vmx_virtualize_x2apic_mode(void)
+{
+   return vmcs_config.cpu_based_2nd_exec_ctrl 
+   SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE;
+}
+
 static inline bool cpu_has_vmx_apic_register_virt(void)
 {
return vmcs_config.cpu_based_2nd_exec_ctrl 
@@ -2543,6 +2549,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
if (_cpu_based_exec_control  CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) {
min2 = 0;
opt2 = SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
+   

[PATCH v10 3/3] x86, apicv: add virtual interrupt delivery support

2013-01-13 Thread Yang Zhang
From: Yang Zhang yang.z.zh...@intel.com

Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
manually, which is fully taken care of by the hardware. This needs
some special awareness into existing interrupr injection path:

- for pending interrupt, instead of direct injection, we may need
  update architecture specific indicators before resuming to guest.

- A pending interrupt, which is masked by ISR, should be also
  considered in above update action, since hardware will decide
  when to inject it at right time. Current has_interrupt and
  get_interrupt only returns a valid vector from injection p.o.v.

Signed-off-by: Kevin Tian kevin.t...@intel.com
Signed-off-by: Yang Zhang yang.z.zh...@intel.com
---
 arch/ia64/kvm/lapic.h   |6 ++
 arch/x86/include/asm/kvm_host.h |7 ++
 arch/x86/include/asm/vmx.h  |   11 +++
 arch/x86/kvm/irq.c  |   56 +++-
 arch/x86/kvm/lapic.c|   69 --
 arch/x86/kvm/lapic.h|   23 +
 arch/x86/kvm/svm.c  |   25 +
 arch/x86/kvm/vmx.c  |  189 +--
 arch/x86/kvm/x86.c  |   11 ++-
 arch/x86/kvm/x86.h  |2 +
 include/linux/kvm_host.h|3 +
 virt/kvm/ioapic.c   |   38 
 virt/kvm/ioapic.h   |1 +
 virt/kvm/irq_comm.c |   25 +
 virt/kvm/kvm_main.c |5 +
 15 files changed, 426 insertions(+), 45 deletions(-)

diff --git a/arch/ia64/kvm/lapic.h b/arch/ia64/kvm/lapic.h
index c5f92a9..c3e2935 100644
--- a/arch/ia64/kvm/lapic.h
+++ b/arch/ia64/kvm/lapic.h
@@ -27,4 +27,10 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct 
kvm_lapic_irq *irq);
 #define kvm_apic_present(x) (true)
 #define kvm_lapic_enabled(x) (true)
 
+static inline bool kvm_apic_vid_enabled(void)
+{
+   /* IA64 has no apicv supporting, do nothing here */
+   return false;
+}
+
 #endif
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 35aa8e6..d90bfa8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -697,6 +697,12 @@ struct kvm_x86_ops {
void (*enable_nmi_window)(struct kvm_vcpu *vcpu);
void (*enable_irq_window)(struct kvm_vcpu *vcpu);
void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr);
+   int (*has_virtual_interrupt_delivery)(void);
+   void (*update_apic_irq)(struct kvm_vcpu *vcpu, int max_irr);
+   void (*update_eoi_exitmap)(struct kvm_vcpu *vcpu);
+   void (*set_svi)(int isr);
+   void (*check_ioapic_entry)(struct kvm_vcpu *vcpu,
+  struct kvm_lapic_irq *irq);
void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
int (*get_tdp_level)(void);
@@ -992,6 +998,7 @@ int kvm_age_hva(struct kvm *kvm, unsigned long hva);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
+int kvm_cpu_has_injectable_intr(struct kvm_vcpu *v);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
 int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu);
 int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0a54df0..694586c 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -62,6 +62,7 @@
 #define EXIT_REASON_MCE_DURING_VMENTRY  41
 #define EXIT_REASON_TPR_BELOW_THRESHOLD 43
 #define EXIT_REASON_APIC_ACCESS 44
+#define EXIT_REASON_EOI_INDUCED 45
 #define EXIT_REASON_EPT_VIOLATION   48
 #define EXIT_REASON_EPT_MISCONFIG   49
 #define EXIT_REASON_WBINVD  54
@@ -144,6 +145,7 @@
 #define SECONDARY_EXEC_WBINVD_EXITING  0x0040
 #define SECONDARY_EXEC_UNRESTRICTED_GUEST  0x0080
 #define SECONDARY_EXEC_APIC_REGISTER_VIRT   0x0100
+#define SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY0x0200
 #define SECONDARY_EXEC_PAUSE_LOOP_EXITING  0x0400
 #define SECONDARY_EXEC_ENABLE_INVPCID  0x1000
 
@@ -181,6 +183,7 @@ enum vmcs_field {
GUEST_GS_SELECTOR   = 0x080a,
GUEST_LDTR_SELECTOR = 0x080c,
GUEST_TR_SELECTOR   = 0x080e,
+   GUEST_INTR_STATUS   = 0x0810,
HOST_ES_SELECTOR= 0x0c00,
HOST_CS_SELECTOR= 0x0c02,
HOST_SS_SELECTOR= 0x0c04,
@@ -208,6 +211,14 @@ enum vmcs_field {
APIC_ACCESS_ADDR_HIGH   = 0x2015,
EPT_POINTER = 0x201a,
EPT_POINTER_HIGH= 0x201b,
+   EOI_EXIT_BITMAP0= 0x201c,
+   EOI_EXIT_BITMAP0_HIGH   = 0x201d,
+   EOI_EXIT_BITMAP1= 0x201e,
+

Re: [Qemu-devel] [PATCH v12 0/8] pv event to notify host when the guest is panicked

2013-01-13 Thread Hu Tao
Hi Marcelo,

Sorry for the late reply.

On Tue, Dec 25, 2012 at 07:52:05PM -0200, Marcelo Tosatti wrote:
 On Thu, Dec 20, 2012 at 03:53:59PM +0800, Hu Tao wrote:
  Hi,
  
  Any comments?
 
 As far as i can see, items 2 and 3 of
 
 https://lkml.org/lkml/2012/11/12/588
 
 Have not been addressed.
 
 https://lkml.org/lkml/2012/11/20/653 contains discussions on those
 items.
 
 2) Format of the interface for other architectures (you can choose
 a different KVM supported architecture and write an example). It was
 your choice to choose an I/O port, which is x86 specific.

Unfortunately I don't have hardware other than x86 in hand to test it.

 
 3) Clear/documented management interface for the feature.

Patch #8 adds a document to describe the usage of pv event. But as you
are asking again, I must be misunderstanding the meaning of management
interface. Can you clarify it?

 
 Note 3 is for management, not the guest-host interface.
 
  On Wed, Dec 12, 2012 at 02:13:43PM +0800, Hu Tao wrote:
   This series implements a new interface, kvm pv event, to notify host when
   some events happen in guest. Right now there is one supported event: guest
   panic.
   
   changes from v11:
   
 - add a new patch 'save/load cpu runstate'
 - fix a bug of null-dereference when no -machine option is supplied
 - reserve RUN_STATE_GUEST_PANICKED during migration
 - add doc of enable_pv_event option
 - disable reboot-on-panic if pv_event is on
   
   v11: http://lists.gnu.org/archive/html/qemu-devel/2012-10/msg04361.html
   
   Hu Tao (7):
 save/load cpu runstate
 update kernel headers
 add a new runstate: RUN_STATE_GUEST_PANICKED
 add a new qevent: QEVENT_GUEST_PANICKED
 introduce a new qom device to deal with panicked event
 allower the user to disable pv event support
 pv event: add document to describe the usage
   
   Wen Congyang (1):
 start vm after resetting it
   
block.h  |   2 +
docs/pv-event.txt|  17 
hw/kvm/Makefile.objs |   2 +-
hw/kvm/pv_event.c| 197 
   +++
hw/pc_piix.c |  11 +++
kvm-stub.c   |   4 +
kvm.h|   2 +
linux-headers/asm-x86/kvm_para.h |   1 +
linux-headers/linux/kvm_para.h   |   6 ++
migration.c  |   7 +-
monitor.c|   6 +-
monitor.h|   1 +
qapi-schema.json |   6 +-
qemu-config.c|   4 +
qemu-options.hx  |   3 +-
qmp.c|   5 +-
savevm.c |   1 +
sysemu.h |   2 +
vl.c |  52 ++-
19 files changed, 312 insertions(+), 17 deletions(-)
create mode 100644 docs/pv-event.txt
create mode 100644 hw/kvm/pv_event.c
   
   -- 
   1.8.0.1.240.ge8a1f5a
  --
  To unsubscribe from this list: send the line unsubscribe kvm in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 2/2] vhost: handle polling errors

2013-01-13 Thread Jason Wang
On 01/14/2013 02:57 PM, Michael S. Tsirkin wrote:
 On Mon, Jan 14, 2013 at 10:59:02AM +0800, Jason Wang wrote:
 On 01/13/2013 07:10 PM, Michael S. Tsirkin wrote:
 On Mon, Jan 07, 2013 at 11:04:32PM +0800, Jason Wang wrote:
 On 01/07/2013 10:55 PM, Michael S. Tsirkin wrote:
 On Mon, Jan 07, 2013 at 12:38:17PM +0800, Jason Wang wrote:
 On 01/06/2013 09:22 PM, Michael S. Tsirkin wrote:
 On Sun, Jan 06, 2013 at 03:18:38PM +0800, Jason Wang wrote:
 Polling errors were ignored by vhost/vhost_net, this may lead to crash 
 when
 trying to remove vhost from waitqueue when after the polling is 
 failed. Solve
 this problem by:

 - checking the poll-wqh before trying to remove from waitqueue
 - report an error when poll() returns a POLLERR in vhost_start_poll()
 - report an error when vhost_start_poll() fails in
   vhost_vring_ioctl()/vhost_net_set_backend() which is used to notify 
 the
   failure to userspace.
 - report an error in the data path in vhost_net when meet polling 
 errors.

 After those changes, we can safely drop the tx polling state in 
 vhost_net since
 it was replaced by the checking of poll-wqh.

 Signed-off-by: Jason Wang jasow...@redhat.com
 ---
  drivers/vhost/net.c   |   74 
 
  drivers/vhost/vhost.c |   31 +++-
  drivers/vhost/vhost.h |2 +-
  3 files changed, 49 insertions(+), 58 deletions(-)

 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index d10ad6f..125c1e5 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -64,20 +64,10 @@ enum {
VHOST_NET_VQ_MAX = 2,
  };
  
 -enum vhost_net_poll_state {
 -  VHOST_NET_POLL_DISABLED = 0,
 -  VHOST_NET_POLL_STARTED = 1,
 -  VHOST_NET_POLL_STOPPED = 2,
 -};
 -
  struct vhost_net {
struct vhost_dev dev;
struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
struct vhost_poll poll[VHOST_NET_VQ_MAX];
 -  /* Tells us whether we are polling a socket for TX.
 -   * We only do this when socket buffer fills up.
 -   * Protected by tx vq lock. */
 -  enum vhost_net_poll_state tx_poll_state;
/* Number of TX recently submitted.
 * Protected by tx vq lock. */
unsigned tx_packets;
 @@ -155,24 +145,6 @@ static void copy_iovec_hdr(const struct iovec 
 *from, struct iovec *to,
}
  }
  
 -/* Caller must have TX VQ lock */
 -static void tx_poll_stop(struct vhost_net *net)
 -{
 -  if (likely(net-tx_poll_state != VHOST_NET_POLL_STARTED))
 -  return;
 -  vhost_poll_stop(net-poll + VHOST_NET_VQ_TX);
 -  net-tx_poll_state = VHOST_NET_POLL_STOPPED;
 -}
 -
 -/* Caller must have TX VQ lock */
 -static void tx_poll_start(struct vhost_net *net, struct socket *sock)
 -{
 -  if (unlikely(net-tx_poll_state != VHOST_NET_POLL_STOPPED))
 -  return;
 -  vhost_poll_start(net-poll + VHOST_NET_VQ_TX, sock-file);
 -  net-tx_poll_state = VHOST_NET_POLL_STARTED;
 -}
 -
  /* In case of DMA done not in order in lower device driver for some 
 reason.
   * upend_idx is used to track end of used idx, done_idx is used to 
 track head
   * of used idx. Once lower device DMA done contiguously, we will 
 signal KVM
 @@ -227,6 +199,7 @@ static void vhost_zerocopy_callback(struct 
 ubuf_info *ubuf, bool success)
  static void handle_tx(struct vhost_net *net)
  {
struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
 +  struct vhost_poll *poll = net-poll + VHOST_NET_VQ_TX;
unsigned out, in, s;
int head;
struct msghdr msg = {
 @@ -252,7 +225,8 @@ static void handle_tx(struct vhost_net *net)
wmem = atomic_read(sock-sk-sk_wmem_alloc);
if (wmem = sock-sk-sk_sndbuf) {
mutex_lock(vq-mutex);
 -  tx_poll_start(net, sock);
 +  if (vhost_poll_start(poll, sock-file))
 +  vq_err(vq, Fail to start TX polling\n);
 s/Fail/Failed/

 A question though: how can this happen? Could you clarify please?
 Maybe we can find a way to prevent this error?
 Two conditions I think this can happen:

 1) a buggy userspace disable a queue through TUNSETQUEUE
 2) the net device were gone

 For 1, looks like we can delay the disabling until the refcnt goes to
 zero. For 2 may needs more changes.
 I'd expect keeping a socket reference would prevent both issues.
 Doesn't it?
 Doesn't work for 2 I think, the socket didn't hold a refcnt of the
 device, so the device can go away at anytime. Although we can change
 this, but it's the behaviour before multiqueue support.
 Hmm there's one scenario that does seem to
 trigger this: queue can get disabled
 and then poll fails.

 Is this the only issue?
 Another one I think we can trigger is:

 - start vhost thread
 - do ip link del link dev tap0 to delete the tap device

 In this case, the netdevice is unregistered but the file/socket still exist.
 Yes but in this case poll_wait is called so apparently no issue
 with existing code? We only have an issue if poll_wait 

Re: [Qemu-devel] [PATCH v12 0/8] pv event to notify host when the guest is panicked

2013-01-13 Thread Hu Tao
On Tue, Dec 25, 2012 at 07:54:20PM -0200, Marcelo Tosatti wrote:
 On Thu, Dec 20, 2012 at 03:53:59PM +0800, Hu Tao wrote:
  Hi,
  
  Any comments?
 
 Did you verify possibilities listed at 
 https://lkml.org/lkml/2012/11/20/653 ?

Except the EIO one you mentioned. I don't know how to reproduce it.

 
 If so, a summary in the patchset would be helpful.

Thanks for reminding, I'll add it in the next version.

Basically, the run state is saved and loaded again so we can know the
state when restoring a vm. This is done by patch #1, and is independent
of guest panic. Can you have a review of patch #1 before next version?

 
  On Wed, Dec 12, 2012 at 02:13:43PM +0800, Hu Tao wrote:
   This series implements a new interface, kvm pv event, to notify host when
   some events happen in guest. Right now there is one supported event: guest
   panic.
   
   changes from v11:
   
 - add a new patch 'save/load cpu runstate'
 - fix a bug of null-dereference when no -machine option is supplied
 - reserve RUN_STATE_GUEST_PANICKED during migration
 - add doc of enable_pv_event option
 - disable reboot-on-panic if pv_event is on
   
   v11: http://lists.gnu.org/archive/html/qemu-devel/2012-10/msg04361.html
   
   Hu Tao (7):
 save/load cpu runstate
 update kernel headers
 add a new runstate: RUN_STATE_GUEST_PANICKED
 add a new qevent: QEVENT_GUEST_PANICKED
 introduce a new qom device to deal with panicked event
 allower the user to disable pv event support
 pv event: add document to describe the usage
   
   Wen Congyang (1):
 start vm after resetting it
   
block.h  |   2 +
docs/pv-event.txt|  17 
hw/kvm/Makefile.objs |   2 +-
hw/kvm/pv_event.c| 197 
   +++
hw/pc_piix.c |  11 +++
kvm-stub.c   |   4 +
kvm.h|   2 +
linux-headers/asm-x86/kvm_para.h |   1 +
linux-headers/linux/kvm_para.h   |   6 ++
migration.c  |   7 +-
monitor.c|   6 +-
monitor.h|   1 +
qapi-schema.json |   6 +-
qemu-config.c|   4 +
qemu-options.hx  |   3 +-
qmp.c|   5 +-
savevm.c |   1 +
sysemu.h |   2 +
vl.c |  52 ++-
19 files changed, 312 insertions(+), 17 deletions(-)
create mode 100644 docs/pv-event.txt
create mode 100644 hw/kvm/pv_event.c
   
   -- 
   1.8.0.1.240.ge8a1f5a
  --
  To unsubscribe from this list: send the line unsubscribe kvm in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html