date:20130508

Re: [PATCH v2] KVM: Fix kvm_irqfd_init initialization

2013-05-08 Thread Cornelia Huck

On Wed,  8 May 2013 10:57:29 +0800
Asias He as...@redhat.com wrote:

 In commit a0f155e96 'KVM: Initialize irqfd from kvm_init()', when
 kvm_init() is called the second time (e.g kvm-amd.ko and kvm-intel.ko),
 kvm_arch_init() will fail with -EEXIST, then kvm_irqfd_exit() will be
 called on the error handling path. This way, the kvm_irqfd system will
 not be ready.
 
 This patch fix the following:
 
 BUG: unable to handle kernel NULL pointer dereference at   (null)
 IP: [81c0721e] _raw_spin_lock+0xe/0x30
 PGD 0
 Oops: 0002 [#1] SMP
 Modules linked in: vhost_net
 CPU 6
 Pid: 4257, comm: qemu-system-x86 Not tainted 3.9.0-rc3+ #757 Dell Inc. 
 OptiPlex 790/0V5HMK
 RIP: 0010:[81c0721e]  [81c0721e] _raw_spin_lock+0xe/0x30
 RSP: 0018:880221721cc8  EFLAGS: 00010046
 RAX: 0100 RBX: 88022dcc003f RCX: 880221734950
 RDX: 8802208f6ca8 RSI: 7fff RDI: 
 RBP: 880221721cc8 R08: 0002 R09: 0002
 R10: 7f7fd01087e0 R11: 0246 R12: 8802208f6ca8
 R13: 0080 R14: 880223e2a900 R15: 
 FS:  7f7fd38488e0() GS:88022dcc() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2:  CR3: 00022309f000 CR4: 000427e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: 0ff0 DR7: 0400
 Process qemu-system-x86 (pid: 4257, threadinfo 88022172, task 
 880222bd5640)
 Stack:
  880221721d08 810ac5c5 88022431dc00 0086
  0080 880223e2a900 8802208f6ca8 
  880221721d48 810ac8fe  880221734000
 Call Trace:
  [810ac5c5] __queue_work+0x45/0x2d0
  [810ac8fe] queue_work_on+0x8e/0xa0
  [810ac949] queue_work+0x19/0x20
  [81009b6b] irqfd_deactivate+0x4b/0x60
  [8100a69d] kvm_irqfd+0x39d/0x580
  [81007a27] kvm_vm_ioctl+0x207/0x5b0
  [810c9545] ? update_curr+0xf5/0x180
  [811b66e8] do_vfs_ioctl+0x98/0x550
  [810c1f5e] ? finish_task_switch+0x4e/0xe0
  [81c054aa] ? __schedule+0x2ea/0x710
  [811b6bf7] sys_ioctl+0x57/0x90
  [8140ae9e] ? trace_hardirqs_on_thunk+0x3a/0x3c
  [81c0f602] system_call_fastpath+0x16/0x1b
 Code: c1 ea 08 38 c2 74 0f 66 0f 1f 44 00 00 f3 90 0f b6 03 38 c2 75 f7 48 83 
 c4 08 5b c9 c3 55 48 89 e5 66 66 66 66 90 b8 00 01 00 00 f0 66 0f c1 07 89 
 c2 66 c1 ea 08 38 c2 74 0c 0f 1f 00 f3 90 0f
 RIP  [81c0721e] _raw_spin_lock+0xe/0x30
 RSP 880221721cc8
 CR2: 
 ---[ end trace 13fb1e4b6e5ab21f ]---
 
 Signed-off-by: Asias He as...@redhat.com

Acked-by: Cornelia Huck cornelia.h...@de.ibm.com

 ---
  virt/kvm/kvm_main.c | 18 +-
  1 file changed, 13 insertions(+), 5 deletions(-)
 
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 8fd325a..85b93d2 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -3078,13 +3078,21 @@ int kvm_init(void *opaque, unsigned vcpu_size, 
 unsigned vcpu_align,
   int r;
   int cpu;
 
 - r = kvm_irqfd_init();
 - if (r)
 - goto out_irqfd;
   r = kvm_arch_init(opaque);
   if (r)
   goto out_fail;
 
 + /*
 +  * kvm_arch_init makes sure there's at most one caller
 +  * for architectures that support multiple implementations,
 +  * like intel and amd on x86.
 +  * kvm_arch_init must be called before kvm_irqfd_init to avoid creating
 +  * conflicts in case kvm is already setup for another implementation.
 +  */
 + r = kvm_irqfd_init();
 + if (r)
 + goto out_irqfd;
 +
   if (!zalloc_cpumask_var(cpus_hardware_enabled, GFP_KERNEL)) {
   r = -ENOMEM;
   goto out_free_0;
 @@ -3159,10 +3167,10 @@ out_free_1:
  out_free_0a:
   free_cpumask_var(cpus_hardware_enabled);
  out_free_0:
 - kvm_arch_exit();
 -out_fail:
   kvm_irqfd_exit();
  out_irqfd:
 + kvm_arch_exit();
 +out_fail:
   return r;
  }
  EXPORT_SYMBOL_GPL(kvm_init);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vhost-test: Make vhost/test.c work

2013-05-08 Thread Asias He

On Tue, May 07, 2013 at 02:22:32PM +0300, Michael S. Tsirkin wrote:
 On Tue, May 07, 2013 at 02:52:45PM +0800, Asias He wrote:
  Fix it by:
  1) switching to use the new device specific fields per vq
  2) not including vhost.c, instead make vhost-test.ko depend on vhost.ko.
 
 Please split this up.
 1. make test work for 3.10
 2. make test work for 3.11
 
 thanks!

okay.

  ---
   drivers/vhost/test.c | 37 +
   1 file changed, 25 insertions(+), 12 deletions(-)
  
  diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
  index 1ee45bc..dc526eb 100644
  --- a/drivers/vhost/test.c
  +++ b/drivers/vhost/test.c
  @@ -18,7 +18,7 @@
   #include linux/slab.h
   
   #include test.h
  -#include vhost.c
  +#include vhost.h
   
   /* Max number of bytes transferred before requeueing the job.
* Using this limit prevents one virtqueue from starving others. */
  @@ -29,16 +29,20 @@ enum {
  VHOST_TEST_VQ_MAX = 1,
   };
   
  +struct vhost_test_virtqueue {
  +   struct vhost_virtqueue vq;
  +};
  +
 
 This isn't needed or useful. Drop above change pls and patch
 size will shrink.

The difference is:

 drivers/vhost/test.c | 23 ---
 1 file changed, 16 insertions(+), 7 deletions(-)

 drivers/vhost/test.c | 35 ---
 1 file changed, 24 insertions(+), 11 deletions(-)

which is not significant.

So, I think it is better to code the same way as we do in vhost-net and
vhost-scsi which makes the device specific usage more consistent.

   struct vhost_test {
  struct vhost_dev dev;
  -   struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX];
  +   struct vhost_test_virtqueue vqs[VHOST_TEST_VQ_MAX];
   };
   
   /* Expects to be always run from workqueue - which acts as
* read-size critical section for our kind of RCU. */
   static void handle_vq(struct vhost_test *n)
   {
  -   struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ];
  +   struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ];
  unsigned out, in;
  int head;
  size_t len, total_len = 0;
  @@ -101,15 +105,23 @@ static void handle_vq_kick(struct vhost_work *work)
   static int vhost_test_open(struct inode *inode, struct file *f)
   {
  struct vhost_test *n = kmalloc(sizeof *n, GFP_KERNEL);
  +   struct vhost_virtqueue **vqs;
  struct vhost_dev *dev;
  int r;
   
  if (!n)
  return -ENOMEM;
   
  +   vqs = kmalloc(VHOST_TEST_VQ_MAX * sizeof(*vqs), GFP_KERNEL);
  +   if (!vqs) {
  +   kfree(n);
  +   return -ENOMEM;
  +   }
  +
  dev = n-dev;
  -   n-vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
  -   r = vhost_dev_init(dev, n-vqs, VHOST_TEST_VQ_MAX);
  +   vqs[VHOST_TEST_VQ] = n-vqs[VHOST_TEST_VQ].vq;
  +   n-vqs[VHOST_TEST_VQ].vq.handle_kick = handle_vq_kick;
  +   r = vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX);
  if (r  0) {
  kfree(n);
  return r;
  @@ -135,12 +147,12 @@ static void *vhost_test_stop_vq(struct vhost_test *n,
   
   static void vhost_test_stop(struct vhost_test *n, void **privatep)
   {
  -   *privatep = vhost_test_stop_vq(n, n-vqs + VHOST_TEST_VQ);
  +   *privatep = vhost_test_stop_vq(n, n-vqs[VHOST_TEST_VQ].vq);
   }
   
   static void vhost_test_flush_vq(struct vhost_test *n, int index)
   {
  -   vhost_poll_flush(n-dev.vqs[index].poll);
  +   vhost_poll_flush(n-vqs[index].vq.poll);
   }
   
   static void vhost_test_flush(struct vhost_test *n)
  @@ -159,6 +171,7 @@ static int vhost_test_release(struct inode *inode, 
  struct file *f)
  /* We do an extra flush before freeing memory,
   * since jobs can re-queue themselves. */
  vhost_test_flush(n);
  +   kfree(n-dev.vqs);
  kfree(n);
  return 0;
   }
  @@ -179,14 +192,14 @@ static long vhost_test_run(struct vhost_test *n, int 
  test)
   
  for (index = 0; index  n-dev.nvqs; ++index) {
  /* Verify that ring has been setup correctly. */
  -   if (!vhost_vq_access_ok(n-vqs[index])) {
  +   if (!vhost_vq_access_ok(n-vqs[index].vq)) {
  r = -EFAULT;
  goto err;
  }
  }
   
  for (index = 0; index  n-dev.nvqs; ++index) {
  -   vq = n-vqs + index;
  +   vq = n-vqs[index].vq;
  mutex_lock(vq-mutex);
  priv = test ? n : NULL;
   
  @@ -195,7 +208,7 @@ static long vhost_test_run(struct vhost_test *n, int 
  test)
  
  lockdep_is_held(vq-mutex));
  rcu_assign_pointer(vq-private_data, priv);
   
  -   r = vhost_init_used(n-vqs[index]);
  +   r = vhost_init_used(n-vqs[index].vq);
   
  mutex_unlock(vq-mutex);
   
  @@ -268,14 +281,14 @@ static long vhost_test_ioctl(struct file *f, unsigned 
  int ioctl,
  return -EFAULT;
  return vhost_test_run(n, test);
  case VHOST_GET_FEATURES:
  -   features = VHOST_NET_FEATURES;
  +   features =

[PATCH v2] vhost-test: Make vhost/test.c work

2013-05-08 Thread Asias He

Fix it by switching to use the new device specific fields per vq

Signed-off-by: Asias He as...@redhat.com
---

This is for 3.10.

 drivers/vhost/test.c | 35 ---
 1 file changed, 24 insertions(+), 11 deletions(-)

diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index 1ee45bc..7b49d10 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -29,16 +29,20 @@ enum {
VHOST_TEST_VQ_MAX = 1,
 };
 
+struct vhost_test_virtqueue {
+   struct vhost_virtqueue vq;
+};
+
 struct vhost_test {
struct vhost_dev dev;
-   struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX];
+   struct vhost_test_virtqueue vqs[VHOST_TEST_VQ_MAX];
 };
 
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_vq(struct vhost_test *n)
 {
-   struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ];
+   struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ];
unsigned out, in;
int head;
size_t len, total_len = 0;
@@ -101,15 +105,23 @@ static void handle_vq_kick(struct vhost_work *work)
 static int vhost_test_open(struct inode *inode, struct file *f)
 {
struct vhost_test *n = kmalloc(sizeof *n, GFP_KERNEL);
+   struct vhost_virtqueue **vqs;
struct vhost_dev *dev;
int r;
 
if (!n)
return -ENOMEM;
 
+   vqs = kmalloc(VHOST_TEST_VQ_MAX * sizeof(*vqs), GFP_KERNEL);
+   if (!vqs) {
+   kfree(n);
+   return -ENOMEM;
+   }
+
dev = n-dev;
-   n-vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
-   r = vhost_dev_init(dev, n-vqs, VHOST_TEST_VQ_MAX);
+   vqs[VHOST_TEST_VQ] = n-vqs[VHOST_TEST_VQ].vq;
+   n-vqs[VHOST_TEST_VQ].vq.handle_kick = handle_vq_kick;
+   r = vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX);
if (r  0) {
kfree(n);
return r;
@@ -135,12 +147,12 @@ static void *vhost_test_stop_vq(struct vhost_test *n,
 
 static void vhost_test_stop(struct vhost_test *n, void **privatep)
 {
-   *privatep = vhost_test_stop_vq(n, n-vqs + VHOST_TEST_VQ);
+   *privatep = vhost_test_stop_vq(n, n-vqs[VHOST_TEST_VQ].vq);
 }
 
 static void vhost_test_flush_vq(struct vhost_test *n, int index)
 {
-   vhost_poll_flush(n-dev.vqs[index].poll);
+   vhost_poll_flush(n-vqs[index].vq.poll);
 }
 
 static void vhost_test_flush(struct vhost_test *n)
@@ -159,6 +171,7 @@ static int vhost_test_release(struct inode *inode, struct 
file *f)
/* We do an extra flush before freeing memory,
 * since jobs can re-queue themselves. */
vhost_test_flush(n);
+   kfree(n-dev.vqs);
kfree(n);
return 0;
 }
@@ -179,14 +192,14 @@ static long vhost_test_run(struct vhost_test *n, int test)
 
for (index = 0; index  n-dev.nvqs; ++index) {
/* Verify that ring has been setup correctly. */
-   if (!vhost_vq_access_ok(n-vqs[index])) {
+   if (!vhost_vq_access_ok(n-vqs[index].vq)) {
r = -EFAULT;
goto err;
}
}
 
for (index = 0; index  n-dev.nvqs; ++index) {
-   vq = n-vqs + index;
+   vq = n-vqs[index].vq;
mutex_lock(vq-mutex);
priv = test ? n : NULL;
 
@@ -195,7 +208,7 @@ static long vhost_test_run(struct vhost_test *n, int test)

lockdep_is_held(vq-mutex));
rcu_assign_pointer(vq-private_data, priv);
 
-   r = vhost_init_used(n-vqs[index]);
+   r = vhost_init_used(n-vqs[index].vq);
 
mutex_unlock(vq-mutex);
 
@@ -268,14 +281,14 @@ static long vhost_test_ioctl(struct file *f, unsigned int 
ioctl,
return -EFAULT;
return vhost_test_run(n, test);
case VHOST_GET_FEATURES:
-   features = VHOST_NET_FEATURES;
+   features = VHOST_FEATURES;
if (copy_to_user(featurep, features, sizeof features))
return -EFAULT;
return 0;
case VHOST_SET_FEATURES:
if (copy_from_user(features, featurep, sizeof features))
return -EFAULT;
-   if (features  ~VHOST_NET_FEATURES)
+   if (features  ~VHOST_FEATURES)
return -EOPNOTSUPP;
return vhost_test_set_features(n, features);
case VHOST_RESET_OWNER:
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: regression in v3.9? a guest stuck in BIOS if emulate_invalid_guest_state=Y

2013-05-08 Thread Jun'ichi Nomura

On 05/08/13 12:22, Jun'ichi Nomura wrote:
 Il 07/05/2013 14:06, Gleb Natapov ha scritto:
 What is the output of virsh qemu-monitor-command vm12 --hmp x/i $pc
 when it hangs?
 
 # virsh qemu-monitor-command vm12 --hmp x/4i \$pc
 0x000c06ca:  aam$0xa
 0x000c06cc:  mov%ax,%bx
 0x000c06ce:  mov%bh,%al
 0x000c06d0:  aam$0xa
 
 # virsh qemu-monitor-command vm12 --hmp x/8b \$pc
 000c06ca: 0xd4 0x0a 0x89 0xc3 0x88 0xf8 0xd4 0x0a

I could also reproduce the problem with following:

# dd if=/dev/zero of=/root/empty.img bs=1M count=1
# /usr/libexec/qemu-kvm -enable-kvm -nographic -nodefconfig -nodefaults 
-chardev socket,id=cmon,host=localhost,port=,server,nowait -mon 
chardev=cmon,mode=readline -drive file=/root/empty.img -chardev stdio,id=ser0 
-device isa-serial,chardev=ser0

With v3.8 kernel, it reaches to the point showing No bootable device
(as expected).
With v3.9 kernel, no visible characters appear on console.

EIP of the stalled guest points to other instruction than the
previously reported case though:

(qemu) info registers
info registers
EAX=f000e81b EBX=0130 ECX=fa2b EDX=031b
ESI=00ed EDI=0050 EBP= ESP=6eaa
EIP=0564 EFL=0046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0040 0400  00809300
CS =c000 000c  00809b00
SS =   00809300
DS =c000 000c  00809300
FS =   00809300
GS =   00809300
LDT=   8200
TR =   8b00
GDT= 000fc558 0037
IDT=  03ff
CR0=0010 CR2= CR3= CR4=
DR0= DR1= DR2= DR3= 
DR6=0ff0 DR7=0400
FCW=037f FSW= [ST=0] FTW=00 MXCSR=1f80
FPR0=  FPR1= 
FPR2=  FPR3= 
FPR4=  FPR5= 
FPR6=  FPR7= 
XMM00= XMM01=
XMM02= XMM03=
XMM04= XMM05=
XMM06= XMM07=
(qemu) 
(qemu) x/8b $pc
x/8b $pc
000c0564: 0xd7 0x1f 0x24 0x7f 0x88 0xc4 0x88 0xd0
(qemu) 
(qemu) x/i $pc
x/i $pc
0x000c0564:  xlat   %ds:(%bx)

-- 
Jun'ichi Nomura, NEC Corporation
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] vhost-test: Make vhost/test.c work

2013-05-08 Thread Michael S. Tsirkin

On Wed, May 08, 2013 at 03:24:33PM +0800, Asias He wrote:
 Fix it by switching to use the new device specific fields per vq
 
 Signed-off-by: Asias He as...@redhat.com
 ---
 
 This is for 3.10.
 
  drivers/vhost/test.c | 35 ---
  1 file changed, 24 insertions(+), 11 deletions(-)
 
 diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
 index 1ee45bc..7b49d10 100644
 --- a/drivers/vhost/test.c
 +++ b/drivers/vhost/test.c
 @@ -29,16 +29,20 @@ enum {
   VHOST_TEST_VQ_MAX = 1,
  };
  
 +struct vhost_test_virtqueue {
 + struct vhost_virtqueue vq;
 +};
 +

Well there are no test specific fields here,
so this structure is not needed. Here's what I queued:

---

vhost-test: fix up test module after API change

Recent vhost API changes broke vhost test module.
Update it to the new APIs.

Signed-off-by: Michael S. Tsirkin m...@redhat.com

---

diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index be65414..c2c3d91 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -38,7 +38,7 @@ struct vhost_test {
  * read-size critical section for our kind of RCU. */
 static void handle_vq(struct vhost_test *n)
 {
-   struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ];
+   struct vhost_virtqueue *vq = n-vqs[VHOST_TEST_VQ];
unsigned out, in;
int head;
size_t len, total_len = 0;
@@ -102,6 +102,7 @@ static int vhost_test_open(struct inode *inode, struct file 
*f)
 {
struct vhost_test *n = kmalloc(sizeof *n, GFP_KERNEL);
struct vhost_dev *dev;
+   struct vhost_virtqueue *vqs[VHOST_TEST_VQ_MAX];
int r;
 
if (!n)
@@ -109,7 +110,8 @@ static int vhost_test_open(struct inode *inode, struct file 
*f)
 
dev = n-dev;
n-vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
-   r = vhost_dev_init(dev, n-vqs, VHOST_TEST_VQ_MAX);
+   vqs[VHOST_TEST_VQ] = n-vqs[VHOST_TEST_VQ];
+   r = vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX);
if (r  0) {
kfree(n);
return r;
@@ -140,7 +142,7 @@ static void vhost_test_stop(struct vhost_test *n, void 
**privatep)
 
 static void vhost_test_flush_vq(struct vhost_test *n, int index)
 {
-   vhost_poll_flush(n-dev.vqs[index].poll);
+   vhost_poll_flush(n-vqs[index].poll);
 }
 
 static void vhost_test_flush(struct vhost_test *n)
@@ -268,21 +270,21 @@ static long vhost_test_ioctl(struct file *f, unsigned int 
ioctl,
return -EFAULT;
return vhost_test_run(n, test);
case VHOST_GET_FEATURES:
-   features = VHOST_NET_FEATURES;
+   features = VHOST_FEATURES;
if (copy_to_user(featurep, features, sizeof features))
return -EFAULT;
return 0;
case VHOST_SET_FEATURES:
if (copy_from_user(features, featurep, sizeof features))
return -EFAULT;
-   if (features  ~VHOST_NET_FEATURES)
+   if (features  ~VHOST_FEATURES)
return -EOPNOTSUPP;
return vhost_test_set_features(n, features);
case VHOST_RESET_OWNER:
return vhost_test_reset_owner(n);
default:
mutex_lock(n-dev.mutex);
-   r = vhost_dev_ioctl(n-dev, ioctl, arg);
+   r = vhost_dev_ioctl(n-dev, ioctl, argp);
vhost_test_flush(n);
mutex_unlock(n-dev.mutex);
return r;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vhost-test: Make vhost/test.c work

2013-05-08 Thread Michael S. Tsirkin

On Wed, May 08, 2013 at 03:14:58PM +0800, Asias He wrote:
 On Tue, May 07, 2013 at 02:22:32PM +0300, Michael S. Tsirkin wrote:
  On Tue, May 07, 2013 at 02:52:45PM +0800, Asias He wrote:
   Fix it by:
   1) switching to use the new device specific fields per vq
   2) not including vhost.c, instead make vhost-test.ko depend on vhost.ko.
  
  Please split this up.
  1. make test work for 3.10
  2. make test work for 3.11
  
  thanks!
 
 okay.
 
   ---
drivers/vhost/test.c | 37 +
1 file changed, 25 insertions(+), 12 deletions(-)
   
   diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
   index 1ee45bc..dc526eb 100644
   --- a/drivers/vhost/test.c
   +++ b/drivers/vhost/test.c
   @@ -18,7 +18,7 @@
#include linux/slab.h

#include test.h
   -#include vhost.c
   +#include vhost.h

/* Max number of bytes transferred before requeueing the job.
 * Using this limit prevents one virtqueue from starving others. */
   @@ -29,16 +29,20 @@ enum {
 VHOST_TEST_VQ_MAX = 1,
};

   +struct vhost_test_virtqueue {
   + struct vhost_virtqueue vq;
   +};
   +
  
  This isn't needed or useful. Drop above change pls and patch
  size will shrink.
 
 The difference is:
 
  drivers/vhost/test.c | 23 ---
  1 file changed, 16 insertions(+), 7 deletions(-)
 
  drivers/vhost/test.c | 35 ---
  1 file changed, 24 insertions(+), 11 deletions(-)
 
 which is not significant.

I did it like this:
 test.c |   14 --
 1 file changed, 8 insertions(+), 6 deletions(-)


 So, I think it is better to code the same way as we do in vhost-net and
 vhost-scsi which makes the device specific usage more consistent.
 
struct vhost_test {
 struct vhost_dev dev;
   - struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX];
   + struct vhost_test_virtqueue vqs[VHOST_TEST_VQ_MAX];
};

/* Expects to be always run from workqueue - which acts as
 * read-size critical section for our kind of RCU. */
static void handle_vq(struct vhost_test *n)
{
   - struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ];
   + struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ];
 unsigned out, in;
 int head;
 size_t len, total_len = 0;
   @@ -101,15 +105,23 @@ static void handle_vq_kick(struct vhost_work *work)
static int vhost_test_open(struct inode *inode, struct file *f)
{
 struct vhost_test *n = kmalloc(sizeof *n, GFP_KERNEL);
   + struct vhost_virtqueue **vqs;
 struct vhost_dev *dev;
 int r;

 if (!n)
 return -ENOMEM;

   + vqs = kmalloc(VHOST_TEST_VQ_MAX * sizeof(*vqs), GFP_KERNEL);
   + if (!vqs) {
   + kfree(n);
   + return -ENOMEM;
   + }
   +
 dev = n-dev;
   - n-vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
   - r = vhost_dev_init(dev, n-vqs, VHOST_TEST_VQ_MAX);
   + vqs[VHOST_TEST_VQ] = n-vqs[VHOST_TEST_VQ].vq;
   + n-vqs[VHOST_TEST_VQ].vq.handle_kick = handle_vq_kick;
   + r = vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX);
 if (r  0) {
 kfree(n);
 return r;
   @@ -135,12 +147,12 @@ static void *vhost_test_stop_vq(struct vhost_test 
   *n,

static void vhost_test_stop(struct vhost_test *n, void **privatep)
{
   - *privatep = vhost_test_stop_vq(n, n-vqs + VHOST_TEST_VQ);
   + *privatep = vhost_test_stop_vq(n, n-vqs[VHOST_TEST_VQ].vq);
}

static void vhost_test_flush_vq(struct vhost_test *n, int index)
{
   - vhost_poll_flush(n-dev.vqs[index].poll);
   + vhost_poll_flush(n-vqs[index].vq.poll);
}

static void vhost_test_flush(struct vhost_test *n)
   @@ -159,6 +171,7 @@ static int vhost_test_release(struct inode *inode, 
   struct file *f)
 /* We do an extra flush before freeing memory,
  * since jobs can re-queue themselves. */
 vhost_test_flush(n);
   + kfree(n-dev.vqs);
 kfree(n);
 return 0;
}
   @@ -179,14 +192,14 @@ static long vhost_test_run(struct vhost_test *n, 
   int test)

 for (index = 0; index  n-dev.nvqs; ++index) {
 /* Verify that ring has been setup correctly. */
   - if (!vhost_vq_access_ok(n-vqs[index])) {
   + if (!vhost_vq_access_ok(n-vqs[index].vq)) {
 r = -EFAULT;
 goto err;
 }
 }

 for (index = 0; index  n-dev.nvqs; ++index) {
   - vq = n-vqs + index;
   + vq = n-vqs[index].vq;
 mutex_lock(vq-mutex);
 priv = test ? n : NULL;

   @@ -195,7 +208,7 @@ static long vhost_test_run(struct vhost_test *n, int 
   test)
 
   lockdep_is_held(vq-mutex));
 rcu_assign_pointer(vq-private_data, priv);

   - r = vhost_init_used(n-vqs[index]);
   + r = vhost_init_used(n-vqs[index].vq);

 mutex_unlock(vq-mutex);

   @@ -268,14 +281,14 @@ static long vhost_test_ioctl(struct file *f, 
   unsigned

Re: [PATCH] vhost-test: Make vhost/test.c work

2013-05-08 Thread Asias He

On Wed, May 08, 2013 at 10:59:03AM +0300, Michael S. Tsirkin wrote:
 On Wed, May 08, 2013 at 03:14:58PM +0800, Asias He wrote:
  On Tue, May 07, 2013 at 02:22:32PM +0300, Michael S. Tsirkin wrote:
   On Tue, May 07, 2013 at 02:52:45PM +0800, Asias He wrote:
Fix it by:
1) switching to use the new device specific fields per vq
2) not including vhost.c, instead make vhost-test.ko depend on vhost.ko.
   
   Please split this up.
   1. make test work for 3.10
   2. make test work for 3.11
   
   thanks!
  
  okay.
  
---
 drivers/vhost/test.c | 37 +
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index 1ee45bc..dc526eb 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -18,7 +18,7 @@
 #include linux/slab.h
 
 #include test.h
-#include vhost.c
+#include vhost.h
 
 /* Max number of bytes transferred before requeueing the job.
  * Using this limit prevents one virtqueue from starving others. */
@@ -29,16 +29,20 @@ enum {
VHOST_TEST_VQ_MAX = 1,
 };
 
+struct vhost_test_virtqueue {
+   struct vhost_virtqueue vq;
+};
+
   
   This isn't needed or useful. Drop above change pls and patch
   size will shrink.
  
  The difference is:
  
   drivers/vhost/test.c | 23 ---
   1 file changed, 16 insertions(+), 7 deletions(-)
  
   drivers/vhost/test.c | 35 ---
   1 file changed, 24 insertions(+), 11 deletions(-)
  
  which is not significant.
 
 I did it like this:
  test.c |   14 --
  1 file changed, 8 insertions(+), 6 deletions(-)

The extra 8 insertions is for vqs allocation which can be dropped. 

Well, if you prefer shorter code over consistency. Go ahead.

 
  So, I think it is better to code the same way as we do in vhost-net and
  vhost-scsi which makes the device specific usage more consistent.
  
 struct vhost_test {
struct vhost_dev dev;
-   struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX];
+   struct vhost_test_virtqueue vqs[VHOST_TEST_VQ_MAX];
 };
 
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_vq(struct vhost_test *n)
 {
-   struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ];
+   struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ];
unsigned out, in;
int head;
size_t len, total_len = 0;
@@ -101,15 +105,23 @@ static void handle_vq_kick(struct vhost_work 
*work)
 static int vhost_test_open(struct inode *inode, struct file *f)
 {
struct vhost_test *n = kmalloc(sizeof *n, GFP_KERNEL);
+   struct vhost_virtqueue **vqs;
struct vhost_dev *dev;
int r;
 
if (!n)
return -ENOMEM;
 
+   vqs = kmalloc(VHOST_TEST_VQ_MAX * sizeof(*vqs), GFP_KERNEL);
+   if (!vqs) {
+   kfree(n);
+   return -ENOMEM;
+   }
+
dev = n-dev;
-   n-vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
-   r = vhost_dev_init(dev, n-vqs, VHOST_TEST_VQ_MAX);
+   vqs[VHOST_TEST_VQ] = n-vqs[VHOST_TEST_VQ].vq;
+   n-vqs[VHOST_TEST_VQ].vq.handle_kick = handle_vq_kick;
+   r = vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX);
if (r  0) {
kfree(n);
return r;
@@ -135,12 +147,12 @@ static void *vhost_test_stop_vq(struct vhost_test 
*n,
 
 static void vhost_test_stop(struct vhost_test *n, void **privatep)
 {
-   *privatep = vhost_test_stop_vq(n, n-vqs + VHOST_TEST_VQ);
+   *privatep = vhost_test_stop_vq(n, n-vqs[VHOST_TEST_VQ].vq);
 }
 
 static void vhost_test_flush_vq(struct vhost_test *n, int index)
 {
-   vhost_poll_flush(n-dev.vqs[index].poll);
+   vhost_poll_flush(n-vqs[index].vq.poll);
 }
 
 static void vhost_test_flush(struct vhost_test *n)
@@ -159,6 +171,7 @@ static int vhost_test_release(struct inode *inode, 
struct file *f)
/* We do an extra flush before freeing memory,
 * since jobs can re-queue themselves. */
vhost_test_flush(n);
+   kfree(n-dev.vqs);
kfree(n);
return 0;
 }
@@ -179,14 +192,14 @@ static long vhost_test_run(struct vhost_test *n, 
int test)
 
for (index = 0; index  n-dev.nvqs; ++index) {
/* Verify that ring has been setup correctly. */
-   if (!vhost_vq_access_ok(n-vqs[index])) {
+   if (!vhost_vq_access_ok(n-vqs[index].vq)) {
r = -EFAULT;
goto err;
}
}
 
for (index

Re: [PATCH v2] vhost-test: Make vhost/test.c work

2013-05-08 Thread Asias He

On Wed, May 08, 2013 at 10:56:19AM +0300, Michael S. Tsirkin wrote:
 On Wed, May 08, 2013 at 03:24:33PM +0800, Asias He wrote:
  Fix it by switching to use the new device specific fields per vq
  
  Signed-off-by: Asias He as...@redhat.com
  ---
  
  This is for 3.10.
  
   drivers/vhost/test.c | 35 ---
   1 file changed, 24 insertions(+), 11 deletions(-)
  
  diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
  index 1ee45bc..7b49d10 100644
  --- a/drivers/vhost/test.c
  +++ b/drivers/vhost/test.c
  @@ -29,16 +29,20 @@ enum {
  VHOST_TEST_VQ_MAX = 1,
   };
   
  +struct vhost_test_virtqueue {
  +   struct vhost_virtqueue vq;
  +};
  +
 
 Well there are no test specific fields here,
 so this structure is not needed. Here's what I queued:

Could you push the queue to your git repo ?

 ---
 
 vhost-test: fix up test module after API change
 
 Recent vhost API changes broke vhost test module.
 Update it to the new APIs.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 ---
 
 diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
 index be65414..c2c3d91 100644
 --- a/drivers/vhost/test.c
 +++ b/drivers/vhost/test.c
 @@ -38,7 +38,7 @@ struct vhost_test {
   * read-size critical section for our kind of RCU. */
  static void handle_vq(struct vhost_test *n)
  {
 - struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ];
 + struct vhost_virtqueue *vq = n-vqs[VHOST_TEST_VQ];
   unsigned out, in;
   int head;
   size_t len, total_len = 0;
 @@ -102,6 +102,7 @@ static int vhost_test_open(struct inode *inode, struct 
 file *f)
  {
   struct vhost_test *n = kmalloc(sizeof *n, GFP_KERNEL);
   struct vhost_dev *dev;
 + struct vhost_virtqueue *vqs[VHOST_TEST_VQ_MAX];
   int r;
  
   if (!n)
 @@ -109,7 +110,8 @@ static int vhost_test_open(struct inode *inode, struct 
 file *f)
  
   dev = n-dev;
   n-vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
 - r = vhost_dev_init(dev, n-vqs, VHOST_TEST_VQ_MAX);
 + vqs[VHOST_TEST_VQ] = n-vqs[VHOST_TEST_VQ];
 + r = vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX);
   if (r  0) {
   kfree(n);
   return r;
 @@ -140,7 +142,7 @@ static void vhost_test_stop(struct vhost_test *n, void 
 **privatep)
  
  static void vhost_test_flush_vq(struct vhost_test *n, int index)
  {
 - vhost_poll_flush(n-dev.vqs[index].poll);
 + vhost_poll_flush(n-vqs[index].poll);
  }
  
  static void vhost_test_flush(struct vhost_test *n)
 @@ -268,21 +270,21 @@ static long vhost_test_ioctl(struct file *f, unsigned 
 int ioctl,
   return -EFAULT;
   return vhost_test_run(n, test);
   case VHOST_GET_FEATURES:
 - features = VHOST_NET_FEATURES;
 + features = VHOST_FEATURES;
   if (copy_to_user(featurep, features, sizeof features))
   return -EFAULT;
   return 0;
   case VHOST_SET_FEATURES:
   if (copy_from_user(features, featurep, sizeof features))
   return -EFAULT;
 - if (features  ~VHOST_NET_FEATURES)
 + if (features  ~VHOST_FEATURES)
   return -EOPNOTSUPP;
   return vhost_test_set_features(n, features);
   case VHOST_RESET_OWNER:
   return vhost_test_reset_owner(n);
   default:
   mutex_lock(n-dev.mutex);
 - r = vhost_dev_ioctl(n-dev, ioctl, arg);
 + r = vhost_dev_ioctl(n-dev, ioctl, argp);
   vhost_test_flush(n);
   mutex_unlock(n-dev.mutex);
   return r;

-- 
Asias
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] vhost-test: Make vhost/test.c work

2013-05-08 Thread Michael S. Tsirkin

On Wed, May 08, 2013 at 04:17:19PM +0800, Asias He wrote:
 On Wed, May 08, 2013 at 10:56:19AM +0300, Michael S. Tsirkin wrote:
  On Wed, May 08, 2013 at 03:24:33PM +0800, Asias He wrote:
   Fix it by switching to use the new device specific fields per vq
   
   Signed-off-by: Asias He as...@redhat.com
   ---
   
   This is for 3.10.
   
drivers/vhost/test.c | 35 ---
1 file changed, 24 insertions(+), 11 deletions(-)
   
   diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
   index 1ee45bc..7b49d10 100644
   --- a/drivers/vhost/test.c
   +++ b/drivers/vhost/test.c
   @@ -29,16 +29,20 @@ enum {
 VHOST_TEST_VQ_MAX = 1,
};

   +struct vhost_test_virtqueue {
   + struct vhost_virtqueue vq;
   +};
   +
  
  Well there are no test specific fields here,
  so this structure is not needed. Here's what I queued:
 
 Could you push the queue to your git repo ?

done
branch vhost

  ---
  
  vhost-test: fix up test module after API change
  
  Recent vhost API changes broke vhost test module.
  Update it to the new APIs.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
  ---
  
  diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
  index be65414..c2c3d91 100644
  --- a/drivers/vhost/test.c
  +++ b/drivers/vhost/test.c
  @@ -38,7 +38,7 @@ struct vhost_test {
* read-size critical section for our kind of RCU. */
   static void handle_vq(struct vhost_test *n)
   {
  -   struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ];
  +   struct vhost_virtqueue *vq = n-vqs[VHOST_TEST_VQ];
  unsigned out, in;
  int head;
  size_t len, total_len = 0;
  @@ -102,6 +102,7 @@ static int vhost_test_open(struct inode *inode, struct 
  file *f)
   {
  struct vhost_test *n = kmalloc(sizeof *n, GFP_KERNEL);
  struct vhost_dev *dev;
  +   struct vhost_virtqueue *vqs[VHOST_TEST_VQ_MAX];
  int r;
   
  if (!n)
  @@ -109,7 +110,8 @@ static int vhost_test_open(struct inode *inode, struct 
  file *f)
   
  dev = n-dev;
  n-vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
  -   r = vhost_dev_init(dev, n-vqs, VHOST_TEST_VQ_MAX);
  +   vqs[VHOST_TEST_VQ] = n-vqs[VHOST_TEST_VQ];
  +   r = vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX);
  if (r  0) {
  kfree(n);
  return r;
  @@ -140,7 +142,7 @@ static void vhost_test_stop(struct vhost_test *n, void 
  **privatep)
   
   static void vhost_test_flush_vq(struct vhost_test *n, int index)
   {
  -   vhost_poll_flush(n-dev.vqs[index].poll);
  +   vhost_poll_flush(n-vqs[index].poll);
   }
   
   static void vhost_test_flush(struct vhost_test *n)
  @@ -268,21 +270,21 @@ static long vhost_test_ioctl(struct file *f, unsigned 
  int ioctl,
  return -EFAULT;
  return vhost_test_run(n, test);
  case VHOST_GET_FEATURES:
  -   features = VHOST_NET_FEATURES;
  +   features = VHOST_FEATURES;
  if (copy_to_user(featurep, features, sizeof features))
  return -EFAULT;
  return 0;
  case VHOST_SET_FEATURES:
  if (copy_from_user(features, featurep, sizeof features))
  return -EFAULT;
  -   if (features  ~VHOST_NET_FEATURES)
  +   if (features  ~VHOST_FEATURES)
  return -EOPNOTSUPP;
  return vhost_test_set_features(n, features);
  case VHOST_RESET_OWNER:
  return vhost_test_reset_owner(n);
  default:
  mutex_lock(n-dev.mutex);
  -   r = vhost_dev_ioctl(n-dev, ioctl, arg);
  +   r = vhost_dev_ioctl(n-dev, ioctl, argp);
  vhost_test_flush(n);
  mutex_unlock(n-dev.mutex);
  return r;
 
 -- 
 Asias
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: regression in v3.9? a guest stuck in BIOS if emulate_invalid_guest_state=Y

2013-05-08 Thread Paolo Bonzini

Il 08/05/2013 09:34, Jun'ichi Nomura ha scritto:
 On 05/08/13 12:22, Jun'ichi Nomura wrote:
 Il 07/05/2013 14:06, Gleb Natapov ha scritto:
 What is the output of virsh qemu-monitor-command vm12 --hmp x/i $pc
 when it hangs?

 # virsh qemu-monitor-command vm12 --hmp x/4i \$pc
 0x000c06ca:  aam$0xa
 0x000c06cc:  mov%ax,%bx
 0x000c06ce:  mov%bh,%al
 0x000c06d0:  aam$0xa

 # virsh qemu-monitor-command vm12 --hmp x/8b \$pc
 000c06ca: 0xd4 0x0a 0x89 0xc3 0x88 0xf8 0xd4 0x0a

 (qemu) x/8b $pc
 x/8b $pc
 000c0564: 0xd7 0x1f 0x24 0x7f 0x88 0xc4 0x88 0xd0
 (qemu) 
 (qemu) x/i $pc
 x/i $pc
 0x000c0564:  xlat   %ds:(%bx)

Both of these sequences are found in sgabios.  The second goes on as
follows:

  popw %ds
  andb $0x7f, %al
  movb %al, %ah
  movb %dl, %al

Thanks for the report!

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v1][KVM][PATCH 1/1] kvm:ppc:booehv: direct ISI exception to Guest

2013-05-08 Thread tiejun.chen

On 05/08/2013 05:20 PM, Caraman Mihai Claudiu-B02008 wrote:

-Original Message-
From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
Behalf Of tiejun.chen
Sent: Wednesday, May 08, 2013 4:54 AM
To: Wood Scott-B07421
Cc: ag...@suse.de; kvm-...@vger.kernel.org; kvm@vger.kernel.org;
linuxppc-...@lists.ozlabs.org
Subject: Re: [v1][KVM][PATCH 1/1] kvm:ppc:booehv: direct ISI exception to
Guest

On 05/08/2013 07:40 AM, Scott Wood wrote:

On 05/07/2013 06:06:30 AM, Tiejun Chen wrote:

We also can direct ISI exception to Guest like DSI.

Signed-off-by: Tiejun Chen tiejun.c...@windriver.com
---
  arch/powerpc/kvm/booke_emulate.c |3 +++
  arch/powerpc/kvm/e500mc.c|3 ++-
  2 files changed, 5 insertions(+), 1 deletion(-)

Are you seeing a real performance improvement from this?  This will

interfere

No. But after we reduce the exit to host, shouldn't this improve
performance?

We lose some flexibility for this so it make sense only if we gain
measurable improvements.

Sounds we have much more works to do.

somewhat with using the VF bit, if we were to ever do so, since VF only

affects

Sorry, what is the VF you said?

VF stands for virtualization fault see MAS8[VF] and we may use it for 
virtualized

I almost forget this point :)

MMIO. The hypervisor should deny execute access on pages marked with VF. 
Accordingly
in this case guest ISI exceptions should be handled by the hypervisor.

Thanks for your information.

Tiejun

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel 3.9 - can't boot qemu with accel=kvm _and_ networking enabled

2013-05-08 Thread Paolo Bonzini

 Paolo,
 
 The full command line is:
 qemu-system-x86_64 -machine accel=kvm -m 1024m  \
  -net tap -net nic \
  -drive file=/dev/zpool/testsrv,index=0,cache=writethrough \
  -k en-us \
  -no-kvm-irqchip \
  -vga cirrus
 
 I've tried any combinations of -net options, but the result is always
 the same. I think this somehow related to
 http://article.gmane.org/gmane.comp.emulators.kvm.devel/109461, as
 setting emulate_invalid_guest_state=0 solves the problem However, I'm
 not aware of any consequences of this change.

Actually, the other bug involves sgabios and you are not using it.
Please try executing the following commands from the monitor (you can
use -monitor stdio to make cut-and-paste simpler):

   x/8i \$pc
   x/64b \$pc

and include the output in the reply to this message.

Thanks,

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] virtio-balloon spec: rework VIRTIO_BALLOON_F_MUST_TELL_HOST feature, support silent deflation

2013-05-08 Thread Paolo Bonzini

The idea of the VIRTIO_BALLOON_F_MUST_TELL_HOST feature is to let drivers
skip usage of the deflate queue when leaking the balloon (silent
deflation).  Guests may benefit from silent deflate by aggressively
inflating the balloon; they know that they will be able to use ballooned
pages without issuing a (blocking) request to the device.

The problem is that this feature is a negative feature: if
set, the guest _may not_ use ballooned pages directly.  Negative features
are not safe against migration; here is an explanation why this is so.

For a positive feature, migration is possible if the destination
supports it, or the source didn't set it:

dest support  source set  ok?
  TT  T
  TF  T
  FT  F
  FF  T

For a negative feature, migration is possible if the destination
supports it, or the source set it:

dest support  source set  ok?
  TT  T
  TF  F
  FT  T
  FF  T

However, the F/T line violates the virtio specification because the
negotiated features are supposed to be the AND of the device-
and driver-supported features.

Furthermore, this assumes that the destination host knows which features
are positive and which are negative, which obviously cannot be the
case in general.  (The original spec assumed that every device supports
VIRTIO_BALLOON_F_MUST_TELL_HOST, but this was not explicitly documented
and in practice it turns out not to be the case).

Not all is lost, however.  First, all known device implementations support
silent deflation, hence they do not negotiate the feature.  We are thus
somewhat free to redefine what the host should do about this feature.

Second, by chance, coincidence or an evil plot, the only known driver
that does not negotiate VIRTIO_BALLOON_F_MUST_TELL_HOST is also using
pages before telling the host.  Thus, even though the feature used to be
just for communication from the host, known drivers are really using it
to communicate was in the other direction, as if the feature was named
VIRTIO_BALLOON_F_GUEST_TELLS_HOST.

Adjust the spec to conform, and add a new feature bit for the host to
tell the drivers if silent deflation is actually supported.  With this
new feature bit, the host can distinguish all three cases: will never
do silent deflation, will do silent deflation if available, will always
do silent deflation (as in the above buggy driver).

Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 virtio-spec.lyx | 264 ++--
 1 file changed, 258 insertions(+), 6 deletions(-)

diff --git a/virtio-spec.lyx b/virtio-spec.lyx
index 73e22e7..033362f 100644
--- a/virtio-spec.lyx
+++ b/virtio-spec.lyx
@@ -63,7 +63,7 @@
 \author -385801441 Cornelia Huck cornelia.h...@de.ibm.com
 \author 460276516 Dmitry Fleytman dfley...@redhat.com
 \author 1112500848 Rusty Russell ru...@rustcorp.com.au
-\author 1531152142 Paolo Bonzini,,, 
+\author 1531152142 Paolo Bonzini pbonz...@redhat.com
 \author 1717892615 Alexey Zaytsev,,, 
 \author 1986246365 Michael S. Tsirkin 
 \end_header
@@ -7179,11 +7179,49 @@ bits
 
 \begin_deeper
 \begin_layout Description
-VIRTIO_BALLOON_F_MUST_TELL_HOST
+VIRTIO_BALLOON_F_
+\change_deleted 1531152142 1347020601
+MUST
+\change_inserted 1531152142 1347020602
+GUEST
+\change_unchanged
+_TELL
+\change_inserted 1531152142 1368004486
+S
+\change_unchanged
+_HOST
 \begin_inset space ~
 \end_inset
 
-(0) Host must be told before pages from the balloon are used.
+(0) 
+\change_deleted 1531152142 1347020625
+Host must be told
+\change_inserted 1531152142 1347020617
+Guest will tell host
+\change_unchanged
+ before pages from the balloon are used.
+
+\change_inserted 1531152142 1368005603
+ The host should always propose this feature.
+\begin_inset Foot
+status open
+
+\begin_layout Plain Layout
+
+\change_inserted 1531152142 1347022389
+This feature used to be named VIRTIO_BALLOON_F_\SpecialChar \-
+MUST_TELL_HOST.
+ However, after a few years it was observed that drivers were not using
+ it as specified.
+ The virtio-balloon spec was then adjusted to what the drivers had been
+ doing.
+\end_layout
+
+\end_inset
+
+
+\change_unchanged
+
 \end_layout
 
 \begin_layout Description
@@ -7192,6 +7230,20 @@ VIRTIO_BALLOON_F_STATS_VQ
 \end_inset
 
 (1) A virtqueue for reporting guest memory statistics is present.
+\change_inserted 1531152142 1347020627
+
+\end_layout
+
+\begin_layout Description
+
+\change_inserted 1531152142 1347020648
+VIRTIO_BALLOON_F_SILENT_DEFLATE
+\begin_inset space ~
+\end_inset
+
+(2) Guest does not need to tell host before pages from the balloon are used.
+\change_unchanged
+
 \end_layout
 
 \end_deeper
@@ -7342,9 +7394,27 @@ The driver constructs an array of addresses of memory 
pages it

Re: [PATCH v2] KVM: Fix kvm_irqfd_init initialization

2013-05-08 Thread Gleb Natapov

On Wed, May 08, 2013 at 10:57:29AM +0800, Asias He wrote:
 In commit a0f155e96 'KVM: Initialize irqfd from kvm_init()', when
 kvm_init() is called the second time (e.g kvm-amd.ko and kvm-intel.ko),
 kvm_arch_init() will fail with -EEXIST, then kvm_irqfd_exit() will be
 called on the error handling path. This way, the kvm_irqfd system will
 not be ready.
 
 This patch fix the following:
 
Applied, thanks.

 BUG: unable to handle kernel NULL pointer dereference at   (null)
 IP: [81c0721e] _raw_spin_lock+0xe/0x30
 PGD 0
 Oops: 0002 [#1] SMP
 Modules linked in: vhost_net
 CPU 6
 Pid: 4257, comm: qemu-system-x86 Not tainted 3.9.0-rc3+ #757 Dell Inc. 
 OptiPlex 790/0V5HMK
 RIP: 0010:[81c0721e]  [81c0721e] _raw_spin_lock+0xe/0x30
 RSP: 0018:880221721cc8  EFLAGS: 00010046
 RAX: 0100 RBX: 88022dcc003f RCX: 880221734950
 RDX: 8802208f6ca8 RSI: 7fff RDI: 
 RBP: 880221721cc8 R08: 0002 R09: 0002
 R10: 7f7fd01087e0 R11: 0246 R12: 8802208f6ca8
 R13: 0080 R14: 880223e2a900 R15: 
 FS:  7f7fd38488e0() GS:88022dcc() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2:  CR3: 00022309f000 CR4: 000427e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: 0ff0 DR7: 0400
 Process qemu-system-x86 (pid: 4257, threadinfo 88022172, task 
 880222bd5640)
 Stack:
  880221721d08 810ac5c5 88022431dc00 0086
  0080 880223e2a900 8802208f6ca8 
  880221721d48 810ac8fe  880221734000
 Call Trace:
  [810ac5c5] __queue_work+0x45/0x2d0
  [810ac8fe] queue_work_on+0x8e/0xa0
  [810ac949] queue_work+0x19/0x20
  [81009b6b] irqfd_deactivate+0x4b/0x60
  [8100a69d] kvm_irqfd+0x39d/0x580
  [81007a27] kvm_vm_ioctl+0x207/0x5b0
  [810c9545] ? update_curr+0xf5/0x180
  [811b66e8] do_vfs_ioctl+0x98/0x550
  [810c1f5e] ? finish_task_switch+0x4e/0xe0
  [81c054aa] ? __schedule+0x2ea/0x710
  [811b6bf7] sys_ioctl+0x57/0x90
  [8140ae9e] ? trace_hardirqs_on_thunk+0x3a/0x3c
  [81c0f602] system_call_fastpath+0x16/0x1b
 Code: c1 ea 08 38 c2 74 0f 66 0f 1f 44 00 00 f3 90 0f b6 03 38 c2 75 f7 48 83 
 c4 08 5b c9 c3 55 48 89 e5 66 66 66 66 90 b8 00 01 00 00 f0 66 0f c1 07 89 
 c2 66 c1 ea 08 38 c2 74 0c 0f 1f 00 f3 90 0f
 RIP  [81c0721e] _raw_spin_lock+0xe/0x30
 RSP 880221721cc8
 CR2: 
 ---[ end trace 13fb1e4b6e5ab21f ]---
 
 Signed-off-by: Asias He as...@redhat.com
 ---
  virt/kvm/kvm_main.c | 18 +-
  1 file changed, 13 insertions(+), 5 deletions(-)
 
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index 8fd325a..85b93d2 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -3078,13 +3078,21 @@ int kvm_init(void *opaque, unsigned vcpu_size, 
 unsigned vcpu_align,
   int r;
   int cpu;
  
 - r = kvm_irqfd_init();
 - if (r)
 - goto out_irqfd;
   r = kvm_arch_init(opaque);
   if (r)
   goto out_fail;
  
 + /*
 +  * kvm_arch_init makes sure there's at most one caller
 +  * for architectures that support multiple implementations,
 +  * like intel and amd on x86.
 +  * kvm_arch_init must be called before kvm_irqfd_init to avoid creating
 +  * conflicts in case kvm is already setup for another implementation.
 +  */
 + r = kvm_irqfd_init();
 + if (r)
 + goto out_irqfd;
 +
   if (!zalloc_cpumask_var(cpus_hardware_enabled, GFP_KERNEL)) {
   r = -ENOMEM;
   goto out_free_0;
 @@ -3159,10 +3167,10 @@ out_free_1:
  out_free_0a:
   free_cpumask_var(cpus_hardware_enabled);
  out_free_0:
 - kvm_arch_exit();
 -out_fail:
   kvm_irqfd_exit();
  out_irqfd:
 + kvm_arch_exit();
 +out_fail:
   return r;
  }
  EXPORT_SYMBOL_GPL(kvm_init);
 -- 
 1.8.1.4

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: x86: fix maintenance of guest/host xcr0 state

2013-05-08 Thread Gleb Natapov

On Mon, Apr 15, 2013 at 11:30:13PM -0300, Marcelo Tosatti wrote:
 
 ** Untested **.
 
 Emulation of xcr0 writes zero guest_xcr0_loaded variable so that
 subsequent VM-entry reloads CPU's xcr0 with guests xcr0 value.
 
 However, this is incorrect because guest_xcr0_loaded variable is 
 read to decide whether to reload hosts xcr0.
 
 In case the vcpu thread is scheduled out after the guest_xcr0_loaded = 0
 assignment, and scheduler decides to preload FPU:
 
 switch_to
 {
   __switch_to
 __math_state_restore
   restore_fpu_checking
 fpu_restore_checking
   if (use_xsave())
   fpu_xrstor_checking
   xrstor64 with CPU's xcr0 == guests xcr0
 
 Fix by properly restoring hosts xcr0 during emulation of xcr0 writes.
 
 Analyzed-by: Ulrich Obergfell uober...@redhat.com
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
Applied, thanks.

 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 999d124..222926a 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -555,6 +555,25 @@ void kvm_lmsw(struct kvm_vcpu *vcpu, unsigned long msw)
  }
  EXPORT_SYMBOL_GPL(kvm_lmsw);
  
 +static void kvm_load_guest_xcr0(struct kvm_vcpu *vcpu)
 +{
 + if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) 
 + !vcpu-guest_xcr0_loaded) {
 + /* kvm_set_xcr() also depends on this */
 + xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu-arch.xcr0);
 + vcpu-guest_xcr0_loaded = 1;
 + }
 +}
 +
 +static void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu)
 +{
 + if (vcpu-guest_xcr0_loaded) {
 + if (vcpu-arch.xcr0 != host_xcr0)
 + xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0);
 + vcpu-guest_xcr0_loaded = 0;
 + }
 +}
 +
  int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
  {
   u64 xcr0;
 @@ -571,8 +590,8 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 
 xcr)
   return 1;
   if (xcr0  ~host_xcr0)
   return 1;
 + kvm_put_guest_xcr0(vcpu);
   vcpu-arch.xcr0 = xcr0;
 - vcpu-guest_xcr0_loaded = 0;
   return 0;
  }
  
 @@ -5600,25 +5619,6 @@ static void inject_pending_event(struct kvm_vcpu *vcpu)
   }
  }
  
 -static void kvm_load_guest_xcr0(struct kvm_vcpu *vcpu)
 -{
 - if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) 
 - !vcpu-guest_xcr0_loaded) {
 - /* kvm_set_xcr() also depends on this */
 - xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu-arch.xcr0);
 - vcpu-guest_xcr0_loaded = 1;
 - }
 -}
 -
 -static void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu)
 -{
 - if (vcpu-guest_xcr0_loaded) {
 - if (vcpu-arch.xcr0 != host_xcr0)
 - xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0);
 - vcpu-guest_xcr0_loaded = 0;
 - }
 -}
 -
  static void process_nmi(struct kvm_vcpu *vcpu)
  {
   unsigned limit = 2;

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages

2013-05-08 Thread Gleb Natapov

On Tue, May 07, 2013 at 12:09:29PM -0300, Marcelo Tosatti wrote:
 On Tue, May 07, 2013 at 05:56:08PM +0300, Gleb Natapov wrote:
Yes, I am missing what Marcelo means there too. We cannot free memslot
until we unmap its rmap one way or the other.
   
   I do not understand what are you optimizing for, given the four possible
   cases we discussed at
   
   https://lkml.org/lkml/2013/4/18/280
   
  We are optimizing mmu_lock holding time for all of those cases.
  
  But you cannot just zap roots + sp gen number increase. on slot
  deletion because you need to transfer access/dirty information from rmap
  that is going to be deleted to actual page before
  kvm_set_memory_region() returns to a caller.
  
   That is, why a simple for_each_all_shadow_page(zap_page) is not 
   sufficient.
  With a lock break? It is. We tried to optimize that by zapping only pages
  that reference memslot that is going to be deleted and zap all other
  later when recycling old sps, but if you think this is premature
  optimization I am fine with it.
 
 If it can be shown that its not premature optimization, I am fine with
 it.
 
 AFAICS all cases are 1) rare and 2) not latency sensitive (as in there
 is no requirement for those cases to finish in a short period of time).
OK, lets start from a simple version. The one that goes through rmap
turned out to be more complicated that we expected.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

kernel 3.9.x kvm hangs after seabios

2013-05-08 Thread Tomas Papan

I have the same issue, with 3.9.1 (3.9.0 too) it hangs right after seabios...
 (no problem in 3.8.11)

qemu-1.4.1
seabios-1.7.2.1

after setting emulate_invalid_guest_state=0 everything works just fine.

virsh # qemu-monitor-command vm-jack --hmp x/8i \$pc
0x000fc46b:  lgdtw  %cs:-0x2c60
0x000fc471:  mov%cr0,%eax
0x000fc474:  or $0x1,%eax
0x000fc478:  mov%eax,%cr0
0x000fc47b:  ljmpl  $0x8,$0xfc483
0x000fc483:  mov$0x10,%ax
0x000fc486:  add%al,(%bx,%si)
0x000fc488:  mov%ax,%ds


virsh # qemu-monitor-command vm-jack --hmp x/64b \$pc
0x000fc46b:  lgdtw  %cs:-0x2c60
0x000fc471:  mov%cr0,%eax
0x000fc474:  or $0x1,%eax
0x000fc478:  mov%eax,%cr0
0x000fc47b:  ljmpl  $0x8,$0xfc483
0x000fc483:  mov$0x10,%ax
0x000fc486:  add%al,(%bx,%si)
0x000fc488:  mov%ax,%ds
0x000fc48a:  mov%ax,%es
0x000fc48c:  mov%ax,%ss
0x000fc48e:  mov%ax,%fs
0x000fc490:  mov%ax,%gs
0x000fc492:  mov%cx,%ax
0x000fc494:  jmp*%dx
0x000fc496:  mov%ax,%cx
0x000fc498:  mov$0x20,%ax
0x000fc49b:  add%al,(%bx,%si)
0x000fc49d:  mov%ax,%ds
0x000fc49f:  mov%ax,%es
0x000fc4a1:  mov%ax,%ss
0x000fc4a3:  mov%ax,%fs
0x000fc4a5:  mov%ax,%gs
0x000fc4a7:  ljmpl  $0xc189,$0x18c4c4
0x000fc4af:  mov$0x30,%ax
0x000fc4b2:  add%al,(%bx,%si)
0x000fc4b4:  mov%ax,%ds
0x000fc4b6:  mov%ax,%es
0x000fc4b8:  mov%ax,%ss
0x000fc4ba:  mov%ax,%fs
0x000fc4bc:  mov%ax,%gs
0x000fc4be:  ljmpl  $0x200f,$0x28c4c4
0x000fc4c6:  shlb   $0xe0,-0x7d(%bp)
0x000fc4ca:  decb   (%bx)
0x000fc4cc:  and%al,%al
0x000fc4ce:  ljmp   $0xf000,$0xc4d3
0x000fc4d3:  lidtw  %cs:-0x2c18
0x000fc4d9:  xor%ax,%ax
0x000fc4db:  mov%ax,%fs
0x000fc4dd:  mov%ax,%gs
0x000fc4df:  mov%ax,%es
0x000fc4e1:  mov%ax,%ds
0x000fc4e3:  mov%ax,%ss
0x000fc4e5:  mov%ecx,%eax
0x000fc4e8:  jmpl   *%edx
0x000fc4eb:  push   %ebp
0x000fc4ed:  push   %eax
0x000fc4ef:  pushl  %es
0x000fc4f1:  push   %cs
0x000fc4f2:  push   $0xc536
0x000fc4f5:  addr32 pushw %es:0x24(%eax)
0x000fc4fa:  addr32 pushl %es:0x20(%eax)
0x000fc500:  addr32 mov %es:0x4(%eax),%edi
0x000fc506:  addr32 mov %es:0x8(%eax),%esi
0x000fc50c:  addr32 mov %es:0xc(%eax),%ebp
0x000fc512:  addr32 mov %es:0x10(%eax),%ebx
0x000fc518:  addr32 mov %es:0x14(%eax),%edx
0x000fc51e:  addr32 mov %es:0x18(%eax),%ecx
0x000fc524:  addr32 mov %es:(%eax),%ds
0x000fc528:  addr32 pushl %es:0x1c(%eax)
0x000fc52e:  addr32 mov %es:0x2(%eax),%es
0x000fc533:  pop%eax
0x000fc535:  iret   
0x000fc536:  pushf  
0x000fc537:  cli 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel 3.9.x kvm hangs after seabios

2013-05-08 Thread Gleb Natapov

On Wed, May 08, 2013 at 11:22:01AM +, Tomas Papan wrote:
 I have the same issue, with 3.9.1 (3.9.0 too) it hangs right after seabios...
  (no problem in 3.8.11)
 
 qemu-1.4.1
 seabios-1.7.2.1
 
Is there anything interesting in libvirt logfile?

Also please send the output of qemu-monitor-command vm-jack --hmp info
registers

And, just in case, can you send me your bios.bin image. Mine work.

 after setting emulate_invalid_guest_state=0 everything works just fine.
 
 virsh # qemu-monitor-command vm-jack --hmp x/8i \$pc
 0x000fc46b:  lgdtw  %cs:-0x2c60
 0x000fc471:  mov%cr0,%eax
 0x000fc474:  or $0x1,%eax
 0x000fc478:  mov%eax,%cr0
 0x000fc47b:  ljmpl  $0x8,$0xfc483
 0x000fc483:  mov$0x10,%ax
 0x000fc486:  add%al,(%bx,%si)
 0x000fc488:  mov%ax,%ds
 
 
 virsh # qemu-monitor-command vm-jack --hmp x/64b \$pc
 0x000fc46b:  lgdtw  %cs:-0x2c60
 0x000fc471:  mov%cr0,%eax
 0x000fc474:  or $0x1,%eax
 0x000fc478:  mov%eax,%cr0
 0x000fc47b:  ljmpl  $0x8,$0xfc483
 0x000fc483:  mov$0x10,%ax
 0x000fc486:  add%al,(%bx,%si)
 0x000fc488:  mov%ax,%ds
 0x000fc48a:  mov%ax,%es
 0x000fc48c:  mov%ax,%ss
 0x000fc48e:  mov%ax,%fs
 0x000fc490:  mov%ax,%gs
 0x000fc492:  mov%cx,%ax
 0x000fc494:  jmp*%dx
 0x000fc496:  mov%ax,%cx
 0x000fc498:  mov$0x20,%ax
 0x000fc49b:  add%al,(%bx,%si)
 0x000fc49d:  mov%ax,%ds
 0x000fc49f:  mov%ax,%es
 0x000fc4a1:  mov%ax,%ss
 0x000fc4a3:  mov%ax,%fs
 0x000fc4a5:  mov%ax,%gs
 0x000fc4a7:  ljmpl  $0xc189,$0x18c4c4
 0x000fc4af:  mov$0x30,%ax
 0x000fc4b2:  add%al,(%bx,%si)
 0x000fc4b4:  mov%ax,%ds
 0x000fc4b6:  mov%ax,%es
 0x000fc4b8:  mov%ax,%ss
 0x000fc4ba:  mov%ax,%fs
 0x000fc4bc:  mov%ax,%gs
 0x000fc4be:  ljmpl  $0x200f,$0x28c4c4
 0x000fc4c6:  shlb   $0xe0,-0x7d(%bp)
 0x000fc4ca:  decb   (%bx)
 0x000fc4cc:  and%al,%al
 0x000fc4ce:  ljmp   $0xf000,$0xc4d3
 0x000fc4d3:  lidtw  %cs:-0x2c18
 0x000fc4d9:  xor%ax,%ax
 0x000fc4db:  mov%ax,%fs
 0x000fc4dd:  mov%ax,%gs
 0x000fc4df:  mov%ax,%es
 0x000fc4e1:  mov%ax,%ds
 0x000fc4e3:  mov%ax,%ss
 0x000fc4e5:  mov%ecx,%eax
 0x000fc4e8:  jmpl   *%edx
 0x000fc4eb:  push   %ebp
 0x000fc4ed:  push   %eax
 0x000fc4ef:  pushl  %es
 0x000fc4f1:  push   %cs
 0x000fc4f2:  push   $0xc536
 0x000fc4f5:  addr32 pushw %es:0x24(%eax)
 0x000fc4fa:  addr32 pushl %es:0x20(%eax)
 0x000fc500:  addr32 mov %es:0x4(%eax),%edi
 0x000fc506:  addr32 mov %es:0x8(%eax),%esi
 0x000fc50c:  addr32 mov %es:0xc(%eax),%ebp
 0x000fc512:  addr32 mov %es:0x10(%eax),%ebx
 0x000fc518:  addr32 mov %es:0x14(%eax),%edx
 0x000fc51e:  addr32 mov %es:0x18(%eax),%ecx
 0x000fc524:  addr32 mov %es:(%eax),%ds
 0x000fc528:  addr32 pushl %es:0x1c(%eax)
 0x000fc52e:  addr32 mov %es:0x2(%eax),%es
 0x000fc533:  pop%eax
 0x000fc535:  iret   
 0x000fc536:  pushf  
 0x000fc537:  cli 
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel 3.9.x kvm hangs after seabios

2013-05-08 Thread Tomas Papan

Hi,

I found this in the libvirt (but those messages are same in 3.8.x)
anakin libvirt # cat libvirtd.log
2013-05-08 11:59:29.645+: 3750: info : libvirt version: 1.0.5
2013-05-08 11:59:29.645+: 3750: error : udevGetDMIData:1548 :
Failed to get udev device for syspath '/sys/devices/virtual/dmi/id' or
'/sys/class/dmi/id'
2013-05-08 11:59:29.680+: 3750: warning :
ebiptablesDriverInitCLITools:4225 : Could not find 'ebtables'
executable

virsh # qemu-monitor-command vm-jack --hmp info registers
EAX=0002 EBX=64a1 ECX=6e08 EDX=000fc5ab
ESI=c5b8 EDI=6eec EBP=dffd83e0 ESP=6df8
EIP=c46b EFL=00010002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=1
ES =   9300
CS =f000 000f  9b00
SS =   9300
DS =   9300
FS =   9300
GS =   9300
LDT=   8200
TR =   8b00
GDT= 000fd3a8 0037
IDT= 000fd3e6 
CR0=0010 CR2= CR3= CR4=
DR0= DR1= DR2=
DR3=
DR6=0ff0 DR7=0400
EFER=
FCW=037f FSW= [ST=0] FTW=00 MXCSR=1f80
FPR0=  FPR1= 
FPR2=  FPR3= 
FPR4=  FPR5= 
FPR6=  FPR7= 
XMM00= XMM01=
XMM02= XMM03=
XMM04= XMM05=
XMM06= XMM07=

bios.bin can be found here http://papan.sk/share/bios.bin

I should mentioned that I'm using gentoo and libvirt 1.0.5.

I'm sorry if gmail interface breaks output.

Regrads
Tomas

On Wed, May 8, 2013 at 1:55 PM, Gleb Natapov g...@redhat.com wrote:
 On Wed, May 08, 2013 at 11:22:01AM +, Tomas Papan wrote:
 I have the same issue, with 3.9.1 (3.9.0 too) it hangs right after seabios...
  (no problem in 3.8.11)

 qemu-1.4.1
 seabios-1.7.2.1

 Is there anything interesting in libvirt logfile?

 Also please send the output of qemu-monitor-command vm-jack --hmp info
 registers

 And, just in case, can you send me your bios.bin image. Mine work.

 after setting emulate_invalid_guest_state=0 everything works just fine.

 virsh # qemu-monitor-command vm-jack --hmp x/8i \$pc
 0x000fc46b:  lgdtw  %cs:-0x2c60
 0x000fc471:  mov%cr0,%eax
 0x000fc474:  or $0x1,%eax
 0x000fc478:  mov%eax,%cr0
 0x000fc47b:  ljmpl  $0x8,$0xfc483
 0x000fc483:  mov$0x10,%ax
 0x000fc486:  add%al,(%bx,%si)
 0x000fc488:  mov%ax,%ds


 virsh # qemu-monitor-command vm-jack --hmp x/64b \$pc
 0x000fc46b:  lgdtw  %cs:-0x2c60
 0x000fc471:  mov%cr0,%eax
 0x000fc474:  or $0x1,%eax
 0x000fc478:  mov%eax,%cr0
 0x000fc47b:  ljmpl  $0x8,$0xfc483
 0x000fc483:  mov$0x10,%ax
 0x000fc486:  add%al,(%bx,%si)
 0x000fc488:  mov%ax,%ds
 0x000fc48a:  mov%ax,%es
 0x000fc48c:  mov%ax,%ss
 0x000fc48e:  mov%ax,%fs
 0x000fc490:  mov%ax,%gs
 0x000fc492:  mov%cx,%ax
 0x000fc494:  jmp*%dx
 0x000fc496:  mov%ax,%cx
 0x000fc498:  mov$0x20,%ax
 0x000fc49b:  add%al,(%bx,%si)
 0x000fc49d:  mov%ax,%ds
 0x000fc49f:  mov%ax,%es
 0x000fc4a1:  mov%ax,%ss
 0x000fc4a3:  mov%ax,%fs
 0x000fc4a5:  mov%ax,%gs
 0x000fc4a7:  ljmpl  $0xc189,$0x18c4c4
 0x000fc4af:  mov$0x30,%ax
 0x000fc4b2:  add%al,(%bx,%si)
 0x000fc4b4:  mov%ax,%ds
 0x000fc4b6:  mov%ax,%es
 0x000fc4b8:  mov%ax,%ss
 0x000fc4ba:  mov%ax,%fs
 0x000fc4bc:  mov%ax,%gs
 0x000fc4be:  ljmpl  $0x200f,$0x28c4c4
 0x000fc4c6:  shlb   $0xe0,-0x7d(%bp)
 0x000fc4ca:  decb   (%bx)
 0x000fc4cc:  and%al,%al
 0x000fc4ce:  ljmp   $0xf000,$0xc4d3
 0x000fc4d3:  lidtw  %cs:-0x2c18
 0x000fc4d9:  xor%ax,%ax
 0x000fc4db:  mov%ax,%fs
 0x000fc4dd:  mov%ax,%gs
 0x000fc4df:  mov%ax,%es
 0x000fc4e1:  mov%ax,%ds
 0x000fc4e3:  mov%ax,%ss
 0x000fc4e5:  mov%ecx,%eax
 0x000fc4e8:  jmpl   *%edx
 0x000fc4eb:  push   %ebp
 0x000fc4ed:  push   %eax
 0x000fc4ef:  pushl  %es
 0x000fc4f1:  push   %cs
 0x000fc4f2:  push   $0xc536
 0x000fc4f5:  addr32 pushw %es:0x24(%eax)
 0x000fc4fa:  addr32 pushl %es:0x20(%eax)
 0x000fc500:  addr32 mov %es:0x4(%eax),%edi
 0x000fc506:

Re: kernel 3.9.x kvm hangs after seabios

2013-05-08 Thread Gleb Natapov

On Wed, May 08, 2013 at 02:08:55PM +0200, Tomas Papan wrote:
 Hi,
 
 I found this in the libvirt (but those messages are same in 3.8.x)
 anakin libvirt # cat libvirtd.log
 2013-05-08 11:59:29.645+: 3750: info : libvirt version: 1.0.5
 2013-05-08 11:59:29.645+: 3750: error : udevGetDMIData:1548 :
 Failed to get udev device for syspath '/sys/devices/virtual/dmi/id' or
 '/sys/class/dmi/id'
 2013-05-08 11:59:29.680+: 3750: warning :
 ebiptablesDriverInitCLITools:4225 : Could not find 'ebtables'
 executable
 
Nothing about KVM internal error?
 
Couple of more things please:
1. Output of qemu-monitor-command vm-jack --hmp info status.
2. command line.
3. trace http://www.linux-kvm.org/page/Tracing

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel 3.9 - can't boot qemu with accel=kvm _and_ networking enabled

2013-05-08 Thread Vladimir

Here they are:

(qemu) x/8i $pc
0x000fc49b:  lgdtw  %cs:-0x2c60
0x000fc4a1:  mov%cr0,%eax
0x000fc4a4:  or $0x1,%eax
0x000fc4a8:  mov%eax,%cr0
0x000fc4ab:  ljmpl  $0x8,$0xfc4b3
0x000fc4b3:  mov$0x10,%ax
0x000fc4b6:  add%al,(%bx,%si)
0x000fc4b8:  mov%ax,%ds

(qemu) x/64b $pc
0x000fc49b:  lgdtw  %cs:-0x2c60
0x000fc4a1:  mov%cr0,%eax
0x000fc4a4:  or $0x1,%eax
0x000fc4a8:  mov%eax,%cr0
0x000fc4ab:  ljmpl  $0x8,$0xfc4b3
0x000fc4b3:  mov$0x10,%ax
0x000fc4b6:  add%al,(%bx,%si)
0x000fc4b8:  mov%ax,%ds
0x000fc4ba:  mov%ax,%es
0x000fc4bc:  mov%ax,%ss
0x000fc4be:  mov%ax,%fs
0x000fc4c0:  mov%ax,%gs
0x000fc4c2:  mov%cx,%ax
0x000fc4c4:  jmp*%dx
0x000fc4c6:  mov%ax,%cx
0x000fc4c8:  mov$0x20,%ax
0x000fc4cb:  add%al,(%bx,%si)
0x000fc4cd:  mov%ax,%ds
0x000fc4cf:  mov%ax,%es
0x000fc4d1:  mov%ax,%ss
0x000fc4d3:  mov%ax,%fs
0x000fc4d5:  mov%ax,%gs
0x000fc4d7:  ljmpl  $0xc189,$0x18c4f4
0x000fc4df:  mov$0x30,%ax
0x000fc4e2:  add%al,(%bx,%si)
0x000fc4e4:  mov%ax,%ds
0x000fc4e6:  mov%ax,%es
0x000fc4e8:  mov%ax,%ss
0x000fc4ea:  mov%ax,%fs
0x000fc4ec:  mov%ax,%gs
0x000fc4ee:  ljmpl  $0x200f,$0x28c4f4
0x000fc4f6:  shlb   $0xe0,-0x7d(%bp)
0x000fc4fa:  decb   (%bx)
0x000fc4fc:  and%al,%al
0x000fc4fe:  ljmp   $0xf000,$0xc503
0x000fc503:  lidtw  %cs:-0x2c18
0x000fc509:  xor%ax,%ax
0x000fc50b:  mov%ax,%fs
0x000fc50d:  mov%ax,%gs
0x000fc50f:  mov%ax,%es
0x000fc511:  mov%ax,%ds
0x000fc513:  mov%ax,%ss
0x000fc515:  mov%ecx,%eax
0x000fc518:  jmpl   *%edx
0x000fc51b:  push   %ebp
0x000fc51d:  push   %eax
0x000fc51f:  pushl  %es
0x000fc521:  push   %cs
0x000fc522:  push   $0xc566
0x000fc525:  addr32 pushw %es:0x24(%eax)
0x000fc52a:  addr32 pushl %es:0x20(%eax)
0x000fc530:  addr32 mov %es:0x4(%eax),%edi
0x000fc536:  addr32 mov %es:0x8(%eax),%esi
0x000fc53c:  addr32 mov %es:0xc(%eax),%ebp
0x000fc542:  addr32 mov %es:0x10(%eax),%ebx
0x000fc548:  addr32 mov %es:0x14(%eax),%edx
0x000fc54e:  addr32 mov %es:0x18(%eax),%ecx
0x000fc554:  addr32 mov %es:(%eax),%ds
0x000fc558:  addr32 pushl %es:0x1c(%eax)
0x000fc55e:  addr32 mov %es:0x2(%eax),%es
0x000fc563:  pop%eax
0x000fc565:  iret  
0x000fc566:  pushf 
0x000fc567:  cli   


On 08/05/13 11:57, Paolo Bonzini wrote:
 Paolo,

 The full command line is:
 qemu-system-x86_64 -machine accel=kvm -m 1024m  \
  -net tap -net nic \
  -drive file=/dev/zpool/testsrv,index=0,cache=writethrough \
  -k en-us \
  -no-kvm-irqchip \
  -vga cirrus

 I've tried any combinations of -net options, but the result is always
 the same. I think this somehow related to
 http://article.gmane.org/gmane.comp.emulators.kvm.devel/109461, as
 setting emulate_invalid_guest_state=0 solves the problem However, I'm
 not aware of any consequences of this change.
 Actually, the other bug involves sgabios and you are not using it.
 Please try executing the following commands from the monitor (you can
 use -monitor stdio to make cut-and-paste simpler):

x/8i \$pc
x/64b \$pc

 and include the output in the reply to this message.

 Thanks,

 Paolo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel 3.9.x kvm hangs after seabios

2013-05-08 Thread Tomas Papan

Hi,

No nothing, I check all logs (even syslog)

1) virsh # qemu-monitor-command vm-jack --hmp info status
VM status: running

2) morpheus@anakin ~ $ ps aux | grep vm-jack
qemu  3822  0.5  0.1 8952256 23600 ?   Sl   13:59   0:08
/usr/bin/qemu-system-x86_64 -machine accel=kvm -name vm-jack -S
-machine pc-0.14,accel=kvm,usb=off -cpu
Nehalem,+rdtscp,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
-m 8192 -smp 4,sockets=4,cores=1,threads=1 -uuid
03196c23-24ba-d398-a000-582b0e88b0e7 -no-user-config -nodefaults
-chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm-jack.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
-no-shutdown -boot order=c,menu=on -device
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
file=/var/lib/libvirt/images/jack.img,if=none,id=drive-virtio-disk0,format=raw
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
-drive 
file=/var/lib/libvirt/images/kernel.img,if=none,id=drive-virtio-disk1,format=raw
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1
-drive if=none,id=drive-ide0-1-0,readonly=on,format=raw -device
ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev
tap,fd=19,id=hostnet0 -device
e1000,netdev=hostnet0,id=net0,mac=52:54:00:21:1c:e0,bus=pci.0,addr=0x3
-chardev pty,id=charserial0 -device
isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:0 -k en-us
-vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6

3) it took some time, I didn't have debug_fs, then tracing... but the
file is stored here (15 MB) http://papan.sk/share/trace.dat.tar.gz

Regards
Tomas
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel 3.9.x kvm hangs after seabios

2013-05-08 Thread Daniel P. Berrange

On Wed, May 08, 2013 at 02:08:55PM +0200, Tomas Papan wrote:
 Hi,
 
 I found this in the libvirt (but those messages are same in 3.8.x)
 anakin libvirt # cat libvirtd.log
 2013-05-08 11:59:29.645+: 3750: info : libvirt version: 1.0.5
 2013-05-08 11:59:29.645+: 3750: error : udevGetDMIData:1548 :
 Failed to get udev device for syspath '/sys/devices/virtual/dmi/id' or
 '/sys/class/dmi/id'
 2013-05-08 11:59:29.680+: 3750: warning :
 ebiptablesDriverInitCLITools:4225 : Could not find 'ebtables'
 executable

You need to look at /var/log/libvirt/qemu/$GUESTNAME.log for
QEMU related messages. The libvirtd.log file only has the
libvirt related messages.

Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel 3.9.x kvm hangs after seabios

2013-05-08 Thread Tomas Papan

Sorry, I didn't write that well, I checked that log too... nothing is there...

anakin qemu # cat vm-jack.log
2013-05-08 13:02:52.358+: starting up
LC_ALL=C 
PATH=/bin:/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/opt/bin
HOME=/root USER=root QEMU_AUDIO_DRV=none /usr/bin/qemu-kvm -name
vm-jack -S -machine pc-0.14,accel=kvm,usb=off -cpu
Nehalem,+rdtscp,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
-m 8192 -smp 4,sockets=4,cores=1,threads=1 -uuid
03196c23-24ba-d398-a000-582b0e88b0e7 -no-user-config -nodefaults
-chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm-jack.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
-no-shutdown -boot order=c,menu=on -device
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
file=/var/lib/libvirt/images/jack.img,if=none,id=drive-virtio-disk0,format=raw
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
-drive 
file=/var/lib/libvirt/images/kernel.img,if=none,id=drive-virtio-disk1,format=raw
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1
-drive if=none,id=drive-ide0-1-0,readonly=on,format=raw -device
ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev
tap,fd=19,id=hostnet0 -device
e1000,netdev=hostnet0,id=net0,mac=52:54:00:21:1c:e0,bus=pci.0,addr=0x3
-chardev pty,id=charserial0 -device
isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:0 -k en-us
-vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
char device redirected to /dev/pts/3 (label charserial0)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel 3.9.x kvm hangs after seabios

2013-05-08 Thread Gleb Natapov

On Wed, May 08, 2013 at 02:51:48PM +0200, Tomas Papan wrote:
 Hi,
 
 No nothing, I check all logs (even syslog)
 
Yeah, since status of the vm is running you are not suppose to see
there anything.

 1) virsh # qemu-monitor-command vm-jack --hmp info status
 VM status: running
 
 2) morpheus@anakin ~ $ ps aux | grep vm-jack
 qemu  3822  0.5  0.1 8952256 23600 ?   Sl   13:59   0:08
 /usr/bin/qemu-system-x86_64 -machine accel=kvm -name vm-jack -S
 -machine pc-0.14,accel=kvm,usb=off -cpu
 Nehalem,+rdtscp,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
 -m 8192 -smp 4,sockets=4,cores=1,threads=1 -uuid
 03196c23-24ba-d398-a000-582b0e88b0e7 -no-user-config -nodefaults
 -chardev 
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm-jack.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
 -no-shutdown -boot order=c,menu=on -device
 piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
 file=/var/lib/libvirt/images/jack.img,if=none,id=drive-virtio-disk0,format=raw
 -device 
 virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
 -drive 
 file=/var/lib/libvirt/images/kernel.img,if=none,id=drive-virtio-disk1,format=raw
 -device 
 virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1
 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw -device
 ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev
 tap,fd=19,id=hostnet0 -device
 e1000,netdev=hostnet0,id=net0,mac=52:54:00:21:1c:e0,bus=pci.0,addr=0x3
 -chardev pty,id=charserial0 -device
 isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:0 -k en-us
 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
 
 3) it took some time, I didn't have debug_fs, then tracing... but the
 file is stored here (15 MB) http://papan.sk/share/trace.dat.tar.gz
 
Very interesting. In the middle of the run vcpu decides that it does not
want to run any more. How much cpu time qemu takes when it happens? If
it is 100% can you do the following:

1. run qemu-monitor-command vm-jack --hmp info cpus
2. note thread id for cpu #0
3. run trace-cmd record -P $pid -p function where $pid is the pid
   thread id that you've found in 2.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel 3.9.x kvm hangs after seabios

2013-05-08 Thread Tomas Papan

Ok, the cpu stays at 0% when it hangs, there is only one 100% cpu peak
which happens when the vm starts ( I think this is quite normal).

However I run following command, and I stop it right when it hangs:
anakin trace2 # virsh start vm-jack; pid=`virsh qemu-monitor-command
vm-jack --hmp info cpus | grep '\*' | awk '{print $5}' | cut -d\=
-f2`; trace-cmd record -P $pid -p function

if anyone is interested it produces a 1.6 GB file (the compressed
version can be found here: http://papan.sk/share/trace2.dat.tar.gz
(150 MB))

Tomas
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel 3.9.x kvm hangs after seabios

2013-05-08 Thread Gleb Natapov

On Wed, May 08, 2013 at 03:50:47PM +0200, Tomas Papan wrote:
 Ok, the cpu stays at 0% when it hangs, there is only one 100% cpu peak
 which happens when the vm starts ( I think this is quite normal).
 
 However I run following command, and I stop it right when it hangs:
 anakin trace2 # virsh start vm-jack; pid=`virsh qemu-monitor-command
 vm-jack --hmp info cpus | grep '\*' | awk '{print $5}' | cut -d\=
 -f2`; trace-cmd record -P $pid -p function
 
 if anyone is interested it produces a 1.6 GB file (the compressed
 version can be found here: http://papan.sk/share/trace2.dat.tar.gz
 (150 MB))
 
Thanks! Can you test the patch below:

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6667042..0af1807 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5197,6 +5197,12 @@ static int handle_invalid_guest_state(struct kvm_vcpu 
*vcpu)
return 0;
}
 
+   if (vcpu-arch.halt_request) {
+   vcpu-arch.halt_request = 0;
+   ret = kvm_emulate_halt(vcpu);
+   goto out;
+   }
+
if (signal_pending(current))
goto out;
if (need_resched())
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel 3.9.x kvm hangs after seabios

2013-05-08 Thread Tomas Papan


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel 3.9.x kvm hangs after seabios

2013-05-08 Thread Tomas Papan

patch is working :)

Thank you very much Gleb.

Regards
Tomas
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel 3.9.x kvm hangs after seabios

2013-05-08 Thread Gleb Natapov

On Wed, May 08, 2013 at 04:52:52PM +0200, Tomas Papan wrote:
 patch is working :)
 
 Thank you very much Gleb.
 
Thank you for your patience. Curious but it was.
 
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: VMX: fix halt emulation while emulating invalid guest sate

2013-05-08 Thread Gleb Natapov

The invalid guest state emulation loop does not check halt_request
which causes 100% cpu loop while guest is in halt and in invalid
state, but more serious issue is that this leaves halt_request set, so
random instruction emulated by vm86 #GP exit can be interpreted
as halt which causes guest hang. Fix both problems by handling
halt_request in emulation loop.

Reported-by: Tomas Papan tomas.pa...@gmail.com
Tested-by: Tomas Papan tomas.pa...@gmail.com
CC: sta...@vger.kernel.org
Signed-off-by: Gleb Natapov g...@redhat.com
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5a87a58..a9fa4bc 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5312,6 +5312,12 @@ static int handle_invalid_guest_state(struct kvm_vcpu 
*vcpu)
return 0;
}
 
+   if (vcpu-arch.halt_request) {
+   vcpu-arch.halt_request = 0;
+   ret = kvm_emulate_halt(vcpu);
+   goto out;
+   }
+
if (signal_pending(current))
goto out;
if (need_resched())
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: VFIO VGA test branches

2013-05-08 Thread Alex Williamson

A few notes for anyone trying this...

  * I recommend the q35 machine type and using the default config
file found in the docs directory.  This means your command line
should include:

 -M q35 -nodefconfig -readconfig /path/to/qemu.git/docs/q35-chipset.cfg

  * You're likely passing through a graphics card that is attached
to the host system below a root port, so make it appear that way
to the guest too.  If your graphics card has a graphics function
and audio function, assign them as:

-device 
vfio-pci,host=2:00.0,x-vga=on,multifunction=on,bus=ich9-pcie-port-1,addr=0.0 \
-device vfio-pci,host=2:00.1,bus=ich9-pcie-port-1,addr=0.1

The bus name comes from the q35-chipset.cfg above.  If your
graphics doesn't include a separate audio device, drop the
second line and the multifunction option of the first (addr is
also optional at that point, 0.0 will be the default).

  * If you follow both of the above, your VGA device is now below a
root port, but the version of seabios in qemu doesn't support
initializing VGA routing to that device.  To fix, use upstream
seabios: git://git.seabios.org/seabios.git  The default config
should work.  Then add the following to your qemu commandline:

-L /path/to/seabios.git/out/ -L /path/to/qemu/bios/files/

(the latter is likely /usr/local/share/qemu/)

  * You can use -nographic to prevent QEMU from trying to start SDL
or need a vnc parameter.  You can also specify a -vnc option and
use the window for mouse input.

  * Use -vga none.  At this point I'm not really interested in
dual-headed VMs unless you're interested in working on it.
Having an emulated VGA means we're not really testing VGA
support through VFIO.

  * Do no use the vfio-pci romfile option unless you need it (ie.
try w/o first).  Option ROMs check an internal signature against
the hardware.  If they don't match, it isn't run.  If you
download a ROM from the internet, you may get nowhere.  If you
do need a ROM, it's best to scrape it off the device you're
using.  You can do this through the rom file in sysfs for the
device.  echo 1  rom to enable it, the read it as cat rom
 /tmp/rom.  To do this, it should be a secondary graphics
device and be untouched by host drivers.  You may have better
luck booting from an install CD to get an environment where the
device is untouched for this.

  * USB passthrough is handy for input and easier than figuring out
which ports are connected to which USB controllers for vfio-pci
assignment.  Use lsusb to find the devices, note the bus and
device numbers, the use:

-device usb-host,hostbus=8,hostaddr=2

I think that's it.  Feel free to reply with other best practices.
Thanks,

Alex

On Fri, 2013-05-03 at 16:56 -0600, Alex Williamson wrote:
 Hi folks,
 
 A number of people have been trying VFIO's VGA support, a few have even
 been successful.  Resetting devices has been a problem and makes it
 very, very difficult to really use VGA assignment effectively.  The code
 in the branches below attempts to address this.  Discrete graphics
 devices are typically on their own bus, which we can reset so we
 theoretically get something pretty close to a power-on state for the GPU
 on each run (or after each guest reset).  With this I'm able to get
 multiple runs on my HD7850 with no need to reset the host.  Hopefully
 this will also cleanup after any host uses of the device so we can
 unload driver rather than blacklisting them.
 
 If you've been playing with VFIO and VGA, please give the branches below
 a shot and report successes and failures.  Note that this new reset is
 only enable with the x-vga=on option, so should not do gratuitous bus
 resets for other devices.  Thanks,
 
 Alex
 
 git://github.com/awilliam/linux-vfio.git vfio-vga-reset
 git://github.com/awilliam/qemu-vfio.git vfio-vga-reset
 
 PS - The above linux branch is v3.9 based which has a known kvm emulator
 bug.  If you're on Intel and nothing happens, try:
 
 sudo modprobe -r kvm_intel
 sudo modprobe kvm_intel emulate_invalid_guest_state=0
 
 This is required to execute the VGA BIOS on my HD7850.
 
 If things still don't work, apply the following patch:
 
 --- a/hw/misc/vfio.c
 +++ b/hw/misc/vfio.c
 @@ -40,7 +40,7 @@
  #include sysemu/kvm.h
  #include sysemu/sysemu.h
  
 -/* #define DEBUG_VFIO */
 +#define DEBUG_VFIO
  #ifdef DEBUG_VFIO
  #define DPRINTF(fmt, ...) \
  do { fprintf(stderr, vfio:  fmt, ## __VA_ARGS__); } while (0)
 
 And log the output (there will be lots).
 
 Also, AMD/ATI and Nvidia are the only devices expected to have a
 reasonable shot at working.  I'm seeing reports of success on AMD/ATI HD
 5xxx, 6xxx, and 7xxx, as well as Nvidia

Re: [PATCH] KVM: VMX: fix halt emulation while emulating invalid guest sate

2013-05-08 Thread Paolo Bonzini



- Messaggio originale -
 Da: Gleb Natapov g...@redhat.com
 A: kvm@vger.kernel.org
 Cc: pbonz...@redhat.com, sta...@vger.kernel.org
 Inviato: Mercoledì, 8 maggio 2013 17:38:44
 Oggetto: [PATCH] KVM: VMX: fix halt emulation while emulating invalid guest 
 sate
 
 The invalid guest state emulation loop does not check halt_request
 which causes 100% cpu loop while guest is in halt and in invalid
 state, but more serious issue is that this leaves halt_request set, so
 random instruction emulated by vm86 #GP exit can be interpreted
 as halt which causes guest hang. Fix both problems by handling
 halt_request in emulation loop.
 
 Reported-by: Tomas Papan tomas.pa...@gmail.com
 Tested-by: Tomas Papan tomas.pa...@gmail.com
 CC: sta...@vger.kernel.org
 Signed-off-by: Gleb Natapov g...@redhat.com
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 5a87a58..a9fa4bc 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -5312,6 +5312,12 @@ static int handle_invalid_guest_state(struct kvm_vcpu
 *vcpu)
   return 0;
   }
  
 + if (vcpu-arch.halt_request) {
 + vcpu-arch.halt_request = 0;
 + ret = kvm_emulate_halt(vcpu);
 + goto out;
 + }
 +
   if (signal_pending(current))
   goto out;
   if (need_resched())
 --
   Gleb.
 

Reviewed-by: Paolo Bonzini pbonz...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Problems while booting a linux system on fast models based CortexA15

2013-05-08 Thread Christoffer Dall

On Wed, May 8, 2013 at 12:07 AM, Mai Daftedar mai.dafte...@gmail.com wrote:
 Dear All,
  I am facing a problem with booting a fully working Linux system on the Fast
 Models based Cortex-A15 simulation platform.

 I'm using the KVM on ARM guide to configure KVM on the ARM fast models
 with CortexA15, however I get the following kernel panic error  when I use
 NFS to boot the kernel.

 VFS: Unable to mount root fs via NFS, trying floppy.

 Noting that the kernel semi-hosting arguments used are as follows:
 kernel uImage --fdt host-a15.dtb -- earlyprintk console=ttyAMA0 mem=2048M
 root=/dev/nfs nfsroot=192.168.x.x:/srv/nfsroot/ rw ip=dhcp

 Where can I be going wrong?

Is that literally the command line you use?

You may want to change those x's then to the actual IP address of your
host machine :)

You also need to make sure that your host machine has NFS configured properly.

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1

2013-05-08 Thread Jun Nakajima

Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
switch the EFER MSR when EPT is used and the host and guest have different
NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
and want to be able to run recent KVM as L1, we need to allow L1 to use this
EFER switching feature.

To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available,
and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
support for the former (the latter is still unsupported).

Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state,
respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all
that's left to do in this patch is to properly advertise this feature to L1.

Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using
vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
support this feature, regardless of whether the host supports it.

Signed-off-by: Nadav Har'El n...@il.ibm.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
---
 arch/x86/kvm/vmx.c | 23 ---
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e53a5f7..51b8b4f0 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2192,7 +2192,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 #else
nested_vmx_exit_ctls_high = 0;
 #endif
-   nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
+   nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
+ VM_EXIT_LOAD_IA32_EFER);
 
/* entry controls */
rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
@@ -2201,8 +2202,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
nested_vmx_entry_ctls_high =
VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE;
-   nested_vmx_entry_ctls_high |= VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
-
+   nested_vmx_entry_ctls_high |= (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |
+  VM_ENTRY_LOAD_IA32_EFER);
/* cpu-based controls */
rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
@@ -7486,10 +7487,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12)
vcpu-arch.cr0_guest_owned_bits = ~vmcs12-cr0_guest_host_mask;
vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu-arch.cr0_guest_owned_bits);
 
-   /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */
-   vmcs_write32(VM_EXIT_CONTROLS,
-   vmcs12-vm_exit_controls | vmcs_config.vmexit_ctrl);
-   vmcs_write32(VM_ENTRY_CONTROLS, vmcs12-vm_entry_controls |
+   /* L2-L1 exit controls are emulated - the hardware exit is to L0 so
+* we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER
+* bits are further modified by vmx_set_efer() below.
+*/
+   vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl);
+
+   /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are
+* emulated by vmx_set_efer(), below.
+*/
+   vmcs_write32(VM_ENTRY_CONTROLS,
+   (vmcs12-vm_entry_controls  ~VM_ENTRY_LOAD_IA32_EFER 
+   ~VM_ENTRY_IA32E_MODE) |
(vmcs_config.vmentry_ctrl  ~VM_ENTRY_IA32E_MODE));
 
if (vmcs12-vm_entry_controls  VM_ENTRY_LOAD_IA32_PAT)
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 02/13] nEPT: Move gpte_access() and prefetch_invalid_gpte() to paging_tmpl.h

2013-05-08 Thread Jun Nakajima

For preparation, we just move gpte_access() and prefetch_invalid_gpte() from 
mmu.c to paging_tmpl.h.

Signed-off-by: Nadav Har'El n...@il.ibm.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
---
 arch/x86/kvm/mmu.c | 30 --
 arch/x86/kvm/paging_tmpl.h | 40 +++-
 2 files changed, 35 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 004cc87..117233f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2488,26 +2488,6 @@ static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu 
*vcpu, gfn_t gfn,
return gfn_to_pfn_memslot_atomic(slot, gfn);
 }
 
-static bool prefetch_invalid_gpte(struct kvm_vcpu *vcpu,
- struct kvm_mmu_page *sp, u64 *spte,
- u64 gpte)
-{
-   if (is_rsvd_bits_set(vcpu-arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
-   goto no_present;
-
-   if (!is_present_gpte(gpte))
-   goto no_present;
-
-   if (!(gpte  PT_ACCESSED_MASK))
-   goto no_present;
-
-   return false;
-
-no_present:
-   drop_spte(vcpu-kvm, spte);
-   return true;
-}
-
 static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
struct kvm_mmu_page *sp,
u64 *start, u64 *end)
@@ -3408,16 +3388,6 @@ static bool sync_mmio_spte(u64 *sptep, gfn_t gfn, 
unsigned access,
return false;
 }
 
-static inline unsigned gpte_access(struct kvm_vcpu *vcpu, u64 gpte)
-{
-   unsigned access;
-
-   access = (gpte  (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
-   access = ~(gpte  PT64_NX_SHIFT);
-
-   return access;
-}
-
 static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned 
gpte)
 {
unsigned index;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index da20860..df34d4a 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -103,6 +103,36 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
return (ret != orig_pte);
 }
 
+static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
+ struct kvm_mmu_page *sp, u64 *spte,
+ u64 gpte)
+{
+   if (is_rsvd_bits_set(vcpu-arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
+   goto no_present;
+
+   if (!is_present_gpte(gpte))
+   goto no_present;
+
+   if (!(gpte  PT_ACCESSED_MASK))
+   goto no_present;
+
+   return false;
+
+no_present:
+   drop_spte(vcpu-kvm, spte);
+   return true;
+}
+
+static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
+{
+   unsigned access;
+
+   access = (gpte  (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
+   access = ~(gpte  PT64_NX_SHIFT);
+
+   return access;
+}
+
 static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
 struct kvm_mmu *mmu,
 struct guest_walker *walker,
@@ -225,7 +255,7 @@ retry_walk:
}
 
accessed_dirty = pte;
-   pte_access = pt_access  gpte_access(vcpu, pte);
+   pte_access = pt_access  FNAME(gpte_access)(vcpu, pte);
 
walker-ptes[walker-level - 1] = pte;
} while (!is_last_gpte(mmu, walker-level, pte));
@@ -309,13 +339,13 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
gfn_t gfn;
pfn_t pfn;
 
-   if (prefetch_invalid_gpte(vcpu, sp, spte, gpte))
+   if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
return false;
 
pgprintk(%s: gpte %llx spte %p\n, __func__, (u64)gpte, spte);
 
gfn = gpte_to_gfn(gpte);
-   pte_access = sp-role.access  gpte_access(vcpu, gpte);
+   pte_access = sp-role.access  FNAME(gpte_access)(vcpu, gpte);
protect_clean_gpte(pte_access, gpte);
pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
no_dirty_log  (pte_access  ACC_WRITE_MASK));
@@ -782,14 +812,14 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
  sizeof(pt_element_t)))
return -EINVAL;
 
-   if (prefetch_invalid_gpte(vcpu, sp, sp-spt[i], gpte)) {
+   if (FNAME(prefetch_invalid_gpte)(vcpu, sp, sp-spt[i], gpte)) {
vcpu-kvm-tlbs_dirty++;
continue;
}
 
gfn = gpte_to_gfn(gpte);
pte_access = sp-role.access;
-   pte_access = gpte_access(vcpu, gpte);
+   pte_access = FNAME(gpte_access)(vcpu, gpte);
protect_clean_gpte(pte_access, gpte);
 
if (sync_mmio_spte(sp-spt[i], gfn,

[PATCH v3 03/13] nEPT: Add EPT tables support to paging_tmpl.h

2013-05-08 Thread Jun Nakajima

This is the first patch in a series which adds nested EPT support to KVM's
nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use
EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest
to set its own cr3 and take its own page faults without either of L0 or L1
getting involved. This often significanlty improves L2's performance over the
previous two alternatives (shadow page tables over EPT, and shadow page
tables over shadow page tables).

This patch adds EPT support to paging_tmpl.h.

paging_tmpl.h contains the code for reading and writing page tables. The code
for 32-bit and 64-bit tables is very similar, but not identical, so
paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once
with PTTYPE=64, and this generates the two sets of similar functions.

There are subtle but important differences between the format of EPT tables
and that of ordinary x86 64-bit page tables, so for nested EPT we need a
third set of functions to read the guest EPT table and to write the shadow
EPT table.

So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed
with EPT) which correctly read and write EPT tables.

Signed-off-by: Nadav Har'El n...@il.ibm.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
---
 arch/x86/kvm/mmu.c |  5 +
 arch/x86/kvm/paging_tmpl.h | 43 +--
 2 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 117233f..6c1670f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3397,6 +3397,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, 
unsigned level, unsigned gp
return mmu-last_pte_bitmap  (1  index);
 }
 
+#define PTTYPE_EPT 18 /* arbitrary */
+#define PTTYPE PTTYPE_EPT
+#include paging_tmpl.h
+#undef PTTYPE
+
 #define PTTYPE 64
 #include paging_tmpl.h
 #undef PTTYPE
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index df34d4a..4c45654 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -50,6 +50,22 @@
#define PT_LEVEL_BITS PT32_LEVEL_BITS
#define PT_MAX_FULL_LEVELS 2
#define CMPXCHG cmpxchg
+#elif PTTYPE == PTTYPE_EPT
+   #define pt_element_t u64
+   #define guest_walker guest_walkerEPT
+   #define FNAME(name) EPT_##name
+   #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK
+   #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
+   #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
+   #define PT_INDEX(addr, level) PT64_INDEX(addr, level)
+   #define PT_LEVEL_BITS PT64_LEVEL_BITS
+   #ifdef CONFIG_X86_64
+   #define PT_MAX_FULL_LEVELS 4
+   #define CMPXCHG cmpxchg
+   #else
+   #define CMPXCHG cmpxchg64
+   #define PT_MAX_FULL_LEVELS 2
+   #endif
 #else
#error Invalid PTTYPE value
 #endif
@@ -80,6 +96,10 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
return (gpte  PT_LVL_ADDR_MASK(lvl))  PAGE_SHIFT;
 }
 
+#if PTTYPE != PTTYPE_EPT
+/*
+ *  Comment out this for EPT because update_accessed_dirty_bits() is not used.
+ */
 static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
   pt_element_t __user *ptep_user, unsigned index,
   pt_element_t orig_pte, pt_element_t new_pte)
@@ -102,6 +122,7 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
 
return (ret != orig_pte);
 }
+#endif
 
 static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
  struct kvm_mmu_page *sp, u64 *spte,
@@ -126,13 +147,21 @@ no_present:
 static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte)
 {
unsigned access;
-
+#if PTTYPE == PTTYPE_EPT
+   access = (gpte  (VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
+ VMX_EPT_EXECUTABLE_MASK));
+#else
access = (gpte  (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK;
access = ~(gpte  PT64_NX_SHIFT);
+#endif
 
return access;
 }
 
+#if PTTYPE != PTTYPE_EPT
+/*
+ * EPT A/D bit support is not implemented.
+ */
 static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
 struct kvm_mmu *mmu,
 struct guest_walker *walker,
@@ -169,6 +198,7 @@ static int FNAME(update_accessed_dirty_bits)(struct 
kvm_vcpu *vcpu,
}
return 0;
 }
+#endif
 
 /*
  * Fetch a guest pte for a guest virtual address
@@ -177,7 +207,6 @@ static int FNAME(walk_addr_generic)(struct guest_walker 
*walker,
struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
gva_t addr, u32 access)
 {
-   int ret;
pt_element_t pte;
pt_element_t __user *uninitialized_var(ptep_user);
gfn_t table_gfn;
@@

[PATCH v3 04/13] nEPT: Define EPT-specific link_shadow_page()

2013-05-08 Thread Jun Nakajima

Since link_shadow_page() is used by a routine in mmu.c, add an
EPT-specific link_shadow_page() in paging_tmp.h, rather than moving
it.

Signed-off-by: Nadav Har'El n...@il.ibm.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
---
 arch/x86/kvm/paging_tmpl.h | 20 
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 4c45654..dc495f9 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -461,6 +461,18 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, 
struct guest_walker *gw,
}
 }
 
+#if PTTYPE == PTTYPE_EPT
+static void FNAME(link_shadow_page)(u64 *sptep, struct kvm_mmu_page *sp)
+{
+   u64 spte;
+
+   spte = __pa(sp-spt) | VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK |
+   VMX_EPT_EXECUTABLE_MASK;
+
+   mmu_spte_set(sptep, spte);
+}
+#endif
+
 /*
  * Fetch a shadow pte for a specific level in the paging hierarchy.
  * If the guest tries to write a write-protected page, we need to
@@ -513,7 +525,11 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
goto out_gpte_changed;
 
if (sp)
+#if PTTYPE == PTTYPE_EPT
+   FNAME(link_shadow_page)(it.sptep, sp);
+#else
link_shadow_page(it.sptep, sp);
+#endif
}
 
for (;
@@ -533,7 +549,11 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 
sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1,
  true, direct_access, it.sptep);
+#if PTTYPE == PTTYPE_EPT
+   FNAME(link_shadow_page)(it.sptep, sp);
+#else
link_shadow_page(it.sptep, sp);
+#endif
}
 
clear_sp_write_flooding_count(it.sptep);
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 05/13] nEPT: MMU context for nested EPT

2013-05-08 Thread Jun Nakajima

KVM's existing shadow MMU code already supports nested TDP. To use it, we
need to set up a new MMU context for nested EPT, and create a few callbacks
for it (nested_ept_*()). This context should also use the EPT versions of
the page table access functions (defined in the previous patch).
Then, we need to switch back and forth between this nested context and the
regular MMU context when switching between L1 and L2 (when L1 runs this L2
with EPT).

Signed-off-by: Nadav Har'El n...@il.ibm.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
---
 arch/x86/kvm/mmu.c | 38 ++
 arch/x86/kvm/mmu.h |  1 +
 arch/x86/kvm/vmx.c | 54 +-
 3 files changed, 92 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6c1670f..37f8d7f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3653,6 +3653,44 @@ int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct 
kvm_mmu *context)
 }
 EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu);
 
+int kvm_init_shadow_EPT_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context)
+{
+   ASSERT(vcpu);
+   ASSERT(!VALID_PAGE(vcpu-arch.mmu.root_hpa));
+
+   context-shadow_root_level = kvm_x86_ops-get_tdp_level();
+
+   context-nx = is_nx(vcpu); /* TODO: ? */
+   context-new_cr3 = paging_new_cr3;
+   context-page_fault = EPT_page_fault;
+   context-gva_to_gpa = EPT_gva_to_gpa;
+   context-sync_page = EPT_sync_page;
+   context-invlpg = EPT_invlpg;
+   context-update_pte = EPT_update_pte;
+   context-free = paging_free;
+   context-root_level = context-shadow_root_level;
+   context-root_hpa = INVALID_PAGE;
+   context-direct_map = false;
+
+   /* TODO: reset_rsvds_bits_mask() is not built for EPT, we need
+  something different.
+*/
+   reset_rsvds_bits_mask(vcpu, context);
+
+
+   /* TODO: I copied these from kvm_init_shadow_mmu, I don't know why
+  they are done, or why they write to vcpu-arch.mmu and not context
+*/
+   vcpu-arch.mmu.base_role.cr4_pae = !!is_pae(vcpu);
+   vcpu-arch.mmu.base_role.cr0_wp  = is_write_protection(vcpu);
+   vcpu-arch.mmu.base_role.smep_andnot_wp =
+   kvm_read_cr4_bits(vcpu, X86_CR4_SMEP) 
+   !is_write_protection(vcpu);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_init_shadow_EPT_mmu);
+
 static int init_kvm_softmmu(struct kvm_vcpu *vcpu)
 {
int r = kvm_init_shadow_mmu(vcpu, vcpu-arch.walk_mmu);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2adcbc2..8fc94dd 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -54,6 +54,7 @@ int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 
addr, u64 sptes[4]);
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
 int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool 
direct);
 int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
+int kvm_init_shadow_EPT_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
 
 static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
 {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 51b8b4f0..80ab5b1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1045,6 +1045,11 @@ static inline bool nested_cpu_has_virtual_nmis(struct 
vmcs12 *vmcs12,
return vmcs12-pin_based_vm_exec_control  PIN_BASED_VIRTUAL_NMIS;
 }
 
+static inline int nested_cpu_has_ept(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_EPT);
+}
+
 static inline bool is_exception(u32 intr_info)
 {
return (intr_info  (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -7305,6 +7310,46 @@ static void vmx_set_supported_cpuid(u32 func, struct 
kvm_cpuid_entry2 *entry)
entry-ecx |= bit(X86_FEATURE_VMX);
 }
 
+/* Callbacks for nested_ept_init_mmu_context: */
+
+static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu)
+{
+   /* return the page table to be shadowed - in our case, EPT12 */
+   return get_vmcs12(vcpu)-ept_pointer;
+}
+
+static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
+   struct x86_exception *fault)
+{
+   struct vmcs12 *vmcs12;
+   nested_vmx_vmexit(vcpu);
+   vmcs12 = get_vmcs12(vcpu);
+   /*
+* Note no need to set vmcs12-vm_exit_reason as it is already copied
+* from vmcs02 in nested_vmx_vmexit() above, i.e., EPT_VIOLATION.
+*/
+   vmcs12-exit_qualification = fault-error_code;
+   vmcs12-guest_physical_address = fault-address;
+}
+
+static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu)
+{
+   int r = kvm_init_shadow_EPT_mmu(vcpu, vcpu-arch.mmu);
+
+   vcpu-arch.mmu.set_cr3   = vmx_set_cr3;
+   vcpu-arch.mmu.get_cr3   = nested_ept_get_cr3;
+   vcpu-arch.mmu.inject_page_fault = nested_ept_inject_page_fault;
+
+   vcpu-arch.walk_mmu  =

[PATCH v3 06/13] nEPT: Fix cr3 handling in nested exit and entry

2013-05-08 Thread Jun Nakajima

The existing code for handling cr3 and related VMCS fields during nested
exit and entry wasn't correct in all cases:

If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
we forgot to do so. This patch adds this copy.

If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
whoever does control cr3 (L1 or L2) is using PAE, the processor might have
saved PDPTEs and we should also save them in vmcs12 (and restore later).

Signed-off-by: Nadav Har'El n...@il.ibm.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
---
 arch/x86/kvm/vmx.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 80ab5b1..db8df4c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7602,6 +7602,17 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12)
kvm_set_cr3(vcpu, vmcs12-guest_cr3);
kvm_mmu_reset_context(vcpu);
 
+   /*
+* Additionally, except when L0 is using shadow page tables, L1 or
+* L2 control guest_cr3 for L2, so they may also have saved PDPTEs
+*/
+   if (enable_ept) {
+   vmcs_write64(GUEST_PDPTR0, vmcs12-guest_pdptr0);
+   vmcs_write64(GUEST_PDPTR1, vmcs12-guest_pdptr1);
+   vmcs_write64(GUEST_PDPTR2, vmcs12-guest_pdptr2);
+   vmcs_write64(GUEST_PDPTR3, vmcs12-guest_pdptr3);
+   }
+
kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12-guest_rsp);
kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12-guest_rip);
 }
@@ -7924,6 +7935,25 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12)
vmcs12-guest_pending_dbg_exceptions =
vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
 
+   /*
+* In some cases (usually, nested EPT), L2 is allowed to change its
+* own CR3 without exiting. If it has changed it, we must keep it.
+* Of course, if L0 is using shadow page tables, GUEST_CR3 was defined
+* by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12.
+*/
+   if (enable_ept)
+   vmcs12-guest_cr3 = vmcs_read64(GUEST_CR3);
+   /*
+* Additionally, except when L0 is using shadow page tables, L1 or
+* L2 control guest_cr3 for L2, so save their PDPTEs
+*/
+   if (enable_ept) {
+   vmcs12-guest_pdptr0 = vmcs_read64(GUEST_PDPTR0);
+   vmcs12-guest_pdptr1 = vmcs_read64(GUEST_PDPTR1);
+   vmcs12-guest_pdptr2 = vmcs_read64(GUEST_PDPTR2);
+   vmcs12-guest_pdptr3 = vmcs_read64(GUEST_PDPTR3);
+   }
+
vmcs12-vm_entry_controls =
(vmcs12-vm_entry_controls  ~VM_ENTRY_IA32E_MODE) |
(vmcs_read32(VM_ENTRY_CONTROLS)  VM_ENTRY_IA32E_MODE);
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 07/13] nEPT: Fix wrong test in kvm_set_cr3

2013-05-08 Thread Jun Nakajima

kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical
address. The problem is that with nested EPT, cr3 is an *L2* physical
address, not an L1 physical address as this test expects.

As the comment above this test explains, it isn't necessary, and doesn't
correspond to anything a real processor would do. So this patch removes it.

Note that this wrong test could have also theoretically caused problems
in nested NPT, not just in nested EPT. However, in practice, the problem
was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the
nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus
circumventing the problem. Additional potential calls to the buggy function
are avoided in that we don't trap cr3 modifications when nested NPT is
enabled. However, because in nested VMX we did want to use kvm_set_cr3()
(as requested in Avi Kivity's review of the original nested VMX patches),
we can't avoid this problem and need to fix it.

Signed-off-by: Nadav Har'El n...@il.ibm.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
---
 arch/x86/kvm/x86.c | 11 ---
 1 file changed, 11 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 94f35d2..ab09003 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -664,17 +664,6 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 */
}
 
-   /*
-* Does the new cr3 value map to physical memory? (Note, we
-* catch an invalid cr3 even in real-mode, because it would
-* cause trouble later on when we turn on paging anyway.)
-*
-* A real CPU would silently accept an invalid cr3 and would
-* attempt to use it - with largely undefined (and often hard
-* to debug) behavior on the guest side.
-*/
-   if (unlikely(!gfn_to_memslot(vcpu-kvm, cr3  PAGE_SHIFT)))
-   return 1;
vcpu-arch.cr3 = cr3;
__set_bit(VCPU_EXREG_CR3, (ulong *)vcpu-arch.regs_avail);
vcpu-arch.mmu.new_cr3(vcpu);
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 08/13] nEPT: Some additional comments

2013-05-08 Thread Jun Nakajima

Some additional comments to preexisting code:
Explain who (L0 or L1) handles EPT violation and misconfiguration exits.
Don't mention shadow on either EPT or shadow as the only two options.

Signed-off-by: Nadav Har'El n...@il.ibm.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
---
 arch/x86/kvm/vmx.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index db8df4c..17d8b89 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6534,7 +6534,20 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu 
*vcpu)
return nested_cpu_has2(vmcs12,
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES);
case EXIT_REASON_EPT_VIOLATION:
+   /*
+* L0 always deals with the EPT violation. If nested EPT is
+* used, and the nested mmu code discovers that the address is
+* missing in the guest EPT table (EPT12), the EPT violation
+* will be injected with nested_ept_inject_page_fault()
+*/
+   return 0;
case EXIT_REASON_EPT_MISCONFIG:
+   /*
+* L2 never uses directly L1's EPT, but rather L0's own EPT
+* table (shadow on EPT) or a merged EPT table that L0 built
+* (EPT on EPT). So any problems with the structure of the
+* table is L0's fault.
+*/
return 0;
case EXIT_REASON_PREEMPTION_TIMER:
return vmcs12-pin_based_vm_exec_control 
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 09/13] nEPT: Advertise EPT to L1

2013-05-08 Thread Jun Nakajima

Advertise the support of EPT to the L1 guest, through the appropriate MSR.

This is the last patch of the basic Nested EPT feature, so as to allow
bisection through this patch series: The guest will not see EPT support until
this last patch, and will not attempt to use the half-applied feature.

Signed-off-by: Nadav Har'El n...@il.ibm.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
---
 arch/x86/include/asm/vmx.h |  2 ++
 arch/x86/kvm/vmx.c | 17 +++--
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index f3e01a2..4aec45d 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -394,7 +394,9 @@ enum vmcs_field {
 #define VMX_EPTP_WB_BIT(1ull  14)
 #define VMX_EPT_2MB_PAGE_BIT   (1ull  16)
 #define VMX_EPT_1GB_PAGE_BIT   (1ull  17)
+#define VMX_EPT_INVEPT_BIT (1ull  20)
 #define VMX_EPT_AD_BIT (1ull  21)
+#define VMX_EPT_EXTENT_INDIVIDUAL_BIT  (1ull  24)
 #define VMX_EPT_EXTENT_CONTEXT_BIT (1ull  25)
 #define VMX_EPT_EXTENT_GLOBAL_BIT  (1ull  26)
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 17d8b89..136fc25 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2155,6 +2155,7 @@ static u32 nested_vmx_pinbased_ctls_low, 
nested_vmx_pinbased_ctls_high;
 static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high;
 static u32 nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high;
 static u32 nested_vmx_misc_low, nested_vmx_misc_high;
+static u32 nested_vmx_ept_caps;
 static __init void nested_vmx_setup_ctls_msrs(void)
 {
/*
@@ -2242,6 +2243,18 @@ static __init void nested_vmx_setup_ctls_msrs(void)
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
SECONDARY_EXEC_WBINVD_EXITING;
 
+   if (enable_ept) {
+   /* nested EPT: emulate EPT also to L1 */
+   nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT;
+   nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT;
+   nested_vmx_ept_caps |=
+   VMX_EPT_INVEPT_BIT | VMX_EPT_EXTENT_GLOBAL_BIT |
+   VMX_EPT_EXTENT_CONTEXT_BIT |
+   VMX_EPT_EXTENT_INDIVIDUAL_BIT;
+   nested_vmx_ept_caps = vmx_capability.ept;
+   } else
+   nested_vmx_ept_caps = 0;
+
/* miscellaneous data */
rdmsr(MSR_IA32_VMX_MISC, nested_vmx_misc_low, nested_vmx_misc_high);
nested_vmx_misc_low = VMX_MISC_PREEMPTION_TIMER_RATE_MASK |
@@ -2347,8 +2360,8 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
nested_vmx_secondary_ctls_high);
break;
case MSR_IA32_VMX_EPT_VPID_CAP:
-   /* Currently, no nested ept or nested vpid */
-   *pdata = 0;
+   /* Currently, no nested vpid support */
+   *pdata = nested_vmx_ept_caps;
break;
default:
return 0;
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 10/13] nEPT: Nested INVEPT

2013-05-08 Thread Jun Nakajima

If we let L1 use EPT, we should probably also support the INVEPT instruction.

In our current nested EPT implementation, when L1 changes its EPT table for
L2 (i.e., EPT12), L0 modifies the shadow EPT table (EPT02), and in the course
of this modification already calls INVEPT. Therefore, when L1 calls INVEPT,
we don't really need to do anything. In particular we *don't* need to call
the real INVEPT again. All we do in our INVEPT is verify the validity of the
call, and its parameters, and then do nothing.

In KVM Forum 2010, Dong et al. presented Nested Virtualization Friendly KVM
and classified our current nested EPT implementation as shadow-like virtual
EPT. He recommended instead a different approach, which he called VTLB-like
virtual EPT. If we had taken that alternative approach, INVEPT would have had
a bigger role: L0 would only rebuild the shadow EPT table when L1 calls INVEPT.

Signed-off-by: Nadav Har'El n...@il.ibm.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
---
 arch/x86/include/uapi/asm/vmx.h |  1 +
 arch/x86/kvm/vmx.c  | 83 +
 2 files changed, 84 insertions(+)

diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index d651082..7a34e8f 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -65,6 +65,7 @@
 #define EXIT_REASON_EOI_INDUCED 45
 #define EXIT_REASON_EPT_VIOLATION   48
 #define EXIT_REASON_EPT_MISCONFIG   49
+#define EXIT_REASON_INVEPT  50
 #define EXIT_REASON_PREEMPTION_TIMER52
 #define EXIT_REASON_WBINVD  54
 #define EXIT_REASON_XSETBV  55
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 136fc25..9ceab42 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6245,6 +6245,87 @@ static int handle_vmptrst(struct kvm_vcpu *vcpu)
return 1;
 }
 
+/* Emulate the INVEPT instruction */
+static int handle_invept(struct kvm_vcpu *vcpu)
+{
+   u32 vmx_instruction_info;
+   unsigned long type;
+   gva_t gva;
+   struct x86_exception e;
+   struct {
+   u64 eptp, gpa;
+   } operand;
+
+   if (!(nested_vmx_secondary_ctls_high  SECONDARY_EXEC_ENABLE_EPT) ||
+   !(nested_vmx_ept_caps  VMX_EPT_INVEPT_BIT)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+
+   if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   /* According to the Intel VMX instruction reference, the memory
+* operand is read even if it isn't needed (e.g., for type==global)
+*/
+   vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+   if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+   vmx_instruction_info, gva))
+   return 1;
+   if (kvm_read_guest_virt(vcpu-arch.emulate_ctxt, gva, operand,
+   sizeof(operand), e)) {
+   kvm_inject_page_fault(vcpu, e);
+   return 1;
+   }
+
+   type = kvm_register_read(vcpu, (vmx_instruction_info  28)  0xf);
+
+   switch (type) {
+   case VMX_EPT_EXTENT_GLOBAL:
+   if (!(nested_vmx_ept_caps  VMX_EPT_EXTENT_GLOBAL_BIT))
+   nested_vmx_failValid(vcpu,
+   VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+   else {
+   /*
+* Do nothing: when L1 changes EPT12, we already
+* update EPT02 (the shadow EPT table) and call INVEPT.
+* So when L1 calls INVEPT, there's nothing left to do.
+*/
+   nested_vmx_succeed(vcpu);
+   }
+   break;
+   case VMX_EPT_EXTENT_CONTEXT:
+   if (!(nested_vmx_ept_caps  VMX_EPT_EXTENT_CONTEXT_BIT))
+   nested_vmx_failValid(vcpu,
+   VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+   else {
+   /* Do nothing */
+   nested_vmx_succeed(vcpu);
+   }
+   break;
+   case VMX_EPT_EXTENT_INDIVIDUAL_ADDR:
+   if (!(nested_vmx_ept_caps  VMX_EPT_EXTENT_INDIVIDUAL_BIT))
+   nested_vmx_failValid(vcpu,
+   VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+   else {
+   /* Do nothing */
+   nested_vmx_succeed(vcpu);
+   }
+   break;
+   default:
+   nested_vmx_failValid(vcpu,
+   VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+   }
+
+   skip_emulated_instruction(vcpu);
+   return 1;
+}
+
 /*
  * The exit handlers return

[PATCH v3 11/13] nEPT: Miscelleneous cleanups

2013-05-08 Thread Jun Nakajima

Some trivial code cleanups not really related to nested EPT.

Signed-off-by: Nadav Har'El n...@il.ibm.com
Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
Reviewed-by: Paolo Bonzini pbonz...@redhat.com
---
 arch/x86/kvm/vmx.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9ceab42..ca49e19 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -714,7 +714,6 @@ static void nested_release_page_clean(struct page *page)
 static u64 construct_eptp(unsigned long root_hpa);
 static void kvm_cpu_vmxon(u64 addr);
 static void kvm_cpu_vmxoff(void);
-static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3);
 static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
 static void vmx_set_segment(struct kvm_vcpu *vcpu,
struct kvm_segment *var, int seg);
@@ -1039,8 +1038,7 @@ static inline bool nested_cpu_has2(struct vmcs12 *vmcs12, 
u32 bit)
(vmcs12-secondary_vm_exec_control  bit);
 }
 
-static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12,
-   struct kvm_vcpu *vcpu)
+static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12)
 {
return vmcs12-pin_based_vm_exec_control  PIN_BASED_VIRTUAL_NMIS;
 }
@@ -6731,7 +6729,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
 
if (unlikely(!cpu_has_virtual_nmis()  vmx-soft_vnmi_blocked 
!(is_guest_mode(vcpu)  nested_cpu_has_virtual_nmis(
-   get_vmcs12(vcpu), vcpu {
+   get_vmcs12(vcpu) {
if (vmx_interrupt_allowed(vcpu)) {
vmx-soft_vnmi_blocked = 0;
} else if (vmx-vnmi_blocked_time  10LL 
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 12/13] nEPT: Move is_rsvd_bits_set() to paging_tmpl.h

2013-05-08 Thread Jun Nakajima

Move is_rsvd_bits_set() to paging_tmpl.h so that it can be used to check
reserved bits in EPT page table entries as well.

Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
---
 arch/x86/kvm/mmu.c |  8 
 arch/x86/kvm/paging_tmpl.h | 12 ++--
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 37f8d7f..93d6abf 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2468,14 +2468,6 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
mmu_free_roots(vcpu);
 }
 
-static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level)
-{
-   int bit7;
-
-   bit7 = (gpte  7)  1;
-   return (gpte  mmu-rsvd_bits_mask[bit7][level-1]) != 0;
-}
-
 static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
 bool no_dirty_log)
 {
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index dc495f9..2432d49 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -124,11 +124,19 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
 }
 #endif
 
+static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
+{
+   int bit7;
+
+   bit7 = (gpte  7)  1;
+   return (gpte  mmu-rsvd_bits_mask[bit7][level-1]) != 0;
+}
+
 static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
  struct kvm_mmu_page *sp, u64 *spte,
  u64 gpte)
 {
-   if (is_rsvd_bits_set(vcpu-arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
+   if (FNAME(is_rsvd_bits_set)(vcpu-arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
goto no_present;
 
if (!is_present_gpte(gpte))
@@ -279,7 +287,7 @@ retry_walk:
if (unlikely(!is_present_gpte(pte)))
goto error;
 
-   if (unlikely(is_rsvd_bits_set(vcpu-arch.mmu, pte,
+   if (unlikely(FNAME(is_rsvd_bits_set)(vcpu-arch.mmu, pte,
  walker-level))) {
errcode |= PFERR_RSVD_MASK | PFERR_PRESENT_MASK;
goto error;
-- 
1.8.1.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 13/13] nEPT: Inject EPT violation/misconfigration

2013-05-08 Thread Jun Nakajima

Add code to detect EPT misconfiguration and inject it to L1 VMM. Also,
it injects more correct exit qualification upon EPT violation to L1
VMM.  Now L1 can correctly go to ept_misconfig handler (instead of
wrongly going to fast_page_fault), it will try to handle mmio page
fault, if failed, it is a real EPT misconfiguration.

Signed-off-by: Jun Nakajima jun.nakaj...@intel.com
Signed-off-by: Xinhao Xu xinhao...@intel.com
---
 arch/x86/include/asm/kvm_host.h |  4 +++
 arch/x86/kvm/mmu.c  |  5 ---
 arch/x86/kvm/mmu.h  |  5 +++
 arch/x86/kvm/paging_tmpl.h  | 26 ++
 arch/x86/kvm/vmx.c  | 79 +++--
 5 files changed, 111 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3741c65..1d03202 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -262,6 +262,8 @@ struct kvm_mmu {
void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva);
void (*update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
   u64 *spte, const void *pte);
+   bool (*check_tdp_pte)(u64 pte, int level);
+
hpa_t root_hpa;
int root_level;
int shadow_root_level;
@@ -503,6 +505,8 @@ struct kvm_vcpu_arch {
 * instruction.
 */
bool write_fault_to_shadow_pgtable;
+
+   unsigned long exit_qualification; /* set at EPT violation at this point 
*/
 };
 
 struct kvm_lpage_info {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 93d6abf..3a3b11f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -233,11 +233,6 @@ static bool set_mmio_spte(u64 *sptep, gfn_t gfn, pfn_t 
pfn, unsigned access)
return false;
 }
 
-static inline u64 rsvd_bits(int s, int e)
-{
-   return ((1ULL  (e - s + 1)) - 1)  s;
-}
-
 void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
u64 dirty_mask, u64 nx_mask, u64 x_mask)
 {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 8fc94dd..559e2e0 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -88,6 +88,11 @@ static inline bool is_write_protection(struct kvm_vcpu *vcpu)
return kvm_read_cr0_bits(vcpu, X86_CR0_WP);
 }
 
+static inline u64 rsvd_bits(int s, int e)
+{
+   return ((1ULL  (e - s + 1)) - 1)  s;
+}
+
 /*
  * Will a fault with a given page-fault error code (pfec) cause a permission
  * fault with the given access (in ACC_* format)?
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 2432d49..067b1f8 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -126,10 +126,14 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, 
struct kvm_mmu *mmu,
 
 static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level)
 {
+#if PTTYPE == PTTYPE_EPT
+   return (mmu-check_tdp_pte(gpte, level));
+#else
int bit7;
 
bit7 = (gpte  7)  1;
return (gpte  mmu-rsvd_bits_mask[bit7][level-1]) != 0;
+#endif
 }
 
 static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
@@ -352,6 +356,28 @@ error:
walker-fault.vector = PF_VECTOR;
walker-fault.error_code_valid = true;
walker-fault.error_code = errcode;
+
+#if PTTYPE == PTTYPE_EPT
+   /*
+* Use PFERR_RSVD_MASK in erorr_code to to tell if EPT
+* misconfiguration requires to be injected. The detection is
+* done by is_rsvd_bits_set() above.
+*
+* We set up the value of exit_qualification to inject:
+* [2:0] -- Derive from [2:0] of real exit_qualification at EPT 
violation
+* [5:3] -- Calculated by the page walk of the guest EPT page tables
+* [7:8] -- Clear to 0.
+*
+* The other bits are set to 0.
+*/
+   if (!(errcode  PFERR_RSVD_MASK)) {
+   unsigned long exit_qualification = 
vcpu-arch.exit_qualification;
+
+   pte_access = pt_access  pte;
+   vcpu-arch.exit_qualification = ((pte_access  0x7)  3) |
+   (exit_qualification  0x7);
+   }
+#endif
walker-fault.address = addr;
walker-fault.nested_page_fault = mmu != vcpu-arch.walk_mmu;
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ca49e19..a44e7fd 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5310,6 +5310,8 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
/* ept page table is present? */
error_code |= (exit_qualification  3)  0x1;
 
+   vcpu-arch.exit_qualification = exit_qualification;
+
return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
@@ -7426,7 +7428,7 @@ static unsigned long nested_ept_get_cr3(struct kvm_vcpu 
*vcpu)
 }
 
 static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu,
-   struct x86_exception *fault)
+   struct x86_exception *fault)
 {
struct vmcs12 *vmcs12;

Re: [PATCH] kvm/ppc: interrupt disabling fixes

2013-05-08 Thread Benjamin Herrenschmidt

On Wed, 2013-05-08 at 19:35 -0500, Scott Wood wrote:

 Sigh, and then there's this:
 
 #ifdef CONFIG_PPC64
  /* lazy EE magic */
  hard_irq_disable();
  if (lazy_irq_pending()) {
  /* Got an interrupt in between, try again */
  local_irq_enable();
  hard_irq_disable();
  kvm_guest_exit();
  continue;
  }
 
  trace_hardirqs_on();
 #endif
 
 Alex, could you be a bit more descriptive than magic please?  Can  
 this chunk of code be removed if we do the other changes being  
 discussed?  Or should we leave this in and drop the pre-enter  
 hard_irq_disable portion of the proposed changes?
 
 Why are you calling trace_hardirqs_on() here and not in  
 kvmppc_lazy_ee_enable()?  Why are you calling kvm_guest_exit() before  
 you've called kvm_guest_enter()?

I think I originated that magic... it more/less mimmics prep_for_idle,
the goal was to hard disable (because we had soft disabled earlier) and
check if anything happened in between... if it did, abort, and try
again, but it's a bit fishy really.

Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v1][KVM][PATCH 1/1] kvm:ppc:booehv: direct ISI exception to Guest

2013-05-08 Thread tiejun.chen

On 05/08/2013 05:20 PM, Caraman Mihai Claudiu-B02008 wrote:

-Original Message-
From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
Behalf Of tiejun.chen
Sent: Wednesday, May 08, 2013 4:54 AM
To: Wood Scott-B07421
Cc: ag...@suse.de; kvm-ppc@vger.kernel.org; k...@vger.kernel.org;
linuxppc-...@lists.ozlabs.org
Subject: Re: [v1][KVM][PATCH 1/1] kvm:ppc:booehv: direct ISI exception to
Guest

On 05/08/2013 07:40 AM, Scott Wood wrote:

On 05/07/2013 06:06:30 AM, Tiejun Chen wrote:

We also can direct ISI exception to Guest like DSI.

Signed-off-by: Tiejun Chen tiejun.c...@windriver.com
---
  arch/powerpc/kvm/booke_emulate.c |3 +++
  arch/powerpc/kvm/e500mc.c|3 ++-
  2 files changed, 5 insertions(+), 1 deletion(-)

Are you seeing a real performance improvement from this?  This will

interfere

No. But after we reduce the exit to host, shouldn't this improve
performance?

We lose some flexibility for this so it make sense only if we gain
measurable improvements.

Sounds we have much more works to do.

somewhat with using the VF bit, if we were to ever do so, since VF only

affects

Sorry, what is the VF you said?

VF stands for virtualization fault see MAS8[VF] and we may use it for 
virtualized

I almost forget this point :)

MMIO. The hypervisor should deny execute access on pages marked with VF. 
Accordingly
in this case guest ISI exceptions should be handled by the hypervisor.

Thanks for your information.

Tiejun

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm/ppc: interrupt disabling fixes

2013-05-08 Thread Benjamin Herrenschmidt

On Wed, 2013-05-08 at 19:35 -0500, Scott Wood wrote:

 Sigh, and then there's this:
 
 #ifdef CONFIG_PPC64
  /* lazy EE magic */
  hard_irq_disable();
  if (lazy_irq_pending()) {
  /* Got an interrupt in between, try again */
  local_irq_enable();
  hard_irq_disable();
  kvm_guest_exit();
  continue;
  }
 
  trace_hardirqs_on();
 #endif
 
 Alex, could you be a bit more descriptive than magic please?  Can  
 this chunk of code be removed if we do the other changes being  
 discussed?  Or should we leave this in and drop the pre-enter  
 hard_irq_disable portion of the proposed changes?
 
 Why are you calling trace_hardirqs_on() here and not in  
 kvmppc_lazy_ee_enable()?  Why are you calling kvm_guest_exit() before  
 you've called kvm_guest_enter()?

I think I originated that magic... it more/less mimmics prep_for_idle,
the goal was to hard disable (because we had soft disabled earlier) and
check if anything happened in between... if it did, abort, and try
again, but it's a bit fishy really.

Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

50 matches

Mail list logo