Re: [PATCH v2] KVM: Fix kvm_irqfd_init initialization
On Wed, 8 May 2013 10:57:29 +0800 Asias He as...@redhat.com wrote: In commit a0f155e96 'KVM: Initialize irqfd from kvm_init()', when kvm_init() is called the second time (e.g kvm-amd.ko and kvm-intel.ko), kvm_arch_init() will fail with -EEXIST, then kvm_irqfd_exit() will be called on the error handling path. This way, the kvm_irqfd system will not be ready. This patch fix the following: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [81c0721e] _raw_spin_lock+0xe/0x30 PGD 0 Oops: 0002 [#1] SMP Modules linked in: vhost_net CPU 6 Pid: 4257, comm: qemu-system-x86 Not tainted 3.9.0-rc3+ #757 Dell Inc. OptiPlex 790/0V5HMK RIP: 0010:[81c0721e] [81c0721e] _raw_spin_lock+0xe/0x30 RSP: 0018:880221721cc8 EFLAGS: 00010046 RAX: 0100 RBX: 88022dcc003f RCX: 880221734950 RDX: 8802208f6ca8 RSI: 7fff RDI: RBP: 880221721cc8 R08: 0002 R09: 0002 R10: 7f7fd01087e0 R11: 0246 R12: 8802208f6ca8 R13: 0080 R14: 880223e2a900 R15: FS: 7f7fd38488e0() GS:88022dcc() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 00022309f000 CR4: 000427e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process qemu-system-x86 (pid: 4257, threadinfo 88022172, task 880222bd5640) Stack: 880221721d08 810ac5c5 88022431dc00 0086 0080 880223e2a900 8802208f6ca8 880221721d48 810ac8fe 880221734000 Call Trace: [810ac5c5] __queue_work+0x45/0x2d0 [810ac8fe] queue_work_on+0x8e/0xa0 [810ac949] queue_work+0x19/0x20 [81009b6b] irqfd_deactivate+0x4b/0x60 [8100a69d] kvm_irqfd+0x39d/0x580 [81007a27] kvm_vm_ioctl+0x207/0x5b0 [810c9545] ? update_curr+0xf5/0x180 [811b66e8] do_vfs_ioctl+0x98/0x550 [810c1f5e] ? finish_task_switch+0x4e/0xe0 [81c054aa] ? __schedule+0x2ea/0x710 [811b6bf7] sys_ioctl+0x57/0x90 [8140ae9e] ? trace_hardirqs_on_thunk+0x3a/0x3c [81c0f602] system_call_fastpath+0x16/0x1b Code: c1 ea 08 38 c2 74 0f 66 0f 1f 44 00 00 f3 90 0f b6 03 38 c2 75 f7 48 83 c4 08 5b c9 c3 55 48 89 e5 66 66 66 66 90 b8 00 01 00 00 f0 66 0f c1 07 89 c2 66 c1 ea 08 38 c2 74 0c 0f 1f 00 f3 90 0f RIP [81c0721e] _raw_spin_lock+0xe/0x30 RSP 880221721cc8 CR2: ---[ end trace 13fb1e4b6e5ab21f ]--- Signed-off-by: Asias He as...@redhat.com Acked-by: Cornelia Huck cornelia.h...@de.ibm.com --- virt/kvm/kvm_main.c | 18 +- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 8fd325a..85b93d2 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3078,13 +3078,21 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align, int r; int cpu; - r = kvm_irqfd_init(); - if (r) - goto out_irqfd; r = kvm_arch_init(opaque); if (r) goto out_fail; + /* + * kvm_arch_init makes sure there's at most one caller + * for architectures that support multiple implementations, + * like intel and amd on x86. + * kvm_arch_init must be called before kvm_irqfd_init to avoid creating + * conflicts in case kvm is already setup for another implementation. + */ + r = kvm_irqfd_init(); + if (r) + goto out_irqfd; + if (!zalloc_cpumask_var(cpus_hardware_enabled, GFP_KERNEL)) { r = -ENOMEM; goto out_free_0; @@ -3159,10 +3167,10 @@ out_free_1: out_free_0a: free_cpumask_var(cpus_hardware_enabled); out_free_0: - kvm_arch_exit(); -out_fail: kvm_irqfd_exit(); out_irqfd: + kvm_arch_exit(); +out_fail: return r; } EXPORT_SYMBOL_GPL(kvm_init); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost-test: Make vhost/test.c work
On Tue, May 07, 2013 at 02:22:32PM +0300, Michael S. Tsirkin wrote: On Tue, May 07, 2013 at 02:52:45PM +0800, Asias He wrote: Fix it by: 1) switching to use the new device specific fields per vq 2) not including vhost.c, instead make vhost-test.ko depend on vhost.ko. Please split this up. 1. make test work for 3.10 2. make test work for 3.11 thanks! okay. --- drivers/vhost/test.c | 37 + 1 file changed, 25 insertions(+), 12 deletions(-) diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c index 1ee45bc..dc526eb 100644 --- a/drivers/vhost/test.c +++ b/drivers/vhost/test.c @@ -18,7 +18,7 @@ #include linux/slab.h #include test.h -#include vhost.c +#include vhost.h /* Max number of bytes transferred before requeueing the job. * Using this limit prevents one virtqueue from starving others. */ @@ -29,16 +29,20 @@ enum { VHOST_TEST_VQ_MAX = 1, }; +struct vhost_test_virtqueue { + struct vhost_virtqueue vq; +}; + This isn't needed or useful. Drop above change pls and patch size will shrink. The difference is: drivers/vhost/test.c | 23 --- 1 file changed, 16 insertions(+), 7 deletions(-) drivers/vhost/test.c | 35 --- 1 file changed, 24 insertions(+), 11 deletions(-) which is not significant. So, I think it is better to code the same way as we do in vhost-net and vhost-scsi which makes the device specific usage more consistent. struct vhost_test { struct vhost_dev dev; - struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX]; + struct vhost_test_virtqueue vqs[VHOST_TEST_VQ_MAX]; }; /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_vq(struct vhost_test *n) { - struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ]; + struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ]; unsigned out, in; int head; size_t len, total_len = 0; @@ -101,15 +105,23 @@ static void handle_vq_kick(struct vhost_work *work) static int vhost_test_open(struct inode *inode, struct file *f) { struct vhost_test *n = kmalloc(sizeof *n, GFP_KERNEL); + struct vhost_virtqueue **vqs; struct vhost_dev *dev; int r; if (!n) return -ENOMEM; + vqs = kmalloc(VHOST_TEST_VQ_MAX * sizeof(*vqs), GFP_KERNEL); + if (!vqs) { + kfree(n); + return -ENOMEM; + } + dev = n-dev; - n-vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick; - r = vhost_dev_init(dev, n-vqs, VHOST_TEST_VQ_MAX); + vqs[VHOST_TEST_VQ] = n-vqs[VHOST_TEST_VQ].vq; + n-vqs[VHOST_TEST_VQ].vq.handle_kick = handle_vq_kick; + r = vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX); if (r 0) { kfree(n); return r; @@ -135,12 +147,12 @@ static void *vhost_test_stop_vq(struct vhost_test *n, static void vhost_test_stop(struct vhost_test *n, void **privatep) { - *privatep = vhost_test_stop_vq(n, n-vqs + VHOST_TEST_VQ); + *privatep = vhost_test_stop_vq(n, n-vqs[VHOST_TEST_VQ].vq); } static void vhost_test_flush_vq(struct vhost_test *n, int index) { - vhost_poll_flush(n-dev.vqs[index].poll); + vhost_poll_flush(n-vqs[index].vq.poll); } static void vhost_test_flush(struct vhost_test *n) @@ -159,6 +171,7 @@ static int vhost_test_release(struct inode *inode, struct file *f) /* We do an extra flush before freeing memory, * since jobs can re-queue themselves. */ vhost_test_flush(n); + kfree(n-dev.vqs); kfree(n); return 0; } @@ -179,14 +192,14 @@ static long vhost_test_run(struct vhost_test *n, int test) for (index = 0; index n-dev.nvqs; ++index) { /* Verify that ring has been setup correctly. */ - if (!vhost_vq_access_ok(n-vqs[index])) { + if (!vhost_vq_access_ok(n-vqs[index].vq)) { r = -EFAULT; goto err; } } for (index = 0; index n-dev.nvqs; ++index) { - vq = n-vqs + index; + vq = n-vqs[index].vq; mutex_lock(vq-mutex); priv = test ? n : NULL; @@ -195,7 +208,7 @@ static long vhost_test_run(struct vhost_test *n, int test) lockdep_is_held(vq-mutex)); rcu_assign_pointer(vq-private_data, priv); - r = vhost_init_used(n-vqs[index]); + r = vhost_init_used(n-vqs[index].vq); mutex_unlock(vq-mutex); @@ -268,14 +281,14 @@ static long vhost_test_ioctl(struct file *f, unsigned int ioctl, return -EFAULT; return vhost_test_run(n, test); case VHOST_GET_FEATURES: - features = VHOST_NET_FEATURES; + features =
[PATCH v2] vhost-test: Make vhost/test.c work
Fix it by switching to use the new device specific fields per vq Signed-off-by: Asias He as...@redhat.com --- This is for 3.10. drivers/vhost/test.c | 35 --- 1 file changed, 24 insertions(+), 11 deletions(-) diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c index 1ee45bc..7b49d10 100644 --- a/drivers/vhost/test.c +++ b/drivers/vhost/test.c @@ -29,16 +29,20 @@ enum { VHOST_TEST_VQ_MAX = 1, }; +struct vhost_test_virtqueue { + struct vhost_virtqueue vq; +}; + struct vhost_test { struct vhost_dev dev; - struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX]; + struct vhost_test_virtqueue vqs[VHOST_TEST_VQ_MAX]; }; /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_vq(struct vhost_test *n) { - struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ]; + struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ]; unsigned out, in; int head; size_t len, total_len = 0; @@ -101,15 +105,23 @@ static void handle_vq_kick(struct vhost_work *work) static int vhost_test_open(struct inode *inode, struct file *f) { struct vhost_test *n = kmalloc(sizeof *n, GFP_KERNEL); + struct vhost_virtqueue **vqs; struct vhost_dev *dev; int r; if (!n) return -ENOMEM; + vqs = kmalloc(VHOST_TEST_VQ_MAX * sizeof(*vqs), GFP_KERNEL); + if (!vqs) { + kfree(n); + return -ENOMEM; + } + dev = n-dev; - n-vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick; - r = vhost_dev_init(dev, n-vqs, VHOST_TEST_VQ_MAX); + vqs[VHOST_TEST_VQ] = n-vqs[VHOST_TEST_VQ].vq; + n-vqs[VHOST_TEST_VQ].vq.handle_kick = handle_vq_kick; + r = vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX); if (r 0) { kfree(n); return r; @@ -135,12 +147,12 @@ static void *vhost_test_stop_vq(struct vhost_test *n, static void vhost_test_stop(struct vhost_test *n, void **privatep) { - *privatep = vhost_test_stop_vq(n, n-vqs + VHOST_TEST_VQ); + *privatep = vhost_test_stop_vq(n, n-vqs[VHOST_TEST_VQ].vq); } static void vhost_test_flush_vq(struct vhost_test *n, int index) { - vhost_poll_flush(n-dev.vqs[index].poll); + vhost_poll_flush(n-vqs[index].vq.poll); } static void vhost_test_flush(struct vhost_test *n) @@ -159,6 +171,7 @@ static int vhost_test_release(struct inode *inode, struct file *f) /* We do an extra flush before freeing memory, * since jobs can re-queue themselves. */ vhost_test_flush(n); + kfree(n-dev.vqs); kfree(n); return 0; } @@ -179,14 +192,14 @@ static long vhost_test_run(struct vhost_test *n, int test) for (index = 0; index n-dev.nvqs; ++index) { /* Verify that ring has been setup correctly. */ - if (!vhost_vq_access_ok(n-vqs[index])) { + if (!vhost_vq_access_ok(n-vqs[index].vq)) { r = -EFAULT; goto err; } } for (index = 0; index n-dev.nvqs; ++index) { - vq = n-vqs + index; + vq = n-vqs[index].vq; mutex_lock(vq-mutex); priv = test ? n : NULL; @@ -195,7 +208,7 @@ static long vhost_test_run(struct vhost_test *n, int test) lockdep_is_held(vq-mutex)); rcu_assign_pointer(vq-private_data, priv); - r = vhost_init_used(n-vqs[index]); + r = vhost_init_used(n-vqs[index].vq); mutex_unlock(vq-mutex); @@ -268,14 +281,14 @@ static long vhost_test_ioctl(struct file *f, unsigned int ioctl, return -EFAULT; return vhost_test_run(n, test); case VHOST_GET_FEATURES: - features = VHOST_NET_FEATURES; + features = VHOST_FEATURES; if (copy_to_user(featurep, features, sizeof features)) return -EFAULT; return 0; case VHOST_SET_FEATURES: if (copy_from_user(features, featurep, sizeof features)) return -EFAULT; - if (features ~VHOST_NET_FEATURES) + if (features ~VHOST_FEATURES) return -EOPNOTSUPP; return vhost_test_set_features(n, features); case VHOST_RESET_OWNER: -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: regression in v3.9? a guest stuck in BIOS if emulate_invalid_guest_state=Y
On 05/08/13 12:22, Jun'ichi Nomura wrote: Il 07/05/2013 14:06, Gleb Natapov ha scritto: What is the output of virsh qemu-monitor-command vm12 --hmp x/i $pc when it hangs? # virsh qemu-monitor-command vm12 --hmp x/4i \$pc 0x000c06ca: aam$0xa 0x000c06cc: mov%ax,%bx 0x000c06ce: mov%bh,%al 0x000c06d0: aam$0xa # virsh qemu-monitor-command vm12 --hmp x/8b \$pc 000c06ca: 0xd4 0x0a 0x89 0xc3 0x88 0xf8 0xd4 0x0a I could also reproduce the problem with following: # dd if=/dev/zero of=/root/empty.img bs=1M count=1 # /usr/libexec/qemu-kvm -enable-kvm -nographic -nodefconfig -nodefaults -chardev socket,id=cmon,host=localhost,port=,server,nowait -mon chardev=cmon,mode=readline -drive file=/root/empty.img -chardev stdio,id=ser0 -device isa-serial,chardev=ser0 With v3.8 kernel, it reaches to the point showing No bootable device (as expected). With v3.9 kernel, no visible characters appear on console. EIP of the stalled guest points to other instruction than the previously reported case though: (qemu) info registers info registers EAX=f000e81b EBX=0130 ECX=fa2b EDX=031b ESI=00ed EDI=0050 EBP= ESP=6eaa EIP=0564 EFL=0046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0040 0400 00809300 CS =c000 000c 00809b00 SS = 00809300 DS =c000 000c 00809300 FS = 00809300 GS = 00809300 LDT= 8200 TR = 8b00 GDT= 000fc558 0037 IDT= 03ff CR0=0010 CR2= CR3= CR4= DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 FCW=037f FSW= [ST=0] FTW=00 MXCSR=1f80 FPR0= FPR1= FPR2= FPR3= FPR4= FPR5= FPR6= FPR7= XMM00= XMM01= XMM02= XMM03= XMM04= XMM05= XMM06= XMM07= (qemu) (qemu) x/8b $pc x/8b $pc 000c0564: 0xd7 0x1f 0x24 0x7f 0x88 0xc4 0x88 0xd0 (qemu) (qemu) x/i $pc x/i $pc 0x000c0564: xlat %ds:(%bx) -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] vhost-test: Make vhost/test.c work
On Wed, May 08, 2013 at 03:24:33PM +0800, Asias He wrote: Fix it by switching to use the new device specific fields per vq Signed-off-by: Asias He as...@redhat.com --- This is for 3.10. drivers/vhost/test.c | 35 --- 1 file changed, 24 insertions(+), 11 deletions(-) diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c index 1ee45bc..7b49d10 100644 --- a/drivers/vhost/test.c +++ b/drivers/vhost/test.c @@ -29,16 +29,20 @@ enum { VHOST_TEST_VQ_MAX = 1, }; +struct vhost_test_virtqueue { + struct vhost_virtqueue vq; +}; + Well there are no test specific fields here, so this structure is not needed. Here's what I queued: --- vhost-test: fix up test module after API change Recent vhost API changes broke vhost test module. Update it to the new APIs. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c index be65414..c2c3d91 100644 --- a/drivers/vhost/test.c +++ b/drivers/vhost/test.c @@ -38,7 +38,7 @@ struct vhost_test { * read-size critical section for our kind of RCU. */ static void handle_vq(struct vhost_test *n) { - struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ]; + struct vhost_virtqueue *vq = n-vqs[VHOST_TEST_VQ]; unsigned out, in; int head; size_t len, total_len = 0; @@ -102,6 +102,7 @@ static int vhost_test_open(struct inode *inode, struct file *f) { struct vhost_test *n = kmalloc(sizeof *n, GFP_KERNEL); struct vhost_dev *dev; + struct vhost_virtqueue *vqs[VHOST_TEST_VQ_MAX]; int r; if (!n) @@ -109,7 +110,8 @@ static int vhost_test_open(struct inode *inode, struct file *f) dev = n-dev; n-vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick; - r = vhost_dev_init(dev, n-vqs, VHOST_TEST_VQ_MAX); + vqs[VHOST_TEST_VQ] = n-vqs[VHOST_TEST_VQ]; + r = vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX); if (r 0) { kfree(n); return r; @@ -140,7 +142,7 @@ static void vhost_test_stop(struct vhost_test *n, void **privatep) static void vhost_test_flush_vq(struct vhost_test *n, int index) { - vhost_poll_flush(n-dev.vqs[index].poll); + vhost_poll_flush(n-vqs[index].poll); } static void vhost_test_flush(struct vhost_test *n) @@ -268,21 +270,21 @@ static long vhost_test_ioctl(struct file *f, unsigned int ioctl, return -EFAULT; return vhost_test_run(n, test); case VHOST_GET_FEATURES: - features = VHOST_NET_FEATURES; + features = VHOST_FEATURES; if (copy_to_user(featurep, features, sizeof features)) return -EFAULT; return 0; case VHOST_SET_FEATURES: if (copy_from_user(features, featurep, sizeof features)) return -EFAULT; - if (features ~VHOST_NET_FEATURES) + if (features ~VHOST_FEATURES) return -EOPNOTSUPP; return vhost_test_set_features(n, features); case VHOST_RESET_OWNER: return vhost_test_reset_owner(n); default: mutex_lock(n-dev.mutex); - r = vhost_dev_ioctl(n-dev, ioctl, arg); + r = vhost_dev_ioctl(n-dev, ioctl, argp); vhost_test_flush(n); mutex_unlock(n-dev.mutex); return r; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost-test: Make vhost/test.c work
On Wed, May 08, 2013 at 03:14:58PM +0800, Asias He wrote: On Tue, May 07, 2013 at 02:22:32PM +0300, Michael S. Tsirkin wrote: On Tue, May 07, 2013 at 02:52:45PM +0800, Asias He wrote: Fix it by: 1) switching to use the new device specific fields per vq 2) not including vhost.c, instead make vhost-test.ko depend on vhost.ko. Please split this up. 1. make test work for 3.10 2. make test work for 3.11 thanks! okay. --- drivers/vhost/test.c | 37 + 1 file changed, 25 insertions(+), 12 deletions(-) diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c index 1ee45bc..dc526eb 100644 --- a/drivers/vhost/test.c +++ b/drivers/vhost/test.c @@ -18,7 +18,7 @@ #include linux/slab.h #include test.h -#include vhost.c +#include vhost.h /* Max number of bytes transferred before requeueing the job. * Using this limit prevents one virtqueue from starving others. */ @@ -29,16 +29,20 @@ enum { VHOST_TEST_VQ_MAX = 1, }; +struct vhost_test_virtqueue { + struct vhost_virtqueue vq; +}; + This isn't needed or useful. Drop above change pls and patch size will shrink. The difference is: drivers/vhost/test.c | 23 --- 1 file changed, 16 insertions(+), 7 deletions(-) drivers/vhost/test.c | 35 --- 1 file changed, 24 insertions(+), 11 deletions(-) which is not significant. I did it like this: test.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) So, I think it is better to code the same way as we do in vhost-net and vhost-scsi which makes the device specific usage more consistent. struct vhost_test { struct vhost_dev dev; - struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX]; + struct vhost_test_virtqueue vqs[VHOST_TEST_VQ_MAX]; }; /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_vq(struct vhost_test *n) { - struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ]; + struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ]; unsigned out, in; int head; size_t len, total_len = 0; @@ -101,15 +105,23 @@ static void handle_vq_kick(struct vhost_work *work) static int vhost_test_open(struct inode *inode, struct file *f) { struct vhost_test *n = kmalloc(sizeof *n, GFP_KERNEL); + struct vhost_virtqueue **vqs; struct vhost_dev *dev; int r; if (!n) return -ENOMEM; + vqs = kmalloc(VHOST_TEST_VQ_MAX * sizeof(*vqs), GFP_KERNEL); + if (!vqs) { + kfree(n); + return -ENOMEM; + } + dev = n-dev; - n-vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick; - r = vhost_dev_init(dev, n-vqs, VHOST_TEST_VQ_MAX); + vqs[VHOST_TEST_VQ] = n-vqs[VHOST_TEST_VQ].vq; + n-vqs[VHOST_TEST_VQ].vq.handle_kick = handle_vq_kick; + r = vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX); if (r 0) { kfree(n); return r; @@ -135,12 +147,12 @@ static void *vhost_test_stop_vq(struct vhost_test *n, static void vhost_test_stop(struct vhost_test *n, void **privatep) { - *privatep = vhost_test_stop_vq(n, n-vqs + VHOST_TEST_VQ); + *privatep = vhost_test_stop_vq(n, n-vqs[VHOST_TEST_VQ].vq); } static void vhost_test_flush_vq(struct vhost_test *n, int index) { - vhost_poll_flush(n-dev.vqs[index].poll); + vhost_poll_flush(n-vqs[index].vq.poll); } static void vhost_test_flush(struct vhost_test *n) @@ -159,6 +171,7 @@ static int vhost_test_release(struct inode *inode, struct file *f) /* We do an extra flush before freeing memory, * since jobs can re-queue themselves. */ vhost_test_flush(n); + kfree(n-dev.vqs); kfree(n); return 0; } @@ -179,14 +192,14 @@ static long vhost_test_run(struct vhost_test *n, int test) for (index = 0; index n-dev.nvqs; ++index) { /* Verify that ring has been setup correctly. */ - if (!vhost_vq_access_ok(n-vqs[index])) { + if (!vhost_vq_access_ok(n-vqs[index].vq)) { r = -EFAULT; goto err; } } for (index = 0; index n-dev.nvqs; ++index) { - vq = n-vqs + index; + vq = n-vqs[index].vq; mutex_lock(vq-mutex); priv = test ? n : NULL; @@ -195,7 +208,7 @@ static long vhost_test_run(struct vhost_test *n, int test) lockdep_is_held(vq-mutex)); rcu_assign_pointer(vq-private_data, priv); - r = vhost_init_used(n-vqs[index]); + r = vhost_init_used(n-vqs[index].vq); mutex_unlock(vq-mutex); @@ -268,14 +281,14 @@ static long vhost_test_ioctl(struct file *f, unsigned
Re: [PATCH] vhost-test: Make vhost/test.c work
On Wed, May 08, 2013 at 10:59:03AM +0300, Michael S. Tsirkin wrote: On Wed, May 08, 2013 at 03:14:58PM +0800, Asias He wrote: On Tue, May 07, 2013 at 02:22:32PM +0300, Michael S. Tsirkin wrote: On Tue, May 07, 2013 at 02:52:45PM +0800, Asias He wrote: Fix it by: 1) switching to use the new device specific fields per vq 2) not including vhost.c, instead make vhost-test.ko depend on vhost.ko. Please split this up. 1. make test work for 3.10 2. make test work for 3.11 thanks! okay. --- drivers/vhost/test.c | 37 + 1 file changed, 25 insertions(+), 12 deletions(-) diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c index 1ee45bc..dc526eb 100644 --- a/drivers/vhost/test.c +++ b/drivers/vhost/test.c @@ -18,7 +18,7 @@ #include linux/slab.h #include test.h -#include vhost.c +#include vhost.h /* Max number of bytes transferred before requeueing the job. * Using this limit prevents one virtqueue from starving others. */ @@ -29,16 +29,20 @@ enum { VHOST_TEST_VQ_MAX = 1, }; +struct vhost_test_virtqueue { + struct vhost_virtqueue vq; +}; + This isn't needed or useful. Drop above change pls and patch size will shrink. The difference is: drivers/vhost/test.c | 23 --- 1 file changed, 16 insertions(+), 7 deletions(-) drivers/vhost/test.c | 35 --- 1 file changed, 24 insertions(+), 11 deletions(-) which is not significant. I did it like this: test.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) The extra 8 insertions is for vqs allocation which can be dropped. Well, if you prefer shorter code over consistency. Go ahead. So, I think it is better to code the same way as we do in vhost-net and vhost-scsi which makes the device specific usage more consistent. struct vhost_test { struct vhost_dev dev; - struct vhost_virtqueue vqs[VHOST_TEST_VQ_MAX]; + struct vhost_test_virtqueue vqs[VHOST_TEST_VQ_MAX]; }; /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_vq(struct vhost_test *n) { - struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ]; + struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ]; unsigned out, in; int head; size_t len, total_len = 0; @@ -101,15 +105,23 @@ static void handle_vq_kick(struct vhost_work *work) static int vhost_test_open(struct inode *inode, struct file *f) { struct vhost_test *n = kmalloc(sizeof *n, GFP_KERNEL); + struct vhost_virtqueue **vqs; struct vhost_dev *dev; int r; if (!n) return -ENOMEM; + vqs = kmalloc(VHOST_TEST_VQ_MAX * sizeof(*vqs), GFP_KERNEL); + if (!vqs) { + kfree(n); + return -ENOMEM; + } + dev = n-dev; - n-vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick; - r = vhost_dev_init(dev, n-vqs, VHOST_TEST_VQ_MAX); + vqs[VHOST_TEST_VQ] = n-vqs[VHOST_TEST_VQ].vq; + n-vqs[VHOST_TEST_VQ].vq.handle_kick = handle_vq_kick; + r = vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX); if (r 0) { kfree(n); return r; @@ -135,12 +147,12 @@ static void *vhost_test_stop_vq(struct vhost_test *n, static void vhost_test_stop(struct vhost_test *n, void **privatep) { - *privatep = vhost_test_stop_vq(n, n-vqs + VHOST_TEST_VQ); + *privatep = vhost_test_stop_vq(n, n-vqs[VHOST_TEST_VQ].vq); } static void vhost_test_flush_vq(struct vhost_test *n, int index) { - vhost_poll_flush(n-dev.vqs[index].poll); + vhost_poll_flush(n-vqs[index].vq.poll); } static void vhost_test_flush(struct vhost_test *n) @@ -159,6 +171,7 @@ static int vhost_test_release(struct inode *inode, struct file *f) /* We do an extra flush before freeing memory, * since jobs can re-queue themselves. */ vhost_test_flush(n); + kfree(n-dev.vqs); kfree(n); return 0; } @@ -179,14 +192,14 @@ static long vhost_test_run(struct vhost_test *n, int test) for (index = 0; index n-dev.nvqs; ++index) { /* Verify that ring has been setup correctly. */ - if (!vhost_vq_access_ok(n-vqs[index])) { + if (!vhost_vq_access_ok(n-vqs[index].vq)) { r = -EFAULT; goto err; } } for (index
Re: [PATCH v2] vhost-test: Make vhost/test.c work
On Wed, May 08, 2013 at 10:56:19AM +0300, Michael S. Tsirkin wrote: On Wed, May 08, 2013 at 03:24:33PM +0800, Asias He wrote: Fix it by switching to use the new device specific fields per vq Signed-off-by: Asias He as...@redhat.com --- This is for 3.10. drivers/vhost/test.c | 35 --- 1 file changed, 24 insertions(+), 11 deletions(-) diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c index 1ee45bc..7b49d10 100644 --- a/drivers/vhost/test.c +++ b/drivers/vhost/test.c @@ -29,16 +29,20 @@ enum { VHOST_TEST_VQ_MAX = 1, }; +struct vhost_test_virtqueue { + struct vhost_virtqueue vq; +}; + Well there are no test specific fields here, so this structure is not needed. Here's what I queued: Could you push the queue to your git repo ? --- vhost-test: fix up test module after API change Recent vhost API changes broke vhost test module. Update it to the new APIs. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c index be65414..c2c3d91 100644 --- a/drivers/vhost/test.c +++ b/drivers/vhost/test.c @@ -38,7 +38,7 @@ struct vhost_test { * read-size critical section for our kind of RCU. */ static void handle_vq(struct vhost_test *n) { - struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ]; + struct vhost_virtqueue *vq = n-vqs[VHOST_TEST_VQ]; unsigned out, in; int head; size_t len, total_len = 0; @@ -102,6 +102,7 @@ static int vhost_test_open(struct inode *inode, struct file *f) { struct vhost_test *n = kmalloc(sizeof *n, GFP_KERNEL); struct vhost_dev *dev; + struct vhost_virtqueue *vqs[VHOST_TEST_VQ_MAX]; int r; if (!n) @@ -109,7 +110,8 @@ static int vhost_test_open(struct inode *inode, struct file *f) dev = n-dev; n-vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick; - r = vhost_dev_init(dev, n-vqs, VHOST_TEST_VQ_MAX); + vqs[VHOST_TEST_VQ] = n-vqs[VHOST_TEST_VQ]; + r = vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX); if (r 0) { kfree(n); return r; @@ -140,7 +142,7 @@ static void vhost_test_stop(struct vhost_test *n, void **privatep) static void vhost_test_flush_vq(struct vhost_test *n, int index) { - vhost_poll_flush(n-dev.vqs[index].poll); + vhost_poll_flush(n-vqs[index].poll); } static void vhost_test_flush(struct vhost_test *n) @@ -268,21 +270,21 @@ static long vhost_test_ioctl(struct file *f, unsigned int ioctl, return -EFAULT; return vhost_test_run(n, test); case VHOST_GET_FEATURES: - features = VHOST_NET_FEATURES; + features = VHOST_FEATURES; if (copy_to_user(featurep, features, sizeof features)) return -EFAULT; return 0; case VHOST_SET_FEATURES: if (copy_from_user(features, featurep, sizeof features)) return -EFAULT; - if (features ~VHOST_NET_FEATURES) + if (features ~VHOST_FEATURES) return -EOPNOTSUPP; return vhost_test_set_features(n, features); case VHOST_RESET_OWNER: return vhost_test_reset_owner(n); default: mutex_lock(n-dev.mutex); - r = vhost_dev_ioctl(n-dev, ioctl, arg); + r = vhost_dev_ioctl(n-dev, ioctl, argp); vhost_test_flush(n); mutex_unlock(n-dev.mutex); return r; -- Asias -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] vhost-test: Make vhost/test.c work
On Wed, May 08, 2013 at 04:17:19PM +0800, Asias He wrote: On Wed, May 08, 2013 at 10:56:19AM +0300, Michael S. Tsirkin wrote: On Wed, May 08, 2013 at 03:24:33PM +0800, Asias He wrote: Fix it by switching to use the new device specific fields per vq Signed-off-by: Asias He as...@redhat.com --- This is for 3.10. drivers/vhost/test.c | 35 --- 1 file changed, 24 insertions(+), 11 deletions(-) diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c index 1ee45bc..7b49d10 100644 --- a/drivers/vhost/test.c +++ b/drivers/vhost/test.c @@ -29,16 +29,20 @@ enum { VHOST_TEST_VQ_MAX = 1, }; +struct vhost_test_virtqueue { + struct vhost_virtqueue vq; +}; + Well there are no test specific fields here, so this structure is not needed. Here's what I queued: Could you push the queue to your git repo ? done branch vhost --- vhost-test: fix up test module after API change Recent vhost API changes broke vhost test module. Update it to the new APIs. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c index be65414..c2c3d91 100644 --- a/drivers/vhost/test.c +++ b/drivers/vhost/test.c @@ -38,7 +38,7 @@ struct vhost_test { * read-size critical section for our kind of RCU. */ static void handle_vq(struct vhost_test *n) { - struct vhost_virtqueue *vq = n-dev.vqs[VHOST_TEST_VQ]; + struct vhost_virtqueue *vq = n-vqs[VHOST_TEST_VQ]; unsigned out, in; int head; size_t len, total_len = 0; @@ -102,6 +102,7 @@ static int vhost_test_open(struct inode *inode, struct file *f) { struct vhost_test *n = kmalloc(sizeof *n, GFP_KERNEL); struct vhost_dev *dev; + struct vhost_virtqueue *vqs[VHOST_TEST_VQ_MAX]; int r; if (!n) @@ -109,7 +110,8 @@ static int vhost_test_open(struct inode *inode, struct file *f) dev = n-dev; n-vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick; - r = vhost_dev_init(dev, n-vqs, VHOST_TEST_VQ_MAX); + vqs[VHOST_TEST_VQ] = n-vqs[VHOST_TEST_VQ]; + r = vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX); if (r 0) { kfree(n); return r; @@ -140,7 +142,7 @@ static void vhost_test_stop(struct vhost_test *n, void **privatep) static void vhost_test_flush_vq(struct vhost_test *n, int index) { - vhost_poll_flush(n-dev.vqs[index].poll); + vhost_poll_flush(n-vqs[index].poll); } static void vhost_test_flush(struct vhost_test *n) @@ -268,21 +270,21 @@ static long vhost_test_ioctl(struct file *f, unsigned int ioctl, return -EFAULT; return vhost_test_run(n, test); case VHOST_GET_FEATURES: - features = VHOST_NET_FEATURES; + features = VHOST_FEATURES; if (copy_to_user(featurep, features, sizeof features)) return -EFAULT; return 0; case VHOST_SET_FEATURES: if (copy_from_user(features, featurep, sizeof features)) return -EFAULT; - if (features ~VHOST_NET_FEATURES) + if (features ~VHOST_FEATURES) return -EOPNOTSUPP; return vhost_test_set_features(n, features); case VHOST_RESET_OWNER: return vhost_test_reset_owner(n); default: mutex_lock(n-dev.mutex); - r = vhost_dev_ioctl(n-dev, ioctl, arg); + r = vhost_dev_ioctl(n-dev, ioctl, argp); vhost_test_flush(n); mutex_unlock(n-dev.mutex); return r; -- Asias -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: regression in v3.9? a guest stuck in BIOS if emulate_invalid_guest_state=Y
Il 08/05/2013 09:34, Jun'ichi Nomura ha scritto: On 05/08/13 12:22, Jun'ichi Nomura wrote: Il 07/05/2013 14:06, Gleb Natapov ha scritto: What is the output of virsh qemu-monitor-command vm12 --hmp x/i $pc when it hangs? # virsh qemu-monitor-command vm12 --hmp x/4i \$pc 0x000c06ca: aam$0xa 0x000c06cc: mov%ax,%bx 0x000c06ce: mov%bh,%al 0x000c06d0: aam$0xa # virsh qemu-monitor-command vm12 --hmp x/8b \$pc 000c06ca: 0xd4 0x0a 0x89 0xc3 0x88 0xf8 0xd4 0x0a (qemu) x/8b $pc x/8b $pc 000c0564: 0xd7 0x1f 0x24 0x7f 0x88 0xc4 0x88 0xd0 (qemu) (qemu) x/i $pc x/i $pc 0x000c0564: xlat %ds:(%bx) Both of these sequences are found in sgabios. The second goes on as follows: popw %ds andb $0x7f, %al movb %al, %ah movb %dl, %al Thanks for the report! Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v1][KVM][PATCH 1/1] kvm:ppc:booehv: direct ISI exception to Guest
On 05/08/2013 05:20 PM, Caraman Mihai Claudiu-B02008 wrote: -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of tiejun.chen Sent: Wednesday, May 08, 2013 4:54 AM To: Wood Scott-B07421 Cc: ag...@suse.de; kvm-...@vger.kernel.org; kvm@vger.kernel.org; linuxppc-...@lists.ozlabs.org Subject: Re: [v1][KVM][PATCH 1/1] kvm:ppc:booehv: direct ISI exception to Guest On 05/08/2013 07:40 AM, Scott Wood wrote: On 05/07/2013 06:06:30 AM, Tiejun Chen wrote: We also can direct ISI exception to Guest like DSI. Signed-off-by: Tiejun Chen tiejun.c...@windriver.com --- arch/powerpc/kvm/booke_emulate.c |3 +++ arch/powerpc/kvm/e500mc.c|3 ++- 2 files changed, 5 insertions(+), 1 deletion(-) Are you seeing a real performance improvement from this? This will interfere No. But after we reduce the exit to host, shouldn't this improve performance? We lose some flexibility for this so it make sense only if we gain measurable improvements. Sounds we have much more works to do. somewhat with using the VF bit, if we were to ever do so, since VF only affects Sorry, what is the VF you said? VF stands for virtualization fault see MAS8[VF] and we may use it for virtualized I almost forget this point :) MMIO. The hypervisor should deny execute access on pages marked with VF. Accordingly in this case guest ISI exceptions should be handled by the hypervisor. Thanks for your information. Tiejun -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 3.9 - can't boot qemu with accel=kvm _and_ networking enabled
Paolo, The full command line is: qemu-system-x86_64 -machine accel=kvm -m 1024m \ -net tap -net nic \ -drive file=/dev/zpool/testsrv,index=0,cache=writethrough \ -k en-us \ -no-kvm-irqchip \ -vga cirrus I've tried any combinations of -net options, but the result is always the same. I think this somehow related to http://article.gmane.org/gmane.comp.emulators.kvm.devel/109461, as setting emulate_invalid_guest_state=0 solves the problem However, I'm not aware of any consequences of this change. Actually, the other bug involves sgabios and you are not using it. Please try executing the following commands from the monitor (you can use -monitor stdio to make cut-and-paste simpler): x/8i \$pc x/64b \$pc and include the output in the reply to this message. Thanks, Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] virtio-balloon spec: rework VIRTIO_BALLOON_F_MUST_TELL_HOST feature, support silent deflation
The idea of the VIRTIO_BALLOON_F_MUST_TELL_HOST feature is to let drivers skip usage of the deflate queue when leaking the balloon (silent deflation). Guests may benefit from silent deflate by aggressively inflating the balloon; they know that they will be able to use ballooned pages without issuing a (blocking) request to the device. The problem is that this feature is a negative feature: if set, the guest _may not_ use ballooned pages directly. Negative features are not safe against migration; here is an explanation why this is so. For a positive feature, migration is possible if the destination supports it, or the source didn't set it: dest support source set ok? TT T TF T FT F FF T For a negative feature, migration is possible if the destination supports it, or the source set it: dest support source set ok? TT T TF F FT T FF T However, the F/T line violates the virtio specification because the negotiated features are supposed to be the AND of the device- and driver-supported features. Furthermore, this assumes that the destination host knows which features are positive and which are negative, which obviously cannot be the case in general. (The original spec assumed that every device supports VIRTIO_BALLOON_F_MUST_TELL_HOST, but this was not explicitly documented and in practice it turns out not to be the case). Not all is lost, however. First, all known device implementations support silent deflation, hence they do not negotiate the feature. We are thus somewhat free to redefine what the host should do about this feature. Second, by chance, coincidence or an evil plot, the only known driver that does not negotiate VIRTIO_BALLOON_F_MUST_TELL_HOST is also using pages before telling the host. Thus, even though the feature used to be just for communication from the host, known drivers are really using it to communicate was in the other direction, as if the feature was named VIRTIO_BALLOON_F_GUEST_TELLS_HOST. Adjust the spec to conform, and add a new feature bit for the host to tell the drivers if silent deflation is actually supported. With this new feature bit, the host can distinguish all three cases: will never do silent deflation, will do silent deflation if available, will always do silent deflation (as in the above buggy driver). Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- virtio-spec.lyx | 264 ++-- 1 file changed, 258 insertions(+), 6 deletions(-) diff --git a/virtio-spec.lyx b/virtio-spec.lyx index 73e22e7..033362f 100644 --- a/virtio-spec.lyx +++ b/virtio-spec.lyx @@ -63,7 +63,7 @@ \author -385801441 Cornelia Huck cornelia.h...@de.ibm.com \author 460276516 Dmitry Fleytman dfley...@redhat.com \author 1112500848 Rusty Russell ru...@rustcorp.com.au -\author 1531152142 Paolo Bonzini,,, +\author 1531152142 Paolo Bonzini pbonz...@redhat.com \author 1717892615 Alexey Zaytsev,,, \author 1986246365 Michael S. Tsirkin \end_header @@ -7179,11 +7179,49 @@ bits \begin_deeper \begin_layout Description -VIRTIO_BALLOON_F_MUST_TELL_HOST +VIRTIO_BALLOON_F_ +\change_deleted 1531152142 1347020601 +MUST +\change_inserted 1531152142 1347020602 +GUEST +\change_unchanged +_TELL +\change_inserted 1531152142 1368004486 +S +\change_unchanged +_HOST \begin_inset space ~ \end_inset -(0) Host must be told before pages from the balloon are used. +(0) +\change_deleted 1531152142 1347020625 +Host must be told +\change_inserted 1531152142 1347020617 +Guest will tell host +\change_unchanged + before pages from the balloon are used. + +\change_inserted 1531152142 1368005603 + The host should always propose this feature. +\begin_inset Foot +status open + +\begin_layout Plain Layout + +\change_inserted 1531152142 1347022389 +This feature used to be named VIRTIO_BALLOON_F_\SpecialChar \- +MUST_TELL_HOST. + However, after a few years it was observed that drivers were not using + it as specified. + The virtio-balloon spec was then adjusted to what the drivers had been + doing. +\end_layout + +\end_inset + + +\change_unchanged + \end_layout \begin_layout Description @@ -7192,6 +7230,20 @@ VIRTIO_BALLOON_F_STATS_VQ \end_inset (1) A virtqueue for reporting guest memory statistics is present. +\change_inserted 1531152142 1347020627 + +\end_layout + +\begin_layout Description + +\change_inserted 1531152142 1347020648 +VIRTIO_BALLOON_F_SILENT_DEFLATE +\begin_inset space ~ +\end_inset + +(2) Guest does not need to tell host before pages from the balloon are used. +\change_unchanged + \end_layout \end_deeper @@ -7342,9 +7394,27 @@ The driver constructs an array of addresses of memory pages it
Re: [PATCH v2] KVM: Fix kvm_irqfd_init initialization
On Wed, May 08, 2013 at 10:57:29AM +0800, Asias He wrote: In commit a0f155e96 'KVM: Initialize irqfd from kvm_init()', when kvm_init() is called the second time (e.g kvm-amd.ko and kvm-intel.ko), kvm_arch_init() will fail with -EEXIST, then kvm_irqfd_exit() will be called on the error handling path. This way, the kvm_irqfd system will not be ready. This patch fix the following: Applied, thanks. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [81c0721e] _raw_spin_lock+0xe/0x30 PGD 0 Oops: 0002 [#1] SMP Modules linked in: vhost_net CPU 6 Pid: 4257, comm: qemu-system-x86 Not tainted 3.9.0-rc3+ #757 Dell Inc. OptiPlex 790/0V5HMK RIP: 0010:[81c0721e] [81c0721e] _raw_spin_lock+0xe/0x30 RSP: 0018:880221721cc8 EFLAGS: 00010046 RAX: 0100 RBX: 88022dcc003f RCX: 880221734950 RDX: 8802208f6ca8 RSI: 7fff RDI: RBP: 880221721cc8 R08: 0002 R09: 0002 R10: 7f7fd01087e0 R11: 0246 R12: 8802208f6ca8 R13: 0080 R14: 880223e2a900 R15: FS: 7f7fd38488e0() GS:88022dcc() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 00022309f000 CR4: 000427e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process qemu-system-x86 (pid: 4257, threadinfo 88022172, task 880222bd5640) Stack: 880221721d08 810ac5c5 88022431dc00 0086 0080 880223e2a900 8802208f6ca8 880221721d48 810ac8fe 880221734000 Call Trace: [810ac5c5] __queue_work+0x45/0x2d0 [810ac8fe] queue_work_on+0x8e/0xa0 [810ac949] queue_work+0x19/0x20 [81009b6b] irqfd_deactivate+0x4b/0x60 [8100a69d] kvm_irqfd+0x39d/0x580 [81007a27] kvm_vm_ioctl+0x207/0x5b0 [810c9545] ? update_curr+0xf5/0x180 [811b66e8] do_vfs_ioctl+0x98/0x550 [810c1f5e] ? finish_task_switch+0x4e/0xe0 [81c054aa] ? __schedule+0x2ea/0x710 [811b6bf7] sys_ioctl+0x57/0x90 [8140ae9e] ? trace_hardirqs_on_thunk+0x3a/0x3c [81c0f602] system_call_fastpath+0x16/0x1b Code: c1 ea 08 38 c2 74 0f 66 0f 1f 44 00 00 f3 90 0f b6 03 38 c2 75 f7 48 83 c4 08 5b c9 c3 55 48 89 e5 66 66 66 66 90 b8 00 01 00 00 f0 66 0f c1 07 89 c2 66 c1 ea 08 38 c2 74 0c 0f 1f 00 f3 90 0f RIP [81c0721e] _raw_spin_lock+0xe/0x30 RSP 880221721cc8 CR2: ---[ end trace 13fb1e4b6e5ab21f ]--- Signed-off-by: Asias He as...@redhat.com --- virt/kvm/kvm_main.c | 18 +- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 8fd325a..85b93d2 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3078,13 +3078,21 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align, int r; int cpu; - r = kvm_irqfd_init(); - if (r) - goto out_irqfd; r = kvm_arch_init(opaque); if (r) goto out_fail; + /* + * kvm_arch_init makes sure there's at most one caller + * for architectures that support multiple implementations, + * like intel and amd on x86. + * kvm_arch_init must be called before kvm_irqfd_init to avoid creating + * conflicts in case kvm is already setup for another implementation. + */ + r = kvm_irqfd_init(); + if (r) + goto out_irqfd; + if (!zalloc_cpumask_var(cpus_hardware_enabled, GFP_KERNEL)) { r = -ENOMEM; goto out_free_0; @@ -3159,10 +3167,10 @@ out_free_1: out_free_0a: free_cpumask_var(cpus_hardware_enabled); out_free_0: - kvm_arch_exit(); -out_fail: kvm_irqfd_exit(); out_irqfd: + kvm_arch_exit(); +out_fail: return r; } EXPORT_SYMBOL_GPL(kvm_init); -- 1.8.1.4 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: x86: fix maintenance of guest/host xcr0 state
On Mon, Apr 15, 2013 at 11:30:13PM -0300, Marcelo Tosatti wrote: ** Untested **. Emulation of xcr0 writes zero guest_xcr0_loaded variable so that subsequent VM-entry reloads CPU's xcr0 with guests xcr0 value. However, this is incorrect because guest_xcr0_loaded variable is read to decide whether to reload hosts xcr0. In case the vcpu thread is scheduled out after the guest_xcr0_loaded = 0 assignment, and scheduler decides to preload FPU: switch_to { __switch_to __math_state_restore restore_fpu_checking fpu_restore_checking if (use_xsave()) fpu_xrstor_checking xrstor64 with CPU's xcr0 == guests xcr0 Fix by properly restoring hosts xcr0 during emulation of xcr0 writes. Analyzed-by: Ulrich Obergfell uober...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Applied, thanks. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 999d124..222926a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -555,6 +555,25 @@ void kvm_lmsw(struct kvm_vcpu *vcpu, unsigned long msw) } EXPORT_SYMBOL_GPL(kvm_lmsw); +static void kvm_load_guest_xcr0(struct kvm_vcpu *vcpu) +{ + if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) + !vcpu-guest_xcr0_loaded) { + /* kvm_set_xcr() also depends on this */ + xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu-arch.xcr0); + vcpu-guest_xcr0_loaded = 1; + } +} + +static void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu) +{ + if (vcpu-guest_xcr0_loaded) { + if (vcpu-arch.xcr0 != host_xcr0) + xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0); + vcpu-guest_xcr0_loaded = 0; + } +} + int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr) { u64 xcr0; @@ -571,8 +590,8 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr) return 1; if (xcr0 ~host_xcr0) return 1; + kvm_put_guest_xcr0(vcpu); vcpu-arch.xcr0 = xcr0; - vcpu-guest_xcr0_loaded = 0; return 0; } @@ -5600,25 +5619,6 @@ static void inject_pending_event(struct kvm_vcpu *vcpu) } } -static void kvm_load_guest_xcr0(struct kvm_vcpu *vcpu) -{ - if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) - !vcpu-guest_xcr0_loaded) { - /* kvm_set_xcr() also depends on this */ - xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu-arch.xcr0); - vcpu-guest_xcr0_loaded = 1; - } -} - -static void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu) -{ - if (vcpu-guest_xcr0_loaded) { - if (vcpu-arch.xcr0 != host_xcr0) - xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0); - vcpu-guest_xcr0_loaded = 0; - } -} - static void process_nmi(struct kvm_vcpu *vcpu) { unsigned limit = 2; -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 4/6] KVM: MMU: fast invalid all shadow pages
On Tue, May 07, 2013 at 12:09:29PM -0300, Marcelo Tosatti wrote: On Tue, May 07, 2013 at 05:56:08PM +0300, Gleb Natapov wrote: Yes, I am missing what Marcelo means there too. We cannot free memslot until we unmap its rmap one way or the other. I do not understand what are you optimizing for, given the four possible cases we discussed at https://lkml.org/lkml/2013/4/18/280 We are optimizing mmu_lock holding time for all of those cases. But you cannot just zap roots + sp gen number increase. on slot deletion because you need to transfer access/dirty information from rmap that is going to be deleted to actual page before kvm_set_memory_region() returns to a caller. That is, why a simple for_each_all_shadow_page(zap_page) is not sufficient. With a lock break? It is. We tried to optimize that by zapping only pages that reference memslot that is going to be deleted and zap all other later when recycling old sps, but if you think this is premature optimization I am fine with it. If it can be shown that its not premature optimization, I am fine with it. AFAICS all cases are 1) rare and 2) not latency sensitive (as in there is no requirement for those cases to finish in a short period of time). OK, lets start from a simple version. The one that goes through rmap turned out to be more complicated that we expected. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
kernel 3.9.x kvm hangs after seabios
I have the same issue, with 3.9.1 (3.9.0 too) it hangs right after seabios... (no problem in 3.8.11) qemu-1.4.1 seabios-1.7.2.1 after setting emulate_invalid_guest_state=0 everything works just fine. virsh # qemu-monitor-command vm-jack --hmp x/8i \$pc 0x000fc46b: lgdtw %cs:-0x2c60 0x000fc471: mov%cr0,%eax 0x000fc474: or $0x1,%eax 0x000fc478: mov%eax,%cr0 0x000fc47b: ljmpl $0x8,$0xfc483 0x000fc483: mov$0x10,%ax 0x000fc486: add%al,(%bx,%si) 0x000fc488: mov%ax,%ds virsh # qemu-monitor-command vm-jack --hmp x/64b \$pc 0x000fc46b: lgdtw %cs:-0x2c60 0x000fc471: mov%cr0,%eax 0x000fc474: or $0x1,%eax 0x000fc478: mov%eax,%cr0 0x000fc47b: ljmpl $0x8,$0xfc483 0x000fc483: mov$0x10,%ax 0x000fc486: add%al,(%bx,%si) 0x000fc488: mov%ax,%ds 0x000fc48a: mov%ax,%es 0x000fc48c: mov%ax,%ss 0x000fc48e: mov%ax,%fs 0x000fc490: mov%ax,%gs 0x000fc492: mov%cx,%ax 0x000fc494: jmp*%dx 0x000fc496: mov%ax,%cx 0x000fc498: mov$0x20,%ax 0x000fc49b: add%al,(%bx,%si) 0x000fc49d: mov%ax,%ds 0x000fc49f: mov%ax,%es 0x000fc4a1: mov%ax,%ss 0x000fc4a3: mov%ax,%fs 0x000fc4a5: mov%ax,%gs 0x000fc4a7: ljmpl $0xc189,$0x18c4c4 0x000fc4af: mov$0x30,%ax 0x000fc4b2: add%al,(%bx,%si) 0x000fc4b4: mov%ax,%ds 0x000fc4b6: mov%ax,%es 0x000fc4b8: mov%ax,%ss 0x000fc4ba: mov%ax,%fs 0x000fc4bc: mov%ax,%gs 0x000fc4be: ljmpl $0x200f,$0x28c4c4 0x000fc4c6: shlb $0xe0,-0x7d(%bp) 0x000fc4ca: decb (%bx) 0x000fc4cc: and%al,%al 0x000fc4ce: ljmp $0xf000,$0xc4d3 0x000fc4d3: lidtw %cs:-0x2c18 0x000fc4d9: xor%ax,%ax 0x000fc4db: mov%ax,%fs 0x000fc4dd: mov%ax,%gs 0x000fc4df: mov%ax,%es 0x000fc4e1: mov%ax,%ds 0x000fc4e3: mov%ax,%ss 0x000fc4e5: mov%ecx,%eax 0x000fc4e8: jmpl *%edx 0x000fc4eb: push %ebp 0x000fc4ed: push %eax 0x000fc4ef: pushl %es 0x000fc4f1: push %cs 0x000fc4f2: push $0xc536 0x000fc4f5: addr32 pushw %es:0x24(%eax) 0x000fc4fa: addr32 pushl %es:0x20(%eax) 0x000fc500: addr32 mov %es:0x4(%eax),%edi 0x000fc506: addr32 mov %es:0x8(%eax),%esi 0x000fc50c: addr32 mov %es:0xc(%eax),%ebp 0x000fc512: addr32 mov %es:0x10(%eax),%ebx 0x000fc518: addr32 mov %es:0x14(%eax),%edx 0x000fc51e: addr32 mov %es:0x18(%eax),%ecx 0x000fc524: addr32 mov %es:(%eax),%ds 0x000fc528: addr32 pushl %es:0x1c(%eax) 0x000fc52e: addr32 mov %es:0x2(%eax),%es 0x000fc533: pop%eax 0x000fc535: iret 0x000fc536: pushf 0x000fc537: cli -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel 3.9.x kvm hangs after seabios
On Wed, May 08, 2013 at 11:22:01AM +, Tomas Papan wrote: I have the same issue, with 3.9.1 (3.9.0 too) it hangs right after seabios... (no problem in 3.8.11) qemu-1.4.1 seabios-1.7.2.1 Is there anything interesting in libvirt logfile? Also please send the output of qemu-monitor-command vm-jack --hmp info registers And, just in case, can you send me your bios.bin image. Mine work. after setting emulate_invalid_guest_state=0 everything works just fine. virsh # qemu-monitor-command vm-jack --hmp x/8i \$pc 0x000fc46b: lgdtw %cs:-0x2c60 0x000fc471: mov%cr0,%eax 0x000fc474: or $0x1,%eax 0x000fc478: mov%eax,%cr0 0x000fc47b: ljmpl $0x8,$0xfc483 0x000fc483: mov$0x10,%ax 0x000fc486: add%al,(%bx,%si) 0x000fc488: mov%ax,%ds virsh # qemu-monitor-command vm-jack --hmp x/64b \$pc 0x000fc46b: lgdtw %cs:-0x2c60 0x000fc471: mov%cr0,%eax 0x000fc474: or $0x1,%eax 0x000fc478: mov%eax,%cr0 0x000fc47b: ljmpl $0x8,$0xfc483 0x000fc483: mov$0x10,%ax 0x000fc486: add%al,(%bx,%si) 0x000fc488: mov%ax,%ds 0x000fc48a: mov%ax,%es 0x000fc48c: mov%ax,%ss 0x000fc48e: mov%ax,%fs 0x000fc490: mov%ax,%gs 0x000fc492: mov%cx,%ax 0x000fc494: jmp*%dx 0x000fc496: mov%ax,%cx 0x000fc498: mov$0x20,%ax 0x000fc49b: add%al,(%bx,%si) 0x000fc49d: mov%ax,%ds 0x000fc49f: mov%ax,%es 0x000fc4a1: mov%ax,%ss 0x000fc4a3: mov%ax,%fs 0x000fc4a5: mov%ax,%gs 0x000fc4a7: ljmpl $0xc189,$0x18c4c4 0x000fc4af: mov$0x30,%ax 0x000fc4b2: add%al,(%bx,%si) 0x000fc4b4: mov%ax,%ds 0x000fc4b6: mov%ax,%es 0x000fc4b8: mov%ax,%ss 0x000fc4ba: mov%ax,%fs 0x000fc4bc: mov%ax,%gs 0x000fc4be: ljmpl $0x200f,$0x28c4c4 0x000fc4c6: shlb $0xe0,-0x7d(%bp) 0x000fc4ca: decb (%bx) 0x000fc4cc: and%al,%al 0x000fc4ce: ljmp $0xf000,$0xc4d3 0x000fc4d3: lidtw %cs:-0x2c18 0x000fc4d9: xor%ax,%ax 0x000fc4db: mov%ax,%fs 0x000fc4dd: mov%ax,%gs 0x000fc4df: mov%ax,%es 0x000fc4e1: mov%ax,%ds 0x000fc4e3: mov%ax,%ss 0x000fc4e5: mov%ecx,%eax 0x000fc4e8: jmpl *%edx 0x000fc4eb: push %ebp 0x000fc4ed: push %eax 0x000fc4ef: pushl %es 0x000fc4f1: push %cs 0x000fc4f2: push $0xc536 0x000fc4f5: addr32 pushw %es:0x24(%eax) 0x000fc4fa: addr32 pushl %es:0x20(%eax) 0x000fc500: addr32 mov %es:0x4(%eax),%edi 0x000fc506: addr32 mov %es:0x8(%eax),%esi 0x000fc50c: addr32 mov %es:0xc(%eax),%ebp 0x000fc512: addr32 mov %es:0x10(%eax),%ebx 0x000fc518: addr32 mov %es:0x14(%eax),%edx 0x000fc51e: addr32 mov %es:0x18(%eax),%ecx 0x000fc524: addr32 mov %es:(%eax),%ds 0x000fc528: addr32 pushl %es:0x1c(%eax) 0x000fc52e: addr32 mov %es:0x2(%eax),%es 0x000fc533: pop%eax 0x000fc535: iret 0x000fc536: pushf 0x000fc537: cli -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel 3.9.x kvm hangs after seabios
Hi, I found this in the libvirt (but those messages are same in 3.8.x) anakin libvirt # cat libvirtd.log 2013-05-08 11:59:29.645+: 3750: info : libvirt version: 1.0.5 2013-05-08 11:59:29.645+: 3750: error : udevGetDMIData:1548 : Failed to get udev device for syspath '/sys/devices/virtual/dmi/id' or '/sys/class/dmi/id' 2013-05-08 11:59:29.680+: 3750: warning : ebiptablesDriverInitCLITools:4225 : Could not find 'ebtables' executable virsh # qemu-monitor-command vm-jack --hmp info registers EAX=0002 EBX=64a1 ECX=6e08 EDX=000fc5ab ESI=c5b8 EDI=6eec EBP=dffd83e0 ESP=6df8 EIP=c46b EFL=00010002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=1 ES = 9300 CS =f000 000f 9b00 SS = 9300 DS = 9300 FS = 9300 GS = 9300 LDT= 8200 TR = 8b00 GDT= 000fd3a8 0037 IDT= 000fd3e6 CR0=0010 CR2= CR3= CR4= DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER= FCW=037f FSW= [ST=0] FTW=00 MXCSR=1f80 FPR0= FPR1= FPR2= FPR3= FPR4= FPR5= FPR6= FPR7= XMM00= XMM01= XMM02= XMM03= XMM04= XMM05= XMM06= XMM07= bios.bin can be found here http://papan.sk/share/bios.bin I should mentioned that I'm using gentoo and libvirt 1.0.5. I'm sorry if gmail interface breaks output. Regrads Tomas On Wed, May 8, 2013 at 1:55 PM, Gleb Natapov g...@redhat.com wrote: On Wed, May 08, 2013 at 11:22:01AM +, Tomas Papan wrote: I have the same issue, with 3.9.1 (3.9.0 too) it hangs right after seabios... (no problem in 3.8.11) qemu-1.4.1 seabios-1.7.2.1 Is there anything interesting in libvirt logfile? Also please send the output of qemu-monitor-command vm-jack --hmp info registers And, just in case, can you send me your bios.bin image. Mine work. after setting emulate_invalid_guest_state=0 everything works just fine. virsh # qemu-monitor-command vm-jack --hmp x/8i \$pc 0x000fc46b: lgdtw %cs:-0x2c60 0x000fc471: mov%cr0,%eax 0x000fc474: or $0x1,%eax 0x000fc478: mov%eax,%cr0 0x000fc47b: ljmpl $0x8,$0xfc483 0x000fc483: mov$0x10,%ax 0x000fc486: add%al,(%bx,%si) 0x000fc488: mov%ax,%ds virsh # qemu-monitor-command vm-jack --hmp x/64b \$pc 0x000fc46b: lgdtw %cs:-0x2c60 0x000fc471: mov%cr0,%eax 0x000fc474: or $0x1,%eax 0x000fc478: mov%eax,%cr0 0x000fc47b: ljmpl $0x8,$0xfc483 0x000fc483: mov$0x10,%ax 0x000fc486: add%al,(%bx,%si) 0x000fc488: mov%ax,%ds 0x000fc48a: mov%ax,%es 0x000fc48c: mov%ax,%ss 0x000fc48e: mov%ax,%fs 0x000fc490: mov%ax,%gs 0x000fc492: mov%cx,%ax 0x000fc494: jmp*%dx 0x000fc496: mov%ax,%cx 0x000fc498: mov$0x20,%ax 0x000fc49b: add%al,(%bx,%si) 0x000fc49d: mov%ax,%ds 0x000fc49f: mov%ax,%es 0x000fc4a1: mov%ax,%ss 0x000fc4a3: mov%ax,%fs 0x000fc4a5: mov%ax,%gs 0x000fc4a7: ljmpl $0xc189,$0x18c4c4 0x000fc4af: mov$0x30,%ax 0x000fc4b2: add%al,(%bx,%si) 0x000fc4b4: mov%ax,%ds 0x000fc4b6: mov%ax,%es 0x000fc4b8: mov%ax,%ss 0x000fc4ba: mov%ax,%fs 0x000fc4bc: mov%ax,%gs 0x000fc4be: ljmpl $0x200f,$0x28c4c4 0x000fc4c6: shlb $0xe0,-0x7d(%bp) 0x000fc4ca: decb (%bx) 0x000fc4cc: and%al,%al 0x000fc4ce: ljmp $0xf000,$0xc4d3 0x000fc4d3: lidtw %cs:-0x2c18 0x000fc4d9: xor%ax,%ax 0x000fc4db: mov%ax,%fs 0x000fc4dd: mov%ax,%gs 0x000fc4df: mov%ax,%es 0x000fc4e1: mov%ax,%ds 0x000fc4e3: mov%ax,%ss 0x000fc4e5: mov%ecx,%eax 0x000fc4e8: jmpl *%edx 0x000fc4eb: push %ebp 0x000fc4ed: push %eax 0x000fc4ef: pushl %es 0x000fc4f1: push %cs 0x000fc4f2: push $0xc536 0x000fc4f5: addr32 pushw %es:0x24(%eax) 0x000fc4fa: addr32 pushl %es:0x20(%eax) 0x000fc500: addr32 mov %es:0x4(%eax),%edi 0x000fc506:
Re: kernel 3.9.x kvm hangs after seabios
On Wed, May 08, 2013 at 02:08:55PM +0200, Tomas Papan wrote: Hi, I found this in the libvirt (but those messages are same in 3.8.x) anakin libvirt # cat libvirtd.log 2013-05-08 11:59:29.645+: 3750: info : libvirt version: 1.0.5 2013-05-08 11:59:29.645+: 3750: error : udevGetDMIData:1548 : Failed to get udev device for syspath '/sys/devices/virtual/dmi/id' or '/sys/class/dmi/id' 2013-05-08 11:59:29.680+: 3750: warning : ebiptablesDriverInitCLITools:4225 : Could not find 'ebtables' executable Nothing about KVM internal error? Couple of more things please: 1. Output of qemu-monitor-command vm-jack --hmp info status. 2. command line. 3. trace http://www.linux-kvm.org/page/Tracing -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 3.9 - can't boot qemu with accel=kvm _and_ networking enabled
Here they are: (qemu) x/8i $pc 0x000fc49b: lgdtw %cs:-0x2c60 0x000fc4a1: mov%cr0,%eax 0x000fc4a4: or $0x1,%eax 0x000fc4a8: mov%eax,%cr0 0x000fc4ab: ljmpl $0x8,$0xfc4b3 0x000fc4b3: mov$0x10,%ax 0x000fc4b6: add%al,(%bx,%si) 0x000fc4b8: mov%ax,%ds (qemu) x/64b $pc 0x000fc49b: lgdtw %cs:-0x2c60 0x000fc4a1: mov%cr0,%eax 0x000fc4a4: or $0x1,%eax 0x000fc4a8: mov%eax,%cr0 0x000fc4ab: ljmpl $0x8,$0xfc4b3 0x000fc4b3: mov$0x10,%ax 0x000fc4b6: add%al,(%bx,%si) 0x000fc4b8: mov%ax,%ds 0x000fc4ba: mov%ax,%es 0x000fc4bc: mov%ax,%ss 0x000fc4be: mov%ax,%fs 0x000fc4c0: mov%ax,%gs 0x000fc4c2: mov%cx,%ax 0x000fc4c4: jmp*%dx 0x000fc4c6: mov%ax,%cx 0x000fc4c8: mov$0x20,%ax 0x000fc4cb: add%al,(%bx,%si) 0x000fc4cd: mov%ax,%ds 0x000fc4cf: mov%ax,%es 0x000fc4d1: mov%ax,%ss 0x000fc4d3: mov%ax,%fs 0x000fc4d5: mov%ax,%gs 0x000fc4d7: ljmpl $0xc189,$0x18c4f4 0x000fc4df: mov$0x30,%ax 0x000fc4e2: add%al,(%bx,%si) 0x000fc4e4: mov%ax,%ds 0x000fc4e6: mov%ax,%es 0x000fc4e8: mov%ax,%ss 0x000fc4ea: mov%ax,%fs 0x000fc4ec: mov%ax,%gs 0x000fc4ee: ljmpl $0x200f,$0x28c4f4 0x000fc4f6: shlb $0xe0,-0x7d(%bp) 0x000fc4fa: decb (%bx) 0x000fc4fc: and%al,%al 0x000fc4fe: ljmp $0xf000,$0xc503 0x000fc503: lidtw %cs:-0x2c18 0x000fc509: xor%ax,%ax 0x000fc50b: mov%ax,%fs 0x000fc50d: mov%ax,%gs 0x000fc50f: mov%ax,%es 0x000fc511: mov%ax,%ds 0x000fc513: mov%ax,%ss 0x000fc515: mov%ecx,%eax 0x000fc518: jmpl *%edx 0x000fc51b: push %ebp 0x000fc51d: push %eax 0x000fc51f: pushl %es 0x000fc521: push %cs 0x000fc522: push $0xc566 0x000fc525: addr32 pushw %es:0x24(%eax) 0x000fc52a: addr32 pushl %es:0x20(%eax) 0x000fc530: addr32 mov %es:0x4(%eax),%edi 0x000fc536: addr32 mov %es:0x8(%eax),%esi 0x000fc53c: addr32 mov %es:0xc(%eax),%ebp 0x000fc542: addr32 mov %es:0x10(%eax),%ebx 0x000fc548: addr32 mov %es:0x14(%eax),%edx 0x000fc54e: addr32 mov %es:0x18(%eax),%ecx 0x000fc554: addr32 mov %es:(%eax),%ds 0x000fc558: addr32 pushl %es:0x1c(%eax) 0x000fc55e: addr32 mov %es:0x2(%eax),%es 0x000fc563: pop%eax 0x000fc565: iret 0x000fc566: pushf 0x000fc567: cli On 08/05/13 11:57, Paolo Bonzini wrote: Paolo, The full command line is: qemu-system-x86_64 -machine accel=kvm -m 1024m \ -net tap -net nic \ -drive file=/dev/zpool/testsrv,index=0,cache=writethrough \ -k en-us \ -no-kvm-irqchip \ -vga cirrus I've tried any combinations of -net options, but the result is always the same. I think this somehow related to http://article.gmane.org/gmane.comp.emulators.kvm.devel/109461, as setting emulate_invalid_guest_state=0 solves the problem However, I'm not aware of any consequences of this change. Actually, the other bug involves sgabios and you are not using it. Please try executing the following commands from the monitor (you can use -monitor stdio to make cut-and-paste simpler): x/8i \$pc x/64b \$pc and include the output in the reply to this message. Thanks, Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel 3.9.x kvm hangs after seabios
Hi, No nothing, I check all logs (even syslog) 1) virsh # qemu-monitor-command vm-jack --hmp info status VM status: running 2) morpheus@anakin ~ $ ps aux | grep vm-jack qemu 3822 0.5 0.1 8952256 23600 ? Sl 13:59 0:08 /usr/bin/qemu-system-x86_64 -machine accel=kvm -name vm-jack -S -machine pc-0.14,accel=kvm,usb=off -cpu Nehalem,+rdtscp,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -m 8192 -smp 4,sockets=4,cores=1,threads=1 -uuid 03196c23-24ba-d398-a000-582b0e88b0e7 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm-jack.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot order=c,menu=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/libvirt/images/jack.img,if=none,id=drive-virtio-disk0,format=raw -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -drive file=/var/lib/libvirt/images/kernel.img,if=none,id=drive-virtio-disk1,format=raw -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=19,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:21:1c:e0,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 3) it took some time, I didn't have debug_fs, then tracing... but the file is stored here (15 MB) http://papan.sk/share/trace.dat.tar.gz Regards Tomas -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel 3.9.x kvm hangs after seabios
On Wed, May 08, 2013 at 02:08:55PM +0200, Tomas Papan wrote: Hi, I found this in the libvirt (but those messages are same in 3.8.x) anakin libvirt # cat libvirtd.log 2013-05-08 11:59:29.645+: 3750: info : libvirt version: 1.0.5 2013-05-08 11:59:29.645+: 3750: error : udevGetDMIData:1548 : Failed to get udev device for syspath '/sys/devices/virtual/dmi/id' or '/sys/class/dmi/id' 2013-05-08 11:59:29.680+: 3750: warning : ebiptablesDriverInitCLITools:4225 : Could not find 'ebtables' executable You need to look at /var/log/libvirt/qemu/$GUESTNAME.log for QEMU related messages. The libvirtd.log file only has the libvirt related messages. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel 3.9.x kvm hangs after seabios
Sorry, I didn't write that well, I checked that log too... nothing is there... anakin qemu # cat vm-jack.log 2013-05-08 13:02:52.358+: starting up LC_ALL=C PATH=/bin:/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/opt/bin HOME=/root USER=root QEMU_AUDIO_DRV=none /usr/bin/qemu-kvm -name vm-jack -S -machine pc-0.14,accel=kvm,usb=off -cpu Nehalem,+rdtscp,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -m 8192 -smp 4,sockets=4,cores=1,threads=1 -uuid 03196c23-24ba-d398-a000-582b0e88b0e7 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm-jack.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot order=c,menu=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/libvirt/images/jack.img,if=none,id=drive-virtio-disk0,format=raw -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -drive file=/var/lib/libvirt/images/kernel.img,if=none,id=drive-virtio-disk1,format=raw -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=19,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:21:1c:e0,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 char device redirected to /dev/pts/3 (label charserial0) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel 3.9.x kvm hangs after seabios
On Wed, May 08, 2013 at 02:51:48PM +0200, Tomas Papan wrote: Hi, No nothing, I check all logs (even syslog) Yeah, since status of the vm is running you are not suppose to see there anything. 1) virsh # qemu-monitor-command vm-jack --hmp info status VM status: running 2) morpheus@anakin ~ $ ps aux | grep vm-jack qemu 3822 0.5 0.1 8952256 23600 ? Sl 13:59 0:08 /usr/bin/qemu-system-x86_64 -machine accel=kvm -name vm-jack -S -machine pc-0.14,accel=kvm,usb=off -cpu Nehalem,+rdtscp,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -m 8192 -smp 4,sockets=4,cores=1,threads=1 -uuid 03196c23-24ba-d398-a000-582b0e88b0e7 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm-jack.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot order=c,menu=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/libvirt/images/jack.img,if=none,id=drive-virtio-disk0,format=raw -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -drive file=/var/lib/libvirt/images/kernel.img,if=none,id=drive-virtio-disk1,format=raw -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=19,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:21:1c:e0,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 3) it took some time, I didn't have debug_fs, then tracing... but the file is stored here (15 MB) http://papan.sk/share/trace.dat.tar.gz Very interesting. In the middle of the run vcpu decides that it does not want to run any more. How much cpu time qemu takes when it happens? If it is 100% can you do the following: 1. run qemu-monitor-command vm-jack --hmp info cpus 2. note thread id for cpu #0 3. run trace-cmd record -P $pid -p function where $pid is the pid thread id that you've found in 2. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel 3.9.x kvm hangs after seabios
Ok, the cpu stays at 0% when it hangs, there is only one 100% cpu peak which happens when the vm starts ( I think this is quite normal). However I run following command, and I stop it right when it hangs: anakin trace2 # virsh start vm-jack; pid=`virsh qemu-monitor-command vm-jack --hmp info cpus | grep '\*' | awk '{print $5}' | cut -d\= -f2`; trace-cmd record -P $pid -p function if anyone is interested it produces a 1.6 GB file (the compressed version can be found here: http://papan.sk/share/trace2.dat.tar.gz (150 MB)) Tomas -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel 3.9.x kvm hangs after seabios
On Wed, May 08, 2013 at 03:50:47PM +0200, Tomas Papan wrote: Ok, the cpu stays at 0% when it hangs, there is only one 100% cpu peak which happens when the vm starts ( I think this is quite normal). However I run following command, and I stop it right when it hangs: anakin trace2 # virsh start vm-jack; pid=`virsh qemu-monitor-command vm-jack --hmp info cpus | grep '\*' | awk '{print $5}' | cut -d\= -f2`; trace-cmd record -P $pid -p function if anyone is interested it produces a 1.6 GB file (the compressed version can be found here: http://papan.sk/share/trace2.dat.tar.gz (150 MB)) Thanks! Can you test the patch below: diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 6667042..0af1807 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5197,6 +5197,12 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu) return 0; } + if (vcpu-arch.halt_request) { + vcpu-arch.halt_request = 0; + ret = kvm_emulate_halt(vcpu); + goto out; + } + if (signal_pending(current)) goto out; if (need_resched()) -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel 3.9.x kvm hangs after seabios
-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel 3.9.x kvm hangs after seabios
patch is working :) Thank you very much Gleb. Regards Tomas -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel 3.9.x kvm hangs after seabios
On Wed, May 08, 2013 at 04:52:52PM +0200, Tomas Papan wrote: patch is working :) Thank you very much Gleb. Thank you for your patience. Curious but it was. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: VMX: fix halt emulation while emulating invalid guest sate
The invalid guest state emulation loop does not check halt_request which causes 100% cpu loop while guest is in halt and in invalid state, but more serious issue is that this leaves halt_request set, so random instruction emulated by vm86 #GP exit can be interpreted as halt which causes guest hang. Fix both problems by handling halt_request in emulation loop. Reported-by: Tomas Papan tomas.pa...@gmail.com Tested-by: Tomas Papan tomas.pa...@gmail.com CC: sta...@vger.kernel.org Signed-off-by: Gleb Natapov g...@redhat.com diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 5a87a58..a9fa4bc 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5312,6 +5312,12 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu) return 0; } + if (vcpu-arch.halt_request) { + vcpu-arch.halt_request = 0; + ret = kvm_emulate_halt(vcpu); + goto out; + } + if (signal_pending(current)) goto out; if (need_resched()) -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VFIO VGA test branches
A few notes for anyone trying this... * I recommend the q35 machine type and using the default config file found in the docs directory. This means your command line should include: -M q35 -nodefconfig -readconfig /path/to/qemu.git/docs/q35-chipset.cfg * You're likely passing through a graphics card that is attached to the host system below a root port, so make it appear that way to the guest too. If your graphics card has a graphics function and audio function, assign them as: -device vfio-pci,host=2:00.0,x-vga=on,multifunction=on,bus=ich9-pcie-port-1,addr=0.0 \ -device vfio-pci,host=2:00.1,bus=ich9-pcie-port-1,addr=0.1 The bus name comes from the q35-chipset.cfg above. If your graphics doesn't include a separate audio device, drop the second line and the multifunction option of the first (addr is also optional at that point, 0.0 will be the default). * If you follow both of the above, your VGA device is now below a root port, but the version of seabios in qemu doesn't support initializing VGA routing to that device. To fix, use upstream seabios: git://git.seabios.org/seabios.git The default config should work. Then add the following to your qemu commandline: -L /path/to/seabios.git/out/ -L /path/to/qemu/bios/files/ (the latter is likely /usr/local/share/qemu/) * You can use -nographic to prevent QEMU from trying to start SDL or need a vnc parameter. You can also specify a -vnc option and use the window for mouse input. * Use -vga none. At this point I'm not really interested in dual-headed VMs unless you're interested in working on it. Having an emulated VGA means we're not really testing VGA support through VFIO. * Do no use the vfio-pci romfile option unless you need it (ie. try w/o first). Option ROMs check an internal signature against the hardware. If they don't match, it isn't run. If you download a ROM from the internet, you may get nowhere. If you do need a ROM, it's best to scrape it off the device you're using. You can do this through the rom file in sysfs for the device. echo 1 rom to enable it, the read it as cat rom /tmp/rom. To do this, it should be a secondary graphics device and be untouched by host drivers. You may have better luck booting from an install CD to get an environment where the device is untouched for this. * USB passthrough is handy for input and easier than figuring out which ports are connected to which USB controllers for vfio-pci assignment. Use lsusb to find the devices, note the bus and device numbers, the use: -device usb-host,hostbus=8,hostaddr=2 I think that's it. Feel free to reply with other best practices. Thanks, Alex On Fri, 2013-05-03 at 16:56 -0600, Alex Williamson wrote: Hi folks, A number of people have been trying VFIO's VGA support, a few have even been successful. Resetting devices has been a problem and makes it very, very difficult to really use VGA assignment effectively. The code in the branches below attempts to address this. Discrete graphics devices are typically on their own bus, which we can reset so we theoretically get something pretty close to a power-on state for the GPU on each run (or after each guest reset). With this I'm able to get multiple runs on my HD7850 with no need to reset the host. Hopefully this will also cleanup after any host uses of the device so we can unload driver rather than blacklisting them. If you've been playing with VFIO and VGA, please give the branches below a shot and report successes and failures. Note that this new reset is only enable with the x-vga=on option, so should not do gratuitous bus resets for other devices. Thanks, Alex git://github.com/awilliam/linux-vfio.git vfio-vga-reset git://github.com/awilliam/qemu-vfio.git vfio-vga-reset PS - The above linux branch is v3.9 based which has a known kvm emulator bug. If you're on Intel and nothing happens, try: sudo modprobe -r kvm_intel sudo modprobe kvm_intel emulate_invalid_guest_state=0 This is required to execute the VGA BIOS on my HD7850. If things still don't work, apply the following patch: --- a/hw/misc/vfio.c +++ b/hw/misc/vfio.c @@ -40,7 +40,7 @@ #include sysemu/kvm.h #include sysemu/sysemu.h -/* #define DEBUG_VFIO */ +#define DEBUG_VFIO #ifdef DEBUG_VFIO #define DPRINTF(fmt, ...) \ do { fprintf(stderr, vfio: fmt, ## __VA_ARGS__); } while (0) And log the output (there will be lots). Also, AMD/ATI and Nvidia are the only devices expected to have a reasonable shot at working. I'm seeing reports of success on AMD/ATI HD 5xxx, 6xxx, and 7xxx, as well as Nvidia
Re: [PATCH] KVM: VMX: fix halt emulation while emulating invalid guest sate
- Messaggio originale - Da: Gleb Natapov g...@redhat.com A: kvm@vger.kernel.org Cc: pbonz...@redhat.com, sta...@vger.kernel.org Inviato: Mercoledì, 8 maggio 2013 17:38:44 Oggetto: [PATCH] KVM: VMX: fix halt emulation while emulating invalid guest sate The invalid guest state emulation loop does not check halt_request which causes 100% cpu loop while guest is in halt and in invalid state, but more serious issue is that this leaves halt_request set, so random instruction emulated by vm86 #GP exit can be interpreted as halt which causes guest hang. Fix both problems by handling halt_request in emulation loop. Reported-by: Tomas Papan tomas.pa...@gmail.com Tested-by: Tomas Papan tomas.pa...@gmail.com CC: sta...@vger.kernel.org Signed-off-by: Gleb Natapov g...@redhat.com diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 5a87a58..a9fa4bc 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5312,6 +5312,12 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu) return 0; } + if (vcpu-arch.halt_request) { + vcpu-arch.halt_request = 0; + ret = kvm_emulate_halt(vcpu); + goto out; + } + if (signal_pending(current)) goto out; if (need_resched()) -- Gleb. Reviewed-by: Paolo Bonzini pbonz...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problems while booting a linux system on fast models based CortexA15
On Wed, May 8, 2013 at 12:07 AM, Mai Daftedar mai.dafte...@gmail.com wrote: Dear All, I am facing a problem with booting a fully working Linux system on the Fast Models based Cortex-A15 simulation platform. I'm using the KVM on ARM guide to configure KVM on the ARM fast models with CortexA15, however I get the following kernel panic error when I use NFS to boot the kernel. VFS: Unable to mount root fs via NFS, trying floppy. Noting that the kernel semi-hosting arguments used are as follows: kernel uImage --fdt host-a15.dtb -- earlyprintk console=ttyAMA0 mem=2048M root=/dev/nfs nfsroot=192.168.x.x:/srv/nfsroot/ rw ip=dhcp Where can I be going wrong? Is that literally the command line you use? You may want to change those x's then to the actual IP address of your host machine :) You also need to make sure that your host machine has NFS configured properly. -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 01/13] nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577 switch the EFER MSR when EPT is used and the host and guest have different NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2) and want to be able to run recent KVM as L1, we need to allow L1 to use this EFER switching feature. To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available, and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds support for the former (the latter is still unsupported). Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state, respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all that's left to do in this patch is to properly advertise this feature to L1. Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using vmx_set_efer (which itself sets one of several vmcs02 fields), so we always support this feature, regardless of whether the host supports it. Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com --- arch/x86/kvm/vmx.c | 23 --- 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index e53a5f7..51b8b4f0 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2192,7 +2192,8 @@ static __init void nested_vmx_setup_ctls_msrs(void) #else nested_vmx_exit_ctls_high = 0; #endif - nested_vmx_exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR; + nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | + VM_EXIT_LOAD_IA32_EFER); /* entry controls */ rdmsr(MSR_IA32_VMX_ENTRY_CTLS, @@ -2201,8 +2202,8 @@ static __init void nested_vmx_setup_ctls_msrs(void) nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; nested_vmx_entry_ctls_high = VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_IA32E_MODE; - nested_vmx_entry_ctls_high |= VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; - + nested_vmx_entry_ctls_high |= (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | + VM_ENTRY_LOAD_IA32_EFER); /* cpu-based controls */ rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high); @@ -7486,10 +7487,18 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) vcpu-arch.cr0_guest_owned_bits = ~vmcs12-cr0_guest_host_mask; vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu-arch.cr0_guest_owned_bits); - /* Note: IA32_MODE, LOAD_IA32_EFER are modified by vmx_set_efer below */ - vmcs_write32(VM_EXIT_CONTROLS, - vmcs12-vm_exit_controls | vmcs_config.vmexit_ctrl); - vmcs_write32(VM_ENTRY_CONTROLS, vmcs12-vm_entry_controls | + /* L2-L1 exit controls are emulated - the hardware exit is to L0 so +* we should use its exit controls. Note that IA32_MODE, LOAD_IA32_EFER +* bits are further modified by vmx_set_efer() below. +*/ + vmcs_write32(VM_EXIT_CONTROLS, vmcs_config.vmexit_ctrl); + + /* vmcs12's VM_ENTRY_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE are +* emulated by vmx_set_efer(), below. +*/ + vmcs_write32(VM_ENTRY_CONTROLS, + (vmcs12-vm_entry_controls ~VM_ENTRY_LOAD_IA32_EFER + ~VM_ENTRY_IA32E_MODE) | (vmcs_config.vmentry_ctrl ~VM_ENTRY_IA32E_MODE)); if (vmcs12-vm_entry_controls VM_ENTRY_LOAD_IA32_PAT) -- 1.8.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 02/13] nEPT: Move gpte_access() and prefetch_invalid_gpte() to paging_tmpl.h
For preparation, we just move gpte_access() and prefetch_invalid_gpte() from mmu.c to paging_tmpl.h. Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com --- arch/x86/kvm/mmu.c | 30 -- arch/x86/kvm/paging_tmpl.h | 40 +++- 2 files changed, 35 insertions(+), 35 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 004cc87..117233f 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2488,26 +2488,6 @@ static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, return gfn_to_pfn_memslot_atomic(slot, gfn); } -static bool prefetch_invalid_gpte(struct kvm_vcpu *vcpu, - struct kvm_mmu_page *sp, u64 *spte, - u64 gpte) -{ - if (is_rsvd_bits_set(vcpu-arch.mmu, gpte, PT_PAGE_TABLE_LEVEL)) - goto no_present; - - if (!is_present_gpte(gpte)) - goto no_present; - - if (!(gpte PT_ACCESSED_MASK)) - goto no_present; - - return false; - -no_present: - drop_spte(vcpu-kvm, spte); - return true; -} - static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, u64 *start, u64 *end) @@ -3408,16 +3388,6 @@ static bool sync_mmio_spte(u64 *sptep, gfn_t gfn, unsigned access, return false; } -static inline unsigned gpte_access(struct kvm_vcpu *vcpu, u64 gpte) -{ - unsigned access; - - access = (gpte (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK; - access = ~(gpte PT64_NX_SHIFT); - - return access; -} - static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gpte) { unsigned index; diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index da20860..df34d4a 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -103,6 +103,36 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, return (ret != orig_pte); } +static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu, + struct kvm_mmu_page *sp, u64 *spte, + u64 gpte) +{ + if (is_rsvd_bits_set(vcpu-arch.mmu, gpte, PT_PAGE_TABLE_LEVEL)) + goto no_present; + + if (!is_present_gpte(gpte)) + goto no_present; + + if (!(gpte PT_ACCESSED_MASK)) + goto no_present; + + return false; + +no_present: + drop_spte(vcpu-kvm, spte); + return true; +} + +static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte) +{ + unsigned access; + + access = (gpte (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK; + access = ~(gpte PT64_NX_SHIFT); + + return access; +} + static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, struct guest_walker *walker, @@ -225,7 +255,7 @@ retry_walk: } accessed_dirty = pte; - pte_access = pt_access gpte_access(vcpu, pte); + pte_access = pt_access FNAME(gpte_access)(vcpu, pte); walker-ptes[walker-level - 1] = pte; } while (!is_last_gpte(mmu, walker-level, pte)); @@ -309,13 +339,13 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, gfn_t gfn; pfn_t pfn; - if (prefetch_invalid_gpte(vcpu, sp, spte, gpte)) + if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte)) return false; pgprintk(%s: gpte %llx spte %p\n, __func__, (u64)gpte, spte); gfn = gpte_to_gfn(gpte); - pte_access = sp-role.access gpte_access(vcpu, gpte); + pte_access = sp-role.access FNAME(gpte_access)(vcpu, gpte); protect_clean_gpte(pte_access, gpte); pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn, no_dirty_log (pte_access ACC_WRITE_MASK)); @@ -782,14 +812,14 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) sizeof(pt_element_t))) return -EINVAL; - if (prefetch_invalid_gpte(vcpu, sp, sp-spt[i], gpte)) { + if (FNAME(prefetch_invalid_gpte)(vcpu, sp, sp-spt[i], gpte)) { vcpu-kvm-tlbs_dirty++; continue; } gfn = gpte_to_gfn(gpte); pte_access = sp-role.access; - pte_access = gpte_access(vcpu, gpte); + pte_access = FNAME(gpte_access)(vcpu, gpte); protect_clean_gpte(pte_access, gpte); if (sync_mmio_spte(sp-spt[i], gfn,
[PATCH v3 03/13] nEPT: Add EPT tables support to paging_tmpl.h
This is the first patch in a series which adds nested EPT support to KVM's nested VMX. Nested EPT means emulating EPT for an L1 guest so that L1 can use EPT when running a nested guest L2. When L1 uses EPT, it allows the L2 guest to set its own cr3 and take its own page faults without either of L0 or L1 getting involved. This often significanlty improves L2's performance over the previous two alternatives (shadow page tables over EPT, and shadow page tables over shadow page tables). This patch adds EPT support to paging_tmpl.h. paging_tmpl.h contains the code for reading and writing page tables. The code for 32-bit and 64-bit tables is very similar, but not identical, so paging_tmpl.h is #include'd twice in mmu.c, once with PTTTYPE=32 and once with PTTYPE=64, and this generates the two sets of similar functions. There are subtle but important differences between the format of EPT tables and that of ordinary x86 64-bit page tables, so for nested EPT we need a third set of functions to read the guest EPT table and to write the shadow EPT table. So this patch adds third PTTYPE, PTTYPE_EPT, which creates functions (prefixed with EPT) which correctly read and write EPT tables. Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com --- arch/x86/kvm/mmu.c | 5 + arch/x86/kvm/paging_tmpl.h | 43 +-- 2 files changed, 46 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 117233f..6c1670f 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3397,6 +3397,11 @@ static inline bool is_last_gpte(struct kvm_mmu *mmu, unsigned level, unsigned gp return mmu-last_pte_bitmap (1 index); } +#define PTTYPE_EPT 18 /* arbitrary */ +#define PTTYPE PTTYPE_EPT +#include paging_tmpl.h +#undef PTTYPE + #define PTTYPE 64 #include paging_tmpl.h #undef PTTYPE diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index df34d4a..4c45654 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -50,6 +50,22 @@ #define PT_LEVEL_BITS PT32_LEVEL_BITS #define PT_MAX_FULL_LEVELS 2 #define CMPXCHG cmpxchg +#elif PTTYPE == PTTYPE_EPT + #define pt_element_t u64 + #define guest_walker guest_walkerEPT + #define FNAME(name) EPT_##name + #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK + #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl) + #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl) + #define PT_INDEX(addr, level) PT64_INDEX(addr, level) + #define PT_LEVEL_BITS PT64_LEVEL_BITS + #ifdef CONFIG_X86_64 + #define PT_MAX_FULL_LEVELS 4 + #define CMPXCHG cmpxchg + #else + #define CMPXCHG cmpxchg64 + #define PT_MAX_FULL_LEVELS 2 + #endif #else #error Invalid PTTYPE value #endif @@ -80,6 +96,10 @@ static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl) return (gpte PT_LVL_ADDR_MASK(lvl)) PAGE_SHIFT; } +#if PTTYPE != PTTYPE_EPT +/* + * Comment out this for EPT because update_accessed_dirty_bits() is not used. + */ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, pt_element_t __user *ptep_user, unsigned index, pt_element_t orig_pte, pt_element_t new_pte) @@ -102,6 +122,7 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, return (ret != orig_pte); } +#endif static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, u64 *spte, @@ -126,13 +147,21 @@ no_present: static inline unsigned FNAME(gpte_access)(struct kvm_vcpu *vcpu, u64 gpte) { unsigned access; - +#if PTTYPE == PTTYPE_EPT + access = (gpte (VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK | + VMX_EPT_EXECUTABLE_MASK)); +#else access = (gpte (PT_WRITABLE_MASK | PT_USER_MASK)) | ACC_EXEC_MASK; access = ~(gpte PT64_NX_SHIFT); +#endif return access; } +#if PTTYPE != PTTYPE_EPT +/* + * EPT A/D bit support is not implemented. + */ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, struct guest_walker *walker, @@ -169,6 +198,7 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu, } return 0; } +#endif /* * Fetch a guest pte for a guest virtual address @@ -177,7 +207,6 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, gva_t addr, u32 access) { - int ret; pt_element_t pte; pt_element_t __user *uninitialized_var(ptep_user); gfn_t table_gfn; @@
[PATCH v3 04/13] nEPT: Define EPT-specific link_shadow_page()
Since link_shadow_page() is used by a routine in mmu.c, add an EPT-specific link_shadow_page() in paging_tmp.h, rather than moving it. Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com --- arch/x86/kvm/paging_tmpl.h | 20 1 file changed, 20 insertions(+) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 4c45654..dc495f9 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -461,6 +461,18 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw, } } +#if PTTYPE == PTTYPE_EPT +static void FNAME(link_shadow_page)(u64 *sptep, struct kvm_mmu_page *sp) +{ + u64 spte; + + spte = __pa(sp-spt) | VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK | + VMX_EPT_EXECUTABLE_MASK; + + mmu_spte_set(sptep, spte); +} +#endif + /* * Fetch a shadow pte for a specific level in the paging hierarchy. * If the guest tries to write a write-protected page, we need to @@ -513,7 +525,11 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, goto out_gpte_changed; if (sp) +#if PTTYPE == PTTYPE_EPT + FNAME(link_shadow_page)(it.sptep, sp); +#else link_shadow_page(it.sptep, sp); +#endif } for (; @@ -533,7 +549,11 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, sp = kvm_mmu_get_page(vcpu, direct_gfn, addr, it.level-1, true, direct_access, it.sptep); +#if PTTYPE == PTTYPE_EPT + FNAME(link_shadow_page)(it.sptep, sp); +#else link_shadow_page(it.sptep, sp); +#endif } clear_sp_write_flooding_count(it.sptep); -- 1.8.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 05/13] nEPT: MMU context for nested EPT
KVM's existing shadow MMU code already supports nested TDP. To use it, we need to set up a new MMU context for nested EPT, and create a few callbacks for it (nested_ept_*()). This context should also use the EPT versions of the page table access functions (defined in the previous patch). Then, we need to switch back and forth between this nested context and the regular MMU context when switching between L1 and L2 (when L1 runs this L2 with EPT). Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com --- arch/x86/kvm/mmu.c | 38 ++ arch/x86/kvm/mmu.h | 1 + arch/x86/kvm/vmx.c | 54 +- 3 files changed, 92 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 6c1670f..37f8d7f 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3653,6 +3653,44 @@ int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context) } EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu); +int kvm_init_shadow_EPT_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context) +{ + ASSERT(vcpu); + ASSERT(!VALID_PAGE(vcpu-arch.mmu.root_hpa)); + + context-shadow_root_level = kvm_x86_ops-get_tdp_level(); + + context-nx = is_nx(vcpu); /* TODO: ? */ + context-new_cr3 = paging_new_cr3; + context-page_fault = EPT_page_fault; + context-gva_to_gpa = EPT_gva_to_gpa; + context-sync_page = EPT_sync_page; + context-invlpg = EPT_invlpg; + context-update_pte = EPT_update_pte; + context-free = paging_free; + context-root_level = context-shadow_root_level; + context-root_hpa = INVALID_PAGE; + context-direct_map = false; + + /* TODO: reset_rsvds_bits_mask() is not built for EPT, we need + something different. +*/ + reset_rsvds_bits_mask(vcpu, context); + + + /* TODO: I copied these from kvm_init_shadow_mmu, I don't know why + they are done, or why they write to vcpu-arch.mmu and not context +*/ + vcpu-arch.mmu.base_role.cr4_pae = !!is_pae(vcpu); + vcpu-arch.mmu.base_role.cr0_wp = is_write_protection(vcpu); + vcpu-arch.mmu.base_role.smep_andnot_wp = + kvm_read_cr4_bits(vcpu, X86_CR4_SMEP) + !is_write_protection(vcpu); + + return 0; +} +EXPORT_SYMBOL_GPL(kvm_init_shadow_EPT_mmu); + static int init_kvm_softmmu(struct kvm_vcpu *vcpu) { int r = kvm_init_shadow_mmu(vcpu, vcpu-arch.walk_mmu); diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 2adcbc2..8fc94dd 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -54,6 +54,7 @@ int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]); void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask); int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct); int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context); +int kvm_init_shadow_EPT_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context); static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm) { diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 51b8b4f0..80ab5b1 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1045,6 +1045,11 @@ static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12, return vmcs12-pin_based_vm_exec_control PIN_BASED_VIRTUAL_NMIS; } +static inline int nested_cpu_has_ept(struct vmcs12 *vmcs12) +{ + return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_EPT); +} + static inline bool is_exception(u32 intr_info) { return (intr_info (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK)) @@ -7305,6 +7310,46 @@ static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry) entry-ecx |= bit(X86_FEATURE_VMX); } +/* Callbacks for nested_ept_init_mmu_context: */ + +static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu) +{ + /* return the page table to be shadowed - in our case, EPT12 */ + return get_vmcs12(vcpu)-ept_pointer; +} + +static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu, + struct x86_exception *fault) +{ + struct vmcs12 *vmcs12; + nested_vmx_vmexit(vcpu); + vmcs12 = get_vmcs12(vcpu); + /* +* Note no need to set vmcs12-vm_exit_reason as it is already copied +* from vmcs02 in nested_vmx_vmexit() above, i.e., EPT_VIOLATION. +*/ + vmcs12-exit_qualification = fault-error_code; + vmcs12-guest_physical_address = fault-address; +} + +static int nested_ept_init_mmu_context(struct kvm_vcpu *vcpu) +{ + int r = kvm_init_shadow_EPT_mmu(vcpu, vcpu-arch.mmu); + + vcpu-arch.mmu.set_cr3 = vmx_set_cr3; + vcpu-arch.mmu.get_cr3 = nested_ept_get_cr3; + vcpu-arch.mmu.inject_page_fault = nested_ept_inject_page_fault; + + vcpu-arch.walk_mmu =
[PATCH v3 06/13] nEPT: Fix cr3 handling in nested exit and entry
The existing code for handling cr3 and related VMCS fields during nested exit and entry wasn't correct in all cases: If L2 is allowed to control cr3 (and this is indeed the case in nested EPT), during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and we forgot to do so. This patch adds this copy. If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and whoever does control cr3 (L1 or L2) is using PAE, the processor might have saved PDPTEs and we should also save them in vmcs12 (and restore later). Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com --- arch/x86/kvm/vmx.c | 30 ++ 1 file changed, 30 insertions(+) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 80ab5b1..db8df4c 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -7602,6 +7602,17 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) kvm_set_cr3(vcpu, vmcs12-guest_cr3); kvm_mmu_reset_context(vcpu); + /* +* Additionally, except when L0 is using shadow page tables, L1 or +* L2 control guest_cr3 for L2, so they may also have saved PDPTEs +*/ + if (enable_ept) { + vmcs_write64(GUEST_PDPTR0, vmcs12-guest_pdptr0); + vmcs_write64(GUEST_PDPTR1, vmcs12-guest_pdptr1); + vmcs_write64(GUEST_PDPTR2, vmcs12-guest_pdptr2); + vmcs_write64(GUEST_PDPTR3, vmcs12-guest_pdptr3); + } + kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12-guest_rsp); kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12-guest_rip); } @@ -7924,6 +7935,25 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) vmcs12-guest_pending_dbg_exceptions = vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS); + /* +* In some cases (usually, nested EPT), L2 is allowed to change its +* own CR3 without exiting. If it has changed it, we must keep it. +* Of course, if L0 is using shadow page tables, GUEST_CR3 was defined +* by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12. +*/ + if (enable_ept) + vmcs12-guest_cr3 = vmcs_read64(GUEST_CR3); + /* +* Additionally, except when L0 is using shadow page tables, L1 or +* L2 control guest_cr3 for L2, so save their PDPTEs +*/ + if (enable_ept) { + vmcs12-guest_pdptr0 = vmcs_read64(GUEST_PDPTR0); + vmcs12-guest_pdptr1 = vmcs_read64(GUEST_PDPTR1); + vmcs12-guest_pdptr2 = vmcs_read64(GUEST_PDPTR2); + vmcs12-guest_pdptr3 = vmcs_read64(GUEST_PDPTR3); + } + vmcs12-vm_entry_controls = (vmcs12-vm_entry_controls ~VM_ENTRY_IA32E_MODE) | (vmcs_read32(VM_ENTRY_CONTROLS) VM_ENTRY_IA32E_MODE); -- 1.8.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 07/13] nEPT: Fix wrong test in kvm_set_cr3
kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical address. The problem is that with nested EPT, cr3 is an *L2* physical address, not an L1 physical address as this test expects. As the comment above this test explains, it isn't necessary, and doesn't correspond to anything a real processor would do. So this patch removes it. Note that this wrong test could have also theoretically caused problems in nested NPT, not just in nested EPT. However, in practice, the problem was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus circumventing the problem. Additional potential calls to the buggy function are avoided in that we don't trap cr3 modifications when nested NPT is enabled. However, because in nested VMX we did want to use kvm_set_cr3() (as requested in Avi Kivity's review of the original nested VMX patches), we can't avoid this problem and need to fix it. Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com --- arch/x86/kvm/x86.c | 11 --- 1 file changed, 11 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 94f35d2..ab09003 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -664,17 +664,6 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3) */ } - /* -* Does the new cr3 value map to physical memory? (Note, we -* catch an invalid cr3 even in real-mode, because it would -* cause trouble later on when we turn on paging anyway.) -* -* A real CPU would silently accept an invalid cr3 and would -* attempt to use it - with largely undefined (and often hard -* to debug) behavior on the guest side. -*/ - if (unlikely(!gfn_to_memslot(vcpu-kvm, cr3 PAGE_SHIFT))) - return 1; vcpu-arch.cr3 = cr3; __set_bit(VCPU_EXREG_CR3, (ulong *)vcpu-arch.regs_avail); vcpu-arch.mmu.new_cr3(vcpu); -- 1.8.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 08/13] nEPT: Some additional comments
Some additional comments to preexisting code: Explain who (L0 or L1) handles EPT violation and misconfiguration exits. Don't mention shadow on either EPT or shadow as the only two options. Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com --- arch/x86/kvm/vmx.c | 13 + 1 file changed, 13 insertions(+) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index db8df4c..17d8b89 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -6534,7 +6534,20 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu) return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES); case EXIT_REASON_EPT_VIOLATION: + /* +* L0 always deals with the EPT violation. If nested EPT is +* used, and the nested mmu code discovers that the address is +* missing in the guest EPT table (EPT12), the EPT violation +* will be injected with nested_ept_inject_page_fault() +*/ + return 0; case EXIT_REASON_EPT_MISCONFIG: + /* +* L2 never uses directly L1's EPT, but rather L0's own EPT +* table (shadow on EPT) or a merged EPT table that L0 built +* (EPT on EPT). So any problems with the structure of the +* table is L0's fault. +*/ return 0; case EXIT_REASON_PREEMPTION_TIMER: return vmcs12-pin_based_vm_exec_control -- 1.8.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 09/13] nEPT: Advertise EPT to L1
Advertise the support of EPT to the L1 guest, through the appropriate MSR. This is the last patch of the basic Nested EPT feature, so as to allow bisection through this patch series: The guest will not see EPT support until this last patch, and will not attempt to use the half-applied feature. Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com --- arch/x86/include/asm/vmx.h | 2 ++ arch/x86/kvm/vmx.c | 17 +++-- 2 files changed, 17 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index f3e01a2..4aec45d 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -394,7 +394,9 @@ enum vmcs_field { #define VMX_EPTP_WB_BIT(1ull 14) #define VMX_EPT_2MB_PAGE_BIT (1ull 16) #define VMX_EPT_1GB_PAGE_BIT (1ull 17) +#define VMX_EPT_INVEPT_BIT (1ull 20) #define VMX_EPT_AD_BIT (1ull 21) +#define VMX_EPT_EXTENT_INDIVIDUAL_BIT (1ull 24) #define VMX_EPT_EXTENT_CONTEXT_BIT (1ull 25) #define VMX_EPT_EXTENT_GLOBAL_BIT (1ull 26) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 17d8b89..136fc25 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2155,6 +2155,7 @@ static u32 nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high; static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high; static u32 nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high; static u32 nested_vmx_misc_low, nested_vmx_misc_high; +static u32 nested_vmx_ept_caps; static __init void nested_vmx_setup_ctls_msrs(void) { /* @@ -2242,6 +2243,18 @@ static __init void nested_vmx_setup_ctls_msrs(void) SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | SECONDARY_EXEC_WBINVD_EXITING; + if (enable_ept) { + /* nested EPT: emulate EPT also to L1 */ + nested_vmx_secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT; + nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT; + nested_vmx_ept_caps |= + VMX_EPT_INVEPT_BIT | VMX_EPT_EXTENT_GLOBAL_BIT | + VMX_EPT_EXTENT_CONTEXT_BIT | + VMX_EPT_EXTENT_INDIVIDUAL_BIT; + nested_vmx_ept_caps = vmx_capability.ept; + } else + nested_vmx_ept_caps = 0; + /* miscellaneous data */ rdmsr(MSR_IA32_VMX_MISC, nested_vmx_misc_low, nested_vmx_misc_high); nested_vmx_misc_low = VMX_MISC_PREEMPTION_TIMER_RATE_MASK | @@ -2347,8 +2360,8 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata) nested_vmx_secondary_ctls_high); break; case MSR_IA32_VMX_EPT_VPID_CAP: - /* Currently, no nested ept or nested vpid */ - *pdata = 0; + /* Currently, no nested vpid support */ + *pdata = nested_vmx_ept_caps; break; default: return 0; -- 1.8.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 10/13] nEPT: Nested INVEPT
If we let L1 use EPT, we should probably also support the INVEPT instruction. In our current nested EPT implementation, when L1 changes its EPT table for L2 (i.e., EPT12), L0 modifies the shadow EPT table (EPT02), and in the course of this modification already calls INVEPT. Therefore, when L1 calls INVEPT, we don't really need to do anything. In particular we *don't* need to call the real INVEPT again. All we do in our INVEPT is verify the validity of the call, and its parameters, and then do nothing. In KVM Forum 2010, Dong et al. presented Nested Virtualization Friendly KVM and classified our current nested EPT implementation as shadow-like virtual EPT. He recommended instead a different approach, which he called VTLB-like virtual EPT. If we had taken that alternative approach, INVEPT would have had a bigger role: L0 would only rebuild the shadow EPT table when L1 calls INVEPT. Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com --- arch/x86/include/uapi/asm/vmx.h | 1 + arch/x86/kvm/vmx.c | 83 + 2 files changed, 84 insertions(+) diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h index d651082..7a34e8f 100644 --- a/arch/x86/include/uapi/asm/vmx.h +++ b/arch/x86/include/uapi/asm/vmx.h @@ -65,6 +65,7 @@ #define EXIT_REASON_EOI_INDUCED 45 #define EXIT_REASON_EPT_VIOLATION 48 #define EXIT_REASON_EPT_MISCONFIG 49 +#define EXIT_REASON_INVEPT 50 #define EXIT_REASON_PREEMPTION_TIMER52 #define EXIT_REASON_WBINVD 54 #define EXIT_REASON_XSETBV 55 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 136fc25..9ceab42 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -6245,6 +6245,87 @@ static int handle_vmptrst(struct kvm_vcpu *vcpu) return 1; } +/* Emulate the INVEPT instruction */ +static int handle_invept(struct kvm_vcpu *vcpu) +{ + u32 vmx_instruction_info; + unsigned long type; + gva_t gva; + struct x86_exception e; + struct { + u64 eptp, gpa; + } operand; + + if (!(nested_vmx_secondary_ctls_high SECONDARY_EXEC_ENABLE_EPT) || + !(nested_vmx_ept_caps VMX_EPT_INVEPT_BIT)) { + kvm_queue_exception(vcpu, UD_VECTOR); + return 1; + } + + if (!nested_vmx_check_permission(vcpu)) + return 1; + + if (!kvm_read_cr0_bits(vcpu, X86_CR0_PE)) { + kvm_queue_exception(vcpu, UD_VECTOR); + return 1; + } + + /* According to the Intel VMX instruction reference, the memory +* operand is read even if it isn't needed (e.g., for type==global) +*/ + vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); + if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION), + vmx_instruction_info, gva)) + return 1; + if (kvm_read_guest_virt(vcpu-arch.emulate_ctxt, gva, operand, + sizeof(operand), e)) { + kvm_inject_page_fault(vcpu, e); + return 1; + } + + type = kvm_register_read(vcpu, (vmx_instruction_info 28) 0xf); + + switch (type) { + case VMX_EPT_EXTENT_GLOBAL: + if (!(nested_vmx_ept_caps VMX_EPT_EXTENT_GLOBAL_BIT)) + nested_vmx_failValid(vcpu, + VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID); + else { + /* +* Do nothing: when L1 changes EPT12, we already +* update EPT02 (the shadow EPT table) and call INVEPT. +* So when L1 calls INVEPT, there's nothing left to do. +*/ + nested_vmx_succeed(vcpu); + } + break; + case VMX_EPT_EXTENT_CONTEXT: + if (!(nested_vmx_ept_caps VMX_EPT_EXTENT_CONTEXT_BIT)) + nested_vmx_failValid(vcpu, + VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID); + else { + /* Do nothing */ + nested_vmx_succeed(vcpu); + } + break; + case VMX_EPT_EXTENT_INDIVIDUAL_ADDR: + if (!(nested_vmx_ept_caps VMX_EPT_EXTENT_INDIVIDUAL_BIT)) + nested_vmx_failValid(vcpu, + VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID); + else { + /* Do nothing */ + nested_vmx_succeed(vcpu); + } + break; + default: + nested_vmx_failValid(vcpu, + VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID); + } + + skip_emulated_instruction(vcpu); + return 1; +} + /* * The exit handlers return
[PATCH v3 11/13] nEPT: Miscelleneous cleanups
Some trivial code cleanups not really related to nested EPT. Signed-off-by: Nadav Har'El n...@il.ibm.com Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com Reviewed-by: Paolo Bonzini pbonz...@redhat.com --- arch/x86/kvm/vmx.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 9ceab42..ca49e19 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -714,7 +714,6 @@ static void nested_release_page_clean(struct page *page) static u64 construct_eptp(unsigned long root_hpa); static void kvm_cpu_vmxon(u64 addr); static void kvm_cpu_vmxoff(void); -static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3); static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr); static void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg); @@ -1039,8 +1038,7 @@ static inline bool nested_cpu_has2(struct vmcs12 *vmcs12, u32 bit) (vmcs12-secondary_vm_exec_control bit); } -static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12, - struct kvm_vcpu *vcpu) +static inline bool nested_cpu_has_virtual_nmis(struct vmcs12 *vmcs12) { return vmcs12-pin_based_vm_exec_control PIN_BASED_VIRTUAL_NMIS; } @@ -6731,7 +6729,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu) if (unlikely(!cpu_has_virtual_nmis() vmx-soft_vnmi_blocked !(is_guest_mode(vcpu) nested_cpu_has_virtual_nmis( - get_vmcs12(vcpu), vcpu { + get_vmcs12(vcpu) { if (vmx_interrupt_allowed(vcpu)) { vmx-soft_vnmi_blocked = 0; } else if (vmx-vnmi_blocked_time 10LL -- 1.8.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 12/13] nEPT: Move is_rsvd_bits_set() to paging_tmpl.h
Move is_rsvd_bits_set() to paging_tmpl.h so that it can be used to check reserved bits in EPT page table entries as well. Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com --- arch/x86/kvm/mmu.c | 8 arch/x86/kvm/paging_tmpl.h | 12 ++-- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 37f8d7f..93d6abf 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2468,14 +2468,6 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu) mmu_free_roots(vcpu); } -static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level) -{ - int bit7; - - bit7 = (gpte 7) 1; - return (gpte mmu-rsvd_bits_mask[bit7][level-1]) != 0; -} - static pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn, bool no_dirty_log) { diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index dc495f9..2432d49 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -124,11 +124,19 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, } #endif +static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level) +{ + int bit7; + + bit7 = (gpte 7) 1; + return (gpte mmu-rsvd_bits_mask[bit7][level-1]) != 0; +} + static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, u64 *spte, u64 gpte) { - if (is_rsvd_bits_set(vcpu-arch.mmu, gpte, PT_PAGE_TABLE_LEVEL)) + if (FNAME(is_rsvd_bits_set)(vcpu-arch.mmu, gpte, PT_PAGE_TABLE_LEVEL)) goto no_present; if (!is_present_gpte(gpte)) @@ -279,7 +287,7 @@ retry_walk: if (unlikely(!is_present_gpte(pte))) goto error; - if (unlikely(is_rsvd_bits_set(vcpu-arch.mmu, pte, + if (unlikely(FNAME(is_rsvd_bits_set)(vcpu-arch.mmu, pte, walker-level))) { errcode |= PFERR_RSVD_MASK | PFERR_PRESENT_MASK; goto error; -- 1.8.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 13/13] nEPT: Inject EPT violation/misconfigration
Add code to detect EPT misconfiguration and inject it to L1 VMM. Also, it injects more correct exit qualification upon EPT violation to L1 VMM. Now L1 can correctly go to ept_misconfig handler (instead of wrongly going to fast_page_fault), it will try to handle mmio page fault, if failed, it is a real EPT misconfiguration. Signed-off-by: Jun Nakajima jun.nakaj...@intel.com Signed-off-by: Xinhao Xu xinhao...@intel.com --- arch/x86/include/asm/kvm_host.h | 4 +++ arch/x86/kvm/mmu.c | 5 --- arch/x86/kvm/mmu.h | 5 +++ arch/x86/kvm/paging_tmpl.h | 26 ++ arch/x86/kvm/vmx.c | 79 +++-- 5 files changed, 111 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 3741c65..1d03202 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -262,6 +262,8 @@ struct kvm_mmu { void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva); void (*update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, u64 *spte, const void *pte); + bool (*check_tdp_pte)(u64 pte, int level); + hpa_t root_hpa; int root_level; int shadow_root_level; @@ -503,6 +505,8 @@ struct kvm_vcpu_arch { * instruction. */ bool write_fault_to_shadow_pgtable; + + unsigned long exit_qualification; /* set at EPT violation at this point */ }; struct kvm_lpage_info { diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 93d6abf..3a3b11f 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -233,11 +233,6 @@ static bool set_mmio_spte(u64 *sptep, gfn_t gfn, pfn_t pfn, unsigned access) return false; } -static inline u64 rsvd_bits(int s, int e) -{ - return ((1ULL (e - s + 1)) - 1) s; -} - void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask, u64 dirty_mask, u64 nx_mask, u64 x_mask) { diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 8fc94dd..559e2e0 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -88,6 +88,11 @@ static inline bool is_write_protection(struct kvm_vcpu *vcpu) return kvm_read_cr0_bits(vcpu, X86_CR0_WP); } +static inline u64 rsvd_bits(int s, int e) +{ + return ((1ULL (e - s + 1)) - 1) s; +} + /* * Will a fault with a given page-fault error code (pfec) cause a permission * fault with the given access (in ACC_* format)? diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 2432d49..067b1f8 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -126,10 +126,14 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, static bool FNAME(is_rsvd_bits_set)(struct kvm_mmu *mmu, u64 gpte, int level) { +#if PTTYPE == PTTYPE_EPT + return (mmu-check_tdp_pte(gpte, level)); +#else int bit7; bit7 = (gpte 7) 1; return (gpte mmu-rsvd_bits_mask[bit7][level-1]) != 0; +#endif } static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu, @@ -352,6 +356,28 @@ error: walker-fault.vector = PF_VECTOR; walker-fault.error_code_valid = true; walker-fault.error_code = errcode; + +#if PTTYPE == PTTYPE_EPT + /* +* Use PFERR_RSVD_MASK in erorr_code to to tell if EPT +* misconfiguration requires to be injected. The detection is +* done by is_rsvd_bits_set() above. +* +* We set up the value of exit_qualification to inject: +* [2:0] -- Derive from [2:0] of real exit_qualification at EPT violation +* [5:3] -- Calculated by the page walk of the guest EPT page tables +* [7:8] -- Clear to 0. +* +* The other bits are set to 0. +*/ + if (!(errcode PFERR_RSVD_MASK)) { + unsigned long exit_qualification = vcpu-arch.exit_qualification; + + pte_access = pt_access pte; + vcpu-arch.exit_qualification = ((pte_access 0x7) 3) | + (exit_qualification 0x7); + } +#endif walker-fault.address = addr; walker-fault.nested_page_fault = mmu != vcpu-arch.walk_mmu; diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index ca49e19..a44e7fd 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5310,6 +5310,8 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) /* ept page table is present? */ error_code |= (exit_qualification 3) 0x1; + vcpu-arch.exit_qualification = exit_qualification; + return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); } @@ -7426,7 +7428,7 @@ static unsigned long nested_ept_get_cr3(struct kvm_vcpu *vcpu) } static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu, - struct x86_exception *fault) + struct x86_exception *fault) { struct vmcs12 *vmcs12;
Re: [PATCH] kvm/ppc: interrupt disabling fixes
On Wed, 2013-05-08 at 19:35 -0500, Scott Wood wrote: Sigh, and then there's this: #ifdef CONFIG_PPC64 /* lazy EE magic */ hard_irq_disable(); if (lazy_irq_pending()) { /* Got an interrupt in between, try again */ local_irq_enable(); hard_irq_disable(); kvm_guest_exit(); continue; } trace_hardirqs_on(); #endif Alex, could you be a bit more descriptive than magic please? Can this chunk of code be removed if we do the other changes being discussed? Or should we leave this in and drop the pre-enter hard_irq_disable portion of the proposed changes? Why are you calling trace_hardirqs_on() here and not in kvmppc_lazy_ee_enable()? Why are you calling kvm_guest_exit() before you've called kvm_guest_enter()? I think I originated that magic... it more/less mimmics prep_for_idle, the goal was to hard disable (because we had soft disabled earlier) and check if anything happened in between... if it did, abort, and try again, but it's a bit fishy really. Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v1][KVM][PATCH 1/1] kvm:ppc:booehv: direct ISI exception to Guest
On 05/08/2013 05:20 PM, Caraman Mihai Claudiu-B02008 wrote: -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of tiejun.chen Sent: Wednesday, May 08, 2013 4:54 AM To: Wood Scott-B07421 Cc: ag...@suse.de; kvm-ppc@vger.kernel.org; k...@vger.kernel.org; linuxppc-...@lists.ozlabs.org Subject: Re: [v1][KVM][PATCH 1/1] kvm:ppc:booehv: direct ISI exception to Guest On 05/08/2013 07:40 AM, Scott Wood wrote: On 05/07/2013 06:06:30 AM, Tiejun Chen wrote: We also can direct ISI exception to Guest like DSI. Signed-off-by: Tiejun Chen tiejun.c...@windriver.com --- arch/powerpc/kvm/booke_emulate.c |3 +++ arch/powerpc/kvm/e500mc.c|3 ++- 2 files changed, 5 insertions(+), 1 deletion(-) Are you seeing a real performance improvement from this? This will interfere No. But after we reduce the exit to host, shouldn't this improve performance? We lose some flexibility for this so it make sense only if we gain measurable improvements. Sounds we have much more works to do. somewhat with using the VF bit, if we were to ever do so, since VF only affects Sorry, what is the VF you said? VF stands for virtualization fault see MAS8[VF] and we may use it for virtualized I almost forget this point :) MMIO. The hypervisor should deny execute access on pages marked with VF. Accordingly in this case guest ISI exceptions should be handled by the hypervisor. Thanks for your information. Tiejun -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm/ppc: interrupt disabling fixes
On Wed, 2013-05-08 at 19:35 -0500, Scott Wood wrote: Sigh, and then there's this: #ifdef CONFIG_PPC64 /* lazy EE magic */ hard_irq_disable(); if (lazy_irq_pending()) { /* Got an interrupt in between, try again */ local_irq_enable(); hard_irq_disable(); kvm_guest_exit(); continue; } trace_hardirqs_on(); #endif Alex, could you be a bit more descriptive than magic please? Can this chunk of code be removed if we do the other changes being discussed? Or should we leave this in and drop the pre-enter hard_irq_disable portion of the proposed changes? Why are you calling trace_hardirqs_on() here and not in kvmppc_lazy_ee_enable()? Why are you calling kvm_guest_exit() before you've called kvm_guest_enter()? I think I originated that magic... it more/less mimmics prep_for_idle, the goal was to hard disable (because we had soft disabled earlier) and check if anything happened in between... if it did, abort, and try again, but it's a bit fishy really. Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html