[Bug 60830] New: L2 rhel6u4(32bit) guest reboot continuously
https://bugzilla.kernel.org/show_bug.cgi?id=60830 Bug ID: 60830 Summary: L2 rhel6u4(32bit) guest reboot continuously Product: Virtualization Version: unspecified Kernel Version: 3.11.0-rc1 Hardware: All OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: kvm Assignee: virtualization_...@kernel-bugs.osdl.org Reporter: chao.z...@intel.com Regression: No Environment: Host OS (ia32/ia32e/IA64):ia32e Guest OS (ia32/ia32e/IA64):ia32e Guest OS Type (Linux/Windows):Linux kvm.git Commit:cc2df20c7c4ce594c3e17e9cc260c330646012c8 qemu.git Commit:f7ad538e1ea130c8b6f3abb06ad6c856242c799e Host Kernel Version:3.11.0-rc1 Hardware:Romley_EP Bug detailed description: -- when create L1 guest with -cpu host , then create a 32bit rhel6u4 guest as L2 guest, the L2 guest reboot continuously. This commit introduced this bug: commit afa61f752ba62549e4143d9f9378a8d1d710d6eb Author: Nadav Har'El n...@il.ibm.com Date: Wed Aug 7 14:59:22 2013 +0200 Advertise the support of EPT to the L1 guest, through the appropriate MSR. This is the last patch of the basic Nested EPT feature, so as to allow bisection through this patch series: The guest will not see EPT support until this last patch, and will not attempt to use the half-applied feature. note: 1. create a 32bit RHEL6u3 as L2 guest, the guest reboot continuously. 2. when creat a 64bit rhel6u4 guest as L2 guest, the L2 guest works fine 3. this should be a kernel bug: kvm + qemu = result cc2df20c + f7ad538e = bad 205befd9 + f7ad538e = good Reproduce steps: 1. create L1 guest: qemu-system-x86_64 -enable-kvm -m 8G -smp 4 -net nic,macaddr=00:12:41:51:14:16 -net tap,script=/etc/kvm/qemu-ifup ia32e_nested-kvm.img -cpu host 2. create L2 guest qemu-system-x86_64 -enable-kvm -m 1024 -smp 2 -net none ia32p_rhel6u4.img Current result: 32bit rhel6u4 as L2 guest reboot continuously Expected result: 32bit rhel6u4 as L2 guest works fine Basic root-causing log: -- -- You are receiving this mail because: You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 2/6] vhost_net: use vhost_add_used_and_signal_n() in vhost_zerocopy_signal_used()
On 09/02/2013 01:50 PM, Michael S. Tsirkin wrote: On Fri, Aug 30, 2013 at 12:29:18PM +0800, Jason Wang wrote: We tend to batch the used adding and signaling in vhost_zerocopy_callback() which may result more than 100 used buffers to be updated in vhost_zerocopy_signal_used() in some cases. So wwitch to use switch Ok. vhost_add_used_and_signal_n() to avoid multiple calls to vhost_add_used_and_signal(). Which means much more less times of used index updating and memory barriers. pls put info on perf gain in commit log too Sure. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 1/6] vhost_net: make vhost_zerocopy_signal_used() returns void
On 09/02/2013 01:51 PM, Michael S. Tsirkin wrote: tweak subj s/returns/return/ On Fri, Aug 30, 2013 at 12:29:17PM +0800, Jason Wang wrote: None of its caller use its return value, so let it return void. Signed-off-by: Jason Wang jasow...@redhat.com --- Will correct it in v3. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 6/6] vhost_net: correctly limit the max pending buffers
On 09/02/2013 01:56 PM, Michael S. Tsirkin wrote: On Fri, Aug 30, 2013 at 12:29:22PM +0800, Jason Wang wrote: As Michael point out, We used to limit the max pending DMAs to get better cache utilization. But it was not done correctly since it was one done when there's no new buffers submitted from guest. Guest can easily exceeds the limitation by keeping sending packets. So this patch moves the check into main loop. Tests shows about 5%-10% improvement on per cpu throughput for guest tx. But a 5% drop on per cpu transaction rate for a single session TCP_RR. Any explanation for the drop? single session TCP_RR is unlikely to exceed VHOST_MAX_PEND, correct? Unlikely to exceed. Recheck the result, looks like it was not stable enough. Will re-test and report. Signed-off-by: Jason Wang jasow...@redhat.com --- drivers/vhost/net.c | 15 --- 1 files changed, 4 insertions(+), 11 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index d09c17c..592e1f2 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -363,6 +363,10 @@ static void handle_tx(struct vhost_net *net) if (zcopy) vhost_zerocopy_signal_used(net, vq); +if ((nvq-upend_idx + vq-num - VHOST_MAX_PEND) % UIO_MAXIOV == +nvq-done_idx) +break; + head = vhost_get_vq_desc(net-dev, vq, vq-iov, ARRAY_SIZE(vq-iov), out, in, @@ -372,17 +376,6 @@ static void handle_tx(struct vhost_net *net) break; /* Nothing new? Wait for eventfd to tell us they refilled. */ if (head == vq-num) { -int num_pends; - -/* If more outstanding DMAs, queue the work. - * Handle upend_idx wrap around - */ -num_pends = likely(nvq-upend_idx = nvq-done_idx) ? -(nvq-upend_idx - nvq-done_idx) : -(nvq-upend_idx + UIO_MAXIOV - - nvq-done_idx); -if (unlikely(num_pends VHOST_MAX_PEND)) -break; if (unlikely(vhost_enable_notify(net-dev, vq))) { vhost_disable_notify(net-dev, vq); continue; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker
On 08/30/2013 07:38 PM, Gleb Natapov wrote: On Thu, Aug 29, 2013 at 07:26:40PM +0800, Xiao Guangrong wrote: On 08/29/2013 05:51 PM, Gleb Natapov wrote: On Thu, Aug 29, 2013 at 05:31:42PM +0800, Xiao Guangrong wrote: As Documentation/RCU/whatisRCU.txt says: As with rcu_assign_pointer(), an important function of rcu_dereference() is to document which pointers are protected by RCU, in particular, flagging a pointer that is subject to changing at any time, including immediately after the rcu_dereference(). And, again like rcu_assign_pointer(), rcu_dereference() is typically used indirectly, via the _rcu list-manipulation primitives, such as list_for_each_entry_rcu(). The documentation aspect of rcu_assign_pointer()/rcu_dereference() is important. The code is complicated, so self documentation will not hurt. I want to see what is actually protected by rcu here. Freeing shadow pages with call_rcu() further complicates matters: does it mean that shadow pages are also protected by rcu? Yes, it stops shadow page to be freed when we do write-protection on it. Yeah, I got the trick, what I am saying that we have a data structure here protected by RCU, but we do not use RCU functions to access it... Yes, they are not used when insert a spte into rmap and get the rmap from the entry... but do we need to use these functions to guarantee the order? The worst case is, we fetch the spte from the desc but the spte is not updated yet, we can happily skip this spte since it will set the dirty-bitmap later, this is guaranteed by the barrier between mmu_spte_update() and mark_page_dirty(), the code is: set_spte(): if (mmu_spte_update(sptep, spte)) kvm_flush_remote_tlbs(vcpu-kvm); if (!remap) { if (rmap_add(vcpu, sptep, gfn) RMAP_RECYCLE_THRESHOLD) rmap_recycle(vcpu, sptep, gfn); if (level PT_PAGE_TABLE_LEVEL) ++vcpu-kvm-stat.lpages; } smp_wmb(); if (pte_access ACC_WRITE_MASK) mark_page_dirty(vcpu-kvm, gfn); So, i guess if we can guaranteed the order by ourself, we do not need to call the rcu functions explicitly... But, the memory barres in the rcu functions are really light on x86 (store can not be reordered with store), so i do not mind to explicitly use them if you think this way is more safe. :) I think the self documentation aspect of using rcu function is also important. Okay. I will use these rcu functions and measure them to see whether it'll cause performance issue. BTW why not allocate sp-spt from SLAB_DESTROY_BY_RCU cache too? We may switch write protection on a random spt occasionally if page is deleted and reused for another spt though. For last level spt it should not be a problem and for non last level we have is_last_spte() check in __rmap_write_protect_lockless(). Can it work? Yes, i also considered this way. It can work if we handle is_last_spte() properly. Since the sp-spte can be reused, we can not get the mapping level from sp. We need to encode the mapping level into spte so that cmpxhg can understand if the page table has been moved to another mapping level. Isn't one bit that says that spte is the last one enough? IIRC we have one more ignored bit to spare in spte. Right. But i also want to use this way in fast_page_fault where mapping-level is needed. Could you allow me to make this optimization separately after this patchset be merged? If you think it will complicate the initial version I am fine with postponing it for later. Thank you, Gleb! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is fallback vhost_net to qemu for live migrate available?
On Sat, Aug 31, 2013 at 12:45:11PM +0800, Qin Chuanyu wrote: On 2013/8/30 0:08, Anthony Liguori wrote: Hi Qin, By change the memory copy and notify mechanism ,currently virtio-net with vhost_net could run on Xen with good performance。 I think the key in doing this would be to implement a property ioeventfd and irqfd interface in the driver domain kernel. Just hacking vhost_net with Xen specific knowledge would be pretty nasty IMHO. Yes, I add a kernel module which persist virtio-net pio_addr and msix address as what kvm module did. Guest wake up vhost thread by adding a hook func in evtchn_interrupt. Did you modify the front end driver to do grant table mapping or is this all being done by mapping the domain's memory? There is nothing changed in front end driver. Currently I use alloc_vm_area to get address space, and map the domain's memory as what what qemu did. You mean you're using xc_map_foreign_range and friends in the backend to map guest memory? That's not very desirable as it violates Xen's security model. It would not be too hard to pass grant references instead of guest physical memory address IMHO. Wei. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is fallback vhost_net to qemu for live migrate available?
On Mon, Sep 02, 2013 at 08:57:22AM +0100, Wei Liu wrote: On Sat, Aug 31, 2013 at 12:45:11PM +0800, Qin Chuanyu wrote: On 2013/8/30 0:08, Anthony Liguori wrote: Hi Qin, By change the memory copy and notify mechanism ,currently virtio-net with vhost_net could run on Xen with good performance。 I think the key in doing this would be to implement a property ioeventfd and irqfd interface in the driver domain kernel. Just hacking vhost_net with Xen specific knowledge would be pretty nasty IMHO. Yes, I add a kernel module which persist virtio-net pio_addr and msix address as what kvm module did. Guest wake up vhost thread by adding a hook func in evtchn_interrupt. Did you modify the front end driver to do grant table mapping or is this all being done by mapping the domain's memory? There is nothing changed in front end driver. Currently I use alloc_vm_area to get address space, and map the domain's memory as what what qemu did. You mean you're using xc_map_foreign_range and friends in the backend to map guest memory? That's not very desirable as it violates Xen's security model. It would not be too hard to pass grant references instead of guest physical memory address IMHO. Wei. It's a start and it should make it fast and work with existing infrastructure in the host, though. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/6] KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state
On Thu, Aug 08, 2013 at 04:26:28PM +0200, Jan Kiszka wrote: Likely a typo, but a fatal one as kvm_set_cr0 performs checks on the Not a typo :) That what Avi asked for do during initial nested VMX review: http://markmail.org/message/hhidqyhbo2mrgxxc But there is at least one transition check that kvm_set_cr0() does that should not be done during vmexit emulation, namely CS.L bit check, so I tend to agree that kvm_set_cr0() is not appropriate here, at lest not as it is. But can we skip other checks kvm_set_cr0() does? For instance what prevents us from loading CR0.PG = 1 EFER.LME = 1 and CR4.PAE = 0 during nested vmexit? What _should_ prevent it is vmentry check from 26.2.4 If the host address-space size VM-exit control is 1, the following must hold: - Bit 5 of the CR4 field (corresponding to CR4.PAE) is 1. But I do not see that we do that check on vmentry. What about NW/CD bit checks, or reserved bits checks? 27.5.1 says: The following bits are not modified: For CR0, ET, CD, NW; bits 63:32 (on processors that support Intel 64 architecture), 28:19, 17, and 15:6; and any bits that are fixed in VMX operation (see Section 23.8). But again current vmexit code does not emulate this properly and just sets everything from host_cr0. vmentry should also preserve all those bit but it looks like it doesn't too. state transition that may prevent loading L1's cr0. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- arch/x86/kvm/vmx.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 57b4e12..d001b019 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -8185,7 +8185,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu, * fpu_active (which may have changed). * Note that vmx_set_cr0 refers to efer set above. */ - kvm_set_cr0(vcpu, vmcs12-host_cr0); + vmx_set_cr0(vcpu, vmcs12-host_cr0); /* * If we did fpu_activate()/fpu_deactivate() during L2's run, we need * to apply the same changes to L1's vmcs. We just set cr0 correctly, -- 1.7.3.4 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 6/6] vhost_net: correctly limit the max pending buffers
On 09/02/2013 02:30 PM, Jason Wang wrote: On 09/02/2013 01:56 PM, Michael S. Tsirkin wrote: On Fri, Aug 30, 2013 at 12:29:22PM +0800, Jason Wang wrote: As Michael point out, We used to limit the max pending DMAs to get better cache utilization. But it was not done correctly since it was one done when there's no new buffers submitted from guest. Guest can easily exceeds the limitation by keeping sending packets. So this patch moves the check into main loop. Tests shows about 5%-10% improvement on per cpu throughput for guest tx. But a 5% drop on per cpu transaction rate for a single session TCP_RR. Any explanation for the drop? single session TCP_RR is unlikely to exceed VHOST_MAX_PEND, correct? Unlikely to exceed. Recheck the result, looks like it was not stable enough. Will re-test and report. Re-tested with more iterations, result shows no difference on TCP_RR test and 5%-10% improvement on per cpu throughput for guest tx. Will post V3 soon. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/12] KVM: MMU: introduce pte-list lockless walker
On 08/30/2013 07:44 PM, Gleb Natapov wrote: On Thu, Aug 29, 2013 at 08:02:30PM +0800, Xiao Guangrong wrote: On 08/29/2013 07:33 PM, Xiao Guangrong wrote: On 08/29/2013 05:31 PM, Gleb Natapov wrote: On Thu, Aug 29, 2013 at 02:50:51PM +0800, Xiao Guangrong wrote: After more thinking, I still think rcu_assign_pointer() is unneeded when a entry is removed. The remove-API does not care the order between unlink the entry and the changes to its fields. It is the caller's responsibility: - in the case of rcuhlist, the caller uses call_rcu()/synchronize_rcu(), etc to enforce all lookups exit and the later change on that entry is invisible to the lookups. - In the case of rculist_nulls, it seems refcounter is used to guarantee the order (see the example from Documentation/RCU/rculist_nulls.txt). - In our case, we allow the lookup to see the deleted desc even if it is in slab cache or its is initialized or it is re-added. BTW is it a good idea? We can access deleted desc while it is allocated and initialized to zero by kmem_cache_zalloc(), are we sure we cannot see partially initialized desc-sptes[] entry? On related note what about 32 bit systems, they do not have atomic access to desc-sptes[]. Ah... wait. desc is a array of pointers: struct pte_list_desc { u64 *sptes[PTE_LIST_EXT]; struct pte_list_desc *more; }; Yep, I misread it to be u64 bits and wondered why do we use u64 to store pointers. assigning a pointer is aways aotomic, but we should carefully initialize it as you said. I will introduce a constructor for desc slab cache which initialize the struct like this: for (i = 0; i PTE_LIST_EXT; i++) desc-sptes[i] = NULL; It is okay. I hope slab does not write anything into allocated memory internally if constructor is present. If only constructor is present (no SLAB_DESTROY_BY_RCU), It'll temporarily write the poison value into the memory then call the constructor to initialize it again, e.g, in slab.c: static void *cache_alloc_debugcheck_after(struct kmem_cache *cachep, gfp_t flags, void *objp, unsigned long caller) { if (cachep-flags SLAB_POISON) { .. poison_obj(cachep, objp, POISON_INUSE); } .. if (cachep-ctor cachep-flags SLAB_POISON) cachep-ctor(objp); } But SLAB_DESTROY_BY_RCU can force the allocer to don't touch the memory, this is true in our case. BTW do you know what happens when SLAB debug is enabled and SLAB_DESTROY_BY_RCU is set? When SLAB debug is enabled, these 3 flags may be set: #define SLAB_DEBUG_FREE 0x0100UL/* DEBUG: Perform (expensive) checks on free */ #define SLAB_RED_ZONE 0x0400UL/* DEBUG: Red zone objs in a cache */ #define SLAB_POISON 0x0800UL/* DEBUG: Poison objects */ Only SLAB_POISON can write something into the memory, and ... Does poison value is written into freed object (freed to slab, but not yet to page allocator)? SLAB_POISON is cleared if SLAB_DESTROY_BY_RCU is used. - In slab, There is the code in __kmem_cache_create(): if (flags SLAB_DESTROY_BY_RCU) BUG_ON(flags SLAB_POISON); - In slub, the code is in calculate_sizes(): /* * Determine if we can poison the object itself. If the user of * the slab may touch the object after free or before allocation * then we should never poison the object itself. */ if ((flags SLAB_POISON) !(flags SLAB_DESTROY_BY_RCU) !s-ctor) s-flags |= __OBJECT_POISON; else s-flags = ~__OBJECT_POISON; - in slob, it seems it does not support SLAB DEBUG. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V3 5/6] vhost_net: poll vhost queue after marking DMA is done
We used to poll vhost queue before making DMA is done, this is racy if vhost thread were waked up before marking DMA is done which can result the signal to be missed. Fix this by always polling the vhost thread before DMA is done. Signed-off-by: Jason Wang jasow...@redhat.com --- - The patch is needed for stable 3.4+ --- drivers/vhost/net.c |9 + 1 files changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 3f89dea..8e9dc55 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -308,6 +308,11 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success) struct vhost_virtqueue *vq = ubufs-vq; int cnt = atomic_read(ubufs-kref.refcount); + /* set len to mark this desc buffers done DMA */ + vq-heads[ubuf-desc].len = success ? + VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN; + vhost_net_ubuf_put(ubufs); + /* * Trigger polling thread if guest stopped submitting new buffers: * in this case, the refcount after decrement will eventually reach 1 @@ -318,10 +323,6 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success) */ if (cnt = 2 || !(cnt % 16)) vhost_poll_queue(vq-poll); - /* set len to mark this desc buffers done DMA */ - vq-heads[ubuf-desc].len = success ? - VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN; - vhost_net_ubuf_put(ubufs); } /* Expects to be always run from workqueue - which acts as -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V3 3/6] vhost: switch to use vhost_add_used_n()
Let vhost_add_used() to use vhost_add_used_n() to reduce the code duplication. To avoid the overhead brought by __copy_to_user(). We will use put_user() when one used need to be added. Signed-off-by: Jason Wang jasow...@redhat.com --- drivers/vhost/vhost.c | 54 ++-- 1 files changed, 12 insertions(+), 42 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 448efe0..9a9502a 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -1332,48 +1332,9 @@ EXPORT_SYMBOL_GPL(vhost_discard_vq_desc); * want to notify the guest, using eventfd. */ int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len) { - struct vring_used_elem __user *used; + struct vring_used_elem heads = { head, len }; - /* The virtqueue contains a ring of used buffers. Get a pointer to the -* next entry in that used ring. */ - used = vq-used-ring[vq-last_used_idx % vq-num]; - if (__put_user(head, used-id)) { - vq_err(vq, Failed to write used id); - return -EFAULT; - } - if (__put_user(len, used-len)) { - vq_err(vq, Failed to write used len); - return -EFAULT; - } - /* Make sure buffer is written before we update index. */ - smp_wmb(); - if (__put_user(vq-last_used_idx + 1, vq-used-idx)) { - vq_err(vq, Failed to increment used idx); - return -EFAULT; - } - if (unlikely(vq-log_used)) { - /* Make sure data is seen before log. */ - smp_wmb(); - /* Log used ring entry write. */ - log_write(vq-log_base, - vq-log_addr + - ((void __user *)used - (void __user *)vq-used), - sizeof *used); - /* Log used index update. */ - log_write(vq-log_base, - vq-log_addr + offsetof(struct vring_used, idx), - sizeof vq-used-idx); - if (vq-log_ctx) - eventfd_signal(vq-log_ctx, 1); - } - vq-last_used_idx++; - /* If the driver never bothers to signal in a very long while, -* used index might wrap around. If that happens, invalidate -* signalled_used index we stored. TODO: make sure driver -* signals at least once in 2^16 and remove this. */ - if (unlikely(vq-last_used_idx == vq-signalled_used)) - vq-signalled_used_valid = false; - return 0; + return vhost_add_used_n(vq, heads, 1); } EXPORT_SYMBOL_GPL(vhost_add_used); @@ -1387,7 +1348,16 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq, start = vq-last_used_idx % vq-num; used = vq-used-ring + start; - if (__copy_to_user(used, heads, count * sizeof *used)) { + if (count == 1) { + if (__put_user(heads[0].id, used-id)) { + vq_err(vq, Failed to write used id); + return -EFAULT; + } + if (__put_user(heads[0].len, used-len)) { + vq_err(vq, Failed to write used len); + return -EFAULT; + } + } else if (__copy_to_user(used, heads, count * sizeof *used)) { vq_err(vq, Failed to write used); return -EFAULT; } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V3 4/6] vhost_net: determine whether or not to use zerocopy at one time
Currently, even if the packet length is smaller than VHOST_GOODCOPY_LEN, if upend_idx != done_idx we still set zcopy_used to true and rollback this choice later. This could be avoided by determining zerocopy once by checking all conditions at one time before. Signed-off-by: Jason Wang jasow...@redhat.com --- drivers/vhost/net.c | 47 --- 1 files changed, 20 insertions(+), 27 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 8a6dd0d..3f89dea 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -404,43 +404,36 @@ static void handle_tx(struct vhost_net *net) iov_length(nvq-hdr, s), hdr_size); break; } - zcopy_used = zcopy (len = VHOST_GOODCOPY_LEN || - nvq-upend_idx != nvq-done_idx); + + zcopy_used = zcopy len = VHOST_GOODCOPY_LEN + (nvq-upend_idx + 1) % UIO_MAXIOV != + nvq-done_idx + vhost_net_tx_select_zcopy(net); /* use msg_control to pass vhost zerocopy ubuf info to skb */ if (zcopy_used) { + struct ubuf_info *ubuf; + ubuf = nvq-ubuf_info + nvq-upend_idx; + vq-heads[nvq-upend_idx].id = head; - if (!vhost_net_tx_select_zcopy(net) || - len VHOST_GOODCOPY_LEN) { - /* copy don't need to wait for DMA done */ - vq-heads[nvq-upend_idx].len = - VHOST_DMA_DONE_LEN; - msg.msg_control = NULL; - msg.msg_controllen = 0; - ubufs = NULL; - } else { - struct ubuf_info *ubuf; - ubuf = nvq-ubuf_info + nvq-upend_idx; - - vq-heads[nvq-upend_idx].len = - VHOST_DMA_IN_PROGRESS; - ubuf-callback = vhost_zerocopy_callback; - ubuf-ctx = nvq-ubufs; - ubuf-desc = nvq-upend_idx; - msg.msg_control = ubuf; - msg.msg_controllen = sizeof(ubuf); - ubufs = nvq-ubufs; - kref_get(ubufs-kref); - } + vq-heads[nvq-upend_idx].len = VHOST_DMA_IN_PROGRESS; + ubuf-callback = vhost_zerocopy_callback; + ubuf-ctx = nvq-ubufs; + ubuf-desc = nvq-upend_idx; + msg.msg_control = ubuf; + msg.msg_controllen = sizeof(ubuf); + ubufs = nvq-ubufs; + kref_get(ubufs-kref); nvq-upend_idx = (nvq-upend_idx + 1) % UIO_MAXIOV; - } else + } else { msg.msg_control = NULL; + ubufs = NULL; + } /* TODO: Check specific error and bomb out unless ENOBUFS? */ err = sock-ops-sendmsg(NULL, sock, msg, len); if (unlikely(err 0)) { if (zcopy_used) { - if (ubufs) - vhost_net_ubuf_put(ubufs); + vhost_net_ubuf_put(ubufs); nvq-upend_idx = ((unsigned)nvq-upend_idx - 1) % UIO_MAXIOV; } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V3 6/6] vhost_net: correctly limit the max pending buffers
As Michael point out, We used to limit the max pending DMAs to get better cache utilization. But it was not done correctly since it was one done when there's no new buffers submitted from guest. Guest can easily exceeds the limitation by keeping sending packets. So this patch moves the check into main loop. Tests shows about 5%-10% improvement on per cpu throughput for guest tx. Signed-off-by: Jason Wang jasow...@redhat.com --- drivers/vhost/net.c | 18 +++--- 1 files changed, 7 insertions(+), 11 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 8e9dc55..831eb4f 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -363,6 +363,13 @@ static void handle_tx(struct vhost_net *net) if (zcopy) vhost_zerocopy_signal_used(net, vq); + /* If more outstanding DMAs, queue the work. +* Handle upend_idx wrap around +*/ + if (unlikely((nvq-upend_idx + vq-num - VHOST_MAX_PEND) + % UIO_MAXIOV == nvq-done_idx)) + break; + head = vhost_get_vq_desc(net-dev, vq, vq-iov, ARRAY_SIZE(vq-iov), out, in, @@ -372,17 +379,6 @@ static void handle_tx(struct vhost_net *net) break; /* Nothing new? Wait for eventfd to tell us they refilled. */ if (head == vq-num) { - int num_pends; - - /* If more outstanding DMAs, queue the work. -* Handle upend_idx wrap around -*/ - num_pends = likely(nvq-upend_idx = nvq-done_idx) ? - (nvq-upend_idx - nvq-done_idx) : - (nvq-upend_idx + UIO_MAXIOV - -nvq-done_idx); - if (unlikely(num_pends VHOST_MAX_PEND)) - break; if (unlikely(vhost_enable_notify(net-dev, vq))) { vhost_disable_notify(net-dev, vq); continue; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/4] kvm-unit-tests: Add a series of test cases
Hi Gleb, Paolo and Jan, Would you please review this series of codes when you can spare time? Jan has review it and, of course, further suggestions are welcomed. Arthur On Thu, Aug 15, 2013 at 7:45 PM, Arthur Chunqi Li yzt...@gmail.com wrote: Add a series of test cases for nested VMX in kvm-unit-tests. Arthur Chunqi Li (4): kvm-unit-tests: VMX: Add test cases for PAT and EFER kvm-unit-tests: VMX: Add test cases for CR0/4 shadowing kvm-unit-tests: VMX: Add test cases for I/O bitmaps kvm-unit-tests: VMX: Add test cases for instruction interception lib/x86/vm.h|4 + x86/vmx.c |3 +- x86/vmx.h | 20 +- x86/vmx_tests.c | 714 +++ 4 files changed, 736 insertions(+), 5 deletions(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/6] KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state
On 2013-09-02 10:21, Gleb Natapov wrote: On Thu, Aug 08, 2013 at 04:26:28PM +0200, Jan Kiszka wrote: Likely a typo, but a fatal one as kvm_set_cr0 performs checks on the Not a typo :) That what Avi asked for do during initial nested VMX review: http://markmail.org/message/hhidqyhbo2mrgxxc Yeah, should rephrase this. But there is at least one transition check that kvm_set_cr0() does that should not be done during vmexit emulation, namely CS.L bit check, so I tend to agree that kvm_set_cr0() is not appropriate here, at lest not as it is. kvm_set_cr0() is for emulating explicit guest changes. It is not the proper interface for implicit, vendor-dependent changes like this one. But can we skip other checks kvm_set_cr0() does? For instance what prevents us from loading CR0.PG = 1 EFER.LME = 1 and CR4.PAE = 0 during nested vmexit? What _should_ prevent it is vmentry check from 26.2.4 If the host address-space size VM-exit control is 1, the following must hold: - Bit 5 of the CR4 field (corresponding to CR4.PAE) is 1. But I do not see that we do that check on vmentry. What about NW/CD bit checks, or reserved bits checks? 27.5.1 says: The following bits are not modified: For CR0, ET, CD, NW; bits 63:32 (on processors that support Intel 64 architecture), 28:19, 17, and 15:6; and any bits that are fixed in VMX operation (see Section 23.8). But again current vmexit code does not emulate this properly and just sets everything from host_cr0. vmentry should also preserve all those bit but it looks like it doesn't too. Yes, there is surely more to improve. Do you think the lacking checks can cause troubles for L0, or is this just imprecise emulation that can be addressed separately? Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V3 2/6] vhost_net: use vhost_add_used_and_signal_n() in vhost_zerocopy_signal_used()
We tend to batch the used adding and signaling in vhost_zerocopy_callback() which may result more than 100 used buffers to be updated in vhost_zerocopy_signal_used() in some cases. So switch to use vhost_add_used_and_signal_n() to avoid multiple calls to vhost_add_used_and_signal(). Which means much less times of used index updating and memory barriers. 2% performance improvement were seen on netperf TCP_RR test. Signed-off-by: Jason Wang jasow...@redhat.com --- drivers/vhost/net.c | 13 - 1 files changed, 8 insertions(+), 5 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 280ee66..8a6dd0d 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -281,7 +281,7 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net, { struct vhost_net_virtqueue *nvq = container_of(vq, struct vhost_net_virtqueue, vq); - int i; + int i, add; int j = 0; for (i = nvq-done_idx; i != nvq-upend_idx; i = (i + 1) % UIO_MAXIOV) { @@ -289,14 +289,17 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net, vhost_net_tx_err(net); if (VHOST_DMA_IS_DONE(vq-heads[i].len)) { vq-heads[i].len = VHOST_DMA_CLEAR_LEN; - vhost_add_used_and_signal(vq-dev, vq, - vq-heads[i].id, 0); ++j; } else break; } - if (j) - nvq-done_idx = i; + while (j) { + add = min(UIO_MAXIOV - nvq-done_idx, j); + vhost_add_used_and_signal_n(vq-dev, vq, + vq-heads[nvq-done_idx], add); + nvq-done_idx = (nvq-done_idx + add) % UIO_MAXIOV; + j -= add; + } } static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success) -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V3 1/6] vhost_net: make vhost_zerocopy_signal_used() return void
None of its caller use its return value, so let it return void. Signed-off-by: Jason Wang jasow...@redhat.com --- drivers/vhost/net.c |5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 969a859..280ee66 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -276,8 +276,8 @@ static void copy_iovec_hdr(const struct iovec *from, struct iovec *to, * of used idx. Once lower device DMA done contiguously, we will signal KVM * guest used idx. */ -static int vhost_zerocopy_signal_used(struct vhost_net *net, - struct vhost_virtqueue *vq) +static void vhost_zerocopy_signal_used(struct vhost_net *net, + struct vhost_virtqueue *vq) { struct vhost_net_virtqueue *nvq = container_of(vq, struct vhost_net_virtqueue, vq); @@ -297,7 +297,6 @@ static int vhost_zerocopy_signal_used(struct vhost_net *net, } if (j) nvq-done_idx = i; - return j; } static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success) -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 60833] New: Window guest time passed very quickly, when restore vm that saved 10 minutes ago.
https://bugzilla.kernel.org/show_bug.cgi?id=60833 Bug ID: 60833 Summary: Window guest time passed very quickly, when restore vm that saved 10 minutes ago. Product: Virtualization Version: unspecified Kernel Version: qemu-kvm-0.12.1.2-2.355.el6.x86_64 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: kvm Assignee: virtualization_...@kernel-bugs.osdl.org Reporter: y...@jhinno.com Regression: No Recently, I found a problem with windows guest(both winxp and win7) that window guest time passed very quickly, when restore vm that saved 1 hours ago. Kvm server OS: Red Hat Enterprise Linux Server release 6.4 Reproduce Steps: 1. install a new windows xp vm named winxp. 2. power on the vm. 3. save the vm by command virsh save winxp /tmp/winxp.save 4. wait 10 minutes. 5. restore the vm by command virsh restore /tmp/winxp.save 6. Note the clock on the task bar, time passed as 10 time quickly as real time speed. -- You are receiving this mail because: You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V3 0/6] vhost code cleanup and minor enhancement
This series tries to unify and simplify vhost codes especially for zerocopy. With this series, 5% - 10% improvement for per cpu throughput were seen during netperf guest sending test. Plase review. Changes from V2: - Typo fixes and code style fix - Add performance gain in the commit log of patch 2/6 - Retest the update the result in patch 6/6 Changes from V1: - Fix the zerocopy enabling check by changing the check of upend_idx != done_idx to (upend_idx + 1) % UIO_MAXIOV == done_idx. - Switch to use put_user() in __vhost_add_used_n() if there's only one used - Keep the max pending check based on Michael's suggestion. Jason Wang (6): vhost_net: make vhost_zerocopy_signal_used() return void vhost_net: use vhost_add_used_and_signal_n() in vhost_zerocopy_signal_used() vhost: switch to use vhost_add_used_n() vhost_net: determine whether or not to use zerocopy at one time vhost_net: poll vhost queue after marking DMA is done vhost_net: correctly limit the max pending buffers drivers/vhost/net.c | 92 ++-- drivers/vhost/vhost.c | 54 ++-- 2 files changed, 54 insertions(+), 92 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots
On 08/30/2013 08:41 PM, Paolo Bonzini wrote: Page tables in a read-only memory slot will currently cause a triple fault because the page walker uses gfn_to_hva and it fails on such a slot. OVMF uses such a page table; however, real hardware seems to be fine with that as long as the accessed/dirty bits are set. Save whether the slot is readonly, and later check it when updating the accessed and dirty bits. Paolo, do you know why OVMF is using readonly memory like this? AFAIK, The fault trigged by this kind of access can hardly be fixed by userspace since the fault is trigged by pagetable walking not by the current instruction. Do you have any idea to let uerspace emulate it properly? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 60833] Window guest time passed very quickly, when restore vm that saved 10 minutes ago.
https://bugzilla.kernel.org/show_bug.cgi?id=60833 Gleb g...@redhat.com changed: What|Removed |Added CC||g...@redhat.com --- Comment #1 from Gleb g...@redhat.com --- This is how it suppose to work. If you do not want this behaviour change time catchup policy in libvirt. BTW this bugzilla is for upstream issues, not vendor kernels. -- You are receiving this mail because: You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots
On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote: On 08/30/2013 08:41 PM, Paolo Bonzini wrote: Page tables in a read-only memory slot will currently cause a triple fault because the page walker uses gfn_to_hva and it fails on such a slot. OVMF uses such a page table; however, real hardware seems to be fine with that as long as the accessed/dirty bits are set. Save whether the slot is readonly, and later check it when updating the accessed and dirty bits. Paolo, do you know why OVMF is using readonly memory like this? Just a guess, but perhaps they want to move to paging mode as early as possible even before memory controller is fully initialized. AFAIK, The fault trigged by this kind of access can hardly be fixed by userspace since the fault is trigged by pagetable walking not by the current instruction. Do you have any idea to let uerspace emulate it properly? Not sure what userspace you mean here, but there shouldn't be a fault in the first place if ROM page tables have access/dirty bit set and they do. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm-unit-tests: VMX: Add the framework of EPT
Add a framework of EPT in nested VMX testing, including a set of functions to construct and read EPT paging structures and a simple read/write test of EPT remapping from guest to host. Signed-off-by: Arthur Chunqi Li yzt...@gmail.com --- x86/vmx.c | 132 -- x86/vmx.h | 76 +++ x86/vmx_tests.c | 156 +++ 3 files changed, 360 insertions(+), 4 deletions(-) diff --git a/x86/vmx.c b/x86/vmx.c index ca36d35..a156b71 100644 --- a/x86/vmx.c +++ b/x86/vmx.c @@ -143,6 +143,132 @@ asm( call hypercall\n\t ); +/* EPT paging structure related functions */ +/* install_ept_entry : Install a page to a given level in EPT + @pml4 : addr of pml4 table + @pte_level : level of PTE to set + @guest_addr : physical address of guest + @pte : pte value to set + @pt_page : address of page table, NULL for a new page + */ +void install_ept_entry(unsigned long *pml4, + int pte_level, + unsigned long guest_addr, + unsigned long pte, + unsigned long *pt_page) +{ + int level; + unsigned long *pt = pml4; + unsigned offset; + + for (level = EPT_PAGE_LEVEL; level pte_level; --level) { + offset = (guest_addr ((level-1) * EPT_PGDIR_WIDTH + 12)) +EPT_PGDIR_MASK; + if (!(pt[offset] (EPT_RA | EPT_WA | EPT_EA))) { + unsigned long *new_pt = pt_page; + if (!new_pt) + new_pt = alloc_page(); + else + pt_page = 0; + memset(new_pt, 0, PAGE_SIZE); + pt[offset] = virt_to_phys(new_pt) + | EPT_RA | EPT_WA | EPT_EA; + } + pt = phys_to_virt(pt[offset] 0xff000ull); + } + offset = ((unsigned long)guest_addr ((level-1) * + EPT_PGDIR_WIDTH + 12)) EPT_PGDIR_MASK; + pt[offset] = pte; +} + +/* Map a page, @perm is the permission of the page */ +void install_ept(unsigned long *pml4, + unsigned long phys, + unsigned long guest_addr, + u64 perm) +{ + install_ept_entry(pml4, 1, guest_addr, (phys PAGE_MASK) | perm, 0); +} + +/* Map a 1G-size page */ +void install_1g_ept(unsigned long *pml4, + unsigned long phys, + unsigned long guest_addr, + u64 perm) +{ + install_ept_entry(pml4, 3, guest_addr, + (phys PAGE_MASK) | perm | EPT_LARGE_PAGE, 0); +} + +/* Map a 2M-size page */ +void install_2m_ept(unsigned long *pml4, + unsigned long phys, + unsigned long guest_addr, + u64 perm) +{ + install_ept_entry(pml4, 2, guest_addr, + (phys PAGE_MASK) | perm | EPT_LARGE_PAGE, 0); +} + +/* setup_ept_range : Setup a range of 1:1 mapped page to EPT paging structure. + @start : start address of guest page + @len : length of address to be mapped + @map_1g : whether 1G page map is used + @map_2m : whether 2M page map is used + @perm : permission for every page + */ +int setup_ept_range(unsigned long *pml4, unsigned long start, + unsigned long len, int map_1g, int map_2m, u64 perm) +{ + u64 phys = start; + u64 max = (u64)len + (u64)start; + + if (map_1g) { + while (phys + PAGE_SIZE_1G = max) { + install_1g_ept(pml4, phys, phys, perm); + phys += PAGE_SIZE_1G; + } + } + if (map_2m) { + while (phys + PAGE_SIZE_2M = max) { + install_2m_ept(pml4, phys, phys, perm); + phys += PAGE_SIZE_2M; + } + } + while (phys + PAGE_SIZE = max) { + install_ept(pml4, phys, phys, perm); + phys += PAGE_SIZE; + } + return 0; +} + +/* get_ept_pte : Get the PTE of a given level in EPT, +@level == 1 means get the latest level*/ +unsigned long get_ept_pte(unsigned long *pml4, + unsigned long guest_addr, int level) +{ + int l; + unsigned long *pt = pml4, pte; + unsigned offset; + + for (l = EPT_PAGE_LEVEL; l 1; --l) { + offset = (guest_addr (((l-1) * EPT_PGDIR_WIDTH) + 12)) +EPT_PGDIR_MASK; + pte = pt[offset]; + if (!(pte (EPT_RA | EPT_WA | EPT_EA))) + return 0; + if (l == level) + return pte; + if (l 4 (pte EPT_LARGE_PAGE)) + return pte; + pt = (unsigned long
Re: [PATCH v3 1/6] KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state
On Mon, Sep 02, 2013 at 11:06:53AM +0200, Jan Kiszka wrote: On 2013-09-02 10:21, Gleb Natapov wrote: On Thu, Aug 08, 2013 at 04:26:28PM +0200, Jan Kiszka wrote: Likely a typo, but a fatal one as kvm_set_cr0 performs checks on the Not a typo :) That what Avi asked for do during initial nested VMX review: http://markmail.org/message/hhidqyhbo2mrgxxc Yeah, should rephrase this. But there is at least one transition check that kvm_set_cr0() does that should not be done during vmexit emulation, namely CS.L bit check, so I tend to agree that kvm_set_cr0() is not appropriate here, at lest not as it is. kvm_set_cr0() is for emulating explicit guest changes. It is not the proper interface for implicit, vendor-dependent changes like this one. Agree, the problem is that we do not have proper interface for implicit changes like this one (do not see why it is vendor-dependent, SVM also restores host state in a similar way). But can we skip other checks kvm_set_cr0() does? For instance what prevents us from loading CR0.PG = 1 EFER.LME = 1 and CR4.PAE = 0 during nested vmexit? What _should_ prevent it is vmentry check from 26.2.4 If the host address-space size VM-exit control is 1, the following must hold: - Bit 5 of the CR4 field (corresponding to CR4.PAE) is 1. But I do not see that we do that check on vmentry. What about NW/CD bit checks, or reserved bits checks? 27.5.1 says: The following bits are not modified: For CR0, ET, CD, NW; bits 63:32 (on processors that support Intel 64 architecture), 28:19, 17, and 15:6; and any bits that are fixed in VMX operation (see Section 23.8). But again current vmexit code does not emulate this properly and just sets everything from host_cr0. vmentry should also preserve all those bit but it looks like it doesn't too. Yes, there is surely more to improve. Do you think the lacking checks can cause troubles for L0, or is this just imprecise emulation that can be addressed separately? The lacking checks may cause L0 to fail guest entry which will trigger internal error. If it is exploitable by L0 userspace it is a serious problem, if only L0 kernel can trigger it then less so. I remember Avi was concerned that KVM code may depend on all registers to be consistent otherwise it can be exploited, I cannot prove or disprove this theory :), but if it is the case then event L0 kernel case is problematic. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm-unit-tests: VMX: Add the framework of EPT
There must have some minor revisions to be done in this patch, so this is mainly a RFC mail. Besides, I'm not quite clear what we should test in nested EPT modules, and I bet writers of nested EPT must have ideas to continue and refine this testing part. Any suggestions of which part and how to test nested EPT is welcome. Please help me CC EPT-related guys if anyone knows. Thanks, Arthur On Mon, Sep 2, 2013 at 5:26 PM, Arthur Chunqi Li yzt...@gmail.com wrote: Add a framework of EPT in nested VMX testing, including a set of functions to construct and read EPT paging structures and a simple read/write test of EPT remapping from guest to host. Signed-off-by: Arthur Chunqi Li yzt...@gmail.com --- x86/vmx.c | 132 -- x86/vmx.h | 76 +++ x86/vmx_tests.c | 156 +++ 3 files changed, 360 insertions(+), 4 deletions(-) diff --git a/x86/vmx.c b/x86/vmx.c index ca36d35..a156b71 100644 --- a/x86/vmx.c +++ b/x86/vmx.c @@ -143,6 +143,132 @@ asm( call hypercall\n\t ); +/* EPT paging structure related functions */ +/* install_ept_entry : Install a page to a given level in EPT + @pml4 : addr of pml4 table + @pte_level : level of PTE to set + @guest_addr : physical address of guest + @pte : pte value to set + @pt_page : address of page table, NULL for a new page + */ +void install_ept_entry(unsigned long *pml4, + int pte_level, + unsigned long guest_addr, + unsigned long pte, + unsigned long *pt_page) +{ + int level; + unsigned long *pt = pml4; + unsigned offset; + + for (level = EPT_PAGE_LEVEL; level pte_level; --level) { + offset = (guest_addr ((level-1) * EPT_PGDIR_WIDTH + 12)) +EPT_PGDIR_MASK; + if (!(pt[offset] (EPT_RA | EPT_WA | EPT_EA))) { + unsigned long *new_pt = pt_page; + if (!new_pt) + new_pt = alloc_page(); + else + pt_page = 0; + memset(new_pt, 0, PAGE_SIZE); + pt[offset] = virt_to_phys(new_pt) + | EPT_RA | EPT_WA | EPT_EA; + } + pt = phys_to_virt(pt[offset] 0xff000ull); + } + offset = ((unsigned long)guest_addr ((level-1) * + EPT_PGDIR_WIDTH + 12)) EPT_PGDIR_MASK; + pt[offset] = pte; +} + +/* Map a page, @perm is the permission of the page */ +void install_ept(unsigned long *pml4, + unsigned long phys, + unsigned long guest_addr, + u64 perm) +{ + install_ept_entry(pml4, 1, guest_addr, (phys PAGE_MASK) | perm, 0); +} + +/* Map a 1G-size page */ +void install_1g_ept(unsigned long *pml4, + unsigned long phys, + unsigned long guest_addr, + u64 perm) +{ + install_ept_entry(pml4, 3, guest_addr, + (phys PAGE_MASK) | perm | EPT_LARGE_PAGE, 0); +} + +/* Map a 2M-size page */ +void install_2m_ept(unsigned long *pml4, + unsigned long phys, + unsigned long guest_addr, + u64 perm) +{ + install_ept_entry(pml4, 2, guest_addr, + (phys PAGE_MASK) | perm | EPT_LARGE_PAGE, 0); +} + +/* setup_ept_range : Setup a range of 1:1 mapped page to EPT paging structure. + @start : start address of guest page + @len : length of address to be mapped + @map_1g : whether 1G page map is used + @map_2m : whether 2M page map is used + @perm : permission for every page + */ +int setup_ept_range(unsigned long *pml4, unsigned long start, + unsigned long len, int map_1g, int map_2m, u64 perm) +{ + u64 phys = start; + u64 max = (u64)len + (u64)start; + + if (map_1g) { + while (phys + PAGE_SIZE_1G = max) { + install_1g_ept(pml4, phys, phys, perm); + phys += PAGE_SIZE_1G; + } + } + if (map_2m) { + while (phys + PAGE_SIZE_2M = max) { + install_2m_ept(pml4, phys, phys, perm); + phys += PAGE_SIZE_2M; + } + } + while (phys + PAGE_SIZE = max) { + install_ept(pml4, phys, phys, perm); + phys += PAGE_SIZE; + } + return 0; +} + +/* get_ept_pte : Get the PTE of a given level in EPT, +@level == 1 means get the latest level*/ +unsigned long get_ept_pte(unsigned long *pml4, + unsigned
Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots
On 09/01/2013 05:17 PM, Gleb Natapov wrote: On Fri, Aug 30, 2013 at 02:41:37PM +0200, Paolo Bonzini wrote: Page tables in a read-only memory slot will currently cause a triple fault because the page walker uses gfn_to_hva and it fails on such a slot. OVMF uses such a page table; however, real hardware seems to be fine with that as long as the accessed/dirty bits are set. Save whether the slot is readonly, and later check it when updating the accessed and dirty bits. The fix looks OK to me, but some comment below. Cc: sta...@vger.kernel.org Cc: g...@redhat.com Cc: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- CCing to stable@ since the regression was introduced with support for readonly memory slots. arch/x86/kvm/paging_tmpl.h | 7 ++- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c| 14 +- 3 files changed, 16 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 0433301..dadc5c0 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -99,6 +99,7 @@ struct guest_walker { pt_element_t prefetch_ptes[PTE_PREFETCH_NUM]; gpa_t pte_gpa[PT_MAX_FULL_LEVELS]; pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS]; +bool pte_writable[PT_MAX_FULL_LEVELS]; unsigned pt_access; unsigned pte_access; gfn_t gfn; @@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu, if (pte == orig_pte) continue; +if (unlikely(!walker-pte_writable[level - 1])) +return -EACCES; + ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, orig_pte, pte); if (ret) return ret; @@ -309,7 +313,8 @@ retry_walk: goto error; real_gfn = gpa_to_gfn(real_gfn); -host_addr = gfn_to_hva(vcpu-kvm, real_gfn); +host_addr = gfn_to_hva_read(vcpu-kvm, real_gfn, +walker-pte_writable[walker-level - 1]); The use of gfn_to_hva_read is misleading. The code can still write into gfn. Lets rename gfn_to_hva_read to gfn_to_hva_prot() and gfn_to_hva() to gfn_to_hva_write(). Yes. I agreed. This makes me think are there other places where gfn_to_hva() was used, but gfn_to_hva_prot() should have been? - kvm_host_page_size() looks incorrect. We never use huge page to map read only memory slots currently. It only checks whether gfn have been mapped, I think we can use gfn_to_hva_read() instead, the real permission will be checked when we translate the gfn to pfn. - kvm_handle_bad_page() also looks incorrect and may cause incorrect address to be reported to userspace. I have no idea on this point. kvm_handle_bad_page() is called when it failed to translate the target gfn to pfn, then the emulator can detect the error on target gfn properly. no? Or i misunderstood your meaning? - kvm_setup_async_pf() also incorrect. Makes all page fault on read only slot to be sync. - kvm_vm_fault() one looks OK since function assumes write only slots, but it is obsolete and should be deleted anyway. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots
On Mon, Sep 02, 2013 at 05:42:25PM +0800, Xiao Guangrong wrote: On 09/01/2013 05:17 PM, Gleb Natapov wrote: On Fri, Aug 30, 2013 at 02:41:37PM +0200, Paolo Bonzini wrote: Page tables in a read-only memory slot will currently cause a triple fault because the page walker uses gfn_to_hva and it fails on such a slot. OVMF uses such a page table; however, real hardware seems to be fine with that as long as the accessed/dirty bits are set. Save whether the slot is readonly, and later check it when updating the accessed and dirty bits. The fix looks OK to me, but some comment below. Cc: sta...@vger.kernel.org Cc: g...@redhat.com Cc: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- CCing to stable@ since the regression was introduced with support for readonly memory slots. arch/x86/kvm/paging_tmpl.h | 7 ++- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c| 14 +- 3 files changed, 16 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 0433301..dadc5c0 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -99,6 +99,7 @@ struct guest_walker { pt_element_t prefetch_ptes[PTE_PREFETCH_NUM]; gpa_t pte_gpa[PT_MAX_FULL_LEVELS]; pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS]; + bool pte_writable[PT_MAX_FULL_LEVELS]; unsigned pt_access; unsigned pte_access; gfn_t gfn; @@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu, if (pte == orig_pte) continue; + if (unlikely(!walker-pte_writable[level - 1])) + return -EACCES; + ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, orig_pte, pte); if (ret) return ret; @@ -309,7 +313,8 @@ retry_walk: goto error; real_gfn = gpa_to_gfn(real_gfn); - host_addr = gfn_to_hva(vcpu-kvm, real_gfn); + host_addr = gfn_to_hva_read(vcpu-kvm, real_gfn, + walker-pte_writable[walker-level - 1]); The use of gfn_to_hva_read is misleading. The code can still write into gfn. Lets rename gfn_to_hva_read to gfn_to_hva_prot() and gfn_to_hva() to gfn_to_hva_write(). Yes. I agreed. This makes me think are there other places where gfn_to_hva() was used, but gfn_to_hva_prot() should have been? - kvm_host_page_size() looks incorrect. We never use huge page to map read only memory slots currently. It only checks whether gfn have been mapped, I think we can use gfn_to_hva_read() instead, the real permission will be checked when we translate the gfn to pfn. Yes, all the cases I listed should be changed to use function that looks at both regular and RO slots. - kvm_handle_bad_page() also looks incorrect and may cause incorrect address to be reported to userspace. I have no idea on this point. kvm_handle_bad_page() is called when it failed to translate the target gfn to pfn, then the emulator can detect the error on target gfn properly. no? Or i misunderstood your meaning? I am talking about the following code: if (pfn == KVM_PFN_ERR_HWPOISON) { kvm_send_hwpoison_signal(gfn_to_hva(vcpu-kvm, gfn), current); return 0; } pfn will be KVM_PFN_ERR_HWPOISON gfn is backed by faulty memory, we need to report the liner address of the faulty memory to a userspace here, but if gfn is in a RO slot gfn_to_hva() will not return correct address here. - kvm_setup_async_pf() also incorrect. Makes all page fault on read only slot to be sync. - kvm_vm_fault() one looks OK since function assumes write only slots, but it is obsolete and should be deleted anyway. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots
On 09/02/2013 05:25 PM, Gleb Natapov wrote: On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote: On 08/30/2013 08:41 PM, Paolo Bonzini wrote: Page tables in a read-only memory slot will currently cause a triple fault because the page walker uses gfn_to_hva and it fails on such a slot. OVMF uses such a page table; however, real hardware seems to be fine with that as long as the accessed/dirty bits are set. Save whether the slot is readonly, and later check it when updating the accessed and dirty bits. Paolo, do you know why OVMF is using readonly memory like this? Just a guess, but perhaps they want to move to paging mode as early as possible even before memory controller is fully initialized. AFAIK, The fault trigged by this kind of access can hardly be fixed by userspace since the fault is trigged by pagetable walking not by the current instruction. Do you have any idea to let uerspace emulate it properly? Not sure what userspace you mean here, but there shouldn't be a fault in the I just wonder how to fix this kind of fault. The current patch returns -EACCES but that will crash the guest. I think we'd better let userspace to fix this error (let userspace set the D/A bit.) first place if ROM page tables have access/dirty bit set and they do. Yes, so we can not call x86_emulate_instruction() to fix this fault (that function emulates the access on the first place). Need directly return a MMIO-exit to userpsace when met this fault? What happen if this fault on pagetable-walking is trigged in x86_emulate_instruction().? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots
On 09/02/2013 05:49 PM, Gleb Natapov wrote: On Mon, Sep 02, 2013 at 05:42:25PM +0800, Xiao Guangrong wrote: On 09/01/2013 05:17 PM, Gleb Natapov wrote: On Fri, Aug 30, 2013 at 02:41:37PM +0200, Paolo Bonzini wrote: Page tables in a read-only memory slot will currently cause a triple fault because the page walker uses gfn_to_hva and it fails on such a slot. OVMF uses such a page table; however, real hardware seems to be fine with that as long as the accessed/dirty bits are set. Save whether the slot is readonly, and later check it when updating the accessed and dirty bits. The fix looks OK to me, but some comment below. Cc: sta...@vger.kernel.org Cc: g...@redhat.com Cc: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- CCing to stable@ since the regression was introduced with support for readonly memory slots. arch/x86/kvm/paging_tmpl.h | 7 ++- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c| 14 +- 3 files changed, 16 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 0433301..dadc5c0 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -99,6 +99,7 @@ struct guest_walker { pt_element_t prefetch_ptes[PTE_PREFETCH_NUM]; gpa_t pte_gpa[PT_MAX_FULL_LEVELS]; pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS]; + bool pte_writable[PT_MAX_FULL_LEVELS]; unsigned pt_access; unsigned pte_access; gfn_t gfn; @@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu, if (pte == orig_pte) continue; + if (unlikely(!walker-pte_writable[level - 1])) + return -EACCES; + ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, orig_pte, pte); if (ret) return ret; @@ -309,7 +313,8 @@ retry_walk: goto error; real_gfn = gpa_to_gfn(real_gfn); - host_addr = gfn_to_hva(vcpu-kvm, real_gfn); + host_addr = gfn_to_hva_read(vcpu-kvm, real_gfn, + walker-pte_writable[walker-level - 1]); The use of gfn_to_hva_read is misleading. The code can still write into gfn. Lets rename gfn_to_hva_read to gfn_to_hva_prot() and gfn_to_hva() to gfn_to_hva_write(). Yes. I agreed. This makes me think are there other places where gfn_to_hva() was used, but gfn_to_hva_prot() should have been? - kvm_host_page_size() looks incorrect. We never use huge page to map read only memory slots currently. It only checks whether gfn have been mapped, I think we can use gfn_to_hva_read() instead, the real permission will be checked when we translate the gfn to pfn. Yes, all the cases I listed should be changed to use function that looks at both regular and RO slots. - kvm_handle_bad_page() also looks incorrect and may cause incorrect address to be reported to userspace. I have no idea on this point. kvm_handle_bad_page() is called when it failed to translate the target gfn to pfn, then the emulator can detect the error on target gfn properly. no? Or i misunderstood your meaning? I am talking about the following code: if (pfn == KVM_PFN_ERR_HWPOISON) { kvm_send_hwpoison_signal(gfn_to_hva(vcpu-kvm, gfn), current); return 0; } pfn will be KVM_PFN_ERR_HWPOISON gfn is backed by faulty memory, we need to report the liner address of the faulty memory to a userspace here, but if gfn is in a RO slot gfn_to_hva() will not return correct address here. Got it, thanks for your explanation. BTW, if you and Paolo are busy on other things, i am happy to fix these issues. :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots
On Mon, Sep 02, 2013 at 06:00:39PM +0800, Xiao Guangrong wrote: On 09/02/2013 05:25 PM, Gleb Natapov wrote: On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote: On 08/30/2013 08:41 PM, Paolo Bonzini wrote: Page tables in a read-only memory slot will currently cause a triple fault because the page walker uses gfn_to_hva and it fails on such a slot. OVMF uses such a page table; however, real hardware seems to be fine with that as long as the accessed/dirty bits are set. Save whether the slot is readonly, and later check it when updating the accessed and dirty bits. Paolo, do you know why OVMF is using readonly memory like this? Just a guess, but perhaps they want to move to paging mode as early as possible even before memory controller is fully initialized. AFAIK, The fault trigged by this kind of access can hardly be fixed by userspace since the fault is trigged by pagetable walking not by the current instruction. Do you have any idea to let uerspace emulate it properly? Not sure what userspace you mean here, but there shouldn't be a fault in the I just wonder how to fix this kind of fault. The current patch returns -EACCES but that will crash the guest. I think we'd better let userspace to fix this error (let userspace set the D/A bit.) Ugh, this is not good. Missed that. Don't know what real HW will do here, but the easy thing for us to do would be to just return success. first place if ROM page tables have access/dirty bit set and they do. Yes, so we can not call x86_emulate_instruction() to fix this fault (that function emulates the access on the first place). Need directly return a MMIO-exit to userpsace when met this fault? What happen if this fault on pagetable-walking is trigged in x86_emulate_instruction().? I think we should not return MMIO-exit to userspace. Either ignore write attempt or kill a guest. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots
On Mon, Sep 02, 2013 at 06:05:10PM +0800, Xiao Guangrong wrote: On 09/02/2013 05:49 PM, Gleb Natapov wrote: On Mon, Sep 02, 2013 at 05:42:25PM +0800, Xiao Guangrong wrote: On 09/01/2013 05:17 PM, Gleb Natapov wrote: On Fri, Aug 30, 2013 at 02:41:37PM +0200, Paolo Bonzini wrote: Page tables in a read-only memory slot will currently cause a triple fault because the page walker uses gfn_to_hva and it fails on such a slot. OVMF uses such a page table; however, real hardware seems to be fine with that as long as the accessed/dirty bits are set. Save whether the slot is readonly, and later check it when updating the accessed and dirty bits. The fix looks OK to me, but some comment below. Cc: sta...@vger.kernel.org Cc: g...@redhat.com Cc: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- CCing to stable@ since the regression was introduced with support for readonly memory slots. arch/x86/kvm/paging_tmpl.h | 7 ++- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c| 14 +- 3 files changed, 16 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 0433301..dadc5c0 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -99,6 +99,7 @@ struct guest_walker { pt_element_t prefetch_ptes[PTE_PREFETCH_NUM]; gpa_t pte_gpa[PT_MAX_FULL_LEVELS]; pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS]; +bool pte_writable[PT_MAX_FULL_LEVELS]; unsigned pt_access; unsigned pte_access; gfn_t gfn; @@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu, if (pte == orig_pte) continue; +if (unlikely(!walker-pte_writable[level - 1])) +return -EACCES; + ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, orig_pte, pte); if (ret) return ret; @@ -309,7 +313,8 @@ retry_walk: goto error; real_gfn = gpa_to_gfn(real_gfn); -host_addr = gfn_to_hva(vcpu-kvm, real_gfn); +host_addr = gfn_to_hva_read(vcpu-kvm, real_gfn, + walker-pte_writable[walker-level - 1]); The use of gfn_to_hva_read is misleading. The code can still write into gfn. Lets rename gfn_to_hva_read to gfn_to_hva_prot() and gfn_to_hva() to gfn_to_hva_write(). Yes. I agreed. This makes me think are there other places where gfn_to_hva() was used, but gfn_to_hva_prot() should have been? - kvm_host_page_size() looks incorrect. We never use huge page to map read only memory slots currently. It only checks whether gfn have been mapped, I think we can use gfn_to_hva_read() instead, the real permission will be checked when we translate the gfn to pfn. Yes, all the cases I listed should be changed to use function that looks at both regular and RO slots. - kvm_handle_bad_page() also looks incorrect and may cause incorrect address to be reported to userspace. I have no idea on this point. kvm_handle_bad_page() is called when it failed to translate the target gfn to pfn, then the emulator can detect the error on target gfn properly. no? Or i misunderstood your meaning? I am talking about the following code: if (pfn == KVM_PFN_ERR_HWPOISON) { kvm_send_hwpoison_signal(gfn_to_hva(vcpu-kvm, gfn), current); return 0; } pfn will be KVM_PFN_ERR_HWPOISON gfn is backed by faulty memory, we need to report the liner address of the faulty memory to a userspace here, but if gfn is in a RO slot gfn_to_hva() will not return correct address here. Got it, thanks for your explanation. BTW, if you and Paolo are busy on other things, i am happy to fix these issues. :) I am busy with reviews mostly :). If you are not to busy with lockless write protection then fine with me. Lest wait for Paolo's input on proposed API though. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: nEPT: reset PDPTR register cache on nested vmentry emulation
After nested vmentry stale cache can be used to reload L2 PDPTR pointers which will cause L2 guest to fail. Fix it by invalidating cache on nested vmentry emulation. https://bugzilla.kernel.org/show_bug.cgi?id=60830 Signed-off-by: Gleb Natapov g...@redhat.com diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 57b4e12..6f69aac 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -7765,6 +7765,10 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) vmcs_write64(GUEST_PDPTR1, vmcs12-guest_pdptr1); vmcs_write64(GUEST_PDPTR2, vmcs12-guest_pdptr2); vmcs_write64(GUEST_PDPTR3, vmcs12-guest_pdptr3); + __clear_bit(VCPU_EXREG_PDPTR, + (unsigned long *)vcpu-arch.regs_avail); + __clear_bit(VCPU_EXREG_PDPTR, + (unsigned long *)vcpu-arch.regs_dirty); } kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12-guest_rsp); -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] i386: forward CPUID cache leaves when -cpu host is used
Hi, target-i386: please. Am 27.08.2013 22:38, schrieb Benoît Canet: Some users running cpu intensive tasks checking the cache CPUID leaves at startup and making decisions based on the result reported that the guest was not reflecting the host CPUID leaves when -cpu host is used. This patch fix this. Signed-off-by: Benoit Canet ben...@irqsave.net --- target-i386/cpu.c | 19 +++ target-i386/cpu.h |1 + 2 files changed, 20 insertions(+) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 42c5de0..2c8eaf7 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -374,6 +374,7 @@ typedef struct x86_def_t { int stepping; FeatureWordArray features; char model_id[48]; +bool fwd_cpuid_cache_leaves; } x86_def_t; #define I486_FEATURES (CPUID_FP87 | CPUID_VME | CPUID_PSE) @@ -1027,6 +1028,7 @@ static void kvm_cpu_fill_host(x86_def_t *x86_cpu_def) assert(kvm_enabled()); x86_cpu_def-name = host; +x86_cpu_def-fwd_cpuid_cache_leaves = true; host_cpuid(0x0, 0, eax, ebx, ecx, edx); x86_cpu_vendor_words2str(x86_cpu_def-vendor, ebx, edx, ecx); @@ -1776,6 +1778,7 @@ static void cpu_x86_register(X86CPU *cpu, const char *name, Error **errp) env-features[FEAT_C000_0001_EDX] = def-features[FEAT_C000_0001_EDX]; env-features[FEAT_7_0_EBX] = def-features[FEAT_7_0_EBX]; env-cpuid_xlevel2 = def-xlevel2; +env-fwd_cpuid_cache_leaves = def-fwd_cpuid_cache_leaves; object_property_set_str(OBJECT(cpu), def-model_id, model-id, errp); } @@ -1949,6 +1952,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, } break; case 2: +if (env-fwd_cpuid_cache_leaves) { +host_cpuid(0x2, 0, eax, ebx, ecx, edx); +break; +} /* cache info: needed for Pentium Pro compatibility */ *eax = 1; *ebx = 0; @@ -1956,6 +1963,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, *edx = 0x2c307d; break; case 4: +if (env-fwd_cpuid_cache_leaves) { +host_cpuid(0x4, count, eax, ebx, ecx, edx); +break; +} /* cache info: needed for Core compatibility */ if (cs-nr_cores 1) { *eax = (cs-nr_cores - 1) 26; @@ -2102,6 +2113,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, break; case 0x8005: /* cache info (L1 cache) */ +if (env-fwd_cpuid_cache_leaves) { +host_cpuid(0x8005, 0, eax, ebx, ecx, edx); +break; +} *eax = 0x01ff01ff; *ebx = 0x01ff01ff; *ecx = 0x40020140; @@ -2109,6 +2124,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, break; case 0x8006: /* cache info (L2 cache) */ +if (env-fwd_cpuid_cache_leaves) { +host_cpuid(0x8006, 0, eax, ebx, ecx, edx); +break; +} *eax = 0; *ebx = 0x42004200; *ecx = 0x02008140; This hunk may trivially conflict with Eduardo's cache flags cleanup. diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 8a3d0fd..1ec32fa 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -865,6 +865,7 @@ typedef struct CPUX86State { bool tsc_valid; int tsc_khz; void *kvm_xsave_buf; +bool fwd_cpuid_cache_leaves; /* in order to simplify APIC support, we leave this pointer to the user */ Please place the field in X86CPU instead and document it. Otherwise patch looks okay to me on a brief sight; but since this is about -cpu host I would prefer this to go through uq/master once fixed or at least to get some acks. Regards, Andreas -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] i386: forward CPUID cache leaves when -cpu host is used
On Mon, Sep 02, 2013 at 02:55:36PM +0200, Andreas Färber wrote: [...] diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 8a3d0fd..1ec32fa 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -865,6 +865,7 @@ typedef struct CPUX86State { bool tsc_valid; int tsc_khz; void *kvm_xsave_buf; +bool fwd_cpuid_cache_leaves; /* in order to simplify APIC support, we leave this pointer to the user */ Please place the field in X86CPU instead and document it. While moving it, I believe the name can be made clearer. I would name it fwd_host_cache_info or something like that, to make it clear that it will expose the _host CPU_ cache information to the guest. -- Eduardo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 3/6] KVM: nVMX: Load nEPT state after EFER
On Thu, Aug 08, 2013 at 04:26:30PM +0200, Jan Kiszka wrote: We need to update EFER.NX before building the nEPT state via nested_ept_init_mmu_context. Otherwise, we risk to create an MMU context that claims to have NX disabled while the guest EPT used NX. This will cause spurious faults for L2. Hmm, I do not see how nested ept mmu depends on guests EFER.NX setting. It just sets mmu-nx to true. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- arch/x86/kvm/vmx.c | 10 +- 1 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 6c42518..363fe19 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -7727,11 +7727,6 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) vmx_flush_tlb(vcpu); } - if (nested_cpu_has_ept(vmcs12)) { - kvm_mmu_unload(vcpu); - nested_ept_init_mmu_context(vcpu); - } - if (vmcs12-vm_entry_controls VM_ENTRY_LOAD_IA32_EFER) vcpu-arch.efer = vmcs12-guest_ia32_efer; else if (vmcs12-vm_entry_controls VM_ENTRY_IA32E_MODE) @@ -7741,6 +7736,11 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) /* Note: modifies VM_ENTRY/EXIT_CONTROLS and GUEST/HOST_IA32_EFER */ vmx_set_efer(vcpu, vcpu-arch.efer); + if (nested_cpu_has_ept(vmcs12)) { + kvm_mmu_unload(vcpu); + nested_ept_init_mmu_context(vcpu); + } + /* * This sets GUEST_CR0 to vmcs12-guest_cr0, with possibly a modified * TS bit (for lazy fpu) and bits which we consider mandatory enabled. -- 1.7.3.4 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V3] forward cpuid leaves when using -cpu host
This patch uses directly cpuid_host to forward the informations instead of storing a variable number of leaves in the cpu states. v3: s/i386/target-i386/ [Andrea] move forward field to X86CPU structure and document it [Andrea] rename forward field [Eduardo] Rebase on top of eduardo cache flags cleanup [Andrea] v2: use index as argument to cpuid_host Benoît Canet (1): target-i386: forward CPUID cache leaves when -cpu host is used target-i386/cpu-qom.h |3 +++ target-i386/cpu.c | 19 +++ 2 files changed, 22 insertions(+) -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V3] target-i386: forward CPUID cache leaves when -cpu host is used
Some users running cpu intensive tasks checking the cache CPUID leaves at startup and making decisions based on the result reported that the guest was not reflecting the host CPUID leaves when -cpu host is used. This patch fix this. Signed-off-by: Benoit Canet ben...@irqsave.net --- target-i386/cpu-qom.h |3 +++ target-i386/cpu.c | 19 +++ 2 files changed, 22 insertions(+) diff --git a/target-i386/cpu-qom.h b/target-i386/cpu-qom.h index c4447c2..b1d1bd8 100644 --- a/target-i386/cpu-qom.h +++ b/target-i386/cpu-qom.h @@ -70,6 +70,9 @@ typedef struct X86CPU { bool hyperv_relaxed_timing; int hyperv_spinlock_attempts; +/* if true the CPUID code directly forward host cache leaves to the guest */ +bool fwd_host_cache_info; + /* Features that were filtered out because of missing host capabilities */ uint32_t filtered_features[FEATURE_WORDS]; diff --git a/target-i386/cpu.c b/target-i386/cpu.c index c36345e..f0df4db 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -486,6 +486,7 @@ typedef struct x86_def_t { int stepping; FeatureWordArray features; char model_id[48]; +bool fwd_host_cache_info; } x86_def_t; #define I486_FEATURES (CPUID_FP87 | CPUID_VME | CPUID_PSE) @@ -1139,6 +1140,7 @@ static void kvm_cpu_fill_host(x86_def_t *x86_cpu_def) assert(kvm_enabled()); x86_cpu_def-name = host; +x86_cpu_def-fwd_host_cache_info = true; host_cpuid(0x0, 0, eax, ebx, ecx, edx); x86_cpu_vendor_words2str(x86_cpu_def-vendor, ebx, edx, ecx); @@ -1888,6 +1890,7 @@ static void cpu_x86_register(X86CPU *cpu, const char *name, Error **errp) env-features[FEAT_C000_0001_EDX] = def-features[FEAT_C000_0001_EDX]; env-features[FEAT_7_0_EBX] = def-features[FEAT_7_0_EBX]; env-cpuid_xlevel2 = def-xlevel2; +cpu-fwd_host_cache_info = def-fwd_host_cache_info; object_property_set_str(OBJECT(cpu), def-model_id, model-id, errp); } @@ -2062,6 +2065,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, break; case 2: /* cache info: needed for Pentium Pro compatibility */ +if (cpu-fwd_host_cache_info) { +host_cpuid(index, 0, eax, ebx, ecx, edx); +break; +} *eax = 1; /* Number of CPUID[EAX=2] calls required */ *ebx = 0; *ecx = 0; @@ -2071,6 +2078,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, break; case 4: /* cache info: needed for Core compatibility */ +if (cpu-fwd_host_cache_info) { +host_cpuid(index, count, eax, ebx, ecx, edx); +break; +} if (cs-nr_cores 1) { *eax = (cs-nr_cores - 1) 26; } else { @@ -2228,6 +2239,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, break; case 0x8005: /* cache info (L1 cache) */ +if (cpu-fwd_host_cache_info) { +host_cpuid(index, 0, eax, ebx, ecx, edx); +break; +} *eax = (L1_DTLB_2M_ASSOC 24) | (L1_DTLB_2M_ENTRIES 16) | \ (L1_ITLB_2M_ASSOC 8) | (L1_ITLB_2M_ENTRIES); *ebx = (L1_DTLB_4K_ASSOC 24) | (L1_DTLB_4K_ENTRIES 16) | \ @@ -2239,6 +2254,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, break; case 0x8006: /* cache info (L2 cache) */ +if (cpu-fwd_host_cache_info) { +host_cpuid(index, 0, eax, ebx, ecx, edx); +break; +} *eax = (AMD_ENC_ASSOC(L2_DTLB_2M_ASSOC) 28) | \ (L2_DTLB_2M_ENTRIES 16) | \ (AMD_ENC_ASSOC(L2_ITLB_2M_ASSOC) 12) | \ -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hallo..........
Hello,How are you,I hope you're well,my name is Mutoni,I'm medium height and fair in complexion,i love,caring and I decided to contact you.I really want to have a good relationship with you.Next I have a special something I want to discuss with you,and tell you more about my self.Hope hear from you soon.I know age will not be a bearier to our relationship,I will give my best. Yours,Miss.Mutoni. Hallo,Wie geht es dir,ich hoffe,du bist gut,mein Name ist Mutoni, ich mittlerer Höhe und Messe in Teint bin,ich liebe,fürsorgliche und ich beschlossen,you.I kontaktieren wirklich wollen,eine gute Beziehung mit Ihnen haben.Weiter habe ich eine besondere etwas möchte ich mit Ihnen zu besprechen haben,und erfahren Sie mehr über meine self.Hope von Ihnen zu hören soon.i wissen Alter wird kein bearier,unsere Beziehung zu sein, werde ich mein Bestes geben.Yours,Miss.Mutoni. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots
Il 02/09/2013 12:07, Gleb Natapov ha scritto: On Mon, Sep 02, 2013 at 06:00:39PM +0800, Xiao Guangrong wrote: On 09/02/2013 05:25 PM, Gleb Natapov wrote: On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote: On 08/30/2013 08:41 PM, Paolo Bonzini wrote: Page tables in a read-only memory slot will currently cause a triple fault because the page walker uses gfn_to_hva and it fails on such a slot. OVMF uses such a page table; however, real hardware seems to be fine with that as long as the accessed/dirty bits are set. Save whether the slot is readonly, and later check it when updating the accessed and dirty bits. Paolo, do you know why OVMF is using readonly memory like this? Just a guess, but perhaps they want to move to paging mode as early as possible even before memory controller is fully initialized. AFAIK, The fault trigged by this kind of access can hardly be fixed by userspace since the fault is trigged by pagetable walking not by the current instruction. Do you have any idea to let uerspace emulate it properly? Not sure what userspace you mean here, but there shouldn't be a fault in the I just wonder how to fix this kind of fault. The current patch returns -EACCES but that will crash the guest. I think we'd better let userspace to fix this error (let userspace set the D/A bit.) Ugh, this is not good. Missed that. Don't know what real HW will do here, but the easy thing for us to do would be to just return success. Real hardware would just do a memory write. What happens depends on what is on the bus, i.e. on what the ROM is used for. QEMU uses read-only slots for two things: actual read-only memory where writes go to the bitbucket, and ROMD memory where writes are treated as MMIO. So, in the first case we would ignore the write. In the second we would do an MMIO exit to userspace. But ignoring the write isn't always correct, and doing an MMIO exit is complicated, so I would just kill the guest. EPT will probably write to the read-only slots without thinking much about it. My patch injects a page fault, which is very likely to escalate to a triple fault. This is probably never what you want---on the other hand, I wasn't sure what level of accuracy we want in this case, given that EPT does it wrong too. Paolo first place if ROM page tables have access/dirty bit set and they do. Yes, so we can not call x86_emulate_instruction() to fix this fault (that function emulates the access on the first place). Need directly return a MMIO-exit to userpsace when met this fault? What happen if this fault on pagetable-walking is trigged in x86_emulate_instruction().? I think we should not return MMIO-exit to userspace. Either ignore write attempt or kill a guest. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots
Il 02/09/2013 12:11, Gleb Natapov ha scritto: Got it, thanks for your explanation. BTW, if you and Paolo are busy on other things, i am happy to fix these issues. :) I am busy with reviews mostly :). If you are not to busy with lockless write protection then fine with me. Lest wait for Paolo's input on proposed API though. No problem, I'm happy to do this kind of exercise. I like the API you proposed. I can do this change on top of the API change. However, since this bug is in the wild it may be applicable to stable@ as well. If you agree, I will also post v2 of this patch for stable kernels as soon as we agree on the desired faulting behavior. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots
Il 02/09/2013 11:25, Gleb Natapov ha scritto: On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote: On 08/30/2013 08:41 PM, Paolo Bonzini wrote: Page tables in a read-only memory slot will currently cause a triple fault because the page walker uses gfn_to_hva and it fails on such a slot. OVMF uses such a page table; however, real hardware seems to be fine with that as long as the accessed/dirty bits are set. Save whether the slot is readonly, and later check it when updating the accessed and dirty bits. Paolo, do you know why OVMF is using readonly memory like this? Just a guess, but perhaps they want to move to paging mode as early as possible even before memory controller is fully initialized. More precisely they want to move to 64-bit mode as early as possible, and that requires paging. AFAIK, The fault trigged by this kind of access can hardly be fixed by userspace since the fault is trigged by pagetable walking not by the current instruction. Do you have any idea to let uerspace emulate it properly? Not sure what userspace you mean here, but there shouldn't be a fault in the first place if ROM page tables have access/dirty bit set and they do. Yep. Actually they do not set dirty on the PD/PDP/PML4, only on the bottom level (which is also fine). Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots
On Mon, Sep 02, 2013 at 06:00:08PM +0200, Paolo Bonzini wrote: Il 02/09/2013 11:25, Gleb Natapov ha scritto: On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote: On 08/30/2013 08:41 PM, Paolo Bonzini wrote: Page tables in a read-only memory slot will currently cause a triple fault because the page walker uses gfn_to_hva and it fails on such a slot. OVMF uses such a page table; however, real hardware seems to be fine with that as long as the accessed/dirty bits are set. Save whether the slot is readonly, and later check it when updating the accessed and dirty bits. Paolo, do you know why OVMF is using readonly memory like this? Just a guess, but perhaps they want to move to paging mode as early as possible even before memory controller is fully initialized. More precisely they want to move to 64-bit mode as early as possible, and that requires paging. AFAIK, The fault trigged by this kind of access can hardly be fixed by userspace since the fault is trigged by pagetable walking not by the current instruction. Do you have any idea to let uerspace emulate it properly? Not sure what userspace you mean here, but there shouldn't be a fault in the first place if ROM page tables have access/dirty bit set and they do. Yep. Actually they do not set dirty on the PD/PDP/PML4, only on the bottom level (which is also fine). They do it for a good reason, dirty bit exists only on the bottom level :) -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3] target-i386: forward CPUID cache leaves when -cpu host is used
On Mon, Sep 02, 2013 at 05:06:37PM +0200, Benoît Canet wrote: Some users running cpu intensive tasks checking the cache CPUID leaves at startup and making decisions based on the result reported that the guest was not reflecting the host CPUID leaves when -cpu host is used. This patch fix this. Signed-off-by: Benoit Canet ben...@irqsave.net Reviewed-by: Eduardo Habkost ehabk...@redhat.com -- Eduardo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots
On Mon, Sep 02, 2013 at 05:56:07PM +0200, Paolo Bonzini wrote: Il 02/09/2013 12:07, Gleb Natapov ha scritto: On Mon, Sep 02, 2013 at 06:00:39PM +0800, Xiao Guangrong wrote: On 09/02/2013 05:25 PM, Gleb Natapov wrote: On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote: On 08/30/2013 08:41 PM, Paolo Bonzini wrote: Page tables in a read-only memory slot will currently cause a triple fault because the page walker uses gfn_to_hva and it fails on such a slot. OVMF uses such a page table; however, real hardware seems to be fine with that as long as the accessed/dirty bits are set. Save whether the slot is readonly, and later check it when updating the accessed and dirty bits. Paolo, do you know why OVMF is using readonly memory like this? Just a guess, but perhaps they want to move to paging mode as early as possible even before memory controller is fully initialized. AFAIK, The fault trigged by this kind of access can hardly be fixed by userspace since the fault is trigged by pagetable walking not by the current instruction. Do you have any idea to let uerspace emulate it properly? Not sure what userspace you mean here, but there shouldn't be a fault in the I just wonder how to fix this kind of fault. The current patch returns -EACCES but that will crash the guest. I think we'd better let userspace to fix this error (let userspace set the D/A bit.) Ugh, this is not good. Missed that. Don't know what real HW will do here, but the easy thing for us to do would be to just return success. Real hardware would just do a memory write. What happens depends on what is on the bus, i.e. on what the ROM is used for. QEMU uses read-only slots for two things: actual read-only memory where writes go to the bitbucket, and ROMD memory where writes are treated as MMIO. So, in the first case we would ignore the write. In the second we would do an MMIO exit to userspace. But ignoring the write isn't always correct, and doing an MMIO exit is complicated, so I would just kill the guest. EPT will probably write to the read-only slots without thinking much about it. My patch injects a page fault, which is very likely to escalate to a triple fault. This is probably never what you want---on the other hand, I wasn't sure what level of accuracy we want in this case, given that EPT does it wrong too. Just ignoring write into RO slot is OK. We do not handle page tables in MMIO memory, so handling page tables in ROMD memory shouldn't be a priority either. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots
On Mon, Sep 02, 2013 at 05:58:41PM +0200, Paolo Bonzini wrote: Il 02/09/2013 12:11, Gleb Natapov ha scritto: Got it, thanks for your explanation. BTW, if you and Paolo are busy on other things, i am happy to fix these issues. :) I am busy with reviews mostly :). If you are not to busy with lockless write protection then fine with me. Lest wait for Paolo's input on proposed API though. No problem, I'm happy to do this kind of exercise. I like the API you proposed. I can do this change on top of the API change. However, since this bug is in the wild it may be applicable to stable@ as well. If you agree, I will also post v2 of this patch for stable kernels as soon as we agree on the desired faulting behavior. I am OK with having minimal patch for backporting to stable and doing API changes + fixes for other call sites on top. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH V3] target-i386: forward CPUID cache leaves when -cpu host is used
Signed-off-by: Benoit Canet ben...@irqsave.net Reviewed-by: Eduardo Habkost ehabk...@redhat.com Thanks. Do you have an idea on how QEMU could reflect the real host clock frequency to the guest when the host cpu scaling governor kicks in ? Giving a false value to cloud customers is mildly annoying. Best regards Benoît -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 3/6] KVM: nVMX: Load nEPT state after EFER
On 2013-09-02 15:16, Gleb Natapov wrote: On Thu, Aug 08, 2013 at 04:26:30PM +0200, Jan Kiszka wrote: We need to update EFER.NX before building the nEPT state via nested_ept_init_mmu_context. Otherwise, we risk to create an MMU context that claims to have NX disabled while the guest EPT used NX. This will cause spurious faults for L2. Hmm, I do not see how nested ept mmu depends on guests EFER.NX setting. It just sets mmu-nx to true. Don't ask me for the details behind this, but update_permission_bitmask called by kvm_init_shadow_ept_mmu is using it e.g. And the before-after effect was clearly visible when L2 and L1 were using different NX settings. Maybe Arthur can write a test for this. Jan Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- arch/x86/kvm/vmx.c | 10 +- 1 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 6c42518..363fe19 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -7727,11 +7727,6 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) vmx_flush_tlb(vcpu); } -if (nested_cpu_has_ept(vmcs12)) { -kvm_mmu_unload(vcpu); -nested_ept_init_mmu_context(vcpu); -} - if (vmcs12-vm_entry_controls VM_ENTRY_LOAD_IA32_EFER) vcpu-arch.efer = vmcs12-guest_ia32_efer; else if (vmcs12-vm_entry_controls VM_ENTRY_IA32E_MODE) @@ -7741,6 +7736,11 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) /* Note: modifies VM_ENTRY/EXIT_CONTROLS and GUEST/HOST_IA32_EFER */ vmx_set_efer(vcpu, vcpu-arch.efer); +if (nested_cpu_has_ept(vmcs12)) { +kvm_mmu_unload(vcpu); +nested_ept_init_mmu_context(vcpu); +} + /* * This sets GUEST_CR0 to vmcs12-guest_cr0, with possibly a modified * TS bit (for lazy fpu) and bits which we consider mandatory enabled. -- 1.7.3.4 -- Gleb. -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 3/6] KVM: nVMX: Load nEPT state after EFER
On Mon, Sep 02, 2013 at 07:58:30PM +0200, Jan Kiszka wrote: On 2013-09-02 15:16, Gleb Natapov wrote: On Thu, Aug 08, 2013 at 04:26:30PM +0200, Jan Kiszka wrote: We need to update EFER.NX before building the nEPT state via nested_ept_init_mmu_context. Otherwise, we risk to create an MMU context that claims to have NX disabled while the guest EPT used NX. This will cause spurious faults for L2. Hmm, I do not see how nested ept mmu depends on guests EFER.NX setting. It just sets mmu-nx to true. Don't ask me for the details behind this, but update_permission_bitmask called by kvm_init_shadow_ept_mmu is using it e.g. And the It uses it only in !ept case though and never looks at a guest setting as far as I can tell. Is it possible that this was an artifact of all nEPT code and the latest one does not need this patch? before-after effect was clearly visible when L2 and L1 were using different NX settings. Maybe Arthur can write a test for this. Jan Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- arch/x86/kvm/vmx.c | 10 +- 1 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 6c42518..363fe19 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -7727,11 +7727,6 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) vmx_flush_tlb(vcpu); } - if (nested_cpu_has_ept(vmcs12)) { - kvm_mmu_unload(vcpu); - nested_ept_init_mmu_context(vcpu); - } - if (vmcs12-vm_entry_controls VM_ENTRY_LOAD_IA32_EFER) vcpu-arch.efer = vmcs12-guest_ia32_efer; else if (vmcs12-vm_entry_controls VM_ENTRY_IA32E_MODE) @@ -7741,6 +7736,11 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) /* Note: modifies VM_ENTRY/EXIT_CONTROLS and GUEST/HOST_IA32_EFER */ vmx_set_efer(vcpu, vcpu-arch.efer); + if (nested_cpu_has_ept(vmcs12)) { + kvm_mmu_unload(vcpu); + nested_ept_init_mmu_context(vcpu); + } + /* * This sets GUEST_CR0 to vmcs12-guest_cr0, with possibly a modified * TS bit (for lazy fpu) and bits which we consider mandatory enabled. -- 1.7.3.4 -- Gleb. -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 3/6] KVM: nVMX: Load nEPT state after EFER
On 2013-09-02 20:09, Gleb Natapov wrote: On Mon, Sep 02, 2013 at 07:58:30PM +0200, Jan Kiszka wrote: On 2013-09-02 15:16, Gleb Natapov wrote: On Thu, Aug 08, 2013 at 04:26:30PM +0200, Jan Kiszka wrote: We need to update EFER.NX before building the nEPT state via nested_ept_init_mmu_context. Otherwise, we risk to create an MMU context that claims to have NX disabled while the guest EPT used NX. This will cause spurious faults for L2. Hmm, I do not see how nested ept mmu depends on guests EFER.NX setting. It just sets mmu-nx to true. Don't ask me for the details behind this, but update_permission_bitmask called by kvm_init_shadow_ept_mmu is using it e.g. And the It uses it only in !ept case though and never looks at a guest setting as far as I can tell. Is it possible that this was an artifact of all nEPT code and the latest one does not need this patch? Hmm, possibly. Let me recheck, hope I can find the reproduction pattern again... Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 3/6] KVM: nVMX: Load nEPT state after EFER
On 2013-09-02 20:20, Jan Kiszka wrote: On 2013-09-02 20:09, Gleb Natapov wrote: On Mon, Sep 02, 2013 at 07:58:30PM +0200, Jan Kiszka wrote: On 2013-09-02 15:16, Gleb Natapov wrote: On Thu, Aug 08, 2013 at 04:26:30PM +0200, Jan Kiszka wrote: We need to update EFER.NX before building the nEPT state via nested_ept_init_mmu_context. Otherwise, we risk to create an MMU context that claims to have NX disabled while the guest EPT used NX. This will cause spurious faults for L2. Hmm, I do not see how nested ept mmu depends on guests EFER.NX setting. It just sets mmu-nx to true. Don't ask me for the details behind this, but update_permission_bitmask called by kvm_init_shadow_ept_mmu is using it e.g. And the It uses it only in !ept case though and never looks at a guest setting as far as I can tell. Is it possible that this was an artifact of all nEPT code and the latest one does not need this patch? Hmm, possibly. Let me recheck, hope I can find the reproduction pattern again... Yeah, drop it. Things work fine without it, and I just found an old version of nEPT where kvm_init_shadow_EPT_mmu did this: context-nx = is_nx(vcpu); /* TODO: ? */ Obviously, this patch was a workaround for that issue. Jan -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] KVM: PPC: Book3S HV: Reserve POWER8 space in get/set_one_reg
This reserves space in get/set_one_reg ioctl for the extra guest state needed for POWER8. It doesn't implement these at all, it just reserves them so that the ABI is defined now. A few things to note here: - This add *a lot* state for transactional memory. TM suspend mode, this is unavoidable, you can't simply roll back all transactions and store only the checkpointed state. I've added this all to get/set_one_reg (including GPRs) rather than creating a new ioctl which returns a struct kvm_regs like KVM_GET_REGS does. This means we if we need to extract the TM state, we are going to need a bucket load of IOCTLs. Hopefully most of the time this will not be needed as we can look at the MSR to see if TM is active and only grab them when needed. If this becomes a bottle neck in future we can add another ioctl to grab all this state in one go. - The TM state is offset by 0x8000. - For TM, I've done away with VMX and FP and created a single 64x128 bit VSX register space. - I've left a space of 1 (at 0x9c) since Paulus needs to add a value which applies to POWER7 as well. Signed-off-by: Michael Neuling mi...@neuling.org diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index ef925ea..341058c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1810,6 +1810,45 @@ registers, find a list below: PPC | KVM_REG_PPC_TLB3PS | 32 PPC | KVM_REG_PPC_EPTCFG | 32 PPC | KVM_REG_PPC_ICP_STATE | 64 + PPC | KVM_REG_PPC_SPMC1| 32 + PPC | KVM_REG_PPC_SPMC2| 32 + PPC | KVM_REG_PPC_IAMR | 64 + PPC | KVM_REG_PPC_TFHAR| 64 + PPC | KVM_REG_PPC_TFIAR| 64 + PPC | KVM_REG_PPC_TEXASR | 64 + PPC | KVM_REG_PPC_FSCR | 64 + PPC | KVM_REG_PPC_PSPB | 32 + PPC | KVM_REG_PPC_EBBHR| 64 + PPC | KVM_REG_PPC_EBBRR| 64 + PPC | KVM_REG_PPC_BESCR| 64 + PPC | KVM_REG_PPC_TAR | 64 + PPC | KVM_REG_PPC_DPDES| 64 + PPC | KVM_REG_PPC_DAWR | 64 + PPC | KVM_REG_PPC_DAWRX| 64 + PPC | KVM_REG_PPC_CIABR| 64 + PPC | KVM_REG_PPC_IC | 64 + PPC | KVM_REG_PPC_VTB | 64 + PPC | KVM_REG_PPC_CSIGR| 64 + PPC | KVM_REG_PPC_TACR | 64 + PPC | KVM_REG_PPC_TCSCR| 64 + PPC | KVM_REG_PPC_PID | 64 + PPC | KVM_REG_PPC_ACOP | 64 + PPC | KVM_REG_PPC_TM_GPR0 | 64 + ... + PPC | KVM_REG_PPC_TM_GPR31 | 64 + PPC | KVM_REG_PPC_TM_VSR0 | 128 + ... + PPC | KVM_REG_PPC_TM_VSR63 | 128 + PPC | KVM_REG_PPC_TM_CR| 64 + PPC | KVM_REG_PPC_TM_LR| 64 + PPC | KVM_REG_PPC_TM_CTR | 64 + PPC | KVM_REG_PPC_TM_FPSCR | 64 + PPC | KVM_REG_PPC_TM_AMR | 64 + PPC | KVM_REG_PPC_TM_PPR | 64 + PPC | KVM_REG_PPC_TM_VRSAVE| 64 + PPC | KVM_REG_PPC_TM_VSCR | 32 + PPC | KVM_REG_PPC_TM_DSCR | 64 + PPC | KVM_REG_PPC_TM_TAR | 64 ARM registers are mapped using the lower 32 bits. The upper 16 of that is the register group type, or coprocessor number: diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 0fb1a6e..8e687a1 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -429,6 +429,11 @@ struct kvm_get_htab_header { #define KVM_REG_PPC_MMCR0 (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x10) #define KVM_REG_PPC_MMCR1 (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x11) #define KVM_REG_PPC_MMCRA (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12) +#define KVM_REG_PPC_MMCR2 (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x13) +#define KVM_REG_PPC_MMCRS (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x14) +#define KVM_REG_PPC_SIAR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x15) +#define KVM_REG_PPC_SDAR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x16) +#define KVM_REG_PPC_SIER (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x17) #define KVM_REG_PPC_PMC1 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x18) #define KVM_REG_PPC_PMC2 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x19) @@ -499,6 +504,55 @@ struct kvm_get_htab_header { #define KVM_REG_PPC_TLB3PS (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9a) #define KVM_REG_PPC_EPTCFG (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9b) +/* POWER8 registers */ +#define KVM_REG_PPC_SPMC1 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9d) +#define KVM_REG_PPC_SPMC2 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9e) +#define KVM_REG_PPC_IAMR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x9f) +#define KVM_REG_PPC_TFHAR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa0) +#define KVM_REG_PPC_TFIAR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa1) +#define KVM_REG_PPC_TEXASR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa2) +#define KVM_REG_PPC_FSCR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa3) +#define KVM_REG_PPC_PSPB (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xa4) +#define KVM_REG_PPC_EBBHR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa5) +#define KVM_REG_PPC_EBBRR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa6) +#define
Re: Is fallback vhost_net to qemu for live migrate available?
On 2013/9/2 15:57, Wei Liu wrote: On Sat, Aug 31, 2013 at 12:45:11PM +0800, Qin Chuanyu wrote: On 2013/8/30 0:08, Anthony Liguori wrote: Hi Qin, By change the memory copy and notify mechanism ,currently virtio-net with vhost_net could run on Xen with good performance。 I think the key in doing this would be to implement a property ioeventfd and irqfd interface in the driver domain kernel. Just hacking vhost_net with Xen specific knowledge would be pretty nasty IMHO. Yes, I add a kernel module which persist virtio-net pio_addr and msix address as what kvm module did. Guest wake up vhost thread by adding a hook func in evtchn_interrupt. Did you modify the front end driver to do grant table mapping or is this all being done by mapping the domain's memory? There is nothing changed in front end driver. Currently I use alloc_vm_area to get address space, and map the domain's memory as what what qemu did. You mean you're using xc_map_foreign_range and friends in the backend to map guest memory? That's not very desirable as it violates Xen's security model. It would not be too hard to pass grant references instead of guest physical memory address IMHO. In fact, I did what virtio-net have done in Qemu. I think security is a pseudo question because Dom0 is under control. Host could access memory of guest in KVM much easier than Xen, but I hadn't heard someone said KVM is un-secret. Regards Qin chuanyu -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: x86: update masterclock when kvmclock_offset is calculated
On Wed, Aug 28, 2013 at 02:37:20PM +0200, Paolo Bonzini wrote: Il 28/08/2013 04:52, Marcelo Tosatti ha scritto: On Thu, Aug 22, 2013 at 07:05:20PM +0200, Paolo Bonzini wrote: Il 20/08/2013 20:20, Marcelo Tosatti ha scritto: The offset to add to the hosts monotonic time, kvmclock_offset, is calculated against the monotonic time at KVM_SET_CLOCK ioctl time. Request a master clock update at this time, to reduce a potentially unbounded difference between the values of the masterclock and the clock value used to calculate kvmclock_offset. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: linux-2.6-kvmclock-fixes/arch/x86/kvm/x86.c === --- linux-2.6-kvmclock-fixes.orig/arch/x86/kvm/x86.c +++ linux-2.6-kvmclock-fixes/arch/x86/kvm/x86.c @@ -3806,6 +3806,7 @@ long kvm_arch_vm_ioctl(struct file *filp delta = user_ns.clock - now_ns; local_irq_enable(); kvm-arch.kvmclock_offset = delta; + kvm_gen_update_masterclock(kvm); break; } case KVM_GET_CLOCK: { Reviewed-by: Paolo Bonzini pbonz...@redhat.com While reviewing this patch, which BTW looks good, I noticed the handling of KVM_REQ_MCLOCK_INPROGRESS, the dummy request that is never processed and is only used to block guest entry. It seems to me that this bit is not necessary. After KVM_REQ_CLOCK_UPDATE is issued, no guest entries will happen because kvm_guest_time_update will try to take the pvclock_gtod_sync_lock, currently taken by kvm_gen_update_masterclock. Not entirely clear, to cancel guest entry the bit is necessary: if (vcpu-mode == EXITING_GUEST_MODE || vcpu-requests || need_resched() || signal_pending(current)) { vcpu-mode = OUTSIDE_GUEST_MODE; smp_wmb(); local_irq_enable(); preempt_enable(); r = 1; goto cancel_injection; } Yes, KVM_REQ_CLOCK_UPDATE is enough to cancel guest entries too. See code below. Thus, you do not need the dummy request. You can simply issue KVM_REQ_CLOCK_UPDATE before calling pvclock_update_vm_gtod_copy (with the side effect of exiting VCPUs). VCPUs will stall in kvm_guest_time_update until pvclock_gtod_sync_lock is released by kvm_gen_update_masterclock. What do you think? Not sure its safe. Can you describe the safety of your proposal in more detail ? Here is the code I was thinking of: spin_lock(ka-pvclock_gtod_sync_lock); make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE); /* * No guest entries from this point: VCPUs will be spinning * on pvclock_gtod_sync_lock in kvm_guest_time_update. */ pvclock_update_vm_gtod_copy(kvm); /* * Let kvm_guest_time_update continue: entering the guest * is now allowed too. */ spin_unlock(ka-pvclock_gtod_sync_lock); KVM_REQ_CLOCK_UPDATE is used to cancel guest entry and execute kvm_guest_time_update. But kvm_guest_time_update will spin on pvclock_gtod_sync_lock until pvclock_update_vm_gtod_copy exits and kvm_gen_update_masterclock releases the spinlock. Not safe because there are places which set KVM_REQ_CLOCK_UPDATE without kicking target vcpu out of guest mode. Unless you use a modified make_all_cpus_request. The point of REQ_MCLOCK_INPROGRESS request is to guarantee that the following is not possible: - 2 vcpus in guest mode with per-vcpu kvmclock areas with different {system_timestamp, tsc_offset} values. To achieve that: - Kick all vcpus out of guest mode (via a request bit that can't be cleared). - Update the {system_timestamp, tsc_offset} values. - Clear the request bit. On top of this, optionally the spinlock could become an rw_semaphore so that clock updates for different VCPUs will not be serialized. The effect is probably not visible, though. Still not clear of the benefits, but this area certainly welcomes performance improvements (the global kick is one thing we discussed and that should be improved). This unfortunately does not eliminate the global kick, so there is likely no performance improvement yet. It simplifies the logic a bit though. The change I suggested above is to make pvclock_gtod_sync_lock an rwsem or, probably better, a seqlock. VCPUs reading ka-use_master_clock, ka-master_cycle_now, ka-master_kernel_ns can then run concurrently, with no locked operations in kvm_guest_time_update. Good point. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] KVM: PPC: Book3S HV: Reserve POWER8 space in get/set_one_reg
This reserves space in get/set_one_reg ioctl for the extra guest state needed for POWER8. It doesn't implement these at all, it just reserves them so that the ABI is defined now. A few things to note here: - This add *a lot* state for transactional memory. TM suspend mode, this is unavoidable, you can't simply roll back all transactions and store only the checkpointed state. I've added this all to get/set_one_reg (including GPRs) rather than creating a new ioctl which returns a struct kvm_regs like KVM_GET_REGS does. This means we if we need to extract the TM state, we are going to need a bucket load of IOCTLs. Hopefully most of the time this will not be needed as we can look at the MSR to see if TM is active and only grab them when needed. If this becomes a bottle neck in future we can add another ioctl to grab all this state in one go. - The TM state is offset by 0x8000. - For TM, I've done away with VMX and FP and created a single 64x128 bit VSX register space. - I've left a space of 1 (at 0x9c) since Paulus needs to add a value which applies to POWER7 as well. Signed-off-by: Michael Neuling mi...@neuling.org --- The last one was screwed up... sorry.. v3: fix naming mistake and whitespace screwage. v2: integrate feedback diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index ef925ea..341058c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1810,6 +1810,45 @@ registers, find a list below: PPC | KVM_REG_PPC_TLB3PS | 32 PPC | KVM_REG_PPC_EPTCFG | 32 PPC | KVM_REG_PPC_ICP_STATE | 64 + PPC | KVM_REG_PPC_SPMC1| 32 + PPC | KVM_REG_PPC_SPMC2| 32 + PPC | KVM_REG_PPC_IAMR | 64 + PPC | KVM_REG_PPC_TFHAR| 64 + PPC | KVM_REG_PPC_TFIAR| 64 + PPC | KVM_REG_PPC_TEXASR | 64 + PPC | KVM_REG_PPC_FSCR | 64 + PPC | KVM_REG_PPC_PSPB | 32 + PPC | KVM_REG_PPC_EBBHR| 64 + PPC | KVM_REG_PPC_EBBRR| 64 + PPC | KVM_REG_PPC_BESCR| 64 + PPC | KVM_REG_PPC_TAR | 64 + PPC | KVM_REG_PPC_DPDES| 64 + PPC | KVM_REG_PPC_DAWR | 64 + PPC | KVM_REG_PPC_DAWRX| 64 + PPC | KVM_REG_PPC_CIABR| 64 + PPC | KVM_REG_PPC_IC | 64 + PPC | KVM_REG_PPC_VTB | 64 + PPC | KVM_REG_PPC_CSIGR| 64 + PPC | KVM_REG_PPC_TACR | 64 + PPC | KVM_REG_PPC_TCSCR| 64 + PPC | KVM_REG_PPC_PID | 64 + PPC | KVM_REG_PPC_ACOP | 64 + PPC | KVM_REG_PPC_TM_GPR0 | 64 + ... + PPC | KVM_REG_PPC_TM_GPR31 | 64 + PPC | KVM_REG_PPC_TM_VSR0 | 128 + ... + PPC | KVM_REG_PPC_TM_VSR63 | 128 + PPC | KVM_REG_PPC_TM_CR| 64 + PPC | KVM_REG_PPC_TM_LR| 64 + PPC | KVM_REG_PPC_TM_CTR | 64 + PPC | KVM_REG_PPC_TM_FPSCR | 64 + PPC | KVM_REG_PPC_TM_AMR | 64 + PPC | KVM_REG_PPC_TM_PPR | 64 + PPC | KVM_REG_PPC_TM_VRSAVE| 64 + PPC | KVM_REG_PPC_TM_VSCR | 32 + PPC | KVM_REG_PPC_TM_DSCR | 64 + PPC | KVM_REG_PPC_TM_TAR | 64 ARM registers are mapped using the lower 32 bits. The upper 16 of that is the register group type, or coprocessor number: diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 0fb1a6e..7ed41c0 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -429,6 +429,11 @@ struct kvm_get_htab_header { #define KVM_REG_PPC_MMCR0 (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x10) #define KVM_REG_PPC_MMCR1 (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x11) #define KVM_REG_PPC_MMCRA (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12) +#define KVM_REG_PPC_MMCR2 (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x13) +#define KVM_REG_PPC_MMCRS (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x14) +#define KVM_REG_PPC_SIAR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x15) +#define KVM_REG_PPC_SDAR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x16) +#define KVM_REG_PPC_SIER (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x17) #define KVM_REG_PPC_PMC1 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x18) #define KVM_REG_PPC_PMC2 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x19) @@ -499,6 +504,55 @@ struct kvm_get_htab_header { #define KVM_REG_PPC_TLB3PS (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9a) #define KVM_REG_PPC_EPTCFG (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9b) +/* POWER8 registers */ +#define KVM_REG_PPC_SPMC1 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9d) +#define KVM_REG_PPC_SPMC2 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9e) +#define KVM_REG_PPC_IAMR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x9f) +#define KVM_REG_PPC_TFHAR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa0) +#define KVM_REG_PPC_TFIAR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa1) +#define KVM_REG_PPC_TEXASR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa2) +#define KVM_REG_PPC_FSCR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa3) +#define KVM_REG_PPC_PSPB (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xa4) +#define KVM_REG_PPC_EBBHR
[PATCH v2] KVM: PPC: Book3S HV: Reserve POWER8 space in get/set_one_reg
This reserves space in get/set_one_reg ioctl for the extra guest state needed for POWER8. It doesn't implement these at all, it just reserves them so that the ABI is defined now. A few things to note here: - This add *a lot* state for transactional memory. TM suspend mode, this is unavoidable, you can't simply roll back all transactions and store only the checkpointed state. I've added this all to get/set_one_reg (including GPRs) rather than creating a new ioctl which returns a struct kvm_regs like KVM_GET_REGS does. This means we if we need to extract the TM state, we are going to need a bucket load of IOCTLs. Hopefully most of the time this will not be needed as we can look at the MSR to see if TM is active and only grab them when needed. If this becomes a bottle neck in future we can add another ioctl to grab all this state in one go. - The TM state is offset by 0x8000. - For TM, I've done away with VMX and FP and created a single 64x128 bit VSX register space. - I've left a space of 1 (at 0x9c) since Paulus needs to add a value which applies to POWER7 as well. Signed-off-by: Michael Neuling mi...@neuling.org diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index ef925ea..341058c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1810,6 +1810,45 @@ registers, find a list below: PPC | KVM_REG_PPC_TLB3PS | 32 PPC | KVM_REG_PPC_EPTCFG | 32 PPC | KVM_REG_PPC_ICP_STATE | 64 + PPC | KVM_REG_PPC_SPMC1| 32 + PPC | KVM_REG_PPC_SPMC2| 32 + PPC | KVM_REG_PPC_IAMR | 64 + PPC | KVM_REG_PPC_TFHAR| 64 + PPC | KVM_REG_PPC_TFIAR| 64 + PPC | KVM_REG_PPC_TEXASR | 64 + PPC | KVM_REG_PPC_FSCR | 64 + PPC | KVM_REG_PPC_PSPB | 32 + PPC | KVM_REG_PPC_EBBHR| 64 + PPC | KVM_REG_PPC_EBBRR| 64 + PPC | KVM_REG_PPC_BESCR| 64 + PPC | KVM_REG_PPC_TAR | 64 + PPC | KVM_REG_PPC_DPDES| 64 + PPC | KVM_REG_PPC_DAWR | 64 + PPC | KVM_REG_PPC_DAWRX| 64 + PPC | KVM_REG_PPC_CIABR| 64 + PPC | KVM_REG_PPC_IC | 64 + PPC | KVM_REG_PPC_VTB | 64 + PPC | KVM_REG_PPC_CSIGR| 64 + PPC | KVM_REG_PPC_TACR | 64 + PPC | KVM_REG_PPC_TCSCR| 64 + PPC | KVM_REG_PPC_PID | 64 + PPC | KVM_REG_PPC_ACOP | 64 + PPC | KVM_REG_PPC_TM_GPR0 | 64 + ... + PPC | KVM_REG_PPC_TM_GPR31 | 64 + PPC | KVM_REG_PPC_TM_VSR0 | 128 + ... + PPC | KVM_REG_PPC_TM_VSR63 | 128 + PPC | KVM_REG_PPC_TM_CR| 64 + PPC | KVM_REG_PPC_TM_LR| 64 + PPC | KVM_REG_PPC_TM_CTR | 64 + PPC | KVM_REG_PPC_TM_FPSCR | 64 + PPC | KVM_REG_PPC_TM_AMR | 64 + PPC | KVM_REG_PPC_TM_PPR | 64 + PPC | KVM_REG_PPC_TM_VRSAVE| 64 + PPC | KVM_REG_PPC_TM_VSCR | 32 + PPC | KVM_REG_PPC_TM_DSCR | 64 + PPC | KVM_REG_PPC_TM_TAR | 64 ARM registers are mapped using the lower 32 bits. The upper 16 of that is the register group type, or coprocessor number: diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 0fb1a6e..8e687a1 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -429,6 +429,11 @@ struct kvm_get_htab_header { #define KVM_REG_PPC_MMCR0 (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x10) #define KVM_REG_PPC_MMCR1 (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x11) #define KVM_REG_PPC_MMCRA (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12) +#define KVM_REG_PPC_MMCR2 (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x13) +#define KVM_REG_PPC_MMCRS (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x14) +#define KVM_REG_PPC_SIAR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x15) +#define KVM_REG_PPC_SDAR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x16) +#define KVM_REG_PPC_SIER (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x17) #define KVM_REG_PPC_PMC1 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x18) #define KVM_REG_PPC_PMC2 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x19) @@ -499,6 +504,55 @@ struct kvm_get_htab_header { #define KVM_REG_PPC_TLB3PS (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9a) #define KVM_REG_PPC_EPTCFG (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9b) +/* POWER8 registers */ +#define KVM_REG_PPC_SPMC1 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9d) +#define KVM_REG_PPC_SPMC2 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9e) +#define KVM_REG_PPC_IAMR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x9f) +#define KVM_REG_PPC_TFHAR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa0) +#define KVM_REG_PPC_TFIAR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa1) +#define KVM_REG_PPC_TEXASR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa2) +#define KVM_REG_PPC_FSCR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa3) +#define KVM_REG_PPC_PSPB (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xa4) +#define KVM_REG_PPC_EBBHR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa5) +#define KVM_REG_PPC_EBBRR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa6) +#define
[PATCH v3] KVM: PPC: Book3S HV: Reserve POWER8 space in get/set_one_reg
This reserves space in get/set_one_reg ioctl for the extra guest state needed for POWER8. It doesn't implement these at all, it just reserves them so that the ABI is defined now. A few things to note here: - This add *a lot* state for transactional memory. TM suspend mode, this is unavoidable, you can't simply roll back all transactions and store only the checkpointed state. I've added this all to get/set_one_reg (including GPRs) rather than creating a new ioctl which returns a struct kvm_regs like KVM_GET_REGS does. This means we if we need to extract the TM state, we are going to need a bucket load of IOCTLs. Hopefully most of the time this will not be needed as we can look at the MSR to see if TM is active and only grab them when needed. If this becomes a bottle neck in future we can add another ioctl to grab all this state in one go. - The TM state is offset by 0x8000. - For TM, I've done away with VMX and FP and created a single 64x128 bit VSX register space. - I've left a space of 1 (at 0x9c) since Paulus needs to add a value which applies to POWER7 as well. Signed-off-by: Michael Neuling mi...@neuling.org --- The last one was screwed up... sorry.. v3: fix naming mistake and whitespace screwage. v2: integrate feedback diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index ef925ea..341058c 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1810,6 +1810,45 @@ registers, find a list below: PPC | KVM_REG_PPC_TLB3PS | 32 PPC | KVM_REG_PPC_EPTCFG | 32 PPC | KVM_REG_PPC_ICP_STATE | 64 + PPC | KVM_REG_PPC_SPMC1| 32 + PPC | KVM_REG_PPC_SPMC2| 32 + PPC | KVM_REG_PPC_IAMR | 64 + PPC | KVM_REG_PPC_TFHAR| 64 + PPC | KVM_REG_PPC_TFIAR| 64 + PPC | KVM_REG_PPC_TEXASR | 64 + PPC | KVM_REG_PPC_FSCR | 64 + PPC | KVM_REG_PPC_PSPB | 32 + PPC | KVM_REG_PPC_EBBHR| 64 + PPC | KVM_REG_PPC_EBBRR| 64 + PPC | KVM_REG_PPC_BESCR| 64 + PPC | KVM_REG_PPC_TAR | 64 + PPC | KVM_REG_PPC_DPDES| 64 + PPC | KVM_REG_PPC_DAWR | 64 + PPC | KVM_REG_PPC_DAWRX| 64 + PPC | KVM_REG_PPC_CIABR| 64 + PPC | KVM_REG_PPC_IC | 64 + PPC | KVM_REG_PPC_VTB | 64 + PPC | KVM_REG_PPC_CSIGR| 64 + PPC | KVM_REG_PPC_TACR | 64 + PPC | KVM_REG_PPC_TCSCR| 64 + PPC | KVM_REG_PPC_PID | 64 + PPC | KVM_REG_PPC_ACOP | 64 + PPC | KVM_REG_PPC_TM_GPR0 | 64 + ... + PPC | KVM_REG_PPC_TM_GPR31 | 64 + PPC | KVM_REG_PPC_TM_VSR0 | 128 + ... + PPC | KVM_REG_PPC_TM_VSR63 | 128 + PPC | KVM_REG_PPC_TM_CR| 64 + PPC | KVM_REG_PPC_TM_LR| 64 + PPC | KVM_REG_PPC_TM_CTR | 64 + PPC | KVM_REG_PPC_TM_FPSCR | 64 + PPC | KVM_REG_PPC_TM_AMR | 64 + PPC | KVM_REG_PPC_TM_PPR | 64 + PPC | KVM_REG_PPC_TM_VRSAVE| 64 + PPC | KVM_REG_PPC_TM_VSCR | 32 + PPC | KVM_REG_PPC_TM_DSCR | 64 + PPC | KVM_REG_PPC_TM_TAR | 64 ARM registers are mapped using the lower 32 bits. The upper 16 of that is the register group type, or coprocessor number: diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 0fb1a6e..7ed41c0 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -429,6 +429,11 @@ struct kvm_get_htab_header { #define KVM_REG_PPC_MMCR0 (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x10) #define KVM_REG_PPC_MMCR1 (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x11) #define KVM_REG_PPC_MMCRA (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x12) +#define KVM_REG_PPC_MMCR2 (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x13) +#define KVM_REG_PPC_MMCRS (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x14) +#define KVM_REG_PPC_SIAR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x15) +#define KVM_REG_PPC_SDAR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x16) +#define KVM_REG_PPC_SIER (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x17) #define KVM_REG_PPC_PMC1 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x18) #define KVM_REG_PPC_PMC2 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x19) @@ -499,6 +504,55 @@ struct kvm_get_htab_header { #define KVM_REG_PPC_TLB3PS (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9a) #define KVM_REG_PPC_EPTCFG (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9b) +/* POWER8 registers */ +#define KVM_REG_PPC_SPMC1 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9d) +#define KVM_REG_PPC_SPMC2 (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0x9e) +#define KVM_REG_PPC_IAMR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x9f) +#define KVM_REG_PPC_TFHAR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa0) +#define KVM_REG_PPC_TFIAR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa1) +#define KVM_REG_PPC_TEXASR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa2) +#define KVM_REG_PPC_FSCR (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xa3) +#define KVM_REG_PPC_PSPB (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xa4) +#define KVM_REG_PPC_EBBHR