Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* oerg Roedel j...@8bytes.org wrote: On Tue, Mar 16, 2010 at 12:25:00PM +0100, Ingo Molnar wrote: Hm, that sounds rather messy if we want to use it to basically expose kernel functionality in a guest/host unified way. Is the qemu process discoverable in some secure way? Can we trust it? Is there some proper tooling available to do it, or do we have to push it through 2-3 packages to get such a useful feature done? Since we want to implement a pmu usable for the guest anyway why we don't just use a guests perf to get all information we want? [...] Look at the previous posting of this patch, this is something new and rather unique. The main power in the 'perf kvm' kind of instrumentation is to profile _both_ the host and the guest on the host, using the same tool (often using the same kernel) and using similar workloads, and do profile comparisons using 'perf diff'. Note that KVM's in-kernel design makes it easy to offer this kind of host/guest shared implementation that Yanmin has created. Other virtulization solutions with a poorer design (for example where the hypervisor code base is split away from the guest implementation) will have it much harder to create something similar. That kind of integrated approach can result in very interesting finds straight away, see: http://lkml.indiana.edu/hypermail/linux/kernel/1003.0/00613.html ( the profile there demoes the need for spinlock accelerators for example - there's clearly assymetrically large overhead in guest spinlock code. Guess how much else we'll be able to find with a full 'perf kvm' implementation. ) One of the main goals of a virtualization implementation is to eliminate as many performance differences to the host kernel as possible. From the first day KVM was released the overriding question from users was always: 'how much slower is it than native, and which workloads are hit worst, and why, and could you pretty please speed up important workload XYZ'. 'perf kvm' helps exactly that kind of development workflow. Note that with oprofile you can already do separate guest space and host space profiling (with the timer driven fallbackin the guest). One idea with 'perf kvm' is to change that paradigm of forced separation and forced duplication and to supprt the workflow that most developers employ: use the host space for development and unify instrumentation in an intuitive framework. Yanmin's 'perf kvm' patch is a very good step towards that goal. Anyway ... look at the patches, try them and see it for yourself. Back in the days when i did KVM performance work i wish i had something like Yanmin's 'perf kvm' feature. I'd probably still be hacking KVM today ;-) So, the code is there, it's useful and it's up to you guys whether you live with this opportunity - the perf developers are certainly eager to help out with the details. There's already tons of per kernel subsystem perf helper tools: perf sched, perf kmem, perf lock, perf bench, perf timechart. 'perf kvm' is really a natural and good next step IMO that underlines the main design goodness KVM brought to the world of virtualization: proper guest/host code base integration. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] KVM: Make locked operations truly atomic
Avi Kivity wrote: Once upon a time, locked operations were emulated while holding the mmu mutex. Since mmu pages were write protected, it was safe to emulate the writes in a non-atomic manner, since there could be no other writer, either in the guest or in the kernel. These days emulation takes place without holding the mmu spinlock, so the write could be preempted by an unshadowing event, which exposes the page to writes by the guest. This may cause corruption of guest page tables. Fix by using an atomic cmpxchg for these operations. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kvm/x86.c | 69 1 files changed, 48 insertions(+), 21 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9d02cc7..d724a52 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3299,41 +3299,68 @@ int emulator_write_emulated(unsigned long addr, } EXPORT_SYMBOL_GPL(emulator_write_emulated); +#define CMPXCHG_TYPE(t, ptr, old, new) \ + (cmpxchg((t *)(ptr), *(t *)(old), *(t *)(new)) == *(t *)(old)) + +#ifdef CONFIG_X86_64 +# define CMPXCHG64(ptr, old, new) CMPXCHG_TYPE(u64, ptr, old, new) +#else +# define CMPXCHG64(ptr, old, new) \ + (cmpxchg64((u64 *)(ptr), *(u64 *)(old), *(u *)(new)) == *(u64 *)(old)) ^^ This should cause the 32-bit build breakage I see with the current next branch. Jan signature.asc Description: OpenPGP digital signature
[PATCH v2] KVM: cleanup {kvm_vm_ioctl, kvm}_get_dirty_log()
Using bitmap_empty() to see whether memslot-dirty_bitmap is empty Changlog: cleanup x86 specific kvm_vm_ioctl_get_dirty_log() and fix a local parameter's type address Takuya Yoshikawa's suggestion Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/x86.c | 17 - virt/kvm/kvm_main.c |7 ++- 2 files changed, 6 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index bcf52d1..e6cbbd4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2644,22 +2644,17 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm, int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) { - int r, n, i; + int r, n, is_dirty = 0; struct kvm_memory_slot *memslot; - unsigned long is_dirty = 0; unsigned long *dirty_bitmap = NULL; mutex_lock(kvm-slots_lock); - r = -EINVAL; - if (log-slot = KVM_MEMORY_SLOTS) + r = kvm_get_dirty_log(kvm, log, is_dirty); + if (r) goto out; memslot = kvm-memslots-memslots[log-slot]; - r = -ENOENT; - if (!memslot-dirty_bitmap) - goto out; - n = ALIGN(memslot-npages, BITS_PER_LONG) / 8; r = -ENOMEM; @@ -2668,9 +2663,6 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, goto out; memset(dirty_bitmap, 0, n); - for (i = 0; !is_dirty i n/sizeof(long); i++) - is_dirty = memslot-dirty_bitmap[i]; - /* If nothing is dirty, don't bother messing with page tables. */ if (is_dirty) { struct kvm_memslots *slots, *old_slots; @@ -2694,8 +2686,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, } r = 0; - if (copy_to_user(log-dirty_bitmap, dirty_bitmap, n)) - r = -EFAULT; + out_free: vfree(dirty_bitmap); out: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index bcd08b8..b08a7de 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -767,9 +767,7 @@ int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log, int *is_dirty) { struct kvm_memory_slot *memslot; - int r, i; - int n; - unsigned long any = 0; + int r, n, any = 0; r = -EINVAL; if (log-slot = KVM_MEMORY_SLOTS) @@ -782,8 +780,7 @@ int kvm_get_dirty_log(struct kvm *kvm, n = ALIGN(memslot-npages, BITS_PER_LONG) / 8; - for (i = 0; !any i n/sizeof(long); ++i) - any = memslot-dirty_bitmap[i]; + any = !bitmap_empty(memslot-dirty_bitmap, memslot-npages); r = -EFAULT; if (copy_to_user(log-dirty_bitmap, memslot-dirty_bitmap, n)) -- 1.6.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] KVM: Make locked operations truly atomic
On 03/17/2010 09:45 AM, Jan Kiszka wrote: Avi Kivity wrote: Once upon a time, locked operations were emulated while holding the mmu mutex. Since mmu pages were write protected, it was safe to emulate the writes in a non-atomic manner, since there could be no other writer, either in the guest or in the kernel. These days emulation takes place without holding the mmu spinlock, so the write could be preempted by an unshadowing event, which exposes the page to writes by the guest. This may cause corruption of guest page tables. Fix by using an atomic cmpxchg for these operations. Signed-off-by: Avi Kivitya...@redhat.com --- arch/x86/kvm/x86.c | 69 1 files changed, 48 insertions(+), 21 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9d02cc7..d724a52 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3299,41 +3299,68 @@ int emulator_write_emulated(unsigned long addr, } EXPORT_SYMBOL_GPL(emulator_write_emulated); +#define CMPXCHG_TYPE(t, ptr, old, new) \ + (cmpxchg((t *)(ptr), *(t *)(old), *(t *)(new)) == *(t *)(old)) + +#ifdef CONFIG_X86_64 +# define CMPXCHG64(ptr, old, new) CMPXCHG_TYPE(u64, ptr, old, new) +#else +# define CMPXCHG64(ptr, old, new) \ + (cmpxchg64((u64 *)(ptr), *(u64 *)(old), *(u *)(new)) == *(u64 *)(old)) ^^ This should cause the 32-bit build breakage I see with the current next branch. Also, Marcelo sees autotest breakage, so it's also broken on 64-bit somehow. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] Unify KVM kernel-space and user-space code into a single project
* Anthony Liguori anth...@codemonkey.ws wrote: On 03/16/2010 12:39 PM, Ingo Molnar wrote: If we look at the use-case, it's going to be something like, a user is creating virtual machines and wants to get performance information about them. Having to run a separate tool like perf is not going to be what they would expect they had to do. Instead, they would either use their existing GUI tool (like virt-manager) or they would use their management interface (either QMP or libvirt). The complexity of interaction is due to the fact that perf shouldn't be a stand alone tool. It should be a library or something with a programmatic interface that another tool can make use of. But ... a GUI interface/integration is of course possible too, and it's being worked on. perf is mainly a kernel developer tool, and kernel developers generally dont use GUIs to do their stuff: which is the (sole) reason why its first ~850 commits of tools/perf/ were done without a GUI. We go where our developers are. In any case it's not an excuse to have no proper command-line tooling. In fact if you cannot get simpler, more atomic command-line tooling right then you'll probably doubly suck at doing a GUI as well. It's about who owns the user interface. If qemu owns the user interface, than we can satisfy this in a very simple way by adding a perf monitor command. If we have to support third party tools, then it significantly complicates things. Of course illogical modularization complicates things 'significantly'. I wish both you and Avi looked back 3-4 years and realized what made KVM so successful back then and why the hearts and minds of virtualization developers were captured by KVM almost overnight. KVM's main strength back then was that it was a surprisingly functional piece of code offered by a 10 KLOC patch - right on the very latest upstream kernel. Code was shared with upstream, there was version parity, and it all was in the same single repo which was (and is) a pleasure to develop on. Unlike Xen, which was a 200+ KLOC patch on top of a forked 10 MLOC kernel a few upstream versions back. Xen had constant version friction due to that fork and due to that forced/false separation/modularization: Xen _itself_ was a fork of Linux to begin with. (for exampe Xen still had my copyrights last i checked, which it got from old Linux code i worked on) That forced separation and version friction in Xen was a development and productization nightmare, and developing on KVM was a truly refreshing experience. (I'll go out on a limb to declare that you wont find a _single_ developer on this list who will tells us otherwise.) Fast forward to 2010. The kernel side of KVM is maximum goodness - by far the worst-quality remaining aspects of KVM are precisely in areas that you mention: 'if we have to support third party tools, then it significantly complicates things'. You kept Qemu as an external 'third party' entity to KVM, and KVM is clearly hurting from that - just see the recent KVM usability thread for examples about suckage. So a similar 'complication' is the crux of the matter behind KVM quality problems: you've not followed through with the original KVM vision and you have not applied that concept to Qemu! And please realize that the user does not care that KVM's kernel bits are top notch, if the rest of the package has sucky aspects: it's always the weakest link of the chain that matters to the user. Xen sucked because of such design shortsightedness on the kernel level, and now KVM suffers from it on the user space level. If you want to jump to the next level of technological quality you need to fix this attitude and you need to go back to the design roots of KVM. Concentrate on Qemu (as that is the weakest link now), make it a first class member of the KVM repo and simplify your development model by having a single repo: - move a clean (and minimal) version of the Qemu code base to tools/kvm/, in the upstream kernel repo, and work on that from that point on. - co-develop new features within the same patch. Release new versions of kvm-qemu and the kvm bits at the same time (together with the upstream kernel), at well defined points in time. - encourage kernel-space and user-space KVM developers to work on both user-space and kernel-space bits as a single unit. It's one project and a single experience to the user. - [ and probably libvirt should go there too ] If KVM's hypervisor and guest kernel code can enjoy the benefits of a single repository, why cannot the rest of KVM enjoy the same developer goodness? Only fixing that will bring the break-through in quality - not more manpower really. Yes, i've read a thousand excuses for why this is an absolutely impossible and a bad thing to do, and none of them was really convincing to me - and you also have become rather emotional about all the arguments so it's hard to argue about it on
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Frank Ch. Eigler f...@redhat.com wrote: Hi - On Tue, Mar 16, 2010 at 06:04:10PM -0500, Anthony Liguori wrote: [...] The only way to really address this is to change the interaction. Instead of running perf externally to qemu, we should support a perf command in the qemu monitor that can then tie directly to the perf tooling. That gives us the best possible user experience. To what extent could this be solved with less crossing of isolation/abstraction layers, if the perfctr facilities were properly virtualized? [...] Note, 'perfctr' is a different out-of-tree Linux kernel project run by someone else: it offers the /dev/perfctr special-purpose device that allows raw, unabstracted, low-level access to the PMU. I suspect the one you wanted to mention here is called 'perf' or 'perf events'. (and used to be called 'performance counters' or 'perfcounters' until it got renamed about a year ago) Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: cleanup {kvm_vm_ioctl, kvm}_get_dirty_log()
Xiao Guangrong wrote: Using bitmap_empty() to see whether memslot-dirty_bitmap is empty Changlog: cleanup x86 specific kvm_vm_ioctl_get_dirty_log() and fix a local parameter's type address Takuya Yoshikawa's suggestion Oh, for such a tiny comment. Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/x86.c | 17 - virt/kvm/kvm_main.c |7 ++- 2 files changed, 6 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index bcf52d1..e6cbbd4 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c What I said was just you may be able to use bitmap_empty() instead of - for (i = 0; !is_dirty i n/sizeof(long); i++) - is_dirty = memslot-dirty_bitmap[i]; for x86's code too, if your patch for kvm_get_dirty_log() was correct. @@ -2644,22 +2644,17 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm, int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) { - int r, n, i; + int r, n, is_dirty = 0; struct kvm_memory_slot *memslot; - unsigned long is_dirty = 0; unsigned long *dirty_bitmap = NULL; mutex_lock(kvm-slots_lock); - r = -EINVAL; - if (log-slot = KVM_MEMORY_SLOTS) + r = kvm_get_dirty_log(kvm, log, is_dirty); + if (r) goto out; memslot = kvm-memslots-memslots[log-slot]; - r = -ENOENT; - if (!memslot-dirty_bitmap) - goto out; - n = ALIGN(memslot-npages, BITS_PER_LONG) / 8; r = -ENOMEM; @@ -2668,9 +2663,6 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, goto out; memset(dirty_bitmap, 0, n); - for (i = 0; !is_dirty i n/sizeof(long); i++) - is_dirty = memslot-dirty_bitmap[i]; - /* If nothing is dirty, don't bother messing with page tables. */ if (is_dirty) { struct kvm_memslots *slots, *old_slots; @@ -2694,8 +2686,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, } r = 0; - if (copy_to_user(log-dirty_bitmap, dirty_bitmap, n)) - r = -EFAULT; + out_free: vfree(dirty_bitmap); out: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index bcd08b8..b08a7de 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -767,9 +767,7 @@ int kvm_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log, int *is_dirty) { struct kvm_memory_slot *memslot; - int r, i; - int n; - unsigned long any = 0; + int r, n, any = 0; r = -EINVAL; if (log-slot = KVM_MEMORY_SLOTS) @@ -782,8 +780,7 @@ int kvm_get_dirty_log(struct kvm *kvm, n = ALIGN(memslot-npages, BITS_PER_LONG) / 8; - for (i = 0; !any i n/sizeof(long); ++i) - any = memslot-dirty_bitmap[i]; + any = !bitmap_empty(memslot-dirty_bitmap, memslot-npages); r = -EFAULT; if (copy_to_user(log-dirty_bitmap, memslot-dirty_bitmap, n)) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Avi Kivity a...@redhat.com wrote: Monitoring guests from the host is useful for kvm developers, but less so for users. Guest space profiling is easy, and 'perf kvm' is not about that. (plain 'perf' will work if a proper paravirt channel is opened to the host) I think you might have misunderstood the purpose and role of the 'perf kvm' patch here? 'perf kvm' is aimed at KVM developers: it is them who improve KVM code, not guest kernel users. Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/17/2010 10:16 AM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: Monitoring guests from the host is useful for kvm developers, but less so for users. Guest space profiling is easy, and 'perf kvm' is not about that. (plain 'perf' will work if a proper paravirt channel is opened to the host) I think you might have misunderstood the purpose and role of the 'perf kvm' patch here? 'perf kvm' is aimed at KVM developers: it is them who improve KVM code, not guest kernel users. Of course I understood it. My point was that 'perf kvm' serves a tiny minority of users. That doesn't mean it isn't useful, just that it doesn't satisfy all needs by itself. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: cleanup {kvm_vm_ioctl, kvm}_get_dirty_log()
Takuya Yoshikawa wrote: Oh, for such a tiny comment. Your comment is valuable although it's tiny :-) What I said was just you may be able to use bitmap_empty() instead of -for (i = 0; !is_dirty i n/sizeof(long); i++) -is_dirty = memslot-dirty_bitmap[i]; for x86's code too, if your patch for kvm_get_dirty_log() was correct. While i look into x86's code, i found we can direct call kvm_get_dirty_log() in kvm_vm_ioctl_get_dirty_log() to remove some unnecessary code, this is a better cleanup way Thanks, Xiao -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2 serial ports?
Since 0.12, it appears that kvm does not allow more than 2 serial ports for a guest: $ kvm \ -serial unix:s1,server,nowait \ -serial unix:s2,server,nowait \ -serial unix:s3,server,nowait isa irq 4 already assigned Is there a work-around for this? Thanks! /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: cleanup {kvm_vm_ioctl, kvm}_get_dirty_log()
Xiao Guangrong wrote: Takuya Yoshikawa wrote: Oh, for such a tiny comment. Your comment is valuable although it's tiny :-) What I said was just you may be able to use bitmap_empty() instead of -for (i = 0; !is_dirty i n/sizeof(long); i++) -is_dirty = memslot-dirty_bitmap[i]; for x86's code too, if your patch for kvm_get_dirty_log() was correct. While i look into x86's code, i found we can direct call kvm_get_dirty_log() in kvm_vm_ioctl_get_dirty_log() to remove some unnecessary code, this is a better cleanup way Ah, probably checking the git log will explain you why it is like that! Marcelo's work? IIRC. Thanks, Xiao -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Tue, Mar 16, 2010 at 01:08:28PM +0200, Avi Kivity wrote: If the batch size is larger than the virtio queue size, or if there are no flushes at all, then yes the huge write cache gives more opportunity for reordering. But we're already talking hundreds of requests here. Yes. And rememember those don't have to come from the same host. Also remember that we rather limit execssive reodering of O_DIRECT requests in the I/O scheduler because they are synchronous type I/O while we don't do that for pagecache writeback. And we don't have unlimited virtio queue size, in fact it's quite limited. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Anthony Liguori aligu...@linux.vnet.ibm.com wrote: If you want to use a synthetic filesystem as the management interface for qemu, that's one thing. But you suggested exposing the guest filesystem in its entirely and that's what I disagreed with. What did you think, that it would be world-readable? Why would we do such a stupid thing? Any mounted content should at minimum match whatever policy covers the image file. The mounting of contents is not a privilege escallation and it is already possible today - just not integrated properly and not practical. (and apparently not implemented for all the wrong 'security' reasons) The guest may encrypt it's disk image. It still ought to be possible to run perf against that guest, no? _In_ the guest you can of course run it just fine. (once paravirt bits are in place) That has no connection to 'perf kvm' though, which this patch submission is about ... If you want unified profiling of both host and guest then you need access to both the guest and the host. This is what the 'perf kvm' patch is about. Please read the patch, i think you might be misunderstanding what it does ... Regarding encrypted contents - that's really a distraction but the host has absolute, 100% control over the guest and there's nothing the guest can do about that - unless you are thinking about the sub-sub-case of Orwellian DRM-locked-down systems - in which case there's nothing for the host to mount and the guest can reject any requests for information on itself and impose additional policy that way. So it's a security non-issue. Note that DRM is pretty much the worst place to look at when it comes to usability: DRM lock-down is the anti-thesis of usability. Do you really want KVM to match the mind-set of the RIAA and MPAA? Why do you pretend that a developer cannot mount his own disk image? Pretty please, help Linux instead, where development is driven by usability and accessibility ... Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
* Avi Kivity a...@redhat.com wrote: On 03/17/2010 10:16 AM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: Monitoring guests from the host is useful for kvm developers, but less so for users. Guest space profiling is easy, and 'perf kvm' is not about that. (plain 'perf' will work if a proper paravirt channel is opened to the host) I think you might have misunderstood the purpose and role of the 'perf kvm' patch here? 'perf kvm' is aimed at KVM developers: it is them who improve KVM code, not guest kernel users. Of course I understood it. My point was that 'perf kvm' serves a tiny minority of users. [...] I hope you wont be disappointed to learn that 100% of Linux, all 13+ million lines of it, was and is being developed by a tiny, tiny, tiny minority of users ;-) [...] That doesn't mean it isn't useful, just that it doesn't satisfy all needs by itself. Of course - and it doesnt bring world peace either. One step at a time. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/17/2010 10:49 AM, Christoph Hellwig wrote: On Tue, Mar 16, 2010 at 01:08:28PM +0200, Avi Kivity wrote: If the batch size is larger than the virtio queue size, or if there are no flushes at all, then yes the huge write cache gives more opportunity for reordering. But we're already talking hundreds of requests here. Yes. And rememember those don't have to come from the same host. Also remember that we rather limit execssive reodering of O_DIRECT requests in the I/O scheduler because they are synchronous type I/O while we don't do that for pagecache writeback. Maybe we should relax that for kvm. Perhaps some of the problem comes from the fact that we call io_submit() once per request. And we don't have unlimited virtio queue size, in fact it's quite limited. That can be extended easily if it fixes the problem. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2 serial ports?
On 03/17/10 09:38, Michael Tokarev wrote: Since 0.12, it appears that kvm does not allow more than 2 serial ports for a guest: $ kvm \ -serial unix:s1,server,nowait \ -serial unix:s2,server,nowait \ -serial unix:s3,server,nowait isa irq 4 already assigned Is there a work-around for this? Oh, well, yes, I remember. qemu is more strict on ISA irq sharing now. A bit too strict. /me goes dig out a old patch which never made it upstream for some reason I forgot. Attached. HTH, Gerd From 7d5d53e8a23544ac6413487a8ecdd43537ade9f3 Mon Sep 17 00:00:00 2001 From: Gerd Hoffmann kra...@redhat.com Date: Fri, 11 Sep 2009 13:43:46 +0200 Subject: [PATCH] isa: refine irq reservations There are a few cases where IRQ sharing on the ISA bus is used and possible. In general only devices of the same kind can do that. A few use cases: * serial lines 1+3 share irq 4 * serial lines 2+4 share irq 3 * parallel ports share irq 7 * ppc/prep: ide ports share irq 13 This patch refines the irq reservation mechanism for the isa bus to handle those cases. It keeps track of the driver which owns the IRQ in question and allows irq sharing for devices handled by the same driver. Signed-off-by: Gerd Hoffmann kra...@redhat.com --- hw/isa-bus.c | 16 +--- 1 files changed, 13 insertions(+), 3 deletions(-) diff --git a/hw/isa-bus.c b/hw/isa-bus.c index 4d489d2..bd2f69c 100644 --- a/hw/isa-bus.c +++ b/hw/isa-bus.c @@ -26,6 +26,7 @@ struct ISABus { BusState qbus; qemu_irq *irqs; uint32_t assigned; +DeviceInfo *irq_owner[16]; }; static ISABus *isabus; @@ -71,7 +72,9 @@ qemu_irq isa_reserve_irq(int isairq) exit(1); } if (isabus-assigned (1 isairq)) { -fprintf(stderr, isa irq %d already assigned\n, isairq); +DeviceInfo *owner = isabus-irq_owner[isairq]; +fprintf(stderr, isa irq %d already assigned (%s)\n, +isairq, owner ? owner-name : unknown); exit(1); } isabus-assigned |= (1 isairq); @@ -82,10 +85,17 @@ void isa_init_irq(ISADevice *dev, qemu_irq *p, int isairq) { assert(dev-nirqs ARRAY_SIZE(dev-isairq)); if (isabus-assigned (1 isairq)) { -fprintf(stderr, isa irq %d already assigned\n, isairq); -exit(1); +DeviceInfo *owner = isabus-irq_owner[isairq]; +if (owner == dev-qdev.info) { +/* irq sharing is ok in case the same driver handles both */; +} else { +fprintf(stderr, isa irq %d already assigned (%s)\n, +isairq, owner ? owner-name : unknown); +exit(1); +} } isabus-assigned |= (1 isairq); +isabus-irq_owner[isairq] = dev-qdev.info; dev-isairq[dev-nirqs] = isairq; *p = isabus-irqs[isairq]; dev-nirqs++; -- 1.6.6.1
Re: 2 serial ports?
May I ask if it is possible to bind a real physical serial port to a guest? Thanks, Neo On Wed, Mar 17, 2010 at 1:38 AM, Michael Tokarev m...@tls.msk.ru wrote: Since 0.12, it appears that kvm does not allow more than 2 serial ports for a guest: $ kvm \ -serial unix:s1,server,nowait \ -serial unix:s2,server,nowait \ -serial unix:s3,server,nowait isa irq 4 already assigned Is there a work-around for this? Thanks! /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- I would remember that if researchers were not ambitious probably today we haven't the technology we are using! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Tue, 2010-03-16 at 10:47 +0100, Ingo Molnar wrote: * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote: On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote: On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote: On 03/16/2010 07:27 AM, Zhang, Yanmin wrote: From: Zhang, Yanminyanmin_zh...@linux.intel.com Based on the discussion in KVM community, I worked out the patch to support perf to collect guest os statistics from host side. This patch is implemented with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a critical bug and provided good suggestions with other guys. I really appreciate their kind help. The patch adds new subcommand kvm to perf. perf kvm top perf kvm record perf kvm report perf kvm diff The new perf could profile guest os kernel except guest os user space, but it could summarize guest os user space utilization per guest os. Below are some examples. 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Thanks for your kind comments. Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? With the patch, 'perf kvm report --sort pid could show summary statistics for all guest os instances. Then, use parameter --pid of 'perf kvm record' to collect single problematic instance data. Sorry. I found currently --pid isn't process but a thread (main thread). Ingo, Is it possible to support a new parameter or extend --inherit, so 'perf record' and 'perf top' could collect data on all threads of a process when the process is running? If not, I need add a new ugly parameter which is similar to --pid to filter out process data in userspace. Yeah. For maximum utility i'd suggest to extend --pid to include this, and introduce --tid for the previous, limited-to-a-single-task functionality. Most users would expect --pid to work like a 'late attach' - i.e. to work like strace -f or like a gdb attach. Thanks Ingo, Avi. I worked out below patch against tip/master of March 15th. Subject: [PATCH] Change perf's parameter --pid to process-wide collection From: Zhang, Yanmin yanmin_zh...@linux.intel.com Change parameter -p (--pid) to real process pid and add -t (--tid) meaning thread id. Now, --pid means perf collects the statistics of all threads of the process, while --tid means perf just collect the statistics of that thread. BTW, the patch fixes a bug of 'perf stat -p'. 'perf stat' always configures attr-disabled=1 if it isn't a system-wide collection. If there is a '-p' and no forks, 'perf stat -p' doesn't collect any data. In addition, the while(!done) in run_perf_stat consumes 100% single cpu time which has bad impact on running workload. I added a sleep(1) in the loop. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com --- diff -Nraup linux-2.6_tipmaster0315/tools/perf/builtin-record.c linux-2.6_tipmaster0315_perfpid/tools/perf/builtin-record.c --- linux-2.6_tipmaster0315/tools/perf/builtin-record.c 2010-03-16 08:59:54.896488489 +0800 +++ linux-2.6_tipmaster0315_perfpid/tools/perf/builtin-record.c 2010-03-17 16:30:17.71706 +0800 @@ -27,7 +27,7 @@ #include unistd.h #include sched.h -static int fd[MAX_NR_CPUS][MAX_COUNTERS]; +static int *fd[MAX_NR_CPUS][MAX_COUNTERS]; static longdefault_interval= 0; @@ -43,6 +43,9 @@ static intraw_samples = 0; static int system_wide = 0; static int profile_cpu = -1; static pid_t target_pid = -1; +static pid_t target_tid = -1; +static int *all_tids = NULL; +static int thread_num = 0; static pid_t child_pid = -1; static int inherit = 1; static int force = 0; @@ -60,7 +63,7 @@ static struct timeval this_read; static u64 bytes_written = 0; -static struct pollfd event_array[MAX_NR_CPUS * MAX_COUNTERS]; +static struct pollfd *event_array; static int nr_poll = 0; static int nr_cpu = 0; @@ -77,7 +80,7 @@ struct mmap_data { unsigned int
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote: On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote: On 03/16/2010 09:48 AM, Zhang, Yanmin wrote: Right, but there is a scope between kvm_guest_enter and really running in guest os, where a perf event might overflow. Anyway, the scope is very narrow, I will change it to use flag PF_VCPU. There is also a window between setting the flag and calling 'int $2' where an NMI might happen and be accounted incorrectly. Perhaps separate the 'int $2' into a direct call into perf and another call for the rest of NMI handling. I don't see how it would work on svm though - AFAICT the NMI is held whereas vmx swallows it. I guess NMIs will be disabled until the next IRET so it isn't racy, just tricky. I'm not sure if vmexit does break NMI context or not. Hardware NMI context isn't reentrant till a IRET. YangSheng would like to double check it. After more check, I think VMX won't remained NMI block state for host. That's means, if NMI happened and processor is in VMX non-root mode, it would only result in VMExit, with a reason indicate that it's due to NMI happened, but no more state change in the host. So in that meaning, there _is_ a window between VMExit and KVM handle the NMI. Moreover, I think we _can't_ stop the re-entrance of NMI handling code because int $2 don't have effect to block following NMI. And if the NMI sequence is not important(I think so), then we need to generate a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to itself is a good idea. I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to replace int $2. Something unexpected is happening... -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/17/2010 11:28 AM, Sheng Yang wrote: I'm not sure if vmexit does break NMI context or not. Hardware NMI context isn't reentrant till a IRET. YangSheng would like to double check it. After more check, I think VMX won't remained NMI block state for host. That's means, if NMI happened and processor is in VMX non-root mode, it would only result in VMExit, with a reason indicate that it's due to NMI happened, but no more state change in the host. So in that meaning, there _is_ a window between VMExit and KVM handle the NMI. Moreover, I think we _can't_ stop the re-entrance of NMI handling code because int $2 don't have effect to block following NMI. That's pretty bad, as NMI runs on a separate stack (via IST). So if another NMI happens while our int $2 is running, the stack will be corrupted. And if the NMI sequence is not important(I think so), then we need to generate a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to itself is a good idea. I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to replace int $2. Something unexpected is happening... I think you need DM_NMI for that to work correctly. An alternative is to call the NMI handler directly. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
Michael, I don't use the kiocb comes from the sendmsg/recvmsg, since I have embeded the kiocb in page_info structure, and allocate it when page_info allocated. So what I suggested was that vhost allocates and tracks the iocbs, and passes them to your device with sendmsg/ recvmsg calls. This way your device won't need to share structures and locking strategy with vhost: you get an iocb, handle it, invoke a callback to notify vhost about completion. This also gets rid of the 'receiver' callback I'm not sure receiver callback can be removed here: The patch describes a work flow like this: netif_receive_skb() gets the packet, it does nothing but just queue the skb and wakeup the handle_rx() of vhost. handle_rx() then calls the receiver callback to deal with skb and and get the necessary notify info into a list, vhost owns the list and in the same handle_rx() context use it to complete. We use receiver callback here is because only handle_rx() is waked up from netif_receive_skb(), and we need mp device context to deal with the skb and notify info attached to it. We also have some lock in the callback function. If I remove the receiver callback, I can only deal with the skb and notify info in netif_receive_skb(), but this function is in an interrupt context, which I think lock is not allowed there. But I cannot remove the lock there. Please have a review and thanks for the instruction for replying email which helps me a lot. Thanks, Xiaohui drivers/vhost/net.c | 159 +++-- drivers/vhost/vhost.h | 12 2 files changed, 166 insertions(+), 5 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 22d5fef..5483848 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -17,11 +17,13 @@ #include linux/workqueue.h #include linux/rcupdate.h #include linux/file.h +#include linux/aio.h #include linux/net.h #include linux/if_packet.h #include linux/if_arp.h #include linux/if_tun.h +#include linux/mpassthru.h #include net/sock.h @@ -91,6 +93,12 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +static void handle_async_rx_events_notify(struct vhost_net *net, +struct vhost_virtqueue *vq); + +static void handle_async_tx_events_notify(struct vhost_net *net, +struct vhost_virtqueue *vq); + A couple of style comments: - It's better to arrange functions in such order that forward declarations aren't necessary. Since we don't have recursion, this should always be possible. - continuation lines should be idented at least at the position of '(' on the previous line. Thanks. I'd correct that. /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_tx(struct vhost_net *net) @@ -124,6 +132,8 @@ static void handle_tx(struct vhost_net *net) tx_poll_stop(net); hdr_size = vq-hdr_size; +handle_async_tx_events_notify(net, vq); + for (;;) { head = vhost_get_vq_desc(net-dev, vq, vq-iov, ARRAY_SIZE(vq-iov), @@ -151,6 +161,12 @@ static void handle_tx(struct vhost_net *net) /* Skip header. TODO: support TSO. */ s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out); msg.msg_iovlen = out; + +if (vq-link_state == VHOST_VQ_LINK_ASYNC) { +vq-head = head; +msg.msg_control = (void *)vq; So here a device gets a pointer to vhost_virtqueue structure. If it gets an iocb and invokes a callback, it would not care about vhost internals. +} + len = iov_length(vq-iov, out); /* Sanity check */ if (!len) { @@ -166,6 +182,10 @@ static void handle_tx(struct vhost_net *net) tx_poll_start(net, sock); break; } + +if (vq-link_state == VHOST_VQ_LINK_ASYNC) +continue; + if (err != len) pr_err(Truncated TX packet: len %d != %zd\n, err, len); @@ -177,6 +197,8 @@ static void handle_tx(struct vhost_net *net) } } +handle_async_tx_events_notify(net, vq); + mutex_unlock(vq-mutex); unuse_mm(net-dev.mm); } @@ -206,7 +228,8 @@ static void handle_rx(struct vhost_net *net) int err; size_t hdr_size; struct socket *sock = rcu_dereference(vq-private_data); -if (!sock || skb_queue_empty(sock-sk-sk_receive_queue)) +if (!sock || (skb_queue_empty(sock-sk-sk_receive_queue) +vq-link_state == VHOST_VQ_LINK_SYNC)) return; use_mm(net-dev.mm); @@ -214,9 +237,18 @@ static void handle_rx(struct vhost_net
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Wednesday 17 March 2010 17:41:58 Avi Kivity wrote: On 03/17/2010 11:28 AM, Sheng Yang wrote: I'm not sure if vmexit does break NMI context or not. Hardware NMI context isn't reentrant till a IRET. YangSheng would like to double check it. After more check, I think VMX won't remained NMI block state for host. That's means, if NMI happened and processor is in VMX non-root mode, it would only result in VMExit, with a reason indicate that it's due to NMI happened, but no more state change in the host. So in that meaning, there _is_ a window between VMExit and KVM handle the NMI. Moreover, I think we _can't_ stop the re-entrance of NMI handling code because int $2 don't have effect to block following NMI. That's pretty bad, as NMI runs on a separate stack (via IST). So if another NMI happens while our int $2 is running, the stack will be corrupted. Though hardware didn't provide this kind of block, software at least would warn about it... nmi_enter() still would be executed by int $2, and result in BUG() if we are already in NMI context(OK, it is a little better than mysterious crash due to corrupted stack). And if the NMI sequence is not important(I think so), then we need to generate a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to itself is a good idea. I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to replace int $2. Something unexpected is happening... I think you need DM_NMI for that to work correctly. An alternative is to call the NMI handler directly. apic_send_IPI_self() already took care of APIC_DM_NMI. And NMI handler would block the following NMI? -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/17/2010 11:51 AM, Sheng Yang wrote: I think you need DM_NMI for that to work correctly. An alternative is to call the NMI handler directly. apic_send_IPI_self() already took care of APIC_DM_NMI. So it does (though not for x2apic?). I don't see why it doesn't work. And NMI handler would block the following NMI? It wouldn't - won't work without extensive changes. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: cleanup {kvm_vm_ioctl, kvm}_get_dirty_log()
Takuya Yoshikawa wrote: Ah, probably checking the git log will explain you why it is like that! Marcelo's work? IIRC. Oh, i find this commit: commit 706831a7faec7ac0d3057d20df8234c45bbbc3c5 Author: Marcelo Tosatti mtosa...@redhat.com Date: Wed Dec 23 14:35:22 2009 -0200 KVM: use SRCU for dirty log Signed-off-by: Marcelo Tosatti mtosa...@redhat.com But i don't know why Marcelo separates kvm_get_dirty_log()'s code into kvm_vm_ioctl_get_dirty_log(). :-( Thanks, Xiao -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/3] Provides multiple submits and asynchronous notifications.
On Wed, Mar 17, 2010 at 05:48:10PM +0800, Xin, Xiaohui wrote: Michael, I don't use the kiocb comes from the sendmsg/recvmsg, since I have embeded the kiocb in page_info structure, and allocate it when page_info allocated. So what I suggested was that vhost allocates and tracks the iocbs, and passes them to your device with sendmsg/ recvmsg calls. This way your device won't need to share structures and locking strategy with vhost: you get an iocb, handle it, invoke a callback to notify vhost about completion. This also gets rid of the 'receiver' callback I'm not sure receiver callback can be removed here: The patch describes a work flow like this: netif_receive_skb() gets the packet, it does nothing but just queue the skb and wakeup the handle_rx() of vhost. handle_rx() then calls the receiver callback to deal with skb and and get the necessary notify info into a list, vhost owns the list and in the same handle_rx() context use it to complete. We use receiver callback here is because only handle_rx() is waked up from netif_receive_skb(), and we need mp device context to deal with the skb and notify info attached to it. We also have some lock in the callback function. If I remove the receiver callback, I can only deal with the skb and notify info in netif_receive_skb(), but this function is in an interrupt context, which I think lock is not allowed there. But I cannot remove the lock there. The basic idea is that vhost passes iocb to recvmsg and backend completes the iocb to signal that data is ready. That completion could be in interrupt context and so we need to switch to workqueue to handle the event, it is true, but the code to do this would live in vhost.c or net.c. With this structure your device won't depend on vhost, and can go under drivers/net/, opening up possibility to use it for zero copy without vhost in the future. Please have a review and thanks for the instruction for replying email which helps me a lot. Thanks, Xiaohui drivers/vhost/net.c | 159 +++-- drivers/vhost/vhost.h | 12 2 files changed, 166 insertions(+), 5 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 22d5fef..5483848 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -17,11 +17,13 @@ #include linux/workqueue.h #include linux/rcupdate.h #include linux/file.h +#include linux/aio.h #include linux/net.h #include linux/if_packet.h #include linux/if_arp.h #include linux/if_tun.h +#include linux/mpassthru.h #include net/sock.h @@ -91,6 +93,12 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +static void handle_async_rx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq); + +static void handle_async_tx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq); + A couple of style comments: - It's better to arrange functions in such order that forward declarations aren't necessary. Since we don't have recursion, this should always be possible. - continuation lines should be idented at least at the position of '(' on the previous line. Thanks. I'd correct that. /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_tx(struct vhost_net *net) @@ -124,6 +132,8 @@ static void handle_tx(struct vhost_net *net) tx_poll_stop(net); hdr_size = vq-hdr_size; + handle_async_tx_events_notify(net, vq); + for (;;) { head = vhost_get_vq_desc(net-dev, vq, vq-iov, ARRAY_SIZE(vq-iov), @@ -151,6 +161,12 @@ static void handle_tx(struct vhost_net *net) /* Skip header. TODO: support TSO. */ s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out); msg.msg_iovlen = out; + + if (vq-link_state == VHOST_VQ_LINK_ASYNC) { + vq-head = head; + msg.msg_control = (void *)vq; So here a device gets a pointer to vhost_virtqueue structure. If it gets an iocb and invokes a callback, it would not care about vhost internals. + } + len = iov_length(vq-iov, out); /* Sanity check */ if (!len) { @@ -166,6 +182,10 @@ static void handle_tx(struct vhost_net *net) tx_poll_start(net, sock); break; } + + if (vq-link_state == VHOST_VQ_LINK_ASYNC) + continue; + if (err != len) pr_err(Truncated TX packet: len %d != %zd\n, err, len); @@ -177,6 +197,8 @@ static void handle_tx(struct vhost_net *net) }
Re: 2 serial ports?
Gerd Hoffmann wrote: On 03/17/10 09:38, Michael Tokarev wrote: Since 0.12, it appears that kvm does not allow more than 2 serial ports for a guest: $ kvm \ -serial unix:s1,server,nowait \ -serial unix:s2,server,nowait \ -serial unix:s3,server,nowait isa irq 4 already assigned Is there a work-around for this? Oh, well, yes, I remember. qemu is more strict on ISA irq sharing now. A bit too strict. /me goes dig out a old patch which never made it upstream for some reason I forgot. Attached. I tried the patch, and it now appears to work. I did not try to run various stress tests so far, but basic tests are fine. Thank you Gerd! And I think it's time to push it finally :) /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2 serial ports?
Neo Jia wrote: May I ask if it is possible to bind a real physical serial port to a guest? It is all described in the documentation, quite a long list of various things you can attach to a virtual serial port, incl. a real one. /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: 2 serial ports?
Oh, well, yes, I remember. qemu is more strict on ISA irq sharing now. A bit too strict. /me goes dig out a old patch which never made it upstream for some reason I forgot. Attached. This is wrong. Two devices should never be manipulating the same qemu_irq object. If you want multiple devices connected to the same IRQ then you need an explicit multiplexer. e.g. arm_timer.c:sp804_set_irq. Paul -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM test: Make qcow2 check image non critical
Instead of forcing the vms to shut down due to qemu-img check step, just make the postprocess step non-critical, ie, doesn't make the test fail because of it. The check is still there, but it won't mask the results of tests itself, while providing useful additional info. Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com --- client/tests/kvm/tests_base.cfg.sample |3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/client/tests/kvm/tests_base.cfg.sample b/client/tests/kvm/tests_base.cfg.sample index beae786..bb455e6 100644 --- a/client/tests/kvm/tests_base.cfg.sample +++ b/client/tests/kvm/tests_base.cfg.sample @@ -1049,8 +1049,7 @@ variants: post_command = python scripts/check_image.py; remove_image = no post_command_timeout = 600 -kill_vm = yes -kill_vm_gracefully = yes +post_command_noncritical = yes - vmdk: only Fedora Ubuntu Windows only smp2 -- 1.6.6.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Autotest] [PATCH] KVM-test: SR-IOV: Fix a bug that wrongly check VFs count
On Thu, Mar 11, 2010 at 2:54 AM, Yolkfull Chow yz...@redhat.com wrote: The parameter 'devices_requested' is irrelated to driver_option 'max_vfs' of 'igb'. NIC card 82576 has two network interfaces and each can be virtualized up to 7 virtual functions, therefore we multiply two for the value of driver_option 'max_vfs' and can thus get the total number of VFs. Applied, thanks! Signed-off-by: Yolkfull Chow yz...@redhat.com --- client/tests/kvm/kvm_utils.py | 19 +-- 1 files changed, 13 insertions(+), 6 deletions(-) diff --git a/client/tests/kvm/kvm_utils.py b/client/tests/kvm/kvm_utils.py index 4565dc1..1813ed1 100644 --- a/client/tests/kvm/kvm_utils.py +++ b/client/tests/kvm/kvm_utils.py @@ -1012,17 +1012,22 @@ class PciAssignable(object): Get VFs count number according to lspci. + # FIXME: Need to think out a method of identify which + # 'virtual function' belongs to which physical card considering + # that if the host has more than one 82576 card. PCI_ID? cmd = lspci | grep 'Virtual Function' | wc -l - # For each VF we'll see 2 prints of 'Virtual Function', so let's - # divide the result per 2 - return int(commands.getoutput(cmd)) / 2 + return int(commands.getoutput(cmd)) def check_vfs_count(self): Check VFs count number according to the parameter driver_options. - return (self.get_vfs_count == self.devices_requested) + # Network card 82576 has two network interfaces and each can be + # virtualized up to 7 virtual functions, therefore we multiply + # two for the value of driver_option 'max_vfs'. + expected_count = int((re.findall((\d), self.driver_option)[0])) * 2 + return (self.get_vfs_count == expected_count) def is_binded_to_stub(self, full_id): @@ -1054,15 +1059,17 @@ class PciAssignable(object): elif not self.check_vfs_count(): os.system(modprobe -r %s % self.driver) re_probe = True + else: + return True # Re-probe driver with proper number of VFs if re_probe: cmd = modprobe %s %s % (self.driver, self.driver_option) + logging.info(Loading the driver '%s' with option '%s' % + (self.driver, self.driver_option)) s, o = commands.getstatusoutput(cmd) if s: return False - if not self.check_vfs_count(): - return False return True -- 1.7.0.1 ___ Autotest mailing list autot...@test.kernel.org http://test.kernel.org/cgi-bin/mailman/listinfo/autotest -- Lucas -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Autotest] [PATCH 1/2] KVM test: Refactoring the 'autotest' subtest
On Fri, Feb 26, 2010 at 1:13 AM, sudhir kumar smalik...@gmail.com wrote: Looks good to me. It will definitely boost test speed for certain tests and give flexibility to use existing autotest strength in more granular way. Thank you! FYI, this patch was applied, mainly because it's not dependent on cpu_set test itself: http://autotest.kernel.org/changeset/4308 On Fri, Feb 26, 2010 at 1:13 AM, Lucas Meneghel Rodrigues l...@redhat.com wrote: Refactor autotest subtest into a utility function, so other KVM subtests can run autotest control files in hosts as part of their routine. This arrangement was made to accomodate the upcoming 'cpu_set' test. Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com --- client/tests/kvm/kvm_test_utils.py | 165 +++- client/tests/kvm/tests/autotest.py | 153 ++--- 2 files changed, 171 insertions(+), 147 deletions(-) diff --git a/client/tests/kvm/kvm_test_utils.py b/client/tests/kvm/kvm_test_utils.py index 7d96d6e..71d6303 100644 --- a/client/tests/kvm/kvm_test_utils.py +++ b/client/tests/kvm/kvm_test_utils.py @@ -24,7 +24,7 @@ More specifically: import time, os, logging, re, commands from autotest_lib.client.common_lib import error from autotest_lib.client.bin import utils -import kvm_utils, kvm_vm, kvm_subprocess +import kvm_utils, kvm_vm, kvm_subprocess, scan_results def get_living_vm(env, vm_name): @@ -237,3 +237,166 @@ def get_memory_info(lvms): meminfo = meminfo[0:-2] + } return meminfo + + +def run_autotest(vm, session, control_path, timeout, test_name, outputdir): + + Run an autotest control file inside a guest (linux only utility). + + �...@param vm: VM object. + �...@param session: A shell session on the VM provided. + �...@param control: An autotest control file. + �...@param timeout: Timeout under which the autotest test must complete. + �...@param test_name: Autotest client test name. + �...@param outputdir: Path on host where we should copy the guest autotest + results to. + + def copy_if_size_differs(vm, local_path, remote_path): + + Copy a file to a guest if it doesn't exist or if its size differs. + + �...@param vm: VM object. + �...@param local_path: Local path. + �...@param remote_path: Remote path. + + copy = False + basename = os.path.basename(local_path) + local_size = os.path.getsize(local_path) + output = session.get_command_output(ls -l %s % remote_path) + if such file in output: + logging.info(Copying %s to guest (remote file is missing) % + basename) + copy = True + else: + try: + remote_size = output.split()[4] + remote_size = int(remote_size) + except IndexError, ValueError: + logging.error(Check for remote path size %s returned %s. + Cannot process., remote_path, output) + raise error.TestFail(Failed to check for %s (Guest died?) % + remote_path) + if remote_size != local_size: + logging.debug(Copying %s to guest due to size mismatch + (remote size %s, local size %s) % + (basename, remote_size, local_size)) + copy = True + + if copy: + if not vm.copy_files_to(local_path, remote_path): + raise error.TestFail(Could not copy %s to guest % local_path) + + + def extract(vm, remote_path, dest_dir=.): + + Extract a .tar.bz2 file on the guest. + + �...@param vm: VM object + �...@param remote_path: Remote file path + �...@param dest_dir: Destination dir for the contents + + basename = os.path.basename(remote_path) + logging.info(Extracting %s... % basename) + (status, output) = session.get_command_status_output( + tar xjvf %s -C %s % (remote_path, dest_dir)) + if status != 0: + logging.error(Uncompress output:\n%s % output) + raise error.TestFail(Could not extract % on guest) + + if not os.path.isfile(control_path): + raise error.TestError(Invalid path to autotest control file: %s % + control_path) + + tarred_autotest_path = /tmp/autotest.tar.bz2 + tarred_test_path = /tmp/%s.tar.bz2 % test_name + + # To avoid problems, let's make the test use the current AUTODIR + # (autotest client path) location + autotest_path = os.environ['AUTODIR'] + tests_path = os.path.join(autotest_path, 'tests') + test_path = os.path.join(tests_path, test_name) + + # tar the contents of bindir/autotest + cmd = tar cvjf %s %s/* % (tarred_autotest_path,
[PATCHv6 0/4] qemu-kvm: vhost net port
This is port of vhost v6 patch set I posted previously to qemu-kvm, for those that want to get good performance out of it :) This patchset needs to be applied when qemu.git one gets merged, this includes irqchip support. Changes from previous version: - check kvm_enabled in irqfd call Michael S. Tsirkin (4): qemu-kvm: add vhost.h header kvm: irqfd support msix: add mask/unmask notifiers virtio-pci: irqfd support hw/msix.c | 36 - hw/msix.h |1 + hw/pci.h |6 ++ hw/virtio-pci.c | 27 + kvm-all.c | 19 +++ kvm.h | 10 kvm/include/linux/vhost.h | 130 + 7 files changed, 228 insertions(+), 1 deletions(-) create mode 100644 kvm/include/linux/vhost.h -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv6 1/4] qemu-kvm: add vhost.h header
This makes it possible to build vhost support on systems which do not have this header. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- kvm/include/linux/vhost.h | 130 + 1 files changed, 130 insertions(+), 0 deletions(-) create mode 100644 kvm/include/linux/vhost.h diff --git a/kvm/include/linux/vhost.h b/kvm/include/linux/vhost.h new file mode 100644 index 000..165a484 --- /dev/null +++ b/kvm/include/linux/vhost.h @@ -0,0 +1,130 @@ +#ifndef _LINUX_VHOST_H +#define _LINUX_VHOST_H +/* Userspace interface for in-kernel virtio accelerators. */ + +/* vhost is used to reduce the number of system calls involved in virtio. + * + * Existing virtio net code is used in the guest without modification. + * + * This header includes interface used by userspace hypervisor for + * device configuration. + */ + +#include linux/types.h + +#include linux/ioctl.h +#include linux/virtio_config.h +#include linux/virtio_ring.h + +struct vhost_vring_state { + unsigned int index; + unsigned int num; +}; + +struct vhost_vring_file { + unsigned int index; + int fd; /* Pass -1 to unbind from file. */ + +}; + +struct vhost_vring_addr { + unsigned int index; + /* Option flags. */ + unsigned int flags; + /* Flag values: */ + /* Whether log address is valid. If set enables logging. */ +#define VHOST_VRING_F_LOG 0 + + /* Start of array of descriptors (virtually contiguous) */ + __u64 desc_user_addr; + /* Used structure address. Must be 32 bit aligned */ + __u64 used_user_addr; + /* Available structure address. Must be 16 bit aligned */ + __u64 avail_user_addr; + /* Logging support. */ + /* Log writes to used structure, at offset calculated from specified +* address. Address must be 32 bit aligned. */ + __u64 log_guest_addr; +}; + +struct vhost_memory_region { + __u64 guest_phys_addr; + __u64 memory_size; /* bytes */ + __u64 userspace_addr; + __u64 flags_padding; /* No flags are currently specified. */ +}; + +/* All region addresses and sizes must be 4K aligned. */ +#define VHOST_PAGE_SIZE 0x1000 + +struct vhost_memory { + __u32 nregions; + __u32 padding; + struct vhost_memory_region regions[0]; +}; + +/* ioctls */ + +#define VHOST_VIRTIO 0xAF + +/* Features bitmask for forward compatibility. Transport bits are used for + * vhost specific features. */ +#define VHOST_GET_FEATURES _IOR(VHOST_VIRTIO, 0x00, __u64) +#define VHOST_SET_FEATURES _IOW(VHOST_VIRTIO, 0x00, __u64) + +/* Set current process as the (exclusive) owner of this file descriptor. This + * must be called before any other vhost command. Further calls to + * VHOST_OWNER_SET fail until VHOST_OWNER_RESET is called. */ +#define VHOST_SET_OWNER _IO(VHOST_VIRTIO, 0x01) +/* Give up ownership, and reset the device to default values. + * Allows subsequent call to VHOST_OWNER_SET to succeed. */ +#define VHOST_RESET_OWNER _IO(VHOST_VIRTIO, 0x02) + +/* Set up/modify memory layout */ +#define VHOST_SET_MEM_TABLE_IOW(VHOST_VIRTIO, 0x03, struct vhost_memory) + +/* Write logging setup. */ +/* Memory writes can optionally be logged by setting bit at an offset + * (calculated from the physical address) from specified log base. + * The bit is set using an atomic 32 bit operation. */ +/* Set base address for logging. */ +#define VHOST_SET_LOG_BASE _IOW(VHOST_VIRTIO, 0x04, __u64) +/* Specify an eventfd file descriptor to signal on log write. */ +#define VHOST_SET_LOG_FD _IOW(VHOST_VIRTIO, 0x07, int) + +/* Ring setup. */ +/* Set number of descriptors in ring. This parameter can not + * be modified while ring is running (bound to a device). */ +#define VHOST_SET_VRING_NUM _IOW(VHOST_VIRTIO, 0x10, struct vhost_vring_state) +/* Set addresses for the ring. */ +#define VHOST_SET_VRING_ADDR _IOW(VHOST_VIRTIO, 0x11, struct vhost_vring_addr) +/* Base value where queue looks for available descriptors */ +#define VHOST_SET_VRING_BASE _IOW(VHOST_VIRTIO, 0x12, struct vhost_vring_state) +/* Get accessor: reads index, writes value in num */ +#define VHOST_GET_VRING_BASE _IOWR(VHOST_VIRTIO, 0x12, struct vhost_vring_state) + +/* The following ioctls use eventfd file descriptors to signal and poll + * for events. */ + +/* Set eventfd to poll for added buffers */ +#define VHOST_SET_VRING_KICK _IOW(VHOST_VIRTIO, 0x20, struct vhost_vring_file) +/* Set eventfd to signal when buffers have beed used */ +#define VHOST_SET_VRING_CALL _IOW(VHOST_VIRTIO, 0x21, struct vhost_vring_file) +/* Set eventfd to signal an error */ +#define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file) + +/* VHOST_NET specific defines */ + +/* Attach virtio net ring to a raw socket, or tap device. + * The socket must be already bound to an ethernet device, this device will be + * used for transmit. Pass fd -1 to unbind from the socket and the transmit + * device. This can be used to stop the ring
[PATCHv6 2/4] kvm: irqfd support
Add API to assign/deassign irqfd to kvm. Add stub so that users do not have to use ifdefs. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- kvm-all.c | 19 +++ kvm.h | 10 ++ 2 files changed, 29 insertions(+), 0 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index 7b05462..1a15662 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -1200,5 +1200,24 @@ int kvm_set_ioeventfd_pio_word(int fd, uint16_t addr, uint16_t val, bool assign) } #endif +#if defined(KVM_IRQFD) +int kvm_set_irqfd(int gsi, int fd, bool assigned) +{ +struct kvm_irqfd irqfd = { +.fd = fd, +.gsi = gsi, +.flags = assigned ? 0 : KVM_IRQFD_FLAG_DEASSIGN, +}; +int r; +if (!kvm_enabled() || !kvm_irqchip_in_kernel()) +return -ENOSYS; + +r = kvm_vm_ioctl(kvm_state, KVM_IRQFD, irqfd); +if (r 0) +return r; +return 0; +} +#endif + #undef PAGE_SIZE #include qemu-kvm.c diff --git a/kvm.h b/kvm.h index 0951380..72dcaca 100644 --- a/kvm.h +++ b/kvm.h @@ -180,4 +180,14 @@ int kvm_set_ioeventfd_pio_word(int fd, uint16_t adr, uint16_t val, bool assign) } #endif +#if defined(KVM_IRQFD) defined(CONFIG_KVM) +int kvm_set_irqfd(int gsi, int fd, bool assigned); +#else +static inline +int kvm_set_irqfd(int gsi, int fd, bool assigned) +{ +return -ENOSYS; +} +#endif + #endif -- 1.7.0.18.g0d53a5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv6 3/4] msix: add mask/unmask notifiers
Support per-vector callbacks for msix mask/unmask. Will be used for vhost net. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- hw/msix.c | 36 +++- hw/msix.h |1 + hw/pci.h |6 ++ 3 files changed, 42 insertions(+), 1 deletions(-) diff --git a/hw/msix.c b/hw/msix.c index faee0b2..3ec8805 100644 --- a/hw/msix.c +++ b/hw/msix.c @@ -317,6 +317,13 @@ static void msix_mmio_writel(void *opaque, target_phys_addr_t addr, if (kvm_enabled() kvm_irqchip_in_kernel()) { kvm_msix_update(dev, vector, was_masked, msix_is_masked(dev, vector)); } +if (was_masked != msix_is_masked(dev, vector) +dev-msix_mask_notifier dev-msix_mask_notifier_opaque[vector]) { +int r = dev-msix_mask_notifier(dev, vector, + dev-msix_mask_notifier_opaque[vector], + msix_is_masked(dev, vector)); +assert(r = 0); +} msix_handle_mask_update(dev, vector); } @@ -355,10 +362,18 @@ void msix_mmio_map(PCIDevice *d, int region_num, static void msix_mask_all(struct PCIDevice *dev, unsigned nentries) { -int vector; +int vector, r; for (vector = 0; vector nentries; ++vector) { unsigned offset = vector * MSIX_ENTRY_SIZE + MSIX_VECTOR_CTRL; +int was_masked = msix_is_masked(dev, vector); dev-msix_table_page[offset] |= MSIX_VECTOR_MASK; +if (was_masked != msix_is_masked(dev, vector) +dev-msix_mask_notifier dev-msix_mask_notifier_opaque[vector]) { +r = dev-msix_mask_notifier(dev, vector, +dev-msix_mask_notifier_opaque[vector], +msix_is_masked(dev, vector)); +assert(r = 0); +} } } @@ -381,6 +396,9 @@ int msix_init(struct PCIDevice *dev, unsigned short nentries, sizeof *dev-msix_irq_entries); } #endif +dev-msix_mask_notifier_opaque = +qemu_mallocz(nentries * sizeof *dev-msix_mask_notifier_opaque); +dev-msix_mask_notifier = NULL; dev-msix_entry_used = qemu_mallocz(MSIX_MAX_ENTRIES * sizeof *dev-msix_entry_used); @@ -443,6 +461,8 @@ int msix_uninit(PCIDevice *dev) dev-msix_entry_used = NULL; qemu_free(dev-msix_irq_entries); dev-msix_irq_entries = NULL; +qemu_free(dev-msix_mask_notifier_opaque); +dev-msix_mask_notifier_opaque = NULL; dev-cap_present = ~QEMU_PCI_CAP_MSIX; return 0; } @@ -586,3 +606,17 @@ void msix_unuse_all_vectors(PCIDevice *dev) return; msix_free_irq_entries(dev); } + +int msix_set_mask_notifier(PCIDevice *dev, unsigned vector, void *opaque) +{ +int r = 0; +if (vector = dev-msix_entries_nr || !dev-msix_entry_used[vector]) +return 0; + +if (dev-msix_mask_notifier) +r = dev-msix_mask_notifier(dev, vector, opaque, +msix_is_masked(dev, vector)); +if (r = 0) +dev-msix_mask_notifier_opaque[vector] = opaque; +return r; +} diff --git a/hw/msix.h b/hw/msix.h index a9f7993..f167231 100644 --- a/hw/msix.h +++ b/hw/msix.h @@ -33,4 +33,5 @@ void msix_reset(PCIDevice *dev); extern int msix_supported; +int msix_set_mask_notifier(PCIDevice *dev, unsigned vector, void *opaque); #endif diff --git a/hw/pci.h b/hw/pci.h index 1eab8f2..100104c 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -136,6 +136,9 @@ enum { #define PCI_CAPABILITY_CONFIG_MSI_LENGTH 0x10 #define PCI_CAPABILITY_CONFIG_MSIX_LENGTH 0x10 +typedef int (*msix_mask_notifier_func)(PCIDevice *, unsigned vector, + void *opaque, int masked); + struct PCIDevice { DeviceState qdev; /* PCI config space */ @@ -201,6 +204,9 @@ struct PCIDevice { struct kvm_irq_routing_entry *msix_irq_entries; +void **msix_mask_notifier_opaque; +msix_mask_notifier_func msix_mask_notifier; + /* Device capability configuration space */ struct { int supported; -- 1.7.0.18.g0d53a5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv6 4/4] virtio-pci: irqfd support
Use irqfd when supported by kernel. This uses msix mask notifiers: when vector is masked, we poll it from userspace. When it is unmasked, we poll it from kernel. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- hw/virtio-pci.c | 27 +++ 1 files changed, 27 insertions(+), 0 deletions(-) diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index 4255d98..f8d8022 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -402,6 +402,27 @@ static void virtio_pci_guest_notifier_read(void *opaque) } } +static int virtio_pci_mask_notifier(PCIDevice *dev, unsigned vector, +void *opaque, int masked) +{ +VirtQueue *vq = opaque; +EventNotifier *notifier = virtio_queue_get_guest_notifier(vq); +int r = kvm_set_irqfd(dev-msix_irq_entries[vector].gsi, + event_notifier_get_fd(notifier), + !masked); +if (r 0) { +return (r == -ENOSYS) ? 0 : r; +} +if (masked) { +qemu_set_fd_handler(event_notifier_get_fd(notifier), +virtio_pci_guest_notifier_read, NULL, vq); +} else { +qemu_set_fd_handler(event_notifier_get_fd(notifier), +NULL, NULL, NULL); +} +return 0; +} + static int virtio_pci_set_guest_notifier(void *opaque, int n, bool assign) { VirtIOPCIProxy *proxy = opaque; @@ -415,7 +436,11 @@ static int virtio_pci_set_guest_notifier(void *opaque, int n, bool assign) } qemu_set_fd_handler(event_notifier_get_fd(notifier), virtio_pci_guest_notifier_read, NULL, vq); +msix_set_mask_notifier(proxy-pci_dev, + virtio_queue_vector(proxy-vdev, n), vq); } else { +msix_set_mask_notifier(proxy-pci_dev, + virtio_queue_vector(proxy-vdev, n), NULL); qemu_set_fd_handler(event_notifier_get_fd(notifier), NULL, NULL, NULL); event_notifier_cleanup(notifier); @@ -500,6 +525,8 @@ static void virtio_init_pci(VirtIOPCIProxy *proxy, VirtIODevice *vdev, proxy-pci_dev.config_write = virtio_write_config; +proxy-pci_dev.msix_mask_notifier = virtio_pci_mask_notifier; + size = VIRTIO_PCI_REGION_SIZE(proxy-pci_dev) + vdev-config_len; if (size (size-1)) size = 1 qemu_fls(size); -- 1.7.0.18.g0d53a5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Corrupt qcow2 image, recovery?
Liang Guo gave me this advice some weeks ago: you may use kvm-nbd or qemu-nbd to present kvm image as a NBD device, so that you can use nbd-client to access them. eg: kvm-nbd /vm/sid1.img modprobe nbd nbd-client localhost 1024 /dev/nbd0 fdisk -l /dev/nbd0 Didn't work for me because I always got a segfault but maybe it work's for you. - Robert On 03/16/10 19:21, Christian Nilsson wrote: Hi! I'm running kvm / qemu-kvm on a couple of production servers everything (or at least most things) works as it should. However today someone thought it was a good idea to restart one of the servers and after that the windows 2k3 guest on that server don't boot anymore. kvm on this server is a bit outdated: QEMU PC emulator version 0.9.1 (kvm-83) (I guess this is one of the qcow2 corruption bugs, and i can only blame myself for not upgrading kvm sooner.) The guest.qcow2 is a 21GiB file for a 60GiB disk i have tried a couple of things kvm-img convert -f qcow2 -O raw guest.qcow2 guest.raw this stops and does nothing after creating a guest.raw that is 60GiB but only using 60MiB so mounted the fs from another server running: QEMU PC emulator version 0.12.1 (qemu-kvm-0.12.1.2) and run qemu-img with the same options as above and after a few secs got qemu-img: error while reading and the same 60MiB used by guest.raw i also tried booting qemu-kvm with a linux guest and this qcow2 image but only get I/O Errors (and no partitions found) # qemu-img check guest.qcow2 ERROR: invalid cluster offset=0x10a000 ERROR OFLAG_COPIED: l2_offset=ee73 refcount=1 ERROR l2_offset=ee73: Table is not cluster aligned; L1 entry corrupted ERROR: invalid cluster offset=0x11d44100080 ERROR: invalid cluster offset=0x11d61600080 ERROR: invalid cluster offset=0x11d68600080 ERROR: invalid cluster offset=0x11d95300080 (and a loot more in this style, full log can be provided if it would be of help to anybody) is there any possibility to repair this file, or convert it to a RAW file (even with parts padded that are not safe from the qcow2 image), or as a last resort, are there any debug tools for qcow2 images that might be of use? I have read up on the qcow fileformat but right now i'm a bit short of time, i need the data in this guests disk image, or at least the MS SQL datafiles that are on this disk) i have also checked the qcow2 file and it do contain a NTLDR string and a loot of other NTFS recognized strings so i know that all data is not gone. the question is how can i access it as a Filesystem again? Any help would be appreciated! Regards Christian Nilsson -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: Add KVM_GET/SET_VCPU_EVENTS
Avi Kivity wrote: On 11/12/2009 02:04 AM, Jan Kiszka wrote: This new IOCTL exports all yet user-invisible states related to exceptions, interrupts, and NMIs. Together with appropriate user space changes, this fixes sporadic problems of vmsave/restore, live migration and system reset. Applied, thanks. I added a flags field to the structure in case we discover a new bit that needs to fit in there. Please take a look (separate commit in kvm-next). So without this patch migration fails? Sounds like a stable candidate to me. Same goes for the follow-up that adds the shadow field. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Autotest] [Autotest PATCH] KVM-test: Add a subtest 'qemu_img'
Copying Michael on the message. Hi Yolkfull, I have reviewed this patch and I have some comments to make on it, similar to the ones I made on an earlier version of it: One of the things that I noticed is that this patch doesn't work very well out of the box: [...@freedom kvm]$ ./scan_results.py TestStatus Seconds Info -- --- (Result file: ../../results/default/status) smp2.Fedora.11.64.qemu_img.checkGOOD47 completed successfully smp2.Fedora.11.64.qemu_img.create GOOD44 completed successfully smp2.Fedora.11.64.qemu_img.convert.to_qcow2 FAIL45 Image converted failed; Command: /usr/bin/qemu-img convert -f qcow2 -O qcow2 /tmp/kvm_autotest_root/images/fc11-64.qcow2 /tmp/kvm_autotest_root/images/fc11-64.qcow2.converted_qcow2;Output is: qemu-img: Could not open '/tmp/kvm_autotest_root/images/fc11-64.qcow2' smp2.Fedora.11.64.qemu_img.convert.to_raw FAIL46 Image converted failed; Command: /usr/bin/qemu-img convert -f qcow2 -O raw /tmp/kvm_autotest_root/images/fc11-64.qcow2 /tmp/kvm_autotest_root/images/fc11-64.qcow2.converted_raw;Output is: qemu-img: Could not open '/tmp/kvm_autotest_root/images/fc11-64.qcow2' smp2.Fedora.11.64.qemu_img.snapshot FAIL44 Create snapshot failed via command: /usr/bin/qemu-img snapshot -c snapshot0 /tmp/kvm_autotest_root/images/fc11-64.qcow2;Output is: qemu-img: Could not open '/tmp/kvm_autotest_root/images/fc11-64.qcow2' smp2.Fedora.11.64.qemu_img.commit GOOD44 completed successfully smp2.Fedora.11.64.qemu_img.info FAIL44 Unhandled str: Unhandled TypeError: argument of type 'NoneType' is not iterable smp2.Fedora.11.64.qemu_img.rebase TEST_NA 43 Current kvm user space version does not support 'rebase' subcommand GOOD412 We need to fix that before upstream inclusion. Also, one thing that I've noticed is that this test doesn't depend of any other variants, so we don't need to repeat it to every combination of guest and qemu command line options. Michael, does it occur to you a way to get this test out of the variants block, so it gets executed only once per job and not every combination of guest and other qemu options? On Fri, Jan 29, 2010 at 4:00 AM, Yolkfull Chow yz...@redhat.com wrote: This is designed to test all subcommands of 'qemu-img' however so far 'commit' is not implemented. * For 'check' subcommand test, it will 'dd' to create a file with specified size and see whether it's supported to be checked. Then convert it to be supported formats ('qcow2' and 'raw' so far) to see whether there's error after convertion. * For 'convert' subcommand test, it will convert both to 'qcow2' and 'raw' from the format specified in config file. And only check 'qcow2' after convertion. * For 'snapshot' subcommand test, it will create two snapshots and list them. Finally delete them if no errors found. * For 'info' subcommand test, it will check image format size according to output of 'info' subcommand at specified image file. * For 'rebase' subcommand test, it will create first snapshot 'sn1' based on original base_img, and create second snapshot based on sn1. And then rebase sn2 to base_img. After rebase check the baking_file of sn2. This supports two rebase mode: unsafe mode and safe mode: Unsafe mode: With -u an unsafe mode is enabled that doesn't require the backing files to exist. It merely changes the backing file reference in the COW image. This is useful for renaming or moving the backing file. The user is responsible to make sure that the new backing file has no changes compared to the old one, or corruption may occur. Safe Mode: Both the current and the new backing file need to exist, and after the rebase, the COW image is guaranteed to have the same guest visible content as before. To achieve this, old and new backing file are compared and, if necessary, data is copied from the old backing file into the COW image. Signed-off-by: Yolkfull Chow yz...@redhat.com --- client/tests/kvm/tests/qemu_img.py | 235 client/tests/kvm/tests_base.cfg.sample | 40 ++ 2 files changed, 275 insertions(+), 0 deletions(-) create mode 100644 client/tests/kvm/tests/qemu_img.py diff --git a/client/tests/kvm/tests/qemu_img.py b/client/tests/kvm/tests/qemu_img.py new file mode 100644 index 000..e6352a0 --- /dev/null +++ b/client/tests/kvm/tests/qemu_img.py @@ -0,0 +1,235 @@ +import re, os, logging, commands +from autotest_lib.client.common_lib import utils, error +import
Re: [PATCH 05/10] Don't call apic functions directly from kvm code
On Tue, Mar 09, 2010 at 03:27:02PM +0200, Avi Kivity wrote: On 02/26/2010 10:12 PM, Glauber Costa wrote: It is actually not necessary to call a tpr function to save and load cr8, as cr8 is part of the processor state, and thus, it is much easier to just add it to CPUState. As for apic base, wrap kvm usages, so we can call either the qemu device, or the in kernel version. } +static void kvm_set_apic_base(CPUState *env, uint64_t val) +{ +if (!kvm_irqchip_in_kernel()) +cpu_set_apic_base(env, val); What if it is in kernel? Just ignored? Doesn't seem right. At this point it is right, because there is no irqchip in kernel yet. In a later patch, irqchip in kernel begins to exist, and this function gets filled. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vhost: fix error handling in vring ioctls
Stanse found a locking problem in vhost_set_vring: several returns from VHOST_SET_VRING_KICK, VHOST_SET_VRING_CALL, VHOST_SET_VRING_ERR with the vq-mutex held. Fix these up. Reported-by: Jiri Slaby jirisl...@gmail.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/vhost/vhost.c | 18 -- 1 files changed, 12 insertions(+), 6 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 7cd55e0..7bd7a1e 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -476,8 +476,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp) if (r 0) break; eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd); - if (IS_ERR(eventfp)) - return PTR_ERR(eventfp); + if (IS_ERR(eventfp)) { + r = PTR_ERR(eventfp); + break; + } if (eventfp != vq-kick) { pollstop = filep = vq-kick; pollstart = vq-kick = eventfp; @@ -489,8 +491,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp) if (r 0) break; eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd); - if (IS_ERR(eventfp)) - return PTR_ERR(eventfp); + if (IS_ERR(eventfp)) { + r = PTR_ERR(eventfp); + break; + } if (eventfp != vq-call) { filep = vq-call; ctx = vq-call_ctx; @@ -505,8 +509,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp) if (r 0) break; eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd); - if (IS_ERR(eventfp)) - return PTR_ERR(eventfp); + if (IS_ERR(eventfp)) { + r = PTR_ERR(eventfp); + break; + } if (eventfp != vq-error) { filep = vq-error; vq-error = eventfp; -- 1.7.0.18.g0d53a5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vhost: fix interrupt mitigation with raw sockets
A thinko in code means we never trigger interrupt mitigation. Fix this. Reported-by: Juan Quintela quint...@redhat.com Reported-by: Unai Uribarri unai.uriba...@optenet.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/vhost/net.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index fcafb6b..a6a88df 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -125,7 +125,7 @@ static void handle_tx(struct vhost_net *net) mutex_lock(vq-mutex); vhost_disable_notify(vq); - if (wmem sock-sk-sk_sndbuf * 2) + if (wmem sock-sk-sk_sndbuf / 2) tx_poll_stop(net); hdr_size = vq-hdr_size; -- 1.7.0.18.g0d53a5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
Anthony Liguori anth...@codemonkey.ws writes: This really gets down to your definition of safe behaviour. As it stands, if you suffer a power outage, it may lead to guest corruption. While we are correct in advertising a write-cache, write-caches are volatile and should a drive lose power, it could lead to data corruption. Enterprise disks tend to have battery backed write caches to prevent this. In the set up you're emulating, the host is acting as a giant write cache. Should your host fail, you can get data corruption. Hi Anthony. I suspected my post might spark an interesting discussion! Before considering anything like this, we did quite a bit of testing with OSes in qemu-kvm guests running filesystem-intensive work, using an ipmitool power off to kill the host. I didn't manage to corrupt any ext3, ext4 or NTFS filesystems despite these efforts. Is your claim here that:- (a) qemu doesn't emulate a disk write cache correctly; or (b) operating systems are inherently unsafe running on top of a disk with a write-cache; or (c) installations that are already broken and lose data with a physical drive with a write-cache can lose much more in this case because the write cache is much bigger? Following Christoph Hellwig's patch series from last September, I'm pretty convinced that (a) isn't true apart from the inability to disable the write-cache at run-time, which is something that neither recent linux nor windows seem to want to do out-of-the box. Given that modern SATA drives come with fairly substantial write-caches nowadays which operating systems leave on without widespread disaster, I don't really believe in (b) either, at least for the ide and scsi case. Filesystems know they have to flush the disk cache to avoid corruption. (Virtio makes the write cache invisible to the OS except in linux 2.6.32+ so I know virtio-blk has to be avoided for current windows and obsolete linux when writeback caching is on.) I can certainly imagine (c) might be the case, although when I use strace to watch the IO to the block device, I see pretty regular fdatasyncs being issued by the guests, interleaved with the writes, so I'm not sure how likely the problem would be in practice. Perhaps my test guests were unrepresentatively well-behaved. However, the potentially unlimited time-window for loss of incorrectly unsynced data is also something one could imagine fixing at the qemu level. Perhaps I should be implementing something like cache=writeback,flushtimeout=N which, upon a write being issued to the block device, starts an N second timer if it isn't already running. The timer is destroyed on flush, and if it expires before it's destroyed, a gratuitous flush is sent. Do you think this is worth doing? Just a simple 'while sleep 10; do sync; done' on the host even! We've used cache=none and cache=writethrough, and whilst performance is fine with a single guest accessing a disk, when we chop the disks up with LVM and run a even a small handful of guests, the constant seeking to serve tiny synchronous IOs leads to truly abysmal throughput---we've seen less than 700kB/s streaming write rates within guests when the backing store is capable of 100MB/s. With cache=writeback, there's still IO contention between guests, but the write granularity is a bit coarser, so the host's elevator seems to get a bit more of a chance to help us out and we can at least squeeze out 5-10MB/s from two or three concurrently running guests, getting a total of 20-30% of the performance of the underlying block device rather than a total of around 5%. Cheers, Chris. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
Avi Kivity a...@redhat.com writes: On 03/15/2010 10:23 PM, Chris Webb wrote: Wasteful duplication of page cache between guest and host notwithstanding, turning on cache=writeback is a spectacular performance win for our guests. Is this with qcow2, raw file, or direct volume access? This is with direct access to logical volumes. No file systems or qcow2 in the stack. Our typical host has a couple of SATA disks, combined in md RAID1, chopped up into volumes with LVM2 (really just dm linear targets). The performance measured outside qemu is excellent, inside qemu-kvm is fine too until multiple guests are trying to access their drives at once, but then everything starts to grind badly. I can understand it for qcow2, but for direct volume access this shouldn't happen. The guest schedules as many writes as it can, followed by a sync. The host (and disk) can then reschedule them whether they are in the writeback cache or in the block layer, and must sync in the same way once completed. I don't really understand what's going on here, but I wonder if the underlying problem might be that all the O_DIRECT/O_SYNC writes from the guests go down into the same block device at the bottom of the device mapper stack, and thus can't be reordered with respect to one another. For our purposes, Guest AA Guest BB Guest AA Guest BB Guest AA Guest BB write A1 write A1 write B1 write B1 write A2 write A1 write A2 write B1 write A2 are all equivalent, but the system isn't allowed to reorder in this way because there isn't a separate request queue for each logical volume, just the one at the bottom. (I don't know whether nested request queues would behave remotely reasonably either, though!) Also, if my guest kernel issues (say) three small writes, one at the start of the disk, one in the middle, one at the end, and then does a flush, can virtio really express this as one non-contiguous O_DIRECT write (the three components of which can be reordered by the elevator with respect to one another) rather than three distinct O_DIRECT writes which can't be permuted? Can qemu issue a write like that? cache=writeback + flush allows this to be optimised by the block layer in the normal way. Cheers, Chris. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/17/2010 10:14 AM, Chris Webb wrote: Anthony Liguorianth...@codemonkey.ws writes: This really gets down to your definition of safe behaviour. As it stands, if you suffer a power outage, it may lead to guest corruption. While we are correct in advertising a write-cache, write-caches are volatile and should a drive lose power, it could lead to data corruption. Enterprise disks tend to have battery backed write caches to prevent this. In the set up you're emulating, the host is acting as a giant write cache. Should your host fail, you can get data corruption. Hi Anthony. I suspected my post might spark an interesting discussion! Before considering anything like this, we did quite a bit of testing with OSes in qemu-kvm guests running filesystem-intensive work, using an ipmitool power off to kill the host. I didn't manage to corrupt any ext3, ext4 or NTFS filesystems despite these efforts. Is your claim here that:- (a) qemu doesn't emulate a disk write cache correctly; or (b) operating systems are inherently unsafe running on top of a disk with a write-cache; or (c) installations that are already broken and lose data with a physical drive with a write-cache can lose much more in this case because the write cache is much bigger? This is the closest to the most accurate. It basically boils down to this: most enterprises use a disks with battery backed write caches. Having the host act as a giant write cache means that you can lose data. I agree that a well behaved file system will not become corrupt, but my contention is that for many types of applications, data lose == corruption and not all file systems are well behaved. And it's certainly valid to argue about whether common filesystems are broken but from a purely pragmatic perspective, this is going to be the case. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Fix SIGFPE for vnc display of width/height = 1
On 03/08/2010 08:34 AM, Chris Webb wrote: During boot, the screen gets resized to height 1 and a mouse click at this point will cause a division by zero when calculating the absolute pointer position from the pixel (x, y). Return a click in the middle of the screen instead in this case. Signed-off-by: Chris Webbch...@arachsys.com Applied. Thanks. Regards, Anthony Liguori --- vnc.c |6 -- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/vnc.c b/vnc.c index 01353a9..676a707 100644 --- a/vnc.c +++ b/vnc.c @@ -1457,8 +1457,10 @@ static void pointer_event(VncState *vs, int button_mask, int x, int y) dz = 1; if (vs-absolute) { -kbd_mouse_event(x * 0x7FFF / (ds_get_width(vs-ds) - 1), -y * 0x7FFF / (ds_get_height(vs-ds) - 1), +kbd_mouse_event(ds_get_width(vs-ds) 1 ? + x * 0x7FFF / (ds_get_width(vs-ds) - 1) : 0x4000, +ds_get_height(vs-ds) 1 ? + y * 0x7FFF / (ds_get_height(vs-ds) - 1) : 0x4000, dz, buttons); } else if (vnc_has_feature(vs, VNC_FEATURE_POINTER_TYPE_CHANGE)) { x -= 0x7FFF; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 2/6] qemu-kvm: Modify and introduce wrapper functions to access phys_ram_dirty.
On 03/16/2010 10:10 PM, Blue Swirl wrote: Yes, and is what tlb_protect_code() does and it's called from tb_alloc_page() which is what's code when a TB is created. Just a tangential note: a long time ago, I tried to disable self modifying code detection for Sparc. On most RISC architectures, SMC needs explicit flushing so in theory we need not track code memory writes. However, during exceptions the translator needs to access the original unmodified code that was used to generate the TB. But maybe there are other ways to avoid SMC tracking, on x86 it's still needed On x86 you're supposed to execute a serializing instruction (one of INVD, INVEPT, INVLPG, INVVPID, LGDT, LIDT, LLDT, LTR, MOV (to control register, with the exception of MOV CR8), MOV (to debug register), WBINVD, WRMSR, CPUID, IRET, and RSM) before running modified code. Last time I checked, a jump instruction was sufficient to ensure coherency withing a core. Serializing instructions are only required for coherency between cores on SMP systems. QEMU effectively has a very large physically tagged icache[1] with very expensive cache loads. AFAIK The only practical way to maintain that cache on x86 targets is to do write snooping via dirty bits. On targets that mandate explicit icache invalidation we might be able to get away with this, however I doubt it actually gains you anything - a correctly written guest is going to invalidate at least as much as we get from dirty tracking, and we still need to provide correct behaviour when executing with cache disabled. but I suppose SMC is pretty rare. Every time you demand load a code page from disk, you're running self modifying code (though it usually doesn't exist in the tlb, so there's no previous version that can cause trouble). I think you're confusing TLB flushes with TB flushes. Paul [1] Even modern x86 only have relatively small icache. The large L2/L3 caches aren't relevant as they are unified I/D caches. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/17/2010 05:24 PM, Chris Webb wrote: Avi Kivitya...@redhat.com writes: On 03/15/2010 10:23 PM, Chris Webb wrote: Wasteful duplication of page cache between guest and host notwithstanding, turning on cache=writeback is a spectacular performance win for our guests. Is this with qcow2, raw file, or direct volume access? This is with direct access to logical volumes. No file systems or qcow2 in the stack. Our typical host has a couple of SATA disks, combined in md RAID1, chopped up into volumes with LVM2 (really just dm linear targets). The performance measured outside qemu is excellent, inside qemu-kvm is fine too until multiple guests are trying to access their drives at once, but then everything starts to grind badly. OK. I can understand it for qcow2, but for direct volume access this shouldn't happen. The guest schedules as many writes as it can, followed by a sync. The host (and disk) can then reschedule them whether they are in the writeback cache or in the block layer, and must sync in the same way once completed. I don't really understand what's going on here, but I wonder if the underlying problem might be that all the O_DIRECT/O_SYNC writes from the guests go down into the same block device at the bottom of the device mapper stack, and thus can't be reordered with respect to one another. They should be reorderable. Otherwise host filesystems on several volumes would suffer the same problems. Whether the filesystem is in the host or guest shouldn't matter. For our purposes, Guest AA Guest BB Guest AA Guest BB Guest AA Guest BB write A1 write A1 write B1 write B1 write A2 write A1 write A2 write B1 write A2 are all equivalent, but the system isn't allowed to reorder in this way because there isn't a separate request queue for each logical volume, just the one at the bottom. (I don't know whether nested request queues would behave remotely reasonably either, though!) Also, if my guest kernel issues (say) three small writes, one at the start of the disk, one in the middle, one at the end, and then does a flush, can virtio really express this as one non-contiguous O_DIRECT write (the three components of which can be reordered by the elevator with respect to one another) rather than three distinct O_DIRECT writes which can't be permuted? Can qemu issue a write like that? cache=writeback + flush allows this to be optimised by the block layer in the normal way. Guest side virtio will send this as three requests followed by a flush. Qemu will issue these as three distinct requests and then flush. The requests are marked, as Christoph says, in a way that limits their reorderability, and perhaps if we fix these two problems performance will improve. Something that comes to mind is merging of flush requests. If N guests issue one write and one flush each, we should issue N writes and just one flush - a flush for the disk applies to all volumes on that disk. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
* Anthony Liguori anth...@codemonkey.ws [2010-03-17 10:55:47]: On 03/17/2010 10:14 AM, Chris Webb wrote: Anthony Liguorianth...@codemonkey.ws writes: This really gets down to your definition of safe behaviour. As it stands, if you suffer a power outage, it may lead to guest corruption. While we are correct in advertising a write-cache, write-caches are volatile and should a drive lose power, it could lead to data corruption. Enterprise disks tend to have battery backed write caches to prevent this. In the set up you're emulating, the host is acting as a giant write cache. Should your host fail, you can get data corruption. Hi Anthony. I suspected my post might spark an interesting discussion! Before considering anything like this, we did quite a bit of testing with OSes in qemu-kvm guests running filesystem-intensive work, using an ipmitool power off to kill the host. I didn't manage to corrupt any ext3, ext4 or NTFS filesystems despite these efforts. Is your claim here that:- (a) qemu doesn't emulate a disk write cache correctly; or (b) operating systems are inherently unsafe running on top of a disk with a write-cache; or (c) installations that are already broken and lose data with a physical drive with a write-cache can lose much more in this case because the write cache is much bigger? This is the closest to the most accurate. It basically boils down to this: most enterprises use a disks with battery backed write caches. Having the host act as a giant write cache means that you can lose data. Dirty limits can help control how much we lose, but also affect how much we write out. I agree that a well behaved file system will not become corrupt, but my contention is that for many types of applications, data lose == corruption and not all file systems are well behaved. And it's certainly valid to argue about whether common filesystems are broken but from a purely pragmatic perspective, this is going to be the case. I think it is a trade-off for end users to decide on. cache=writeback does provide performance benefits, but can cause data loss. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [PATCH 2/6] qemu-kvm: Modify and introduce wrapper functions to access phys_ram_dirty.
On 03/17/2010 06:06 PM, Paul Brook wrote: On 03/16/2010 10:10 PM, Blue Swirl wrote: Yes, and is what tlb_protect_code() does and it's called from tb_alloc_page() which is what's code when a TB is created. Just a tangential note: a long time ago, I tried to disable self modifying code detection for Sparc. On most RISC architectures, SMC needs explicit flushing so in theory we need not track code memory writes. However, during exceptions the translator needs to access the original unmodified code that was used to generate the TB. But maybe there are other ways to avoid SMC tracking, on x86 it's still needed On x86 you're supposed to execute a serializing instruction (one of INVD, INVEPT, INVLPG, INVVPID, LGDT, LIDT, LLDT, LTR, MOV (to control register, with the exception of MOV CR8), MOV (to debug register), WBINVD, WRMSR, CPUID, IRET, and RSM) before running modified code. Last time I checked, a jump instruction was sufficient to ensure coherency withing a core. Serializing instructions are only required for coherency between cores on SMP systems. Yeah, the docs say either a jump or a serializing instruction is needed. QEMU effectively has a very large physically tagged icache[1] with very expensive cache loads. AFAIK The only practical way to maintain that cache on x86 targets is to do write snooping via dirty bits. On targets that mandate explicit icache invalidation we might be able to get away with this, however I doubt it actually gains you anything - a correctly written guest is going to invalidate at least as much as we get from dirty tracking, and we still need to provide correct behaviour when executing with cache disabled. Agreed. but I suppose SMC is pretty rare. Every time you demand load a code page from disk, you're running self modifying code (though it usually doesn't exist in the tlb, so there's no previous version that can cause trouble). I think you're confusing TLB flushes with TB flushes. No - my thinking was page fault, load page, invlpg, continue. But the invlpg is unneeded, and continue has to include a jump anyway. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
Anthony Liguori anth...@codemonkey.ws writes: On 03/17/2010 10:14 AM, Chris Webb wrote: (c) installations that are already broken and lose data with a physical drive with a write-cache can lose much more in this case because the write cache is much bigger? This is the closest to the most accurate. It basically boils down to this: most enterprises use a disks with battery backed write caches. Having the host act as a giant write cache means that you can lose data. I agree that a well behaved file system will not become corrupt, but my contention is that for many types of applications, data lose == corruption and not all file systems are well behaved. And it's certainly valid to argue about whether common filesystems are broken but from a purely pragmatic perspective, this is going to be the case. Okay. What I was driving at in describing these systems as 'already broken' is that they will already lose data (in this sense) if they're run on bare metal with normal commodity SATA disks with their 32MB write caches on. That configuration surely describes the vast majority of PC-class desktops and servers! If I understand correctly, your point here is that the small cache on a real SATA drive gives a relatively small time window for data loss, whereas the worry with cache=writeback is that the host page cache can be gigabytes, so the time window for unsynced data to be lost is potentially enormous. Isn't the fix for that just forcing periodic sync on the host to bound-above the time window for unsynced data loss in the guest? Cheers, Chris. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/10] Don't call apic functions directly from kvm code
On 03/17/2010 04:00 PM, Glauber Costa wrote: On Tue, Mar 09, 2010 at 03:27:02PM +0200, Avi Kivity wrote: On 02/26/2010 10:12 PM, Glauber Costa wrote: It is actually not necessary to call a tpr function to save and load cr8, as cr8 is part of the processor state, and thus, it is much easier to just add it to CPUState. As for apic base, wrap kvm usages, so we can call either the qemu device, or the in kernel version. } +static void kvm_set_apic_base(CPUState *env, uint64_t val) +{ +if (!kvm_irqchip_in_kernel()) +cpu_set_apic_base(env, val); What if it is in kernel? Just ignored? Doesn't seem right. At this point it is right, because there is no irqchip in kernel yet. In a later patch, irqchip in kernel begins to exist, and this function gets filled. Ok. In the future please code things like that without the if (), and add it when you introduce the other side. Helps fend off nit-pickers. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/17/2010 06:22 PM, Avi Kivity wrote: Also, if my guest kernel issues (say) three small writes, one at the start of the disk, one in the middle, one at the end, and then does a flush, can virtio really express this as one non-contiguous O_DIRECT write (the three components of which can be reordered by the elevator with respect to one another) rather than three distinct O_DIRECT writes which can't be permuted? Can qemu issue a write like that? cache=writeback + flush allows this to be optimised by the block layer in the normal way. Guest side virtio will send this as three requests followed by a flush. Qemu will issue these as three distinct requests and then flush. The requests are marked, as Christoph says, in a way that limits their reorderability, and perhaps if we fix these two problems performance will improve. Something that comes to mind is merging of flush requests. If N guests issue one write and one flush each, we should issue N writes and just one flush - a flush for the disk applies to all volumes on that disk. Chris, can you carry out an experiment? Write a program that pwrite()s a byte to a file at the same location repeatedly, with the file opened using O_SYNC. Measure the write rate, and run blktrace on the host to see what the disk (/dev/sda, not the volume) sees. Should be a (write, flush, write, flush) per pwrite pattern or similar (for writing the data and a journal block, perhaps even three writes will be needed). Then scale this across multiple guests, measure and trace again. If we're lucky, the flushes will be coalesced, if not, we need to work on it. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
Avi Kivity a...@redhat.com writes: Chris, can you carry out an experiment? Write a program that pwrite()s a byte to a file at the same location repeatedly, with the file opened using O_SYNC. Measure the write rate, and run blktrace on the host to see what the disk (/dev/sda, not the volume) sees. Should be a (write, flush, write, flush) per pwrite pattern or similar (for writing the data and a journal block, perhaps even three writes will be needed). Then scale this across multiple guests, measure and trace again. If we're lucky, the flushes will be coalesced, if not, we need to work on it. Sure, sounds like an excellent plan. I don't have a test machine at the moment as the last host I was using for this has gone into production, but I'm due to get another one to install later today or first thing tomorrow which would be ideal for doing this. I'll follow up with the results once I have them. Cheers, Chris. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Wed, Mar 17, 2010 at 06:22:29PM +0200, Avi Kivity wrote: They should be reorderable. Otherwise host filesystems on several volumes would suffer the same problems. They are reordable, just not as extremly as the the page cache. Remember that the request queue really is just a relatively small queue of outstanding I/O, and that is absolutely intentional. Large scale _caching_ is done by the VM in the pagecache, with all the usual aging, pressure, etc algorithms applied to it. The block devices have a relatively small fixed size request queue associated with it to facilitate request merging and limited reordering and having fully set up I/O requests for the device. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/17/2010 06:47 PM, Chris Webb wrote: Avi Kivitya...@redhat.com writes: Chris, can you carry out an experiment? Write a program that pwrite()s a byte to a file at the same location repeatedly, with the file opened using O_SYNC. Measure the write rate, and run blktrace on the host to see what the disk (/dev/sda, not the volume) sees. Should be a (write, flush, write, flush) per pwrite pattern or similar (for writing the data and a journal block, perhaps even three writes will be needed). Then scale this across multiple guests, measure and trace again. If we're lucky, the flushes will be coalesced, if not, we need to work on it. Sure, sounds like an excellent plan. I don't have a test machine at the moment as the last host I was using for this has gone into production, but I'm due to get another one to install later today or first thing tomorrow which would be ideal for doing this. I'll follow up with the results once I have them. Meanwhile I looked at the code, and it looks bad. There is an IO_CMD_FDSYNC, but it isn't tagged, so we have to drain the queue before issuing it. In any case, qemu doesn't use it as far as I could tell, and even if it did, device-matter doesn't implement the needed -aio_fsync() operation. So, there's a lot of plubming needed before we can get cache flushes merged into each other. Given cache=writeback does allow merging, I think we explained part of the problem at least. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] Fix SIGFPE for vnc display of width/height = 1
Anthony Liguori wrote: On 03/08/2010 08:34 AM, Chris Webb wrote: During boot, the screen gets resized to height 1 and a mouse click at this point will cause a division by zero when calculating the absolute pointer position from the pixel (x, y). Return a click in the middle of the screen instead in this case. Signed-off-by: Chris Webbch...@arachsys.com Applied. Thanks. Also queued it to stable? Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Wed, Mar 17, 2010 at 06:40:30PM +0200, Avi Kivity wrote: Chris, can you carry out an experiment? Write a program that pwrite()s a byte to a file at the same location repeatedly, with the file opened using O_SYNC. Measure the write rate, and run blktrace on the host to see what the disk (/dev/sda, not the volume) sees. Should be a (write, flush, write, flush) per pwrite pattern or similar (for writing the data and a journal block, perhaps even three writes will be needed). Then scale this across multiple guests, measure and trace again. If we're lucky, the flushes will be coalesced, if not, we need to work on it. As the person who has written quite a bit of the current O_SYNC implementation and also reviewed the rest of it I can tell you that those flushes won't be coalesced. If we always rewrite the same block we do the cache flush from the fsync method and there's is nothing to coalesced it there. If you actually do modify metadata (e.g. by using the new real O_SYNC instead of the old one that always was O_DSYNC that I introduced in 2.6.33 but that isn't picked up by userspace yet) you might hit a very limited transaction merging window in some filesystems, but it's generally very small for a good reason. If it were too large we'd make the once progress wait for I/O in another just because we might expect transactions to coalesced later. There's been some long discussion about that fsync transaction batching tuning for ext3 a while ago. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Wed, Mar 17, 2010 at 06:53:34PM +0200, Avi Kivity wrote: Meanwhile I looked at the code, and it looks bad. There is an IO_CMD_FDSYNC, but it isn't tagged, so we have to drain the queue before issuing it. In any case, qemu doesn't use it as far as I could tell, and even if it did, device-matter doesn't implement the needed -aio_fsync() operation. No one implements it, and all surrounding code is dead wood. It would require us to do asynchronous pagecache operations, which involve major surgery of the VM code. Patches to do this were rejected multiple times. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/17/2010 06:58 PM, Christoph Hellwig wrote: On Wed, Mar 17, 2010 at 06:53:34PM +0200, Avi Kivity wrote: Meanwhile I looked at the code, and it looks bad. There is an IO_CMD_FDSYNC, but it isn't tagged, so we have to drain the queue before issuing it. In any case, qemu doesn't use it as far as I could tell, and even if it did, device-matter doesn't implement the needed -aio_fsync() operation. No one implements it, and all surrounding code is dead wood. It would require us to do asynchronous pagecache operations, which involve major surgery of the VM code. Patches to do this were rejected multiple times. Pity. What about the O_DIRECT aio case? It's ridiculous that you can submit async write requests but have to wait synchronously for them to actually hit the disk if you have a write cache. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/17/2010 06:52 PM, Christoph Hellwig wrote: On Wed, Mar 17, 2010 at 06:22:29PM +0200, Avi Kivity wrote: They should be reorderable. Otherwise host filesystems on several volumes would suffer the same problems. They are reordable, just not as extremly as the the page cache. Remember that the request queue really is just a relatively small queue of outstanding I/O, and that is absolutely intentional. Large scale _caching_ is done by the VM in the pagecache, with all the usual aging, pressure, etc algorithms applied to it. We already have the large scale caching and stuff running in the guest. We have a stream of optimized requests coming out of guests, running the same algorithm again shouldn't improve things. The host has an opportunity to do inter-guest optimization, but given each guest has its own disk area, I don't see how any reordering or merging could help here (beyond sorting guests according to disk order). The block devices have a relatively small fixed size request queue associated with it to facilitate request merging and limited reordering and having fully set up I/O requests for the device. We should enlarge the queues, increase request reorderability, and merge flushes (delay flushes until after unrelated writes, then adjacent flushes can be collapsed). Collapsing flushes should get us better than linear scaling (since we collapes N writes + M flushes into N writes and 1 flush). However the writes themselves scale worse than linearly, since they now span a larger disk space and cause higher seek penalties. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On 03/17/2010 06:57 PM, Christoph Hellwig wrote: On Wed, Mar 17, 2010 at 06:40:30PM +0200, Avi Kivity wrote: Chris, can you carry out an experiment? Write a program that pwrite()s a byte to a file at the same location repeatedly, with the file opened using O_SYNC. Measure the write rate, and run blktrace on the host to see what the disk (/dev/sda, not the volume) sees. Should be a (write, flush, write, flush) per pwrite pattern or similar (for writing the data and a journal block, perhaps even three writes will be needed). Then scale this across multiple guests, measure and trace again. If we're lucky, the flushes will be coalesced, if not, we need to work on it. As the person who has written quite a bit of the current O_SYNC implementation and also reviewed the rest of it I can tell you that those flushes won't be coalesced. If we always rewrite the same block we do the cache flush from the fsync method and there's is nothing to coalesced it there. If you actually do modify metadata (e.g. by using the new real O_SYNC instead of the old one that always was O_DSYNC that I introduced in 2.6.33 but that isn't picked up by userspace yet) you might hit a very limited transaction merging window in some filesystems, but it's generally very small for a good reason. If it were too large we'd make the once progress wait for I/O in another just because we might expect transactions to coalesced later. There's been some long discussion about that fsync transaction batching tuning for ext3 a while ago. I definitely don't expect flush merging for a single guest, but for multiple guests there is certainly an opportunity for merging. Most likely we don't take advantage of it and that's one of the problems. Copying data into pagecache so that we can merge the flushes seems like a very unsatisfactory implementation. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
On Wed, Mar 17, 2010 at 03:14:10PM +, Chris Webb wrote: Anthony Liguori anth...@codemonkey.ws writes: This really gets down to your definition of safe behaviour. As it stands, if you suffer a power outage, it may lead to guest corruption. While we are correct in advertising a write-cache, write-caches are volatile and should a drive lose power, it could lead to data corruption. Enterprise disks tend to have battery backed write caches to prevent this. In the set up you're emulating, the host is acting as a giant write cache. Should your host fail, you can get data corruption. Hi Anthony. I suspected my post might spark an interesting discussion! Before considering anything like this, we did quite a bit of testing with OSes in qemu-kvm guests running filesystem-intensive work, using an ipmitool power off to kill the host. I didn't manage to corrupt any ext3, ext4 or NTFS filesystems despite these efforts. Is your claim here that:- (a) qemu doesn't emulate a disk write cache correctly; or (b) operating systems are inherently unsafe running on top of a disk with a write-cache; or (c) installations that are already broken and lose data with a physical drive with a write-cache can lose much more in this case because the write cache is much bigger? Following Christoph Hellwig's patch series from last September, I'm pretty convinced that (a) isn't true apart from the inability to disable the write-cache at run-time, which is something that neither recent linux nor windows seem to want to do out-of-the box. Given that modern SATA drives come with fairly substantial write-caches nowadays which operating systems leave on without widespread disaster, I don't really believe in (b) either, at least for the ide and scsi case. Filesystems know they have to flush the disk cache to avoid corruption. (Virtio makes the write cache invisible to the OS except in linux 2.6.32+ so I know virtio-blk has to be avoided for current windows and obsolete linux when writeback caching is on.) I can certainly imagine (c) might be the case, although when I use strace to watch the IO to the block device, I see pretty regular fdatasyncs being issued by the guests, interleaved with the writes, so I'm not sure how likely the problem would be in practice. Perhaps my test guests were unrepresentatively well-behaved. However, the potentially unlimited time-window for loss of incorrectly unsynced data is also something one could imagine fixing at the qemu level. Perhaps I should be implementing something like cache=writeback,flushtimeout=N which, upon a write being issued to the block device, starts an N second timer if it isn't already running. The timer is destroyed on flush, and if it expires before it's destroyed, a gratuitous flush is sent. Do you think this is worth doing? Just a simple 'while sleep 10; do sync; done' on the host even! We've used cache=none and cache=writethrough, and whilst performance is fine with a single guest accessing a disk, when we chop the disks up with LVM and run a even a small handful of guests, the constant seeking to serve tiny synchronous IOs leads to truly abysmal throughput---we've seen less than 700kB/s streaming write rates within guests when the backing store is capable of 100MB/s. With cache=writeback, there's still IO contention between guests, but the write granularity is a bit coarser, so the host's elevator seems to get a bit more of a chance to help us out and we can at least squeeze out 5-10MB/s from two or three concurrently running guests, getting a total of 20-30% of the performance of the underlying block device rather than a total of around 5%. Hi Chris, Are you using CFQ in the host? What is the host kernel version? I am not sure what is the problem here but you might want to play with IO controller and put these guests in individual cgroups and see if you get better throughput even with cache=writethrough. If the problem is that if sync writes from different guests get intermixed resulting in more seeks, IO controller might help as these writes will now go on different group service trees and in CFQ, we try to service requests from one service tree at a time for a period before we switch the service tree. The issue will be that all the logic is in CFQ and it works at leaf nodes of storage stack and not at LVM nodes. So first you might want to try it with single partitioned disk. If it helps, then it might help with LVM configuration also (IO control working at leaf nodes). Thanks Vivek -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: fix error handling in vring ioctls
Acked-by: cha...@google.com On Wed, Mar 17, 2010 at 7:42 AM, Michael S. Tsirkin m...@redhat.com wrote: Stanse found a locking problem in vhost_set_vring: several returns from VHOST_SET_VRING_KICK, VHOST_SET_VRING_CALL, VHOST_SET_VRING_ERR with the vq-mutex held. Fix these up. Reported-by: Jiri Slaby jirisl...@gmail.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/vhost/vhost.c | 18 -- 1 files changed, 12 insertions(+), 6 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 7cd55e0..7bd7a1e 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -476,8 +476,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp) if (r 0) break; eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd); - if (IS_ERR(eventfp)) - return PTR_ERR(eventfp); + if (IS_ERR(eventfp)) { + r = PTR_ERR(eventfp); + break; + } if (eventfp != vq-kick) { pollstop = filep = vq-kick; pollstart = vq-kick = eventfp; @@ -489,8 +491,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp) if (r 0) break; eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd); - if (IS_ERR(eventfp)) - return PTR_ERR(eventfp); + if (IS_ERR(eventfp)) { + r = PTR_ERR(eventfp); + break; + } if (eventfp != vq-call) { filep = vq-call; ctx = vq-call_ctx; @@ -505,8 +509,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp) if (r 0) break; eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd); - if (IS_ERR(eventfp)) - return PTR_ERR(eventfp); + if (IS_ERR(eventfp)) { + r = PTR_ERR(eventfp); + break; + } if (eventfp != vq-error) { filep = vq-error; vq-error = eventfp; -- 1.7.0.18.g0d53a5 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] kvm: svm: reset cr0 properly on vcpu reset
Eduardo Habkost wrote: svm_vcpu_reset() was not properly resetting the contents of the guest-visible cr0 register, causing the following issue: https://bugzilla.redhat.com/show_bug.cgi?id=525699 Without resetting cr0 properly, the vcpu was running the SIPI bootstrap routine with paging enabled, making the vcpu get a pagefault exception while trying to run it. Instead of setting vmcb-save.cr0 directly, the new code just resets kvm-arch.cr0 and calls kvm_set_cr0(). The bits that were set/cleared on vmcb-save.cr0 (PG, WP, !CD, !NW) will be set properly by svm_set_cr0(). kvm_set_cr0() is used instead of calling svm_set_cr0() directly to make sure kvm_mmu_reset_context() is called to reset the mmu to nonpaging mode. Signed-off-by: Eduardo Habkost ehabk...@redhat.com Should this go into -stable? Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm crashes with Assertion ... failed.
On Sun, Mar 14, 2010 at 09:57:52AM +0100, André Weidemann wrote: Hi, I cloned the qemu-kvm git repository today with git clone git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git qemu-kvm-2010-03-14, ran configure and compiled it and did a make install. Everything went fine without warnings or errors. For configure output take a look here: http://pastebin.com/BL4DYCRY Here is my Server Hardware: Asus P5Q Mainbaord Intel Q9300 8GB RAM RAID5 with mdadm consisting of 4x 1TB disks The volume /dev/storage/Windows7test mentioned below is on this RAID5. I ran my virtual machine with the following command: qemu-system-x86_64 -cpu core2duo -vga cirrus -boot order=ndc -vnc 192.168.3.42:2 -k de -smp 4,cores=4 -drive file=/vmware/Windows7Test_600G.img,if=ide,index=0,cache=writeback -m 1024 -net nic,model=e1000,macaddr=DE:AD:BE:EF:12:3A -net tap,script=/usr/local/bin/qemu-ifup -monitor pty -name Windows7test,process=Windows7test -drive file=/dev/storage/Windows7test,if=ide,index=1,cache=none,aio=native Andre, Can you try qemu-kvm-0.12.3 ? Windows7Test_600G.img is a qcow2 file and contains a Windows 7 Pro image. /dev/storage/Windows7test is formated with XFS After starting the machine with the above command line, I booted into an Ubuntu 9.10 x86_64 Live Image via PXE and mounted /dev/sdb1 (/dev/storage/Windows7test) under /mnt. I then did cd /mnt/ and ran iozone -Ra -g 2G -b /tmp/iozone-aoi-linux-xls iozone ran some test and then kvm simply quit with the following error message: qemu-system-x86_64: /usr/local/src/qemu-kvm-2010-03-10/hw/ide/internal.h:510: bmdma_active_if: Assertion `bmdma-unit != (uint8_t)-1' failed. /var/log/syslog contained the folowing: Mar 14 09:18:14 server kernel: [318080.627468] kvm: 1361: cpu0 kvm_set_msr_common: MSR_IA32_MCG_STATUS 0x0, nop Mar 14 09:18:14 server kernel: [318080.627473] kvm: 1361: cpu0 kvm_set_msr_common: MSR_IA32_MCG_CTL 0x, nop Mar 14 09:18:14 server kernel: [318080.627476] kvm: 1361: cpu0 unhandled wrmsr: 0x400 data Mar 14 09:18:14 server kernel: [318080.627506] kvm: 1361: cpu1 kvm_set_msr_common: MSR_IA32_MCG_STATUS 0x0, nop Mar 14 09:18:14 server kernel: [318080.627509] kvm: 1361: cpu1 kvm_set_msr_common: MSR_IA32_MCG_CTL 0x, nop Mar 14 09:18:14 server kernel: [318080.627511] kvm: 1361: cpu1 unhandled wrmsr: 0x400 data Mar 14 09:18:14 server kernel: [318080.627538] kvm: 1361: cpu2 kvm_set_msr_common: MSR_IA32_MCG_STATUS 0x0, nop Mar 14 09:18:14 server kernel: [318080.627540] kvm: 1361: cpu2 kvm_set_msr_common: MSR_IA32_MCG_CTL 0x, nop Mar 14 09:18:14 server kernel: [318080.627543] kvm: 1361: cpu2 unhandled wrmsr: 0x400 data I ws able to reproduce this error 3 times in a row. Regards, André -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: MMU: Disassociate direct maps from guest levels
On Sun, Mar 14, 2010 at 10:22:52AM +0200, Avi Kivity wrote: Direct maps are linear translations for a section of memory, used for real mode or with large pages. As such, they are independent of the guest levels. Teach the mmu about this by making page-role.glevels = 0 for direct maps. This allows direct maps to be shared among real mode and the various paging modes. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kvm/mmu.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index b137515..a984bc1 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1328,6 +1328,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, role = vcpu-arch.mmu.base_role; role.level = level; role.direct = direct; + if (role.direct) + role.glevels = 0; role.access = access; if (vcpu-arch.mmu.root_level = PT32_ROOT_LEVEL) { quadrant = gaddr (PAGE_SHIFT + (PT64_PT_BITS * level)); -- 1.7.0.2 Isnt this what happens already, since for tdp base_role.glevels is not initialized? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: VMX: Disable unrestricted guest when EPT disabled
Marcelo Tosatti wrote: On Fri, Nov 27, 2009 at 04:46:26PM +0800, Sheng Yang wrote: Otherwise would cause VMEntry failure when using ept=0 on unrestricted guest supported processors. Signed-off-by: Sheng Yang sh...@linux.intel.com Applied, thanks. So without this patch kvm breaks with ept=0? Sounds like a stable candidate to me. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost: fix error handling in vring ioctls
Acked-by: Laurent Chavey cha...@google.com On Wed, Mar 17, 2010 at 10:54 AM, Laurent Chavey cha...@google.com wrote: Acked-by: cha...@google.com On Wed, Mar 17, 2010 at 7:42 AM, Michael S. Tsirkin m...@redhat.com wrote: Stanse found a locking problem in vhost_set_vring: several returns from VHOST_SET_VRING_KICK, VHOST_SET_VRING_CALL, VHOST_SET_VRING_ERR with the vq-mutex held. Fix these up. Reported-by: Jiri Slaby jirisl...@gmail.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/vhost/vhost.c | 18 -- 1 files changed, 12 insertions(+), 6 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 7cd55e0..7bd7a1e 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -476,8 +476,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp) if (r 0) break; eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd); - if (IS_ERR(eventfp)) - return PTR_ERR(eventfp); + if (IS_ERR(eventfp)) { + r = PTR_ERR(eventfp); + break; + } if (eventfp != vq-kick) { pollstop = filep = vq-kick; pollstart = vq-kick = eventfp; @@ -489,8 +491,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp) if (r 0) break; eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd); - if (IS_ERR(eventfp)) - return PTR_ERR(eventfp); + if (IS_ERR(eventfp)) { + r = PTR_ERR(eventfp); + break; + } if (eventfp != vq-call) { filep = vq-call; ctx = vq-call_ctx; @@ -505,8 +509,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp) if (r 0) break; eventfp = f.fd == -1 ? NULL : eventfd_fget(f.fd); - if (IS_ERR(eventfp)) - return PTR_ERR(eventfp); + if (IS_ERR(eventfp)) { + r = PTR_ERR(eventfp); + break; + } if (eventfp != vq-error) { filep = vq-error; vq-error = eventfp; -- 1.7.0.18.g0d53a5 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2 serial ports?
On Wed, Mar 17, 2010 at 3:35 AM, Michael Tokarev m...@tls.msk.ru wrote: Neo Jia wrote: May I ask if it is possible to bind a real physical serial port to a guest? It is all described in the documentation, quite a long list of various things you can attach to a virtual serial port, incl. a real one. I have tried -serial /dev/ttyS0 but I can't use it to debug my Windows guest. Thanks, Neo /mjt -- I would remember that if researchers were not ambitious probably today we haven't the technology we are using! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RF C/T/D] Unmapped page cache control - via boot parameter
Vivek Goyal vgo...@redhat.com writes: Are you using CFQ in the host? What is the host kernel version? I am not sure what is the problem here but you might want to play with IO controller and put these guests in individual cgroups and see if you get better throughput even with cache=writethrough. Hi. We're using the deadline IO scheduler on 2.6.32.7. We got better performance from deadline than from cfq when we last tested, which was admittedly around the 2.6.30 timescale so is now a rather outdated measurement. If the problem is that if sync writes from different guests get intermixed resulting in more seeks, IO controller might help as these writes will now go on different group service trees and in CFQ, we try to service requests from one service tree at a time for a period before we switch the service tree. Thanks for the suggestion: I'll have a play with this. I currently use /sys/kernel/uids/N/cpu_share with one UID per guest to divide up the CPU between guests, but this could just as easily be done with a cgroup per guest if a side-effect is to provide a hint about IO independence to CFQ. Best wishes, Chris. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2 serial ports?
Neo Jia wrote: On Wed, Mar 17, 2010 at 3:35 AM, Michael Tokarev m...@tls.msk.ru wrote: Neo Jia wrote: May I ask if it is possible to bind a real physical serial port to a guest? It is all described in the documentation, quite a long list of various things you can attach to a virtual serial port, incl. a real one. I have tried -serial /dev/ttyS0 but I can't use it to debug my Windows guest. That's entirely different issue, -- inability to debug windows guests. Please don't hijack other threads for unrelated issues -- it makes finding information and replying more difficult. If it does not work for you, ask in a new thread. But before, try to research the issue a bit, I've seen several discussions about debugging guests over serial port in kvm. Besides, I've no idea what are you really trying to do - debugging a guest is much easier in kvm than to set up another HOST and connect two HOSTS over a null-modem serial cable /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] add xchg ax, reg emulator test
On Tue, Mar 16, 2010 at 02:42:52PM +0200, Gleb Natapov wrote: Add test for opcodes 0x90-0x9f emulation Signed-off-by: Gleb Natapov g...@redhat.com diff --git a/kvm/user/test/x86/realmode.c b/kvm/user/test/x86/realmode.c index bc6b27f..bfc2942 100644 Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Broken loadvm ?
On Tue, Mar 16, 2010 at 05:25:13PM +0200, Alpár Török wrote: PS: It just occurred to me , that it does indeed freeze and cause a 100% CPU usage. At least i can say for sure that network, serial line, keyboard, nor mouse work. If loadvm is loaded from the command line. If loaded from the monitor, everything seams to work, except the mouse. After a -loadvm from the command line, repeating the command from the monitor doesn't unfreeze it. i am really stuck with this. Any help is greatly appreciated, as downgrading is not an option. Upgrade to qemu-kvm-0.12.3? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH rework] KVM: coalesced_mmio: fix kvm_coalesced_mmio_init()'s error handling
On Mon, Mar 15, 2010 at 10:13:30PM +0900, Takuya Yoshikawa wrote: kvm_coalesced_mmio_init() keeps to hold the addresses of a coalesced mmio ring page and dev even after it has freed them. Also, if this function fails, though it might be rare, it seems to be suggesting the system's serious state: so we'd better stop the works following the kvm_creat_vm(). This patch clears these problems. We move the coalesced mmio's initialization out of kvm_create_vm(). This seems to be natural because it includes a registration which can be done only when vm is successfully created. Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- virt/kvm/coalesced_mmio.c |2 ++ virt/kvm/kvm_main.c | 12 2 files changed, 10 insertions(+), 4 deletions(-) Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Cleanup: change to use bool return values
On Mon, Mar 15, 2010 at 05:29:09PM +0800, Gui Jianfeng wrote: Make use of bool as return values, and remove some useless bool value converting. Thanks Avi to point this out. Signed-off-by: Gui Jianfeng guijianf...@cn.fujitsu.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
qemu-kvm 0.12.3, Slackware 13 host and Windows XP guest - time drift a lot
Hey all, I'm working on moving from a mixture of physical servers and virtualized servers running on Virtualbox, to a pure KVM setup. But I'm having some problems with my Windows XP guests in my test-setup. This is the host I'm testing on: CPU: Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz RAM: 8GB 2x320GB WD SATA disks (one for host OS and one for KVM guest images) 2x1GBs Intel nics (bonded) Host OS is Slackware 13 with the following kernels: 2.6.29.6-huge, 2.6.29.6-generic, 2.6.33 and 2.6.33.1 qemu-kvm is 0.12.3 My Linux guests works like a charm. When they boot up I do a single ntpdate -b europe.pool.ntp.org and after that the time stays in near perfect sync with the host, with no ntpd running on the guests. My Windows XP guests on the other hand drifts backwards in time, especially when there's load on the guest, for example when I'm copying a large file from my samba server to the Windows XP guest. The guest can easily lose 10 minutes while copying a 600MB file. Or if I start a few browsers and point them at some horrible flash heavy sites and just let them sit there, then the VM also start losing a lot of time real fast. This is the commandline I use to start the Windows XP guests: qemu-system-x86_64 -hda winxppro.raw -boot c -m 1024 -vnc :1 -k da -smp 1 -localtime -daemonize -name qemu_winxppro,process=qemu_winxppro -net nic,macaddr=de:ad:be:ef:00:01,model=rtl8139 -net tap -runas kvm I use the same commandline for my Linux guests, except the nic is virtio. I'm at my wits end. I've tried the -tdf option with no success. I've tried setting various -rtc options with no success. Could it be I'm missing some key-component in the kernel? Or is there perhaps some dev version of qemu-kvm I could/should try? According to some of the #kvm residents, this should just work (tm), but I simply cannot make it happen. Any and all advice are more than welcome. :o) /Thomas -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm 0.12.3, Slackware 13 host and Windows XP guest - time drift a lot
On 03/17/2010 09:22 AM, Thomas Løcke wrote: Hey all, I'm working on moving from a mixture of physical servers and virtualized servers running on Virtualbox, to a pure KVM setup. But I'm having some problems with my Windows XP guests in my test-setup. This is the host I'm testing on: CPU: Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz RAM: 8GB 2x320GB WD SATA disks (one for host OS and one for KVM guest images) 2x1GBs Intel nics (bonded) Host OS is Slackware 13 with the following kernels: 2.6.29.6-huge, 2.6.29.6-generic, 2.6.33 and 2.6.33.1 qemu-kvm is 0.12.3 qemu's been changing a lot, might be best to build from the actual git repository, which is 0.12.50 now. My Linux guests works like a charm. When they boot up I do a single ntpdate -b europe.pool.ntp.org and after that the time stays in near perfect sync with the host, with no ntpd running on the guests. My Windows XP guests on the other hand drifts backwards in time, especially when there's load on the guest, for example when I'm copying a large file from my samba server to the Windows XP guest. The guest can easily lose 10 minutes while copying a 600MB file. Or if I start a few browsers and point them at some horrible flash heavy sites and just let them sit there, then the VM also start losing a lot of time real fast. What's your host CPU load get up to. You only have a single core? This is the commandline I use to start the Windows XP guests: qemu-system-x86_64 -hda winxppro.raw -boot c -m 1024 -vnc :1 -k da -smp 1 -localtime -daemonize -name qemu_winxppro,process=qemu_winxppro -net nic,macaddr=de:ad:be:ef:00:01,model=rtl8139 -net tap -runas kvm I use the same commandline for my Linux guests, except the nic is virtio. I'm at my wits end. I've tried the -tdf option with no success. I've tried setting various -rtc options with no success. Including -rtc-td-hack ? Could it be I'm missing some key-component in the kernel? Or is there perhaps some dev version of qemu-kvm I could/should try? According to some of the #kvm residents, this should just work (tm), but I simply cannot make it happen. Any and all advice are more than welcome. As always, make sure you are running the latest and greatest modules, those matter even more than the kernel, and check for any warning messages in dmesg and qemu output. Zach -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[ kvm-Bugs-2972152 ] guest crash when -cpu kvm64
Bugs item #2972152, was opened at 2010-03-17 14:43 Message generated for change (Tracker Item Submitted) made by high33 You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2972152group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: libkvm Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: hugohiggins (high33) Assigned to: Nobody/Anonymous (nobody) Summary: guest crash when -cpu kvm64 Initial Comment: When using -cpu kvm64 guest crashes when X starts up. dmesg on hypervisor says: [6149047.906364] kvm: 29020: cpu0 unhandled rdmsr: 0xc0010112 Guest boots OK without -cpu parameter cpu: dual opteron 2435 (12 cores total) ram: 32gig host dist: ubuntu 9.04 host kernel: 2.6.28-16-generic #55-Ubuntu SMP guest dist: xubuntu-9.10-amd64 # /usr/local/qemu-kvm-0.12.3/bin/qemu-system-x86_64 -name ubu64 localhost:69 -M pc \ -m 2048 -boot d -vga std \ -net nic,macaddr=BA:DD:C0:FF:EE:F6,model=virtio -net vde \ -drive file=/dev/sdp,if=scsi,boot=on \ -cpu kvm64 \ -cdrom iso/xubuntu-9.10-desktop-amd64.iso -k en-us -localtime -sdl -vnc localhost:69 -daemonize -usbdevice tablet -- You can respond by visiting: https://sourceforge.net/tracker/?func=detailatid=893831aid=2972152group_id=180599 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
SIGSEGV with -smp 17+, and error handling around...
When run with -smp 17 or greather, kvm fails like this: $ kvm -smp 17 kvm_create_vcpu: Invalid argument kvm_setup_mce FAILED: Invalid argument KVM_SET_LAPIC failed Segmentation fault $ _ In qemu-kvm.c, the kvm_create_vcpu() routine (which is used in a vcpu thread to set up vcpu) is declared as void, i.e, no error return. And the code that calls it blindly assumes that it will never fail... But the first error message above is from kernel, which - apparently - refuses to create 17th vCPU. Hence we've a vcpu thread which is empty/dummy and not even fully initialized... so it fails later in the game. This all looks quite... raw, not polished ;) Can we somehow handle the (several possible) errors in that (and other) places, and how we ever can act on them? Abort? Warn the user and reduce the number of vcpus accordingly (seems wrong, esp. if it were some first vcpus or in the middle which failed)... Thanks! /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
-enable-kvm - can it be a required option?
What I mean is: if asked to enable kvm but kvm can't be initialized for some reason (lack of virt extensions on the cpu, permission denied and so on), can we stop with a fatal error instead of continuing in emulated mode? Or maybe with another option, like -require-kvm? I understand that -enable-kvm is now in upstream qemu too, and _there_ it means something different, that is, it enables something that is disabled by default. But even with that, if user asks for something and that something isn't available, it seems like a good idea to stop here instead of producing a warning and continuing... This is especially true for kvm where -enable-kvm is the default anyway. I see more and more people are using this option now in a hope that kvm will actually stop when no virt extensions are available. It was my first reaction too, wow, now I can force it to require kvm extensions instead of running 1000 times slower!. So this has something to think about, it looks like... ;) Thanks! /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: -enable-kvm - can it be a required option?
On 03/17/2010 03:18 PM, Michael Tokarev wrote: What I mean is: if asked to enable kvm but kvm can't be initialized for some reason (lack of virt extensions on the cpu, permission denied and so on), can we stop with a fatal error instead of continuing in emulated mode? What I've been thinking, is that we should make kvm enablement a -cpu option. Something like: -cpu host,accel=kvm -cpu host,accel=tcg -cpu host,accel=kvm:tcg (1) would be KVM only, (2) would be TCG only, (3) would be KVM falling back to TCG. What's nice about this approach, is that we already pull CPU model definitions from a global config file which means that you could tweak this parameter to your liking. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm crashes with Assertion ... failed.
On 17.03.2010 19:22, Marcelo Tosatti wrote: On Sun, Mar 14, 2010 at 09:57:52AM +0100, André Weidemann wrote: Hi, I cloned the qemu-kvm git repository today with git clone git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git qemu-kvm-2010-03-14, ran configure and compiled it and did a make install. Everything went fine without warnings or errors. For configure output take a look here: http://pastebin.com/BL4DYCRY Here is my Server Hardware: Asus P5Q Mainbaord Intel Q9300 8GB RAM RAID5 with mdadm consisting of 4x 1TB disks The volume /dev/storage/Windows7test mentioned below is on this RAID5. I ran my virtual machine with the following command: qemu-system-x86_64 -cpu core2duo -vga cirrus -boot order=ndc -vnc 192.168.3.42:2 -k de -smp 4,cores=4 -drive file=/vmware/Windows7Test_600G.img,if=ide,index=0,cache=writeback -m 1024 -net nic,model=e1000,macaddr=DE:AD:BE:EF:12:3A -net tap,script=/usr/local/bin/qemu-ifup -monitor pty -name Windows7test,process=Windows7test -drive file=/dev/storage/Windows7test,if=ide,index=1,cache=none,aio=native Andre, Can you try qemu-kvm-0.12.3 ? I did the following: git clone git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git qemu-kvm-2010-03-17 cd qemu-kvm-2010-03-17 git checkout -b test qemu-kvm-0.12.3 ./configure make -j6 make install I started the VM again exactly as I did the last time and it crashed again with the same error message. qemu-system-x86_64: /usr/local/src/qemu-kvm-2010-03-17/hw/ide/internal.h:507: bmdma_active_if: Assertion `bmdma-unit != (uint8_t)-1' failed. André -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On 03/16/2010 11:28 PM, Sheng Yang wrote: On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote: On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote: On 03/16/2010 09:48 AM, Zhang, Yanmin wrote: Right, but there is a scope between kvm_guest_enter and really running in guest os, where a perf event might overflow. Anyway, the scope is very narrow, I will change it to use flag PF_VCPU. There is also a window between setting the flag and calling 'int $2' where an NMI might happen and be accounted incorrectly. Perhaps separate the 'int $2' into a direct call into perf and another call for the rest of NMI handling. I don't see how it would work on svm though - AFAICT the NMI is held whereas vmx swallows it. I guess NMIs will be disabled until the next IRET so it isn't racy, just tricky. I'm not sure if vmexit does break NMI context or not. Hardware NMI context isn't reentrant till a IRET. YangSheng would like to double check it. After more check, I think VMX won't remained NMI block state for host. That's means, if NMI happened and processor is in VMX non-root mode, it would only result in VMExit, with a reason indicate that it's due to NMI happened, but no more state change in the host. So in that meaning, there _is_ a window between VMExit and KVM handle the NMI. Moreover, I think we _can't_ stop the re-entrance of NMI handling code because int $2 don't have effect to block following NMI. And if the NMI sequence is not important(I think so), then we need to generate a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to itself is a good idea. I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to replace int $2. Something unexpected is happening... You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't supposed to be able to. Zach -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] kvm: svm: reset cr0 properly on vcpu reset
On Wed, Mar 17, 2010 at 07:17:32PM +0100, Alexander Graf wrote: Eduardo Habkost wrote: svm_vcpu_reset() was not properly resetting the contents of the guest-visible cr0 register, causing the following issue: https://bugzilla.redhat.com/show_bug.cgi?id=525699 Without resetting cr0 properly, the vcpu was running the SIPI bootstrap routine with paging enabled, making the vcpu get a pagefault exception while trying to run it. Instead of setting vmcb-save.cr0 directly, the new code just resets kvm-arch.cr0 and calls kvm_set_cr0(). The bits that were set/cleared on vmcb-save.cr0 (PG, WP, !CD, !NW) will be set properly by svm_set_cr0(). kvm_set_cr0() is used instead of calling svm_set_cr0() directly to make sure kvm_mmu_reset_context() is called to reset the mmu to nonpaging mode. Signed-off-by: Eduardo Habkost ehabk...@redhat.com Should this go into -stable? I think so. The patch is from October, was -stable branched before that? -- Eduardo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] kvm: svm: reset cr0 properly on vcpu reset
On 17.03.2010, at 22:42, Eduardo Habkost wrote: On Wed, Mar 17, 2010 at 07:17:32PM +0100, Alexander Graf wrote: Eduardo Habkost wrote: svm_vcpu_reset() was not properly resetting the contents of the guest-visible cr0 register, causing the following issue: https://bugzilla.redhat.com/show_bug.cgi?id=525699 Without resetting cr0 properly, the vcpu was running the SIPI bootstrap routine with paging enabled, making the vcpu get a pagefault exception while trying to run it. Instead of setting vmcb-save.cr0 directly, the new code just resets kvm-arch.cr0 and calls kvm_set_cr0(). The bits that were set/cleared on vmcb-save.cr0 (PG, WP, !CD, !NW) will be set properly by svm_set_cr0(). kvm_set_cr0() is used instead of calling svm_set_cr0() directly to make sure kvm_mmu_reset_context() is called to reset the mmu to nonpaging mode. Signed-off-by: Eduardo Habkost ehabk...@redhat.com Should this go into -stable? I think so. The patch is from October, was -stable branched before that? If I read the diff log correctly 2.6.32 kvm development was branched off end of July 2009. The important question is if this patch fixes a regression introduced by some speedup magic. Alex-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 23/42] KVM: Activate Virtualization On Demand
On 17.03.2010, at 22:57, Dieter Ries wrote: Am 16.11.2009 13:19, schrieb Avi Kivity: From: Alexander Graf ag...@suse.de diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index f54c4f9..59fe4d5 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -316,7 +316,7 @@ static void svm_hardware_disable(void *garbage) cpu_svm_disable(); } -static void svm_hardware_enable(void *garbage) +static int svm_hardware_enable(void *garbage) { struct svm_cpu_data *svm_data; @@ -325,16 +325,20 @@ static void svm_hardware_enable(void *garbage) struct desc_struct *gdt; int me = raw_smp_processor_id(); +rdmsrl(MSR_EFER, efer); +if (efer EFER_SVME) +return -EBUSY; + Hi, This is breaking KVM on my Phenom II X4 955. When I start kvm I get this on the terminal: kvm_create_vm: Device or resource busy Could not initialize KVM, will disable KVM support And in dmesg: [ 67.980732] kvm: enabling virtualization on CPU0 failed I commented out the if() and return, and I added 2 printk's there for debugging, and now that's what I see in dmesg when I start kvm: [ 3341.740112] efer is 3329 [ 3341.740113] efer is 3329 [ 3341.740117] efer is 3329 [ 3341.740119] EFER_SVME is 4096 [ 3341.740121] EFER_SVME is 4096 [ 3341.740124] EFER_SVME is 4096 [ 3341.740130] efer is 3329 [ 3341.740132] EFER_SVME is 4096 In hex the values are 0x1000 and 0x0d01 KVM has been working well on this machine before, and it still works well after commenting that part out. I am not sure what the value of this register is supposed to be, but are you sure if (efer EFER_SVME) is the right condition? According to the printks you show above the condition should never apply. Are you 100% sure you don't have vmware, virtualbox, parallels, whatever running in parallel on that machine? Alex-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 23/42] KVM: Activate Virtualization On Demand
Am 16.11.2009 13:19, schrieb Avi Kivity: From: Alexander Graf ag...@suse.de diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index f54c4f9..59fe4d5 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -316,7 +316,7 @@ static void svm_hardware_disable(void *garbage) cpu_svm_disable(); } -static void svm_hardware_enable(void *garbage) +static int svm_hardware_enable(void *garbage) { struct svm_cpu_data *svm_data; @@ -325,16 +325,20 @@ static void svm_hardware_enable(void *garbage) struct desc_struct *gdt; int me = raw_smp_processor_id(); + rdmsrl(MSR_EFER, efer); + if (efer EFER_SVME) + return -EBUSY; + Hi, This is breaking KVM on my Phenom II X4 955. When I start kvm I get this on the terminal: kvm_create_vm: Device or resource busy Could not initialize KVM, will disable KVM support And in dmesg: [ 67.980732] kvm: enabling virtualization on CPU0 failed I commented out the if() and return, and I added 2 printk's there for debugging, and now that's what I see in dmesg when I start kvm: [ 3341.740112] efer is 3329 [ 3341.740113] efer is 3329 [ 3341.740117] efer is 3329 [ 3341.740119] EFER_SVME is 4096 [ 3341.740121] EFER_SVME is 4096 [ 3341.740124] EFER_SVME is 4096 [ 3341.740130] efer is 3329 [ 3341.740132] EFER_SVME is 4096 In hex the values are 0x1000 and 0x0d01 KVM has been working well on this machine before, and it still works well after commenting that part out. I am not sure what the value of this register is supposed to be, but are you sure if (efer EFER_SVME) is the right condition? cu Dieter -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm 0.12.3, Slackware 13 host and Windows XP guest - time drift a lot
On Wed, Mar 17, 2010 at 8:33 PM, Zachary Amsden zams...@redhat.com wrote: What's your host CPU load get up to. You only have a single core? Dual core. If I only run a single Windows VM, the host load is pretty low. Sure it goes up a bit when for example copying a file, but it's nothing serious. It's not getting hammered in any way. Including -rtc-td-hack ? Yup, tried that as per suggested by one of the #kvm users. Didn't fix it. But come to think of it, I didn't change any of the other options. Should I have dropped -localtime and/or -tdf options? I will try again tomorrow. As always, make sure you are running the latest and greatest modules, those matter even more than the kernel, and check for any warning messages in dmesg and qemu output. But don't the latest kvm modules come with the kernel? So if I compile a new kernel, the kvm modules should be updated too, yes? I will try the latest qemu-kvm. /Thomas -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 23/42] KVM: Activate Virtualization On Demand
On Wed, Mar 17, 2010 at 11:02:40PM +0100, Alexander Graf wrote: On 17.03.2010, at 22:57, Dieter Ries wrote: Hi, This is breaking KVM on my Phenom II X4 955. When I start kvm I get this on the terminal: kvm_create_vm: Device or resource busy Could not initialize KVM, will disable KVM support And in dmesg: [ 67.980732] kvm: enabling virtualization on CPU0 failed I commented out the if() and return, and I added 2 printk's there for debugging, and now that's what I see in dmesg when I start kvm: [ 3341.740112] efer is 3329 [ 3341.740113] efer is 3329 [ 3341.740117] efer is 3329 [ 3341.740119] EFER_SVME is 4096 [ 3341.740121] EFER_SVME is 4096 [ 3341.740124] EFER_SVME is 4096 [ 3341.740130] efer is 3329 [ 3341.740132] EFER_SVME is 4096 In hex the values are 0x1000 and 0x0d01 KVM has been working well on this machine before, and it still works well after commenting that part out. I am not sure what the value of this register is supposed to be, but are you sure if (efer EFER_SVME) is the right condition? According to the printks you show above the condition should never apply. Are you 100% sure you don't have vmware, virtualbox, parallels, whatever running in parallel on that machine? Definitely. I have virtualbox installed, but haven't used it in months. The others I don't use at all, so they are not installed either. There is nothing running which could cause that. Behaviour is the same when I don't log into KDE but just try this without X, where nearly nothing is started. I noted something more now: When I comment it out once, and start kvm like that, and then remove the comments again, then it works. So I guess the dmesg parts I wrote were not perfect. It's more like: I: After reboot, with debugging printk and if condition: [ 42.089423] efer is d01 [ 42.089425] efer is d01 [ 42.089428] efer is d01 [ 42.089430] EFER_SVME is 1000 [ 42.089431] EFER_SVME is 1000 [ 42.089433] EFER_SVME is 1000 [ 42.089436] efer is 1d01 [ 42.089438] EFER_SVME is 1000 [ 42.089440] kvm: enabling virtualization on CPU0 failed II: debugging printk, no if condition: [ 317.355519] efer is d01 [ 317.355522] efer is d01 [ 317.355524] efer is d01 [ 317.355527] EFER_SVME is 1000 [ 317.355528] EFER_SVME is 1000 [ 317.355531] EFER_SVME is 1000 [ 317.355534] efer is 1d01 [ 317.355536] EFER_SVME is 1000 III: debugging printk and if condition: [ 421.955433] efer is d01 [ 421.955437] efer is d01 [ 421.955440] efer is d01 [ 421.955442] EFER_SVME is 1000 [ 421.955443] EFER_SVME is 1000 [ 421.955445] EFER_SVME is 1000 [ 421.955449] efer is d01 [ 421.955451] EFER_SVME is 1000 This is without reboots in between. So now before I use the commented out version for the first time, it doesnt work, the 2nd time it works. Maybe some initialization problem... Alex cu Dieter -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm 0.12.3, Slackware 13 host and Windows XP guest - time drift a lot
On 03/17/2010 12:17 PM, Thomas Løcke wrote: On Wed, Mar 17, 2010 at 8:33 PM, Zachary Amsdenzams...@redhat.com wrote: What's your host CPU load get up to. You only have a single core? Dual core. If I only run a single Windows VM, the host load is pretty low. Sure it goes up a bit when for example copying a file, but it's nothing serious. It's not getting hammered in any way. Including -rtc-td-hack ? Yup, tried that as per suggested by one of the #kvm users. Didn't fix it. But come to think of it, I didn't change any of the other options. Should I have dropped -localtime and/or -tdf options? I will try again tomorrow. -rtc localtime is required for Windows to get the proper RTC time, and -tdf should have no effect on Windows guests. You might try -rtc localtime,clock=host,driftfix=slew As always, make sure you are running the latest and greatest modules, those matter even more than the kernel, and check for any warning messages in dmesg and qemu output. But don't the latest kvm modules come with the kernel? So if I compile a new kernel, the kvm modules should be updated too, yes? I will try the latest qemu-kvm. I use git://git.kernel.org/pub/scm/virt/kvm/kvm-kmod.git and track a 2.6 kernel branch directly so I always have latest module source regardless of host kernel. Zach -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
IT Service
You have exceeded the limit of your mailbox set by your WEBCTSERVICE/Administrator, and you will be having problems in sending and recieving mails Until You Re-Validate. To prevent this, please click on the link below to reset your account.CLICKHERE: http://form00345.9hz.com/ This electronic transmission may contain information that is privileged, confidential and exempt from disclosure under applicable law. If you are not the intended recipient, please notify me immediately as use of this information is strictly prohibited. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 23/42] KVM: Activate Virtualization On Demand
On 17.03.2010, at 23:40, Dieter Ries wrote: On Wed, Mar 17, 2010 at 11:02:40PM +0100, Alexander Graf wrote: On 17.03.2010, at 22:57, Dieter Ries wrote: Hi, This is breaking KVM on my Phenom II X4 955. When I start kvm I get this on the terminal: kvm_create_vm: Device or resource busy Could not initialize KVM, will disable KVM support And in dmesg: [ 67.980732] kvm: enabling virtualization on CPU0 failed I commented out the if() and return, and I added 2 printk's there for debugging, and now that's what I see in dmesg when I start kvm: [ 3341.740112] efer is 3329 [ 3341.740113] efer is 3329 [ 3341.740117] efer is 3329 [ 3341.740119] EFER_SVME is 4096 [ 3341.740121] EFER_SVME is 4096 [ 3341.740124] EFER_SVME is 4096 [ 3341.740130] efer is 3329 [ 3341.740132] EFER_SVME is 4096 In hex the values are 0x1000 and 0x0d01 KVM has been working well on this machine before, and it still works well after commenting that part out. I am not sure what the value of this register is supposed to be, but are you sure if (efer EFER_SVME) is the right condition? According to the printks you show above the condition should never apply. Are you 100% sure you don't have vmware, virtualbox, parallels, whatever running in parallel on that machine? Definitely. I have virtualbox installed, but haven't used it in months. The others I don't use at all, so they are not installed either. There is nothing running which could cause that. Behaviour is the same when I don't log into KDE but just try this without X, where nearly nothing is started. I noted something more now: When I comment it out once, and start kvm like that, and then remove the comments again, then it works. So I guess the dmesg parts I wrote were not perfect. It's more like: I: After reboot, with debugging printk and if condition: [ 42.089423] efer is d01 [ 42.089425] efer is d01 [ 42.089428] efer is d01 [ 42.089430] EFER_SVME is 1000 [ 42.089431] EFER_SVME is 1000 [ 42.089433] EFER_SVME is 1000 [ 42.089436] efer is 1d01 [ 42.089438] EFER_SVME is 1000 [ 42.089440] kvm: enabling virtualization on CPU0 failed II: debugging printk, no if condition: [ 317.355519] efer is d01 [ 317.355522] efer is d01 [ 317.355524] efer is d01 [ 317.355527] EFER_SVME is 1000 [ 317.355528] EFER_SVME is 1000 [ 317.355531] EFER_SVME is 1000 [ 317.355534] efer is 1d01 [ 317.355536] EFER_SVME is 1000 III: debugging printk and if condition: [ 421.955433] efer is d01 [ 421.955437] efer is d01 [ 421.955440] efer is d01 [ 421.955442] EFER_SVME is 1000 [ 421.955443] EFER_SVME is 1000 [ 421.955445] EFER_SVME is 1000 [ 421.955449] efer is d01 [ 421.955451] EFER_SVME is 1000 This is without reboots in between. So now before I use the commented out version for the first time, it doesnt work, the 2nd time it works. Maybe some initialization problem... It looks like one of your CPUs has EFER_SVME enabled on bootup already. I'm not aware of code clearing EFER, so if there's garbage in there on boot it stays there. Could you please add the current CPU number to your printk? I bet it's always the same one. If that's the case I'd say you have a broken BIOS or bootloader. Alex-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Thursday 18 March 2010 05:14:52 Zachary Amsden wrote: On 03/16/2010 11:28 PM, Sheng Yang wrote: On Wednesday 17 March 2010 10:34:33 Zhang, Yanmin wrote: On Tue, 2010-03-16 at 11:32 +0200, Avi Kivity wrote: On 03/16/2010 09:48 AM, Zhang, Yanmin wrote: Right, but there is a scope between kvm_guest_enter and really running in guest os, where a perf event might overflow. Anyway, the scope is very narrow, I will change it to use flag PF_VCPU. There is also a window between setting the flag and calling 'int $2' where an NMI might happen and be accounted incorrectly. Perhaps separate the 'int $2' into a direct call into perf and another call for the rest of NMI handling. I don't see how it would work on svm though - AFAICT the NMI is held whereas vmx swallows it. I guess NMIs will be disabled until the next IRET so it isn't racy, just tricky. I'm not sure if vmexit does break NMI context or not. Hardware NMI context isn't reentrant till a IRET. YangSheng would like to double check it. After more check, I think VMX won't remained NMI block state for host. That's means, if NMI happened and processor is in VMX non-root mode, it would only result in VMExit, with a reason indicate that it's due to NMI happened, but no more state change in the host. So in that meaning, there _is_ a window between VMExit and KVM handle the NMI. Moreover, I think we _can't_ stop the re-entrance of NMI handling code because int $2 don't have effect to block following NMI. And if the NMI sequence is not important(I think so), then we need to generate a real NMI in current vmexit-after code. Seems let APIC send a NMI IPI to itself is a good idea. I am debugging a patch based on apic-send_IPI_self(NMI_VECTOR) to replace int $2. Something unexpected is happening... You can't use the APIC to send vectors 0x00-0x1f, or at least, aren't supposed to be able to. Um? Why? Especially kernel is already using it to deliver NMI. -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: VMX: Disable unrestricted guest when EPT disabled
On Thursday 18 March 2010 02:37:10 Alexander Graf wrote: Marcelo Tosatti wrote: On Fri, Nov 27, 2009 at 04:46:26PM +0800, Sheng Yang wrote: Otherwise would cause VMEntry failure when using ept=0 on unrestricted guest supported processors. Signed-off-by: Sheng Yang sh...@linux.intel.com Applied, thanks. So without this patch kvm breaks with ept=0? Sounds like a stable candidate to me. Seems unrestricted guest code isn't in v2.6.31-stable, and v2.6.32 had already fixed this issue. So it should be fine. -- regards Yang, Sheng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM test: Parallel install of guest OS v3
From: yogi anant...@linux.vnet.ibm.com The patch enables doing mulitple install of guest OS in parallel. Have added four more options to test_base.cfg, port redirection entry guest_port_unattend_shell for host to communicate with guest during installation, pxe_dir, 'pxe_image' and 'pxe_initrd to specify locations for kernel and initrd. For parallel installation to work in unattended mode, the floppy image and pxe boot path also has to be unique for each quest. All the relevant unattended post install steps for guests were changed, now they are server based codes. Notes: * Yogi, I am going to remove the SLES patch, and will wait for you to send a new patchset with both the SLES files and the opensuse ones, OK? Thanks. Changes from v2: * According to Michael Goldish comments, handled a possible socket.error exception that could be generated during the unattended install test * Modified the floppy image names to be contained inside the same directory that might hold the tftp root for each OS, making the needed changes on unattended.py. * Added floppy names for windows based OSs, which were lacking on previous patches. Changes from v1: * Fixed the logic for the new unattended install test (original implementation would hang indefinitely if guest dies in the middle of the install). * Fixed the config changes to make sure the unattended install port actually gets redirected so the test can work, also made the config specific to unattended install * Merged the finish.exe patch, including a binary patch that changes the binary shipped to the new version * Changed all unattended install files to use the parallel mechanism Tested with Windows 7 and Fedora 11 guests. I (lmr) am going to keep this in the queue for a bit so I can test it more in the internal test farm and everybody can take a look at the patch. Signed-off-by: Yogananth Subramanian anant...@linux.vnet.ibm.com Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com --- client/tests/kvm/deps/finish.cpp | 111 +++- client/tests/kvm/deps/finish.exe | Bin 26913 - 26926 bytes client/tests/kvm/kvm_utils.py |4 +- client/tests/kvm/scripts/unattended.py | 59 ++- client/tests/kvm/tests/unattended_install.py | 45 client/tests/kvm/tests_base.cfg.sample | 81 +-- client/tests/kvm/unattended/Fedora-10.ks | 12 +- client/tests/kvm/unattended/Fedora-11.ks | 11 +- client/tests/kvm/unattended/Fedora-12.ks | 11 +- client/tests/kvm/unattended/Fedora-8.ks| 11 +- client/tests/kvm/unattended/Fedora-9.ks| 11 +- client/tests/kvm/unattended/RHEL-3-series.ks | 12 +- client/tests/kvm/unattended/RHEL-4-series.ks | 11 +- client/tests/kvm/unattended/RHEL-5-series.ks | 11 +- client/tests/kvm/unattended/win2003-32.sif |2 +- client/tests/kvm/unattended/win2003-64.sif |2 +- .../kvm/unattended/win2008-32-autounattend.xml |2 +- .../kvm/unattended/win2008-64-autounattend.xml |2 +- .../kvm/unattended/win2008-r2-autounattend.xml |2 +- .../tests/kvm/unattended/win7-32-autounattend.xml |2 +- .../tests/kvm/unattended/win7-64-autounattend.xml |2 +- .../kvm/unattended/winvista-32-autounattend.xml|2 +- .../kvm/unattended/winvista-64-autounattend.xml|2 +- client/tests/kvm/unattended/winxp32.sif|2 +- client/tests/kvm/unattended/winxp64.sif|2 +- 25 files changed, 242 insertions(+), 170 deletions(-) diff --git a/client/tests/kvm/deps/finish.cpp b/client/tests/kvm/deps/finish.cpp index 9c2867c..e5ba128 100644 --- a/client/tests/kvm/deps/finish.cpp +++ b/client/tests/kvm/deps/finish.cpp @@ -1,12 +1,13 @@ -// Simple app that only sends an ack string to the KVM unattended install -// watch code. +// Simple application that creates a server socket, listening for connections +// of the unattended install test. Once it gets a client connected, the +// app will send back an ACK string, indicating the install process is done. // // You must link this code with Ws2_32.lib, Mswsock.lib, and Advapi32.lib // // Author: Lucas Meneghel Rodrigues l...@redhat.com // Code was adapted from an MSDN sample. -// Usage: finish.exe [Host OS IP] +// Usage: finish.exe // MinGW's ws2tcpip.h only defines getaddrinfo and other functions only for // the case _WIN32_WINNT = 0x0501. @@ -21,24 +22,18 @@ #include stdlib.h #include stdio.h -#define DEFAULT_BUFLEN 512 #define DEFAULT_PORT 12323 - int main(int argc, char **argv) { WSADATA wsaData; -SOCKET ConnectSocket = INVALID_SOCKET; -struct addrinfo *result = NULL, -*ptr = NULL, -hints; +SOCKET ListenSocket = INVALID_SOCKET, ClientSocket = INVALID_SOCKET; +struct addrinfo *result = NULL, hints; char *sendbuf = done; -char
Re: [PATCH] KVM test: Parallel install of guest OS v3
FYI, patch applied, see: http://autotest.kernel.org/changeset/4309 On Wed, Mar 17, 2010 at 11:28 PM, Lucas Meneghel Rodrigues l...@redhat.com wrote: From: yogi anant...@linux.vnet.ibm.com The patch enables doing mulitple install of guest OS in parallel. Have added four more options to test_base.cfg, port redirection entry guest_port_unattend_shell for host to communicate with guest during installation, pxe_dir, 'pxe_image' and 'pxe_initrd to specify locations for kernel and initrd. For parallel installation to work in unattended mode, the floppy image and pxe boot path also has to be unique for each quest. All the relevant unattended post install steps for guests were changed, now they are server based codes. Notes: * Yogi, I am going to remove the SLES patch, and will wait for you to send a new patchset with both the SLES files and the opensuse ones, OK? Thanks. Changes from v2: * According to Michael Goldish comments, handled a possible socket.error exception that could be generated during the unattended install test * Modified the floppy image names to be contained inside the same directory that might hold the tftp root for each OS, making the needed changes on unattended.py. * Added floppy names for windows based OSs, which were lacking on previous patches. Changes from v1: * Fixed the logic for the new unattended install test (original implementation would hang indefinitely if guest dies in the middle of the install). * Fixed the config changes to make sure the unattended install port actually gets redirected so the test can work, also made the config specific to unattended install * Merged the finish.exe patch, including a binary patch that changes the binary shipped to the new version * Changed all unattended install files to use the parallel mechanism Tested with Windows 7 and Fedora 11 guests. I (lmr) am going to keep this in the queue for a bit so I can test it more in the internal test farm and everybody can take a look at the patch. Signed-off-by: Yogananth Subramanian anant...@linux.vnet.ibm.com Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com --- client/tests/kvm/deps/finish.cpp | 111 +++- client/tests/kvm/deps/finish.exe | Bin 26913 - 26926 bytes client/tests/kvm/kvm_utils.py | 4 +- client/tests/kvm/scripts/unattended.py | 59 ++- client/tests/kvm/tests/unattended_install.py | 45 client/tests/kvm/tests_base.cfg.sample | 81 +-- client/tests/kvm/unattended/Fedora-10.ks | 12 +- client/tests/kvm/unattended/Fedora-11.ks | 11 +- client/tests/kvm/unattended/Fedora-12.ks | 11 +- client/tests/kvm/unattended/Fedora-8.ks | 11 +- client/tests/kvm/unattended/Fedora-9.ks | 11 +- client/tests/kvm/unattended/RHEL-3-series.ks | 12 +- client/tests/kvm/unattended/RHEL-4-series.ks | 11 +- client/tests/kvm/unattended/RHEL-5-series.ks | 11 +- client/tests/kvm/unattended/win2003-32.sif | 2 +- client/tests/kvm/unattended/win2003-64.sif | 2 +- .../kvm/unattended/win2008-32-autounattend.xml | 2 +- .../kvm/unattended/win2008-64-autounattend.xml | 2 +- .../kvm/unattended/win2008-r2-autounattend.xml | 2 +- .../tests/kvm/unattended/win7-32-autounattend.xml | 2 +- .../tests/kvm/unattended/win7-64-autounattend.xml | 2 +- .../kvm/unattended/winvista-32-autounattend.xml | 2 +- .../kvm/unattended/winvista-64-autounattend.xml | 2 +- client/tests/kvm/unattended/winxp32.sif | 2 +- client/tests/kvm/unattended/winxp64.sif | 2 +- 25 files changed, 242 insertions(+), 170 deletions(-) diff --git a/client/tests/kvm/deps/finish.cpp b/client/tests/kvm/deps/finish.cpp index 9c2867c..e5ba128 100644 --- a/client/tests/kvm/deps/finish.cpp +++ b/client/tests/kvm/deps/finish.cpp @@ -1,12 +1,13 @@ -// Simple app that only sends an ack string to the KVM unattended install -// watch code. +// Simple application that creates a server socket, listening for connections +// of the unattended install test. Once it gets a client connected, the +// app will send back an ACK string, indicating the install process is done. // // You must link this code with Ws2_32.lib, Mswsock.lib, and Advapi32.lib // // Author: Lucas Meneghel Rodrigues l...@redhat.com // Code was adapted from an MSDN sample. -// Usage: finish.exe [Host OS IP] +// Usage: finish.exe // MinGW's ws2tcpip.h only defines getaddrinfo and other functions only for // the case _WIN32_WINNT = 0x0501. @@ -21,24 +22,18 @@ #include stdlib.h #include stdio.h -#define DEFAULT_BUFLEN 512 #define DEFAULT_PORT 12323 - int main(int argc, char **argv) { WSADATA wsaData; - SOCKET ConnectSocket = INVALID_SOCKET; -
Re: [PATCH] Enhance perf to collect KVM guest os statistics from host side
On Wed, 2010-03-17 at 17:26 +0800, Zhang, Yanmin wrote: On Tue, 2010-03-16 at 10:47 +0100, Ingo Molnar wrote: * Zhang, Yanmin yanmin_zh...@linux.intel.com wrote: On Tue, 2010-03-16 at 15:48 +0800, Zhang, Yanmin wrote: On Tue, 2010-03-16 at 07:41 +0200, Avi Kivity wrote: On 03/16/2010 07:27 AM, Zhang, Yanmin wrote: From: Zhang, Yanminyanmin_zh...@linux.intel.com Based on the discussion in KVM community, I worked out the patch to support perf to collect guest os statistics from host side. This patch is implemented with Ingo, Peter and some other guys' kind help. Yang Sheng pointed out a critical bug and provided good suggestions with other guys. I really appreciate their kind help. The patch adds new subcommand kvm to perf. perf kvm top perf kvm record perf kvm report perf kvm diff The new perf could profile guest os kernel except guest os user space, but it could summarize guest os user space utilization per guest os. Below are some examples. 1) perf kvm top [r...@lkp-ne01 norm]# perf kvm --host --guest --guestkallsyms=/home/ymzhang/guest/kallsyms --guestmodules=/home/ymzhang/guest/modules top Thanks for your kind comments. Excellent, support for guest kernel != host kernel is critical (I can't remember the last time I ran same kernels). How would we support multiple guests with different kernels? With the patch, 'perf kvm report --sort pid could show summary statistics for all guest os instances. Then, use parameter --pid of 'perf kvm record' to collect single problematic instance data. Sorry. I found currently --pid isn't process but a thread (main thread). Ingo, Is it possible to support a new parameter or extend --inherit, so 'perf record' and 'perf top' could collect data on all threads of a process when the process is running? If not, I need add a new ugly parameter which is similar to --pid to filter out process data in userspace. Yeah. For maximum utility i'd suggest to extend --pid to include this, and introduce --tid for the previous, limited-to-a-single-task functionality. Most users would expect --pid to work like a 'late attach' - i.e. to work like strace -f or like a gdb attach. Thanks Ingo, Avi. I worked out below patch against tip/master of March 15th. Subject: [PATCH] Change perf's parameter --pid to process-wide collection From: Zhang, Yanmin yanmin_zh...@linux.intel.com Change parameter -p (--pid) to real process pid and add -t (--tid) meaning thread id. Now, --pid means perf collects the statistics of all threads of the process, while --tid means perf just collect the statistics of that thread. BTW, the patch fixes a bug of 'perf stat -p'. 'perf stat' always configures attr-disabled=1 if it isn't a system-wide collection. If there is a '-p' and no forks, 'perf stat -p' doesn't collect any data. In addition, the while(!done) in run_perf_stat consumes 100% single cpu time which has bad impact on running workload. I added a sleep(1) in the loop. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com Ingo, Sorry, the patch has bugs. I need do a better job and will work out 2 separate patches against the 2 issues. Yanmin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM MMU: check reserved bits only when CR4.PSE=1 or CR4.PAE=1
On Wed, Mar 17, 2010 at 11:43:06AM +0800, Xiao Guangrong wrote: - The RSV bit is possibility set in error code when #PF occurred only if CR4.PSE=1 or CR4.PAE=1 - context-rsvd_bits_mask[1][0] is always 0 Changlog: Move this operation to reset_rsvds_bits_mask() address Avi Kivity's suggestion Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 12 +--- 1 files changed, 9 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index b137515..c49f8ec 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2288,18 +2288,26 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, int level) if (!is_nx(vcpu)) exb_bit_rsvd = rsvd_bits(63, 63); + + context-rsvd_bits_mask[1][0] = 0; So if the guest enables PAT at PTE level you completly disable reserved bit checking? You should only disable checking for [1][1] if !PSE. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html