Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux
Ingo Molnar wrote: This is a little too good to be true. Were both runs with the same KVM_NUM_MMU_PAGES? yes, both had the same elevated KVM_NUM_MMU_PAGES of 2048. The 'trunk' run should have been labeled as: 'cr3 tree with paravirt turned off'. That's not completely 'trunk' but close to it, and all other changes (like elimination of unnecessary TLB flushes) are fairly applied to both. Ok. I guess there's a switch/switch back pattern in there. i also did a run with much less MMU cache pages of 256, and hackbench 1 stayed the same, while hackbench 5 numbers started fluctuating badly (i think that workload if trashing the MMU cache badly). Yes, 256 is too low. -u64 *pae_root; +u64 *pae_root[KVM_CR3_CACHE_SIZE]; hmm. wouldn't it be simpler to have pae_root always point at the current root? does that guarantee that it's available? I wanted to 'pin' the root itself this way, to make sure that if a guest switches to it via the cache, that it's truly available and a valid root. cr3 addresses are non-virtual so this is the only mechanism available to guarantee that the host-side memory truly contains a root pagetable. I meant u64 *pae_root_cache; u64 *pae_root; /* == pae_root_cache + 4*cache_index */ so that the rest of the code need not worry about the cache. +vcpu-mmu.pae_root[j][i] = INVALID_PAGE; +} } vcpu-mmu.root_hpa = INVALID_PAGE; } You keep the page directories pinned here. [...] yes. [...] This can be a problem if a guest frees a page directory, and then starts using it as a regular page. kvm sometimes chooses not to emulate a write to a guest page table, but instead to zap it, which is impossible when the page is freed. You need to either unpin the page when that happens, or add a hypercall to let kvm know when a page directory is freed. the cache is zapped upon pagefaults anyway, so unpinning ought to be possible. Which one would you prefer? It's zapped by the equivalent of mmu_free_roots(), right? That's effectively unpinning it (by zeroing -root_count). However, kvm takes pagefaults even for silly things like setting (in hardware) or clearing (in software) the dirty bit. +#define KVM_API_MAGIC 0x87654321 + linux/kvm.h is the vmm userspace interface. The guest/host interface should probably go somewhere else. yeah. kvm_para.h? Sounds good. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux
* Avi Kivity [EMAIL PROTECTED] wrote: the cache is zapped upon pagefaults anyway, so unpinning ought to be possible. Which one would you prefer? It's zapped by the equivalent of mmu_free_roots(), right? That's effectively unpinning it (by zeroing -root_count). no, right now only the guest-visible cache is zapped - the roots are zapped by natural rotation. I guess they should be zapped in kvm_cr3_cache_clear() - but i wanted to keep that function an invariant to the other MMU state, to make it easier to call it from whatever mmu codepath. However, kvm takes pagefaults even for silly things like setting (in hardware) or clearing (in software) the dirty bit. yeah. I think it also does some TLB flushes that are not needed. For example in rmap_write_protect() we do this: rmap_remove(vcpu, spte); kvm_arch_ops-tlb_flush(vcpu); but AFAICS rmap_write_protect() is only ever called if we write a new cr3 - hence a TLB flush will happen anyway, because we do a vmcs_writel(GUEST_CR3, new_cr3). Am i missing something? I didnt want to remove it as part of the cr3 patches (to keep things simpler), but that flush looks quite unnecessary to me. The patch below seems to work in light testing. Ingo Index: linux/drivers/kvm/mmu.c === --- linux.orig/drivers/kvm/mmu.c +++ linux/drivers/kvm/mmu.c @@ -404,7 +404,11 @@ static void rmap_write_protect(struct kv BUG_ON(!(*spte PT_WRITABLE_MASK)); rmap_printk(rmap_write_protect: spte %p %llx\n, spte, *spte); rmap_remove(vcpu, spte); - kvm_arch_ops-tlb_flush(vcpu); + /* +* While we removed a mapping there's no need to explicitly +* flush the TLB here, because this codepath only triggers +* if we write a new cr3 - which will flush the TLB anyway. +*/ *spte = ~(u64)PT_WRITABLE_MASK; } } - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux
Ingo Molnar wrote: * Avi Kivity [EMAIL PROTECTED] wrote: but AFAICS rmap_write_protect() is only ever called if we write a new cr3 - hence a TLB flush will happen anyway, because we do a vmcs_writel(GUEST_CR3, new_cr3). Am i missing something? No, rmap_write_protect() is called whenever we shadow a new guest page table (the mechanism used to maintain coherency is write faults on page tables). hm. Where? I only see rmap_write_protect() called from kvm_mmu_get_page(), which is only called from nonpaging_map() [but it doesnt call the rmap function in that case, due to metaphysical], and mmu_alloc_roots(). mmu_alloc_roots() is only called from context init (init-only thing) or from paging_new_cr3(). ahh ... i missed paging_tmpl.h. how about the patch below then? Looks like a lot of complexity for very little gain. I'm not sure what the vmwrite cost is, cut it can't be that high compared to vmexit. I think there are two better options: 1. Find out if tlb_flush() can be implemented as a no-op on intel -- I'm fairly sure it can. 2. If not, implement tlb_flush() as a counter increment like on amd. Then, successive tlb flushes and context switches are folded into one. furthermore, shouldnt we pass in the flush area size from the pagefault handler? In most cases it would be 4096 bytes, that would mean an invlpg is enough, not a full cr3 flush. Although ... i dont know how to invlpg a guest-side TLB. On VMX it probably doesnt matter at all. On SVM, is there a way (or a need) to flush a single TLB of another context ID? Yes (and yes), invlpga. A complication is that a single shadow pte can be used to map multiple guest virtual addresses (by mapping the page table using multiple pdes), so the kvm_mmu_page object has to keep track of the page table gva, and whether it is multiply mapped or not. I plan to do that later. Ingo Index: linux/drivers/kvm/mmu.c === --- linux.orig/drivers/kvm/mmu.c +++ linux/drivers/kvm/mmu.c @@ -378,7 +378,7 @@ static void rmap_remove(struct kvm_vcpu } } -static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn) +static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn, int flush_tlb) { struct kvm *kvm = vcpu-kvm; struct page *page; @@ -404,7 +404,13 @@ static void rmap_write_protect(struct kv BUG_ON(!(*spte PT_WRITABLE_MASK)); rmap_printk(rmap_write_protect: spte %p %llx\n, spte, *spte); rmap_remove(vcpu, spte); - kvm_arch_ops-tlb_flush(vcpu); + /* + * While we removed a mapping there's no need to explicitly + * flush the TLB here if we write a new cr3. It is needed + * from the pagefault path though. + */ + if (flush_tlb) + kvm_arch_ops-tlb_flush(vcpu); *spte = ~(u64)PT_WRITABLE_MASK; } } @@ -561,7 +567,8 @@ static struct kvm_mmu_page *kvm_mmu_get_ gva_t gaddr, unsigned level, int metaphysical, - u64 *parent_pte) + u64 *parent_pte, + int flush_tlb) { union kvm_mmu_page_role role; unsigned index; @@ -597,7 +604,7 @@ static struct kvm_mmu_page *kvm_mmu_get_ page-role = role; hlist_add_head(page-hash_link, bucket); if (!metaphysical) - rmap_write_protect(vcpu, gfn); + rmap_write_protect(vcpu, gfn, flush_tlb); return page; } @@ -764,7 +771,7 @@ static int nonpaging_map(struct kvm_vcpu PAGE_SHIFT; new_table = kvm_mmu_get_page(vcpu, pseudo_gfn, v, level - 1, - 1, table[index]); + 1, table[index], 0); if (!new_table) { pgprintk(nonpaging_map: ENOMEM\n); return -ENOMEM; @@ -838,7 +845,7 @@ static void mmu_alloc_roots(struct kvm_v ASSERT(!VALID_PAGE(root)); page = kvm_mmu_get_page(vcpu, root_gfn, 0, - PT64_ROOT_LEVEL, 0, NULL); + PT64_ROOT_LEVEL, 0, NULL, 0); root = page-page_hpa; ++page-root_count; vcpu-mmu.root_hpa = root; @@ -857,7 +864,7 @@ static void mmu_alloc_roots(struct kvm_v root_gfn = 0; page = kvm_mmu_get_page(vcpu, root_gfn, i 30, PT32_ROOT_LEVEL, !is_paging(vcpu),
Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux
On Sat, 6 Jan 2007, Pavel Machek wrote: Does this make Xen obsolete? I mean... we have xen patches in suse kernels, should we keep updating them, or just drop them in favour of KVM? Pavel Xen is duplicating basic OS components like the scheduler etc. As a result its difficult to maintain and not well integrated with Linux. KVM looks like a better approach. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux
On Sun, 2007-01-07 at 14:20 +0200, Avi Kivity wrote: Well, you did say it was ad-hoc. For reference, this is how I see the hypercall API: [snip] - Guest/host communications is by guest physical addressed, as the virtual-physical translation is much cheaper on the guest (__pa() vs a page table walk). Strongly agreed. One of the major problems we had with the PowerPC Xen port was that Xen passes virtual addresses (even userspace virtual addresses!) to the hypervisor. Performing a MMU search on PowerPC is far more convoluted than x86's table walk and is not feasible in software. I'm anxious to avoid the same mistake wherever possible. Of course, even with physical addresses, data structures that straddle page boundaries prevent the hypervisor from mapping contiguous physical pages to discontiguous machine pages (or whatever terminology you want to use). IBM's Power Hypervisor passes hypercall data in registers most of the time. In the places it communicates via memory, I believe the parameter is actually a physical frame number, and the size is explicitly limited to one page. -Hollis - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux
On Sat, Jan 06, 2007 at 01:08:18PM +, Pavel Machek wrote: Does this make Xen obsolete? I mean... we have xen patches in suse kernels, should we keep updating them, or just drop them in favour of KVM? After all the Novell Marketing Hype you'll probably have to keep Xen ;-) Except for that I suspect a paravirt kvm or lhype might be the better hypervisor choice in the long term. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux
Ingo Molnar wrote: * Zachary Amsden [EMAIL PROTECTED] wrote: What you really want is more like EXPORT_SYMBOL_READABLE_GPL(paravirt_ops); yep. Not a big issue - what is important is to put the paravirt ops into the read-only section so that it's somewhat harder for rootkits to modify. (Also, it needs to be made clear that this is fundamental, lowlevel system functionality written by people under the GPLv2, so that if you utilize it beyond its original purpose, using its internals, you likely create a work derived from the kernel. Something simple as irq disabling probably doesnt qualify, and that we exported to modules for a long time, but lots of other details do. So the existence of paravirt_ops isnt a free-for all.) I agree completely. It would be nice to have a way to make certain kernel structures available, but non-mutable to non-GPL modules. But I'm not sure that is technically feasible yet. The kvm code should probably go in kvm.c instead of paravirt.c. no. This is fundamental architecture boot code, not module code. kvm.c should eventually go into kernel/ and arch/*/kernel, not the other way around. What I meant was kvm.c in arch/i386/kernel - as symmetric to the other paravirt-ops modules, which live in arch/i386/kernel/vmi.c / lhype.c, etc. Either that, or we should move them to be symmetric, but I don't think paravirt.c is the proper place for kvm specific code. Index: linux/drivers/serial/8250.c === --- linux.orig/drivers/serial/8250.c +++ linux/drivers/serial/8250.c @@ -1371,7 +1371,7 @@ static irqreturn_t serial8250_interrupt( l = l-next; -if (l == i-head pass_counter++ PASS_LIMIT) { +if (!kvm_paravirt Is this a bug that might happen under other virtualizations as well, not just kvm? Perhaps it deserves a disable feature instead of a kvm specific check. yes - this limit is easily triggered via the KVM/Qemu virtual serial drivers. You can think of kvm_paravirt as Linux paravirt, it's just a flag. Can't you just test paravirt_enabled() in that case? Zach - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux
Ingo Molnar wrote: i'm pleased to announce the first release of paravirtualized KVM (Linux under Linux), which includes support for the hardware cr3-cache feature of Intel-VMX CPUs. (which speeds up context switches and TLB flushes) the patch is against 2.6.20-rc3 + KVM trunk and can be found at: http://redhat.com/~mingo/kvm-paravirt-patches/ Some aspects of the code are still a bit ad-hoc and incomplete, but the code is stable enough in my testing and i'd like to have some feedback. Your code looks generally good. I have some comments. You can't do this, even though you want to: -EXPORT_SYMBOL(paravirt_ops); +EXPORT_SYMBOL_GPL(paravirt_ops); The problem is it makes all modules GPL - or at least all modules that use any kind of locking, pull in the basic definitions to enable and disable interrupts, thus the paravirt_ops symbol, so basically all modules. What you really want is more like EXPORT_SYMBOL_READABLE_GPL(paravirt_ops); But I'm not sure that is technically feasible yet. The kvm code should probably go in kvm.c instead of paravirt.c. Index: linux/drivers/serial/8250.c === --- linux.orig/drivers/serial/8250.c +++ linux/drivers/serial/8250.c @@ -1371,7 +1371,7 @@ static irqreturn_t serial8250_interrupt( l = l-next; - if (l == i-head pass_counter++ PASS_LIMIT) { + if (!kvm_paravirt Is this a bug that might happen under other virtualizations as well, not just kvm? Perhaps it deserves a disable feature instead of a kvm specific check. Which also gets rid of the need for this unusually placed extern: Index: linux/include/linux/sched.h === --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -1911,6 +1911,11 @@ static inline void set_task_cpu(struct t #endif /* CONFIG_SMP */ +/* + * Is paravirtualization active? + */ +extern int kvm_paravirt; + - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux
* Zachary Amsden [EMAIL PROTECTED] wrote: What you really want is more like EXPORT_SYMBOL_READABLE_GPL(paravirt_ops); yep. Not a big issue - what is important is to put the paravirt ops into the read-only section so that it's somewhat harder for rootkits to modify. (Also, it needs to be made clear that this is fundamental, lowlevel system functionality written by people under the GPLv2, so that if you utilize it beyond its original purpose, using its internals, you likely create a work derived from the kernel. Something simple as irq disabling probably doesnt qualify, and that we exported to modules for a long time, but lots of other details do. So the existence of paravirt_ops isnt a free-for all.) I agree completely. It would be nice to have a way to make certain kernel structures available, but non-mutable to non-GPL modules. the thing is, we are not 'making these available to non-GPL modules'. The _GPL postfix does not turn the other normal exports from grey into white. The _GPL postfix turns exports into almost-black for non-GPL modules. The rest is still grey. in this case, i'd only make local_irq_*() available to modules (of any type), not the other paravirt ops. Exporting the whole of paravirt_ops was a mistake, and this has to be fixed before 2.6.20. I'll cook up a patch. yes - this limit is easily triggered via the KVM/Qemu virtual serial drivers. You can think of kvm_paravirt as Linux paravirt, it's just a flag. Can't you just test paravirt_enabled() in that case? yep - i've changed this in my tree. Ingo - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel