Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux

2007-01-08 Thread Avi Kivity
Ingo Molnar wrote:

 

 This is a little too good to be true.  Were both runs with the same 
 KVM_NUM_MMU_PAGES?
 

 yes, both had the same elevated KVM_NUM_MMU_PAGES of 2048. The 'trunk' 
 run should have been labeled as: 'cr3 tree with paravirt turned off'. 
 That's not completely 'trunk' but close to it, and all other changes 
 (like elimination of unnecessary TLB flushes) are fairly applied to 
 both.
   

Ok.  I guess there's a switch/switch back pattern in there.

 i also did a run with much less MMU cache pages of 256, and hackbench 1 
 stayed the same, while hackbench 5 numbers started fluctuating badly (i 
 think that workload if trashing the MMU cache badly).
   

Yes, 256 is too low.

   
 -u64 *pae_root;
 +u64 *pae_root[KVM_CR3_CACHE_SIZE];
   
 hmm.  wouldn't it be simpler to have pae_root always point at the 
 current root?
 

 does that guarantee that it's available? I wanted to 'pin' the root 
 itself this way, to make sure that if a guest switches to it via the 
 cache, that it's truly available and a valid root. cr3 addresses are 
 non-virtual so this is the only mechanism available to guarantee that 
 the host-side memory truly contains a root pagetable.

   

I meant

   u64 *pae_root_cache;
   u64 *pae_root; /* == pae_root_cache + 4*cache_index */

so that the rest of the code need not worry about the cache.

 +vcpu-mmu.pae_root[j][i] = INVALID_PAGE;
 +}
 }
 vcpu-mmu.root_hpa = INVALID_PAGE;
 }
   
 You keep the page directories pinned here. [...]
 

 yes.

   
 [...]  This can be a problem if a guest frees a page directory, and 
 then starts using it as a regular page.  kvm sometimes chooses not to 
 emulate a write to a guest page table, but instead to zap it, which is 
 impossible when the page is freed.  You need to either unpin the page 
 when that happens, or add a hypercall to let kvm know when a page 
 directory is freed.
 

 the cache is zapped upon pagefaults anyway, so unpinning ought to be 
 possible. Which one would you prefer?
   

It's zapped by the equivalent of mmu_free_roots(), right?  That's 
effectively unpinning it (by zeroing -root_count).

However, kvm takes pagefaults even for silly things like setting (in 
hardware) or clearing (in software) the dirty bit.


   
 +#define KVM_API_MAGIC 0x87654321
 +
   
 linux/kvm.h is the vmm userspace interface.  The guest/host 
 interface should probably go somewhere else.
 

 yeah. kvm_para.h?

   

Sounds good.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux

2007-01-08 Thread Ingo Molnar

* Avi Kivity [EMAIL PROTECTED] wrote:

  the cache is zapped upon pagefaults anyway, so unpinning ought to be 
  possible. Which one would you prefer?
 
 It's zapped by the equivalent of mmu_free_roots(), right?  That's 
 effectively unpinning it (by zeroing -root_count).

no, right now only the guest-visible cache is zapped - the roots are 
zapped by natural rotation. I guess they should be zapped in 
kvm_cr3_cache_clear() - but i wanted to keep that function an invariant 
to the other MMU state, to make it easier to call it from whatever mmu 
codepath.

 However, kvm takes pagefaults even for silly things like setting (in 
 hardware) or clearing (in software) the dirty bit.

yeah. I think it also does some TLB flushes that are not needed. For 
example in rmap_write_protect() we do this:

rmap_remove(vcpu, spte);
kvm_arch_ops-tlb_flush(vcpu);

but AFAICS rmap_write_protect() is only ever called if we write a new 
cr3 - hence a TLB flush will happen anyway, because we do a 
vmcs_writel(GUEST_CR3, new_cr3). Am i missing something? I didnt want to 
remove it as part of the cr3 patches (to keep things simpler), but that 
flush looks quite unnecessary to me. The patch below seems to work in 
light testing.

Ingo

Index: linux/drivers/kvm/mmu.c
===
--- linux.orig/drivers/kvm/mmu.c
+++ linux/drivers/kvm/mmu.c
@@ -404,7 +404,11 @@ static void rmap_write_protect(struct kv
BUG_ON(!(*spte  PT_WRITABLE_MASK));
rmap_printk(rmap_write_protect: spte %p %llx\n, spte, *spte);
rmap_remove(vcpu, spte);
-   kvm_arch_ops-tlb_flush(vcpu);
+   /*
+* While we removed a mapping there's no need to explicitly
+* flush the TLB here, because this codepath only triggers
+* if we write a new cr3 - which will flush the TLB anyway.
+*/
*spte = ~(u64)PT_WRITABLE_MASK;
}
 }

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux

2007-01-08 Thread Avi Kivity
Ingo Molnar wrote:
 * Avi Kivity [EMAIL PROTECTED] wrote:

   
 but AFAICS rmap_write_protect() is only ever called if we write a new 
 cr3 - hence a TLB flush will happen anyway, because we do a 
 vmcs_writel(GUEST_CR3, new_cr3). Am i missing something?
   
 No, rmap_write_protect() is called whenever we shadow a new guest page 
 table (the mechanism used to maintain coherency is write faults on 
 page tables).
 

 hm. Where? I only see rmap_write_protect() called from 
 kvm_mmu_get_page(), which is only called from nonpaging_map() [but it 
 doesnt call the rmap function in that case, due to metaphysical], and 
 mmu_alloc_roots(). mmu_alloc_roots() is only called from context init 
 (init-only thing) or from paging_new_cr3().

 ahh ... i missed paging_tmpl.h.

 how about the patch below then?
   

Looks like a lot of complexity for very little gain.  I'm not sure what 
the vmwrite cost is, cut it can't be that high compared to vmexit.

I think there are two better options:

1.  Find out if tlb_flush() can be implemented as a no-op on intel -- 
I'm fairly sure it can.
2.  If not, implement tlb_flush() as a counter increment like on amd.  
Then, successive tlb flushes and context switches are folded into one.


 furthermore, shouldnt we pass in the flush area size from the pagefault 
 handler? In most cases it would be 4096 bytes, that would mean an invlpg 
 is enough, not a full cr3 flush. Although ... i dont know how to invlpg 
 a guest-side TLB. On VMX it probably doesnt matter at all. On SVM, is 
 there a way (or a need) to flush a single TLB of another context ID?
   

Yes (and yes), invlpga.

A complication is that a single shadow pte can be used to map multiple 
guest virtual addresses (by mapping the page table using multiple pdes), 
so the kvm_mmu_page object has to keep track of the page table gva, and 
whether it is multiply mapped or not.  I plan to do that later.

   Ingo

 Index: linux/drivers/kvm/mmu.c
 ===
 --- linux.orig/drivers/kvm/mmu.c
 +++ linux/drivers/kvm/mmu.c
 @@ -378,7 +378,7 @@ static void rmap_remove(struct kvm_vcpu 
   }
  }
  
 -static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
 +static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn, int flush_tlb)
  {
   struct kvm *kvm = vcpu-kvm;
   struct page *page;
 @@ -404,7 +404,13 @@ static void rmap_write_protect(struct kv
   BUG_ON(!(*spte  PT_WRITABLE_MASK));
   rmap_printk(rmap_write_protect: spte %p %llx\n, spte, *spte);
   rmap_remove(vcpu, spte);
 - kvm_arch_ops-tlb_flush(vcpu);
 + /*
 +  * While we removed a mapping there's no need to explicitly
 +  * flush the TLB here if we write a new cr3. It is needed
 +  * from the pagefault path though.
 +  */
 + if (flush_tlb)
 + kvm_arch_ops-tlb_flush(vcpu);
   *spte = ~(u64)PT_WRITABLE_MASK;
   }
  }
 @@ -561,7 +567,8 @@ static struct kvm_mmu_page *kvm_mmu_get_
gva_t gaddr,
unsigned level,
int metaphysical,
 -  u64 *parent_pte)
 +  u64 *parent_pte,
 +  int flush_tlb)
  {
   union kvm_mmu_page_role role;
   unsigned index;
 @@ -597,7 +604,7 @@ static struct kvm_mmu_page *kvm_mmu_get_
   page-role = role;
   hlist_add_head(page-hash_link, bucket);
   if (!metaphysical)
 - rmap_write_protect(vcpu, gfn);
 + rmap_write_protect(vcpu, gfn, flush_tlb);
   return page;
  }
  
 @@ -764,7 +771,7 @@ static int nonpaging_map(struct kvm_vcpu
PAGE_SHIFT;
   new_table = kvm_mmu_get_page(vcpu, pseudo_gfn,
v, level - 1,
 -  1, table[index]);
 +  1, table[index], 0);
   if (!new_table) {
   pgprintk(nonpaging_map: ENOMEM\n);
   return -ENOMEM;
 @@ -838,7 +845,7 @@ static void mmu_alloc_roots(struct kvm_v
  
   ASSERT(!VALID_PAGE(root));
   page = kvm_mmu_get_page(vcpu, root_gfn, 0,
 - PT64_ROOT_LEVEL, 0, NULL);
 + PT64_ROOT_LEVEL, 0, NULL, 0);
   root = page-page_hpa;
   ++page-root_count;
   vcpu-mmu.root_hpa = root;
 @@ -857,7 +864,7 @@ static void mmu_alloc_roots(struct kvm_v
   root_gfn = 0;
   page = kvm_mmu_get_page(vcpu, root_gfn, i  30,
   PT32_ROOT_LEVEL, !is_paging(vcpu),
 

Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux

2007-01-08 Thread Christoph Lameter
On Sat, 6 Jan 2007, Pavel Machek wrote:

 Does this make Xen obsolete? I mean... we have xen patches in suse
 kernels, should we keep updating them, or just drop them in favour of
 KVM?
   Pavel

Xen is duplicating basic OS components like the scheduler etc. As 
a result its difficult to maintain and not well integrated with Linux. 
KVM looks like a better approach.



-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux

2007-01-07 Thread Hollis Blanchard
On Sun, 2007-01-07 at 14:20 +0200, Avi Kivity wrote:
 
 
 Well, you did say it was ad-hoc.  For reference, this is how I see the
 hypercall API:
[snip]
 - Guest/host communications is by guest physical addressed, as the 
 virtual-physical translation is much cheaper on the guest (__pa() vs
 a page table walk). 

Strongly agreed. One of the major problems we had with the PowerPC Xen
port was that Xen passes virtual addresses (even userspace virtual
addresses!) to the hypervisor. Performing a MMU search on PowerPC is far
more convoluted than x86's table walk and is not feasible in software.

I'm anxious to avoid the same mistake wherever possible.

Of course, even with physical addresses, data structures that straddle
page boundaries prevent the hypervisor from mapping contiguous physical
pages to discontiguous machine pages (or whatever terminology you want
to use).

IBM's Power Hypervisor passes hypercall data in registers most of the
time. In the places it communicates via memory, I believe the parameter
is actually a physical frame number, and the size is explicitly limited
to one page.

-Hollis


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux

2007-01-07 Thread Christoph Hellwig
On Sat, Jan 06, 2007 at 01:08:18PM +, Pavel Machek wrote:
 Does this make Xen obsolete? I mean... we have xen patches in suse
 kernels, should we keep updating them, or just drop them in favour of
 KVM?

After all the Novell Marketing Hype you'll probably have to keep Xen ;-)
Except for that I suspect a paravirt kvm or lhype might be the better
hypervisor choice in the long term.


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux

2007-01-05 Thread Zachary Amsden
Ingo Molnar wrote:
 * Zachary Amsden [EMAIL PROTECTED] wrote:

   
 What you really want is more like 
 EXPORT_SYMBOL_READABLE_GPL(paravirt_ops);
 

 yep. Not a big issue - what is important is to put the paravirt ops into 
 the read-only section so that it's somewhat harder for rootkits to 
 modify. (Also, it needs to be made clear that this is fundamental, 
 lowlevel system functionality written by people under the GPLv2, so that 
 if you utilize it beyond its original purpose, using its internals, you 
 likely create a work derived from the kernel. Something simple as irq 
 disabling probably doesnt qualify, and that we exported to modules for a 
 long time, but lots of other details do. So the existence of 
 paravirt_ops isnt a free-for all.)
   

I agree completely.  It would be nice to have a way to make certain 
kernel structures available, but non-mutable to non-GPL modules.

 But I'm not sure that is technically feasible yet.

 The kvm code should probably go in kvm.c instead of paravirt.c.
 

 no. This is fundamental architecture boot code, not module code. kvm.c 
 should eventually go into kernel/ and arch/*/kernel, not the other way 
 around.
   

What I meant was kvm.c in arch/i386/kernel - as symmetric to the other 
paravirt-ops modules, which live in arch/i386/kernel/vmi.c / lhype.c, 
etc.  Either that, or we should move them to be symmetric, but I don't 
think paravirt.c is the proper place for kvm specific code.


   
 Index: linux/drivers/serial/8250.c
 ===
 --- linux.orig/drivers/serial/8250.c
 +++ linux/drivers/serial/8250.c
 @@ -1371,7 +1371,7 @@ static irqreturn_t serial8250_interrupt(

  l = l-next;

 -if (l == i-head  pass_counter++  PASS_LIMIT) {
 +if (!kvm_paravirt 

 Is this a bug that might happen under other virtualizations as well, 
 not just kvm? Perhaps it deserves a disable feature instead of a kvm 
 specific check.
 

 yes - this limit is easily triggered via the KVM/Qemu virtual serial 
 drivers. You can think of kvm_paravirt as Linux paravirt, it's just 
 a flag.
   

Can't you just test paravirt_enabled() in that case?


Zach

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux

2007-01-05 Thread Zachary Amsden
Ingo Molnar wrote:
 i'm pleased to announce the first release of paravirtualized KVM (Linux 
 under Linux), which includes support for the hardware cr3-cache feature 
 of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)

 the patch is against 2.6.20-rc3 + KVM trunk and can be found at:

http://redhat.com/~mingo/kvm-paravirt-patches/

 Some aspects of the code are still a bit ad-hoc and incomplete, but the 
 code is stable enough in my testing and i'd like to have some feedback. 
   

Your code looks generally good.  I have some comments.

You can't do this, even though you want to:

-EXPORT_SYMBOL(paravirt_ops);
+EXPORT_SYMBOL_GPL(paravirt_ops);


The problem is it makes all modules GPL - or at least all modules that 
use any kind of locking, pull in the basic definitions to enable and 
disable interrupts, thus the paravirt_ops symbol, so basically all modules.

What you really want is more like EXPORT_SYMBOL_READABLE_GPL(paravirt_ops);

But I'm not sure that is technically feasible yet.

The kvm code should probably go in kvm.c instead of paravirt.c.


Index: linux/drivers/serial/8250.c
===
--- linux.orig/drivers/serial/8250.c
+++ linux/drivers/serial/8250.c
@@ -1371,7 +1371,7 @@ static irqreturn_t serial8250_interrupt(
 
l = l-next;
 
-   if (l == i-head  pass_counter++  PASS_LIMIT) {
+   if (!kvm_paravirt 



Is this a bug that might happen under other virtualizations as well, not 
just kvm? Perhaps it deserves a disable feature instead of a kvm 
specific check.

Which also gets rid of the need for this unusually placed extern:

Index: linux/include/linux/sched.h
===
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1911,6 +1911,11 @@ static inline void set_task_cpu(struct t
 
 #endif /* CONFIG_SMP */
 
+/*
+ * Is paravirtualization active?
+ */
+extern int kvm_paravirt;

+




-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux

2007-01-05 Thread Ingo Molnar

* Zachary Amsden [EMAIL PROTECTED] wrote:

 What you really want is more like 
 EXPORT_SYMBOL_READABLE_GPL(paravirt_ops);
 
 
 yep. Not a big issue - what is important is to put the paravirt ops 
 into the read-only section so that it's somewhat harder for rootkits 
 to modify. (Also, it needs to be made clear that this is fundamental, 
 lowlevel system functionality written by people under the GPLv2, so 
 that if you utilize it beyond its original purpose, using its 
 internals, you likely create a work derived from the kernel. 
 Something simple as irq disabling probably doesnt qualify, and that 
 we exported to modules for a long time, but lots of other details do. 
 So the existence of paravirt_ops isnt a free-for all.)
 
 I agree completely.  It would be nice to have a way to make certain 
 kernel structures available, but non-mutable to non-GPL modules.

the thing is, we are not 'making these available to non-GPL modules'. 
The _GPL postfix does not turn the other normal exports from grey into 
white. The _GPL postfix turns exports into almost-black for non-GPL 
modules. The rest is still grey.

in this case, i'd only make local_irq_*() available to modules (of any 
type), not the other paravirt ops. Exporting the whole of paravirt_ops 
was a mistake, and this has to be fixed before 2.6.20. I'll cook up a 
patch.

 yes - this limit is easily triggered via the KVM/Qemu virtual serial 
 drivers. You can think of kvm_paravirt as Linux paravirt, it's 
 just a flag.
 
 Can't you just test paravirt_enabled() in that case?

yep - i've changed this in my tree.

Ingo

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel