Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-08 Thread Christoph Lameter
On Sat, 6 Jan 2007, Pavel Machek wrote:

> Does this make Xen obsolete? I mean... we have xen patches in suse
> kernels, should we keep updating them, or just drop them in favour of
> KVM?
>   Pavel

Xen is duplicating basic OS components like the scheduler etc. As 
a result its difficult to maintain and not well integrated with Linux. 
KVM looks like a better approach.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-08 Thread Ingo Molnar

* Avi Kivity <[EMAIL PROTECTED]> wrote:

> Looks like a lot of complexity for very little gain.  I'm not sure 
> what the vmwrite cost is, cut it can't be that high compared to 
> vmexit.

while i disagree with characterising one extra parameter passed down 
plus one extra branch as 'a lot of complexity', i agree that making the 
flush a NOP on VMX is even better. (if it's possible without breaking 
future hardware) I guess you are right that the only true cost here is 
the vmwrite cost, and that there's no impact on CR3 flushing (because it 
happens unconditionally). Forget about this patch.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-08 Thread Avi Kivity

Ingo Molnar wrote:

* Avi Kivity <[EMAIL PROTECTED]> wrote:

  
but AFAICS rmap_write_protect() is only ever called if we write a new 
cr3 - hence a TLB flush will happen anyway, because we do a 
vmcs_writel(GUEST_CR3, new_cr3). Am i missing something?
  
No, rmap_write_protect() is called whenever we shadow a new guest page 
table (the mechanism used to maintain coherency is write faults on 
page tables).



hm. Where? I only see rmap_write_protect() called from 
kvm_mmu_get_page(), which is only called from nonpaging_map() [but it 
doesnt call the rmap function in that case, due to metaphysical], and 
mmu_alloc_roots(). mmu_alloc_roots() is only called from context init 
(init-only thing) or from paging_new_cr3().


ahh ... i missed paging_tmpl.h.

how about the patch below then?
  


Looks like a lot of complexity for very little gain.  I'm not sure what 
the vmwrite cost is, cut it can't be that high compared to vmexit.


I think there are two better options:

1.  Find out if tlb_flush() can be implemented as a no-op on intel -- 
I'm fairly sure it can.
2.  If not, implement tlb_flush() as a counter increment like on amd.  
Then, successive tlb flushes and context switches are folded into one.



furthermore, shouldnt we pass in the flush area size from the pagefault 
handler? In most cases it would be 4096 bytes, that would mean an invlpg 
is enough, not a full cr3 flush. Although ... i dont know how to invlpg 
a guest-side TLB. On VMX it probably doesnt matter at all. On SVM, is 
there a way (or a need) to flush a single TLB of another context ID?
  


Yes (and yes), invlpga.

A complication is that a single shadow pte can be used to map multiple 
guest virtual addresses (by mapping the page table using multiple pdes), 
so the kvm_mmu_page object has to keep track of the page table gva, and 
whether it is multiply mapped or not.  I plan to do that later.



Ingo

Index: linux/drivers/kvm/mmu.c
===
--- linux.orig/drivers/kvm/mmu.c
+++ linux/drivers/kvm/mmu.c
@@ -378,7 +378,7 @@ static void rmap_remove(struct kvm_vcpu 
 	}

 }
 
-static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)

+static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn, int flush_tlb)
 {
struct kvm *kvm = vcpu->kvm;
struct page *page;
@@ -404,7 +404,13 @@ static void rmap_write_protect(struct kv
BUG_ON(!(*spte & PT_WRITABLE_MASK));
rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
rmap_remove(vcpu, spte);
-   kvm_arch_ops->tlb_flush(vcpu);
+   /*
+* While we removed a mapping there's no need to explicitly
+* flush the TLB here if we write a new cr3. It is needed
+* from the pagefault path though.
+*/
+   if (flush_tlb)
+   kvm_arch_ops->tlb_flush(vcpu);
*spte &= ~(u64)PT_WRITABLE_MASK;
}
 }
@@ -561,7 +567,8 @@ static struct kvm_mmu_page *kvm_mmu_get_
 gva_t gaddr,
 unsigned level,
 int metaphysical,
-u64 *parent_pte)
+u64 *parent_pte,
+int flush_tlb)
 {
union kvm_mmu_page_role role;
unsigned index;
@@ -597,7 +604,7 @@ static struct kvm_mmu_page *kvm_mmu_get_
page->role = role;
hlist_add_head(>hash_link, bucket);
if (!metaphysical)
-   rmap_write_protect(vcpu, gfn);
+   rmap_write_protect(vcpu, gfn, flush_tlb);
return page;
 }
 
@@ -764,7 +771,7 @@ static int nonpaging_map(struct kvm_vcpu

>> PAGE_SHIFT;
new_table = kvm_mmu_get_page(vcpu, pseudo_gfn,
 v, level - 1,
-1, [index]);
+1, [index], 0);
if (!new_table) {
pgprintk("nonpaging_map: ENOMEM\n");
return -ENOMEM;
@@ -838,7 +845,7 @@ static void mmu_alloc_roots(struct kvm_v
 
 		ASSERT(!VALID_PAGE(root));

page = kvm_mmu_get_page(vcpu, root_gfn, 0,
-   PT64_ROOT_LEVEL, 0, NULL);
+   PT64_ROOT_LEVEL, 0, NULL, 0);
root = page->page_hpa;
++page->root_count;
vcpu->mmu.root_hpa = root;
@@ -857,7 +864,7 @@ static void mmu_alloc_roots(struct kvm_v
root_gfn = 0;
page = kvm_mmu_get_page(vcpu, root_gfn, i << 30,
PT32_ROOT_LEVEL, 

Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-08 Thread Ingo Molnar

* Avi Kivity <[EMAIL PROTECTED]> wrote:

> >but AFAICS rmap_write_protect() is only ever called if we write a new 
> >cr3 - hence a TLB flush will happen anyway, because we do a 
> >vmcs_writel(GUEST_CR3, new_cr3). Am i missing something?
> 
> No, rmap_write_protect() is called whenever we shadow a new guest page 
> table (the mechanism used to maintain coherency is write faults on 
> page tables).

hm. Where? I only see rmap_write_protect() called from 
kvm_mmu_get_page(), which is only called from nonpaging_map() [but it 
doesnt call the rmap function in that case, due to metaphysical], and 
mmu_alloc_roots(). mmu_alloc_roots() is only called from context init 
(init-only thing) or from paging_new_cr3().

ahh ... i missed paging_tmpl.h.

how about the patch below then?

furthermore, shouldnt we pass in the flush area size from the pagefault 
handler? In most cases it would be 4096 bytes, that would mean an invlpg 
is enough, not a full cr3 flush. Although ... i dont know how to invlpg 
a guest-side TLB. On VMX it probably doesnt matter at all. On SVM, is 
there a way (or a need) to flush a single TLB of another context ID?

Ingo

Index: linux/drivers/kvm/mmu.c
===
--- linux.orig/drivers/kvm/mmu.c
+++ linux/drivers/kvm/mmu.c
@@ -378,7 +378,7 @@ static void rmap_remove(struct kvm_vcpu 
}
 }
 
-static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
+static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn, int flush_tlb)
 {
struct kvm *kvm = vcpu->kvm;
struct page *page;
@@ -404,7 +404,13 @@ static void rmap_write_protect(struct kv
BUG_ON(!(*spte & PT_WRITABLE_MASK));
rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
rmap_remove(vcpu, spte);
-   kvm_arch_ops->tlb_flush(vcpu);
+   /*
+* While we removed a mapping there's no need to explicitly
+* flush the TLB here if we write a new cr3. It is needed
+* from the pagefault path though.
+*/
+   if (flush_tlb)
+   kvm_arch_ops->tlb_flush(vcpu);
*spte &= ~(u64)PT_WRITABLE_MASK;
}
 }
@@ -561,7 +567,8 @@ static struct kvm_mmu_page *kvm_mmu_get_
 gva_t gaddr,
 unsigned level,
 int metaphysical,
-u64 *parent_pte)
+u64 *parent_pte,
+int flush_tlb)
 {
union kvm_mmu_page_role role;
unsigned index;
@@ -597,7 +604,7 @@ static struct kvm_mmu_page *kvm_mmu_get_
page->role = role;
hlist_add_head(>hash_link, bucket);
if (!metaphysical)
-   rmap_write_protect(vcpu, gfn);
+   rmap_write_protect(vcpu, gfn, flush_tlb);
return page;
 }
 
@@ -764,7 +771,7 @@ static int nonpaging_map(struct kvm_vcpu
>> PAGE_SHIFT;
new_table = kvm_mmu_get_page(vcpu, pseudo_gfn,
 v, level - 1,
-1, [index]);
+1, [index], 0);
if (!new_table) {
pgprintk("nonpaging_map: ENOMEM\n");
return -ENOMEM;
@@ -838,7 +845,7 @@ static void mmu_alloc_roots(struct kvm_v
 
ASSERT(!VALID_PAGE(root));
page = kvm_mmu_get_page(vcpu, root_gfn, 0,
-   PT64_ROOT_LEVEL, 0, NULL);
+   PT64_ROOT_LEVEL, 0, NULL, 0);
root = page->page_hpa;
++page->root_count;
vcpu->mmu.root_hpa = root;
@@ -857,7 +864,7 @@ static void mmu_alloc_roots(struct kvm_v
root_gfn = 0;
page = kvm_mmu_get_page(vcpu, root_gfn, i << 30,
PT32_ROOT_LEVEL, !is_paging(vcpu),
-   NULL);
+   NULL, 0);
root = page->page_hpa;
++page->root_count;
vcpu->mmu.pae_root[j][i] = root | PT_PRESENT_MASK;
Index: linux/drivers/kvm/paging_tmpl.h
===
--- linux.orig/drivers/kvm/paging_tmpl.h
+++ linux/drivers/kvm/paging_tmpl.h
@@ -245,7 +245,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
table_gfn = walker->table_gfn[level - 2];
}
shadow_page = kvm_mmu_get_page(vcpu, table_gfn, addr, level-1,
-  metaphysical, shadow_ent);
+

Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-08 Thread Avi Kivity

Ingo Molnar wrote:

* Avi Kivity <[EMAIL PROTECTED]> wrote:

  
the cache is zapped upon pagefaults anyway, so unpinning ought to be 
possible. Which one would you prefer?
  
It's zapped by the equivalent of mmu_free_roots(), right?  That's 
effectively unpinning it (by zeroing ->root_count).



no, right now only the guest-visible cache is zapped - the roots are 
zapped by natural rotation. I guess they should be zapped in 
kvm_cr3_cache_clear() - but i wanted to keep that function an invariant 
to the other MMU state, to make it easier to call it from whatever mmu 
codepath.


  


Ok.  Some mechanism is then needed to unpin the pages in case they are 
recycled by the guest.


However, kvm takes pagefaults even for silly things like setting (in 
hardware) or clearing (in software) the dirty bit.



yeah. I think it also does some TLB flushes that are not needed. For 
example in rmap_write_protect() we do this:


rmap_remove(vcpu, spte);
kvm_arch_ops->tlb_flush(vcpu);

but AFAICS rmap_write_protect() is only ever called if we write a new 
cr3 - hence a TLB flush will happen anyway, because we do a 
vmcs_writel(GUEST_CR3, new_cr3). Am i missing something? 


No, rmap_write_protect() is called whenever we shadow a new guest page 
table (the mechanism used to maintain coherency is write faults on page 
tables).


I didnt want to 
remove it as part of the cr3 patches (to keep things simpler), but that 
flush looks quite unnecessary to me. The patch below seems to work in 
light testing.


Ingo

Index: linux/drivers/kvm/mmu.c
===
--- linux.orig/drivers/kvm/mmu.c
+++ linux/drivers/kvm/mmu.c
@@ -404,7 +404,11 @@ static void rmap_write_protect(struct kv
BUG_ON(!(*spte & PT_WRITABLE_MASK));
rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
rmap_remove(vcpu, spte);
-   kvm_arch_ops->tlb_flush(vcpu);
+   /*
+* While we removed a mapping there's no need to explicitly
+* flush the TLB here, because this codepath only triggers
+* if we write a new cr3 - which will flush the TLB anyway.
+*/
*spte &= ~(u64)PT_WRITABLE_MASK;
}
 }
  


This will kill svm.

I don't think the tlb flushes are really necessary on today's Intel 
machines, as they probably flush the tlb on each vmx switch.  But AMD 
keeps the TLB, and the flushes are critical there.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-08 Thread Ingo Molnar

* Avi Kivity <[EMAIL PROTECTED]> wrote:

> > the cache is zapped upon pagefaults anyway, so unpinning ought to be 
> > possible. Which one would you prefer?
> 
> It's zapped by the equivalent of mmu_free_roots(), right?  That's 
> effectively unpinning it (by zeroing ->root_count).

no, right now only the guest-visible cache is zapped - the roots are 
zapped by natural rotation. I guess they should be zapped in 
kvm_cr3_cache_clear() - but i wanted to keep that function an invariant 
to the other MMU state, to make it easier to call it from whatever mmu 
codepath.

> However, kvm takes pagefaults even for silly things like setting (in 
> hardware) or clearing (in software) the dirty bit.

yeah. I think it also does some TLB flushes that are not needed. For 
example in rmap_write_protect() we do this:

rmap_remove(vcpu, spte);
kvm_arch_ops->tlb_flush(vcpu);

but AFAICS rmap_write_protect() is only ever called if we write a new 
cr3 - hence a TLB flush will happen anyway, because we do a 
vmcs_writel(GUEST_CR3, new_cr3). Am i missing something? I didnt want to 
remove it as part of the cr3 patches (to keep things simpler), but that 
flush looks quite unnecessary to me. The patch below seems to work in 
light testing.

Ingo

Index: linux/drivers/kvm/mmu.c
===
--- linux.orig/drivers/kvm/mmu.c
+++ linux/drivers/kvm/mmu.c
@@ -404,7 +404,11 @@ static void rmap_write_protect(struct kv
BUG_ON(!(*spte & PT_WRITABLE_MASK));
rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
rmap_remove(vcpu, spte);
-   kvm_arch_ops->tlb_flush(vcpu);
+   /*
+* While we removed a mapping there's no need to explicitly
+* flush the TLB here, because this codepath only triggers
+* if we write a new cr3 - which will flush the TLB anyway.
+*/
*spte &= ~(u64)PT_WRITABLE_MASK;
}
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-08 Thread Avi Kivity

Ingo Molnar wrote:



This is a little too good to be true.  Were both runs with the same 
KVM_NUM_MMU_PAGES?



yes, both had the same elevated KVM_NUM_MMU_PAGES of 2048. The 'trunk' 
run should have been labeled as: 'cr3 tree with paravirt turned off'. 
That's not completely 'trunk' but close to it, and all other changes 
(like elimination of unnecessary TLB flushes) are fairly applied to 
both.
  


Ok.  I guess there's a switch/switch back pattern in there.

i also did a run with much less MMU cache pages of 256, and hackbench 1 
stayed the same, while hackbench 5 numbers started fluctuating badly (i 
think that workload if trashing the MMU cache badly).
  


Yes, 256 is too low.

  

-u64 *pae_root;
+u64 *pae_root[KVM_CR3_CACHE_SIZE];
  
hmm.  wouldn't it be simpler to have pae_root always point at the 
current root?



does that guarantee that it's available? I wanted to 'pin' the root 
itself this way, to make sure that if a guest switches to it via the 
cache, that it's truly available and a valid root. cr3 addresses are 
non-virtual so this is the only mechanism available to guarantee that 
the host-side memory truly contains a root pagetable.


  


I meant

  u64 *pae_root_cache;
  u64 *pae_root; /* == pae_root_cache + 4*cache_index */

so that the rest of the code need not worry about the cache.


+vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
+}
}
vcpu->mmu.root_hpa = INVALID_PAGE;
}
  

You keep the page directories pinned here. [...]



yes.

  
[...]  This can be a problem if a guest frees a page directory, and 
then starts using it as a regular page.  kvm sometimes chooses not to 
emulate a write to a guest page table, but instead to zap it, which is 
impossible when the page is freed.  You need to either unpin the page 
when that happens, or add a hypercall to let kvm know when a page 
directory is freed.



the cache is zapped upon pagefaults anyway, so unpinning ought to be 
possible. Which one would you prefer?
  


It's zapped by the equivalent of mmu_free_roots(), right?  That's 
effectively unpinning it (by zeroing ->root_count).


However, kvm takes pagefaults even for silly things like setting (in 
hardware) or clearing (in software) the dirty bit.




  

+#define KVM_API_MAGIC 0x87654321
+
  
 is the vmm userspace interface.  The guest/host 
interface should probably go somewhere else.



yeah. kvm_para.h?

  


Sounds good.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-08 Thread Avi Kivity

Ingo Molnar wrote:



This is a little too good to be true.  Were both runs with the same 
KVM_NUM_MMU_PAGES?



yes, both had the same elevated KVM_NUM_MMU_PAGES of 2048. The 'trunk' 
run should have been labeled as: 'cr3 tree with paravirt turned off'. 
That's not completely 'trunk' but close to it, and all other changes 
(like elimination of unnecessary TLB flushes) are fairly applied to 
both.
  


Ok.  I guess there's a switch/switch back pattern in there.

i also did a run with much less MMU cache pages of 256, and hackbench 1 
stayed the same, while hackbench 5 numbers started fluctuating badly (i 
think that workload if trashing the MMU cache badly).
  


Yes, 256 is too low.

  

-u64 *pae_root;
+u64 *pae_root[KVM_CR3_CACHE_SIZE];
  
hmm.  wouldn't it be simpler to have pae_root always point at the 
current root?



does that guarantee that it's available? I wanted to 'pin' the root 
itself this way, to make sure that if a guest switches to it via the 
cache, that it's truly available and a valid root. cr3 addresses are 
non-virtual so this is the only mechanism available to guarantee that 
the host-side memory truly contains a root pagetable.


  


I meant

  u64 *pae_root_cache;
  u64 *pae_root; /* == pae_root_cache + 4*cache_index */

so that the rest of the code need not worry about the cache.


+vcpu-mmu.pae_root[j][i] = INVALID_PAGE;
+}
}
vcpu-mmu.root_hpa = INVALID_PAGE;
}
  

You keep the page directories pinned here. [...]



yes.

  
[...]  This can be a problem if a guest frees a page directory, and 
then starts using it as a regular page.  kvm sometimes chooses not to 
emulate a write to a guest page table, but instead to zap it, which is 
impossible when the page is freed.  You need to either unpin the page 
when that happens, or add a hypercall to let kvm know when a page 
directory is freed.



the cache is zapped upon pagefaults anyway, so unpinning ought to be 
possible. Which one would you prefer?
  


It's zapped by the equivalent of mmu_free_roots(), right?  That's 
effectively unpinning it (by zeroing -root_count).


However, kvm takes pagefaults even for silly things like setting (in 
hardware) or clearing (in software) the dirty bit.




  

+#define KVM_API_MAGIC 0x87654321
+
  
linux/kvm.h is the vmm userspace interface.  The guest/host 
interface should probably go somewhere else.



yeah. kvm_para.h?

  


Sounds good.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-08 Thread Ingo Molnar

* Avi Kivity [EMAIL PROTECTED] wrote:

  the cache is zapped upon pagefaults anyway, so unpinning ought to be 
  possible. Which one would you prefer?
 
 It's zapped by the equivalent of mmu_free_roots(), right?  That's 
 effectively unpinning it (by zeroing -root_count).

no, right now only the guest-visible cache is zapped - the roots are 
zapped by natural rotation. I guess they should be zapped in 
kvm_cr3_cache_clear() - but i wanted to keep that function an invariant 
to the other MMU state, to make it easier to call it from whatever mmu 
codepath.

 However, kvm takes pagefaults even for silly things like setting (in 
 hardware) or clearing (in software) the dirty bit.

yeah. I think it also does some TLB flushes that are not needed. For 
example in rmap_write_protect() we do this:

rmap_remove(vcpu, spte);
kvm_arch_ops-tlb_flush(vcpu);

but AFAICS rmap_write_protect() is only ever called if we write a new 
cr3 - hence a TLB flush will happen anyway, because we do a 
vmcs_writel(GUEST_CR3, new_cr3). Am i missing something? I didnt want to 
remove it as part of the cr3 patches (to keep things simpler), but that 
flush looks quite unnecessary to me. The patch below seems to work in 
light testing.

Ingo

Index: linux/drivers/kvm/mmu.c
===
--- linux.orig/drivers/kvm/mmu.c
+++ linux/drivers/kvm/mmu.c
@@ -404,7 +404,11 @@ static void rmap_write_protect(struct kv
BUG_ON(!(*spte  PT_WRITABLE_MASK));
rmap_printk(rmap_write_protect: spte %p %llx\n, spte, *spte);
rmap_remove(vcpu, spte);
-   kvm_arch_ops-tlb_flush(vcpu);
+   /*
+* While we removed a mapping there's no need to explicitly
+* flush the TLB here, because this codepath only triggers
+* if we write a new cr3 - which will flush the TLB anyway.
+*/
*spte = ~(u64)PT_WRITABLE_MASK;
}
 }
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-08 Thread Avi Kivity

Ingo Molnar wrote:

* Avi Kivity [EMAIL PROTECTED] wrote:

  
the cache is zapped upon pagefaults anyway, so unpinning ought to be 
possible. Which one would you prefer?
  
It's zapped by the equivalent of mmu_free_roots(), right?  That's 
effectively unpinning it (by zeroing -root_count).



no, right now only the guest-visible cache is zapped - the roots are 
zapped by natural rotation. I guess they should be zapped in 
kvm_cr3_cache_clear() - but i wanted to keep that function an invariant 
to the other MMU state, to make it easier to call it from whatever mmu 
codepath.


  


Ok.  Some mechanism is then needed to unpin the pages in case they are 
recycled by the guest.


However, kvm takes pagefaults even for silly things like setting (in 
hardware) or clearing (in software) the dirty bit.



yeah. I think it also does some TLB flushes that are not needed. For 
example in rmap_write_protect() we do this:


rmap_remove(vcpu, spte);
kvm_arch_ops-tlb_flush(vcpu);

but AFAICS rmap_write_protect() is only ever called if we write a new 
cr3 - hence a TLB flush will happen anyway, because we do a 
vmcs_writel(GUEST_CR3, new_cr3). Am i missing something? 


No, rmap_write_protect() is called whenever we shadow a new guest page 
table (the mechanism used to maintain coherency is write faults on page 
tables).


I didnt want to 
remove it as part of the cr3 patches (to keep things simpler), but that 
flush looks quite unnecessary to me. The patch below seems to work in 
light testing.


Ingo

Index: linux/drivers/kvm/mmu.c
===
--- linux.orig/drivers/kvm/mmu.c
+++ linux/drivers/kvm/mmu.c
@@ -404,7 +404,11 @@ static void rmap_write_protect(struct kv
BUG_ON(!(*spte  PT_WRITABLE_MASK));
rmap_printk(rmap_write_protect: spte %p %llx\n, spte, *spte);
rmap_remove(vcpu, spte);
-   kvm_arch_ops-tlb_flush(vcpu);
+   /*
+* While we removed a mapping there's no need to explicitly
+* flush the TLB here, because this codepath only triggers
+* if we write a new cr3 - which will flush the TLB anyway.
+*/
*spte = ~(u64)PT_WRITABLE_MASK;
}
 }
  


This will kill svm.

I don't think the tlb flushes are really necessary on today's Intel 
machines, as they probably flush the tlb on each vmx switch.  But AMD 
keeps the TLB, and the flushes are critical there.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-08 Thread Ingo Molnar

* Avi Kivity [EMAIL PROTECTED] wrote:

 but AFAICS rmap_write_protect() is only ever called if we write a new 
 cr3 - hence a TLB flush will happen anyway, because we do a 
 vmcs_writel(GUEST_CR3, new_cr3). Am i missing something?
 
 No, rmap_write_protect() is called whenever we shadow a new guest page 
 table (the mechanism used to maintain coherency is write faults on 
 page tables).

hm. Where? I only see rmap_write_protect() called from 
kvm_mmu_get_page(), which is only called from nonpaging_map() [but it 
doesnt call the rmap function in that case, due to metaphysical], and 
mmu_alloc_roots(). mmu_alloc_roots() is only called from context init 
(init-only thing) or from paging_new_cr3().

ahh ... i missed paging_tmpl.h.

how about the patch below then?

furthermore, shouldnt we pass in the flush area size from the pagefault 
handler? In most cases it would be 4096 bytes, that would mean an invlpg 
is enough, not a full cr3 flush. Although ... i dont know how to invlpg 
a guest-side TLB. On VMX it probably doesnt matter at all. On SVM, is 
there a way (or a need) to flush a single TLB of another context ID?

Ingo

Index: linux/drivers/kvm/mmu.c
===
--- linux.orig/drivers/kvm/mmu.c
+++ linux/drivers/kvm/mmu.c
@@ -378,7 +378,7 @@ static void rmap_remove(struct kvm_vcpu 
}
 }
 
-static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)
+static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn, int flush_tlb)
 {
struct kvm *kvm = vcpu-kvm;
struct page *page;
@@ -404,7 +404,13 @@ static void rmap_write_protect(struct kv
BUG_ON(!(*spte  PT_WRITABLE_MASK));
rmap_printk(rmap_write_protect: spte %p %llx\n, spte, *spte);
rmap_remove(vcpu, spte);
-   kvm_arch_ops-tlb_flush(vcpu);
+   /*
+* While we removed a mapping there's no need to explicitly
+* flush the TLB here if we write a new cr3. It is needed
+* from the pagefault path though.
+*/
+   if (flush_tlb)
+   kvm_arch_ops-tlb_flush(vcpu);
*spte = ~(u64)PT_WRITABLE_MASK;
}
 }
@@ -561,7 +567,8 @@ static struct kvm_mmu_page *kvm_mmu_get_
 gva_t gaddr,
 unsigned level,
 int metaphysical,
-u64 *parent_pte)
+u64 *parent_pte,
+int flush_tlb)
 {
union kvm_mmu_page_role role;
unsigned index;
@@ -597,7 +604,7 @@ static struct kvm_mmu_page *kvm_mmu_get_
page-role = role;
hlist_add_head(page-hash_link, bucket);
if (!metaphysical)
-   rmap_write_protect(vcpu, gfn);
+   rmap_write_protect(vcpu, gfn, flush_tlb);
return page;
 }
 
@@ -764,7 +771,7 @@ static int nonpaging_map(struct kvm_vcpu
 PAGE_SHIFT;
new_table = kvm_mmu_get_page(vcpu, pseudo_gfn,
 v, level - 1,
-1, table[index]);
+1, table[index], 0);
if (!new_table) {
pgprintk(nonpaging_map: ENOMEM\n);
return -ENOMEM;
@@ -838,7 +845,7 @@ static void mmu_alloc_roots(struct kvm_v
 
ASSERT(!VALID_PAGE(root));
page = kvm_mmu_get_page(vcpu, root_gfn, 0,
-   PT64_ROOT_LEVEL, 0, NULL);
+   PT64_ROOT_LEVEL, 0, NULL, 0);
root = page-page_hpa;
++page-root_count;
vcpu-mmu.root_hpa = root;
@@ -857,7 +864,7 @@ static void mmu_alloc_roots(struct kvm_v
root_gfn = 0;
page = kvm_mmu_get_page(vcpu, root_gfn, i  30,
PT32_ROOT_LEVEL, !is_paging(vcpu),
-   NULL);
+   NULL, 0);
root = page-page_hpa;
++page-root_count;
vcpu-mmu.pae_root[j][i] = root | PT_PRESENT_MASK;
Index: linux/drivers/kvm/paging_tmpl.h
===
--- linux.orig/drivers/kvm/paging_tmpl.h
+++ linux/drivers/kvm/paging_tmpl.h
@@ -245,7 +245,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu
table_gfn = walker-table_gfn[level - 2];
}
shadow_page = kvm_mmu_get_page(vcpu, table_gfn, addr, level-1,
-  metaphysical, shadow_ent);
+   

Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-08 Thread Avi Kivity

Ingo Molnar wrote:

* Avi Kivity [EMAIL PROTECTED] wrote:

  
but AFAICS rmap_write_protect() is only ever called if we write a new 
cr3 - hence a TLB flush will happen anyway, because we do a 
vmcs_writel(GUEST_CR3, new_cr3). Am i missing something?
  
No, rmap_write_protect() is called whenever we shadow a new guest page 
table (the mechanism used to maintain coherency is write faults on 
page tables).



hm. Where? I only see rmap_write_protect() called from 
kvm_mmu_get_page(), which is only called from nonpaging_map() [but it 
doesnt call the rmap function in that case, due to metaphysical], and 
mmu_alloc_roots(). mmu_alloc_roots() is only called from context init 
(init-only thing) or from paging_new_cr3().


ahh ... i missed paging_tmpl.h.

how about the patch below then?
  


Looks like a lot of complexity for very little gain.  I'm not sure what 
the vmwrite cost is, cut it can't be that high compared to vmexit.


I think there are two better options:

1.  Find out if tlb_flush() can be implemented as a no-op on intel -- 
I'm fairly sure it can.
2.  If not, implement tlb_flush() as a counter increment like on amd.  
Then, successive tlb flushes and context switches are folded into one.



furthermore, shouldnt we pass in the flush area size from the pagefault 
handler? In most cases it would be 4096 bytes, that would mean an invlpg 
is enough, not a full cr3 flush. Although ... i dont know how to invlpg 
a guest-side TLB. On VMX it probably doesnt matter at all. On SVM, is 
there a way (or a need) to flush a single TLB of another context ID?
  


Yes (and yes), invlpga.

A complication is that a single shadow pte can be used to map multiple 
guest virtual addresses (by mapping the page table using multiple pdes), 
so the kvm_mmu_page object has to keep track of the page table gva, and 
whether it is multiply mapped or not.  I plan to do that later.



Ingo

Index: linux/drivers/kvm/mmu.c
===
--- linux.orig/drivers/kvm/mmu.c
+++ linux/drivers/kvm/mmu.c
@@ -378,7 +378,7 @@ static void rmap_remove(struct kvm_vcpu 
 	}

 }
 
-static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn)

+static void rmap_write_protect(struct kvm_vcpu *vcpu, u64 gfn, int flush_tlb)
 {
struct kvm *kvm = vcpu-kvm;
struct page *page;
@@ -404,7 +404,13 @@ static void rmap_write_protect(struct kv
BUG_ON(!(*spte  PT_WRITABLE_MASK));
rmap_printk(rmap_write_protect: spte %p %llx\n, spte, *spte);
rmap_remove(vcpu, spte);
-   kvm_arch_ops-tlb_flush(vcpu);
+   /*
+* While we removed a mapping there's no need to explicitly
+* flush the TLB here if we write a new cr3. It is needed
+* from the pagefault path though.
+*/
+   if (flush_tlb)
+   kvm_arch_ops-tlb_flush(vcpu);
*spte = ~(u64)PT_WRITABLE_MASK;
}
 }
@@ -561,7 +567,8 @@ static struct kvm_mmu_page *kvm_mmu_get_
 gva_t gaddr,
 unsigned level,
 int metaphysical,
-u64 *parent_pte)
+u64 *parent_pte,
+int flush_tlb)
 {
union kvm_mmu_page_role role;
unsigned index;
@@ -597,7 +604,7 @@ static struct kvm_mmu_page *kvm_mmu_get_
page-role = role;
hlist_add_head(page-hash_link, bucket);
if (!metaphysical)
-   rmap_write_protect(vcpu, gfn);
+   rmap_write_protect(vcpu, gfn, flush_tlb);
return page;
 }
 
@@ -764,7 +771,7 @@ static int nonpaging_map(struct kvm_vcpu

 PAGE_SHIFT;
new_table = kvm_mmu_get_page(vcpu, pseudo_gfn,
 v, level - 1,
-1, table[index]);
+1, table[index], 0);
if (!new_table) {
pgprintk(nonpaging_map: ENOMEM\n);
return -ENOMEM;
@@ -838,7 +845,7 @@ static void mmu_alloc_roots(struct kvm_v
 
 		ASSERT(!VALID_PAGE(root));

page = kvm_mmu_get_page(vcpu, root_gfn, 0,
-   PT64_ROOT_LEVEL, 0, NULL);
+   PT64_ROOT_LEVEL, 0, NULL, 0);
root = page-page_hpa;
++page-root_count;
vcpu-mmu.root_hpa = root;
@@ -857,7 +864,7 @@ static void mmu_alloc_roots(struct kvm_v
root_gfn = 0;
page = kvm_mmu_get_page(vcpu, root_gfn, i  30,
PT32_ROOT_LEVEL, !is_paging(vcpu),
- 

Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-08 Thread Ingo Molnar

* Avi Kivity [EMAIL PROTECTED] wrote:

 Looks like a lot of complexity for very little gain.  I'm not sure 
 what the vmwrite cost is, cut it can't be that high compared to 
 vmexit.

while i disagree with characterising one extra parameter passed down 
plus one extra branch as 'a lot of complexity', i agree that making the 
flush a NOP on VMX is even better. (if it's possible without breaking 
future hardware) I guess you are right that the only true cost here is 
the vmwrite cost, and that there's no impact on CR3 flushing (because it 
happens unconditionally). Forget about this patch.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-08 Thread Christoph Lameter
On Sat, 6 Jan 2007, Pavel Machek wrote:

 Does this make Xen obsolete? I mean... we have xen patches in suse
 kernels, should we keep updating them, or just drop them in favour of
 KVM?
   Pavel

Xen is duplicating basic OS components like the scheduler etc. As 
a result its difficult to maintain and not well integrated with Linux. 
KVM looks like a better approach.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-07 Thread Christoph Hellwig
On Sat, Jan 06, 2007 at 01:08:18PM +, Pavel Machek wrote:
> Does this make Xen obsolete? I mean... we have xen patches in suse
> kernels, should we keep updating them, or just drop them in favour of
> KVM?

After all the Novell Marketing Hype you'll probably have to keep Xen ;-)
Except for that I suspect a paravirt kvm or lhype might be the better
hypervisor choice in the long term.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux

2007-01-07 Thread Hollis Blanchard
On Sun, 2007-01-07 at 14:20 +0200, Avi Kivity wrote:
> 
> 
> Well, you did say it was ad-hoc.  For reference, this is how I see the
> hypercall API:
[snip]
> - Guest/host communications is by guest physical addressed, as the 
> virtual->physical translation is much cheaper on the guest (__pa() vs
> a page table walk). 

Strongly agreed. One of the major problems we had with the PowerPC Xen
port was that Xen passes virtual addresses (even userspace virtual
addresses!) to the hypervisor. Performing a MMU search on PowerPC is far
more convoluted than x86's table walk and is not feasible in software.

I'm anxious to avoid the same mistake wherever possible.

Of course, even with physical addresses, data structures that straddle
page boundaries prevent the hypervisor from mapping contiguous physical
pages to discontiguous machine pages (or whatever terminology you want
to use).

IBM's Power Hypervisor passes hypercall data in registers most of the
time. In the places it communicates via memory, I believe the parameter
is actually a physical frame number, and the size is explicitly limited
to one page.

-Hollis

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-07 Thread Ingo Molnar

* Avi Kivity <[EMAIL PROTECTED]> wrote:

> >2-task context-switch performance (in microseconds, lower is better):
> >
> > native:   1.11
> > --
> > Qemu:61.18
> > KVM upstream:53.01
> > KVM trunk:6.36
> > KVM trunk+paravirt/cr3:   1.60
> >
> >i.e. 2-task context-switch performance is faster by a factor of 4, and 
> >is now quite close to native speed!
> >  
> 
> Very impressive!  The gain probably comes not only from avoiding the 
> vmentry/vmexit, but also from avoiding the flushing of the global page 
> tlb entries.

90% of the win comes from the avoidance of the VM exit. To quantify this 
more precisely i have added an artificial __flush_tlb_global() call to 
after switch_to(), just to see how much impact an extra global flush has 
on the native kernel. Context-switch cost went from 1.11 usecs to 1.65 
usecs. Then i added a __flush_tlb(), which made the cost go to 1.75, 
which means that the global flush component is at around 0.5 usecs.

> >"hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better):
> >
> > native:   0.25
> > --
> > Qemu: 7.8
> > KVM upstream: 2.8
> > KVM trunk:0.55
> > KVM paravirt/cr3: 0.36
> >
> >almost twice as fast.
> >
> >"hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better):
> >
> > native:   0.9
> > --
> > Qemu:35.2
> > KVM upstream: 9.4
> > KVM trunk:2.8
> > KVM paravirt/cr3: 2.2
> >
> >still a 30% improvement - which isnt too bad considering that 200 tasks 
> >are context-switching in this workload and the cr3 cache in current CPUs 
> >is only 4 entries.
> >  
> 
> This is a little too good to be true.  Were both runs with the same 
> KVM_NUM_MMU_PAGES?

yes, both had the same elevated KVM_NUM_MMU_PAGES of 2048. The 'trunk' 
run should have been labeled as: 'cr3 tree with paravirt turned off'. 
That's not completely 'trunk' but close to it, and all other changes 
(like elimination of unnecessary TLB flushes) are fairly applied to 
both.

i also did a run with much less MMU cache pages of 256, and hackbench 1 
stayed the same, while hackbench 5 numbers started fluctuating badly (i 
think that workload if trashing the MMU cache badly).

> I'm also concerned that at this point in time the cr3 optimizations 
> will only show an improvement in microbenchmarks.  In real life 
> workloads a context switch is usually preceded by an I/O, and with the 
> current sorry state of kvm I/O the context switch time would be 
> dominated by the I/O time.

oh, i agreed completely - but in my opinion accelerating virtual I/O is 
really easy. Accelerating the context-switch path (and basic syscall 
overhead like KVM does) is /hard/. So i wanted to see whether KVM runs 
well in all the hard cases, before looking at the low hanging 
performance fruits in the I/O area =B-)

also note that there's lots of internal reasons why application 
workloads can be heavily context-switching - it's not just I/O that 
generates them. (pipes, critical sections / futexes, etc.) So having 
near-native performance for context-switches is very important.

> >+if (irq & 8) {
> >+outb(cached_slave_mask, PIC_SLAVE_IMR);
> >+outb(0x60+(irq&7),PIC_SLAVE_CMD);/* 'Specific EOI' to slave */
> >+outb(0x60+PIC_CASCADE_IR,PIC_MASTER_CMD); /* 'Specific EOI' 
> >to master-IRQ2 */
> >+} else {
> >+outb(cached_master_mask, PIC_MASTER_IMR);
> >+/* 'Specific EOI' to master: */
> >+outb(0x60+irq, PIC_MASTER_CMD);
> >+}
> >+spin_unlock_irqrestore(_lock, flags);
> >+}
> 
> Any reason this can't be applied to mainline?  There's probably no 
> downside to native, and it would benefit all virtualization solutions 
> equally.

this is legacy stuff ...

> >-u64 *pae_root;
> >+u64 *pae_root[KVM_CR3_CACHE_SIZE];
> 
> hmm.  wouldn't it be simpler to have pae_root always point at the 
> current root?

does that guarantee that it's available? I wanted to 'pin' the root 
itself this way, to make sure that if a guest switches to it via the 
cache, that it's truly available and a valid root. cr3 addresses are 
non-virtual so this is the only mechanism available to guarantee that 
the host-side memory truly contains a root pagetable.

> >+vcpu->mmu.pae_root[j][i] = INVALID_PAGE;
> >+}
> > }
> > vcpu->mmu.root_hpa = INVALID_PAGE;
> > }
> 
> You keep the page directories pinned here. [...]

yes.

> [...]  This can be a problem if a guest frees a page directory, and 
> then starts using it as a regular page.  kvm sometimes chooses not to 
> emulate a write to a guest page table, but instead to zap it, which is 
> impossible when the page is freed.  You need to either 

Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-07 Thread Avi Kivity

Ingo Molnar wrote:
i'm pleased to announce the first release of paravirtualized KVM (Linux 
under Linux), which includes support for the hardware cr3-cache feature 
of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)


the patch is against 2.6.20-rc3 + KVM trunk and can be found at:

   http://redhat.com/~mingo/kvm-paravirt-patches/

Some aspects of the code are still a bit ad-hoc and incomplete, but the 
code is stable enough in my testing and i'd like to have some feedback. 


Firstly, here are some numbers:

2-task context-switch performance (in microseconds, lower is better):

 native:   1.11
 --
 Qemu:61.18
 KVM upstream:53.01
 KVM trunk:6.36
 KVM trunk+paravirt/cr3:   1.60

i.e. 2-task context-switch performance is faster by a factor of 4, and 
is now quite close to native speed!
  


Very impressive!  The gain probably comes not only from avoiding the 
vmentry/vmexit, but also from avoiding the flushing of the global page 
tlb entries.



"hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better):

 native:   0.25
 --
 Qemu: 7.8
 KVM upstream: 2.8
 KVM trunk:0.55
 KVM paravirt/cr3: 0.36

almost twice as fast.

"hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better):

 native:   0.9
 --
 Qemu:35.2
 KVM upstream: 9.4
 KVM trunk:2.8
 KVM paravirt/cr3: 2.2

still a 30% improvement - which isnt too bad considering that 200 tasks 
are context-switching in this workload and the cr3 cache in current CPUs 
is only 4 entries.
  


This is a little too good to be true.  Were both runs with the same 
KVM_NUM_MMU_PAGES?


I'm also concerned that at this point in time the cr3 optimizations will 
only show an improvement in microbenchmarks.  In real life workloads a 
context switch is usually preceded by an I/O, and with the current sorry 
state of kvm I/O the context switch time would be dominated by the I/O time.



the patchset does the following:

- it provides an ad-hoc paravirtualization hypercall API between a Linux 
  guest and a Linux host. (this will be replaced with a proper

  hypercall later on.)

- using the hypercall API it utilizes the "cr3 target cache" feature in 
  Intel VMX CPUs, and extends KVM to make use of that cache. This 
  feature allows the avoidance of expensive VM exits into hypervisor 
  context. (The guest needs to be 'aware' and the cache has to be

  shared between the guest and the hypervisor. So fully emulated OSs
  wont benefit from this feature.)

- a few simpler paravirtualization changes are done for Linux guests: IO 
  port delays do not cause a VM exit anymore, the i8259A IRQ controller 
  code got simplified (this will be replaced with a proper, hypercall
  based and host-maintained IRQ controller implementation) and TLB 
  flushes are more efficient, because no cr3 reads happen which would 
  otherwise cause a VM exit. These changes have a visible effect
  already: they reduce qemu's CPU usage when a guest idles in HLT, by 
  about 25%. (from ~20% CPU usage to 14% CPU usage if an -rt guest has 
  HZ=1000)


Paravirtualization is triggered via the kvm_paravirt=1 boot option (for 
now, this too is ad-hoc) - if that is passed then the KVM guest will 
probe for paravirtualization availability on the hypervisor side - and 
will use it if found. (If the guest does not find KVM-paravirt support 
on the hypervisor side then it will continue as a fully emulated guest.)


Issues: i only tested this on 32-bit VMX. (64-bit should work with not 
too many changes, the paravirt.c bits can be carried over to 64-bit 
almost as-is. But i didnt want to spread the code too wide.)


Comments, suggestions are welcome!

Ingo
  


Some comments below on the code.


+
+/*
+ * Special, register-to-cr3 instruction based hypercall API
+ * variant to the KVM host. This utilizes the cr3 filter capability
+ * of the hardware - if this works out then no VM exit happens,
+ * if a VM exit happens then KVM will get the virtual address too.
+ */
+static void kvm_write_cr3(unsigned long guest_cr3)
+{
+struct kvm_vcpu_para_state *para_state = _cpu_var(para_state);
+struct kvm_cr3_cache *cache = _state->cr3_cache;
+int idx;
+
+/*
+ * Check the cache (maintained by the host) for a matching
+ * guest_cr3 => host_cr3 mapping. Use it if found:
+ */
+for (idx = 0; idx < cache->max_idx; idx++) {
+if (cache->entry[idx].guest_cr3 == guest_cr3) {
+/*
+ * Cache-hit: we load the cached host-CR3 value.
+ * This never causes any VM exit. (if it does then the
+ * hypervisor could do nothing with this instruction
+

Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-07 Thread Avi Kivity

Ingo Molnar wrote:
i'm pleased to announce the first release of paravirtualized KVM (Linux 
under Linux), which includes support for the hardware cr3-cache feature 
of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)


the patch is against 2.6.20-rc3 + KVM trunk and can be found at:

   http://redhat.com/~mingo/kvm-paravirt-patches/

Some aspects of the code are still a bit ad-hoc and incomplete, but the 
code is stable enough in my testing and i'd like to have some feedback. 


Firstly, here are some numbers:

2-task context-switch performance (in microseconds, lower is better):

 native:   1.11
 --
 Qemu:61.18
 KVM upstream:53.01
 KVM trunk:6.36
 KVM trunk+paravirt/cr3:   1.60

i.e. 2-task context-switch performance is faster by a factor of 4, and 
is now quite close to native speed!
  


Very impressive!  The gain probably comes not only from avoiding the 
vmentry/vmexit, but also from avoiding the flushing of the global page 
tlb entries.



hackbench 1 (utilizes 40 tasks, numbers in seconds, lower is better):

 native:   0.25
 --
 Qemu: 7.8
 KVM upstream: 2.8
 KVM trunk:0.55
 KVM paravirt/cr3: 0.36

almost twice as fast.

hackbench 5 (utilizes 200 tasks, numbers in seconds, lower is better):

 native:   0.9
 --
 Qemu:35.2
 KVM upstream: 9.4
 KVM trunk:2.8
 KVM paravirt/cr3: 2.2

still a 30% improvement - which isnt too bad considering that 200 tasks 
are context-switching in this workload and the cr3 cache in current CPUs 
is only 4 entries.
  


This is a little too good to be true.  Were both runs with the same 
KVM_NUM_MMU_PAGES?


I'm also concerned that at this point in time the cr3 optimizations will 
only show an improvement in microbenchmarks.  In real life workloads a 
context switch is usually preceded by an I/O, and with the current sorry 
state of kvm I/O the context switch time would be dominated by the I/O time.



the patchset does the following:

- it provides an ad-hoc paravirtualization hypercall API between a Linux 
  guest and a Linux host. (this will be replaced with a proper

  hypercall later on.)

- using the hypercall API it utilizes the cr3 target cache feature in 
  Intel VMX CPUs, and extends KVM to make use of that cache. This 
  feature allows the avoidance of expensive VM exits into hypervisor 
  context. (The guest needs to be 'aware' and the cache has to be

  shared between the guest and the hypervisor. So fully emulated OSs
  wont benefit from this feature.)

- a few simpler paravirtualization changes are done for Linux guests: IO 
  port delays do not cause a VM exit anymore, the i8259A IRQ controller 
  code got simplified (this will be replaced with a proper, hypercall
  based and host-maintained IRQ controller implementation) and TLB 
  flushes are more efficient, because no cr3 reads happen which would 
  otherwise cause a VM exit. These changes have a visible effect
  already: they reduce qemu's CPU usage when a guest idles in HLT, by 
  about 25%. (from ~20% CPU usage to 14% CPU usage if an -rt guest has 
  HZ=1000)


Paravirtualization is triggered via the kvm_paravirt=1 boot option (for 
now, this too is ad-hoc) - if that is passed then the KVM guest will 
probe for paravirtualization availability on the hypervisor side - and 
will use it if found. (If the guest does not find KVM-paravirt support 
on the hypervisor side then it will continue as a fully emulated guest.)


Issues: i only tested this on 32-bit VMX. (64-bit should work with not 
too many changes, the paravirt.c bits can be carried over to 64-bit 
almost as-is. But i didnt want to spread the code too wide.)


Comments, suggestions are welcome!

Ingo
  


Some comments below on the code.


+
+/*
+ * Special, register-to-cr3 instruction based hypercall API
+ * variant to the KVM host. This utilizes the cr3 filter capability
+ * of the hardware - if this works out then no VM exit happens,
+ * if a VM exit happens then KVM will get the virtual address too.
+ */
+static void kvm_write_cr3(unsigned long guest_cr3)
+{
+struct kvm_vcpu_para_state *para_state = get_cpu_var(para_state);
+struct kvm_cr3_cache *cache = para_state-cr3_cache;
+int idx;
+
+/*
+ * Check the cache (maintained by the host) for a matching
+ * guest_cr3 = host_cr3 mapping. Use it if found:
+ */
+for (idx = 0; idx  cache-max_idx; idx++) {
+if (cache-entry[idx].guest_cr3 == guest_cr3) {
+/*
+ * Cache-hit: we load the cached host-CR3 value.
+ * This never causes any VM exit. (if it does then the
+ * hypervisor could do nothing with this instruction
+ * 

Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-07 Thread Ingo Molnar

* Avi Kivity [EMAIL PROTECTED] wrote:

 2-task context-switch performance (in microseconds, lower is better):
 
  native:   1.11
  --
  Qemu:61.18
  KVM upstream:53.01
  KVM trunk:6.36
  KVM trunk+paravirt/cr3:   1.60
 
 i.e. 2-task context-switch performance is faster by a factor of 4, and 
 is now quite close to native speed!
   
 
 Very impressive!  The gain probably comes not only from avoiding the 
 vmentry/vmexit, but also from avoiding the flushing of the global page 
 tlb entries.

90% of the win comes from the avoidance of the VM exit. To quantify this 
more precisely i have added an artificial __flush_tlb_global() call to 
after switch_to(), just to see how much impact an extra global flush has 
on the native kernel. Context-switch cost went from 1.11 usecs to 1.65 
usecs. Then i added a __flush_tlb(), which made the cost go to 1.75, 
which means that the global flush component is at around 0.5 usecs.

 hackbench 1 (utilizes 40 tasks, numbers in seconds, lower is better):
 
  native:   0.25
  --
  Qemu: 7.8
  KVM upstream: 2.8
  KVM trunk:0.55
  KVM paravirt/cr3: 0.36
 
 almost twice as fast.
 
 hackbench 5 (utilizes 200 tasks, numbers in seconds, lower is better):
 
  native:   0.9
  --
  Qemu:35.2
  KVM upstream: 9.4
  KVM trunk:2.8
  KVM paravirt/cr3: 2.2
 
 still a 30% improvement - which isnt too bad considering that 200 tasks 
 are context-switching in this workload and the cr3 cache in current CPUs 
 is only 4 entries.
   
 
 This is a little too good to be true.  Were both runs with the same 
 KVM_NUM_MMU_PAGES?

yes, both had the same elevated KVM_NUM_MMU_PAGES of 2048. The 'trunk' 
run should have been labeled as: 'cr3 tree with paravirt turned off'. 
That's not completely 'trunk' but close to it, and all other changes 
(like elimination of unnecessary TLB flushes) are fairly applied to 
both.

i also did a run with much less MMU cache pages of 256, and hackbench 1 
stayed the same, while hackbench 5 numbers started fluctuating badly (i 
think that workload if trashing the MMU cache badly).

 I'm also concerned that at this point in time the cr3 optimizations 
 will only show an improvement in microbenchmarks.  In real life 
 workloads a context switch is usually preceded by an I/O, and with the 
 current sorry state of kvm I/O the context switch time would be 
 dominated by the I/O time.

oh, i agreed completely - but in my opinion accelerating virtual I/O is 
really easy. Accelerating the context-switch path (and basic syscall 
overhead like KVM does) is /hard/. So i wanted to see whether KVM runs 
well in all the hard cases, before looking at the low hanging 
performance fruits in the I/O area =B-)

also note that there's lots of internal reasons why application 
workloads can be heavily context-switching - it's not just I/O that 
generates them. (pipes, critical sections / futexes, etc.) So having 
near-native performance for context-switches is very important.

 +if (irq  8) {
 +outb(cached_slave_mask, PIC_SLAVE_IMR);
 +outb(0x60+(irq7),PIC_SLAVE_CMD);/* 'Specific EOI' to slave */
 +outb(0x60+PIC_CASCADE_IR,PIC_MASTER_CMD); /* 'Specific EOI' 
 to master-IRQ2 */
 +} else {
 +outb(cached_master_mask, PIC_MASTER_IMR);
 +/* 'Specific EOI' to master: */
 +outb(0x60+irq, PIC_MASTER_CMD);
 +}
 +spin_unlock_irqrestore(i8259A_lock, flags);
 +}
 
 Any reason this can't be applied to mainline?  There's probably no 
 downside to native, and it would benefit all virtualization solutions 
 equally.

this is legacy stuff ...

 -u64 *pae_root;
 +u64 *pae_root[KVM_CR3_CACHE_SIZE];
 
 hmm.  wouldn't it be simpler to have pae_root always point at the 
 current root?

does that guarantee that it's available? I wanted to 'pin' the root 
itself this way, to make sure that if a guest switches to it via the 
cache, that it's truly available and a valid root. cr3 addresses are 
non-virtual so this is the only mechanism available to guarantee that 
the host-side memory truly contains a root pagetable.

 +vcpu-mmu.pae_root[j][i] = INVALID_PAGE;
 +}
  }
  vcpu-mmu.root_hpa = INVALID_PAGE;
  }
 
 You keep the page directories pinned here. [...]

yes.

 [...]  This can be a problem if a guest frees a page directory, and 
 then starts using it as a regular page.  kvm sometimes chooses not to 
 emulate a write to a guest page table, but instead to zap it, which is 
 impossible when the page is freed.  You need to either unpin the page 
 when that happens, or add a hypercall to let kvm know when a page 
 directory is freed.

the cache is zapped upon 

Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux

2007-01-07 Thread Hollis Blanchard
On Sun, 2007-01-07 at 14:20 +0200, Avi Kivity wrote:
 
 
 Well, you did say it was ad-hoc.  For reference, this is how I see the
 hypercall API:
[snip]
 - Guest/host communications is by guest physical addressed, as the 
 virtual-physical translation is much cheaper on the guest (__pa() vs
 a page table walk). 

Strongly agreed. One of the major problems we had with the PowerPC Xen
port was that Xen passes virtual addresses (even userspace virtual
addresses!) to the hypervisor. Performing a MMU search on PowerPC is far
more convoluted than x86's table walk and is not feasible in software.

I'm anxious to avoid the same mistake wherever possible.

Of course, even with physical addresses, data structures that straddle
page boundaries prevent the hypervisor from mapping contiguous physical
pages to discontiguous machine pages (or whatever terminology you want
to use).

IBM's Power Hypervisor passes hypercall data in registers most of the
time. In the places it communicates via memory, I believe the parameter
is actually a physical frame number, and the size is explicitly limited
to one page.

-Hollis

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-07 Thread Christoph Hellwig
On Sat, Jan 06, 2007 at 01:08:18PM +, Pavel Machek wrote:
 Does this make Xen obsolete? I mean... we have xen patches in suse
 kernels, should we keep updating them, or just drop them in favour of
 KVM?

After all the Novell Marketing Hype you'll probably have to keep Xen ;-)
Except for that I suspect a paravirt kvm or lhype might be the better
hypervisor choice in the long term.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-06 Thread Pavel Machek
Hi!

> i'm pleased to announce the first release of paravirtualized KVM (Linux 
> under Linux), which includes support for the hardware cr3-cache feature 
> of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)
> 
> the patch is against 2.6.20-rc3 + KVM trunk and can be found at:
> 
>http://redhat.com/~mingo/kvm-paravirt-patches/
> 
> Some aspects of the code are still a bit ad-hoc and incomplete, but the 
> code is stable enough in my testing and i'd like to have some feedback. 
> 
> Firstly, here are some numbers:
> 
> 2-task context-switch performance (in microseconds, lower is better):
> 
>  native:   1.11
>  --
>  Qemu:61.18
>  KVM upstream:53.01
>  KVM trunk:6.36
>  KVM trunk+paravirt/cr3:   1.60

Does this make Xen obsolete? I mean... we have xen patches in suse
kernels, should we keep updating them, or just drop them in favour of
KVM?
Pavel
-- 
Thanks for all the (sleeping) penguins.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-06 Thread Pavel Machek
Hi!

 i'm pleased to announce the first release of paravirtualized KVM (Linux 
 under Linux), which includes support for the hardware cr3-cache feature 
 of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)
 
 the patch is against 2.6.20-rc3 + KVM trunk and can be found at:
 
http://redhat.com/~mingo/kvm-paravirt-patches/
 
 Some aspects of the code are still a bit ad-hoc and incomplete, but the 
 code is stable enough in my testing and i'd like to have some feedback. 
 
 Firstly, here are some numbers:
 
 2-task context-switch performance (in microseconds, lower is better):
 
  native:   1.11
  --
  Qemu:61.18
  KVM upstream:53.01
  KVM trunk:6.36
  KVM trunk+paravirt/cr3:   1.60

Does this make Xen obsolete? I mean... we have xen patches in suse
kernels, should we keep updating them, or just drop them in favour of
KVM?
Pavel
-- 
Thanks for all the (sleeping) penguins.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-05 Thread Ingo Molnar

* Zachary Amsden <[EMAIL PROTECTED]> wrote:

> >>What you really want is more like 
> >>EXPORT_SYMBOL_READABLE_GPL(paravirt_ops);
> >>
> >
> >yep. Not a big issue - what is important is to put the paravirt ops 
> >into the read-only section so that it's somewhat harder for rootkits 
> >to modify. (Also, it needs to be made clear that this is fundamental, 
> >lowlevel system functionality written by people under the GPLv2, so 
> >that if you utilize it beyond its original purpose, using its 
> >internals, you likely create a work derived from the kernel. 
> >Something simple as irq disabling probably doesnt qualify, and that 
> >we exported to modules for a long time, but lots of other details do. 
> >So the existence of paravirt_ops isnt a free-for all.)
> 
> I agree completely.  It would be nice to have a way to make certain 
> kernel structures available, but non-mutable to non-GPL modules.

the thing is, we are not 'making these available to non-GPL modules'. 
The _GPL postfix does not turn the other normal exports from grey into 
white. The _GPL postfix turns exports into almost-black for non-GPL 
modules. The rest is still grey.

in this case, i'd only make local_irq_*() available to modules (of any 
type), not the other paravirt ops. Exporting the whole of paravirt_ops 
was a mistake, and this has to be fixed before 2.6.20. I'll cook up a 
patch.

> >yes - this limit is easily triggered via the KVM/Qemu virtual serial 
> >drivers. You can think of "kvm_paravirt" as "Linux paravirt", it's 
> >just a flag.
> 
> Can't you just test paravirt_enabled() in that case?

yep - i've changed this in my tree.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux

2007-01-05 Thread Anthony Liguori
This is pretty cool.  I've read the VT spec a number of times and never 
understood why they included the CR3 caching :-)  I suspect that you may 
even be faster than Xen for context switches because of the hardware 
assistance here.  Any chance you can run your benchmarks against Xen?


You may already know this, but there isn't a CR3 cache in SVM AFAIK so a 
fair bit of the code will probably have to be enabled conditionally.  
Otherwise, I don't think SVM support would be that hard.  The only other 
odd bit I noticed was:



+   magic_val = KVM_API_MAGIC;
+   para_state->guest_version = KVM_PARA_API_VERSION;
+   para_state->host_version = -1;
+   para_state->size = sizeof(*para_state);
+
+   asm volatile ("movl %0, %%cr3"
+   : "=" (ret)
+   : "a" (para_state),
+ "0" (magic_val)
+   );


If I read this correctly, you're using a CR3 write as a hypercall (and 
relying on registers being set/returned that aren't part of the actual 
CR3 move)?  Any reason for not just using vmcall?  It seems a bit less 
awkward.  Even a PIO operation would be a bit cleaner IMHO.  PIO exits 
tend to be fast in VT/SVM so that's an added benefit.


Regards,

Anthony Liguori

Ingo Molnar wrote:

i'm pleased to announce the first release of paravirtualized KVM (Linux
under Linux), which includes support for the hardware cr3-cache feature
of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)

the patch is against 2.6.20-rc3 + KVM trunk and can be found at:

   http://redhat.com/~mingo/kvm-paravirt-patches/

Some aspects of the code are still a bit ad-hoc and incomplete, but the
code is stable enough in my testing and i'd like to have some feedback.

Firstly, here are some numbers:

2-task context-switch performance (in microseconds, lower is better):

 native:   1.11
 --
 Qemu:61.18
 KVM upstream:53.01
 KVM trunk:6.36
 KVM trunk+paravirt/cr3:   1.60

i.e. 2-task context-switch performance is faster by a factor of 4, and
is now quite close to native speed!

"hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better):

 native:   0.25
 --
 Qemu: 7.8
 KVM upstream: 2.8
 KVM trunk:0.55
 KVM paravirt/cr3: 0.36

almost twice as fast.

"hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better):

 native:   0.9
 --
 Qemu:35.2
 KVM upstream: 9.4
 KVM trunk:2.8
 KVM paravirt/cr3: 2.2

still a 30% improvement - which isnt too bad considering that 200 tasks
are context-switching in this workload and the cr3 cache in current CPUs
is only 4 entries.

the patchset does the following:

- it provides an ad-hoc paravirtualization hypercall API between a Linux
  guest and a Linux host. (this will be replaced with a proper
  hypercall later on.)

- using the hypercall API it utilizes the "cr3 target cache" feature in
  Intel VMX CPUs, and extends KVM to make use of that cache. This
  feature allows the avoidance of expensive VM exits into hypervisor
  context. (The guest needs to be 'aware' and the cache has to be
  shared between the guest and the hypervisor. So fully emulated OSs
  wont benefit from this feature.)

- a few simpler paravirtualization changes are done for Linux guests: IO
  port delays do not cause a VM exit anymore, the i8259A IRQ controller
  code got simplified (this will be replaced with a proper, hypercall
  based and host-maintained IRQ controller implementation) and TLB
  flushes are more efficient, because no cr3 reads happen which would
  otherwise cause a VM exit. These changes have a visible effect
  already: they reduce qemu's CPU usage when a guest idles in HLT, by
  about 25%. (from ~20% CPU usage to 14% CPU usage if an -rt guest has
  HZ=1000)

Paravirtualization is triggered via the kvm_paravirt=1 boot option (for
now, this too is ad-hoc) - if that is passed then the KVM guest will
probe for paravirtualization availability on the hypervisor side - and
will use it if found. (If the guest does not find KVM-paravirt support
on the hypervisor side then it will continue as a fully emulated guest.)

Issues: i only tested this on 32-bit VMX. (64-bit should work with not
too many changes, the paravirt.c bits can be carried over to 64-bit
almost as-is. But i didnt want to spread the code too wide.)

Comments, suggestions are welcome!

Ingo

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash

Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-05 Thread Zachary Amsden

Ingo Molnar wrote:

* Zachary Amsden <[EMAIL PROTECTED]> wrote:

  
What you really want is more like 
EXPORT_SYMBOL_READABLE_GPL(paravirt_ops);



yep. Not a big issue - what is important is to put the paravirt ops into 
the read-only section so that it's somewhat harder for rootkits to 
modify. (Also, it needs to be made clear that this is fundamental, 
lowlevel system functionality written by people under the GPLv2, so that 
if you utilize it beyond its original purpose, using its internals, you 
likely create a work derived from the kernel. Something simple as irq 
disabling probably doesnt qualify, and that we exported to modules for a 
long time, but lots of other details do. So the existence of 
paravirt_ops isnt a free-for all.)
  


I agree completely.  It would be nice to have a way to make certain 
kernel structures available, but non-mutable to non-GPL modules.



But I'm not sure that is technically feasible yet.

The kvm code should probably go in kvm.c instead of paravirt.c.



no. This is fundamental architecture boot code, not module code. kvm.c 
should eventually go into kernel/ and arch/*/kernel, not the other way 
around.
  


What I meant was kvm.c in arch/i386/kernel - as symmetric to the other 
paravirt-ops modules, which live in arch/i386/kernel/vmi.c / lhype.c, 
etc.  Either that, or we should move them to be symmetric, but I don't 
think paravirt.c is the proper place for kvm specific code.



  

Index: linux/drivers/serial/8250.c
===
--- linux.orig/drivers/serial/8250.c
+++ linux/drivers/serial/8250.c
@@ -1371,7 +1371,7 @@ static irqreturn_t serial8250_interrupt(

l = l->next;

-   if (l == i->head && pass_counter++ > PASS_LIMIT) {
+		if (!kvm_paravirt 

Is this a bug that might happen under other virtualizations as well, 
not just kvm? Perhaps it deserves a disable feature instead of a kvm 
specific check.



yes - this limit is easily triggered via the KVM/Qemu virtual serial 
drivers. You can think of "kvm_paravirt" as "Linux paravirt", it's just 
a flag.
  


Can't you just test paravirt_enabled() in that case?


Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-05 Thread Ingo Molnar

* Zachary Amsden <[EMAIL PROTECTED]> wrote:

> What you really want is more like 
> EXPORT_SYMBOL_READABLE_GPL(paravirt_ops);

yep. Not a big issue - what is important is to put the paravirt ops into 
the read-only section so that it's somewhat harder for rootkits to 
modify. (Also, it needs to be made clear that this is fundamental, 
lowlevel system functionality written by people under the GPLv2, so that 
if you utilize it beyond its original purpose, using its internals, you 
likely create a work derived from the kernel. Something simple as irq 
disabling probably doesnt qualify, and that we exported to modules for a 
long time, but lots of other details do. So the existence of 
paravirt_ops isnt a free-for all.)

> But I'm not sure that is technically feasible yet.
> 
> The kvm code should probably go in kvm.c instead of paravirt.c.

no. This is fundamental architecture boot code, not module code. kvm.c 
should eventually go into kernel/ and arch/*/kernel, not the other way 
around.

> Index: linux/drivers/serial/8250.c
> ===
> --- linux.orig/drivers/serial/8250.c
> +++ linux/drivers/serial/8250.c
> @@ -1371,7 +1371,7 @@ static irqreturn_t serial8250_interrupt(
> 
>   l = l->next;
> 
> - if (l == i->head && pass_counter++ > PASS_LIMIT) {
> + if (!kvm_paravirt 
> 
> Is this a bug that might happen under other virtualizations as well, 
> not just kvm? Perhaps it deserves a disable feature instead of a kvm 
> specific check.

yes - this limit is easily triggered via the KVM/Qemu virtual serial 
drivers. You can think of "kvm_paravirt" as "Linux paravirt", it's just 
a flag.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-05 Thread Zachary Amsden

Ingo Molnar wrote:
i'm pleased to announce the first release of paravirtualized KVM (Linux 
under Linux), which includes support for the hardware cr3-cache feature 
of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)


the patch is against 2.6.20-rc3 + KVM trunk and can be found at:

   http://redhat.com/~mingo/kvm-paravirt-patches/

Some aspects of the code are still a bit ad-hoc and incomplete, but the 
code is stable enough in my testing and i'd like to have some feedback. 
  


Your code looks generally good.  I have some comments.

You can't do this, even though you want to:

-EXPORT_SYMBOL(paravirt_ops);
+EXPORT_SYMBOL_GPL(paravirt_ops);


The problem is it makes all modules GPL - or at least all modules that 
use any kind of locking, pull in the basic definitions to enable and 
disable interrupts, thus the paravirt_ops symbol, so basically all modules.


What you really want is more like EXPORT_SYMBOL_READABLE_GPL(paravirt_ops);

But I'm not sure that is technically feasible yet.

The kvm code should probably go in kvm.c instead of paravirt.c.


Index: linux/drivers/serial/8250.c
===
--- linux.orig/drivers/serial/8250.c
+++ linux/drivers/serial/8250.c
@@ -1371,7 +1371,7 @@ static irqreturn_t serial8250_interrupt(

l = l->next;

-   if (l == i->head && pass_counter++ > PASS_LIMIT) {
+		if (!kvm_paravirt 




Is this a bug that might happen under other virtualizations as well, not 
just kvm? Perhaps it deserves a disable feature instead of a kvm 
specific check.


Which also gets rid of the need for this unusually placed extern:

Index: linux/include/linux/sched.h
===
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1911,6 +1911,11 @@ static inline void set_task_cpu(struct t

#endif /* CONFIG_SMP */

+/*
+ * Is paravirtualization active?
+ */
+extern int kvm_paravirt;

+



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[announce] [patch] KVM paravirtualization for Linux

2007-01-05 Thread Ingo Molnar

i'm pleased to announce the first release of paravirtualized KVM (Linux 
under Linux), which includes support for the hardware cr3-cache feature 
of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)

the patch is against 2.6.20-rc3 + KVM trunk and can be found at:

   http://redhat.com/~mingo/kvm-paravirt-patches/

Some aspects of the code are still a bit ad-hoc and incomplete, but the 
code is stable enough in my testing and i'd like to have some feedback. 

Firstly, here are some numbers:

2-task context-switch performance (in microseconds, lower is better):

 native:   1.11
 --
 Qemu:61.18
 KVM upstream:53.01
 KVM trunk:6.36
 KVM trunk+paravirt/cr3:   1.60

i.e. 2-task context-switch performance is faster by a factor of 4, and 
is now quite close to native speed!

"hackbench 1" (utilizes 40 tasks, numbers in seconds, lower is better):

 native:   0.25
 --
 Qemu: 7.8
 KVM upstream: 2.8
 KVM trunk:0.55
 KVM paravirt/cr3: 0.36

almost twice as fast.

"hackbench 5" (utilizes 200 tasks, numbers in seconds, lower is better):

 native:   0.9
 --
 Qemu:35.2
 KVM upstream: 9.4
 KVM trunk:2.8
 KVM paravirt/cr3: 2.2

still a 30% improvement - which isnt too bad considering that 200 tasks 
are context-switching in this workload and the cr3 cache in current CPUs 
is only 4 entries.

the patchset does the following:

- it provides an ad-hoc paravirtualization hypercall API between a Linux 
  guest and a Linux host. (this will be replaced with a proper
  hypercall later on.)

- using the hypercall API it utilizes the "cr3 target cache" feature in 
  Intel VMX CPUs, and extends KVM to make use of that cache. This 
  feature allows the avoidance of expensive VM exits into hypervisor 
  context. (The guest needs to be 'aware' and the cache has to be
  shared between the guest and the hypervisor. So fully emulated OSs
  wont benefit from this feature.)

- a few simpler paravirtualization changes are done for Linux guests: IO 
  port delays do not cause a VM exit anymore, the i8259A IRQ controller 
  code got simplified (this will be replaced with a proper, hypercall
  based and host-maintained IRQ controller implementation) and TLB 
  flushes are more efficient, because no cr3 reads happen which would 
  otherwise cause a VM exit. These changes have a visible effect
  already: they reduce qemu's CPU usage when a guest idles in HLT, by 
  about 25%. (from ~20% CPU usage to 14% CPU usage if an -rt guest has 
  HZ=1000)

Paravirtualization is triggered via the kvm_paravirt=1 boot option (for 
now, this too is ad-hoc) - if that is passed then the KVM guest will 
probe for paravirtualization availability on the hypervisor side - and 
will use it if found. (If the guest does not find KVM-paravirt support 
on the hypervisor side then it will continue as a fully emulated guest.)

Issues: i only tested this on 32-bit VMX. (64-bit should work with not 
too many changes, the paravirt.c bits can be carried over to 64-bit 
almost as-is. But i didnt want to spread the code too wide.)

Comments, suggestions are welcome!

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[announce] [patch] KVM paravirtualization for Linux

2007-01-05 Thread Ingo Molnar

i'm pleased to announce the first release of paravirtualized KVM (Linux 
under Linux), which includes support for the hardware cr3-cache feature 
of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)

the patch is against 2.6.20-rc3 + KVM trunk and can be found at:

   http://redhat.com/~mingo/kvm-paravirt-patches/

Some aspects of the code are still a bit ad-hoc and incomplete, but the 
code is stable enough in my testing and i'd like to have some feedback. 

Firstly, here are some numbers:

2-task context-switch performance (in microseconds, lower is better):

 native:   1.11
 --
 Qemu:61.18
 KVM upstream:53.01
 KVM trunk:6.36
 KVM trunk+paravirt/cr3:   1.60

i.e. 2-task context-switch performance is faster by a factor of 4, and 
is now quite close to native speed!

hackbench 1 (utilizes 40 tasks, numbers in seconds, lower is better):

 native:   0.25
 --
 Qemu: 7.8
 KVM upstream: 2.8
 KVM trunk:0.55
 KVM paravirt/cr3: 0.36

almost twice as fast.

hackbench 5 (utilizes 200 tasks, numbers in seconds, lower is better):

 native:   0.9
 --
 Qemu:35.2
 KVM upstream: 9.4
 KVM trunk:2.8
 KVM paravirt/cr3: 2.2

still a 30% improvement - which isnt too bad considering that 200 tasks 
are context-switching in this workload and the cr3 cache in current CPUs 
is only 4 entries.

the patchset does the following:

- it provides an ad-hoc paravirtualization hypercall API between a Linux 
  guest and a Linux host. (this will be replaced with a proper
  hypercall later on.)

- using the hypercall API it utilizes the cr3 target cache feature in 
  Intel VMX CPUs, and extends KVM to make use of that cache. This 
  feature allows the avoidance of expensive VM exits into hypervisor 
  context. (The guest needs to be 'aware' and the cache has to be
  shared between the guest and the hypervisor. So fully emulated OSs
  wont benefit from this feature.)

- a few simpler paravirtualization changes are done for Linux guests: IO 
  port delays do not cause a VM exit anymore, the i8259A IRQ controller 
  code got simplified (this will be replaced with a proper, hypercall
  based and host-maintained IRQ controller implementation) and TLB 
  flushes are more efficient, because no cr3 reads happen which would 
  otherwise cause a VM exit. These changes have a visible effect
  already: they reduce qemu's CPU usage when a guest idles in HLT, by 
  about 25%. (from ~20% CPU usage to 14% CPU usage if an -rt guest has 
  HZ=1000)

Paravirtualization is triggered via the kvm_paravirt=1 boot option (for 
now, this too is ad-hoc) - if that is passed then the KVM guest will 
probe for paravirtualization availability on the hypervisor side - and 
will use it if found. (If the guest does not find KVM-paravirt support 
on the hypervisor side then it will continue as a fully emulated guest.)

Issues: i only tested this on 32-bit VMX. (64-bit should work with not 
too many changes, the paravirt.c bits can be carried over to 64-bit 
almost as-is. But i didnt want to spread the code too wide.)

Comments, suggestions are welcome!

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-05 Thread Zachary Amsden

Ingo Molnar wrote:
i'm pleased to announce the first release of paravirtualized KVM (Linux 
under Linux), which includes support for the hardware cr3-cache feature 
of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)


the patch is against 2.6.20-rc3 + KVM trunk and can be found at:

   http://redhat.com/~mingo/kvm-paravirt-patches/

Some aspects of the code are still a bit ad-hoc and incomplete, but the 
code is stable enough in my testing and i'd like to have some feedback. 
  


Your code looks generally good.  I have some comments.

You can't do this, even though you want to:

-EXPORT_SYMBOL(paravirt_ops);
+EXPORT_SYMBOL_GPL(paravirt_ops);


The problem is it makes all modules GPL - or at least all modules that 
use any kind of locking, pull in the basic definitions to enable and 
disable interrupts, thus the paravirt_ops symbol, so basically all modules.


What you really want is more like EXPORT_SYMBOL_READABLE_GPL(paravirt_ops);

But I'm not sure that is technically feasible yet.

The kvm code should probably go in kvm.c instead of paravirt.c.


Index: linux/drivers/serial/8250.c
===
--- linux.orig/drivers/serial/8250.c
+++ linux/drivers/serial/8250.c
@@ -1371,7 +1371,7 @@ static irqreturn_t serial8250_interrupt(

l = l-next;

-   if (l == i-head  pass_counter++  PASS_LIMIT) {
+		if (!kvm_paravirt 




Is this a bug that might happen under other virtualizations as well, not 
just kvm? Perhaps it deserves a disable feature instead of a kvm 
specific check.


Which also gets rid of the need for this unusually placed extern:

Index: linux/include/linux/sched.h
===
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1911,6 +1911,11 @@ static inline void set_task_cpu(struct t

#endif /* CONFIG_SMP */

+/*
+ * Is paravirtualization active?
+ */
+extern int kvm_paravirt;

+



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-05 Thread Ingo Molnar

* Zachary Amsden [EMAIL PROTECTED] wrote:

 What you really want is more like 
 EXPORT_SYMBOL_READABLE_GPL(paravirt_ops);

yep. Not a big issue - what is important is to put the paravirt ops into 
the read-only section so that it's somewhat harder for rootkits to 
modify. (Also, it needs to be made clear that this is fundamental, 
lowlevel system functionality written by people under the GPLv2, so that 
if you utilize it beyond its original purpose, using its internals, you 
likely create a work derived from the kernel. Something simple as irq 
disabling probably doesnt qualify, and that we exported to modules for a 
long time, but lots of other details do. So the existence of 
paravirt_ops isnt a free-for all.)

 But I'm not sure that is technically feasible yet.
 
 The kvm code should probably go in kvm.c instead of paravirt.c.

no. This is fundamental architecture boot code, not module code. kvm.c 
should eventually go into kernel/ and arch/*/kernel, not the other way 
around.

 Index: linux/drivers/serial/8250.c
 ===
 --- linux.orig/drivers/serial/8250.c
 +++ linux/drivers/serial/8250.c
 @@ -1371,7 +1371,7 @@ static irqreturn_t serial8250_interrupt(
 
   l = l-next;
 
 - if (l == i-head  pass_counter++  PASS_LIMIT) {
 + if (!kvm_paravirt 
 
 Is this a bug that might happen under other virtualizations as well, 
 not just kvm? Perhaps it deserves a disable feature instead of a kvm 
 specific check.

yes - this limit is easily triggered via the KVM/Qemu virtual serial 
drivers. You can think of kvm_paravirt as Linux paravirt, it's just 
a flag.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-05 Thread Zachary Amsden

Ingo Molnar wrote:

* Zachary Amsden [EMAIL PROTECTED] wrote:

  
What you really want is more like 
EXPORT_SYMBOL_READABLE_GPL(paravirt_ops);



yep. Not a big issue - what is important is to put the paravirt ops into 
the read-only section so that it's somewhat harder for rootkits to 
modify. (Also, it needs to be made clear that this is fundamental, 
lowlevel system functionality written by people under the GPLv2, so that 
if you utilize it beyond its original purpose, using its internals, you 
likely create a work derived from the kernel. Something simple as irq 
disabling probably doesnt qualify, and that we exported to modules for a 
long time, but lots of other details do. So the existence of 
paravirt_ops isnt a free-for all.)
  


I agree completely.  It would be nice to have a way to make certain 
kernel structures available, but non-mutable to non-GPL modules.



But I'm not sure that is technically feasible yet.

The kvm code should probably go in kvm.c instead of paravirt.c.



no. This is fundamental architecture boot code, not module code. kvm.c 
should eventually go into kernel/ and arch/*/kernel, not the other way 
around.
  


What I meant was kvm.c in arch/i386/kernel - as symmetric to the other 
paravirt-ops modules, which live in arch/i386/kernel/vmi.c / lhype.c, 
etc.  Either that, or we should move them to be symmetric, but I don't 
think paravirt.c is the proper place for kvm specific code.



  

Index: linux/drivers/serial/8250.c
===
--- linux.orig/drivers/serial/8250.c
+++ linux/drivers/serial/8250.c
@@ -1371,7 +1371,7 @@ static irqreturn_t serial8250_interrupt(

l = l-next;

-   if (l == i-head  pass_counter++  PASS_LIMIT) {
+		if (!kvm_paravirt 

Is this a bug that might happen under other virtualizations as well, 
not just kvm? Perhaps it deserves a disable feature instead of a kvm 
specific check.



yes - this limit is easily triggered via the KVM/Qemu virtual serial 
drivers. You can think of kvm_paravirt as Linux paravirt, it's just 
a flag.
  


Can't you just test paravirt_enabled() in that case?


Zach
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kvm-devel] [announce] [patch] KVM paravirtualization for Linux

2007-01-05 Thread Anthony Liguori
This is pretty cool.  I've read the VT spec a number of times and never 
understood why they included the CR3 caching :-)  I suspect that you may 
even be faster than Xen for context switches because of the hardware 
assistance here.  Any chance you can run your benchmarks against Xen?


You may already know this, but there isn't a CR3 cache in SVM AFAIK so a 
fair bit of the code will probably have to be enabled conditionally.  
Otherwise, I don't think SVM support would be that hard.  The only other 
odd bit I noticed was:



+   magic_val = KVM_API_MAGIC;
+   para_state-guest_version = KVM_PARA_API_VERSION;
+   para_state-host_version = -1;
+   para_state-size = sizeof(*para_state);
+
+   asm volatile (movl %0, %%cr3
+   : =r (ret)
+   : a (para_state),
+ 0 (magic_val)
+   );


If I read this correctly, you're using a CR3 write as a hypercall (and 
relying on registers being set/returned that aren't part of the actual 
CR3 move)?  Any reason for not just using vmcall?  It seems a bit less 
awkward.  Even a PIO operation would be a bit cleaner IMHO.  PIO exits 
tend to be fast in VT/SVM so that's an added benefit.


Regards,

Anthony Liguori

Ingo Molnar wrote:

i'm pleased to announce the first release of paravirtualized KVM (Linux
under Linux), which includes support for the hardware cr3-cache feature
of Intel-VMX CPUs. (which speeds up context switches and TLB flushes)

the patch is against 2.6.20-rc3 + KVM trunk and can be found at:

   http://redhat.com/~mingo/kvm-paravirt-patches/

Some aspects of the code are still a bit ad-hoc and incomplete, but the
code is stable enough in my testing and i'd like to have some feedback.

Firstly, here are some numbers:

2-task context-switch performance (in microseconds, lower is better):

 native:   1.11
 --
 Qemu:61.18
 KVM upstream:53.01
 KVM trunk:6.36
 KVM trunk+paravirt/cr3:   1.60

i.e. 2-task context-switch performance is faster by a factor of 4, and
is now quite close to native speed!

hackbench 1 (utilizes 40 tasks, numbers in seconds, lower is better):

 native:   0.25
 --
 Qemu: 7.8
 KVM upstream: 2.8
 KVM trunk:0.55
 KVM paravirt/cr3: 0.36

almost twice as fast.

hackbench 5 (utilizes 200 tasks, numbers in seconds, lower is better):

 native:   0.9
 --
 Qemu:35.2
 KVM upstream: 9.4
 KVM trunk:2.8
 KVM paravirt/cr3: 2.2

still a 30% improvement - which isnt too bad considering that 200 tasks
are context-switching in this workload and the cr3 cache in current CPUs
is only 4 entries.

the patchset does the following:

- it provides an ad-hoc paravirtualization hypercall API between a Linux
  guest and a Linux host. (this will be replaced with a proper
  hypercall later on.)

- using the hypercall API it utilizes the cr3 target cache feature in
  Intel VMX CPUs, and extends KVM to make use of that cache. This
  feature allows the avoidance of expensive VM exits into hypervisor
  context. (The guest needs to be 'aware' and the cache has to be
  shared between the guest and the hypervisor. So fully emulated OSs
  wont benefit from this feature.)

- a few simpler paravirtualization changes are done for Linux guests: IO
  port delays do not cause a VM exit anymore, the i8259A IRQ controller
  code got simplified (this will be replaced with a proper, hypercall
  based and host-maintained IRQ controller implementation) and TLB
  flushes are more efficient, because no cr3 reads happen which would
  otherwise cause a VM exit. These changes have a visible effect
  already: they reduce qemu's CPU usage when a guest idles in HLT, by
  about 25%. (from ~20% CPU usage to 14% CPU usage if an -rt guest has
  HZ=1000)

Paravirtualization is triggered via the kvm_paravirt=1 boot option (for
now, this too is ad-hoc) - if that is passed then the KVM guest will
probe for paravirtualization availability on the hypervisor side - and
will use it if found. (If the guest does not find KVM-paravirt support
on the hypervisor side then it will continue as a fully emulated guest.)

Issues: i only tested this on 32-bit VMX. (64-bit should work with not
too many changes, the paravirt.c bits can be carried over to 64-bit
almost as-is. But i didnt want to spread the code too wide.)

Comments, suggestions are welcome!

Ingo

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys - and earn cash

Re: [announce] [patch] KVM paravirtualization for Linux

2007-01-05 Thread Ingo Molnar

* Zachary Amsden [EMAIL PROTECTED] wrote:

 What you really want is more like 
 EXPORT_SYMBOL_READABLE_GPL(paravirt_ops);
 
 
 yep. Not a big issue - what is important is to put the paravirt ops 
 into the read-only section so that it's somewhat harder for rootkits 
 to modify. (Also, it needs to be made clear that this is fundamental, 
 lowlevel system functionality written by people under the GPLv2, so 
 that if you utilize it beyond its original purpose, using its 
 internals, you likely create a work derived from the kernel. 
 Something simple as irq disabling probably doesnt qualify, and that 
 we exported to modules for a long time, but lots of other details do. 
 So the existence of paravirt_ops isnt a free-for all.)
 
 I agree completely.  It would be nice to have a way to make certain 
 kernel structures available, but non-mutable to non-GPL modules.

the thing is, we are not 'making these available to non-GPL modules'. 
The _GPL postfix does not turn the other normal exports from grey into 
white. The _GPL postfix turns exports into almost-black for non-GPL 
modules. The rest is still grey.

in this case, i'd only make local_irq_*() available to modules (of any 
type), not the other paravirt ops. Exporting the whole of paravirt_ops 
was a mistake, and this has to be fixed before 2.6.20. I'll cook up a 
patch.

 yes - this limit is easily triggered via the KVM/Qemu virtual serial 
 drivers. You can think of kvm_paravirt as Linux paravirt, it's 
 just a flag.
 
 Can't you just test paravirt_enabled() in that case?

yep - i've changed this in my tree.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/