date:20100715

Re: KVM and NUMA

2010-07-15 Thread Ralf Spenneberg

Hi Daniel,

thanks for your response.

Am Donnerstag, den 15.07.2010, 20:31 +0100 schrieb Daniel P. Berrange:

> If numactl --hardware works, then libvirt should work,
> since libvirt uses the numactl library to query topology

Ok. I did not know that, and in my case it does not seem to work. See
below.

> The NUMA topology does not get put inside the  element. It 
> is one level up in a  element. eg
> 
In my case (Ubuntu 10.04 LTS) it is just put inside the cpu element. 
Full host listing:



  

  x86_64
  core2duo
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  


  
  
tcp
  


  apparmor
  0

  

> > I guess this is the fact, because QEMU does not recognize the
> > NUMA-Architecture (QEMU-Monitor):
> > (qemu) info numa
> > 0 nodes
Thanks for the clarification.

> There are two aspects to NUMA. 1. Placing QEMU on appropriate NUMA
> ndes. 2. defining guest NUMA topology
Right. I am interested in placing Qemu on the appropriate node.

> 
> By default QEMU will float freely across any CPUs and all the guest
> RAM will appear in one node. This is can be bad for performance,
> especially if you are benchmarking

> So for performance testing you definitely want to  bind QEMU to the
> CPUs within a single NUMA node at startup, this will mean that all
> memory accesses are local to the node. Unless you give the guest
> more virtual RAM, than there is free RAM on the local NUMA node.
> Since you suggest you're using libvirt, the low level way todo 
> this is in the guest XML at the  element
Ok. But will my Qemu implementation use the appropriate RAM since it
does not recognize the architecture?

> For further performance you also really want to enable hugepages on
> your host (eg mount hugetlbfs at /dev/hugepages), then restart 
> libvirtd daemon, and then add the following to your guest XML just
> after the  element:
> 
>   
> 
>   
I have played with that, too. I could mount the hugetlbfs filesystem and
define the mountpoint in libvirt. The guest started ok but I could
verify that it was actually used. /proc/meminfo always showed 100% free
huge pages whether the guest was running or not. Shouldn't these pages
be used when the guest is running?

As I said: Ubuntu not RHEL.

Kind regards,

Ralf


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 6/6] KVM: MMU: using __xchg_spte more smarter

2010-07-15 Thread Xiao Guangrong

Sometimes, atomically set spte is not needed, this patch call __xchg_spte()
more smartly

Note: if the old mapping's access bit is already set, we no need atomic 
operation
since the access bit is not lost

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c |9 +++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index fb3ae54..71e9268 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -682,9 +682,14 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
 static void set_spte_track_bits(u64 *sptep, u64 new_spte)
 {
pfn_t pfn;
-   u64 old_spte;
+   u64 old_spte = *sptep;
+
+   if (!shadow_accessed_mask || !is_shadow_present_pte(old_spte) ||
+ old_spte & shadow_accessed_mask) {
+   __set_spte(sptep, new_spte);
+   } else
+   old_spte = __xchg_spte(sptep, new_spte);
 
-   old_spte = __xchg_spte(sptep, new_spte);
if (!is_rmap_spte(old_spte))
return;
pfn = spte_to_pfn(old_spte);
-- 
1.6.1.2


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 5/6] KVM: MMU: cleanup spte set and accssed/dirty tracking

2010-07-15 Thread Xiao Guangrong

Introduce set_spte_track_bits() to cleanup current code

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c |   17 -
 1 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4123e8e..fb3ae54 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -679,7 +679,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
}
 }
 
-static void drop_spte(struct kvm *kvm, u64 *sptep, u64 new_spte)
+static void set_spte_track_bits(u64 *sptep, u64 new_spte)
 {
pfn_t pfn;
u64 old_spte;
@@ -692,6 +692,11 @@ static void drop_spte(struct kvm *kvm, u64 *sptep, u64 
new_spte)
kvm_set_pfn_accessed(pfn);
if (is_writable_pte(old_spte))
kvm_set_pfn_dirty(pfn);
+}
+
+static void drop_spte(struct kvm *kvm, u64 *sptep, u64 new_spte)
+{
+   set_spte_track_bits(sptep, new_spte);
rmap_remove(kvm, sptep);
 }
 
@@ -791,7 +796,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long 
*rmapp,
 unsigned long data)
 {
int need_flush = 0;
-   u64 *spte, new_spte, old_spte;
+   u64 *spte, new_spte;
pte_t *ptep = (pte_t *)data;
pfn_t new_pfn;
 
@@ -812,13 +817,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned 
long *rmapp,
new_spte &= ~PT_WRITABLE_MASK;
new_spte &= ~SPTE_HOST_WRITEABLE;
new_spte &= ~shadow_accessed_mask;
-   if (is_writable_pte(*spte))
-   kvm_set_pfn_dirty(spte_to_pfn(*spte));
-   old_spte = __xchg_spte(spte, new_spte);
-   if (is_shadow_present_pte(old_spte)
-   && (!shadow_accessed_mask ||
-   old_spte & shadow_accessed_mask))
-   
mark_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
+   set_spte_track_bits(spte, new_spte);
spte = rmap_next(kvm, rmapp, spte);
}
}
-- 
1.6.1.2


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 4/6] KVM: MMU: don't atomicly set spte if it's not present

2010-07-15 Thread Xiao Guangrong

If the old mapping is not present, the spte.a is not lost, so no need
atomic operation to set it

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b3896bf..4123e8e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -307,9 +307,10 @@ static void update_spte(u64 *sptep, u64 new_spte)
 {
u64 old_spte;
 
-   if (!shadow_accessed_mask || (new_spte & shadow_accessed_mask)) {
+   if (!shadow_accessed_mask || (new_spte & shadow_accessed_mask) ||
+ !is_rmap_spte(*sptep))
__set_spte(sptep, new_spte);
-   } else {
+   else {
old_spte = __xchg_spte(sptep, new_spte);
if (old_spte & shadow_accessed_mask)
mark_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
-- 
1.6.1.2


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 3/6] KVM: MMU: fix page dirty tracking lost while sync page

2010-07-15 Thread Xiao Guangrong

In sync-page path, if spte.writable is changed, it will lose page dirty
tracking, for example:

assume spte.writable = 0 in a unsync-page, when it's synced, it map spte
to writable(that is spte.writable = 1), later guest write spte.gfn, it means
spte.gfn is dirty, then guest changed this mapping to read-only, after it's
synced,  spte.writable = 0

So, when host release the spte, it detect spte.writable = 0 and not mark page
dirty

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c |   10 +++---
 1 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 5937054..b3896bf 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1981,6 +1981,8 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
mark_page_dirty(vcpu->kvm, gfn);
 
 set_pte:
+   if (is_writable_pte(*sptep) && !is_writable_pte(spte))
+   kvm_set_pfn_dirty(pfn);
update_spte(sptep, spte);
 done:
return ret;
@@ -1994,7 +1996,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 bool reset_host_protection)
 {
int was_rmapped = 0;
-   int was_writable = is_writable_pte(*sptep);
int rmap_count;
 
pgprintk("%s: spte %llx access %x write_fault %d"
@@ -2044,15 +2045,10 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
page_header_update_slot(vcpu->kvm, sptep, gfn);
if (!was_rmapped) {
rmap_count = rmap_add(vcpu, sptep, gfn);
-   kvm_release_pfn_clean(pfn);
if (rmap_count > RMAP_RECYCLE_THRESHOLD)
rmap_recycle(vcpu, sptep, gfn);
-   } else {
-   if (was_writable)
-   kvm_release_pfn_dirty(pfn);
-   else
-   kvm_release_pfn_clean(pfn);
}
+   kvm_release_pfn_clean(pfn);
if (speculative) {
vcpu->arch.last_pte_updated = sptep;
vcpu->arch.last_pte_gfn = gfn;
-- 
1.6.1.2



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 2/6] KVM: MMU: fix page accessed tracking lost if ept is enabled

2010-07-15 Thread Xiao Guangrong

In current code, if ept is enabled(shadow_accessed_mask = 0), the page
accessed tracking is lost

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 1a4b42e..5937054 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -687,7 +687,7 @@ static void drop_spte(struct kvm *kvm, u64 *sptep, u64 
new_spte)
if (!is_rmap_spte(old_spte))
return;
pfn = spte_to_pfn(old_spte);
-   if (old_spte & shadow_accessed_mask)
+   if (!shadow_accessed_mask || old_spte & shadow_accessed_mask)
kvm_set_pfn_accessed(pfn);
if (is_writable_pte(old_spte))
kvm_set_pfn_dirty(pfn);
@@ -815,7 +815,8 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long 
*rmapp,
kvm_set_pfn_dirty(spte_to_pfn(*spte));
old_spte = __xchg_spte(spte, new_spte);
if (is_shadow_present_pte(old_spte)
-   && (old_spte & shadow_accessed_mask))
+   && (!shadow_accessed_mask ||
+   old_spte & shadow_accessed_mask))

mark_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
spte = rmap_next(kvm, rmapp, spte);
}
-- 
1.6.1.2


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/6] KVM: MMU: fix forgot reserved bits check in speculative path

2010-07-15 Thread Xiao Guangrong

In the speculative path, we should check guest pte's reserved bits just as
the real processor does

Reported-by: Marcelo Tosatti 
Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c |9 -
 arch/x86/kvm/paging_tmpl.h |5 +++--
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d16efbe..1a4b42e 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2693,6 +2693,9 @@ static void mmu_pte_write_new_pte(struct kvm_vcpu *vcpu,
return;
 }
 
+   if (is_rsvd_bits_set(vcpu, *(u64 *)new, PT_PAGE_TABLE_LEVEL))
+   return;
+
++vcpu->kvm->stat.mmu_pte_updated;
if (!sp->role.cr4_pae)
paging32_update_pte(vcpu, sp, spte, new);
@@ -2771,6 +2774,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
   bool guest_initiated)
 {
gfn_t gfn = gpa >> PAGE_SHIFT;
+   union kvm_mmu_page_role mask = { .word = 0 };
struct kvm_mmu_page *sp;
struct hlist_node *node;
LIST_HEAD(invalid_list);
@@ -2845,6 +2849,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
}
}
 
+   mask.cr0_wp = mask.cr4_pae = mask.nxe = 1;
for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn, node) {
pte_size = sp->role.cr4_pae ? 8 : 4;
misaligned = (offset ^ (offset + bytes - 1)) & ~(pte_size - 1);
@@ -2892,7 +2897,9 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
while (npte--) {
entry = *spte;
mmu_pte_write_zap_pte(vcpu, sp, spte);
-   if (gentry)
+   if (gentry &&
+ !(sp->role.word ^ vcpu->arch.mmu.base_role.word)
+ & mask.word)
mmu_pte_write_new_pte(vcpu, sp, spte, &gentry);
if (!remote_flush && need_remote_flush(entry, *spte))
remote_flush = true;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index a09e04c..231fce1 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -638,8 +638,9 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
return -EINVAL;
 
gfn = gpte_to_gfn(gpte);
-   if (gfn != sp->gfns[i] ||
- !is_present_gpte(gpte) || !(gpte & PT_ACCESSED_MASK)) {
+   if (is_rsvd_bits_set(vcpu, gpte, PT_PAGE_TABLE_LEVEL)
+ || gfn != sp->gfns[i] || !is_present_gpte(gpte)
+ || !(gpte & PT_ACCESSED_MASK)) {
u64 nonpresent;
 
if (is_present_gpte(gpte) || !clear_unsync)
-- 
1.6.1.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] test: add test for xchg instruction

2010-07-15 Thread Marcelo Tosatti

On Thu, Jul 15, 2010 at 08:58:47AM +0800, Wei Yongjun wrote:
> This patch add test for xchg instruction.
> 
> Signed-off-by: Wei Yongjun 
> ---
>  kvm/test/x86/emulator.c |   52 
> +++
>  1 files changed, 52 insertions(+), 0 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2] KVM: x86 emulator: fix xchg instruction emulation

2010-07-15 Thread Marcelo Tosatti

On Thu, Jul 15, 2010 at 08:51:58AM +0800, Wei Yongjun wrote:
> If the destination is a memory operand and the memory cannot
> map to a valid page, the xchg instruction emulation and locked
> instruction will not work on io regions and stuck in endless
> loop. We should emulate exchange as write to fix it.
> 
> Signed-off-by: Wei Yongjun 
> Acked-by: Gleb Natapov 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: Don't walk memory_size == 0 slots in kvm_client_migration_log

2010-07-15 Thread Marcelo Tosatti

On Wed, Jul 14, 2010 at 01:36:49PM -0600, Alex Williamson wrote:
> If we've unregistered a memory area, we should avoid calling
> qemu_get_ram_ptr() on the left over phys_offset cruft in the
> slot array.  Now that we support removing ramblocks, the
> phys_offset ram_addr_t can go away and cause a lookup fault
> and abort.
> 
> Signed-off-by: Alex Williamson 

Applied to uq/master branch, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] KVM: x86: Call mask notifiers from pic.

2010-07-15 Thread Marcelo Tosatti

On Thu, Jul 15, 2010 at 12:24:37PM +0300, Gleb Natapov wrote:
> If pit delivers interrupt while pic is masking it OS will never do EOI
> and ack notifier will not be called so when pit will be unmasked no pit
> interrupts will be delivered any more. Calling mask notifiers solves this
> issue.
> 
> Signed-off-by: Gleb Natapov 

Applied, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/6] kvm, x86: use ro page and don't copy shared page

2010-07-15 Thread Lai Jiangshan

When page fault, we always call get_user_pages(write=1).

Actually, we don't need to do this when it is not write fault.
get_user_pages(write=1) will cause shared page(ksm) copied.
If this page is not modified in future, this copying and the copied page
are just wasted. Ksm may scan and merge them and may cause thrash.

In this patch, if the page is RO for host VMM and it not write fault for guest,
we will use RO page, otherwise we use a writable page.

Signed-off-by: Lai Jiangshan 
---
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8ba9b0d..6382140 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1832,6 +1832,45 @@ static void kvm_unsync_pages(struct kvm_vcpu *vcpu,  
gfn_t gfn)
}
 }
 
+/* get a current mapped page fast, and test whether the page is writable. */
+static struct page *get_user_page_and_protection(unsigned long addr,
+   int *writable)
+{
+   struct page *page[1];
+
+   if (__get_user_pages_fast(addr, 1, 1, page) == 1) {
+   *writable = 1;
+   return page[0];
+   }
+   if (__get_user_pages_fast(addr, 1, 0, page) == 1) {
+   *writable = 0;
+   return page[0];
+   }
+   return NULL;
+}
+
+static pfn_t kvm_get_pfn_for_page_fault(struct kvm *kvm, gfn_t gfn,
+   int write_fault, int *host_writable)
+{
+   unsigned long addr;
+   struct page *page;
+
+   if (!write_fault) {
+   addr = gfn_to_hva(kvm, gfn);
+   if (kvm_is_error_hva(addr)) {
+   get_page(bad_page);
+   return page_to_pfn(bad_page);
+   }
+
+   page = get_user_page_and_protection(addr, host_writable);
+   if (page)
+   return page_to_pfn(page);
+   }
+
+   *host_writable = 1;
+   return kvm_get_pfn_for_gfn(kvm, gfn);
+}
+
 static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn,
  bool can_unsync)
 {
@@ -2085,6 +2124,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, 
int write, gfn_t gfn)
int level;
pfn_t pfn;
unsigned long mmu_seq;
+   int host_writable;
 
level = mapping_level(vcpu, gfn);
 
@@ -2099,7 +2139,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, 
int write, gfn_t gfn)
 
mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();
-   pfn = kvm_get_pfn_for_gfn(vcpu->kvm, gfn);
+   pfn = kvm_get_pfn_for_page_fault(vcpu->kvm, gfn, write, &host_writable);
 
/* mmio */
if (is_error_pfn(pfn))
@@ -2109,7 +2149,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, 
int write, gfn_t gfn)
if (mmu_notifier_retry(vcpu, mmu_seq))
goto out_unlock;
kvm_mmu_free_some_pages(vcpu);
-   r = __direct_map(vcpu, v, write, level, gfn, pfn, true);
+   r = __direct_map(vcpu, v, write, level, gfn, pfn, host_writable);
spin_unlock(&vcpu->kvm->mmu_lock);
 
 
@@ -2307,6 +2347,8 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
gpa,
int level;
gfn_t gfn = gpa >> PAGE_SHIFT;
unsigned long mmu_seq;
+   int write_fault = error_code & PFERR_WRITE_MASK;
+   int host_writable;
 
ASSERT(vcpu);
ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa));
@@ -2321,15 +2363,16 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
gpa,
 
mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();
-   pfn = kvm_get_pfn_for_gfn(vcpu->kvm, gfn);
+   pfn = kvm_get_pfn_for_page_fault(vcpu->kvm, gfn, write_fault,
+   &host_writable);
if (is_error_pfn(pfn))
return kvm_handle_bad_page(vcpu->kvm, gfn, pfn);
spin_lock(&vcpu->kvm->mmu_lock);
if (mmu_notifier_retry(vcpu, mmu_seq))
goto out_unlock;
kvm_mmu_free_some_pages(vcpu);
-   r = __direct_map(vcpu, gpa, error_code & PFERR_WRITE_MASK,
-level, gfn, pfn, true);
+   r = __direct_map(vcpu, gpa, write_fault,
+level, gfn, pfn, host_writable);
spin_unlock(&vcpu->kvm->mmu_lock);
 
return r;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index a9dbaa0..1874f51 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -430,6 +430,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr,
pfn_t pfn;
int level = PT_PAGE_TABLE_LEVEL;
unsigned long mmu_seq;
+   int host_writable;
 
pgprintk("%s: addr %lx err %x\n", __func__, addr, error_code);
kvm_mmu_audit(vcpu, "pre page fault");
@@ -461,7 +462,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr,
 
mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();
-   pfn = kvm_get_pfn_for_gfn(vcpu->kvm, walker.gfn);
+   pfn = kvm_get_pfn_for_page_fault(vcpu->kvm, walker.gfn, write_fault,
+

[PATCH 6/6] kvm, faster and simpler version of get_user_page_and_protection()

2010-07-15 Thread Lai Jiangshan


a light weight version of get_user_page_and_protection()

Signed-off-by: Lai Jiangshan 
---
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a34c785..d0e4f2f 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -618,6 +618,8 @@ static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, 
int count)
memcpy(dst, src, count * sizeof(pgd_t));
 }
 
+extern
+struct page *get_user_page_and_protection(unsigned long addr, int *writable);
 
 #include 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6382140..de44847 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1832,23 +1832,6 @@ static void kvm_unsync_pages(struct kvm_vcpu *vcpu,  
gfn_t gfn)
}
 }
 
-/* get a current mapped page fast, and test whether the page is writable. */
-static struct page *get_user_page_and_protection(unsigned long addr,
-   int *writable)
-{
-   struct page *page[1];
-
-   if (__get_user_pages_fast(addr, 1, 1, page) == 1) {
-   *writable = 1;
-   return page[0];
-   }
-   if (__get_user_pages_fast(addr, 1, 0, page) == 1) {
-   *writable = 0;
-   return page[0];
-   }
-   return NULL;
-}
-
 static pfn_t kvm_get_pfn_for_page_fault(struct kvm *kvm, gfn_t gfn,
int write_fault, int *host_writable)
 {
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index a4ce19f..34b05c7 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -275,7 +275,6 @@ int __get_user_pages_fast(unsigned long start, int 
nr_pages, int write,
 
return nr;
 }
-EXPORT_SYMBOL_GPL(__get_user_pages_fast);
 
 /**
  * get_user_pages_fast() - pin user pages in memory
@@ -375,3 +374,83 @@ slow_irqon:
return ret;
}
 }
+
+/*
+ * get a current mapped page fast, and test whether the page is writable.
+ * equivalent version(but slower):
+ * {
+ * struct page *page[1];
+ *
+ * if (__get_user_pages_fast(addr, 1, 1, page) == 1) {
+ * *writable = 1;
+ * return page[0];
+ * }
+ * if (__get_user_pages_fast(addr, 1, 0, page) == 1) {
+ * *writable = 0;
+ * return page[0];
+ * }
+ * return NULL;
+ * }
+ */
+struct page *get_user_page_and_protection(unsigned long addr, int *writable)
+{
+   unsigned long flags;
+   struct mm_struct *mm = current->mm;
+   pgd_t *pgdp;
+   pud_t *pudp;
+   pmd_t *pmdp;
+   pte_t pte, *ptep;
+
+   unsigned long mask = _PAGE_PRESENT | _PAGE_USER;
+   unsigned long offset = 0;
+   struct page *head, *page = NULL;
+
+   addr &= PAGE_MASK;
+
+   local_irq_save(flags);
+   pgdp = pgd_offset(mm, addr);
+   if (!pgd_present(*pgdp))
+   goto out;
+
+   pudp = pud_offset(pgdp, addr);
+   if (!pud_present(*pudp))
+   goto out;
+
+   if (unlikely(pud_large(*pudp))) {
+   pte = *(pte_t *)pudp;
+   offset = ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   goto verify;
+   }
+
+   pmdp = pmd_offset(pudp, addr);
+   if (!pmd_present(*pmdp))
+   goto out;
+
+   if (unlikely(pmd_large(*pmdp))) {
+   pte = *(pte_t *)pmdp;
+   offset = ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   goto verify;
+   }
+
+   ptep = pte_offset_map(pmdp, addr);
+   pte = gup_get_pte(ptep);
+   pte_unmap(ptep);
+
+verify:
+   if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask)
+   goto out;
+
+   VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+
+   head = pte_page(pte);
+   page = head + offset;
+   VM_BUG_ON(compound_head(page) != head);
+   get_page(page);
+   *writable = !!(pte_flags(pte) & _PAGE_RW);
+
+out:
+   local_irq_restore(flags);
+   return page;
+}
+EXPORT_SYMBOL_GPL(get_user_page_and_protection);
+
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/6] kvm: add host_writable parameter

2010-07-15 Thread Lai Jiangshan

add host_writable parameter for some functions,
no functionality changed, prepare for using RO pages.


Signed-off-by: Lai Jiangshan 
---
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0867ced..8ba9b0d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1861,7 +1861,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
unsigned pte_access, int user_fault,
int write_fault, int dirty, int level,
gfn_t gfn, pfn_t pfn, bool speculative,
-   bool can_unsync, bool reset_host_protection)
+   bool can_unsync, bool host_writable)
 {
u64 spte;
int ret = 0;
@@ -1888,8 +1888,10 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
spte |= kvm_x86_ops->get_mt_mask(vcpu, gfn,
kvm_is_mmio_pfn(pfn));
 
-   if (reset_host_protection)
+   if (host_writable)
spte |= SPTE_HOST_WRITEABLE;
+   else
+   pte_access &= ~ACC_WRITE_MASK;
 
spte |= (u64)pfn << PAGE_SHIFT;
 
@@ -1942,7 +1944,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 int user_fault, int write_fault, int dirty,
 int *ptwrite, int level, gfn_t gfn,
 pfn_t pfn, bool speculative,
-bool reset_host_protection)
+bool host_writable)
 {
int was_rmapped = 0;
int was_writable = is_writable_pte(*sptep);
@@ -1978,7 +1980,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 
if (set_spte(vcpu, sptep, pte_access, user_fault, write_fault,
  dirty, level, gfn, pfn, speculative, true,
- reset_host_protection)) {
+ host_writable)) {
if (write_fault)
*ptwrite = 1;
kvm_mmu_flush_tlb(vcpu);
@@ -2015,7 +2017,7 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
 }
 
 static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
-   int level, gfn_t gfn, pfn_t pfn)
+   int level, gfn_t gfn, pfn_t pfn, bool host_writable)
 {
struct kvm_shadow_walk_iterator iterator;
struct kvm_mmu_page *sp;
@@ -2026,7 +2028,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
if (iterator.level == level) {
mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, ACC_ALL,
 0, write, 1, &pt_write,
-level, gfn, pfn, false, true);
+level, gfn, pfn, false, host_writable);
++vcpu->stat.pf_fixed;
break;
}
@@ -2107,7 +2109,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, 
int write, gfn_t gfn)
if (mmu_notifier_retry(vcpu, mmu_seq))
goto out_unlock;
kvm_mmu_free_some_pages(vcpu);
-   r = __direct_map(vcpu, v, write, level, gfn, pfn);
+   r = __direct_map(vcpu, v, write, level, gfn, pfn, true);
spin_unlock(&vcpu->kvm->mmu_lock);
 
 
@@ -2327,7 +2329,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
gpa,
goto out_unlock;
kvm_mmu_free_some_pages(vcpu);
r = __direct_map(vcpu, gpa, error_code & PFERR_WRITE_MASK,
-level, gfn, pfn);
+level, gfn, pfn, true);
spin_unlock(&vcpu->kvm->mmu_lock);
 
return r;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index ac24158..a9dbaa0 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -291,7 +291,7 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
return;
kvm_get_pfn(pfn);
/*
-* we call mmu_set_spte() with reset_host_protection = true beacuse that
+* we call mmu_set_spte() with host_writable = true beacuse that
 * vcpu->arch.update_pte.pfn was fetched from get_user_pages(write = 1).
 */
mmu_set_spte(vcpu, spte, sp->role.access, pte_access, 0, 0,
@@ -305,7 +305,7 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
 static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 struct guest_walker *gw,
 int user_fault, int write_fault, int hlevel,
-int *ptwrite, pfn_t pfn)
+int *ptwrite, pfn_t pfn, bool host_writable)
 {
unsigned access = gw->pt_access;
struct kvm_mmu_page *sp;
@@ -334,7 +334,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 gw->pte_access & access,
 user_fault, write_fault,
 dirty, ptwrite, level,
-

[PATCH 3/6] kvm: rename gfn_to_pfn() etc.

2010-07-15 Thread Lai Jiangshan

gfn_to_pfn() does actually increase the reference of the page.
But "gfn_to_pfn" is questionable, it misses this semantic.
So we rename it to kvm_get_pfn_for_gfn() which make more sense.

gfn_to_page() and hva_to_pfn() are also renamed.

(no behavior changed)


Signed-off-by: Lai Jiangshan 
---
diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 5cb5865..fe220d6 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -1589,7 +1589,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
return -ENOMEM;
 
for (i = 0; i < npages; i++) {
-   pfn = gfn_to_pfn(kvm, base_gfn + i);
+   pfn = kvm_get_pfn_for_gfn(kvm, base_gfn + i);
if (!kvm_is_mmio_pfn(pfn)) {
kvm_set_pmt_entry(kvm, base_gfn + i,
pfn << PAGE_SHIFT,
diff --git a/arch/powerpc/kvm/44x_tlb.c b/arch/powerpc/kvm/44x_tlb.c
index 8123125..aeb67aa 100644
--- a/arch/powerpc/kvm/44x_tlb.c
+++ b/arch/powerpc/kvm/44x_tlb.c
@@ -314,7 +314,7 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 gvaddr, 
gpa_t gpaddr,
 
/* Get reference to new page. */
gfn = gpaddr >> PAGE_SHIFT;
-   new_page = gfn_to_page(vcpu->kvm, gfn);
+   new_page = kvm_get_page_for_gfn(vcpu->kvm, gfn);
if (is_error_page(new_page)) {
printk(KERN_ERR "Couldn't get guest page for gfn %lx!\n", gfn);
kvm_release_page_clean(new_page);
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index a3cef30..1611292 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -414,7 +414,7 @@ static void kvmppc_patch_dcbz(struct kvm_vcpu *vcpu, struct 
kvmppc_pte *pte)
u32 *page;
int i;
 
-   hpage = gfn_to_page(vcpu->kvm, pte->raddr >> PAGE_SHIFT);
+   hpage = kvm_get_page_for_gfn(vcpu->kvm, pte->raddr >> PAGE_SHIFT);
if (is_error_page(hpage))
return;
 
diff --git a/arch/powerpc/kvm/book3s_32_mmu_host.c 
b/arch/powerpc/kvm/book3s_32_mmu_host.c
index 0b51ef8..94b4907 100644
--- a/arch/powerpc/kvm/book3s_32_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_32_mmu_host.c
@@ -147,7 +147,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct 
kvmppc_pte *orig_pte)
struct hpte_cache *pte;
 
/* Get host physical address for gpa */
-   hpaddr = gfn_to_pfn(vcpu->kvm, orig_pte->raddr >> PAGE_SHIFT);
+   hpaddr = kvm_get_pfn_for_gfn(vcpu->kvm, orig_pte->raddr >> PAGE_SHIFT);
if (kvm_is_error_hva(hpaddr)) {
printk(KERN_INFO "Couldn't get guest page for gfn %lx!\n",
 orig_pte->eaddr);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c 
b/arch/powerpc/kvm/book3s_64_mmu_host.c
index 384179a..414a001 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_host.c
@@ -101,7 +101,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct 
kvmppc_pte *orig_pte)
struct kvmppc_sid_map *map;
 
/* Get host physical address for gpa */
-   hpaddr = gfn_to_pfn(vcpu->kvm, orig_pte->raddr >> PAGE_SHIFT);
+   hpaddr = kvm_get_pfn_for_gfn(vcpu->kvm, orig_pte->raddr >> PAGE_SHIFT);
if (kvm_is_error_hva(hpaddr)) {
printk(KERN_INFO "Couldn't get guest page for gfn %lx!\n", 
orig_pte->eaddr);
return -EINVAL;
diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index f11ca0f..026e5c7 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -299,7 +299,7 @@ static inline void kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,
stlbe = &vcpu_e500->shadow_tlb[tlbsel][esel];
 
/* Get reference to new page. */
-   new_page = gfn_to_page(vcpu_e500->vcpu.kvm, gfn);
+   new_page = kvm_get_page_for_gfn(vcpu_e500->vcpu.kvm, gfn);
if (is_error_page(new_page)) {
printk(KERN_ERR "Couldn't get guest page for gfn %lx!\n", gfn);
kvm_release_page_clean(new_page);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 1f3cbb8..0867ced 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2097,7 +2097,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, 
int write, gfn_t gfn)
 
mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();
-   pfn = gfn_to_pfn(vcpu->kvm, gfn);
+   pfn = kvm_get_pfn_for_gfn(vcpu->kvm, gfn);
 
/* mmio */
if (is_error_pfn(pfn))
@@ -2319,7 +2319,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
gpa,
 
mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();
-   pfn = gfn_to_pfn(vcpu->kvm, gfn);
+   pfn = kvm_get_pfn_for_gfn(vcpu->kvm, gfn);
if (is_error_pfn(pfn))
return kvm_handle_bad_page(vcpu->kvm, gfn, pfn);
spin_lock(&vcpu->kvm->mmu_lock);
@@ -2696,7 +2696,7 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu 
*vcpu, gpa_t gpa,
 
vcp

[PATCH 2/6] kvm, ept: remove the default write bit

2010-07-15 Thread Lai Jiangshan

When ept enabled, current code set shadow_base_present_pte
including the write bit, thus all pte entries have
writabe bit, and it means guest os can always
write to any mapped page (even VMM maps RO pages for
the guest.)

we will use RO pages future, fix it.

Signed-off-by: Lai Jiangshan 
---
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 502e53f..62cc947 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -534,7 +534,6 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
 int kvm_mmu_create(struct kvm_vcpu *vcpu);
 int kvm_mmu_setup(struct kvm_vcpu *vcpu);
 void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte);
-void kvm_mmu_set_base_ptes(u64 base_pte);
 void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
u64 dirty_mask, u64 nx_mask, u64 x_mask);
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b93b94f..1f3cbb8 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -181,7 +181,6 @@ static struct kmem_cache *mmu_page_header_cache;
 
 static u64 __read_mostly shadow_trap_nonpresent_pte;
 static u64 __read_mostly shadow_notrap_nonpresent_pte;
-static u64 __read_mostly shadow_base_present_pte;
 static u64 __read_mostly shadow_nx_mask;
 static u64 __read_mostly shadow_x_mask;/* mutual exclusive with 
nx_mask */
 static u64 __read_mostly shadow_user_mask;
@@ -200,12 +199,6 @@ void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 
notrap_pte)
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_nonpresent_ptes);
 
-void kvm_mmu_set_base_ptes(u64 base_pte)
-{
-   shadow_base_present_pte = base_pte;
-}
-EXPORT_SYMBOL_GPL(kvm_mmu_set_base_ptes);
-
 void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
u64 dirty_mask, u64 nx_mask, u64 x_mask)
 {
@@ -1878,7 +1871,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 * whether the guest actually used the pte (in order to detect
 * demand paging).
 */
-   spte = shadow_base_present_pte | shadow_dirty_mask;
+   spte = PT_PRESENT_MASK | shadow_dirty_mask;
if (!speculative)
spte |= shadow_accessed_mask;
if (!dirty)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2fdcc98..856e427 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4419,8 +4419,6 @@ static int __init vmx_init(void)
 
if (enable_ept) {
bypass_guest_pf = 0;
-   kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK |
-   VMX_EPT_WRITABLE_MASK);
kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
VMX_EPT_EXECUTABLE_MASK);
kvm_enable_tdp();
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fb08316..5f2fb50 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4237,7 +4237,6 @@ int kvm_arch_init(void *opaque)
 
kvm_x86_ops = ops;
kvm_mmu_set_nonpresent_ptes(0ull, 0ull);
-   kvm_mmu_set_base_ptes(PT_PRESENT_MASK);
kvm_mmu_set_mask_ptes(PT_USER_MASK, PT_ACCESSED_MASK,
PT_DIRTY_MASK, PT64_NX_MASK, 0);
 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/6] kvm: pass error code to handler

2010-07-15 Thread Lai Jiangshan

handle_ept_violation() does not pass error code to
the handler tdp_page_fault().

It means tdp_page_fault() handles the page fault with ignoring
the error code, It will not handle the page fault completely correctly,
and may causes endless page fault.

Signed-off-by: Lai Jiangshan 
---
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 856e427..b40731e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3521,7 +3521,8 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 
gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
trace_kvm_page_fault(gpa, exit_qualification);
-   return kvm_mmu_page_fault(vcpu, gpa & PAGE_MASK, 0);
+   return kvm_mmu_page_fault(vcpu, gpa & PAGE_MASK,
+   exit_qualification & 0x2);
 }
 
 static u64 ept_rsvd_mask(u64 spte, int level)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] KVM: MMU: track dirty page in speculative path properly

2010-07-15 Thread Xiao Guangrong

Marcelo Tosatti wrote:

>> It uses access bit to track both page accessed and page dirty, and it's 
>> rather cheap...
> 
> Xiao,
> 
> I don't understand it. What are you trying to achieve?
> 

Marcelo,

The issue which we try to fix in this patch is we mark the page dirty in 
speculative
path, i'm not sure that after Lai's patch this bug is gone, if it's true, this 
patch
is not needed anymore, otherwise the later patch is the cheaper way help us to 
fix this
issue.

In the later patch, we use pte.a to track whether the speculative mapping is 
accessed,
if it's accessed, we mark the page dirty, just like tracking page-accessed.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM vs. PCI-E Function Level Reset (FLR) ...

2010-07-15 Thread Sheng Yang

On Thursday 15 July 2010 23:39:36 Casey Leedom wrote:
> | From: Sheng Yang 
> | Date: Wednesday, July 14, 2010 06:31 pm
> | 
> | On Thursday 15 July 2010 02:01:29 Casey Leedom wrote:
> | > | From: Sheng Yang 
> | > | Date: Tuesday, July 13, 2010 05:53 pm
> | 
> | (Please use reply to all next time.)
> 
>   Sorry, old habit.
> 
> | > | On Wednesday 14 July 2010 04:41:01 Casey Leedom wrote:
> | >   Hhrrmmm, this seems like a semantic error.  "Resetting" the Vm should
> | > 
> | > be morally equivalent to resetting a real physical machine.  And when
> | > a real physical machine is reset, all of its buses, etc. get reset
> | > which propagates to Device Resets on those buses ...  I think that
> | > Assigned Devices should be reset for reboots and power off/on ...
> | 
> | Yes, it should be done when reset. And power on/off should be there,
> | because that's means the create and destroy the guest.
> 
>   Okay, cool.  I've looked through the kernel code and I don't see any
> likely places for putting such a VM Reboot/Reset feature in so I assume
> that this is also controlled by QEmu and it would need to be responsible
> for either doing a KVM_DEASSIGN_PCI_DEVICE/KVM_ASSIGN_PCI_DEVICE ioctl()
> pair for each assigned device or issuing a new
> KVM_RESET_ASSIGNED_PCI_DEVICE ioctl() for Reboot/Reset ...

Yeah, the detection of reset is not that straightforward... Maybe we need an 
ioctl 
for reset event in qemu-kvm kvm_reset_vcpu().

We don't need assign/deassign device when reboot/reset, we only need to notify 
KVM 
that the reset is happening, then KVM know what's to do.
 
> | > Assigned Devices.  For PCI-E SR-IOV Virtual Functions which are
> | > assigned to a VM, they need to be reset at reboot/power off/power on
> | > and the Configuration Space emulation needs to support the Guest OS/VM
> | > Device Driver issuing an FLR ...
> | 
> | You can add the FLR support for your device if you need to call it in the
> | guest. But I don't quite understand why you need to call it in the guest.
> | KVM should has already done that, and it's not necessary for native
> | device.
> 
>   This was mostly for device driver load/unload support.  I.e. being able
> to issue a Function Level Reset to a PCI Device Function (regardless of it
> being an SR-IOV Virtual Function or not) is a nice way to zap the device
> back to a canonical state.

OK.
> 
> | So you want to issue FLR(rather than probing the feature) when
> | driver->probe() called? Current seems KVM is the only user of
> | pci_reset_function(), so I think we can update it after your driver
> | posted to the mailing list. Also I am not sure why you want to reset
> | function when probing. KVM should cover them all I think.
> 
>   Remember, the "probe" _argument_ in pci_dev_reset() has nothing to do
> with whether it's being called from a device driver's probe() routine. 
> The "probe" _argument_ in pci_dev_reset() is used for the two-stage effort
> in
> pci_reset_function() which does:
> 
>  1. "try to see if it's possible to reset this function by itself"
> Call pci_dev_reset(dev, probe=1);
> 
>  2. "go ahead and do the function reset"
> Call pci_dev_reset(dev, probe=0);
> 
>   And yes, I'd noticed that KVM was the only current subscriber of the
> kernel interface but, as I noted above, it is useful for resetting the
> device function into a known state -- or at least it _was_ useful till the
> deadlock got introduced in 2.6.31 ... :-(  And again, the use of this
> actually has no dependency on KVM: I would want to be able to call
> pci_reset_function() in any device driver's probe() routine ...  It just
> happens that I ran into this need (and unfortunate 2.6.31 deadlock) with
> our PCI-E SR-IOV Virtual Function Driver.

What I meant was, before your driver, there is no such requirement in the code. 
And no one can predict every usage of the code in the future, so it's quite 
reasonable you called the "deadlock" is there. And I can't say it's a 
"deadlock" 
because there is no code in the current tree make it "deadlock" IIUR. 

So now you have this requirement, you can modify it to fit your need. That's 
quite 
straightforward...
> 
>   Oh, and the driver has been posted to net-next.  I'm guessing that it
> _should_ get merged into 2.6.35 ... or maybe 2.6.36 ... I'm really not
> sure of how the merge schedule between net-next and the core kernel runs
> ...

That's good. So you can modify the function to provide a lockless version. That 
make sense now. :)

--
regards
Yang, Sheng

> 
> Casey
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM and NUMA

2010-07-15 Thread Daniel P. Berrange

On Thu, Jul 15, 2010 at 07:10:35PM +0200, Ralf Spenneberg wrote:
> Hi,
> 
> I just had a chance to play with KVM on Ubuntu 10.04 LTS on some new HP
> 360 g6 with Nehalem processors. I have a feeling that KVM and NUMA on
> these machines do not play well together. 
> 
> Doing some benchmarks I got bizarre numbers. Sometimes the VMs were
> performing fine and some times the performance was very bad! Apparently
> KVM does not recognize the NUMA-architecture and places memory and
> process randomly and therefore often on different numa cells.
> 
> First a couple of specs of the machine:
> Two Nehalem sockets with E5520, Hyperthreading turned off, 4 cores per
> socket, all in all 8 processors.
> 
> 
> Linux recognizes the NUMA-architecture:
> # numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 2 4 6
> node 0 size: 12277 MB
> node 0 free: 9183 MB
> node 1 cpus: 1 3 5 7
> node 1 size: 12287 MB
> node 1 free: 8533 MB
> node distances:
> node   0   1 
>   0:  10  20 
>   1:  20  10 

If numactl --hardware works, then libvirt should work,
since libvirt uses the numactl library to query topology

> 
> So I have got two cells with 4 cores each.
> 
> Virsh does not recognize the topology:
> # virsh capabilities
> 
>   
> 
>   x86_64
>   core2duo
>   
>   
> ..

The NUMA topology does not get put inside the  element. It 
is one level up in a  element. eg

  x86_64
  snip

...snip...

This shows 2 numa nodes (cells in libvirt terminology) each with
4 CPUs. You can also query free RAM in each node/cell

# virsh freecell 0
0: 1922084 kB
# virsh freecell 1
1: 1035700 kB

>From both of these you can then decide where to place the guest

> I guess this is the fact, because QEMU does not recognize the
> NUMA-Architecture (QEMU-Monitor):
> (qemu) info numa
> 0 nodes

IIRC this is reporting the guest NUMA  topology which is
completely independant of host NUMA topology.

> So apparently KVM does not utilize the NUMA-architecture. Did I do
> something wrong. Is KVM missing a patch? Do I need to activate something
> in KVM to recognize the NUMA-Architecture? 

There are two aspects to NUMA. 1. Placing QEMU on appropriate NUMA
ndes. 2. defining guest NUMA topology

By default QEMU will float freely across any CPUs and all the guest
RAM will appear in one node. This is can be bad for performance,
especially if you are benchmarking

So for performance testing you definitely want to  bind QEMU to the
CPUs within a single NUMA node at startup, this will mean that all
memory accesses are local to the node. Unless you give the guest
more virtual RAM, than there is free RAM on the local NUMA node.
Since you suggest you're using libvirt, the low level way todo 
this is in the guest XML at the  element

In my capabilities XML example above you can see 2 numa nodes, 
each with 4 cpus. So if I want to restrict the guest to the
first NUMA node which has CPU numbers 0, 1, 2, 3, then I'd do

rhel6x86_64
0bbf8187-bce1-bc77-2a2c-fb033816f7f4
819200
819200
2
...snip...

You can verify the pinning with virsh cpuinfo

# virsh vcpuinfo rhel5xen
VCPU:   0
CPU:1
State:  running
CPU time:   15.9s
CPU Affinity:   

VCPU:   1
CPU:2
State:  running
CPU time:   9.5s
CPU Affinity:   
snip rest...

It is not yet possible to define the guest visible NUMA topology via
libvirt, but that shouldn't be too critical for performance unless
you needed to your guest to be able to span multiple host nodes.

For further performance you also really want to enable hugepages on
your host (eg mount hugetlbfs at /dev/hugepages), then restart 
libvirtd daemon, and then add the following to your guest XML just
after the  element:

This will make it pre-allocate hugepages for all guest RAM at startup.
NB the downside is that you can't overcommit RAM, but that's a tradeoff
between maximising utilization and maximising performance.

Daniel
-- 
|: Red Hat, Engineering, London-o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
|: http://autobuild.org-o- http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: KVM Call agenda for July 13th

2010-07-15 Thread Anthony Liguori


On 07/15/2010 01:51 PM, Aurelien Jarno wrote:

On Thu, Jul 15, 2010 at 01:43:28PM -0500, Justin M. Forbes wrote:
   

On Tue, Jul 13, 2010 at 12:19:21PM -0500, Brian Jackson wrote:
 

On Tuesday, July 13, 2010 12:01:22 pm Avi Kivity wrote:
   

On 07/13/2010 07:57 PM, Anthony Liguori wrote:
 

I'd like to see more frequent stable releases, at least if the stable
branch contains fixes to user-reported bugs (or of course security or
data integrity fixes).
 

Would you like to see more frequent stable releases or more frequent
master releases?
   

Yes.  But in this context I'm interested in stable releases.  We have
bugs reported, fixed, and the fix applied, yet the fixes are unreachable
to users.
 

Especially so since qemu-kvm 0.12-stable hasn't been merged with qemu
basically since 0.12.4 came out. I was trying to help one of the Gentoo
maintainers find post 0.12.4 patches the other day and had to point them to
the upstream qemu stable tree.
   

I have offered to take care of this, but so far I do not have commit
access.

 

You don't necessarily need commit access for that. Just create your own
tree with backported patches, and then send a stable pull request to
the mailing list.
   


Precisely.  In the case of stable, that means watching the mailing list 
and backporting patches as appropriate and periodically doing pull requests.


Regards,

Anthony Liguori


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: KVM Call agenda for July 13th

2010-07-15 Thread Aurelien Jarno

On Thu, Jul 15, 2010 at 01:43:28PM -0500, Justin M. Forbes wrote:
> On Tue, Jul 13, 2010 at 12:19:21PM -0500, Brian Jackson wrote:
> > On Tuesday, July 13, 2010 12:01:22 pm Avi Kivity wrote:
> > > On 07/13/2010 07:57 PM, Anthony Liguori wrote:
> > > >> I'd like to see more frequent stable releases, at least if the stable
> > > >> branch contains fixes to user-reported bugs (or of course security or
> > > >> data integrity fixes).
> > > > 
> > > > Would you like to see more frequent stable releases or more frequent
> > > > master releases?
> > > 
> > > Yes.  But in this context I'm interested in stable releases.  We have
> > > bugs reported, fixed, and the fix applied, yet the fixes are unreachable
> > > to users.
> > 
> > Especially so since qemu-kvm 0.12-stable hasn't been merged with qemu 
> > basically since 0.12.4 came out. I was trying to help one of the Gentoo 
> > maintainers find post 0.12.4 patches the other day and had to point them to 
> > the upstream qemu stable tree.
> 
> I have offered to take care of this, but so far I do not have commit
> access.
> 

You don't necessarily need commit access for that. Just create your own
tree with backported patches, and then send a stable pull request to 
the mailing list.

-- 
Aurelien Jarno  GPG: 1024D/F1BCDB73
aurel...@aurel32.net http://www.aurel32.net
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: KVM Call agenda for July 13th

2010-07-15 Thread Justin M. Forbes

On Tue, Jul 13, 2010 at 12:19:21PM -0500, Brian Jackson wrote:
> On Tuesday, July 13, 2010 12:01:22 pm Avi Kivity wrote:
> > On 07/13/2010 07:57 PM, Anthony Liguori wrote:
> > >> I'd like to see more frequent stable releases, at least if the stable
> > >> branch contains fixes to user-reported bugs (or of course security or
> > >> data integrity fixes).
> > > 
> > > Would you like to see more frequent stable releases or more frequent
> > > master releases?
> > 
> > Yes.  But in this context I'm interested in stable releases.  We have
> > bugs reported, fixed, and the fix applied, yet the fixes are unreachable
> > to users.
> 
> Especially so since qemu-kvm 0.12-stable hasn't been merged with qemu 
> basically since 0.12.4 came out. I was trying to help one of the Gentoo 
> maintainers find post 0.12.4 patches the other day and had to point them to 
> the upstream qemu stable tree.

I have offered to take care of this, but so far I do not have commit
access.

Justin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vhost-net: avoid flush under lock

2010-07-15 Thread Sridhar Samudrala

On Thu, 2010-07-15 at 15:19 +0300, Michael S. Tsirkin wrote:
> We flush under vq mutex when changing backends.
> This creates a deadlock as workqueue being flushed
> needs this lock as well.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=612421
> 
> Drop the vq mutex before flush: we have the device mutex
> which is sufficient to prevent another ioctl from touching
> the vq.

Why do we need to flush the vq when trying to set the backend and
we find that it is already set. Is this just an optimization?

Thanks
Sridhar
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  drivers/vhost/net.c |5 +
>  1 files changed, 5 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 28d7786..50df58e6 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -534,11 +534,16 @@ static long vhost_net_set_backend(struct vhost_net *n, 
> unsigned index, int fd)
>   rcu_assign_pointer(vq->private_data, sock);
>   vhost_net_enable_vq(n, vq);
>  done:
> + mutex_unlock(&vq->mutex);
> +
>   if (oldsock) {
>   vhost_net_flush_vq(n, index);
>   fput(oldsock->file);
>   }
> 
> + mutex_unlock(&n->dev.mutex);
> + return 0;
> +
>  err_vq:
>   mutex_unlock(&vq->mutex);
>  err:

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] KVM: MMU: track dirty page in speculative path properly

2010-07-15 Thread Marcelo Tosatti

On Thu, Jul 15, 2010 at 03:44:48PM +0800, Xiao Guangrong wrote:
> 
> 
> Marcelo Tosatti wrote:
> 
> >> How about just track access bit for speculative path, we set page both 
> >> accessed and
> >> dirty(if it's writable) only if the access bit is set?
> > 
> > A useful thing to do would be to allow read-only mappings, in the fault
> > path (Lai sent a few patches in that direction sometime ago but there
> > was no follow up).
> > 
> > So in the case of a read-only fault from the guest, you'd inform
> > get_user_pages() that read-only access is acceptable (so swapcache pages
> > can be mapped, or qemu can mprotect(PROT_READ) guest memory).
> > 
> 
> Yeah, it's a great work, i guess Lai will post the new version soon.
> 
> And, even we do this, i think the page dirty track is still needed, right?
> Then, how about my new idea to track page dirty for speculative path, just
> as below draft patch does:
> 
> @@ -687,10 +687,11 @@ static void drop_spte(struct kvm *kvm, u64 *sptep, u64 
> new_spte)
> if (!is_rmap_spte(old_spte))
> return;
> pfn = spte_to_pfn(old_spte);
> -   if (old_spte & shadow_accessed_mask)
> +   if (old_spte & shadow_accessed_mask) {
> kvm_set_pfn_accessed(pfn);
> -   if (is_writable_pte(old_spte))
> -   kvm_set_pfn_dirty(pfn);
> +   if (is_writable_pte(old_spte))
> +   kvm_set_pfn_dirty(pfn);
> +   }
> rmap_remove(kvm, sptep);
>  }
>  
> @@ -1920,8 +1921,11 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>  * demand paging).
>  */
> spte = shadow_base_present_pte | shadow_dirty_mask;
> -   if (!speculative)
> +   if (!speculative) {
> spte |= shadow_accessed_mask;
> +   if (is_writable_pte(*sptep))
> +   kvm_set_pfn_dirty(pfn);
> +   }
> if (!dirty)
> pte_access &= ~ACC_WRITE_MASK;
> if (pte_access & ACC_EXEC_MASK)
> 
> It uses access bit to track both page accessed and page dirty, and it's 
> rather cheap...

Xiao,

I don't understand it. What are you trying to achieve?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support

2010-07-15 Thread Anthony Liguori


On 07/15/2010 11:45 AM, Eduard - Gabriel Munteanu wrote:

On Thu, Jul 15, 2010 at 07:45:06AM -0500, Anthony Liguori wrote:
   

No.  PCI devices should never call cpu_physical_memory*.

PCI devices should call pci_memory*.

ISA devices should call isa_memory*.

All device memory accesses should go through their respective buses.
There can be multiple IOMMUs at different levels of the device
hierarchy.  If you don't provide bus-level memory access functions that
chain through the hierarchy, it's extremely difficult to implement all
the necessary hooks to perform the translations at different places.

Regards,

Anthony Liguori

 

I liked Paul's initial approach more, at least if I understood him
correctly. Basically I'm suggesting a single memory_* function that
simply asks the bus for I/O and translation. Say you have something like
this:

+ Bus 1
|
 Memory 1
|
---+ Bus 2 bridge
|
 Memory 2
|
---+ Bus 3 bridge
   |
    Device

Say Device wants to write to memory. If we have the DeviceState we
needn't concern whether this is a BusOneDevice or BusTwoDevice from
device code itself. We would just call

memory_rw(dev_state, addr, buf, size, is_write);
   


I dislike this API for a few reasons:

1) buses have different types of addresses with different address 
ranges.  this api would have to take a generic address type.
2) dev_state would be the qdev device state.  this means qdev needs to 
have memory hook mechanisms that's chainable.  I think it's unnecessary 
at the qdev level
3) users have upcasted device states, so it's more natural to pass 
PCIDevice than DeviceState.
4) there's an assumption that all devices can get to DeviceState.  
that's not always true today.



which simply recurses through DeviceState's and BusState's through their
parent pointers. The actual bus can set up those to provide
identification information and perhaps hooks for translation and access
checking. So memory_rw() looks like this (pseudocode):

static void memory_rw(DeviceState *dev,
   target_phys_addr_t addr,
  uint8_t *buf,
  int size,
  int is_write)
{
BusState *bus = get_parent_bus_of_dev(dev);
DeviceState *pdev = get_parent_dev(dev);
target_phys_addr_t taddr;

if (!bus) {
/* This shouldn't happen. */
assert(0);
}

if (bus->responsible_for(addr)) {
raw_physical_memory_rw(addr, buf, size, is_write);
return;
}

taddr = bus->translate(dev, addr);
memory_rw(pdev, taddr, buf, size, is_write);
   


This is too simplistic because you sometimes have layering that doesn't 
fit into the bus model.  For instance, virtio + pci.


We really want a virtio_memory_rw that calls either syborg_memory_rw or 
pci_memory_rw based on the transport.  In your proposal, we would have 
to model virtio-pci as a bus with a single device which appears awkward 
to me.


Regards,

Anthony Liguori


}

If we do this, it seems there's no need to provide separate
functions. The actual buses must instead initialize those hooks
properly. Translation here is something inherent to the bus, that
handles arbitration between possibly multiple IOMMUs. Our memory would
normally reside on / belong to the top-level bus.

What do you think? (Naming could be better though.)


Eduard

   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support

2010-07-15 Thread Eduard - Gabriel Munteanu

On Thu, Jul 15, 2010 at 10:17:10AM -0700, Chris Wright wrote:
> * Avi Kivity (a...@redhat.com) wrote:
> > 
> > For emulated device, it seems like we can ignore ATS completely, no?
> 
> Not if you want to emulate an ATS capable device ;)
> 
> Eariler upthread I said:
> 
>   IOW, if qemu ever had a device with ATS support...
> 
> So, that should've been a much bigger _IF_
> 
> thanks,
> -chris

I think we can augment some devices with ATS capability if there are
performance gains in doing so. This doesn't seem to be a detail the
actual guest OS would be interested in, so we can do it even for devices
that existed long before the AMD IOMMU came into existence. But I'm not
really sure about this, it's just a thought.

Linux seems to be issuing IOTLB invalidation commands anyway.

Eduard

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support

2010-07-15 Thread Chris Wright

* Avi Kivity (a...@redhat.com) wrote:
> On 07/15/2010 08:17 PM, Chris Wright wrote:
> >
> >>For emulated device, it seems like we can ignore ATS completely, no?
> >Not if you want to emulate an ATS capable device ;)
> 
> What I meant was that the whole request translation, invalidate, dma
> using a translated address thing is invisible to software.  We can
> emulate an ATS capable device by going through the iommu every time.

Well, I don't see any reason to completely ignore it.  It'd be really
useful for testing (I'd use it that way).  Esp to verify the
invalidation of the device IOTLBs.

But I think it's not a difficult thing to emulate once we have a proper
api encapsulating a device's dma request.

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support

2010-07-15 Thread Joerg Roedel

On Thu, Jul 15, 2010 at 08:02:05PM +0300, Avi Kivity wrote:

> For emulated device, it seems like we can ignore ATS completely, no?

An important use-case for emulation is software testing and caching of
iommu's is an important part that needs to be handled in software. For
this purpose it makes sense to emulate the behavior of caches too. So we
probably should not ignore the possibility of an emulated ATS device
completly.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support

2010-07-15 Thread Avi Kivity


On 07/15/2010 08:17 PM, Chris Wright wrote:



For emulated device, it seems like we can ignore ATS completely, no?
 

Not if you want to emulate an ATS capable device ;)
   


What I meant was that the whole request translation, invalidate, dma 
using a translated address thing is invisible to software.  We can 
emulate an ATS capable device by going through the iommu every time.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

KVM and NUMA

2010-07-15 Thread Ralf Spenneberg

Hi,

I just had a chance to play with KVM on Ubuntu 10.04 LTS on some new HP
360 g6 with Nehalem processors. I have a feeling that KVM and NUMA on
these machines do not play well together. 

Doing some benchmarks I got bizarre numbers. Sometimes the VMs were
performing fine and some times the performance was very bad! Apparently
KVM does not recognize the NUMA-architecture and places memory and
process randomly and therefore often on different numa cells.

First a couple of specs of the machine:
Two Nehalem sockets with E5520, Hyperthreading turned off, 4 cores per
socket, all in all 8 processors.

# cat /proc/cpuinfo 
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 26
model name  : Intel(R) Xeon(R) CPU   E5520  @ 2.27GHz
stepping: 5
...


Linux recognizes the NUMA-architecture:
# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6
node 0 size: 12277 MB
node 0 free: 9183 MB
node 1 cpus: 1 3 5 7
node 1 size: 12287 MB
node 1 free: 8533 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 

So I have got two cells with 4 cores each.

Virsh does not recognize the topology:
# virsh capabilities

  

  x86_64
  core2duo
  
  
..

I guess this is the fact, because QEMU does not recognize the
NUMA-Architecture (QEMU-Monitor):
(qemu) info numa
0 nodes


This is done on an Ubuntu 10.04 LTS:
Linux lxkvm01 2.6.32-23-server #37-Ubuntu SMP Fri Jun 11 09:11:11 UTC
2010 x86_64 GNU/Linux

The package used was:
qemu-kvm 0.12.3+noroms-0ubuntu9.2

When doing benchmarks numastat shows a lot of misses when the
performance is bad. 
# numastat
   node0   node1
numa_hit1552715817015505
numa_miss7032982 3512950
numa_foreign 3512950 7032982
interleave_hit  80788264
local_node  1552518717006655
other_node   7034953 3521800

So apparently KVM does not utilize the NUMA-architecture. Did I do
something wrong. Is KVM missing a patch? Do I need to activate something
in KVM to recognize the NUMA-Architecture? 
I tried the newest Kernel Module 2.6.32.15 and Qemu-kvm 0.12.4 without a
change.

Any hints?

Kind regards,

Ralf


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support

2010-07-15 Thread Chris Wright

* Avi Kivity (a...@redhat.com) wrote:
> On 07/15/2010 07:52 PM, Chris Wright wrote:
> >
> >>Really? Can you provide an documentation to support this claim?
> >>My impression is that there is no difference between translated and
> >>untranslated devices, and the translation is explicitly disabled by 
> >>software.
> >ATS allows an I/O device to request a translation from the IOMMU.
> >The device can then cache that translation and use the translated address
> >in a PCIe memory transaction.  PCIe uses a couple of previously reserved
> >bits in the transaction layer packet header to describe the address
> >type for memory transactions.  The default (00) maps to legacy PCIe and
> >describes the memory address as untranslated.  This is the normal mode,
> >and could then incur a translation if an IOMMU is present and programmed
> >w/ page tables, etc. as is passes through the host bridge.
> >
> >Another type is simply a transaction requesting a translation.  This is
> >new, and allows a device to request (and cache) a translation from the
> >IOMMU for subsequent use.
> >
> >The third type is a memory transaction tagged as already translated.
> >This is the type of transaction an ATS capable I/O device will generate
> >when it was able to translate the memory address from its own cache.
> >
> >Of course, there's also an invalidation request that the IOMMU can send
> >to ATS capable I/O devices to invalidate the cached translation.
> 
> For emulated device, it seems like we can ignore ATS completely, no?

Not if you want to emulate an ATS capable device ;)

Eariler upthread I said:

  IOW, if qemu ever had a device with ATS support...

So, that should've been a much bigger _IF_

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support

2010-07-15 Thread Chris Wright

* Chris Wright (chr...@sous-sol.org) wrote:
> * Paul Brook (p...@codesourcery.com) wrote:
> > > > The right approach IMHO is to convert devices to use bus-specific
> > > > functions to access memory.  The bus specific functions should have
> > > > a device argument as the first parameter.
> > > 
> > > As for ATS, the internal api to handle the device's dma reqeust needs
> > > a notion of a translated vs. an untranslated request.  IOW, if qemu ever
> > > had a device with ATS support, the device would use its local cache to
> > > translate the dma address and then submit a translated request to the
> > > pci bus (effectively doing a raw cpu physical memory* in that case).
> > 
> > Really? Can you provide an documentation to support this claim?

Wow...color me surprised...there's actually some apparently public
"training" docs that might help give a more complete view:

http://www.pcisig.com/developers/main/training_materials/get_document?doc_id=0ab681ba7001e40cdb297ddaf279a8de82a7dc40

ATS discussion starts on slide 23.

> > My impression is that there is no difference between translated and 
> > untranslated devices, and the translation is explicitly disabled by 
> > software.

And now that I re-read that sentence, I see what you are talking about.
Yes, there is the above notion as well.

A device can live in a 1:1 mapping of device address space to physical
memory.  This could be achieved in a few ways (all done by the OS software
programming the IOMMU).

One is to simply create a set of page tables that map 1:1 all of device
memory to physical memory.  Another is to somehow mark the device as
special (either omit translation tables or mark the translation entry
as effectively "do not translate").  This is often referred to as Pass
Through mode.  But this is not the same as ATS.

Pass Through mode is the functional equivalent of disabling the
translation/isolation capabilities of the IOMMU.  It's typically used
when an OS wants to keep a device for itself and isn't interested in
the isolation properties of the IOMMU.  It then only creates isolating
translation tables for devices it's giving to unprivileged software
(e.g. Linux/KVM giving a device to a guest, Linux giving a device to
user space process, etc.)

> ATS allows an I/O device to request a translation from the IOMMU.
> The device can then cache that translation and use the translated address
> in a PCIe memory transaction.  PCIe uses a couple of previously reserved
> bits in the transaction layer packet header to describe the address
> type for memory transactions.  The default (00) maps to legacy PCIe and
> describes the memory address as untranslated.  This is the normal mode,
> and could then incur a translation if an IOMMU is present and programmed
> w/ page tables, etc. as is passes through the host bridge.
> 
> Another type is simply a transaction requesting a translation.  This is
> new, and allows a device to request (and cache) a translation from the
> IOMMU for subsequent use.
> 
> The third type is a memory transaction tagged as already translated.
> This is the type of transaction an ATS capable I/O device will generate
> when it was able to translate the memory address from its own cache.
> 
> Of course, there's also an invalidation request that the IOMMU can send
> to ATS capable I/O devices to invalidate the cached translation.

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support

2010-07-15 Thread Avi Kivity


On 07/15/2010 07:52 PM, Chris Wright wrote:



Really? Can you provide an documentation to support this claim?
My impression is that there is no difference between translated and
untranslated devices, and the translation is explicitly disabled by software.
 

ATS allows an I/O device to request a translation from the IOMMU.
The device can then cache that translation and use the translated address
in a PCIe memory transaction.  PCIe uses a couple of previously reserved
bits in the transaction layer packet header to describe the address
type for memory transactions.  The default (00) maps to legacy PCIe and
describes the memory address as untranslated.  This is the normal mode,
and could then incur a translation if an IOMMU is present and programmed
w/ page tables, etc. as is passes through the host bridge.

Another type is simply a transaction requesting a translation.  This is
new, and allows a device to request (and cache) a translation from the
IOMMU for subsequent use.

The third type is a memory transaction tagged as already translated.
This is the type of transaction an ATS capable I/O device will generate
when it was able to translate the memory address from its own cache.

Of course, there's also an invalidation request that the IOMMU can send
to ATS capable I/O devices to invalidate the cached translation.
   


For emulated device, it seems like we can ignore ATS completely, no?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support

2010-07-15 Thread Chris Wright

* Paul Brook (p...@codesourcery.com) wrote:
> > > The right approach IMHO is to convert devices to use bus-specific
> > > functions to access memory.  The bus specific functions should have
> > > a device argument as the first parameter.
> > 
> > As for ATS, the internal api to handle the device's dma reqeust needs
> > a notion of a translated vs. an untranslated request.  IOW, if qemu ever
> > had a device with ATS support, the device would use its local cache to
> > translate the dma address and then submit a translated request to the
> > pci bus (effectively doing a raw cpu physical memory* in that case).
> 
> Really? Can you provide an documentation to support this claim?
> My impression is that there is no difference between translated and 
> untranslated devices, and the translation is explicitly disabled by software.

ATS allows an I/O device to request a translation from the IOMMU.
The device can then cache that translation and use the translated address
in a PCIe memory transaction.  PCIe uses a couple of previously reserved
bits in the transaction layer packet header to describe the address
type for memory transactions.  The default (00) maps to legacy PCIe and
describes the memory address as untranslated.  This is the normal mode,
and could then incur a translation if an IOMMU is present and programmed
w/ page tables, etc. as is passes through the host bridge.

Another type is simply a transaction requesting a translation.  This is
new, and allows a device to request (and cache) a translation from the
IOMMU for subsequent use.

The third type is a memory transaction tagged as already translated.
This is the type of transaction an ATS capable I/O device will generate
when it was able to translate the memory address from its own cache.

Of course, there's also an invalidation request that the IOMMU can send
to ATS capable I/O devices to invalidate the cached translation.

thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support

2010-07-15 Thread Eduard - Gabriel Munteanu

On Thu, Jul 15, 2010 at 07:45:06AM -0500, Anthony Liguori wrote:
> 
> No.  PCI devices should never call cpu_physical_memory*.
> 
> PCI devices should call pci_memory*.
> 
> ISA devices should call isa_memory*.
> 
> All device memory accesses should go through their respective buses.  
> There can be multiple IOMMUs at different levels of the device 
> hierarchy.  If you don't provide bus-level memory access functions that 
> chain through the hierarchy, it's extremely difficult to implement all 
> the necessary hooks to perform the translations at different places.
> 
> Regards,
> 
> Anthony Liguori
> 

I liked Paul's initial approach more, at least if I understood him
correctly. Basically I'm suggesting a single memory_* function that
simply asks the bus for I/O and translation. Say you have something like
this:

+ Bus 1
|
 Memory 1
|
---+ Bus 2 bridge
   |
    Memory 2
   |
   ---+ Bus 3 bridge
  |
   Device

Say Device wants to write to memory. If we have the DeviceState we
needn't concern whether this is a BusOneDevice or BusTwoDevice from
device code itself. We would just call

memory_rw(dev_state, addr, buf, size, is_write);

which simply recurses through DeviceState's and BusState's through their
parent pointers. The actual bus can set up those to provide
identification information and perhaps hooks for translation and access
checking. So memory_rw() looks like this (pseudocode):

static void memory_rw(DeviceState *dev,
  target_phys_addr_t addr,
  uint8_t *buf,
  int size,
  int is_write)
{
BusState *bus = get_parent_bus_of_dev(dev);
DeviceState *pdev = get_parent_dev(dev);
target_phys_addr_t taddr;

if (!bus) {
/* This shouldn't happen. */
assert(0);
}

if (bus->responsible_for(addr)) {
raw_physical_memory_rw(addr, buf, size, is_write);
return;
}

taddr = bus->translate(dev, addr);
memory_rw(pdev, taddr, buf, size, is_write);
}

If we do this, it seems there's no need to provide separate
functions. The actual buses must instead initialize those hooks
properly. Translation here is something inherent to the bus, that
handles arbitration between possibly multiple IOMMUs. Our memory would
normally reside on / belong to the top-level bus.

What do you think? (Naming could be better though.)

Eduard

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM vs. PCI-E Function Level Reset (FLR) ...

2010-07-15 Thread Casey Leedom

[[Sorry for any duplicates -- something is going wrong with my local mail relay 
and my outbound mail is being rejected.  I'm heading over to our IT guy's 
office 
to have a "Serious Talk" with him ... -- Casey]]

| From: Sheng Yang 
| Date: Wednesday, July 14, 2010 06:31 pm
| 
| On Thursday 15 July 2010 02:01:29 Casey Leedom wrote:
| > | From: Sheng Yang 
| > | Date: Tuesday, July 13, 2010 05:53 pm
| 
| (Please use reply to all next time.)

  Sorry, old habit.

| > | On Wednesday 14 July 2010 04:41:01 Casey Leedom wrote:
|
| >   Hhrrmmm, this seems like a semantic error.  "Resetting" the Vm should
| > be morally equivalent to resetting a real physical machine.  And when
| > a real physical machine is reset, all of its buses, etc. get reset which
| > propagates to Device Resets on those buses ...  I think that Assigned
| > Devices should be reset for reboots and power off/on ...
| 
| Yes, it should be done when reset. And power on/off should be there,
| because that's means the create and destroy the guest.

  Okay, cool.  I've looked through the kernel code and I don't see any likely 
places for putting such a VM Reboot/Reset feature in so I assume that this is 
also controlled by QEmu and it would need to be responsible for either doing a 
KVM_DEASSIGN_PCI_DEVICE/KVM_ASSIGN_PCI_DEVICE ioctl() pair for each assigned 
device or issuing a new KVM_RESET_ASSIGNED_PCI_DEVICE ioctl() for Reboot/Reset 
...

| > Assigned Devices.  For PCI-E SR-IOV Virtual Functions which are assigned
| > to a VM, they need to be reset at reboot/power off/power on and the
| > Configuration Space emulation needs to support the Guest OS/VM Device
| > Driver issuing an FLR ...
| 
| You can add the FLR support for your device if you need to call it in the
| guest. But I don't quite understand why you need to call it in the guest.
| KVM should has already done that, and it's not necessary for native
| device.

  This was mostly for device driver load/unload support.  I.e. being able to 
issue a Function Level Reset to a PCI Device Function (regardless of it being 
an 
SR-IOV Virtual Function or not) is a nice way to zap the device back to a 
canonical state.

| So you want to issue FLR(rather than probing the feature) when
| driver->probe() called? Current seems KVM is the only user of
| pci_reset_function(), so I think we can update it after your driver posted
| to the mailing list. Also I am not sure why you want to reset function
| when probing. KVM should cover them all I think.

  Remember, the "probe" argument in pci_dev_reset() has nothing to do with 
whether it's being called from a device driver's probe() routine.  The "probe" 
argument in pci_dev_reset() is used for the two-stage effort in 
pci_reset_function() which does:

 1. "try to see if it's possible to reset this function by itself"
Call pci_dev_reset(dev, probe=1);

 2. "go ahead and do the function reset"
Call pci_dev_reset(dev, probe=0);

  And yes, I'd noticed that KVM was the only current subscriber of the kernel 
interface but, as I noted above, it is useful for resetting the device function 
into a known state -- or at least it was useful till the deadlock got 
introduced in 2.6.31 ... :-(  And again, the use of this actually has no 
dependency on KVM: I would want to be able to call pci_reset_function() in any 
device driver's probe() routine ...  It just happens that I ran into this need 
(and unfortunate 2.6.31 deadlock) with our PCI-E SR-IOV Virtual Function Driver.

  Oh, and the driver (cxgb4vf) has been posted to net-next.  I'm guessing that 
it should get merged into 2.6.35 ... or maybe 2.6.36 ... I'm really not sure of 
how the merge schedule between net-next and the core kernel runs ...

Casey
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Cannot start new VMs when the host system is under load

2010-07-15 Thread Guido Winkelmann

I just noticed that if I leave off the -nodefaults from the qemu command line, 
the guest will reliably start up again.

I don't know what reasons the libvirt developers had for putting it there, 
though...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[KVM-AUTOTEST PATCH 9/9] [RFC] KVM test: add WHQL test definitions to tests_base.cfg.sample

2010-07-15 Thread Michael Goldish

The parameters that define submissions (dd_name_*, dd_data_*, desc_path_*) were
collected from manually created submissions.
I haven't yet collected the parameters for Windows 2003 and 2008, so for now
WHQL tests are disabled for these versions.

This patch also adds a comment in tests.cfg.sample, encouraging the user to
specify the DTM server's address and ports there.

Note that this patch renames WinVista.32sp1 to WinVista.32.sp1,
 WinVista.32sp2 to WinVista.32.sp2,
 WinVista.64sp1 to WinVista.64.sp1 and
 WinVista.64sp2 to WinVista.64.sp2.

Signed-off-by: Michael Goldish 
---
 client/tests/kvm/tests.cfg.sample  |8 +
 client/tests/kvm/tests_base.cfg.sample |  230 
 2 files changed, 184 insertions(+), 54 deletions(-)

diff --git a/client/tests/kvm/tests.cfg.sample 
b/client/tests/kvm/tests.cfg.sample
index e01406e..cdb7b0a 100644
--- a/client/tests/kvm/tests.cfg.sample
+++ b/client/tests/kvm/tests.cfg.sample
@@ -73,6 +73,14 @@ variants:
 only Fedora.13.64
 only unattended_install.cdrom boot shutdown
 
+# You may provide information about the DTM server for WHQL tests here:
+#whql:
+#server_address = 10.20.30.40
+#server_shell_port = 10022
+#server_file_transfer_port = 10023
+# Note that the DTM server must run rss.exe (available under deps/),
+# preferably with administrator privileges.
+
 # Uncomment the following lines to enable abort-on-error mode:
 #abort_on_error = yes
 #kill_vm.* ?= no
diff --git a/client/tests/kvm/tests_base.cfg.sample 
b/client/tests/kvm/tests_base.cfg.sample
index d0b8acb..dabcdc4 100644
--- a/client/tests/kvm/tests_base.cfg.sample
+++ b/client/tests/kvm/tests_base.cfg.sample
@@ -285,6 +285,80 @@ variants:
 iozone_cmd = "D:\IOzone\iozone.exe -a"
 iozone_timeout = 3600
 
+- @whql: install setup unattended_install.cdrom
+nic_mode = tap
+# Replace this with the address of an installed DTM server
+server_address = 10.20.30.40
+# The server should run rss.exe like a regular Windows VM, preferably
+# with administrator privileges (or at least with permission to write
+# to the DTM studio directory)
+server_shell_port = 10022
+server_file_transfer_port = 10023
+server_studio_path = %programfiles%\Microsoft Driver Test 
Manager\Studio
+wtt_services = wttsvc
+variants:
+- whql_client_install:
+type = whql_client_install
+dsso_delete_machine_binary = deps/whql_delete_machine_15.exe
+# The username and password are required for accessing the DTM 
client
+# installer binary shared by the server
+server_username = administrator
+server_password = 1q2w3eP
+# This path refers to a shared directory on the server
+# (the final cmd will be something like 
\\servername\DTMInstall\...)
+install_cmd = \DTMInstall\Client\Setup.exe /passive
+install_timeout = 1800
+- whql_submission:whql_client_install
+type = whql_submission
+dsso_test_binary = deps/whql_submission_15.exe
+test_timeout = 3600
+device_data = cat0 cat1 cat2 cat3 logoarch logoos whqlos 
whqlqual prog desc filter virt
+descriptors = desc1 desc2 desc3
+# DeviceData names
+dd_name_cat0 = Category
+dd_name_cat1 = Category
+dd_name_cat2 = Category
+dd_name_cat3 = Category
+dd_name_logoarch = LogoProcessorArchitecture
+dd_name_logoos   = LogoOperatingSystem
+dd_name_whqlos   = WhqlOs
+dd_name_whqlqual = WhqlQualification
+dd_name_prog = LogoProgramId
+dd_name_desc = LogoProgramDescription
+dd_name_filter   = WDKFilterAttribute
+dd_name_virt = ParaVirtualizationDriver
+# Common DeviceData data
+dd_data_filter   = FilterIfNoInf
+dd_data_virt = True
+variants:
+- keyboard:
+# test_device is a regular expression that should 
match a device's
+# name as it appears in device manager.  The first 
device that matches
+# is used.
+test_device = keyboard
+dd_data_cat0 = Input\Keyboard
+dd_data_cat1 = Device Fundamentals
+dd_data_cat2 = System Fundamentals\Dynamic Partitioning
+dd_data_prog = InputKbd
+dd_data_desc = Input > Keyboard
+- hdd:
+test_device =

[KVM-AUTOTEST PATCH 8/9] [RFC] KVM test: add whql_client_install test

2010-07-15 Thread Michael Goldish

whql_client_install installs the DTM client on a guest.  It requires a
functioning external DTM server which runs rss.exe like regular Windows VMs.

Signed-off-by: Michael Goldish 
---
 client/tests/kvm/tests/whql_client_install.py |  110 +
 1 files changed, 110 insertions(+), 0 deletions(-)
 create mode 100644 client/tests/kvm/tests/whql_client_install.py

diff --git a/client/tests/kvm/tests/whql_client_install.py 
b/client/tests/kvm/tests/whql_client_install.py
new file mode 100644
index 000..f46939e
--- /dev/null
+++ b/client/tests/kvm/tests/whql_client_install.py
@@ -0,0 +1,110 @@
+import logging, time, os, re
+from autotest_lib.client.common_lib import error
+import kvm_subprocess, kvm_test_utils, kvm_utils, rss_file_transfer
+
+
+def run_whql_client_install(test, params, env):
+"""
+WHQL DTM client installation:
+1) Log into the guest (the client machine) and into a DTM server machine
+2) Stop the DTM client service (wttsvc) on the client machine
+3) Delete the client machine from the server's data store
+4) Rename the client machine (give it a randomly generated name)
+5) Move the client machine into the server's workgroup
+6) Reboot the client machine
+7) Install the DTM client software
+
+@param test: kvm test object
+@param params: Dictionary with the test parameters
+@param env: Dictionary with test environment.
+"""
+vm = kvm_test_utils.get_living_vm(env, params.get("main_vm"))
+session = kvm_test_utils.wait_for_login(vm, 0, 240)
+
+# Collect test params
+server_address = params.get("server_address")
+server_shell_port = int(params.get("server_shell_port"))
+server_file_transfer_port = int(params.get("server_file_transfer_port"))
+server_studio_path = params.get("server_studio_path", "%programfiles%\\ "
+"Microsoft Driver Test Manager\\Studio")
+server_username = params.get("server_username")
+server_password = params.get("server_password")
+dsso_delete_machine_binary = params.get("dsso_delete_machine_binary",
+"deps/whql_delete_machine_15.exe")
+dsso_delete_machine_binary = kvm_utils.get_path(test.bindir,
+dsso_delete_machine_binary)
+install_timeout = float(params.get("install_timeout", 600))
+install_cmd = params.get("install_cmd")
+wtt_services = params.get("wtt_services")
+
+# Stop WTT service(s) on client
+for svc in wtt_services.split():
+kvm_test_utils.stop_windows_service(session, svc)
+
+# Copy dsso_delete_machine_binary to server
+rss_file_transfer.upload(server_address, server_file_transfer_port,
+ dsso_delete_machine_binary, server_studio_path,
+ timeout=60)
+
+# Open a shell session with server
+server_session = kvm_utils.remote_login("nc", server_address,
+server_shell_port, "", "",
+session.prompt, session.linesep)
+
+# Get server and client information
+cmd = "echo %computername%"
+server_name = server_session.get_command_output(cmd).strip()
+client_name = session.get_command_output(cmd).strip()
+cmd = "wmic computersystem get workgroup"
+server_workgroup = server_session.get_command_output(cmd).strip()
+
+# Delete the client machine from the server's data store (if it's there)
+server_session.get_command_output("cd %s" % server_studio_path)
+cmd = "%s %s %s" % (os.path.basename(dsso_delete_machine_binary),
+server_name, client_name)
+server_session.get_command_output(cmd, print_func=logging.info)
+server_session.close()
+
+# Rename the client machine
+client_name = "autotest_%s" % kvm_utils.generate_random_string(4)
+logging.info("Renaming client machine to '%s'" % client_name)
+cmd = ('wmic computersystem where name="%%computername%%" rename name="%s"'
+   % client_name)
+s = session.get_command_status(cmd, timeout=600)
+if s != 0:
+raise error.TestError("Could not rename the client machine")
+
+# Join the server's workgroup
+logging.info("Joining workgroup '%s'" % server_workgroup)
+cmd = ('wmic computersystem where name="%%computername%%" call '
+   'joindomainorworkgroup name="%s"' % server_workgroup)
+s = session.get_command_status(cmd, timeout=600)
+if s != 0:
+raise error.TestError("Could not change the client's workgroup")
+
+# Reboot
+session = kvm_test_utils.reboot(vm, session)
+
+# Access shared resources on the server machine
+logging.info("Attempting to access remote share on server")
+cmd = r"net use \\%s /user:%s %s" % (server_name, server_username,
+ server_password)
+end_time = time.time() + 120
+while time.time() <

[KVM-AUTOTEST PATCH 7/9] [RFC] KVM test: add whql_submission test

2010-07-15 Thread Michael Goldish

whql_submission runs a submission on a given device.  It requires a
functioning external DTM server which runs rss.exe like regular Windows VMs,
preferably with administrator permissions.
The submission is defined by descriptors and device_data objects, which are
specified in the config file(s).  All jobs of the submission are executed.
When all jobs complete, or when the timeout expires, HTML reports are generated
and copied to test.debugdir (client/results/default/kvm...whql_submission/debug)
and the raw test logs (wtl or xml files) are copied to test.debugdir as well.

Signed-off-by: Michael Goldish 
---
 client/tests/kvm/tests/whql_submission.py |  176 +
 1 files changed, 176 insertions(+), 0 deletions(-)
 create mode 100644 client/tests/kvm/tests/whql_submission.py

diff --git a/client/tests/kvm/tests/whql_submission.py 
b/client/tests/kvm/tests/whql_submission.py
new file mode 100644
index 000..fd238ff
--- /dev/null
+++ b/client/tests/kvm/tests/whql_submission.py
@@ -0,0 +1,176 @@
+import logging, time, os, re
+from autotest_lib.client.common_lib import error
+import kvm_subprocess, kvm_test_utils, kvm_utils, rss_file_transfer
+
+
+def run_whql_submission(test, params, env):
+"""
+WHQL submission test:
+1) Log into the guest (the client machine) and into a DTM server machine
+2) Copy the automation program binary (dsso_test_binary) to the server 
machine
+3) Run the automation program
+4) Pass the program all relevant parameters (e.g. device_data)
+5) Wait for the program to terminate
+6) Parse and report job results
+(logs and HTML reports are placed in test.bindir)
+
+@param test: kvm test object
+@param params: Dictionary with the test parameters
+@param env: Dictionary with test environment.
+"""
+vm = kvm_test_utils.get_living_vm(env, params.get("main_vm"))
+session = kvm_test_utils.wait_for_login(vm, 0, 240)
+
+# Collect parameters
+server_address = params.get("server_address")
+server_shell_port = int(params.get("server_shell_port"))
+server_file_transfer_port = int(params.get("server_file_transfer_port"))
+server_studio_path = params.get("server_studio_path", "%programfiles%\\ "
+"Microsoft Driver Test Manager\\Studio")
+dsso_test_binary = params.get("dsso_test_binary",
+  "deps/whql_submission_15.exe")
+dsso_test_binary = kvm_utils.get_path(test.bindir, dsso_test_binary)
+test_device = params.get("test_device")
+test_timeout = float(params.get("test_timeout", 600))
+wtt_services = params.get("wtt_services")
+
+# Restart WTT service(s) on the client
+logging.info("Restarting WTT services on client")
+for svc in wtt_services.split():
+kvm_test_utils.stop_windows_service(session, svc)
+for svc in wtt_services.split():
+kvm_test_utils.start_windows_service(session, svc)
+
+# Copy dsso_test_binary to the server
+rss_file_transfer.upload(server_address, server_file_transfer_port,
+ dsso_test_binary, server_studio_path, timeout=60)
+
+# Open a shell session with the server
+server_session = kvm_utils.remote_login("nc", server_address,
+server_shell_port, "", "",
+session.prompt, session.linesep)
+
+# Get the computer names of the server and client
+cmd = "echo %computername%"
+server_name = server_session.get_command_output(cmd).strip()
+client_name = session.get_command_output(cmd).strip()
+session.close()
+
+# Run the automation program on the server
+server_session.get_command_output("cd %s" % server_studio_path)
+cmd = "%s %s %s %s %s %s" % (os.path.basename(dsso_test_binary),
+ server_name,
+ client_name,
+ "%s_pool" % client_name,
+ "%s_submission" % client_name,
+ test_timeout)
+server_session.sendline(cmd)
+
+# Helper function: wait for a given prompt and raise an exception if an
+# error occurs
+def find_prompt(prompt):
+m, o = server_session.read_until_last_line_matches(
+[prompt, server_session.prompt], print_func=logging.info,
+timeout=600)
+if m != 0:
+errors = re.findall("^Error:.*$", o, re.I | re.M)
+if errors:
+raise error.TestError(errors[0])
+else:
+raise error.TestError("Error running automation program: could 
"
+  "not find '%s' prompt" % prompt)
+
+# Tell the automation program which device to test
+find_prompt("Device to test:")
+server_session.sendline(test_device)
+
+# Give the automation program all the device data supplied by the user
+find_prompt("D

[KVM-AUTOTEST PATCH 6/9] [RFC] KVM test: add utility functions start_windows_service() and stop_windows_service()

2010-07-15 Thread Michael Goldish

These utilities use sc to stop and start windows services.  They're used by 
whql_submission
and whql_client_install to stop or restart wttsvc on the client machine.

Signed-off-by: Michael Goldish 
---
 client/tests/kvm/kvm_test_utils.py |   46 
 1 files changed, 46 insertions(+), 0 deletions(-)

diff --git a/client/tests/kvm/kvm_test_utils.py 
b/client/tests/kvm/kvm_test_utils.py
index 53c11ae..9fd3a74 100644
--- a/client/tests/kvm/kvm_test_utils.py
+++ b/client/tests/kvm/kvm_test_utils.py
@@ -242,6 +242,52 @@ def migrate(vm, env=None, mig_timeout=3600, 
mig_protocol="tcp",
 raise
 
 
+def stop_windows_service(session, service, timeout=120):
+"""
+Stop a Windows service using sc.
+If the service is already stopped or is not installed, do nothing.
+
+@param service: The name of the service
+@param timeout: Time duration to wait for service to stop
+@raise: error.TestError is raised if the service can't be stopped
+"""
+end_time = time.time() + timeout
+while time.time() < end_time:
+o = session.get_command_output("sc stop %s" % service, timeout=60)
+# FAILED 1060 means the service isn't installed.
+# FAILED 1062 means the service hasn't been started.
+if re.search(r"\bFAILED (1060|1062)\b", o, re.I):
+break
+time.sleep(1)
+else:
+raise error.TestError("Could not stop service '%s'" % service)
+
+
+def start_windows_service(session, service, timeout=120):
+"""
+Start a Windows service using sc.
+If the service is already running, do nothing.
+If the service isn't installed, fail.
+
+@param service: The name of the service
+@param timeout: Time duration to wait for service to start
+@raise: error.TestError is raised if the service can't be started
+"""
+end_time = time.time() + timeout
+while time.time() < end_time:
+o = session.get_command_output("sc start %s" % service, timeout=60)
+# FAILED 1060 means the service isn't installed.
+if re.search(r"\bFAILED 1060\b", o, re.I):
+raise error.TestError("Could not start service '%s' "
+  "(service not installed)" % service)
+# FAILED 1056 means the service is already running.
+if re.search(r"\bFAILED 1056\b", o, re.I):
+break
+time.sleep(1)
+else:
+raise error.TestError("Could not start service '%s'" % service)
+
+
 def get_time(session, time_command, time_filter_re, time_format):
 """
 Return the host time and guest time.  If the guest time cannot be fetched
-- 
1.5.4.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[KVM-AUTOTEST PATCH 5/9] [RFC] KVM test: DTM machine deletion tool for WHQL tests

2010-07-15 Thread Michael Goldish

This C# program should run on a DTM server.  It's used by the
whql_client_install test to delete the client machine from the server's data
store (if listed there) prior to client installation.  This seems to be
necessary to prevent trouble during testing (like failure to contact the client
machine).

Note: the binary is copied to the server at run time, so it doesn't need to be
packaged in winutils.iso.

Signed-off-by: Michael Goldish 
---
 client/tests/kvm/deps/whql_delete_machine_15.cs  |   82 ++
 client/tests/kvm/deps/whql_delete_machine_15.exe |  Bin 0 -> 10240 bytes
 2 files changed, 82 insertions(+), 0 deletions(-)
 create mode 100644 client/tests/kvm/deps/whql_delete_machine_15.cs
 create mode 100644 client/tests/kvm/deps/whql_delete_machine_15.exe

diff --git a/client/tests/kvm/deps/whql_delete_machine_15.cs 
b/client/tests/kvm/deps/whql_delete_machine_15.cs
new file mode 100644
index 000..1d78a6d
--- /dev/null
+++ b/client/tests/kvm/deps/whql_delete_machine_15.cs
@@ -0,0 +1,82 @@
+// DTM machine deletion tool
+// Author: Michael Goldish 
+// Based on sample code by Microsoft.
+
+using System;
+using System.Collections.Generic;
+using System.Text.RegularExpressions;
+using Microsoft.DistributedAutomation.DeviceSelection;
+using Microsoft.DistributedAutomation.SqlDataStore;
+
+namespace automate0
+{
+class AutoJob
+{
+static int Main(string[] args)
+{
+if (args.Length != 2)
+{
+Console.WriteLine("Error: incorrect number of command line 
arguments");
+Console.WriteLine("Usage: {0} serverName clientName",
+System.Environment.GetCommandLineArgs()[0]);
+return 1;
+}
+string serverName = args[0];
+string clientName = args[1];
+
+try
+{
+// Initialize DeviceScript and connect to data store
+Console.WriteLine("Initializing DeviceScript object");
+DeviceScript script = new DeviceScript();
+Console.WriteLine("Connecting to data store");
+script.ConnectToNamedDataStore(serverName);
+
+// Find the client machine
+IResourcePool rootPool = script.GetResourcePoolByName("$");
+Console.WriteLine("Looking for client machine '{0}'", 
clientName);
+IResource machine = rootPool.GetResourceByName(clientName);
+if (machine == null)
+{
+Console.WriteLine("Client machine not found");
+return 0;
+}
+Console.WriteLine("Client machine '{0}' found ({1}, {2})",
+clientName, machine.OperatingSystem, 
machine.ProcessorArchitecture);
+
+// Change the client machine's status to 'unsafe'
+Console.WriteLine("Changing the client machine's status to 
'Unsafe'");
+try
+{
+machine.ChangeResourceStatus("Unsafe");
+}
+catch (Exception e)
+{
+Console.WriteLine("Warning: " + e.Message);
+}
+while (machine.Status != "Unsafe")
+{
+try
+{
+machine = rootPool.GetResourceByName(clientName);
+}
+catch (Exception e)
+{
+Console.WriteLine("Warning: " + e.Message);
+}
+System.Threading.Thread.Sleep(1000);
+}
+
+// Delete the client machine from datastore
+Console.WriteLine("Deleting client machine from data store");
+script.DeleteResource(machine.Id);
+return 0;
+}
+catch (Exception e)
+{
+Console.WriteLine("Error: " + e.Message);
+return 1;
+}
+}
+}
+}
diff --git a/client/tests/kvm/deps/whql_delete_machine_15.exe 
b/client/tests/kvm/deps/whql_delete_machine_15.exe
new file mode 100644
index 
..3817ac42d7748b87af3683e86e1137757ed6073a
GIT binary patch
literal 10240
zcmehm...@zb$@%j;~h`rS#u{HC5p6`q$D0i@<@tGBPo(iilmft_#yKs*>0pr%exh|
z%H8dGc26e#(I^N~IDn(rKw1=PTGvI>HU)y(v5Uq{Z5RD;QNRw8B1K&kR^7sIjMS)u
zHvLM_sJ}P6%R5q%tDya-K$hIMZ{EE3=6%h~?%eFpK1oeP6vlo1I?>nh080{?zX`8ed(Vd{+!Pm7T06Ok*+zfO)LZ02gl>Pjq%eX=3tBJ-x
zOSD=2*PnjVW1+{PH_!Mu(Hk6ujn_G%R?yx!Ow_rt?}OUcT+?f+NcX2?I*f+jD)EGDaeb^6>CSfO9HKC5Zbsj8H)f*U(J9g
z-o7`Kz8Vj;U)>bicC|IM{c2mN<7y^~o>W-=TOj731mdZPJfQh{M5Nhn0Zl~YMbPXx
z_c...@tk(eB^4FTQF{~9qt`l8G5IX|M9fBX;+ZxEY|k|k(ZRzaYA4Vb9P<(T4z$vv
zyU;}G#0Piyh)-`tQ-FTON{_Va?;=k...@+p6m}s`Q0JFCO?14co
zow?zmnw?l...@`{_6)rw_cZy&$WsB9>3}`$&JMs-
zn_K}|wE1~TKm|uLrKeNf0|$jJy05w*XsJ6yo4CWql#AP#BJsg?aYrg4e}{!VveGb+
z5Z&$...@bsa(63n

[KVM-AUTOTEST PATCH 4/9] [RFC] KVM test: DTM automation program for WHQL tests

2010-07-15 Thread Michael Goldish

This C# program should run on a DTM/WHQL server.  It's used by the
whql_submission test to schedule and monitor device submission jobs.

Note: the binary is copied to the server at run time, so it doesn't need to be
packaged in winutils.iso.

Signed-off-by: Michael Goldish 
---
 client/tests/kvm/deps/whql_submission_15.cs  |  254 ++
 client/tests/kvm/deps/whql_submission_15.exe |  Bin 0 -> 10240 bytes
 2 files changed, 254 insertions(+), 0 deletions(-)
 create mode 100644 client/tests/kvm/deps/whql_submission_15.cs
 create mode 100644 client/tests/kvm/deps/whql_submission_15.exe

diff --git a/client/tests/kvm/deps/whql_submission_15.cs 
b/client/tests/kvm/deps/whql_submission_15.cs
new file mode 100644
index 000..540674a
--- /dev/null
+++ b/client/tests/kvm/deps/whql_submission_15.cs
@@ -0,0 +1,254 @@
+// DTM submission automation program
+// Author: Michael Goldish 
+// Based on sample code by Microsoft.
+
+// Note: this program has only been tested with DTM version 1.5.
+// It might fail to work with other versions, specifically because it uses
+// a few undocumented methods/attributes.
+
+using System;
+using System.Collections.Generic;
+using System.Text.RegularExpressions;
+using Microsoft.DistributedAutomation.DeviceSelection;
+using Microsoft.DistributedAutomation.SqlDataStore;
+
+namespace automate0
+{
+class AutoJob
+{
+static int Main(string[] args)
+{
+if (args.Length != 5)
+{
+Console.WriteLine("Error: incorrect number of command line 
arguments");
+Console.WriteLine("Usage: {0} serverName clientName 
machinePoolName submissionName timeout",
+System.Environment.GetCommandLineArgs()[0]);
+return 1;
+}
+string serverName = args[0];
+string clientName = args[1];
+string machinePoolName = args[2];
+string submissionName = args[3];
+double timeout = Convert.ToDouble(args[4]);
+
+try
+{
+// Initialize DeviceScript and connect to data store
+Console.WriteLine("Initializing DeviceScript object");
+DeviceScript script = new DeviceScript();
+Console.WriteLine("Connecting to data store");
+
+script.ConnectToNamedDataStore(serverName);
+
+// Find client machine
+IResourcePool rootPool = script.GetResourcePoolByName("$");
+Console.WriteLine("Looking for client machine '{0}'", 
clientName);
+IResource machine = null;
+while (true)
+{
+try
+{
+machine = rootPool.GetResourceByName(clientName);
+}
+catch (Exception e)
+{
+Console.WriteLine("Warning: " + e.Message);
+}
+// Make sure the machine is valid
+if (machine != null &&
+machine.OperatingSystem != null &&
+machine.OperatingSystem.Length > 0 &&
+machine.ProcessorArchitecture != null &&
+machine.ProcessorArchitecture.Length > 0 &&
+machine.GetDevices().Length > 0)
+break;
+System.Threading.Thread.Sleep(1000);
+}
+Console.WriteLine("Client machine '{0}' found ({1}, {2})",
+clientName, machine.OperatingSystem, 
machine.ProcessorArchitecture);
+
+// Create machine pool and add client machine to it
+// (this must be done because jobs cannot be scheduled for 
machines in the
+// default pool)
+try
+{
+script.CreateResourcePool(machinePoolName, 
rootPool.ResourcePoolId);
+}
+catch (Exception e)
+{
+Console.WriteLine("Warning: " + e.Message);
+}
+IResourcePool newPool = 
script.GetResourcePoolByName(machinePoolName);
+Console.WriteLine("Moving the client machine to pool '{0}'", 
machinePoolName);
+machine.ChangeResourcePool(newPool);
+
+// Reset client machine
+if (machine.Status != "Ready")
+{
+Console.WriteLine("Changing the client machine's status to 
'Reset'");
+while (true)
+{
+try
+{
+machine.ChangeResourceStatus("Reset");
+break;
+}
+catch (Exception e)
+{
+Console.WriteLine("Warning: " + e.Message);

[KVM-AUTOTEST PATCH 3/9] KVM test: rss_file_transfer.py: add convenience functions upload() and download()

2010-07-15 Thread Michael Goldish

Signed-off-by: Michael Goldish 
---
 client/tests/kvm/rss_file_transfer.py |   32 ++--
 1 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/client/tests/kvm/rss_file_transfer.py 
b/client/tests/kvm/rss_file_transfer.py
index c584397..3de8259 100755
--- a/client/tests/kvm/rss_file_transfer.py
+++ b/client/tests/kvm/rss_file_transfer.py
@@ -395,6 +395,30 @@ class FileDownloadClient(FileTransferClient):
 raise
 
 
+def upload(address, port, src_pattern, dst_path, timeout=60,
+   connect_timeout=10):
+"""
+Connect to server and upload files.
+
+@see: FileUploadClient
+"""
+client = FileUploadClient(address, port, connect_timeout)
+client.upload(src_pattern, dst_path, timeout)
+client.close()
+
+
+def download(address, port, src_pattern, dst_path, timeout=60,
+ connect_timeout=10):
+"""
+Connect to server and upload files.
+
+@see: FileDownloadClient
+"""
+client = FileDownloadClient(address, port, connect_timeout)
+client.download(src_pattern, dst_path, timeout)
+client.close()
+
+
 def main():
 import optparse
 
@@ -418,13 +442,9 @@ def main():
 port = int(port)
 
 if options.download:
-client = FileDownloadClient(address, port)
-client.download(src_pattern, dst_path, timeout=options.timeout)
-client.close()
+download(address, port, src_pattern, dst_path, options.timeout)
 elif options.upload:
-client = FileUploadClient(address, port)
-client.upload(src_pattern, dst_path, timeout=options.timeout)
-client.close()
+upload(address, port, src_pattern, dst_path, options.timeout)
 
 
 if __name__ == "__main__":
-- 
1.5.4.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[KVM-AUTOTEST PATCH 2/9] KVM test: allow definition of multiple cdroms

2010-07-15 Thread Michael Goldish

Instead of using 'cdrom_extra', define 'cdroms' similarly to 'images'.
Also set -drive indices for cd1 and image1.
Also fix regular expression in tests.cfg.sample, so it doesn't match 'cdroms'
when it shouldn't.

Note: if you use your own tests.cfg, you should change the line

cdrom.* ?<= ...

to

cdrom(_.*)? ?<= ...

It won't hurt to do the same for image_name.

Signed-off-by: Michael Goldish 
---
 client/tests/kvm/kvm_vm.py |   19 +++
 client/tests/kvm/tests.cfg.sample  |4 ++--
 client/tests/kvm/tests_base.cfg.sample |   10 --
 3 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/client/tests/kvm/kvm_vm.py b/client/tests/kvm/kvm_vm.py
index 248aeca..bdc9aab 100755
--- a/client/tests/kvm/kvm_vm.py
+++ b/client/tests/kvm/kvm_vm.py
@@ -414,18 +414,13 @@ class VM:
 if smp:
 qemu_cmd += add_smp(help, smp)
 
-iso = params.get("cdrom")
-if iso:
-iso = kvm_utils.get_path(root_dir, iso)
-qemu_cmd += add_cdrom(help, iso, 2)
-
-# Even though this is not a really scalable approach,
-# it doesn't seem like we are going to need more than
-# 2 CDs active on the same VM.
-iso_extra = params.get("cdrom_extra")
-if iso_extra:
-iso_extra = kvm_utils.get_path(root_dir, iso_extra)
-qemu_cmd += add_cdrom(help, iso_extra, 3)
+cdroms = kvm_utils.get_sub_dict_names(params, "cdroms")
+for cdrom in cdroms:
+cdrom_params = kvm_utils.get_sub_dict(params, cdrom)
+iso = cdrom_params.get("cdrom")
+if iso:
+qemu_cmd += add_cdrom(help, kvm_utils.get_path(root_dir, iso),
+  cdrom_params.get("drive_index"))
 
 # We may want to add {floppy_otps} parameter for -fda
 # {fat:floppy:}/path/. However vvfat is not usually recommended.
diff --git a/client/tests/kvm/tests.cfg.sample 
b/client/tests/kvm/tests.cfg.sample
index 6d5f244..e01406e 100644
--- a/client/tests/kvm/tests.cfg.sample
+++ b/client/tests/kvm/tests.cfg.sample
@@ -13,8 +13,8 @@ include cdkeys.cfg
 # * All image files are expected under /tmp/kvm_autotest_root/images/
 # * All iso files are expected under /tmp/kvm_autotest_root/isos/
 qemu_img_binary = /usr/bin/qemu-img
-image_name.* ?<= /tmp/kvm_autotest_root/images/
-cdrom.* ?<= /tmp/kvm_autotest_root/isos/
+image_name(_.*)? ?<= /tmp/kvm_autotest_root/images/
+cdrom(_.*)? ?<= /tmp/kvm_autotest_root/isos/
 
 # Here are the test sets variants. The variant 'qemu_kvm_windows_quick' is
 # fully commented, the following ones have comments only on noteworthy points
diff --git a/client/tests/kvm/tests_base.cfg.sample 
b/client/tests/kvm/tests_base.cfg.sample
index c2def29..d0b8acb 100644
--- a/client/tests/kvm/tests_base.cfg.sample
+++ b/client/tests/kvm/tests_base.cfg.sample
@@ -3,9 +3,10 @@
 # Define the objects we'll be using
 vms = vm1
 images = image1
+cdroms = cd1
 nics = nic1
 monitors = humanmonitor1
-login_timeout = 360
+
 # Choose the main VM and monitor
 main_vm = vm1
 main_monitor = humanmonitor1
@@ -33,8 +34,10 @@ qemu_img_binary = qemu-img
 smp = 1
 mem = 512
 image_size = 10G
+drive_index_image1 = 0
 shell_port = 22
 display = vnc
+drive_index_cd1 = 2
 
 # Monitor params
 monitor_type = human
@@ -56,6 +59,7 @@ run_tcpdump = yes
 
 # Misc
 profilers = kvm_stat
+login_timeout = 360
 
 
 # Tests
@@ -1044,7 +1048,9 @@ variants:
 unattended_install.cdrom:
 timeout = 7200
 finish_program = deps/finish.exe
-cdrom_extra = windows/winutils.iso
+cdroms += " extracd"
+cdrom_extracd = windows/winutils.iso
+drive_index_extracd = 3
 migrate:
 migration_test_command = ver && vol
 migration_bg_command = start ping -t localhost
-- 
1.5.4.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[KVM-AUTOTEST PATCH 1/9] KVM test: kvm_vm.py: make -drive index optional for both images and cdrom ISOs

2010-07-15 Thread Michael Goldish

Signed-off-by: Michael Goldish 
---
 client/tests/kvm/kvm_vm.py |   15 +--
 1 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/client/tests/kvm/kvm_vm.py b/client/tests/kvm/kvm_vm.py
index 6cd0688..248aeca 100755
--- a/client/tests/kvm/kvm_vm.py
+++ b/client/tests/kvm/kvm_vm.py
@@ -214,16 +214,18 @@ class VM:
 def add_smp(help, smp):
 return " -smp %s" % smp
 
-def add_cdrom(help, filename, index=2):
+def add_cdrom(help, filename, index=None):
 if has_option(help, "drive"):
-return " -drive file='%s',index=%d,media=cdrom" % (filename,
-   index)
+cmd = " -drive file='%s',media=cdrom" % filename
+if index is not None: cmd += ",index=%s" % index
+return cmd
 else:
 return " -cdrom '%s'" % filename
 
-def add_drive(help, filename, format=None, cache=None, werror=None,
-  serial=None, snapshot=False, boot=False):
+def add_drive(help, filename, index=None, format=None, cache=None,
+  werror=None, serial=None, snapshot=False, boot=False):
 cmd = " -drive file='%s'" % filename
+if index is not None: cmd += ",index=%s" % index
 if format: cmd += ",if=%s" % format
 if cache: cmd += ",cache=%s" % cache
 if werror: cmd += ",werror=%s" % werror
@@ -362,6 +364,7 @@ class VM:
 continue
 qemu_cmd += add_drive(help,
   get_image_filename(image_params, root_dir),
+  image_params.get("drive_index"),
   image_params.get("drive_format"),
   image_params.get("drive_cache"),
   image_params.get("drive_werror"),
@@ -414,7 +417,7 @@ class VM:
 iso = params.get("cdrom")
 if iso:
 iso = kvm_utils.get_path(root_dir, iso)
-qemu_cmd += add_cdrom(help, iso)
+qemu_cmd += add_cdrom(help, iso, 2)
 
 # Even though this is not a really scalable approach,
 # it doesn't seem like we are going to need more than
-- 
1.5.4.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Cannot start new VMs when the host system is under load

2010-07-15 Thread Guido Winkelmann

Hi,

I've got a problem where I cannot start any new VMs with KVM if the host 
machine is under high CPU load. The problem is not 100% reproducible (it works 
sometimes), but under load conditions, it happens most of the time - roughly 
95%.

I'm usually using libvirt to start and stop KVM VMs. When using virsh to start 
a new VM under those conditions, the output looks like this:

virsh # start testserver-a
error: Failed to start domain testserver-a
error: monitor socket did not show up.: Connection refused

(There is a very long wait after the command has been sent until the error 
message shows up.)

This is (an example of) the command line that libvirtd uses to start up qemu:

- snip -
LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin HOME=/root USER=root LOGNAME=root 
QEMU_AUDIO_DRV=none /usr/bin/qemu-kvm -S -M pc-0.12 -enable-kvm -m 256 -smp 
1,sockets=1,cores=1,threads=1 -name testserver-a -uuid 7cbb3665-4d58-86b8-
ce8f-20541995a99c -nodefaults -chardev 
socket,id=monitor,path=/usr/local/var/lib/libvirt/qemu/testserver-
a.monitor,server,nowait -mon chardev=monitor,mode=readline -rtc base=utc -no-
acpi -boot c -device lsi,id=scsi0,bus=pci.0,addr=0x7 -drive 
file=/data/testserver-a-system.img,if=none,id=drive-scsi0-0-1,boot=on -device 
scsi-disk,bus=scsi0.0,scsi-id=1,drive=drive-scsi0-0-1,id=scsi0-0-1 -drive 
file=/data/testserver-a-data1.img,if=none,id=drive-virtio-disk1 -device 
virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk1,id=virtio-disk1 -
drive file=/data/testserver-a-data2.img,if=none,id=drive-virtio-disk2 -device 
virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk2,id=virtio-disk2 -
drive file=/data/gentoo-install-amd64-
minimal-20100408.iso,if=none,media=cdrom,id=drive-ide0-0-0,readonly=on -device 
ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -drive 
file=/data/testserver-a_configfloppy.img,if=none,id=drive-fdc0-0-0 -global 
isa-fdc.driveA=drive-fdc0-0-0 -device 
e1000,vlan=0,id=net0,mac=52:54:00:84:6d:69,bus=pci.0,addr=0x6 -net 
tap,fd=24,vlan=0,name=hostnet0 -usb -vnc 127.0.0.1:1,password -k de -vga 
cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3
- snip -

Copy-pasting this to a commandline on the host to start qemu manually leads to 
a non-functional qemu process that "just sits there" with nothing happening. 
The monitor socket /usr/local/var/lib/libvirt/qemu/testserver-a.monitor will, 
indeed, not show up.

I've tried starting qemu with the same commandline but without the parameters 
for redirecting the monitor to a socket, without the fd parameter for the 
network interface and without the vnc parameter. This resulted in a black 
window with the title "QEMU (testserver-a) [Stopped]". I could not access the 
monitor console in graphical mode either.

Some experimentation I've done suggests that this problem only happens if the 
high cpu load is caused by another KVM process, not if it is caused by 
something else running on the machine.

The host machine I'm running this on has got 16 cores in total. It looks like 
it is sufficient for this bug to surface if at least one of these cores is 
brought to near 100% use by a KVM process.

The version of qemu I'm using is qemu-kvm 0.12.4, built from source. Libvirt 
is version 0.8.1, built from source as well. The host OS is Fedora 12. The 
Kernel version is 2.6.32.12-115.fc12.x86_64.

Does anybody have any idea what might be causing this and what might be done 
against it?

Regards,

Guido
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] netfilter: add CHECKSUM target

2010-07-15 Thread Patrick McHardy

Am 11.07.2010 12:47, schrieb Michael S. Tsirkin:
> On Fri, Jul 09, 2010 at 05:17:36PM +0200, Patrick McHardy wrote:
>> Am 09.07.2010 00:29, schrieb Michael S. Tsirkin:
>>> This adds a `CHECKSUM' target, which can be used in the iptables mangle
>>> table.
>>>
>>> You can use this target to compute and fill in the checksum in
>>> an IP packet that lacks a checksum.  This is particularly useful,
>>> if you need to work around old applications such as dhcp clients,
>>> that do not work well with checksum offloads, but don't want to
>>> disable checksum offload in your device.
>>>
>>> The problem happens in the field with virtualized applications.
>>> For reference, see Red Hat bz 60, as well as
>>> http://www.spinics.net/lists/kvm/msg37660.html
>>>
>>> Typical expected use (helps old dhclient binary running in a VM):
>>> iptables -A POSTROUTING -t mangle -p udp --dport 68 -j CHECKSUM
>>> --checksum-fill
>>
>> I'm not sure this is something we want to merge upstream and
>> support indefinitely. Dave suggested this as a temporary
>> out-of-tree workaround until the majority of guest dhcp clients
>> are fixed. Has anything changed that makes this course of
>> action impractical?
> 
> If I understand what Dave said correctly, it's up to you ...
> 
> The arguments for putting this upstream are:
> 
> Given the track record, I wouldn't hope for quick fix in the majority of
> guest dhcp clients, unfortunately :(.  We are talking years here.
> Even after that, one of the uses of virtualization is
> to keep old guests running. So yes, I think we'll
> keep using work-arounds for this for a very long time.
> 
> Further, since we have to add the module and we have to teach management
> to program it, it will be much less painful for everyone
> involved if we can put the code upstream, rather than forking
> management code.

Fair enough, its simple enough that I don't expect much maintenance
overhead.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support

2010-07-15 Thread Joerg Roedel

On Thu, Jul 15, 2010 at 11:49:20AM +0100, Paul Brook wrote:

> > An IOMMU is not necessarily part of a bus bridge. By concept an IOMMU
> > can also be implemented on a plugin-card translating only that card.
> > Real implementations that I am aware of always implement the IOMMU in
> > the PCI root bridge, though.
> 
> If the IOMMU is implemented on the card, then it isn't an interesting case. 
> It's effectively just a complex form of scatter-gather.
> 
> If the on-card MMU can delegate pagetable walks to an external device then 
> IMO 
> that's also an unrelated feature, and requires an additional data channel.

But that would be handled by the same IOMMU emulation code, so the
hooks need to be usable there too.

> My point still stands. You should not be pushing the IOMMU handling into 
> device specific code. All you need to do is make the memory access routines 
> aware of which device caused the access.

Right, the device does not need to know too much about the IOMMU in the
general case. The iommu_get/iommu_read/iommu_write interface should
replaced by the pci_memory* functions like suggested by Anthony.

> The fact that the GART can translate CPU accesses proves my point.  If you 
> have separate code for CPU and devices, then you need to duplicate the GART 
> handling code.

You can configure the GART to translate device accesses only, cpu
accesses only, or to translate both. This is hard to handle if cpu and
device emulation use the same memory access functions.


Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support

2010-07-15 Thread Joerg Roedel

On Thu, Jul 15, 2010 at 07:45:06AM -0500, Anthony Liguori wrote:
> On 07/15/2010 04:10 AM, Joerg Roedel wrote:

>> If this means a seperate interface for device dma accesses and not fold
>> that functionality into the cpu_physical_memory* interface I agree too :-)
>>
> No.  PCI devices should never call cpu_physical_memory*.

Fully agreed.

> PCI devices should call pci_memory*.
>
> ISA devices should call isa_memory*.

This is a seperate interface. I like the idea and as you stated below it
has clear advantages, so lets go this way.

> All device memory accesses should go through their respective buses.   
> There can be multiple IOMMUs at different levels of the device  
> hierarchy.  If you don't provide bus-level memory access functions that  
> chain through the hierarchy, it's extremely difficult to implement all  
> the necessary hooks to perform the translations at different places.


Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support

2010-07-15 Thread Paul Brook

> >>> Depending how the we decide to handle IOMMU invalidation, it may also
> >>> be necessary to augment the memory_map API to allow the system to
> >>> request a mapping be revoked.  However this issue is not specific to
> >>> the IOMMU implementation. Such bugs are already present on any system
> >>> that allows dynamic reconfiguration of the address space, e.g. by
> >>> changing PCI BARs.
> >> 
> >> That's why the memory_map API today does not allow mappings to persist
> >> after trips back to the main loop.
> > 
> > Sure it does.  If you can't combine zero-copy memory access with
> > asynchronous IO then IMO it's fairly useless. See e.g. dma-helpers.c
> 
> DMA's a very special case.  

Special compared to what?  The whole purpose of this API is to provide DMA.

> DMA is performed asynchronously to the
> execution of the CPU so you generally can't make any guarantees about
> what state the transaction is in until it's completed.  That gives us a
> fair bit of wiggle room when dealing with a DMA operation to a region of
> physical memory where the physical memory mapping is altered in some way
> during the transaction.

You do have ordering constraints though. While it may not be possible to 
directly determine whether the DMA completed before or after the remapping, 
and you might not be able to make any assumptions about the atomicity of the 
transaction as a whole, it is reasonable to assume that any writes to the old 
mapping will occur before the remapping operation completes.

While things like store buffers potentially allows reordering and deferral of 
accesses, there are generally fairly tight constraints on this. For example a 
PCI hast bridge may buffer CPU writes. However it will guarantee that those 
writes have been flushed out before a subsequent read operation completes.

Consider the case where the hypervisor allows passthough of a device, using 
the IOMMU to support DMA from that device into virtual machine RAM. When that 
virtual machine is destroyed the IOMMU mapping for that device will be 
invalidated. Once the invalidation has completed that RAM can be reused by the 
hypervisor for other purposes. This may happen before the device is reset.  We 
probably don't really care what happens to the device in this case, but we do 
need to prevent the device stomping on ram it no longer owns.

There are two ways this can be handled:

If your address translation mechanism allows updates to be deferred 
indefinitely then we can stall until all relevant DMA transactions have 
completed.  This is probably sufficient for well behaved guests, but 
potentially opens up a significant window for DoS attacks. 

If you need the remapping to occur in a finite timeframe (in the PCI BAR case 
this is probably before the next CPU access to that bus) then you need some 
mechanism for revoking the host mapping provided by cpu_physical_memory_map.

Note that a QEMU DMA transaction typically encompasses a whole block of data. 
The transaction is started when the AIO request is issued, and remains live 
until the transfer completes. This includes the time taken to fetch the data 
from external media/devices.

On real hardware a DMA transaction typically only covers a single burst memory 
write (maybe 16 bytes). This will generally not start until the device has 
buffered sufficient data to satisfy the burst (or has sufficient buffer space 
to receive the whole burst).

Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Freezing Windows 2008 x64bit guest

2010-07-15 Thread Gleb Natapov

On Thu, Jul 15, 2010 at 03:19:44PM +0200, Christoph Adomeit wrote:
> Hi,
> 
> we are running several kvm servers with kvm 0.12.4 and kernel 2.6.32
> (actually we run proxmox ve)
> 
> These proxmox servers are running all kinds of vms rock-solid, be it
> linux or windows 32 platforms.
> 
> But one Windows 2008 64 Bit Server Standard is freezing regularly.
> This happens sometimes 3 times a day, sometimes it takes 2 days
> until freeze. The Windows Machine is a clean fresh install.
> 
> By Freezing I mean that the Machine console still shows the 
> windows logo but does not accept any keystroke and does not answer
> to ping or rdp or whatever. The virtual cpu-load is then shown
> as 100% (by proxmox).
> 
> We tried several combinations of ide or virtio, several virtual network cards 
> (intel and realtek) and single-cpu vs multi cpu. The Problem persists.
> 
> The Windows Machine is a quite fresh install but runs exchange and 10 rdp 
> sessions.
> 
> We are experienced linux-admins but we dont find any hints on the kvm servers 
> what might be the problem, no syslogs,dmesgs, nothing. Also there are no 
> dumps on the windows guest.
> 
> Do you have any hints what might be the problem and what we could do to debug 
> the problem and to get more information of the root cause ?
> 
When hang occurs ensure that problematic vm is the only on on the server and run
kvm_stat. Send output here. Also do the following:

# mount -t debugfs debugfs /sys/kernel/debug
# echo kvm > /sys/kernel/debug/tracing/set_event
# sleep 1
# cat /sys/kernel/debug/tracing/trace > /tmp/trace

Send /tmp/trace here too, but it may be huge, so send only last 1000
lines.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Freezing Windows 2008 x64bit guest

2010-07-15 Thread Christoph Adomeit

Hi,

we are running several kvm servers with kvm 0.12.4 and kernel 2.6.32
(actually we run proxmox ve)

These proxmox servers are running all kinds of vms rock-solid, be it
linux or windows 32 platforms.

But one Windows 2008 64 Bit Server Standard is freezing regularly.
This happens sometimes 3 times a day, sometimes it takes 2 days
until freeze. The Windows Machine is a clean fresh install.

By Freezing I mean that the Machine console still shows the 
windows logo but does not accept any keystroke and does not answer
to ping or rdp or whatever. The virtual cpu-load is then shown
as 100% (by proxmox).

We tried several combinations of ide or virtio, several virtual network cards 
(intel and realtek) and single-cpu vs multi cpu. The Problem persists.

The Windows Machine is a quite fresh install but runs exchange and 10 rdp 
sessions.

We are experienced linux-admins but we dont find any hints on the kvm servers 
what might be the problem, no syslogs,dmesgs, nothing. Also there are no 
dumps on the windows guest.

Do you have any hints what might be the problem and what we could do to debug 
the problem and to get more information of the root cause ?

Thanks
  Christoph
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support

2010-07-15 Thread Anthony Liguori


On 07/15/2010 04:10 AM, Joerg Roedel wrote:

On Wed, Jul 14, 2010 at 04:29:18PM -0500, Anthony Liguori wrote:
   

On 07/14/2010 03:13 PM, Paul Brook wrote:
 

Well, ok, the function name needs fixing too.  However I think the only thing
missing from the current API is that it does not provide a way to determine
which device is performing the access.
   

I agree with Paul.

The right approach IMHO is to convert devices to use bus-specific
functions to access memory.  The bus specific functions should have a
device argument as the first parameter.
 

If this means a seperate interface for device dma accesses and not fold
that functionality into the cpu_physical_memory* interface I agree too :-)
   


No.  PCI devices should never call cpu_physical_memory*.

PCI devices should call pci_memory*.

ISA devices should call isa_memory*.

All device memory accesses should go through their respective buses.  
There can be multiple IOMMUs at different levels of the device 
hierarchy.  If you don't provide bus-level memory access functions that 
chain through the hierarchy, it's extremely difficult to implement all 
the necessary hooks to perform the translations at different places.


Regards,

Anthony Liguori


Joerg

   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vhost-net: avoid flush under lock

2010-07-15 Thread Michael S. Tsirkin

On Thu, Jul 15, 2010 at 03:19:12PM +0300, Michael S. Tsirkin wrote:
> We flush under vq mutex when changing backends.
> This creates a deadlock as workqueue being flushed
> needs this lock as well.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=612421
> 
> Drop the vq mutex before flush: we have the device mutex
> which is sufficient to prevent another ioctl from touching
> the vq.
> 
> Signed-off-by: Michael S. Tsirkin 

Dave, just to clarify, I'll send pull request to merge it through my tree,
there's no need for you to bother with this.

> ---
>  drivers/vhost/net.c |5 +
>  1 files changed, 5 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 28d7786..50df58e6 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -534,11 +534,16 @@ static long vhost_net_set_backend(struct vhost_net *n, 
> unsigned index, int fd)
>   rcu_assign_pointer(vq->private_data, sock);
>   vhost_net_enable_vq(n, vq);
>  done:
> + mutex_unlock(&vq->mutex);
> +
>   if (oldsock) {
>   vhost_net_flush_vq(n, index);
>   fput(oldsock->file);
>   }
>  
> + mutex_unlock(&n->dev.mutex);
> + return 0;
> +
>  err_vq:
>   mutex_unlock(&vq->mutex);
>  err:
> -- 
> 1.7.2.rc0.14.g41c1c
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support

2010-07-15 Thread Anthony Liguori


On 07/15/2010 05:33 AM, Paul Brook wrote:

Depending how the we decide to handle IOMMU invalidation, it may also be
necessary to augment the memory_map API to allow the system to request a
mapping be revoked.  However this issue is not specific to the IOMMU
implementation. Such bugs are already present on any system that allows
dynamic reconfiguration of the address space, e.g. by changing PCI BARs.
   

That's why the memory_map API today does not allow mappings to persist
after trips back to the main loop.
 

Sure it does.  If you can't combine zero-copy memory access with asynchronous
IO then IMO it's fairly useless. See e.g. dma-helpers.c
   


DMA's a very special case.  DMA is performed asynchronously to the 
execution of the CPU so you generally can't make any guarantees about 
what state the transaction is in until it's completed.  That gives us a 
fair bit of wiggle room when dealing with a DMA operation to a region of 
physical memory where the physical memory mapping is altered in some way 
during the transaction.


However, that is not true in the general case.

Regards,

Anthony Liguori


Paul
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] vhost-net: avoid flush under lock

2010-07-15 Thread Michael S. Tsirkin

We flush under vq mutex when changing backends.
This creates a deadlock as workqueue being flushed
needs this lock as well.

https://bugzilla.redhat.com/show_bug.cgi?id=612421

Drop the vq mutex before flush: we have the device mutex
which is sufficient to prevent another ioctl from touching
the vq.

Signed-off-by: Michael S. Tsirkin 
---
 drivers/vhost/net.c |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 28d7786..50df58e6 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -534,11 +534,16 @@ static long vhost_net_set_backend(struct vhost_net *n, 
unsigned index, int fd)
rcu_assign_pointer(vq->private_data, sock);
vhost_net_enable_vq(n, vq);
 done:
+   mutex_unlock(&vq->mutex);
+
if (oldsock) {
vhost_net_flush_vq(n, index);
fput(oldsock->file);
}
 
+   mutex_unlock(&n->dev.mutex);
+   return 0;
+
 err_vq:
mutex_unlock(&vq->mutex);
 err:
-- 
1.7.2.rc0.14.g41c1c
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 4/7] ide: IOMMU support

2010-07-15 Thread Paul Brook

> On Wed, Jul 14, 2010 at 02:53:03PM +0100, Paul Brook wrote:
> > > Memory accesses must go through the IOMMU layer.
> > 
> > No. Devices should not know or care whether an IOMMU is present.
> 
> They don't really care. iommu_get() et al. are convenience functions
> which can and do return NULL when there's no IOMMU and device code can
> pass that NULL around without checking. 

Devices should not need to know any of this. You're introducing a significant 
amount of duplication and complexity into every device.

The assumption that all accesses will go through the same IOMMU is also false. 
Accesses to devices on the same bus will not be translated by the IOMMU. 
Currently there are probably also other things that will break in this case, 
but your API seems fundamentally incapable of handling this.

> I could've probably made the r/w
> functions take only the DeviceState in addition to normal args, but
> wanted to avoid looking up the related structures on each I/O operation.

That's exactly what you should be doing.  If this is inefficient then there 
are much better ways of fixing this. e.g. by not having the device perform so 
many accesses, or by adding some sort of translation cache.

> > You should be adding a DeviceState argument to
> > cpu_physical_memory_{rw,map}. This should then handle IOMMU translation
> > transparently.
> > 
> > You also need to accomodate the the case where multiple IOMMU are
> > present.
> 
> We don't assume there's a single IOMMU in the generic layer. The
> callbacks within 'struct iommu' could very well dispatch the request to
> one of multiple, coexisting IOMMUs.

So you've now introduced yet another copy of the translation code. Not only 
does every device have to be IOMMU aware, but every IOMMU also has to be aware 
of nested IOMMUs.

Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support

2010-07-15 Thread Paul Brook

> On Wed, Jul 14, 2010 at 09:13:44PM +0100, Paul Brook wrote:
> > A device performs a memory access on its local bus. It has no knowledge
> > of how that access is routed to its destination.  The device should not
> > be aware of any IOMMUs, in the same way that it doesn't know whether it
> > happens to be accessing RAM or memory mapped peripherals on another
> > device.
> 
> Right.
> 
> > Each IOMMU is fundamentally part of a bus bridge. For example the bridge
> > between a PCI bus and the system bus. It provides a address mapping from
> > one bus to another.
> 
> An IOMMU is not necessarily part of a bus bridge. By concept an IOMMU
> can also be implemented on a plugin-card translating only that card.
> Real implementations that I am aware of always implement the IOMMU in
> the PCI root bridge, though.

If the IOMMU is implemented on the card, then it isn't an interesting case. 
It's effectively just a complex form of scatter-gather.

If the on-card MMU can delegate pagetable walks to an external device then IMO 
that's also an unrelated feature, and requires an additional data channel.

> > There should be no direct interaction between an IOMMU and a device
> > (ignoring ATS, which is effectively a separate data channel). 
> > Everything should be done via the cpu_phsycial_memory_* code.  Likewise
> > on a system with multiple nested IOMMUs there should be no direct
> > interatcion between these.
> > cpu_physical_memory_* should walk the device/bus tree to determine where
> > the access terminates, applying mappings appropriately.
> 
> Thats the point where I disagree. I think there should be a seperate set
> of functions independent from cpu_physical_memory_* to handle device
> memory accesses. This would keep the changes small and non-intrusive.
> Beside that, real memory controlers can also handle cpu memory accesses
> different from device memory accesses. The AMD northbridge GART uses
> this to decide whether it needs to remap a request or not. The GART can
> be configured to translate cpu and device accesses seperatly.

My point still stands. You should not be pushing the IOMMU handling into 
device specific code. All you need to do is make the memory access routines 
aware of which device caused the access.

The fact that the GART can translate CPU accesses proves my point.  If you 
have separate code for CPU and devices, then you need to duplicate the GART 
handling code.

Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support

2010-07-15 Thread Paul Brook

> > Depending how the we decide to handle IOMMU invalidation, it may also be
> > necessary to augment the memory_map API to allow the system to request a
> > mapping be revoked.  However this issue is not specific to the IOMMU
> > implementation. Such bugs are already present on any system that allows
> > dynamic reconfiguration of the address space, e.g. by changing PCI BARs.
> 
> That's why the memory_map API today does not allow mappings to persist
> after trips back to the main loop.

Sure it does.  If you can't combine zero-copy memory access with asynchronous 
IO then IMO it's fairly useless. See e.g. dma-helpers.c

Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support

2010-07-15 Thread Paul Brook

> > The right approach IMHO is to convert devices to use bus-specific
> > functions to access memory.  The bus specific functions should have
> > a device argument as the first parameter.
> 
> As for ATS, the internal api to handle the device's dma reqeust needs
> a notion of a translated vs. an untranslated request.  IOW, if qemu ever
> had a device with ATS support, the device would use its local cache to
> translate the dma address and then submit a translated request to the
> pci bus (effectively doing a raw cpu physical memory* in that case).

Really? Can you provide an documentation to support this claim?
My impression is that there is no difference between translated and 
untranslated devices, and the translation is explicitly disabled by software.

Paul
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: several messages

2010-07-15 Thread Jan Engelhardt

On Thursday 2010-07-15 11:36, Patrick McHardy wrote:
>> +struct xt_CHECKSUM_info {
>> +u_int8_t operation; /* bitset of operations */
>
>Please use __u8 in public header files.
>
>> +#include 
>
>Do you really need in.h?
>
>> + * $Id$
>
>Please no CVS IDs.
>
>> +switch (c) {
>> +case 'F':
>> +if (*flags)
>> +xtables_error(PARAMETER_PROBLEM,
>> +"CHECKSUM target: Only use --checksum-fill 
>> ONCE!");
>
>There is a helper function called xtables_param_act for checking double
>arguments and printing a standarized error message.

I took care of these for Xt-a.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv3] extensions: libxt_CHECKSUM extension

2010-07-15 Thread Patrick McHardy

Am 11.07.2010 17:14, schrieb Michael S. Tsirkin:
> diff --git a/extensions/libxt_CHECKSUM.c b/extensions/libxt_CHECKSUM.c
> new file mode 100644
> index 000..00fbd8f
> --- /dev/null
> +++ b/extensions/libxt_CHECKSUM.c
> @@ -0,0 +1,99 @@
> +/* Shared library add-on to xtables for CHECKSUM
> + *
> + * (C) 2002 by Harald Welte 
> + * (C) 2010 by Red Hat, Inc
> + * Author: Michael S. Tsirkin 
> + *
> + * This program is distributed under the terms of GNU GPL v2, 1991
> + *
> + * libxt_CHECKSUM.c borrowed some bits from libipt_ECN.c
> + *
> + * $Id$

Please no CVS IDs.

> + */
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +
> +static void CHECKSUM_help(void)
> +{
> + printf(
> +"CHECKSUM target options\n"
> +"  --checksum-fill   Fill in packet checksum.\n");
> +}
> +
> +static const struct option CHECKSUM_opts[] = {
> + { "checksum-fill", 0, NULL, 'F' },
> + { .name = NULL }
> +};
> +
> +static int CHECKSUM_parse(int c, char **argv, int invert, unsigned int 
> *flags,
> + const void *entry, struct xt_entry_target **target)
> +{
> + struct xt_CHECKSUM_info *einfo
> + = (struct xt_CHECKSUM_info *)(*target)->data;
> +
> + switch (c) {
> + case 'F':
> + if (*flags)
> + xtables_error(PARAMETER_PROBLEM,
> + "CHECKSUM target: Only use --checksum-fill 
> ONCE!");

There is a helper function called xtables_param_act for checking double
arguments and printing a standarized error message.

> + einfo->operation = XT_CHECKSUM_OP_FILL;
> + *flags |= XT_CHECKSUM_OP_FILL;
> + break;
> + default:
> + return 0;
> + }
> +
> + return 1;
> +}
> +
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2] netfilter: add CHECKSUM target

2010-07-15 Thread Patrick McHardy

Am 11.07.2010 17:06, schrieb Michael S. Tsirkin:
> +#ifndef _IPT_CHECKSUM_TARGET_H
> +#define _IPT_CHECKSUM_TARGET_H
> +
> +#define XT_CHECKSUM_OP_FILL  0x01/* fill in checksum in IP header */
> +
> +struct xt_CHECKSUM_info {
> + u_int8_t operation; /* bitset of operations */

Please use __u8 in public header files.

> +};
> +
> +#endif /* _IPT_CHECKSUM_TARGET_H */
> diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
> index 8593a77..1cf4852 100644
> --- a/net/netfilter/Kconfig
> +++ b/net/netfilter/Kconfig
> @@ -294,7 +294,7 @@ endif # NF_CONNTRACK
>  config NETFILTER_TPROXY
>   tristate "Transparent proxying support (EXPERIMENTAL)"
>   depends on EXPERIMENTAL
> - depends on IP_NF_MANGLE
> + depends on IP_NF_MANGLE || IP6_NF_MANGLE

This does not seem to belong into this patch.

>   depends on NETFILTER_ADVANCED
>   help
> This option enables transparent proxying support, that is,
> @@ -347,6 +347,21 @@ config NETFILTER_XT_CONNMARK
>  
>  comment "Xtables targets"
>  
> +config NETFILTER_XT_TARGET_CHECKSUM
> + tristate "CHECKSUM target support"
> + depends on NETFILTER_ADVANCED
> + ---help---
> +   This option adds a `CHECKSUM' target, which can be used in the 
> iptables mangle
> +   table.  

You should add a dependency on the mangle table then.

> +
> +   You can use this target to compute and fill in the checksum in
> +   a packet that lacks a checksum.  This is particularly useful,
> +   if you need to work around old applications such as dhcp clients,
> +   that do not work well with checksum offloads, but don't want to 
> disable
> +   checksum offload in your device.
> +
> +   To compile it as a module, choose M here.  If unsure, say N.
> +
>  config NETFILTER_XT_TARGET_CLASSIFY
>   tristate '"CLASSIFY" target support'
>   depends on NETFILTER_ADVANCED
> diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
> index 14e3a8f..8eb541d 100644
> --- a/net/netfilter/Makefile
> +++ b/net/netfilter/Makefile
> @@ -45,6 +45,7 @@ obj-$(CONFIG_NETFILTER_XT_MARK) += xt_mark.o
>  obj-$(CONFIG_NETFILTER_XT_CONNMARK) += xt_connmark.o
>  
>  # targets
> +obj-$(CONFIG_NETFILTER_XT_TARGET_CHECKSUM) += xt_CHECKSUM.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_CLASSIFY) += xt_CLASSIFY.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_CONNSECMARK) += xt_CONNSECMARK.o
>  obj-$(CONFIG_NETFILTER_XT_TARGET_CT) += xt_CT.o
> diff --git a/net/netfilter/xt_CHECKSUM.c b/net/netfilter/xt_CHECKSUM.c
> new file mode 100644
> index 000..0fee1a7
> --- /dev/null
> +++ b/net/netfilter/xt_CHECKSUM.c
> @@ -0,0 +1,72 @@
> +/* iptables module for the packet checksum mangling
> + *
> + * (C) 2002 by Harald Welte 
> + * (C) 2010 Red Hat, Inc.
> + *
> + * Author: Michael S. Tsirkin 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> +*/
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +#include 

Do you really need in.h?

> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Michael S. Tsirkin ");
> +MODULE_DESCRIPTION("Xtables: checksum modification");
> +MODULE_ALIAS("ipt_CHECKSUM");
> +MODULE_ALIAS("ip6t_CHECKSUM");
> +
> +static unsigned int
> +checksum_tg(struct sk_buff *skb, const struct xt_action_param *par)
> +{
> + if (skb->ip_summed == CHECKSUM_PARTIAL)
> + skb_checksum_help(skb);
> +
> + return XT_CONTINUE;
> +}
> +
> +static int checksum_tg_check(const struct xt_tgchk_param *par)
> +{
> + const struct xt_CHECKSUM_info *einfo = par->targinfo;
> +
> + if (einfo->operation & ~XT_CHECKSUM_OP_FILL) {
> + pr_info("unsupported CHECKSUM operation %x\n", 
> einfo->operation);
> + return -EINVAL;
> + }
> + if (!einfo->operation) {
> + pr_info("no CHECKSUM operation enabled\n");
> + return -EINVAL;
> + }
> + return 0;
> +}
> +
> +static struct xt_target checksum_tg_reg __read_mostly = {
> + .name   = "CHECKSUM",
> + .family = NFPROTO_UNSPEC,
> + .target = checksum_tg,
> + .targetsize = sizeof(struct xt_CHECKSUM_info),
> + .table  = "mangle",
> + .checkentry = checksum_tg_check,
> + .me = THIS_MODULE,
> +};
> +
> +static int __init checksum_tg_init(void)
> +{
> + return xt_register_target(&checksum_tg_reg);
> +}
> +
> +static void __exit checksum_tg_exit(void)
> +{
> + xt_unregister_target(&checksum_tg_reg);
> +}
> +
> +module_init(checksum_tg_init);
> +module_exit(checksum_tg_exit);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] KVM: x86: Call mask notifiers from pic.

2010-07-15 Thread Gleb Natapov

If pit delivers interrupt while pic is masking it OS will never do EOI
and ack notifier will not be called so when pit will be unmasked no pit
interrupts will be delivered any more. Calling mask notifiers solves this
issue.

Signed-off-by: Gleb Natapov 
---
Changelog:
 v1->v2
Properly handle slave pic.

diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index 819b748..8d10c06 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -363,10 +363,20 @@ static void pic_ioport_write(void *opaque, u32 addr, u32 
val)
}
} else
switch (s->init_state) {
-   case 0: /* normal mode */
+   case 0: { /* normal mode */
+   u8 imr_diff = s->imr ^ val,
+   off = (s == &s->pics_state->pics[0]) ? 0 : 8;
s->imr = val;
+   for (irq = 0; irq < PIC_NUM_PINS/2; irq++)
+   if (imr_diff & (1 << irq))
+   kvm_fire_mask_notifiers(
+   s->pics_state->kvm,
+   SELECT_PIC(irq + off),
+   irq + off,
+   !!(s->imr & (1 << irq)));
pic_update_irq(s->pics_state);
break;
+   }
case 1:
s->irq_base = val & 0xf8;
s->init_state = 2;
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support

2010-07-15 Thread Joerg Roedel

On Wed, Jul 14, 2010 at 09:13:44PM +0100, Paul Brook wrote:

> A device performs a memory access on its local bus. It has no knowledge of 
> how 
> that access is routed to its destination.  The device should not be aware of 
> any IOMMUs, in the same way that it doesn't know whether it happens to be 
> accessing RAM or memory mapped peripherals on another device.

Right.

> Each IOMMU is fundamentally part of a bus bridge. For example the bridge 
> between a PCI bus and the system bus. It provides a address mapping from one 
> bus to another.

An IOMMU is not necessarily part of a bus bridge. By concept an IOMMU
can also be implemented on a plugin-card translating only that card.
Real implementations that I am aware of always implement the IOMMU in
the PCI root bridge, though.

> There should be no direct interaction between an IOMMU and a device (ignoring 
> ATS, which is effectively a separate data channel).  Everything should be 
> done 
> via the cpu_phsycial_memory_* code.  Likewise on a system with multiple 
> nested 
> IOMMUs there should be no direct interatcion between these. 
> cpu_physical_memory_* should walk the device/bus tree to determine where the 
> access terminates, applying mappings appropriately.

Thats the point where I disagree. I think there should be a seperate set
of functions independent from cpu_physical_memory_* to handle device
memory accesses. This would keep the changes small and non-intrusive.
Beside that, real memory controlers can also handle cpu memory accesses
different from device memory accesses. The AMD northbridge GART uses
this to decide whether it needs to remap a request or not. The GART can
be configured to translate cpu and device accesses seperatly.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support

2010-07-15 Thread Joerg Roedel

On Wed, Jul 14, 2010 at 04:29:18PM -0500, Anthony Liguori wrote:
> On 07/14/2010 03:13 PM, Paul Brook wrote:
>> Well, ok, the function name needs fixing too.  However I think the only thing
>> missing from the current API is that it does not provide a way to determine
>> which device is performing the access.
>
> I agree with Paul.
>
> The right approach IMHO is to convert devices to use bus-specific  
> functions to access memory.  The bus specific functions should have a  
> device argument as the first parameter.

If this means a seperate interface for device dma accesses and not fold
that functionality into the cpu_physical_memory* interface I agree too :-)

Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] KVM: MMU: track dirty page in speculative path properly

2010-07-15 Thread Xiao Guangrong



Marcelo Tosatti wrote:

>> How about just track access bit for speculative path, we set page both 
>> accessed and
>> dirty(if it's writable) only if the access bit is set?
> 
> A useful thing to do would be to allow read-only mappings, in the fault
> path (Lai sent a few patches in that direction sometime ago but there
> was no follow up).
> 
> So in the case of a read-only fault from the guest, you'd inform
> get_user_pages() that read-only access is acceptable (so swapcache pages
> can be mapped, or qemu can mprotect(PROT_READ) guest memory).
> 

Yeah, it's a great work, i guess Lai will post the new version soon.

And, even we do this, i think the page dirty track is still needed, right?
Then, how about my new idea to track page dirty for speculative path, just
as below draft patch does:

@@ -687,10 +687,11 @@ static void drop_spte(struct kvm *kvm, u64 *sptep, u64 
new_spte)
if (!is_rmap_spte(old_spte))
return;
pfn = spte_to_pfn(old_spte);
-   if (old_spte & shadow_accessed_mask)
+   if (old_spte & shadow_accessed_mask) {
kvm_set_pfn_accessed(pfn);
-   if (is_writable_pte(old_spte))
-   kvm_set_pfn_dirty(pfn);
+   if (is_writable_pte(old_spte))
+   kvm_set_pfn_dirty(pfn);
+   }
rmap_remove(kvm, sptep);
 }
 
@@ -1920,8 +1921,11 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 * demand paging).
 */
spte = shadow_base_present_pte | shadow_dirty_mask;
-   if (!speculative)
+   if (!speculative) {
spte |= shadow_accessed_mask;
+   if (is_writable_pte(*sptep))
+   kvm_set_pfn_dirty(pfn);
+   }
if (!dirty)
pte_access &= ~ACC_WRITE_MASK;
if (pte_access & ACC_EXEC_MASK)

It uses access bit to track both page accessed and page dirty, and it's rather 
cheap...



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

69 matches

Mail list logo