Re: KVM and NUMA
Hi Daniel, thanks for your response. Am Donnerstag, den 15.07.2010, 20:31 +0100 schrieb Daniel P. Berrange: > If numactl --hardware works, then libvirt should work, > since libvirt uses the numactl library to query topology Ok. I did not know that, and in my case it does not seem to work. See below. > The NUMA topology does not get put inside the element. It > is one level up in a element. eg > In my case (Ubuntu 10.04 LTS) it is just put inside the cpu element. Full host listing: x86_64 core2duo tcp apparmor 0 > > I guess this is the fact, because QEMU does not recognize the > > NUMA-Architecture (QEMU-Monitor): > > (qemu) info numa > > 0 nodes Thanks for the clarification. > There are two aspects to NUMA. 1. Placing QEMU on appropriate NUMA > ndes. 2. defining guest NUMA topology Right. I am interested in placing Qemu on the appropriate node. > > By default QEMU will float freely across any CPUs and all the guest > RAM will appear in one node. This is can be bad for performance, > especially if you are benchmarking > So for performance testing you definitely want to bind QEMU to the > CPUs within a single NUMA node at startup, this will mean that all > memory accesses are local to the node. Unless you give the guest > more virtual RAM, than there is free RAM on the local NUMA node. > Since you suggest you're using libvirt, the low level way todo > this is in the guest XML at the element Ok. But will my Qemu implementation use the appropriate RAM since it does not recognize the architecture? > For further performance you also really want to enable hugepages on > your host (eg mount hugetlbfs at /dev/hugepages), then restart > libvirtd daemon, and then add the following to your guest XML just > after the element: > > > > I have played with that, too. I could mount the hugetlbfs filesystem and define the mountpoint in libvirt. The guest started ok but I could verify that it was actually used. /proc/meminfo always showed 100% free huge pages whether the guest was running or not. Shouldn't these pages be used when the guest is running? As I said: Ubuntu not RHEL. Kind regards, Ralf -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 6/6] KVM: MMU: using __xchg_spte more smarter
Sometimes, atomically set spte is not needed, this patch call __xchg_spte() more smartly Note: if the old mapping's access bit is already set, we no need atomic operation since the access bit is not lost Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c |9 +++-- 1 files changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index fb3ae54..71e9268 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -682,9 +682,14 @@ static void rmap_remove(struct kvm *kvm, u64 *spte) static void set_spte_track_bits(u64 *sptep, u64 new_spte) { pfn_t pfn; - u64 old_spte; + u64 old_spte = *sptep; + + if (!shadow_accessed_mask || !is_shadow_present_pte(old_spte) || + old_spte & shadow_accessed_mask) { + __set_spte(sptep, new_spte); + } else + old_spte = __xchg_spte(sptep, new_spte); - old_spte = __xchg_spte(sptep, new_spte); if (!is_rmap_spte(old_spte)) return; pfn = spte_to_pfn(old_spte); -- 1.6.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 5/6] KVM: MMU: cleanup spte set and accssed/dirty tracking
Introduce set_spte_track_bits() to cleanup current code Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c | 17 - 1 files changed, 8 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 4123e8e..fb3ae54 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -679,7 +679,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte) } } -static void drop_spte(struct kvm *kvm, u64 *sptep, u64 new_spte) +static void set_spte_track_bits(u64 *sptep, u64 new_spte) { pfn_t pfn; u64 old_spte; @@ -692,6 +692,11 @@ static void drop_spte(struct kvm *kvm, u64 *sptep, u64 new_spte) kvm_set_pfn_accessed(pfn); if (is_writable_pte(old_spte)) kvm_set_pfn_dirty(pfn); +} + +static void drop_spte(struct kvm *kvm, u64 *sptep, u64 new_spte) +{ + set_spte_track_bits(sptep, new_spte); rmap_remove(kvm, sptep); } @@ -791,7 +796,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp, unsigned long data) { int need_flush = 0; - u64 *spte, new_spte, old_spte; + u64 *spte, new_spte; pte_t *ptep = (pte_t *)data; pfn_t new_pfn; @@ -812,13 +817,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp, new_spte &= ~PT_WRITABLE_MASK; new_spte &= ~SPTE_HOST_WRITEABLE; new_spte &= ~shadow_accessed_mask; - if (is_writable_pte(*spte)) - kvm_set_pfn_dirty(spte_to_pfn(*spte)); - old_spte = __xchg_spte(spte, new_spte); - if (is_shadow_present_pte(old_spte) - && (!shadow_accessed_mask || - old_spte & shadow_accessed_mask)) - mark_page_accessed(pfn_to_page(spte_to_pfn(old_spte))); + set_spte_track_bits(spte, new_spte); spte = rmap_next(kvm, rmapp, spte); } } -- 1.6.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/6] KVM: MMU: don't atomicly set spte if it's not present
If the old mapping is not present, the spte.a is not lost, so no need atomic operation to set it Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index b3896bf..4123e8e 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -307,9 +307,10 @@ static void update_spte(u64 *sptep, u64 new_spte) { u64 old_spte; - if (!shadow_accessed_mask || (new_spte & shadow_accessed_mask)) { + if (!shadow_accessed_mask || (new_spte & shadow_accessed_mask) || + !is_rmap_spte(*sptep)) __set_spte(sptep, new_spte); - } else { + else { old_spte = __xchg_spte(sptep, new_spte); if (old_spte & shadow_accessed_mask) mark_page_accessed(pfn_to_page(spte_to_pfn(old_spte))); -- 1.6.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/6] KVM: MMU: fix page dirty tracking lost while sync page
In sync-page path, if spte.writable is changed, it will lose page dirty tracking, for example: assume spte.writable = 0 in a unsync-page, when it's synced, it map spte to writable(that is spte.writable = 1), later guest write spte.gfn, it means spte.gfn is dirty, then guest changed this mapping to read-only, after it's synced, spte.writable = 0 So, when host release the spte, it detect spte.writable = 0 and not mark page dirty Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c | 10 +++--- 1 files changed, 3 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 5937054..b3896bf 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1981,6 +1981,8 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, mark_page_dirty(vcpu->kvm, gfn); set_pte: + if (is_writable_pte(*sptep) && !is_writable_pte(spte)) + kvm_set_pfn_dirty(pfn); update_spte(sptep, spte); done: return ret; @@ -1994,7 +1996,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, bool reset_host_protection) { int was_rmapped = 0; - int was_writable = is_writable_pte(*sptep); int rmap_count; pgprintk("%s: spte %llx access %x write_fault %d" @@ -2044,15 +2045,10 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, page_header_update_slot(vcpu->kvm, sptep, gfn); if (!was_rmapped) { rmap_count = rmap_add(vcpu, sptep, gfn); - kvm_release_pfn_clean(pfn); if (rmap_count > RMAP_RECYCLE_THRESHOLD) rmap_recycle(vcpu, sptep, gfn); - } else { - if (was_writable) - kvm_release_pfn_dirty(pfn); - else - kvm_release_pfn_clean(pfn); } + kvm_release_pfn_clean(pfn); if (speculative) { vcpu->arch.last_pte_updated = sptep; vcpu->arch.last_pte_gfn = gfn; -- 1.6.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/6] KVM: MMU: fix page accessed tracking lost if ept is enabled
In current code, if ept is enabled(shadow_accessed_mask = 0), the page accessed tracking is lost Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 1a4b42e..5937054 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -687,7 +687,7 @@ static void drop_spte(struct kvm *kvm, u64 *sptep, u64 new_spte) if (!is_rmap_spte(old_spte)) return; pfn = spte_to_pfn(old_spte); - if (old_spte & shadow_accessed_mask) + if (!shadow_accessed_mask || old_spte & shadow_accessed_mask) kvm_set_pfn_accessed(pfn); if (is_writable_pte(old_spte)) kvm_set_pfn_dirty(pfn); @@ -815,7 +815,8 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp, kvm_set_pfn_dirty(spte_to_pfn(*spte)); old_spte = __xchg_spte(spte, new_spte); if (is_shadow_present_pte(old_spte) - && (old_spte & shadow_accessed_mask)) + && (!shadow_accessed_mask || + old_spte & shadow_accessed_mask)) mark_page_accessed(pfn_to_page(spte_to_pfn(old_spte))); spte = rmap_next(kvm, rmapp, spte); } -- 1.6.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/6] KVM: MMU: fix forgot reserved bits check in speculative path
In the speculative path, we should check guest pte's reserved bits just as the real processor does Reported-by: Marcelo Tosatti Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c |9 - arch/x86/kvm/paging_tmpl.h |5 +++-- 2 files changed, 11 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index d16efbe..1a4b42e 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2693,6 +2693,9 @@ static void mmu_pte_write_new_pte(struct kvm_vcpu *vcpu, return; } + if (is_rsvd_bits_set(vcpu, *(u64 *)new, PT_PAGE_TABLE_LEVEL)) + return; + ++vcpu->kvm->stat.mmu_pte_updated; if (!sp->role.cr4_pae) paging32_update_pte(vcpu, sp, spte, new); @@ -2771,6 +2774,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, bool guest_initiated) { gfn_t gfn = gpa >> PAGE_SHIFT; + union kvm_mmu_page_role mask = { .word = 0 }; struct kvm_mmu_page *sp; struct hlist_node *node; LIST_HEAD(invalid_list); @@ -2845,6 +2849,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, } } + mask.cr0_wp = mask.cr4_pae = mask.nxe = 1; for_each_gfn_indirect_valid_sp(vcpu->kvm, sp, gfn, node) { pte_size = sp->role.cr4_pae ? 8 : 4; misaligned = (offset ^ (offset + bytes - 1)) & ~(pte_size - 1); @@ -2892,7 +2897,9 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, while (npte--) { entry = *spte; mmu_pte_write_zap_pte(vcpu, sp, spte); - if (gentry) + if (gentry && + !(sp->role.word ^ vcpu->arch.mmu.base_role.word) + & mask.word) mmu_pte_write_new_pte(vcpu, sp, spte, &gentry); if (!remote_flush && need_remote_flush(entry, *spte)) remote_flush = true; diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index a09e04c..231fce1 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -638,8 +638,9 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, return -EINVAL; gfn = gpte_to_gfn(gpte); - if (gfn != sp->gfns[i] || - !is_present_gpte(gpte) || !(gpte & PT_ACCESSED_MASK)) { + if (is_rsvd_bits_set(vcpu, gpte, PT_PAGE_TABLE_LEVEL) + || gfn != sp->gfns[i] || !is_present_gpte(gpte) + || !(gpte & PT_ACCESSED_MASK)) { u64 nonpresent; if (is_present_gpte(gpte) || !clear_unsync) -- 1.6.1.2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] test: add test for xchg instruction
On Thu, Jul 15, 2010 at 08:58:47AM +0800, Wei Yongjun wrote: > This patch add test for xchg instruction. > > Signed-off-by: Wei Yongjun > --- > kvm/test/x86/emulator.c | 52 > +++ > 1 files changed, 52 insertions(+), 0 deletions(-) Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] KVM: x86 emulator: fix xchg instruction emulation
On Thu, Jul 15, 2010 at 08:51:58AM +0800, Wei Yongjun wrote: > If the destination is a memory operand and the memory cannot > map to a valid page, the xchg instruction emulation and locked > instruction will not work on io regions and stuck in endless > loop. We should emulate exchange as write to fix it. > > Signed-off-by: Wei Yongjun > Acked-by: Gleb Natapov Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: Don't walk memory_size == 0 slots in kvm_client_migration_log
On Wed, Jul 14, 2010 at 01:36:49PM -0600, Alex Williamson wrote: > If we've unregistered a memory area, we should avoid calling > qemu_get_ram_ptr() on the left over phys_offset cruft in the > slot array. Now that we support removing ramblocks, the > phys_offset ram_addr_t can go away and cause a lookup fault > and abort. > > Signed-off-by: Alex Williamson Applied to uq/master branch, thanks. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: x86: Call mask notifiers from pic.
On Thu, Jul 15, 2010 at 12:24:37PM +0300, Gleb Natapov wrote: > If pit delivers interrupt while pic is masking it OS will never do EOI > and ack notifier will not be called so when pit will be unmasked no pit > interrupts will be delivered any more. Calling mask notifiers solves this > issue. > > Signed-off-by: Gleb Natapov Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] kvm, x86: use ro page and don't copy shared page
When page fault, we always call get_user_pages(write=1). Actually, we don't need to do this when it is not write fault. get_user_pages(write=1) will cause shared page(ksm) copied. If this page is not modified in future, this copying and the copied page are just wasted. Ksm may scan and merge them and may cause thrash. In this patch, if the page is RO for host VMM and it not write fault for guest, we will use RO page, otherwise we use a writable page. Signed-off-by: Lai Jiangshan --- diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 8ba9b0d..6382140 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1832,6 +1832,45 @@ static void kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn) } } +/* get a current mapped page fast, and test whether the page is writable. */ +static struct page *get_user_page_and_protection(unsigned long addr, + int *writable) +{ + struct page *page[1]; + + if (__get_user_pages_fast(addr, 1, 1, page) == 1) { + *writable = 1; + return page[0]; + } + if (__get_user_pages_fast(addr, 1, 0, page) == 1) { + *writable = 0; + return page[0]; + } + return NULL; +} + +static pfn_t kvm_get_pfn_for_page_fault(struct kvm *kvm, gfn_t gfn, + int write_fault, int *host_writable) +{ + unsigned long addr; + struct page *page; + + if (!write_fault) { + addr = gfn_to_hva(kvm, gfn); + if (kvm_is_error_hva(addr)) { + get_page(bad_page); + return page_to_pfn(bad_page); + } + + page = get_user_page_and_protection(addr, host_writable); + if (page) + return page_to_pfn(page); + } + + *host_writable = 1; + return kvm_get_pfn_for_gfn(kvm, gfn); +} + static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync) { @@ -2085,6 +2124,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) int level; pfn_t pfn; unsigned long mmu_seq; + int host_writable; level = mapping_level(vcpu, gfn); @@ -2099,7 +2139,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) mmu_seq = vcpu->kvm->mmu_notifier_seq; smp_rmb(); - pfn = kvm_get_pfn_for_gfn(vcpu->kvm, gfn); + pfn = kvm_get_pfn_for_page_fault(vcpu->kvm, gfn, write, &host_writable); /* mmio */ if (is_error_pfn(pfn)) @@ -2109,7 +2149,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) if (mmu_notifier_retry(vcpu, mmu_seq)) goto out_unlock; kvm_mmu_free_some_pages(vcpu); - r = __direct_map(vcpu, v, write, level, gfn, pfn, true); + r = __direct_map(vcpu, v, write, level, gfn, pfn, host_writable); spin_unlock(&vcpu->kvm->mmu_lock); @@ -2307,6 +2347,8 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, int level; gfn_t gfn = gpa >> PAGE_SHIFT; unsigned long mmu_seq; + int write_fault = error_code & PFERR_WRITE_MASK; + int host_writable; ASSERT(vcpu); ASSERT(VALID_PAGE(vcpu->arch.mmu.root_hpa)); @@ -2321,15 +2363,16 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, mmu_seq = vcpu->kvm->mmu_notifier_seq; smp_rmb(); - pfn = kvm_get_pfn_for_gfn(vcpu->kvm, gfn); + pfn = kvm_get_pfn_for_page_fault(vcpu->kvm, gfn, write_fault, + &host_writable); if (is_error_pfn(pfn)) return kvm_handle_bad_page(vcpu->kvm, gfn, pfn); spin_lock(&vcpu->kvm->mmu_lock); if (mmu_notifier_retry(vcpu, mmu_seq)) goto out_unlock; kvm_mmu_free_some_pages(vcpu); - r = __direct_map(vcpu, gpa, error_code & PFERR_WRITE_MASK, -level, gfn, pfn, true); + r = __direct_map(vcpu, gpa, write_fault, +level, gfn, pfn, host_writable); spin_unlock(&vcpu->kvm->mmu_lock); return r; diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index a9dbaa0..1874f51 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -430,6 +430,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, pfn_t pfn; int level = PT_PAGE_TABLE_LEVEL; unsigned long mmu_seq; + int host_writable; pgprintk("%s: addr %lx err %x\n", __func__, addr, error_code); kvm_mmu_audit(vcpu, "pre page fault"); @@ -461,7 +462,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, mmu_seq = vcpu->kvm->mmu_notifier_seq; smp_rmb(); - pfn = kvm_get_pfn_for_gfn(vcpu->kvm, walker.gfn); + pfn = kvm_get_pfn_for_page_fault(vcpu->kvm, walker.gfn, write_fault, +
[PATCH 6/6] kvm, faster and simpler version of get_user_page_and_protection()
a light weight version of get_user_page_and_protection() Signed-off-by: Lai Jiangshan --- diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index a34c785..d0e4f2f 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -618,6 +618,8 @@ static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, int count) memcpy(dst, src, count * sizeof(pgd_t)); } +extern +struct page *get_user_page_and_protection(unsigned long addr, int *writable); #include #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 6382140..de44847 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1832,23 +1832,6 @@ static void kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn) } } -/* get a current mapped page fast, and test whether the page is writable. */ -static struct page *get_user_page_and_protection(unsigned long addr, - int *writable) -{ - struct page *page[1]; - - if (__get_user_pages_fast(addr, 1, 1, page) == 1) { - *writable = 1; - return page[0]; - } - if (__get_user_pages_fast(addr, 1, 0, page) == 1) { - *writable = 0; - return page[0]; - } - return NULL; -} - static pfn_t kvm_get_pfn_for_page_fault(struct kvm *kvm, gfn_t gfn, int write_fault, int *host_writable) { diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c index a4ce19f..34b05c7 100644 --- a/arch/x86/mm/gup.c +++ b/arch/x86/mm/gup.c @@ -275,7 +275,6 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write, return nr; } -EXPORT_SYMBOL_GPL(__get_user_pages_fast); /** * get_user_pages_fast() - pin user pages in memory @@ -375,3 +374,83 @@ slow_irqon: return ret; } } + +/* + * get a current mapped page fast, and test whether the page is writable. + * equivalent version(but slower): + * { + * struct page *page[1]; + * + * if (__get_user_pages_fast(addr, 1, 1, page) == 1) { + * *writable = 1; + * return page[0]; + * } + * if (__get_user_pages_fast(addr, 1, 0, page) == 1) { + * *writable = 0; + * return page[0]; + * } + * return NULL; + * } + */ +struct page *get_user_page_and_protection(unsigned long addr, int *writable) +{ + unsigned long flags; + struct mm_struct *mm = current->mm; + pgd_t *pgdp; + pud_t *pudp; + pmd_t *pmdp; + pte_t pte, *ptep; + + unsigned long mask = _PAGE_PRESENT | _PAGE_USER; + unsigned long offset = 0; + struct page *head, *page = NULL; + + addr &= PAGE_MASK; + + local_irq_save(flags); + pgdp = pgd_offset(mm, addr); + if (!pgd_present(*pgdp)) + goto out; + + pudp = pud_offset(pgdp, addr); + if (!pud_present(*pudp)) + goto out; + + if (unlikely(pud_large(*pudp))) { + pte = *(pte_t *)pudp; + offset = ((addr & ~PUD_MASK) >> PAGE_SHIFT); + goto verify; + } + + pmdp = pmd_offset(pudp, addr); + if (!pmd_present(*pmdp)) + goto out; + + if (unlikely(pmd_large(*pmdp))) { + pte = *(pte_t *)pmdp; + offset = ((addr & ~PMD_MASK) >> PAGE_SHIFT); + goto verify; + } + + ptep = pte_offset_map(pmdp, addr); + pte = gup_get_pte(ptep); + pte_unmap(ptep); + +verify: + if ((pte_flags(pte) & (mask | _PAGE_SPECIAL)) != mask) + goto out; + + VM_BUG_ON(!pfn_valid(pte_pfn(pte))); + + head = pte_page(pte); + page = head + offset; + VM_BUG_ON(compound_head(page) != head); + get_page(page); + *writable = !!(pte_flags(pte) & _PAGE_RW); + +out: + local_irq_restore(flags); + return page; +} +EXPORT_SYMBOL_GPL(get_user_page_and_protection); + -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6] kvm: add host_writable parameter
add host_writable parameter for some functions, no functionality changed, prepare for using RO pages. Signed-off-by: Lai Jiangshan --- diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0867ced..8ba9b0d 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1861,7 +1861,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned pte_access, int user_fault, int write_fault, int dirty, int level, gfn_t gfn, pfn_t pfn, bool speculative, - bool can_unsync, bool reset_host_protection) + bool can_unsync, bool host_writable) { u64 spte; int ret = 0; @@ -1888,8 +1888,10 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, spte |= kvm_x86_ops->get_mt_mask(vcpu, gfn, kvm_is_mmio_pfn(pfn)); - if (reset_host_protection) + if (host_writable) spte |= SPTE_HOST_WRITEABLE; + else + pte_access &= ~ACC_WRITE_MASK; spte |= (u64)pfn << PAGE_SHIFT; @@ -1942,7 +1944,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, int user_fault, int write_fault, int dirty, int *ptwrite, int level, gfn_t gfn, pfn_t pfn, bool speculative, -bool reset_host_protection) +bool host_writable) { int was_rmapped = 0; int was_writable = is_writable_pte(*sptep); @@ -1978,7 +1980,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, if (set_spte(vcpu, sptep, pte_access, user_fault, write_fault, dirty, level, gfn, pfn, speculative, true, - reset_host_protection)) { + host_writable)) { if (write_fault) *ptwrite = 1; kvm_mmu_flush_tlb(vcpu); @@ -2015,7 +2017,7 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu) } static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, - int level, gfn_t gfn, pfn_t pfn) + int level, gfn_t gfn, pfn_t pfn, bool host_writable) { struct kvm_shadow_walk_iterator iterator; struct kvm_mmu_page *sp; @@ -2026,7 +2028,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, if (iterator.level == level) { mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, ACC_ALL, 0, write, 1, &pt_write, -level, gfn, pfn, false, true); +level, gfn, pfn, false, host_writable); ++vcpu->stat.pf_fixed; break; } @@ -2107,7 +2109,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) if (mmu_notifier_retry(vcpu, mmu_seq)) goto out_unlock; kvm_mmu_free_some_pages(vcpu); - r = __direct_map(vcpu, v, write, level, gfn, pfn); + r = __direct_map(vcpu, v, write, level, gfn, pfn, true); spin_unlock(&vcpu->kvm->mmu_lock); @@ -2327,7 +2329,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, goto out_unlock; kvm_mmu_free_some_pages(vcpu); r = __direct_map(vcpu, gpa, error_code & PFERR_WRITE_MASK, -level, gfn, pfn); +level, gfn, pfn, true); spin_unlock(&vcpu->kvm->mmu_lock); return r; diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index ac24158..a9dbaa0 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -291,7 +291,7 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, return; kvm_get_pfn(pfn); /* -* we call mmu_set_spte() with reset_host_protection = true beacuse that +* we call mmu_set_spte() with host_writable = true beacuse that * vcpu->arch.update_pte.pfn was fetched from get_user_pages(write = 1). */ mmu_set_spte(vcpu, spte, sp->role.access, pte_access, 0, 0, @@ -305,7 +305,7 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, struct guest_walker *gw, int user_fault, int write_fault, int hlevel, -int *ptwrite, pfn_t pfn) +int *ptwrite, pfn_t pfn, bool host_writable) { unsigned access = gw->pt_access; struct kvm_mmu_page *sp; @@ -334,7 +334,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, gw->pte_access & access, user_fault, write_fault, dirty, ptwrite, level, -
[PATCH 3/6] kvm: rename gfn_to_pfn() etc.
gfn_to_pfn() does actually increase the reference of the page. But "gfn_to_pfn" is questionable, it misses this semantic. So we rename it to kvm_get_pfn_for_gfn() which make more sense. gfn_to_page() and hva_to_pfn() are also renamed. (no behavior changed) Signed-off-by: Lai Jiangshan --- diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c index 5cb5865..fe220d6 100644 --- a/arch/ia64/kvm/kvm-ia64.c +++ b/arch/ia64/kvm/kvm-ia64.c @@ -1589,7 +1589,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, return -ENOMEM; for (i = 0; i < npages; i++) { - pfn = gfn_to_pfn(kvm, base_gfn + i); + pfn = kvm_get_pfn_for_gfn(kvm, base_gfn + i); if (!kvm_is_mmio_pfn(pfn)) { kvm_set_pmt_entry(kvm, base_gfn + i, pfn << PAGE_SHIFT, diff --git a/arch/powerpc/kvm/44x_tlb.c b/arch/powerpc/kvm/44x_tlb.c index 8123125..aeb67aa 100644 --- a/arch/powerpc/kvm/44x_tlb.c +++ b/arch/powerpc/kvm/44x_tlb.c @@ -314,7 +314,7 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 gvaddr, gpa_t gpaddr, /* Get reference to new page. */ gfn = gpaddr >> PAGE_SHIFT; - new_page = gfn_to_page(vcpu->kvm, gfn); + new_page = kvm_get_page_for_gfn(vcpu->kvm, gfn); if (is_error_page(new_page)) { printk(KERN_ERR "Couldn't get guest page for gfn %lx!\n", gfn); kvm_release_page_clean(new_page); diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index a3cef30..1611292 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -414,7 +414,7 @@ static void kvmppc_patch_dcbz(struct kvm_vcpu *vcpu, struct kvmppc_pte *pte) u32 *page; int i; - hpage = gfn_to_page(vcpu->kvm, pte->raddr >> PAGE_SHIFT); + hpage = kvm_get_page_for_gfn(vcpu->kvm, pte->raddr >> PAGE_SHIFT); if (is_error_page(hpage)) return; diff --git a/arch/powerpc/kvm/book3s_32_mmu_host.c b/arch/powerpc/kvm/book3s_32_mmu_host.c index 0b51ef8..94b4907 100644 --- a/arch/powerpc/kvm/book3s_32_mmu_host.c +++ b/arch/powerpc/kvm/book3s_32_mmu_host.c @@ -147,7 +147,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte) struct hpte_cache *pte; /* Get host physical address for gpa */ - hpaddr = gfn_to_pfn(vcpu->kvm, orig_pte->raddr >> PAGE_SHIFT); + hpaddr = kvm_get_pfn_for_gfn(vcpu->kvm, orig_pte->raddr >> PAGE_SHIFT); if (kvm_is_error_hva(hpaddr)) { printk(KERN_INFO "Couldn't get guest page for gfn %lx!\n", orig_pte->eaddr); diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c b/arch/powerpc/kvm/book3s_64_mmu_host.c index 384179a..414a001 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_host.c +++ b/arch/powerpc/kvm/book3s_64_mmu_host.c @@ -101,7 +101,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte) struct kvmppc_sid_map *map; /* Get host physical address for gpa */ - hpaddr = gfn_to_pfn(vcpu->kvm, orig_pte->raddr >> PAGE_SHIFT); + hpaddr = kvm_get_pfn_for_gfn(vcpu->kvm, orig_pte->raddr >> PAGE_SHIFT); if (kvm_is_error_hva(hpaddr)) { printk(KERN_INFO "Couldn't get guest page for gfn %lx!\n", orig_pte->eaddr); return -EINVAL; diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c index f11ca0f..026e5c7 100644 --- a/arch/powerpc/kvm/e500_tlb.c +++ b/arch/powerpc/kvm/e500_tlb.c @@ -299,7 +299,7 @@ static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500, stlbe = &vcpu_e500->shadow_tlb[tlbsel][esel]; /* Get reference to new page. */ - new_page = gfn_to_page(vcpu_e500->vcpu.kvm, gfn); + new_page = kvm_get_page_for_gfn(vcpu_e500->vcpu.kvm, gfn); if (is_error_page(new_page)) { printk(KERN_ERR "Couldn't get guest page for gfn %lx!\n", gfn); kvm_release_page_clean(new_page); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 1f3cbb8..0867ced 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2097,7 +2097,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) mmu_seq = vcpu->kvm->mmu_notifier_seq; smp_rmb(); - pfn = gfn_to_pfn(vcpu->kvm, gfn); + pfn = kvm_get_pfn_for_gfn(vcpu->kvm, gfn); /* mmio */ if (is_error_pfn(pfn)) @@ -2319,7 +2319,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, mmu_seq = vcpu->kvm->mmu_notifier_seq; smp_rmb(); - pfn = gfn_to_pfn(vcpu->kvm, gfn); + pfn = kvm_get_pfn_for_gfn(vcpu->kvm, gfn); if (is_error_pfn(pfn)) return kvm_handle_bad_page(vcpu->kvm, gfn, pfn); spin_lock(&vcpu->kvm->mmu_lock); @@ -2696,7 +2696,7 @@ static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, vcp
[PATCH 2/6] kvm, ept: remove the default write bit
When ept enabled, current code set shadow_base_present_pte including the write bit, thus all pte entries have writabe bit, and it means guest os can always write to any mapped page (even VMM maps RO pages for the guest.) we will use RO pages future, fix it. Signed-off-by: Lai Jiangshan --- diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 502e53f..62cc947 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -534,7 +534,6 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu); int kvm_mmu_create(struct kvm_vcpu *vcpu); int kvm_mmu_setup(struct kvm_vcpu *vcpu); void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte); -void kvm_mmu_set_base_ptes(u64 base_pte); void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask, u64 dirty_mask, u64 nx_mask, u64 x_mask); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index b93b94f..1f3cbb8 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -181,7 +181,6 @@ static struct kmem_cache *mmu_page_header_cache; static u64 __read_mostly shadow_trap_nonpresent_pte; static u64 __read_mostly shadow_notrap_nonpresent_pte; -static u64 __read_mostly shadow_base_present_pte; static u64 __read_mostly shadow_nx_mask; static u64 __read_mostly shadow_x_mask;/* mutual exclusive with nx_mask */ static u64 __read_mostly shadow_user_mask; @@ -200,12 +199,6 @@ void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte) } EXPORT_SYMBOL_GPL(kvm_mmu_set_nonpresent_ptes); -void kvm_mmu_set_base_ptes(u64 base_pte) -{ - shadow_base_present_pte = base_pte; -} -EXPORT_SYMBOL_GPL(kvm_mmu_set_base_ptes); - void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask, u64 dirty_mask, u64 nx_mask, u64 x_mask) { @@ -1878,7 +1871,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, * whether the guest actually used the pte (in order to detect * demand paging). */ - spte = shadow_base_present_pte | shadow_dirty_mask; + spte = PT_PRESENT_MASK | shadow_dirty_mask; if (!speculative) spte |= shadow_accessed_mask; if (!dirty) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 2fdcc98..856e427 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4419,8 +4419,6 @@ static int __init vmx_init(void) if (enable_ept) { bypass_guest_pf = 0; - kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK | - VMX_EPT_WRITABLE_MASK); kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull, VMX_EPT_EXECUTABLE_MASK); kvm_enable_tdp(); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fb08316..5f2fb50 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4237,7 +4237,6 @@ int kvm_arch_init(void *opaque) kvm_x86_ops = ops; kvm_mmu_set_nonpresent_ptes(0ull, 0ull); - kvm_mmu_set_base_ptes(PT_PRESENT_MASK); kvm_mmu_set_mask_ptes(PT_USER_MASK, PT_ACCESSED_MASK, PT_DIRTY_MASK, PT64_NX_MASK, 0); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] kvm: pass error code to handler
handle_ept_violation() does not pass error code to the handler tdp_page_fault(). It means tdp_page_fault() handles the page fault with ignoring the error code, It will not handle the page fault completely correctly, and may causes endless page fault. Signed-off-by: Lai Jiangshan --- diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 856e427..b40731e 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3521,7 +3521,8 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS); trace_kvm_page_fault(gpa, exit_qualification); - return kvm_mmu_page_fault(vcpu, gpa & PAGE_MASK, 0); + return kvm_mmu_page_fault(vcpu, gpa & PAGE_MASK, + exit_qualification & 0x2); } static u64 ept_rsvd_mask(u64 spte, int level) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: MMU: track dirty page in speculative path properly
Marcelo Tosatti wrote: >> It uses access bit to track both page accessed and page dirty, and it's >> rather cheap... > > Xiao, > > I don't understand it. What are you trying to achieve? > Marcelo, The issue which we try to fix in this patch is we mark the page dirty in speculative path, i'm not sure that after Lai's patch this bug is gone, if it's true, this patch is not needed anymore, otherwise the later patch is the cheaper way help us to fix this issue. In the later patch, we use pte.a to track whether the speculative mapping is accessed, if it's accessed, we mark the page dirty, just like tracking page-accessed. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM vs. PCI-E Function Level Reset (FLR) ...
On Thursday 15 July 2010 23:39:36 Casey Leedom wrote: > | From: Sheng Yang > | Date: Wednesday, July 14, 2010 06:31 pm > | > | On Thursday 15 July 2010 02:01:29 Casey Leedom wrote: > | > | From: Sheng Yang > | > | Date: Tuesday, July 13, 2010 05:53 pm > | > | (Please use reply to all next time.) > > Sorry, old habit. > > | > | On Wednesday 14 July 2010 04:41:01 Casey Leedom wrote: > | > Hhrrmmm, this seems like a semantic error. "Resetting" the Vm should > | > > | > be morally equivalent to resetting a real physical machine. And when > | > a real physical machine is reset, all of its buses, etc. get reset > | > which propagates to Device Resets on those buses ... I think that > | > Assigned Devices should be reset for reboots and power off/on ... > | > | Yes, it should be done when reset. And power on/off should be there, > | because that's means the create and destroy the guest. > > Okay, cool. I've looked through the kernel code and I don't see any > likely places for putting such a VM Reboot/Reset feature in so I assume > that this is also controlled by QEmu and it would need to be responsible > for either doing a KVM_DEASSIGN_PCI_DEVICE/KVM_ASSIGN_PCI_DEVICE ioctl() > pair for each assigned device or issuing a new > KVM_RESET_ASSIGNED_PCI_DEVICE ioctl() for Reboot/Reset ... Yeah, the detection of reset is not that straightforward... Maybe we need an ioctl for reset event in qemu-kvm kvm_reset_vcpu(). We don't need assign/deassign device when reboot/reset, we only need to notify KVM that the reset is happening, then KVM know what's to do. > | > Assigned Devices. For PCI-E SR-IOV Virtual Functions which are > | > assigned to a VM, they need to be reset at reboot/power off/power on > | > and the Configuration Space emulation needs to support the Guest OS/VM > | > Device Driver issuing an FLR ... > | > | You can add the FLR support for your device if you need to call it in the > | guest. But I don't quite understand why you need to call it in the guest. > | KVM should has already done that, and it's not necessary for native > | device. > > This was mostly for device driver load/unload support. I.e. being able > to issue a Function Level Reset to a PCI Device Function (regardless of it > being an SR-IOV Virtual Function or not) is a nice way to zap the device > back to a canonical state. OK. > > | So you want to issue FLR(rather than probing the feature) when > | driver->probe() called? Current seems KVM is the only user of > | pci_reset_function(), so I think we can update it after your driver > | posted to the mailing list. Also I am not sure why you want to reset > | function when probing. KVM should cover them all I think. > > Remember, the "probe" _argument_ in pci_dev_reset() has nothing to do > with whether it's being called from a device driver's probe() routine. > The "probe" _argument_ in pci_dev_reset() is used for the two-stage effort > in > pci_reset_function() which does: > > 1. "try to see if it's possible to reset this function by itself" > Call pci_dev_reset(dev, probe=1); > > 2. "go ahead and do the function reset" > Call pci_dev_reset(dev, probe=0); > > And yes, I'd noticed that KVM was the only current subscriber of the > kernel interface but, as I noted above, it is useful for resetting the > device function into a known state -- or at least it _was_ useful till the > deadlock got introduced in 2.6.31 ... :-( And again, the use of this > actually has no dependency on KVM: I would want to be able to call > pci_reset_function() in any device driver's probe() routine ... It just > happens that I ran into this need (and unfortunate 2.6.31 deadlock) with > our PCI-E SR-IOV Virtual Function Driver. What I meant was, before your driver, there is no such requirement in the code. And no one can predict every usage of the code in the future, so it's quite reasonable you called the "deadlock" is there. And I can't say it's a "deadlock" because there is no code in the current tree make it "deadlock" IIUR. So now you have this requirement, you can modify it to fit your need. That's quite straightforward... > > Oh, and the driver has been posted to net-next. I'm guessing that it > _should_ get merged into 2.6.35 ... or maybe 2.6.36 ... I'm really not > sure of how the merge schedule between net-next and the core kernel runs > ... That's good. So you can modify the function to provide a lockless version. That make sense now. :) -- regards Yang, Sheng > > Casey -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM and NUMA
On Thu, Jul 15, 2010 at 07:10:35PM +0200, Ralf Spenneberg wrote: > Hi, > > I just had a chance to play with KVM on Ubuntu 10.04 LTS on some new HP > 360 g6 with Nehalem processors. I have a feeling that KVM and NUMA on > these machines do not play well together. > > Doing some benchmarks I got bizarre numbers. Sometimes the VMs were > performing fine and some times the performance was very bad! Apparently > KVM does not recognize the NUMA-architecture and places memory and > process randomly and therefore often on different numa cells. > > First a couple of specs of the machine: > Two Nehalem sockets with E5520, Hyperthreading turned off, 4 cores per > socket, all in all 8 processors. > > > Linux recognizes the NUMA-architecture: > # numactl --hardware > available: 2 nodes (0-1) > node 0 cpus: 0 2 4 6 > node 0 size: 12277 MB > node 0 free: 9183 MB > node 1 cpus: 1 3 5 7 > node 1 size: 12287 MB > node 1 free: 8533 MB > node distances: > node 0 1 > 0: 10 20 > 1: 20 10 If numactl --hardware works, then libvirt should work, since libvirt uses the numactl library to query topology > > So I have got two cells with 4 cores each. > > Virsh does not recognize the topology: > # virsh capabilities > > > > x86_64 > core2duo > > > .. The NUMA topology does not get put inside the element. It is one level up in a element. eg x86_64 snip ...snip... This shows 2 numa nodes (cells in libvirt terminology) each with 4 CPUs. You can also query free RAM in each node/cell # virsh freecell 0 0: 1922084 kB # virsh freecell 1 1: 1035700 kB >From both of these you can then decide where to place the guest > I guess this is the fact, because QEMU does not recognize the > NUMA-Architecture (QEMU-Monitor): > (qemu) info numa > 0 nodes IIRC this is reporting the guest NUMA topology which is completely independant of host NUMA topology. > So apparently KVM does not utilize the NUMA-architecture. Did I do > something wrong. Is KVM missing a patch? Do I need to activate something > in KVM to recognize the NUMA-Architecture? There are two aspects to NUMA. 1. Placing QEMU on appropriate NUMA ndes. 2. defining guest NUMA topology By default QEMU will float freely across any CPUs and all the guest RAM will appear in one node. This is can be bad for performance, especially if you are benchmarking So for performance testing you definitely want to bind QEMU to the CPUs within a single NUMA node at startup, this will mean that all memory accesses are local to the node. Unless you give the guest more virtual RAM, than there is free RAM on the local NUMA node. Since you suggest you're using libvirt, the low level way todo this is in the guest XML at the element In my capabilities XML example above you can see 2 numa nodes, each with 4 cpus. So if I want to restrict the guest to the first NUMA node which has CPU numbers 0, 1, 2, 3, then I'd do rhel6x86_64 0bbf8187-bce1-bc77-2a2c-fb033816f7f4 819200 819200 2 ...snip... You can verify the pinning with virsh cpuinfo # virsh vcpuinfo rhel5xen VCPU: 0 CPU:1 State: running CPU time: 15.9s CPU Affinity: VCPU: 1 CPU:2 State: running CPU time: 9.5s CPU Affinity: snip rest... It is not yet possible to define the guest visible NUMA topology via libvirt, but that shouldn't be too critical for performance unless you needed to your guest to be able to span multiple host nodes. For further performance you also really want to enable hugepages on your host (eg mount hugetlbfs at /dev/hugepages), then restart libvirtd daemon, and then add the following to your guest XML just after the element: This will make it pre-allocate hugepages for all guest RAM at startup. NB the downside is that you can't overcommit RAM, but that's a tradeoff between maximising utilization and maximising performance. Daniel -- |: Red Hat, Engineering, London-o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org-o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: KVM Call agenda for July 13th
On 07/15/2010 01:51 PM, Aurelien Jarno wrote: On Thu, Jul 15, 2010 at 01:43:28PM -0500, Justin M. Forbes wrote: On Tue, Jul 13, 2010 at 12:19:21PM -0500, Brian Jackson wrote: On Tuesday, July 13, 2010 12:01:22 pm Avi Kivity wrote: On 07/13/2010 07:57 PM, Anthony Liguori wrote: I'd like to see more frequent stable releases, at least if the stable branch contains fixes to user-reported bugs (or of course security or data integrity fixes). Would you like to see more frequent stable releases or more frequent master releases? Yes. But in this context I'm interested in stable releases. We have bugs reported, fixed, and the fix applied, yet the fixes are unreachable to users. Especially so since qemu-kvm 0.12-stable hasn't been merged with qemu basically since 0.12.4 came out. I was trying to help one of the Gentoo maintainers find post 0.12.4 patches the other day and had to point them to the upstream qemu stable tree. I have offered to take care of this, but so far I do not have commit access. You don't necessarily need commit access for that. Just create your own tree with backported patches, and then send a stable pull request to the mailing list. Precisely. In the case of stable, that means watching the mailing list and backporting patches as appropriate and periodically doing pull requests. Regards, Anthony Liguori -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: KVM Call agenda for July 13th
On Thu, Jul 15, 2010 at 01:43:28PM -0500, Justin M. Forbes wrote: > On Tue, Jul 13, 2010 at 12:19:21PM -0500, Brian Jackson wrote: > > On Tuesday, July 13, 2010 12:01:22 pm Avi Kivity wrote: > > > On 07/13/2010 07:57 PM, Anthony Liguori wrote: > > > >> I'd like to see more frequent stable releases, at least if the stable > > > >> branch contains fixes to user-reported bugs (or of course security or > > > >> data integrity fixes). > > > > > > > > Would you like to see more frequent stable releases or more frequent > > > > master releases? > > > > > > Yes. But in this context I'm interested in stable releases. We have > > > bugs reported, fixed, and the fix applied, yet the fixes are unreachable > > > to users. > > > > Especially so since qemu-kvm 0.12-stable hasn't been merged with qemu > > basically since 0.12.4 came out. I was trying to help one of the Gentoo > > maintainers find post 0.12.4 patches the other day and had to point them to > > the upstream qemu stable tree. > > I have offered to take care of this, but so far I do not have commit > access. > You don't necessarily need commit access for that. Just create your own tree with backported patches, and then send a stable pull request to the mailing list. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurel...@aurel32.net http://www.aurel32.net -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: KVM Call agenda for July 13th
On Tue, Jul 13, 2010 at 12:19:21PM -0500, Brian Jackson wrote: > On Tuesday, July 13, 2010 12:01:22 pm Avi Kivity wrote: > > On 07/13/2010 07:57 PM, Anthony Liguori wrote: > > >> I'd like to see more frequent stable releases, at least if the stable > > >> branch contains fixes to user-reported bugs (or of course security or > > >> data integrity fixes). > > > > > > Would you like to see more frequent stable releases or more frequent > > > master releases? > > > > Yes. But in this context I'm interested in stable releases. We have > > bugs reported, fixed, and the fix applied, yet the fixes are unreachable > > to users. > > Especially so since qemu-kvm 0.12-stable hasn't been merged with qemu > basically since 0.12.4 came out. I was trying to help one of the Gentoo > maintainers find post 0.12.4 patches the other day and had to point them to > the upstream qemu stable tree. I have offered to take care of this, but so far I do not have commit access. Justin -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost-net: avoid flush under lock
On Thu, 2010-07-15 at 15:19 +0300, Michael S. Tsirkin wrote: > We flush under vq mutex when changing backends. > This creates a deadlock as workqueue being flushed > needs this lock as well. > > https://bugzilla.redhat.com/show_bug.cgi?id=612421 > > Drop the vq mutex before flush: we have the device mutex > which is sufficient to prevent another ioctl from touching > the vq. Why do we need to flush the vq when trying to set the backend and we find that it is already set. Is this just an optimization? Thanks Sridhar > > Signed-off-by: Michael S. Tsirkin > --- > drivers/vhost/net.c |5 + > 1 files changed, 5 insertions(+), 0 deletions(-) > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c > index 28d7786..50df58e6 100644 > --- a/drivers/vhost/net.c > +++ b/drivers/vhost/net.c > @@ -534,11 +534,16 @@ static long vhost_net_set_backend(struct vhost_net *n, > unsigned index, int fd) > rcu_assign_pointer(vq->private_data, sock); > vhost_net_enable_vq(n, vq); > done: > + mutex_unlock(&vq->mutex); > + > if (oldsock) { > vhost_net_flush_vq(n, index); > fput(oldsock->file); > } > > + mutex_unlock(&n->dev.mutex); > + return 0; > + > err_vq: > mutex_unlock(&vq->mutex); > err: -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: MMU: track dirty page in speculative path properly
On Thu, Jul 15, 2010 at 03:44:48PM +0800, Xiao Guangrong wrote: > > > Marcelo Tosatti wrote: > > >> How about just track access bit for speculative path, we set page both > >> accessed and > >> dirty(if it's writable) only if the access bit is set? > > > > A useful thing to do would be to allow read-only mappings, in the fault > > path (Lai sent a few patches in that direction sometime ago but there > > was no follow up). > > > > So in the case of a read-only fault from the guest, you'd inform > > get_user_pages() that read-only access is acceptable (so swapcache pages > > can be mapped, or qemu can mprotect(PROT_READ) guest memory). > > > > Yeah, it's a great work, i guess Lai will post the new version soon. > > And, even we do this, i think the page dirty track is still needed, right? > Then, how about my new idea to track page dirty for speculative path, just > as below draft patch does: > > @@ -687,10 +687,11 @@ static void drop_spte(struct kvm *kvm, u64 *sptep, u64 > new_spte) > if (!is_rmap_spte(old_spte)) > return; > pfn = spte_to_pfn(old_spte); > - if (old_spte & shadow_accessed_mask) > + if (old_spte & shadow_accessed_mask) { > kvm_set_pfn_accessed(pfn); > - if (is_writable_pte(old_spte)) > - kvm_set_pfn_dirty(pfn); > + if (is_writable_pte(old_spte)) > + kvm_set_pfn_dirty(pfn); > + } > rmap_remove(kvm, sptep); > } > > @@ -1920,8 +1921,11 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, > * demand paging). > */ > spte = shadow_base_present_pte | shadow_dirty_mask; > - if (!speculative) > + if (!speculative) { > spte |= shadow_accessed_mask; > + if (is_writable_pte(*sptep)) > + kvm_set_pfn_dirty(pfn); > + } > if (!dirty) > pte_access &= ~ACC_WRITE_MASK; > if (pte_access & ACC_EXEC_MASK) > > It uses access bit to track both page accessed and page dirty, and it's > rather cheap... Xiao, I don't understand it. What are you trying to achieve? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support
On 07/15/2010 11:45 AM, Eduard - Gabriel Munteanu wrote: On Thu, Jul 15, 2010 at 07:45:06AM -0500, Anthony Liguori wrote: No. PCI devices should never call cpu_physical_memory*. PCI devices should call pci_memory*. ISA devices should call isa_memory*. All device memory accesses should go through their respective buses. There can be multiple IOMMUs at different levels of the device hierarchy. If you don't provide bus-level memory access functions that chain through the hierarchy, it's extremely difficult to implement all the necessary hooks to perform the translations at different places. Regards, Anthony Liguori I liked Paul's initial approach more, at least if I understood him correctly. Basically I'm suggesting a single memory_* function that simply asks the bus for I/O and translation. Say you have something like this: + Bus 1 | Memory 1 | ---+ Bus 2 bridge | Memory 2 | ---+ Bus 3 bridge | Device Say Device wants to write to memory. If we have the DeviceState we needn't concern whether this is a BusOneDevice or BusTwoDevice from device code itself. We would just call memory_rw(dev_state, addr, buf, size, is_write); I dislike this API for a few reasons: 1) buses have different types of addresses with different address ranges. this api would have to take a generic address type. 2) dev_state would be the qdev device state. this means qdev needs to have memory hook mechanisms that's chainable. I think it's unnecessary at the qdev level 3) users have upcasted device states, so it's more natural to pass PCIDevice than DeviceState. 4) there's an assumption that all devices can get to DeviceState. that's not always true today. which simply recurses through DeviceState's and BusState's through their parent pointers. The actual bus can set up those to provide identification information and perhaps hooks for translation and access checking. So memory_rw() looks like this (pseudocode): static void memory_rw(DeviceState *dev, target_phys_addr_t addr, uint8_t *buf, int size, int is_write) { BusState *bus = get_parent_bus_of_dev(dev); DeviceState *pdev = get_parent_dev(dev); target_phys_addr_t taddr; if (!bus) { /* This shouldn't happen. */ assert(0); } if (bus->responsible_for(addr)) { raw_physical_memory_rw(addr, buf, size, is_write); return; } taddr = bus->translate(dev, addr); memory_rw(pdev, taddr, buf, size, is_write); This is too simplistic because you sometimes have layering that doesn't fit into the bus model. For instance, virtio + pci. We really want a virtio_memory_rw that calls either syborg_memory_rw or pci_memory_rw based on the transport. In your proposal, we would have to model virtio-pci as a bus with a single device which appears awkward to me. Regards, Anthony Liguori } If we do this, it seems there's no need to provide separate functions. The actual buses must instead initialize those hooks properly. Translation here is something inherent to the bus, that handles arbitration between possibly multiple IOMMUs. Our memory would normally reside on / belong to the top-level bus. What do you think? (Naming could be better though.) Eduard -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support
On Thu, Jul 15, 2010 at 10:17:10AM -0700, Chris Wright wrote: > * Avi Kivity (a...@redhat.com) wrote: > > > > For emulated device, it seems like we can ignore ATS completely, no? > > Not if you want to emulate an ATS capable device ;) > > Eariler upthread I said: > > IOW, if qemu ever had a device with ATS support... > > So, that should've been a much bigger _IF_ > > thanks, > -chris I think we can augment some devices with ATS capability if there are performance gains in doing so. This doesn't seem to be a detail the actual guest OS would be interested in, so we can do it even for devices that existed long before the AMD IOMMU came into existence. But I'm not really sure about this, it's just a thought. Linux seems to be issuing IOTLB invalidation commands anyway. Eduard -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support
* Avi Kivity (a...@redhat.com) wrote: > On 07/15/2010 08:17 PM, Chris Wright wrote: > > > >>For emulated device, it seems like we can ignore ATS completely, no? > >Not if you want to emulate an ATS capable device ;) > > What I meant was that the whole request translation, invalidate, dma > using a translated address thing is invisible to software. We can > emulate an ATS capable device by going through the iommu every time. Well, I don't see any reason to completely ignore it. It'd be really useful for testing (I'd use it that way). Esp to verify the invalidation of the device IOTLBs. But I think it's not a difficult thing to emulate once we have a proper api encapsulating a device's dma request. thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support
On Thu, Jul 15, 2010 at 08:02:05PM +0300, Avi Kivity wrote: > For emulated device, it seems like we can ignore ATS completely, no? An important use-case for emulation is software testing and caching of iommu's is an important part that needs to be handled in software. For this purpose it makes sense to emulate the behavior of caches too. So we probably should not ignore the possibility of an emulated ATS device completly. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support
On 07/15/2010 08:17 PM, Chris Wright wrote: For emulated device, it seems like we can ignore ATS completely, no? Not if you want to emulate an ATS capable device ;) What I meant was that the whole request translation, invalidate, dma using a translated address thing is invisible to software. We can emulate an ATS capable device by going through the iommu every time. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM and NUMA
Hi, I just had a chance to play with KVM on Ubuntu 10.04 LTS on some new HP 360 g6 with Nehalem processors. I have a feeling that KVM and NUMA on these machines do not play well together. Doing some benchmarks I got bizarre numbers. Sometimes the VMs were performing fine and some times the performance was very bad! Apparently KVM does not recognize the NUMA-architecture and places memory and process randomly and therefore often on different numa cells. First a couple of specs of the machine: Two Nehalem sockets with E5520, Hyperthreading turned off, 4 cores per socket, all in all 8 processors. # cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz stepping: 5 ... Linux recognizes the NUMA-architecture: # numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 2 4 6 node 0 size: 12277 MB node 0 free: 9183 MB node 1 cpus: 1 3 5 7 node 1 size: 12287 MB node 1 free: 8533 MB node distances: node 0 1 0: 10 20 1: 20 10 So I have got two cells with 4 cores each. Virsh does not recognize the topology: # virsh capabilities x86_64 core2duo .. I guess this is the fact, because QEMU does not recognize the NUMA-Architecture (QEMU-Monitor): (qemu) info numa 0 nodes This is done on an Ubuntu 10.04 LTS: Linux lxkvm01 2.6.32-23-server #37-Ubuntu SMP Fri Jun 11 09:11:11 UTC 2010 x86_64 GNU/Linux The package used was: qemu-kvm 0.12.3+noroms-0ubuntu9.2 When doing benchmarks numastat shows a lot of misses when the performance is bad. # numastat node0 node1 numa_hit1552715817015505 numa_miss7032982 3512950 numa_foreign 3512950 7032982 interleave_hit 80788264 local_node 1552518717006655 other_node 7034953 3521800 So apparently KVM does not utilize the NUMA-architecture. Did I do something wrong. Is KVM missing a patch? Do I need to activate something in KVM to recognize the NUMA-Architecture? I tried the newest Kernel Module 2.6.32.15 and Qemu-kvm 0.12.4 without a change. Any hints? Kind regards, Ralf -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support
* Avi Kivity (a...@redhat.com) wrote: > On 07/15/2010 07:52 PM, Chris Wright wrote: > > > >>Really? Can you provide an documentation to support this claim? > >>My impression is that there is no difference between translated and > >>untranslated devices, and the translation is explicitly disabled by > >>software. > >ATS allows an I/O device to request a translation from the IOMMU. > >The device can then cache that translation and use the translated address > >in a PCIe memory transaction. PCIe uses a couple of previously reserved > >bits in the transaction layer packet header to describe the address > >type for memory transactions. The default (00) maps to legacy PCIe and > >describes the memory address as untranslated. This is the normal mode, > >and could then incur a translation if an IOMMU is present and programmed > >w/ page tables, etc. as is passes through the host bridge. > > > >Another type is simply a transaction requesting a translation. This is > >new, and allows a device to request (and cache) a translation from the > >IOMMU for subsequent use. > > > >The third type is a memory transaction tagged as already translated. > >This is the type of transaction an ATS capable I/O device will generate > >when it was able to translate the memory address from its own cache. > > > >Of course, there's also an invalidation request that the IOMMU can send > >to ATS capable I/O devices to invalidate the cached translation. > > For emulated device, it seems like we can ignore ATS completely, no? Not if you want to emulate an ATS capable device ;) Eariler upthread I said: IOW, if qemu ever had a device with ATS support... So, that should've been a much bigger _IF_ thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support
* Chris Wright (chr...@sous-sol.org) wrote: > * Paul Brook (p...@codesourcery.com) wrote: > > > > The right approach IMHO is to convert devices to use bus-specific > > > > functions to access memory. The bus specific functions should have > > > > a device argument as the first parameter. > > > > > > As for ATS, the internal api to handle the device's dma reqeust needs > > > a notion of a translated vs. an untranslated request. IOW, if qemu ever > > > had a device with ATS support, the device would use its local cache to > > > translate the dma address and then submit a translated request to the > > > pci bus (effectively doing a raw cpu physical memory* in that case). > > > > Really? Can you provide an documentation to support this claim? Wow...color me surprised...there's actually some apparently public "training" docs that might help give a more complete view: http://www.pcisig.com/developers/main/training_materials/get_document?doc_id=0ab681ba7001e40cdb297ddaf279a8de82a7dc40 ATS discussion starts on slide 23. > > My impression is that there is no difference between translated and > > untranslated devices, and the translation is explicitly disabled by > > software. And now that I re-read that sentence, I see what you are talking about. Yes, there is the above notion as well. A device can live in a 1:1 mapping of device address space to physical memory. This could be achieved in a few ways (all done by the OS software programming the IOMMU). One is to simply create a set of page tables that map 1:1 all of device memory to physical memory. Another is to somehow mark the device as special (either omit translation tables or mark the translation entry as effectively "do not translate"). This is often referred to as Pass Through mode. But this is not the same as ATS. Pass Through mode is the functional equivalent of disabling the translation/isolation capabilities of the IOMMU. It's typically used when an OS wants to keep a device for itself and isn't interested in the isolation properties of the IOMMU. It then only creates isolating translation tables for devices it's giving to unprivileged software (e.g. Linux/KVM giving a device to a guest, Linux giving a device to user space process, etc.) > ATS allows an I/O device to request a translation from the IOMMU. > The device can then cache that translation and use the translated address > in a PCIe memory transaction. PCIe uses a couple of previously reserved > bits in the transaction layer packet header to describe the address > type for memory transactions. The default (00) maps to legacy PCIe and > describes the memory address as untranslated. This is the normal mode, > and could then incur a translation if an IOMMU is present and programmed > w/ page tables, etc. as is passes through the host bridge. > > Another type is simply a transaction requesting a translation. This is > new, and allows a device to request (and cache) a translation from the > IOMMU for subsequent use. > > The third type is a memory transaction tagged as already translated. > This is the type of transaction an ATS capable I/O device will generate > when it was able to translate the memory address from its own cache. > > Of course, there's also an invalidation request that the IOMMU can send > to ATS capable I/O devices to invalidate the cached translation. thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support
On 07/15/2010 07:52 PM, Chris Wright wrote: Really? Can you provide an documentation to support this claim? My impression is that there is no difference between translated and untranslated devices, and the translation is explicitly disabled by software. ATS allows an I/O device to request a translation from the IOMMU. The device can then cache that translation and use the translated address in a PCIe memory transaction. PCIe uses a couple of previously reserved bits in the transaction layer packet header to describe the address type for memory transactions. The default (00) maps to legacy PCIe and describes the memory address as untranslated. This is the normal mode, and could then incur a translation if an IOMMU is present and programmed w/ page tables, etc. as is passes through the host bridge. Another type is simply a transaction requesting a translation. This is new, and allows a device to request (and cache) a translation from the IOMMU for subsequent use. The third type is a memory transaction tagged as already translated. This is the type of transaction an ATS capable I/O device will generate when it was able to translate the memory address from its own cache. Of course, there's also an invalidation request that the IOMMU can send to ATS capable I/O devices to invalidate the cached translation. For emulated device, it seems like we can ignore ATS completely, no? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support
* Paul Brook (p...@codesourcery.com) wrote: > > > The right approach IMHO is to convert devices to use bus-specific > > > functions to access memory. The bus specific functions should have > > > a device argument as the first parameter. > > > > As for ATS, the internal api to handle the device's dma reqeust needs > > a notion of a translated vs. an untranslated request. IOW, if qemu ever > > had a device with ATS support, the device would use its local cache to > > translate the dma address and then submit a translated request to the > > pci bus (effectively doing a raw cpu physical memory* in that case). > > Really? Can you provide an documentation to support this claim? > My impression is that there is no difference between translated and > untranslated devices, and the translation is explicitly disabled by software. ATS allows an I/O device to request a translation from the IOMMU. The device can then cache that translation and use the translated address in a PCIe memory transaction. PCIe uses a couple of previously reserved bits in the transaction layer packet header to describe the address type for memory transactions. The default (00) maps to legacy PCIe and describes the memory address as untranslated. This is the normal mode, and could then incur a translation if an IOMMU is present and programmed w/ page tables, etc. as is passes through the host bridge. Another type is simply a transaction requesting a translation. This is new, and allows a device to request (and cache) a translation from the IOMMU for subsequent use. The third type is a memory transaction tagged as already translated. This is the type of transaction an ATS capable I/O device will generate when it was able to translate the memory address from its own cache. Of course, there's also an invalidation request that the IOMMU can send to ATS capable I/O devices to invalidate the cached translation. thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support
On Thu, Jul 15, 2010 at 07:45:06AM -0500, Anthony Liguori wrote: > > No. PCI devices should never call cpu_physical_memory*. > > PCI devices should call pci_memory*. > > ISA devices should call isa_memory*. > > All device memory accesses should go through their respective buses. > There can be multiple IOMMUs at different levels of the device > hierarchy. If you don't provide bus-level memory access functions that > chain through the hierarchy, it's extremely difficult to implement all > the necessary hooks to perform the translations at different places. > > Regards, > > Anthony Liguori > I liked Paul's initial approach more, at least if I understood him correctly. Basically I'm suggesting a single memory_* function that simply asks the bus for I/O and translation. Say you have something like this: + Bus 1 | Memory 1 | ---+ Bus 2 bridge | Memory 2 | ---+ Bus 3 bridge | Device Say Device wants to write to memory. If we have the DeviceState we needn't concern whether this is a BusOneDevice or BusTwoDevice from device code itself. We would just call memory_rw(dev_state, addr, buf, size, is_write); which simply recurses through DeviceState's and BusState's through their parent pointers. The actual bus can set up those to provide identification information and perhaps hooks for translation and access checking. So memory_rw() looks like this (pseudocode): static void memory_rw(DeviceState *dev, target_phys_addr_t addr, uint8_t *buf, int size, int is_write) { BusState *bus = get_parent_bus_of_dev(dev); DeviceState *pdev = get_parent_dev(dev); target_phys_addr_t taddr; if (!bus) { /* This shouldn't happen. */ assert(0); } if (bus->responsible_for(addr)) { raw_physical_memory_rw(addr, buf, size, is_write); return; } taddr = bus->translate(dev, addr); memory_rw(pdev, taddr, buf, size, is_write); } If we do this, it seems there's no need to provide separate functions. The actual buses must instead initialize those hooks properly. Translation here is something inherent to the bus, that handles arbitration between possibly multiple IOMMUs. Our memory would normally reside on / belong to the top-level bus. What do you think? (Naming could be better though.) Eduard -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM vs. PCI-E Function Level Reset (FLR) ...
[[Sorry for any duplicates -- something is going wrong with my local mail relay and my outbound mail is being rejected. I'm heading over to our IT guy's office to have a "Serious Talk" with him ... -- Casey]] | From: Sheng Yang | Date: Wednesday, July 14, 2010 06:31 pm | | On Thursday 15 July 2010 02:01:29 Casey Leedom wrote: | > | From: Sheng Yang | > | Date: Tuesday, July 13, 2010 05:53 pm | | (Please use reply to all next time.) Sorry, old habit. | > | On Wednesday 14 July 2010 04:41:01 Casey Leedom wrote: | | > Hhrrmmm, this seems like a semantic error. "Resetting" the Vm should | > be morally equivalent to resetting a real physical machine. And when | > a real physical machine is reset, all of its buses, etc. get reset which | > propagates to Device Resets on those buses ... I think that Assigned | > Devices should be reset for reboots and power off/on ... | | Yes, it should be done when reset. And power on/off should be there, | because that's means the create and destroy the guest. Okay, cool. I've looked through the kernel code and I don't see any likely places for putting such a VM Reboot/Reset feature in so I assume that this is also controlled by QEmu and it would need to be responsible for either doing a KVM_DEASSIGN_PCI_DEVICE/KVM_ASSIGN_PCI_DEVICE ioctl() pair for each assigned device or issuing a new KVM_RESET_ASSIGNED_PCI_DEVICE ioctl() for Reboot/Reset ... | > Assigned Devices. For PCI-E SR-IOV Virtual Functions which are assigned | > to a VM, they need to be reset at reboot/power off/power on and the | > Configuration Space emulation needs to support the Guest OS/VM Device | > Driver issuing an FLR ... | | You can add the FLR support for your device if you need to call it in the | guest. But I don't quite understand why you need to call it in the guest. | KVM should has already done that, and it's not necessary for native | device. This was mostly for device driver load/unload support. I.e. being able to issue a Function Level Reset to a PCI Device Function (regardless of it being an SR-IOV Virtual Function or not) is a nice way to zap the device back to a canonical state. | So you want to issue FLR(rather than probing the feature) when | driver->probe() called? Current seems KVM is the only user of | pci_reset_function(), so I think we can update it after your driver posted | to the mailing list. Also I am not sure why you want to reset function | when probing. KVM should cover them all I think. Remember, the "probe" argument in pci_dev_reset() has nothing to do with whether it's being called from a device driver's probe() routine. The "probe" argument in pci_dev_reset() is used for the two-stage effort in pci_reset_function() which does: 1. "try to see if it's possible to reset this function by itself" Call pci_dev_reset(dev, probe=1); 2. "go ahead and do the function reset" Call pci_dev_reset(dev, probe=0); And yes, I'd noticed that KVM was the only current subscriber of the kernel interface but, as I noted above, it is useful for resetting the device function into a known state -- or at least it was useful till the deadlock got introduced in 2.6.31 ... :-( And again, the use of this actually has no dependency on KVM: I would want to be able to call pci_reset_function() in any device driver's probe() routine ... It just happens that I ran into this need (and unfortunate 2.6.31 deadlock) with our PCI-E SR-IOV Virtual Function Driver. Oh, and the driver (cxgb4vf) has been posted to net-next. I'm guessing that it should get merged into 2.6.35 ... or maybe 2.6.36 ... I'm really not sure of how the merge schedule between net-next and the core kernel runs ... Casey -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cannot start new VMs when the host system is under load
I just noticed that if I leave off the -nodefaults from the qemu command line, the guest will reliably start up again. I don't know what reasons the libvirt developers had for putting it there, though... -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM-AUTOTEST PATCH 9/9] [RFC] KVM test: add WHQL test definitions to tests_base.cfg.sample
The parameters that define submissions (dd_name_*, dd_data_*, desc_path_*) were collected from manually created submissions. I haven't yet collected the parameters for Windows 2003 and 2008, so for now WHQL tests are disabled for these versions. This patch also adds a comment in tests.cfg.sample, encouraging the user to specify the DTM server's address and ports there. Note that this patch renames WinVista.32sp1 to WinVista.32.sp1, WinVista.32sp2 to WinVista.32.sp2, WinVista.64sp1 to WinVista.64.sp1 and WinVista.64sp2 to WinVista.64.sp2. Signed-off-by: Michael Goldish --- client/tests/kvm/tests.cfg.sample |8 + client/tests/kvm/tests_base.cfg.sample | 230 2 files changed, 184 insertions(+), 54 deletions(-) diff --git a/client/tests/kvm/tests.cfg.sample b/client/tests/kvm/tests.cfg.sample index e01406e..cdb7b0a 100644 --- a/client/tests/kvm/tests.cfg.sample +++ b/client/tests/kvm/tests.cfg.sample @@ -73,6 +73,14 @@ variants: only Fedora.13.64 only unattended_install.cdrom boot shutdown +# You may provide information about the DTM server for WHQL tests here: +#whql: +#server_address = 10.20.30.40 +#server_shell_port = 10022 +#server_file_transfer_port = 10023 +# Note that the DTM server must run rss.exe (available under deps/), +# preferably with administrator privileges. + # Uncomment the following lines to enable abort-on-error mode: #abort_on_error = yes #kill_vm.* ?= no diff --git a/client/tests/kvm/tests_base.cfg.sample b/client/tests/kvm/tests_base.cfg.sample index d0b8acb..dabcdc4 100644 --- a/client/tests/kvm/tests_base.cfg.sample +++ b/client/tests/kvm/tests_base.cfg.sample @@ -285,6 +285,80 @@ variants: iozone_cmd = "D:\IOzone\iozone.exe -a" iozone_timeout = 3600 +- @whql: install setup unattended_install.cdrom +nic_mode = tap +# Replace this with the address of an installed DTM server +server_address = 10.20.30.40 +# The server should run rss.exe like a regular Windows VM, preferably +# with administrator privileges (or at least with permission to write +# to the DTM studio directory) +server_shell_port = 10022 +server_file_transfer_port = 10023 +server_studio_path = %programfiles%\Microsoft Driver Test Manager\Studio +wtt_services = wttsvc +variants: +- whql_client_install: +type = whql_client_install +dsso_delete_machine_binary = deps/whql_delete_machine_15.exe +# The username and password are required for accessing the DTM client +# installer binary shared by the server +server_username = administrator +server_password = 1q2w3eP +# This path refers to a shared directory on the server +# (the final cmd will be something like \\servername\DTMInstall\...) +install_cmd = \DTMInstall\Client\Setup.exe /passive +install_timeout = 1800 +- whql_submission:whql_client_install +type = whql_submission +dsso_test_binary = deps/whql_submission_15.exe +test_timeout = 3600 +device_data = cat0 cat1 cat2 cat3 logoarch logoos whqlos whqlqual prog desc filter virt +descriptors = desc1 desc2 desc3 +# DeviceData names +dd_name_cat0 = Category +dd_name_cat1 = Category +dd_name_cat2 = Category +dd_name_cat3 = Category +dd_name_logoarch = LogoProcessorArchitecture +dd_name_logoos = LogoOperatingSystem +dd_name_whqlos = WhqlOs +dd_name_whqlqual = WhqlQualification +dd_name_prog = LogoProgramId +dd_name_desc = LogoProgramDescription +dd_name_filter = WDKFilterAttribute +dd_name_virt = ParaVirtualizationDriver +# Common DeviceData data +dd_data_filter = FilterIfNoInf +dd_data_virt = True +variants: +- keyboard: +# test_device is a regular expression that should match a device's +# name as it appears in device manager. The first device that matches +# is used. +test_device = keyboard +dd_data_cat0 = Input\Keyboard +dd_data_cat1 = Device Fundamentals +dd_data_cat2 = System Fundamentals\Dynamic Partitioning +dd_data_prog = InputKbd +dd_data_desc = Input > Keyboard +- hdd: +test_device =
[KVM-AUTOTEST PATCH 8/9] [RFC] KVM test: add whql_client_install test
whql_client_install installs the DTM client on a guest. It requires a functioning external DTM server which runs rss.exe like regular Windows VMs. Signed-off-by: Michael Goldish --- client/tests/kvm/tests/whql_client_install.py | 110 + 1 files changed, 110 insertions(+), 0 deletions(-) create mode 100644 client/tests/kvm/tests/whql_client_install.py diff --git a/client/tests/kvm/tests/whql_client_install.py b/client/tests/kvm/tests/whql_client_install.py new file mode 100644 index 000..f46939e --- /dev/null +++ b/client/tests/kvm/tests/whql_client_install.py @@ -0,0 +1,110 @@ +import logging, time, os, re +from autotest_lib.client.common_lib import error +import kvm_subprocess, kvm_test_utils, kvm_utils, rss_file_transfer + + +def run_whql_client_install(test, params, env): +""" +WHQL DTM client installation: +1) Log into the guest (the client machine) and into a DTM server machine +2) Stop the DTM client service (wttsvc) on the client machine +3) Delete the client machine from the server's data store +4) Rename the client machine (give it a randomly generated name) +5) Move the client machine into the server's workgroup +6) Reboot the client machine +7) Install the DTM client software + +@param test: kvm test object +@param params: Dictionary with the test parameters +@param env: Dictionary with test environment. +""" +vm = kvm_test_utils.get_living_vm(env, params.get("main_vm")) +session = kvm_test_utils.wait_for_login(vm, 0, 240) + +# Collect test params +server_address = params.get("server_address") +server_shell_port = int(params.get("server_shell_port")) +server_file_transfer_port = int(params.get("server_file_transfer_port")) +server_studio_path = params.get("server_studio_path", "%programfiles%\\ " +"Microsoft Driver Test Manager\\Studio") +server_username = params.get("server_username") +server_password = params.get("server_password") +dsso_delete_machine_binary = params.get("dsso_delete_machine_binary", +"deps/whql_delete_machine_15.exe") +dsso_delete_machine_binary = kvm_utils.get_path(test.bindir, +dsso_delete_machine_binary) +install_timeout = float(params.get("install_timeout", 600)) +install_cmd = params.get("install_cmd") +wtt_services = params.get("wtt_services") + +# Stop WTT service(s) on client +for svc in wtt_services.split(): +kvm_test_utils.stop_windows_service(session, svc) + +# Copy dsso_delete_machine_binary to server +rss_file_transfer.upload(server_address, server_file_transfer_port, + dsso_delete_machine_binary, server_studio_path, + timeout=60) + +# Open a shell session with server +server_session = kvm_utils.remote_login("nc", server_address, +server_shell_port, "", "", +session.prompt, session.linesep) + +# Get server and client information +cmd = "echo %computername%" +server_name = server_session.get_command_output(cmd).strip() +client_name = session.get_command_output(cmd).strip() +cmd = "wmic computersystem get workgroup" +server_workgroup = server_session.get_command_output(cmd).strip() + +# Delete the client machine from the server's data store (if it's there) +server_session.get_command_output("cd %s" % server_studio_path) +cmd = "%s %s %s" % (os.path.basename(dsso_delete_machine_binary), +server_name, client_name) +server_session.get_command_output(cmd, print_func=logging.info) +server_session.close() + +# Rename the client machine +client_name = "autotest_%s" % kvm_utils.generate_random_string(4) +logging.info("Renaming client machine to '%s'" % client_name) +cmd = ('wmic computersystem where name="%%computername%%" rename name="%s"' + % client_name) +s = session.get_command_status(cmd, timeout=600) +if s != 0: +raise error.TestError("Could not rename the client machine") + +# Join the server's workgroup +logging.info("Joining workgroup '%s'" % server_workgroup) +cmd = ('wmic computersystem where name="%%computername%%" call ' + 'joindomainorworkgroup name="%s"' % server_workgroup) +s = session.get_command_status(cmd, timeout=600) +if s != 0: +raise error.TestError("Could not change the client's workgroup") + +# Reboot +session = kvm_test_utils.reboot(vm, session) + +# Access shared resources on the server machine +logging.info("Attempting to access remote share on server") +cmd = r"net use \\%s /user:%s %s" % (server_name, server_username, + server_password) +end_time = time.time() + 120 +while time.time() <
[KVM-AUTOTEST PATCH 7/9] [RFC] KVM test: add whql_submission test
whql_submission runs a submission on a given device. It requires a functioning external DTM server which runs rss.exe like regular Windows VMs, preferably with administrator permissions. The submission is defined by descriptors and device_data objects, which are specified in the config file(s). All jobs of the submission are executed. When all jobs complete, or when the timeout expires, HTML reports are generated and copied to test.debugdir (client/results/default/kvm...whql_submission/debug) and the raw test logs (wtl or xml files) are copied to test.debugdir as well. Signed-off-by: Michael Goldish --- client/tests/kvm/tests/whql_submission.py | 176 + 1 files changed, 176 insertions(+), 0 deletions(-) create mode 100644 client/tests/kvm/tests/whql_submission.py diff --git a/client/tests/kvm/tests/whql_submission.py b/client/tests/kvm/tests/whql_submission.py new file mode 100644 index 000..fd238ff --- /dev/null +++ b/client/tests/kvm/tests/whql_submission.py @@ -0,0 +1,176 @@ +import logging, time, os, re +from autotest_lib.client.common_lib import error +import kvm_subprocess, kvm_test_utils, kvm_utils, rss_file_transfer + + +def run_whql_submission(test, params, env): +""" +WHQL submission test: +1) Log into the guest (the client machine) and into a DTM server machine +2) Copy the automation program binary (dsso_test_binary) to the server machine +3) Run the automation program +4) Pass the program all relevant parameters (e.g. device_data) +5) Wait for the program to terminate +6) Parse and report job results +(logs and HTML reports are placed in test.bindir) + +@param test: kvm test object +@param params: Dictionary with the test parameters +@param env: Dictionary with test environment. +""" +vm = kvm_test_utils.get_living_vm(env, params.get("main_vm")) +session = kvm_test_utils.wait_for_login(vm, 0, 240) + +# Collect parameters +server_address = params.get("server_address") +server_shell_port = int(params.get("server_shell_port")) +server_file_transfer_port = int(params.get("server_file_transfer_port")) +server_studio_path = params.get("server_studio_path", "%programfiles%\\ " +"Microsoft Driver Test Manager\\Studio") +dsso_test_binary = params.get("dsso_test_binary", + "deps/whql_submission_15.exe") +dsso_test_binary = kvm_utils.get_path(test.bindir, dsso_test_binary) +test_device = params.get("test_device") +test_timeout = float(params.get("test_timeout", 600)) +wtt_services = params.get("wtt_services") + +# Restart WTT service(s) on the client +logging.info("Restarting WTT services on client") +for svc in wtt_services.split(): +kvm_test_utils.stop_windows_service(session, svc) +for svc in wtt_services.split(): +kvm_test_utils.start_windows_service(session, svc) + +# Copy dsso_test_binary to the server +rss_file_transfer.upload(server_address, server_file_transfer_port, + dsso_test_binary, server_studio_path, timeout=60) + +# Open a shell session with the server +server_session = kvm_utils.remote_login("nc", server_address, +server_shell_port, "", "", +session.prompt, session.linesep) + +# Get the computer names of the server and client +cmd = "echo %computername%" +server_name = server_session.get_command_output(cmd).strip() +client_name = session.get_command_output(cmd).strip() +session.close() + +# Run the automation program on the server +server_session.get_command_output("cd %s" % server_studio_path) +cmd = "%s %s %s %s %s %s" % (os.path.basename(dsso_test_binary), + server_name, + client_name, + "%s_pool" % client_name, + "%s_submission" % client_name, + test_timeout) +server_session.sendline(cmd) + +# Helper function: wait for a given prompt and raise an exception if an +# error occurs +def find_prompt(prompt): +m, o = server_session.read_until_last_line_matches( +[prompt, server_session.prompt], print_func=logging.info, +timeout=600) +if m != 0: +errors = re.findall("^Error:.*$", o, re.I | re.M) +if errors: +raise error.TestError(errors[0]) +else: +raise error.TestError("Error running automation program: could " + "not find '%s' prompt" % prompt) + +# Tell the automation program which device to test +find_prompt("Device to test:") +server_session.sendline(test_device) + +# Give the automation program all the device data supplied by the user +find_prompt("D
[KVM-AUTOTEST PATCH 6/9] [RFC] KVM test: add utility functions start_windows_service() and stop_windows_service()
These utilities use sc to stop and start windows services. They're used by whql_submission and whql_client_install to stop or restart wttsvc on the client machine. Signed-off-by: Michael Goldish --- client/tests/kvm/kvm_test_utils.py | 46 1 files changed, 46 insertions(+), 0 deletions(-) diff --git a/client/tests/kvm/kvm_test_utils.py b/client/tests/kvm/kvm_test_utils.py index 53c11ae..9fd3a74 100644 --- a/client/tests/kvm/kvm_test_utils.py +++ b/client/tests/kvm/kvm_test_utils.py @@ -242,6 +242,52 @@ def migrate(vm, env=None, mig_timeout=3600, mig_protocol="tcp", raise +def stop_windows_service(session, service, timeout=120): +""" +Stop a Windows service using sc. +If the service is already stopped or is not installed, do nothing. + +@param service: The name of the service +@param timeout: Time duration to wait for service to stop +@raise: error.TestError is raised if the service can't be stopped +""" +end_time = time.time() + timeout +while time.time() < end_time: +o = session.get_command_output("sc stop %s" % service, timeout=60) +# FAILED 1060 means the service isn't installed. +# FAILED 1062 means the service hasn't been started. +if re.search(r"\bFAILED (1060|1062)\b", o, re.I): +break +time.sleep(1) +else: +raise error.TestError("Could not stop service '%s'" % service) + + +def start_windows_service(session, service, timeout=120): +""" +Start a Windows service using sc. +If the service is already running, do nothing. +If the service isn't installed, fail. + +@param service: The name of the service +@param timeout: Time duration to wait for service to start +@raise: error.TestError is raised if the service can't be started +""" +end_time = time.time() + timeout +while time.time() < end_time: +o = session.get_command_output("sc start %s" % service, timeout=60) +# FAILED 1060 means the service isn't installed. +if re.search(r"\bFAILED 1060\b", o, re.I): +raise error.TestError("Could not start service '%s' " + "(service not installed)" % service) +# FAILED 1056 means the service is already running. +if re.search(r"\bFAILED 1056\b", o, re.I): +break +time.sleep(1) +else: +raise error.TestError("Could not start service '%s'" % service) + + def get_time(session, time_command, time_filter_re, time_format): """ Return the host time and guest time. If the guest time cannot be fetched -- 1.5.4.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM-AUTOTEST PATCH 5/9] [RFC] KVM test: DTM machine deletion tool for WHQL tests
This C# program should run on a DTM server. It's used by the whql_client_install test to delete the client machine from the server's data store (if listed there) prior to client installation. This seems to be necessary to prevent trouble during testing (like failure to contact the client machine). Note: the binary is copied to the server at run time, so it doesn't need to be packaged in winutils.iso. Signed-off-by: Michael Goldish --- client/tests/kvm/deps/whql_delete_machine_15.cs | 82 ++ client/tests/kvm/deps/whql_delete_machine_15.exe | Bin 0 -> 10240 bytes 2 files changed, 82 insertions(+), 0 deletions(-) create mode 100644 client/tests/kvm/deps/whql_delete_machine_15.cs create mode 100644 client/tests/kvm/deps/whql_delete_machine_15.exe diff --git a/client/tests/kvm/deps/whql_delete_machine_15.cs b/client/tests/kvm/deps/whql_delete_machine_15.cs new file mode 100644 index 000..1d78a6d --- /dev/null +++ b/client/tests/kvm/deps/whql_delete_machine_15.cs @@ -0,0 +1,82 @@ +// DTM machine deletion tool +// Author: Michael Goldish +// Based on sample code by Microsoft. + +using System; +using System.Collections.Generic; +using System.Text.RegularExpressions; +using Microsoft.DistributedAutomation.DeviceSelection; +using Microsoft.DistributedAutomation.SqlDataStore; + +namespace automate0 +{ +class AutoJob +{ +static int Main(string[] args) +{ +if (args.Length != 2) +{ +Console.WriteLine("Error: incorrect number of command line arguments"); +Console.WriteLine("Usage: {0} serverName clientName", +System.Environment.GetCommandLineArgs()[0]); +return 1; +} +string serverName = args[0]; +string clientName = args[1]; + +try +{ +// Initialize DeviceScript and connect to data store +Console.WriteLine("Initializing DeviceScript object"); +DeviceScript script = new DeviceScript(); +Console.WriteLine("Connecting to data store"); +script.ConnectToNamedDataStore(serverName); + +// Find the client machine +IResourcePool rootPool = script.GetResourcePoolByName("$"); +Console.WriteLine("Looking for client machine '{0}'", clientName); +IResource machine = rootPool.GetResourceByName(clientName); +if (machine == null) +{ +Console.WriteLine("Client machine not found"); +return 0; +} +Console.WriteLine("Client machine '{0}' found ({1}, {2})", +clientName, machine.OperatingSystem, machine.ProcessorArchitecture); + +// Change the client machine's status to 'unsafe' +Console.WriteLine("Changing the client machine's status to 'Unsafe'"); +try +{ +machine.ChangeResourceStatus("Unsafe"); +} +catch (Exception e) +{ +Console.WriteLine("Warning: " + e.Message); +} +while (machine.Status != "Unsafe") +{ +try +{ +machine = rootPool.GetResourceByName(clientName); +} +catch (Exception e) +{ +Console.WriteLine("Warning: " + e.Message); +} +System.Threading.Thread.Sleep(1000); +} + +// Delete the client machine from datastore +Console.WriteLine("Deleting client machine from data store"); +script.DeleteResource(machine.Id); +return 0; +} +catch (Exception e) +{ +Console.WriteLine("Error: " + e.Message); +return 1; +} +} +} +} diff --git a/client/tests/kvm/deps/whql_delete_machine_15.exe b/client/tests/kvm/deps/whql_delete_machine_15.exe new file mode 100644 index ..3817ac42d7748b87af3683e86e1137757ed6073a GIT binary patch literal 10240 zcmehm...@zb$@%j;~h`rS#u{HC5p6`q$D0i@<@tGBPo(iilmft_#yKs*>0pr%exh| z%H8dGc26e#(I^N~IDn(rKw1=PTGvI>HU)y(v5Uq{Z5RD;QNRw8B1K&kR^7sIjMS)u zHvLM_sJ}P6%R5q%tDya-K$hIMZ{EE3=6%h~?%eFpK1oeP6vlo1I?>nh080{?zX`8ed(Vd{+!Pm7T06Ok*+zfO)LZ02gl>Pjq%eX=3tBJ-x zOSD=2*PnjVW1+{PH_!Mu(Hk6ujn_G%R?yx!Ow_rt?}OUcT+?f+NcX2?I*f+jD)EGDaeb^6>CSfO9HKC5Zbsj8H)f*U(J9g z-o7`Kz8Vj;U)>bicC|IM{c2mN<7y^~o>W-=TOj731mdZPJfQh{M5Nhn0Zl~YMbPXx z_c...@tk(eB^4FTQF{~9qt`l8G5IX|M9fBX;+ZxEY|k|k(ZRzaYA4Vb9P<(T4z$vv zyU;}G#0Piyh)-`tQ-FTON{_Va?;=k...@+p6m}s`Q0JFCO?14co zow?zmnw?l...@`{_6)rw_cZy&$WsB9>3}`$&JMs- zn_K}|wE1~TKm|uLrKeNf0|$jJy05w*XsJ6yo4CWql#AP#BJsg?aYrg4e}{!VveGb+ z5Z&$...@bsa(63n
[KVM-AUTOTEST PATCH 4/9] [RFC] KVM test: DTM automation program for WHQL tests
This C# program should run on a DTM/WHQL server. It's used by the whql_submission test to schedule and monitor device submission jobs. Note: the binary is copied to the server at run time, so it doesn't need to be packaged in winutils.iso. Signed-off-by: Michael Goldish --- client/tests/kvm/deps/whql_submission_15.cs | 254 ++ client/tests/kvm/deps/whql_submission_15.exe | Bin 0 -> 10240 bytes 2 files changed, 254 insertions(+), 0 deletions(-) create mode 100644 client/tests/kvm/deps/whql_submission_15.cs create mode 100644 client/tests/kvm/deps/whql_submission_15.exe diff --git a/client/tests/kvm/deps/whql_submission_15.cs b/client/tests/kvm/deps/whql_submission_15.cs new file mode 100644 index 000..540674a --- /dev/null +++ b/client/tests/kvm/deps/whql_submission_15.cs @@ -0,0 +1,254 @@ +// DTM submission automation program +// Author: Michael Goldish +// Based on sample code by Microsoft. + +// Note: this program has only been tested with DTM version 1.5. +// It might fail to work with other versions, specifically because it uses +// a few undocumented methods/attributes. + +using System; +using System.Collections.Generic; +using System.Text.RegularExpressions; +using Microsoft.DistributedAutomation.DeviceSelection; +using Microsoft.DistributedAutomation.SqlDataStore; + +namespace automate0 +{ +class AutoJob +{ +static int Main(string[] args) +{ +if (args.Length != 5) +{ +Console.WriteLine("Error: incorrect number of command line arguments"); +Console.WriteLine("Usage: {0} serverName clientName machinePoolName submissionName timeout", +System.Environment.GetCommandLineArgs()[0]); +return 1; +} +string serverName = args[0]; +string clientName = args[1]; +string machinePoolName = args[2]; +string submissionName = args[3]; +double timeout = Convert.ToDouble(args[4]); + +try +{ +// Initialize DeviceScript and connect to data store +Console.WriteLine("Initializing DeviceScript object"); +DeviceScript script = new DeviceScript(); +Console.WriteLine("Connecting to data store"); + +script.ConnectToNamedDataStore(serverName); + +// Find client machine +IResourcePool rootPool = script.GetResourcePoolByName("$"); +Console.WriteLine("Looking for client machine '{0}'", clientName); +IResource machine = null; +while (true) +{ +try +{ +machine = rootPool.GetResourceByName(clientName); +} +catch (Exception e) +{ +Console.WriteLine("Warning: " + e.Message); +} +// Make sure the machine is valid +if (machine != null && +machine.OperatingSystem != null && +machine.OperatingSystem.Length > 0 && +machine.ProcessorArchitecture != null && +machine.ProcessorArchitecture.Length > 0 && +machine.GetDevices().Length > 0) +break; +System.Threading.Thread.Sleep(1000); +} +Console.WriteLine("Client machine '{0}' found ({1}, {2})", +clientName, machine.OperatingSystem, machine.ProcessorArchitecture); + +// Create machine pool and add client machine to it +// (this must be done because jobs cannot be scheduled for machines in the +// default pool) +try +{ +script.CreateResourcePool(machinePoolName, rootPool.ResourcePoolId); +} +catch (Exception e) +{ +Console.WriteLine("Warning: " + e.Message); +} +IResourcePool newPool = script.GetResourcePoolByName(machinePoolName); +Console.WriteLine("Moving the client machine to pool '{0}'", machinePoolName); +machine.ChangeResourcePool(newPool); + +// Reset client machine +if (machine.Status != "Ready") +{ +Console.WriteLine("Changing the client machine's status to 'Reset'"); +while (true) +{ +try +{ +machine.ChangeResourceStatus("Reset"); +break; +} +catch (Exception e) +{ +Console.WriteLine("Warning: " + e.Message);
[KVM-AUTOTEST PATCH 3/9] KVM test: rss_file_transfer.py: add convenience functions upload() and download()
Signed-off-by: Michael Goldish --- client/tests/kvm/rss_file_transfer.py | 32 ++-- 1 files changed, 26 insertions(+), 6 deletions(-) diff --git a/client/tests/kvm/rss_file_transfer.py b/client/tests/kvm/rss_file_transfer.py index c584397..3de8259 100755 --- a/client/tests/kvm/rss_file_transfer.py +++ b/client/tests/kvm/rss_file_transfer.py @@ -395,6 +395,30 @@ class FileDownloadClient(FileTransferClient): raise +def upload(address, port, src_pattern, dst_path, timeout=60, + connect_timeout=10): +""" +Connect to server and upload files. + +@see: FileUploadClient +""" +client = FileUploadClient(address, port, connect_timeout) +client.upload(src_pattern, dst_path, timeout) +client.close() + + +def download(address, port, src_pattern, dst_path, timeout=60, + connect_timeout=10): +""" +Connect to server and upload files. + +@see: FileDownloadClient +""" +client = FileDownloadClient(address, port, connect_timeout) +client.download(src_pattern, dst_path, timeout) +client.close() + + def main(): import optparse @@ -418,13 +442,9 @@ def main(): port = int(port) if options.download: -client = FileDownloadClient(address, port) -client.download(src_pattern, dst_path, timeout=options.timeout) -client.close() +download(address, port, src_pattern, dst_path, options.timeout) elif options.upload: -client = FileUploadClient(address, port) -client.upload(src_pattern, dst_path, timeout=options.timeout) -client.close() +upload(address, port, src_pattern, dst_path, options.timeout) if __name__ == "__main__": -- 1.5.4.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM-AUTOTEST PATCH 2/9] KVM test: allow definition of multiple cdroms
Instead of using 'cdrom_extra', define 'cdroms' similarly to 'images'. Also set -drive indices for cd1 and image1. Also fix regular expression in tests.cfg.sample, so it doesn't match 'cdroms' when it shouldn't. Note: if you use your own tests.cfg, you should change the line cdrom.* ?<= ... to cdrom(_.*)? ?<= ... It won't hurt to do the same for image_name. Signed-off-by: Michael Goldish --- client/tests/kvm/kvm_vm.py | 19 +++ client/tests/kvm/tests.cfg.sample |4 ++-- client/tests/kvm/tests_base.cfg.sample | 10 -- 3 files changed, 17 insertions(+), 16 deletions(-) diff --git a/client/tests/kvm/kvm_vm.py b/client/tests/kvm/kvm_vm.py index 248aeca..bdc9aab 100755 --- a/client/tests/kvm/kvm_vm.py +++ b/client/tests/kvm/kvm_vm.py @@ -414,18 +414,13 @@ class VM: if smp: qemu_cmd += add_smp(help, smp) -iso = params.get("cdrom") -if iso: -iso = kvm_utils.get_path(root_dir, iso) -qemu_cmd += add_cdrom(help, iso, 2) - -# Even though this is not a really scalable approach, -# it doesn't seem like we are going to need more than -# 2 CDs active on the same VM. -iso_extra = params.get("cdrom_extra") -if iso_extra: -iso_extra = kvm_utils.get_path(root_dir, iso_extra) -qemu_cmd += add_cdrom(help, iso_extra, 3) +cdroms = kvm_utils.get_sub_dict_names(params, "cdroms") +for cdrom in cdroms: +cdrom_params = kvm_utils.get_sub_dict(params, cdrom) +iso = cdrom_params.get("cdrom") +if iso: +qemu_cmd += add_cdrom(help, kvm_utils.get_path(root_dir, iso), + cdrom_params.get("drive_index")) # We may want to add {floppy_otps} parameter for -fda # {fat:floppy:}/path/. However vvfat is not usually recommended. diff --git a/client/tests/kvm/tests.cfg.sample b/client/tests/kvm/tests.cfg.sample index 6d5f244..e01406e 100644 --- a/client/tests/kvm/tests.cfg.sample +++ b/client/tests/kvm/tests.cfg.sample @@ -13,8 +13,8 @@ include cdkeys.cfg # * All image files are expected under /tmp/kvm_autotest_root/images/ # * All iso files are expected under /tmp/kvm_autotest_root/isos/ qemu_img_binary = /usr/bin/qemu-img -image_name.* ?<= /tmp/kvm_autotest_root/images/ -cdrom.* ?<= /tmp/kvm_autotest_root/isos/ +image_name(_.*)? ?<= /tmp/kvm_autotest_root/images/ +cdrom(_.*)? ?<= /tmp/kvm_autotest_root/isos/ # Here are the test sets variants. The variant 'qemu_kvm_windows_quick' is # fully commented, the following ones have comments only on noteworthy points diff --git a/client/tests/kvm/tests_base.cfg.sample b/client/tests/kvm/tests_base.cfg.sample index c2def29..d0b8acb 100644 --- a/client/tests/kvm/tests_base.cfg.sample +++ b/client/tests/kvm/tests_base.cfg.sample @@ -3,9 +3,10 @@ # Define the objects we'll be using vms = vm1 images = image1 +cdroms = cd1 nics = nic1 monitors = humanmonitor1 -login_timeout = 360 + # Choose the main VM and monitor main_vm = vm1 main_monitor = humanmonitor1 @@ -33,8 +34,10 @@ qemu_img_binary = qemu-img smp = 1 mem = 512 image_size = 10G +drive_index_image1 = 0 shell_port = 22 display = vnc +drive_index_cd1 = 2 # Monitor params monitor_type = human @@ -56,6 +59,7 @@ run_tcpdump = yes # Misc profilers = kvm_stat +login_timeout = 360 # Tests @@ -1044,7 +1048,9 @@ variants: unattended_install.cdrom: timeout = 7200 finish_program = deps/finish.exe -cdrom_extra = windows/winutils.iso +cdroms += " extracd" +cdrom_extracd = windows/winutils.iso +drive_index_extracd = 3 migrate: migration_test_command = ver && vol migration_bg_command = start ping -t localhost -- 1.5.4.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM-AUTOTEST PATCH 1/9] KVM test: kvm_vm.py: make -drive index optional for both images and cdrom ISOs
Signed-off-by: Michael Goldish --- client/tests/kvm/kvm_vm.py | 15 +-- 1 files changed, 9 insertions(+), 6 deletions(-) diff --git a/client/tests/kvm/kvm_vm.py b/client/tests/kvm/kvm_vm.py index 6cd0688..248aeca 100755 --- a/client/tests/kvm/kvm_vm.py +++ b/client/tests/kvm/kvm_vm.py @@ -214,16 +214,18 @@ class VM: def add_smp(help, smp): return " -smp %s" % smp -def add_cdrom(help, filename, index=2): +def add_cdrom(help, filename, index=None): if has_option(help, "drive"): -return " -drive file='%s',index=%d,media=cdrom" % (filename, - index) +cmd = " -drive file='%s',media=cdrom" % filename +if index is not None: cmd += ",index=%s" % index +return cmd else: return " -cdrom '%s'" % filename -def add_drive(help, filename, format=None, cache=None, werror=None, - serial=None, snapshot=False, boot=False): +def add_drive(help, filename, index=None, format=None, cache=None, + werror=None, serial=None, snapshot=False, boot=False): cmd = " -drive file='%s'" % filename +if index is not None: cmd += ",index=%s" % index if format: cmd += ",if=%s" % format if cache: cmd += ",cache=%s" % cache if werror: cmd += ",werror=%s" % werror @@ -362,6 +364,7 @@ class VM: continue qemu_cmd += add_drive(help, get_image_filename(image_params, root_dir), + image_params.get("drive_index"), image_params.get("drive_format"), image_params.get("drive_cache"), image_params.get("drive_werror"), @@ -414,7 +417,7 @@ class VM: iso = params.get("cdrom") if iso: iso = kvm_utils.get_path(root_dir, iso) -qemu_cmd += add_cdrom(help, iso) +qemu_cmd += add_cdrom(help, iso, 2) # Even though this is not a really scalable approach, # it doesn't seem like we are going to need more than -- 1.5.4.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Cannot start new VMs when the host system is under load
Hi, I've got a problem where I cannot start any new VMs with KVM if the host machine is under high CPU load. The problem is not 100% reproducible (it works sometimes), but under load conditions, it happens most of the time - roughly 95%. I'm usually using libvirt to start and stop KVM VMs. When using virsh to start a new VM under those conditions, the output looks like this: virsh # start testserver-a error: Failed to start domain testserver-a error: monitor socket did not show up.: Connection refused (There is a very long wait after the command has been sent until the error message shows up.) This is (an example of) the command line that libvirtd uses to start up qemu: - snip - LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin HOME=/root USER=root LOGNAME=root QEMU_AUDIO_DRV=none /usr/bin/qemu-kvm -S -M pc-0.12 -enable-kvm -m 256 -smp 1,sockets=1,cores=1,threads=1 -name testserver-a -uuid 7cbb3665-4d58-86b8- ce8f-20541995a99c -nodefaults -chardev socket,id=monitor,path=/usr/local/var/lib/libvirt/qemu/testserver- a.monitor,server,nowait -mon chardev=monitor,mode=readline -rtc base=utc -no- acpi -boot c -device lsi,id=scsi0,bus=pci.0,addr=0x7 -drive file=/data/testserver-a-system.img,if=none,id=drive-scsi0-0-1,boot=on -device scsi-disk,bus=scsi0.0,scsi-id=1,drive=drive-scsi0-0-1,id=scsi0-0-1 -drive file=/data/testserver-a-data1.img,if=none,id=drive-virtio-disk1 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk1,id=virtio-disk1 - drive file=/data/testserver-a-data2.img,if=none,id=drive-virtio-disk2 -device virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk2,id=virtio-disk2 - drive file=/data/gentoo-install-amd64- minimal-20100408.iso,if=none,media=cdrom,id=drive-ide0-0-0,readonly=on -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -drive file=/data/testserver-a_configfloppy.img,if=none,id=drive-fdc0-0-0 -global isa-fdc.driveA=drive-fdc0-0-0 -device e1000,vlan=0,id=net0,mac=52:54:00:84:6d:69,bus=pci.0,addr=0x6 -net tap,fd=24,vlan=0,name=hostnet0 -usb -vnc 127.0.0.1:1,password -k de -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 - snip - Copy-pasting this to a commandline on the host to start qemu manually leads to a non-functional qemu process that "just sits there" with nothing happening. The monitor socket /usr/local/var/lib/libvirt/qemu/testserver-a.monitor will, indeed, not show up. I've tried starting qemu with the same commandline but without the parameters for redirecting the monitor to a socket, without the fd parameter for the network interface and without the vnc parameter. This resulted in a black window with the title "QEMU (testserver-a) [Stopped]". I could not access the monitor console in graphical mode either. Some experimentation I've done suggests that this problem only happens if the high cpu load is caused by another KVM process, not if it is caused by something else running on the machine. The host machine I'm running this on has got 16 cores in total. It looks like it is sufficient for this bug to surface if at least one of these cores is brought to near 100% use by a KVM process. The version of qemu I'm using is qemu-kvm 0.12.4, built from source. Libvirt is version 0.8.1, built from source as well. The host OS is Fedora 12. The Kernel version is 2.6.32.12-115.fc12.x86_64. Does anybody have any idea what might be causing this and what might be done against it? Regards, Guido -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netfilter: add CHECKSUM target
Am 11.07.2010 12:47, schrieb Michael S. Tsirkin: > On Fri, Jul 09, 2010 at 05:17:36PM +0200, Patrick McHardy wrote: >> Am 09.07.2010 00:29, schrieb Michael S. Tsirkin: >>> This adds a `CHECKSUM' target, which can be used in the iptables mangle >>> table. >>> >>> You can use this target to compute and fill in the checksum in >>> an IP packet that lacks a checksum. This is particularly useful, >>> if you need to work around old applications such as dhcp clients, >>> that do not work well with checksum offloads, but don't want to >>> disable checksum offload in your device. >>> >>> The problem happens in the field with virtualized applications. >>> For reference, see Red Hat bz 60, as well as >>> http://www.spinics.net/lists/kvm/msg37660.html >>> >>> Typical expected use (helps old dhclient binary running in a VM): >>> iptables -A POSTROUTING -t mangle -p udp --dport 68 -j CHECKSUM >>> --checksum-fill >> >> I'm not sure this is something we want to merge upstream and >> support indefinitely. Dave suggested this as a temporary >> out-of-tree workaround until the majority of guest dhcp clients >> are fixed. Has anything changed that makes this course of >> action impractical? > > If I understand what Dave said correctly, it's up to you ... > > The arguments for putting this upstream are: > > Given the track record, I wouldn't hope for quick fix in the majority of > guest dhcp clients, unfortunately :(. We are talking years here. > Even after that, one of the uses of virtualization is > to keep old guests running. So yes, I think we'll > keep using work-arounds for this for a very long time. > > Further, since we have to add the module and we have to teach management > to program it, it will be much less painful for everyone > involved if we can put the code upstream, rather than forking > management code. Fair enough, its simple enough that I don't expect much maintenance overhead. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support
On Thu, Jul 15, 2010 at 11:49:20AM +0100, Paul Brook wrote: > > An IOMMU is not necessarily part of a bus bridge. By concept an IOMMU > > can also be implemented on a plugin-card translating only that card. > > Real implementations that I am aware of always implement the IOMMU in > > the PCI root bridge, though. > > If the IOMMU is implemented on the card, then it isn't an interesting case. > It's effectively just a complex form of scatter-gather. > > If the on-card MMU can delegate pagetable walks to an external device then > IMO > that's also an unrelated feature, and requires an additional data channel. But that would be handled by the same IOMMU emulation code, so the hooks need to be usable there too. > My point still stands. You should not be pushing the IOMMU handling into > device specific code. All you need to do is make the memory access routines > aware of which device caused the access. Right, the device does not need to know too much about the IOMMU in the general case. The iommu_get/iommu_read/iommu_write interface should replaced by the pci_memory* functions like suggested by Anthony. > The fact that the GART can translate CPU accesses proves my point. If you > have separate code for CPU and devices, then you need to duplicate the GART > handling code. You can configure the GART to translate device accesses only, cpu accesses only, or to translate both. This is hard to handle if cpu and device emulation use the same memory access functions. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support
On Thu, Jul 15, 2010 at 07:45:06AM -0500, Anthony Liguori wrote: > On 07/15/2010 04:10 AM, Joerg Roedel wrote: >> If this means a seperate interface for device dma accesses and not fold >> that functionality into the cpu_physical_memory* interface I agree too :-) >> > No. PCI devices should never call cpu_physical_memory*. Fully agreed. > PCI devices should call pci_memory*. > > ISA devices should call isa_memory*. This is a seperate interface. I like the idea and as you stated below it has clear advantages, so lets go this way. > All device memory accesses should go through their respective buses. > There can be multiple IOMMUs at different levels of the device > hierarchy. If you don't provide bus-level memory access functions that > chain through the hierarchy, it's extremely difficult to implement all > the necessary hooks to perform the translations at different places. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support
> >>> Depending how the we decide to handle IOMMU invalidation, it may also > >>> be necessary to augment the memory_map API to allow the system to > >>> request a mapping be revoked. However this issue is not specific to > >>> the IOMMU implementation. Such bugs are already present on any system > >>> that allows dynamic reconfiguration of the address space, e.g. by > >>> changing PCI BARs. > >> > >> That's why the memory_map API today does not allow mappings to persist > >> after trips back to the main loop. > > > > Sure it does. If you can't combine zero-copy memory access with > > asynchronous IO then IMO it's fairly useless. See e.g. dma-helpers.c > > DMA's a very special case. Special compared to what? The whole purpose of this API is to provide DMA. > DMA is performed asynchronously to the > execution of the CPU so you generally can't make any guarantees about > what state the transaction is in until it's completed. That gives us a > fair bit of wiggle room when dealing with a DMA operation to a region of > physical memory where the physical memory mapping is altered in some way > during the transaction. You do have ordering constraints though. While it may not be possible to directly determine whether the DMA completed before or after the remapping, and you might not be able to make any assumptions about the atomicity of the transaction as a whole, it is reasonable to assume that any writes to the old mapping will occur before the remapping operation completes. While things like store buffers potentially allows reordering and deferral of accesses, there are generally fairly tight constraints on this. For example a PCI hast bridge may buffer CPU writes. However it will guarantee that those writes have been flushed out before a subsequent read operation completes. Consider the case where the hypervisor allows passthough of a device, using the IOMMU to support DMA from that device into virtual machine RAM. When that virtual machine is destroyed the IOMMU mapping for that device will be invalidated. Once the invalidation has completed that RAM can be reused by the hypervisor for other purposes. This may happen before the device is reset. We probably don't really care what happens to the device in this case, but we do need to prevent the device stomping on ram it no longer owns. There are two ways this can be handled: If your address translation mechanism allows updates to be deferred indefinitely then we can stall until all relevant DMA transactions have completed. This is probably sufficient for well behaved guests, but potentially opens up a significant window for DoS attacks. If you need the remapping to occur in a finite timeframe (in the PCI BAR case this is probably before the next CPU access to that bus) then you need some mechanism for revoking the host mapping provided by cpu_physical_memory_map. Note that a QEMU DMA transaction typically encompasses a whole block of data. The transaction is started when the AIO request is issued, and remains live until the transfer completes. This includes the time taken to fetch the data from external media/devices. On real hardware a DMA transaction typically only covers a single burst memory write (maybe 16 bytes). This will generally not start until the device has buffered sufficient data to satisfy the burst (or has sufficient buffer space to receive the whole burst). Paul -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Freezing Windows 2008 x64bit guest
On Thu, Jul 15, 2010 at 03:19:44PM +0200, Christoph Adomeit wrote: > Hi, > > we are running several kvm servers with kvm 0.12.4 and kernel 2.6.32 > (actually we run proxmox ve) > > These proxmox servers are running all kinds of vms rock-solid, be it > linux or windows 32 platforms. > > But one Windows 2008 64 Bit Server Standard is freezing regularly. > This happens sometimes 3 times a day, sometimes it takes 2 days > until freeze. The Windows Machine is a clean fresh install. > > By Freezing I mean that the Machine console still shows the > windows logo but does not accept any keystroke and does not answer > to ping or rdp or whatever. The virtual cpu-load is then shown > as 100% (by proxmox). > > We tried several combinations of ide or virtio, several virtual network cards > (intel and realtek) and single-cpu vs multi cpu. The Problem persists. > > The Windows Machine is a quite fresh install but runs exchange and 10 rdp > sessions. > > We are experienced linux-admins but we dont find any hints on the kvm servers > what might be the problem, no syslogs,dmesgs, nothing. Also there are no > dumps on the windows guest. > > Do you have any hints what might be the problem and what we could do to debug > the problem and to get more information of the root cause ? > When hang occurs ensure that problematic vm is the only on on the server and run kvm_stat. Send output here. Also do the following: # mount -t debugfs debugfs /sys/kernel/debug # echo kvm > /sys/kernel/debug/tracing/set_event # sleep 1 # cat /sys/kernel/debug/tracing/trace > /tmp/trace Send /tmp/trace here too, but it may be huge, so send only last 1000 lines. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Freezing Windows 2008 x64bit guest
Hi, we are running several kvm servers with kvm 0.12.4 and kernel 2.6.32 (actually we run proxmox ve) These proxmox servers are running all kinds of vms rock-solid, be it linux or windows 32 platforms. But one Windows 2008 64 Bit Server Standard is freezing regularly. This happens sometimes 3 times a day, sometimes it takes 2 days until freeze. The Windows Machine is a clean fresh install. By Freezing I mean that the Machine console still shows the windows logo but does not accept any keystroke and does not answer to ping or rdp or whatever. The virtual cpu-load is then shown as 100% (by proxmox). We tried several combinations of ide or virtio, several virtual network cards (intel and realtek) and single-cpu vs multi cpu. The Problem persists. The Windows Machine is a quite fresh install but runs exchange and 10 rdp sessions. We are experienced linux-admins but we dont find any hints on the kvm servers what might be the problem, no syslogs,dmesgs, nothing. Also there are no dumps on the windows guest. Do you have any hints what might be the problem and what we could do to debug the problem and to get more information of the root cause ? Thanks Christoph -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support
On 07/15/2010 04:10 AM, Joerg Roedel wrote: On Wed, Jul 14, 2010 at 04:29:18PM -0500, Anthony Liguori wrote: On 07/14/2010 03:13 PM, Paul Brook wrote: Well, ok, the function name needs fixing too. However I think the only thing missing from the current API is that it does not provide a way to determine which device is performing the access. I agree with Paul. The right approach IMHO is to convert devices to use bus-specific functions to access memory. The bus specific functions should have a device argument as the first parameter. If this means a seperate interface for device dma accesses and not fold that functionality into the cpu_physical_memory* interface I agree too :-) No. PCI devices should never call cpu_physical_memory*. PCI devices should call pci_memory*. ISA devices should call isa_memory*. All device memory accesses should go through their respective buses. There can be multiple IOMMUs at different levels of the device hierarchy. If you don't provide bus-level memory access functions that chain through the hierarchy, it's extremely difficult to implement all the necessary hooks to perform the translations at different places. Regards, Anthony Liguori Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vhost-net: avoid flush under lock
On Thu, Jul 15, 2010 at 03:19:12PM +0300, Michael S. Tsirkin wrote: > We flush under vq mutex when changing backends. > This creates a deadlock as workqueue being flushed > needs this lock as well. > > https://bugzilla.redhat.com/show_bug.cgi?id=612421 > > Drop the vq mutex before flush: we have the device mutex > which is sufficient to prevent another ioctl from touching > the vq. > > Signed-off-by: Michael S. Tsirkin Dave, just to clarify, I'll send pull request to merge it through my tree, there's no need for you to bother with this. > --- > drivers/vhost/net.c |5 + > 1 files changed, 5 insertions(+), 0 deletions(-) > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c > index 28d7786..50df58e6 100644 > --- a/drivers/vhost/net.c > +++ b/drivers/vhost/net.c > @@ -534,11 +534,16 @@ static long vhost_net_set_backend(struct vhost_net *n, > unsigned index, int fd) > rcu_assign_pointer(vq->private_data, sock); > vhost_net_enable_vq(n, vq); > done: > + mutex_unlock(&vq->mutex); > + > if (oldsock) { > vhost_net_flush_vq(n, index); > fput(oldsock->file); > } > > + mutex_unlock(&n->dev.mutex); > + return 0; > + > err_vq: > mutex_unlock(&vq->mutex); > err: > -- > 1.7.2.rc0.14.g41c1c -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support
On 07/15/2010 05:33 AM, Paul Brook wrote: Depending how the we decide to handle IOMMU invalidation, it may also be necessary to augment the memory_map API to allow the system to request a mapping be revoked. However this issue is not specific to the IOMMU implementation. Such bugs are already present on any system that allows dynamic reconfiguration of the address space, e.g. by changing PCI BARs. That's why the memory_map API today does not allow mappings to persist after trips back to the main loop. Sure it does. If you can't combine zero-copy memory access with asynchronous IO then IMO it's fairly useless. See e.g. dma-helpers.c DMA's a very special case. DMA is performed asynchronously to the execution of the CPU so you generally can't make any guarantees about what state the transaction is in until it's completed. That gives us a fair bit of wiggle room when dealing with a DMA operation to a region of physical memory where the physical memory mapping is altered in some way during the transaction. However, that is not true in the general case. Regards, Anthony Liguori Paul -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vhost-net: avoid flush under lock
We flush under vq mutex when changing backends. This creates a deadlock as workqueue being flushed needs this lock as well. https://bugzilla.redhat.com/show_bug.cgi?id=612421 Drop the vq mutex before flush: we have the device mutex which is sufficient to prevent another ioctl from touching the vq. Signed-off-by: Michael S. Tsirkin --- drivers/vhost/net.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 28d7786..50df58e6 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -534,11 +534,16 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd) rcu_assign_pointer(vq->private_data, sock); vhost_net_enable_vq(n, vq); done: + mutex_unlock(&vq->mutex); + if (oldsock) { vhost_net_flush_vq(n, index); fput(oldsock->file); } + mutex_unlock(&n->dev.mutex); + return 0; + err_vq: mutex_unlock(&vq->mutex); err: -- 1.7.2.rc0.14.g41c1c -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 4/7] ide: IOMMU support
> On Wed, Jul 14, 2010 at 02:53:03PM +0100, Paul Brook wrote: > > > Memory accesses must go through the IOMMU layer. > > > > No. Devices should not know or care whether an IOMMU is present. > > They don't really care. iommu_get() et al. are convenience functions > which can and do return NULL when there's no IOMMU and device code can > pass that NULL around without checking. Devices should not need to know any of this. You're introducing a significant amount of duplication and complexity into every device. The assumption that all accesses will go through the same IOMMU is also false. Accesses to devices on the same bus will not be translated by the IOMMU. Currently there are probably also other things that will break in this case, but your API seems fundamentally incapable of handling this. > I could've probably made the r/w > functions take only the DeviceState in addition to normal args, but > wanted to avoid looking up the related structures on each I/O operation. That's exactly what you should be doing. If this is inefficient then there are much better ways of fixing this. e.g. by not having the device perform so many accesses, or by adding some sort of translation cache. > > You should be adding a DeviceState argument to > > cpu_physical_memory_{rw,map}. This should then handle IOMMU translation > > transparently. > > > > You also need to accomodate the the case where multiple IOMMU are > > present. > > We don't assume there's a single IOMMU in the generic layer. The > callbacks within 'struct iommu' could very well dispatch the request to > one of multiple, coexisting IOMMUs. So you've now introduced yet another copy of the translation code. Not only does every device have to be IOMMU aware, but every IOMMU also has to be aware of nested IOMMUs. Paul -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support
> On Wed, Jul 14, 2010 at 09:13:44PM +0100, Paul Brook wrote: > > A device performs a memory access on its local bus. It has no knowledge > > of how that access is routed to its destination. The device should not > > be aware of any IOMMUs, in the same way that it doesn't know whether it > > happens to be accessing RAM or memory mapped peripherals on another > > device. > > Right. > > > Each IOMMU is fundamentally part of a bus bridge. For example the bridge > > between a PCI bus and the system bus. It provides a address mapping from > > one bus to another. > > An IOMMU is not necessarily part of a bus bridge. By concept an IOMMU > can also be implemented on a plugin-card translating only that card. > Real implementations that I am aware of always implement the IOMMU in > the PCI root bridge, though. If the IOMMU is implemented on the card, then it isn't an interesting case. It's effectively just a complex form of scatter-gather. If the on-card MMU can delegate pagetable walks to an external device then IMO that's also an unrelated feature, and requires an additional data channel. > > There should be no direct interaction between an IOMMU and a device > > (ignoring ATS, which is effectively a separate data channel). > > Everything should be done via the cpu_phsycial_memory_* code. Likewise > > on a system with multiple nested IOMMUs there should be no direct > > interatcion between these. > > cpu_physical_memory_* should walk the device/bus tree to determine where > > the access terminates, applying mappings appropriately. > > Thats the point where I disagree. I think there should be a seperate set > of functions independent from cpu_physical_memory_* to handle device > memory accesses. This would keep the changes small and non-intrusive. > Beside that, real memory controlers can also handle cpu memory accesses > different from device memory accesses. The AMD northbridge GART uses > this to decide whether it needs to remap a request or not. The GART can > be configured to translate cpu and device accesses seperatly. My point still stands. You should not be pushing the IOMMU handling into device specific code. All you need to do is make the memory access routines aware of which device caused the access. The fact that the GART can translate CPU accesses proves my point. If you have separate code for CPU and devices, then you need to duplicate the GART handling code. Paul -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support
> > Depending how the we decide to handle IOMMU invalidation, it may also be > > necessary to augment the memory_map API to allow the system to request a > > mapping be revoked. However this issue is not specific to the IOMMU > > implementation. Such bugs are already present on any system that allows > > dynamic reconfiguration of the address space, e.g. by changing PCI BARs. > > That's why the memory_map API today does not allow mappings to persist > after trips back to the main loop. Sure it does. If you can't combine zero-copy memory access with asynchronous IO then IMO it's fairly useless. See e.g. dma-helpers.c Paul -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support
> > The right approach IMHO is to convert devices to use bus-specific > > functions to access memory. The bus specific functions should have > > a device argument as the first parameter. > > As for ATS, the internal api to handle the device's dma reqeust needs > a notion of a translated vs. an untranslated request. IOW, if qemu ever > had a device with ATS support, the device would use its local cache to > translate the dma address and then submit a translated request to the > pci bus (effectively doing a raw cpu physical memory* in that case). Really? Can you provide an documentation to support this claim? My impression is that there is no difference between translated and untranslated devices, and the translation is explicitly disabled by software. Paul -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: several messages
On Thursday 2010-07-15 11:36, Patrick McHardy wrote: >> +struct xt_CHECKSUM_info { >> +u_int8_t operation; /* bitset of operations */ > >Please use __u8 in public header files. > >> +#include > >Do you really need in.h? > >> + * $Id$ > >Please no CVS IDs. > >> +switch (c) { >> +case 'F': >> +if (*flags) >> +xtables_error(PARAMETER_PROBLEM, >> +"CHECKSUM target: Only use --checksum-fill >> ONCE!"); > >There is a helper function called xtables_param_act for checking double >arguments and printing a standarized error message. I took care of these for Xt-a. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv3] extensions: libxt_CHECKSUM extension
Am 11.07.2010 17:14, schrieb Michael S. Tsirkin: > diff --git a/extensions/libxt_CHECKSUM.c b/extensions/libxt_CHECKSUM.c > new file mode 100644 > index 000..00fbd8f > --- /dev/null > +++ b/extensions/libxt_CHECKSUM.c > @@ -0,0 +1,99 @@ > +/* Shared library add-on to xtables for CHECKSUM > + * > + * (C) 2002 by Harald Welte > + * (C) 2010 by Red Hat, Inc > + * Author: Michael S. Tsirkin > + * > + * This program is distributed under the terms of GNU GPL v2, 1991 > + * > + * libxt_CHECKSUM.c borrowed some bits from libipt_ECN.c > + * > + * $Id$ Please no CVS IDs. > + */ > +#include > +#include > +#include > +#include > + > +#include > +#include > + > +static void CHECKSUM_help(void) > +{ > + printf( > +"CHECKSUM target options\n" > +" --checksum-fill Fill in packet checksum.\n"); > +} > + > +static const struct option CHECKSUM_opts[] = { > + { "checksum-fill", 0, NULL, 'F' }, > + { .name = NULL } > +}; > + > +static int CHECKSUM_parse(int c, char **argv, int invert, unsigned int > *flags, > + const void *entry, struct xt_entry_target **target) > +{ > + struct xt_CHECKSUM_info *einfo > + = (struct xt_CHECKSUM_info *)(*target)->data; > + > + switch (c) { > + case 'F': > + if (*flags) > + xtables_error(PARAMETER_PROBLEM, > + "CHECKSUM target: Only use --checksum-fill > ONCE!"); There is a helper function called xtables_param_act for checking double arguments and printing a standarized error message. > + einfo->operation = XT_CHECKSUM_OP_FILL; > + *flags |= XT_CHECKSUM_OP_FILL; > + break; > + default: > + return 0; > + } > + > + return 1; > +} > + -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] netfilter: add CHECKSUM target
Am 11.07.2010 17:06, schrieb Michael S. Tsirkin: > +#ifndef _IPT_CHECKSUM_TARGET_H > +#define _IPT_CHECKSUM_TARGET_H > + > +#define XT_CHECKSUM_OP_FILL 0x01/* fill in checksum in IP header */ > + > +struct xt_CHECKSUM_info { > + u_int8_t operation; /* bitset of operations */ Please use __u8 in public header files. > +}; > + > +#endif /* _IPT_CHECKSUM_TARGET_H */ > diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig > index 8593a77..1cf4852 100644 > --- a/net/netfilter/Kconfig > +++ b/net/netfilter/Kconfig > @@ -294,7 +294,7 @@ endif # NF_CONNTRACK > config NETFILTER_TPROXY > tristate "Transparent proxying support (EXPERIMENTAL)" > depends on EXPERIMENTAL > - depends on IP_NF_MANGLE > + depends on IP_NF_MANGLE || IP6_NF_MANGLE This does not seem to belong into this patch. > depends on NETFILTER_ADVANCED > help > This option enables transparent proxying support, that is, > @@ -347,6 +347,21 @@ config NETFILTER_XT_CONNMARK > > comment "Xtables targets" > > +config NETFILTER_XT_TARGET_CHECKSUM > + tristate "CHECKSUM target support" > + depends on NETFILTER_ADVANCED > + ---help--- > + This option adds a `CHECKSUM' target, which can be used in the > iptables mangle > + table. You should add a dependency on the mangle table then. > + > + You can use this target to compute and fill in the checksum in > + a packet that lacks a checksum. This is particularly useful, > + if you need to work around old applications such as dhcp clients, > + that do not work well with checksum offloads, but don't want to > disable > + checksum offload in your device. > + > + To compile it as a module, choose M here. If unsure, say N. > + > config NETFILTER_XT_TARGET_CLASSIFY > tristate '"CLASSIFY" target support' > depends on NETFILTER_ADVANCED > diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile > index 14e3a8f..8eb541d 100644 > --- a/net/netfilter/Makefile > +++ b/net/netfilter/Makefile > @@ -45,6 +45,7 @@ obj-$(CONFIG_NETFILTER_XT_MARK) += xt_mark.o > obj-$(CONFIG_NETFILTER_XT_CONNMARK) += xt_connmark.o > > # targets > +obj-$(CONFIG_NETFILTER_XT_TARGET_CHECKSUM) += xt_CHECKSUM.o > obj-$(CONFIG_NETFILTER_XT_TARGET_CLASSIFY) += xt_CLASSIFY.o > obj-$(CONFIG_NETFILTER_XT_TARGET_CONNSECMARK) += xt_CONNSECMARK.o > obj-$(CONFIG_NETFILTER_XT_TARGET_CT) += xt_CT.o > diff --git a/net/netfilter/xt_CHECKSUM.c b/net/netfilter/xt_CHECKSUM.c > new file mode 100644 > index 000..0fee1a7 > --- /dev/null > +++ b/net/netfilter/xt_CHECKSUM.c > @@ -0,0 +1,72 @@ > +/* iptables module for the packet checksum mangling > + * > + * (C) 2002 by Harald Welte > + * (C) 2010 Red Hat, Inc. > + * > + * Author: Michael S. Tsirkin > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > +*/ > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > +#include Do you really need in.h? > +#include > +#include > +#include > + > +#include > +#include > + > +MODULE_LICENSE("GPL"); > +MODULE_AUTHOR("Michael S. Tsirkin "); > +MODULE_DESCRIPTION("Xtables: checksum modification"); > +MODULE_ALIAS("ipt_CHECKSUM"); > +MODULE_ALIAS("ip6t_CHECKSUM"); > + > +static unsigned int > +checksum_tg(struct sk_buff *skb, const struct xt_action_param *par) > +{ > + if (skb->ip_summed == CHECKSUM_PARTIAL) > + skb_checksum_help(skb); > + > + return XT_CONTINUE; > +} > + > +static int checksum_tg_check(const struct xt_tgchk_param *par) > +{ > + const struct xt_CHECKSUM_info *einfo = par->targinfo; > + > + if (einfo->operation & ~XT_CHECKSUM_OP_FILL) { > + pr_info("unsupported CHECKSUM operation %x\n", > einfo->operation); > + return -EINVAL; > + } > + if (!einfo->operation) { > + pr_info("no CHECKSUM operation enabled\n"); > + return -EINVAL; > + } > + return 0; > +} > + > +static struct xt_target checksum_tg_reg __read_mostly = { > + .name = "CHECKSUM", > + .family = NFPROTO_UNSPEC, > + .target = checksum_tg, > + .targetsize = sizeof(struct xt_CHECKSUM_info), > + .table = "mangle", > + .checkentry = checksum_tg_check, > + .me = THIS_MODULE, > +}; > + > +static int __init checksum_tg_init(void) > +{ > + return xt_register_target(&checksum_tg_reg); > +} > + > +static void __exit checksum_tg_exit(void) > +{ > + xt_unregister_target(&checksum_tg_reg); > +} > + > +module_init(checksum_tg_init); > +module_exit(checksum_tg_exit); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] KVM: x86: Call mask notifiers from pic.
If pit delivers interrupt while pic is masking it OS will never do EOI and ack notifier will not be called so when pit will be unmasked no pit interrupts will be delivered any more. Calling mask notifiers solves this issue. Signed-off-by: Gleb Natapov --- Changelog: v1->v2 Properly handle slave pic. diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c index 819b748..8d10c06 100644 --- a/arch/x86/kvm/i8259.c +++ b/arch/x86/kvm/i8259.c @@ -363,10 +363,20 @@ static void pic_ioport_write(void *opaque, u32 addr, u32 val) } } else switch (s->init_state) { - case 0: /* normal mode */ + case 0: { /* normal mode */ + u8 imr_diff = s->imr ^ val, + off = (s == &s->pics_state->pics[0]) ? 0 : 8; s->imr = val; + for (irq = 0; irq < PIC_NUM_PINS/2; irq++) + if (imr_diff & (1 << irq)) + kvm_fire_mask_notifiers( + s->pics_state->kvm, + SELECT_PIC(irq + off), + irq + off, + !!(s->imr & (1 << irq))); pic_update_irq(s->pics_state); break; + } case 1: s->irq_base = val & 0xf8; s->init_state = 2; -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support
On Wed, Jul 14, 2010 at 09:13:44PM +0100, Paul Brook wrote: > A device performs a memory access on its local bus. It has no knowledge of > how > that access is routed to its destination. The device should not be aware of > any IOMMUs, in the same way that it doesn't know whether it happens to be > accessing RAM or memory mapped peripherals on another device. Right. > Each IOMMU is fundamentally part of a bus bridge. For example the bridge > between a PCI bus and the system bus. It provides a address mapping from one > bus to another. An IOMMU is not necessarily part of a bus bridge. By concept an IOMMU can also be implemented on a plugin-card translating only that card. Real implementations that I am aware of always implement the IOMMU in the PCI root bridge, though. > There should be no direct interaction between an IOMMU and a device (ignoring > ATS, which is effectively a separate data channel). Everything should be > done > via the cpu_phsycial_memory_* code. Likewise on a system with multiple > nested > IOMMUs there should be no direct interatcion between these. > cpu_physical_memory_* should walk the device/bus tree to determine where the > access terminates, applying mappings appropriately. Thats the point where I disagree. I think there should be a seperate set of functions independent from cpu_physical_memory_* to handle device memory accesses. This would keep the changes small and non-intrusive. Beside that, real memory controlers can also handle cpu memory accesses different from device memory accesses. The AMD northbridge GART uses this to decide whether it needs to remap a request or not. The GART can be configured to translate cpu and device accesses seperatly. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: [RFC PATCH 4/7] ide: IOMMU support
On Wed, Jul 14, 2010 at 04:29:18PM -0500, Anthony Liguori wrote: > On 07/14/2010 03:13 PM, Paul Brook wrote: >> Well, ok, the function name needs fixing too. However I think the only thing >> missing from the current API is that it does not provide a way to determine >> which device is performing the access. > > I agree with Paul. > > The right approach IMHO is to convert devices to use bus-specific > functions to access memory. The bus specific functions should have a > device argument as the first parameter. If this means a seperate interface for device dma accesses and not fold that functionality into the cpu_physical_memory* interface I agree too :-) Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: MMU: track dirty page in speculative path properly
Marcelo Tosatti wrote: >> How about just track access bit for speculative path, we set page both >> accessed and >> dirty(if it's writable) only if the access bit is set? > > A useful thing to do would be to allow read-only mappings, in the fault > path (Lai sent a few patches in that direction sometime ago but there > was no follow up). > > So in the case of a read-only fault from the guest, you'd inform > get_user_pages() that read-only access is acceptable (so swapcache pages > can be mapped, or qemu can mprotect(PROT_READ) guest memory). > Yeah, it's a great work, i guess Lai will post the new version soon. And, even we do this, i think the page dirty track is still needed, right? Then, how about my new idea to track page dirty for speculative path, just as below draft patch does: @@ -687,10 +687,11 @@ static void drop_spte(struct kvm *kvm, u64 *sptep, u64 new_spte) if (!is_rmap_spte(old_spte)) return; pfn = spte_to_pfn(old_spte); - if (old_spte & shadow_accessed_mask) + if (old_spte & shadow_accessed_mask) { kvm_set_pfn_accessed(pfn); - if (is_writable_pte(old_spte)) - kvm_set_pfn_dirty(pfn); + if (is_writable_pte(old_spte)) + kvm_set_pfn_dirty(pfn); + } rmap_remove(kvm, sptep); } @@ -1920,8 +1921,11 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, * demand paging). */ spte = shadow_base_present_pte | shadow_dirty_mask; - if (!speculative) + if (!speculative) { spte |= shadow_accessed_mask; + if (is_writable_pte(*sptep)) + kvm_set_pfn_dirty(pfn); + } if (!dirty) pte_access &= ~ACC_WRITE_MASK; if (pte_access & ACC_EXEC_MASK) It uses access bit to track both page accessed and page dirty, and it's rather cheap... -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html