Re: [PATCH 02/10] nEPT: MMU context for nested EPT
On 11/13/2011 08:26 PM, Orit Wasserman wrote: int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]); void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask); int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct); diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 507e2b8..70d4cfd 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -39,6 +39,21 @@ #define CMPXCHG cmpxchg64 #define PT_MAX_FULL_LEVELS 2 #endif +#elif PTTYPE == EPT + #define pt_element_t u64 + #define FNAME(name) EPT_##name + #define PT_BASE_ADDR_MASK PT64_BASE_ADDR_MASK + #define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl) + #define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl) + #define PT_INDEX(addr, level) PT64_INDEX(addr, level) + #define PT_LEVEL_BITS PT64_LEVEL_BITS + #ifdef CONFIG_X86_64 + #define PT_MAX_FULL_LEVELS 4 + #define CMPXCHG cmpxchg + #else + #define CMPXCHG cmpxchg64 + #define PT_MAX_FULL_LEVELS 2 + #endif The various masks should be defined here, to avoid lots of #ifdefs later. That what I did first but than I was afraid that the MASK will be changed for mmu.c too. so I decided on ifdefs. The more I think about it I think we need rapper function for mask checking (at least for this file). What do you think ? Either should work, as long as the main logic is clean. for (;;) { gfn_t real_gfn; @@ -186,9 +215,14 @@ retry_walk: pte_gpa = gfn_to_gpa(table_gfn) + offset; walker-table_gfn[walker-level - 1] = table_gfn; walker-pte_gpa[walker-level - 1] = pte_gpa; - +#if PTTYPE == EPT + real_gfn = mmu-translate_gpa(vcpu, gfn_to_gpa(table_gfn), +EPT_WRITABLE_MASK); +#else real_gfn = mmu-translate_gpa(vcpu, gfn_to_gpa(table_gfn), PFERR_USER_MASK|PFERR_WRITE_MASK); +#endif + Unneeded, I think. Is it because translate_nested_gpa always set USER_MASK ? Yes... maybe that function needs to do something like access |= mmu-default_access; -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] KVM: Dirty logging optimization using rmap
This is a revised version of my previous work. I hope that the patches are more self explanatory than before. Thanks, Takuya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] KVM: MMU: Clean up BUG_ON() conditions in rmap_write_protect()
Remove redundant checks and use is_large_pte() macro. Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- arch/x86/kvm/mmu.c |4 +--- 1 files changed, 1 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index a9b3a32..973f254 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1027,7 +1027,6 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn) spte = rmap_next(kvm, rmapp, NULL); while (spte) { - BUG_ON(!spte); BUG_ON(!(*spte PT_PRESENT_MASK)); rmap_printk(rmap_write_protect: spte %p %llx\n, spte, *spte); if (is_writable_pte(*spte)) { @@ -1043,9 +1042,8 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn) rmapp = gfn_to_rmap(kvm, gfn, i); spte = rmap_next(kvm, rmapp, NULL); while (spte) { - BUG_ON(!spte); BUG_ON(!(*spte PT_PRESENT_MASK)); - BUG_ON((*spte (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK)) != (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK)); + BUG_ON(!is_large_pte(*spte)); pgprintk(rmap_write_protect(large): spte %p %llx %lld\n, spte, *spte, gfn); if (is_writable_pte(*spte)) { drop_spte(kvm, spte); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] KVM: MMU: Split gfn_to_rmap() into two functions
rmap_write_protect() calls gfn_to_rmap() for each level with gfn fixed. This results in calling gfn_to_memslot() repeatedly with that gfn. This patch introduces __gfn_to_rmap() which takes the slot as an argument to avoid this. This is also needed for the following dirty logging optimization. Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- arch/x86/kvm/mmu.c | 26 +- 1 files changed, 17 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 973f254..fa71085 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -958,23 +958,29 @@ static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn) } } -/* - * Take gfn and return the reverse mapping to it. - */ -static unsigned long *gfn_to_rmap(struct kvm *kvm, gfn_t gfn, int level) +static unsigned long *__gfn_to_rmap(struct kvm *kvm, gfn_t gfn, int level, + struct kvm_memory_slot *slot) { - struct kvm_memory_slot *slot; struct kvm_lpage_info *linfo; - slot = gfn_to_memslot(kvm, gfn); if (likely(level == PT_PAGE_TABLE_LEVEL)) return slot-rmap[gfn - slot-base_gfn]; linfo = lpage_info_slot(gfn, slot, level); - return linfo-rmap_pde; } +/* + * Take gfn and return the reverse mapping to it. + */ +static unsigned long *gfn_to_rmap(struct kvm *kvm, gfn_t gfn, int level) +{ + struct kvm_memory_slot *slot; + + slot = gfn_to_memslot(kvm, gfn); + return __gfn_to_rmap(kvm, gfn, level, slot); +} + static bool rmap_can_add(struct kvm_vcpu *vcpu) { struct kvm_mmu_memory_cache *cache; @@ -1019,12 +1025,14 @@ static void drop_spte(struct kvm *kvm, u64 *sptep) static int rmap_write_protect(struct kvm *kvm, u64 gfn) { + struct kvm_memory_slot *slot; unsigned long *rmapp; u64 *spte; int i, write_protected = 0; - rmapp = gfn_to_rmap(kvm, gfn, PT_PAGE_TABLE_LEVEL); + slot = gfn_to_memslot(kvm, gfn); + rmapp = __gfn_to_rmap(kvm, gfn, PT_PAGE_TABLE_LEVEL, slot); spte = rmap_next(kvm, rmapp, NULL); while (spte) { BUG_ON(!(*spte PT_PRESENT_MASK)); @@ -1039,7 +1047,7 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn) /* check for huge page mappings */ for (i = PT_DIRECTORY_LEVEL; i PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) { - rmapp = gfn_to_rmap(kvm, gfn, i); + rmapp = __gfn_to_rmap(kvm, gfn, i, slot); spte = rmap_next(kvm, rmapp, NULL); while (spte) { BUG_ON(!(*spte PT_PRESENT_MASK)); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] KVM: Optimize dirty logging by rmap_write_protect()
Currently, write protecting a slot needs to walk all the shadow pages and checks ones which have a pte mapping a page in it. The walk is overly heavy when dirty pages in that slot are not so many and checking the shadow pages would result in unwanted cache pollution. To mitigate this problem, we use rmap_write_protect() and check only the sptes which can be reached from gfns marked in the dirty bitmap when the number of dirty pages are less than that of shadow pages. This criterion is reasonable in its meaning and worked well in our test: write protection became some times faster than before when the ratio of dirty pages are low and was not worse even when the ratio was near the criterion. Note that the locking for this write protection becomes fine grained. The reason why this is safe is descripted in the comments. Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- arch/x86/include/asm/kvm_host.h |2 + arch/x86/kvm/mmu.c | 14 +++--- arch/x86/kvm/x86.c | 58 ++- 3 files changed, 63 insertions(+), 11 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 6d83264..69b6525 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -648,6 +648,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask, int kvm_mmu_reset_context(struct kvm_vcpu *vcpu); void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot); +int kvm_mmu_rmap_write_protect(struct kvm *kvm, u64 gfn, + struct kvm_memory_slot *slot); void kvm_mmu_zap_all(struct kvm *kvm); unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm); void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index fa71085..aecdea2 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1023,15 +1023,13 @@ static void drop_spte(struct kvm *kvm, u64 *sptep) rmap_remove(kvm, sptep); } -static int rmap_write_protect(struct kvm *kvm, u64 gfn) +int kvm_mmu_rmap_write_protect(struct kvm *kvm, u64 gfn, + struct kvm_memory_slot *slot) { - struct kvm_memory_slot *slot; unsigned long *rmapp; u64 *spte; int i, write_protected = 0; - slot = gfn_to_memslot(kvm, gfn); - rmapp = __gfn_to_rmap(kvm, gfn, PT_PAGE_TABLE_LEVEL, slot); spte = rmap_next(kvm, rmapp, NULL); while (spte) { @@ -1066,6 +1064,14 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn) return write_protected; } +static int rmap_write_protect(struct kvm *kvm, u64 gfn) +{ + struct kvm_memory_slot *slot; + + slot = gfn_to_memslot(kvm, gfn); + return kvm_mmu_rmap_write_protect(kvm, gfn, slot); +} + static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp, unsigned long data) { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5fa6801..1985ea1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3461,6 +3461,50 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm, return 0; } +/** + * write_protect_slot - write protect a slot for dirty logging + * @kvm: the kvm instance + * @memslot: the slot we protect + * @dirty_bitmap: the bitmap indicating which pages are dirty + * @nr_dirty_pages: the number of dirty pages + * + * We have two ways to find all sptes to protect: + * 1. Use kvm_mmu_slot_remove_write_access() which walks all shadow pages and + *checks ones that have a spte mapping a page in the slot. + * 2. Use kvm_mmu_rmap_write_protect() for each gfn found in the bitmap. + * + * Generally speaking, if there are not so many dirty pages compared to the + * number of shadow pages, we should use the latter. + * + * Note that letting others write into a page marked dirty in the old bitmap + * by using the remaining tlb entry is not a problem. That page will become + * write protected again when we flush the tlb and then be reported dirty to + * the user space by copying the old bitmap. + */ +static void write_protect_slot(struct kvm *kvm, + struct kvm_memory_slot *memslot, + unsigned long *dirty_bitmap, + unsigned long nr_dirty_pages) +{ + /* Not many dirty pages compared to # of shadow pages. */ + if (nr_dirty_pages kvm-arch.n_used_mmu_pages) { + unsigned long gfn_offset; + + for_each_set_bit(gfn_offset, dirty_bitmap, memslot-npages) { + unsigned long gfn = memslot-base_gfn + gfn_offset; + + spin_lock(kvm-mmu_lock); + kvm_mmu_rmap_write_protect(kvm, gfn, memslot); + spin_unlock(kvm-mmu_lock); + } + kvm_flush_remote_tlbs(kvm); + } else { +
[PATCH 3/4] KVM: Count the number of dirty pages for dirty logging
Needed for the next patch which uses this number to decide how to write protect a slot. Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- arch/x86/kvm/x86.c |9 +++-- include/linux/kvm_host.h |1 + virt/kvm/kvm_main.c |4 +++- 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9eff4af..5fa6801 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3467,10 +3467,9 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm, int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) { - int r, i; + int r; struct kvm_memory_slot *memslot; unsigned long n; - unsigned long is_dirty = 0; mutex_lock(kvm-slots_lock); @@ -3485,11 +3484,8 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, n = kvm_dirty_bitmap_bytes(memslot); - for (i = 0; !is_dirty i n/sizeof(long); i++) - is_dirty = memslot-dirty_bitmap[i]; - /* If nothing is dirty, don't bother messing with page tables. */ - if (is_dirty) { + if (memslot-nr_dirty_pages) { struct kvm_memslots *slots, *old_slots; unsigned long *dirty_bitmap; @@ -3504,6 +3500,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, goto out; memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots)); slots-memslots[log-slot].dirty_bitmap = dirty_bitmap; + slots-memslots[log-slot].nr_dirty_pages = 0; slots-generation++; old_slots = kvm-memslots; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c6a2ec9..7c654aa 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -181,6 +181,7 @@ struct kvm_memory_slot { unsigned long *rmap; unsigned long *dirty_bitmap; unsigned long *dirty_bitmap_head; + unsigned long nr_dirty_pages; struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1]; unsigned long userspace_addr; int user_alloc; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4c5b9a2..af5c988 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -625,6 +625,7 @@ static int kvm_create_dirty_bitmap(struct kvm_memory_slot *memslot) return -ENOMEM; memslot-dirty_bitmap_head = memslot-dirty_bitmap; + memslot-nr_dirty_pages = 0; return 0; } #endif /* !CONFIG_S390 */ @@ -1491,7 +1492,8 @@ void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot, if (memslot memslot-dirty_bitmap) { unsigned long rel_gfn = gfn - memslot-base_gfn; - __set_bit_le(rel_gfn, memslot-dirty_bitmap); + if (!__test_and_set_bit_le(rel_gfn, memslot-dirty_bitmap)) + memslot-nr_dirty_pages++; } } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions
Am 12.11.2011 11:25, schrieb Avi Kivity: On 11/11/2011 12:15 PM, Kevin Wolf wrote: Am 10.11.2011 22:30, schrieb Anthony Liguori: Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach. I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case. If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen(). Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between? Not sure what results and what computation you mean, but let me clarify a bit about bdrv_reopen: The main purpose of bdrv_reopen() is to change flags, for example toggle O_SYNC during runtime in order to allow the guest to toggle WCE. This doesn't necessarily mean a close()/open() sequence if there are other means to change the flags, like fcntl() (or even using other protocols than files). The idea here was to extend this to invalidate all caches if some specific flag is set. As you don't change any other flag, this will usually not be a reopen on a lower level. If we need to use open() though, and it fails (this is really the only different result that comes to mind) then bdrv_reopen() would fail and the old fd would stay in use. Migration would have to fail, but I don't think this case is ever needed for reopening after migration. What's wrong with just delaying the open? Nothing, except that with today's code it's harder to do. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: Count the number of dirty pages for dirty logging
On 11/14/2011 11:23 AM, Takuya Yoshikawa wrote: Needed for the next patch which uses this number to decide how to write protect a slot. /* If nothing is dirty, don't bother messing with page tables. */ - if (is_dirty) { + if (memslot-nr_dirty_pages) { struct kvm_memslots *slots, *old_slots; unsigned long *dirty_bitmap; @@ -3504,6 +3500,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, goto out; memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots)); slots-memslots[log-slot].dirty_bitmap = dirty_bitmap; + slots-memslots[log-slot].nr_dirty_pages = 0; slots-generation++; #endif /* !CONFIG_S390 */ @@ -1491,7 +1492,8 @@ void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot, if (memslot memslot-dirty_bitmap) { unsigned long rel_gfn = gfn - memslot-base_gfn; - __set_bit_le(rel_gfn, memslot-dirty_bitmap); + if (!__test_and_set_bit_le(rel_gfn, memslot-dirty_bitmap)) + memslot-nr_dirty_pages++; } } The two assignments to -nr_dirty_pages can race, no? Nothing protects it. btw mark_page_dirty() itself seems to assume mmu_lock protection that doesn't exist. Marcelo? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions
On Mon, Nov 14, 2011 at 10:58:16AM +0100, Kevin Wolf wrote: Am 12.11.2011 11:25, schrieb Avi Kivity: On 11/11/2011 12:15 PM, Kevin Wolf wrote: Am 10.11.2011 22:30, schrieb Anthony Liguori: Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach. I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case. If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen(). Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between? Not sure what results and what computation you mean, but let me clarify a bit about bdrv_reopen: The main purpose of bdrv_reopen() is to change flags, for example toggle O_SYNC during runtime in order to allow the guest to toggle WCE. This doesn't necessarily mean a close()/open() sequence if there are other means to change the flags, like fcntl() (or even using other protocols than files). The idea here was to extend this to invalidate all caches if some specific flag is set. As you don't change any other flag, this will usually not be a reopen on a lower level. If we need to use open() though, and it fails (this is really the only different result that comes to mind) then bdrv_reopen() would fail and the old fd would stay in use. Migration would have to fail, but I don't think this case is ever needed for reopening after migration. What's wrong with just delaying the open? Nothing, except that with today's code it's harder to do. Kevin It seems cleaner, though, doesn't it? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] kvm tools: Implement multiple VQ for virtio-net
On Mon, 2011-11-14 at 10:04 +0800, Asias He wrote: Hi, Shsha On 11/13/2011 11:00 PM, Sasha Levin wrote: On Sun, 2011-11-13 at 12:24 +0200, Michael S. Tsirkin wrote: On Sat, Nov 12, 2011 at 12:12:01AM +0200, Sasha Levin wrote: This is a patch based on Krishna Kumar's patch series which implements multiple VQ support for virtio-net. The patch was tested with ver3 of the patch. Cc: Krishna Kumarkrkum...@in.ibm.com Cc: Michael S. Tsirkinm...@redhat.com Cc: Rusty Russellru...@rustcorp.com.au Cc: virtualizat...@lists.linux-foundation.org Cc: net...@vger.kernel.org Signed-off-by: Sasha Levinlevinsasha...@gmail.com Any performance numbers? I tried finding a box with more than two cores so I could test it on something like that as well. From what I see this patch causes a performance regression on my 2 core box. I'll send an updated KVM tools patch in a bit as well. Before: # netperf -H 192.168.33.4,ipv4 -t TCP_RR MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.33.4 (192.168.33.4) port 0 AF_INET : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size SizeTime Rate bytes Bytes bytesbytes secs.per sec 16384 87380 11 10.0011160.63 16384 87380 # netperf -H 192.168.33.4,ipv4 -t UDP_RR MIGRATED UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.33.4 (192.168.33.4) port 0 AF_INET : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size SizeTime Rate bytes Bytes bytesbytes secs.per sec 122880 122880 11 10.0012072.64 229376 229376 # netperf -H 192.168.33.4,ipv4 -t TCP_STREAM MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.33.4 (192.168.33.4) port 0 AF_INET Recv SendSend Socket Socket Message Elapsed Size SizeSize Time Throughput bytes bytes bytessecs.10^6bits/sec 87380 16384 1638410.004654.50 netperf -H 192.168.33.4,ipv4 -t TCP_STREAM -- -m 128 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.33.4 (192.168.33.4) port 0 AF_INET Recv SendSend Socket Socket Message Elapsed Size SizeSize Time Throughput bytes bytes bytessecs.10^6bits/sec 87380 1638412810.00 635.45 # netperf -H 192.168.33.4,ipv4 -t UDP_STREAM MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.33.4 (192.168.33.4) port 0 AF_INET Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 122880 65507 10.00 113894 05968.54 229376 10.00 89373 4683.54 # netperf -H 192.168.33.4,ipv4 -t UDP_STREAM -- -m 128 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.33.4 (192.168.33.4) port 0 AF_INET Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 122880 128 10.00 550634 0 56.38 229376 10.00 398786 40.84 After: # netperf -H 192.168.33.4,ipv4 -t TCP_RR MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.33.4 (192.168.33.4) port 0 AF_INET : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size SizeTime Rate bytes Bytes bytesbytes secs.per sec 16384 87380 11 10.008952.47 16384 87380 # netperf -H 192.168.33.4,ipv4 -t UDP_RR MIGRATED UDP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.33.4 (192.168.33.4) port 0 AF_INET : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size SizeTime Rate bytes Bytes bytesbytes secs.per sec 122880 122880 11 10.009534.52 229376 229376 # netperf -H 192.168.33.4,ipv4 -t TCP_STREAM MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.33.4 (192.168.33.4) port 0 AF_INET Recv SendSend Socket Socket Message Elapsed Size SizeSize Time Throughput bytes bytes bytessecs.10^6bits/sec 87380 16384 1638410.132278.23 # netperf -H 192.168.33.4,ipv4 -t TCP_STREAM -- -m 128 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.33.4 (192.168.33.4) port 0 AF_INET Recv SendSend Socket Socket Message Elapsed Size SizeSize Time Throughput bytes bytes bytessecs.10^6bits/sec 87380 1638412810.00 623.27 # netperf -H 192.168.33.4,ipv4 -t UDP_STREAM MIGRATED UDP STREAM TEST
Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions
On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote: On 11/11/2011 12:15 PM, Kevin Wolf wrote: Am 10.11.2011 22:30, schrieb Anthony Liguori: Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach. I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case. If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen(). Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between? What's wrong with just delaying the open? If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: Optimize dirty logging by rmap_write_protect()
On 11/14/2011 11:24 AM, Takuya Yoshikawa wrote: Currently, write protecting a slot needs to walk all the shadow pages and checks ones which have a pte mapping a page in it. The walk is overly heavy when dirty pages in that slot are not so many and checking the shadow pages would result in unwanted cache pollution. To mitigate this problem, we use rmap_write_protect() and check only the sptes which can be reached from gfns marked in the dirty bitmap when the number of dirty pages are less than that of shadow pages. This criterion is reasonable in its meaning and worked well in our test: write protection became some times faster than before when the ratio of dirty pages are low and was not worse even when the ratio was near the criterion. Note that the locking for this write protection becomes fine grained. The reason why this is safe is descripted in the comments. +/** + * write_protect_slot - write protect a slot for dirty logging + * @kvm: the kvm instance + * @memslot: the slot we protect + * @dirty_bitmap: the bitmap indicating which pages are dirty + * @nr_dirty_pages: the number of dirty pages + * + * We have two ways to find all sptes to protect: + * 1. Use kvm_mmu_slot_remove_write_access() which walks all shadow pages and + *checks ones that have a spte mapping a page in the slot. + * 2. Use kvm_mmu_rmap_write_protect() for each gfn found in the bitmap. + * + * Generally speaking, if there are not so many dirty pages compared to the + * number of shadow pages, we should use the latter. + * + * Note that letting others write into a page marked dirty in the old bitmap + * by using the remaining tlb entry is not a problem. That page will become + * write protected again when we flush the tlb and then be reported dirty to + * the user space by copying the old bitmap. + */ +static void write_protect_slot(struct kvm *kvm, +struct kvm_memory_slot *memslot, +unsigned long *dirty_bitmap, +unsigned long nr_dirty_pages) +{ + /* Not many dirty pages compared to # of shadow pages. */ + if (nr_dirty_pages kvm-arch.n_used_mmu_pages) { Seems a reasonable heuristic. In particular, this is always true for vga, yes? That will get the code exercised. + unsigned long gfn_offset; + + for_each_set_bit(gfn_offset, dirty_bitmap, memslot-npages) { + unsigned long gfn = memslot-base_gfn + gfn_offset; + + spin_lock(kvm-mmu_lock); + kvm_mmu_rmap_write_protect(kvm, gfn, memslot); + spin_unlock(kvm-mmu_lock); + } + kvm_flush_remote_tlbs(kvm); + } else { + spin_lock(kvm-mmu_lock); + kvm_mmu_slot_remove_write_access(kvm, memslot-id); + spin_unlock(kvm-mmu_lock); + } +} + -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions
On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote: On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote: On 11/11/2011 12:15 PM, Kevin Wolf wrote: Am 10.11.2011 22:30, schrieb Anthony Liguori: Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach. I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case. If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen(). Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between? What's wrong with just delaying the open? If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails. Daniel I guess reopen can fail as well, so this seems to me to be an important fix but not a blocker. -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: Count the number of dirty pages for dirty logging
On 11/14/2011 12:07 PM, Avi Kivity wrote: On 11/14/2011 11:23 AM, Takuya Yoshikawa wrote: Needed for the next patch which uses this number to decide how to write protect a slot. /* If nothing is dirty, don't bother messing with page tables. */ - if (is_dirty) { + if (memslot-nr_dirty_pages) { struct kvm_memslots *slots, *old_slots; unsigned long *dirty_bitmap; @@ -3504,6 +3500,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, goto out; memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots)); slots-memslots[log-slot].dirty_bitmap = dirty_bitmap; + slots-memslots[log-slot].nr_dirty_pages = 0; slots-generation++; #endif /* !CONFIG_S390 */ @@ -1491,7 +1492,8 @@ void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot, if (memslot memslot-dirty_bitmap) { unsigned long rel_gfn = gfn - memslot-base_gfn; - __set_bit_le(rel_gfn, memslot-dirty_bitmap); + if (!__test_and_set_bit_le(rel_gfn, memslot-dirty_bitmap)) + memslot-nr_dirty_pages++; } } The two assignments to -nr_dirty_pages can race, no? Nothing protects it. er, it can't, it's before the rcu_assign_pointer(). -- error compiling committee.c: too many argumen ts to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] KVM: Dirty logging optimization using rmap
On 11/14/2011 11:20 AM, Takuya Yoshikawa wrote: This is a revised version of my previous work. I hope that the patches are more self explanatory than before. It looks good. I'll let Marcelo (or anyone else?) review it as well before applying. Do you have performance measurements? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] KVM: Optimize dirty logging by rmap_write_protect()
(2011/11/14 19:22), Avi Kivity wrote: + * + * Generally speaking, if there are not so many dirty pages compared to the + * number of shadow pages, we should use the latter. + * + * Note that letting others write into a page marked dirty in the old bitmap + * by using the remaining tlb entry is not a problem. That page will become + * write protected again when we flush the tlb and then be reported dirty to + * the user space by copying the old bitmap. + */ +static void write_protect_slot(struct kvm *kvm, + struct kvm_memory_slot *memslot, + unsigned long *dirty_bitmap, + unsigned long nr_dirty_pages) +{ + /* Not many dirty pages compared to # of shadow pages. */ + if (nr_dirty_pages kvm-arch.n_used_mmu_pages) { Seems a reasonable heuristic. In particular, this is always true for vga, yes? That will get the code exercised. Almost always yes, once the guest warms up and shadow pages are allocated. Takuya + unsigned long gfn_offset; + + for_each_set_bit(gfn_offset, dirty_bitmap, memslot-npages) { + unsigned long gfn = memslot-base_gfn + gfn_offset; + + spin_lock(kvm-mmu_lock); + kvm_mmu_rmap_write_protect(kvm, gfn, memslot); + spin_unlock(kvm-mmu_lock); + } + kvm_flush_remote_tlbs(kvm); + } else { + spin_lock(kvm-mmu_lock); + kvm_mmu_slot_remove_write_access(kvm, memslot-id); + spin_unlock(kvm-mmu_lock); + } +} + -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] KVM: Dirty logging optimization using rmap
(2011/11/14 19:25), Avi Kivity wrote: On 11/14/2011 11:20 AM, Takuya Yoshikawa wrote: This is a revised version of my previous work. I hope that the patches are more self explanatory than before. It looks good. I'll let Marcelo (or anyone else?) review it as well before applying. Do you have performance measurements? For VGA, 30-40us became 3-5us when the display was quiet, with a enough warmed up guest. Near the criterion, the number was not different much from the original version. For live migration, I forgot the number but the result was good. But my test case was not enough to cover every pattern, so I changed the criterion to be a bit conservative. More tests may be able to find a better criterion. I am not in a hurry about this, so it is OK to add some tests before merging this. But what I did not like was holding spin lock more than 100us or so with the original version. With this version, at least, the problem should be cured some. Takuya One note: kvm-unit-tests' dirty logging test was broken for 32-bit box: compile error. I changed an idt to boot_idt and used it. I do not know kvm-unit-tests well, so I want somebody to fix that officially. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions
On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote: On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote: On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote: On 11/11/2011 12:15 PM, Kevin Wolf wrote: Am 10.11.2011 22:30, schrieb Anthony Liguori: Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach. I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case. If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen(). Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between? What's wrong with just delaying the open? If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails. Daniel I guess reopen can fail as well, so this seems to me to be an important fix but not a blocker. If if the initial open succeeds, then it is far more likely that a later re-open will succeed too, because you have already elminated the possibility of configuration mistakes, and will have caught most storage runtime errors too. So there is a very significant difference in reliability between doing an 'open at startup + reopen at cont' vs just 'open at cont' Based on the bug reports I see, we want to be very good at detecting and gracefully handling open errors because they are pretty frequent. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions
Am 14.11.2011 12:08, schrieb Daniel P. Berrange: On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote: On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote: On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote: On 11/11/2011 12:15 PM, Kevin Wolf wrote: Am 10.11.2011 22:30, schrieb Anthony Liguori: Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach. I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case. If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen(). Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between? What's wrong with just delaying the open? If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails. Daniel I guess reopen can fail as well, so this seems to me to be an important fix but not a blocker. If if the initial open succeeds, then it is far more likely that a later re-open will succeed too, because you have already elminated the possibility of configuration mistakes, and will have caught most storage runtime errors too. So there is a very significant difference in reliability between doing an 'open at startup + reopen at cont' vs just 'open at cont' Based on the bug reports I see, we want to be very good at detecting and gracefully handling open errors because they are pretty frequent. Do you have some more details on the kind of errors? Missing files, permissions, something like this? Or rather something related to the actual content of an image file? I'm asking because for avoiding the former, things like access() could be enough, whereas for the latter we'd have to do a full open. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions
On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote: Am 14.11.2011 12:08, schrieb Daniel P. Berrange: On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote: On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote: On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote: On 11/11/2011 12:15 PM, Kevin Wolf wrote: Am 10.11.2011 22:30, schrieb Anthony Liguori: Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach. I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case. If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen(). Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between? What's wrong with just delaying the open? If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails. Daniel I guess reopen can fail as well, so this seems to me to be an important fix but not a blocker. If if the initial open succeeds, then it is far more likely that a later re-open will succeed too, because you have already elminated the possibility of configuration mistakes, and will have caught most storage runtime errors too. So there is a very significant difference in reliability between doing an 'open at startup + reopen at cont' vs just 'open at cont' Based on the bug reports I see, we want to be very good at detecting and gracefully handling open errors because they are pretty frequent. Do you have some more details on the kind of errors? Missing files, permissions, something like this? Or rather something related to the actual content of an image file? Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI setup. Access permissions due to incorrect user / group setup, or read only mounts, or SELinux denials. Actual I/O errors are less common and are not so likely to cause QEMU to fail to start any, since QEMU is likely to just report them to the guest OS instead. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions
On Mon, Nov 14, 2011 at 11:08:02AM +, Daniel P. Berrange wrote: On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote: On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote: On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote: On 11/11/2011 12:15 PM, Kevin Wolf wrote: Am 10.11.2011 22:30, schrieb Anthony Liguori: Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach. I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case. If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen(). Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between? What's wrong with just delaying the open? If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails. Daniel I guess reopen can fail as well, so this seems to me to be an important fix but not a blocker. If if the initial open succeeds, then it is far more likely that a later re-open will succeed too, because you have already elminated the possibility of configuration mistakes, and will have caught most storage runtime errors too. So there is a very significant difference in reliability between doing an 'open at startup + reopen at cont' vs just 'open at cont' Based on the bug reports I see, we want to be very good at detecting and gracefully handling open errors because they are pretty frequent. Regards, Daniel IIUC, the 'cont' that we were discussing is the startup of the VM at destination after migration completes. A failure results in migration failure, which libvirt has been able to handle since forever. In case of the 'cont' command on source upon migration failure, qemu was running there previously so it's likely configuration is OK. Am I confused? If no, libvirt seems unaffected. -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions
On Mon, Nov 14, 2011 at 11:29:18AM +, Daniel P. Berrange wrote: On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote: Am 14.11.2011 12:08, schrieb Daniel P. Berrange: On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote: On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote: On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote: On 11/11/2011 12:15 PM, Kevin Wolf wrote: Am 10.11.2011 22:30, schrieb Anthony Liguori: Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach. I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case. If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen(). Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between? What's wrong with just delaying the open? If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails. Daniel I guess reopen can fail as well, so this seems to me to be an important fix but not a blocker. If if the initial open succeeds, then it is far more likely that a later re-open will succeed too, because you have already elminated the possibility of configuration mistakes, and will have caught most storage runtime errors too. So there is a very significant difference in reliability between doing an 'open at startup + reopen at cont' vs just 'open at cont' Based on the bug reports I see, we want to be very good at detecting and gracefully handling open errors because they are pretty frequent. Do you have some more details on the kind of errors? Missing files, permissions, something like this? Or rather something related to the actual content of an image file? Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI setup. Access permissions due to incorrect user / group setup, or read only mounts, or SELinux denials. Actual I/O errors are less common and are not so likely to cause QEMU to fail to start any, since QEMU is likely to just report them to the guest OS instead. Daniel Do you run qemu with -S, then give a 'cont' command to start it? -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions
On Mon, Nov 14, 2011 at 11:29:18AM +, Daniel P. Berrange wrote: Do you have some more details on the kind of errors? Missing files, permissions, something like this? Or rather something related to the actual content of an image file? Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI setup. Access permissions due to incorrect user / group setup, or read only mounts, or SELinux denials. Actual I/O errors are less common and are not so likely to cause QEMU to fail to start any, since QEMU is likely to just report them to the guest OS instead. If started correctly QEMU should not report them to the guest OS, but pause instead. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions
On Mon, Nov 14, 2011 at 01:34:15PM +0200, Michael S. Tsirkin wrote: On Mon, Nov 14, 2011 at 11:29:18AM +, Daniel P. Berrange wrote: On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote: Am 14.11.2011 12:08, schrieb Daniel P. Berrange: On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote: On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote: On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote: On 11/11/2011 12:15 PM, Kevin Wolf wrote: Am 10.11.2011 22:30, schrieb Anthony Liguori: Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach. I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case. If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen(). Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between? What's wrong with just delaying the open? If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails. Daniel I guess reopen can fail as well, so this seems to me to be an important fix but not a blocker. If if the initial open succeeds, then it is far more likely that a later re-open will succeed too, because you have already elminated the possibility of configuration mistakes, and will have caught most storage runtime errors too. So there is a very significant difference in reliability between doing an 'open at startup + reopen at cont' vs just 'open at cont' Based on the bug reports I see, we want to be very good at detecting and gracefully handling open errors because they are pretty frequent. Do you have some more details on the kind of errors? Missing files, permissions, something like this? Or rather something related to the actual content of an image file? Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI setup. Access permissions due to incorrect user / group setup, or read only mounts, or SELinux denials. Actual I/O errors are less common and are not so likely to cause QEMU to fail to start any, since QEMU is likely to just report them to the guest OS instead. Do you run qemu with -S, then give a 'cont' command to start it? Yes Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM VMExit
x86 has a Undefined Instruction, its opcode is 0F 0B and it generates an invalid opcode exception. This instruction is provided for software testing to explicitly generate an invalid opcode exception. The opcode for this instruction is reserved for this purpose. Thanks Xin On Sun, Nov 13, 2011 at 8:50 PM, 王永博 wangyongb...@gmail.com wrote: 2011/11/14 Xin Tong xerox.time.t...@gmail.com I would like to know how I can make a UD2 (Undefined Instruction) cause a vmexit in KVM ? Thanks Xin what is UD2 ? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions
On Mon, Nov 14, 2011 at 11:37:27AM +, Daniel P. Berrange wrote: On Mon, Nov 14, 2011 at 01:34:15PM +0200, Michael S. Tsirkin wrote: On Mon, Nov 14, 2011 at 11:29:18AM +, Daniel P. Berrange wrote: On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote: Am 14.11.2011 12:08, schrieb Daniel P. Berrange: On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote: On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote: On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote: On 11/11/2011 12:15 PM, Kevin Wolf wrote: Am 10.11.2011 22:30, schrieb Anthony Liguori: Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach. I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case. If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen(). Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between? What's wrong with just delaying the open? If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails. Daniel I guess reopen can fail as well, so this seems to me to be an important fix but not a blocker. If if the initial open succeeds, then it is far more likely that a later re-open will succeed too, because you have already elminated the possibility of configuration mistakes, and will have caught most storage runtime errors too. So there is a very significant difference in reliability between doing an 'open at startup + reopen at cont' vs just 'open at cont' Based on the bug reports I see, we want to be very good at detecting and gracefully handling open errors because they are pretty frequent. Do you have some more details on the kind of errors? Missing files, permissions, something like this? Or rather something related to the actual content of an image file? Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI setup. Access permissions due to incorrect user / group setup, or read only mounts, or SELinux denials. Actual I/O errors are less common and are not so likely to cause QEMU to fail to start any, since QEMU is likely to just report them to the guest OS instead. Do you run qemu with -S, then give a 'cont' command to start it? Yes Daniel Probably in an attempt to improve reliability :) So this is in fact unrelated to migration. So we can either ignore this bug (assuming no distros ship cutting edge qemu with an old libvirt), or special-case -S and do an open/close cycle on startup. -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions
On Mon, Nov 14, 2011 at 11:37:27AM +, Daniel P. Berrange wrote: On Mon, Nov 14, 2011 at 01:34:15PM +0200, Michael S. Tsirkin wrote: On Mon, Nov 14, 2011 at 11:29:18AM +, Daniel P. Berrange wrote: On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote: Am 14.11.2011 12:08, schrieb Daniel P. Berrange: On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote: On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote: On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote: On 11/11/2011 12:15 PM, Kevin Wolf wrote: Am 10.11.2011 22:30, schrieb Anthony Liguori: Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach. I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case. If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen(). Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between? What's wrong with just delaying the open? If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails. Daniel I guess reopen can fail as well, so this seems to me to be an important fix but not a blocker. If if the initial open succeeds, then it is far more likely that a later re-open will succeed too, because you have already elminated the possibility of configuration mistakes, and will have caught most storage runtime errors too. So there is a very significant difference in reliability between doing an 'open at startup + reopen at cont' vs just 'open at cont' Based on the bug reports I see, we want to be very good at detecting and gracefully handling open errors because they are pretty frequent. Do you have some more details on the kind of errors? Missing files, permissions, something like this? Or rather something related to the actual content of an image file? Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI setup. Access permissions due to incorrect user / group setup, or read only mounts, or SELinux denials. Actual I/O errors are less common and are not so likely to cause QEMU to fail to start any, since QEMU is likely to just report them to the guest OS instead. Do you run qemu with -S, then give a 'cont' command to start it? Yes Daniel OK, so let's go back one step now - how is this related to 'rollback to source host'? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions
On Mon, Nov 14, 2011 at 01:51:40PM +0200, Michael S. Tsirkin wrote: On Mon, Nov 14, 2011 at 11:37:27AM +, Daniel P. Berrange wrote: On Mon, Nov 14, 2011 at 01:34:15PM +0200, Michael S. Tsirkin wrote: On Mon, Nov 14, 2011 at 11:29:18AM +, Daniel P. Berrange wrote: On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote: Am 14.11.2011 12:08, schrieb Daniel P. Berrange: On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote: On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote: On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote: On 11/11/2011 12:15 PM, Kevin Wolf wrote: Am 10.11.2011 22:30, schrieb Anthony Liguori: Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach. I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case. If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen(). Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between? What's wrong with just delaying the open? If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails. Daniel I guess reopen can fail as well, so this seems to me to be an important fix but not a blocker. If if the initial open succeeds, then it is far more likely that a later re-open will succeed too, because you have already elminated the possibility of configuration mistakes, and will have caught most storage runtime errors too. So there is a very significant difference in reliability between doing an 'open at startup + reopen at cont' vs just 'open at cont' Based on the bug reports I see, we want to be very good at detecting and gracefully handling open errors because they are pretty frequent. Do you have some more details on the kind of errors? Missing files, permissions, something like this? Or rather something related to the actual content of an image file? Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI setup. Access permissions due to incorrect user / group setup, or read only mounts, or SELinux denials. Actual I/O errors are less common and are not so likely to cause QEMU to fail to start any, since QEMU is likely to just report them to the guest OS instead. Do you run qemu with -S, then give a 'cont' command to start it? Probably in an attempt to improve reliability :) Not really. We can't simply let QEMU start its own CPUs, because there are various tasks that need performing after the migration transfer finishes, but before the CPUs are allowed to be started. eg - Finish 802.11Qb{g,h} (VEPA) network port profile association on target - Release leases for any resources associated with the source QEMU via a configured lock manager (eg sanlock) - Acquire leases for any resources associated with the target QEMU via a configured lock manager (eg sanlock) Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions
On Mon, Nov 14, 2011 at 01:56:36PM +0200, Michael S. Tsirkin wrote: On Mon, Nov 14, 2011 at 11:37:27AM +, Daniel P. Berrange wrote: On Mon, Nov 14, 2011 at 01:34:15PM +0200, Michael S. Tsirkin wrote: On Mon, Nov 14, 2011 at 11:29:18AM +, Daniel P. Berrange wrote: On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote: Am 14.11.2011 12:08, schrieb Daniel P. Berrange: On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote: On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote: On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote: On 11/11/2011 12:15 PM, Kevin Wolf wrote: Am 10.11.2011 22:30, schrieb Anthony Liguori: Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach. I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case. If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen(). Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between? What's wrong with just delaying the open? If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails. Daniel I guess reopen can fail as well, so this seems to me to be an important fix but not a blocker. If if the initial open succeeds, then it is far more likely that a later re-open will succeed too, because you have already elminated the possibility of configuration mistakes, and will have caught most storage runtime errors too. So there is a very significant difference in reliability between doing an 'open at startup + reopen at cont' vs just 'open at cont' Based on the bug reports I see, we want to be very good at detecting and gracefully handling open errors because they are pretty frequent. Do you have some more details on the kind of errors? Missing files, permissions, something like this? Or rather something related to the actual content of an image file? Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI setup. Access permissions due to incorrect user / group setup, or read only mounts, or SELinux denials. Actual I/O errors are less common and are not so likely to cause QEMU to fail to start any, since QEMU is likely to just report them to the guest OS instead. Do you run qemu with -S, then give a 'cont' command to start it? Yes OK, so let's go back one step now - how is this related to 'rollback to source host'? In the old libvirt migration protocol, by the time we run 'cont' on the destination, the source QEMU has already been killed off, so there's nothing to resume on failure. Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Performance counter - vPMU
Hi, I saw the a presentation on virtualizing performance counters at http://www.linux-kvm.org/wiki/images/6/6d/Kvm-forum-2011-performance-monitoring.pdf. Has the code been merged? Can I get something to play with/provide feedback? Balbir Singh -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions
On Mon, Nov 14, 2011 at 11:58:14AM +, Daniel P. Berrange wrote: On Mon, Nov 14, 2011 at 01:56:36PM +0200, Michael S. Tsirkin wrote: On Mon, Nov 14, 2011 at 11:37:27AM +, Daniel P. Berrange wrote: On Mon, Nov 14, 2011 at 01:34:15PM +0200, Michael S. Tsirkin wrote: On Mon, Nov 14, 2011 at 11:29:18AM +, Daniel P. Berrange wrote: On Mon, Nov 14, 2011 at 12:21:53PM +0100, Kevin Wolf wrote: Am 14.11.2011 12:08, schrieb Daniel P. Berrange: On Mon, Nov 14, 2011 at 12:24:22PM +0200, Michael S. Tsirkin wrote: On Mon, Nov 14, 2011 at 10:16:10AM +, Daniel P. Berrange wrote: On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote: On 11/11/2011 12:15 PM, Kevin Wolf wrote: Am 10.11.2011 22:30, schrieb Anthony Liguori: Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach. I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case. If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen(). Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between? What's wrong with just delaying the open? If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails. Daniel I guess reopen can fail as well, so this seems to me to be an important fix but not a blocker. If if the initial open succeeds, then it is far more likely that a later re-open will succeed too, because you have already elminated the possibility of configuration mistakes, and will have caught most storage runtime errors too. So there is a very significant difference in reliability between doing an 'open at startup + reopen at cont' vs just 'open at cont' Based on the bug reports I see, we want to be very good at detecting and gracefully handling open errors because they are pretty frequent. Do you have some more details on the kind of errors? Missing files, permissions, something like this? Or rather something related to the actual content of an image file? Missing files due to wrong/missing NFS mounts, or incorrect SAN / iSCSI setup. Access permissions due to incorrect user / group setup, or read only mounts, or SELinux denials. Actual I/O errors are less common and are not so likely to cause QEMU to fail to start any, since QEMU is likely to just report them to the guest OS instead. Do you run qemu with -S, then give a 'cont' command to start it? Yes OK, so let's go back one step now - how is this related to 'rollback to source host'? In the old libvirt migration protocol, by the time we run 'cont' on the destination, the source QEMU has already been killed off, so there's nothing to resume on failure. Daniel I see. So again there are two solutions I see: 1. ignore old libvirt as it can't restart source reliably anyway 2. open files when migration is completed (after startup, but before cont) -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance counter - vPMU
On Mon, Nov 14, 2011 at 05:35:59PM +0530, Balbir Singh wrote: Hi, I saw the a presentation on virtualizing performance counters at http://www.linux-kvm.org/wiki/images/6/6d/Kvm-forum-2011-performance-monitoring.pdf. Has the code been merged? Can I get something to play with/provide feedback? Not yet merged. You can take it from here https://lkml.org/lkml/2011/11/10/215 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] kvm tools: Implement multiple VQ for virtio-net
On Mon, Nov 14, 2011 at 4:04 AM, Asias He asias.he...@gmail.com wrote: Why both the bandwidth and latency performance are dropping so dramatically with multiple VQ? What's the expected benefit from multiple VQs i.e. why are doing the patches Sasha? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] KVM: Dirty logging optimization using rmap
On 11/14/2011 12:56 PM, Takuya Yoshikawa wrote: (2011/11/14 19:25), Avi Kivity wrote: On 11/14/2011 11:20 AM, Takuya Yoshikawa wrote: This is a revised version of my previous work. I hope that the patches are more self explanatory than before. It looks good. I'll let Marcelo (or anyone else?) review it as well before applying. Do you have performance measurements? For VGA, 30-40us became 3-5us when the display was quiet, with a enough warmed up guest. That's a nice improvement. Near the criterion, the number was not different much from the original version. For live migration, I forgot the number but the result was good. But my test case was not enough to cover every pattern, so I changed the criterion to be a bit conservative. More tests may be able to find a better criterion. I am not in a hurry about this, so it is OK to add some tests before merging this. I think we can merge is as is, it's clear we get an improvement. But what I did not like was holding spin lock more than 100us or so with the original version. With this version, at least, the problem should be cured some. There was a patchset from Peter Zijlstra that converted mmu notifiers to be preemptible, with that, we can convert the mmu spinlock to a mutex, I'll see what happened to it. There is a third method of doing write protection, and that is by write-protecting at the higher levels of the paging hierarchy. The advantage there is that write protection is O(1) no matter how large the guest is, or the number of dirty pages. To write protect all guest memory, we just write protect the 512 PTEs at the very top, and leave the rest alone. When the guest writes to a page, we allow writes for the top-level PTE that faulted, and write-protect all the PTEs that it points to. We can combine it with your method by having a small bitmap (say, just 64 bits) per shadow page. Each bit represents 8 PTEs (total 512 PTEs) and is set if any of those PTEs are writeable. Takuya One note: kvm-unit-tests' dirty logging test was broken for 32-bit box: compile error. I changed an idt to boot_idt and used it. I do not know kvm-unit-tests well, so I want somebody to fix that officially. I'll look into it. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM VMExit
On Mon, Nov 14, 2011, Xin Tong wrote about Re: KVM VMExit: x86 has a Undefined Instruction, its opcode is 0F 0B and it generates an invalid opcode exception. This instruction is provided for software testing to explicitly generate an invalid opcode exception. The opcode for this instruction is reserved for this purpose. Seeing your recent questions on this list aren't really about KVM development, but rather about VMX (apparently) basics, I suggest you also grab yourself a copy of the Intel Software Manual - see volume 3 of: http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html which answers all of your recent questions. About this question - as far as I know there is no way to specifically cause exit only on the UD2 instruction. What you can do, however, is to cause exit on any #UD exception, including the one generated by UD2. The VMCS has an exception bitmap which defines which exceptions cause an exit, and you can turn on the #UD bit to ask for this exit. -- Nadav Har'El|Monday, Nov 14 2011, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Willpower: The ability to eat only one http://nadav.harel.org.il |salted peanut. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] kvm tools: Implement multiple VQ for virtio-net
On Mon, Nov 14, 2011 at 02:25:17PM +0200, Pekka Enberg wrote: On Mon, Nov 14, 2011 at 4:04 AM, Asias He asias.he...@gmail.com wrote: Why both the bandwidth and latency performance are dropping so dramatically with multiple VQ? What's the expected benefit from multiple VQs Heh, the original patchset didn't mention this :) It really should. They are supposed to speed up networking for high smp guests. i.e. why are doing the patches Sasha? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] KVM: PPC: booke: Check for MSR[WE] in prepare_to_enter
On 11/09/2011 01:23 AM, Scott Wood wrote: This prevents us from inappropriately blocking in a KVM_SET_REGS ioctl -- the MSR[WE] will take effect when the guest is next entered. It also causes SRR1[WE] to be set when we enter the guest's interrupt handler, which is what e500 hardware is documented to do. Signed-off-by: Scott Woodscottw...@freescale.com --- arch/powerpc/kvm/booke.c | 28 ++-- 1 files changed, 18 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index feaefc4..557f028 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -124,12 +124,6 @@ void kvmppc_set_msr(struct kvm_vcpu *vcpu, u32 new_msr) vcpu-arch.shared-msr = new_msr; kvmppc_mmu_msr_notify(vcpu, old_msr); - - if (vcpu-arch.shared-msr MSR_WE) { - kvm_vcpu_block(vcpu); - kvmppc_set_exit_type(vcpu, EMULATED_MTMSRWE_EXITS); - }; - kvmppc_vcpu_sync_spe(vcpu); } @@ -288,15 +282,12 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu, return allowed; } -/* Check pending exceptions and deliver one, if possible. */ -void kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu) +static void kvmppc_core_check_exceptions(struct kvm_vcpu *vcpu) { unsigned long *pending =vcpu-arch.pending_exceptions; unsigned long old_pending = vcpu-arch.pending_exceptions; unsigned int priority; - WARN_ON_ONCE(!irqs_disabled()); - priority = __ffs(*pending); while (priority= BOOKE_IRQPRIO_MAX) { if (kvmppc_booke_irqprio_deliver(vcpu, priority)) @@ -314,6 +305,23 @@ void kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu) vcpu-arch.shared-int_pending = 0; } +/* Check pending exceptions and deliver one, if possible. */ +void kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu) +{ + WARN_ON_ONCE(!irqs_disabled()); + + kvmppc_core_check_exceptions(vcpu); + + if (vcpu-arch.shared-msr MSR_WE) { + local_irq_enable(); + kvm_vcpu_block(vcpu); + local_irq_disable(); Hrm. This specific irq enable/disable part isn't pretty but I can't think of a cleaner way either. Unless you move it out of the prepare function, since I don't see a way it could race with an interrupt. + + kvmppc_set_exit_type(vcpu, EMULATED_MTMSRWE_EXITS); + kvmppc_core_check_exceptions(vcpu); Shouldn't if (msr MSR_WE) { ... } core_check_exceptions(vcpu); work just as well? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] qemu and qemu.git - Migration + disk stress introduces qcow2 corruptions
On 11/14/2011 04:16 AM, Daniel P. Berrange wrote: On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote: On 11/11/2011 12:15 PM, Kevin Wolf wrote: Am 10.11.2011 22:30, schrieb Anthony Liguori: Live migration with qcow2 or any other image format is just not going to work right now even with proper clustered storage. I think doing a block level flush cache interface and letting block devices decide how to do it is the best approach. I would really prefer reusing the existing open/close code. It means less (duplicated) code, is existing code that is well tested and doesn't make migration much of a special case. If you want to avoid reopening the file on the OS level, we can reopen only the topmost layer (i.e. the format, but not the protocol) for now and in 1.1 we can use bdrv_reopen(). Intuitively I dislike _reopen style interfaces. If the second open yields different results from the first, does it invalidate any computations in between? What's wrong with just delaying the open? If you delay the 'open' until the mgmt app issues 'cont', then you loose the ability to rollback to the source host upon open failure for most deployed versions of libvirt. We only fairly recently switched to a five stage migration handshake to cope with rollback when 'cont' fails. Delayed open isn't a panacea. With the series I sent, we should be able to migration with a qcow2 file on coherent shared storage. There are two other cases that we care about: migration with nfs cache!=none and direct attached storage with cache!=none Whether the open is deferred matters less with NFS than if the open happens after the close on the source. To fix NFS cache!=none, we would have to do a bdrv_close() before sending the last byte of migration data and make sure that we bdrv_open() after receiving the last byte of migration data. The problem with this IMHO is it creates a large window where noone has the file open and you're critically vulnerable to losing your VM. I'm much more in favor of a smarter caching policy. If we can fcntl() our way to O_DIRECT on NFS, that would be fairly interesting. I'm not sure if this is supported today but it's something we could look into adding in the kernel. That way we could force NFS to O_DIRECT during migration which would solve this problem robustly. Deferred open doesn't help with direct attached storage. There simple is no guarantee that there isn't data in the page cache. Again, I think defaulting DAS to cache=none|directsync is what makes the most sense here. We can even add a migration blocker for DAS with cache=on. If we can do dynamic toggling of the cache setting, then that's pretty friendly at the end of the day. Regards, Anthony Liguori Daniel -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] ndo: ndo_queue_xmit/ndo_flush_xmit (was Re: [RFC] [ver3 PATCH 0/6] Implement multiqueue virtio-net)
On Sun, Nov 13, 2011 at 07:48:28PM +0200, Michael S. Tsirkin wrote: @@ -863,6 +872,9 @@ struct net_device_ops { int (*ndo_stop)(struct net_device *dev); netdev_tx_t (*ndo_start_xmit) (struct sk_buff *skb, struct net_device *dev); + netdev_tx_t (*ndo_queue_xmit)(struct sk_buff *skb, + struct net_device *dev); + void(*ndo_flush_xmit)(struct net_device *dev); u16 (*ndo_select_queue)(struct net_device *dev, struct sk_buff *skb); void(*ndo_change_rx_flags)(struct net_device *dev, @@ -2099,6 +2111,10 @@ extern int dev_set_mac_address(struct net_device *, An alternative I considered was to add a boolean flag to ndo_start_xmit 'bool queue' or something like this, plus ndo_flush_xmit. This will lead to cleaner code I think but will require all drivers to be changed, so for a proof of concept I decided to go for one that is less work. Let me know what looks more palatable ... -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] KVM: PPC: booke: Check for MSR[WE] in prepare_to_enter
On 11/14/2011 07:13 AM, Alexander Graf wrote: On 11/09/2011 01:23 AM, Scott Wood wrote: +/* Check pending exceptions and deliver one, if possible. */ +void kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu) +{ +WARN_ON_ONCE(!irqs_disabled()); + +kvmppc_core_check_exceptions(vcpu); + +if (vcpu-arch.shared-msr MSR_WE) { +local_irq_enable(); +kvm_vcpu_block(vcpu); +local_irq_disable(); Hrm. This specific irq enable/disable part isn't pretty but I can't think of a cleaner way either. Unless you move it out of the prepare function, since I don't see a way it could race with an interrupt. kvmppc_core_check_exceptions can clear MSR_WE, so we need to check after that. We can't enable interrupts after kvmppc_core_check_exceptions (or rather, if we do, we need to check again once interrupts are re-disabled, as in the MSR_WE case) because otherwise we could have an exception delivered afterward and receive the resched IPI at just the wrong time to take any action on it (just like the signal check). + +kvmppc_set_exit_type(vcpu, EMULATED_MTMSRWE_EXITS); +kvmppc_core_check_exceptions(vcpu); Shouldn't if (msr MSR_WE) { ... } core_check_exceptions(vcpu); work just as well? That would have us pointlessly checking the exceptions twice in the non-WE case -- unless you mean only check after, which as described above means you'll fail to wake up. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call agenda for November 15th
Hi Please send in any agenda items you are interested in covering. Proposal: - Migration debacle. Thanks, Juan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] KVM call agenda for November 15th
On 11/14/2011 11:44 AM, Juan Quintela wrote: Hi Please send in any agenda items you are interested in covering. Proposal: - Migration debacle. Ack. You read my mind :-) Regards, Anthony Liguori Thanks, Juan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] KVM: PPC: booke: Check for MSR[WE] in prepare_to_enter
On 14.11.2011, at 18:22, Scott Wood scottw...@freescale.com wrote: On 11/14/2011 07:13 AM, Alexander Graf wrote: On 11/09/2011 01:23 AM, Scott Wood wrote: +/* Check pending exceptions and deliver one, if possible. */ +void kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu) +{ +WARN_ON_ONCE(!irqs_disabled()); + +kvmppc_core_check_exceptions(vcpu); + +if (vcpu-arch.shared-msr MSR_WE) { +local_irq_enable(); +kvm_vcpu_block(vcpu); +local_irq_disable(); Hrm. This specific irq enable/disable part isn't pretty but I can't think of a cleaner way either. Unless you move it out of the prepare function, since I don't see a way it could race with an interrupt. kvmppc_core_check_exceptions can clear MSR_WE, so we need to check after that. We can't enable interrupts after kvmppc_core_check_exceptions (or rather, if we do, we need to check again once interrupts are re-disabled, as in the MSR_WE case) because otherwise we could have an exception delivered afterward and receive the resched IPI at just the wrong time to take any action on it (just like the signal check). Well, but if you enable interrupts here you're basically rendering code that runs earlier in disabled-interrupt mode racy. How does this align with the signal handling? We could receive a signal here and not handle it the same as the race you fixed earlier, no? Alex + +kvmppc_set_exit_type(vcpu, EMULATED_MTMSRWE_EXITS); +kvmppc_core_check_exceptions(vcpu); Shouldn't if (msr MSR_WE) { ... } core_check_exceptions(vcpu); work just as well? That would have us pointlessly checking the exceptions twice in the non-WE case -- unless you mean only check after, which as described above means you'll fail to wake up. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2 RFC] virtio-pci: flexible configuration layout
Add a flexible mechanism to specify virtio configuration layout, using pci vendor-specific capability. A separate capability is used for each of common, device specific and data-path accesses. Warning: compiled only. This patch also needs to be split up, pci_iomap changes also need arch updates for non-x86. There might also be more spec changes. Posting here for early feedback, and to allow Sasha to proceed with his kvm tool work. Changes from v1: Updated to match v3 of the spec, see: Subject: [PATCHv3 RFC] virtio-spec: flexible configuration layout Message-ID: 2010122436.ga13...@redhat.com In-Reply-To: 2009195901.ga28...@redhat.com In particular: - split ISR out - reorg capability - add offset alignment Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/virtio/virtio_pci.c | 184 +++--- include/asm-generic/io.h|4 + include/asm-generic/iomap.h | 11 +++ include/linux/pci_regs.h|4 + include/linux/virtio_pci.h | 41 ++ lib/iomap.c | 40 - 6 files changed, 265 insertions(+), 19 deletions(-) diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index ecb9254..bdd3876 100644 --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -37,8 +37,12 @@ struct virtio_pci_device struct virtio_device vdev; struct pci_dev *pci_dev; - /* the IO mapping for the PCI config space */ + /* the IO address for the common PCI config space */ void __iomem *ioaddr; + /* the IO address for device specific config */ + void __iomem *ioaddr_device; + /* the IO address to use for notifications operations */ + void __iomem *ioaddr_notify; /* a list of queues so we can dispatch IRQs */ spinlock_t lock; @@ -57,8 +61,158 @@ struct virtio_pci_device unsigned msix_used_vectors; /* Whether we have vector per vq */ bool per_vq_vectors; + + /* Various IO mappings: used for resource tracking only. */ + + /* Legacy BAR0: typically PIO. */ + void __iomem *legacy_map; + + /* Mappings specified by device capabilities: typically in MMIO */ + void __iomem *isr_map; + void __iomem *notify_map; + void __iomem *common_map; + void __iomem *device_map; }; +/* + * With PIO, device-specific config moves as MSI-X is enabled/disabled. + * Use this accessor to keep pointer to that config in sync. + */ +static void virtio_pci_set_msix_enabled(struct virtio_pci_device *vp_dev, int enabled) +{ + vp_dev-msix_enabled = enabled; + if (vp_dev-device_map) + vp_dev-ioaddr_device = vp_dev-device_map; + else + vp_dev-ioaddr_device = vp_dev-legacy_map + + VIRTIO_PCI_CONFIG(vp_dev); +} + +static void __iomem *virtio_pci_map_cfg(struct virtio_pci_device *vp_dev, u8 cap_id, + u32 align) +{ +u32 size; +u32 offset; +u8 bir; +u8 cap_len; + int pos; + struct pci_dev *dev = vp_dev-pci_dev; + void __iomem *p; + + for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR); +pos 0; +pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) { + u8 id; + pci_read_config_byte(dev, pos + PCI_VNDR_CAP_LEN, cap_len); + if (cap_len VIRTIO_PCI_CAP_ID + 1) + continue; + pci_read_config_byte(dev, pos + VIRTIO_PCI_CAP_ID, id); + id = VIRTIO_PCI_CAP_ID_SHIFT; + id = VIRTIO_PCI_CAP_ID_MASK; + if (id == cap_id) + break; + } + + if (pos = 0) + return NULL; + + if (cap_len VIRTIO_PCI_CAP_CFG_BIR + 1) + goto err; +pci_read_config_byte(dev, pos + VIRTIO_PCI_CAP_CFG_BIR, bir); + if (cap_len VIRTIO_PCI_CAP_CFG_OFF + 4) + goto err; +pci_read_config_dword(dev, pos + VIRTIO_PCI_CAP_CFG_OFF, offset); + if (cap_len VIRTIO_PCI_CAP_CFG_SIZE + 4) + goto err; +pci_read_config_dword(dev, pos + VIRTIO_PCI_CAP_CFG_SIZE, size); +bir = VIRTIO_PCI_CAP_CFG_BIR_SHIFT; +bir = VIRTIO_PCI_CAP_CFG_BIR_MASK; +size = VIRTIO_PCI_CAP_CFG_SIZE_SHIFT; +size = VIRTIO_PCI_CAP_CFG_SIZE_MASK; +off = VIRTIO_PCI_CAP_CFG_OFF_SHIFT; +off = VIRTIO_PCI_CAP_CFG_OFF_MASK; + /* Align offset appropriately */ + off = ~(align - 1); + + /* It's possible for a device to supply a huge config space, +* but we'll never need to map more than a page ATM. */ + p = pci_iomap_range(dev, bir, offset, size, PAGE_SIZE); + if (!p) + dev_err(vp_dev-vdev.dev, Unable to map virtio pci memory); + return p; +err: + dev_err(vp_dev-vdev.dev, Unable to parse virtio pci capability); + return
kvm pio
I am investigating how PIO is emulated in KVM and QEMU. when a PIO is encountered, it seems to me that its pio data is copied to vcpu-arch.pio_data and a fixed offset is assigned to vcpu-run-io.data_offset. static int emulator_pio_out_emulated(int size, unsigned short port, {... memcpy(vcpu-arch.pio_data, val, size * count); ... vcpu-run-io.data_offset = KVM_PIO_PAGE_OFFSET * PAGE_SIZE; } later in QEMU, it retrieves data from (uint8_t *)run + run-io.data_offset, how can we be sure than the memory the pio data is copied to vcpu-arch.pio_data is where the (uint8_t *)run + run-io.data_offset is pointing to ? Also, it seems that there is something called fast pio in which kvm does not return to qemu. in what case does it happen ? Thanks Xin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] KVM: PPC: booke: Check for MSR[WE] in prepare_to_enter
On 11/14/2011 12:01 PM, Alexander Graf wrote: On 14.11.2011, at 18:22, Scott Wood scottw...@freescale.com wrote: kvmppc_core_check_exceptions can clear MSR_WE, so we need to check after that. We can't enable interrupts after kvmppc_core_check_exceptions (or rather, if we do, we need to check again once interrupts are re-disabled, as in the MSR_WE case) because otherwise we could have an exception delivered afterward and receive the resched IPI at just the wrong time to take any action on it (just like the signal check). Well, but if you enable interrupts here you're basically rendering code that runs earlier in disabled-interrupt mode racy. That's why once we wake, we disable interrupts and check again. How does this align with the signal handling? We could receive a signal here and not handle it the same as the race you fixed earlier, no? Signals are checked by the caller after calling prepare_to_enter. Moving the check into prepare_to_enter wouldn't buy us much since the method of returning is different, and so there'd still need to be code in the caller to deal with it. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] vfio: VFIO Driver core framework
On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote: On 11/03/2011 03:12 PM, Alex Williamson wrote: +Many modern system now provide DMA and interrupt remapping facilities +to help ensure I/O devices behave within the boundaries they've been +allotted. This includes x86 hardware with AMD-Vi and Intel VT-d as +well as POWER systems with Partitionable Endpoints (PEs) and even +embedded powerpc systems (technology name unknown). Maybe replace (technology name unknown) with (such as Freescale chips with PAMU) or similar? Or just leave out the parenthetical. I was hoping that comment would lead to an answer. Thanks for the info ;) +As documented in linux/vfio.h, several ioctls are provided on the +group chardev: + +#define VFIO_GROUP_GET_FLAGS_IOR(';', 100, __u64) + #define VFIO_GROUP_FLAGS_VIABLE(1 0) + #define VFIO_GROUP_FLAGS_MM_LOCKED (1 1) +#define VFIO_GROUP_MERGE_IOW(';', 101, int) +#define VFIO_GROUP_UNMERGE _IOW(';', 102, int) +#define VFIO_GROUP_GET_IOMMU_FD _IO(';', 103) +#define VFIO_GROUP_GET_DEVICE_FD_IOW(';', 104, char *) This suggests the argument to VFIO_GROUP_GET_DEVICE_FD is a pointer to a pointer to char rather than a pointer to an array of char (just as e.g. VFIO_GROUP_MERGE takes a pointer to an int, not just an int). I believe I was following the UI_SET_PHYS ioctl as an example, which is defined as a char *. I'll change to char and verify. +The IOMMU file descriptor provides this set of ioctls: + +#define VFIO_IOMMU_GET_FLAGS_IOR(';', 105, __u64) + #define VFIO_IOMMU_FLAGS_MAP_ANY (1 0) +#define VFIO_IOMMU_MAP_DMA _IOWR(';', 106, struct vfio_dma_map) +#define VFIO_IOMMU_UNMAP_DMA_IOWR(';', 107, struct vfio_dma_map) What is the implication if VFIO_IOMMU_FLAGS_MAP_ANY is clear? Is such an implementation supposed to add a new flag that describes its restrictions? If MAP_ANY is clear then I would expect a new flag is set defining a new mapping paradigm, probably with an ioctl to describe the restrictions/parameters. MAP_ANY effectively means there are no restrictions. Can we get a way to turn DMA access off and on, short of unmapping everything, and then mapping it again? iommu_ops doesn't support such an interface, so no, not currently. +The GET_FLAGS ioctl returns basic information about the IOMMU domain. +We currently only support IOMMU domains that are able to map any +virtual address to any IOVA. This is indicated by the MAP_ANY flag. + +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping +and unmapping IOVAs to process virtual addresses: + +struct vfio_dma_map { +__u64 len;/* length of structure */ +__u64 vaddr; /* process virtual addr */ +__u64 dmaaddr;/* desired and/or returned dma address */ +__u64 size; /* size in bytes */ +__u64 flags; +#define VFIO_DMA_MAP_FLAG_WRITE (1 0) /* req writeable DMA mem */ +}; What are the semantics of desired and/or returned dma address? I believe the original intention was that a user could leave dmaaddr clear and let the iommu layer provide an iova address. The iommu api has since evolved and that mapping scheme really isn't present anymore. We'll currently fail if we can map the requested address. I'll update the docs to make that be the definition. Are we always supposed to provide a desired address, but it may be different on return? Or are there cases where we want to say give me whatever you want or give me this or fail? Exactly, that's what it used to be, but we don't really implement that any more. How much of this needs to be filled out for unmap? dmaaddr size, will update docs. Note that the length of structure approach means that ioctl numbers will change whenever this grows -- perhaps we should avoid encoding the struct size into these ioctls? How so? What's described here is effectively the base size. If we later add feature foo requiring additional fields, we set a flag, change the size, and tack those fields onto the end. The kernel side should balk if the size doesn't match what it expects from the flags it understands (which I think I probably need to be more strict about). +struct vfio_region_info { +__u32 len;/* length of structure */ +__u32 index; /* region number */ +__u64 size; /* size in bytes of region */ +__u64 offset; /* start offset of region */ +__u64 flags; +#define VFIO_REGION_INFO_FLAG_MMAP (1 0) +#define VFIO_REGION_INFO_FLAG_RO(1 1) +#define VFIO_REGION_INFO_FLAG_PHYS_VALID(1 2) +__u64 phys; /* physical address of region */ +}; In light of the above, this struct should not include
Re: [RFC PATCH] vfio: VFIO Driver core framework
On Mon, 2011-11-14 at 13:54 -0700, Alex Williamson wrote: On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote: On 11/03/2011 03:12 PM, Alex Williamson wrote: + for (i = 0; i npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) { + unsigned long pfn = 0; + + ret = vaddr_get_pfn(vaddr, rdwr, pfn); + if (ret) { + __vfio_dma_unmap(iommu, start, i, rdwr); + return ret; + } + + /* Only add actual locked pages to accounting */ + if (!is_invalid_reserved_pfn(pfn)) + locked++; + + ret = iommu_map(iommu-domain, iova, + (phys_addr_t)pfn PAGE_SHIFT, 0, prot); + if (ret) { + /* Back out mappings on error */ + put_pfn(pfn, rdwr); + __vfio_dma_unmap(iommu, start, i, rdwr); + return ret; + } + } There's no way to hand this stuff to the IOMMU driver in chunks larger than a page? That's going to be a problem for our IOMMU, which wants to deal with large windows. There is, this is just a simple implementation that maps individual pages. We just need to determine physically contiguous chunks and mlock them instead of using get_user_pages. The current implementation is much like how KVM maps iommu pages, but there shouldn't be a user API change to try to use larger chinks. We want this for IOMMU large page support too. Also, at one point intel-iommu didn't allow sub-ranges to be unmapped; an unmap of a single page would unmap the entire original mapping that contained that page. That made it easier to map each page individually for the flexibility it provided on unmap. I need to see if we still have that restriction. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] vfio: VFIO Driver core framework
On 11/14/2011 02:54 PM, Alex Williamson wrote: On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote: What are the semantics of desired and/or returned dma address? I believe the original intention was that a user could leave dmaaddr clear and let the iommu layer provide an iova address. The iommu api has since evolved and that mapping scheme really isn't present anymore. We'll currently fail if we can map the requested address. I'll update the docs to make that be the definition. OK... if there is any desire in the future to have the kernel pick an address (which could be useful for IOMMUs that don't set VFIO_IOMMU_FLAGS_MAP_ANY), there should be an explicit flag for this, since zero could be a valid address to request (doesn't mean clear). Note that the length of structure approach means that ioctl numbers will change whenever this grows -- perhaps we should avoid encoding the struct size into these ioctls? How so? What's described here is effectively the base size. If we later add feature foo requiring additional fields, we set a flag, change the size, and tack those fields onto the end. The kernel side should balk if the size doesn't match what it expects from the flags it understands (which I think I probably need to be more strict about). The size of the struct is encoded into the ioctl number via the _IOWR() macro. If we want the struct to be growable in the future, we should leave that out and just use _IO(). Otherwise if the size of the struct changes, the ioctl number changes. This is annoying for old userspace plus new kernel (have to add compat entries to the switch), and broken for old kernel plus new userspace. Can we limit the IOMMU_API dependency to the IOMMU parts of VFIO? It would still be useful for devices which don't do DMA, or where we accept the lack of protection/translation (e.g. we have a customer that wants to do KVM device assignment on one of our lower-end chips that lacks an IOMMU). Ugh. I'm not really onboard with it given that we're trying to sell vfio as a secure user space driver interface with iommu-based protection. That's its main use case, but it doesn't make much sense to duplicate the non-iommu-related bits for other use cases. This applies at runtime too, some devices don't do DMA at all (and thus may not be part of an IOMMU group, even if there is an IOMMU present for other devices -- could be considered a standalone group of one device, with a null IOMMU backend). Support for such devices can wait, but it's good to keep the possibility in mind. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] vfio: VFIO Driver core framework
Am 14.11.2011 um 23:26 schrieb Scott Wood scottw...@freescale.com: On 11/14/2011 02:54 PM, Alex Williamson wrote: On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote: What are the semantics of desired and/or returned dma address? I believe the original intention was that a user could leave dmaaddr clear and let the iommu layer provide an iova address. The iommu api has since evolved and that mapping scheme really isn't present anymore. We'll currently fail if we can map the requested address. I'll update the docs to make that be the definition. OK... if there is any desire in the future to have the kernel pick an address (which could be useful for IOMMUs that don't set VFIO_IOMMU_FLAGS_MAP_ANY), there should be an explicit flag for this, since zero could be a valid address to request (doesn't mean clear). Note that the length of structure approach means that ioctl numbers will change whenever this grows -- perhaps we should avoid encoding the struct size into these ioctls? How so? What's described here is effectively the base size. If we later add feature foo requiring additional fields, we set a flag, change the size, and tack those fields onto the end. The kernel side should balk if the size doesn't match what it expects from the flags it understands (which I think I probably need to be more strict about). The size of the struct is encoded into the ioctl number via the _IOWR() macro. If we want the struct to be growable in the future, we should leave that out and just use _IO(). Otherwise if the size of the struct changes, the ioctl number changes. This is annoying for old userspace plus new kernel (have to add compat entries to the switch), and broken for old kernel plus new userspace. Avi wanted to write up a patch for this to allow ioctls with arbitrary size, for exctly this purpose. Can we limit the IOMMU_API dependency to the IOMMU parts of VFIO? It would still be useful for devices which don't do DMA, or where we accept the lack of protection/translation (e.g. we have a customer that wants to do KVM device assignment on one of our lower-end chips that lacks an IOMMU). Ugh. I'm not really onboard with it given that we're trying to sell vfio as a secure user space driver interface with iommu-based protection. That's its main use case, but it doesn't make much sense to duplicate the non-iommu-related bits for other use cases. This applies at runtime too, some devices don't do DMA at all (and thus may not be part of an IOMMU group, even if there is an IOMMU present for other devices -- could be considered a standalone group of one device, with a null IOMMU backend). Support for such devices can wait, but it's good to keep the possibility in mind. I agree. Potentially backing a device with a nop iommu also makes testing easier. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC PATCH] vfio: VFIO Driver core framework
On Fri, 2011-11-11 at 16:22 -0600, Christian Benvenuti (benve) wrote: -Original Message- From: Alex Williamson [mailto:alex.william...@redhat.com] Sent: Friday, November 11, 2011 10:04 AM To: Christian Benvenuti (benve) Cc: chr...@sous-sol.org; a...@au1.ibm.com; p...@au1.ibm.com; d...@au1.ibm.com; joerg.roe...@amd.com; ag...@suse.de; Aaron Fabbri (aafabbri); b08...@freescale.com; b07...@freescale.com; a...@redhat.com; konrad.w...@oracle.com; kvm@vger.kernel.org; qemu-de...@nongnu.org; io...@lists.linux-foundation.org; linux-...@vger.kernel.org Subject: RE: [RFC PATCH] vfio: VFIO Driver core framework On Wed, 2011-11-09 at 18:57 -0600, Christian Benvenuti (benve) wrote: Here are few minor comments on vfio_iommu.c ... Sorry, I've been poking sticks at trying to figure out a clean way to solve the force vfio driver attach problem. Attach o detach? Attach. For the case when a new device appears that belongs to a group that already in use. I'll probably add a claim() operation to the vfio_device_ops that tells the driver to grab it. I was hoping for pci this would just add it to the dynamic ids, but that hits device lock problems. diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c new file mode 100644 index 000..029dae3 --- /dev/null +++ b/drivers/vfio/vfio_iommu.c snip + +#include vfio_private.h Doesn't the 'dma_' prefix belong to the generic DMA code? Sure, we could these more vfio-centric. Like vfio_dma_map_page? Something like that, though _page doesn't seem appropriate as it tracks a region. +struct dma_map_page { + struct list_headlist; + dma_addr_t daddr; + unsigned long vaddr; + int npage; + int rdwr; +}; + +/* + * This code handles mapping and unmapping of user data buffers + * into DMA'ble space using the IOMMU + */ + +#define NPAGE_TO_SIZE(npage) ((size_t)(npage) PAGE_SHIFT) + +struct vwork { + struct mm_struct*mm; + int npage; + struct work_struct work; +}; + +/* delayed decrement for locked_vm */ +static void vfio_lock_acct_bg(struct work_struct *work) +{ + struct vwork *vwork = container_of(work, struct vwork, work); + struct mm_struct *mm; + + mm = vwork-mm; + down_write(mm-mmap_sem); + mm-locked_vm += vwork-npage; + up_write(mm-mmap_sem); + mmput(mm); /* unref mm */ + kfree(vwork); +} + +static void vfio_lock_acct(int npage) +{ + struct vwork *vwork; + struct mm_struct *mm; + + if (!current-mm) { + /* process exited */ + return; + } + if (down_write_trylock(current-mm-mmap_sem)) { + current-mm-locked_vm += npage; + up_write(current-mm-mmap_sem); + return; + } + /* +* Couldn't get mmap_sem lock, so must setup to decrement ^ Increment? Yep Actually, side note, this is increment/decrement depending on the sign of the parameter. So update may be more appropriate. I think Tom originally used increment in one place and decrement in another to show it's dual use. snip +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start, + size_t size, struct dma_map_page *mlp) +{ + struct dma_map_page *split; + int npage_lo, npage_hi; + + /* Existing dma region is completely covered, unmap all */ This works. However, given how vfio_dma_map_dm implements the merging logic, I think it is impossible to have (start mlp-daddr start + size mlp-daddr + NPAGE_TO_SIZE(mlp-npage)) It's quite possible. This allows userspace to create a sparse mapping, then blow it all away with a single unmap from 0 to ~0. I would prefer the user to use exact ranges in the unmap operations because it would make it easier to detect bugs/leaks in the map/unmap logic used by the callers. My assumptions are that: - the user always keeps track of the mappings My qemu code plays a little on the loose side here, acting as a passthrough for the internal memory client. But even there, worst case would probably be trying to unmap a non-existent entry, not unmapping a sparse range. - the user either unmaps one specific mapping or 'all of them'. The 'all of them' case would also take care of those cases where the user does _not_ keep track of mappings and simply uses the unmap from 0 to ~0 each time. Because of this you could still provide an exact map/unmap logic and
[RFC 2/5] virtio-pci: Fix compilation issue
Signed-off-by: Sasha Levin levinsasha...@gmail.com --- drivers/virtio/virtio_pci.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index 7625434..d242fcc 100644 --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -129,10 +129,10 @@ static void __iomem *virtio_pci_map_cfg(struct virtio_pci_device *vp_dev, u8 cap bir = VIRTIO_PCI_CAP_CFG_BIR_MASK; size = VIRTIO_PCI_CAP_CFG_SIZE_SHIFT; size = VIRTIO_PCI_CAP_CFG_SIZE_MASK; -off = VIRTIO_PCI_CAP_CFG_OFF_SHIFT; -off = VIRTIO_PCI_CAP_CFG_OFF_MASK; +offset = VIRTIO_PCI_CAP_CFG_OFF_SHIFT; +offset = VIRTIO_PCI_CAP_CFG_OFF_MASK; /* Align offset appropriately */ - off = ~(align - 1); + offset = ~(align - 1); /* It's possible for a device to supply a huge config space, * but we'll never need to map more than a page ATM. */ -- 1.7.8.rc1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 4/5] kvm tools: Free up the MSI-X PBA BAR
Free up the BAR to make space for the new virtio BARs. It isn't required to have the PBA and the table in the separate BARs, and uniting them will just give us extra BARs to play with. Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/include/kvm/virtio-pci.h |1 - tools/kvm/virtio/pci.c | 35 ++- 2 files changed, 14 insertions(+), 22 deletions(-) diff --git a/tools/kvm/include/kvm/virtio-pci.h b/tools/kvm/include/kvm/virtio-pci.h index 2bbb271..73f7486 100644 --- a/tools/kvm/include/kvm/virtio-pci.h +++ b/tools/kvm/include/kvm/virtio-pci.h @@ -30,7 +30,6 @@ struct virtio_pci { u32 vq_vector[VIRTIO_PCI_MAX_VQ]; u32 gsis[VIRTIO_PCI_MAX_VQ]; u32 msix_io_block; - u32 msix_pba_block; u64 msix_pba; struct msix_table msix_table[VIRTIO_PCI_MAX_VQ + VIRTIO_PCI_MAX_CONFIG]; diff --git a/tools/kvm/virtio/pci.c b/tools/kvm/virtio/pci.c index 1660f06..da38ba5 100644 --- a/tools/kvm/virtio/pci.c +++ b/tools/kvm/virtio/pci.c @@ -214,23 +214,21 @@ static struct ioport_operations virtio_pci__io_ops = { static void callback_mmio_table(u64 addr, u8 *data, u32 len, u8 is_write, void *ptr) { struct virtio_pci *vpci = ptr; - void *table = vpci-msix_table; + void *table; + u32 offset; - if (is_write) - memcpy(table + addr - vpci-msix_io_block, data, len); - else - memcpy(data, table + addr - vpci-msix_io_block, len); -} - -static void callback_mmio_pba(u64 addr, u8 *data, u32 len, u8 is_write, void *ptr) -{ - struct virtio_pci *vpci = ptr; - void *pba = vpci-msix_pba; + if (addr vpci-msix_io_block + PCI_IO_SIZE) { + table = vpci-msix_pba; + offset = vpci-msix_io_block + PCI_IO_SIZE; + } else { + table = vpci-msix_table; + offset = vpci-msix_io_block; + } if (is_write) - memcpy(pba + addr - vpci-msix_pba_block, data, len); + memcpy(table + addr - offset, data, len); else - memcpy(data, pba + addr - vpci-msix_pba_block, len); + memcpy(data, table + addr - offset, len); } int virtio_pci__signal_vq(struct kvm *kvm, struct virtio_trans *vtrans, u32 vq) @@ -283,12 +281,10 @@ int virtio_pci__init(struct kvm *kvm, struct virtio_trans *vtrans, void *dev, u8 pin, line, ndev; vpci-dev = dev; - vpci-msix_io_block = pci_get_io_space_block(PCI_IO_SIZE); - vpci-msix_pba_block = pci_get_io_space_block(PCI_IO_SIZE); + vpci-msix_io_block = pci_get_io_space_block(PCI_IO_SIZE * 2); vpci-base_addr = ioport__register(IOPORT_EMPTY, virtio_pci__io_ops, IOPORT_SIZE, vtrans); kvm__register_mmio(kvm, vpci-msix_io_block, 0x100, callback_mmio_table, vpci); - kvm__register_mmio(kvm, vpci-msix_pba_block, 0x100, callback_mmio_pba, vpci); vpci-pci_hdr = (struct pci_device_header) { .vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET, @@ -299,10 +295,7 @@ int virtio_pci__init(struct kvm *kvm, struct virtio_trans *vtrans, void *dev, .subsys_vendor_id = PCI_SUBSYSTEM_VENDOR_ID_REDHAT_QUMRANET, .subsys_id = subsys_id, .bar[0] = vpci-base_addr | PCI_BASE_ADDRESS_SPACE_IO, - .bar[1] = vpci-msix_io_block | PCI_BASE_ADDRESS_SPACE_MEMORY - | PCI_BASE_ADDRESS_MEM_TYPE_64, - .bar[3] = vpci-msix_pba_block | PCI_BASE_ADDRESS_SPACE_MEMORY - | PCI_BASE_ADDRESS_MEM_TYPE_64, + .bar[1] = vpci-msix_io_block | PCI_BASE_ADDRESS_SPACE_MEMORY, .status = PCI_STATUS_CAP_LIST, .capabilities = (void *)vpci-pci_hdr.msix - (void *)vpci-pci_hdr, }; @@ -327,7 +320,7 @@ int virtio_pci__init(struct kvm *kvm, struct virtio_trans *vtrans, void *dev, * we're not in short of BARs */ vpci-pci_hdr.msix.table_offset = 1; /* Use BAR 1 */ - vpci-pci_hdr.msix.pba_offset = 3; /* Use BAR 3 */ + vpci-pci_hdr.msix.pba_offset = 1 | PCI_IO_SIZE; /* Use BAR 1 with offset */ vpci-config_vector = 0; if (irq__register_device(subsys_id, ndev, pin, line) 0) -- 1.7.8.rc1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 5/5] kvm tools: Support new virtio-pci configuration layout
Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/include/kvm/pci.h| 13 ++- tools/kvm/include/kvm/virtio-pci.h |4 + tools/kvm/virtio/pci.c | 248 ++-- 3 files changed, 254 insertions(+), 11 deletions(-) diff --git a/tools/kvm/include/kvm/pci.h b/tools/kvm/include/kvm/pci.h index f71af0b..4f5a09f 100644 --- a/tools/kvm/include/kvm/pci.h +++ b/tools/kvm/include/kvm/pci.h @@ -39,6 +39,16 @@ struct msix_cap { u32 pba_offset; }; +struct virtio_cap { + u8 cap; + u8 next; + u8 cap_len; + u8 structure_id; + u8 bir; + u32 size:24; + u32 offset; +}; + struct pci_device_header { u16 vendor_id; u16 device_id; @@ -63,7 +73,8 @@ struct pci_device_header { u8 min_gnt; u8 max_lat; struct msix_cap msix; - u8 empty[136]; /* Rest of PCI config space */ + struct virtio_cap virtio[4]; + u8 empty[90]; /* Rest of PCI config space */ u32 bar_size[6]; }; diff --git a/tools/kvm/include/kvm/virtio-pci.h b/tools/kvm/include/kvm/virtio-pci.h index 73f7486..fc3c03f 100644 --- a/tools/kvm/include/kvm/virtio-pci.h +++ b/tools/kvm/include/kvm/virtio-pci.h @@ -19,6 +19,7 @@ struct virtio_pci_ioevent_param { struct virtio_pci { struct pci_device_header pci_hdr; void*dev; + struct kvm *kvm; u16 base_addr; u8 status; @@ -36,6 +37,9 @@ struct virtio_pci { /* virtio queue */ u16 queue_selector; struct virtio_pci_ioevent_param ioeventfds[VIRTIO_PCI_MAX_VQ]; + + u32 virtio_mmio; + u16 virtio_pio; }; int virtio_pci__init(struct kvm *kvm, struct virtio_trans *vtrans, void *dev, diff --git a/tools/kvm/virtio/pci.c b/tools/kvm/virtio/pci.c index da38ba5..8606d87 100644 --- a/tools/kvm/virtio/pci.c +++ b/tools/kvm/virtio/pci.c @@ -40,7 +40,7 @@ static int virtio_pci__init_ioeventfd(struct kvm *kvm, struct virtio_trans *vtra }; ioevent = (struct ioevent) { - .io_addr= vpci-base_addr + VIRTIO_PCI_QUEUE_NOTIFY, + .io_addr= vpci-virtio_pio, .io_len = sizeof(u16), .fn = virtio_pci__ioevent_callback, .fn_ptr = vpci-ioeventfds[vq], @@ -59,7 +59,7 @@ static inline bool virtio_pci__msix_enabled(struct virtio_pci *vpci) return vpci-pci_hdr.msix.ctrl PCI_MSIX_FLAGS_ENABLE; } -static bool virtio_pci__specific_io_in(struct kvm *kvm, struct virtio_trans *vtrans, u16 port, +static bool virtio_pci_legacy__specific_io_in(struct kvm *kvm, struct virtio_trans *vtrans, u16 port, void *data, int size, int offset) { u32 config_offset; @@ -89,7 +89,8 @@ static bool virtio_pci__specific_io_in(struct kvm *kvm, struct virtio_trans *vtr return false; } -static bool virtio_pci__io_in(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size) +static bool virtio_pci_legacy__io_in(struct ioport *ioport, struct kvm *kvm, + u16 port, void *data, int size) { unsigned long offset; bool ret = true; @@ -124,15 +125,15 @@ static bool virtio_pci__io_in(struct ioport *ioport, struct kvm *kvm, u16 port, vpci-isr = VIRTIO_IRQ_LOW; break; default: - ret = virtio_pci__specific_io_in(kvm, vtrans, port, data, size, offset); + ret = virtio_pci_legacy__specific_io_in(kvm, vtrans, port, data, size, offset); break; }; return ret; } -static bool virtio_pci__specific_io_out(struct kvm *kvm, struct virtio_trans *vtrans, u16 port, - void *data, int size, int offset) +static bool virtio_pci_legacy__specific_io_out(struct kvm *kvm, struct virtio_trans *vtrans, + u16 port, void *data, int size, int offset) { struct virtio_pci *vpci = vtrans-virtio; u32 config_offset, gsi, vec; @@ -166,7 +167,8 @@ static bool virtio_pci__specific_io_out(struct kvm *kvm, struct virtio_trans *vt return false; } -static bool virtio_pci__io_out(struct ioport *ioport, struct kvm *kvm, u16 port, void *data, int size) +static bool virtio_pci_legacy__io_out(struct ioport *ioport, + struct kvm *kvm, u16 port, void *data, int size) { unsigned long offset; bool ret = true; @@ -199,7 +201,76 @@ static bool virtio_pci__io_out(struct ioport *ioport, struct kvm *kvm, u16 port, vpci-status= ioport__read8(data); break; default: - ret =
[RFC 3/5] iomap: Don't ignore offset
Offset was ignored for start calulcations, making all mappings start at offset 0 always. Signed-off-by: Sasha Levin levinsasha...@gmail.com --- lib/iomap.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/lib/iomap.c b/lib/iomap.c index f28720e..93ae915 100644 --- a/lib/iomap.c +++ b/lib/iomap.c @@ -272,6 +272,7 @@ void __iomem *pci_iomap_range(struct pci_dev *dev, int bar, if (len = offset || !start) return NULL; len -= offset; + start += offset; if (len minlen) return NULL; if (maxlen len maxlen) -- 1.7.8.rc1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 1/5] virtio-pci: flexible configuration layout
From: Michael S. Tsirkin m...@redhat.com Add a flexible mechanism to specify virtio configuration layout, using pci vendor-specific capability. A separate capability is used for each of common, device specific and data-path accesses. Warning: compiled only. This patch also needs to be split up, pci_iomap changes also need arch updates for non-x86. There might also be more spec changes. Posting here for early feedback, and to allow Sasha to proceed with his kvm tool work. Changes from v1: Updated to match v3 of the spec, see: Subject: [PATCHv3 RFC] virtio-spec: flexible configuration layout Message-ID: 2010122436.ga13...@redhat.com In-Reply-To: 2009195901.ga28...@redhat.com In particular: - split ISR out - reorg capability - add offset alignment Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/virtio/virtio_pci.c | 184 +++--- include/asm-generic/io.h|4 + include/asm-generic/iomap.h | 11 +++ include/linux/pci_regs.h|4 + include/linux/virtio_pci.h | 41 ++ lib/iomap.c | 40 - 6 files changed, 265 insertions(+), 19 deletions(-) diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index 3d1bf41..7625434 100644 --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -37,8 +37,12 @@ struct virtio_pci_device struct virtio_device vdev; struct pci_dev *pci_dev; - /* the IO mapping for the PCI config space */ + /* the IO address for the common PCI config space */ void __iomem *ioaddr; + /* the IO address for device specific config */ + void __iomem *ioaddr_device; + /* the IO address to use for notifications operations */ + void __iomem *ioaddr_notify; /* a list of queues so we can dispatch IRQs */ spinlock_t lock; @@ -57,8 +61,158 @@ struct virtio_pci_device unsigned msix_used_vectors; /* Whether we have vector per vq */ bool per_vq_vectors; + + /* Various IO mappings: used for resource tracking only. */ + + /* Legacy BAR0: typically PIO. */ + void __iomem *legacy_map; + + /* Mappings specified by device capabilities: typically in MMIO */ + void __iomem *isr_map; + void __iomem *notify_map; + void __iomem *common_map; + void __iomem *device_map; }; +/* + * With PIO, device-specific config moves as MSI-X is enabled/disabled. + * Use this accessor to keep pointer to that config in sync. + */ +static void virtio_pci_set_msix_enabled(struct virtio_pci_device *vp_dev, int enabled) +{ + vp_dev-msix_enabled = enabled; + if (vp_dev-device_map) + vp_dev-ioaddr_device = vp_dev-device_map; + else + vp_dev-ioaddr_device = vp_dev-legacy_map + + VIRTIO_PCI_CONFIG(vp_dev); +} + +static void __iomem *virtio_pci_map_cfg(struct virtio_pci_device *vp_dev, u8 cap_id, + u32 align) +{ +u32 size; +u32 offset; +u8 bir; +u8 cap_len; + int pos; + struct pci_dev *dev = vp_dev-pci_dev; + void __iomem *p; + + for (pos = pci_find_capability(dev, PCI_CAP_ID_VNDR); +pos 0; +pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_VNDR)) { + u8 id; + pci_read_config_byte(dev, pos + PCI_VNDR_CAP_LEN, cap_len); + if (cap_len VIRTIO_PCI_CAP_ID + 1) + continue; + pci_read_config_byte(dev, pos + VIRTIO_PCI_CAP_ID, id); + id = VIRTIO_PCI_CAP_ID_SHIFT; + id = VIRTIO_PCI_CAP_ID_MASK; + if (id == cap_id) + break; + } + + if (pos = 0) + return NULL; + + if (cap_len VIRTIO_PCI_CAP_CFG_BIR + 1) + goto err; +pci_read_config_byte(dev, pos + VIRTIO_PCI_CAP_CFG_BIR, bir); + if (cap_len VIRTIO_PCI_CAP_CFG_OFF + 4) + goto err; +pci_read_config_dword(dev, pos + VIRTIO_PCI_CAP_CFG_OFF, offset); + if (cap_len VIRTIO_PCI_CAP_CFG_SIZE + 4) + goto err; +pci_read_config_dword(dev, pos + VIRTIO_PCI_CAP_CFG_SIZE, size); +bir = VIRTIO_PCI_CAP_CFG_BIR_SHIFT; +bir = VIRTIO_PCI_CAP_CFG_BIR_MASK; +size = VIRTIO_PCI_CAP_CFG_SIZE_SHIFT; +size = VIRTIO_PCI_CAP_CFG_SIZE_MASK; +off = VIRTIO_PCI_CAP_CFG_OFF_SHIFT; +off = VIRTIO_PCI_CAP_CFG_OFF_MASK; + /* Align offset appropriately */ + off = ~(align - 1); + + /* It's possible for a device to supply a huge config space, +* but we'll never need to map more than a page ATM. */ + p = pci_iomap_range(dev, bir, offset, size, PAGE_SIZE); + if (!p) + dev_err(vp_dev-vdev.dev, Unable to map virtio pci memory); + return p; +err: + dev_err(vp_dev-vdev.dev, Unable to parse
[RFC 0/5] virtio-pci,kvm tools: Support new virtio-pci spec in kvm tools
This patch series contains two fixes for the RFC patch send by Michael. These patches are pretty basic and should probably be merged into the next version of that patch. Other two patches add support to the new virtio-pci spec in KVM tools, allowing for easy testing of any future virtio-pci kernel side patches. The usermode code isn't complete, but it's working properly and should be enough for functionality testing. Michael S. Tsirkin (1): virtio-pci: flexible configuration layout Sasha Levin (4): virtio-pci: Fix compilation issue iomap: Don't ignore offset kvm tools: Free up the MSI-X PBA BAR kvm tools: Support new virtio-pci configuration layout drivers/virtio/virtio_pci.c| 184 ++-- include/asm-generic/io.h |4 + include/asm-generic/iomap.h| 11 ++ include/linux/pci_regs.h |4 + include/linux/virtio_pci.h | 41 ++ lib/iomap.c| 41 +- tools/kvm/include/kvm/pci.h| 13 ++- tools/kvm/include/kvm/virtio-pci.h |5 +- tools/kvm/virtio/pci.c | 275 9 files changed, 530 insertions(+), 48 deletions(-) -- 1.7.8.rc1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] vfio: VFIO Driver core framework
On Fri, Nov 11, 2011 at 03:10:56PM -0700, Alex Williamson wrote: Thanks Konrad! Comments inline. On Fri, 2011-11-11 at 12:51 -0500, Konrad Rzeszutek Wilk wrote: On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote: [snip] +The GET_NUM_REGIONS ioctl tells us how many regions the device supports: + +#define VFIO_DEVICE_GET_NUM_REGIONS _IOR(';', 109, int) Don't want __u32? It could be, not sure if it buys us anything maybe even restricts us. We likely don't need 2^32 regions (famous last words?), so we could later define 0 to something? As a rule, it's best to use explicit fixed width types for all ioctl() arguments, to avoid compat hell for 32-bit userland on 64-bit kernel setups. [snip] +Again, zero count entries are allowed (vfio-pci uses a static interrupt +type to index mapping). I am not really sure what that means. This is so PCI can expose: enum { VFIO_PCI_INTX_IRQ_INDEX, VFIO_PCI_MSI_IRQ_INDEX, VFIO_PCI_MSIX_IRQ_INDEX, VFIO_PCI_NUM_IRQS }; So like regions it always exposes 3 IRQ indexes where count=0 if the device doesn't actually support that type of interrupt. I just want to spell out that bus drivers have this kind of flexibility. I knew what you were aiming for, so I could see what you meant here, but I don't think the doco is very clearly expressed at all. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] vfio: VFIO Driver core framework
On Mon, Nov 14, 2011 at 03:59:00PM -0700, Alex Williamson wrote: On Fri, 2011-11-11 at 16:22 -0600, Christian Benvenuti (benve) wrote: [snip] - the user either unmaps one specific mapping or 'all of them'. The 'all of them' case would also take care of those cases where the user does _not_ keep track of mappings and simply uses the unmap from 0 to ~0 each time. Because of this you could still provide an exact map/unmap logic and allow such unmap from 0 to ~0 by making the latter a special case. However, if we want to allow any arbitrary/inexact unmap request, then OK. I can't think of any good reasons we shouldn't be more strict. I think it was primarily just convenient to hit all the corner cases since we merge all the requests together for tracking and need to be able to split them back apart. It does feel a little awkward to have a 0/~0 special case though, but I don't think it's worth adding another ioctl to handle it. Being strict, or at least enforcing strictness, requires that the infrastructure track all the maps, so that the unmaps can be matching. This is not a natural thing with the data structures you want for all IOMMUs. For example on POWER, the IOMMU (aka TCE table) is a simple 1-level pagetable. One pointer with a couple of permission bits per IOMMU page. Handling oddly overlapping operations on that data structure is natural, enforcing strict matching of maps and unmaps is not and would require extra information to be stored by vfio. On POWER, the IOMMU operations often *are* a hot path, so manipulating those structures would have a real cost, too. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] vfio: VFIO Driver core framework
On Tue, 2011-11-15 at 11:05 +1100, David Gibson wrote: Being strict, or at least enforcing strictness, requires that the infrastructure track all the maps, so that the unmaps can be matching. This is not a natural thing with the data structures you want for all IOMMUs. For example on POWER, the IOMMU (aka TCE table) is a simple 1-level pagetable. One pointer with a couple of permission bits per IOMMU page. Handling oddly overlapping operations on that data structure is natural, enforcing strict matching of maps and unmaps is not and would require extra information to be stored by vfio. On POWER, the IOMMU operations often *are* a hot path, so manipulating those structures would have a real cost, too. In fact they are a very hot path even. There's no way we can afford the cost of tracking per page mapping/unmapping (other than bumping the page count on a page that's currently mapped or via some debug-only feature). Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] vfio: VFIO Driver core framework
On Mon, 2011-11-14 at 13:54 -0700, Alex Williamson wrote: On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote: On 11/03/2011 03:12 PM, Alex Williamson wrote: + int (*get)(void *); + void(*put)(void *); + ssize_t (*read)(void *, char __user *, + size_t, loff_t *); + ssize_t (*write)(void *, const char __user *, + size_t, loff_t *); + long(*ioctl)(void *, unsigned int, unsigned long); + int (*mmap)(void *, struct vm_area_struct *); +}; When defining an API, please do not omit parameter names. ok Should specify what the driver is supposed to do with get/put -- I guess not try to unbind when the count is nonzero? Races could still lead the unbinder to be blocked, but I guess it lets the driver know when it's likely to succeed. Right, for the pci bus driver, it's mainly for reference counting, including the module_get to prevent vfio-pci from being unloaded. On the first get for a device, we also do a pci_enable() and pci_disable() on last put. I'll try to clarify in the docs. Looking at these again, I should just rename them to open/release. That matches the points when they're called. I suspect I started with just reference counting and it grew to more of a full blown open/release. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance counter - vPMU
On Mon, Nov 14, 2011 at 5:48 PM, Gleb Natapov g...@redhat.com wrote: Not yet merged. You can take it from here https://lkml.org/lkml/2011/11/10/215 Thank you very much, Gleb! Balbir Singh. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] kvm tools: Implement multiple VQ for virtio-net
Sasha Levin levinsasha...@gmail.com wrote on 11/14/2011 03:45:40 PM: Why both the bandwidth and latency performance are dropping so dramatically with multiple VQ? It looks like theres no hash sync between host and guest, which makes the RX VQ change for every packet. This is my guess. Yes, I confirmed this happens for macvtap. I am using ixgbe - it calls skb_record_rx_queue when a skb is allocated, but sets rxhash when a packet arrives. Macvtap is relying on record_rx_queue first ahead of rxhash (as part of my patch making macvtap multiqueue), hence different skbs result in macvtap selecting different vq's. Reordering macvtap to use rxhash first results in all packets going to the same VQ. The code snippet is: { ... if (!numvtaps) goto out; rxq = skb_get_rxhash(skb); if (rxq) { tap = rcu_dereference(vlan-taps[rxq % numvtaps]); if (tap) goto out; } if (likely(skb_rx_queue_recorded(skb))) { rxq = skb_get_rx_queue(skb); while (unlikely(rxq = numvtaps)) rxq -= numvtaps; tap = rcu_dereference(vlan-taps[rxq]); if (tap) goto out; } } I will submit a patch for macvtap separately. I am working towards the other issue pointed out - different vhost threads handling rx/tx of a single flow. thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Delivery Failure
- The message you sent to bamboowang.com/sales was rejected because it would exceed the quota for the mailbox. The subject of the message follows: Subject: Delivery reports about your e-mail - -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] KVM: PPC: booke: Check for MSR[WE] in prepare_to_enter
On 14.11.2011, at 18:22, Scott Wood scottw...@freescale.com wrote: On 11/14/2011 07:13 AM, Alexander Graf wrote: On 11/09/2011 01:23 AM, Scott Wood wrote: +/* Check pending exceptions and deliver one, if possible. */ +void kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu) +{ +WARN_ON_ONCE(!irqs_disabled()); + +kvmppc_core_check_exceptions(vcpu); + +if (vcpu-arch.shared-msr MSR_WE) { +local_irq_enable(); +kvm_vcpu_block(vcpu); +local_irq_disable(); Hrm. This specific irq enable/disable part isn't pretty but I can't think of a cleaner way either. Unless you move it out of the prepare function, since I don't see a way it could race with an interrupt. kvmppc_core_check_exceptions can clear MSR_WE, so we need to check after that. We can't enable interrupts after kvmppc_core_check_exceptions (or rather, if we do, we need to check again once interrupts are re-disabled, as in the MSR_WE case) because otherwise we could have an exception delivered afterward and receive the resched IPI at just the wrong time to take any action on it (just like the signal check). Well, but if you enable interrupts here you're basically rendering code that runs earlier in disabled-interrupt mode racy. How does this align with the signal handling? We could receive a signal here and not handle it the same as the race you fixed earlier, no? Alex + +kvmppc_set_exit_type(vcpu, EMULATED_MTMSRWE_EXITS); +kvmppc_core_check_exceptions(vcpu); Shouldn't if (msr MSR_WE) { ... } core_check_exceptions(vcpu); work just as well? That would have us pointlessly checking the exceptions twice in the non-WE case -- unless you mean only check after, which as described above means you'll fail to wake up. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html