Re: [Qemu-devel] Re: QEMU-KVM and video performance
On Wed, 21 Apr 2010, Avi Kivity wrote: On 04/21/2010 09:50 PM, Gerhard Wiesinger wrote: I don't think changing VGA window is a problem because there are 500.000-1Mio changes/s possible. 1MB/s, 500k-1M changes/s Coincidence? Is it taking a page fault or trap on every write? To clarify: Memory Performance writing to segmen A000 is about 1MB/st. That indicates a fault every write (assuming 8-16 bit writes). If you're using 256 color vga and not switching banks, this indicates a bug. Yes, 256 color VGA and no bank switches involved. Calling INT 10 set/get window function with different windows (e.g. toggling between window page 0 and 1) is about 500.000 to 1Mio function calls per second. That's suprisingly fast. I'd expect 100-200k/sec. Sorry, I mixed up the numbers: 1.) QEMU-KVM: ~111k 2.) QEMU only: 500k-1Mio Please run kvm_stat and report output for both tests to confirm. See below. 2nd column is per second statistic when running the test. To get real good VGA performance both parameters should be: About 50MB/s for writes to segment A000 ~500.000 bank switches per second. First should be doable easily, second is borderline. I think this is very easy to distingish: 1.) VGA Segment A000 is legacy and should be handled through QEMU and not through KVM (because it is much more faster). Also 16 color modes should be fast enough there. 2.) All other flat PCI memory accesses should be handled through KVM (there is a specialized driver loaded for that PCI device in the non legacy OS). Is that easily possible? No. Code can run in either qemu or kvm, not both. You can switch between them based on access statistics (early versions of qemu-kvm did that, without the statistics part), but this isn't trivial. Hmmm. Ok, 2 different opinions about the memory write performance: Easily or not possible? Thnx for you help so far. Ciao, Gerhard -- http://www.wiesinger.com/ - kvm_stat Please mount debugfs ('mount -t debugfs debugfs /sys/kernel/debug') and ensure the kvm modules are loaded lsmod|grep -i kvm kvm_amd38276 0 kvm 162288 1 kvm_amd mount|grep -i debug = mount -t debugfs debugfs /sys/kernel/debug int10perf: INT10h Performance tests: kvm statistics efer_reload 0 0 exits 37648629 456206 fpu_reload 8512535 455983 halt_exits2084 0 halt_wakeup 2047 0 host_state_reload 8513213 456011 hypercalls 0 0 insn_emulation29182065 0 insn_emulation_fail 0 0 invlpg 0 0 io_exits 8386082 455975 irq_exits51713 214 irq_injections 21797 36 irq_window 0 0 largepages 0 0 mmio_exits 242781 0 mmu_cache_miss 150 0 mmu_flooded 0 0 mmu_pde_zapped 0 0 mmu_pte_updated 0 0 mmu_pte_write 8192 0 mmu_recycled 0 0 mmu_shadow_zapped 151 0 mmu_unsync 0 0 mmu_unsync_global0 0 nmi_injections 0 0 nmi_window 0 0 pf_fixed 16935 0 pf_guest 0 0 remote_tlb_flush 2 0 request_irq 0 0 request_nmi 0 0 signal_exits 1 0 tlb_flush 2251 0 Running VGA memory tests in same VGA page in Video Mode VESA 101h: kvm statistics efer_reload 0 0 exits 18470836 554582 fpu_reload 21478333469 halt_exits2083 0 halt_wakeup 2047 0 host_state_reload 21481863470 hypercalls 0 0 insn_emulation 7688203 554244 insn_emulation_fail 0 0 invlpg 0 0 io_exits 10701583 18 irq_exits50781 321 irq_injections 25251 18 irq_window 0 0 largepages 0 0 mmio_exits 1628473241 mmu_cache_miss 154 0 mmu_flooded 0 0 mmu_pde_zapped 0 0 mmu_pte_updated 0 0 mmu_pte_write 8192 0 mmu_recycled 0 0 mmu_shadow_zapped 155 0 mmu_unsync 0 0 mmu_unsync_global0 0 nmi_injections 0 0 nmi_window 0 0 pf_fixed 16936 0 pf_guest 0 0 remote_tlb_flush 5 0 request_irq 0 0 request_nmi 0
Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps
On 04/22/2010 11:45 AM, Fernando Luis Vázquez Cao wrote: On 04/21/2010 06:41 PM, Alexander Graf wrote: On 21.04.2010, at 10:29, Fernando Luis Vázquez Cao wrote: On 04/20/2010 08:03 PM, Takuya Yoshikawa wrote: @@ -318,7 +318,7 @@ struct kvm_dirty_log { __u32 padding1; union { void __user *dirty_bitmap; /* one bit per page */ - __u64 padding2; + __u64 addr; This can break on x86_32 and x86_64-compat. addr is a long not a __u64. So the high 32 bits are zero. Where's the problem? If we are careful enough to cast the addr appropriately we should be fine, even if we keep the padding field in the union. I am not saying that it breaks 32 architectures but that it can potentially be problematic. + case KVM_SWITCH_DIRTY_LOG: { + struct kvm_dirty_log log; + + r = -EFAULT; + if (copy_from_user(log, argp, sizeof log)) + goto out; + r = kvm_vm_ioctl_switch_dirty_log(kvm, log); + if (r) + goto out; + r = -EFAULT; + if (copy_to_user(argp, log, sizeof log)) + goto out; + r = 0; + break; + } In x86_64-compat mode we are handling 32bit user-space addresses so we need the compat counterpart of KVM_SWITCH_DIRTY_LOG too. The compat code just forwards everything to the generic ioctls. The compat code uses struct compat_kvm_dirty_log instead of struct kvm_dirty_log to communicate with user space so the necessary conversions needs to be done before invoking the generic ioctl (see KVM_GET_DIRTY_LOG in kvm_vm_compat_ioctl). By the way we probable should move the definition of struct compat_kvm_dirty_log to a header file. It seems that it was you and Arnd who added the kvm_vm compat ioctl :-). Are you considering a different approach to tackle the issues that we have with a big-endian userspace? Thanks, Fernando -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: QEMU-KVM and video performance
On Wed, 21 Apr 2010, Jamie Lokier wrote: Gerhard Wiesinger wrote: Hmmm. I'm very new to QEMU and KVM but at least accessing the virtual HW of QEMU even from KVM must be possible (e.g. memory and port accesses are done on nearly every virtual device) and therefore I'm ending in C code in the QEMU hw/*.c directory. Therefore also the VGA memory area should be able to be accessable from KVM but with the specialized and fast memory access of QEMU. Am I missing something? What you're missing is that when KVM calls out to QEMU to handle hw/*.c traps, that call is very slow. It's because the hardware-VM support is a bit slow when the trap happens, and then the the call from KVM in the kernel up to QEMU is a bit slow again. Then all the way back. It adds up to a lot, for every I/O operation. Isn't that then a general problem of KVM virtualization (oder hardware virtualization) in general? Is this CPU dependend (AMD vs. Intel)? When QEMU does the same thing, it's fast because it's inside the same process; it's just a function call. Yes, that's clear to me. That's why the most often called devices are emulated separately in KVM's kernel code, things like the interrupt controller, timer chip etc. It's also why individual instructions that need help are emulated in KVM's kernel code, instead of passing control up to QEMU just for one instruction. BTW: Still not clear why performance is low with KVM since there are no window changes in the testcase involved which could cause a (slow) page fault. It sounds like a bug. Avi gave suggests about what to look for. If it fixes my OS install speeds too, I'll be very happy :-) See other post for details. In 256-colour mode, KVM should be writing to the VGA memory at high speed a lot like normal RAM, not trapping at the hardware-VM level, and not calling up to the code in hw/*.c for every byte. Yes, same picture to me: 256 color mode should be only a memory write (16 color mode is more difficult as pixel/byte mapping is not the same). But it looks like this isn't the case in this test scenario. You might double-check if your guest is using VGA Mode X. (See Wikipedia.) Code: inregs.x.ax = 0x4F02; inregs.x.bx = 0xC000 | 0x101; // bh=bit 15=0 (clear), bit14=0 (windowed) int86x(INT_SCREEN, inregs, outregs, outsregs); /* Call INT 10h */ I can post the whole code/exes if you want (I already planned to post my whole tools, but I have to do some cleanups until I wanted to publish whole package) . That was a way to accelerate VGA on real PCs, but it will be slow in KVM for the same reasons as 16-colour mode. Which way do you mean? Thnx. Ciao, Gerhard -- http://www.wiesinger.com/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/10] KVM MMU: allow more shadow pages become asynchronous
In current code, shadow page can become asynchronous only if one shadow page for a gfn, this rule is too strict, in fact, we can let all last mapping page(i.e, it's the pte page) become unsync and sync them at invlpg or flush tlb time. Address this thinking, a gfn may have many shadow pages, for performance reason, when one sp need be synced, just write protect sp-gfn and sync this sp but we keep other shadow pages asynchronous, i.e, sp only can be synced just at their own invlpg and flush TLB time. patch 1 ~ 6 are bugfix/cleanup/optimization for current code patch 7 ~ 10 implement this idea -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/10] KVM MMU: fix for calculating gpa in invlpg code
If the guest is 32-bit, we should use 'quadrant' to adjust gpa offset Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/paging_tmpl.h |7 ++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index d0cc07e..46d80d6 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -478,9 +478,14 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) ((level == PT_DIRECTORY_LEVEL is_large_pte(*sptep))) || ((level == PT_PDPE_LEVEL is_large_pte(*sptep { struct kvm_mmu_page *sp = page_header(__pa(sptep)); + int offset = 0; + + if (PTTYPE == 32) + offset = sp-role.quadrant PT64_LEVEL_BITS;; pte_gpa = (sp-gfn PAGE_SHIFT); - pte_gpa += (sptep - sp-spt) * sizeof(pt_element_t); + pte_gpa += (sptep - sp-spt + offset) * + sizeof(pt_element_t); if (is_shadow_present_pte(*sptep)) { rmap_remove(vcpu-kvm, sptep); -- 1.6.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/10] KVM MMU: convert mmu tracepoints
Convert mmu tracepoints by using DECLARE_EVENT_CLASS Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmutrace.h | 69 +- 1 files changed, 26 insertions(+), 43 deletions(-) diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h index bc4f7f0..d860a03 100644 --- a/arch/x86/kvm/mmutrace.h +++ b/arch/x86/kvm/mmutrace.h @@ -92,15 +92,15 @@ TRACE_EVENT( TP_printk(pte %llx level %u, __entry-pte, __entry-level) ); -/* We set a pte accessed bit */ -TRACE_EVENT( - kvm_mmu_set_accessed_bit, +DECLARE_EVENT_CLASS(kvm_mmu_set_bit_class, + TP_PROTO(unsigned long table_gfn, unsigned index, unsigned size), + TP_ARGS(table_gfn, index, size), TP_STRUCT__entry( __field(__u64, gpa) - ), + ), TP_fast_assign( __entry-gpa = ((u64)table_gfn PAGE_SHIFT) @@ -110,22 +110,20 @@ TRACE_EVENT( TP_printk(gpa %llx, __entry-gpa) ); -/* We set a pte dirty bit */ -TRACE_EVENT( - kvm_mmu_set_dirty_bit, +/* We set a pte accessed bit */ +DEFINE_EVENT(kvm_mmu_set_bit_class, kvm_mmu_set_accessed_bit, + TP_PROTO(unsigned long table_gfn, unsigned index, unsigned size), - TP_ARGS(table_gfn, index, size), - TP_STRUCT__entry( - __field(__u64, gpa) - ), + TP_ARGS(table_gfn, index, size) +); - TP_fast_assign( - __entry-gpa = ((u64)table_gfn PAGE_SHIFT) - + index * size; - ), +/* We set a pte dirty bit */ +DEFINE_EVENT(kvm_mmu_set_bit_class, kvm_mmu_set_dirty_bit, - TP_printk(gpa %llx, __entry-gpa) + TP_PROTO(unsigned long table_gfn, unsigned index, unsigned size), + + TP_ARGS(table_gfn, index, size) ); TRACE_EVENT( @@ -164,54 +162,39 @@ TRACE_EVENT( __entry-created ? new : existing) ); -TRACE_EVENT( - kvm_mmu_sync_page, +DECLARE_EVENT_CLASS(kvm_mmu_page_class, + TP_PROTO(struct kvm_mmu_page *sp), TP_ARGS(sp), TP_STRUCT__entry( KVM_MMU_PAGE_FIELDS - ), + ), TP_fast_assign( KVM_MMU_PAGE_ASSIGN(sp) - ), + ), TP_printk(%s, KVM_MMU_PAGE_PRINTK()) ); -TRACE_EVENT( - kvm_mmu_unsync_page, +DEFINE_EVENT(kvm_mmu_page_class, kvm_mmu_sync_page, TP_PROTO(struct kvm_mmu_page *sp), - TP_ARGS(sp), - - TP_STRUCT__entry( - KVM_MMU_PAGE_FIELDS - ), - - TP_fast_assign( - KVM_MMU_PAGE_ASSIGN(sp) - ), - TP_printk(%s, KVM_MMU_PAGE_PRINTK()) + TP_ARGS(sp) ); -TRACE_EVENT( - kvm_mmu_zap_page, +DEFINE_EVENT(kvm_mmu_page_class, kvm_mmu_unsync_page, TP_PROTO(struct kvm_mmu_page *sp), - TP_ARGS(sp), - TP_STRUCT__entry( - KVM_MMU_PAGE_FIELDS - ), + TP_ARGS(sp) +); - TP_fast_assign( - KVM_MMU_PAGE_ASSIGN(sp) - ), +DEFINE_EVENT(kvm_mmu_page_class, kvm_mmu_zap_page, + TP_PROTO(struct kvm_mmu_page *sp), - TP_printk(%s, KVM_MMU_PAGE_PRINTK()) + TP_ARGS(sp) ); - #endif /* _TRACE_KVMMMU_H */ #undef TRACE_INCLUDE_PATH -- 1.6.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/10] KVM MMU: move unsync/sync tracpoints to proper place
Move unsync/sync tracepoints to the proper place, it's good for us to obtain unsync page live time Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index ddfa865..abf8bd4 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1189,6 +1189,7 @@ static struct kvm_mmu_page *kvm_mmu_lookup_page(struct kvm *kvm, gfn_t gfn) static void kvm_unlink_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp) { WARN_ON(!sp-unsync); + trace_kvm_mmu_sync_page(sp); sp-unsync = 0; --kvm-stat.mmu_unsync; } @@ -1202,7 +1203,6 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) return 1; } - trace_kvm_mmu_sync_page(sp); if (rmap_write_protect(vcpu-kvm, sp-gfn)) kvm_flush_remote_tlbs(vcpu-kvm); kvm_unlink_unsync_page(vcpu-kvm, sp); @@ -1730,7 +1730,6 @@ static int kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) struct kvm_mmu_page *s; struct hlist_node *node, *n; - trace_kvm_mmu_unsync_page(sp); index = kvm_page_table_hashfn(sp-gfn); bucket = vcpu-kvm-arch.mmu_page_hash[index]; /* don't unsync if pagetable is shadowed with multiple roles */ @@ -1740,6 +1739,7 @@ static int kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) if (s-role.word != sp-role.word) return 1; } + trace_kvm_mmu_unsync_page(sp); ++vcpu-kvm-stat.mmu_unsync; sp-unsync = 1; -- 1.6.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/10] KVM MMU: Move invlpg code out of paging_tmpl.h
Using '!sp-role.cr4_pae' replaces 'PTTYPE == 32' and using 'pte_size = sp-role.cr4_pae ? 8 : 4' replaces sizeof(pt_element_t) Then no need compile twice for this code Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 60 ++- arch/x86/kvm/paging_tmpl.h | 56 - 2 files changed, 58 insertions(+), 58 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index abf8bd4..fac7c09 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2256,6 +2256,62 @@ static bool is_rsvd_bits_set(struct kvm_vcpu *vcpu, u64 gpte, int level) return (gpte vcpu-arch.mmu.rsvd_bits_mask[bit7][level-1]) != 0; } +static void paging_invlpg(struct kvm_vcpu *vcpu, gva_t gva) +{ + struct kvm_shadow_walk_iterator iterator; + gpa_t pte_gpa = -1; + int level; + u64 *sptep; + int need_flush = 0; + unsigned pte_size = 0; + + spin_lock(vcpu-kvm-mmu_lock); + + for_each_shadow_entry(vcpu, gva, iterator) { + level = iterator.level; + sptep = iterator.sptep; + + if (level == PT_PAGE_TABLE_LEVEL || + ((level == PT_DIRECTORY_LEVEL is_large_pte(*sptep))) || + ((level == PT_PDPE_LEVEL is_large_pte(*sptep { + struct kvm_mmu_page *sp = page_header(__pa(sptep)); + int offset = 0; + + if (!sp-role.cr4_pae) + offset = sp-role.quadrant PT64_LEVEL_BITS;; + pte_size = sp-role.cr4_pae ? 8 : 4; + pte_gpa = (sp-gfn PAGE_SHIFT); + pte_gpa += (sptep - sp-spt + offset) * pte_size; + + if (is_shadow_present_pte(*sptep)) { + rmap_remove(vcpu-kvm, sptep); + if (is_large_pte(*sptep)) + --vcpu-kvm-stat.lpages; + need_flush = 1; + } + __set_spte(sptep, shadow_trap_nonpresent_pte); + break; + } + + if (!is_shadow_present_pte(*sptep)) + break; + } + + if (need_flush) + kvm_flush_remote_tlbs(vcpu-kvm); + + atomic_inc(vcpu-kvm-arch.invlpg_counter); + + spin_unlock(vcpu-kvm-mmu_lock); + + if (pte_gpa == -1) + return; + + if (mmu_topup_memory_caches(vcpu)) + return; + kvm_mmu_pte_write(vcpu, pte_gpa, NULL, pte_size, 0); +} + #define PTTYPE 64 #include paging_tmpl.h #undef PTTYPE @@ -2335,7 +2391,7 @@ static int paging64_init_context_common(struct kvm_vcpu *vcpu, int level) context-gva_to_gpa = paging64_gva_to_gpa; context-prefetch_page = paging64_prefetch_page; context-sync_page = paging64_sync_page; - context-invlpg = paging64_invlpg; + context-invlpg = paging_invlpg; context-free = paging_free; context-root_level = level; context-shadow_root_level = level; @@ -2360,7 +2416,7 @@ static int paging32_init_context(struct kvm_vcpu *vcpu) context-free = paging_free; context-prefetch_page = paging32_prefetch_page; context-sync_page = paging32_sync_page; - context-invlpg = paging32_invlpg; + context-invlpg = paging_invlpg; context-root_level = PT32_ROOT_LEVEL; context-shadow_root_level = PT32E_ROOT_LEVEL; context-root_hpa = INVALID_PAGE; diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 46d80d6..d0df9cd 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -460,62 +460,6 @@ out_unlock: return 0; } -static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) -{ - struct kvm_shadow_walk_iterator iterator; - gpa_t pte_gpa = -1; - int level; - u64 *sptep; - int need_flush = 0; - - spin_lock(vcpu-kvm-mmu_lock); - - for_each_shadow_entry(vcpu, gva, iterator) { - level = iterator.level; - sptep = iterator.sptep; - - if (level == PT_PAGE_TABLE_LEVEL || - ((level == PT_DIRECTORY_LEVEL is_large_pte(*sptep))) || - ((level == PT_PDPE_LEVEL is_large_pte(*sptep { - struct kvm_mmu_page *sp = page_header(__pa(sptep)); - int offset = 0; - - if (PTTYPE == 32) - offset = sp-role.quadrant PT64_LEVEL_BITS;; - - pte_gpa = (sp-gfn PAGE_SHIFT); - pte_gpa += (sptep - sp-spt + offset) * - sizeof(pt_element_t); - - if (is_shadow_present_pte(*sptep)) { - rmap_remove(vcpu-kvm, sptep); -
[PATCH 5/10] KVM MMU: cleanup invlpg code
Using is_last_spte() to cleanup invlpg code Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c |4 +--- 1 files changed, 1 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index fac7c09..fd027a6 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2271,9 +2271,7 @@ static void paging_invlpg(struct kvm_vcpu *vcpu, gva_t gva) level = iterator.level; sptep = iterator.sptep; - if (level == PT_PAGE_TABLE_LEVEL || - ((level == PT_DIRECTORY_LEVEL is_large_pte(*sptep))) || - ((level == PT_PDPE_LEVEL is_large_pte(*sptep { + if (is_last_spte(*sptep, level)) { struct kvm_mmu_page *sp = page_header(__pa(sptep)); int offset = 0; -- 1.6.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/10] KVM MMU: don't write-protect if have new mapping to unsync page
If have new mapping to the unsync page(i.e, add a new parent), just update the page from sp-gfn but not write-protect gfn, and if need create new shadow page form sp-gfn, we should sync it Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 27 +++ 1 files changed, 19 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index fd027a6..8607a64 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1196,16 +1196,20 @@ static void kvm_unlink_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp) static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp); -static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) +static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, +bool clear_unsync) { if (sp-role.cr4_pae != !!is_pae(vcpu)) { kvm_mmu_zap_page(vcpu-kvm, sp); return 1; } - if (rmap_write_protect(vcpu-kvm, sp-gfn)) - kvm_flush_remote_tlbs(vcpu-kvm); - kvm_unlink_unsync_page(vcpu-kvm, sp); + if (clear_unsync) { + if (rmap_write_protect(vcpu-kvm, sp-gfn)) + kvm_flush_remote_tlbs(vcpu-kvm); + kvm_unlink_unsync_page(vcpu-kvm, sp); + } + if (vcpu-arch.mmu.sync_page(vcpu, sp)) { kvm_mmu_zap_page(vcpu-kvm, sp); return 1; @@ -1293,7 +1297,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu, kvm_flush_remote_tlbs(vcpu-kvm); for_each_sp(pages, sp, parents, i) { - kvm_sync_page(vcpu, sp); + kvm_sync_page(vcpu, sp, true); mmu_pages_clear_parents(parents); } cond_resched_lock(vcpu-kvm-mmu_lock); @@ -1313,7 +1317,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, unsigned index; unsigned quadrant; struct hlist_head *bucket; - struct kvm_mmu_page *sp; + struct kvm_mmu_page *sp, *unsync_sp = NULL; struct hlist_node *node, *tmp; role = vcpu-arch.mmu.base_role; @@ -1332,12 +1336,16 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link) if (sp-gfn == gfn) { if (sp-unsync) - if (kvm_sync_page(vcpu, sp)) - continue; + unsync_sp = sp; if (sp-role.word != role.word) continue; + if (unsync_sp kvm_sync_page(vcpu, unsync_sp, false)) { + unsync_sp = NULL; + continue; + } + mmu_page_add_parent_pte(vcpu, sp, parent_pte); if (sp-unsync_children) { set_bit(KVM_REQ_MMU_SYNC, vcpu-requests); @@ -1346,6 +1354,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, trace_kvm_mmu_get_page(sp, false); return sp; } + if (unsync_sp) + kvm_sync_page(vcpu, unsync_sp, true); + ++vcpu-kvm-stat.mmu_cache_miss; sp = kvm_mmu_alloc_page(vcpu, parent_pte); if (!sp) -- 1.6.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/10] KVM MMU: allow more page become unsync at gfn mapping time
In current code, shadow page can become asynchronous only if one shadow page for a gfn, this rule is too strict, in fact, we can let all last mapping page(i.e, it's the pte page) become unsync, and sync them at invlpg or flush tlb time. This patch allow more page become asynchronous at gfn mapping time Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 81 +++ 1 files changed, 37 insertions(+), 44 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 8607a64..13378e7 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1166,26 +1166,6 @@ static int mmu_unsync_walk(struct kvm_mmu_page *sp, return __mmu_unsync_walk(sp, pvec); } -static struct kvm_mmu_page *kvm_mmu_lookup_page(struct kvm *kvm, gfn_t gfn) -{ - unsigned index; - struct hlist_head *bucket; - struct kvm_mmu_page *sp; - struct hlist_node *node; - - pgprintk(%s: looking for gfn %lx\n, __func__, gfn); - index = kvm_page_table_hashfn(gfn); - bucket = kvm-arch.mmu_page_hash[index]; - hlist_for_each_entry(sp, node, bucket, hash_link) - if (sp-gfn == gfn !sp-role.direct -!sp-role.invalid) { - pgprintk(%s: found role %x\n, -__func__, sp-role.word); - return sp; - } - return NULL; -} - static void kvm_unlink_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp) { WARN_ON(!sp-unsync); @@ -1734,47 +1714,60 @@ u8 kvm_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn) } EXPORT_SYMBOL_GPL(kvm_get_guest_memory_type); -static int kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) +static void __kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) +{ + trace_kvm_mmu_unsync_page(sp); + ++vcpu-kvm-stat.mmu_unsync; + sp-unsync = 1; + + kvm_mmu_mark_parents_unsync(sp); + mmu_convert_notrap(sp); +} + +static void kvm_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn) { - unsigned index; struct hlist_head *bucket; struct kvm_mmu_page *s; struct hlist_node *node, *n; + unsigned index; - index = kvm_page_table_hashfn(sp-gfn); + index = kvm_page_table_hashfn(gfn); bucket = vcpu-kvm-arch.mmu_page_hash[index]; - /* don't unsync if pagetable is shadowed with multiple roles */ + hlist_for_each_entry_safe(s, node, n, bucket, hash_link) { - if (s-gfn != sp-gfn || s-role.direct) + if (s-gfn != gfn || s-role.direct || s-unsync) continue; - if (s-role.word != sp-role.word) - return 1; + WARN_ON(s-role.level != PT_PAGE_TABLE_LEVEL); + __kvm_unsync_page(vcpu, s); } - trace_kvm_mmu_unsync_page(sp); - ++vcpu-kvm-stat.mmu_unsync; - sp-unsync = 1; - - kvm_mmu_mark_parents_unsync(sp); - - mmu_convert_notrap(sp); - return 0; } static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync) { - struct kvm_mmu_page *shadow; + unsigned index; + struct hlist_head *bucket; + struct kvm_mmu_page *s; + struct hlist_node *node, *n; + bool need_unsync = false; + + index = kvm_page_table_hashfn(gfn); + bucket = vcpu-kvm-arch.mmu_page_hash[index]; + hlist_for_each_entry_safe(s, node, n, bucket, hash_link) { + if (s-gfn != gfn || s-role.direct) + continue; - shadow = kvm_mmu_lookup_page(vcpu-kvm, gfn); - if (shadow) { - if (shadow-role.level != PT_PAGE_TABLE_LEVEL) + if (s-role.level != PT_PAGE_TABLE_LEVEL) return 1; - if (shadow-unsync) - return 0; - if (can_unsync oos_shadow) - return kvm_unsync_page(vcpu, shadow); - return 1; + + if (!need_unsync !s-unsync) { + if (!can_unsync || !oos_shadow) + return 1; + need_unsync = true; + } } + if (need_unsync) + kvm_unsync_pages(vcpu, gfn); return 0; } -- 1.6.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/10] KVM MMU: allow more page become unsync at getting sp time
Allow more page become asynchronous at getting sp time, if need create new shadow page for gfn but it not allow unsync(level 0), we should unsync all gfn's unsync page Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 22 -- 1 files changed, 20 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 13378e7..e0bb4d8 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1199,6 +1199,23 @@ static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, return 0; } +static void kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t gfn) +{ + struct hlist_head *bucket; + struct kvm_mmu_page *s; + struct hlist_node *node, *n; + unsigned index; + + index = kvm_page_table_hashfn(gfn); + bucket = vcpu-kvm-arch.mmu_page_hash[index]; + hlist_for_each_entry_safe(s, node, n, bucket, hash_link) { + if (s-gfn != gfn || !s-unsync) + continue; + WARN_ON(s-role.level != PT_PAGE_TABLE_LEVEL); + kvm_sync_page(vcpu, s, true); + } +} + struct mmu_page_path { struct kvm_mmu_page *parent[PT64_ROOT_LEVEL-1]; unsigned int idx[PT64_ROOT_LEVEL-1]; @@ -1334,8 +1351,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, trace_kvm_mmu_get_page(sp, false); return sp; } - if (unsync_sp) - kvm_sync_page(vcpu, unsync_sp, true); + + if (!direct level PT_PAGE_TABLE_LEVEL unsync_sp) + kvm_sync_pages(vcpu, gfn); ++vcpu-kvm-stat.mmu_cache_miss; sp = kvm_mmu_alloc_page(vcpu, parent_pte); -- 1.6.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 9/10] KVM MMU: separate invlpg code form kvm_mmu_pte_write()
Let invlpg not depends on kvm_mmu_pte_write path, later patch will need this feature Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 40 1 files changed, 24 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index e0bb4d8..f092e71 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2278,14 +2278,21 @@ static bool is_rsvd_bits_set(struct kvm_vcpu *vcpu, u64 gpte, int level) return (gpte vcpu-arch.mmu.rsvd_bits_mask[bit7][level-1]) != 0; } +static void mmu_guess_page_from_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, + u64 gpte); +static void mmu_pte_write_new_pte(struct kvm_vcpu *vcpu, + struct kvm_mmu_page *sp, + u64 *spte, + const void *new); + static void paging_invlpg(struct kvm_vcpu *vcpu, gva_t gva) { + struct kvm_mmu_page *sp = NULL; struct kvm_shadow_walk_iterator iterator; - gpa_t pte_gpa = -1; - int level; - u64 *sptep; - int need_flush = 0; + gfn_t gfn = -1; + u64 *sptep = NULL, gentry; unsigned pte_size = 0; + int invlpg_counter, level, offset = 0, need_flush = 0; spin_lock(vcpu-kvm-mmu_lock); @@ -2294,14 +2301,14 @@ static void paging_invlpg(struct kvm_vcpu *vcpu, gva_t gva) sptep = iterator.sptep; if (is_last_spte(*sptep, level)) { - struct kvm_mmu_page *sp = page_header(__pa(sptep)); - int offset = 0; + + sp = page_header(__pa(sptep)); if (!sp-role.cr4_pae) offset = sp-role.quadrant PT64_LEVEL_BITS;; pte_size = sp-role.cr4_pae ? 8 : 4; - pte_gpa = (sp-gfn PAGE_SHIFT); - pte_gpa += (sptep - sp-spt + offset) * pte_size; + gfn = (sp-gfn PAGE_SHIFT); + offset = (sptep - sp-spt + offset) * pte_size; if (is_shadow_present_pte(*sptep)) { rmap_remove(vcpu-kvm, sptep); @@ -2320,16 +2327,22 @@ static void paging_invlpg(struct kvm_vcpu *vcpu, gva_t gva) if (need_flush) kvm_flush_remote_tlbs(vcpu-kvm); - atomic_inc(vcpu-kvm-arch.invlpg_counter); + invlpg_counter = atomic_add_return(1, vcpu-kvm-arch.invlpg_counter); spin_unlock(vcpu-kvm-mmu_lock); - if (pte_gpa == -1) + if (gfn == -1) return; if (mmu_topup_memory_caches(vcpu)) return; - kvm_mmu_pte_write(vcpu, pte_gpa, NULL, pte_size, 0); + + kvm_read_guest_page(vcpu-kvm, gfn, gentry, offset, pte_size); + mmu_guess_page_from_pte_write(vcpu, gfn_to_gpa(gfn) + offset, gentry); + spin_lock(vcpu-kvm-mmu_lock); + if (atomic_read(vcpu-kvm-arch.invlpg_counter) == invlpg_counter) + mmu_pte_write_new_pte(vcpu, sp, sptep, gentry); + spin_unlock(vcpu-kvm-mmu_lock); } #define PTTYPE 64 @@ -2675,12 +2688,9 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, int flooded = 0; int npte; int r; - int invlpg_counter; pgprintk(%s: gpa %llx bytes %d\n, __func__, gpa, bytes); - invlpg_counter = atomic_read(vcpu-kvm-arch.invlpg_counter); - /* * Assume that the pte write on a page table of the same type * as the current vcpu paging mode. This is nearly always true @@ -2713,8 +2723,6 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, mmu_guess_page_from_pte_write(vcpu, gpa, gentry); spin_lock(vcpu-kvm-mmu_lock); - if (atomic_read(vcpu-kvm-arch.invlpg_counter) != invlpg_counter) - gentry = 0; kvm_mmu_access_page(vcpu, gfn); kvm_mmu_free_some_pages(vcpu); ++vcpu-kvm-stat.mmu_pte_write; -- 1.6.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/10] KVM MMU: optimize sync/update unsync-page
invlpg only need update unsync page, sp-unsync and sp-unsync_children can help us to find it Now, a gfn may have many shadow pages, when one sp need be synced, we write protect sp-gfn and sync this sp but we keep other shadow pages asynchronous So, while gfn happen page fault, let it not touch unsync page, the unsync page only updated at invlpg/flush TLB time Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 10 ++ 1 files changed, 6 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index f092e71..5bdcc17 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2299,10 +2299,11 @@ static void paging_invlpg(struct kvm_vcpu *vcpu, gva_t gva) for_each_shadow_entry(vcpu, gva, iterator) { level = iterator.level; sptep = iterator.sptep; + sp = page_header(__pa(sptep)); if (is_last_spte(*sptep, level)) { - - sp = page_header(__pa(sptep)); + if (!sp-unsync) + break; if (!sp-role.cr4_pae) offset = sp-role.quadrant PT64_LEVEL_BITS;; @@ -2320,7 +2321,7 @@ static void paging_invlpg(struct kvm_vcpu *vcpu, gva_t gva) break; } - if (!is_shadow_present_pte(*sptep)) + if (!is_shadow_present_pte(*sptep) || !sp-unsync_children) break; } @@ -2744,7 +2745,8 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, restart: hlist_for_each_entry_safe(sp, node, n, bucket, hash_link) { - if (sp-gfn != gfn || sp-role.direct || sp-role.invalid) + if (sp-gfn != gfn || sp-role.direct || sp-role.invalid || + sp-unsync) continue; pte_size = sp-role.cr4_pae ? 8 : 4; misaligned = (offset ^ (offset + bytes - 1)) ~(pte_size - 1); -- 1.6.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM Page Fault Question
On 04/22/2010 08:26 AM, Marek Olszewski wrote: Under VMX without EPT, I do not seeing any VM Exits due to task switches. Is there a way to enable these? I'm looking to intercept the guest whenever it does a iret. See EXIT_REASON_TASK_SWITCH. However, that won't fire on any iret, only irets that generate task switches. You can ask for exits on irets by setting CPU_BASED_VIRTUAL_NMI_PENDING and GUEST_INTR_STATE_NMI, and looking for EXIT_REASON_NMI_WINDOW. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: QEMU-KVM and video performance
On 04/22/2010 08:37 AM, Gerhard Wiesinger wrote: On Wed, 21 Apr 2010, Avi Kivity wrote: On 04/21/2010 09:14 PM, Gerhard Wiesinger wrote: Can you explain which code files/functions of KVM is involved in handling VGA memory window and page switching through the port write to the VGA window register (or is that part handled through QEMU), so a little bit architecture explaination would be nice? qemu hw/vga.c and hw/cirrus_vga.c. Boring functions like vbe_ioport_write_data() and vga_ioport_write(). Yes, I was already in that code part and that are very simple functions as already explained and are therefore in QEMU only very fast. But I ment: How is the calling path from KVM guest OS to hw/vga.c for memory and I/O accesses, and which parts are done in hardware directly (to understand the speed gap and maybe to find a solution)? The speed gap is mostly due to hardware constraints (it takes ~2000 cycles for an exit from guest mode, plus we need to switch a few msrs to get to userspace). See vmx_vcpu_run(), the vmresume instruction is where an exit starts. BTW: In which KVM code parts is decided where direct code or an emulated device code is used? Same place. Look for calls to cpu_register_physical_memory(). If the last argument was obtained by a call to cpu_register_io_memory(), then all writes trap. Otherwise, it was obtained by qemu_ram_alloc() and writes will not trap (except the first write to a page in a 30ms window, used to note that the page is dirty and needs redrawing). Ok, that finally ends in: cpu_register_physical_memory_offset() ... // 0.12.3 if (kvm_enabled()) kvm_set_phys_mem(start_addr, size, phys_offset); // KVM cpu_notify_set_memory(start_addr, size, phys_offset); ... I/O is always done through: cpu_register_io_memory = cpu_register_io_memory_fixed cpu_register_io_memory_fixed() ... No call to KVM? kvm_set_phys_mem() is a call to kvm. ... Where is the trap from KVM to QEMU? See kvm_cpu_exec(). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: QEMU-KVM and video performance
On 04/22/2010 09:04 AM, Gerhard Wiesinger wrote: On Wed, 21 Apr 2010, Avi Kivity wrote: On 04/21/2010 09:50 PM, Gerhard Wiesinger wrote: I don't think changing VGA window is a problem because there are 500.000-1Mio changes/s possible. 1MB/s, 500k-1M changes/s Coincidence? Is it taking a page fault or trap on every write? To clarify: Memory Performance writing to segmen A000 is about 1MB/st. That indicates a fault every write (assuming 8-16 bit writes). If you're using 256 color vga and not switching banks, this indicates a bug. Yes, 256 color VGA and no bank switches involved. Calling INT 10 set/get window function with different windows (e.g. toggling between window page 0 and 1) is about 500.000 to 1Mio function calls per second. That's suprisingly fast. I'd expect 100-200k/sec. Sorry, I mixed up the numbers: 1.) QEMU-KVM: ~111k 2.) QEMU only: 500k-1Mio Please run kvm_stat and report output for both tests to confirm. See below. 2nd column is per second statistic when running the test. efer_reload 0 0 exits 18470836 554582 fpu_reload 21478333469 halt_exits2083 0 halt_wakeup 2047 0 host_state_reload 21481863470 hypercalls 0 0 insn_emulation 7688203 554244 This indicates that kvm is emulating instead of direct mapping. That's probably a bug. If you fix it, performance will increase dramatically. To get real good VGA performance both parameters should be: About 50MB/s for writes to segment A000 ~500.000 bank switches per second. First should be doable easily, second is borderline. I think this is very easy to distingish: 1.) VGA Segment A000 is legacy and should be handled through QEMU and not through KVM (because it is much more faster). Also 16 color modes should be fast enough there. 2.) All other flat PCI memory accesses should be handled through KVM (there is a specialized driver loaded for that PCI device in the non legacy OS). Is that easily possible? No. Code can run in either qemu or kvm, not both. You can switch between them based on access statistics (early versions of qemu-kvm did that, without the statistics part), but this isn't trivial. Hmmm. Ok, 2 different opinions about the memory write performance: Easily or not possible? Switching between tcg and kvm is hard, but not needed. For 256 color modes, direct map is possible and should yield good performance. Bank switching can be improved perhaps 3x, but will never be fast. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Document mmu
Avi Kivity wrote: +Translation +=== + +The primary job of the mmu is to program the processor's mmu to translate +addresses for the guest. Different translations are required at different +times: + +- when guest paging is disabled, we translate guest physical addresses to + host physical addresses (gpa-hpa) +- when guest paging is enabled, we translate guest virtual addresses, to + guest virtual addresses, to host physical addresses (gva-gpa-hpa) I think you meant 'to guest physical addresses' here. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Document mmu
On 04/22/2010 10:13 AM, Karl Vogel wrote: Avi Kivity wrote: +Translation +=== + +The primary job of the mmu is to program the processor's mmu to translate +addresses for the guest. Different translations are required at different +times: + +- when guest paging is disabled, we translate guest physical addresses to + host physical addresses (gpa-hpa) +- when guest paging is enabled, we translate guest virtual addresses, to + guest virtual addresses, to host physical addresses (gva-gpa-hpa) I think you meant 'to guest physical addresses' here. Fixed. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Timedrift in KVM guests after livemigration.
On Sunday 18 April 2010 11:33:44 Espen Berg wrote: All guest are Debian lenny with latest upstream kernel, hvm/kvm. We are using kvm-clock as guest source clock. cat /sys/devices/system/clocksource/clocksource0/current_clocksource kvm-clock I had to deactivate C1E (AMD CPUs) and use acpi clocksource (for both servers and VMs, IIRC). If you can, you should give it a try. After that, live migration worked somewhat stable. regards, thomas -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re:[RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
From: Xin Xiaohui xiaohui@intel.com Add a device to utilize the vhost-net backend driver for copy-less data transfer between guest FE and host NIC. It pins the guest user space to the host memory and provides proto_ops as sendmsg/recvmsg to vhost-net. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzha...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- Michael, Thanks. I have updated the patch with your suggestion. It looks much clean now. Please have a review. Thanks Xiaohui drivers/vhost/Kconfig | 10 + drivers/vhost/Makefile|2 + drivers/vhost/mpassthru.c | 1239 + include/linux/mpassthru.h | 29 + 4 files changed, 1280 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/mpassthru.c create mode 100644 include/linux/mpassthru.h diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig index 9f409f4..91806b1 100644 --- a/drivers/vhost/Kconfig +++ b/drivers/vhost/Kconfig @@ -9,3 +9,13 @@ config VHOST_NET To compile this driver as a module, choose M here: the module will be called vhost_net. +config MEDIATE_PASSTHRU + tristate mediate passthru network driver (EXPERIMENTAL) + depends on VHOST_NET + ---help--- + zerocopy network I/O support, we call it as mediate passthru to + be distiguish with hardare passthru. + + To compile this driver as a module, choose M here: the module will + be called mpassthru. + diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile index 72dd020..c18b9fc 100644 --- a/drivers/vhost/Makefile +++ b/drivers/vhost/Makefile @@ -1,2 +1,4 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o vhost_net-y := vhost.o net.o + +obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c new file mode 100644 index 000..cc99b14 --- /dev/null +++ b/drivers/vhost/mpassthru.c @@ -0,0 +1,1239 @@ +/* + * MPASSTHRU - Mediate passthrough device. + * Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#define DRV_NAMEmpassthru +#define DRV_DESCRIPTION Mediate passthru device driver +#define DRV_COPYRIGHT (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + +#include linux/module.h +#include linux/errno.h +#include linux/kernel.h +#include linux/major.h +#include linux/slab.h +#include linux/smp_lock.h +#include linux/poll.h +#include linux/fcntl.h +#include linux/init.h +#include linux/aio.h + +#include linux/skbuff.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/miscdevice.h +#include linux/ethtool.h +#include linux/rtnetlink.h +#include linux/if.h +#include linux/if_arp.h +#include linux/if_ether.h +#include linux/crc32.h +#include linux/nsproxy.h +#include linux/uaccess.h +#include linux/virtio_net.h +#include linux/mpassthru.h +#include net/net_namespace.h +#include net/netns/generic.h +#include net/rtnetlink.h +#include net/sock.h + +#include asm/system.h + +/* Uncomment to enable debugging */ +/* #define MPASSTHRU_DEBUG 1 */ + +#ifdef MPASSTHRU_DEBUG +static int debug; + +#define DBG if (mp-debug) printk +#define DBG1 if (debug == 2) printk +#else +#define DBG(a...) +#define DBG1(a...) +#endif + +#define COPY_THRESHOLD (L1_CACHE_BYTES * 4) +#define COPY_HDR_LEN (L1_CACHE_BYTES 64 ? 64 : L1_CACHE_BYTES) + +struct frag { + u16 offset; + u16 size; +}; + +struct page_ctor { + struct list_headreadq; + int w_len; + int r_len; + spinlock_t read_lock; + struct kmem_cache *cache; + /* record the locked pages */ + int lock_pages; + struct rlimit o_rlim; + struct net_device *dev; + struct mpassthru_port port; +}; + +struct page_info { + struct list_headlist; + int header; + /* indicate the actual length of bytes +* send/recv in the user space buffers +*/ + int total; + int offset; + struct page *pages[MAX_SKB_FRAGS+1]; + struct skb_frag_struct frag[MAX_SKB_FRAGS+1]; + struct sk_buff *skb; + struct page_ctor*ctor; + + /* The pointer relayed to skb, to indicate +* it's a user space allocated skb or kernel +*/ + struct
RE: Re:[RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net.
Michael, Sorry, it's based on the suggestion to hook an iocb completion callback to handle the iocb list in vhost-net. Thanks Xiaohui -Original Message- From: Xin, Xiaohui Sent: Thursday, April 22, 2010 4:24 PM To: m...@redhat.com Cc: a...@arndb.de; net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; mi...@elte.hu; da...@davemloft.net; jd...@linux.intel.com; Xin, Xiaohui Subject: Re:[RFC][PATCH v3 1/3] A device for zero-copy based on KVM virtio-net. From: Xin Xiaohui xiaohui@intel.com Add a device to utilize the vhost-net backend driver for copy-less data transfer between guest FE and host NIC. It pins the guest user space to the host memory and provides proto_ops as sendmsg/recvmsg to vhost-net. Signed-off-by: Xin Xiaohui xiaohui@intel.com Signed-off-by: Zhao Yu yzha...@gmail.com Reviewed-by: Jeff Dike jd...@linux.intel.com --- Michael, Thanks. I have updated the patch with your suggestion. It looks much clean now. Please have a review. Thanks Xiaohui drivers/vhost/Kconfig | 10 + drivers/vhost/Makefile|2 + drivers/vhost/mpassthru.c | 1239 + include/linux/mpassthru.h | 29 + 4 files changed, 1280 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/mpassthru.c create mode 100644 include/linux/mpassthru.h diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig index 9f409f4..91806b1 100644 --- a/drivers/vhost/Kconfig +++ b/drivers/vhost/Kconfig @@ -9,3 +9,13 @@ config VHOST_NET To compile this driver as a module, choose M here: the module will be called vhost_net. +config MEDIATE_PASSTHRU + tristate mediate passthru network driver (EXPERIMENTAL) + depends on VHOST_NET + ---help--- + zerocopy network I/O support, we call it as mediate passthru to + be distiguish with hardare passthru. + + To compile this driver as a module, choose M here: the module will + be called mpassthru. + diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile index 72dd020..c18b9fc 100644 --- a/drivers/vhost/Makefile +++ b/drivers/vhost/Makefile @@ -1,2 +1,4 @@ obj-$(CONFIG_VHOST_NET) += vhost_net.o vhost_net-y := vhost.o net.o + +obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c new file mode 100644 index 000..cc99b14 --- /dev/null +++ b/drivers/vhost/mpassthru.c @@ -0,0 +1,1239 @@ +/* + * MPASSTHRU - Mediate passthrough device. + * Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#define DRV_NAMEmpassthru +#define DRV_DESCRIPTION Mediate passthru device driver +#define DRV_COPYRIGHT (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G + +#include linux/module.h +#include linux/errno.h +#include linux/kernel.h +#include linux/major.h +#include linux/slab.h +#include linux/smp_lock.h +#include linux/poll.h +#include linux/fcntl.h +#include linux/init.h +#include linux/aio.h + +#include linux/skbuff.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/miscdevice.h +#include linux/ethtool.h +#include linux/rtnetlink.h +#include linux/if.h +#include linux/if_arp.h +#include linux/if_ether.h +#include linux/crc32.h +#include linux/nsproxy.h +#include linux/uaccess.h +#include linux/virtio_net.h +#include linux/mpassthru.h +#include net/net_namespace.h +#include net/netns/generic.h +#include net/rtnetlink.h +#include net/sock.h + +#include asm/system.h + +/* Uncomment to enable debugging */ +/* #define MPASSTHRU_DEBUG 1 */ + +#ifdef MPASSTHRU_DEBUG +static int debug; + +#define DBG if (mp-debug) printk +#define DBG1 if (debug == 2) printk +#else +#define DBG(a...) +#define DBG1(a...) +#endif + +#define COPY_THRESHOLD (L1_CACHE_BYTES * 4) +#define COPY_HDR_LEN (L1_CACHE_BYTES 64 ? 64 : L1_CACHE_BYTES) + +struct frag { + u16 offset; + u16 size; +}; + +struct page_ctor { + struct list_headreadq; + int w_len; + int r_len; + spinlock_t read_lock; + struct kmem_cache *cache; + /* record the locked pages */ + int lock_pages; + struct rlimit o_rlim; + struct net_device *dev; + struct mpassthru_port port; +}; + +struct page_info { + struct list_headlist; + int header; + /* indicate the actual
Re:[RFC][PATCH v3 2/3] Provides multiple submits and asynchronous notifications.
From: Xin Xiaohui xiaohui@intel.com The vhost-net backend now only supports synchronous send/recv operations. The patch provides multiple submits and asynchronous notifications. This is needed for zero-copy case. Signed-off-by: Xin Xiaohui xiaohui@intel.com --- Michael, Can't vhost supply a kiocb completion callback that will handle the list? Yes, thanks. And with it I also remove the vq-receiver finally. Thanks Xiaohui drivers/vhost/net.c | 227 +++-- drivers/vhost/vhost.c | 115 ++--- drivers/vhost/vhost.h | 14 +++ 3 files changed, 301 insertions(+), 55 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 22d5fef..4a70f66 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -17,11 +17,13 @@ #include linux/workqueue.h #include linux/rcupdate.h #include linux/file.h +#include linux/aio.h #include linux/net.h #include linux/if_packet.h #include linux/if_arp.h #include linux/if_tun.h +#include linux/mpassthru.h #include net/sock.h @@ -47,6 +49,7 @@ struct vhost_net { struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX]; struct vhost_poll poll[VHOST_NET_VQ_MAX]; + struct kmem_cache *cache; /* Tells us whether we are polling a socket for TX. * We only do this when socket buffer fills up. * Protected by tx vq lock. */ @@ -91,11 +94,132 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + if (!list_empty(vq-notifier)) { + iocb = list_first_entry(vq-notifier, + struct kiocb, ki_list); + list_del(iocb-ki_list); + } + spin_unlock_irqrestore(vq-notify_lock, flags); + return iocb; +} + +static void handle_iocb(struct kiocb *iocb) +{ + struct vhost_virtqueue *vq = iocb-private; + unsigned long flags; + +spin_lock_irqsave(vq-notify_lock, flags); +list_add_tail(iocb-ki_list, vq-notifier); +spin_unlock_irqrestore(vq-notify_lock, flags); +} + +static void handle_async_rx_events_notify(struct vhost_net *net, +struct vhost_virtqueue *vq, +struct socket *sock) +{ + struct kiocb *iocb = NULL; + struct vhost_log *vq_log = NULL; + int rx_total_len = 0; + unsigned int head, log, in, out; + int size; + + if (vq-link_state != VHOST_VQ_LINK_ASYNC) + return; + + if (sock-sk-sk_data_ready) + sock-sk-sk_data_ready(sock-sk, 0); + + vq_log = unlikely(vhost_has_feature( + net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; + while ((iocb = notify_dequeue(vq)) != NULL) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, iocb-ki_nbytes); + log = (int)(iocb-ki_user_data 32); + size = iocb-ki_nbytes; + head = iocb-ki_pos; + rx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + kmem_cache_free(net-cache, iocb); + + /* when log is enabled, recomputing the log info is needed, +* since these buffers are in async queue, and may not get +* the log info before. +*/ + if (unlikely(vq_log)) { + if (!log) + __vhost_get_vq_desc(net-dev, vq, vq-iov, + ARRAY_SIZE(vq-iov), + out, in, vq_log, + log, head); + vhost_log_write(vq, vq_log, log, size); + } + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } +} + +static void handle_async_tx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + int tx_total_len = 0; + + if (vq-link_state != VHOST_VQ_LINK_ASYNC) + return; + + while ((iocb = notify_dequeue(vq)) != NULL) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, 0); + tx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + + kmem_cache_free(net-cache, iocb); + if (unlikely(tx_total_len = VHOST_NET_WEIGHT)) { +
Re: [Qemu-devel] [PATCH][STABLE] block: Free iovec arrays allocated by multiwrite_merge()
Am 21.04.2010 21:35, schrieb Stefan Hajnoczi: A new iovec array is allocated when creating a merged write request. This patch ensures that the iovec array is deleted in addition to its qiov owner. Reported-by: Leszek Urbanski tyg...@moo.pl Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com Acked-by: Kevin Wolf kw...@redhat.com Picked it up for my block branch, too, but I think this should be committed immediately by Aurelien. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC v2 3/6] KVM: introduce a wrapper function to copy dirty bitmaps to user space
(2010/04/21 20:12), Avi Kivity wrote: On 04/20/2010 01:59 PM, Takuya Yoshikawa wrote: We will replace copy_to_user() to copy_in_user() when we move the dirty bitmaps to user space. But sadly, we have copy_in_user() only for 64 bits architectures. So this function should work as a wrapper to hide ifdefs from outside. Once we get copy_in_user() for 32 bits architectures, we can remove this wrapper and use copy_in_user() directly. I prefer a generic copy_in_user() instead of having multiple paths in kvm. I do too :). I checked ppc64's copy_in_user(). It looks like just using __copy_tofrom_user() and it is also implemented for ppc32, IIUC. So yes, we may not need to implement it by ourselves! Just ask them to make it useful for 32bit too. And about x86_32 copy_in_user(). They are using copy_user_generic() which is implemented only for 64bit. So I'll show them current copy_from_user() and copy_to_user() version and see their response. If rejected, though I hope not, please give me another option, OK? -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote: Hi all, We have been implementing the prototype of Kemari for KVM, and we're sending this message to share what we have now and TODO lists. Hopefully, we would like to get early feedback to keep us in the right direction. Although advanced approaches in the TODO lists are fascinating, we would like to run this project step by step while absorbing comments from the community. The current code is based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27. For those who are new to Kemari for KVM, please take a look at the following RFC which we posted last year. http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html The transmission/transaction protocol, and most of the control logic is implemented in QEMU. However, we needed a hack in KVM to prevent rip from proceeding before synchronizing VMs. It may also need some plumbing in the kernel side to guarantee replayability of certain events and instructions, integrate the RAS capabilities of newer x86 hardware with the HA stack, as well as for optimization purposes, for example. [ snap] The rest of this message describes TODO lists grouped by each topic. === event tapping === Event tapping is the core component of Kemari, and it decides on which event the primary should synchronize with the secondary. The basic assumption here is that outgoing I/O operations are idempotent, which is usually true for disk I/O and reliable network protocols such as TCP. IMO any type of network even should be stalled too. What if the VM runs non tcp protocol and the packet that the master node sent reached some remote client and before the sync to the slave the master failed? [snap] === clock === Since synchronizing the virtual machines every time the TSC is accessed would be prohibitive, the transmission of the TSC will be done lazily, which means delaying it until there is a non-TSC synchronization point arrives. Why do you specifically care about the tsc sync? When you sync all the IO model on snapshot it also synchronizes the tsc. In general, can you please explain the 'algorithm' for continuous snapshots (is that what you like to do?): A trivial one would we to : - do X online snapshots/sec - Stall all IO (disk/block) from the guest to the outside world until the previous snapshot reaches the slave. - Snapshots are made of - diff of dirty pages from last snapshot - Qemu device model (+kvm's) diff from last. You can do 'light' snapshots in between to send dirty pages to reduce snapshot time. I wrote the above to serve a reference for your comments so it will map into my mind. Thanks, dor TODO: - Synchronization of clock sources (need to intercept TSC reads, etc). === usability === These are items that defines how users interact with Kemari. TODO: - Kemarid daemon that takes care of the cluster management/monitoring side of things. - Some device emulators might need minor modifications to work well with Kemari. Use white(black)-listing to take the burden of choosing the right device model off the users. === optimizations === Although the big picture can be realized by completing the TODO list above, we need some optimizations/enhancements to make Kemari useful in real world, and these are items what needs to be done for that. TODO: - SMP (for the sake of performance might need to implement a synchronization protocol that can maintain two or more synchronization points active at any given moment) - VGA (leverage VNC's subtilting mechanism to identify fb pages that are really dirty). Any comments/suggestions would be greatly appreciated. Thanks, Yoshi -- Kemari starts synchronizing VMs when QEMU handles I/O requests. Without this patch VCPU state is already proceeded before synchronization, and after failover to the VM on the receiver, it hangs because of this. Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/svm.c | 11 --- arch/x86/kvm/vmx.c | 11 --- arch/x86/kvm/x86.c |4 4 files changed, 21 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 26c629a..7b8f514 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -227,6 +227,7 @@ struct kvm_pio_request { int in; int port; int size; + bool lazy_skip; }; /* diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index d04c7ad..e373245 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm *svm) { struct kvm_vcpu *vcpu =svm-vcpu; u32 io_info = svm-vmcb-control.exit_info_1; /* address size bug? */ - int size, in, string; + int size, in, string, ret; unsigned port; ++svm-vcpu.stat.io_exits; @@
RE: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
Michael, Yes, I think this packet split mode probably maps well to mergeable buffer support. Note that 1. Not all devices support large packets in this way, others might map to indirect buffers better Do the indirect buffers accord to deal with the skb-frag_list? So we have to figure out how migration is going to work Yes, different guest virtio-net driver may contain different features. Does the qemu migration work with different features supported by virtio-net driver now? 2. It's up to guest driver whether to enable features such as mergeable buffers and indirect buffers So we have to figure out how to notify guest which mode is optimal for a given device Yes. When a device is binded, the mp device may query the capabilities from driver. Actually, there is a structure now in mp device can do this, we can add some field to support more. 3. We don't want to depend on jumbo frames for decent performance So we probably should support GSO/GRO GSO is for the tx side, right? I think driver can handle it itself. For GRO, I'm not sure it's easy or not. Basically, the mp device now we have support is doing what raw socket is doing. The packets are not going to host stack. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC v2 5/6] KVM: moving dirty bitmaps to user space
(2010/04/21 20:26), Avi Kivity wrote: r = 0; @@ -1858,7 +1866,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, if (memslot-is_dirty) { kvm_flush_remote_tlbs(kvm); n = kvm_dirty_bitmap_bytes(memslot); - memset(memslot-dirty_bitmap, 0, n); + clear_user(memslot-dirty_bitmap, n); memslot-is_dirty = false; Does this need an error check? Yes, sorry I forgot. @@ -468,8 +480,12 @@ void kvm_free_physmem(struct kvm *kvm) int i; struct kvm_memslots *slots = kvm-memslots; - for (i = 0; i slots-nmemslots; ++i) + for (i = 0; i slots-nmemslots; ++i) { + /* We don't munmap dirty bitmaps by ourselves. */ Why not? If we allocated them, we have to free them. In the case of VM destruction, when qemu process exits, it conflicts with the work of the usual (process) destructor's job. I struggled with this problem before sending the first version. So I have to differentiate the slot release and qemu process exit. + slots-memslots[i].dirty_bitmap = NULL; + slots-memslots[i].dirty_bitmap_old = NULL; kvm_free_physmem_slot(slots-memslots[i], NULL); + } +/* + * Please use generic *_user bitops once they become available. + * Be careful setting the bit won't be done atomically. + */ Please introduce the user bitops as part of this patchset. OK, I will do in the next version. In this RFC, I would be happy if I can know the overall design is right or not. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.
On Thu, Apr 22, 2010 at 04:57:56PM +0800, Xin, Xiaohui wrote: Michael, Yes, I think this packet split mode probably maps well to mergeable buffer support. Note that 1. Not all devices support large packets in this way, others might map to indirect buffers better Do the indirect buffers accord to deal with the skb-frag_list? We currently use skb-frags. So we have to figure out how migration is going to work Yes, different guest virtio-net driver may contain different features. Does the qemu migration work with different features supported by virtio-net driver now? For now, you must have identical feature-sets for migration to work. And long as we manage the buffers in software, we can always make features match. 2. It's up to guest driver whether to enable features such as mergeable buffers and indirect buffers So we have to figure out how to notify guest which mode is optimal for a given device Yes. When a device is binded, the mp device may query the capabilities from driver. Actually, there is a structure now in mp device can do this, we can add some field to support more. 3. We don't want to depend on jumbo frames for decent performance So we probably should support GSO/GRO GSO is for the tx side, right? I think driver can handle it itself. For GRO, I'm not sure it's easy or not. Basically, the mp device now we have support is doing what raw socket is doing. The packets are not going to host stack. See commit bfd5f4a3d605e0f6054df0b59fe0907ff7e696d3 (it doesn't currently work with vhost net, but that's a separate story). -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps
Thanks, I can know basic rules about kvm/api . (2010/04/21 20:46), Avi Kivity wrote: +Type: vm ioctl +Parameters: struct kvm_dirty_log (in/out) +Returns: 0 on success, -1 on error + +/* for KVM_SWITCH_DIRTY_LOG */ +struct kvm_dirty_log { + __u32 slot; + __u32 padding; Please put a flags field here (and verify it is zero in the ioctl implementation). This allows us to extend it later. + union { + void __user *dirty_bitmap; /* one bit per page */ + __u64 addr; + }; Just make dirty_bitmap a __u64. With the union there is the risk that someone forgets to zero the structure so we get some random bits in the pointer, and also issues with big endian and s390 compat. Reserve some extra space here for future expansion. Hm. I see that this is the existing kvm_dirty_log structure. So we can't play with it, ignore my comments about it. So, introducing a new structure to export(and get) two bitmap addresses with a flag is the thing? 1. Qemu calls ioctl to get the two addresses. 2. Qemu calls ioctl to make KVM switch the dirty bitmaps. -- in this case, qemu have to change the address got in step 1. ... 3. Qemu calls ioctl(we can reuse 1. with different command flag) to free the bitmaps. In my plan, we don't let qemu malloc() dirty bitmaps: at least, I want to make that another patch set because this patch set already has some dependencies to other issues. But, yes, I can make the struct considering the future expansion if it needed. - r = kvm_copy_dirty_bitmap(log-dirty_bitmap, dirty_bitmap_old, n); + if (need_copy) { + r = kvm_copy_dirty_bitmap(log-dirty_bitmap, + dirty_bitmap_old, n); + } else { + log-addr = (unsigned long)dirty_bitmap_old; + r = 0; + } Instead of passing need_copy everywhere, you might check a flag in log-. You'll need a flag to know whether to munmap() or not, no? To judge munmap() or not, putting in the memory slot's flag may be more useful. But as you suggest, we won't use kvm_dirty_log so parameters will change. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] KVM MMU: fix sp-unsync type error in trace event definition.
sp-unsync is bool now, so update trace event declaration. Signed-off-by: Gui Jianfeng guijianf...@cn.fujitsu.com --- arch/x86/kvm/mmutrace.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/mmutrace.h b/arch/x86/kvm/mmutrace.h index 3851f1f..9966e80 100644 --- a/arch/x86/kvm/mmutrace.h +++ b/arch/x86/kvm/mmutrace.h @@ -11,7 +11,7 @@ __field(__u64, gfn) \ __field(__u32, role) \ __field(__u32, root_count) \ - __field(__u32, unsync) + __field(bool, unsync) #define KVM_MMU_PAGE_ASSIGN(sp) \ __entry-gfn = sp-gfn; \ -- 1.6.5.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] KVM MMU: Take sp level into account when calculating quadran
Take sp level into account when calculating quadrant, because only when level == PT_PAGE_TABLE_LEVEL, quadrant is needed. Signed-off-by: Gui Jianfeng guijianf...@cn.fujitsu.com --- arch/x86/kvm/mmu.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 640b82d..2a35a65 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1324,7 +1324,8 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, if (role.direct) role.cr4_pae = 0; role.access = access; - if (vcpu-arch.mmu.root_level = PT32_ROOT_LEVEL) { + if (vcpu-arch.mmu.root_level = PT32_ROOT_LEVEL + level == PT_PAGE_TABLE_LEVEL) { quadrant = gaddr (PAGE_SHIFT + (PT64_PT_BITS * level)); quadrant = (1 ((PT32_PT_BITS - PT64_PT_BITS) * level)) - 1; role.quadrant = quadrant; -- 1.6.5.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH v3 2/3] Provides multiple submits and asynchronous notifications.
On Thu, Apr 22, 2010 at 04:37:16PM +0800, xiaohui@intel.com wrote: From: Xin Xiaohui xiaohui@intel.com The vhost-net backend now only supports synchronous send/recv operations. The patch provides multiple submits and asynchronous notifications. This is needed for zero-copy case. Signed-off-by: Xin Xiaohui xiaohui@intel.com --- Michael, Can't vhost supply a kiocb completion callback that will handle the list? Yes, thanks. And with it I also remove the vq-receiver finally. Thanks Xiaohui Nice progress. I commented on some minor issues below. Thanks! drivers/vhost/net.c | 227 +++-- drivers/vhost/vhost.c | 115 ++--- drivers/vhost/vhost.h | 14 +++ 3 files changed, 301 insertions(+), 55 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 22d5fef..4a70f66 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -17,11 +17,13 @@ #include linux/workqueue.h #include linux/rcupdate.h #include linux/file.h +#include linux/aio.h #include linux/net.h #include linux/if_packet.h #include linux/if_arp.h #include linux/if_tun.h +#include linux/mpassthru.h #include net/sock.h @@ -47,6 +49,7 @@ struct vhost_net { struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX]; struct vhost_poll poll[VHOST_NET_VQ_MAX]; + struct kmem_cache *cache; /* Tells us whether we are polling a socket for TX. * We only do this when socket buffer fills up. * Protected by tx vq lock. */ @@ -91,11 +94,132 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock) net-tx_poll_state = VHOST_NET_POLL_STARTED; } +struct kiocb *notify_dequeue(struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + unsigned long flags; + + spin_lock_irqsave(vq-notify_lock, flags); + if (!list_empty(vq-notifier)) { + iocb = list_first_entry(vq-notifier, + struct kiocb, ki_list); + list_del(iocb-ki_list); + } + spin_unlock_irqrestore(vq-notify_lock, flags); + return iocb; +} + +static void handle_iocb(struct kiocb *iocb) +{ + struct vhost_virtqueue *vq = iocb-private; + unsigned long flags; + +spin_lock_irqsave(vq-notify_lock, flags); +list_add_tail(iocb-ki_list, vq-notifier); +spin_unlock_irqrestore(vq-notify_lock, flags); +} + checkpatch.pl does not complain about the above? +static void handle_async_rx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq, + struct socket *sock) continuation lines should start to the right of (. +{ + struct kiocb *iocb = NULL; + struct vhost_log *vq_log = NULL; + int rx_total_len = 0; + unsigned int head, log, in, out; + int size; + + if (vq-link_state != VHOST_VQ_LINK_ASYNC) + return; + + if (sock-sk-sk_data_ready) + sock-sk-sk_data_ready(sock-sk, 0); + + vq_log = unlikely(vhost_has_feature( + net-dev, VHOST_F_LOG_ALL)) ? vq-log : NULL; split the above line at ?, continuation being to the left of ( looks ugly. + while ((iocb = notify_dequeue(vq)) != NULL) { + vhost_add_used_and_signal(net-dev, vq, + iocb-ki_pos, iocb-ki_nbytes); + log = (int)(iocb-ki_user_data 32); how about we always do the recompute step, and not encode the log bit in ki_user_data? + size = iocb-ki_nbytes; + head = iocb-ki_pos; + rx_total_len += iocb-ki_nbytes; + + if (iocb-ki_dtor) + iocb-ki_dtor(iocb); + kmem_cache_free(net-cache, iocb); + + /* when log is enabled, recomputing the log info is needed, + * since these buffers are in async queue, and may not get + * the log info before. + */ + if (unlikely(vq_log)) { + if (!log) + __vhost_get_vq_desc(net-dev, vq, vq-iov, + ARRAY_SIZE(vq-iov), + out, in, vq_log, + log, head); + vhost_log_write(vq, vq_log, log, size); + } + if (unlikely(rx_total_len = VHOST_NET_WEIGHT)) { + vhost_poll_queue(vq-poll); + break; + } + } +} + +static void handle_async_tx_events_notify(struct vhost_net *net, + struct vhost_virtqueue *vq) +{ + struct kiocb *iocb = NULL; + int tx_total_len = 0; + + if (vq-link_state != VHOST_VQ_LINK_ASYNC) + return; + + while
32-bit color graphic on KVM virtual machines
Hi. Is it possible to have 32-bit color graphic on KVM virtual machines? I installed a Windows virtual machine, but it allows me to configure only 24-bit color display and it does not have any display driver installed. Is there a way to solve this problem? Thank youv very much! Bye. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V5 1/3] perf kvm: Enhance perf to collect KVM guest os statistics from host side
-Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Zhang, Yanmin Sent: Monday, April 19, 2010 1:33 PM To: Avi Kivity Cc: Ingo Molnar; Peter Zijlstra; Avi Kivity; Sheng Yang; linux-ker...@vger.kernel.org; kvm@vger.kernel.org; Marcelo Tosatti; oerg Roedel; Jes Sorensen; Gleb Natapov; Zachary Amsden; zhiteng.hu...@intel.com; tim.c.c...@intel.com; Arnaldo Carvalho de Melo Subject: [PATCH V5 1/3] perf kvm: Enhance perf to collect KVM guest os statistics from host side Below patch introduces perf_guest_info_callbacks and related register/unregister functions. Add more PERF_RECORD_MISC_XXX bits meaning guest kernel and guest user space. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com --- diff -Nraup --exclude-from=exclude.diff linux-2.6_tip0417/include/linux/perf_event.h linux-2.6_tip0417_perfkvm/include/linux/perf_event.h --- linux-2.6_tip0417/include/linux/perf_event.h 2010-04-19 09:51:59.544791000 +0800 +++ linux-2.6_tip0417_perfkvm/include/linux/perf_event.h 2010-04-19 09:53:59.691378953 +0800 @@ -932,6 +940,12 @@ static inline void perf_event_mmap(struc __perf_event_mmap(vma); } +extern struct perf_guest_info_callbacks *perf_guest_cbs; +extern int perf_register_guest_info_callbacks( + struct perf_guest_info_callbacks *); +extern int perf_unregister_guest_info_callbacks( + struct perf_guest_info_callbacks *); + extern void perf_event_comm(struct task_struct *tsk); extern void perf_event_fork(struct task_struct *tsk); @@ -1001,6 +1015,11 @@ perf_sw_event(u32 event_id, u64 nr, int static inline void perf_bp_event(struct perf_event *event, void *data) { } +static inline int perf_register_guest_info_callbacks +(struct perf_guest_info_callbacks *) {return 0; } +static inline int perf_unregister_guest_info_callbacks +(struct perf_guest_info_callbacks *) {return 0; } + static inline void perf_event_mmap(struct vm_area_struct *vma) { } static inline void perf_event_comm(struct task_struct *tsk) { } static inline void perf_event_fork(struct task_struct *tsk) { } Hi, I met this error when built kernel. Anything wrong? CC init/main.o In file included from include/linux/ftrace_event.h:8, from include/trace/syscall.h:6, from include/linux/syscalls.h:75, from init/main.c:16: include/linux/perf_event.h: In function 'perf_register_guest_info_callbacks': include/linux/perf_event.h:1019: error: parameter name omitted include/linux/perf_event.h: In function 'perf_unregister_guest_info_callbacks': include/linux/perf_event.h:1021: error: parameter name omitted make[1]: *** [init/main.o] Error 1 make: *** [init] Error 2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/8] KVM: SVM: Fix nested nmi handling
The patch introducing nested nmi handling had a bug. The check does not belong to enable_nmi_window but must be in nmi_allowed. This patch fixes this. Signed-off-by: Joerg Roedel joerg.roe...@amd.com --- arch/x86/kvm/svm.c | 16 +--- 1 files changed, 9 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index ab78eb8..ec20584 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -2771,8 +2771,12 @@ static int svm_nmi_allowed(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); struct vmcb *vmcb = svm-vmcb; - return !(vmcb-control.int_state SVM_INTERRUPT_SHADOW_MASK) - !(svm-vcpu.arch.hflags HF_NMI_MASK); + int ret; + ret = !(vmcb-control.int_state SVM_INTERRUPT_SHADOW_MASK) + !(svm-vcpu.arch.hflags HF_NMI_MASK); + ret = ret gif_set(svm) nested_svm_nmi(svm); + + return ret; } static bool svm_get_nmi_mask(struct kvm_vcpu *vcpu) @@ -2841,11 +2845,9 @@ static void enable_nmi_window(struct kvm_vcpu *vcpu) * Something prevents NMI from been injected. Single step over possible * problem (IRET or exception injection or interrupt shadow) */ - if (gif_set(svm) nested_svm_nmi(svm)) { - svm-nmi_singlestep = true; - svm-vmcb-save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF); - update_db_intercept(vcpu); - } + svm-nmi_singlestep = true; + svm-vmcb-save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF); + update_db_intercept(vcpu); } static int svm_set_tss_addr(struct kvm *kvm, unsigned int addr) -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/8] KVM: SVM: Propagate nested entry failure into guest hypervisor
This patch implements propagation of a failes guest vmrun back into the guest instead of killing the whole guest. Signed-off-by: Joerg Roedel joerg.roe...@amd.com --- arch/x86/kvm/svm.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 5ad9d80..b10d163 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1715,6 +1715,10 @@ static int nested_svm_intercept(struct vcpu_svm *svm) vmexit = NESTED_EXIT_DONE; break; } + case SVM_EXIT_ERR: { + vmexit = NESTED_EXIT_DONE; + break; + } default: { u64 exit_bits = 1ULL (exit_code - SVM_EXIT_INTR); if (svm-nested.intercept exit_bits) -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/8] KVM: SVM: Sync cr0 and cr3 to kvm state before nested handling
This patch syncs cr0 and cr3 from the vmcb to the kvm state before nested intercept handling is done. This allows to simplify the vmexit path. Signed-off-by: Joerg Roedel joerg.roe...@amd.com --- arch/x86/kvm/svm.c | 15 ++- 1 files changed, 6 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index c480d7f..5ad9d80 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1799,10 +1799,7 @@ static int nested_svm_vmexit(struct vcpu_svm *svm) nested_vmcb-save.gdtr = vmcb-save.gdtr; nested_vmcb-save.idtr = vmcb-save.idtr; nested_vmcb-save.cr0= kvm_read_cr0(svm-vcpu); - if (npt_enabled) - nested_vmcb-save.cr3= vmcb-save.cr3; - else - nested_vmcb-save.cr3= svm-vcpu.arch.cr3; + nested_vmcb-save.cr3= svm-vcpu.arch.cr3; nested_vmcb-save.cr2= vmcb-save.cr2; nested_vmcb-save.cr4= svm-vcpu.arch.cr4; nested_vmcb-save.rflags = vmcb-save.rflags; @@ -2641,6 +2638,11 @@ static int handle_exit(struct kvm_vcpu *vcpu) trace_kvm_exit(exit_code, vcpu); + if (!(svm-vmcb-control.intercept_cr_write INTERCEPT_CR0_MASK)) + vcpu-arch.cr0 = svm-vmcb-save.cr0; + if (npt_enabled) + vcpu-arch.cr3 = svm-vmcb-save.cr3; + if (unlikely(svm-nested.exit_required)) { nested_svm_vmexit(svm); svm-nested.exit_required = false; @@ -2668,11 +2670,6 @@ static int handle_exit(struct kvm_vcpu *vcpu) svm_complete_interrupts(svm); - if (!(svm-vmcb-control.intercept_cr_write INTERCEPT_CR0_MASK)) - vcpu-arch.cr0 = svm-vmcb-save.cr0; - if (npt_enabled) - vcpu-arch.cr3 = svm-vmcb-save.cr3; - if (svm-vmcb-control.exit_code == SVM_EXIT_ERR) { kvm_run-exit_reason = KVM_EXIT_FAIL_ENTRY; kvm_run-fail_entry.hardware_entry_failure_reason -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/8] KVM: x86: Allow marking an exception as reinjected
This patch adds logic to kvm/x86 which allows to mark an injected exception as reinjected. This allows to remove an ugly hack from svm_complete_interrupts that prevented exceptions from being reinjected at all in the nested case. The hack was necessary because an reinjected exception into the nested guest could cause a nested vmexit emulation. But reinjected exceptions must not intercept. The downside of the hack is that a exception that in injected could get lost. This patch fixes the problem and puts the code for it into generic x86 files because. Nested-VMX will likely have the same problem and could reuse the code. Signed-off-by: Joerg Roedel joerg.roe...@amd.com --- arch/x86/include/asm/kvm_host.h |6 +- arch/x86/kvm/svm.c | 12 ++-- arch/x86/kvm/vmx.c |3 ++- arch/x86/kvm/x86.c | 23 +++ 4 files changed, 32 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 357573a..3f0007b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -312,6 +312,7 @@ struct kvm_vcpu_arch { struct kvm_queued_exception { bool pending; bool has_error_code; + bool reinject; u8 nr; u32 error_code; } exception; @@ -514,7 +515,8 @@ struct kvm_x86_ops { void (*set_irq)(struct kvm_vcpu *vcpu); void (*set_nmi)(struct kvm_vcpu *vcpu); void (*queue_exception)(struct kvm_vcpu *vcpu, unsigned nr, - bool has_error_code, u32 error_code); + bool has_error_code, u32 error_code, + bool reinject); int (*interrupt_allowed)(struct kvm_vcpu *vcpu); int (*nmi_allowed)(struct kvm_vcpu *vcpu); bool (*get_nmi_mask)(struct kvm_vcpu *vcpu); @@ -617,6 +619,8 @@ void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags); void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr); void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code); +void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr); +void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code); void kvm_inject_page_fault(struct kvm_vcpu *vcpu, unsigned long cr2, u32 error_code); bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 65fc114..30e49fe 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -338,7 +338,8 @@ static void skip_emulated_instruction(struct kvm_vcpu *vcpu) } static void svm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr, - bool has_error_code, u32 error_code) + bool has_error_code, u32 error_code, + bool reinject) { struct vcpu_svm *svm = to_svm(vcpu); @@ -346,7 +347,8 @@ static void svm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr, * If we are within a nested VM we'd better #VMEXIT and let the guest * handle the exception */ - if (nested_svm_check_exception(svm, nr, has_error_code, error_code)) + if (!reinject + nested_svm_check_exception(svm, nr, has_error_code, error_code)) return; if (nr == BP_VECTOR !svm_has(SVM_FEATURE_NRIP)) { @@ -2918,8 +2920,6 @@ static void svm_complete_interrupts(struct vcpu_svm *svm) svm-vcpu.arch.nmi_injected = true; break; case SVM_EXITINTINFO_TYPE_EXEPT: - if (is_nested(svm)) - break; /* * In case of software exceptions, do not reinject the vector, * but re-execute the instruction instead. Rewind RIP first @@ -2935,10 +2935,10 @@ static void svm_complete_interrupts(struct vcpu_svm *svm) } if (exitintinfo SVM_EXITINTINFO_VALID_ERR) { u32 err = svm-vmcb-control.exit_int_info_err; - kvm_queue_exception_e(svm-vcpu, vector, err); + kvm_requeue_exception_e(svm-vcpu, vector, err); } else - kvm_queue_exception(svm-vcpu, vector); + kvm_requeue_exception(svm-vcpu, vector); break; case SVM_EXITINTINFO_TYPE_INTR: kvm_queue_interrupt(svm-vcpu, vector, false); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index dc023bc..285fe1a 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -919,7 +919,8 @@ static void skip_emulated_instruction(struct kvm_vcpu *vcpu) } static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr, - bool has_error_code, u32 error_code) + bool has_error_code, u32
[PATCH 8/8] KVM: SVM: Handle MCE intercepts always on host level
This patch prevents MCE intercepts from being propagated into the L1 guest if they happened in an L2 guest. Signed-off-by: Joerg Roedel joerg.roe...@amd.com --- arch/x86/kvm/svm.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 30e49fe..889f660 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1651,6 +1651,7 @@ static int nested_svm_exit_special(struct vcpu_svm *svm) switch (exit_code) { case SVM_EXIT_INTR: case SVM_EXIT_NMI: + case SVM_EXIT_EXCP_BASE + MC_VECTOR: return NESTED_EXIT_HOST; case SVM_EXIT_NPF: /* For now we are always handling NPFs when using them */ -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/8] KVM: SVM: Report emulated SVM features to userspace
This patch implements the reporting of the emulated SVM features to userspace instead of the real hardware capabilities. Every real hardware capability needs emulation in nested svm so the old behavior was broken. Cc: sta...@kernel.org Signed-off-by: Joerg Roedel joerg.roe...@amd.com --- arch/x86/kvm/svm.c | 10 ++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 0fa2035..65fc114 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3154,6 +3154,16 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu) static void svm_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry) { + switch (func) { + case 0x800A: + entry-eax = 1; /* SVM revision 1 */ + entry-ebx = 8; /* Lets support 8 ASIDs in case we add proper + ASID emulation to nested SVM */ + entry-ecx = 0; /* Reserved */ + entry-edx = 0; /* Do not support any additional features */ + + break; + } } static const struct trace_print_flags svm_exit_reasons_str[] = { -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/8] KVM: X86: Add callback to let modules decide over some supported cpuid bits
This patch adds the get_supported_cpuid callback to kvm_x86_ops. It will be used in do_cpuid_ent to delegate the decission about some supported cpuid bits to the architecture modules. Cc: sta...@kernel.org Signed-off-by: Joerg Roedel joerg.roe...@amd.com --- arch/x86/include/asm/kvm_host.h |2 ++ arch/x86/kvm/svm.c |6 ++ arch/x86/kvm/vmx.c |6 ++ arch/x86/kvm/x86.c |3 +++ 4 files changed, 17 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index d47d087..357573a 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -528,6 +528,8 @@ struct kvm_x86_ops { int (*get_lpage_level)(void); bool (*rdtscp_supported)(void); + void (*set_supported_cpuid)(u32 func, struct kvm_cpuid_entry2 *entry); + const struct trace_print_flags *exit_reasons_str; }; diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index b10d163..0fa2035 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3152,6 +3152,10 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu) { } +static void svm_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry) +{ +} + static const struct trace_print_flags svm_exit_reasons_str[] = { { SVM_EXIT_READ_CR0,read_cr0 }, { SVM_EXIT_READ_CR3,read_cr3 }, @@ -3297,6 +3301,8 @@ static struct kvm_x86_ops svm_x86_ops = { .cpuid_update = svm_cpuid_update, .rdtscp_supported = svm_rdtscp_supported, + + .set_supported_cpuid = svm_set_supported_cpuid, }; static int __init svm_init(void) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index d0a10b5..dc023bc 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4111,6 +4111,10 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu) } } +static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry) +{ +} + static struct kvm_x86_ops vmx_x86_ops = { .cpu_has_kvm_support = cpu_has_kvm_support, .disabled_by_bios = vmx_disabled_by_bios, @@ -4183,6 +4187,8 @@ static struct kvm_x86_ops vmx_x86_ops = { .cpuid_update = vmx_cpuid_update, .rdtscp_supported = vmx_rdtscp_supported, + + .set_supported_cpuid = vmx_set_supported_cpuid, }; static int __init vmx_init(void) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cf37ac6..c2c3c29 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1960,6 +1960,9 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, entry-ecx = kvm_supported_word6_x86_features; break; } + + kvm_x86_ops-set_supported_cpuid(function, entry); + put_cpu(); } -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/8] More fixes for nested svm
Hi Avi, Marcelo, here is another set of nested svm fixes. They fix NMI code to make UP Hyper-V root domain booting. The patches also add better handling for nested entry failures and mce intercepts. Also in this patchset are the fixes for the supported cpuid reporting for svm features. These patches were taken from the nested-npt patchset and slightly modified. These patches are also marked for -stable backporting. The probably most important fix is about exception reinjection. This didn't work reliably before and is fixed with the patch in this series now. This fix also touches common x86 code but that should be ok because it could be reused by nested-vmx later. Please review and give comments (or apply ;-). Thanks, Joerg Diffstat: arch/x86/include/asm/kvm_host.h |8 - arch/x86/kvm/svm.c | 72 +-- arch/x86/kvm/vmx.c |9 - arch/x86/kvm/x86.c | 26 -- 4 files changed, 83 insertions(+), 32 deletions(-) Shortlog: Joerg Roedel (8): KVM: SVM: Fix nested nmi handling KVM: SVM: Make sure rip is synced to vmcb before nested vmexit KVM: SVM: Sync cr0 and cr3 to kvm state before nested handling KVM: SVM: Propagate nested entry failure into guest hypervisor KVM: X86: Add callback to let modules decide over some supported cpuid bits KVM: SVM: Report emulated SVM features to userspace KVM: x86: Allow marking an exception as reinjected KVM: SVM: Handle MCE intercepts always on host level -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/8] KVM: SVM: Make sure rip is synced to vmcb before nested vmexit
This patch fixes a bug where a nested guest always went over the same instruction because the rip was not advanced on a nested vmexit. Signed-off-by: Joerg Roedel joerg.roe...@amd.com --- arch/x86/kvm/svm.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index ec20584..c480d7f 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -2960,6 +2960,10 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu) u16 gs_selector; u16 ldt_selector; + svm-vmcb-save.rax = vcpu-arch.regs[VCPU_REGS_RAX]; + svm-vmcb-save.rsp = vcpu-arch.regs[VCPU_REGS_RSP]; + svm-vmcb-save.rip = vcpu-arch.regs[VCPU_REGS_RIP]; + /* * A vmexit emulation is required before the vcpu can be executed * again. @@ -2967,10 +2971,6 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu) if (unlikely(svm-nested.exit_required)) return; - svm-vmcb-save.rax = vcpu-arch.regs[VCPU_REGS_RAX]; - svm-vmcb-save.rsp = vcpu-arch.regs[VCPU_REGS_RSP]; - svm-vmcb-save.rip = vcpu-arch.regs[VCPU_REGS_RIP]; - pre_svm_run(svm); sync_lapic_to_cr8(vcpu); -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
Dor Laor wrote: On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote: Hi all, We have been implementing the prototype of Kemari for KVM, and we're sending this message to share what we have now and TODO lists. Hopefully, we would like to get early feedback to keep us in the right direction. Although advanced approaches in the TODO lists are fascinating, we would like to run this project step by step while absorbing comments from the community. The current code is based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27. For those who are new to Kemari for KVM, please take a look at the following RFC which we posted last year. http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html The transmission/transaction protocol, and most of the control logic is implemented in QEMU. However, we needed a hack in KVM to prevent rip from proceeding before synchronizing VMs. It may also need some plumbing in the kernel side to guarantee replayability of certain events and instructions, integrate the RAS capabilities of newer x86 hardware with the HA stack, as well as for optimization purposes, for example. [ snap] The rest of this message describes TODO lists grouped by each topic. === event tapping === Event tapping is the core component of Kemari, and it decides on which event the primary should synchronize with the secondary. The basic assumption here is that outgoing I/O operations are idempotent, which is usually true for disk I/O and reliable network protocols such as TCP. IMO any type of network even should be stalled too. What if the VM runs non tcp protocol and the packet that the master node sent reached some remote client and before the sync to the slave the master failed? In current implementation, it is actually stalling any type of network that goes through virtio-net. However, if the application was using unreliable protocols, it should have its own recovering mechanism, or it should be completely stateless. [snap] === clock === Since synchronizing the virtual machines every time the TSC is accessed would be prohibitive, the transmission of the TSC will be done lazily, which means delaying it until there is a non-TSC synchronization point arrives. Why do you specifically care about the tsc sync? When you sync all the IO model on snapshot it also synchronizes the tsc. In general, can you please explain the 'algorithm' for continuous snapshots (is that what you like to do?): Yes, of course. Sorry for being less informative. A trivial one would we to : - do X online snapshots/sec I currently don't have good numbers that I can share right now. Snapshots/sec depends on what kind of workload is running, and if the guest was almost idle, there will be no snapshots in 5sec. On the other hand, if the guest was running I/O intensive workloads (netperf, iozone for example), there will be about 50 snapshots/sec. - Stall all IO (disk/block) from the guest to the outside world until the previous snapshot reaches the slave. Yes, it does. - Snapshots are made of Full device model + diff of dirty pages from the last snapshot. - diff of dirty pages from last snapshot This also depends on the workload. In case of I/O intensive workloads, dirty pages are usually less than 100. - Qemu device model (+kvm's) diff from last. We're currently sending full copy because we're completely reusing this part of existing live migration framework. Last time we measured, it was about 13KB. But it varies by which QEMU version is used. You can do 'light' snapshots in between to send dirty pages to reduce snapshot time. I agree. That's one of the advanced topic we would like to try too. I wrote the above to serve a reference for your comments so it will map into my mind. Thanks, dor Thank your for the guidance. I hope this answers to your question. At the same time, I would also be happy it we could discuss how to implement too. In fact, we needed a hack to prevent rip from proceeding in KVM, which turned out that it was not the best workaround. Thanks, Yoshi TODO: - Synchronization of clock sources (need to intercept TSC reads, etc). === usability === These are items that defines how users interact with Kemari. TODO: - Kemarid daemon that takes care of the cluster management/monitoring side of things. - Some device emulators might need minor modifications to work well with Kemari. Use white(black)-listing to take the burden of choosing the right device model off the users. === optimizations === Although the big picture can be realized by completing the TODO list above, we need some optimizations/enhancements to make Kemari useful in real world, and these are items what needs to be done for that. TODO: - SMP (for the sake of performance might need to implement a synchronization protocol that can maintain two or more synchronization points active at any given moment) - VGA (leverage VNC's subtilting mechanism to identify fb pages that are really
[PATCH] qemu-kvm: Ask kernel about supported svm features
This patch adds code to ask the kernel about the svm features it supports for its guests and propagates them to the guest. The new capability is necessary because the old behavior of the kernel was to just return the host svm features but every svm-feature needs emulation in the nested svm kernel code. The new capability indicates that the kernel is aware of that when returning svm cpuid information. Signed-off-by: Joerg Roedel joerg.roe...@amd.com --- qemu-kvm-x86.c | 14 -- 1 files changed, 12 insertions(+), 2 deletions(-) diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c index 748ff69..6eccd69 100644 --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -1327,8 +1327,18 @@ int kvm_arch_init_vcpu(CPUState *cenv) qemu_kvm_cpuid_on_env(copy); limit = copy.regs[R_EAX]; -for (i = 0x8000; i = limit; ++i) - do_cpuid_ent(cpuid_ent[cpuid_nent++], i, 0, copy); +for (i = 0x8000; i = limit; ++i) { +do_cpuid_ent(cpuid_ent[cpuid_nent], i, 0, copy); +switch (i) { +case 0x800a: +cpuid_ent[cpuid_nent].eax = kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EAX); +cpuid_ent[cpuid_nent].ebx = kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EBX); +cpuid_ent[cpuid_nent].ebx = kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EBX); +cpuid_ent[cpuid_nent].edx = kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EDX); +break; +} +cpuid_nent += 1; +} kvm_setup_cpuid2(cenv, cpuid_nent, cpuid_ent); -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] More fixes for nested svm
On 04/22/2010 01:33 PM, Joerg Roedel wrote: Hi Avi, Marcelo, here is another set of nested svm fixes. They fix NMI code to make UP Hyper-V root domain booting. The patches also add better handling for nested entry failures and mce intercepts. Also in this patchset are the fixes for the supported cpuid reporting for svm features. These patches were taken from the nested-npt patchset and slightly modified. These patches are also marked for -stable backporting. The probably most important fix is about exception reinjection. This didn't work reliably before and is fixed with the patch in this series now. This fix also touches common x86 code but that should be ok because it could be reused by nested-vmx later. Please review and give comments (or apply ;-). Looks good. I'll wait a day or two for more reviews (esp. Alex). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu-kvm: Ask kernel about supported svm features
On 04/22/2010 01:57 PM, Joerg Roedel wrote: This patch adds code to ask the kernel about the svm features it supports for its guests and propagates them to the guest. The new capability is necessary because the old behavior of the kernel was to just return the host svm features but every svm-feature needs emulation in the nested svm kernel code. The new capability indicates that the kernel is aware of that when returning svm cpuid information. No new capability here. Signed-off-by: Joerg Roedeljoerg.roe...@amd.com --- qemu-kvm-x86.c | 14 -- 1 files changed, 12 insertions(+), 2 deletions(-) diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c index 748ff69..6eccd69 100644 --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -1327,8 +1327,18 @@ int kvm_arch_init_vcpu(CPUState *cenv) qemu_kvm_cpuid_on_env(copy); limit = copy.regs[R_EAX]; -for (i = 0x8000; i= limit; ++i) - do_cpuid_ent(cpuid_ent[cpuid_nent++], i, 0,copy); +for (i = 0x8000; i= limit; ++i) { +do_cpuid_ent(cpuid_ent[cpuid_nent], i, 0,copy); +switch (i) { +case 0x800a: +cpuid_ent[cpuid_nent].eax = kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EAX); +cpuid_ent[cpuid_nent].ebx = kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EBX); +cpuid_ent[cpuid_nent].ebx = kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EBX); +cpuid_ent[cpuid_nent].edx = kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EDX); +break; +} +cpuid_nent += 1; +} I don't understand why this is different compared to all other cpuid bits. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm vmx: free vpid when fail to create vcpu
On 04/17/2010 11:41 AM, Lai Jiangshan wrote: Fix bug of the exception path, free allocated vpid when fail to create vcpu. Applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu-kvm: Ask kernel about supported svm features
On Thu, Apr 22, 2010 at 02:07:14PM +0300, Avi Kivity wrote: On 04/22/2010 01:57 PM, Joerg Roedel wrote: This patch adds code to ask the kernel about the svm features it supports for its guests and propagates them to the guest. The new capability is necessary because the old behavior of the kernel was to just return the host svm features but every svm-feature needs emulation in the nested svm kernel code. The new capability indicates that the kernel is aware of that when returning svm cpuid information. No new capability here. copypaste error, sorry. Signed-off-by: Joerg Roedeljoerg.roe...@amd.com --- qemu-kvm-x86.c | 14 -- 1 files changed, 12 insertions(+), 2 deletions(-) diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c index 748ff69..6eccd69 100644 --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -1327,8 +1327,18 @@ int kvm_arch_init_vcpu(CPUState *cenv) qemu_kvm_cpuid_on_env(copy); limit = copy.regs[R_EAX]; -for (i = 0x8000; i= limit; ++i) -do_cpuid_ent(cpuid_ent[cpuid_nent++], i, 0,copy); +for (i = 0x8000; i= limit; ++i) { +do_cpuid_ent(cpuid_ent[cpuid_nent], i, 0,copy); +switch (i) { +case 0x800a: +cpuid_ent[cpuid_nent].eax = kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EAX); +cpuid_ent[cpuid_nent].ebx = kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EBX); +cpuid_ent[cpuid_nent].ebx = kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EBX); +cpuid_ent[cpuid_nent].edx = kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EDX); +break; +} +cpuid_nent += 1; +} I don't understand why this is different compared to all other cpuid bits. Because for the SVM features we report to the guest we need to ask the kernel which of them are supported. We can't just take the host-cpuid because most of the additional svm features need special emulation in the kernel. Or do you think this should better be handled in target-i386/cpuid.c? Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu-kvm: Ask kernel about supported svm features
On 04/22/2010 03:02 PM, Joerg Roedel wrote: Signed-off-by: Joerg Roedeljoerg.roe...@amd.com --- qemu-kvm-x86.c | 14 -- 1 files changed, 12 insertions(+), 2 deletions(-) diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c index 748ff69..6eccd69 100644 --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -1327,8 +1327,18 @@ int kvm_arch_init_vcpu(CPUState *cenv) qemu_kvm_cpuid_on_env(copy); limit = copy.regs[R_EAX]; -for (i = 0x8000; i= limit; ++i) - do_cpuid_ent(cpuid_ent[cpuid_nent++], i, 0,copy); +for (i = 0x8000; i= limit; ++i) { +do_cpuid_ent(cpuid_ent[cpuid_nent], i, 0,copy); +switch (i) { +case 0x800a: +cpuid_ent[cpuid_nent].eax = kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EAX); +cpuid_ent[cpuid_nent].ebx = kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EBX); +cpuid_ent[cpuid_nent].ebx = kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EBX); +cpuid_ent[cpuid_nent].edx = kvm_arch_get_supported_cpuid(cenv, 0x800a, R_EDX); +break; +} +cpuid_nent += 1; +} I don't understand why this is different compared to all other cpuid bits. Because for the SVM features we report to the guest we need to ask the kernel which of them are supported. That's true for all cpuid features. We can't just take the host-cpuid because most of the additional svm features need special emulation in the kernel. Or do you think this should better be handled in target-i386/cpuid.c? Yes. -cpu host should take KVM_GET_SUPPORTED_CPUID output and loop it back to the vcpu configuration, others just take the qemu configuration, mask it with supported bits, and pass it back (see check_features_against_host()). (need feature names for the bits, too, so you can enable or disable them from the command line) -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote: Dor Laor wrote: On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote: Hi all, We have been implementing the prototype of Kemari for KVM, and we're sending this message to share what we have now and TODO lists. Hopefully, we would like to get early feedback to keep us in the right direction. Although advanced approaches in the TODO lists are fascinating, we would like to run this project step by step while absorbing comments from the community. The current code is based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27. For those who are new to Kemari for KVM, please take a look at the following RFC which we posted last year. http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html The transmission/transaction protocol, and most of the control logic is implemented in QEMU. However, we needed a hack in KVM to prevent rip from proceeding before synchronizing VMs. It may also need some plumbing in the kernel side to guarantee replayability of certain events and instructions, integrate the RAS capabilities of newer x86 hardware with the HA stack, as well as for optimization purposes, for example. [ snap] The rest of this message describes TODO lists grouped by each topic. === event tapping === Event tapping is the core component of Kemari, and it decides on which event the primary should synchronize with the secondary. The basic assumption here is that outgoing I/O operations are idempotent, which is usually true for disk I/O and reliable network protocols such as TCP. IMO any type of network even should be stalled too. What if the VM runs non tcp protocol and the packet that the master node sent reached some remote client and before the sync to the slave the master failed? In current implementation, it is actually stalling any type of network that goes through virtio-net. However, if the application was using unreliable protocols, it should have its own recovering mechanism, or it should be completely stateless. Why do you treat tcp differently? You can damage the entire VM this way - think of dhcp request that was dropped on the moment you switched between the master and the slave? [snap] === clock === Since synchronizing the virtual machines every time the TSC is accessed would be prohibitive, the transmission of the TSC will be done lazily, which means delaying it until there is a non-TSC synchronization point arrives. Why do you specifically care about the tsc sync? When you sync all the IO model on snapshot it also synchronizes the tsc. So, do you agree that an extra clock synchronization is not needed since it is done anyway as part of the live migration state sync? In general, can you please explain the 'algorithm' for continuous snapshots (is that what you like to do?): Yes, of course. Sorry for being less informative. A trivial one would we to : - do X online snapshots/sec I currently don't have good numbers that I can share right now. Snapshots/sec depends on what kind of workload is running, and if the guest was almost idle, there will be no snapshots in 5sec. On the other hand, if the guest was running I/O intensive workloads (netperf, iozone for example), there will be about 50 snapshots/sec. - Stall all IO (disk/block) from the guest to the outside world until the previous snapshot reaches the slave. Yes, it does. - Snapshots are made of Full device model + diff of dirty pages from the last snapshot. - diff of dirty pages from last snapshot This also depends on the workload. In case of I/O intensive workloads, dirty pages are usually less than 100. The hardest would be memory intensive loads. So 100 snap/sec means latency of 10msec right? (not that it's not ok, with faster hw and IB you'll be able to get much more) - Qemu device model (+kvm's) diff from last. We're currently sending full copy because we're completely reusing this part of existing live migration framework. Last time we measured, it was about 13KB. But it varies by which QEMU version is used. You can do 'light' snapshots in between to send dirty pages to reduce snapshot time. I agree. That's one of the advanced topic we would like to try too. I wrote the above to serve a reference for your comments so it will map into my mind. Thanks, dor Thank your for the guidance. I hope this answers to your question. At the same time, I would also be happy it we could discuss how to implement too. In fact, we needed a hack to prevent rip from proceeding in KVM, which turned out that it was not the best workaround. There are brute force solutions like - stop the guest until you send all of the snapshot to the remote (like standard live migration) - Stop + fork + cont the father Or mark the recent dirty pages that were not sent to the remote as write protected and copy them if touched. Thanks, Yoshi TODO: - Synchronization of clock sources (need to intercept TSC reads, etc). === usability ===
Re: [PATCH] qemu-kvm: Ask kernel about supported svm features
On Thu, Apr 22, 2010 at 03:13:14PM +0300, Avi Kivity wrote: On 04/22/2010 03:02 PM, Joerg Roedel wrote: We can't just take the host-cpuid because most of the additional svm features need special emulation in the kernel. Or do you think this should better be handled in target-i386/cpuid.c? Yes. -cpu host should take KVM_GET_SUPPORTED_CPUID output and loop it back to the vcpu configuration, others just take the qemu configuration, mask it with supported bits, and pass it back (see check_features_against_host()). Hmm, the plan was to enable with -enable-nesting all kernel supported svm features for the guest (and add switches later to remove them individually) If we activate nested svm with -cpu host in the future thats fine too (closed-source hypervisors need that anyway). But we should also define a cpu model in which we can migrate nested hypervisors between machines were the cpu is not completly indentical. (need feature names for the bits, too, so you can enable or disable them from the command line) Yeah, I know. I omitted that for the first bring-up. It was planned for a later patch. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[qemu-kvm tests 0/3] qemu-kvm tests cleanup
Cleanup, mostly x86 oriented. Patches against 'next' branch. Naphtali Sprei (3): qemu-kvm tests cleanup qemu-kvm tests cleanup: adapt stringio test to kernel-mode run qemu-kvm tests cleanup: Added printing for passing tests Also typo fix kvm/user/README| 23 ++ kvm/user/balloon_ctl.c | 92 -- kvm/user/bootstrap.lds | 15 - kvm/user/config-x86-common.mak | 26 +-- kvm/user/config-x86_64.mak |5 +- kvm/user/main.c| 611 kvm/user/test/x86/README | 15 + kvm/user/test/x86/bootstrap.S | 137 - kvm/user/test/x86/exit.c |7 - kvm/user/test/x86/memtest1.S | 44 --- kvm/user/test/x86/realmode.c | 106 +++- kvm/user/test/x86/runtime.h|6 - kvm/user/test/x86/simple.S | 13 - kvm/user/test/x86/stringio.S | 13 +- kvm/user/test/x86/test32.S |8 - kvm/user/testdev.txt | 14 + 16 files changed, 170 insertions(+), 965 deletions(-) create mode 100644 kvm/user/README delete mode 100755 kvm/user/balloon_ctl.c delete mode 100644 kvm/user/bootstrap.lds delete mode 100644 kvm/user/main.c create mode 100644 kvm/user/test/x86/README delete mode 100644 kvm/user/test/x86/bootstrap.S delete mode 100644 kvm/user/test/x86/exit.c delete mode 100644 kvm/user/test/x86/memtest1.S delete mode 100644 kvm/user/test/x86/runtime.h delete mode 100644 kvm/user/test/x86/simple.S delete mode 100644 kvm/user/test/x86/test32.S create mode 100644 kvm/user/testdev.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[qemu-kvm tests 1/3] qemu-kvm tests cleanup
Mainly removed unused/unnecessary files and references to them Signed-off-by: Naphtali Sprei nsp...@redhat.com --- kvm/user/README| 23 ++ kvm/user/balloon_ctl.c | 92 -- kvm/user/bootstrap.lds | 15 - kvm/user/config-x86-common.mak | 24 +-- kvm/user/config-x86_64.mak |5 +- kvm/user/main.c| 611 kvm/user/test/x86/README | 15 + kvm/user/test/x86/bootstrap.S | 137 - kvm/user/test/x86/exit.c |7 - kvm/user/test/x86/memtest1.S | 44 --- kvm/user/test/x86/runtime.h|6 - kvm/user/test/x86/simple.S | 13 - kvm/user/test/x86/test32.S |8 - kvm/user/testdev.txt | 14 + 14 files changed, 60 insertions(+), 954 deletions(-) create mode 100644 kvm/user/README delete mode 100755 kvm/user/balloon_ctl.c delete mode 100644 kvm/user/bootstrap.lds delete mode 100644 kvm/user/main.c create mode 100644 kvm/user/test/x86/README delete mode 100644 kvm/user/test/x86/bootstrap.S delete mode 100644 kvm/user/test/x86/exit.c delete mode 100644 kvm/user/test/x86/memtest1.S delete mode 100644 kvm/user/test/x86/runtime.h delete mode 100644 kvm/user/test/x86/simple.S delete mode 100644 kvm/user/test/x86/test32.S create mode 100644 kvm/user/testdev.txt diff --git a/kvm/user/README b/kvm/user/README new file mode 100644 index 000..6a83831 --- /dev/null +++ b/kvm/user/README @@ -0,0 +1,23 @@ +This directory contains sources for a kvm test suite. + +Tests for x86 architecture are run as kernel images for qemu that supports multiboot format. +Tests uses an infrastructure called from the bios code. The infrastructure initialize the system/cpu's, +switch to long-mode and calls the 'main' function of the individual test. +Tests uses a qemu's virtual test device, named testdev, for services like printing, exiting, query memory size etc. +See file testdev.txt for more details. + +To create the tests' images just type 'make' in this directory. +Tests' images created in ./test/ARCH/*.flat + +An example of a test invocation: +qemu-system-x86_64 -device testdev,chardev=testlog -chardev file,id=testlog,path=msr.out -kernel ./test/x86/msr.flat +This invocation runs the msr test case. The test output is in file msr.out. + + + +Directory structure: +.: Makefile and config files for the tests +./test/lib: general services for the tests +./test/lib/ARCH: architecture dependent services for the tests +./test/ARCH: the sources of the tests and the created objects/images + diff --git a/kvm/user/balloon_ctl.c b/kvm/user/balloon_ctl.c deleted file mode 100755 index e65b08d..000 --- a/kvm/user/balloon_ctl.c +++ /dev/null @@ -1,92 +0,0 @@ -/* - * This binary provides access to the guest's balloon driver - * module. - * - * Copyright (C) 2007 Qumranet - * - * Author: - * - * Dor Laor dor.l...@qumranet.com - * - * This work is licensed under the GNU LGPL license, version 2. - */ - -#include unistd.h -#include fcntl.h -#include stdio.h -#include stdlib.h -#include sys/mman.h -#include string.h -#include errno.h -#include sys/ioctl.h - -#define __user -#include linux/kvm.h - -#define PAGE_SIZE 4096ul - - -static int balloon_op(int *fd, int bytes) -{ - struct kvm_balloon_op bop; -int r; - - bop.npages = bytes/PAGE_SIZE; - r = ioctl(*fd, KVM_BALLOON_OP, bop); - if (r == -1) - return -errno; - printf(Ballon handled %d pages successfully\n, bop.npages); - - return 0; -} - -static int balloon_init(int *fd) -{ - *fd = open(/dev/kvm_balloon, O_RDWR); - if (*fd == -1) { - perror(open /dev/kvm_balloon); - return -1; - } - - return 0; -} - -int main(int argc, char *argv[]) -{ - int fd; - int r; - int bytes; - - if (argc != 3) { - perror(Please provide op=[i|d], bytes\n); - return 1; - } - bytes = atoi(argv[2]); - - switch (*argv[1]) { - case 'i': - break; - case 'd': - bytes = -bytes; - break; - default: - perror(Wrong op param\n); - return 1; - } - - if (balloon_init(fd)) { - perror(balloon_init failed\n); - return 1; - } - - if ((r = balloon_op(fd, bytes))) { - perror(balloon_op failed\n); - goto out; - } - -out: - close(fd); - - return r; -} - diff --git a/kvm/user/bootstrap.lds b/kvm/user/bootstrap.lds deleted file mode 100644 index fd0a4f8..000 --- a/kvm/user/bootstrap.lds +++ /dev/null @@ -1,15 +0,0 @@ -OUTPUT_FORMAT(binary) - -SECTIONS -{ -. = 0; -stext = .; -.text : { *(.init) *(.text) } -. = ALIGN(4K); -.data : { *(.data) } -. = ALIGN(16); -.bss : { *(.bss) } -. = ALIGN(4K); -edata = .; -} - diff --git a/kvm/user/config-x86-common.mak
[qemu-kvm tests 2/3] qemu-kvm tests cleanup: adapt stringio test to kernel-mode run
Also use testdev for output, call exit to quit. Currently, test reboots endlessly because of a triple-fault. Need to run test with -no-reboot till issue fixed (in kvm ??) Signed-off-by: Naphtali Sprei nsp...@redhat.com --- kvm/user/config-x86-common.mak |2 +- kvm/user/test/x86/stringio.S | 13 + 2 files changed, 10 insertions(+), 5 deletions(-) diff --git a/kvm/user/config-x86-common.mak b/kvm/user/config-x86-common.mak index 960741e..d312681 100644 --- a/kvm/user/config-x86-common.mak +++ b/kvm/user/config-x86-common.mak @@ -54,7 +54,7 @@ $(TEST_DIR)/realmode.flat: $(TEST_DIR)/realmode.o $(TEST_DIR)/realmode.o: bits = 32 -$(TEST_DIR)/stringio.flat: $(TEST_DIR)/stringio.o +$(TEST_DIR)/stringio.flat: $(cstart.o) $(TEST_DIR)/stringio.o $(TEST_DIR)/msr.flat: $(cstart.o) $(TEST_DIR)/msr.o diff --git a/kvm/user/test/x86/stringio.S b/kvm/user/test/x86/stringio.S index 31ddc47..461621c 100644 --- a/kvm/user/test/x86/stringio.S +++ b/kvm/user/test/x86/stringio.S @@ -8,24 +8,29 @@ 1: .endm +TESTDEV_PORT = 0xf1 + str forward, forward str backward, backward .text - +.global main +main: cld movl forward, %ecx lea 4+forward, %rsi - movw $1, %dx + movw $TESTDEV_PORT, %dx rep outsb std movl backward, %ecx lea 4+backward-1(%rcx), %rsi - movw $2, %dx + movw $TESTDEV_PORT, %dx rep outsb - hlt + mov $0, %rsi + call exit + -- 1.6.3.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[qemu-kvm tests 3/3] qemu-kvm tests cleanup: Added printing for passing tests Also typo fix
Signed-off-by: Naphtali Sprei nsp...@redhat.com --- kvm/user/test/x86/realmode.c | 106 +++-- 1 files changed, 100 insertions(+), 6 deletions(-) diff --git a/kvm/user/test/x86/realmode.c b/kvm/user/test/x86/realmode.c index bfc2942..bc4ed97 100644 --- a/kvm/user/test/x86/realmode.c +++ b/kvm/user/test/x86/realmode.c @@ -160,6 +160,8 @@ void test_xchg(void) if (!regs_equal(inregs, outregs, 0)) print_serial(xchg test 1: FAIL\n); + else + print_serial(xchg test 1: PASS\n); exec_in_big_real_mode(inregs, outregs, insn_xchg_test2, @@ -169,6 +171,8 @@ void test_xchg(void) outregs.eax != inregs.ebx || outregs.ebx != inregs.eax) print_serial(xchg test 2: FAIL\n); + else + print_serial(xchg test 2: PASS\n); exec_in_big_real_mode(inregs, outregs, insn_xchg_test3, @@ -178,6 +182,8 @@ void test_xchg(void) outregs.eax != inregs.ecx || outregs.ecx != inregs.eax) print_serial(xchg test 3: FAIL\n); + else + print_serial(xchg test 3: PASS\n); exec_in_big_real_mode(inregs, outregs, insn_xchg_test4, @@ -187,6 +193,8 @@ void test_xchg(void) outregs.eax != inregs.edx || outregs.edx != inregs.eax) print_serial(xchg test 4: FAIL\n); + else + print_serial(xchg test 4: PASS\n); exec_in_big_real_mode(inregs, outregs, insn_xchg_test5, @@ -196,6 +204,8 @@ void test_xchg(void) outregs.eax != inregs.esi || outregs.esi != inregs.eax) print_serial(xchg test 5: FAIL\n); + else + print_serial(xchg test 5: PASS\n); exec_in_big_real_mode(inregs, outregs, insn_xchg_test6, @@ -205,6 +215,8 @@ void test_xchg(void) outregs.eax != inregs.edi || outregs.edi != inregs.eax) print_serial(xchg test 6: FAIL\n); + else + print_serial(xchg test 6: PASS\n); exec_in_big_real_mode(inregs, outregs, insn_xchg_test7, @@ -214,6 +226,8 @@ void test_xchg(void) outregs.eax != inregs.ebp || outregs.ebp != inregs.eax) print_serial(xchg test 7: FAIL\n); + else + print_serial(xchg test 7: PASS\n); exec_in_big_real_mode(inregs, outregs, insn_xchg_test8, @@ -223,6 +237,8 @@ void test_xchg(void) outregs.eax != inregs.esp || outregs.esp != inregs.eax) print_serial(xchg test 8: FAIL\n); + else + print_serial(xchg test 8: PASS\n); } void test_shld(void) @@ -234,9 +250,9 @@ void test_shld(void) insn_shld_test, insn_shld_test_end - insn_shld_test); if (outregs.eax != 0xbeef) - print_serial(shld: failure\n); + print_serial(shld: FAIL\n); else - print_serial(shld: success\n); + print_serial(shld: PASS\n); } void test_mov_imm(void) @@ -253,6 +269,8 @@ void test_mov_imm(void) insn_mov_r16_imm_1_end - insn_mov_r16_imm_1); if (!regs_equal(inregs, outregs, R_AX) || outregs.eax != 1234) print_serial(mov test 1: FAIL\n); + else + print_serial(mov test 1: PASS\n); /* test mov $imm, %eax */ exec_in_big_real_mode(inregs, outregs, @@ -260,6 +278,8 @@ void test_mov_imm(void) insn_mov_r32_imm_1_end - insn_mov_r32_imm_1); if (!regs_equal(inregs, outregs, R_AX) || outregs.eax != 1234567890) print_serial(mov test 2: FAIL\n); + else + print_serial(mov test 2: PASS\n); /* test mov $imm, %al/%ah */ exec_in_big_real_mode(inregs, outregs, @@ -267,16 +287,24 @@ void test_mov_imm(void) insn_mov_r8_imm_1_end - insn_mov_r8_imm_1); if (!regs_equal(inregs, outregs, R_AX) || outregs.eax != 0x1200) print_serial(mov test 3: FAIL\n); + else + print_serial(mov test 3: PASS\n); + exec_in_big_real_mode(inregs, outregs, insn_mov_r8_imm_2, insn_mov_r8_imm_2_end - insn_mov_r8_imm_2); if (!regs_equal(inregs, outregs, R_AX) || outregs.eax != 0x34) print_serial(mov test 4: FAIL\n); + else + print_serial(mov test 4: PASS\n); + exec_in_big_real_mode(inregs, outregs, insn_mov_r8_imm_3, insn_mov_r8_imm_3_end - insn_mov_r8_imm_3); if (!regs_equal(inregs, outregs,
Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
2010/4/22 Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp: (2010/04/22 19:35), Yoshiaki Tamura wrote: A trivial one would we to : - do X online snapshots/sec I currently don't have good numbers that I can share right now. Snapshots/sec depends on what kind of workload is running, and if the guest was almost idle, there will be no snapshots in 5sec. On the other hand, if the guest was running I/O intensive workloads (netperf, iozone for example), there will be about 50 snapshots/sec. 50 is too small: this depends on the synchronization speed and does not show how many snapshots we need, right? No it doesn't. It's an example data which I measured before. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] Add a global synchronization point for pvclock
On Tue, Apr 20, 2010 at 12:42:17PM -0700, Jeremy Fitzhardinge wrote: On 04/20/2010 11:54 AM, Avi Kivity wrote: On 04/20/2010 09:23 PM, Jeremy Fitzhardinge wrote: On 04/20/2010 02:31 AM, Avi Kivity wrote: btw, do you want this code in pvclock.c, or shall we keep it kvmclock specific? I think its a pvclock-level fix. I'd been hoping to avoid having something like this, but I think its ultimately necessary. Did you observe drift on Xen, or is this ultimately pointing at the future? People are reporting weirdnesses that clocksource=jiffies apparently resolves. Xen and KVM are faced with the same hardware constraints, and it wouldn't surprise me if there were small measurable non-monotonicities in the PV clock under Xen. May as well be safe. Of course, it kills any possibility of being able to usefully export this interface down to usermode. My main concern about this kind of simple fix is that if there's a long term systematic drift between different CPU's tscs, then this will somewhat mask the problem while giving really awful time measurement on the slow CPU(s). In that case it really needs to adjust the scaling factor to correct for the drift (*not* update the offset). But if we're definitely only talking about fixed, relatively small time offsets then it is fine. Can you by any chance run ingo's time warp test on those machines? You need to define TOD to 1, and leave out the TSC test. For me, warps exists on every machine out there, but the nehalems, so far -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu-kvm: Ask kernel about supported svm features
On 04/22/2010 03:28 PM, Joerg Roedel wrote: Yes. -cpu host should take KVM_GET_SUPPORTED_CPUID output and loop it back to the vcpu configuration, others just take the qemu configuration, mask it with supported bits, and pass it back (see check_features_against_host()). Hmm, the plan was to enable with -enable-nesting all kernel supported svm features for the guest (and add switches later to remove them individually) If we activate nested svm with -cpu host in the future thats fine too (closed-source hypervisors need that anyway). But we should also define a cpu model in which we can migrate nested hypervisors between machines were the cpu is not completly indentical. You can use -cpu host, or -cpu kvm64,+svm,+npt,+other_feature... just as with all other features. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
Yoshiaki Tamura wrote: Dor Laor wrote: On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote: Event tapping is the core component of Kemari, and it decides on which event the primary should synchronize with the secondary. The basic assumption here is that outgoing I/O operations are idempotent, which is usually true for disk I/O and reliable network protocols such as TCP. IMO any type of network even should be stalled too. What if the VM runs non tcp protocol and the packet that the master node sent reached some remote client and before the sync to the slave the master failed? In current implementation, it is actually stalling any type of network that goes through virtio-net. However, if the application was using unreliable protocols, it should have its own recovering mechanism, or it should be completely stateless. Even with unreliable protocols, if slave takeover causes the receiver to have received a packet that the sender _does not think it has ever sent_, expect some protocols to break. If the slave replaying master's behaviour since the last sync means it will definitely get into the same state of having sent the packet, that works out. But you still have to be careful that the other end's responses to that packet are not seen by the slave too early during that replay. Otherwise, for example, the slave may observe a TCP ACK to a packet that it hasn't yet sent, which is an error. About IP idempotency: In general, IP packets are allowed to be lost or duplicated in the network. All IP protocols should be prepared for that; it is a basic property. However there is one respect in which they're not idempotent: The TTL field should be decreased if packets are delayed. Packets should not appear to live in the network for longer than TTL seconds. If they do, some protocols (like TCP) can react to the delayed ones differently, such as sending a RST packet and breaking a connection. It is acceptable to reduce TTL faster than the minimum. After all, it is reduced by 1 on every forwarding hop, in addition to time delays. I currently don't have good numbers that I can share right now. Snapshots/sec depends on what kind of workload is running, and if the guest was almost idle, there will be no snapshots in 5sec. On the other hand, if the guest was running I/O intensive workloads (netperf, iozone for example), there will be about 50 snapshots/sec. That is a really satisfying number, thank you :-) Without this work I wouldn't have imagined that synchronised machines could work with such a low transaction rate. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [UNTESTED] KVM: do not call kvm_set_irq from irq disabled section
On Wed, Apr 21, 2010 at 09:38:39PM +0300, Gleb Natapov wrote: On Wed, Apr 21, 2010 at 03:29:11PM -0300, Marcelo Tosatti wrote: On Wed, Apr 21, 2010 at 08:58:48PM +0300, Gleb Natapov wrote: On Wed, Apr 21, 2010 at 02:37:34PM -0300, Marcelo Tosatti wrote: On Wed, Apr 21, 2010 at 08:12:27PM +0300, Gleb Natapov wrote: On Wed, Apr 21, 2010 at 12:58:41PM -0300, Marcelo Tosatti wrote: Or could we make kvm_set_irq() atomic? Though the code path is a little long for spinlock. Yes, given the sleep-inside-RCU-protected section bug from kvm_notify_acked_irq, either that or convert IRQ locking to SRCU. But as you said, the code paths are long and potentially slow, so probably SRCU is a better alternative. Gleb? kvm_set_irq() was converted to rcu from mutex to make msix interrupt injection scalable. We meant ioapic lock. See the last report from Ralf on this thread. Can we solve the problem by calling ack notifier outside rcu read section in kvm_notify_acked_irq()? The unregister path does - remove_from_list(entry) - synchronize_rcu - kfree(entry) So if kvm_notify_acked_irq sleeps, synchronize_rcu can succeed, and the notifier entry can be freed. What I mean is kvm_notify_acked_irq() will iterate over all ack entries in rcu read protected section, but instead of calling callback it will collect them into the array and call them outside rcu read section. At this point it doesn't matter if entry is unregistered since the copy is used to actually call the notifier. Here it is, but no, this trick can't be done safely for ack notifiers because they are unregistered at runtime by device assignment. How does the freeing path knows it can proceed to free its entry if there is no reliable way to know if there is a reference to itself? (think priv gets freed between rcu_read_unlock and -irq_acked with the patch below). I can convert to SRCU if you have no objections. diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c index 0150aff..900ac05 100644 --- a/arch/x86/kvm/i8254.c +++ b/arch/x86/kvm/i8254.c @@ -241,8 +241,8 @@ int pit_has_pending_timer(struct kvm_vcpu *vcpu) static void kvm_pit_ack_irq(struct kvm_irq_ack_notifier *kian) { - struct kvm_kpit_state *ps = container_of(kian, struct kvm_kpit_state, -irq_ack_notifier); + struct kvm_kpit_state *ps = kian-priv; + raw_spin_lock(ps-inject_lock); if (atomic_dec_return(ps-pit_timer.pending) 0) atomic_inc(ps-pit_timer.pending); @@ -636,6 +636,7 @@ struct kvm_pit *kvm_create_pit(struct kvm *kvm, u32 flags) CLOCK_MONOTONIC, HRTIMER_MODE_ABS); pit_state-irq_ack_notifier.gsi = 0; pit_state-irq_ack_notifier.irq_acked = kvm_pit_ack_irq; + pit_state-irq_ack_notifier.priv = pit-pit_state; kvm_register_irq_ack_notifier(kvm, pit_state-irq_ack_notifier); pit_state-pit_timer.reinject = true; mutex_unlock(pit-pit_state.lock); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index ce027d5..6c3fb06 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -404,6 +404,7 @@ struct kvm_irq_ack_notifier { struct hlist_node link; unsigned gsi; void (*irq_acked)(struct kvm_irq_ack_notifier *kian); + void *priv; }; #define KVM_ASSIGNED_MSIX_PENDING 0x1 diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c index 4d10b1e..779b749 100644 --- a/virt/kvm/assigned-dev.c +++ b/virt/kvm/assigned-dev.c @@ -121,8 +121,7 @@ static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian) if (kian-gsi == -1) return; - dev = container_of(kian, struct kvm_assigned_dev_kernel, - ack_notifier); + dev = kian-priv; kvm_set_irq(dev-kvm, dev-irq_source_id, dev-guest_irq, 0); @@ -563,6 +562,7 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm, match-irq_source_id = -1; match-kvm = kvm; match-ack_notifier.irq_acked = kvm_assigned_dev_ack_irq; + match-ack_notifier.priv = match; INIT_WORK(match-interrupt_work, kvm_assigned_dev_interrupt_work_handler); diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c index a0e8880..ebccea8 100644 --- a/virt/kvm/irq_comm.c +++ b/virt/kvm/irq_comm.c @@ -175,11 +175,14 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level) return ret; } +#define MAX_ACK_NOTIFIER_PER_GSI 4 + void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin) { struct kvm_irq_ack_notifier *kian; + struct kvm_irq_ack_notifier acks[MAX_ACK_NOTIFIER_PER_GSI]; struct hlist_node *n; - int gsi; + int gsi, i, acks_nr = 0; trace_kvm_ack_irq(irqchip, pin); @@ -188,9 +191,15 @@ void
[PATCH] Enable pvclock flags in vcpu_time_info structure
This patch removes one padding byte and transform it into a flags field. New versions of guests using pvclock will query these flags upon each read. Flags, however, will only be interpreted when the guest decides to. It uses the pvclock_valid_flags function to signal that a specific set of flags should be taken into consideration. Which flags are valid are usually devised via HV negotiation. Signed-off-by: Glauber Costa glom...@redhat.com --- arch/x86/include/asm/pvclock-abi.h |3 ++- arch/x86/include/asm/pvclock.h |1 + arch/x86/kernel/pvclock.c |9 + 3 files changed, 12 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/asm/pvclock-abi.h index 6d93508..ec5c41a 100644 --- a/arch/x86/include/asm/pvclock-abi.h +++ b/arch/x86/include/asm/pvclock-abi.h @@ -29,7 +29,8 @@ struct pvclock_vcpu_time_info { u64 system_time; u32 tsc_to_system_mul; s8tsc_shift; - u8pad[3]; + u8flags; + u8pad[2]; } __attribute__((__packed__)); /* 32 bytes */ struct pvclock_wall_clock { diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index 53235fd..c50823f 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -6,6 +6,7 @@ /* some helper functions for xen and kvm pv clock sources */ cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src); +void pvclock_valid_flags(u8 flags); unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src); void pvclock_read_wallclock(struct pvclock_wall_clock *wall, struct pvclock_vcpu_time_info *vcpu, diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 03801f2..8f4af7b 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -31,8 +31,16 @@ struct pvclock_shadow_time { u32 tsc_to_nsec_mul; int tsc_shift; u32 version; + u8 flags; }; +static u8 valid_flags = 0; + +void pvclock_valid_flags(u8 flags) +{ + valid_flags = flags; +} + /* * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction, * yielding a 64-bit result. @@ -91,6 +99,7 @@ static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst, dst-system_timestamp = src-system_time; dst-tsc_to_nsec_mul = src-tsc_to_system_mul; dst-tsc_shift = src-tsc_shift; + dst-flags = src-flags; rmb(); /* test version after fetching data */ } while ((src-version 1) || (dst-version != src-version)); -- 1.6.2.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] Add mergeable RX bufs support to vhost
Michael S. Tsirkin m...@redhat.com wrote on 04/22/2010 05:02:25 AM: On Mon, Apr 19, 2010 at 03:12:19PM -0700, David L Stevens wrote: This patch adds the mergeable RX buffers feature to vhost. Signed-off-by: David L Stevens dlstev...@us.ibm.com Looks pretty clean to me. Could you send a checkpatch-clean version please? The original passes checkpatch already, but I guess I must still be getting whitespace mangling if it didn't for you. (sigh) Here it is as an attachment: We should also check performance implications. Do you have any data? I'm getting on the order of 10-20% improvement in throughput over stock vhost guest to host, but I do see a lot of variability in the results, even with no KVM and just over loopback. I don't know where that's coming from, but I'll do some runs and post. Thanks for all the reviews! +-DLS MRXBv4.patch Description: Binary data
Re: [PATCH v4] Add mergeable RX bufs support to vhost
On Mon, Apr 19, 2010 at 03:12:19PM -0700, David L Stevens wrote: This patch adds the mergeable RX buffers feature to vhost. Signed-off-by: David L Stevens dlstev...@us.ibm.com BTW, which userspace should I use for testing this? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] Add mergeable RX bufs support to vhost
On Thu, Apr 22, 2010 at 11:47:15AM -0600, David Stevens wrote: Michael S. Tsirkin m...@redhat.com wrote on 04/22/2010 05:02:25 AM: On Mon, Apr 19, 2010 at 03:12:19PM -0700, David L Stevens wrote: This patch adds the mergeable RX buffers feature to vhost. Signed-off-by: David L Stevens dlstev...@us.ibm.com Looks pretty clean to me. Could you send a checkpatch-clean version please? The original passes checkpatch already, but I guess I must still be getting whitespace mangling if it didn't for you. (sigh) Here it is as an attachment: No good either. I thought you managed to run sendmail somewhere? Failing that, put it on dropbox and let me know. See http://kbase.redhat.com/faq/docs/DOC-2113 -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] Add mergeable RX bufs support to vhost
Michael S. Tsirkin m...@redhat.com wrote on 04/22/2010 10:43:49 AM: On Mon, Apr 19, 2010 at 03:12:19PM -0700, David L Stevens wrote: This patch adds the mergeable RX buffers feature to vhost. Signed-off-by: David L Stevens dlstev...@us.ibm.com BTW, which userspace should I use for testing this? I use the qemu-kvm patch to your git tree that I posted a while back (also attached). And it needs your TUNSETVNETHDRSIZE ioctl in the kernel as well. Also, I added the TUN ioctl defines to /usr/include for this to pick up (it compiles, but won't do the ioctl's without that); will get back to that patch per your comments next. [also correction: I said 10-20% guest to host, I meant host-to-guest (ie, the receiver side). h2g appears improved too, but not as much)] +-DLS qemu-MRXB-2.patch Description: Binary data
Re: [PATCH v4] Add mergeable RX bufs support to vhost
On Thu, Apr 22, 2010 at 11:59:55AM -0600, David Stevens wrote: Michael S. Tsirkin m...@redhat.com wrote on 04/22/2010 10:43:49 AM: On Mon, Apr 19, 2010 at 03:12:19PM -0700, David L Stevens wrote: This patch adds the mergeable RX buffers feature to vhost. Signed-off-by: David L Stevens dlstev...@us.ibm.com BTW, which userspace should I use for testing this? I use the qemu-kvm patch to your git tree that I posted a while back (also attached). Doesn't apply, probably also corrupted. Could you put it up on dropbox please? And it needs your TUNSETVNETHDRSIZE ioctl in the kernel as well. Also, I added the TUN ioctl defines to /usr/include for this to pick up (it compiles, but won't do the ioctl's without that); will get back to that patch per your comments next. [also correction: I said 10-20% guest to host, I meant host-to-guest (ie, the receiver side). h2g appears improved too, but not as much)] +-DLS -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 01/20] Modify DIRTY_FLAG value and introduce DIRTY_IDX to use as indexes of bit-based phys_ram_dirty.
Hi, On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote: Replaces byte-based phys_ram_dirty bitmap with four (MASTER, VGA, CODE, MIGRATION) bit-based phys_ram_dirty bitmap. On allocation, it sets all bits in the bitmap. It uses ffs() to convert DIRTY_FLAG to DIRTY_IDX. Modifies wrapper functions for byte-based phys_ram_dirty bitmap to bit-based phys_ram_dirty bitmap. MASTER works as a buffer, and upon get_diry() or get_dirty_flags(), it calls cpu_physical_memory_sync_master() to update VGA and MIGRATION. Why use an additional bitmap for MASTER instead of just updating the VGA, CODE, and MIGRATION bitmaps together? Regards, Anthony Liguori Replaces direct phys_ram_dirty access with wrapper functions to prevent direct access to the phys_ram_dirty bitmap. Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp Signed-off-by: OHMURA Keiohmura@lab.ntt.co.jp --- cpu-all.h | 130 + exec.c| 60 ++-- 2 files changed, 152 insertions(+), 38 deletions(-) diff --git a/cpu-all.h b/cpu-all.h index 51effc0..3f8762d 100644 --- a/cpu-all.h +++ b/cpu-all.h @@ -37,6 +37,9 @@ #include softfloat.h +/* to use ffs in flag_to_idx() */ +#includestrings.h + #if defined(HOST_WORDS_BIGENDIAN) != defined(TARGET_WORDS_BIGENDIAN) #define BSWAP_NEEDED #endif @@ -846,7 +849,6 @@ int cpu_str_to_log_mask(const char *str); /* memory API */ extern int phys_ram_fd; -extern uint8_t *phys_ram_dirty; extern ram_addr_t ram_size; extern ram_addr_t last_ram_offset; extern uint8_t *bios_mem; @@ -869,28 +871,140 @@ extern uint8_t *bios_mem; /* Set if TLB entry is an IO callback. */ #define TLB_MMIO(1 5) +/* Use DIRTY_IDX as indexes of bit-based phys_ram_dirty. */ +#define MASTER_DIRTY_IDX0 +#define VGA_DIRTY_IDX 1 +#define CODE_DIRTY_IDX 2 +#define MIGRATION_DIRTY_IDX 3 +#define NUM_DIRTY_IDX 4 + +#define MASTER_DIRTY_FLAG(1 MASTER_DIRTY_IDX) +#define VGA_DIRTY_FLAG (1 VGA_DIRTY_IDX) +#define CODE_DIRTY_FLAG (1 CODE_DIRTY_IDX) +#define MIGRATION_DIRTY_FLAG (1 MIGRATION_DIRTY_IDX) + +extern unsigned long *phys_ram_dirty[NUM_DIRTY_IDX]; + +static inline int dirty_flag_to_idx(int flag) +{ +return ffs(flag) - 1; +} + +static inline int dirty_idx_to_flag(int idx) +{ +return 1 idx; +} + int cpu_memory_rw_debug(CPUState *env, target_ulong addr, uint8_t *buf, int len, int is_write); -#define VGA_DIRTY_FLAG 0x01 -#define CODE_DIRTY_FLAG 0x02 -#define MIGRATION_DIRTY_FLAG 0x08 - /* read dirty bit (return 0 or 1) */ static inline int cpu_physical_memory_is_dirty(ram_addr_t addr) { -return phys_ram_dirty[addr TARGET_PAGE_BITS] == 0xff; +unsigned long mask; +ram_addr_t index = (addr TARGET_PAGE_BITS) / HOST_LONG_BITS; +int offset = (addr TARGET_PAGE_BITS) (HOST_LONG_BITS - 1); + +mask = 1UL offset; +return (phys_ram_dirty[MASTER_DIRTY_IDX][index] mask) == mask; +} + +static inline void cpu_physical_memory_sync_master(ram_addr_t index) +{ +if (phys_ram_dirty[MASTER_DIRTY_IDX][index]) { +phys_ram_dirty[VGA_DIRTY_IDX][index] +|= phys_ram_dirty[MASTER_DIRTY_IDX][index]; +phys_ram_dirty[MIGRATION_DIRTY_IDX][index] +|= phys_ram_dirty[MASTER_DIRTY_IDX][index]; +phys_ram_dirty[MASTER_DIRTY_IDX][index] = 0UL; +} +} + +static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr) +{ +unsigned long mask; +ram_addr_t index = (addr TARGET_PAGE_BITS) / HOST_LONG_BITS; +int offset = (addr TARGET_PAGE_BITS) (HOST_LONG_BITS - 1); +int ret = 0, i; + +mask = 1UL offset; +cpu_physical_memory_sync_master(index); + +for (i = VGA_DIRTY_IDX; i= MIGRATION_DIRTY_IDX; i++) { +if (phys_ram_dirty[i][index] mask) { +ret |= dirty_idx_to_flag(i); +} +} + +return ret; +} + +static inline int cpu_physical_memory_get_dirty_idx(ram_addr_t addr, +int dirty_idx) +{ +unsigned long mask; +ram_addr_t index = (addr TARGET_PAGE_BITS) / HOST_LONG_BITS; +int offset = (addr TARGET_PAGE_BITS) (HOST_LONG_BITS - 1); + +mask = 1UL offset; +cpu_physical_memory_sync_master(index); +return (phys_ram_dirty[dirty_idx][index] mask) == mask; } static inline int cpu_physical_memory_get_dirty(ram_addr_t addr, int dirty_flags) { -return phys_ram_dirty[addr TARGET_PAGE_BITS] dirty_flags; +return cpu_physical_memory_get_dirty_idx(addr, + dirty_flag_to_idx(dirty_flags)); } static inline void cpu_physical_memory_set_dirty(ram_addr_t addr) { -phys_ram_dirty[addr TARGET_PAGE_BITS] = 0xff; +unsigned long mask; +ram_addr_t index = (addr TARGET_PAGE_BITS) / HOST_LONG_BITS; +int offset = (addr TARGET_PAGE_BITS)
Re: [RFC PATCH 07/20] Introduce qemu_put_vector() and qemu_put_vector_prepare() to use put_vector() in QEMUFile.
On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote: For fool proof purpose, qemu_put_vector_parepare should be called before qemu_put_vector. Then, if qemu_put_* functions except this is called after qemu_put_vector_prepare, program will abort(). Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp I don't get it. What's this protecting against? Regards, Anthony Liguori --- hw/hw.h |2 ++ savevm.c | 53 + 2 files changed, 55 insertions(+), 0 deletions(-) diff --git a/hw/hw.h b/hw/hw.h index 921cf90..10e6dda 100644 --- a/hw/hw.h +++ b/hw/hw.h @@ -77,6 +77,8 @@ void qemu_fflush(QEMUFile *f); int qemu_fclose(QEMUFile *f); void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size); void qemu_put_byte(QEMUFile *f, int v); +void qemu_put_vector(QEMUFile *f, QEMUIOVector *qiov); +void qemu_put_vector_prepare(QEMUFile *f); void *qemu_realloc_buffer(QEMUFile *f, int size); void qemu_clear_buffer(QEMUFile *f); diff --git a/savevm.c b/savevm.c index 944e788..22d928c 100644 --- a/savevm.c +++ b/savevm.c @@ -180,6 +180,7 @@ struct QEMUFile { uint8_t *buf; int has_error; +int prepares_vector; }; typedef struct QEMUFileStdio @@ -557,6 +558,58 @@ void qemu_put_byte(QEMUFile *f, int v) qemu_fflush(f); } +void qemu_put_vector(QEMUFile *f, QEMUIOVector *v) +{ +struct iovec *iov; +int cnt; +size_t bufsize; +uint8_t *buf; + +if (qemu_file_get_rate_limit(f) != 0) { +fprintf(stderr, +Attempted to write vector while bandwidth limit is not zero.\n); +abort(); +} + +/* checks prepares vector. + * For fool proof purpose, qemu_put_vector_parepare should be called + * before qemu_put_vector. Then, if qemu_put_* functions except this + * is called after qemu_put_vector_prepare, program will abort(). + */ +if (!f-prepares_vector) { +fprintf(stderr, +You should prepare with qemu_put_vector_prepare.\n); +abort(); +} else if (f-prepares_vector f-buf_index != 0) { +fprintf(stderr, Wrote data after qemu_put_vector_prepare.\n); +abort(); +} +f-prepares_vector = 0; + +if (f-put_vector) { +qemu_iovec_to_vector(v,iov,cnt); +f-put_vector(f-opaque, iov, 0, cnt); +} else { +qemu_iovec_to_size(v,bufsize); +buf = qemu_malloc(bufsize + 1 /* for '\0' */); +qemu_iovec_to_buffer(v, buf); +qemu_put_buffer(f, buf, bufsize); +qemu_free(buf); +} + +} + +void qemu_put_vector_prepare(QEMUFile *f) +{ +if (f-prepares_vector) { +/* prepare vector */ +fprintf(stderr, Attempted to prepare vector twice\n); +abort(); +} +f-prepares_vector = 1; +qemu_fflush(f); +} + int qemu_get_buffer(QEMUFile *f, uint8_t *buf, int size1) { int size, l; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 08/20] Introduce RAMSaveIO and use cpu_physical_memory_get_dirty_range() to check multiple dirty pages.
On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote: Introduce RAMSaveIO to use writev for saving ram blocks, and modifies ram_save_block() and ram_save_remaining() to use cpu_physical_memory_get_dirty_range() to check multiple dirty and non-dirty pages at once. Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp Signed-off-by: OHMURA Keiohmura@lab.ntt.co.jp Perf data? Regards, Anthony Liguori --- vl.c | 221 ++--- 1 files changed, 197 insertions(+), 24 deletions(-) diff --git a/vl.c b/vl.c index 729c955..9c3dc4c 100644 --- a/vl.c +++ b/vl.c @@ -2774,12 +2774,167 @@ static int is_dup_page(uint8_t *page, uint8_t ch) return 1; } -static int ram_save_block(QEMUFile *f) +typedef struct RAMSaveIO RAMSaveIO; + +struct RAMSaveIO { +QEMUFile *f; +QEMUIOVector *qiov; + +uint8_t *ram_store; +size_t nalloc, nused; +uint8_t io_mode; + +void (*put_buffer)(RAMSaveIO *s, uint8_t *buf, size_t len); +void (*put_byte)(RAMSaveIO *s, int v); +void (*put_be64)(RAMSaveIO *s, uint64_t v); + +}; + +static inline void ram_saveio_flush(RAMSaveIO *s, int prepare) +{ +qemu_put_vector(s-f, s-qiov); +if (prepare) +qemu_put_vector_prepare(s-f); + +/* reset stored data */ +qemu_iovec_reset(s-qiov); +s-nused = 0; +} + +static inline void ram_saveio_put_buffer(RAMSaveIO *s, uint8_t *buf, size_t len) +{ +s-put_buffer(s, buf, len); +} + +static inline void ram_saveio_put_byte(RAMSaveIO *s, int v) +{ +s-put_byte(s, v); +} + +static inline void ram_saveio_put_be64(RAMSaveIO *s, uint64_t v) +{ +s-put_be64(s, v); +} + +static inline void ram_saveio_set_error(RAMSaveIO *s) +{ +qemu_file_set_error(s-f); +} + +static void ram_saveio_put_buffer_vector(RAMSaveIO *s, uint8_t *buf, size_t len) +{ +qemu_iovec_add(s-qiov, buf, len); +} + +static void ram_saveio_put_buffer_direct(RAMSaveIO *s, uint8_t *buf, size_t len) +{ +qemu_put_buffer(s-f, buf, len); +} + +static void ram_saveio_put_byte_vector(RAMSaveIO *s, int v) +{ +uint8_t *to_save; + +if (s-nalloc - s-nused sizeof(int)) +ram_saveio_flush(s, 1); + +to_save =s-ram_store[s-nused]; +to_save[0] = v 0xff; +s-nused++; + +qemu_iovec_add(s-qiov, to_save, 1); +} + +static void ram_saveio_put_byte_direct(RAMSaveIO *s, int v) +{ +qemu_put_byte(s-f, v); +} + +static void ram_saveio_put_be64_vector(RAMSaveIO *s, uint64_t v) +{ +uint8_t *to_save; + +if (s-nalloc - s-nused sizeof(uint64_t)) +ram_saveio_flush(s, 1); + +to_save =s-ram_store[s-nused]; +to_save[0] = (v 56) 0xff; +to_save[1] = (v 48) 0xff; +to_save[2] = (v 40) 0xff; +to_save[3] = (v 32) 0xff; +to_save[4] = (v 24) 0xff; +to_save[5] = (v 16) 0xff; +to_save[6] = (v 8) 0xff; +to_save[7] = (v 0) 0xff; +s-nused += sizeof(uint64_t); + +qemu_iovec_add(s-qiov, to_save, sizeof(uint64_t)); +} + +static void ram_saveio_put_be64_direct(RAMSaveIO *s, uint64_t v) +{ + +qemu_put_be64(s-f, v); +} + +static RAMSaveIO *ram_saveio_new(QEMUFile *f, size_t max_store) +{ +RAMSaveIO *s; + +s = qemu_mallocz(sizeof(*s)); + +if (qemu_file_get_rate_limit(f) == 0) {/* non buffer mode */ +/* When QEMUFile don't have get_rate limit, + * qemu_file_get_rate_limit will return 0. + * However, we believe that all kinds of QEMUFile + * except non-block mode has rate limit function. + */ +s-io_mode = 1; +s-ram_store = qemu_mallocz(max_store); +s-nalloc = max_store; +s-nused = 0; + +s-qiov = qemu_mallocz(sizeof(*s-qiov)); +qemu_iovec_init(s-qiov, max_store); + +s-put_buffer = ram_saveio_put_buffer_vector; +s-put_byte = ram_saveio_put_byte_vector; +s-put_be64 = ram_saveio_put_be64_vector; + +qemu_put_vector_prepare(f); +} else { +s-io_mode = 0; +s-put_buffer = ram_saveio_put_buffer_direct; +s-put_byte = ram_saveio_put_byte_direct; +s-put_be64 = ram_saveio_put_be64_direct; +} + +s-f = f; + +return s; +} + +static void ram_saveio_destroy(RAMSaveIO *s) +{ +if (s-qiov != NULL) { /* means using put_vector */ +ram_saveio_flush(s, 0); +qemu_iovec_destroy(s-qiov); +qemu_free(s-qiov); +qemu_free(s-ram_store); +} +qemu_free(s); +} + +/* + * RAMSaveIO will manage I/O. + */ +static int ram_save_block(RAMSaveIO *s) { static ram_addr_t current_addr = 0; ram_addr_t saved_addr = current_addr; ram_addr_t addr = 0; -int found = 0; +ram_addr_t dirty_rams[HOST_LONG_BITS]; +int i, found = 0; while (addr last_ram_offset) { if (kvm_enabled() current_addr == 0) { @@ -2787,32 +2942,38 @@ static int ram_save_block(QEMUFile *f) r = kvm_update_dirty_pages_log(); if (r) { fprintf(stderr, %s: update dirty
Re: [PATCH 6/10] KVM MMU: don't write-protect if have new mapping to unsync page
On Thu, Apr 22, 2010 at 02:13:04PM +0800, Xiao Guangrong wrote: If have new mapping to the unsync page(i.e, add a new parent), just update the page from sp-gfn but not write-protect gfn, and if need create new shadow page form sp-gfn, we should sync it Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 27 +++ 1 files changed, 19 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index fd027a6..8607a64 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1196,16 +1196,20 @@ static void kvm_unlink_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp) static int kvm_mmu_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp); -static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) +static int kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, + bool clear_unsync) { if (sp-role.cr4_pae != !!is_pae(vcpu)) { kvm_mmu_zap_page(vcpu-kvm, sp); return 1; } - if (rmap_write_protect(vcpu-kvm, sp-gfn)) - kvm_flush_remote_tlbs(vcpu-kvm); - kvm_unlink_unsync_page(vcpu-kvm, sp); + if (clear_unsync) { + if (rmap_write_protect(vcpu-kvm, sp-gfn)) + kvm_flush_remote_tlbs(vcpu-kvm); + kvm_unlink_unsync_page(vcpu-kvm, sp); + } + if (vcpu-arch.mmu.sync_page(vcpu, sp)) { kvm_mmu_zap_page(vcpu-kvm, sp); return 1; @@ -1293,7 +1297,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu, kvm_flush_remote_tlbs(vcpu-kvm); for_each_sp(pages, sp, parents, i) { - kvm_sync_page(vcpu, sp); + kvm_sync_page(vcpu, sp, true); mmu_pages_clear_parents(parents); } cond_resched_lock(vcpu-kvm-mmu_lock); @@ -1313,7 +1317,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, unsigned index; unsigned quadrant; struct hlist_head *bucket; - struct kvm_mmu_page *sp; + struct kvm_mmu_page *sp, *unsync_sp = NULL; struct hlist_node *node, *tmp; role = vcpu-arch.mmu.base_role; @@ -1332,12 +1336,16 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link) if (sp-gfn == gfn) { if (sp-unsync) - if (kvm_sync_page(vcpu, sp)) - continue; + unsync_sp = sp; Xiao, I don't see a reason why you can't create a new mapping to an unsync page. The code already creates shadow pte entries using unsync pagetables. So all you need would be to kvm_sync_pages before write protecting. Also make sure kvm_sync_pages is in place here before enabling multiple unsync shadows, in the patch series. if (sp-role.word != role.word) continue; + if (unsync_sp kvm_sync_page(vcpu, unsync_sp, false)) { + unsync_sp = NULL; + continue; + } + mmu_page_add_parent_pte(vcpu, sp, parent_pte); if (sp-unsync_children) { set_bit(KVM_REQ_MMU_SYNC, vcpu-requests); @@ -1346,6 +1354,9 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, trace_kvm_mmu_get_page(sp, false); return sp; } + if (unsync_sp) + kvm_sync_page(vcpu, unsync_sp, true); + ++vcpu-kvm-stat.mmu_cache_miss; sp = kvm_mmu_alloc_page(vcpu, parent_pte); if (!sp) -- 1.6.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 10/20] Introduce skip_header parameter to qemu_loadvm_state() so that it can be called iteratively without reading the header.
On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote: Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp I think the more appropriate thing to do is have qemu_savevm_state_complete() not write QEMU_VM_EOF when doing a continuous live migration. You would then want qemu_loadvm_state() to detect real EOF and treat that the same as QEMU_VM_EOF (provided it occurred at a section boundary). Of course, this should be a flag to qemu_loadvm_state() as it's not always the right behavior. Regards, Anthony Liguori --- migration-exec.c |2 +- migration-fd.c |2 +- migration-tcp.c |2 +- migration-unix.c |2 +- savevm.c | 25 ++--- sysemu.h |2 +- 6 files changed, 19 insertions(+), 16 deletions(-) diff --git a/migration-exec.c b/migration-exec.c index 3edc026..5839a6d 100644 --- a/migration-exec.c +++ b/migration-exec.c @@ -113,7 +113,7 @@ static void exec_accept_incoming_migration(void *opaque) QEMUFile *f = opaque; int ret; -ret = qemu_loadvm_state(f); +ret = qemu_loadvm_state(f, 0); if (ret 0) { fprintf(stderr, load of migration failed\n); goto err; diff --git a/migration-fd.c b/migration-fd.c index 0cc74ad..0e97ed0 100644 --- a/migration-fd.c +++ b/migration-fd.c @@ -106,7 +106,7 @@ static void fd_accept_incoming_migration(void *opaque) QEMUFile *f = opaque; int ret; -ret = qemu_loadvm_state(f); +ret = qemu_loadvm_state(f, 0); if (ret 0) { fprintf(stderr, load of migration failed\n); goto err; diff --git a/migration-tcp.c b/migration-tcp.c index 56e1a3b..94a1a03 100644 --- a/migration-tcp.c +++ b/migration-tcp.c @@ -182,7 +182,7 @@ static void tcp_accept_incoming_migration(void *opaque) goto out; } -ret = qemu_loadvm_state(f); +ret = qemu_loadvm_state(f, 0); if (ret 0) { fprintf(stderr, load of migration failed\n); goto out_fopen; diff --git a/migration-unix.c b/migration-unix.c index b7aab38..dd99a73 100644 --- a/migration-unix.c +++ b/migration-unix.c @@ -168,7 +168,7 @@ static void unix_accept_incoming_migration(void *opaque) goto out; } -ret = qemu_loadvm_state(f); +ret = qemu_loadvm_state(f, 0); if (ret 0) { fprintf(stderr, load of migration failed\n); goto out_fopen; diff --git a/savevm.c b/savevm.c index 22d928c..a401b27 100644 --- a/savevm.c +++ b/savevm.c @@ -1554,7 +1554,7 @@ typedef struct LoadStateEntry { int version_id; } LoadStateEntry; -int qemu_loadvm_state(QEMUFile *f) +int qemu_loadvm_state(QEMUFile *f, int skip_header) { QLIST_HEAD(, LoadStateEntry) loadvm_handlers = QLIST_HEAD_INITIALIZER(loadvm_handlers); @@ -1563,17 +1563,20 @@ int qemu_loadvm_state(QEMUFile *f) unsigned int v; int ret; -v = qemu_get_be32(f); -if (v != QEMU_VM_FILE_MAGIC) -return -EINVAL; +if (!skip_header) { +v = qemu_get_be32(f); +if (v != QEMU_VM_FILE_MAGIC) +return -EINVAL; + +v = qemu_get_be32(f); +if (v == QEMU_VM_FILE_VERSION_COMPAT) { +fprintf(stderr, SaveVM v3 format is obsolete and don't work anymore\n); +return -ENOTSUP; +} +if (v != QEMU_VM_FILE_VERSION) +return -ENOTSUP; -v = qemu_get_be32(f); -if (v == QEMU_VM_FILE_VERSION_COMPAT) { -fprintf(stderr, SaveVM v2 format is obsolete and don't work anymore\n); -return -ENOTSUP; } -if (v != QEMU_VM_FILE_VERSION) -return -ENOTSUP; while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) { uint32_t instance_id, version_id, section_id; @@ -1898,7 +1901,7 @@ int load_vmstate(Monitor *mon, const char *name) monitor_printf(mon, Could not open VM state file\n); return -EINVAL; } -ret = qemu_loadvm_state(f); +ret = qemu_loadvm_state(f, 0); qemu_fclose(f); if (ret 0) { monitor_printf(mon, Error %d while loading VM state\n, ret); diff --git a/sysemu.h b/sysemu.h index 647a468..6c1441f 100644 --- a/sysemu.h +++ b/sysemu.h @@ -68,7 +68,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f); int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f); void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f); -int qemu_loadvm_state(QEMUFile *f); +int qemu_loadvm_state(QEMUFile *f, int skip_header); void qemu_errors_to_file(FILE *fp); void qemu_errors_to_mon(Monitor *mon); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 14/20] Upgrade QEMU_FILE_VERSION from 3 to 4, and introduce qemu_savevm_state_all().
On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote: Make a 32bit entry after QEMU_VM_FILE_VERSION to recognize whether the transfered data is QEMU_VM_FT_MODE or QEMU_VM_LIVE_MIGRATION_MODE. I'd rather you encapsulate the current protocol as opposed to extending it with a new version. You could also do this by just having a new -incoming option, right? All you really need is to indicate that this is for high frequency checkpointing, right? Regards, Anthony Liguori Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp --- savevm.c | 76 - sysemu.h |1 + 2 files changed, 75 insertions(+), 2 deletions(-) diff --git a/savevm.c b/savevm.c index 292ae32..19b3efb 100644 --- a/savevm.c +++ b/savevm.c @@ -1402,8 +1402,10 @@ static void vmstate_save(QEMUFile *f, SaveStateEntry *se) } #define QEMU_VM_FILE_MAGIC 0x5145564d -#define QEMU_VM_FILE_VERSION_COMPAT 0x0002 -#define QEMU_VM_FILE_VERSION 0x0003 +#define QEMU_VM_FILE_VERSION_COMPAT 0x0003 +#define QEMU_VM_FILE_VERSION 0x0004 +#define QEMU_VM_LIVE_MIGRATION_MODE 0x0005 +#define QEMU_VM_FT_MODE 0x0006 #define QEMU_VM_EOF 0x00 #define QEMU_VM_SECTION_START0x01 @@ -1425,6 +1427,12 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, qemu_put_be32(f, QEMU_VM_FILE_MAGIC); qemu_put_be32(f, QEMU_VM_FILE_VERSION); + +if (ft_mode) { +qemu_put_be32(f, QEMU_VM_FT_MODE); +} else { +qemu_put_be32(f, QEMU_VM_LIVE_MIGRATION_MODE); +} QTAILQ_FOREACH(se,savevm_handlers, entry) { int len; @@ -1533,6 +1541,66 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f) return 0; } +int qemu_savevm_state_all(Monitor *mon, QEMUFile *f) +{ +SaveStateEntry *se; + +QTAILQ_FOREACH(se,savevm_handlers, entry) { +int len; + +if (se-save_live_state == NULL) +continue; + +/* Section type */ +qemu_put_byte(f, QEMU_VM_SECTION_START); +qemu_put_be32(f, se-section_id); + +/* ID string */ +len = strlen(se-idstr); +qemu_put_byte(f, len); +qemu_put_buffer(f, (uint8_t *)se-idstr, len); + +qemu_put_be32(f, se-instance_id); +qemu_put_be32(f, se-version_id); +if (ft_mode == FT_INIT) { +/* This is workaround. */ +se-save_live_state(mon, f, QEMU_VM_SECTION_START, se-opaque); +} else { +se-save_live_state(mon, f, QEMU_VM_SECTION_PART, se-opaque); +} +} + +ft_mode = FT_TRANSACTION; +QTAILQ_FOREACH(se,savevm_handlers, entry) { +int len; + + if (se-save_state == NULL se-vmsd == NULL) + continue; + +/* Section type */ +qemu_put_byte(f, QEMU_VM_SECTION_FULL); +qemu_put_be32(f, se-section_id); + +/* ID string */ +len = strlen(se-idstr); +qemu_put_byte(f, len); +qemu_put_buffer(f, (uint8_t *)se-idstr, len); + +qemu_put_be32(f, se-instance_id); +qemu_put_be32(f, se-version_id); + +vmstate_save(f, se); +} + +qemu_put_byte(f, QEMU_VM_EOF); + +if (qemu_file_has_error(f)) +return -EIO; + +return 0; +} + + void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f) { SaveStateEntry *se; @@ -1617,6 +1685,10 @@ int qemu_loadvm_state(QEMUFile *f, int skip_header) if (v != QEMU_VM_FILE_VERSION) return -ENOTSUP; +v = qemu_get_be32(f); +if (v == QEMU_VM_FT_MODE) { +ft_mode = FT_INIT; +} } while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) { diff --git a/sysemu.h b/sysemu.h index 6c1441f..df314bb 100644 --- a/sysemu.h +++ b/sysemu.h @@ -67,6 +67,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, int shared); int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f); int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f); +int qemu_savevm_state_all(Monitor *mon, QEMUFile *f); void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f); int qemu_loadvm_state(QEMUFile *f, int skip_header); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 15/20] Introduce FT mode support to configure.
On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote: Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp No need for this. Regards, Anthony Liguori --- configure |8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/configure b/configure index 046c591..f0682d4 100755 --- a/configure +++ b/configure @@ -298,6 +298,7 @@ bsd_user=no guest_base= uname_release= io_thread=no +ft_mode=no mixemu=no kvm_trace=no kvm_cap_pit= @@ -671,6 +672,8 @@ for opt do ;; --enable-io-thread) io_thread=yes ;; + --enable-ft-mode) ft_mode=yes + ;; --disable-blobs) blobs=no ;; --kerneldir=*) kerneldir=$optarg @@ -840,6 +843,7 @@ echo --enable-vde enable support for vde network echo --disable-linux-aio disable Linux AIO support echo --enable-linux-aio enable Linux AIO support echo --enable-io-thread enable IO thread +echo --enable-ft-mode enable FT mode support echo --disable-blobs disable installing provided firmware blobs echo --kerneldir=PATH look for kernel includes in PATH echo --with-kvm-trace enable building the KVM module with the kvm trace option @@ -2117,6 +2121,7 @@ echo GUEST_BASE$guest_base echo PIE user targets $user_pie echo vde support $vde echo IO thread $io_thread +echo FT mode support $ft_mode echo Linux AIO support $linux_aio echo Install blobs $blobs echo KVM support $kvm @@ -2318,6 +2323,9 @@ fi if test $io_thread = yes ; then echo CONFIG_IOTHREAD=y $config_host_mak fi +if test $ft_mode = yes ; then + echo CONFIG_FT_MODE=y $config_host_mak +fi if test $linux_aio = yes ; then echo CONFIG_LINUX_AIO=y $config_host_mak fi -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 19/20] Insert do_event_tap() to virtio-{blk, net}, comment out assert() on cpu_single_env temporally.
On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote: do_event_tap() is inserted to functions which actually fire outputs. By synchronizing VMs before outputs are fired, we can failover to the receiver upon failure. To save VM continuously, comment out assert() on cpu_single_env temporally. Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp --- hw/virtio-blk.c |2 ++ hw/virtio-net.c |2 ++ qemu-kvm.c |7 ++- 3 files changed, 10 insertions(+), 1 deletions(-) This would be better done in the generic layers (the block and net layers respectively). Then it would work with virtio and emulated devices. Regards, Anthony Liguori diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c index b80402d..1dd1c31 100644 --- a/hw/virtio-blk.c +++ b/hw/virtio-blk.c @@ -327,6 +327,8 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq) .old_bs = NULL, }; +do_event_tap(); + while ((req = virtio_blk_get_request(s))) { virtio_blk_handle_request(req,mrb); } diff --git a/hw/virtio-net.c b/hw/virtio-net.c index 5c0093e..1a32bf3 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -667,6 +667,8 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq) { VirtIONet *n = to_virtio_net(vdev); +do_event_tap(); + if (n-tx_timer_active) { virtio_queue_set_notification(vq, 1); qemu_del_timer(n-tx_timer); diff --git a/qemu-kvm.c b/qemu-kvm.c index 1414f49..769bc95 100644 --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -935,8 +935,12 @@ int kvm_run(CPUState *env) post_kvm_run(kvm, env); +/* TODO: we need to prevent tapping events that derived from the + * same VMEXIT. This needs more info from the kernel. */ #if defined(KVM_CAP_COALESCED_MMIO) if (kvm_state-coalesced_mmio) { +/* prevent from tapping events while handling coalesced_mmio */ +event_tap_suspend(); struct kvm_coalesced_mmio_ring *ring = (void *) run + kvm_state-coalesced_mmio * PAGE_SIZE; while (ring-first != ring-last) { @@ -946,6 +950,7 @@ int kvm_run(CPUState *env) smp_wmb(); ring-first = (ring-first + 1) % KVM_COALESCED_MMIO_MAX; } +event_tap_resume(); } #endif @@ -1770,7 +1775,7 @@ static void resume_all_threads(void) { CPUState *penv = first_cpu; -assert(!cpu_single_env); +/* assert(!cpu_single_env); */ while (penv) { penv-stop = 0; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [UNTESTED] KVM: do not call kvm_set_irq from irq disabled section
On Thu, Apr 22, 2010 at 09:11:30PM +0300, Gleb Natapov wrote: On Thu, Apr 22, 2010 at 01:40:38PM -0300, Marcelo Tosatti wrote: On Wed, Apr 21, 2010 at 09:38:39PM +0300, Gleb Natapov wrote: On Wed, Apr 21, 2010 at 03:29:11PM -0300, Marcelo Tosatti wrote: On Wed, Apr 21, 2010 at 08:58:48PM +0300, Gleb Natapov wrote: On Wed, Apr 21, 2010 at 02:37:34PM -0300, Marcelo Tosatti wrote: On Wed, Apr 21, 2010 at 08:12:27PM +0300, Gleb Natapov wrote: On Wed, Apr 21, 2010 at 12:58:41PM -0300, Marcelo Tosatti wrote: Or could we make kvm_set_irq() atomic? Though the code path is a little long for spinlock. Yes, given the sleep-inside-RCU-protected section bug from kvm_notify_acked_irq, either that or convert IRQ locking to SRCU. But as you said, the code paths are long and potentially slow, so probably SRCU is a better alternative. Gleb? kvm_set_irq() was converted to rcu from mutex to make msix interrupt injection scalable. We meant ioapic lock. See the last report from Ralf on this thread. Can we solve the problem by calling ack notifier outside rcu read section in kvm_notify_acked_irq()? The unregister path does - remove_from_list(entry) - synchronize_rcu - kfree(entry) So if kvm_notify_acked_irq sleeps, synchronize_rcu can succeed, and the notifier entry can be freed. What I mean is kvm_notify_acked_irq() will iterate over all ack entries in rcu read protected section, but instead of calling callback it will collect them into the array and call them outside rcu read section. At this point it doesn't matter if entry is unregistered since the copy is used to actually call the notifier. Here it is, but no, this trick can't be done safely for ack notifiers because they are unregistered at runtime by device assignment. How does the freeing path knows it can proceed to free its entry if there is no reliable way to know if there is a reference to itself? (think priv gets freed between rcu_read_unlock and -irq_acked with the patch below). Yeah, I see :( I can convert to SRCU if you have no objections. AFAIR there was a disadvantage comparing to RCU, but I don't remember what it was exactly. http://www.mentby.com/paul-e-mckenney/kvm-patch-v3-13-kvm-fix-race-in-irqrouting-logic.html But for KVM IRQ path's it should not matter much since usage of grace periods is rare (registration/unregistration is very rare compared to read side), and the IRQ path's are already large and slow, so the added overhead should not be noticeable. What about converting PIC/IOAPIC mutexes into spinlocks? Works for me, but on large guests the spinning will be noticeable. I believe. Avi? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/20] Kemari for KVM v0.1
On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote: Hi all, We have been implementing the prototype of Kemari for KVM, and we're sending this message to share what we have now and TODO lists. Hopefully, we would like to get early feedback to keep us in the right direction. Although advanced approaches in the TODO lists are fascinating, we would like to run this project step by step while absorbing comments from the community. The current code is based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27. For those who are new to Kemari for KVM, please take a look at the following RFC which we posted last year. http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html The transmission/transaction protocol, and most of the control logic is implemented in QEMU. However, we needed a hack in KVM to prevent rip from proceeding before synchronizing VMs. It may also need some plumbing in the kernel side to guarantee replayability of certain events and instructions, integrate the RAS capabilities of newer x86 hardware with the HA stack, as well as for optimization purposes, for example. Before going into details, we would like to show how Kemari looks. We prepared a demonstration video at the following location. For those who are not interested in the code, please take a look. The demonstration scenario is, 1. Play with a guest VM that has virtio-blk and virtio-net. # The guest image should be a NFS/SAN. 2. Start Kemari to synchronize the VM by running the following command in QEMU. Just add -k option to usual migrate command. migrate -d -k tcp:192.168.0.20: 3. Check the status by calling info migrate. 4. Go back to the VM to play chess animation. 5. Kill the the VM. (VNC client also disappears) 6. Press c to continue the VM on the other host. 7. Bring up the VNC client (Sorry, it pops outside of video capture.) 8. Confirm that the chess animation ends, browser works fine, then shutdown. http://www.osrg.net/kemari/download/kemari-kvm-fc11.mov The repository contains all patches we're sending with this message. For those who want to try, pull the following repository. At running configure, please put --enable-ft-mode. Also you need to apply a patch attached at the end of this message to your KVM. git://kemari.git.sourceforge.net/gitroot/kemari/kemari In addition to usual migrate environment and command, add -k to run. The patch set consists of following components. - bit-based dirty bitmap. (I have posted v4 for upstream QEMU on April 2o) - writev() support to QEMUFile and FdMigrationState. - FT transaction sender/receiver - event tap that triggers FT transaction. - virtio-blk, virtio-net support for event tap. This series looks quite nice! I think it would make sense to separate out the things that are actually optimizations (like the dirty bitmap changes and the writev/readv changes) and to attempt to justify them with actual performance data. I'd prefer not to modify the live migration protocol ABI and it doesn't seem to be necessary if we're willing to add options to the -incoming flag. We also want to be a bit more generic with respect to IO. Otherwise, the series looks very close to being mergable. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().
On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote: QEMUFile currently doesn't support writev(). For sending multiple data, such as pages, using writev() should be more efficient. Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp Is there performance data that backs this up? Since QEMUFile uses a linear buffer for most operations that's limited to 16k, I suspect you wouldn't be able to observe a difference in practice. Regards, Anthony Liguori --- buffered_file.c |2 +- hw/hw.h | 16 savevm.c| 43 +-- 3 files changed, 42 insertions(+), 19 deletions(-) diff --git a/buffered_file.c b/buffered_file.c index 54dc6c2..187d1d4 100644 --- a/buffered_file.c +++ b/buffered_file.c @@ -256,7 +256,7 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque, s-wait_for_unfreeze = wait_for_unfreeze; s-close = close; -s-file = qemu_fopen_ops(s, buffered_put_buffer, NULL, +s-file = qemu_fopen_ops(s, buffered_put_buffer, NULL, NULL, NULL, buffered_close, buffered_rate_limit, buffered_set_rate_limit, buffered_get_rate_limit); diff --git a/hw/hw.h b/hw/hw.h index fc9ed29..921cf90 100644 --- a/hw/hw.h +++ b/hw/hw.h @@ -23,6 +23,13 @@ typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf, int64_t pos, int size); +/* This function writes a chunk of vector to a file at the given position. + * The pos argument can be ignored if the file is only being used for + * streaming. + */ +typedef int (QEMUFilePutVectorFunc)(void *opaque, struct iovec *iov, +int64_t pos, int iovcnt); + /* Read a chunk of data from a file at the given position. The pos argument * can be ignored if the file is only be used for streaming. The number of * bytes actually read should be returned. @@ -30,6 +37,13 @@ typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf, typedef int (QEMUFileGetBufferFunc)(void *opaque, uint8_t *buf, int64_t pos, int size); +/* Read a chunk of vector from a file at the given position. The pos argument + * can be ignored if the file is only be used for streaming. The number of + * bytes actually read should be returned. + */ +typedef int (QEMUFileGetVectorFunc)(void *opaque, struct iovec *iov, +int64_t pos, int iovcnt); + /* Close a file and return an error code */ typedef int (QEMUFileCloseFunc)(void *opaque); @@ -46,7 +60,9 @@ typedef size_t (QEMUFileSetRateLimit)(void *opaque, size_t new_rate); typedef size_t (QEMUFileGetRateLimit)(void *opaque); QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer, + QEMUFilePutVectorFunc *put_vector, QEMUFileGetBufferFunc *get_buffer, + QEMUFileGetVectorFunc *get_vector, QEMUFileCloseFunc *close, QEMUFileRateLimit *rate_limit, QEMUFileSetRateLimit *set_rate_limit, diff --git a/savevm.c b/savevm.c index 490ab70..944e788 100644 --- a/savevm.c +++ b/savevm.c @@ -162,7 +162,9 @@ void qemu_announce_self(void) struct QEMUFile { QEMUFilePutBufferFunc *put_buffer; +QEMUFilePutVectorFunc *put_vector; QEMUFileGetBufferFunc *get_buffer; +QEMUFileGetVectorFunc *get_vector; QEMUFileCloseFunc *close; QEMUFileRateLimit *rate_limit; QEMUFileSetRateLimit *set_rate_limit; @@ -263,11 +265,11 @@ QEMUFile *qemu_popen(FILE *stdio_file, const char *mode) s-stdio_file = stdio_file; if(mode[0] == 'r') { -s-file = qemu_fopen_ops(s, NULL, stdio_get_buffer, stdio_pclose, -NULL, NULL, NULL); +s-file = qemu_fopen_ops(s, NULL, NULL, stdio_get_buffer, + NULL, stdio_pclose, NULL, NULL, NULL); } else { -s-file = qemu_fopen_ops(s, stdio_put_buffer, NULL, stdio_pclose, -NULL, NULL, NULL); +s-file = qemu_fopen_ops(s, stdio_put_buffer, NULL, NULL, NULL, + stdio_pclose, NULL, NULL, NULL); } return s-file; } @@ -312,11 +314,11 @@ QEMUFile *qemu_fdopen(int fd, const char *mode) goto fail; if(mode[0] == 'r') { -s-file = qemu_fopen_ops(s, NULL, stdio_get_buffer, stdio_fclose, -NULL, NULL, NULL); +s-file = qemu_fopen_ops(s, NULL, NULL, stdio_get_buffer, NULL, + stdio_fclose, NULL, NULL, NULL); } else { -s-file = qemu_fopen_ops(s, stdio_put_buffer, NULL, stdio_fclose, -NULL, NULL, NULL); +s-file = qemu_fopen_ops(s, stdio_put_buffer, NULL, NULL, NULL, + stdio_fclose, NULL, NULL, NULL); } return
Re: [UNTESTED] KVM: do not call kvm_set_irq from irq disabled section
On Thu, Apr 22, 2010 at 04:40:30PM -0300, Marcelo Tosatti wrote: On Thu, Apr 22, 2010 at 09:11:30PM +0300, Gleb Natapov wrote: On Thu, Apr 22, 2010 at 01:40:38PM -0300, Marcelo Tosatti wrote: On Wed, Apr 21, 2010 at 09:38:39PM +0300, Gleb Natapov wrote: On Wed, Apr 21, 2010 at 03:29:11PM -0300, Marcelo Tosatti wrote: On Wed, Apr 21, 2010 at 08:58:48PM +0300, Gleb Natapov wrote: On Wed, Apr 21, 2010 at 02:37:34PM -0300, Marcelo Tosatti wrote: On Wed, Apr 21, 2010 at 08:12:27PM +0300, Gleb Natapov wrote: On Wed, Apr 21, 2010 at 12:58:41PM -0300, Marcelo Tosatti wrote: Or could we make kvm_set_irq() atomic? Though the code path is a little long for spinlock. Yes, given the sleep-inside-RCU-protected section bug from kvm_notify_acked_irq, either that or convert IRQ locking to SRCU. But as you said, the code paths are long and potentially slow, so probably SRCU is a better alternative. Gleb? kvm_set_irq() was converted to rcu from mutex to make msix interrupt injection scalable. We meant ioapic lock. See the last report from Ralf on this thread. Can we solve the problem by calling ack notifier outside rcu read section in kvm_notify_acked_irq()? The unregister path does - remove_from_list(entry) - synchronize_rcu - kfree(entry) So if kvm_notify_acked_irq sleeps, synchronize_rcu can succeed, and the notifier entry can be freed. What I mean is kvm_notify_acked_irq() will iterate over all ack entries in rcu read protected section, but instead of calling callback it will collect them into the array and call them outside rcu read section. At this point it doesn't matter if entry is unregistered since the copy is used to actually call the notifier. Here it is, but no, this trick can't be done safely for ack notifiers because they are unregistered at runtime by device assignment. How does the freeing path knows it can proceed to free its entry if there is no reliable way to know if there is a reference to itself? (think priv gets freed between rcu_read_unlock and -irq_acked with the patch below). Yeah, I see :( I can convert to SRCU if you have no objections. AFAIR there was a disadvantage comparing to RCU, but I don't remember what it was exactly. http://www.mentby.com/paul-e-mckenney/kvm-patch-v3-13-kvm-fix-race-in-irqrouting-logic.html But for KVM IRQ path's it should not matter much since usage of grace periods is rare (registration/unregistration is very rare compared to read side), and the IRQ path's are already large and slow, so the added overhead should not be noticeable. msix path shouldn't be slow and this is the case we are trying to optimize. SRCU's read-side primitives are also significantly slower than those of RCU. Sounds not too optimistic. What about converting PIC/IOAPIC mutexes into spinlocks? Works for me, but on large guests the spinning will be noticeable. I believe. For interrupts going through IOPIC, but we know this is not scalable anyway. Avi? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
On 04/22/2010 04:16 PM, Yoshiaki Tamura wrote: 2010/4/22 Dor Laordl...@redhat.com: On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote: Dor Laor wrote: On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote: Hi all, We have been implementing the prototype of Kemari for KVM, and we're sending this message to share what we have now and TODO lists. Hopefully, we would like to get early feedback to keep us in the right direction. Although advanced approaches in the TODO lists are fascinating, we would like to run this project step by step while absorbing comments from the community. The current code is based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27. For those who are new to Kemari for KVM, please take a look at the following RFC which we posted last year. http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html The transmission/transaction protocol, and most of the control logic is implemented in QEMU. However, we needed a hack in KVM to prevent rip from proceeding before synchronizing VMs. It may also need some plumbing in the kernel side to guarantee replayability of certain events and instructions, integrate the RAS capabilities of newer x86 hardware with the HA stack, as well as for optimization purposes, for example. [ snap] The rest of this message describes TODO lists grouped by each topic. === event tapping === Event tapping is the core component of Kemari, and it decides on which event the primary should synchronize with the secondary. The basic assumption here is that outgoing I/O operations are idempotent, which is usually true for disk I/O and reliable network protocols such as TCP. IMO any type of network even should be stalled too. What if the VM runs non tcp protocol and the packet that the master node sent reached some remote client and before the sync to the slave the master failed? In current implementation, it is actually stalling any type of network that goes through virtio-net. However, if the application was using unreliable protocols, it should have its own recovering mechanism, or it should be completely stateless. Why do you treat tcp differently? You can damage the entire VM this way - think of dhcp request that was dropped on the moment you switched between the master and the slave? I'm not trying to say that we should treat tcp differently, but just it's severe. In case of dhcp request, the client would have a chance to retry after failover, correct? But until it timeouts it won't have networking. BTW, in current implementation, it's synchronizing before dhcp ack is sent. But in case of tcp, once you send ack to the client before sync, there is no way to recover. What if the guest is running dhcp server? It we provide an IP to a client and then fail to the secondary that will run without knowing the master allocated this IP [snap] === clock === Since synchronizing the virtual machines every time the TSC is accessed would be prohibitive, the transmission of the TSC will be done lazily, which means delaying it until there is a non-TSC synchronization point arrives. Why do you specifically care about the tsc sync? When you sync all the IO model on snapshot it also synchronizes the tsc. So, do you agree that an extra clock synchronization is not needed since it is done anyway as part of the live migration state sync? I agree that its sent as part of the live migration. What I wanted to say here is that this is not something for real time applications. I usually get questions like can this guarantee fault tolerance for real time applications. First the huge cost of snapshots won't match to any real time app. Second, even if it wasn't the case, the tsc delta and kvmclock are synchronized as part of the VM state so there is no use of trapping it in the middle. In general, can you please explain the 'algorithm' for continuous snapshots (is that what you like to do?): Yes, of course. Sorry for being less informative. A trivial one would we to : - do X online snapshots/sec I currently don't have good numbers that I can share right now. Snapshots/sec depends on what kind of workload is running, and if the guest was almost idle, there will be no snapshots in 5sec. On the other hand, if the guest was running I/O intensive workloads (netperf, iozone for example), there will be about 50 snapshots/sec. - Stall all IO (disk/block) from the guest to the outside world until the previous snapshot reaches the slave. Yes, it does. - Snapshots are made of Full device model + diff of dirty pages from the last snapshot. - diff of dirty pages from last snapshot This also depends on the workload. In case of I/O intensive workloads, dirty pages are usually less than 100. The hardest would be memory intensive loads. So 100 snap/sec means latency of 10msec right? (not that it's not ok, with faster hw and IB you'll be able to get much more) Doesn't 100 snap/sec mean the interval of snap is 10msec? IIUC, to get the latency,
using ftrace with kvm
I have a VM that is spinning (both vcpus at 100%). As I recall kvm_stat has been deprecated in favor or ftrace. Is there a wiki page or document that gives suggestions on this? David -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC] KVM MMU: fix hashing for TDP and non-paging modes
I've been reading the x86's mmu.c recently and had been wondering about something. Avi's recent mmu documentation (thanks!) seems to have confirmed my understanding of how the shadow paging is supposed to be working. In TDP mode, when mmu_alloc_roots() calls kvm_mmu_get_page(), why does it pass (vcpu-arch.cr3 PAGE_SHIFT) or (vcpu-arch.mmu.pae_root[i]) as gfn? It seems to me that in TDP mode, gfn should be either zero for the root page table, or 0/1GB/2GB/3GB (for PAE page tables). The existing behavior can lead to multiple, semantically-identical TDP roots being created by mmu_alloc_roots, depending on the VCPU's CR3 at the time that mmu_alloc_roots was called. But the nested page tables should be* independent of the VCPU state. That wastes some memory and causes extra page faults while populating the extra copies of the page tables. *assuming that we aren't modeling per-VCPU state that might change the physical address map as seen by that VCPU, such as setting the APIC base to an address overlapping RAM. All feedback would be welcome, since I'm new to this system! A strawman patch follows. thanks, -Eric -- For TDP mode, avoid creating multiple page table roots for the single guest-to-host physical address map by fixing the inputs used for the shadow page table hash in mmu_alloc_roots(). Signed-off-by: Eric Northup digitale...@google.com --- arch/x86/kvm/mmu.c | 12 1 files changed, 8 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index ddfa865..9696d65 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2059,10 +2059,12 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu) hpa_t root = vcpu-arch.mmu.root_hpa; ASSERT(!VALID_PAGE(root)); - if (tdp_enabled) - direct = 1; if (mmu_check_root(vcpu, root_gfn)) return 1; + if (tdp_enabled) { + direct = 1; + root_gfn = 0; + } sp = kvm_mmu_get_page(vcpu, root_gfn, 0, PT64_ROOT_LEVEL, direct, ACC_ALL, NULL); @@ -2072,8 +2074,6 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu) return 0; } direct = !is_paging(vcpu); - if (tdp_enabled) - direct = 1; for (i = 0; i 4; ++i) { hpa_t root = vcpu-arch.mmu.pae_root[i]; @@ -2089,6 +2089,10 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu) root_gfn = 0; if (mmu_check_root(vcpu, root_gfn)) return 1; + if (tdp_enabled) { + direct = 1; + root_gfn = i 30; + } sp = kvm_mmu_get_page(vcpu, root_gfn, i 30, PT32_ROOT_LEVEL, direct, ACC_ALL, NULL); -- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: using ftrace with kvm
On Thu, Apr 22, 2010 at 3:53 PM, David S. Ahern daah...@cisco.com wrote: I have a VM that is spinning (both vcpus at 100%). As I recall kvm_stat has been deprecated in favor or ftrace. Is there a wiki page or document that gives suggestions on this? David Documentation/trace/* is the place to see. but for me function and function_graph give too much data even if i limit it to 1 function. so i use trace points. i simply enable the tracepoints i am interested in and read trace_pipe. cat available_events echo kvm:* set_event cat trace_pipe ~/mydump.txt (or whatever you want to do) --- regards Manish Regmi http://manish-cs.blogspot.com http://ext2read.sf.net -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC v2 6/6] KVM: introduce a new API for getting dirty bitmaps
On 22.04.2010, at 08:09, Fernando Luis Vázquez Cao wrote: On 04/22/2010 11:45 AM, Fernando Luis Vázquez Cao wrote: On 04/21/2010 06:41 PM, Alexander Graf wrote: On 21.04.2010, at 10:29, Fernando Luis Vázquez Cao wrote: On 04/20/2010 08:03 PM, Takuya Yoshikawa wrote: @@ -318,7 +318,7 @@ struct kvm_dirty_log { __u32 padding1; union { void __user *dirty_bitmap; /* one bit per page */ - __u64 padding2; + __u64 addr; This can break on x86_32 and x86_64-compat. addr is a long not a __u64. So the high 32 bits are zero. Where's the problem? If we are careful enough to cast the addr appropriately we should be fine, even if we keep the padding field in the union. I am not saying that it breaks 32 architectures but that it can potentially be problematic. + case KVM_SWITCH_DIRTY_LOG: { + struct kvm_dirty_log log; + + r = -EFAULT; + if (copy_from_user(log, argp, sizeof log)) + goto out; + r = kvm_vm_ioctl_switch_dirty_log(kvm, log); + if (r) + goto out; + r = -EFAULT; + if (copy_to_user(argp, log, sizeof log)) + goto out; + r = 0; + break; + } In x86_64-compat mode we are handling 32bit user-space addresses so we need the compat counterpart of KVM_SWITCH_DIRTY_LOG too. The compat code just forwards everything to the generic ioctls. The compat code uses struct compat_kvm_dirty_log instead of struct kvm_dirty_log to communicate with user space so the necessary conversions needs to be done before invoking the generic ioctl (see KVM_GET_DIRTY_LOG in kvm_vm_compat_ioctl). By the way we probable should move the definition of struct compat_kvm_dirty_log to a header file. It seems that it was you and Arnd who added the kvm_vm compat ioctl :-). Are you considering a different approach to tackle the issues that we have with a big-endian userspace? IIRC the issue was a pointer inside of a nested structure, no? Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
Jamie Lokier wrote: Yoshiaki Tamura wrote: Dor Laor wrote: On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote: Event tapping is the core component of Kemari, and it decides on which event the primary should synchronize with the secondary. The basic assumption here is that outgoing I/O operations are idempotent, which is usually true for disk I/O and reliable network protocols such as TCP. IMO any type of network even should be stalled too. What if the VM runs non tcp protocol and the packet that the master node sent reached some remote client and before the sync to the slave the master failed? In current implementation, it is actually stalling any type of network that goes through virtio-net. However, if the application was using unreliable protocols, it should have its own recovering mechanism, or it should be completely stateless. Even with unreliable protocols, if slave takeover causes the receiver to have received a packet that the sender _does not think it has ever sent_, expect some protocols to break. If the slave replaying master's behaviour since the last sync means it will definitely get into the same state of having sent the packet, that works out. That's something we're expecting now. But you still have to be careful that the other end's responses to that packet are not seen by the slave too early during that replay. Otherwise, for example, the slave may observe a TCP ACK to a packet that it hasn't yet sent, which is an error. Even current implementation syncs just before network output, what you pointed out could happen. In this case, would the connection going to be lost, or would client/server recover from it? If latter, it would be fine, otherwise I wonder how people doing similar things are handling this situation. About IP idempotency: In general, IP packets are allowed to be lost or duplicated in the network. All IP protocols should be prepared for that; it is a basic property. However there is one respect in which they're not idempotent: The TTL field should be decreased if packets are delayed. Packets should not appear to live in the network for longer than TTL seconds. If they do, some protocols (like TCP) can react to the delayed ones differently, such as sending a RST packet and breaking a connection. It is acceptable to reduce TTL faster than the minimum. After all, it is reduced by 1 on every forwarding hop, in addition to time delays. So the problem is, when the slave takes over, it sends a packet with same TTL which client may have received. I currently don't have good numbers that I can share right now. Snapshots/sec depends on what kind of workload is running, and if the guest was almost idle, there will be no snapshots in 5sec. On the other hand, if the guest was running I/O intensive workloads (netperf, iozone for example), there will be about 50 snapshots/sec. That is a really satisfying number, thank you :-) Without this work I wouldn't have imagined that synchronised machines could work with such a low transaction rate. Thank you for your comments. Although I haven't prepared good data yet, I personally prefer to have discussion with actual implementation and experimental data. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 00/20] Kemari for KVM v0.1
Anthony Liguori wrote: On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote: Hi all, We have been implementing the prototype of Kemari for KVM, and we're sending this message to share what we have now and TODO lists. Hopefully, we would like to get early feedback to keep us in the right direction. Although advanced approaches in the TODO lists are fascinating, we would like to run this project step by step while absorbing comments from the community. The current code is based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27. For those who are new to Kemari for KVM, please take a look at the following RFC which we posted last year. http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html The transmission/transaction protocol, and most of the control logic is implemented in QEMU. However, we needed a hack in KVM to prevent rip from proceeding before synchronizing VMs. It may also need some plumbing in the kernel side to guarantee replayability of certain events and instructions, integrate the RAS capabilities of newer x86 hardware with the HA stack, as well as for optimization purposes, for example. Before going into details, we would like to show how Kemari looks. We prepared a demonstration video at the following location. For those who are not interested in the code, please take a look. The demonstration scenario is, 1. Play with a guest VM that has virtio-blk and virtio-net. # The guest image should be a NFS/SAN. 2. Start Kemari to synchronize the VM by running the following command in QEMU. Just add -k option to usual migrate command. migrate -d -k tcp:192.168.0.20: 3. Check the status by calling info migrate. 4. Go back to the VM to play chess animation. 5. Kill the the VM. (VNC client also disappears) 6. Press c to continue the VM on the other host. 7. Bring up the VNC client (Sorry, it pops outside of video capture.) 8. Confirm that the chess animation ends, browser works fine, then shutdown. http://www.osrg.net/kemari/download/kemari-kvm-fc11.mov The repository contains all patches we're sending with this message. For those who want to try, pull the following repository. At running configure, please put --enable-ft-mode. Also you need to apply a patch attached at the end of this message to your KVM. git://kemari.git.sourceforge.net/gitroot/kemari/kemari In addition to usual migrate environment and command, add -k to run. The patch set consists of following components. - bit-based dirty bitmap. (I have posted v4 for upstream QEMU on April 2o) - writev() support to QEMUFile and FdMigrationState. - FT transaction sender/receiver - event tap that triggers FT transaction. - virtio-blk, virtio-net support for event tap. This series looks quite nice! Thanks for your kind words! I think it would make sense to separate out the things that are actually optimizations (like the dirty bitmap changes and the writev/readv changes) and to attempt to justify them with actual performance data. I agree with the separation plan. For dirty bitmap change, Avi and I discussed on patchset for upsream QEMU while you were offline (Sorry, if I was wrong). Could you also take a look? http://lists.gnu.org/archive/html/qemu-devel/2010-04/msg01396.html Regarding writev, I agree that it should be backed with actual data, otherwise it should be removed. We attemped to do everything that may reduce the overhead of the transaction. I'd prefer not to modify the live migration protocol ABI and it doesn't seem to be necessary if we're willing to add options to the -incoming flag. We also want to be a bit more generic with respect to IO. I totally agree with your approach not to change the protocol ABI. Can we add an option to -incoming? Like, -incoming ft_mode, for example Regarding the IO, let me reply to the next message. Otherwise, the series looks very close to being mergable. Thank you for your comment on each patch. To be honest, I wasn't that confident because I'm a newbie to KVM/QEMU and struggled for how to implement in an acceptable way. Thanks, Yoshi Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
PXE Boot Timeout Issue...
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi all, Just upgraded to 12.3 user space tools from 11.0, and now when I attempt to netboot a guest, it appears that the pxe rom is timing out on dhcp before the bridge has enough time to come up. Is there a command line switch to set the dhcp timeout, or a build option that can be changed to set the timeout to a longer value, or disable it entirely? Thanks in advance... Stu - -- Spider Pig, Spider Pig, Does whatever a Spider Pig does. Can he swing, from a web, no he can't, he's a pig. Look Oot! He is a Spider Pig -- Matt Groening - Spider Pig Lyrics -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) iQIcBAEBCAAGBQJL0O2PAAoJEFKVLITDJSGSHKUP/RGNTgmaQmqQ4L5wdkRpWVY4 0BY0wsfajLCQ8W87Zr5EIoFDw3QfeKM9345bvstwSvzvX5WB19U/Q463nxzUfV3v YjBk6oCQVI7nnbSZIuz0rqxeo8QPcfuDStHQAkYZD6TlfgNntbzFRnCoMRAMuK46 EdcjmJMCcZ5bF2OezzljhpvJavqwBhOsuOwXHprlIxkhnlOQB5sBY4RSS/f886xw 34Bf1ER7J3zoyq96ElTOCNpFU5BeoaUGw0n/fDPaxQ5QmGoHOB/OpSoN8fae+h1J OxgPWWUhgcb1d0rCsU9R4frbzJGwQFy8X4tBw4gqhjPX76D3T4zyf532pbTjrnri GASOLFuSGxowPH1LmPag36Yq8IdOejVrEzK8jTyrpbmWtVBkyS9NiqQEJmqdbGpt bu4Kt+Ab4oJRI+DGf8kfnvBvfZ5wIPmfsWzV3YrV4AHn+9X6FG/1OMxg9EV8cyC7 hhy/BTGX19qwzuxxUq8s1aH0BSARip++lZoKFuYojF5AoP+SBUSoOjpdhY+NNMTh 3cldyrN3Zd9Y5/ejHIBta0QDUg6YdbXvi6Ai2B+6aKtq63B4+uT8XBfh+duKNBfF Y25gERY2aXoed4KRukSfr/6qh2y9MSQb7UbZneAy9bFMKhK21/ZK9FRM/uIUOs+g lE6iqHCL4f4esyNFIwAr =OAyU -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] Add a global synchronization point for pvclock
On 04/22/2010 03:11 AM, Glauber Costa wrote: On Tue, Apr 20, 2010 at 12:42:17PM -0700, Jeremy Fitzhardinge wrote: On 04/20/2010 11:54 AM, Avi Kivity wrote: On 04/20/2010 09:23 PM, Jeremy Fitzhardinge wrote: On 04/20/2010 02:31 AM, Avi Kivity wrote: btw, do you want this code in pvclock.c, or shall we keep it kvmclock specific? I think its a pvclock-level fix. I'd been hoping to avoid having something like this, but I think its ultimately necessary. Did you observe drift on Xen, or is this ultimately pointing at the future? People are reporting weirdnesses that clocksource=jiffies apparently resolves. Xen and KVM are faced with the same hardware constraints, and it wouldn't surprise me if there were small measurable non-monotonicities in the PV clock under Xen. May as well be safe. Of course, it kills any possibility of being able to usefully export this interface down to usermode. My main concern about this kind of simple fix is that if there's a long term systematic drift between different CPU's tscs, then this will somewhat mask the problem while giving really awful time measurement on the slow CPU(s). In that case it really needs to adjust the scaling factor to correct for the drift (*not* update the offset). But if we're definitely only talking about fixed, relatively small time offsets then it is fine. Can you by any chance run ingo's time warp test on those machines? You need to define TOD to 1, and leave out the TSC test. For me, warps exists on every machine out there, but the nehalems, so far Or apply this patch. diff -rup a/time-warp-test.c b/time-warp-test.c --- a/time-warp-test.c 2010-04-15 16:30:13.955981607 -1000 +++ b/time-warp-test.c 2010-04-15 16:35:37.777982377 -1000 @@ -91,7 +91,7 @@ static inline unsigned long long __rdtsc { DECLARE_ARGS(val, low, high); - asm volatile(cpuid; rdtsc : EAX_EDX_RET(val, low, high)); + asm volatile(cpuid; rdtsc : EAX_EDX_RET(val, low, high) :: ebx, ecx); return EAX_EDX_VAL(val, low, high); }
[RFC PATCH] asm-generic: bitops: introduce le bit offset macro
Although we can use *_le_bit() helpers to treat bitmaps le arranged, having le bit offset calculation as a seperate macro gives us more freedom. For example, KVM has le arranged dirty bitmaps for VGA, live-migration, etc. and we use them in user space too. To avoid bitmap copies between kernel and user space, we want to update the bitmaps in user space directly. To achive this, le bit offset with *_user() functions help us a lot. So let us use the le bit offset calculation part by defining it as a new macro: generic_le_bit_offset() . Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- include/asm-generic/bitops/le.h | 18 +++--- 1 files changed, 11 insertions(+), 7 deletions(-) diff --git a/include/asm-generic/bitops/le.h b/include/asm-generic/bitops/le.h index 80e3bf1..6af2e54 100644 --- a/include/asm-generic/bitops/le.h +++ b/include/asm-generic/bitops/le.h @@ -9,6 +9,8 @@ #if defined(__LITTLE_ENDIAN) +#define generic_le_bit_offset(nr) (nr) + #define generic_test_le_bit(nr, addr) test_bit(nr, addr) #define generic___set_le_bit(nr, addr) __set_bit(nr, addr) #define generic___clear_le_bit(nr, addr) __clear_bit(nr, addr) @@ -25,22 +27,24 @@ #elif defined(__BIG_ENDIAN) +#define generic_le_bit_offset(nr) ((nr) ^ BITOP_LE_SWIZZLE) + #define generic_test_le_bit(nr, addr) \ - test_bit((nr) ^ BITOP_LE_SWIZZLE, (addr)) + test_bit(generic_le_bit_offset(nr), (addr)) #define generic___set_le_bit(nr, addr) \ - __set_bit((nr) ^ BITOP_LE_SWIZZLE, (addr)) + __set_bit(generic_le_bit_offset(nr), (addr)) #define generic___clear_le_bit(nr, addr) \ - __clear_bit((nr) ^ BITOP_LE_SWIZZLE, (addr)) + __clear_bit(generic_le_bit_offset(nr), (addr)) #define generic_test_and_set_le_bit(nr, addr) \ - test_and_set_bit((nr) ^ BITOP_LE_SWIZZLE, (addr)) + test_and_set_bit(generic_le_bit_offset(nr), (addr)) #define generic_test_and_clear_le_bit(nr, addr) \ - test_and_clear_bit((nr) ^ BITOP_LE_SWIZZLE, (addr)) + test_and_clear_bit(generic_le_bit_offset(nr), (addr)) #define generic___test_and_set_le_bit(nr, addr) \ - __test_and_set_bit((nr) ^ BITOP_LE_SWIZZLE, (addr)) + __test_and_set_bit(generic_le_bit_offset(nr), (addr)) #define generic___test_and_clear_le_bit(nr, addr) \ - __test_and_clear_bit((nr) ^ BITOP_LE_SWIZZLE, (addr)) + __test_and_clear_bit(generic_le_bit_offset(nr), (addr)) extern unsigned long generic_find_next_zero_le_bit(const unsigned long *addr, unsigned long size, unsigned long offset); -- 1.6.3.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1
Anthony Liguori wrote: On 04/22/2010 08:16 AM, Yoshiaki Tamura wrote: 2010/4/22 Dor Laordl...@redhat.com: On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote: Dor Laor wrote: On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote: Hi all, We have been implementing the prototype of Kemari for KVM, and we're sending this message to share what we have now and TODO lists. Hopefully, we would like to get early feedback to keep us in the right direction. Although advanced approaches in the TODO lists are fascinating, we would like to run this project step by step while absorbing comments from the community. The current code is based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27. For those who are new to Kemari for KVM, please take a look at the following RFC which we posted last year. http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html The transmission/transaction protocol, and most of the control logic is implemented in QEMU. However, we needed a hack in KVM to prevent rip from proceeding before synchronizing VMs. It may also need some plumbing in the kernel side to guarantee replayability of certain events and instructions, integrate the RAS capabilities of newer x86 hardware with the HA stack, as well as for optimization purposes, for example. [ snap] The rest of this message describes TODO lists grouped by each topic. === event tapping === Event tapping is the core component of Kemari, and it decides on which event the primary should synchronize with the secondary. The basic assumption here is that outgoing I/O operations are idempotent, which is usually true for disk I/O and reliable network protocols such as TCP. IMO any type of network even should be stalled too. What if the VM runs non tcp protocol and the packet that the master node sent reached some remote client and before the sync to the slave the master failed? In current implementation, it is actually stalling any type of network that goes through virtio-net. However, if the application was using unreliable protocols, it should have its own recovering mechanism, or it should be completely stateless. Why do you treat tcp differently? You can damage the entire VM this way - think of dhcp request that was dropped on the moment you switched between the master and the slave? I'm not trying to say that we should treat tcp differently, but just it's severe. In case of dhcp request, the client would have a chance to retry after failover, correct? BTW, in current implementation, I'm slightly confused about the current implementation vs. my recollection of the original paper with Xen. I had thought that all disk and network I/O was buffered in such a way that at each checkpoint, the I/O operations would be released in a burst. Otherwise, you would have to synchronize after every I/O operation which is what it seems the current implementation does. Yes, you're almost right. It's synchronizing before QEMU starts emulating I/O at each device model. It was originally designed that way to avoid complexity of introducing buffering mechanism and additional I/O latency by buffering. I'm not sure how that is accomplished atomically though since you could have a completed I/O operation duplicated on the slave node provided it didn't notify completion prior to failure. That's exactly the point I wanted to discuss. Currently, we're calling vm_stop(0), qemu_aio_flush() and bdrv_flush_all() before qemu_save_state_all() in ft_tranx_ready(), to ensure outstanding I/O is complete. I mimicked what existing live migration is doing. It's not enough? Is there another kemari component that somehow handles buffering I/O that is not obvious from these patches? No, I'm not hiding anything, and I would share any information regarding Kemari to develop it in this community :-) Thanks, Yoshi Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 01/20] Modify DIRTY_FLAG value and introduce DIRTY_IDX to use as indexes of bit-based phys_ram_dirty.
Anthony Liguori wrote: Hi, On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote: Replaces byte-based phys_ram_dirty bitmap with four (MASTER, VGA, CODE, MIGRATION) bit-based phys_ram_dirty bitmap. On allocation, it sets all bits in the bitmap. It uses ffs() to convert DIRTY_FLAG to DIRTY_IDX. Modifies wrapper functions for byte-based phys_ram_dirty bitmap to bit-based phys_ram_dirty bitmap. MASTER works as a buffer, and upon get_diry() or get_dirty_flags(), it calls cpu_physical_memory_sync_master() to update VGA and MIGRATION. Why use an additional bitmap for MASTER instead of just updating the VGA, CODE, and MIGRATION bitmaps together? This way we don't have to think whether we should update VGA or MIGRATION. IIRC, we had this discussion on upstream before with Avi? http://www.mail-archive.com/kvm@vger.kernel.org/msg30728.html BTW, I also have the following TODO list regarding dirty bitmap. 1. Allocate vga and migration dirty bitmap dynamically. 2. Separate CODE and the other dirty bitmaps functions. Regards, Anthony Liguori Replaces direct phys_ram_dirty access with wrapper functions to prevent direct access to the phys_ram_dirty bitmap. Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp Signed-off-by: OHMURA Keiohmura@lab.ntt.co.jp --- cpu-all.h | 130 + exec.c | 60 ++-- 2 files changed, 152 insertions(+), 38 deletions(-) diff --git a/cpu-all.h b/cpu-all.h index 51effc0..3f8762d 100644 --- a/cpu-all.h +++ b/cpu-all.h @@ -37,6 +37,9 @@ #include softfloat.h +/* to use ffs in flag_to_idx() */ +#includestrings.h + #if defined(HOST_WORDS_BIGENDIAN) != defined(TARGET_WORDS_BIGENDIAN) #define BSWAP_NEEDED #endif @@ -846,7 +849,6 @@ int cpu_str_to_log_mask(const char *str); /* memory API */ extern int phys_ram_fd; -extern uint8_t *phys_ram_dirty; extern ram_addr_t ram_size; extern ram_addr_t last_ram_offset; extern uint8_t *bios_mem; @@ -869,28 +871,140 @@ extern uint8_t *bios_mem; /* Set if TLB entry is an IO callback. */ #define TLB_MMIO (1 5) +/* Use DIRTY_IDX as indexes of bit-based phys_ram_dirty. */ +#define MASTER_DIRTY_IDX 0 +#define VGA_DIRTY_IDX 1 +#define CODE_DIRTY_IDX 2 +#define MIGRATION_DIRTY_IDX 3 +#define NUM_DIRTY_IDX 4 + +#define MASTER_DIRTY_FLAG (1 MASTER_DIRTY_IDX) +#define VGA_DIRTY_FLAG (1 VGA_DIRTY_IDX) +#define CODE_DIRTY_FLAG (1 CODE_DIRTY_IDX) +#define MIGRATION_DIRTY_FLAG (1 MIGRATION_DIRTY_IDX) + +extern unsigned long *phys_ram_dirty[NUM_DIRTY_IDX]; + +static inline int dirty_flag_to_idx(int flag) +{ + return ffs(flag) - 1; +} + +static inline int dirty_idx_to_flag(int idx) +{ + return 1 idx; +} + int cpu_memory_rw_debug(CPUState *env, target_ulong addr, uint8_t *buf, int len, int is_write); -#define VGA_DIRTY_FLAG 0x01 -#define CODE_DIRTY_FLAG 0x02 -#define MIGRATION_DIRTY_FLAG 0x08 - /* read dirty bit (return 0 or 1) */ static inline int cpu_physical_memory_is_dirty(ram_addr_t addr) { - return phys_ram_dirty[addr TARGET_PAGE_BITS] == 0xff; + unsigned long mask; + ram_addr_t index = (addr TARGET_PAGE_BITS) / HOST_LONG_BITS; + int offset = (addr TARGET_PAGE_BITS) (HOST_LONG_BITS - 1); + + mask = 1UL offset; + return (phys_ram_dirty[MASTER_DIRTY_IDX][index] mask) == mask; +} + +static inline void cpu_physical_memory_sync_master(ram_addr_t index) +{ + if (phys_ram_dirty[MASTER_DIRTY_IDX][index]) { + phys_ram_dirty[VGA_DIRTY_IDX][index] + |= phys_ram_dirty[MASTER_DIRTY_IDX][index]; + phys_ram_dirty[MIGRATION_DIRTY_IDX][index] + |= phys_ram_dirty[MASTER_DIRTY_IDX][index]; + phys_ram_dirty[MASTER_DIRTY_IDX][index] = 0UL; + } +} + +static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr) +{ + unsigned long mask; + ram_addr_t index = (addr TARGET_PAGE_BITS) / HOST_LONG_BITS; + int offset = (addr TARGET_PAGE_BITS) (HOST_LONG_BITS - 1); + int ret = 0, i; + + mask = 1UL offset; + cpu_physical_memory_sync_master(index); + + for (i = VGA_DIRTY_IDX; i= MIGRATION_DIRTY_IDX; i++) { + if (phys_ram_dirty[i][index] mask) { + ret |= dirty_idx_to_flag(i); + } + } + + return ret; +} + +static inline int cpu_physical_memory_get_dirty_idx(ram_addr_t addr, + int dirty_idx) +{ + unsigned long mask; + ram_addr_t index = (addr TARGET_PAGE_BITS) / HOST_LONG_BITS; + int offset = (addr TARGET_PAGE_BITS) (HOST_LONG_BITS - 1); + + mask = 1UL offset; + cpu_physical_memory_sync_master(index); + return (phys_ram_dirty[dirty_idx][index] mask) == mask; } static inline int cpu_physical_memory_get_dirty(ram_addr_t addr, int dirty_flags) { - return phys_ram_dirty[addr TARGET_PAGE_BITS] dirty_flags; + return cpu_physical_memory_get_dirty_idx(addr, + dirty_flag_to_idx(dirty_flags)); } static inline void cpu_physical_memory_set_dirty(ram_addr_t addr) { - phys_ram_dirty[addr TARGET_PAGE_BITS] = 0xff; + unsigned long mask; + ram_addr_t index = (addr TARGET_PAGE_BITS) / HOST_LONG_BITS; + int offset = (addr TARGET_PAGE_BITS) (HOST_LONG_BITS - 1); + + mask = 1UL offset; +
Re: [RFC PATCH 15/20] Introduce FT mode support to configure.
Anthony Liguori wrote: On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote: Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp No need for this. OK. Regards, Anthony Liguori --- configure | 8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/configure b/configure index 046c591..f0682d4 100755 --- a/configure +++ b/configure @@ -298,6 +298,7 @@ bsd_user=no guest_base= uname_release= io_thread=no +ft_mode=no mixemu=no kvm_trace=no kvm_cap_pit= @@ -671,6 +672,8 @@ for opt do ;; --enable-io-thread) io_thread=yes ;; + --enable-ft-mode) ft_mode=yes + ;; --disable-blobs) blobs=no ;; --kerneldir=*) kerneldir=$optarg @@ -840,6 +843,7 @@ echo --enable-vde enable support for vde network echo --disable-linux-aio disable Linux AIO support echo --enable-linux-aio enable Linux AIO support echo --enable-io-thread enable IO thread +echo --enable-ft-mode enable FT mode support echo --disable-blobs disable installing provided firmware blobs echo --kerneldir=PATH look for kernel includes in PATH echo --with-kvm-trace enable building the KVM module with the kvm trace option @@ -2117,6 +2121,7 @@ echo GUEST_BASE $guest_base echo PIE user targets $user_pie echo vde support $vde echo IO thread $io_thread +echo FT mode support $ft_mode echo Linux AIO support $linux_aio echo Install blobs $blobs echo KVM support $kvm @@ -2318,6 +2323,9 @@ fi if test $io_thread = yes ; then echo CONFIG_IOTHREAD=y $config_host_mak fi +if test $ft_mode = yes ; then + echo CONFIG_FT_MODE=y $config_host_mak +fi if test $linux_aio = yes ; then echo CONFIG_LINUX_AIO=y $config_host_mak fi -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 14/20] Upgrade QEMU_FILE_VERSION from 3 to 4, and introduce qemu_savevm_state_all().
Anthony Liguori wrote: On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote: Make a 32bit entry after QEMU_VM_FILE_VERSION to recognize whether the transfered data is QEMU_VM_FT_MODE or QEMU_VM_LIVE_MIGRATION_MODE. I'd rather you encapsulate the current protocol as opposed to extending it with a new version. You could also do this by just having a new -incoming option, right? All you really need is to indicate that this is for high frequency checkpointing, right? Exactly. I would use -incoming option. Regards, Anthony Liguori Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp --- savevm.c | 76 - sysemu.h | 1 + 2 files changed, 75 insertions(+), 2 deletions(-) diff --git a/savevm.c b/savevm.c index 292ae32..19b3efb 100644 --- a/savevm.c +++ b/savevm.c @@ -1402,8 +1402,10 @@ static void vmstate_save(QEMUFile *f, SaveStateEntry *se) } #define QEMU_VM_FILE_MAGIC 0x5145564d -#define QEMU_VM_FILE_VERSION_COMPAT 0x0002 -#define QEMU_VM_FILE_VERSION 0x0003 +#define QEMU_VM_FILE_VERSION_COMPAT 0x0003 +#define QEMU_VM_FILE_VERSION 0x0004 +#define QEMU_VM_LIVE_MIGRATION_MODE 0x0005 +#define QEMU_VM_FT_MODE 0x0006 #define QEMU_VM_EOF 0x00 #define QEMU_VM_SECTION_START 0x01 @@ -1425,6 +1427,12 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, qemu_put_be32(f, QEMU_VM_FILE_MAGIC); qemu_put_be32(f, QEMU_VM_FILE_VERSION); + + if (ft_mode) { + qemu_put_be32(f, QEMU_VM_FT_MODE); + } else { + qemu_put_be32(f, QEMU_VM_LIVE_MIGRATION_MODE); + } QTAILQ_FOREACH(se,savevm_handlers, entry) { int len; @@ -1533,6 +1541,66 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f) return 0; } +int qemu_savevm_state_all(Monitor *mon, QEMUFile *f) +{ + SaveStateEntry *se; + + QTAILQ_FOREACH(se,savevm_handlers, entry) { + int len; + + if (se-save_live_state == NULL) + continue; + + /* Section type */ + qemu_put_byte(f, QEMU_VM_SECTION_START); + qemu_put_be32(f, se-section_id); + + /* ID string */ + len = strlen(se-idstr); + qemu_put_byte(f, len); + qemu_put_buffer(f, (uint8_t *)se-idstr, len); + + qemu_put_be32(f, se-instance_id); + qemu_put_be32(f, se-version_id); + if (ft_mode == FT_INIT) { + /* This is workaround. */ + se-save_live_state(mon, f, QEMU_VM_SECTION_START, se-opaque); + } else { + se-save_live_state(mon, f, QEMU_VM_SECTION_PART, se-opaque); + } + } + + ft_mode = FT_TRANSACTION; + QTAILQ_FOREACH(se,savevm_handlers, entry) { + int len; + + if (se-save_state == NULL se-vmsd == NULL) + continue; + + /* Section type */ + qemu_put_byte(f, QEMU_VM_SECTION_FULL); + qemu_put_be32(f, se-section_id); + + /* ID string */ + len = strlen(se-idstr); + qemu_put_byte(f, len); + qemu_put_buffer(f, (uint8_t *)se-idstr, len); + + qemu_put_be32(f, se-instance_id); + qemu_put_be32(f, se-version_id); + + vmstate_save(f, se); + } + + qemu_put_byte(f, QEMU_VM_EOF); + + if (qemu_file_has_error(f)) + return -EIO; + + return 0; +} + + void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f) { SaveStateEntry *se; @@ -1617,6 +1685,10 @@ int qemu_loadvm_state(QEMUFile *f, int skip_header) if (v != QEMU_VM_FILE_VERSION) return -ENOTSUP; + v = qemu_get_be32(f); + if (v == QEMU_VM_FT_MODE) { + ft_mode = FT_INIT; + } } while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) { diff --git a/sysemu.h b/sysemu.h index 6c1441f..df314bb 100644 --- a/sysemu.h +++ b/sysemu.h @@ -67,6 +67,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, int shared); int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f); int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f); +int qemu_savevm_state_all(Monitor *mon, QEMUFile *f); void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f); int qemu_loadvm_state(QEMUFile *f, int skip_header); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 05/20] Introduce put_vector() and get_vector to QEMUFile and qemu_fopen_ops().
Anthony Liguori wrote: On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote: QEMUFile currently doesn't support writev(). For sending multiple data, such as pages, using writev() should be more efficient. Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp Is there performance data that backs this up? Since QEMUFile uses a linear buffer for most operations that's limited to 16k, I suspect you wouldn't be able to observe a difference in practice. I currently don't have data, but I'll prepare it. There were two things I wanted to avoid. 1. Pages to be copied to QEMUFile buf through qemu_put_buffer. 2. Calling write() everytime even when we want to send multiple pages at once. I think 2 may be neglectable. But 1 seems to be problematic if we want make to the latency as small as possible, no? Regards, Anthony Liguori --- buffered_file.c | 2 +- hw/hw.h | 16 savevm.c | 43 +-- 3 files changed, 42 insertions(+), 19 deletions(-) diff --git a/buffered_file.c b/buffered_file.c index 54dc6c2..187d1d4 100644 --- a/buffered_file.c +++ b/buffered_file.c @@ -256,7 +256,7 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque, s-wait_for_unfreeze = wait_for_unfreeze; s-close = close; - s-file = qemu_fopen_ops(s, buffered_put_buffer, NULL, + s-file = qemu_fopen_ops(s, buffered_put_buffer, NULL, NULL, NULL, buffered_close, buffered_rate_limit, buffered_set_rate_limit, buffered_get_rate_limit); diff --git a/hw/hw.h b/hw/hw.h index fc9ed29..921cf90 100644 --- a/hw/hw.h +++ b/hw/hw.h @@ -23,6 +23,13 @@ typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf, int64_t pos, int size); +/* This function writes a chunk of vector to a file at the given position. + * The pos argument can be ignored if the file is only being used for + * streaming. + */ +typedef int (QEMUFilePutVectorFunc)(void *opaque, struct iovec *iov, + int64_t pos, int iovcnt); + /* Read a chunk of data from a file at the given position. The pos argument * can be ignored if the file is only be used for streaming. The number of * bytes actually read should be returned. @@ -30,6 +37,13 @@ typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf, typedef int (QEMUFileGetBufferFunc)(void *opaque, uint8_t *buf, int64_t pos, int size); +/* Read a chunk of vector from a file at the given position. The pos argument + * can be ignored if the file is only be used for streaming. The number of + * bytes actually read should be returned. + */ +typedef int (QEMUFileGetVectorFunc)(void *opaque, struct iovec *iov, + int64_t pos, int iovcnt); + /* Close a file and return an error code */ typedef int (QEMUFileCloseFunc)(void *opaque); @@ -46,7 +60,9 @@ typedef size_t (QEMUFileSetRateLimit)(void *opaque, size_t new_rate); typedef size_t (QEMUFileGetRateLimit)(void *opaque); QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer, + QEMUFilePutVectorFunc *put_vector, QEMUFileGetBufferFunc *get_buffer, + QEMUFileGetVectorFunc *get_vector, QEMUFileCloseFunc *close, QEMUFileRateLimit *rate_limit, QEMUFileSetRateLimit *set_rate_limit, diff --git a/savevm.c b/savevm.c index 490ab70..944e788 100644 --- a/savevm.c +++ b/savevm.c @@ -162,7 +162,9 @@ void qemu_announce_self(void) struct QEMUFile { QEMUFilePutBufferFunc *put_buffer; + QEMUFilePutVectorFunc *put_vector; QEMUFileGetBufferFunc *get_buffer; + QEMUFileGetVectorFunc *get_vector; QEMUFileCloseFunc *close; QEMUFileRateLimit *rate_limit; QEMUFileSetRateLimit *set_rate_limit; @@ -263,11 +265,11 @@ QEMUFile *qemu_popen(FILE *stdio_file, const char *mode) s-stdio_file = stdio_file; if(mode[0] == 'r') { - s-file = qemu_fopen_ops(s, NULL, stdio_get_buffer, stdio_pclose, - NULL, NULL, NULL); + s-file = qemu_fopen_ops(s, NULL, NULL, stdio_get_buffer, + NULL, stdio_pclose, NULL, NULL, NULL); } else { - s-file = qemu_fopen_ops(s, stdio_put_buffer, NULL, stdio_pclose, - NULL, NULL, NULL); + s-file = qemu_fopen_ops(s, stdio_put_buffer, NULL, NULL, NULL, + stdio_pclose, NULL, NULL, NULL); } return s-file; } @@ -312,11 +314,11 @@ QEMUFile *qemu_fdopen(int fd, const char *mode) goto fail; if(mode[0] == 'r') { - s-file = qemu_fopen_ops(s, NULL, stdio_get_buffer, stdio_fclose, - NULL, NULL, NULL); + s-file = qemu_fopen_ops(s, NULL, NULL, stdio_get_buffer, NULL, + stdio_fclose, NULL, NULL, NULL); } else { - s-file = qemu_fopen_ops(s, stdio_put_buffer, NULL, stdio_fclose, - NULL, NULL, NULL); + s-file = qemu_fopen_ops(s, stdio_put_buffer, NULL, NULL, NULL, + stdio_fclose, NULL, NULL, NULL); } return s-file; @@ -330,8 +332,8 @@ QEMUFile *qemu_fopen_socket(int fd) QEMUFileSocket *s = qemu_mallocz(sizeof(QEMUFileSocket)); s-fd = fd; - s-file = qemu_fopen_ops(s, NULL, socket_get_buffer, socket_close, - NULL, NULL, NULL); + s-file = qemu_fopen_ops(s, NULL, NULL, socket_get_buffer, NULL, + socket_close, NULL, NULL, NULL); return s-file; } @@ -368,11 +370,11 @@ QEMUFile *qemu_fopen(const char *filename, const char *mode) goto fail;
Re: [PATCH 6/10] KVM MMU: don't write-protect if have new mapping to unsync page
Marcelo Tosatti wrote: role = vcpu-arch.mmu.base_role; @@ -1332,12 +1336,16 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link) if (sp-gfn == gfn) { if (sp-unsync) -if (kvm_sync_page(vcpu, sp)) -continue; +unsync_sp = sp; Hi Marcelo, Thanks for your comments, maybe the changlog is not clear, please allow me explain here. Two cases maybe happen in kvm_mmu_get_page() function: - one case is, the goal sp is already in cache, if the sp is unsync, we only need update it to assure this mapping is valid, but not mark it sync and not write-protect sp-gfn since it not broke unsync rule(one shadow page for a gfn) - another case is, the goal sp not existed, we need create a new sp for gfn, i.e, gfn (may)has another shadow page, to keep unsync rule, we should sync(mark sync and write-protect) gfn's unsync shadow page. After enabling multiple unsync shadows, we sync those shadow pages only when the new sp not allow to become unsync(also for the unsyc rule, the new rule is: allow all pte page become unsync) I don't see a reason why you can't create a new mapping to an unsync page. The code already creates shadow pte entries using unsync pagetables. Do you means the case 2? In the original code, it unsync-ed gfn's unsync page first regardless it's whether broke unsync rule: | hlist_for_each_entry_safe(sp, node, tmp, bucket, hash_link) | if (sp-gfn == gfn) { | if (sp-unsync) | if (kvm_sync_page(vcpu, sp)) And, my English is poor, sorry if i misunderstand your comment :-( Xiao -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] KVM MMU: make kvm_mmu_zap_page() return the number of zapped sp in total.
Gui Jianfeng wrote: Currently, in kvm_mmu_change_mmu_pages(kvm, page), used_pages-- is performed after calling kvm_mmu_zap_page() in spite of that whether page is actually reclaimed. Because root sp won't be reclaimed by kvm_mmu_zap_page(). So making kvm_mmu_zap_page() return total number of reclaimed sp makes more sense. A new flag is put into kvm_mmu_zap_page() to indicate whether the top page is reclaimed. This bug only hurts kvm_mmu_change_mmu_pages() function, we'd better allow 'self_deleted' is NULL, then we can pass NULL at other place. @@ -1571,7 +1584,8 @@ restart: pgprintk(%s: gfn %lx role %x\n, __func__, gfn, sp-role.word); r = 1; - if (kvm_mmu_zap_page(kvm, sp)) + ret = kvm_mmu_zap_page(kvm, sp, self_deleted); + if (ret 1 || (ret == 1 self_deleted == 0)) goto restart; Maybe we can keep kvm_mmu_zap_page() returns the number of zapped children, and 'self_deleted' indicates whether self is zapped, then we no need modify those function, just fix kvm_mmu_change_mmu_pages() that is if 'self_deleted == 1', inc 'used_pages' Xiao -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 07/20] Introduce qemu_put_vector() and qemu_put_vector_prepare() to use put_vector() in QEMUFile.
Anthony Liguori wrote: On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote: For fool proof purpose, qemu_put_vector_parepare should be called before qemu_put_vector. Then, if qemu_put_* functions except this is called after qemu_put_vector_prepare, program will abort(). Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp I don't get it. What's this protecting against? This was introduced to prevent mixing the order of normal write and vector write, and flush QEMUFile buffer before handling vectors. While qemu_put_buffer copies data to QEMUFile buffer, qemu_put_vector() will bypass that buffer. It's just fool proof purpose for what we encountered at beginning, and if the user of qemu_put_vector() is careful enough, we can remove qemu_put_vectore_prepare(). While writing this message, I started to think that just calling qemu_fflush() in qemu_put_vector() would be enough... Regards, Anthony Liguori --- hw/hw.h | 2 ++ savevm.c | 53 + 2 files changed, 55 insertions(+), 0 deletions(-) diff --git a/hw/hw.h b/hw/hw.h index 921cf90..10e6dda 100644 --- a/hw/hw.h +++ b/hw/hw.h @@ -77,6 +77,8 @@ void qemu_fflush(QEMUFile *f); int qemu_fclose(QEMUFile *f); void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size); void qemu_put_byte(QEMUFile *f, int v); +void qemu_put_vector(QEMUFile *f, QEMUIOVector *qiov); +void qemu_put_vector_prepare(QEMUFile *f); void *qemu_realloc_buffer(QEMUFile *f, int size); void qemu_clear_buffer(QEMUFile *f); diff --git a/savevm.c b/savevm.c index 944e788..22d928c 100644 --- a/savevm.c +++ b/savevm.c @@ -180,6 +180,7 @@ struct QEMUFile { uint8_t *buf; int has_error; + int prepares_vector; }; typedef struct QEMUFileStdio @@ -557,6 +558,58 @@ void qemu_put_byte(QEMUFile *f, int v) qemu_fflush(f); } +void qemu_put_vector(QEMUFile *f, QEMUIOVector *v) +{ + struct iovec *iov; + int cnt; + size_t bufsize; + uint8_t *buf; + + if (qemu_file_get_rate_limit(f) != 0) { + fprintf(stderr, + Attempted to write vector while bandwidth limit is not zero.\n); + abort(); + } + + /* checks prepares vector. + * For fool proof purpose, qemu_put_vector_parepare should be called + * before qemu_put_vector. Then, if qemu_put_* functions except this + * is called after qemu_put_vector_prepare, program will abort(). + */ + if (!f-prepares_vector) { + fprintf(stderr, + You should prepare with qemu_put_vector_prepare.\n); + abort(); + } else if (f-prepares_vector f-buf_index != 0) { + fprintf(stderr, Wrote data after qemu_put_vector_prepare.\n); + abort(); + } + f-prepares_vector = 0; + + if (f-put_vector) { + qemu_iovec_to_vector(v,iov,cnt); + f-put_vector(f-opaque, iov, 0, cnt); + } else { + qemu_iovec_to_size(v,bufsize); + buf = qemu_malloc(bufsize + 1 /* for '\0' */); + qemu_iovec_to_buffer(v, buf); + qemu_put_buffer(f, buf, bufsize); + qemu_free(buf); + } + +} + +void qemu_put_vector_prepare(QEMUFile *f) +{ + if (f-prepares_vector) { + /* prepare vector */ + fprintf(stderr, Attempted to prepare vector twice\n); + abort(); + } + f-prepares_vector = 1; + qemu_fflush(f); +} + int qemu_get_buffer(QEMUFile *f, uint8_t *buf, int size1) { int size, l; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 10/20] Introduce skip_header parameter to qemu_loadvm_state() so that it can be called iteratively without reading the header.
Anthony Liguori wrote: On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote: Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp I think the more appropriate thing to do is have qemu_savevm_state_complete() not write QEMU_VM_EOF when doing a continuous live migration. You would then want qemu_loadvm_state() to detect real EOF and treat that the same as QEMU_VM_EOF (provided it occurred at a section boundary). Of course, this should be a flag to qemu_loadvm_state() as it's not always the right behavior. Sorry. I couldn't get your intention. I would appreciate if you could explain the good points of it. On the receiver side, we need to switch to ft_transaction mode. If the qemu_savevm_state_complete() didn't send QEMU_VM_EOF, qemu_loadvm_state() won't get out of the loop, and therefore it can switch to ft_transaction mode. Please let me know if I were misunderstanding. Regards, Anthony Liguori --- migration-exec.c | 2 +- migration-fd.c | 2 +- migration-tcp.c | 2 +- migration-unix.c | 2 +- savevm.c | 25 ++--- sysemu.h | 2 +- 6 files changed, 19 insertions(+), 16 deletions(-) diff --git a/migration-exec.c b/migration-exec.c index 3edc026..5839a6d 100644 --- a/migration-exec.c +++ b/migration-exec.c @@ -113,7 +113,7 @@ static void exec_accept_incoming_migration(void *opaque) QEMUFile *f = opaque; int ret; - ret = qemu_loadvm_state(f); + ret = qemu_loadvm_state(f, 0); if (ret 0) { fprintf(stderr, load of migration failed\n); goto err; diff --git a/migration-fd.c b/migration-fd.c index 0cc74ad..0e97ed0 100644 --- a/migration-fd.c +++ b/migration-fd.c @@ -106,7 +106,7 @@ static void fd_accept_incoming_migration(void *opaque) QEMUFile *f = opaque; int ret; - ret = qemu_loadvm_state(f); + ret = qemu_loadvm_state(f, 0); if (ret 0) { fprintf(stderr, load of migration failed\n); goto err; diff --git a/migration-tcp.c b/migration-tcp.c index 56e1a3b..94a1a03 100644 --- a/migration-tcp.c +++ b/migration-tcp.c @@ -182,7 +182,7 @@ static void tcp_accept_incoming_migration(void *opaque) goto out; } - ret = qemu_loadvm_state(f); + ret = qemu_loadvm_state(f, 0); if (ret 0) { fprintf(stderr, load of migration failed\n); goto out_fopen; diff --git a/migration-unix.c b/migration-unix.c index b7aab38..dd99a73 100644 --- a/migration-unix.c +++ b/migration-unix.c @@ -168,7 +168,7 @@ static void unix_accept_incoming_migration(void *opaque) goto out; } - ret = qemu_loadvm_state(f); + ret = qemu_loadvm_state(f, 0); if (ret 0) { fprintf(stderr, load of migration failed\n); goto out_fopen; diff --git a/savevm.c b/savevm.c index 22d928c..a401b27 100644 --- a/savevm.c +++ b/savevm.c @@ -1554,7 +1554,7 @@ typedef struct LoadStateEntry { int version_id; } LoadStateEntry; -int qemu_loadvm_state(QEMUFile *f) +int qemu_loadvm_state(QEMUFile *f, int skip_header) { QLIST_HEAD(, LoadStateEntry) loadvm_handlers = QLIST_HEAD_INITIALIZER(loadvm_handlers); @@ -1563,17 +1563,20 @@ int qemu_loadvm_state(QEMUFile *f) unsigned int v; int ret; - v = qemu_get_be32(f); - if (v != QEMU_VM_FILE_MAGIC) - return -EINVAL; + if (!skip_header) { + v = qemu_get_be32(f); + if (v != QEMU_VM_FILE_MAGIC) + return -EINVAL; + + v = qemu_get_be32(f); + if (v == QEMU_VM_FILE_VERSION_COMPAT) { + fprintf(stderr, SaveVM v3 format is obsolete and don't work anymore\n); + return -ENOTSUP; + } + if (v != QEMU_VM_FILE_VERSION) + return -ENOTSUP; - v = qemu_get_be32(f); - if (v == QEMU_VM_FILE_VERSION_COMPAT) { - fprintf(stderr, SaveVM v2 format is obsolete and don't work anymore\n); - return -ENOTSUP; } - if (v != QEMU_VM_FILE_VERSION) - return -ENOTSUP; while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) { uint32_t instance_id, version_id, section_id; @@ -1898,7 +1901,7 @@ int load_vmstate(Monitor *mon, const char *name) monitor_printf(mon, Could not open VM state file\n); return -EINVAL; } - ret = qemu_loadvm_state(f); + ret = qemu_loadvm_state(f, 0); qemu_fclose(f); if (ret 0) { monitor_printf(mon, Error %d while loading VM state\n, ret); diff --git a/sysemu.h b/sysemu.h index 647a468..6c1441f 100644 --- a/sysemu.h +++ b/sysemu.h @@ -68,7 +68,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f); int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f); void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f); -int qemu_loadvm_state(QEMUFile *f); +int qemu_loadvm_state(QEMUFile *f, int skip_header); void qemu_errors_to_file(FILE *fp); void qemu_errors_to_mon(Monitor *mon); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 19/20] Insert do_event_tap() to virtio-{blk, net}, comment out assert() on cpu_single_env temporally.
Anthony Liguori wrote: On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote: do_event_tap() is inserted to functions which actually fire outputs. By synchronizing VMs before outputs are fired, we can failover to the receiver upon failure. To save VM continuously, comment out assert() on cpu_single_env temporally. Signed-off-by: Yoshiaki Tamuratamura.yoshi...@lab.ntt.co.jp --- hw/virtio-blk.c | 2 ++ hw/virtio-net.c | 2 ++ qemu-kvm.c | 7 ++- 3 files changed, 10 insertions(+), 1 deletions(-) This would be better done in the generic layers (the block and net layers respectively). Then it would work with virtio and emulated devices. I agree with your opinion that it's better if we can handle any emulated devices at once. However, I have a question here that, if we put do_event_tap() to the generic layers, emulated devices state would have already been proceeded, and it won't be possible to reproduce those I/O after failover? If I were wrong, I would be happy to move it, but if I were right, there are would be two approaches to overcome this: 1. Sync I/O requests to the receiver, and upon failover, release those requests before running the guest VM. 2. Copy the emulated devices state before starting emulating, and once it comes to the generic layer, start the synchronizing using the copied state. We should also consider the guest's VCPU state. I previously had similar discussion with Avi. I would like to reconfirm his idea too. Regards, Anthony Liguori diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c index b80402d..1dd1c31 100644 --- a/hw/virtio-blk.c +++ b/hw/virtio-blk.c @@ -327,6 +327,8 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq) .old_bs = NULL, }; + do_event_tap(); + while ((req = virtio_blk_get_request(s))) { virtio_blk_handle_request(req,mrb); } diff --git a/hw/virtio-net.c b/hw/virtio-net.c index 5c0093e..1a32bf3 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -667,6 +667,8 @@ static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq) { VirtIONet *n = to_virtio_net(vdev); + do_event_tap(); + if (n-tx_timer_active) { virtio_queue_set_notification(vq, 1); qemu_del_timer(n-tx_timer); diff --git a/qemu-kvm.c b/qemu-kvm.c index 1414f49..769bc95 100644 --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -935,8 +935,12 @@ int kvm_run(CPUState *env) post_kvm_run(kvm, env); + /* TODO: we need to prevent tapping events that derived from the + * same VMEXIT. This needs more info from the kernel. */ #if defined(KVM_CAP_COALESCED_MMIO) if (kvm_state-coalesced_mmio) { + /* prevent from tapping events while handling coalesced_mmio */ + event_tap_suspend(); struct kvm_coalesced_mmio_ring *ring = (void *) run + kvm_state-coalesced_mmio * PAGE_SIZE; while (ring-first != ring-last) { @@ -946,6 +950,7 @@ int kvm_run(CPUState *env) smp_wmb(); ring-first = (ring-first + 1) % KVM_COALESCED_MMIO_MAX; } + event_tap_resume(); } #endif @@ -1770,7 +1775,7 @@ static void resume_all_threads(void) { CPUState *penv = first_cpu; - assert(!cpu_single_env); + /* assert(!cpu_single_env); */ while (penv) { penv-stop = 0; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html